I went to do my daily check of our external servers when I noticed that one that hosts several of our virtual machines was not responding. Since I have a Dell RAC card installed I logged in via the web console to see what was going on. I started an active console session and saw nothing but a black screen and the server was unresponsive. I figured this may have been a glitch in the RAC card web console so I went to reboot the machine via the console. All reboot and restart options where greyed out. At this point it was time for a field trip to the data center.
Upon gaining physical access to the machine I saw there was nothing but a black screen on the console so I manually rebooted. The memory test completed, RAID array verification messages flew by and then nothing but a black screen. No error messages at all. At this time I knew I had a sick server and decided to take it back to the office for further investigation.
Upon arriving at the office I loaded up Bart PE to take a look. The hard disks were present and the volumes were intact but diskpart.exe was showing an error on the mirror. My first thought was that there were bad sectors on the disk so I went to run a chkdsk to verify this. My second thought was the mirror was bad so I was going to break the mirror, check each disk and rebuild with a new disk if needed.
To launch chkdsk I put in the Windows Server 2008 R2 boot disk and booted it up and selected the repair option. I then ran the memory diagnostics. After a thorough check unbelievably the server started like normal. A quick check of the disks showed that the mirror was intact and everything was functional. I decided to run chkdsk to scan the disks for errors. This needed to be run on a schedule reboot. So I rebooted the machine and……….Same result. Memory and RAID diagnostic messages followed by a black screen and no response from the console.
After a LOT of digging in log files, I found out what was finally happening.
I figured out that Windows Update put a read only lock on the local Windows Update DB. The updates never successfully completed on reboot. When I tried to reboot again the updates tried to complete one last time but gave up internally. At this point a bug was encountered that erases the boot loader config which tells Windows which volume and partition to launch. Basically it cleared out the boot loader file. The only feedback I got was a black screen after the memory tests when it booted.
I ran a memory diagnostic again and exited halfway through. This allowed me to boot the OS, and manually rebuild the boot config file. However a reboot committed the same bug. It was only after doing the whole process a second time that I set the network to DHCP to connect to the internet, uninstall the windows updates that recently came in and reinstalled them and rebuild the windows update DB. After that the updates came in fine, the DB was no longer read only and rebooted successfully each time.
Both hard disks turned out to be fine but I had to break the mirror in the process as I thought the bad mirror was adding to the issue. I rebuilt the mirror and server is going back to the data center this AM to be returned to production.
Windows Update robbed a few hours of my life. I would like them back but alas, such is work in IT. Just glad to have the production server back online and I hope others find this useful.