April 21, 2024

On a beautiful Wednesday morning last April, we received a call from one of our clients (a dental clinic) reporting that things were running rather slowly and print jobs were not coming out. Long story short, the doctor or one of the staff pulled the power cord from the back of their server to ‘help things reboot faster’. The server was in the middle of a critical write operation and upon startup, blue-screened – failing to boot.

Our team responded onsite within an hour and discovered that the server was stuck in a bootup loop. None of the startup interruption options, or startup repair utilities were able to solve the issue and the errors being thrown indicated a combination of hardware problems with the hard drive controller and generic, ‘my windows is broken’ messages.

With the office being unable to take or look at xrays, accept payments, schedule follow-up appointments, or check clinical notes, we made the decision to enact Plan B: The in-office backup system utilized by UTS allows a method by which, instead of restoring a whole server from backup, a process that takes some hours, we’re able to boot a duplicate server instance straight from the backup repository. Naturally, the speed of the server running in this manner is not as quick, but it gets the customer up and running with their data. Hamda performed the steps necessary to create a new virtual server instance and connected it directly to the latest backup repository. Unfortunately, although the bootup process got farther, the new server startup failed with the same hardware errors from the controller. Plan C was in order.

Plan C involves a loaner server. UTS keeps a few of these ready to go just for these kinds of issues, where unexpected hardware failure is causing downtime for the client and where proper diagnosis and sourcing parts can take up a lot of time. The loaner was rushed to the customer’s office, attached to the backup array and booted up without problem! Total downtime was under 4 hours, most of which was spent fighting through traffic. Once their business day was over, we performed a migration from the backup array to the loaner server’s internal storage drives, which improved server performance for their next business day.

The customer’s server was brought to our office for diagnosis and a serious talking-to. The array controller was quoted to the customer and ordered. The server was then bench-tested to ensure that everything else was OK and eventually returned to the customer’s office where we were able to conduct a Live Migration – a magical act possible with virtualization whereby we can take the thoughts from the loaner server and migrate everything to the customer’s server, without even as much as a reboot!

The day of the incident, the restored server was missing a bit of the work the customer had performed that morning because the backup image that we used was taken the night before. We offered to contact Dentrix and Dexis support, whereby we’d work together to identify the necessary files on the failed server image, extract them and re-inject them. But the doctor elected to skip that work stating that it was not a big deal for his staff to re-create a few notes, as they were fresh in their memory.

Lessons learned:

  • Don’t pull the power cord to speed up your reboots – it is a TERRIBLE IDEA!
  • On-premises backups are an absolute must for any business that relies on their computers. With a cloud-only backup, a full restore like the one we performed within minutes, would take a full day, or even multiple days, depending on the amount of data and internet connection speed.
  • Planning ahead is key to quick disaster-recovery and that’s where experience plays a great role. Having pre-thought-out procedures, ready-to-go loaner servers, and systems designed with quick recovery in mind, is what makes this possible.

Leave a Reply