Unix Server Reboot Controversy

Paul Venezia has been getting a certain amount of flack for suggesting that rebooting Unix boxes is an inherently bad thing. I mostly agree with him.

People have been pointing out the value of regular reboots as part of maintenance. Paul maintains that any bugs that get shaken out by a regular reboot are merely evidence of an insufficiently well managed configuration. I see a lack of reboots as evidence of an insufficiently well tested configuration. If you want to be sure that all your service startup and shutdowns are in the right order through a full reboot, the best way to test it is by doing a reboot. At that point you need never reboot your box again provided you are also in the habit of never changing its configuration. Clearly there is a value judgement to be made here. Not all changes to a system will have an impact on its startup. But, modern systems are complex with a lot of interdependencies and humans are imperfect creatures. A scheduled reboot is a small amount of managed downtime to double-check that you got things right in the hope of staving off a longer period of unmanaged downtime at some point in the future.

The other thing that often happens at reboot time is hardware failure. This is particularly the case with ECC memory modules exceeding their error thresholds. But to a lesser extent you see that same thing with hard disks and power supplies. It’s much nicer to be able to call up your vendor for parts during working hours than the middle of the night. This behaviour depends on how much your servers resemble big iron. As a server’s irony is embiggened it is much more likely to tell you things are going wrong, and to let you replace parts while still up and running.

The obvious point that I’ve failed to make so far is that this is only the case if rebooting a box is not the same thing as a service outage. If you have one solitary mail server with no failover then rebooting that box will take out you mail service for however long the reboot takes. In this case I would understand perfectly well if you did’t want to reboot it once a month when the chance of it suffering hardware failure in any given three year period is actually pretty small. Although I would still caution that finding out about the flaky PSU when you get a power cut at three AM is less favourable than finding the same thing at eight AM on a Thursday during your scheduled reboot.

I briefly mentioned failover. This is probably the reason most larger systems do scheduled reboots. If you have gone to the time and expense of installing complicated hardware and software failover systems you really have to test them with some regularity. Regular reboots and failover from one side of an HA system to the other are a sensible part of any such testing.

If you have a small number of single machines running lone services then regular reboots of your servers may be positively harmful to your service uptime. If you have a large fleet of machines running clustered services then the concept regular reboots is almost certainly beneficial. Deciding where you are on this spectrum is the tricky bit. As ever it’s a weighing up of costs and benefits against a background of not entirely quantifiable risks.

All that being said I believe Paul’s main beef was with the people who reboot as an initial fix for any mishap; Yes junior MCSE, I’m looking at you. Obviously, if your first response to any service outage is to hit the reset button then there is something wrong with you and you don’t belong anywhere near any IT system.