I was at a conference recently where one of the speakers was talking about their services uptime during the past year. 99.998% uptime flashed on the screen to the applause of the mostly technical multitude! Some quick math told me that this system had experience only 10.512 minutes of unscheduled downtime during the previous 525,600 minutes (remember the song from “Rent”)! Very impressive numbers! However, it got me thinking about overall computer system reliability, how we put our faith in numbers and sort of arbitrarily create standards which are difficult and expensive to maintain.
Years ago, when competition was coming to our country’s landline telephone network , the “Bell” companies and AT&T touted that no other company or system could get close to their “Five 9s” in reliability. Again, doing the math, that’s only 5.256 minutes of unscheduled network downtime annually. “Ma Bell” was right. The landline networks (both voice and data) are “only” 99.995% reliable (averaging several government and industry surveys). So the average landline voice or data network can expect to have 26.28 minutes of unscheduled downtime annually.
For most of us, that few minutes of downtime goes totally unnoticed and isn’t very important in our lives. Remember, that’s “network” downtime. If a backhoe cuts the wire or cable to your building and you lose both voice and Internet, that doesn’t count (here). Same if you lose power to your building or the local cell tower takes a lightning strike. Those lofty numbers only apply to the “central” voice and data networks.
So what’s your level of reliability tolerance? And what do most systems give you? Back to the math. 99.5% sounds pretty good. But a system like that will experience 43.8 HOURS of downtime in a year. That’s nearly two full days. 99.9% reliability reduced the number to 8.76 hours annually, over a workday. If those downtimes occurred all at once, you’d certainly notice!
Everyone’s reliability tolerance is different. And each of ours is different at varying times of day, what is going on at our businesses, and in our lives. During a personal emergency any phone, data, or cellphone downtime can be devastating. While we’re asleep, a couple of minutes of network downtime is just fine and will likely go totally unnoticed. There are lots of claims of what data network downtime costs various businesses. Like a lot of things, the real answer is: "it depends".
The better way to look at data network downtime is to figure out what information or services that you don’t want to be without and for how long can you afford to be down. A common system example is email. A typical client needs access to new email during waking hours and needs that access regardless of where we are. Older emails are important, even critical, but don't necessarily need to be accessible instantaneously all the time. So, looking back at the reliability numbers, current emails may need four 9s of uptime while accessing older may only require three 9s. Why then, design, manage, and pay for a four or nearly five 9s overall email system when you might not need that level of reliability for everything you do?
Systems like the one just described take careful thought in their architecture, engineering, implementation and maintenance. Using our example from above, we design the current email system to be highly reliable and contains an automatic failover feature whereby new email is accessible in the event of a main system outage. The rest of the email folder system and the email archiving is reasonably reliable, has good data integrity, and backup, but doesn’t need to be stored fully offsite with automatic failover. This approach save lots of dollars, reduces maintenance headaches, and gets the job done to many client’s satisfaction.
Different systems require different reliability targets based on the organization’s policies and needs. We're happy to help you figure out how many 9s you need!