Another week, another story about a major data center outage. This time it’s British Airways under public scrutiny as the company scrambles to discover the source of data center downtime that grounded hundreds of flights.
While the cause of that outage isn’t yet released, that hasn’t stopped some experts from suggesting human error as the cause. They aren’t likely to be off base, either: human error remains the leading cause of IT infrastructure outages. Therefore minimizing human error should be a primary focus of reliability efforts.
While we all make mistakes, when critical infrastructure is at stake — not to mention thousands of dollars in downtime related costs — it’s worth some investment to try and reduce the potential negative effects of people on IT systems. Here are some tips to help you avoid downtime stemming from human error.
With the average cost of data center outages hovering at $740,000 (according to a Ponemon / Emerson study from 2016), operators must take action to avoid the most common causes of downtime. Let’s take a quick dive into the leading origins of unplanned downtime and how you can avoid them in your data center.
As data center design continues to evolve, one stalwart piece hasn’t changed too much: cabinet or rack security and monitoring. After all, how complicated can a door lock get? While most every data center will have some form of lock on their racks and/or cabinets, especially colocation facilities as they have multiple clients accessing shared floor space, not all locks are created equal. Newer technologies allow automated access logs, biometric security, wireless unlocking, and more.
With different compliance standards and security requirements for various applications, some colocation providers will install custom locks for your cabinet if necessary. Physical security measures remain vitally important, as social engineering and theft can extend to hardware and not just data. How then do data center providers go about securing cabinets and racks?
Airflow containment refers to the practice of segregating the aisles of a data center so the hot exhaust air from servers does not mix with incoming cold air, while also more efficiently directing airflow into or out of the data center floor. According to the Uptime Institute’s 2014 Data Center Industry Survey, only 30% of operators have at least ¾ of their data center using some form of containment. Less than half of all survey respondents had at least 50% of their data center heat contained.
That leaves a lot of white space without any form of containment, which is one of the best ways to improve energy efficiency and translates into a more reliable environment as well as direct cost savings.
Things have improved since a few years ago, to be sure. But airflow containment remains a significant upfront investment that data center operations teams might not consider, especially at smaller providers or in-house facilities. However it can show a real ROI.
OK, so data centers don’t use Duracell batteries (ours are much, much heavier, more expensive, and specialized). I just couldn’t resist the Matrix reference.
And what data center operator does have time for downtime? Nobody, with average costs per minute hovering at nearly $8,000. This same 2013 study from Ponemon discovered that 55% of data center outages were caused by—you guessed it—UPS battery failure.
Data center UPS (uninterruptible power supply) systems are supported by dozens of bricklike batteries, and if even one of them has a bad cell, it can take down the whole system. Even a brief hiccup in power can then lead to downtime for the entire data center. Despite all this, only 48% of surveyed operators regularly tested or monitored their UPS battery health.
What types of batteries are used in the data center? How can operators keep batteries in good shape?