Another week, another story about a major data center outage. This time it’s British Airways under public scrutiny as the company scrambles to discover the source of data center downtime that grounded hundreds of flights.
While the cause of that outage isn’t yet released, that hasn’t stopped some experts from suggesting human error. They aren’t likely to be off base, either: human error remains the leading cause of IT infrastructure outages. Therefore minimizing human error should be a primary focus of reliability efforts.
While we all make mistakes, when critical infrastructure is at stake — not to mention thousands of dollars in downtime related costs — it’s worth some investment to try and reduce the potential negative effects of people on IT systems. Here are some tips to help you avoid downtime stemming from human error.
The traditional methods to avoid downtime tend to focus on redundancies in data center design and equipment, geographically separate facilities with linked systems, and a new focus on automation via DCIM and software-defined data center technology.
These are all valuable additions to a data center and can go a long way towards improving reliability of the facility as a whole. Multiple fiber connections, diesel powered generators, redundant network design, multiple UPS systems, and a disaster recovery plan are pretty much essential components of a modern enterprise data center. There should never be a single point of failure in your equipment that can take out the entire facility.
And yet, British Airways may have faced that very problem. Their data centers were almost certainly designed with reliability and redundancy in mind, but something happened that halted system failover to the second site. A UPS system at one of the sites was shut down, despite a combination of main power, batteries, and diesel backup. This may have been due to a surge or loss of voltage on the main power feed from the public utility. Why systems did not move over to the second site remains a mystery.
While automation is likely to be the future of data center management, it can’t replace humans just yet (and maybe never completely). As the industry embraces software-defined technology and robotics, we may see routine data center maintenance tasks handed off to robots with a much lower chance of failure. In the meantime, humans are still racking and still performing software updates. That’s unlikely to ever go away completely.
Finally, alarms and access control points are obviously vital to data center security and to create an audit trail. They can help you pinpoint the cause of downtime, but they may or may not be able to outright prevent it. Still, configuring danger alerts for critical systems like cooling, power, humidity, and fire suppression can help you get out in front of mechanical errors.
The key questions you should ask DRaaS providers in order to appropriately assess the services provided.
Your Data Center Operations team should develop, document, and practice policy and procedure for every action within the data center. For software updates and other activities outside the purview of the DCOps team, similar policies should be developed. These policies usually include some combination of the following:
It sounds daunting to carefully craft step-by-step process documents for every little data center operation, and it's true that employees can still end up skipping over steps they find to be an encumbrance. That's why stringent policy and procudure has to be combined with careful hiring and training of employees as well as 24/7 monitoring of all systems and access points. It may be only a matter of (down)time until everyone faces an outage, but these practices might help you reach 100% availability.