Minimizing Human Error in Data Center Operations
March 18, 2015
Written by Joseph Parrino

Wouldn’t it be nice to be able to create a clockwork data center where you get the system up and just let it run? You wouldn’t have to worry about potential system failures, power outages, or hacker attacks. Unfortunately, managing a data center requires ongoing vigilance by human operators, which means that data center management is subject to human error.

Human bias can be one of the biggest threats to the data center. People make mistakes, especially when under pressure. When faced with a problem or a crisis the human tendency is to apply what you think you know to address the problem at hand. I noted in a 2013 T5 Data Centers blog I wrote about overconfidence, bias, and Recency Effect that people tend to be overconfident and have short memories, which tend to lead to wrong decisions. More often than not, those human biases lead to errors, so understanding human bias is the first step in minimizing the impact of the human element in data center management.

Human Bias in the Data Center

Psychologists have identified a number of cognitive biases that we all share that can have a direct impact on decision-making. These cognitive biases tend to get in the way of rational assessment and lead to incorrect assumptions based on past or anticipated experience.

Good data center management is built on tested protocols and procedures, and those protocols are subject to cognitive bias. If a system changeover or response becomes a “routine” procedure, it can become more difficult to anticipate and address what is not routine.

There have been hundreds of cognitive biases catalogued by psychologists so let’s consider a few that can interfere with good data center decision-making:

  • The Anchoring Effect – Also known as the relativity trap, this describes the tendency to compare and contrast a limited set of items rather than assessing a broader spectrum. For example, if an item is on sale it seems a bargain compared to the normal price, but you are not comparing it to the universal price for the same item. The same is true in the data center. You tend to look at performance metrics in terms of recent highs and lows, ignoring the larger spectrum.
  • The Gambler’s Fallacy – The gambler’s fallacy is the false belief that an occurrence that has happened in the past is likely to happen in the future. The most common example is a coin toss. If a coin comes up heads five times out of six, the fallacy is to assume that it has a tendency to come up heads when the odds remain exactly even. In the data center, assuming that the same failure is likely to happen again because it occurred once or twice before could lead to errors.
  • Neglect of Probability – In this instance, you just ignore probability altogether. For example, you ignore the statistics which show that wearing a seatbelt is safer, rationalizing that it might be harder to get out of a smashed vehicle with your seatbelt fastened, or believing the relatively few stories of survival where someone was thrown from the vehicle. Consider the same argument used to rationalize a server failover or systems backup – just because it never failed doesn’t mean you ignore the protocol.
  • Observational Selection Bias – Your perception changes based on recent experience. For example, if you just bought a new car, suddenly you are aware of everyone who drives that same make and model. The number of cars hasn’t increased, only your perception has changed. You could experience the same phenomenon once you start hunting for gremlins you expect to see in the data center.
  • Negativity bias – People tend to dwell on bad news. Consider, for example, that the amount of crime and violence may be decreasing, but access to more media sources give the perception that things are worse. The same can apply in the data center. If you dwell too much on what fails, you overlook what is working (and what may fail later).

Minimizing Human Error in the Data Center

Even IT professionals can fall victim to cognitive bias, but there are steps you can take to ensure that human error doesn’t interfere with effective data center management:

  1. Rely on your tools – Chances are that you will be familiar with the data center hardware and know how it is supposed to perform. Don’t rely on your assumptions. Use the tools at hand to measure performance and, more importantly, rely on the data provided. Use the objective data you have, not what you think you know.
  2. Apply automation – Using technology to manage routine predictive failures is one way to eliminate human bias. Trending performance indices with respect to time is a very effective means of predicting impending failure. However, you still need human experience to program the technology and determine what to look for.
  3. Develop written protocols – Having well-defined protocols and procedures eliminates errors. Be sure to have well-documented protocols for common problems and failures. Of course, you can’t predict every kind of problem, but you can create procedures to isolate an infected server or switch to a backup system while you triage the primary failure.
  4. Practice makes perfect – Review recovery protocols regularly and train on new systems and new procedures.

Using past experience to predict potential future failures has its value. However, your best defense against data center failures is maintaining some healthy skepticism. Strive to remain observant and objective. Don’t trust only what you know, but rely on external indicators to guide diagnostics. If you can strike a balance between automation, well-defined protocols, and experience, tempered with a little healthy skepticism, you will be able to deal with almost any data center problem.

What’s your strategy for minimizing the impact of human bias in the data center?