Our team was sitting in front of a customer recently and I was pointedly asked by one of their executives, “what keeps you awake at night?” For me the answer was easy…“humans in the data center,” I responded.
We are flawed. There’s no getting around it. We’re flawed in our thinking, flawed in our judgment, especially in pressure situations. We are born and conditioned with internal biases. It’s ‘Nature + Nurture.’ We over-confidently think we can predict future outcomes by assessing recent results. Read that last sentence again. We’re over-confident and we have short memories. Psychologists call these Cognitive Biases. I’ll list a few near the end of this article and you can still try and convince yourself you’re o.k. (unfortunately you’re not!).
I became aware of our flawed thinking in 2005. I was attending an IEEE conference called RAMS – Reliability and Maintainability Symposium. There was a fascinating presentation by NASA Engineer, Matthew Melis, who works at the NASA Glenn Research Center. He talked about his team’s contribution to the Columbia Accident Investigation Board’s report on the Space Shuttle disaster that occurred on February 1, 2003. They used ballistic testing to confirm the Reinforced Carbon-Carbon wing edge was breached by a foam strike from the External Tank at 81.7 seconds into the launch. During vehicle re-entry, the damage enabled hot atmospheric gases to penetrate the wing and melt its aluminum structure, ultimately causing Columbia to break apart, along with the heartbreaking loss of Columbia’s seven crew members.
During the mission, they underwent the process of investigating what they regarded as “normal anomalies” during a launch…foam strikes from the External Tank. This can be related to a Cognitive Bias called Recency. They were then led into a feeling of over-confidence…thinking everything would be o.k. After all, this had happened at some level on the Shuttle’s 90+ prior launches.
Foam strikes had been a ‘fact of life’ since day one, and the problem had never been totally resolved. Insulating foam on the External Tank was necessary to prevent water from condensing and ultimately causing the formation of ice on the External Tank.The insulating foam regularly became dislodged at the bipod forward attach point, seen at the top of Figure 1.
At 81.7 seconds into launch, a mass of foam dislodged from the External Tank measuring 21-27” long by 12-18” wide, traveling at a velocity between 416-573 miles per hour, causing a breach in the leading RCC edge of Columbia’s left wing.
Certainly a very interesting and tragic story…but what does this have to do with data centers?”
More than you think.
The decisions that were made by NASA based on the reoccurrence of the events, coupled with some other internal organizational issues, have similarities to operating and managing data centers.
NASA knew the foam strike had occurred, but it wasn’t until the next day upon completing the review of the film that they saw the significance of it. Still, due to prior occurrences (recency), and although they had the capability to perform closer imaging, decisions were made to not conduct additional investigations (over confidence). Besides, time was short. The ISS crew would exceed their 180-day stay, and Node 2 of the Space Station needed to be installed (organizational issue: schedule pressure).
Those of us who operate data centers have seen this as well. We get comfortable with a switching operation or a “routine” procedure. This can lead to complacency. Stuff gets old and stuff doesn’t always work the same as it did when it was young. Yet, we over confidently think it will.
This leads us to underestimate the probability of something out of normal happening. Then when something out of normal happens, we haven’t thought through all the outcomes and scenarios. We now risk creating a larger problem.
The Cognitive Biases Discussed in this post:
1. Recency Effect – “placing too much emphasis on recent occurrences and not being aware of older anomalies”
2. Over confidence – “someone who is inexperienced or has too much self-admiration, and therefore has too much trust in his own judgment and abilities”
The Longer List of Cognitive Biases:
1. Sunk Cost – “the irrational escalation of commitment to an outcome”
2. Hindsight Bias – “the inclination to see past events as being predictable”
3. Bandwagon Effect – “the tendency to do (or believe) because many other people do (or believe) the same”
4. Confirmation Bias – “the tendency to search for or interpret information in a way that confirms one’s preconceptions”
5. Self Serving Bias – “to claim more responsibility for successes than failures”
6. Illusion of Control – “the tendency for humans to believe they can control or at least influence outcomes for which they clearly cannot”
Think back on your respective data center operations. I’ll bet you can relate to many, if not all of these.
So how do we avoid these pitfalls? Staying scared every time we’re changing the configuration of our power and cooling systems? To a degree, yes. Having a policy of operating by procedure only? This helps. Imagining every possible failure scenario that could occur during the switching operation? Likely impossible, but it should be the goal. Becoming as educated as possible on our critical infrastructure systems helps us imagine more failure scenarios.
There was something else mentioned that wasn’t listed in the above biases. Schedule Pressure. Is this a bias or an organizational issue? If we’re operating within a Change Window and we are at an elevated level of risk, how can this be managed?
NASA was under some level of a schedule pressure during both of their Shuttle accidents. In the data center, I’ve seen a negative result from this. Too many people working on a UPS within too short of a time window.
Our thinking is flawed. Are these human flaws manageable? To a degree, yes. By becoming aware they exist, we can remain on guard for bad decision-making opportunities. Further, by understanding our infrastructure as well as we can, we can at least have a fighting chance of predicting the outcome.
Columbia Accident Investigation Board Report
Matthew E. Melis – NASA Glenn Research Center