Incident Response: How We Prepare for the Unexpected?
April 17, 2015
Written by Joseph Parrino

How do we prepare for the unexpected? While the probability of a natural disaster occurring that will disrupt the well-designed data center that was constructed to be resilient to those events is unlikely, some component failures can create a situation that will warrant immediate attention. T5 recommends that every data center operator should have an Incident Response Plan in place to deal with those unexpected failures.T5-Data-Centers-incident

First, let’s clarify what we mean by Incident Response. Our primary role at T5 Data Centers is to maintain what we like to call ‘Total Availability,’ which means a safe and reliable data center with ‘forever’ uptime. When we encounter an incident, we know it can lead to a disruption in service if enough layers of redundancy are compromised.

Since we can’t prepare for every possible scenario, the next best thing is to understand how each of the systems interfaces with one another. This enables us to maintain the data center critical power and mechanical systems in the most robust condition at all times. When we encounter an out of normal incident, we act quickly, but more importantly, we respond using a logical and orderly process.

Here’s a three-step strategy we use to address data center operations incidents:

1. Strive for Early and Fast Diagnosis of the Problem

When an incident occurs, our first task is to attempt to identify the source of the problem. We need to determine what happened and what immediate steps we need to take, if any, to ensure the critical load is protected.

Early and quick diagnosis comes from maintaining a detailed understanding of the data center architecture through training and drills. Having a big picture understanding of the data center and its interrelated pieces makes it easier to identify the source of the problem. Utilizing the tools in place will help provide a macro view of the infrastructure, such as a Building Management System (BMS) or Electrical Power Management System (EPMS). Then, we can drill down to identify the component that is not in its normal state.

When things go wrong, the stress level rises, which makes it more difficult to make rational, well-reasoned decisions. That’s why we develop Emergency Operating Plans (EOPs) and response protocols in advance, and continuously drill the teams.

Early Diagnosis Summary:

a. Training and Drills provides the best opportunity for accuracy of diagnosis.

i. Know the architecture of the systems and follow a logical and orderly troubleshooting methodology.

b. Use the tools to drill down to the problem.

i. BMS/EPMS can provide the macro view.

ii. Drill down to the component that’s not in its normal state.

iii. Protect the systems that are still functional, always striving to maximize layers of redundancy to maintain the best possible resiliency.

1. Sometimes there are no ‘perfect’ configurations.

2. Manage the Incident Response

The initial diagnosis can be erroneous. Even experienced teams will make wrong assumptions about the cause of an incident. For example, a power supply fails on a single-corded 48-port, A/B edge network. If the network traffic doesn’t reroute to the B-segment, numerous servers will be affected. This will initially appear as though there was a widespread power outage in the critical power distribution system, when in fact, the power systems are normal.

When a customer-affecting incident does happen in the data center, the phone will start ringing with demands for action. Our objective is to remain calm and remind everyone that we don’t want to make a bad situation worse. An ill-conceived solution could create other problems downstream. We use our training, our tools and protocols to properly identify and manage the problem. Acting without understanding the situation can create confusion and unnecessary guesswork.

Manage the Incident Response Summary:

a. Even with seasoned teams, early diagnosis is often erroneous or only part of the picture. As ongoing investigations take place, the picture begins to come into focus.

b. The phone will start ringing, and everyone will want to know when the system(s) will be restored.

i. If still assessing, remain calm – understand the pressure to “do something” is normal but will also increase.

ii. Remind everyone that it’s in no one’s best interest to make a bad situation even worse.

iii. Call knowledge resources if needed.

iv. Acting without fully (or at least mostly) understanding can result in confusion and guessing. Guessing is not a good position to be in.

3. Resolve the Incident

Once we’ve started initial diagnostics and believe we’ve identified the root cause, we determine the appropriate response. The more complex the system, the more difficult it may be to identify the root cause. We realize we may need to drill deeper to identify contributing factors.

By way of example, we recently ran into a situation where one of our UPS systems went into static bypass. We had the proper redundant systems in place so there was no load loss, but it took some investigating to identify the cause. We called on our commissioning agent and the manufacturer’s engineering team to investigate the source of the failure. Once corrected, we documented it and added it to the equipment history database, then shared it with the technicians in our other markets.

When we run into problems we can’t diagnose, we call in the experts. We call the commissioning agent or the manufacturer’s engineering teams to get more insight into hardware problems. Manufacturers often want to know about product issues in order to implement new improvements.

Finally, we use the Incident as a learning opportunity. We continually evaluate which design changes or process improvements can make the data center infrastructure more resilient.

Incident Resolution Summary:

a. Get to the Root Cause

i. If can’t be determined in-house, call in the experts.

ii. As system complexity increases, sometimes Root Cause may not be confidently determined.

b. To notify the manufacturer of a potential defect.

i. Manufacturers want to know and many times are willing to help

c. Share Lessons Learned across the portfolio.

i. Not discussing with other teams is a lost opportunity.

d. Encourage participation in suggesting design changes or process improvements to make the system more resilient.

We can’t anticipate every kind of systems failure, but we can be prepared with the right protocols and procedures. Being prepared means maintaining the data center in the most resilient position, maintaining as many layers of redundancy as possible, and learning from the experience for next time.