A Southwest Airlines computer glitch on July 20 caused 2,300 canceled flights across the United States. The outage lasted 12 hours and disrupted their website and operations, including check in, boarding passes and ticket booking. It meant chaos and long lines for Southwest customers, and the cost was estimated at up to $10 million in lost revenue.

This outage reminds us of the critical importance of IT systems — uptime, quick response time and changes that do not interrupt this service. Years ago, Holiday Inn had a slogan: "The best surprise is no surprise." This is certainly fitting for customers of our IT systems.

What can a CIO do to address and improve the reliability of IT systems, both those in-house and those in the cloud? Let's take a closer look.

Why is reliability important?

Expectations are high. With the introduction of electronic health record systems in healthcare, a physician may be depending on system availability during a patient visit. There may no longer be a paper backup if the system is unavailable or slow. It is critical to the business for the system to be there and to perform. Like plugging into an outlet, electricity is expected and with a steady current.

Reliability impacts customer service. As in the Southwest case, the outage resulted in long lines of disgruntled customers. Have you ever been on a call with a customer service rep and heard "our system is slow today"? How did you feel about that company and its level of service? System availability and response time enables call center or operations staff to effectively help a business customer.

IT efficiency is affected. Outages and troubleshooting affect IT efficiency. Your team may be spending too much time firefighting. If things do not break, you can focus those scarce resources on projects that move the business forward (offense) rather than problem resolution and recovery (defense).

The reputation of IT is at stake. If the system is plagued with outages, it becomes difficult for IT to request strategic projects. A seat at the executive table assumes IT foundation elements like system reliability are covered adequately.

7 ways to improve reliability

1. Establish SLAs. Service level agreements between IT and the business go a long way to setting and guiding expectations. In healthcare, for example, EHR uptime of 99.9 percent was considered best in class a few years ago, based on analysts' research. We used this as the benchmark to set expectations.

2. Implement redundancy. Analyze your entire end-to-end system (power, connectivity, storage, servers) to identify single points of failure. Build in redundancy for example, a backup generator for power where budgets allow. Allocate time to testing your redundancy to assure proper failover.

In my experience, components can and will fail. Even redundant components may fail, as in the case of the Southwest outage. Southwest Chief Operating Officer Mike Van de Ven told The Dallas Morning News, "We have redundant systems that should have kicked in place, and they didn't."

Redundancy does not guarantee 100 percent uptime, but it does greatly reduce risk.

3. Use change management techniques. Analysts reported years ago that 80 percent of outages are due to people issues, only 20 percent due to hardware and software. So it makes sense to pay close attention to the people aspects of change.

  • Communicate changes inside and outside of IT. Share with the business those projects and system changes going into production, so there is an opportunity for the business to prepare and adapt. Perhaps, for example, the business would want to clear a backlog of work before the system changes occurs. Share within IT as well for the same reason. Alert the help desk especially, as this team will be fielding calls if there are any issues. Help them be adequately staffed and knowledgeable when business users call.
  • Prepare contingency plans. What is Plan B if the changes do not go as anticipated? It is much better to prepare for such an eventuality ahead of time rather than in the heat of battle.
  • Use checklists to help avoid people problems. Checklists are even being used in many operating rooms now, as a double check to assure all steps are followed. Why not in IT as well? Checklists can help reduce errors, especially with less experienced staff.

4. Reduce risk by using automation. Where there is vulnerability due to human intervention, such as entering a date range, consider automating the task to remove that risk. We did this with a computer operations task, for example, to enter a date range for lengthy batch processing. Automation eliminated this task for operations, while reducing the risk of an error, delayed output and a costly rerun.

5. Leverage the vendor's reliability. For 20 percent of the changes that are hardware related, pay attention to the vendors with whom you partner. We tended to use solid, mature vendors (names like Dell, Cisco and EMC) for reliable uptime on strategic systems. We were open to startups for customer innovation, like improving the web experience, or for systems that were not mission-critical.

For further reliability protection, look to your vendor contracts. Establish contractual SLAs and provide vendor penalties for not meeting these service level agreements. See my previous article about improving your technology contracts.

6. Prepare to react. As hard as you try to prevent an outage, it may still occur. Prepare for that eventuality as well.

You, your team and your vendors are all in this together. With many technology components affecting uptime and response time, it is essential for the technical experts, to work together. No one expert knows all the answers. Working together and recovering quickly, you can greatly diminish the negative impact to the business.

I was once responsible for leading a virtual IT SWAT team, an on-call team of experts in various technology components. This cross-functional team was defined and ready to address critical outages when needed. Consider your own virtual technology SWAT team.

7. Learn lessons from your outages. Take time after an incident to analyze what happened and, more importantly, how to prevent it from ever happening again. A lessons-learned, postmortem meeting may be as important as recovering from the outage.

What about the cloud?

Similarly, address the cloud vendor's reliability. Protect yourself with contractual metrics and penalties. Assess the applicability of cloud solutions to mission-critical systems.

And where you use the cloud, contractually provide for switching vendors if the arrangement does not provide the reliability you and the business demand. Allow enough time to contractually select and start up a new solution.

Where does security fit?

As well as internal systems and cloud-based solutions, outages may stem from a security incident, like a denial of service attack. Here are some considerations.

Prevention. Similar to the strategies above, try to prevent an adverse impact. Establish security policies and awareness, implement tools like intrusion detection, keep systems current with security-oriented patches, and try to incorporate security practices into the development process.

Reaction. In spite of prevention, you may still encounter a security incident. Hackers are becoming more and more sophisticated in their attacks. So prepare for your reaction to a security incident. Like a disaster-recovery plan, a security-response plan entails roles and responsibilities in firefighting the incident, the processes to follow, escalation procedures if the incident is not resolved quickly, and communication to stakeholders.

Preparedness is the key to an effective recovery.

Results

Using the techniques in this article, our team was able to achieve 99.96 percent uptime or better for four consecutive years, exceeding the best-in-class benchmark for healthcare at that time.

This was achieved primarily with in-house systems, as cloud solutions did not always offer improved reliability stats. Also in healthcare, patient privacy was a serious concern, so we opted not to use cloud for mission-critical systems. We did, however, pursue the cloud for nonstrategic systems, where uptime and response time demands were less.

Let's all learn a lesson from Southwest Airlines about the critical importance of IT systems. It's time to take the proper steps to ensure this doesn't happen to your company.