As I write thison a Monday morning British Airways flightBA is struggling to recover from one of the worst IT outages an airline has ever suffered, and accordingly, also one of the worst PR disasters.

Over the weekend, BA’s IT systems all went down. All British Airways flights were grounded for almost two full days. And not just at one airport in the UK, but worldwide. The news has been filled with pictures of mountains of luggage, terminals overflowing with distressed passengers, people sleeping on floors, lines zigzagging across packed check-in halls and out through the doors. Tearful kids, angry parents, and furious business travelers are the faces of this story.

people waiting at airport terminalSource: The Independent
Source: The Independent

British Airways' reputation in jeopardy

No matter how you look at it, this is a disaster for British Airways. Taking place at the start of the Bank Holiday weekend and most schools’ half-term week, we can be sure that many families have had their precious long-booked holidays trashed. The compensation cost for all passengers worldwide for two days will be horrific, and BA's reputation may never recover.

This hurts the British psyche particularly, since BA is still somehow still regarded as our national airline, with still lingering pride in its quality and reputation. After all, it’s got the name "British" in it. I’m sure the Queen gets involved there somehow as well. We regard BA as if it is still publicly owned, nationalized and subsidized by the government and paid for with British taxes.

Of course, this is not the case. BA is a very commercial, private company, fighting for profit in a competitive market and trying to retain its "premier brand" reputation while competing for every pound against the small, super cheap, no-frills airlines for the same customers.

So, the outage is bad. The fact that it was British Airways makes it somehow much worse. The outage has been headline major news in the UK for two days so far. It’s still going on, although my experience today is that many of the issues are sorted out and only a few flights appeared as cancelled.

Recent WannaCry ransomware outbreak

This is the second major IT outage issue to hit the UK/British national identity recently. Only three weeks ago, the WannaCry/WannaCrypt ransomware outbreak hit the National Health Service hard, closing hospitals and turning away patients. The NHS is genuinely government funded, free for all that need it, and baked deep in the British psyche. The ransomware outbreak impacted IT worldwide, but it most visibly hurt the UK health service.

The main conclusions described for the impact on the NHS was a lack of sufficient  IT funding, old computers running XP, and slow update and patch timescales. It was only a matter of time before WannaCry or similar hit the NHS. The impact was huge.

Now it is British Airways.

The big question: What caused the British Airways IT outage?

So far, all we know is that power supply issues caused the outage of all the IT systems that manage check-in, luggage, boarding and, well, pretty much everything, apparently. It appears that some power failure basically turned off all BA IT systems, and they simply couldn’t just be turned back on again.

At this point, that story doesn’t really sit well with me. If that is true, it would suggest that BA runs its IT systems on an infrastructure that is not mirrored, replicated or built with failover capabilities across more than one location. I could understand a small non-IT company putting its eggs into one basket, but these days even mid-sized companies understand the basics of IT service continuity. It’s very normal for business critical applications run from the cloud to be automatically operated in a failover datacenter model so that if an entire DC was hit by—say a small asteroid—there would only be a short recovery time before all systems spin back up with nominal data loss in the mirrored location.

Personally I think there is more to the BA story. If it is "a UPS failed and took out our entire business" then that itself is a shocking example of criminally bad IT management. But I suspect there is more yet to come out. In fact I hope there is. But the communication from BA has been tight-lipped, brief and very limited.

Six lessons we can learn from the BA outage

In the absence of the full story, here are six lessons we can take away from the British Airways (and NHS outages), assuming that the "power supply problems' turns out to be the full story.

1. IT outages are no longer: "Sorry, our computers are down, can I call you back?" In today's digital world, IT outages impact real people, cost public money, cause horrible inconvenienceS, and put lives at risk.

2. IT is critical to your business. Don’t under-invest and hope the worst won’t happen. It just did to the NHS and British Airways. IT and digital operations are now everything. As Matt Hooper says, "Praying is not an acceptable IT strategy."

3. Agility in IT operations can prevent catastrophic failures. In a modern, agile, digital world, resilient practices enable smaller, faster, deliberate failure experiments and the building of better anti-fragile solutions. These are there to stop the sky falling in with catastrophic failures. Resilience and a lack of fragility are key.

4. But if the sky does fall in, Service Continuity Management—even just basic Disaster Recovery—is essential. Where is your DR plan? What happens if your current primary DC is hit by a plane, asteroid, explosion or earthquake? That plan should be documented and tested. How do you operate without IT?

5. IT Service Management should have prevented this, and IT Service Management could now be helping BA learn from this. Improvement comes from knowledge and knowledge comes from learning. It’s basic ITSM—avoid future incidents by analyzing the cause of previous incidents and applying changes to prevent them from happening again. In fact, so many aspects of ITSM seem to apply here: from major incident, through service continuity, to problem, knowledge and change management and then into continual service improvement.

6. Most importantly: All IT professionals need to be able to learn from BA. If BA fails to share a comprehensive and honest, detailed account of what really went wrong, then they are failing the global IT community and consciously putting us all at risk of future IT outages, in any industry. Honesty and transparency are important not only for brand trust and credibility, but also to ensure that the same mistakes are not being made by Nasa, the military, the police, the electricity, water, and gas suppliers, or the nuclear power generators. In the scientific community, failures in experiments are shared as widely as successes, so that future scientists of any country or employer can avoid making the same mistakes. This basic culture of shared learning needs to be encouraged for the benefit of all IT operations everywhere. IT operations is too important to be hidden behind a curtain of silence.

The BA outage is slowly easing. Tweets and press suggest that systems are coming back online and cancelled flights are gradually reducing. I travelled this morning without any delay, but it’s not quite over. The dust has certainly not settled, and I sincerely hope that what comes next includes an honest sharing of mistakes made and lessons learned. If not, we are just waiting until the next major outage, and that could be much worse that spending a night on a departures hall floor.

Layered Security is the Whole Endpoint full report