I recently was reminded that Air New Zealand’s airline reservation system went out of service on Sunday morning at about 9:30 am, October 11, 2009. This story is very interesting to me, since my team is building just such a system (with a very different underlying implementation).
IBM said that the outage happened because of a power failure at an IBM data center in Newton that took out their mainframe. Many existing airline reservation systems run on a single IBM mainframe; mainframes are known for being rock-solid reliable, but not without electricity! IBM said it was caused by a failed oil pressure sensor on a backup generator. What’s more, the problem happened during a scheduled maintenance session!
The outage affected more than 10,000 passengers, leaving airports “in disarray”. Most systems were restored around 1.30 pm [four hours later], but the passenger backlog did not start to clear until self check-in kiosks were up and running again about 3.30pm [six hours later]. Air New Zealand was, to put it mildly, furious.
As usual, people never (well, hardly ever) adequately test their redundant backup technology! In particular, they should have also used the generators for a long enough time to test for this kind of failure. I recently heard about another such pr0blem, in a discussion at ITA, where the backup generators worked but didn’t work for long enough. (At least I think it was a distinct case, since I heard it a while ago, but I could be wrong.)
You must do these tests reasonably frequently, since things can break over time, even if they are merely lying in wait. I plan to write more about this in a future blog post.
I don’t know, but I’ve been told: most, if not all, airlines do not actually have disaster recovery setup that would switch over to a geographically distant site. Evidently airlines are surprisingly “penny wise and pound foolish” when it comes to redundant components, which they are loath to pay for. (I think we were talking about network connections but it’s too long ago for me to remember clearly. The same principle applies across the board.)
Afterward, apparently IBM’s main job was to grovel. Air New Zealand, in the person of CEO Rob Fyfe, said in strong language that IBM took a long time to react, accept responsibility, and apologize. He called IBM “amateur”, which is quite an insult for IBM, and that his IT team was looking for alternative suppliers already. (I don’t know how that turned out.)
IBM did apologize by the evening of the next day, and said they “immediately engaged a team of 32 local IT professionals, supported by global colleagues. This means Mr. Fyfe considers two working days to be a very long time for such an apology. Perhaps he was manly putting on a public show of anger, actually intended more for his customers and shareholders than for IBM. But I don’t think that actually matters, from IBM’s point of view. As someone participating in a team building such a system, that’s the point of view that I am most concerned with.
(By the way, back at MIT in the late 1970′s, when a guy from Digital Equipment showed up for “preventive maintenance” on one of our timesharing systems (removable disks on the MIT-MC KL/10), we called it “causitive maintenance”. He once made a mistake that caused a lot of trouble.)
I don’t just mean this to rag on IBM. Making systems that are working fully 24/7 is quite difficult and expensive. Our team can perhaps learn a little bit from this: if nothing else, it is one more data point about the cost/benefit of system failure for an airline reservation system. When airlines talk about high availability, they are not kidding!