LF

Top 7 IT Environment Outages from Last Year

02

MAY, 2018

Contributed by Author TJ Simmons

The backbone of business today is digital, and every day we rely more and more on the IT services and infrastructures around us. However, when things go wrong, it can go wrong dramatically and lead to loss of service, loss of revenue, unhappy customers, brand damage and potentially even worse i.e. life and death in some cases.

With that in mind, here are our top seven IT environment outages that left an impact on 2017 in chronological order.

 

1 – GitLab, January 31
Cause: Accidental Removal of Database

GitLab had an outage lasting nearly 18 hours on January 31st. An outage that impacted myself and team, as we couldn’t push our changes to the repository for a day. Thankfully, it wasn’t during a major or scheduled release and we waited it out without incident.

In our case we lost no data, however, some weren’t so lucky. According to GitLab, they lost “six hours of database data (issues, merge requests, users, comments, snippets, etc.) for GitLab.com.” Nevertheless, GitLab did earn a lot of goodwill for their transparency in handling the outage and to their credit, defended the individual team-member-1 (an experienced DBA) who made the mistake.

 

2 – Amazon Web Services, February 28
Cause: Wrong Command Accidentally Typed

On February 28th, an Amazon Web Services engineer was trying to debug an S3 storage system in the provider’s Virginia data center when he accidentally typed a command incorrectly. The result? Well, it felt like half the internet was down for four hours. Affected systems include many enterprise platforms, like Slack, Zendesk, Trello, and consumer-facing ones like Quora.

What made things worse was that the AWS Service Health Dashboard (which AWS customers look at for updates on outages) was itself broken. In this case, it showed that only S3 was having issues. But in reality, multiple AWS services weren’t working properly. AWS reported that the error in the dashboard was because “the icons for the many AWS services were hosted in the Northern Virginia region.” That’s the same region where the outage hit the affected services. Yes, this outage was so bad that AWS couldn’t even get its own health dashboard to work properly.

AWS took preventive measures after this, such as running audits on its operational systems. It even adjusted one such system “to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.” In other words, Amazon doesn’t want to undergo the embarrassment of the 2017 outage ever again.

 

3 – WhatsApp, May 4
Cause: Unknown (According to Speculation, Updates Were at Fault)

On the same day that Facebook CEO Mark Zuckerberg announced that WhatsApp had more than 175 million users, WhatsApp went down for four hours. Users were loudly complaining about how little information Facebook was sharing about the outage. Even the official WhatsApp Status Twitter account, which appeared to be unused for at least a year, was silent on the matter.

After WhatsApp went back online, Reuters reported that the company’s spokesman only acknowledged the outage and mentioned that they “have now fixed the issue and apologize for the inconvenience.” WhatsApp gave no reason for the outage, but people speculated that the outage was due to an update.

This outage teaches a valuable lesson. When the makers of a heavily-used service roll out a new release, that should be as carefully planned as the development of the release itself. It’s also worth noting that people can forgive and even empathize with human error, as seen in the GitLab accidental deletion incident, however you must own up to it.

When you deliberately make changes, people have expectations about the smoothness of those changes. It’s a no-win situation, of course. Do it well, and you’re just meeting those expectations. But do it badly, and you incur the wrath of many—in WhatsApp’s case, possibly hundreds of millions.

This means managing your test environment and testing the update there prior to release is especially crucial for a smooth rollout of software updates. If you have an ecosystem like Facebook with multiple interconnected systems, it’s even more challenging to manage a test environment manually. Automated testing and management of the test environment are key.

 

4 – British Airways, June 2
Cause: Accidental Power Shutoff by Contractor

Unlike the other cases so far, this one seems particularly low-tech. But this outage was just as severe, if not even more so. On June 2, British Airways left 75,000 people stranded in UK airports over the weekend. All flights out of London’s Heathrow and Gatwick were canceled. Computer systems were knocked out. Flight operations were disrupted. Call centers and websites went down due to an overwhelming number of requests from customers.

The cause? A contractor working in the British Airways data center accidentally shut down a power supply unit.

If you think you have difficult customers, try having 75,000 angry passengers with disrupted flight plans. Experts warned that the final cost of this outage would be “significant” for the UK flag carrier. It was a fair assessment. British Airways ended up paying “£200 for a hotel room (with couples expected to share), £50 for transport between the airport and the hotel, and £25 per day for meals and refreshments” for every affected passenger.

 

5 – Marketo, July 26
Cause: Someone Forgot to Renew the Domain

Maybe you haven’t heard of Marketo. If that’s the case, let me tell you a bit about them. Marketo has been a publically-listed company since 2013. It’s widely considered at the forefront of the market for small business lead management—so much so that by 2017, Gartner Magic Quadrant named Marketo the leader in CRM lead management for the sixth year in a row.

So what happened? Well, they forgot to renew their domain. So businesses using Marketo services sent out email campaigns with links that were completely broken. It was a Marketo customer who spotted the expired domain and renewed it before anyone else could. It’s bad enough to have an outage. Having your customer help you is especially embarrassing.

Even the renewal couldn’t completely solve the issue, as DNS propagation can take 48 hours. The Marketo team probably had to ask ISPs to flush the DNS cache for them.

Marketo’s outage had to be on this list, simply because, for some customers, the outage may have lasted as long as 48 hours.

 

6 – WhatsApp, November 3
Cause: Unknown. Again.

Yup, it happened again. WhatsApp went down for the second time in less than six months. Thankfully, clocking in at one hour, it didn’t last as long as the May outage. But that’s where the differences end. Everything else is the same, from the lack of information about when things would be back up to the lack of clarity about the cause—and the speculation again that it had to do with software issues. Compare this with GitLab’s constant tweeting about their progress in recovering their database.

WhatsApp had two outages in span of less than six months. That suggests the issues are systemic in nature. And when there are systemic issues, the companies affected can find and solve them permanently. I cannot emphasize enough how important it is to permanently solve preventable issues. As we’re revisiting all these cases of outages, the outcomes are almost always negative. You’ve got to prevent them.

 

7 – Marketo, November 22
Cause: Firewall Issues from Data Center

Yup, another Marketo outage. Less than four months after they forgot to renew their domain and faced the outage mentioned earlier, they faced another outage lasting several hours. The length of this outage wasn’t as bad, but the timing was awful. It happened when marketers were preparing for the weekend of sales activity following the Thanksgiving holiday—in other words, the weekend of Black Friday.

Once again, this certainly sounds like something that was entirely preventable. The company announced a root-cause analysis into this matter shortly after restoring their services…which should be a basic requirement of every such outage on this list.

 

Conclusion: Minimize Preventable Outages

Of all these outages, it’s those that are the most preventable that should be the least forgivable. Therefore, systemizing a release management or environment operations process is key towards preventing future preventable outages.

Not sure what features constitute a good release management system? Take a look at the Enov8 release management datasheet. Enov8 offers a holistic approach to release management, ranging from at Scale Enterprise Release Management (Agile Release Trains) through to Implementation Planning and ultimately Agile Deployment Operations and Automation (inc CICD).

If you have your own favorite outages then weigh in! Which do you think were the worst?

Contributed by author TJ Simmons

Relevant Articles

5 Red Flags Deployment Management Is Failing

20 MAY, 2019 by Mark Henke It’s a great step when teams deliberately manage their deployments instead of treating them as second-class citizens to writing code. But there are many pitfalls to managing deployments effectively. Many things lurk, waiting to trip us up....

5 Red Flags Enterprise Release Management Is Failing

08 MAY, 2019 by Mark Henke Taking enterprise release management seriously is a great step toward helping our organization flourish. Embracing release management will allow us to make the invisible visible. We’ll be able to effectively manage how work flows through our...

Marrying SAFe and DevOps

04 MAY, 2019 by Rodney Smith If you work in an organization that uses the scaled agile framework (SAFe), chances are it's not a small company. It's enterprise-y. It's probably gone through some growing pains, which is a good problem to have in the business sense. The...

DevOps vs. DevOps at Scale

29 APRIL, 2019 by Carlos "Kami" Maldonado "DevOps at scale" is what we call the process of implementing DevOps culture at big, structured companies. Although the DevOps term was coined almost 10 years ago, even in 2018 most organizations still haven't completely...

DevOps Anti-Patterns

24 APRIL, 2019 by Mark Robinson It’s the normal case with software buzzwords that people focus so much on what something is that they forget what it is not. DevOps is no exception. To truly embrace DevOps and cherish what it is, it’s important to comprehend what it...

Delivering at Scale, Why SAFe Is Essential for Agile Teams

16 April, 2019 by Eric Goebelbecker Your organization is in the midst of an agile transformation. You know that agile is the way to go, and you're looking forward to, or maybe already reaping, some of the benefits. Who can argue with what agile brings to the table?...