Amazon Elastic Compute Cloud hiccup takes down Airbnb, Foursquare, and other startups for hours

Many digital companies, including travel startups like Airbnb and Foursquare, have been knocked offline for hours during the business day today.

The startups rely on web hosting and data services that are provided “in the cloud” on EC2, rented computing capacity via huge data centers run by Amazon.

In its most recent update, Amazon Web Services says it is “experiencing degraded performance for EBS volumes in a single Availability Zone in the US-EAST-1 Region.”

Some analysts suspected this is a bug with provisioned iops in EBS.

One travel startup reports that its “volumes that don’t have provisioned iops are totally fine.” Companies that set up a multi-region fault-tolerant configuration are also fine.

While we don’t yet know the cause of this crash, a recent academic study found that vulnerable software includes “Amazon’s EC2 Java library and all cloud clients based on it.”

Whatever the reason, the disruption is enough to make a startup think about at least having a spare server.

airbnb amazon ec2 crash offline cloud storage foursquare

Airbnb had a Zen tweet about the news:

Apologies. Our site is having a case of the Mondays… We’ll Airbrb as soon as possible

AWS holds an estimated 80% of the cloud services market. But given its recent series of crashes, one could be forgiven for wondering if they’re doing enough to prevent off-line moments.

On the other hand, the cloud has brought advantages. Cost savings is one. Hosting 1 terabyte of data a decade ago could cost $1 million a year and now it costs $50, according to Jim Davidson of Farelogix.

AWS’s SLA annual Uptime percentage is 99.95%, which translates to about 263 minutes of downtime per year.

Switching hosts isn’t easy either. Moving EC2 instances would require copying 100s of gigabytes of files for startups the size of Airbnb.

aws ec2 crash airbnb foursquare cloud service

As Tnooz contributor Steven Joyce of Rezgo has pointed out:

10 years ago these start-ups were hosting on dedicated managed servers in a single data center somewhere. Only these data centers were small and expensive and suffered from (arguably) more outage issues.

We hosted with a data center whose building power was cut off by a construction accident. It took over 12 hours to get service restored. I’m not saying crashes should be tolerated, I’m just saying that they happen and to assume otherwise is foolish.

One company benefiting from the crisis seems to be Pager Duty, which provides SaaS IT on-call schedule management, alerting and incident tracking.

Related posts:

  1. Skyscanner takes to the cloud to help rapid growth
  2. How a storm in the cloud brought Room 77 and other websites back down to earth
  3. Did Foursquare just kill a bunch of trip planning startups?
Sean O'Neill About Sean O'Neill

Sean O’Neill is a UK-based reporter for Tnooz.

Since university, he's been a full-time journalist for US consumer magazines and websites, and since 2007 he has covered B2C travel news full-time.

He lives in London and is travel tech columnist for BBC Travel. He used to work in New York City as the online senior editor for Arthur Frommer’s Budget Travel.

In the past, O'Neill held editor, writer, and reporter positions at Kiplinger’s Personal Finance and Foreign Policy magazines in Washington, DC. Please visit his personal site and follow him on Twitter or Google+ .

Comments

  1. This kind of downtime is totally unwarranted since there are plenty of technologies that allow anyone utilizing AWS to deal with outages with “Failover” by not putting all your servers in one facility and being able to switch to other datacenters in real time.

    Also this is the second major outage in the N. Virginia Datacenter this year yet all these companies are stacking most of their services there with no high availability? No failover? Come on.

    http://www.brandonholtsclaw.com/blog/2012/how-not-to-fail-at-the-cloud/
    http://benjaminkerensa.com/2012/06/30/reflecting-on-netflix-instagram-pinterest-downtime

  2. Sean O'Neill Sean O'Neill says:

    Thanks for the comment, Benjamin.

  3. Phil says:

    Here is a helpful tool available to the public free for realtime health status of Amazon Elastic Compute Cloud EC2 in all regions http://www.systemswatch.com Good site to know quickly if its you are AWS.

  4. I have to agree with Benjamin.

    One of the beautiful properties of the ‘AWS-cloud’ is that it’s extremely simple (relatively of course) to geographically distribute load for high-availabilty/failover cases by using different Availability Zones in one or several regions.

    It’s really beyond me why established companies as AirBnB and Foursquare appear to have neglected this essential part of their infrastructure

Speak Your Mind

*