Last week, hotel booking serviceÂ Room 77Â was knocked offline in a mammoth cloud-computing outage that also crashed the websites of dozens of companies.
At 11:10 pm EST on Friday 29 June, Room 77′s engineering team received an automated notification from its alert system that its website was down, which signaled the start of a 19-hour outage that also affectedÂ Netflix, Pinterest and Instagram.
Like many start-ups, Room 77 relies on web hosting and data services that are hosted in the cloud by Â rented computing capacity via huge data centers run by Amazon.
On Friday evening, theÂ AWS Service Health Dashboard revealed that the company had lost its main power supply as well as backup power generator in the Washington, DC,-metro area.
A glitch in time, ainâ€™t Amazon prime
Two causes are to blame for the service disruption, say Amazon spokesman Drew Herdener: A lightning storm on the US East Coast that took out power to several main AWS data centers and their back-up generator, plus an additional second added to the world clock between Saturday and Sunday to make up for quirks in the Earthâ€™s rotation.
Lifeâ€™s a glitch and then you cry
At the time of the outage, most of Room 77â€™s staff members were having dinner or were at home. Within minutes, they began to react to the problem.
SaysÂ KevinÂ Fliess,Â Â vice presidentÂ of products and general manager:
“As a web-based business you have to expect some unplanned downtime. Within ten minutes of the outage everyone on the team from technology to marketing knew the situation and we began working through a set of tasks to ensure that our customers were well supported during the outage.
“Our first priority was to ensure that our existing customers holding reservations were well supported.”
“During the outage we were able to provide support via email and through out toll-free number. Â Having these other support channels in place helped ensure smooth business continuity. If we had web-based support only, it would have been much more painful.”
Room 77â€™s servers were restored on Saturday at about 7:30 PM EDT.
Amazon looking more like a dwarf now?
AWS holds an estimated 80% of the cloud services market. Yet this week some experts were wondering if this is the end of AWSâ€™s near monopoly grip on the infrastructure-as-a-service market, similar to how Research in MotionÂ (RIM) saw the pace of its loss of smart phone market share dramatically increase after an October 2011 global outage of service to Blackberry devices.
This was not AWS’s first outage. Two weeks earlier, it had a six-hour outage. In April, it had a four-day â€śepic failâ€ť. In March, a former AWS employee who left for travel search startup Hipmunk wrote publicly on online forum Reddit that AWS services were full of glitches. (Incidentally, Hipmunk’sÂ servers are on AWS, but on AWS West rather than AWS East, so it avoided the power outage trouble.)
Google is a likely new rival to AWS. Last Thursday, the Internet behemoth announced its plans to launch a cloud-services platform called Compute EngineÂ at pricesÂ that undercut AWSâ€™s rate sheet on a like-for-like service basis.
Google is offering 3.75GB of memory with 1 virtual core and 420GB of hard disk space for $0.145 an hour, compared with Amazonâ€™s nearly identical service (except for 10 gigabytes less of hard disk space) at $0.16 an hour.
The closest offering by Rackspace, the number-two largest provider after AWS, costs $0.24 an hour and includes merely 4 gigabytes of hard disk space. It’s rumored in online forums that Microsoft will adjust its price list for its similar service to respond to the new market dynamics.
Amazonâ€™s Herdener says the AWS cloud provider prices its plans by depth of reliability, with the costliest plans including redundancies distributing customers’ loads among multiple centers and making outages up to 99 percent unlikely on AWSâ€™s much-touted Elastic Compute Cloud (EC2) server.
Meanwhile, startups scared off by the cloud and worried that AWS and other services are too broad or risky for their needs might consider buying a Storage Area Network (SAN), which comes with plenty of data redundancies (including back-up power supplies) for about $50,000 from a supplier like HP or IBM.
Developing back-ups for the back-ups
At Room 77, FliessÂ says they are taking several measures to ensure that this does not happen again.
“For example, we will be investing in greater data center redundancy across geographical regions, improving monitoring and alert systems, and providing better customer support.”
The lightning bolt that hit AWS is an opportunity for companies that can help startups back-up their data seamlessly, instead of keeping all their data in one basket.
New startups, like CliQr, claim to have technology to allow companies to easily move internal enterprise applications between clouds, such as “private clouds” run on their own machines, or “public clouds” that could be operated by multiple cloud service providers like Amazon.com, Google, Microsoft and Rackspace.
In Room 77â€™s case, FliessÂ plans for greater “data center redundancy across geographical regions” and â€śimproving monitoring and alert systems.”
“As a start-up we’ve leveraged AWS as a cost-effective channel for data storage and computing capacity.
“AWS as well as their competitors are offering services from multiple datacenters, although it costs a lot more to deploy servers across multiple geographical regions.
“As our traffic grows, we are increasing our infrastructure investment in order to create redundancy across geographical regions. Â That way, if a catastrophic event knocks out a single data center, we’ll have uninterrupted web and mobile operations.”
Twitter as emergency customer tool
Many companies used Twitter to keep their consumers updated, shining a spotlight on social mediaâ€™s importance as a back-up customer service arm for many start-ups.
Two hours into the crisis, Room 77Â tweetedÂ the bad news to its customers.Â Room 77 was offline for 19 hours, a frustratingly long time for a consumer-facing company not to be able to reach customers. Social media may earn a reputation for being handy in a pinch.
NB: Lightning bolt image via Shutterstock.
Clarification: This post was updated on 12 July to redact a reference to IHG Hotels. A representative of the company had originally confirmed that IHG had experienced downtime but another representative has clarified that this was not the case.