by David Marshall

Amazon’s S3 data storage service hit with outage

analysis

Feb 16, 20084 mins

In 2006, Amazon.com launched several pay-as-you-go services that were aimed at the developer community: Simple Storage Service (S3) which offers unlimited Internet storage, Elastic Compute Cloud (EC2) which lets developers create and manage virtual machine instances, and Simple Queue Service for message delivery. For the most part, these services have been fairly robust and worked as advertised. And the services

In 2006, Amazon.com launched several pay-as-you-go services that were aimed at the developer community: Simple Storage Service (S3) which offers unlimited Internet storage, Elastic Compute Cloud (EC2) which lets developers create and manage virtual machine instances, and Simple Queue Service for message delivery.

For the most part, these services have been fairly robust and worked as advertised. And the services have remained fairly inexpensive since they don’t offer a ‘5-nines’ guarantee SLA behind it.

However, Amazon Web Services were dealt a substantial blow yesterday when it was struck with a temporary outage that reportedly took thousands of Web sites down – each of which relied on the company’s hosted storage solution. Reports of Amazon’s S3 services being unavailable appeared around 4:30 AM PST and then the service was later restored within a three hour window.

A message thread quickly popped up on the Amazon Web Services Forum titled ‘Massive (500) Internal Server Error.outage started 35 minutes ago’ where consumers of the service demanded answers to their questions.

Comments were being made such as: “and this is why you have to setup a fail-safe”, “the s3 service is great but this just proves you can’t rely on it”, “My business is effectively closed right now because Amazon did something wrong. I’ll have to reconsider using the service now.”, “Errors happen, but there MUST be a fail-safe way of reporting them.”

All of the comments weren’t negative. Some said, “While I’m surprised this kind of error is possible, a big thanks to Amazon for getting onto this so quickly” and “its at least good that amazon fixed the issue within 2 hours thats fast if you compare other companys that might fix it in a day or two”.

The Amazon Web Services Team commented throughout, but finally responded with the following update of information:

Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.
Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue.

As we said earlier today, though we’re proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable. As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements. We are taking immediate action on the following: (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls. Additionally, we’ve begun work on a service health dashboard, and expect to release that shortly.

If Cloud Computing is going to take off and become widely adopted, the service needs to reach 5-nines (99.999%) of reliability and uptime – even if that means the price has to go up. At the very least, the option has to be made available to users. Until that can happen, it sounds like most users would at least be happy with a dashboard of information and a notification system from Amazon that would alert end users if something goes wrong. Without notification from Amazon, many consumers spent time and money trying to troubleshoot the problem on their end, until they finally realized the problem was on Amazon’s side of the fence.

Software Development

by David Marshall

Show me more

Topics

About

Policies

Our Network

More

Amazon’s S3 data storage service hit with outage

More from this author

Mirantis looks to open up OpenStack vendor certifications

VMware’s Virtual SAN to address software-defined storage

VMware releases vSphere Mobile Watchlist for Android and iOS smartphones

VMTurbo drives the software-defined data center

VMware plays up network virtualization momentum in 2014

FreeBSD 10 introduces brand-new virtualization platform

IBM drops $1.2B to significantly expand its global cloud footprint

Big data analytics star in VMware vCenter Log Insight 1.5 update

Show me more

Final training of AI models is a fraction of their total cost

Anthropic throttles Claude subscriptions to meet capacity

Edge clouds and local data centers reshape IT

How to run your own little local Claude Code (sort of!)

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy