Contributor

Lessons learned from the recent AWS S3 outage

opinion
Mar 13, 20174 mins

Cross-region replication and backups can help applications survive regional cloud service outages.

Amazon S3 underpins many AWS services, including AWS Lambda, Elastic BeanStalk, and Amazon’s own Service Health Dashboard. It also serves as an object and media store for many other internet services that rely on it every day.

On February 28th, 2017 AWS experienced an hours long outage of the Amazon S3 Service in US-EAST–1 region. That created a cascading effect of outages across a good chunk of the internet, including services like Dockerhub.

A human error turned out to be the root cause:

At 9:37 AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that are used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly, and a larger set of servers was removed than intended.

As it turns out, there is a common misconception about the difference between durability and availability. Durability measures how reliable the storage is and answers the question “Am I going to lose my data?” Availability, on the other hand, measures how accessible the data is, i.e. “Am I going to be able to retrieve my data?”

AWS S3 offers 99.999999999% durability within a single region. If we examine Amazon’s example, that means if you store 10,000 objects in S3, on average a single object may get lost once every 10 million years. Amazon S3 accomplishes this is by replicating the data across multiple facilities within a region.

Standard S3 availability of objects, on the other hand, is at 99.99% per year within a region. What that means is that in any given 12 month period you should expect a total of 52 minutes and 33 seconds of not be able to access your data.

AWS offers both IaaS and PaaS services. At the IaaS level, the AWS customers have full control over the virtual servers and networks. They can configure any software and service they desire, and they manage it on their own. Any outage is the responsibility of the customer.

At PaaS level, AWS offers fully managed platform services such as object storage, databases, queues and so on. The client delegates the responsibility for availability and durability of these services to the managed service provider — AWS in this case. AWS platform services that are utilized via their proprietary API are particularly vulnerable to a regional outage due to a human error at AWS.

Human error can cause an outage anywhere — on-premise, in the cloud, managed, or self-hosted. Consider the recent Delta computer outage as an example of an entire self-hosted system going down. Delegating the responsibility for managing a platform service to a cloud provider doesn’t change the fact that human error can bring it down — but it does amplify the impact. Whereas the Delta outage only impacted Delta, an AWS S3 outage impacted a good chunk of the internet.

Fortunately, AWS S3 offers ample tools for reducing the impact of an outage. Let’s consider just a few.

S3 cross-region replication

Data stored in a particular S3 region is replicated across all availability zones and can sustain an outage in any zone. It can’t, however, survive an outage in an entire region, such as the one that happened on February 28th. Replicating S3 objects across geographic regions helps satisfy the increased redundancy requirements.

Backups

Cross-region replication can help increase availability. Backups to AWS Glacier can contribute to increased durability. Conveniently, AWS offers an automatic mechanism to backup objects in S3 to Glacier.

Consider content distribution with CloudFront

If your S3 objects are frequently accessed, it may make sense to configure AWS CloudFront to serve objects from S3. CloudFront will replicate the data where the users need it most and may help alleviate the effects of an S3 outage in some use cases.

Final thoughts

Managed platform services are the cornerstone of cloud services. Using a one like S3 can reduce DevOps costs and help bring applications to market faster. While AWS has been extremely reliable over the years, Amazon has experienced self-inflicted outages in the past. The recent S3 outage is no exception. Some combination of cross-region replication, backups and content-distribution should reduce the impact of such outages.

Oleg Dulin is a Big Data software engineer and consultant in the New York City area.

In 1997 Oleg co-founded Clarkson University Linux Users Group. This group was influential in bringing awareness of open-source to Clarkson, and later morphed into what now is a dedicated lab and curriculum called Clarkson Open Source Institute. While at Clarkson, Oleg advocated on behalf of open-source and Linux and community and helped with construction of Clarkson’s first open-source high-performance computing cluster called “The North Country.”

While at IBM T. J. Watson Research Center in 1999-2000 Oleg co-authored a paper on federated information systems that was presented at Engineering of Federated Information Systems (EFIS) conference in 2000. This R&D project involved building a proof-of-concept federated IS that integrated structured (SQL) and unstructured (multi-media) data under a single set of API and user interfaces.

From 2001 to 2003 Oleg worked as a data integration consultant at a major investment bank in NYC on a web portal for private banking. This project involved aggregation of secure financial data from multiple legacy databases and presenting it in a customizable web portal.

In 2004, while working at a startup called ConfigureCode, Oleg contributed to two patent applications involving construction and semantic validation of mixed-schema XML documents. This technology was utilized in a Data Capture and Tracking System for Human Resources data integration.

From 2005 to 2011 Oleg worked at a Wall St. company (see Oleg’s LinkedIn Profile for more details) where he was instrumental in improving data quality, reducing trading errors, implementing analytics and reporting within the context of an equities order management system. The system was a 24/7 high performance computing platform that processed billions of dollars worth of trade executions daily.

From fall of 2011 to end of 2016, Oleg worked at Liquid Analytics as Cloud Platform Architect, where he was a thought leader in the implemention of a cloud-based PaaS for mobile Business Intelligence.

Presently, Oleg works at ADP Innovation Lab as Chief Architect.

The opinions expressed in this blog are those of Oleg Dulin and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author