The incident: Amazon S3 outage

On February 28th, 2017, the Amazon Simple Storage Service -commonly known as Amazon S3- suffered unavailability for some hours. This incident affected, among others, a lot of companies whose businesses are based on their online presence. Some of these companies were frozen during the Amazon S3 outage, unavailable to continue with their business up & running. Among them, there were companies that we use every day, like Slack, Quora and Lonely Planet.

Many people rushed to say that all of AWS was failing, but that’s a false perception. Amazon S3 only had problems in the Northern Virginia (US-EAST-1) Region.

S3 is one of AWSs’ most popular services. It’s so popular that even other AWS services rely on it to some degree: if we look at the AWS Status dashboard for Feb 28th, we’ll see more services started partially failing or stopped working properly (among them, Amazon Athena, EMR, Amazon Kinesis Firehose, Auto Scaling and Amazon Cloud Formation).

Amazon S3 outage

On March 2nd, AWS published a note for its customers explaining what went wrong.

Due to a human error -entering an incorrect input in the command line- a large set of servers was removed without being intended. These servers support two S3 sub-systems, one of which manages the metadata and location information of all S3 objects in the region. That triggered a significant loss of capacity, and other services that rely on S3 for storage, including the S3 console or the EC2 new instance launches, were impacted while the S3 APIs were unavailable.

The Amazon S3 outage incident affected just one of the 16 regions available by AWS. It was the Northern Virginia (US-EAST-1) Region, with some services down for 4 hours and 20 minutes. The other regions continued running smoothly and that’s why some other companies that rely on AWS for its critical systems, like Netflix, didn’t have any downtimes.

Amazon S3 outage

Understanding Regions and Availability Zones on AWS

To understand why the Amazon S3 outage affected only some companies we have to clarify the differences between AWS regions and Availability Zones.

Amazon S3 is a service that is provided on a regional level. That means that when we create an S3 bucket, we can select the region where we want to deploy it. Then, AWS automatically uses the available AZs in that region to keep the data available and safeguarded.

How can we avoid regional service failures?

To avoid downtime when incidents like the Amazon S3 outage occur, we have different possibilities. All solutions have to take into account lots of factors: criticality of our service, the economics of service unavailability, etc.

We would like to present you one of them (it’s not the only one!): configuring DNS failover with Route 53 and enabling cross-region replication.

By configuring DNS failover with Route 53, you’ll be able to know when an S3 bucket is down, and automatically divert requests to a bucket in another region. In order to do that, you need the same data in two different regions, which you can do by enabling cross-region replication. With these two actions, you’ll be covered in case of a similar outage in the region that hosts your data.

This solution only covers problems with Amazon S3 but, as we saw before, other services in the same region were affected by the outage. We have to think about the other services too (EC2 failure, EBS failure, etc.), and that can be a complex and daunting task (it’s possible to design with regional failures in mind).

The key, in this case, is working with a Cloud companion, a partner that knows and understands your business requirements. A good SysArchitect will help you setting up an infrastructure that attends requests in other regions with your needs and constraints in mind (DR, Pilot light, Active-Active scenario, etc.). The cloud enables us to build infrastructures designed to answer your requirements without forgetting to be cost-efficient. If you have some doubts or need someone to help you build an always available infrastructure, you can rely on our great team of SysArchitects.

TAGS: Amazon S3 outage, amazon web services, aws, az, downtime, high-availability, regions, S3

speech-bubble-13-icon Created with Sketch.
Comments
sohail | January 12, 2019 7:16 am

S3 bucket in one region is impacted then the traffic diverts to replication partner region. Is the replication partner is in read only mode or users can still be able to modify the objects. If yes then what’s the behavior once Primary S3 bucket services are restored?.. Does data sync of delta needs to be initiated from partner to S3 bucket which got restored

Reply
A. Yamin | April 26, 2018 7:22 am

The DNS failover with S3 does not work. Please remove that from your post or provide detailed step by step configuration. (I suggest you practically test it first)

Reply
P. Puig | April 26, 2018 9:08 am

Hello, could you develop further on what did you try and how did you test it? Thanks.

Reply
A. Yamin | April 26, 2018 3:07 pm

P.S. happy to see you prove me wrong 🙂

Reply
A. Yamin | April 26, 2018 3:05 pm
Reply
Emma Briones | April 30, 2018 11:20 am

Hi A.Yamin,

We don’t see any contradiction between what is mentioned in the post you’re sharing and what we are stating in our post.
If you want to use HTTPS in an S3 bucket with your own domain name, you need to use Cloudfront.

Anyway, this was just an informative post to know how to prevent S3 outages, not a step-by-step guide.
Seeing that a step-by-step guide may be of interest to our readers, we’ll take it into account and we’ll try to write one in the future. Stay tuned

Thanks for taking your time to read our blog and interact.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*