The incident: Amazon S3 outage
On February 28th, 2017, the Amazon Simple Storage Service -commonly known as Amazon S3- suffered unavailability for some hours. This incident affected, among others, a lot of companies whose businesses are based on their online presence. Some of these companies were frozen during the Amazon S3 outage, unavailable to continue with their business up & running. Among them, there were companies that we use every day, like Slack, Quora and Lonely Planet.
Many people rushed to say that all of AWS was failing, but that’s a false perception. Amazon S3 only had problems in the Northern Virginia (US-EAST-1) Region.
S3 is one of AWSs’ most popular services. It’s so popular that even other AWS services rely on it to some degree: if we look at the AWS Status dashboard for Feb 28th, we’ll see more services started partially failing or stopped working properly (among them, Amazon Athena, EMR, Amazon Kinesis Firehose, Auto Scaling and Amazon Cloud Formation).
On March 2nd, AWS published a note for its customers explaining what went wrong.
Due to a human error -entering an incorrect input in the command line- a large set of servers was removed without being intended. These servers support two S3 sub-systems, one of which manages the metadata and location information of all S3 objects in the region. That triggered a significant loss of capacity, and other services that rely on S3 for storage, including the S3 console or the EC2 new instance launches, were impacted while the S3 APIs were unavailable.
The Amazon S3 outage incident affected just one of the 16 regions available by AWS. It was the Northern Virginia (US-EAST-1) Region, with some services down for 4 hours and 20 minutes. The other regions continued running smoothly and that’s why some other companies that rely on AWS for its critical systems, like Netflix, didn’t have any downtimes.
Understanding Regions and Availability Zones on AWS
To understand why the Amazon S3 outage affected only some companies we have to clarify the differences between AWS regions and Availability Zones.
- Region: a physically isolated zone, completely independent from the rest of the regions. Nowadays, Amazon Web Services provides 16 regions spread worldwide. By default, resources aren’t replicated across regions unless you do so specifically. So a problem in a region can affect other services in that area, but can’t affect other regions.
- Availability Zone (AZ): Each AWS region has multiple, isolated locations known as Availability Zones. AZs in the same region connect between them through low-latency links so there is fast and easy communication between them. Almost all AWS provided services replicate by default between AZs in the same region to offer high-availability.
Amazon S3 is a service that is provided on a regional level. That means that when we create an S3 bucket, we can select the region where we want to deploy it. Then, AWS automatically uses the available AZs in that region to keep the data available and safeguarded.
How can we avoid regional service failures?
To avoid downtime when incidents like the Amazon S3 outage occur, we have different possibilities. All solutions have to take into account lots of factors: criticality of our service, the economics of service unavailability, etc.
We would like to present you one of them (it’s not the only one!): configuring DNS failover with Route 53 and enabling cross-region replication.
By configuring DNS failover with Route 53, you’ll be able to know when an S3 bucket is down, and automatically divert requests to a bucket in another region. In order to do that, you need the same data in two different regions, which you can do by enabling cross-region replication. With these two actions, you’ll be covered in case of a similar outage in the region that hosts your data.
This solution only covers problems with Amazon S3 but, as we saw before, other services in the same region were affected by the outage. We have to think about the other services too (EC2 failure, EBS failure, etc.), and that can be a complex and daunting task (it’s possible to design with regional failures in mind).
The key, in this case, is working with a Cloud companion, a partner that knows and understands your business requirements. A good SysArchitect will help you setting up an infrastructure that attends requests in other regions with your needs and constraints in mind (DR, Pilot light, Active-Active scenario, etc.). The cloud enables us to build infrastructures designed to answer your requirements without forgetting to be cost-efficient. If you have some doubts or need someone to help you build an always available infrastructure, you can rely on our great team of SysArchitects.