Posts /

AWS S3 Failure

Photo by Lukas Budimaier

Twitter Facebook Google+
01 Mar 2017

AWS S3 is down!

You might have noticed that the internet was struggling today. This was due to a problem with Amazon’s Simple Storage Service (S3).

The above is pretty much the use case for the majority of those using S3 as a backend. S3 offers durable, available, automatically scalable storage for minimal cost.

It’s a great service I rave about in most posts on this blog.

S3 Availability

The key point of note here is availability.

S3 guarantees 99.99% availability of all objects stored in S3, regardless of the storage type (Standard, Reduced Redundancy etc).

This translates to a downtime of just less than 53 minutes per year.

0.01% of 1 year = 52.59 minutes

According to Tarsnap the actual downtime was roughly 3 hours and 20 minutes (when looking at GETs).

The break down was as follows:

    17:37:29 UTC: First InternalError response from S3
    17:37:32 UTC: Last successful request
    17:37:56 UTC: S3 switches from 100% InternalError responses to 503 responses
    19:37 UTC: AWS notified of 'high error rates'
    20:34:36 UTC: S3 switches from 503 responses back to InternalError responses
    20:35:50 UTC: First successful request
    20:54: AWS notified of GET requests partially suceeding
    ~21:03 UTC: Most GET requests succeeding
    21:13 UTC: AWS notifies of GET requests fully restored
    ~21:52 UTC: Most PUT requests succeeding
    22:11 UTC: AWS notifies of PUT requests fully restored

The disconnect between the problem being seen by it’s customers, acknowledging the issue and notifying the community is terribly large. As a company that prides itself on reliability, this was a scary thing to witness.

I’m not expecting a response.

AWS Status

One of the most interesting take aways was that the AWS Status page wasn’t showing any error. This was due to the status page using S3 as part of it’s backend.

Wow!

The status page has since been updated to reflect the real state without relying on the service it was reporting about.

Interestingly, I never once saw an issue listed on the status page for regions outside North America, but definately could not access a bucket in Sydney, Asia Pacific. Maybe this was a partial fix? Stay tuned I guess.

Redundancy is the key to survival

Did you notice this website go down? No?

Because I have AWS Cloudfront in front of S3, caching responses. As a static website, this is a simple solution and also offers a number of other benefits.

Some of my other services would have gone down (Cat Facts), if they were being utilised during this time. Luckily for me, at the time of writing, all the services I host on AWS are low traffic.

The correct way to get around a problem like this, where a 3rd party service goes down is to not keep all your eggs in one bucket.

Google Cloud Services offers direct support for the S3 XML API, and has multi-regional support at a fraction of the cost of S3. Find out how to port S3 to GCS here

But why stop there?

You can also use Azure or the lesser known Backblaze B2.

I prefer to trust no one, and look after the data myself and you should too.



Thanks for reading

If you enjoyed the content please consider leaving a comment, sharing or hiring me.

Cheers,
Michael



comments powered by Disqus