You might have noticed that the internet was struggling today. This was due to a problem with Amazon’s Simple Storage Service (S3).
The above is pretty much the use case for the majority of those using S3 as a backend. S3 offers durable, available, automatically scalable storage for minimal cost.
It’s a great service I rave about in most posts on this blog.
The key point of note here is availability.
S3 guarantees 99.99% availability of all objects stored in S3, regardless of the storage type (Standard, Reduced Redundancy etc).
This translates to a downtime of just less than 53 minutes per year.
0.01% of 1 year = 52.59 minutes
According to Tarsnap the actual downtime was roughly 3 hours and 20 minutes (when looking at GETs).
The break down was as follows:
17:37:29 UTC: First InternalError response from S3 17:37:32 UTC: Last successful request 17:37:56 UTC: S3 switches from 100% InternalError responses to 503 responses 19:37 UTC: AWS notified of 'high error rates' 20:34:36 UTC: S3 switches from 503 responses back to InternalError responses 20:35:50 UTC: First successful request 20:54: AWS notified of GET requests partially suceeding ~21:03 UTC: Most GET requests succeeding 21:13 UTC: AWS notifies of GET requests fully restored ~21:52 UTC: Most PUT requests succeeding 22:11 UTC: AWS notifies of PUT requests fully restored
The disconnect between the problem being seen by it’s customers, acknowledging the issue and notifying the community is terribly large. As a company that prides itself on reliability, this was a scary thing to witness.
I’m not expecting a response.
One of the most interesting take aways was that the AWS Status page wasn’t showing any error. This was due to the status page using S3 as part of it’s backend.
@awscloud please don't host your status service on the service it's reporting a status of...— Michael Standen (@_MichaelStanden) February 28, 2017
Amazon admits the status page can’t be updated because the images are in S3: pic.twitter.com/gTtWajirSh— MikeTalonNYC (@MikeTalonNYC) February 28, 2017
The status page has since been updated to reflect the real state without relying on the service it was reporting about.
Interestingly, I never once saw an issue listed on the status page for regions outside North America, but definately could not access a bucket in Sydney, Asia Pacific. Maybe this was a partial fix? Stay tuned I guess.
Did you notice this website go down? No?
Because I have AWS Cloudfront in front of S3, caching responses. As a static website, this is a simple solution and also offers a number of other benefits.
Some of my other services would have gone down (Cat Facts), if they were being utilised during this time. Luckily for me, at the time of writing, all the services I host on AWS are low traffic.
The correct way to get around a problem like this, where a 3rd party service goes down is to not keep all your eggs in one bucket.
Google Cloud Services offers direct support for the S3 XML API, and has multi-regional support at a fraction of the cost of S3. Find out how to port S3 to GCS here
But why stop there?
I prefer to trust no one, and look after the data myself and you should too.
If you enjoyed the content please consider leaving a comment, sharing or hiring me.