Ziggeo was First to Report Yesterday’s AWS Outage — But Remained Unaffected
Here's how Ziggeo first noticed AWS' outage -- and why our system remained unaffected:
Yesterday, our internal error tracking system registered an elevated
number of failed requests in Amazon's Simple Queue Service (SQS).
Most of our internal jobs at Ziggeo are distributed to worker nodes via
SQS, so we immediately realized something was off.
Our job scheduling gracefully falls back to a completely separate job
scheduling system once SQS starts to fail multiple times, so the overall
health of our application was not affected by SQS' issues.
After investigating the issue further, we realized a couple of
services, including DynamoDB, were showing highly elevated failure rates.
We checked on Hackernews whether anybody else was experiencing similar
issues, and upon finding no reports, we posted our
observations here.
Almost immediately we made it to #1 on Hackernews -- and stayed there for almost half a day.
Many services, including Heroku, completely stopped their service in the
US East region.
Given that Ziggeo's system was designed with resilience in mind, Ziggeo remained unaffected.