Ziggeo was First to Report Yesterday’s AWS Outage — But Remained Unaffected

Here's how Ziggeo first noticed AWS' outage -- and why our system remained unaffected: Yesterday, our internal error tracking system registered an elevated number of failed requests in Amazon's Simple Queue Service (SQS). Most of our internal jobs at Ziggeo are distributed to worker nodes via SQS, so we immediately realized something was off. Our job scheduling gracefully falls back to a completely separate job scheduling system once SQS starts to fail multiple times, so the overall health of our application was not affected by SQS' issues. After investigating the issue further, we realized a couple of services, including DynamoDB, were showing highly elevated failure rates. We checked on Hackernews whether anybody else was experiencing similar issues, and upon finding no reports, we posted our observations here. Almost immediately we made it to #1 on Hackernews -- and stayed there for almost half a day. Many services, including Heroku, completely stopped their service in the US East region. Given that Ziggeo's system was designed with resilience in mind, Ziggeo remained unaffected.
PREV NEXT