Incident Report – All Services Down
The incident happened on 22nd of January this year (2025).
Incident Effects
Anyone using our service or some parts of our website had received different errors usually in 500 range indicating issues on the server side.
Some parts of our service would still seem to work, or have worked due to the various implementations such as CDN and caching that is present not just on our side however in other layers of network globally as well.
Who was affected
This specific change affected customers in all regions - US and EU.
What happened
Due to a mistake during a quick update, the AWS account was inaccessible causing the downtime for all customers. As the update was on AWS account side there was no tests we would usually have with code.
This made:
- the update immediate
- and the error invisible
The detection and the fix
We have different kinds of reporting and even though our AWS account was not working, our reporting has kicked in and started raising alerts which lead to investigation and resolution. The fix was quickly carried out however the effect did not disappear right away. This made all of our systems unavailable for a total of 1 hour.
Aftermath
Once the issue was fixed, there was no additional action needed from our side. All the servers started up automatically, synced and everything was running smoothly.
While we know that it was account change that caused it, we did continue monitoring to see if there are any issues caused by this "restart" and luckily everything was running smoothly.
Plans for the future
It is impossible to create tests and automation that would run and confirm that a change on AWS directly would be done properly or not. This makes it hard to do checks if a typo is introduced or a change that might look innocent yet might make the service unavailable to any extent.
That said, we plan to be more careful when making these changes to make sure that the same does not happen again.
Final Thoughts
Any action (such as recording or API calls) would not have been captured by our system. In most cases the recorder itself would not have been shown on your page neither. If the upload was in the process, it would have stopped and would not be possible to restore it.
That said, we are happy that we have been able to detect it very quickly and have the entire system come back online without any hickups. It is a great lesson to keep in mind and make sure to always "measure twice and cut once" even if it is not a code or system update rather "just a quick account modification".