Incident Report – New recordings turned into Failed videos for some customers

The incident happened on 20th of May this year.

Incident Effects

The incident effects were seen as videos with a FAILED status. This was happening only for the new videos that were coming in our system. This incident did not affect the existing videos nor their availability in any manner.

Who was affected

This has affected some of our customers. This did not affect customers that did not have any new videos coming into their system during the short time when incident was occurring.

This has also not affected all of the customers that did have any of the incoming activities on their account. This only affected the customers that had incoming (uploading/recording) activities and these videos were going through the servers that had the incident occurring on.

What happened

There was a security update that had to be applied to the runc package. You can see more about this on the following link: https://alas.aws.amazon.com/ALAS-2021-1499.html

Because this affects part of our system the update was rolled out.

Based on preliminary tests, the update did not seem to result in any unwanted actions. As more and more of our dynamic system was starting to have the aforementioned update present the issue started to show itself.

The detection and the fix

Our team quickly detected strange occurrence and started to look into what is happening. Some of our customers also reached out to us to let us know what they noticed (thank you!). This has helped get right people to quickly notice what is happening and how to stop it.

The affected servers were quickly brought down and the update was addressed.

After this time the videos were no longer failing, however the old videos were still with the FAILED status

Aftermath

Once the fix was created we started looking for traces of anything else being off. After confirming that the initial fix had resolved the issue, our team started to work on permanent fix. Few hours later same day the fix was introduced that allowed us to have our servers set up in a way that the CVE in question no longer applied and that no issues were present.

Once this fix was introduced our team created an action that was run to fix the failed videos. All videos that came to our system a bit before the update was released, during and several hours after were re-queued to be checked and transcoded again by our system.

This has allowed us to fix all videos that would not have otherwise fail.

Plans for the future

Going forward, we will do tests in a way that would allow us to better detect if the update could cause issue in production servers. Our DevOps team has also already added additional checks on transcodings and video outcomes to get notified faster.

While we recognize that some updates, be them made by us or by some underlying system might take us by surprise, we believe with the steps we have taken and which are to be implemented soon, we will be able to detect and minimize the effects that come from any incident in future.

Final Thoughts

We do want to thank everyone that noticed something being different and reaching out to us, regardless if this was when the error started to appear or after. Your feedback helped make it easier to minimize the logs our team had to go through as well as to provide us with confirmation that things were back to normal.

PREV NEXT