May 15 Outage Post-Mortem
Last night, we had an incident in our data-processing pipeline that resulted in some data loss.
At around 11:30 PM Pacific Time, an on-call engineering was paged due to a server becoming unresponsive and unreachable. While quite rare, our infrastructure is robust enough to tolerate some servers going down and it does not usually cause any service disruption.
Unfortunately, the situation was a bit different this time. Shortly after the first page, another server started exhibiting the same symptoms. Eventually, all our servers were affected by the outage.
Eventually, we managed to stabilize the system. The servers were up and running again, database availability was improving and began the process of automatic data repair and recovery. The web servers were accepting requests again, which means the Skylight agents could resume sending reports. However, the data-processing part of the system remained stuck in a restart loop due to a Kafka error.
Normally, Kafka is one the most robust parts of our system. Upon receiving reports from agents, after some lightweight authentication and validation steps, the contents of these agent reports were promptly written to Kafka while awaiting further processing. Because of Kafka's track record of being highly available, this split of responsibility in our architecture has historically prevented data loss during outages – as long as the data is in Kafka, we can always catch up on processing it after the issue is resolved.
This time, we were not so lucky. Because all of our servers went down together in short succession, it had defeated our redundancy and replication strategy. Not only was the Kafka cluster unavailable during the outage window, when the nodes eventually came back online, they suffered from a data consistency issue that caused a good amount of data to become unavailable. When the workers started back up, they tried to resume processing data from where they left off, and because the data was no longer available in Kafka, this caused a crash in the workers and thus they were stuck in a restart loop unable to make progress.
To be clear, Kafka wasn't used for storing persistent data. Once an agent report has been processed by a worker, the data in Kafka is no longer needed. However, it did mean that any reports that were submitted during the outage window that hadn't been processed were lost permanently. In other words, you may notice a gap of data missing on your Skylight dashboard from around Thursday May 14 11:30 PM Pacific Time and possibly up to around Friday May 15 3 AM or so.
Once we became certain that the data in Kafka was unrecoverable, we instructed the workers to skip over the agent reports in the affected window, so that they can start ingesting new data and unblock the processing of agent reports. We are very sorry about this, but given the circumstances at the time, we believe this was the right call.
Initially, we had assumed the incident was due to a widespread network partition event at the data center, as it seems to fit the reachability issues we observed at the time. Upon further investigation, it turned out that this was caused by a bad automatic security update, similar to another incident in the past. A bad package was pushed to Ubuntu's security channel yesterday. Under certain circumstances, installing this package will cause a kernel panic, which was what happened here.
Now that we have identified the root cause of the incident, we will work on mitigating the risk of it happening again in the future. In the short term, we plan to make some adjustments to our automatic security updates to allow for more time to discover and respond to issues like this. We also plan on looking into improving the redundancy of our Kafka cluster and potentially moving to a managed Kafka solution.
Once again, we are really sorry about this.