November Outage Post-Mortem

For the last couple of months, we have been investing in improving our operations and infrastructure here at Skylight. As our efforts begin to bear fruit, we have had a pretty good run without major incidents – until last week where we were hit by one of the biggest outages in our recent history. As they say, it never rains, but oh does it pour!

Incident Summary

Between November 28 and November 30 (Pacific Time from here on), we experienced a major outage. Our data processing sub-system was offline for 17 hours, the longest in our history. During this time, the Skylight agent was sending data with no impact to customer applications. We built the backend systems in compartments so the data collection sub-system is not affected when the data processing sub-system experiences downtime. The Skylight UI was also unaffected, beyond the fact that there was no realtime data to show.

While our architecture meant that we were unlikely to lose data, the extensive downtime accumulated a huge amount of data to be processed once the system was back online. Since this magnitude of backlog is very rare (a first, in fact), our infrastructure was not well-tuned to handle this type of workload. As a result, we were unable to process data as quickly as we would have liked.

After the initial recovery, some of our customers experienced multi-hour data processing delays (time between data was sent from the agent and when they become available on the Skylight UI) for the next two days. This is certainly not the experience we aim for, and we are sorry for letting you down. This post-mortem serves as a detailed report of what happened, and what we are doing to prevent it from happening again.

Detailed Timeline

This section gives an overview of the key events that contributed to the outage. Most of these facts were not known at the time of occurrence.

November 27 (Tuesday)

  • At 5:18 PM, Ubuntu published a new OpenJDK build into the security channel.
  • Following the standard Ubuntu (and Debian) best practices, our servers are configured to automatically apply security updates. Over the next couple of hours, the security update was picked up and applied on all our servers without human intervention.
  • Around 11 PM, one of the worker servers encountered an unrelated Java exception during processing, which caused our monitoring system to automatically kill and restart the process. This is a routine procedure and our system is designed to tolerate and recover from these kinds of failures. However, since this was the first restart after the security update, the data processing application was running the new OpenJDK build for the first time.
  • At 11:09 PM, the on-call engineer was paged to handle a disk-space issue on a worker server.
  • Shortly after, we detected an elevated error rate, caused by OS-level SIGBUS errors ("a fault occurred in a recent unsafe memory access operation in compiled Java code") and a mix of other application-level exceptions (e.g. integer underflow), all of which originated from within a dependency used to maintain an off-heap query cache. We later determined these errors had the same root cause.
  • The new errors quickly overwhelmed the worker server. The data processing application was stuck in a "restart loop" and was unable to make meaningful progress. The resulted in a significant loss in data processing capacity which affected about 25% of customer applications.
  • At 11:20 PM, the on-call engineer responded to the alert and began investigating. At first, the focus was to triage the disk-space issue that was brought to our attention. We observed that the off-heap query cache was taking up a lot (but not all) of available disk space, but it seemed to have settled with no signs of growth or fluctuation. Even with that oversized cache, we appeared to have a reasonable (if only barely) buffer of free space left, so it was not obviously an urgent issue.
  • We shifted into investigating the stream of the far more unusual new errors and restoring the worker’s processing capacity. For the rest of the hour, we attempted the routine troubleshooting and recovery procedures—rebooting, cycling hardware, provisioning extra disk space, etc—to no avail.

November 28 (Wednesday)

  • Between 12 AM and 1 AM, the remaining worker servers were restarted for a variety of routine reasons. At this point all the servers were running the new OpenJDK build and were stuck in similar restart loops. This resulted in a total loss of data processing capacity.
  • At the time, the root cause of the failure was not known. Since the worker servers were largely isolated from each other, we had been assuming up until this point that this was an isolated problem restricted to the first worker. These workers were set up to process different, independent shards of customer data, so our leading theory was that we had some form of corrupted data, caused by an agent bug. The spread of the failure was quite surprising and extremely puzzling.
  • Since we were focusing on recovering the first server thus far, we missed the window of opportunity to compare and identify any differences between a "healthy" and "unhealthy" worker server. By the time we looked into the other servers, they all exhibited similar failure modes with no apparent links to be found in the application logs.
  • Since the symptoms were consistent with data corruption (invalid data was produced somewhere upstream resulting in unaccounted for conditions somewhere down the chain), this was the main focus of our investigation. This kind of problem is extremely difficult to debug (especially at 2 AM!), as the cause and effect can sometimes be very far apart, making usual debugging techniques ineffective. We resorted to code reading—both in our application codebase and in the third-party dependency—carefully tracing all the data paths to identify ways this sort of problem could manifest, as well as reviewing all recent commits.
  • When morning rolled around, additional engineers got involved in the situation, handling status updates, customer communications, etc. Our status page assumed a maximum data processing delay of six hours. By this point, we had reached that limit and the figure was not accurate. Perhaps this was too much optimism on our part, but the reason a cutoff was chosen here was to avoid potentially generating an unbounded amount of load on the collector (querying a lot of historic data) during an already bad outage situation.
  • At this point, even though we still had not identified a root cause, we were pretty confident that the problem was related to the off-heap query cache. We starting evaluating a plan to operate with the cache completely disabled. This was not something we did lightly, as we knew that behavior could diverge from what was typically possible on production. We also knew that this could put an unprecedented amount of read load on the database, which our database was not tuned for. But since we were running low on other leads, we prepared a patch for this.
  • At around 11 AM, we finished drawing up a plan and decided to test the patch to disable the query cache on a single worker server. While it did not exhibit cache-related issues, it got stuck in a restart loop due to a new exception. Unfortunately, this time it was a garden-variety neighborhood NullPointerException exception, which took us a while to triage.
  • We determined that this error was caused by some missing data in the database. The exact cause was unknown at the time, but we figured out how to work around it with another patch. We realized (way) later that this was because we were persisting transient 1 minute rollups with a 6-hour TTL (expiration time). Normally, we would finalize the rollup window soon after and rewrite the data with a much longer TTL, matching our data retention policy. Since we were down for more than 6 hours at this point, the transient data had already expired. The problem itself wasn’t particularly fatal, but we had never experienced a downtime long enough to have accounted for this scenario.
  • At around 2 PM, we successfully restarted the first worker server. We resumed data processing for some customers and assessed the extra workload's impact on the database. After the longest 15 hours at Skylight, we breathed our first sigh of relief and took breaks for food and coffee.
  • As it turned out, the concern over the extra database workload was mostly unnecessary. Due to other issues we experienced in the past (which are now fixed), we ended up provisioning a lot of extra database capacity. This came in pretty handy for the situation as it allowed the database cluster to handle the extra workload reasonably well.
  • By 5 PM, we felt confident enough to restart another worker server. By 7 PM, we had all of our worker servers back online.
  • Due to the six hour reporting limit on the lag monitoring, we didn't have very good visibility into the backlog situation in terms of wall clock time. We did, however, have metrics in terms of raw bytes from the input side of the pipeline. We knew that we were burning through backlog at a positive rate (i.e. our data processing rate was higher than our ingestion rate), so we were no longer at risk of any data loss (which would only happen if we were to over-run the ingestion buffer) and it was "just a matter of time" until we were fully caught up. We decided that it was safest to let things settle overnight rather than to make any more last-minute changes with our sleep-deprived brains.
  • At this point, we all went home* for dinner and had a good night's sleep. (* Yehuda and I were at the Bay Area for TC39, and Peter worked from home, so we only "went home" metaphorically.)

November 29 (Thursday)

  • By daytime, the data processing delay on the first worker server had reduced to under the 6 hour mark, which was a good sign. However, this didn't mean we were 6 hours wall clock time away from fully catching up. The volume of data that we process at Skylight is directly tied to the amount of traffic our customers’ applications receive. In aggregate, this adds up to a huge daytime traffic bias – we often process several times more data at (US) daytime "peak hours".
  • While we had provisioned more than enough processing capacity to handle the typical peak traffic, there wasn’t much spare capacity left to process the backlog. It also didn’t help that we had the query cache disabled, which meant we were operating at somewhat degraded performance. The end result was that while we were able to catch up significantly during the night (we were once over 17 hours behind in processing data!), the gap didn't narrow significantly during the day.
  • At around 8:30 AM, we deployed a patch to our monitoring system that increased the delay reporting window. At this point, most of the workers were caught up and somewhere within the 6 hour window, where they continued to hover for the rest of the work day. During this period, our customers experienced various levels of delayed access to their performance data (e.g. only data from the morning would be available when checking the app in the afternoon). Once again, we are truly sorry for the confusion and inconvenience that this caused.
  • We looked into provisioning extra capacity to get us through the hump, but ultimately decided against it. In hindsight, this may have been an incorrect judgement call, but, at the time we made it, we were both more optimistic about the processing rate and more conservative about introducing unnecessary variables and sources of errors into the system. Throughout this incident, our top priority was to avoid data-loss. At the time, we still had not identified the root cause, and we were unsure how much extra load the database cluster could handle. Since it generally takes a couple of hours to provision extra servers and reconfigure our pipeline, we believed the payoff was probably not worth the extra risk.
  • We also attempted to re-enable the query cache, but we ran into the same issues as before.
  • Since there wasn’t much else to do to speed things up on the processing side, we shifted focus back to investigating the root causes for the incident for the rest of the day.
  • By 5 PM, the first worker was fully caught up, with the remaining workers hovering around the 5 hour mark.

November 30 (Friday)

  • By 6 AM, all but one of the workers were fully caught up, while the last worker was hovering around the 2 hour mark (due to unbalanced sharding). Based on our experience from Thursday, we knew that this would probably remain the case for the rest of the day. At this point, we felt that we understood the capacity of the system fairly well and decided to provision extra capacity and rebalance the shards.
  • By noon, we deployed the infrastructure changes successfully.
  • By around 4 PM, all workers were fully caught up.
  • By the end of day, we completed our root cause analysis (see below for the findings).

Root Cause Analysis

Once again, most of this information was not known at the time of their occurrences. This section is written in an order optimized for understanding the problem.

A month ago, Ubuntu published an OpenJDK security update to address a variety of CVE vulnerabilities. This version was based on the upstream OpenJDK 8 Update 181 release. Unfortunately, it was later discovered that this release also introduced a regression. As far as we know, we were not affected by that particular issue, as we have been running this in production for about a month at this point.

To address the issue, Ubuntu published the aforementioned OpenJDK update last Tuesday. Based on the Ubuntu Security Notice, there don't seem to be other notable security patches in this release other than to address the regression caused by the October update. However, this release is based on the newer upstream OpenJDK 8 Update 191 release, which contained a variety of other bugfix patches.

Among those patches was a fix for JDK-8168628. On the surface, the symptoms of this bug were quite similar to our situation: it involves SIGBUS faults while using mmap, which our off-heap query cache ultimately relies on under the hood. However, in practice this patch had the opposite effect for us: we did not experience any of those issues before the patch, and now we do.

To understand why, we need to explain how our off-heap query cache works. For those unfamiliar, the POSIX mmap API is a way to memory-map a file. This allows you to arbitrarily address contents of a big file as if they are just content in memory (e.g. using pointer arithmetic and dereferencing, as opposed to using the regular file-oriented APIs such as lseek and read).

When accessing an address for the first time, the kernel will lazily load the relevant part of the file into memory, usually one or more pages at a time. Likewise, when running low on memory, portions of the memory-mapped file can be flushed to disk and unloaded from RAM. In a way, it can act like an application-level swap space.

This is perfect for our use case. As our data (other than the transient rollups mentioned above) are immutable-once-written, this allowed us to maintain a local cache larger than the available amount of working memory alone would allow. Even when we do have to read from disk, it will still be slightly faster than making a trip to the database and will reduce the read load on the database.

On each worker server, we have configured two such local, equally sized, off-heap caches via Chronicle Map to store two different types of query data. The exact size of the caches are heuristically (but deterministically) calculated by the library using static configuration we provide.

Later introspection determined the size of each of these caches was around 90 GB each (180 GB combined), which added up to more than the disk space we had provisioned (100 GB dedicated to the caches) on the worker servers.

This was not a problem for us in practice, as we rely on sparse file support in XFS, which lazily allocates backing disk space as needed. Under typical operating load, our query caches are running at around 10-20% full at all time, which means we are normally only using around 20-30 GB disk space, leaving a pretty healthy margin.

Furthermore, because the content of the cache is tied to the lifetime of the data processing application process, in the extremely unlikely event that we run out of disk space for the query cache (this has never happened before, as far as we are aware), it will simply crash the process and instantly empty the cache/disk and start over again.

With all that said, even though it does not cause any problems in practice, you could argue that this is a misconfiguration on our part, as we also don't have any particular reason to configure the cache to be bigger than the available disk space. It was simply one of those cases where you plugged in some initial guesstimates which worked well and never had a reason to go back and fine-tune the numbers.

During the outage, we observed that one of our two caches consumed over 90GB of disk space (90% of what was available), while the second one was only using a few hundred KBs.

Using strace, we have traced down the system call that was responsible for this behavior. It turns out that, in an attempt to fix JDK-8168628, OpenJDK 8 Update 191 changed the behavior of RandomAccessFile::setLength on Linux-based systems from performing a fallocate instead of a ftruncate system call (patch diff).

This method is used by Chronicle Map to initialize the cache files on disk on startup. The change in behavior is subtle but important – where ftruncate merely updates the metadata for the file, while fallocate eagerly allocates backing disk space and zero-fills the file.

To understand the motivation of this change, JDK-8168628 describes the scenario where you are mapping a big file and eventually fail to lazily allocate the necessary disk space on subsequent access, causing a SIGBUS crash despite having "pre-allocated" the file with RandomAccessFile::setLength. In order to fix this crash, the assumption was that this method should eagerly allocate the requested disk space so that it would fail early.

This change had some unfortunate consequences. For starters, zero-ing a huge file is not a quick operation. But more importantly, quoting the latest comment from the bug thread: “it break[s] sparse files”… which is exactly what happened to us. The result of this change was that, when initializing the first cache, Java would now eagerly allocate 90 GB of backing disk. Since we had 100 GB available, this operation would succeed, although it would be a bit slow and result in a lot of unnecessary disk IO. However, when it came time to initialize the second cache, it would also try to eagerly allocate 90 GB of backing disk. Only this time, it would fail to do so as there is not enough disk space left.

For some reason, the resulting IOException was discarded and ignored. Perhaps the thought was that this was not necessarily a fatal problem, due to sparse files, or perhaps the hope was that disk space would eventually free up before they were really needed.

Whatever the logic, the application/library would happily proceed to mmap-ing the file, despite the error. However, since the first cache used up almost all the available disk space, shortly after, we ran out of backing disk and failed to allocate space on subsequent access, resulting in a SIGBUS fault. Ironically, this was exactly the situation described in JDK-8168628, and what the patch was trying to prevent in the first place.

The general consensus is that this patch was a bad idea and was fixing the wrong problem. In fact, it has already been reverted in JDK 11 and will hopefully make its way back to JDK 8 eventually.

Going Forward

Given that the incident was ultimately triggered by an automatic system update, it is natural to question whether that should have been enabled at all. As painful as this was, we still believe problems like these are a relatively rare occurrence and the benefits of automatic patching far outweigh the risk.

In a small team like ours, if everything is an emergency, then nothing really is. Perhaps in an ideal world, we would promise to review every security patch before applying them manually in a timely manner, the truth is we just don’t have that kind of resources to dedicate. Without the automatic update, we would likely fall behind on applying security patches, which would have far worse consequences.

That being said, we need to keep a closer eye on the upgrade process. In hindsight, the upgrade logs should have been one of the first things we checked. We will be adding this to our troubleshooting checklists, as well as setting up better notifications and monitoring around this to improve our situational awareness.

We will also work on improving our communications during incidents like this. While we have always had a status page setup, it was not prominently linked to from within the Skylight UI. This resulted in a pretty confusing experience during the outage, as customers would notice the missing data and assumed there was a problem on their end. To address this problem, we will be adding a live incident banner to the Skylight UI that shows the latest incident information right within the UI during an outage. This change should go live sometime this week, but hopefully you won’t see it anytime soon.

In the longer term, we will work on improving our architecture, such that we will be able to provision extra resources more fluidly during the recovery phase. This has always been on our wishlist, but it requires making some pretty significant changes to our architecture. Given that it would only make a difference in the rare outage scenarios, it was difficult to prioritize the work since we would rather be working on customer-facing features. However, we also understand that when it comes to winning your trust, reliability is just as important of a feature. We will do our best to strike the right balance here.

Once again, we are sorry for the trouble and would like to thank you for your patience during this incident. We will do better.