Get Skylight

What I Learned About Hunting Memory Leaks in Ruby 2.1

I recently spent a bunch of time deep diving into some mystery memory leaks in Ruby. They were persistent and annoying, though not devastating by any means—the kind we could get away with ignoring for a long while, but shouldn't.

Spoiler alert: we let them hang out for longer than we should have. The day finally came when we could no longer get away with letting the leaks fester, so I grabbed my hardhat and dove in. What follows is the dramatic story of my journey.

I kid! It was more anticlimactic than dramatic (much of our mystery leak was fixed by a similarly mysterious cure), but I did learn a lot along the way, and I thought I could be helpful to others dealing with similar issues. So let's dive right in.

The Backstory

We'd been noticing a slow but persistent leak in our Rails app for a while. We're running on Heroku with 512MB limits and we were hitting those pretty fast. This is bad news because once an app reaches the cap, it starts using swap, and swap is very slow. We were able to deal with this by restarting the app every day or so, which was annoying but not a big deal.

It became more of a problem when we tried to upgrade to Rails 4.1 and Ruby 2.1, because Ruby 2.1 makes significant changes to garbage collection. The updates improved performance, but they also increased the memory baseline. Once we attempted to deploy the change, we found ourselves hitting the caps much more quickly.

We again decided to hold off on the upgrade until we could investigate more, but eventually, we found that the memory leak was getting worse even without upgrading. We were going to have to do something.

I felt like the first investigation had flailed a bit (thus leading us to the Mexican-Memory-Standoff we were in), so I spent a lot of time examining things.

I collected a multitude of memory dumps searching for leaks. Unfortunately, I didn't learn much that was new and definitive. In order to collect these dumps though, I had to run under Ruby 2.1. This meant that for a couple hours at a time I would deploy Rails 4.1 and Ruby 2.1.

During these temporary deploys I noticed that while our memory baseline was high, it wasn't growing quickly. In fact, after waiting for a while, it leveled off. Our memory leak was gone!

At this point we were running Rails 4.1.6 and Ruby 2.1.2. Apparently, something had changed since Rails 4.1.1 and Ruby 2.1.1. There was a known bug in Ruby 2.1.1 so it's entirely possible that it's what we were hitting.

This is all a bit anticlimactic as we never found out exactly what caused our leaks. I did learn a lot about how to look for memory leaks in Ruby apps though. I'd like to share a bit about this with you.

ObjectSpace and Heap Dumping

The most useful built-in tool for dealing with memory issues in Ruby is ObjectSpace. This object has been around for quite some time but has been slowly gaining additional usefulness. Most recently, in Ruby 2.1, it received the ability to do heap dumps.

This means that you can collect heap dumps of your production application and compare the outputs. There's an incredible amount of information provided in these dumps (many of mine clocked in at around 100mb) and the main challenge is figuring out how to parse it out to something useful. Hopefully, once you've finished reading this post you'll know how to get a heap dump and have a bit of an idea of how to extract useful information from it.

Getting a Heap Dump

The simplest way to get a JSON formatted heap dump as a file is documented in the slides for an excellect talk by Sam Saffron:

rbtrace -p PID -e 'Thread.new{require "objspace"; ObjectSpace.trace_object_allocations_start; GC.start(); ObjectSpace.dump_all(output: File.open("heap.json", "w"))}.join'  

This starts by using rbtrace to connect to the running Rails process, and it follows up by executig the provided code.

Next we start a new thread to keep this from affecting the main thread. We then call ObjectSpace.trace_object_allocations_start to gather additional metadata about allocations. After running a full garbage collection to get to a clean state, we then dump the output to the given file.

(By the way, in some cases you may run up against a maximum command length that you can send to rbtrace. You'll need to put the main body of your command in a file and load that file in your rbtrace call to work around this.)

This worked great for me locally, but as we're running on Heroku, I didn't have direct console access to the dyno. To deal with this, I actually added a custom admin endpoint to our Rails app that essentially ran:

system("rbtrace -p #{Process.pid} -e 'load \"#{Rails.root}/script/heap_dump.rb\"'").  

Inside heap_dump.rb was an expanded version of Sam's script that uploaded the resulting file to S3 tagged with the timestamp and the current Heroku hostname (after all we have to know which dyno it came from).

All I had to do was hit this admin endpoint I'd created and it would kick off a heap dump. One snag was that the heap dump creation usually took longer than the 30 second timeout in Heroku's router. This meant that I'd always get a timeout error in my browser. Heroku doesn't actually kill the triggered process, so this wasn't a big deal and I'd still get my dumps.

The more annoying part was that the process of getting the dump appeared to put a pretty serious strain on the system. After triggering one or two dumps, I'd often noticed that the dyno had died and was restarted. This was especially aggravating since I needed at least two (and preferrably three) dumps from the same dyno to get useful data.

Heads up: the dumps will have a lot of lines that look like this:

{"type":"ROOT", "root":"vm", "references":["0x7fe7e08cb7a0"]}
{"type":"ROOT", "root":"finalizers", "references":["0x108d050f8", "0x108dcbdc0", "0x108ea0700", "0x108f74708", "0x10948b110", "0x109541e88", "0x10b180ba8"]}
{"address":"0x105a8c018", "type":"STRING", "class":"0x7fe7e0829658", "embedded":true, "bytesize":9, "value":"        \n", "encoding":"UTF-8", "flags":{"wb_protected":true, "old":true, "marked":true}}
{"address":"0x105a8c0b8", "type":"NODE", "node_type":"NODE_CREF", "references":["0x7fe7e0b96c28", "0x105a8d580"], "flags":{"wb_protected":true, "old":true, "marked":true}}
{"address":"0x105a8c0e0", "type":"DATA", "class":"0x7fe7e08cb480", "struct":"iseq", "references":["0x7fe7e0c8fa30", "0x105a916a8", "0x105a916a8", "0x105a9b2e8", "0x105a9b2e8", "0x105a8c0b8", "0x7fe7e0b96c28"], "memsize":808, "flags":{"wb_protected":true, "old":true, "marked":true}}

If you're paying close attention you'll notice that it's not actually valid JSON. It's a text file where each line is a JSON object.

Diffing the Dumps

Eventually, I managed to collect enough dumps to do some analysis. In his slide deck, Sam provides some basic code for doing a simple two dump diff, but he suggested that you could get more accurate results by comparing three dumps. He details the process as follows:

  • Gather three snapshots
  • Remove common objects in snapshot 1 from snapshot 2
  • Remove missing objects in snapshot 3 from snapshot 2

The goal here is to get a list of all objects that were added between the first and second snapshot and were still present in the third snapshot. This isn't a guarantee that those objects won't get cleaned up even later, but it at least tells us we didn't catch any very short lived objects.

Sam didn't share a script for this process so I wrote one based on his simple two snapshot diff:

#!/usr/bin/env ruby

require 'set'  
require 'json'

if ARGV.length != 3  
  puts "Usage: detect_leaks [FIRST.json] [SECOND.json] [THIRD.json]"
  exit 1
end

first_addrs = Set.new  
third_addrs = Set.new

# Get a list of memory addresses from the first dump
File.open(ARGV[0], "r").each_line do |line|  
  parsed = JSON.parse(line)
  first_addrs << parsed["address"] if parsed && parsed["address"]
end

# Get a list of memory addresses from the last dump
File.open(ARGV[2], "r").each_line do |line|  
  parsed = JSON.parse(line)
  third_addrs << parsed["address"] if parsed && parsed["address"]
end

diff = []

# Get a list of all items present in both the second and
# third dumps but not in the first.
File.open(ARGV[1], "r").each_line do |line|  
  parsed = JSON.parse(line)
  if parsed && parsed["address"]
    if !first_addrs.include?(parsed["address"]) && third_addrs.include?(parsed["address"])
      diff << parsed
    end
  end
end

# Group items
diff.group_by do |x|  
  [x["type"], x["file"], x["line"]]
end.map do |x,y|  
  # Collect memory size
  [x, y.count, y.inject(0){|sum,i| sum + (i['bytesize'] || 0) }, y.inject(0){|sum,i| sum + (i['memsize'] || 0) }]
end.sort do |a,b|  
  b[1] <=> a[1]
end.each do |x,y,bytesize,memsize|  
  # Output information about each potential leak
  puts "Leaked #{y} #{x[0]} objects of size #{bytesize}/#{memsize} at: #{x[1]}:#{x[2]}"
end

# Also output total memory usage, because why not?
memsize = diff.inject(0){|sum,i| sum + (i['memsize'] || 0) }  
bytesize = diff.inject(0){|sum,i| sum + (i['bytesize'] || 0) }  
puts "\n\nTotal Size: #{bytesize}/#{memsize}"  

This will give you a file that looks something like the following but with more lines:

Leaked 77 STRING objects of size 10947/10961 at: /Users/peterwagenet/.rvm/gems/ruby-2.1.3/gems/actionview-4.1.6/lib/action_view/template.rb:297  
Leaked 15 DATA objects of size 0/51808 at: /Users/peterwagenet/.rvm/gems/ruby-2.1.3/gems/actionview-4.1.6/lib/action_view/template.rb:297  
Leaked 1 FILE objects of size 0/216 at: /Users/peterwagenet/.rvm/gems/ruby-2.1.3/gems/puma-2.7.1/lib/puma/server.rb:290  
Leaked 1 DATA objects of size 0/0 at: /Users/peterwagenet/.rvm/gems/ruby-2.1.3/gems/puma-2.7.1/lib/puma/client.rb:35


Total Size: 11767/68771  

Putting It All Together

If you're lucky, something in your diff will stand out. Still, you need to be careful. Just because something shows up here as a leak doesn't mean it's actually a leak. For instance, what initially looked like a leak in ActiveRecord turned out to be the statement cache which caches 1000 items by default. These objects are long lived, but they're not a leak.

Unfortunately, for us, nothing really stood out here. Once we adjusted for stuff from the S3 gem (used only to upload dumps) and realized that the objects in ActiveRecord were from the statement cache, there wasn't anything significant to account for our troubles. Fortunately, as I mentioned at the start, this was because out leaks were somehow gone!

They say all's well that ends well, and I guess that's true here. That said, I still would have liked to know what caused our problems in the first place...

Learn More

If you've got memory concerns in your app, we're pretty close to rolling out memory profiling in Skylight. Once it's out and you hook it up we'll be able to tell you what methods are allocating the most memory. Even if you're not leaking, lots of allocation puts a heavy load on your GC, which makes your app slower. Skylight will help make your app fast.

Resources

I spent a lot of my research time finding smarter people to learn from. These are some of the things I found most interesting and useful: