Ruby Code Tips (Best Tutorial 2019)

Ruby Code Tips for fast Execution

Ruby Code Tips for fast Execution 

If you learn what makes Ruby code fast, we must understand what makes Ruby code slow. If you’ve done any performance optimization in the past, you probably think you know what makes code slow.

 

You may think that even if you haven’t done performance optimization. In this blog, we explain several tips and techniques for fast execution of Ruby code. Let me see if I can guess what you think.

 

Your first guess is algorithmic complexity of the code: extra nested loops, computations, that sort of stuff. And what would you do to fix the algorithmic complexity? Well, you would profile the code, locate the slow section, identify the reason for the slowness, and rewrite the code to avoid the bottleneck.

 

Rinse and repeat until fast

Ruby code

Sounds like a good plan, right? However, it doesn’t always work for Ruby code. Algorithmic complexity can be a major cause for performance problems. But Ruby has another cause that developers often overlook.

 

Let me show you what I’m talking about. Let’s consider a simple example that takes a two-dimensional array of strings and formats it as a CSV.

 

Let’s jump right in. Key in or download this simple program.

blg1/example_unoptimized.rb
require "benchmark"
num_rows = 100000
num_cols = 10
data = Array.new(num_rows) { Array.new(num_cols) { "x"*1000 } }
time = Benchmark.realtime do
csv = data.map { |row| row.join(",") }.join("\n")
end
puts time.round(2)

 

We’ll run the program and see how it performs. But before that, we need to set up the execution environment. There are five major Ruby versions in use today: 1.8.7, 1.9.3, 2.0, 2.1, and 2.2.

 

These versions have very different performance characteristics. Ruby 1.8 is the oldest and the slowest of them, with a different interpreter architecture and implementation.

 

Ruby 1.9.3 and 2.0 are the current mainstream releases with similar performance. Ruby 2.1 and 2.2 are the only versions that were developed with performance in mind, at least if we believe their release notes, and thus should be the fastest.

 

It’s hard to target old software platforms, so I’ll make a necessary simplification in this blog. I will neither write examples nor measure performance for Ruby 1.8. I do this because Ruby 1.8 is not only internally different, it’s also source-incompatible, making my task extremely complicated.

 

However, even if you have a legacy system running Ruby 1.8 with no chance to upgrade, you can still use the performance optimization advice from this blog. Everything I describe in the blog applies to 1.8.

 

In fact, you might even get more improvement. The old interpreter is so inefficient that any little change can make a big difference. In addition to that, I will give 1.8-specific advice where appropriate.

 

The easiest way to run several Rubys without messing up your system is to use rbenv or rvm.

I’ll use the former in this blog.

Get rbenv from Build software better, together sstephenson/rbenv.

 

Follow the installation instructions from README.md Job Board. Once you install it, download the latest releases of Ruby versions that you’re interested in. This is what I did; you may want to get more recent versions:

$ rbenv install -l
...
1.9.3-p551
2.0.0-p598
2.1.5
2.2.0
...
$ rbenv install -k 1.9.3-p551
$ rbenv install -k 2.0.0-p598
$ rbenv install -k 2.1.5
$ rbenv install -k 2.2.0

 

Note how I install Ruby interpreters with the k option. This keeps sources in rbenv’s directory after compilation.

 

In due time we’ll talk about the internal Ruby architecture and implementation, and you might want to have a peek at the source code. For now, just save it for the future.

Ruby version

To run your code under a specific Ruby version, use this:

$ rbenv versions
* system (set by /home/user/.rbenv/version) 1.9.3-p551
2.0.0-p598
2.1.5
2.2.0
$ rbenv shell 1.9.3-p551
$ ruby blg1/example_unoptimized.rb

 

To get a rough idea of how things perform, you can run examples just one time. But you shouldn’t make comparisons or draw any conclusions based on only one measurement.

 

To do that, you need to obtain statistically correct measurements. This involves running examples multiple times, statistically post-processing the measurement results, eliminating external factors like power management on most modern computers, and more.

 

In short, it’s hard to obtain a truly meaningful measurement. But for our present purposes, it is fine if you run an example several times until you see the repeating pattern in the numbers. I’ll do my measurements the right way, skipping any details of the statistical analysis for now.

 

OK, so let’s get back to our example and actually run it:

$ rbenv shell 1.9.3-p551
$ ruby example_unoptimized.rb
9.18
$ rbenv shell 2.0.0-p598
$ ruby example_unoptimized.rb 11.42
$ rbenv shell 2.1.5
$ ruby example_unoptimized.rb
2.65
$ rbenv shell 2.2.0
$ ruby example_unoptimized.rb
2.43

Let’s organize the measurements in a tabular format for easy comparison. Further, in the blog, I’ll skip the session printouts and will just include the comparison tables.

1.9.3 2.0 2.1 2.2

 

Execution time 9.18 11.42 2.65 2.43

What? Concatenating 100,000 rows, 10 columns each, takes up to 10 seconds? That’s way too much. Ruby 2.1 and 2.2 are better but still take too long. Why is our simple program so slow?

 

Let’s look at our code one more time. It seems like an idiomatic Ruby one-liner that is internally just a loop with a nested loop. The algorithmic efficiency of this code is going to be O(n m) no matter what. So the question is, what can we optimize?

 

I’ll give you a hint. Run this program with garbage collection disabled. For that just add a GC. disable statement before the benchmark block like this:

blg1/example_no_gc.rb
require "benchmark"
num_rows = 100000
num_cols = 10
data = Array.new(num_rows) { Array.new(num_cols) { "x"*1000 } }
GC.disable
time = Benchmark.realtime do
csv = data.map { |row| row.join(",") }.join("\n")
end
puts time.round(2)
Now let’s run this and compare our measurements with the original program.
1.9.3 2.0 2.1 2.2
GC enabled 9.18 11.42 2.65 2.43
GC disabled 1.14 1.15 1.19 1.16
% of time spent
in GC 88% 90% 55% 52%

Do you see why the code is so slow? Our program spends the majority of its execution time in the garbage collector—a whopping 90% of the time in older Rubys and a significant 50% of the time in modern versions.

 

I started my career as a C++ developer. That’s why I was stunned when I first realized how much time Ruby GC takes. This surprises even seasoned developers who have worked with garbage-collected languages like Java and C#. Ruby GC takes as much time as our code itself or more.

 

Yes, Ruby 2.1 and later perform much better. But even they require half the execution time for garbage collection in our example.

What’s the deal with the Ruby GC

  • What’s the deal with the Ruby GC?
  • Did our code use too much memory?
  • Is the Ruby GC too slow?

The answer is a resounding yes to both questions.

 

High memory consumption is intrinsic to Ruby. It’s a side effect of the language design. “Everything is an object” means that programs need extra memory to represent data as Ruby objects. Also, slow garbage collection is a well-known historical problem with Ruby.

 

Its mark-and-sweep, stop-the-world GC is not only the slowest known garbage collection algorithm. It also has to stop the application for the time GC is running. That’s why our application takes almost a dozen seconds to complete.

 

You have surely noticed a significant performance improvement with Ruby 2.1 and 2.2. These versions feature much-improved GC, called restricted generational GC.  For now, it’s important to remember that the latest two Ruby releases are much faster thanks to the better GC.

 

High GC times are surprising to the uninitiated. Less surprising, but still important, is the fact that without GC all Ruby versions perform the same, finishing in about 1.15 seconds.

Internally the Ruby VMs are not that different across the versions starting from 1.9.

 

The biggest improvement relevant to performance is the restricted generational GC that came with Ruby 2.1. But that, of course, has no effect on code performance when GC is disabled.

 

If you’re a Ruby 1.8 user, you shouldn’t expect to get the performance of 1.9 and later, even with GC turned off. Modern Rubys have a virtual machine to execute precompiled code. Ruby 1.8 executes code in a much slower fashion by traversing the syntax tree.

 

OK, let’s get back to our example and think about why GC took so much time. What did it do? Well, we know that the more memory we use, the longer GC takes to complete.

 

So we must have allocated a lot of memory, right? Let’s see how much by printing memory size before and after our benchmark. The way to do this is to print the process’s RSS, or Resident Set Size, which is the portion of a process’s memory that’s held in RAM.

 

On Linux and Mac OS X you can get RSS from the ps command: puts "%dM" % `ps -o rss= -p #{Process.pid}`.to_i

On Windows, your best bet is to use the OS.rss function from the OS gem, rdp/os. 

The gem is outdated and unmaintained, but it still should work for you.

blg1/example_measure_memory.rb
require "benchmark"
num_rows = 100000
num_cols = 10
data = Array.new(num_rows) { Array.new(num_cols) { "x"*1000 } }
puts "%d MB" % (`ps -o rss= -p #{Process.pid}`.to_i/1024)
GC.disable
time = Benchmark.realtime do
csv = data.map { |row| row.join(",") }.join("\n") end
puts "%d MB" % (`ps -o rss= -p #{Process.pid}`.to_i/1024) puts time.round(2)
$ rbenv shell 2.2.0
$ ruby example_measure_memory.rb 1040 MB
2958 MB

Aha. Things are getting more and more interesting. Our initial dataset is roughly 1 gigabyte. Here and later in this blog when I write kB I mean 1024 bytes, MB - 1024 * 1024 bytes, GB - 1024 * 1024 * 1024 bytes (yes, I know, it’s old school).

 

So, we consumed 2 extra gigabytes of memory to process that 1 GB of data. Your gut feeling is that it should have taken only 1 GB extra. Instead, we took 2 GB. No wonder GC has a lot of work to do!

 

You probably have a bunch of questions already. Why did the program need 2 GB instead of 1 GB? How do we deal with this? Is there a way for our code to use less memory? The answers are in the next section, but first, let’s review what we’ve learned so far.

 

Optimize Memory

Optimize Memory

High memory consumption is what makes Ruby slow. Therefore, to optimize we need to reduce the memory footprint. This will, in turn, reduce the time for garbage collection.

 

You might ask, why don’t we disable GC altogether? That is rarely a good thing to do. Turning off GC significantly increases peak memory consumption. The operating system may run out of memory or start swapping. Both results will hit performance much harder than Ruby GC itself.

 

So let’s get back to our example and think how we can reduce memory consumption. We know that we use 2 GB of memory to process 1 GB of data. So we’ll need to look at where that extra memory is used.

blg1/example_annotated.rb
Line 1 require "benchmark"
-
- num_rows = 100000
- num_cols = 10
5 data = Array.new(num_rows) { Array.new(num_cols) { "x"*1000 } }
-
- time = Benchmark.realtime do
- csv = data.map do |row|
- row.join(",")
10 end.join("\n")
- end
-
- puts time.round(2)

I made the map block more verbose to show you where the problem is. The CSV rows that we generate inside that block are actually intermediate results stored into memory until we can finally join them by the newline character. This is exactly where we use that extra 1 GB of memory.

 

Let’s rewrite this in a way that doesn’t store any intermediate results. For that, I’ll explicitly loop over rows with a nested loop over columns and store results as I go into the csv.

blg1/example_optimized.rb
require "benchmark"
num_rows = 100000
num_cols = 10
data = Array.new(num_rows) { Array.new(num_cols) { "x"*1000 } }
time = Benchmark.realtime do
csv = ''
num_rows.times do |i|
num_cols.times do |j|
csv << data[i][j]
csv << "," unless j == num_cols - 1
end
csv << "\n" unless i == num_rows - 1
end
end
puts time.round(2)

The code got uglier, but how fast is it now? Let’s run it and compare it with the unoptimized version.

These are great results! Our simple changes got rid of the GC overhead. The optimized program is even faster than the original with no GC.

 

And if you run the optimized version with the GC disabled, you’ll find out that its GC time is merely a 10% of total execution time. Because of this, our program performs the same in all Ruby versions.

 

By making simple changes, we got from 2.5 to 10 times performance improvement. Doing so required us merely to look through the code and think how much memory each line and function call takes.

 

Once you catch memory copying, or extra memory allocation, or another case of a memory-inefficient operation, you rewrite the code to avoid that. Simple, isn’t it?

 

Actually, it is. It turns out that to get significant speedup you might not need code profiling. Memory optimization is easier: just review, think, and rewrite.

 

Only when you are sure that the code spends a reasonable time in GC should you look further and try to locate algorithmic complexity or other sources of poor performance.

 

But in my experience, there’s often no need to optimize anything other than memory. For me the following 80-20 rule of Ruby performance optimization is always true: 80% of performance improvements come from memory opti-mization, the remaining 20% from everything else.

 

Review, think, and rewrite. Maybe we should think about thinking. If optimizing memory requires rethinking what the code does, then what exactly should we think about? We’ll talk about that in the next section, but first, let’s review what we’ve learned so far.

 

Ruby optimization is more about rethinking what the code does and less about finding bottlenecks with specialized tools. The major skill to learn is rather the right way of thinking about performance. This is what I call the Ruby Performance Mindset.

 

How do you get into this mindset? Let me give you a hint. When you write code, remember that memory consumption and garbage collection are, most likely, why Ruby is slow, and constantly ask yourself these three questions:

 

1. Is Ruby the right tool to solve my problem?

Ruby the right tool to solve my problem

Ruby is a general-purpose programming language, but that doesn’t mean you should use it to solve all your problems. There are things that Ruby is not so good at. The prime example is large dataset processing. That needs memory: exactly the sort of thing that you want to avoid.

 

This task is better done in a database or in background processes written in other programming languages. Twitter, for example, once had a Ruby on Rails front-end backed by Scala workers. Another example is statistical computations, which are better done with, say, the R language.

 

2. How much memory will my code use?

The less memory your code uses, the less work Ruby GC has to do. You already know some tricks to reduce memory consumption—the ones that we used in our example: line-by-line data processing and avoiding intermediate objects. I’ll show you more in subsequent blogs.

 

3. What is the raw performance of this code?

raw performance of this code

Once you’re sure the memory is used optimally, take a look at the algorithmic complexity of the code itself.

 

Asking these three questions, in the stated order, will get you into the Ruby Performance Mindset. And then you may begin to find that new code that you write is fast right from the start, without any optimization required. 

 

Ah, but what can you do about an old program? What problems should you look for? It turns out that the majority of performance problems come from a relatively limited number of sources. In the next blog, we’ll talk about these, and how to fix them.

 

Fix Common Performance Problems

Fix Common Performance Problems

There is nothing new under the sun.

The reasons code is slow invariably come down to familiar issues. This is especially true for us Ruby developers. We are far removed from writing bare-metal code. We heavily use language features, standard libraries, gems, and frameworks.

 

And each of these brings along its performance issues. Some of these are actually memory inefficient by design! We should be extremely careful about how we write our code and what features or libraries we use.

 

We have talked about two of the common reasons for poor performance in the previous blog: extra memory allocation and data structure copying. What are the others?

 

Execution context copying, memory-heavy iterators, slow type conversions, and iterator unsafe functions are a few of the culprits. In the next sections, I’ll walk you through the steps to avoid these. But before we start, let’s briefly talk about a subject we’ve avoided so far: measurements.

 

We need some way to know that the changes we make really improve performance.

In the previous blog we used Benchmark.realtime to measure execution time and `ps -o rss= -p #{Process.pid}`.to_i to measure current memory usage.

 

To understand how reduced memory usage translates into the improved performance, we’ll also measure the number of GC calls and the time required for GC.

 

The former is easy to measure. Ruby provides the GC#stat function that returns the number of GC runs (and more stats that we’ll ignore for now). The latter is harder and requires running the same program twice, once with GC disabled, and getting a difference you can attribute to GC.

 

Let’s build a tool. We’ll create a wrapper function that will measure execution time, the number of GC runs, and total allocated memory. In addition to that,

let’s make the function read the --no-gc command-line option and turn off GC if requested.

blg2/wrapper.rb
require "json"
require "benchmark"
def measure(&block)
no_gc = (ARGV[0] == "--no-gc")
if no_gc
GC.disable
else
# collect memory allocated during library loading
# and our own code before the measurement GC.start
end
memory_before = `ps -o rss= -p #{Process.pid}`.to_i/1024
gc_stat_before = GC.stat
time = Benchmark.realtime do
yield
end
puts ObjectSpace.count_objects
unless no_gc
GC.start(full_mark: true, immediate_sweep: true, immediate_mark: false)
end
puts ObjectSpace.count_objects
gc_stat_after = GC.stat
memory_after = `ps -o rss= -p #{Process.pid}`.to_i/1024
puts({
RUBY_VERSION => {
gc: no_gc ? 'disabled' : 'enabled',
time: time.round(2),
gc_count: gc_stat_after[:count] - gc_stat_before[:count],
memory: "%d MB" % (memory_after - memory_before)
}
}.to_json)
end

 

OK, there’s another way to measure GC time: the GC:: Profiler that Ruby 1.9.2 introduced. The problem is that it adds significant overhead to both memory and CPU. This is good for profiling where absolute numbers are not important and you’re interested only in relative values. It’s less useful for measurements that we want to do in this blog.

 

Memory measurements with and without GC will of course differ. In the former case, we will get the amount of memory allocated by the block that stays allocated after we’re done. We’ll use this number to find memory leaks.

 

In the latter case, we’ll get total memory consumption: the amount of memory allocated during the execution of the block. That’s the metric we’ll use most often in this blog, as it directly shows how much work your program makes for the GC.

 

So let’s do some measuring. Here and later in this blog, I will use Ruby 2.2 to run my examples unless otherwise noted.

blg2/wrapper_example.rb
require 'wrapper'
require 'csv'
measure do
data = CSV.open("data.csv")
output = data.readlines.map do |line|
line.map { |col| col.downcase.gsub(/\b('?[a-z])/) { $1.capitalize } }
end
File.open("output.csv", "w+") { |f| f.write output.join("\n") }
end
$ cd code/blg2
$ ruby -I . wrapper_example.rb {"2.2.0":{"gc":"enabled","time":14.96,"gc_count":27,"memory":"479 MB"}}
$ ruby -I . wrapper_example.rb --no-gc {"2.2.0":{"gc":"disabled","time":10.17,"gc_count":0,"memory":"1555 MB"}}

The results are exactly what we saw before. But in addition, we see that GC kicked off 27 times during execution.

 

As usual with these measurements, you will have to run the wrapper several times to obtain (more or less) accurate measurement. But there’s no need yet to aim for statistical significance. We’ll handle that problem later.

 

So let’s take this wrapper as a basic measurement tool and see what is slow in Ruby and how to fix it.

 

Save Memory

Save Memory

The first step to make your application faster is to save memory. Every time you create or copy something in memory, you add work for GC. Let’s look at the best practices to write code that doesn’t use too much memory.

 

Modify Strings in Place

Ruby programs use a lot of strings and copy them a lot. In most cases, they really shouldn’t. You can do most string manipulations in place, meaning that instead of making a changed copy, you change the original.

 

Ruby has a bunch of “bang!” functions for in-place modification. 

Those are gsub!, capitalize!, downcase!, upcase!, delete!, reverse!, slice!, and others. It’s always a good idea to use them as much as you can when you no longer need the original string.
blg2/string_in_place1.rb
Line 1 require 'wrapper'
str = "X" * 1024 * 1024 * 10 # 10 MB string
measure do
str = str.downcase
end
measure do
str.downcase!
end
$ ruby -I . string_in_place1.rb --no-gc {"2.2.0":{"gc":"disabled","time":0.02,"gc_count":0,"memory":"9 MB"}} {"2.2.0":{"gc":"disabled","time":0.01,"gc_count":0,"memory":"0 MB"}}

 

The String#downcase call on line 5 allocates another 10 MB in memory to copy a string, then changes it to lowercase. The bang version of the same function on line 8 does not need any extra memory. And that’s exactly what we see in the measurements.

 

Another useful in-place modification function is String::<<. It concatenates strings by appending a new string to the original. When asked to append one string to another, most developers write this:

x = "foo"
x += "bar"
This code is equivalent to
x = "foo"
y = x + "bar"
x = y

 

Here Ruby allocates extra memory to store the result of the concatenation. The same code using the shift operator will need no additional memory if your resulting string is less than 40 bytes. If the string is larger than that, Ruby will only allocate enough memory to store the appended string. So next time, write this instead:

x = "foo"

x << "bar"

Behind the scenes, the String#<< may not be able to increase the size of the original string to do a true in-place modification.

In this case, it may have to move the string data in memory into the new location.

 

However, that happens in the realloc() C library function behind Ruby’s back and does not trigger GC.

 

Another thing worth pointing out is that “bang!” functions are not guaranteed to do an in-place modification. Most of them do, but that’s implementation dependent. So don’t be surprised when one of them doesn’t optimize anything.

 

Modify Arrays and Hashes in Place

Modify Arrays and Hashes in Place

Like strings, arrays and hashes can be modified in place.

If you look at the Ruby API documentation, you’ll again see “bang!” functions like map!, select!, reject!, and others. 

The idea is the same: do not create a modified copy of the same array unless really necessary.

 

String, array, and hash in-place modification functions are extremely powerful when used together. Compare these two examples:

blg2/combined_in_place1.rb
require 'wrapper'
data = Array.new(100) { "x" * 1024 * 1024 }
measure do
data.map { |str| str.upcase }
end
blg2/combined_in_place2.rb
require 'wrapper'
data = Array.new(100) { "x" * 1024 * 1024 }
measure do
data.map! { |str| str.upcase! }
end
map and upcase map! and upcase!
Total time 0.22 s 0.14 s
Extra memory 100 MB 0 MB
# of GC calls 3 0

See how this code got 35% faster by simply adding two “!” characters? Easy optimization, isn’t it? The second example gives no work to GC at all despite crunching through 100 MB of data.

 

Read Files Line by Line

Read Files Line by Line

It takes memory to read the whole file. We expect that, of course, and sometimes will do that for convenience. But as usual with Ruby, it takes a toll on memory. How big is the overhead?

 

It’s insignificant if you just read the file. For example, reading the 26 MB data.csv file1 takes exactly 26 MB of memory.

blg2/file_reading1.rb
require 'wrapper'
measure do
File.read("data.csv")
end
$ ruby -I . file_reading1.rb --no-gc {"2.2.0":{"gc":"disabled","time":0.02,"gc_count":0,"memory":"25 MB"}}

 

Here we simply create one File object (it takes just 40 bytes on a 64-bit architecture) and store the 26 MB string there. No extra memory is used.

 

Things rapidly become less perfect when we try to parse the file. For example, it takes 158 MB to split the same CSV file into lines and columns.

blg2/file_reading2.rb
require 'wrapper'
measure do
File.readlines("data.csv").map! { |line| line.split(",") }
end
$ ruby -I . file_reading2.rb --no-gc {"2.2.0":{"gc":"disabled","time":0.45,"gc_count":0,"memory":"186 MB"}}

 

What does Ruby use this memory for? The file has about 163,000 rows of data in 11 columns. So, to store the parsed contents we should allocate 163,000 objects for rows and 1,793,000 objects for columns—1,956,000 objects in total.

 

On a 64-bit architecture that requires approximately 75 MB. Together with 26 MB necessary to read the file, our program needs at least 101 MB of memory.

 

In addition to that, not all strings are small enough to fit into 40-byte Ruby objects. Ruby will allocate more memory to store them. That’s what the remaining 85 MB are used for. As the result, our simple program takes seven times the size of our data after parsing.

 

The Ruby CSV parser takes even more. It needs 346 MB of memory, 13 times the data size.

blg2/file_reading3.rb
require 'wrapper'
require 'csv'
measure do
CSV.read("data.csv")
end
$ ruby -I . file_reading3.rb --no-gc {"2.2.0":{"gc":"disabled","time":2.66,"gc_count":0,"memory":"368 MB"}}

 

This memory consumption math is really disturbing. In my experience, the size of the data after parsing increases anywhere from two up to ten times depending on the nature of the data in real-world applications. That’s a lot of work for Ruby GC.

 

The solution? Read and parse data files line by line as much as possible. In the previous blog, we did that for the CSV file and got a two times speedup. Whenever you can, read files line by line, as in this example:

blg2/file_reading4.rb
require 'wrapper'
measure do
file = File.open("data.csv", "r")
while line = file.gets
line.split(",")
end
end
And do the same with CSV files:
blg2/file_reading5.rb
require 'csv'
require 'wrapper'
measure do
file = CSV.open("data.csv")
while line = file.readline
end
end
Now, let’s measure these examples with our wrapper code. To our surprise, memory allocation is about the same as before: 171 MB and 367 MB.
$ ruby -I . file_reading4.rb --no-gc {"2.2.0":{"gc":"disabled","time":0.45,"gc_count":0,"memory":"171 MB"}}
$ ruby -I . file_reading5.rb --no-gc {"2.2.0":{"gc":"disabled","time":2.64,"gc_count":0,"memory":"367 MB"}}

 

But if you think about this a little more, you’ll understand. It doesn’t matter how we parse the file—in one go, or line by line. We’ll end up allocating the same amount of memory anyway. And look at execution time. It’s the same as before. What’s the deal?

 

We’ve been measuring the total amount of memory allocated. That makes sense when we want to know exactly how much memory in total a certain snippet of code needs.

 

But it doesn’t tell us anything about peak memory consumption. During program execution, GC will deallocate unused memory. This will reduce both peak memory consumption and GC time because there are much fewer data held in memory at any given moment.

 

When we read a file line by line, we’re telling Ruby that we don’t need the previous lines anymore. GC will then collect them as your program executes. So, to see the optimization, you need to turn on GC. Let’s do that and compare before and after numbers.

Before optimization:
$ ruby -I . file_reading2.rb
{"2.2.0":{"gc":"enabled","time":0.68,"gc_count":11,"memory":"144 MB"}}
$ ruby -I . file_reading3.rb
{"2.2.0":{"gc":"enabled","time":3.25,"gc_count":17,"memory":"175 MB"}}
After optimization:
$ ruby -I . file_reading4.rb {"2.2.0":{"gc":"enabled","time":0.44,"gc_count":106,"memory":"0 MB"}}
$ ruby -I . file_reading5.rb {"2.2.0":{"gc":"enabled","time":2.62,"gc_count":246,"memory":"1 MB"}}

 

Now you see why reading files line by line is such a good idea. First, you’ll end up using almost no additional memory. In fact, you’ll end up storing just the line you are processing and any previous lines that were allocated after the last GC call.

 

Second, the program will run faster. Speedup depends on the data size; in our examples, it is 35% for plain file reading and 20% for CSV parsing.

 

Watch for Memory Leaks Caused by Callbacks

Memory Leaks Caused by Callbacks

Rails developers know and use callbacks a lot. But when done wrong, callbacks can hurt performance. For example, let’s write a logger object that will lazily record object creation.

 

For that, instead of writing the output right away, it will log events and replay them later all at once. It is tempting to implement the event logger using Ruby closures (lambdas or Procs) like this:

blg2/callbacks1.rb
module Logger
extend self
attr_accessor :output, :log_actions
def log(&event)
self.log_actions ||= []
self.log_actions << event
end
def play
output = []
log_actions.each { |e| e.call(output) }
puts output.join("\n")
end
end
class Thing
def initialize(id)
Logger.log { |output| output << "created thing #{id}" }
end
end
def do_something
1000.times { |i| Thing.new(i) }
end
do_something
GC.start
Logger.play
puts ObjectSpace.each_object(Thing).count

 

We log an event by storing a block of code that gets executed later. The code actually looks quite cool. At least I feel cool every time I use bits of functional programming in Ruby.

 

Unfortunately, when I write something cool or smart, it tends to turn out slow and inefficient. The same thing happens here. Such logging will keep the references to all created objects even if we don’t need them. So add the following lines to the end of the program and run it:

 

GC.start # collect all unused objects
puts ObjectSpace.each_object(Thing).count
$ ruby -I . callbacks1.rb created thing 0
created thing 1
«...»
created thing 999
1000

After we’re done with the do_something, we don’t really need all one thousand of these Thing objects. But even an explicit GC.start call does not collect them. What’s going on?

 

Callbacks stored in the Logger class are the reason the objects are still there. When you pass an anonymous block in the Thing constructor to the Logger#log function, Ruby converts it into the Proc object and stores references to all objects in the block’s execution context.

 

That includes the Thing instance. In this way, we end up keeping references from the Logger object to all one thousand instances of Thing. It’s a classic example of a memory leak.

 

A dumbed-down version of the Logger class will look less cool but will prevent the memory leak. You can, of course, write an even more dumb version that doesn’t use any callbacks at all, but I’ll keep them for this example.

 style="margin:0;width:959px;height:307px">blg2/callbacks2.rb
module Logger
extend self
attr_accessor :output
def log(&event)
self.output ||= []
event.call(output)
end
def play
puts output.join("\n")
end
end
class Thing
def initialize(id)
Logger.log { |output| output << "created thing #{id}" }
end
end
def do_something
1000.times { |i| Thing.new(i) }
end
do_something
GC.start
Logger.play
puts ObjectSpace.each_object(Thing).count
$ ruby -I . callbacks1.rb created thing 0
created thing 1
«...»
created thing 999
0

In this case, no memory is leaked and all Thing objects are garbage collected.

So be careful every time you create a block or Proc callback. Remember, if you store it somewhere, you will also keep references to its execution context. That not only hurts the performance but also might even leak memory.

 

Are All Anonymous Blocks Dangerous to Performance?

Anonymous blocks

Anonymous blocks do not store execution context unless they are converted to Proc objects. When calling a function that takes an anonymous block, Ruby stores a reference to the caller’s stack frame.

 

It’s OK to do that since the caller is guaranteed to exit before the caller’s stack frame is popped. When calling a function that takes a named block, Ruby assumes that this block is long-lived and clones the execution context right there.

 

An obvious case of the anonymous block to Proc conversion is when your receiving function defines the &block argument.

def take_block(&block)

block.call(args)

end

take_block { |args| do_something(args) }

 

It’s a good idea to change such code to use anonymous blocks. We don’t really need the Proc conversion since the block is simply executed, and never stored as in the logger example in the previous section.

def take_block

yield(args)

end

take_block { |args| do_something(args) }

 

However, it’s not always clear when the conversion happens. It may be hidden well down into the call stack, or even happen in C code inside the Ruby interpreter. Let’s look at this example:

 blg2/signal1.rb
Line 1 class LargeObject
- def initialize
- @data = "x" * 1024 * 1024 * 20
- end
end
-
- def do_something
- obj = LargeObject.new
- trap("TERM") { puts obj.inspect }
end
-
- do_something
- # force major GC to make sure we free all objects that can be freed
- GC.start(full_mark: true, immediate_sweep: true)
puts "LargeObject instances left in memory: %d" %
- ObjectSpace.each_object(LargeObject).count
$ ruby -I . signal1.rb
LargeObject instances left in memory: 1

 

This example behaves suspiciously similar to what we saw with the smart logger in the previous section. It leaves one large object behind. There’s only one place in the code that could cause that.

 

Line 9 passes an anonymous block to the trap function. A quick look at the source code reveals that the trap implementation calls cmd = rb_block_proc(); that indeed converts the block to Proc behind the scenes. If you comment out line 9, the program will report 0 large objects left after execution.

 

So, if you suspect memory leaks in named blocks, you’ll have to review the code down the stack—at least down to the Ruby standard library, including functions implemented in C. It’s not as hard as it sounds.

 

You can always look up the function implementation in the Ruby API docs from the website. Ruby source code is well written and clean. You’ll be able to make sense of it even if you don’t know any C, as with the trap example earlier.

 

Optimize Your Iterators

Optimize Your Iterators

To a Ruby newcomer, Ruby iterators typically look like a convenient syntax for loops. In fact, iterators are such a good abstraction that even seasoned developers often forget that they really are nothing more than methods of Array and Hash classes with a block argument.

 

However, keeping this in mind is important for performance. We talked in Modify Arrays and Hashes in Place, about the importance of in-place operations on hashes and arrays. But that’s not the end of the story.

Because a Ruby iterator is a function of an object (Array, Range, Hash, etc.), it has two characteristics that affect performance:

 

1. Ruby GC will not garbage collect the object you’re iterating before the iterator is finished. This means that when you have a large list in memory, that whole list will stay in memory even if you no longer need the parts you’ve already traversed.

 

2. Iterators, being functions, can and will create temporary objects behind the scenes. This adds work for the garbage collector and hurts performance.

 

Compounding these performance hits, iterators (just like loops) are sensitive to the algorithmic complexity of the code. An operation that by itself is just a tad slow becomes a huge time sink when repeated hundreds of thousands of times. So let’s see when exactly iterators become slow and what can we do about that.

 

Free Objects from Collections During Iteration

Free Objects from Collections

Let’s assume we have a list of objects, say one thousand elements of class Thing. We iterate over the list, do something useful, and discard the list. I’ve seen and written a lot of such code in production applications. For example, you read data from a file, calculate some stats, and return only the stats.

class Thing; end
list = Array.new(1000) { Thing.new }
list.each do |item|
# do something with the item
end
list = nil

 

Obviously, we can’t deallocate list before each finish. So it will stay in memory even if we no longer need access to previously traversed items. Let’s prove that by counting the number of Things instances before each iteration.

blg2/each_bang.rb
class Thing; end
list = Array.new(1000) { Thing.new }
puts ObjectSpace.each_object(Thing).count # 1000 objects
list.each do |item|
GC.start
puts ObjectSpace.each_object(Thing).count # same count as before
# do something with the item
end
list = nil
GC.start
puts ObjectSpace.each_object(Thing).count # everything has been deallocated
$ ruby -I . each_bang.rb 1000
1000
«...»
1000
1000
0

As expected, only when we clear the list reference does the whole list get garbage collected. We can do better by using a while loop and removing elements from the list as we process them, like this:

blg2/each_bang.rb
class Thing; end
list = Array.new(1000) { Thing.new } # allocate 1000 objects again puts ObjectSpace.each_object(Thing).count
while list.count > 0
GC.start # this will garbage collect item from previous iteration puts ObjectSpace.each_object(Thing).count # watch the counter decreasing item = list.shift
end
GC.start # this will garbage collect item from previous iteration puts ObjectSpace.each_object(Thing).count # watch the counter decreasing
$ ruby -I . each_bang.rb 1000
999
«...»
2
1

See how the object counter decreases as we loop through the list? I’m again running GC before each iteration to show you that all previous elements are garbage and will be collected. In the real world, you wouldn’t want to force GC. Just let it do its job and your loop will neither take too much time nor run out of memory.

 

Don’t worry about the negative effects of list modification inside the loop. GC time savings will outweigh them if you process lots of objects. That happens both when your list is large and when you load linked data from these objects— for example, Rails associations.

 

Use the Each! Pattern

Ruby iterator

If we wrap our loop that removes items from an array during iteration into a Ruby iterator, we’ll get what its creator, Alexander Goldstein, called “Each!”. This is how the simplest each! iterator looks:

blg2/each_bang_pattern.rb
class Array
def each!
while count > 0
yield(shift)
end
end
end
Array.new(10000).each! { |element| puts element.class }

 

This implementation is not 100% idiomatic Ruby because it doesn’t return an Enumerator if there’s no block passed. But it illustrates the concept well enough. Also, note how it avoids creating Proc objects from anonymous blocks (there’s no &block argument).

 

Avoid Iterators That Create Additional Objects

It turns out that some Ruby iterators (not all of them as we will see) internally create additional Ruby objects. Compare these two examples:

blg2/iterator_each1.rb
GC.disable
before = ObjectSpace.count_objects
Array.new(10000).each do |i|
[0,1].each do |j|
end
end
after = ObjectSpace.count_objects
puts "# of arrays: %d" % (after[:T_ARRAY] - before[:T_ARRAY])
puts "# of nodes: %d" % (after[:T_NODE] - before[:T_NODE])
$ ruby -I . iterator_each1.rb
# of arrays: 10001
# of nodes: 0
blg2/iterator_each2.rb
GC.disable
before = ObjectSpace.count_objects
Array.new(10000).each do |i|
[0,1].each_with_index do |j, index|
end
end
after = ObjectSpace.count_objects
puts "# of arrays: %d" % (after[:T_ARRAY] - before[:T_ARRAY])
puts "# of nodes: %d" % (after[:T_NODE] - before[:T_NODE])
$ ruby -I . iterator_each2.rb
# of arrays: 10001
# of nodes: 20000

As you’d expect, the code creates 10,000 temporary [0,1] arrays. But something fishy is going on with the number of T_NODE objects. Why would each_with_index create 20,000 extra objects?

 

The answer is in the Ruby source code. Here’s the implementation of each:

VALUE
rb_ary_each(VALUE array)
{
long i;
volatile VALUE ary = array;
RETURN_SIZED_ENUMERATOR(ary, 0, 0, ary_enum_length); for (i=0; i<RARRAY_LEN(ary); i++) { rb_yield(RARRAY_AREF(ary, i)); } return ary; } Compare it to the implementation of and each_with_index. enum_each_with_index(int argc, VALUE *argv, VALUE obj) { NODE *memo; RETURN_SIZED_ENUMERATOR(obj, argc, argv, enum_size); memo=NEW_MEMO(0, 0, 0); rb_block_call(obj, id_each, argc, argv, each_with_index_i, (VALUE)memo); return obj; } static VALUE each_with_index_i(RB_BLOCK_CALL_FUNC_ARGLIST(i, memo)) { long n=RNODE(memo)->u3.cnt++;
return rb_yield_values(2, rb_enum_values_pack(argc, argv), INT2NUM(n));
}

 

Even if your C-fu is not that strong, you’ll still see that each_with_index creates an additional NODE *memo variable. Because our each_with_index loop is nested in another loop, we get to create 10,000 additional nodes.

 

Worse, the internal function each_with_index_i allocates one more node. Thus we end up with the 20,000 extra T_NODE objects that you see in our example output.

 

How does that affect performance? Imagine your nested loop is executed not 10,000 times, but 1 million times. You’ll get 2 million objects created. And while they can be freed during the iteration, GC still gets way too much work to do. How’s that for an iterator that you would otherwise easily mistake for a syntactic construct?

 

It would be nice to know which iterators are bad for performance and which are not, wouldn’t it? I thought so, and so I calculated the number of additional T_NODE objects created per iterator.

 

Iterators that create 0 additional objects are safe to use in nested loops. But be careful with those that allocate two or even three additional objects: all?, each_with_index, inject, and others.

 

Looking at the table, we can also spot that iterators of the Array class, and in some cases the Hash class, behave differently. It turns out that Range and Hash use default iterator implementations from the Enumerable module, while Array reimplements most of them.

 

That not only results in better algorithmic performance (that was the reason behind the reimplementation), but also in better memory consumption. This means that most of Array’s iterators are safe to use, with the notable exceptions of each_with_index and inject.

 

Date#parse

Date parsing in Ruby

Date parsing in Ruby has been traditionally slow, but this function is especially harmful to performance. Let’s see how much time it uses in a loop with 100,000 iterations:

blg2/date_parsing1.rb
require 'date'
require 'benchmark'
date = "2014-05-23"
time = Benchmark.realtime do
100000.times do
Date.parse(date)
end
end
puts "%.3f" % time
$ ruby date_parsing1.rb
0.833

Each Date#parse call takes a minuscule 0.02 ms. But in a moderately large loop, that translates into almost one second of execution time.

 

A better solution is to let the date parser know which date format to use, like this:

blg2/date_parsing2.rb
require 'date'
require 'benchmark'
date = "2014-05-23"
time = Benchmark.realtime do
100000.times do
Date.strptime(date, '%Y-%m-%d')
end
end
puts "%.3f" % time
$ ruby date_parsing2.rb
0.182
That is already 4.6 times faster. But avoiding date string parsing altogether is even faster:
blg2/date_parsing3.rb
require 'date'
require 'benchmark'
date = "2014-05-23"
time = Benchmark.realtime do
100000.times do
Date.civil(date[0,4].to_i, date[5,2].to_i, date[8,2].to_i)
end
end
puts "%.3f" % time
$ ruby date_parsing3.rb
0.108

 

While slightly uglier, that code is almost eight times faster than the original, and almost two times faster than the Date#strptime version. 

Object#class, Object#is_a?, Object#kind_of?

 

These have considerable performance overhead when used in loops or frequently used functions like constructors or == comparison operators.

blg2/class_check1.rb
require 'benchmark'
obj = "sample string"
time = Benchmark.realtime do
100000.times do
obj.class == String
end
end
puts time
$ ruby class_check1.rb
0.022767841
blg2/class_check2.rb
require 'benchmark'
obj = "sample string"
time = Benchmark.realtime do
100000.times do
obj.is_a?(String)
end
end
puts time
$ ruby class_check2.rb
0.019568893

In a moderately large loop, again 100,000 iterations, such checks take 19–22 ms. That doesn’t sound bad, except that, for example, a Rails application can call comparison operators more than 1 million times per request and spend longer than 200 ms doing type checks.

 

It’s a good idea to move type checking away from iterators or frequently called functions and operators. If you can’t, unfortunately, there’s not much you can do about that.

BigDecimal::==(String)

 

Code that gets data from databases uses big decimals a lot. That is especially true for Rails applications. Such code often creates a BigDecimal from a string that it reads from a database, and then compares it directly with strings.

 

The catch is that the natural way to do this comparison is unbelievably slow in Ruby version 1.9.3 and lower:

blg2/bigdecimal1.rb
require 'bigdecimal'
require 'benchmark'
x = BigDecimal("10.2")
time = Benchmark.realtime do
100000.times do
x == "10.2"
end
end
puts time
$ rbenv shell 1.9.3-p551
$ ruby bigdecimal1.rb
0.773866128
$ rbenv shell 2.0.0-p598
$ ruby bigdecimal1.rb
0.025224029
$ rbenv shell 2.1.5
$ ruby bigdecimal1.rb
0.027570681
$ rbenv shell 2.2.0
$ ruby bigdecimal1.rb
0.02474011096637696

Older Rubys have unacceptably slow implementations of the BigDecimal::== function. This performance problem goes away with a Ruby 2.0 upgrade. But if you can’t upgrade, use this smart trick. Convert a BigDecimal to a String before comparison:

blg2/bigdecimal2.rb
require 'bigdecimal'
require 'benchmark'
x = BigDecimal("10.2")
time = Benchmark.realtime do
100000.times do
x.to_s == "10.2"
end
end
puts time
$ rbenv shell 1.9.3-p545
$ ruby bigdecimal2.rb
0.195041792

This hack is three to four times faster—not forty times faster, as in the Ruby 2.x implementation, but still an improvement.

 

Write Less Ruby

Write Less Ruby

One of the gurus who taught me programming used to say that the best code is the code that does not exist. If we could solve the problem without writing any code, then we wouldn’t have to optimize it. Right?

 

Unfortunately, in the real world, we still write code to solve our problems. But that doesn’t mean that it has to be Ruby code. Other tools do certain things better. We have seen that Ruby is especially bad in two areas: large dataset processing and complex computations. So let’s see what you can use instead, and how that improves performance.

 

Offload Work to the Database

The Ruby community tends to view databases only as data storage tools. Rails developers are especially prone to this because they often use ActiveRecord and ActiveModel abstractions without having to interface with the database directly.

 

So yes, you can build a Rails application without knowing any SQL or understanding the differences between MySQL and PostgreSQL. But by doing this, you’ll trade performance for convenience and miss out on the data processing power that databases provide.

 

It turns out—surprise, surprise—that databases are really good at complex computations and other kinds of data manipulation. Let me show you just how good they are.

 

Let’s imagine we have a large database with company employees, say, 10,000 people working in 25 various departments. We know each person’s salary, and we want to calculate the employees’ rank within a department by salary.

 

I’ll use PostgreSQL for this example and will create random data for simplicity. To reproduce this example, you should install and launch the PostgreSQL database server.

$ createdb company_data
$ psql company_data
create table empsalaries(
department_id integer,
employee_id integer,
salary integer);
insert into empsalaries (
select (1 + round(random()*25)), *, (50000 + round(random()*250000))
from generate_series(1, 10000)
);
create index empsalaries_department_id_idx on empsalaries (department_id);

 

Let me explain this in case you’re not familiar with PostgreSQL. The insert statement will generate a series of 10,000 rows (our employee IDs), and then for each of those rows will assign a random department ID from 1 to 25 and a random salary from $50,000 to $250,000.

 

Let’s first use ActiveRecord to calculate an employee rank. For that, we’ll create a folder called group_rank with Gemfile and group_rank.rb in it.

blg2/group_rank/Gemfile
source 'your community gem host'
gem 'activerecord'
gem 'pg'
blg2/group_rank/group_rank.rb
require 'rubygems'
require 'active_record'
ActiveRecord::Base.establish_connection(
:adapter => "postgresql",
:database => "company_data"
)
class Empsalary < ActiveRecord::Base
attr_accessor :rank
end
time = Benchmark.realtime do
salaries = Empsalary.all.order(:department_id, :salary)
key, counter = nil, nil
salaries.each do |s|
if s.department_id != key
key, counter = s.department_id, 0
end
counter += 1
s.rank = counter
end
end
puts "Group rank with ActiveRecord: %5.3fs" % time
Now let’s run bundler to install all the required gems and launch the applica-tion to see how long it takes to execute:
$ cd group_rank
$ rbenv shell 2.2.0
$ bundle install --path .bundle/gems
$ bundle exec ruby group_rank.rb Group rank with ActiveRecord: 0.264s

 

Taking 246 ms to process a mere 10,000 rows is pretty bad. Now try to do the same thing with 100,000 rows and 1 million rows. Ruby >= 2.0 will take 2.4 and 24 seconds, respectively.

 

Older Rubys like 1.8 and 1.9 might not even finish because GC will kick in too often. I was patient enough to wait 110 seconds for Ruby 1.9 to process 1 million rows. I’m quite sure the users of my code are not that patient.

 

Now let’s see how fast PostgreSQL can do the same thing on 10,000 rows:

$ psql company_data
=# \timing
Timing is on.
=# select department_id, employee_id, salary,
rank() over(partition by department_id order by salary desc)
from empsalaries;
Time: 22.573 ms

 

This is ten times faster in PostgreSQL. As a bonus, it also scales nicely. It needs 280 ms for 100,000 rows and 2.3 seconds for 1 million rows.

Notice how PostgreSQL’s performance is consistently ten times faster than the best of Ruby’s. Yes, my example uses Postgres-specific features like window functions. But that’s exactly my point.

 

The database is much better at data processing. That makes a huge difference. We have seen that ten times is not a limit. Sometimes it’s a difference between never finishing the task in Ruby and completing it in several seconds simply by letting your database do what it’s good at.

 

Rewrite in C

Ruby is implemented in C, so it has an easy way to interface with C code. So if your Ruby code is slow, you can always rewrite it in C. Wait! What? Fear not, I’m not going to try to talk you into writing the C code yourself.

 

You can certainly do that, but it’s out of the scope of this blog. Instead, I’d like to point out that there are plenty of Ruby gems written in C that do the job faster than their counterparts.

 

I divide these native code gems into two types:

1. Gems that rewrite slow parts of Ruby or Ruby on Rails in C

2. Gems that implement a specific task in C

 

The Date:: Performance gem3 is a good example of the first type. It’s an old gem that all Ruby 1.8 developers should use. It transparently replaces the slow Ruby Date and DateTime libraries with a similar implementation written in C.

 

Note that the Date:: Performance gem is Ruby 1.8 only. Ruby 1.9 and later have a date library that is much faster.

 

Let me show how much faster Date:: Performance is. For that, we’ll switch to Ruby 1.8, install the date-performance gem, and measure the execution time (without GC, to factor it out) of a program that creates a lot of Date objects.

$ rbenv shell 1.8.7-p375
$ gem install date-performance
Fetching: date-performance-0.4.8.gem (100%)
Building native extensions. This could take a while...
Successfully installed date-performance-0.4.8
1 gem installed
Let’s see how Date from the standard library performs.
blg2/date_without_date_performance.rb
require 'date'
require 'benchmark'
GC.disable
memory_before = `ps -o rss= -p #{Process.pid}`.to_i/1024
time = Benchmark.realtime do
100000.times do
Date.new(2014,5,1)
end
end
memory_after = `ps -o rss= -p #{Process.pid}`.to_i/1024
puts "time: #{time}, memory: #{"%d MB" % (memory_after - memory_before)}"
$ ruby date_without_date_performance.rb time: 2.19644594192505, memory: 262 MB
We need 2.2 seconds to create 100,000 dates. Now let’s compare this with
Date::Performance.
blg2/date_with_date_performance.rb
require 'benchmark'
require 'rubygems'
require 'date/performance'
GC.disable
memory_before = `ps -o rss= -p #{Process.pid}`.to_i/1024
time = Benchmark.realtime do
100000.times do
Date.new(2014,5,1)
end
end
memory_after = `ps -o rss= -p #{Process.pid}`.to_i/1024
puts "time: #{time}, memory: #{"%d MB" % (memory_after - memory_before)}"
$ ruby -I . date_with_date_performance.rb --no-gc time: 0.294741868972778, memory: 84 MB

 

The same code written in C is almost eight times faster! And as a bonus, it uses 175 MB less memory. Both are great improvements. That’s why I advise that everybody who is stuck with a good old Ruby 1.8 should use the Date:: Performance gem.

 

There are also gems that implement a specific task in C. The best example of this is markdown libraries. Some of them are written in C, some of them in Ruby. Here’s the performance comparison made by Jashank Jeremy, one of the Jekyll blog engine contributors:

 

Make Rails Faster

Make Rails Faster

In principle, you already know how to make Rails faster: the same performance optimization strategies that we’ve discussed in the previous blog will work for any Rails application.

 

Use less memory, avoid heavy function calls in iterators, and write less Ruby and Rails. These are the big things that make Rails applications faster, and you’ll learn how to apply them in this blog.

 

But before we start, make sure you have at least a bare Rails application set up and running. All the examples you’ll see in this blog require a Rails 4.x application with a database connection.

 

I’ll also assume we are both using a PostgreSQL 9.x database. PostgreSQL is my preferred choice not only because it is one of the best-performing freely available databases.

 

I choose it specifically because I will need a lot of random data for the examples, and that’s easy to generate with the Postgres-specific generate_series function. That lets us start with an empty database and add schema and data in migrations as necessary.

So, take the Rails app (bare or your own), and let’s optimize it.

 

Make ActiveRecord Faster

ActiveRecord is a wrapper around your data. By definition that should take memory, and oh indeed it does. It turns out the overhead is quite significant, in both the number of objects and in raw memory.

 

To see the overhead, let’s create a database table with 10 string columns and fill it with 10,000 rows, each row containing 10 strings of 100 chars. 

 

blg3/app/db/migrate/20140722140429_large_tables.rb
class LargeTables < ActiveRecord::Migration
def up
create_table :things do |t|
10.times do |i|
t.string "col#{i}"
end
end
execute <<-END
insert into things(col0, col1, col2, col3, col4, col5, col6, col7, col8, col9) (
select
rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x')
from generate_series(1, 10000)
);
END
end
def down
drop_table :things
end
end

This migration creates 10 million bytes of data (10,000 * 10 * 100), approximately 9.5 MB. A database is quite efficient at storing that. For example, my PostgreSQL installation uses just 11 MB:

$ psql app_development
app_development=# select pg_size_pretty(pg_relation_size('things'));
pg_size_pretty
11 MB
Let’s see how memory-efficient ActiveRecord is. We’ll need to create a Thing model:
blg3/app/app/models/thing.rb
class Thing < ActiveRecord::Base
end
And we’ll need to adapt our wrapper.rb measurement helper from the previous blog to Rails:
blg3/app/lib/measure.rb
class Measure
def self.run(options = {gc: :enable})
if options[:gc] == :disable
GC.disable
elsif options[:gc] == :enable
# collect memory allocated during library loading
# and our own code before the measurement GC.start
end
memory_before = `ps -o rss= -p #{Process.pid}`.to_i/1024
gc_stat_before = GC.stat
time = Benchmark.realtime do
yield
end
gc_stat_after = GC.stat
GC.start if options[:gc] == :enable
memory_after = `ps -o rss= -p #{Process.pid}`.to_i/1024
puts({
RUBY_VERSION => {
gc: options[:gc],
time: time.round(2),
gc_count: gc_stat_after[:count].to_i - gc_stat_before[:count].to_i,
memory: "%d MB" % (memory_after - memory_before)
}
}.to_json)
end
end

For this to work, add the lib directory to Rails’ autoload_paths in config/application.rb.

blg3/app/config/application.rb

config.autoload_paths << Rails.root.join('lib')

 

Got that? Good. Now we can run our migration and measure the memory usage. Note that this needs to be done in production mode to make sure we do not include any of Rails development mode’s side effects.

$ RAILS_ENV=production bundle exec rake db:create
$ RAILS_ENV=production bundle exec rake db:migrate
$ RAILS_ENV=production bundle exec rails console
2.2.0 :001 > Measure.run(gc: :disable) { Thing.all.load }
{"2.2.0":{"gc":"enable","time":0.32,"gc_count":1,"memory":"33 MB"}} => nil

 

ActiveRecord uses 3.5 times more memory than the size of the data. It also triggers one garbage collection during loading.

ActiveRecord is convenient, but the convenience that ActiveRecord affords comes at a steep price. I realize I’m not going to convince you to avoid ActiveRecord. But you do need to understand the consequences of using it.

 

In 80% of cases, the speed of development is worth more than the cost in execution speed. In the remaining 20% of cases, you have other options. 

Load Only the Attributes You Need

Load Attributes

Your first option is to load only the data you intend to use. Rails make this very easy to do, like this:

$ RAILS_ENV=production bundle exec rails console Loading production environment (Rails 4.1.4)

2.2.0 :001 > Measure.run { Thing.all.select([:id, :col1, :col5]).load }

{"2.2.0":{"gc":"enable","time":0.21,"gc_count":1,"memory":"7 MB"}} => nil

 

This uses 5 times less memory and runs 1.5 times faster than Thing.all.load. The more columns you have, the more it makes sense to add select into the query, especially if you join tables.

 

Preload Aggressively

Another best practice is preloading. Every time you query into a has a _many or belongs_to relationship, preload.

 

For example, let’s add a has_many relationship call to our Thing. We’ll need to set up the migration and ActiveRecord model.

blg3/app/db/migrate/20140724142101_minions.rb
class Minions < ActiveRecord::Migration
def up
create_table :minions do |t|
t.references :thing
10.times do |i|
t.string "mcol#{i}"
end
end
execute <<-END insert into minions(thing_id, mcol0, mcol1, mcol2, mcol3, mcol4, mcol5, mcol6, mcol7, mcol8, mcol9) ( select Coming Soon, rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x') from things, generate_series(1, 10) ); END end def down drop_table :minions end end blg3/app/app/models/minion.rb class Minion < ActiveRecord::Base belongs_to :thing end blg3/app/app/models/thing.rb class Thing < ActiveRecord::Base has_many :minions end Run the migration with RAILS_ENV=production bundle exec rake db:migrate and you will get 10 Minions for each Thing in the database. Iterating over that data without preloading is not such a good idea. $ RAILS_ENV=production bundle exec rails console Loading production environment (Rails 4.1.4) 2.2.0 :001> Measure.run { Thing.all.each { |thing| thing.minions.load } }
{"2.2.0":{"gc":"enable","time":272.93,"gc_count":16,"memory":"478 MB"}} => nil

Good luck waiting for this one line of code to finish. It needs not only to load everything into memory but also to execute 10,000 queries against the database to fetch the minions for each thing.

 

Preloading is the better way.

$ RAILS_ENV=production bundle exec rails console Loading production environment (Rails 4.1.4)

2.2.0 :001 > Measure.run { Thing.all.includes(:minions).load }

{"2.2.0":{"gc":"enable","time":11.59,"gc_count":19,"memory":"518 MB"}} => nil

 

Depending on the Rails version, this might be slightly less memory efficient. But the code finishes 25 times faster because Rails performs only two database queries—one to load things, and another to load minions.

 

Combine Selective Attribute Loading and Preloading

Combine Selective Attribute Loading

Even better is to take my advice from the Load Only the Attributes You Need section and select only the columns we need. But there’s a catch. Rails do not have a convenient way of selecting a subset of columns from the dependent model. For example, this will fail:

 

Thing.all.includes(:minions).select("col1", "minions.mcol4").load

It fails because includes(: minions) runs an additional query to fetch minions for the things it selected. And Rails is not smart enough to figure out which of the select columns belong to the Minions table.

 

If we queried from the side of the belongs_to association, we would use joins.

Minion.where(id: 1).joins(:thing).select("things.col1", "minions.mcol4")

 

From the has_many side joins will return duplicates of the same Thing object, 10 duplicates in our case. To combat that, we can use the PostgreSQL-specific array_agg feature that aggregates an array of columns from the joined table.

$ RAILS_ENV=production bundle exec rails console Loading production environment (Rails 4.1.4)
2.2.0 :001 > query = "select id, col1, array_agg(mcol4) from things
2.2.0 :002"> inner join
2.2.0 :003"> (select thing_id, mcol4 from minions) minions
2.2.0 :004"> on (Coming Soon = minions.thing_id)
2.2.0 :005"> group by id, col1"
=> "select id, col1, array_agg(mcol4) from things inner join
(select thing_id, mcol4 from minions) minions
on (Coming Soon = minions.thing_id)
group by id, col1"
2.2.0 :006 > Measure.run { Thing.find_by_sql(query) }
{"2.2.0":{"gc":"enable","time":0.62,"gc_count":1,"memory":"8 MB"}} => nil

Just look at the memory consumption: 8 MB instead of 518 MB from a full select with preloading. As a bonus, this runs 20 times faster. Restricting the number of columns you select can save you seconds of execution time and hundreds of megabytes of memory.

 

Use the Each! The pattern for Rails with find_each and find_in_batches

 

It is expensive to instantiate a lot of ActiveRecord models. Rails developers knew that and added two functions to loop through large datasets in batches.

 

Both find_each and find_in_batches will load by default 1,000 objects and return them to you—the first function, one by one; the latter, the whole batch at once.

 

You can ask for smaller or larger batches with the: batch_size option.

find_each and find_in_batches will still have to load all the objects in memory. So how do they improve performance? The effect is the same as with the each! pattern from Once you’re done with the batch, GC can collect it. Let’s see how that works.

>$ RAILS_ENV=production bundle exec rails console Loading production environment (Rails 4.1.4)
2.2.0 :001 > ObjectSpace.each_object(Thing).count
=> 0
2.2.0 :002 > Thing.find_in_batches { |batch|
2.2.0 :003?> GC.start
2.2.0 :004?> puts ObjectSpace.each_object(Thing).count
2.2.0 :005?> }
1000
2000
… 6 lines elided
2000
2000 => nil
2.2.0 :006 > GC.start => nil
2.2.0 :007 > ObjectSpace.each_object(Thing).count => 0

 

GC indeed collects objects from previous batches, so no more than two batches are in memory during the iteration. Compare this with the regular each iterator over the list of objects returned by Thing.all.

$ RAILS_ENV=production bundle exec rails console Loading production environment (Rails 4.1.4)
2.2.0 :001 > ObjectSpace.each_object(Thing).count
=> 0
2.2.0 :002 > Thing.all.each_with_index { |thing, i|
2.2.0 :003?> if i % 1000 == 0
2.2.0 :004?> GC.start
2.2.0 :005?> puts ObjectSpace.each_object(Thing).count
2.2.0 :006?> end
2.2.0 :007?> }; nil
10000
10000
… 6 lines elided
10000
10000 => nil

 

Here we keep 10,000 objects for the whole duration of each loop. This increases both total memory consumption and GC time. It also increases the risk of running out of memory if the dataset is too big (remember, ActiveRecord needs 3.5 times more space to store your data).

 

Use ActiveRecord without Instantiating Models

database query

If all you need is to run a database query or update a column in the table, consider using the following ActiveRecord functions that do not instantiate models.

ActiveRecord::Base.connection.execute("select * from things") This function executes the query and returns its result unparsed.

 

ActiveRecord::Base.connection.select_values("select col5 from things") Similar to the previous function, but returns an array of values only from the first column of the query result.

 

Thing.all.pluck(:col1, :col5) Variation of the previous two functions. Returns an array of values that contains either the whole row or the columns you specified in the arguments to pluck.

 

Thing.where("id < 10").update_all(col1: 'something') Updates columns in the table.

These not only save you memory but also run faster because they neither instantiate models nor execute before/after filters. All they do is run plain SQL queries and, in some cases, return arrays as the result.

 

Make ActionView Faster

Make ActionView Faster

It’s not unusual for template rendering to take longer than controller code. But you may think that you can’t do much to speed it up. Most templates are just a collection of calls to rendering helper functions that you didn’t write and can’t really optimize—except when they’re called in a loop.

 

Rendering is basically a string manipulation. As we already know, that takes both CPU time and memory. In a loop, we multiply the effect of what is already slow. So every time you iterate over a large dataset in a template, see whether you can optimize it.

 

Rails template rendering has performance characteristics similar to Ruby iterators. It’s fine to do just about anything until you render partials in a loop. There are two reasons for that.

 

First, rendering comes at a cost. It takes time to initialize the view object, compute the execution context, and pass the required variables. So every partial that you render in a loop should be your first suspect for poor performance.

 

Second, the majority of Rails view helpers are iterator-unsafe. One call to link_to will not slow you down, but a thousand of them will.

 

Render Partials in a Loop Faster

When asked to render a set of objects, your template code would probably look something like this:

<% objects.each do |object| %>

<%= render partial: 'object', locals: { object: object } %> <% end %>

 

There’s nothing wrong with the code, except that it becomes slow on a large collection of objects. How slow? I measured the rendering of 10,000 empty partials in different versions of Rails and the results were not pleasant.

Rails 2.x Rails 3.x Rails 4.x

0.335 ± 0.006 1.355 ± 0.033 1.840 ± 0.045

 

Although 10,000 objects is not a large dataset, just rendering the placeholders for them will set you back by 2 seconds with recent Rails. That’s disturbing.

 

Also disturbing is that rendering also gets much worse with each subsequent version of Rails. But before you fall into your memories of good old Rails 2.x times, let me point out that even 0.3 seconds for doing nothing is already too much.

 

Rails 3.0 and higher has a solution to this problem called render collection:

Rails 3.0

<%= render :partial => 'object', :collection => @objects %>

Or, in a shorter notation:

<%= render @objects %>

 

This inserts a partial for each member of the collection, automatically figuring out the partial name and passing the local variable. That also performs 20 times faster.

Rails 3.x Rails 4.x

0.066 ± 0.001 0.100 ± 0.005

 

The reason rendering a collection is faster is that it initializes the template only once. Then it reuses the same template to render all objects from the collection. Rendering 10,000 partials in a loop will have to repeat the initialization 10,000 times.

 

Render collection has none of that overhead, and it doesn’t produce excessive log output, either. That makes it clearly superior to rendering partials in a loop.

 

There’s no render collection in Rails 2.x. But if you’re still using that version, try the template more inline plug-in.2 It achieves the same effect by textually inserting partial code into the parent template before Rails compiles it.

 

This is what I wrote when I worked at Acunote,3 the online project management system built with Ruby on Rails. 

 

There we rendered hundreds of tasks on the page, each task having 8–10 fields. For each field, we had a separate part for rendering, and there was no render collection in Rails 2.x. That’s when the template more inline was born.

 

To use it, add the plug-in to your Rails application, and append inline: true to the render statement:

<% @objects.each do |object| %>

<%= render partial: 'object', locals: { object: object }, inline: true %> <% end %>

 

Rails never see the render partial call, and as result, we get the same two orders of magnitude performance improvement.

Rails 2.x Rails 2.x with template inliner

0.335 ± 0.006 0.003 ± 0.0001

 

Avoid Iterator Unsafe Helpers and Functions

Helpers and Functions

All rendering helpers are what I call iterator unsafe. They take both time and memory, so be careful when using them in a loop, especially with link_to, url_for, and img_tag.

 

I do not have any better advice than to be careful, for two reasons. First, you cannot avoid using these helpers (especially in newer Rails). Second, it’s very hard to benchmark them.

 

Helpers’ performance depends on too many factors, making any synthetic benchmark useless. For example, link_to and url_for get slower when the complexity of your routing increases.

 

And img_tag performs worse as you add more assets. In one application it’s safe to render a thousand URLs in the loop, whereas in another it’s not. So…be careful.

 

Test Performance

As experienced software developers, we know that testing is the best way to ensure that our code works as advertised. When you first write the code, a test proves that it does what you think it does. When you fix a bug, the test prevents it from happening again.

 

I’m a big fan of applying the same approach for performance. What if we write tests that first set the expected level of performance, and then make sure that performance doesn’t degrade below this level? Sounds like a good idea, right?

 

I learned about this concept while working on Acunote. We started Acunote when Rails was at version 1.0, so performance was a huge concern. Performance testing helped us not only to understand and improve application performance but to survive through numerous Rails upgrades.

 

It turned out that even a minor version upgrade could introduce the performance regression in some unexpected way. We wouldn’t be able to detect and fix these regressions without the performance tests.

 

So let me show you how we did performance testing in Acunote and how you can do it too.

A unit test for a function might look something like this:

def test_do_something

assert_equal 4, do_something(2,2)

end

 

This test, in fact, performs three separate steps: evaluation, assertion, and error reporting. Testing frameworks abstract these three steps, so we end up writing just one line of code to do all three.

 

To evaluate, our example test runs the do_something function to get its return value. The assertion assesses the equality of this actual value against the expected value. If the assertion fails, the test reports the error.

 

A performance test should perform these same three steps, but each one of them will be slightly different. Say we want to write the performance test for this same do_something function. The test will look like this:

def test_something_performance
actual_performance = performance_benchmark do
do_something(2,2))
end
assert_performance actual_performance
end

The evaluation step is simply a benchmarking. The actual value for assert_performance is the current set of the performance measurements.

 

Ah, but what is our expected level of performance? We said that our performance test should ensure that performance doesn’t degrade below an expected level.

 

A reasonable answer is that our assert_performance should make sure that performance is the same as or better than before. So the test should somehow know the performance measurements from the previous test run.

 

Those measurements make up the expected value that we’ll compare to. What if there are no previous results? Then the only thing the test should do is store the results for future comparison.

 

We already know how to compare performance measurements from the previous blog. So the remaining thing to figure out is how and where to store the previous test results. This is something regular tests don’t do.

 

Should the test find a slowdown, we want to the performance before and after, and their difference. As we know from the previous blog, all before and after numbers should come with their deviations, and the difference should come with its confidence interval.

 

This means the reporting step in performance tests is also very different from what you usually see in tests.

 

OK, that’s the big picture of performance testing. Now, the details.

Benchmark

Benchmark

Let’s take everything you learned in the previous blog on measurements and apply that knowledge here to write a benchmark function. To reiterate, here’s what such a function should do:

  • Run the code multiple times to gather measurements. It’s best if we can do 30 runs or more.
  • Skip the results of the first run to reduce the warm-up effects and let caching do its job.
  • Force GC before each run.
  • Fork the process before measurement to make sure all runs are isolated and don’t interfere with each other.
  • Store all measurements somewhere (in the file, on S3, etc.) to be processed later.
  • Calculate and report average performance and its standard deviation.

 

This list makes for a pretty detailed spec, so let’s go ahead and write the benchmark function.

blg8/performance_benchmark.rb
require 'benchmark'
def performance_benchmark(name, &block)
# 31 runs, we'll discard the first result
(0..30).each do |i|
# force GC in parent process to make sure we reclaim
# any memory taken by forking in previous run GC.start
# fork to isolate our run
pid = fork do
# again run GC to reduce effects of forking GC.start
# disable GC if you want to see the raw performance of your code GC.disable if ENV["RUBY_DISABLE_GC"]
# because we are in a forked process, we need to store
# results in some shared space.
# local file is the simplest way to do that
benchmark_results = File.open("benchmark_results_#{name}", "a")
elapsed_time = Benchmark::realtime do
yield
end
# do not count the first run if i > 0
# we use system clock for measurements,
# so microsecond is the last significant figure
benchmark_results.puts elapsed_time.round(6)
end
benchmark_results.close
GC.enable if ENV["RUBY_DISABLE_GC"]
end
Process::waitpid pid
end
measurements = File.readlines("benchmark_results_#{name}").map do |value| value.to_f
end
File.delete("benchmark_results_#{name}")
average = measurements.inject(0) do |sum, x|
sum + x
end.to_f / measurements.size
stddev = Math.sqrt(
measurements.inject(0){ |sum, x| sum + (x - average)**2 }.to_f / (measurements.size - 1)
)
# return both average and standard deviation,
# this time in millisecond precision
# for all practical purposes that should be enough [name, average.round(3), stddev.round(3)]
end

We made three simplifications in the benchmarking function. First, we used the Ruby round function that doesn’t follow the tie-breaking rule of rounding when the first non-significant digit is 5 followed by 0.

 

Instead of rounding to the nearest even number, it’ll always round up. Second, we decreased precision to milliseconds despite the system clock being able to measure times with microsecond precision. Finally, we hard-coded the number of measurements to 30.

 

You can easily undo the first and the last simplifications, but I recommend you keep the second. Ruby isn’t a systems programming language, so we usually don’t care about microseconds of execution time. In fact, in most cases we don’t care about milliseconds or even tens of milliseconds—that’s why we rounded off our measurements in our example.

 

Now let’s see how our benchmarking function works. Run this simple program:

>blg8/test_performance_benchmark.rb
require 'performance_benchmark'
result = performance_benchmark("sleep 1 second") do sleep 1
end
puts "%-28s %0.3f ± %0.3f" % result
$ cd code/blg8
$ ruby -I . test_performance_benchmark.rb
sleep 1 second 1.000 ± 0.000

 

As expected, sleep(1) takes exactly one second, with no deviation. We can be sure our measurements are correct. Now it’s time to write the function to assert the performance.

 

Assert Performance

Assert Performance

We know that assert_performance should measure the current performance, compare it with the performance from the previous run, and store the current measurements as the reference value for the next run.

 

Of course, the first test run should just store the results because there are no previous data to compare with.

 

Now let’s think through success and failure scenarios for such tests. Failure is easy. If performance is significantly worse, then report the failure. The success scenario, though, has two possible outcomes: one when performance is not significantly different, and another when it has significantly improved.

 

It looks like it’s not enough just to report failure/success. We need to report the current measurement, as well as any significant difference in performance.

 

So let’s get back to the editor and try to do exactly that.

blg8/assert_performance.rb
require 'minitest/autorun'
class Minitest::Test
def assert_performance(current_performance)
self.assertions += 1 # increase Minitest assertion counter
benchmark_name, current_average, current_stddev = *current_performance past_average, past_stddev = load_benchmark(benchmark_name) save_benchmark(benchmark_name, current_average, current_stddev)
optimization_mean, optimization_standard_error = compare_performance( past_average, past_stddev, current_average, current_stddev
)
optimization_confidence_interval = [
optimization_mean - 2*optimization_standard_error,
optimization_mean + 2*optimization_standard_error
]
conclusion = if optimization_confidence_interval.all? { |i| i < 0 } :slowdown
elsif optimization_confidence_interval.all? { |i| i > 0 } :speedup
else
:unchanged
end
print "%-28s %0.3f ± %0.3f: %-10s" %
[benchmark_name, current_average, current_stddev, conclusion] if conclusion != :unchanged
print " by %0.3f..%0.3f with 95%% confidence" %
optimization_confidence_interval
end
print "\n"
if conclusion == :slowdown
raise MiniTest::Assertion.new("#{benchmark_name} got slower")
end
end
private
def load_benchmark(benchmark_name)
return [nil, nil] unless File.exist?("benchmarks/#{benchmark_name}") benchmark = File.read("benchmarks/#{benchmark_name}") benchmark.split(" ").map { |value| value.to_f }
end
def save_benchmark(benchmark_name, current_average, current_stddev)
File.open("benchmarks/#{benchmark_name}", "w+") do |f|
f.write "%0.3f %0.3f" % [current_average, current_stddev]
end
end
def compare_performance(past_average, past_stddev, current_average, current_stddev)
# when there's no past data, just report no performance change past_average ||= current_average
past_stddev ||= current_stddev
optimization_mean = past_average - current_average optimization_standard_error = (current_stddev**2/30 +
past_stddev**2/30)**0.5
# drop non-significant digits that our calculations might introduce optimization_mean = optimization_mean.round(3) optimization_standard_error = optimization_standard_error.round(3)
[optimization_mean, optimization_standard_error]
end
end

Again, this includes some simplifications you can easily undo. First, we save the benchmark results to the file in a predefined hard-coded location.

 

Second, we hardcode the number of measurement repetitions to 30, exactly as in the performance_benchmark function. And third, our assert_performance works only with Minitest 5.0 and later, so we need to install the minitest gem.

 

But now that we have our assert, we can write our first performance test.

blg8/test_assert_performance1.rb
require 'assert_performance'
require 'performance_benchmark'
class TestAssertPerformance < Minitest::Test
def test_assert_performance
actual_performance = performance_benchmark("string operations") do result = ""
700.times do
result += ("x"*1024)
end
end
assert_performance actual_performance
end
end
Let’s run it (don’t forget to gem install minitest first).
$ ruby -I . test_assert_performance1.rb
# Running:
string operations 0.172 ± 0.011: unchanged
.
Finished in 2.294557s, 0.4358 runs/s, 0.4358 assertions/s.
1 runs, 1 assertions, 0 failures, 0 errors, 0 skips
The first run will save the measurements to the benchmarks/string operations file.
If we rerun the test without making any changes, it should report no change.
$ ruby -I . test_assert_performance1.rb
# Running:
string operations 0.168 ± 0.016: unchanged
.
Finished in 2.313815s, 0.4322 runs/s, 0.4322 assertions/s.
1 runs, 1 assertions, 0 failures, 0 errors, 0 skips

 

As expected, the test reports that performance hasn’t changed despite the difference in average numbers. That’s statistical analysis at work! Now you know why we spent so much time talking about it.

Now let’s optimize the program. 

blg8/test_assert_performance2.rb
require 'assert_performance'
require 'performance_benchmark'
class TestAssertPerformance < Minitest::Test
def test_assert_performance
actual_performance = performance_benchmark("string operations") do result = ""
700.times do
result << ("x"*1024) end
end
assert_performance actual_performance
end
end
Let’s run the performance test again.
$ bundle exec ruby -I . test_assert_performance2.rb
# Running:
string operations 0.004 ± 0.000: speedup by 0.161..0.167 with 95% confidence
Finished in 1.089948s, 0.9175 runs/s, 0.9175 assertions/s. 1 runs, 1 assertions, 0 failures, 0 errors, 0 skips

And of course, the test reports the huge optimization. That’s exactly what we like to see when we optimize.

 

However, if the execution environment isn’t perfect, our performance test might report a slowdown or optimization even if we did nothing. For example, I can get the slowdown error from the first unoptimized test on my laptop when it gets busy doing something else. This is one such test run:

$ ruby -I . test_assert_performance1.rb
# Running:
string operations 0.201 ± 0.059: slowdown by -0.044..-0.022 with 95% confidence
F
Finished in 2.456716s, 0.4070 runs/s, 0.4070 assertions/s.

1) Failure:

TestAssertPerformance#test_assert_performance [test_assert_performance1.rb:10]:

string operations got slower

1 runs, 1 assertions, 1 failures, 0 errors, 0 skips

 

See how big my standard deviation is? It’s almost a quarter of my average. This means that some of the measurements were outliers, and they made the test fail.

 

We already talked about two ways of dealing with that. One is to further minimize external factors. Another is to exclude outliers. But there’s one more: you can increase the confidence level for the optimization interval.

 

 The 95% confidence interval we use is roughly plus/minus two standard errors from the mean of the difference between before and after numbers. We can demand 99% confidence. This increases the interval to about plus/minus three standard errors.

 

Note how simple tweaking of the confidence interval changed the test outcome. So I recommend that you play with this and come up with the confidence level that works reliably for your performance tests.

 

There’s, of course, a limit to confidence level increases. See how we were barely able to determine that performance in our test stayed the same. Had the standard deviation been one millisecond less, we would have declared this run as a slowdown.

 

You might be tempted to increase the interval size to four or five standard errors from the mean. But in practice, three standard errors (99%) is the highest confidence you should aim for.

 

You can’t demand the confidence of the large hadron collider experiments from your Ruby tests. If your tests are still not reliable, step back and look for more external factors, or start excluding outliers in measurements.

 

Report Slowdowns and Optimizations

Report Slowdowns and Optimizations

The test prints benchmarks together with any deviations from the previous runs. Is there anything else to report? Yes, but not in the test output.

 

A performance test is a perfect candidate for daily or continuous testing. Make sure you notify developers about any changes in performance, good or bad. I personally take the complete output of the test and report that.

 

It contains enough information for a human to assess performance. Your choice might be different, so I won’t talk further about reporting here.

 

Test Rails Application Performance

If you’re writing Rails applications, you’ll surely want to apply the same performance testing techniques to them. Rails developers have long since recognized that. Rails 3 even included a performance testing framework. And while Rails 4 no longer provides it, the code is still there in the rails-perfect gem.

 

So should we simply use that gem for Rails performance testing? No, not really. The rails-perfect gem tries to be the Jack of all trades and becomes the master of none. It does benchmark and profiling, and lets you collect any metrics other than execution time.

 

At the same time, it doesn’t do enough runs for its benchmarks to be statistically significant. And it doesn’t do any comparison. And, honestly, mixing profiling with performance testing in one tool doesn’t sound like a good idea.

 

With that in mind, I think we’re better off adapting our performance_benchmark and assert_performance functions to work with Rails. So let’s see what it takes.

 

Make Rails Performance Test an Integration Test

Rail is a complex stack of software. In the performance test, we need to make sure we benchmark the complete stack, not just a part of it.

 

It might be tempting to performance test only a controller action, or even a function in the model. But what if we add some middleware that totally ruins our performance? Will our performance test spot that?

 

The only kind of test that runs the whole Rails stack every request is the integration test, so let’s start writing one.

class RailsAppPerformanceTest < ActionDispatch::IntegrationTest test "performance test something" do
actual_performance = performance_benchmark do
get "/something"
end
assert_performance actual_performance
end
end

 

The good thing about this test is, again, that it processes the request almost in the same way our production application would. Ah, but the devil is in details. While the get or post calls from the integration test do execute the whole Rails stack, they do excessive logging and no caching.

 

By default, tests run with the: debug log level. To imitate production, we’ll need to set it to info. We can either create a separate environment for performance testing or simply set the log level before each test like this:

>class RailsAppPerformanceTest < ActionDispatch::IntegrationTest def setup
@previous_log_level = Rails.application.config.log_level Rails.application.config.log_level = :info
end
def teardown
Rails.application.config.log_level = @previous_log_level
end
end

 

Caching is more complicated. If our application heavily relies on caching, we must be sure to turn it on for performance tests.

 

Our benchmark function skips the results of the first test run, so it will correctly ignore the first, uncached, request. This means no changes to the testing infrastructure are needed:

 

we just execute Rails.application.config.action_controller.perform_caching = true in the same place where we change the log level.

 

Benchmark Rails the Right Way

Benchmark Rails

So we wrote an integration performance test, reduced logging, and decided on caching. Is there anything else to think about? It turns out, yes.

 

The majority of Rails apps work with a database. They load and store data there. When we talked about benchmarking Ruby code earlier, we didn’t think about the byproducts of that code. Instead, we assumed that there were no side effects and that it was safe to rerun the same function again and again.

 

Not so much with Rails. In most cases, the time you measure will increase with each test run during the benchmark. Why? Because when you hit Rails, more often than not you change the database. This way you might end up with more and more data to process in each subsequent test run.

 

Say our Rails action inserts a record into the database, then returns a summary of the stored data. The more we insert, the slower our action becomes. The measurement from the 30th test run will be way different from the first. Remember what that means for performance tests?

 

Large standard deviations in the benchmarks and large standard errors of the optimization. As a result, we won’t be able to compare the performance.

 

So we’ll need to make sure the test runs leave no byproducts. The easiest way to do that is to start a transaction, measure, and roll it back. That’s exactly what Rails does in between tests.

 

But now we’ll have to do that in between measurements inside one test. Let me show you how to modify the performance_benchmark function to do that.

blg8/rails_performance_benchmark.rb
require 'benchmark'
class PerformanceTestTransactionError < StandardError
end
def performance_benchmark(name, &block)
# 31 runs, we'll discard the first result
(0..30).each do |i|
# force GC in parent process to make sure we reclaim
# any memory taken by forking in previous run GC.start
# fork to isolate our run
pid = fork do
# again run GC to reduce effects of forking GC.start
# disable GC if you want to see the raw performance of your code GC.disable if ENV["RUBY_DISABLE_GC"]
# because we are in a forked process, we need to store
# results in some shared space.
# local file is the simplest way to do that
benchmark_results = File.open("benchmark_results_#{name}", "a")
elapsed_time = nil
begin
ActiveRecord::Base.transaction do
elapsed_time = Benchmark::realtime do
yield
end
raise PerformanceTestTransactionError
end
rescue PerformanceTestTransactionError
# rollback transaction as expected
end
# do not count the first run if i > 0
# we use system clock for measurements,
# so microsecond is the last significant figure
benchmark_results.puts elapsed_time.round(6)
end
benchmark_results.close
GC.enable if ENV["RUBY_DISABLE_GC"]
# Hack! Do this only if you use database sockets.
# dup2 all file descriptors to /dev/null so that forked process
# forgets them and doesn't close them at exit.
# Otherwise the forked process will close the database connection.
3.upto(256) { |fd| IO.new(fd).reopen("/dev/null") rescue nil }
end
Process::waitpid pid
end
measurements = File.readlines("benchmark_results_#{name}").map do |value| value.to_f
end
File.delete("benchmark_results_#{name}")
average = measurements.inject(0) do |sum, x|
sum + x
end.to_f / measurements.size
stddev = Math.sqrt(
measurements.inject(0){ |sum, x| sum + (x - average)**2 }.to_f / (measurements.size - 1)
)
# return both average and standard deviation,
# this time in millisecond precision
# for all practical purposes that should be enough [name, average.round(3), stddev.round(3)]
end

 

All it takes is to wrap your measurement into the transaction and roll it back after you’re done. Some people may need the socket hack that I highlighted in the previous example.

 

That’s the trick I learned at Acunote. We used to develop and test on Linux and connected to the PostgreSQL database via local sockets. If you do the same, you’ll need the hack because the forked measurement process will attempt to close all its sockets at the exit.

 

And one of those sockets will be the database connection that the forked process is sharing with its parent. So after the socket is closed, the parent won’t be able to continue benchmarking.

 

With all these modifications in place, our benchmarking function and our assertion are Rails compatible and ready to be used. So, you’re good to go and write your own Rails performance test for your applications.

 

But before we jump into that, let me tell you about one more kind of performance test that’s not applicable to most of the pure Ruby applications but that’s really important for Rails.

 

Test Database Performance

Test Database Performance

A Rails application is not just Ruby code. With all the ActiveModel and ActiveRecord abstractions, developers tend to forget about the underlying database. But its performance is essential to the performance of the whole application.

 

If the queries we run are slow, the application will be slow. If we execute too many queries, the application will be slow. See the pattern? We absolutely need to take care that our performance tests account for the database performance.

 

We have two kinds of database-related performance problems: slow queries and too many queries. Are our performance tests helping us to prevent each of these two kinds of database slowdowns?

 

It turns out the answer is no in both cases. What can we do about that? The answer: Generate enough data for tests in the first case, and write another kind of tests in the second. Let me explain.

 

Generate Enough Data for Performance Tests

Say we have an application that has a slow query. If we see that query in development mode, we’ll optimize it, and write the performance test that calls the code that executes this query. Such a test will make sure the query doesn’t get any slower with time. Nothing more to do here.

 

But what if we spot the slow query only in production? For example, we see it in the database server logs, or in the NewRelic report. This usually means our test database doesn’t have enough data. Production has way more data, and that makes queries slower there.

 

So our best strategy is to generate enough data for our performance tests.

How much is enough? That depends on both our application and our data. In some cases our request only inserts data, and there’s nothing we need to add to our test database.

 

In other cases, our request processes all data from a table that, for example, contains 10,000 rows on production. Then our test database also needs 10,000 rows in that table.

 

Yet another case is when the request uses just 100 rows out of 10,000 total. So it might be enough to have only 100 rows prepared for the performance test. Or it might not be for example, if our database structure lacks an index that makes fetching 100 rows out of 10,000 as fast as fetching 100 rows out of 100.

 

There are more cases of course, so I can’t give you more specific advice here. When writing the performance test, think through the data usage and use your best judgment to generate just enough data to match the production behavior for your particular situation.

 

Test Database Queries

Test Database Queries

Executing slow queries is obviously bad for performance. But running too many queries is also bad.

We’ve seen that in the example from Preload Aggressively. There, the extra 10,000 queries took an extra 250 seconds of execution time. Although realistic, that example is rather extreme. You’ll surely notice this kind of slowdown before the code goes into production.

 

A more subtle case is when our request runs just several dozen queries. In a development that might not be noticeable at all because we have fewer data and run everything in a single process with no concurrency.

 

But our production environment is going to be exactly the opposite, with more data and high concurrency. So our harmless dozen queries can become a huge performance problem.

 

Because Rails is such a good abstraction, it’s hard to understand how many queries are executed just by looking at the source code. Gems that magically do authentication, authorization, validation, and many other things completely hide the database operations from us.

 

The only place to see the queries executed in the development log. And, honestly, how often do you look there? In a moderately complex application, the log is too verbose to parse visually, let alone to count queries.

 

So how do we make sure we don’t run too many queries? When I worked on Acunote, we ran into the same problem. Our solution was to write what we called a query test.

 

A query test is a test that executes the snippet of code, gathers the list of SQL queries from the log, and asserts that this list hasn’t changed from the previous test run. That’s like a performance test that measures the number of queries instead of the execution time.

 

Rails 3.0 and higher makes it easy to gather the list of queries.

We can simply subscribe to the sql.active_record hook of ActiveSupport:: Notifications.
blg8/app/test/integration/query_test.rb
def track_queries
result = []
ActiveSupport::Notifications.subscribe "sql.active_record" do |*args| event = ActiveSupport::Notifications::Event.new(*args) query_name = event.payload[:name]
next if ['SCHEMA'].include?(query_name) # skip AR schema lookups result << query_name
end
yield
ActiveSupport::Notifications.unsubscribe("sql.active_record")
result
end

 

This function will execute a block and return the list of query names. Now how do we assert that the list stays the same? We can use assert_equal, but we’d have to modify the test by hand every time we make a legitimate change that results in an additional query.

 

The assert_value gem2 we wrote while working on Acunote is a better way to do that. It compares the expected value with the actual value, just like assert_equal. But when the actual value changes, it asks us to either confirm the change as legitimate or reject it as a test failure. If we confirm, it goes into the test file and updates the expected value.

 

This turned out to be useful not only for performance testing. We ended up using assert_value for the majority of our tests in Acunote, and I now use it in all my other projects.

So let’s use the assert_value to write a query test.

First, let’s add my gem to the Gemfile:

gem 'assert_value', require: false

Second, update your bundle:

$ bundle install

 

Let’s take the preloading example from Preload Aggressively and then write a query test for it. But first, let’s modify it a bit. We’ll put the code into the controller action to imitate the real Rails application, and limit the number of rows to 10 for brevity.

 

Also, our controller action will execute either the unoptimized or optimized code depending on the params.

blg8/app/app/controllers/application_controller.rb
class ApplicationController < ActionController::Base def do_something
if params[:preload]
Thing.limit(10).includes(:minions).load
else
Thing.limit(10).each { |thing| thing.minions.load }
end
render nothing: true
end
end
We’ll need a route to request that action, so we add the following line to our
config/routes.rb:
root 'application#do_something'
Now let’s put together everything we’ve talked about, and write our query test.
blg8/app/test/integration/query_test.rb
require 'test_helper'
require 'assert_value'
class QueryTest < ActionDispatch::IntegrationTest
def test_loading_things
ActiveRecord::Base.connection.execute <<-END
insert into things(col0, col1, col2, col3, col4, col5, col6, col7, col8, col9) (
select
rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x')
from generate_series(1, 10000)
);
END
queries = track_queries do
get "/"
end
assert_value queries
end
private
def track_queries
result = []
ActiveSupport::Notifications.subscribe "sql.active_record" do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
query_name = event.payload[:name]
next if ['SCHEMA'].include?(query_name) # skip AR schema lookups
result << query_name
end
yield
ActiveSupport::Notifications.unsubscribe("sql.active_record")
result
end
end

 

Like our previous performance tests, this query test is an integration test, and for the same reason: we need to test the complete Rails stack to be sure there are no extra queries magically added by gems and plug-ins behind our back.

 

So, let’s run this test.

$ bundle exec ruby -I test test/integration/query_test.rb

The first time we run the test, it will collect the list of queries that are executed within the track_queries block, and ask us to confirm that the list is correct.

$ bundle exec ruby -I test test/integration/query_test.rb
# Running:
@@ -1,0, +1,11 @@
+Thing Load
+Minion Load
+Minion Load
.........
Accept the new value: yes to all, no to all, yes, no? [Y/N/y/n] (y):
Press y, and the test will update the assert_value call in the test/integra-
tion/query_test.rb file.
blg8/app/test/integration/query_test.rb
assert_value queries.join("\n"), <<-END
Thing Load
Minion Load
Minion Load
Minion Load
.........
END
As expected, this unoptimized code runs nine too many Minion Load queries.
Let’s see how the optimized version performs. First, change the code inside the track_queries block:
blg8/app/test/integration/query_test.rb
queries = track_queries do
get "/", preload: true
end
Next, run the test again and accept the new value.
$ bundle exec ruby -I test test/integration/query_test.rb
# Running:
@@ -1,11, +1,2 @@
Thing Load
Minion Load
-Minion Load
-Minion Load
..................
Accept the new value: yes to all, no to all, yes, no? [Y/N/y/n] (y): y .
Finished in 21.327313s, 0.0469 runs/s, 0.0938 assertions/s.
1 runs, 2 assertions, 0 failures, 0 errors, 0 skips
The test will update the query assert to this:
blg8/app/test/integration/query_test.rb
assert_value queries.join("\n"), <<-END
Thing Load
Minion Load
END

Again as expected, the optimized version runs only two queries to load all our objects. After optimization, you can use this test as a reference. It will fail every time you introduce an additional query. At that point, you’ll either accept the change or reject it and fix your code.

 

Takeaways

Takeaways

Testing is the best way to maintain application performance after optimization.

1. Write performance tests—special kinds of integration tests that benchmark your code, keep results, and then assert performance by comparing current and previous benchmarks.

 

2. Make sure your performance tests get the measurements and comparisons right. Use the framework we wrote in this blog to create your own performance tests.

 

3. When performance testing Rails, don’t forget about database performance. Make sure you create enough data for performance tests, and check how many queries your requests run.

Congratulations! Now you know everything you need to know to optimize your Ruby code, measure optimization result, and ensure that your optimization persists.

 

But our quest for faster Ruby applications is not over yet. To optimize, we thought of our code as of a white box that we can dissect and improve.

 

As you might guess, another approach would be to think of it as of a black box and optimize the way we run the code by speeding up its dependencies and the whole execution environment. So let’s do just that.

 

Our ability to focus, as software engineers, makes us vulnerable to tunnel vision syndrome. What that means when it comes to optimization is that when our Ruby program is slow, we tend to concentrate only on Ruby code optimization.

 

But there are other ways to make our Ruby program faster, often resulting in a greater improvement than the obvious approach of looking for ways to optimize that Ruby code.

 

To find these other ways we have to step out of the box and look at how our program runs in the real world, what other software it uses, and where it’s deployed. Our program will become faster if we find a better way to run it by optimizing all its dependencies and deployment infrastructure.

 

How exactly do we do that? I’ll show you a couple of examples here in this blog. But, unlike in the rest of my blog, I can neither give you a complete solution nor outline the steps to be taken.

 

There are simply too many ways to run the Ruby code, too many external tools it may use, and too many deployment platforms. So look at the material in this blog as a source of inspiration for your own thinking outside the box, not as a complete guide.

 

Cycle Long-Running Instances

Cycle Long-Running Instances

Let’s first look at how our program runs, and decide what we can do to make it run faster.

 

Imagine you started a program. Let’s say it’s a web browser, and after some time it’s become slow. What do you do? It’s not a trick question. You know what you should do: restart it, and it’ll be fast again.

 

  • Can we apply the same principle to Ruby programs? It turns out we can.
  • Any long-running Ruby application will become faster after the restart.

 

Back in Profile Memory, we talked about how garbage collector performance declines when the amount of memory allocated by the Ruby process increases.

 

Spoiler alert: I’m going to jump ahead of myself and add to this one more fact. In most cases, the Ruby process will never give the memory allocated for Ruby objects back to the operating system. Take a peek at the next blog if you’re curious about what happens, and why.

 

For now, let’s concentrate only on these two facts: (a) the performance declines with the increased memory usage, and (b) the amount of memory allocated by the Ruby process only grows with time.

 

What does this add up to? Slowdown. The longer our Ruby program runs, the slower it gets. No amount of code optimization can prevent this. Only a restart solves this slowdown!

 

So if you have a long-running Ruby instance, you’ll need to cycle it. And by cycling I mean restarting it when it uses too much memory. You can cycle Ruby applications in several ways:

 

Use a hosting platform that does it for you. For example, Heroku cycles its “dynos” daily1 and also aborts the process when it exceeds the dyno’s memory limit.

 

Use a process management tool, like monit, god, upstart, runit, or foreman with systemd.

Ask the operating system to kill your application if it exceeds the memory limit. Note that this still depends on a process management tool to restart the application after it gets killed.

 

If you deploy on Heroku, its daily cycling might work well for you. But when I worked on Acunote, we had our own deployment infrastructure and dedicated servers. So we had to use other techniques to combat excessive memory usage. I’ll show a few examples of how we did it.

 

Example: Configure Monit to Cycle Ruby Processes

With monit we can check total  mem and load avg variables and restart based on their values, like this:

check process my_ruby_process
with pidfile /var/run/my_ruby_process/my_ruby_process.pid start program = "my_ruby_process start" stop program = "my_ruby_process stop"
# eating up memory?
if totalmem is greater than 300.0 MB for 3 cycles then restart
# bad, bad, bad
if loadavg(5min) greater than 10 for 8 cycles then restart
# something is wrong, call the sys-admin
if 20 restarts within 20 cycles then timeout

Example: Set Operating System Memory Limit

On Unix systems, we can use the setr limit system call to enforce the process memory limit.

 

For example, on Linux and Mac OS X, set RLIMIT_AS:

# 600 MB RSS limit

Process.setrlimit(Process::RLIMIT_AS, 600 * 1024 * 1024)

 

Example: Cycle Unicorn Workers in the Rails Application

Rails applications running on the Unicorn web server can cycle themselves without any external process management tool. The idea is to set a memory limit for workers and let the master process restart them once they get killed.

 

For example, this is what I have in my config/unicorn.rb:

after_fork do |server, worker|
worker.set_memory_limits
end
class Unicorn::Worker
MEMORY_LIMIT = 600 #MB
def set_memory_limits
Process.setrlimit(Process::RLIMIT_AS, MEMORY_LIMIT * 1024 * 1024)
end
end

 

That works for Unicorn 4.4, so you might need to change it to work with newer versions.

The only problem with this approach is that the operating system may kill your worker while serving the request. To avoid that, I also set up what I call the kind memory limit.

 

We can set this limit to a value lower than the RSS memory limit, and check for it after every request. Once the worker reaches the kind memory limit, it gracefully shuts itself down.

 

This way, in most cases workers quit before reaching the RSS memory limit enforced by the operating system. That becomes a safeguard only against long-running requests that grow too big in memory.

Here’s how I set up the kind limit with Unicorn:
class Unicorn::HttpServer
KIND_MEMORY_LIMIT = 250 #MB
alias process_client_orig process_client
undef_method :process_client
def process_client(client)
process_client_orig(client)
exit if get_memory_size(Process.pid) > KIND_MEMORY_LIMIT
end
def get_memory_size(pid)
status = File.read("/proc/#{pid}/status")
matches = status.match(/VmRSS:\s+(\d+)/)
matches[1].to_i / 1024
end
end

 

This example will work only for Linux and other Unixes because it gets the current process memory usage from the /proc filesystem. If you’d like to port it for Mac OS or Windows, you’ll have to rewrite the get_memory_size function.

 

There are, of course, many other ways to keep the Ruby process from growing in memory. I can’t describe all of them in this blog, but by now you should have a general idea. Whatever tool you use, make sure it restarts the long-running Ruby application before it grows too big in memory.

 

Fork to Run Heavy Jobs

Fork to Run Heavy Jobs

Cycling long-running Ruby instances help to deal with sudden increases in memory consumption. But often we know beforehand that the code we’re going to execute will need memory.

 

For example, our database query returned 100,000 rows, and we need to compute complex statistics based on that data.

 

We can let that memory-heavy operation run and then let our infrastructure restart the Ruby process. But there’s a better solution. We can fork our process and execute the memory-heavy code in the child process. This way, only the child process will grow in memory, and when it exists, the parent process remains unaffected.

 

The simplest possible implementation looks like this:

pid = fork do
heavy_function
end
Process::waitpid(pid)

 

You might recognize this code from the performance_benchmark function in the previous blog. We used the same fork-and-run approach to isolate benchmarks from the parent process, and from themselves.

 

You might also recall the downside of this approach. Such code has no easy way of returning data to the parent process. If you want to do it, you’ll need to open a pipe between parent and child, use temporary storage, or store results into the database.

 

In the previous blog we already used the temporary storage to communicate between the forked process and its parent. So now let’s see how to send the data via the I/O pipe.

blg9/forked_process_io_pipe_example.rb
require 'bigdecimal'
def heavy_function
# this allocates approx. 450,000 extra objects before returning the result Array.new(100000) { BigDecimal(rand(), 3) }.inject(0) { |sum, i| sum + i }
end
# disable GC to compute object allocation statistics GC.disable
puts "Total Ruby objects before operation: #{ObjectSpace.count_objects[:TOTAL]}"
# open pipe, then close "read" end on child side,
# and "write" end on parent side
read, write = IO.pipe
pid = fork do
# child may run GC as usual GC.enable
read.close
result = heavy_function
# use Marshal.dump to save Ruby objects into the pipe Marshal.dump(result, write)
exit!(0)
end
write.close
result = read.read
# make sure we wait until the child finishes Process.wait(pid)
# use Marshal.dump to load Ruby objects from pipe puts Marshal.load(result).inspect
# this number should be not too different from the previous one
puts "Total Ruby objects after operation: #{ObjectSpace.count_objects[:TOTAL]}"
When we run the code, we see that despite the child allocating 400,000–450,000 objects, the parent process doesn’t grow at all.
$ ruby forked_process_io_pipe_example.rb

 

Total Ruby objects before operation: 30163 #<BigDecimal:7f99b3a612e8,'0.5016076916 4137E5',18(27)> Total Ruby objects after operation: 30163

 

This technique is very useful for long-running Ruby applications that occasionally have to perform memory-heavy operations. But for Rails, there are usually better solutions.

 

Most modern deployments support the idea of background jobs. For example, delayed_job gem7 essentially implements the same idea. It lets you delay any function call by serializing the function and its data into the database, and then executing the code in the separated, short-lived process (usually launched by a rake task).

 

There are many other background job implementations that do the same thing. You can use any of them.

 

But beware of the ones that use threads instead of separate processes. A notable example is Sidekiq. It is usually one Ruby process running several dozen Ruby threads.

 

All these share the same ObjectSpace, so when one thread grows, the whole process needs a restart. So make sure you use one of the process management tools we talked about earlier to monitor and restart the Sidekiq worker.

 

Both cycling and forking keep the Ruby process under a certain memory limit so that GC has less work to do and takes less time to complete. It’s GC time that we’re really optimizing here.

 

Do Out-of-Band Garbage Collection

Despite all our optimization efforts, GC will continue to take a substantial part of execution time. So what do we do if we can’t reduce GC time? We force GC when our application isn’t doing anything.

 

That approach is called Out-of-Band Garbage Collection (OOBGC). Idle time is something that we usually observe in web applications and services. So let me show you how to configure OOBGC for the most popular Ruby web servers.

 

Example: OOBGC with Unicorn

Unicorn9 has direct support for OOBGC.
If you use Ruby 2.1 or later, add the gctools gem10 to your Gemfile and put this into your config.ru for Unicorn:
require 'gctools/oobgc'
use(GC::OOB::UnicornMiddleware)

 

When does the OOBGC happen? You might guess it should run after every N requests. But that would add unnecessary load to your server. Not all requests are the same, and some of them might leave little to no garbage.

 

The gc tools library does it in a better way. Ruby 2.1 exposes enough information about its internal state for the gc tools to decide when the collection is required. 

 

However, if you’re still using Ruby version 2.0 and lower, then the only OOBGC strategy is to force GC after every N requests. Unicorn has the built-in middleware for that:

require 'unicorn/oob_gc'

use(Unicorn::OobGC, 1)

 

Here the second parameter to the use() call is the frequency of OOBGC: 1 means after every request, 2 means after every two requests, and so on.

 

Example: OOBGC in Other Web Servers or Applications

We can use the tools library to do OOBGC in any code. Once we determine when our code has idle time, we can call this:

require 'gctools/oobgc'

GC::OOB.run

 

There are two things to keep in mind when implementing OOBGC on your own.

First, make sure that you get your OOBGC timing right. For web applications, it’s after the request body is flushed into the stream. For background workers, it’s after finishing the task and before pulling the next task from the queue.

 

Second, be careful if you use threads. If you force GC from one thread while the other is still doing its job, you will block that other thread. So you should make sure all threads in the process are doing nothing before calling OOBGC.

 

That’s why, for example, the Puma web server doesn’t support OOBGC. Unlike Unicorn, Puma’s workers can be multithreaded, and there’s no single point in time when you can safely perform OOBGC.

 

Tune Your Database

Your database server can be either a liability or an asset to performance. If you use one of the modern SQL databases, don’t treat it as a dumb storage mechanism. We have seen in Offload Work to the Database, how rewriting Ruby code into SQL can give you one or more orders of magnitude improvement.

 

But equally important is having your database server tuned up for optimal performance. You’ll want to do this, because the default settings are usually inadequate, especially for PostgreSQL.

 

Example: TuneUp PostgreSQL for Optimal Performance

This example is relevant for you only when you have to set up the database server on your own. If you host on Heroku, then you can expect your PostgreSQL to be configured reasonably well. That might be true for other hosting solutions.

 

PostgreSQL has a plethora of configuration options, so instead of diving into details, we’ll talk at a high level about what we need to configure.

  • Let the database use as much memory as possible. Ideally, all your data should fit into RAM for faster access.
  • Make sure the database has enough memory to store intermediate results, especially for sorts and aggregations.
  • Set the database to log slow queries and preserve as much information about them as possible to reproduce the problem.

 

Let me show you the PostgreSQL configuration snippet that implements these goals. This is an extract from the config we used for Acunote. It’s not a complete config, so you should review these settings, read comments, and merge with your own config as necessary.

blg9/postgresql.conf
# For all memory settings below, RAM_FOR_DATABASE is the amount of memory
# available to the PostgreSQL after the operating system and all other
# services are started.
#
# Evaluate the Ruby pseudocode in angle brackets and replace
# it with actual values.
# Use as Much Memory as Possible
# How much memory we have for database disk caches in memory
# Note, disk caching is controlled by the operating system,
# so this setting is just a guideline.
# Recommended setting: RAM_FOR_DATABASE * 3/4
set effective_cache_size <ram_for_database.to_i * 3/4>MB
# Shared memory that PostgreSQL itself allocates for data caching in RAM
# Recommended setting: RAM_FOR_DATABASE/4
# Warning: on Linux make sure to increase the SHMMAX kernel setting.
set shared_buffers <ram_for_database.to_i / 4>MB
# Allocate Enough Memory for Intermediate Results
# Work memory for queries (to store sort results, aggregates, etc.)
# This is a per-connection setting, so it depends on the expected
# maximum number of active connections.
# Recommended setting: (RAM_FOR_DATABASE/max_connections) ROUND DOWN 2^x
set work_mem < 2**(Math.log(ram_for_database.to_i /
expected_max_active_connections.to_i)/Math.log(2)).floor >MB
# Memory for vacuum, autovacuum, index creation
# Recommended setting: RAM_FOR_DATABASE/16 ROUND DOWN 2^x
set maintenance_work_mem < 2**(Math.log(ram_for_database.to_i / 16)
/Math.log(2)).floor >MB
# Log Slow Queries
# Log only autovacuum's longer than 1 sec.
set log_autovacuum_min_duration 1000ms
# Log long queries.
set log_min_duration_statement 1000ms
set auto_explain.log_min_duration 1000ms
# Explain long queries in the log using the auto_explain plug-in. set shared_preload_libraries 'auto_explain'
set custom_variable_classes 'auto_explain'
# But do not use explain analyze, which may be slow
set auto_explain.log_analyze off

You might have noticed that this configuration mostly optimizes PostgreSQL memory usage. Yes, it’s memory again. We spent the greater part of this blog talking about Ruby memory optimization, and now it’s our database that also needs memory tuning.

 

That’s not a coincidence. Modern software is rarely limited by CPU. The most severe limitation is the amount of available memory, followed by network latency and throughput, and disk I/O. That’s why, no matter what database you use, you need to make sure it has and uses as much memory as possible.

 

Buy Enough Resources for Production

A large number of Ruby applications run in the cloud today. There are many providers of deployment infrastructure, but the better ones tend to be expensive. So you often have to find the optimal compromise between the price you pay and the resource limits you get for that price.

 

Hosting providers usually emphasize the number of CPU cores and the size of the storage in their pricing plans. Both these numbers are irrelevant. The CPU performance is usually not a problem, and storage can be easily added. Here are what I believe are the most important criteria when evaluating the potential deployment stack:

 

1. Total RAM available

After reading this blog, it should be no surprise that memory comes first in my list.

 

2. I/O performance.

This is the most overlooked parameter, which is often hard to evaluate without deploying at least the test application. It doesn’t matter for some applications, but if you write logs or cache to disk, pay attention to it.

 

3. Database configuration.

If you do not set up the database server yourself, make sure your provider follows the best practices that we talked about earlier.

 

4. Everything else.

Don’t be afraid to pay for more memory. That is often cheaper than paying for extra servers (virtual machines). For example, as of this writing, Heroku, one of the most popular Rails deployment solutions, offers two kinds of dynos (virtual machines): 1X and 2X.

 

The first has 512 MB of RAM, and the second has 1024 MB. 2X is about 2–2.5 times more expensive.

 

On the 1X dyno, you can run only one Rails instance because their average size is about 250–300 MB. On the 2X dyno, you can run three instances—for example, three Unicorn workers. That more than justifies the increase in cost.

 

Also, when your Rails application goes bad and grows in memory, you’ll at least have some extra memory on a 2X dyno to be used before you cycle the offending process. In my experience that makes all the difference.

 

With a 1X dyno your application can just stop responding whereas 2X will slow down for a short time.

 

I/O bit me several times in the past. At Acunote, we used to deploy on Engine Yard before we bought our own hardware. Their virtual machines at the time did not have their own storage, and instead used the network filesystem (RedHat GFS).

 

GFS seemed to work really well for us until I found that certain cache expiration calls took too long to execute.

 

We cached to disk and expired it by traversing the whole cache directory and matching the file paths to the expiration regular expression.

 

It turned out that GFS had a slow fstat() implementation, so traversing the large cache directory could take several seconds. So we had to change our caching strategy to limit the number of directories in the search path for expiration.

Recommend