(Insightful) Ramblings by batasrki

No subtitle needed

Riak Search as a Means of Querying

| Comments

Introduction

Up tonight, I’ll explore enhancing the querying in my app with the use of Riak Search. Basho has recently added a list-keys warning to the Ruby driver and all of my current MapReduce code is triggering it. Riak Search should help there by enabling me to reduce the set of keys in the events bucket to the ones closer to what I’m looking. Since that won’t traverse the entire bucket, I’ll avoid triggering the warning.

Background

To give you some background, buckets in Riak are just namespaces for keys and their values plus some metadata about each key/value pair. As such, writing a MapReduce query that selects only the bucket, and not some range of keys within that bucket, will cause Riak to traverse all of the keys on all of the nodes in the system. Needless to say that operation will get extremely slow as the number of keys grows. There is a bit more explanation on the recently added wiki page for the Riak Ruby client.

What not to do

As a quick example of what not to do for a production app, but something that you should play with in development, here’s a sample MapReduce query that tries to pull out events for a certain month:

1
2
3
4
5
6
7
8
# month is number from 1-12, validated elsewhere
# self.client is an instance of Riak::Client which hooks you up to the database
# parameter to add is the bucket name
def events_for(month)
  job = Riak::MapReduce.new(self.client).add("events")
  job.map("function(value, keydata, arg){ var item = Riak.mapValuesJson(value)[0]; var mo = item.event_date.split('-')[1]; if(month == #{month}){ return [data];}}", :keep => true)
  job.run
end

The Riak::MapReduce object is initialized with a client object that represents the database connection, whether it’s a cluster or a single node. Its add method takes a few parameters, but initially all I comprehended it needed was a bucket name. The other two parameters concern themselves with keys and key filters, the latter being out of scope for this blog post.

The map function takes a string of Javascript which is run on each node that has the data. This keeps the querying of the data very close to where data lives, making it very easy to not worry about node availability. There are built-in Javascript functions (and I’m using one there), as well as a set of community-contributed functions that one can use.

It’s in the Javascript function that I do the filtering by month. In another post, I’ll explore pushing functions down to each node as opposed to mixing a lot of Javascript into my Ruby code. While this looked like a straightforward way to introduce dynamic querying to a key/value store (and I felt smug about it), it has the wonderful property of traversing every single key looking for those namespaced to the events bucket and then shoving those one-by-one into my Javascript function. The function manages to find the documents whose date attribute has the month I want and returns a collection of those as items in a hash to Ruby. It’s, now, quite clearly the wrong way to go about querying in Riak. Let’s see if I can make it better.

Riak Search

Out of the two enhancements to the querying capabilities added by Basho to Riak, Riak Search is the older one. I am starting to appreciate having a full-text search engine tightly integrated into a database the way Riak, as well as PostgreSQL, has done. Having started my Rails career with MySQL-backed apps and futzing around with various search solutions, this is another thing I appreciate not having to worry about.

Enabling Riak Search

First thing I need to do is enable the full-text search on each node in the cluster. This is something that is easily automated, so it’s going on the growing list of things to automate. Right now, I’ll do it manually.

After getting the cluster up and running, I’ll edit each node’s app.config to enable the searching. Changing the value of enabled from false to true will get it going.

app.config
1
2
3
4
5
%% Riak Search Config
{riak_search, [
              %% To enable Search functionality set this 'true'.
              {enabled, true}
              ]},

Indexing data

Both the wiki and Mathias Meyer’s great Riak Handbook provide syntax for enabling the pre-commit hook for the search engine. The hook will index new data before it’s committed to the database.

Note I said new data. As of right now, the pre-commit hook will not index existing records. I do feel that this is a bit of a weakness and something to be corrected, though. Anyway, cribbing from the wiki, the syntax to enable the pre-commit hook is:

1
2
search-cmd install events
 :: Installing Riak Search <--> KV hook on bucket 'events'.

With that done and with the database primed with a few records, let’s see if we can actually search the events.

Searching

OK, so, going back to the original MapReduce function, what I wanted were events whose date was in March. The search syntax for that is the following:

Event model in irb
1
2
3
Event.client.search("events", "event_date:*03*")

{"responseHeader"=>{"status"=>0, "QTime"=>16, "params"=>{"q"=>"event_date:*03*", "q.op"=>"or", "filter"=>"", "wt"=>"json"}}, "response"=>{"numFound"=>0, "start"=>0, "maxScore"=>"0.0", "docs"=>[]}}

As I’m experimenting with search, I don’t have an elegant abstraction around the lower-level driver syntax just yet. That’s a TODO for another night. What I find strange is that I know there are events in March, but this search hasn’t found any. That’s either because the documents weren’t indexed or I did something wrong.

Let’s try with the year and month instead of just month. That’s closer to how it’ll be used anyway:

Event model in irb, part 2
1
2
3
Event.client.search("events", "event_date:2012-03*")

{"responseHeader"=>{"status"=>0, "QTime"=>22, "params"=>{"q"=>"event_date:2012-03*", "q.op"=>"or", "filter"=>"", "wt"=>"json"}}, "response"=>{"numFound"=>25, "start"=>0, "maxScore"=>"0.00000e+0", "docs"=>[{"id"=>"1iCUQIvt4Zz7cccLFjDPAr4pLFf", "index"=>"events", "fields"=>{"category"=>"personal", "event_date"=>"2012-03-16T00:00:00+00:00", "location"=>"Vaughan"}, "props"=>{}}, {"id"=>"1uMMj5EBEJF26gGzM6ORvJIxBym", "index"=>"events", "fields"=>{"category"=>"personal", "event_date"=>"2012-03-31T00:00:00+00:00", "location"=>"Toronto"}, "props"=>{}}, {"id"=>"3eL8Zsq7uhXMcNVkiT2EqMOmjIx", "index"=>"events", "fields"=>{"category"=>"personal", "event_date"=>"2012-03-11T00:00:00+00:00", "location"=>"Maple"}, "props"=>{}}, {"id"=>"63uNQcQqunUIWuxFaUyFJmJvquf", "index"=>"events", "fields"=>{"category"=>"personal", "event_date"=>"2012-03-31T00:00:00+00:00", "location"=>"Vaughan"}, "props"=>{}}, {"id"=>"891nojSWgC86mc9a3sgBzogEhAB", "index"=>"events", "fields"=>{"category"=>"business", "event_date"=>"2012-03-09T00:00:00+00:00", "location"=>""}, "props"=>{}}, {"id"=>"8WFTYibfz46VNjzJeBYHpcvAKOU", "index"=>"events", "fields"=>{"category"=>"personal", "event_date"=>"2012-03-28T00:00:00+00:00", "location"=>"Vaughan"}, "props"=>{}}, {"id"=>"8XJZR8K6sXYylBz4Kq1xRwsiddd", "index"=>"events", "fields"=>{"category"=>"business", "event_date"=>"2012-03-13T00:00:00+00:00", "location"=>""}, "props"=>{}}, {"id"=>"8yfHBxxhSM12dM9u7RsCjInHehV", "index"=>"events", "fields"=>{"category"=>"personal", "event_date"=>"2012-03-16T00:00:00+00:00", "location"=>"Vaughan"}, "props"=>{}}, {"id"=>"9dAuFCdh32OKIoIMgGWtt1kE4gZ", "index"=>"events", "fields"=>{"category"=>"business", "event_date"=>"2012-03-23T00:00:00+00:00", "location"=>"Toronto"}, "props"=>{}}, {"id"=>"Alj2Iwqg0D3dCwAENMjDbeggM2t", "index"=>"events", "fields"=>{"category"=>"personal", "event_date"=>"2012-03-07T00:00:00+00:00", "location"=>"Woodbridge"}, "props"=>{}}]}}

Ah, nice. So, interesting that having wildcards on both sides of the search term doesn’t work, yet adding a year and wildcard after does. It’s something to further explore later, but let’s analyze the results.

The Hash structure returned from the search engine has a few interesting fields, but what I’m interested is data retrieval. The docs key points to an array of hashes, each hash being a document returned. I can easily parse that out, throw away fields I don’t care about and present the clean hash as attributes to my upstream object. Now that the low level is working nicely, that part will be easy.

It looks like I may be even able to drop MapReduce. Well, it did, until I came across these tidbits:

Currently, the wildcard must come at the end of the term in both cases. (Basho wiki.basho.com/…)

It needs a minimum of three characters though,… (Mathias Meyer, The Riak Handbook)

Well, that explains why the first search failed. OK, so the search is not as full-featured yet as I’d like, but every piece of documentation I read suggests that it can be integrated with MapReduce. Surely with the MapReduce’s power, I’ll be able to get some nice querying going.

Integration with MapReduce

Since this is backing a web application, I need to make it easy for users to do adhoc searching of their events. As we’ve seen, search can get me partly there, so let’s integrate with the MapReduce object to add a bit more power. What I want is grab all personal events in March with no location. Although Sean Cribbs has started work on documentation for the Riak Ruby client, it’s still early enough for the MapReduce and Search documentation not to be there. As such, let’s refer to the search spec.

code I'm guessing will work
1
2
3
4
5
6
7
job = Riak::MapReduce.new(Event.client)
job.search("events", "event_date:2012-03*")
job.map("function(value, keydata, arg){ var data = Riak.mapValuesJson(value)[0]; if(data.location === \"\") {return [data];} else{ return [];} }", :keep => true)
job.run

#returns
[{"category"=>"business", "event_date"=>"2012-03-13T00:00:00+00:00", "location"=>""}, {"category"=>"business", "event_date"=>"2012-03-09T00:00:00+00:00", "location"=>""}]

Nice, I guessed correctly. I wasn’t completely sure, but I got it on the first try, so that’s good. The best news is that the list-keys warning was not triggered. Exactly what I wanted!

Caveats

The biggest caveat that I can immediately percieve is the inability of the search engine to index the existing set of documents. There are ways around that, of course. If the dataset is small enough, it can be reimported after the pre-commit hook was set up. That’s what I did here. There is also a way to index the data through the client. I may explore that later, but for now, you can refer to the spec file linked above.

Another, fairly obvious caveat is that inserting records with the search enabled is going to be slower and more taxing on each node than it would be without it. Mathias suggests to benchmark the before and after and I think that’s a worthwhile suggestion. For myself, I know that the app I’m writing is a lot more read-oriented than write-oriented, so I’m willing to take that hit. I’ll still benchmark it, because it’s good practice, but I need this type of querying power.

Conclusion

Overall, I like Riak Search. It is not hard to pick up and it’s not hard to modify existing code to leverage it. I would like if indexing was a bit more comprehensive, but it’s not a showstopper for me.

As far as putting additional pressure on each node due to the overhead of indexing, it’s a manageable thing for me now. However, Basho has introduced secondary indexes to Riak lately. Since it seems that adding the secondary index to a document is an application-level concern rather than a database-level one, it seems as if it’s a lighter load on the nodes themselves, while introducing a bit of complexity to the application. I’ll explore that next.

Riak and Vagrant

| Comments

tl; dr;

Creating a Riak cluster as a set of nodes through Vagrant and talking to the cluster through the official Ruby driver turned out to be harder than expected. Cryptic errors in Riak itself and lack of documentation for hooking up the driver to the cluster were two root causes. The final, working versions of the Vagrantfile and the Riak config files needed to get things working are below.

Vagrantfile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Vagrant::Config.run do |config|
  config.vm.define :node1 do |node|
    node.vm.box = "ubuntu-1110-server-amd64"
    node.vm.network :hostonly, "192.168.33.10"
    node.vm.forward_port 8098, 8091
    node.vm.forward_port 8087, 8081
  end

  config.vm.define :node2 do |node|
    node.vm.box = "ubuntu-1110-server-amd64"
    node.vm.network :hostonly, "192.168.33.11"
    node.vm.forward_port 8098, 8092
    node.vm.forward_port 8087, 8082
  end

  config.vm.define :node3 do |node|
    node.vm.box = "ubuntu-1110-server-amd64"
    node.vm.network :hostonly, "192.168.33.12"
    node.vm.forward_port 8098, 8093
    node.vm.forward_port 8087, 8083
  end
end
vm.args
1
2
3
## Name of the riak node
## You need to change the values on both sides of the @ sign
-name node1@192.168.33.10

Introduction

A few nights ago, I embarked on getting a proper Riak cluster going on my big dev machine. Up to this point, I’ve been developing against a single Riak instance. Since Riak is meant to be used in a cluster and its behaviour can differ when multiple nodes are activated, I decided to upgrade to the cluster. I could have just used the devrel builds of Riak, by dowloading the latest version of Riak’s source, issuing make all and make devrel commands and generally following these steps. However, since I do plan on eventually shipping my Riak-powered project, what I really wanted to do was build a true cluster of Ubuntu servers running the 64-bit binary package of Riak on each instance.

Existing information

After I’ve done too many of the steps below, I finally entered ‘riak vagrant’ as the search term in Google. I ended up on a post on the Basho blog that details how to use Vagrant and Chef to automatically provision nodes for the cluster. After much head-smacking, I realized that I actually preferred to step through the build manually, lest the automatic provisions obscure a part of the process. I do realize that others may want to expedite said process, which is why I’m linking to the post. Onwards then.

Cluster build options

There are several ways available to me to have real instances running, including EC2 and Rackspace. I didn’t want, or really need to right now, to figure out how to set up instances I want on those two, much less trying to get Riak installed and running. There is a third option, and as you might have guessed from the title of this post, it involves Vagrant.

Vagrant is a wonderful, Virtualbox-powered, tool that lets one locally provision instances of operating systems and interact with them as if they’re real server boxes. After I downloaded and installed Vagrant, I went in search of a good-enough box that would serve as the basis of the nodes in my cluster. After a quick visit to Vagrantbox.es, I downloaded the Ubuntu 11.10 64-bit box.

The next steps added the box to my setup:

1
2
vagrant box add (fully qualified box name)
vagrant init ubuntu-1110-server-amd64

With the box added, I scoured Vagrant’s excellent documentation for ways to get multiple instances of the same box up and running. I first tried making multiple subdirectories and either copying the Vagrantfile into each and making appropriate modifications or symlinking the original Vagrantfile to each subdirectory. While the first approach worked, it clearly wouldn’t scale as I added more nodes to the cluster. The latter approach, not surprisingly, failed spectacularly.

I, then, ran across the Multi-VM Environments page in the documentation. This is actually what I wanted and I quickly removed the subdirectories and began building my cluster in the parent directory.

Setting up Riak on the nodes

Now that I have 3 nodes up and running, it’s time to get Riak installed and running. After SSH-ing into each instance in turn, I download the above binary distribution using curl:

1
2
3
4
# same for node2 and node3
vagrant ssh node1

curl -O http://downloads.basho.com.s3-website-us-east-1.amazonaws.com/riak/1.1/1.1.2/riak_1.1.2-1_amd64.deb

This will download the 1.1.2 version of Riak, which is currently the latest stable version. Change the url as needed. The install originally fails, due to libssl being the version 1.0 and Riak wanting the 0.9.8. A sudo apt-get install command later, the install succeeds and we’re up and running. The nice thing about the binary install is that it will set up Riak to start on boot, which means I don’t have to muck around with that.

Setting up the Riak cluster

Having Riak installed on all the VMs, it’s time to make them into a cluster. The documentation on how to connect nodes to the cluster is sparse and in multiple places. Currently, the best place to look for Riak documentation is in the Riak handbook. It’s a self-published e-book, but it does aggregate a lot of information about Riak in one place and it’s much easier to search than Basho’s wiki and the Riak mailing list, at least right now.

So, as per the handbook, all one needs to do is change the 127.0.0.1 to the current machine’s IP address in the vm.args file.

vm.args
1
2
3
4
5
## From
-name riak@127.0.0.1

## To
-name riak@192.168.33.11

Restarting the node fails with an error, though.

1
2
riak restart
# Node 'node1@192.168.33.11' not responding to pings.

After referring to the Vagrant/Chef example above, I notice that ports in app.config are bound to 0.0.0.0, i.e. all interfaces. Changing that lets me finally use the following command to add the current node to a cluster:

riak admin
1
riak-admin join riak@192.168.33.10

After I set up each node with its own static IP, I changed the vm.args file as above and executed the command. The riak-admin status command should show nodes connected like this:

1
2
riak-admin status | grep connected
# connected_nodes: [192.168.33.10]

Everything seems connected and working, so we should be good.

Talking to Riak nodes

Now that each node has Riak installed and running, I want to make sure that I can talk to each node from the host machine. The Vagrant documentation specifies that I can make a few types of networks for my nodes:

  1. hostonly, meaning no external access, all traffic is between the nodes and the host machine
  2. bridged, meaning each node shows up as a physical interface, presumably letting external traffic access it

I chose the hostonly option for a few reasons, chief being that this is an experiment. Also, I think it’s beneficial from a system design perspective to not expose my database nodes to the vagaries of the internet. I could be wrong, but I’ve heard something to that effect.

Anyhow, I also would like to query these instances for data in Riak, so I have to forward traffic from my host machine to each node’s Riak port which, by default, is 8098. Actually, from what I’ve gathered from reading through the Chef blog post linked above, I have to set up two forwards. Port 8098 is the HTTP traffic port to Riak. It also exposes a protocol buffers port, it being 8087. In Vagrant, it’s done like so:

Vagrantfile
1
2
node.vm.forward_port 8098, 8091
node.vm.forward_port 8087, 8081

This needs to be done for each node in the cluster where the second set of numbers is the port numbers on the host machine. I found that syntax slightly unintuitive and subject to multiple documentation lookups, but I guess I’ll get used to it eventually.

A quick check through curl reveals that each Riak instance responds to the basic curl command.

curl
1
2
3
4
5
6
7
8
9
10
11
curl -i http://localhost:8091

HTTP/1.1 200 OK
Vary: Accept
Server: MochiWeb/1.1 WebMachine/1.9.0 (participate in the frantic)
Link: </buckets>; rel="riak_kv_wm_buckets",</riak>; rel="riak_kv_wm_buckets",</buckets>; rel="riak_kv_wm_index",</buckets>; rel="riak_kv_wm_keylist",</buckets>; rel="riak_kv_wm_link_walker",</riak>; rel="riak_kv_wm_link_walker",</mapred>; rel="riak_kv_wm_mapred",</buckets>; rel="riak_kv_wm_object",</riak>; rel="riak_kv_wm_object",</ping>; rel="riak_kv_wm_ping",</buckets>; rel="riak_kv_wm_props",</stats>; rel="riak_kv_wm_stats"
Date: Mon, 23 Apr 2012 01:27:57 GMT
Content-Type: text/html
Content-Length: 616

<html><body><ul><li><a href="/buckets">riak_kv_wm_buckets</a></li><li><a href="/riak">riak_kv_wm_buckets</a></li><li><a href="/buckets">riak_kv_wm_index</a></li><li><a href="/buckets">riak_kv_wm_keylist</a></li><li><a href="/buckets">riak_kv_wm_link_walker</a></li><li><a href="/riak">riak_kv_wm_link_walker</a></li><li><a href="/mapred">riak_kv_wm_mapred</a></li><li><a href="/buckets">riak_kv_wm_object</a></li><li><a href="/riak">riak_kv_wm_object</a></li><li><a href="/ping">riak_kv_wm_ping</a></li><li><a href="/buckets">riak_kv_wm_props</a></li><li><a href="/stats">riak_kv_wm_stats</a></li></ul></body></html>%

It’s now time to hook up the Riak Ruby client. It’s at this point that all documentation runs out and I end up reading the specs, source code and such. I lie, there is documentation on how to add nodes to the client’s initialization:

riak-ruby-client
1
2
3
4
5
6
# Automatically balance between multiple nodes
client = Riak::Client.new(:nodes => [
  {:host => '10.0.0.1'},
  {:host => '10.0.0.2', :pb_port => 1234},
  {:host => '10.0.0.3', :http_port => 5678}
])

The trouble here is that I went through the pain of setting up a YAML file with the database configuration and I needed to port the above to YAML syntax. Reading through the YAML documentation, I find out the syntax needed for representing the array in YAML:

riak.yml
1
2
3
4
5
6
7
8
9
10
11
12
development: &base
  http_backend: :Excon
  nodes:
    -
      host: 'localhost'
      http_port: 8091
    -
      host: 'localhost'
      http_port: 8092
    -
      host: 'localhost'
      http_port: 8093

I had to modify my homegrown code that turns string keys in hash into symbols, since the Riak client expects symbols as key types and not strings. This has caused me problems a few times, but I don’t know if it should be patched to accept either. All this being done, I expected to have full access to all nodes, reading from them and writing to them.

Um, no

Instead, the error I got through the client and through curl is {insufficient_vnodes, 0, expected, 3}. I had set up read/write values for Riak in order to control how data is replicated during writes, as well as how it’s read from the cluster.

This terse message hints at the issue at hand. Even though I had set up the cluster as it is detailed in all of the sources I found, the nodes did not distribute the ring of hashed keys that point to data between themselves.

The way to see what’s really going on in the cluster is to make sure that the connected_nodes information in the status correlates with the ring_members information a few rows lower. In my case, as you can see below, the two rows did not reflect the same information. The nodes seem to see and connect to each other, but they have not shared the ring of data between them and are not in a cluster.

1
2
3
connected_nodes : [riak@192.168.33.11, riak@192.168.33.12]
---snip 10 rows
ring_members : ['riak@127.0.0.1']

Another tipoff is that ring_num_partitions number of the node you’re querying is the size of the ring you’ve set up, 64 by default. When the node is in a cluster, it has a fraction of that number, that is size of the ring \/ number of nodes in cluster.

Changing the information in vm.args seemed to help.

vm.args
1
2
3
4
5
% Doesn't work
-name riak@192.168.33.10

% Works
-name node1@192.168.33.10

Restarting the nodes and querying each node’s ring_status and member_status shows me conflicting information. The members are picked up as expected, but the ring status still seems to refer to riak@127.0.0.1, which it shouldn’t. Searching through the app.config file, I find out where the ring information for each node is stored on the filesystem. In my case, it’s at /var/lib/riak/ring. Stopping the node, clearing out the files in this directory and starting the node finally, mercifully, shows me the correct information on all nodes.

Finally, it works

Double checking whether I can query a node through curl and through Ruby confirms for me that the system now works. As I expected it, a 404 is returned for a non-existent key, a 204 responses for a successful POST and DELETE. I can now continue writing code against this system in the same way as I had done against a single node.

Conclusion

Mark Phillips, Basho’s Director of Community, recently asked on the Riak mailing list how adoption of Riak can be improved. After going through the above, I have to say that documentation should be priority #1. I do realize that Riak is still in fast-development mode. Features are flying in every day and the core developers are continously improving stability and user-friendliness.

Having said that, the wiki on Basho’s website either doesn’t have all of the necessary information or it doesn’t have it organized well. It’s nearly impossible to search it intuitively. The Riak Handbook I linked to above is a better source, but it’s not free and the examples in it are written in JavaScript. I have nothing against JavaScript, but porting the examples is yet another thing I need to do to get things working in Ruby. In any case, the book cannot cover every error, such as the error I ran into here. This is information that needs to be a part of the living documentation provided by the database manufacturer themselves.

Furthermore, the documentation for the Riak’s Ruby driver is sorely lacking. I refer to the test suite for almost every single thing I need to do with the driver, because the README does not provide examples of how to do things. This is something I will definitively help out with, since Github provides a wiki that I can contribute to. I encourage everyone else using this driver to do the same.

I really like Riak as a database and these are all growing pains one needs to experience, I guess. I just expected it not to hurt as much as it did, :)

No Rails in My Specs

| Comments

tl; dr

Removing Rails from the runtime of an RSpec suite will result in massive performance improvements while reducing the number of steps needed to run them. The trade-offs are:

A different way of writing specs, especially controller ones There is overhead due to having to handle requires on your own These trade-offs may not be worth the performance gain. Skim the code blocks if the post is too long.

Since my models are not subclasses of ActiveRecord::Base, I am unaware of what additional work is involved. Hat tip to Tommy Morgan for bringing up that concern.

Introduction

So, last night, I tweeted my amazement at a massive performance improvement while running specs for a Rails app I’m building. A few people have asked if I could write down my thoughts, so here they are.

First, though, a little bit of background. This after-hours app of mine is a bank account analysis web application. Basically, I wanted to be able to see at a glance what my wife and I are spending money on, how much of it and to eventually come up with a prediction model based on past financial history. Also, I’m treating this application as a non-trivial testbed of technologies and techniques. For example, the data store is Riak and I am splitting the application up into services. Currently, I have two parts: 1. A Sinatra-powered API service for querying the data and importing new source data 2. A thin Rails client whose sole job is presenting the query results from the API service to the end user

Initial assessment

So, the Rails client is running on Ruby 1.9.3 and Rails 3.1.3, while Typhoeus 0.3.3 is my HTTP library of choice. For testing, I’m using RSpec 2.7.0. It’s important that I note the versions here, because things may change for the better. I used test-first methodology to drive out the initial iteration of the client, based on a few requirements I jotted down on a piece of paper. I deliberately skipped creating ActiveRecord (or any other ORM) models and I went with straight-up Ruby objects. After all, the data store for the client is the API service, so I didn’t see the need to duplicate the data store part here.

While the model part is not the usual Rails fare, the controller and view parts are pretty standard with some modern techniques, like using presenters/decorators instead of helpers, mixed in. I wrote the controller and model specs in the usual way, mocking out the API response in the controller and letting the model specs hit the API service like they would in the production mode.

When I ran the suite of 30 specs, I was flabbergasted to see the suite take 9 seconds to run. The specs ran fairly fast and most of the time seemed to be spent loading up the environment. This run time surprised me, because the entire Ruby universe on Twitter kept saying how much better 1.9.3 was at loading up the Rails environment than 1.9.2. I can only imagine how long this would’ve taken to run using 1.9.2. To me, 9 seconds for each spec suite run is unacceptable. To hell with premature optimization, this was an issue right now and it’d be better if I solved it while I only have 30 specs.

Spork

As usual, the first tool I reach for when trying to improve the run time of a spec suite is Spork. It’s a well-known and trusted tool among Rails people and I’ve used it before. A quick gem install later, followed by the bootstrap and I’m off and running.

Originally, my spec_helper.rb file looked like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
require 'rubygems'
require 'spork'

Spork.prefork do
end

Spork.each_run do
  # This code will be run each time you run your specs.
end
ENV["RAILS_ENV"] ||= 'test'
require File.expand_path("../../config/environment", __FILE__)
require 'rspec/rails'
require 'rspec/autorun'

# Requires supporting ruby files with custom matchers and macros, etc,
# in spec/support/ and its subdirectories.
Dir[Rails.root.join("spec/support/**/*.rb")].each {|f| require f}

RSpec.configure do |config|
  config.mock_with :rspec
  config.infer_base_class_for_anonymous_controllers = false
end

I moved everything into the Spork.prefork block and left the Spork.each_run block empty. This worked immediately dropping my run time from 9 seconds (well, after the initial 8 second load time) to 0.8 seconds. I cannot emphasize how big a win Spork is early in the life of a project. I mean, 2 minutes of work has shaved off a tonne of time. However, this wasn’t without its issues.

The big issue I ran into almost right away was that the changes to my model classes weren’t being applied between runs. This is a rhythm-killer for me. I had to stop and reload spork every time I changed a model class, paying the 8 second load penalty. Controller specs weren’t affected, so I suspect that this had something to do with the fact that my model classes weren’t subclassing ActiveRecord::Base. A bit of googling later, my spec_helper file became this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
require 'rubygems'
require 'spork'

Spork.prefork do
  # Loading more in this block will cause your tests to run faster. However, 
  # if you change any configuration or code from libraries loaded here, you'll
  # need to restart spork for it take effect.
  ENV["RAILS_ENV"] ||= 'test'
  require File.expand_path("../../config/environment", __FILE__)
  require 'rspec/rails'
  require 'rspec/autorun'
  Dir[Rails.root.join("spec/support/**/*.rb")].each {|f| require f}
  RSpec.configure do |config|
    config.mock_with :rspec
    # If true, the base class of anonymous controllers will be inferred
    # automatically. This will be the default behavior in future versions of
    # rspec-rails.
    config.infer_base_class_for_anonymous_controllers = false
  end
end

Spork.each_run do
  Dir["#{Rails.root}/app/controllers//*.rb"].each do |controller|
    load controller
  end
  Dir["#{Rails.root}/app/models//*.rb"].each do |model|
    load model
  end
  Dir["#{Rails.root}/app/models/api/*.rb"].each do |model|
    load model
  end
  Dir["#{Rails.root}/app/presenters//*.rb"].each do |presenter|
    load presenter
  end
end

I was annoyed that I had to do that, but it worked as advertised so I moved on.

To the extreme

The immediate annoyance I encountered with this setup is the new process I had to adopt. In order to write new specs, I had to do the following: 1. Start up the API service, complete with the database start-up 2. Start up spork, wait 8 seconds 3. Run the existing specs 4. Write a new one and cycle

Now, I realize that there is a gem out there, guard, which will automate steps 2 and 3 for me. However, this is a thin Rails client with minimally few moving parts and I didn’t feel like repeating the incantations needed to set up guard properly. Moreover, step 1 is still something I have to do manually, so the win with guard isn’t as big.

Furthermore, forgetting to start guard or spork will cause me to get frustrated, while I wait the ~9 seconds it will take to run the suite plus the usual 8 second load time when I do start spork up. I just want to run rspec spec/, for crying out loud.

I had heard of Corey Haines’ presentation on fast rails tests where he essentially removes Rails from the spec suite runtime. There were also a few blog posts around the same issue. I figured I had nothing to lose trying out this approach, since my Rails client isn’t a “real”™ Rails application.

So, I removed:

1
require 'spec_helper'

from every spec file I had and set about fixing all of the errors that popped up. What I came up with within an hour is this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
APP_ROOT = File.expand_path(File.join(File.dirname(__FILE__), '..'))
$: << File.join(APP_ROOT, "app", "controllers") << File.join(APP_ROOT, "app", "models") << File.join(APP_ROOT, "app", "presenters")

module ActionController
  class Base
    def self.protect_from_forgery; end
    def params; {}; end
  end
end

require 'active_support/core_ext/string/inflections'
require 'date'
require 'vcr_setup'
require 'application_controller'
require 'transactions_controller'
require 'categories_controller'
require 'api/transaction'
require 'transaction_collection'
require 'transaction'
require 'category_formatter'

def assigns(name)
  controller.instance_variable_get "@#{name}"
end

RSpec.configure do |config|
  config.mock_with :rspec
end

As you’ll note, the spec_helper file is pretty different when compared to a regular one, no surprise there. I used the setup from this blog post, Running Rails Rspec Tests - Without Rails and an inspiration from this gist with a few modifications.

As the blog post explains in detail, requiring Rails is the thing that kills the startup time of the spec suite. I don’t think it’s a big revelation, but it’s one worth repeating. If you want really fast Rails tests, remove RAILS!

The above code monkey-patches ActionController::Base and provides a dummy implementation of a few methods to ensure that ApplicationController and its subclasses don’t blow up. After adding the paths to various parts of the app to the load path (that’s the $: thingie), requiring the various classes and defining a helper method, we’re pretty much done.

There are a few other things I needed to explicitly require after ripping out Rails, namely the date library from Ruby’s stdlib and the inflector from ActiveSupport that is used in the presenter spec.

The controller specs look like this now:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
require 'spec_helper'

describe TransactionsController do

  describe "GET 'index'" do
    let(:controller) { TransactionsController.new}
    before do
      builder = TransactionCollection.new
      parsed_json = [{"key12" =>{:category => "insurance", :amount => 332}}]
      API::Transaction.stub!(:transactions_for).with("insurance").and_return(parsed_json)
    end

    it "has a category assigned" do
      controller.class.send(:define_method, :params) do
        {:category_id => "insurance"}
      end
      controller.index
      assigns("category").should_not be_nil
    end

    it "returns a set of transactions" do
      controller.class.send(:define_method, :params) do
        {:category_id => "insurance"}
      end
      controller.index
      transactions = assigns("transactions")
      transactions.first.key.should eql "key12"
    end

    it "returns an empty array when there are no transactions for the category" do
      controller.class.send(:define_method, :params) do
        {:category_id => "testme"}
      end
      API::Transaction.stub!(:transactions_for).with("testme").and_return([])
      controller.index
      assigns("transactions").should eql []
    end
  end
end

Disregarding the duplication setup that I left in there, the invocation of the index action in the TransactionsController is really not that much different from how it’s normally done. The duplicated part sets up the params hash with the expected key/value pair.

The model specs have not changed while I was ripping stuff apart, which made me really happy.

VCR

The eagle-eyed among you, might have noticed this line in the spec_helper:

1
require 'vcr_setup'

After I got the suite running, I noticed that what was previously 0.8 seconds with spork was now consistently around 2 seconds. The only cause for this slow down were the model specs that hit the API service. Looking through Avdi Grimm’s blog posts and various podcast appearances, I came upon the VCR gem.

What this gem does is records the HTTP requests your specs make in the course of a run, stores the results of said call into a YAML file to which you can give a name, cuts off HTTP access to your specs and returns the contents of each YAML file made as if the spec still made that HTTP request. I think showing a bit of code may explain this better than words:

1
2
3
4
5
it "should return a set of categories" do
  VCR.use_cassette("built-in categories") do
    API::Transaction.categories.should_not be_empty
  end
end

As you can see, I’ve named this cassette “built-in categories” which will cause VCR to store a YAML file named built_in_categories.yml under the spec/vcr_cassettes directory. On the first run after install, VCR will execute the HTTP request inside the use_cassette block and save the result, replaying it with every subsequent run. There are options available to let you control if and when these cassettes should expire, as well as many more. The documentation is here.

This library is an absolute gem (pardon the pun). It works as advertised and extremely smoothly. Even things that weren’t well-documented, but seemed like they should work, did. For example, if you’re executing an HTTP request in a before block for multiple spec runs, the same syntax applies and works. I am thoroughly impressed with this gem, as is Avdi.

The long-awaited payoff

So, why go through all of this? If you’re asking yourself that, don’t worry, I’ve asked myself the same. For me, there are a few big wins.

Firstly, the speed. Repeated full suite runs clock in at 0.5 seconds (I said 0.7 on Twitter, but I was wrong). That’s right, it takes me half a second to run all the specs, the controller ones, the model ones and the presenter ones. When I compare this to the initial run of 9 seconds in total, I grin.

Secondly, spork is now out of the picture. For me, this is a huge relief on mental load. The process that I outlined above is reduced to steps 3 and 4. I can type in rspec spec/, note any failures I left for myself from the night before, make the specs pass, refactor and move. The rhythm is back and it’s the rhythm that makes me productive.

This is does not imply that spork sucks. Installing spork, modifying the spec_helper, firing spork up and testing is a very legitimate way of improving spec suite run time. I’ve outlined why I don’t like this process, but I’m not ever going to argue against it if it works for you.

I wanted to prove to myself that it’s possible to have Rails specs without Rails. Having an unusual use case helped push me down this path. It is possible and, as of right now, it isn’t that much work. I do not know if this approach scales, though. I need to emphasize that point. I may come back a few months down the road saying that this is too much work. I don’t know.

What I do know is that it’s been a worthwhile experiment and I hope that documenting it in this way will help someone else, as well. Good luck.

Props and fist bumps go to Tommy Morgan for proof-reading, Tony Collen for the gist and the blog post link and Myron Marston for the VCR gem.