(Insightful) Ramblings by batasrki

No subtitle needed

Experiments Part 5

| Comments

Previously in Experiments, part 4,…

I added a form for creating new links. Instead of introducing a database, the links are being added to an atom.

In tonight’s installment…

I am going to start working the trickier parts of bookmarking links, retrieving a title of a link and possibly the body. At some point, I’ll have to commit to a database, but I can push that off for at least one more post. I would like to have a background queue-like functionality without introducing an actual background queue, and I think core.async can help me do that

Onward, then…

What I would like to do first is fetch the title of the posted URL. This will enable me to have a nice description of the link I saved. I need a library to which I can give a URL and have it fetch the HTML of the page. It would be nice if it could also parse that HTML. The most popular library in Clojure that does this is called Enlive.

As usual, I’m going to add the library to the project.clj file and require it in the handler.clj.

1
2
3
4
5
;; in project.clj, in the :dependencies vector
[enlive "1.1.5"]

;; in handler.clj
(:require [net.cgrand.enlive-html :as html])

With that done, I need to add a few methods to fetch and parse HTML.

1
2
3
4
5
6
7
(defn fetch-url [url]
  (with-open [inputstream (-> (java.net.URL. url)
                              .openConnection
                              (doto (.setRequestProperty "User-Agent"
                                                         "ReadLaterCrawler/1.0 ..."))
                              .getContent)]
    (html/html-resource inputstream)))

This is the first function where Clojure’s interop with Java starts to show. Using the Java’s URL library to get open the connection, the code threads that connection through a few things ending with the getContent function. Enlive then turns that resource into a map that I can iterate over. A quick REPL test shows me how this looks.

OK, I have a map representing the HTML, I need to get at the title.

1
2
(defn get-title [url]
  (first (map html/text (html/select (fetch-url url) [:title]))))

Enlive lets me select the element based on a vector I pass in. Since I all I want is a title, it’s a simple vector being passed in. html/select takes the map of HTML and the selector vector and gives me back a map representing the selected thing.

I then extract the content and make it the return value.

Integration

Now that I have that working, I am going to do a simple integration into the current flow. POSTing the new link will block until the title is extracted, which is what core.async will solve.

1
2
3
4
5
6
7
;; in POST
(POST "/api/links" request
  (let [next-id (inc (apply max (map #(:id %) @links)))
        new-link (:body request)
        new-link-with-fields (assoc new-link :id next-id :title (get-title (:url new-link)) :created_at "2015-12-15")]
    (swap! links conj new-link-with-fields)
    (json/write-str @links)))

The change I made is in assoc, where instead of :title (:url new-link), I now have :title (get-title (:url new-link)). This should block while getting the title.

The time it takes for the new link to show up varies based on the response time of the website, so while it can be quick some of the time, other times it might take a while.

That’s it for now. In the next post, I am going to make that fetching asynchronous, which may take a bit of rejiggering of things.

Until then.

Experiments Part 4

| Comments

Previously in Experiments, part 3,…

I set up a card-based UI and added a few more records to my test data set.

In tonight’s installment…

It’s still (mostly) about the UI. I’m going to create a form for adding new links. I will be appending the links to the test data set and push dealing with the DB until later.

Onward, then…

Up to now, this has been a read-only app. It’s time to add the creation part. I’m going to not worry about persistently storing yet. Instead, I’ll add the newly created link to the atom I created in the last post.

Firstly, I’m going to add a form component.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
var LinkForm = React.createClass({
  render: function() {
    return (
      <div className={"col-lg-6", "well"}>
        <form className="form-horizontal" onSubmit={this.handleSubmit}>
          <fieldset>
            <legend>Save a link</legend>
            <div className="form-group">
              <label className={"col-lg-2 control-label"}>URL</label>
              <div className="col-lg-10">
                <input name="url" placeholder="URL" type="url" className="form-control" ref="url" />
              </div>
            </div>
            <div className="form-group">
              <label className={"col-lg-2 control-label"}>From</label>
              <div className="col-lg-10">
                <input name="client" placeholder="From" type="text" className="form-control" ref="client" />
              </div>
            </div>
            <div className="form-group">
              <div className={"col-lg-10 col-lg-offset-2"}>
                <button className={"btn btn-primary"} type="submit">Save</button>
              </div>
            </div>
          </fieldset>
        </form>
      </div>
    );
  }
});

For now, I’ll stick it under the list component, but later I may break it into a modal.

1
2
3
4
5
6
7
8
render: function() {
  return (
    <section className="linkListContainer">
      <LinkList data = {this.state.data} />
      <LinkForm onLinkSubmit={this.handleLinkSubmit} />
    </section>
  );
}

So far, so straightforward. I’ve added a form, laid out in typical Bootstrap style, then added it as a component to the parent component.

Submits

A sharp-eyed reader will notice two seemingly missing pieces, onLinkSubmit={this.handleLinkSubmit} in the declaration of the form component and onSubmit={this.handleSubmit} in the action of the <form> element.

These two functions are the drivers of the form’s behaviour. I’m going to show and explain the latter one first.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
var LinkForm = React.createClass({
  handleSubmit: function(e) {
    e.preventDefault();
    var url = this.refs.url.value.trim();
    var client = this.refs.client.value.trim();

    if (!url || !client) {
      return;
    }

    this.props.onLinkSubmit({url: url, client: client});
    this.refs.url.value = '';
    this.refs.client.value = '';
    return;
  },
  // render function below

All the form’s submit function does is parse out the values from the two input fields, does a quick presence validation, passes on the values to the parent component, and clears the fields.

I’ve highlighted the cool and important part. The function itself doesn’t actually do any XHR calls. Its responsibility is to get the values and pass them up the hierachy. As far as I can tell, this helps with centralizing state changes. The parent component will deal with the actual data changes and it will then tell every descendant to re-render themselves.

Here’s the function that does the actual server call.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// componentDidMount above
handleLinkSubmit: function(link) {
  $.ajax({
    url: this.props.url,
    dataType: "json",
    contentType: "application/json",
    type: "POST",
    data: JSON.stringify(link),
    success: function(data) {
      this.setState({data: data});
    }.bind(this),
    error: function(xhr, status, err) {
      console.error(this.props.url, status, err.toString());
    }.bind(this)
  });
},
// render below

It’s pretty straightforward. The object named link is passed from the form component’s handleSubmit function, which is then stringified and sent to the server. On a successful response, the LinkListContainer component receives a full set of data, which is what the internal state is set to. This will in turn cause child components to re-render themselves.

Fake it until you…need it

I’m now going to implement the POST route on the server-side, where I’ll attach the new link to the existing data set and send that down.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
;; data set refresher
(def links
  (atom '({:id 1 :url "https://google.ca" :title "Google" :client "static" :created_at "2015-12-01"}
          {:id 2 :url "https://twitter.com" :title "Twitter" :client "static" :created_at "2015-12-01"}
          {:id 3 :url "https://github.com" :title "Github" :client "static" :created_at "2015-12-08"}
          {:id 4 :url "https://www.shopify.ca" :title "Shopify" :client "static" :created_at "2015-12-08"}
          {:id 5 :url "https://www.youtube.com" :title "YouTube" :client "static" :created_at "2015-12-08"})
  )
)
;; in app-routes
(POST "/api/links" request
  (let [next-id (inc (apply max (map #(:id %) @links)))
        new-link (:body request)
        new-link-with-fields (assoc new-link :id next-id :title (:url new-link) :created_at "2015-12-15")]
    (swap! links conj new-link-with-fields)
    (json/write-str @links)))

I love how this works. I really, really do. I’m going to explain each line starting from let.

1
next-id (inc (apply max (map $(:id %) @links)))

ReactJS would like a key to hang its rendering on. I am using the ID of each record. This line implements a simplistic ID auto-increment. It lifts the ID values out of each record in the links atom. It then applies max to the new list to find the highest value. Finally, it increments that value by one.

1
new-link (:body request)

Maps are the predominant data structure in Clojure web development. At least, they are from what I’ve seen in my limited exposure. Here, I take the body of the request which represents the JSON data I passed in from the client.

1
new-link-with-fields (assoc new-link :id next-id :title (:url new-link) :created_at "2015-12-15")

I need to add a few more things to the new record, namely title, id, and created_at. The last one will be hard-coded, the title can be equal to the URL and ID will be the next-id I computed above.

Also, isn’t it neat how I can create a new value within a let and then use it to create another one within the same let?!? I think it is.

1
(swap! links conj new-link-with-fields)

Once I have a new, valid record, I need to add it to my dataset. Since I’m using the atom construct, I need to use swap! to update its value. Atoms are neat, too. So many neat things.

1
(json/write-str @links)

Finally, with all of the transformation done, I create a JSON response that is sent back to the client. When I run the whole app, it works exactly as if there was a database backing it. Woot.

Experiments Part 3

| Comments

Previously in Experiments, part 2,…

I set up dummy data on the backend, then created the initial set of components on the front end that consume that data.

In tonight’s installment…

It’s all about the UI. I’m going to add a few more links to the dummy set, then set up my card-based UI.

Onward, then…

Firstly, I’m going to add 3 more links to my dataset so that I can play with the layout.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(def links
  (atom '({:id 1 :url "https://google.ca" :title "Google" :client "static" :created_at "2015-12-01"}
          {:id 2 :url "https://twitter.com" :title "Twitter" :client "static" :created_at "2015-12-01"}
          {:id 3 :url "https://github.com" :title "Github" :client "static" :created_at "2015-12-08"}
          {:id 4 :url "https://www.shopify.ca" :title "Shopify" :client "static" :created_at "2015-12-08"}
          {:id 5 :url "https://www.youtube.com" :title "YouTube" :client "static" :created_at "2015-12-08"})
  )
)

(defroutes app-routes
  (GET "/" []
       (views/index))
  (GET "/api/links" []
       (json/write-str @links))
  (route/not-found "Not Found"))

I’m extracting the static set into an atom, and then de-referencing it to grab its value inside the JSON render call. I can easily add more to this set if I need to play with bigger sets while I lay things out.

Now, I’m ready to switch the table UI to a card one.

Card UI

To do that, I’m going to need to add a component. I’d like to have a group of 4 cards per row, so I need partition my set into groups of 4 and I need a LinkGroup component to manage each partition.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
var LinkGroup = React.createClass({
  render: function() {
    var linkNodes = this.props.data.map(function (link) {
      return (<Link key={link.id} title={link.title} url={link.url} client={link.client} created_at={link.created_at} />);
    });

    return (
      <section className={"card-columns"}>
        {linkNodes}
      </section>
    );
  }
});

var LinkList = React.createClass({
  render: function() {
    var grouped = this.partition(this.props.data, 4);

    var linkGroups = grouped.map(function (link_group) {
      return(<LinkGroup data={link_group} />);
    });

    return (
      <section>
        {linkGroups}
      </section>
    );
  },
  partition: function(list, in_groups_of) {
    var container = [];
    var sublist = [];
    var initial = 0;
    var iteration_count = Math.ceil(list.length / in_groups_of);

    for(var k = 1; k <= iteration_count; k++) {
      for(var i = initial; i < (in_groups_of * k); i++) {
            if(list[i] !== undefined) {
            sublist.push(list[i]);
            }
      }
      container.push(sublist);
      sublist = [];
      initial = in_groups_of * k;
    }
    return container;
  }
});

Let me explain the changes. The LinkList component is now rendering a set of LinkGroup components. Each LinkGroup component is rendering its own subset of Link components. I chose, for now, to do the partitioning of the data in Javascript. It was a fun diversion, though I may decide to push that onto the server at a later point.

I’ve also added a class for card columns that will be leveraged below. Now, I need to modify my Link component to suit the new layout.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
var Link = React.createClass({
  render: function() {
    return (
      <div className={"card card-block"}>
        <h4 className="card-title">
          <a target="_blank" href={this.props.url}>{this.props.title}</a>
        </h4>
        <p className="card-text">
          Created by {this.props.client} at {this.props.created_at}
        </p>
      </div>
    );
  }
});

The changes include ripping out <tr> and <td> elements and replacing them with <div>s and headers and paragraphs. I like this change, because I plan on adding another aspect of each link, namely a bit of text parsed out from it. I feel this layout suits that plan better.

Refreshing the page, I see the new layout and with that, I’m done. Next up is database work.

Until then.

Experiments Part 2

| Comments

Previously in Experiments,…

I set up a Clojure backend and installed ReactJS front-end. I also created a basic skeleton, but that didn’t actually show up, because JSX.

In tonight’s installment…

Firstly, I’m going to fix up the transpiling from JSX to JS, so that changes to the React front-end are picked up and shown. I’ll add the expected route, but serve dummy data. Then, I’ll get the front-end to consume that data.

Onward, then…

OK, so, ReactJS tutorial site is a bit misleading. I am guessing they expect me to use straight-up HTML and add the text/babel script tag. However, I don’t know how to do that through Hiccup. Moreover, that seems a bit weird, anyway. There’s a way to do it without using the specialized tag, but it’ll take a bit more effort.

Using the Offline transform section of the Getting Started page, I’m going to…ugh…install an NPM module globally (seriously, tho?) and use it to transform my core.js JSX joint into a pure JavaScript file.

1
2
3
4
$ npm install -g babel-cli babel-preset-react
$ cd resources/public/js
$ mkdir build
$ babel --presets react core.js --watch --out-dir build

Now that babel is transpiling my JSX into compatible JS, I need to adjust where I look for that JS file.

1
(include-js "/js/build/core.js")

Running lein ring server opens up a browser tab pointing to http://localhost:3000. As usual, I immediately open the Web Inspector tab and notice the following error: ReferenceError: LinkList is not defined. This is good news! It means that the JSX was successfully transpiled into JS and the browser has attempted to build a ReactJS component.

I’m going to fix the error quickly, by adding the missing component, which is just an empty table tag, for now.

1
2
3
4
5
6
7
var LinkList = React.createClass({
  render: function() {
    return (
      <table></table>
    );
  }
});

Cool. That brings the next error, TypeError: this.state is null. If you recall, that’s from the following line:

1
<LinkList data = {this.state.data} />

It makes sense, I didn’t initalise the state to anything, but I want an element of it, data. That’s fine, I can set it to something empty. ReactJS has a handy function, called getInitialState that I can use to set…well, the initial state for each component.

1
2
3
4
// This is done on the outer component, LinkListContainer
getInitialState: function() {
  return { data: [] };
}

And with that, no more errors in Web Inspector console!

An interlude

I want to stop here, for a second, and clarify something about ReactJS. ReactJS is a front-end technology. It only cares about consuming data. As such, the JavaScript snippets of code I am presenting can be used with any web framework.

This is one of the things that I like about it and its competitors. However, unlike BackboneJS (though, I admit, it’s been a long while since I’ve used it) and AngularJS, giving ReactJS data to consume is very easy. Insert your favourite “Facebook is a data consuming monster” joke here.

Continuing….

Back in Clojure-land, I’m going to get the api/links route working. Initially, it’ll just deliver a static list of links, each link being a hashmap. The nice thing about Clojure and its libraries is consistency of data structures. Querying a database returns a hashmap representation of a record. That representation is easily turned into a JSON response, which is itself a hashmap!

1
2
3
4
5
6
7
8
;; in require, need to add a JSON library
(:require [clojure.data.json :as json])

;; appending a route to defroutes
(GET "/api/links" []
  (let [links '({:url "https://google.ca" :title "Google" :client "static" :created_at "2015-12-01"}
                {:url "https://twitter.com :title "Twitter" :client "static" :created_at "2015-12-01"})]
    (json/write-str links)))

Giving it a quick run in the console, curl -vi http://localhost:3000/api/links, I see the desired results. With that done, I need to add a way to consume this data on the client and finish for tonight.

Oh, I forgot. I need to wrap the response as JSON, otherwise things go wonky.

1
2
3
4
5
6
7
8
(:require [ring.middleware.json :as wrapper])

;; then down at the bottom

(def app
  (wrap-defaults
    (wrapper/wrap-json-body app-routes {:keywords? true})
    site-defaults))

Back to the client…

I specified the URL to the data as a property of the outermost container. I would like to query that URL now and populate this.state.data variable of the component with the data from the URL. For that, I’m going to use the componentDidMount function. Within, I am using the standard $.ajax call from jQuery and then binding the results to the data property of the state object.

1
2
3
4
5
6
7
8
9
10
11
componentDidMount: function() {
  $.ajax({
    url: this.props.url,
    dataType: "json",
    success: function(data) {
      this.setState({ data: data });
    }.bind(this),
    error: function(xhr, status, err) {
      console.error(this.props.url, status, err.toString());
    }.bind(this)
});

ReactJS frowns upon the usage of = as the assignment operator. Instead, I can use the setState function.

Reloading the page, I still see nothing, but that’s fine, I haven’t built a component to consume the data. In the Network tab of Web Inspector, though, I see the API call.

Last thing in this installment…

So, the only thing left is render out a table of links. For that, I need to fill out the LinkList component and add a Link component. The former describes a list of links, while the latter is specific to one link record only. Old-timers reading this post will recognize as the master-detail layout.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
var Link = React.createClass({
  render: function() {
    return (
      <tr>
        <td><a target="_blank" href={this.props.url}>{this.props.title}</a></td>
        <td>{this.props.client}</td>
        <td>{this.props.created_at}</td>
      </tr>
    );
  }
});

var LinkList = React.createClass({
  render: function() {
    var linkNodes = this.props.data.map(function (link) {
      return (<Link key={link.id} title={link.title} url={link.url} client={link.client} created_at={link.created_at} />);
    });
    return (
      <table className={"table table-striped table-hover"}>
        <thead>
          <tr>
            <th>URL</th>
            <th>From</th>
            <th>On</th>
          </tr>
        </thead>
        <tbody>
          {linkNodes}
        </tbody>
      </table>
    );
  }
});

I am fairly certain using that map function isn’t the “ideal” way of building sub-components, but it works. Refreshing the tab, I see the 2 hard-coded links I added and, the UI looks like the existing UI.

That’s it for now, then.

Good day/night/whatever.

Experiments, Part 1

| Comments

A different tack

I’ve been very lucky to have had steady employment using Ruby and Rails. At times, it felt like cheating. How can I have so much fun and get paid well for it?

Continuing my search for that fun high, I started looking at other technologies. I’ve been an admirer of Clojure for a while. Its approach to composing programs seems to suit how I think software development should be. In the same vein, ReactJS’s approach to building UIs using components fits how I think UIs should be built. As such, I’m embarking on a rewrite of my bookmarking site, Particle Transporter, using Clojure and ReactJS.

Goals

The big goals for this rewrite are:

  1. Have a nicer UI for the listing of links than what’s on the current site
  2. Implement a filter (filter by tags to start with)
  3. Fetch the title of a given link in the background using core.async
  4. Fetch the first paragraph (if available) of a given link in the background using core.async
  5. Have an offline mode, to ease development from multiple machines

Having spent a few months at Shopify, I really started liking a UI concept called Cards. It seems to nicely encapsulate a piece of data. This makes it an easy choice for ReactJS, since we can take each saved link and all its data and metadata and wrap it into a card. However, I’m getting ahead of myself. Firstly, I’m going to set up the environment in which I can create this app.

Setup

Since this will be a Clojure app, I’m going to use Compojure as the base. Leiningen is a must, obviously, and it provides a template that generates a Compojure-flavoured skeleton.

1
2
$ lein new compojure read-later
$ cd read-later && ls

Cool, the filesystem structure is there. I am going to give it a quick whirl, running lein ring server on the command line within the directory. A few seconds later, I see “Hello World” in a browser tab.

Now that I have a running server, I’m going to add a view that sets up the skeleton of the html page. To do that, I’ll install Hiccup. It allows me to write views in Clojure, which will then render HTML. Stopping and starting the server using lein will install the newly specified dependency automatically.

OK, next, I need to add the base view and add ReactJS files.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(ns blog-post-app.views
  (:require [hiccup.page :refer [html5 include-js include-css]]))

(defn base-page [title & body]
  (html5
   [:head
    (include-css "https://cdn.rawgit.com/twbs/bootstrap/v4-dev/dist/css/bootstrap.css")
    [:title title]]
   [:body
    [:div.container-fluid
     [:h1
      [:a {:class :brand :href "/"} "Particle Transporter Reborn"]]
     [:div {:id "content"}]]
    (include-js "https://cdnjs.cloudflare.com/ajax/libs/react/0.14.3/react.js")
    (include-js "https://cdnjs.cloudflare.com/ajax/libs/react/0.14.3/react-dom.js")
    (include-js "https://cdnjs.cloudflare.com/ajax/libs/babel-core/5.8.23/browser.min.js")
    (include-js "https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js")
    (include-js "/js/core.js")]))

(defn index []
  (base-page
      "Saved Links - fPT"))

I am going to reference the views file in the default handler.clj file and adjust the home route to render the HTML page.

1
2
3
4
5
6
7
8
9
10
11
12
13
(ns blog-post-app.handler
  (:require [compojure.core :refer :all]
            [compojure.route :as route]
            [blog-post-app.views :as views]
            [ring.middleware.defaults :refer [wrap-defaults site-defaults]]))

(defroutes app-routes
  (GET "/" []
       (views/index))
  (route/not-found "Not Found"))

(def app
  (wrap-defaults app-routes site-defaults))

Refreshing the page, I see the header link and an empty page. Checking the console, I see no errors, so all the JS files have been required successfully. Onto the next step, creating a ReactJS view.

ReactJS view

The JavaScript file is straightforward to begin with.

1
2
3
4
5
6
7
8
9
10
11
var LinkListContainer = React.createClass({
  render: function() {
    return (
      <section className="linkListContainer">
        <LinkList data = {this.state.data} />
      </section>
    );
  }
});

ReactDOM.render(<LinkListContainer url="/api/links" />, document.getElementById("content"));

I specified the top-level component and the child component that will do most of the work. The top-level component is given a URL from which data will be fetched and an HTML element that it should render into.

That’s it for part 1. Next, I will specify the data source and render out the links.

Passing Lambdas as Parameters

| Comments

Background

At my current job, I have been working on merchants stores' integration with third parties. As a part of this task, I found that I needed to ensure that our app batches API calls to the third party. After doing a quick measurement, I found that batching API calls lowered the processing time by about 2x-5x when compared to doing the same calls in sequential order. However, whereas sequential processing had a clear approach of dealing with successful and failed calls, the batch call had none of that.

Lambdas

The answer was suggested by my team lead and clarified by a co-worker. I then remembered I knew this about Ruby and I’ve been using a similar approach extensively when writing Clojure programs.

Much like JS, Ruby allows us to pass a lambda or a Proc into a method as a parameter. We can then call this stored function with certain parameters. In the case above, what I wanted was to partition the responses to batched API calls into successful and failed ones. I could then invoke 2 lambdas, one for each set of responses and do the appropriate thing.

A trivial example

I don’t want to share the code specific to my problem, so here’s a trivialized version of the same approach:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
for_evens = lambda { |evens| p "All even numbers in the set are: #{evens}" }

for_odds = lambda { |odds| p "All odd numbers in the set are: #{odds}" }

the_set = (1..10).to_a

def partition_set_and_invoke_lambdas(the_set, for_evens, for_odds)
  evens, odds = the_set.partition {|item| item % 2 == 0}

  for_evens.call(evens)
  for_odds.call(odds)
end

partition_set_and_invoke_lambdas(the_set, for_odds, for_evens)

"All odd numbers in the set are: [2, 4, 6, 8, 10]"
"All even numbers in the set are: [1, 3, 5, 7, 9]"
 => "All even numbers in the set are: [1, 3, 5, 7, 9]"

Cool, eh?

RSpec Matcher and Time Value

| Comments

This will be a quick post that documents an interesting find in RSpec.

Background

I encountered a classic Ruby testing failure. I have an object whose method takes 2 time values, a start datetime and an end datetime. The intent of the method is to find records matching those values and send an e-mail to a related User account.

As many experienced Ruby programmers have seen, the failure looks like this:

1
2
3
4
Failure/Error: response = put :update, {id: event.id,
  <WaitlistNotifier (class)> received :alert_matching with unexpected arguments
    expected: (Thu, 29 Jan 2015 23:56:42 UTC +00:00, Fri, 30 Jan 2015 00:01:42 UTC +00:00)
    got: (Thu, 29 Jan 2015 23:56:42 UTC +00:00, Fri, 30 Jan 2015 00:01:42 UTC +00:00)

Upon closer inspection, you will notice that the values appear to be exactly the same. You may ask yourself then, “Why is there a failure if the values are the same?”.

Equality things

One of the possible reasons why this spec failed is that while the datetime values match down to the second, they’re probably mismatching on the millisecond. The accepted solution is to use the be_within matcher.

1
2
3
it "should match as close as possible" do
  expect(time_value_under_test).to be_within(0.1).of(expected_time_value)
end

If you’re comparing time or date/time values, you can stop there. However, my use case has a twist on it. Instead of comparing whether or not the time value is what I expect, I wanted to ensure that the method I’m calling is passed the appropriate time values.

1
2
3
4
5
6
7
8
9
class WaitlistNotifier
  def alert_matching(datetime_start, datetime_end)
end

describe SomeController do
  it "should pass the appropriate time values to the notifier" do
      expect(WaitlistNotifier).to receive(:alert_matching).with(start_time, end_time)
  end
end

As you can see, I don’t care if the values are equal down to the millisecond, all I care is that it’s the appropriate value passed in. RSpec makes no such distinction, nor should it.

Having consulted a few peeps on Twitter, including Myron Marston, I got my answer. The be_within matcher can be used within a with call!

1
2
3
4
5
6
7
describe SomeController do
  it "should pass the appropriate time values to the notifier" do
      expect(WaitlistNotifier).to receive(:alert_matching).
                                      with(a_value_within(1).of(start_time),
                                           a_value_within(1).of(end_time))
  end
end

According to the RSpec documentation, a_value_within is an alias of be_within.

I think this is a pretty cool and pretty powerful part of RSpec and I confess that I was not aware of it AT ALL.

Hopefully, this will save someone time in the future.

CRDT for Real Data in Riak and Ruby

| Comments

Intro

In the previous post, I briefly introduced CRDTs in Ruby and have shown a basic usage case for one Conflict-Free Replicated Data Type. Tonight, we will explore one slightly more complicated use case that can be seen in real applications in the wild.

Use case

For the purpose of this post, imagine we’re running a Netflix clone. We have accounts and we have movies. For the purposes of recommendations, we want to store what movies each account holder has viewed. As is the case with Netflix, an account may be signed in and used across more than one device by holders of differing tastes. Since we don’t really want to impose the restriction of just one account being signed in, we will record statistics from each session.

As you can see, this may lead to write conflicts, since any account holder may want to watch any movie at any time. For the sake of efficiency, we will only store an ID to each movie in the account’s movies watched collection.

The data type used

Out of the currently known CRDTs, the one that seems to fit this problem the best is the GSet, or Grow-only Set, data type. As its name suggests, this CRDT only ever grows in size, it does not shrink. Since recommendation engines thrive on tonnes of data, having an ever-increasing set of records is very beneficial.

The basic layout of a GSet is as follows:

1
2
3
4
{
  'type': 'g-set',
  'e': ['a', 'b', 'c']
}

The type information there is for the meangirls library to invoke the correct operational semantics. Other than that, it’s just a simple list of data, always growing in size. Only, the meangirls library does not include a class for a GSet. It has a class for TwoPhaseSet, which is a combination of two GSets, but it doesn’t actually implement a GSet by itself.

That’s not a big deal, we will implement one ourselves.

GSet implementation

Since it has fairly simple requirement, let’s take a look at the GSet class. I’ll explain parts of it after the code listing.

gset.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
class GSet < Meangirls::Set
  attr_accessor :a

  def initialize(hash=nil)
    if hash
      raise ArgumentError, "hash must contain a" unless hash['a']

      @a = Set.new hash['a']
    else
      # Empty set
      @a = Set.new
    end
  end

  def <<(e)
    @a << e
    self
  end
  alias add <<

  def ==(other)
    other.kind_of?(self.class) && a == other.a
  end

  def as_json
    {
      'type' => type,
      'a' => a.to_a
    }
  end

  def clone
    c = super
    c.a = a.clone
    c
  end

  def merge(other)
    unless other.kind_of? self.class
      raise ArgumentError, "other must be a #{self.class}"
    end

    self.class.new('a' => (a | other.a))
  end

  def include?(e)
    @a.include? e
  end

  def to_set
    @a
  end

  def type
    'g-set'
  end
end

If you peek at meangirls implementation of a TwoPhaseSet, you will notice that we copied most of its code into here, save for the remove set logic. Let’s go through the important parts.

1
2
3
4
5
6
7
8
9
10
def initialize(hash=nil)
  if hash
    raise ArgumentError, "hash must contain a" unless hash['a']

    @a = Set.new hash['a']
  else
    # Empty set
    @a = Set.new
  end
end

Our constructor may or may not take a hash. This is done, so that we can re-construct a GSet from JSON data, such as a Riak record.

1
2
3
4
5
def <<(e)
  @a << e
  self
end
alias add <<

We only ever add elements to our GSet. Additional guarantees, like that an element can only be an Integer, can be added here.

1
2
3
4
5
6
7
def merge(other)
  unless other.kind_of? self.class
    raise ArgumentError, "other must be a #{self.class}"
  end

  self.class.new('a' => (a | other.a))
end

When merging, we first must ensure that we’re merging a GSet to a GSet. We would get into weird situations and breakage, otherwise. After that, a simple set union over our 2 sets of IDs will ensure we only ever have unique ID values here.

Setup

With that done, let’s add models for the Account records and Movie records. Persistence is done somewhere else. This time, we will store records through irb.

account.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class Account
  attr_accessor :username, :email, :movies_watched

  def initialize(hash=nil)
    if hash
      @username = hash["username"]
      @email = hash["email"]
      @movies_watched = GSet.new(hash["movies_watched"])
    else
      @movies_watched = GSet.new
    end
  end

  def as_json
    {
      "username" => username,
      "email" => email,
      "movies_watched" => movies_watched.as_json
    }
  end

  def to_json
    as_json.to_json
  end
end
movie.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class Movie
  attr_accessor :title

  def initialize(hash=nil)
    if hash
      @title = hash["title"]
    end
  end

  def as_json
    {
      "title" => title
    }
  end

  def to_json
    as_json.to_json
  end
end

Let’s store a few arbitrary movies, keep their keys in memory and cause a write conflict. We will then see how we can resolve this conflict through the usage of our CRDT.

irb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
movie_bucket = client.bucket("movies")
 => #<Riak::Bucket {movies}>

movie_record_keys = []

m1 = Movie.new("title" => "Test Movie")
 => #<Movie:0x007ffa0be21348 @title="Test Movie">

m2 = Movie.new("title" => "Another test movie")
 => #<Movie:0x007ffa0be43ba0 @title="Another test movie">

r1 = movie_bucket.new
 => #<Riak::RObject {movies} [#<Riak::RContent [application/json]:nil>]>

r1.data = m1.as_json
 => {"title"=>"Test Movie"}

r1.store
 => #<Riak::RObject {movies,CAQ3ewkn0vVI5q8Kxt4LPcrFYk9} [#<Riak::RContent [application/json]:{"title"=>"Test Movie"}>]>

movie_record_keys << r1.key

r2 = movie_bucket.new
 => #<Riak::RObject {movies} [#<Riak::RContent [application/json]:nil>]>

r2.data = m2.as_json
 => {"title"=>"Another test movie"}

r2.store
 => #<Riak::RObject {movies,aU6uCCZJ10xGIPJonpSYftZ4H0w} [#<Riak::RContent [application/json]:{"title"=>"Another test movie"}>]>

movie_record_keys << r2.key
 => ["CAQ3ewkn0vVI5q8Kxt4LPcrFYk9", "aU6uCCZJ10xGIPJonpSYftZ4H0w"]

OK, now we have two movies, so let’s simulate a simultaneous account access by 2 holders, each of which is watching a movie. As shown in the previous post, storing values at the same time will cause a write conflict, resulting in creation of siblings. As usual, the accounts bucket should have allow_mult set to true, so that Riak doesn’t discard one of the writes.

Let’s first create an account.

irb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
accounts_bucket = client.bucket("accounts")
 => #<Riak::Bucket {accounts}>

accounts_bucket.allow_mult = true
 => true

account = Account.new
 => #<Account:0x007ffa0ce89758 @movies_watched=#<GSet:0x007ffa0ce89730 @a=#<Set: {}>>>

account.username = "an-user"
 => "an-user"
account.email = "anuser@email.com"
 => "anuser@email.com"

record = accounts_bucket.new
 => #<Riak::RObject {accounts} [#<Riak::RContent [application/json]:nil>]>

record.data = account.as_json
 => {"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>[]}}

record.store
 => #<Riak::RObject {accounts,6z4xJjAFMJVHWn6Ni2MF5J3IBBq} [#<Riak::RContent [application/json]:{"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>[]}}>]>

OK, with the record stored, we retrieve it twice without updating. This ensures that the vclock value for the record is not updated, allowing us the opportunity to cause a write conflict.

irb
1
2
3
4
5
6
7
8
9
10
one_access = accounts_bucket.get "6z4xJjAFMJVHWn6Ni2MF5J3IBBq"
 => #<Riak::RObject {accounts,6z4xJjAFMJVHWn6Ni2MF5J3IBBq} [#<Riak::RContent [application/json]:{"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>[]}}>]>

# same as above
second_access = accounts_bucket.get "6z4xJjAFMJVHWn6Ni2MF5J3IBBq"

first_instance = Account.new(one_access.data)
 => #<Account:0x007ffa0cf2b710 @movies_watched=#<GSet:0x007ffa0cf2b670 @a=#<Set: {}>>, @username="an-user", @email="anuser@email.com">

second_instance = Account.new(second_access.data)

Let’s now add the ID of the first created movie to one instance of this Account record and the ID of the second movie to the other one.

irb
1
2
3
4
5
first_instance.movies_watched.add(movie_record_keys.first)
 => #<GSet:0x007ffa0cf2b670 @a=#<Set: {"CAQ3ewkn0vVI5q8Kxt4LPcrFYk9"}>>

second_instance.movies_watched << movie_record_keys.last
 => #<GSet:0x007ffa0cf31f70 @a=#<Set: {"aU6uCCZJ10xGIPJonpSYftZ4H0w"}>>

Let’s now try storing these objects back into Riak and seeing what happens.

irb
1
2
3
4
5
6
7
8
9
10
11
one_access.data = first_instance.as_json
 => {"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>["CAQ3ewkn0vVI5q8Kxt4LPcrFYk9"]}}

one_access.store
 => #<Riak::RObject {accounts,6z4xJjAFMJVHWn6Ni2MF5J3IBBq} [#<Riak::RContent [application/json]:{"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>["CAQ3ewkn0vVI5q8Kxt4LPcrFYk9"]}}>]>

second_access.data = second_instance.as_json
 => {"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>["aU6uCCZJ10xGIPJonpSYftZ4H0w"]}}

second_access.store
 => #<Riak::RObject {accounts,6z4xJjAFMJVHWn6Ni2MF5J3IBBq} [#<Riak::RContent [application/json]:{"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>["aU6uCCZJ10xGIPJonpSYftZ4H0w"]}}>, #<Riak::RContent [application/json]:{"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>["CAQ3ewkn0vVI5q8Kxt4LPcrFYk9"]}}>]>

And, there we go. We have successfully (again) caused a conflict. Let’s write some code to resolve it. The trick here to resolving conflicts like this is that you must not access the data nor the raw_data methods. These will throw an exception and mess up your day.

Let’s instead, introduce a class method in the Account model that will handle conflict resolution for us.

account.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class Account
  # existing methods
  #
  #
  #
  def self.from_persistence(obj)
    if obj.conflict?
      resolved = {}
      watched_lists = []
      obj.siblings.each do |sibling|
        watched_lists << sibling.data.delete("movies_watched")
        resolved.merge!(sibling.data)
      end

      first_one = GSet.new(watched_lists.shift)
      watched_lists.each {|list| first_one = first_one.merge(GSet.new(list))}

      resolved.merge!("movies_watched" => first_one.as_json)
      new(resolved)
    else
      new(obj.data)
    end
  end
end

In this method, we check first if the raw record coming from Riak has the conflict? flag set, which indicates that there’s been a write conflict somewhere. If it is, we loop through all the siblings and pull out the movies_watched lists. The rest of this record should not be subject to conflict resolution, so we will accept whatever values there are for them.

We then initialize the first item in the list of GSets and then merge the others in that list with the first one. We append this to the hash of resolved values and then instantiate our Account model with the new values.

The second part to this conflict resolution is storing the resolved object back into Riak. Again, as in the previous post, we do that by creating a new Riak::RObject instance, setting its key to the conflicted one and setting the siblings property of the conflicted instance to an array of Riak::RObjects of size 1. The only member is the new instance. We store that and we’re done.

irb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
record = accounts_bucket.get "6z4xJjAFMJVHWn6Ni2MF5J3IBBq"
 => #<Riak::RObject {accounts,6z4xJjAFMJVHWn6Ni2MF5J3IBBq} [#<Riak::RContent [application/json]:{"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>["aU6uCCZJ10xGIPJonpSYftZ4H0w"]}}>, #<Riak::RContent [application/json]:{"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>["CAQ3ewkn0vVI5q8Kxt4LPcrFYk9"]}}>]>

a = Account.from_persistence(record)
 => #<Account:0x007ff318320c68 @username="an-user", @email="anuser@email.com", @movies_watched=#<GSet:0x007ff318320bc8 @a=#<Set: {"aU6uCCZJ10xGIPJonpSYftZ4H0w", "CAQ3ewkn0vVI5q8Kxt4LPcrFYk9"}>>>

resolved = accounts_bucket.new
 => #<Riak::RObject {accounts} [#<Riak::RContent [application/json]:nil>]>

resolved.data = a.as_json
 => {"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>["aU6uCCZJ10xGIPJonpSYftZ4H0w", "CAQ3ewkn0vVI5q8Kxt4LPcrFYk9"]}}

resolved.key = record.key
 => "6z4xJjAFMJVHWn6Ni2MF5J3IBBq"

record.siblings = [resolved]
 => [#<Riak::RObject {accounts,6z4xJjAFMJVHWn6Ni2MF5J3IBBq} [#<Riak::RContent [application/json]:{"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>["aU6uCCZJ10xGIPJonpSYftZ4H0w", "CAQ3ewkn0vVI5q8Kxt4LPcrFYk9"]}}>]>]

record.store
 => #<Riak::RObject {accounts,6z4xJjAFMJVHWn6Ni2MF5J3IBBq} [#<Riak::RContent [application/json]:{"username"=>"an-user", "email"=>"anuser@email.com", "movies_watched"=>{"type"=>"g-set", "a"=>["aU6uCCZJ10xGIPJonpSYftZ4H0w", "CAQ3ewkn0vVI5q8Kxt4LPcrFYk9"]}}>]>

Conclusion

Hopefully, this has shed a bit more light on how CRDTs can be used in real applications. The big key here is that during data model design, a need for conflict resolution must be anticipated and then the data, or parts of it, must be structured as a CRDT. Once that is done, merging conflicted records is pretty easy and availability is preserved.

Until next time.

CRDT Primer in Riak and Ruby

| Comments

Introduction

In this post, I shall attempt how to use CRDTs in a Ruby class, backed by the Riak database. CRDT stands for Commutative Replicated Data Type. There is no Wikipedia entry for this, yet, so I’m linking to a blog post which is linking to a paper.

Background

CRDTs solve a particular problem well. In a distributed database, like Riak, it is quite possible for a value under a key to receive multiple writes from different clients. Now, by default, Riak will discard everything but the latest write. However, it is possible to instruct Riak to keep all conflicting writes, so that we may resolve the conflicts at the application level. Resolving these conflicts can be really hard and this is where CRDTs step in.

It needs to be said that, as with almost everything in software development, whether or not one needs to use these data types depends on one’s application and data usage needs. It’s quite possible that keeping the last write is all that a developer will ever want.

While the concept itself may be fairly easy to grasp, the implementation of the concept has been a struggle for me. I hope to get more of an understanding, while explaining this topic.

Let’s get started

There are a few Ruby gems implementing CRDTs in existence. The most known one is meangirls by distributed systems extraordinaire, Kyle Kingsbury. We will be using this one, as it seems the most complete.

Caveat: the meangirls library isn’t gemified, so we will have to include its source. Also, it only takes care of creating the proper JSON representation of various CRDTs it supports. We have to take care of storing the representation, as well as how things are added or removed from a set.

Firstly, let’s create a directory for the test code and pull down the source code for meangirls.

1
2
3
4
5
$ mkdir crdt-test
$ cd crdt-test
$ mkdir vendor
$ git clone git@github.com:aphyr/meangirls.git vendor/
$ touch Gemfile tester.rb

Next, let’s add the Ruby Riak client gem and a JSON gem.

Gemfile
1
2
3
4
source 'https://rubygems.org'

gem 'yajl-ruby'
gem 'riak-client'

Let’s install those and we will be off and running:

Gemfile
1
$ bundle install

OK, with all that set up, let’s start playing around with the meangirls library. In this post, I will use the two-phase-set CRDT, but I will add the how-tos for other supported types in later blog posts.

The two-phase-set CRDT is interesting, because it allows us to add and remove things from a set while never actually deleting data. It also prevents re-insertion of deleted elements, which can come in handy.

Let’s first set up a playground.

tester.rb
1
2
3
4
5
6
7
8
9
10
11
12
require_relative 'vendor/lib/meangirls.rb'
require 'yajl'
require 'riak'

client = Riak::Client.new(:nodes => [
                          {:host => "33.33.33.11", :http_port => 8111},
                          {:host => "33.33.33.12", :http_port => 8112},
                          {:host => "33.33.33.13", :http_port => 8113}
                          ])
bucket = client.bucket 'meangirls'
bucket.allow_mult = true
p bucket.props

We load in the library and the two supporting gems. We then hook up to a Riak cluster and create a bucket for our data.

Let’s open up irb and write some code.

irb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
require './tester.rb'
twop_set = Meangirls::TwoPhaseSet.new

twop_set.add :alpha
 => #<Meangirls::TwoPhaseSet:0x007fd7549680b0 @a=#<Set: {:alpha}>, @r=#<Set: {}>>
twop_set.add :beta
 => #<Meangirls::TwoPhaseSet:0x007fd7549680b0 @a=#<Set: {:alpha, :beta}>, @r=#<Set: {}>>
twop_set.to_json
 => "{\"type\":\"2p-set\",\"a\":[\"alpha\",\"beta\"],\"r\":[]}"
 twop_set.delete :alpha
  => :alpha
twop_set.to_json
 => "{\"type\":\"2p-set\",\"a\":[\"alpha\",\"beta\"],\"r\":[\"alpha\"]}"
 twop_set.add :alpha
 Meangirls::ReinsertNotAllowed: Meangirls::ReinsertNotAllowed
  from /Users/spejic/code/crdt-post/vendor/lib/meangirls/two_phase_set.rb:25:in '<<'
  from (irb):15
  from /Users/spejic/.rvm/rubies/ruby-2.0.0-p195/bin/irb:16:in '<main>'

As we can see, we can add elements to the set, remove elements from the set (by adding them to the remove set), get a JSON representation of the merge between the add and remove sets and we are prevented from re-adding a deleted element to the whole thing. The last part makes sense, because we didn’t actually delete the element, it’s still there in the add set. How would we add the same element again if we never removed it?

Before explaining why we’re using this CRDT, let’s first come up with a way to store the data.

Let’s create an empty two phase set and store that in Riak.

irb
1
2
3
4
5
twop_set = Meangirls::TwoPhaseSet.new
data_object = bucket.new
data_object.raw_data = twop_set.to_json
data_object.store
=> #<Riak::RObject {meangirls,7ELYcWLmeoaiphhgCjyemyz54JB} [#<Riak::RContent [application/json]:{"type"=>"2p-set", "a"=>[], "r"=>[]}>]>

OK, the set is stored. One of the usual patterns used while developing Riak-backed applications is to store the keys in something like Redis or memcache. This allows the developers to access the data quickly, since key-based access is the fastest type of data access in Riak.

However, this also opens up the possibility that two distinct sets of data will be written to the same key. This is exactly what a CRDT aims to prevent.

Let’s simulate that possibility and see how meangirls handles it.

irb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
fetch_object = bucket.get '7ELYcWLmeoaiphhgCjyemyz54JB'

# fetch another object
another_fetch_object = bucket.get '7ELYcWLmeoaiphhgCjyemyz54JB'

# make sets
set_from_first = Meangirls::TwoPhaseSet.new(fetch_object.data)
set_from_second = Meangirls::TwoPhaseSet.new(another_fetch_object.data)

# add data to each set
set_from_first.add(:alpha)
set_from_second.add(:bravo)

# serialize each set back
fetch_object.raw_data = set_from_first.to_json
fetch_object.store

another_fetch_object.raw_data = set_from_second.to_json
another_fetch_object.store

# fetch the conflicts and merge them
fetch_object = bucket.get '7ELYcWLmeoaiphhgCjyemyz54JB'
fetch_object.conflict?
=> true

conflicted_parts = fetch_object.siblings.map do |sibling|
  Meangirls::TwoPhaseSet.new(sibling.data)
end
=> [#<Meangirls::TwoPhaseSet:0x0000000309d898 
            @a=#<Set: {"beta"}>, @r=#<Set: {}>>,
    #<Meangirls::TwoPhaseSet:0x0000000309d578 @a=#<Set: {"alpha"}>, @r=#<Set: {}>>]

resolved = conflicted_parts.first.merge(conflicted_parts.last)
=> #<Meangirls::TwoPhaseSet:0x00000003501c90
    @a=#<Set: {"beta", "alpha"}>, @r=#<Set: {}>>

# write the merged object back
resolved_data_object = Riak::RObject.new(bucket, fetch_object.key)
resolved_data_object.raw_data = resolved.to_json
resolved_data_object.content_type = "application/json"

fetch_object.siblings = [resolved_data_object]
fetch_object.store

We did a few things here, so let’s break it down. Firstly, fetching the same object out of Riak twice will ensure that we have to references to the same vector clock value. Vector clocks are how a distributed system like Riak keeps track of object updates in the system.

Having the same vector clock on 2 in-memory instances of our Riak object means that were both of them to write updates to that object, we would have a conflict. The default way Riak handles conflicts is by using the last write wins strategy, which is determined by the timestamp of the write coming into the system. All previous writes are discarded in that case, potentially losing information.

However, since we set our bucket to allow multiple values for an object, Riak will instead keep all conflicts and present us with them.

This is exactly the scenario we follow. We add an item to each set and write both sets back using our 2 in-memory references. Riak sees the conflict, but keeps both writes.

Upon the next read, we are presented with those conflicts. The Ruby client for Riak does a nice thing here and keeps things coherent by encapsulating those conflicts into the Riak::RObject instance.

Accessing any field on this conflicted instance before resolving the conflicts will result in an exception, forcing us to resolve the conflicts before proceeding.

conflicted
1
2
3
4
5
6
7
8
fetch_object = bucket.get '7ELYcWLmeoaiphhgCjyemyz54JB'
fetch_object.conflict?
=> true
fetch_object.data
#=> Riak::Conflict: The object is in conflict (has siblings) and cannot be treated singly or saved: #<Riak::RObject {meangirls,7ELYcWLmeoaiphhgCjyemyz54JB} [#<Riak::RContent [application/json]:{"type"=>"2p-set", "a"=>["beta"], "r"=>[]}>, #<Riak::RContent [application/json]:{"type"=>"2p-set", "a"=>["alpha"], "r"=>[]}>]>
  from /home/srdjan/.rvm/gems/ruby-2.0.0-p195@crdt-post/gems/riak-client-1.2.0/lib/riak/robject.rb:169:in `content'
  from (irb):91
  from /home/srdjan/.rvm/rubies/ruby-2.0.0-p195/bin/irb:13:in `<main>'

So, we have conflicts and need to resolve them. The conflicted objects are presented as a collection of siblings. We loop through the collection and create TwoPhaseSet objects out of each one and add them to a collection of their own. We only have 2 conflicts here, so I cheated a bit on how they merge together, but I hope that the intent is clear.

resolve conflicts
1
2
3
4
5
6
7
8
9
10
conflicted_parts = fetch_object.siblings.map do |sibling|
  Meangirls::TwoPhaseSet.new(sibling.data)
end
=> [#<Meangirls::TwoPhaseSet:0x0000000309d898 
            @a=#<Set: {"beta"}>, @r=#<Set: {}>>,
    #<Meangirls::TwoPhaseSet:0x0000000309d578 @a=#<Set: {"alpha"}>, @r=#<Set: {}>>]

resolved = conflicted_parts.first.merge(conflicted_parts.last)
=> #<Meangirls::TwoPhaseSet:0x00000003501c90
    @a=#<Set: {"beta", "alpha"}>, @r=#<Set: {}>>

The last part of the conflict resolution is tricky and under-documented. I only have two tweets from Sean Cribbs to go on.

@batasrki @davidannweiser No, just return the same object, with siblings resolved. Best not to immediately write back though

and

@batasrki RObject.siblings = [ resolved_content_object ]

I, honestly, could not find any other information on this part, so what I’m presenting is probably not the best way. It does work, however.

create a new object
1
2
3
4
5
6
7
# write the merged object back
resolved_data_object = Riak::RObject.new(bucket, fetch_object.key)
resolved_data_object.raw_data = resolved.to_json
resolved_data_object.content_type = "application/json"

fetch_object.siblings = [resolved_data_object]
fetch_object.store

What we’re doing here is creating a new Riak::RObject instance, but setting its key to be the same as the conflicted object. This ensures that we update the data in Riak rather than create a new copy of it.

We then set that object’s raw_data property to the resolved set we created earlier and its content type to JSON. We don’t store this object, instead we overwrite the conflicted object’s siblings collection with a list of size 1 containing the new object.

We then store the conflicted object back into Riak. If we pull this data from Riak subsequently, we will see the updated, resolved set store and no conflicts.

Conclusion

So, there we have it. This is a first in a few posts on how to use CRDTs to resolve conflicts in an eventually consistent system. This is a powerful technique (pattern?), especially when using a distributed database like Riak.

In the next post, I will attempt to use the 2-phase set to resolve conflicts in richer JSON documents like user records.

In posts after that one, I will explore the other provided data types in the Meangirls library.

Until next time, please watch Kyle Kingsbury’s excellent Ricon East video on what happens in distributed databases when CRDTs aren’t used, Call me maybe.

Riak 2i as a Means of Querying

| Comments

Introduction

Up tonight, I will explore how different Riak’s secondary indexes feature is from Riak Search in the context of a query. I will be using the same structure as in my post on Riak Search and Map/Reduce. Riak’s secondary indexes have a different setup than Riak Search. Whereas search had to be enabled on each node in turn, secondary indexes (or 2i from now on) are stored as metadata on each key/value pair. There are a few advantages to this approach and, at this present moment, a few disadvantages.

Easier integration

The major disadvantage of Riak Search, in my mind, is that you have to know that you will need it before you start putting data into Riak. There is no way, right now, to index existing data in order to take advantage of the wonderful capabilities Riak Search offers. At this point in time, the only two ways to add the index is to blow away your dataset, enable search and restore from backup or stream keys and their values through the system after turning Riak Search on. I, admittedly, don’t fully understand the latter option, so when I decided to use Riak Search in my app, I did the former. It may not be possible for everyone to do so. 2i, on the other hand, is stored per key. This means that when you’re ready to add 2i to your data, it’s as simple as reading in a key, adding the index through the application layer and re-saving the data, illustrated below.

add index - event.rb
1
2
3
4
5
6
7
class Event
  def add_indexes(key)
    event = Event.find(key)
    event.indexes['month_year_bin'] << "#{event.month}-#{event.year}"
    event.indexes['year_month_int'] << "#{event.year}-#{event.month}".to_i
  end
end

I’ve added two indexes that look the same for illustration purposes, as well as, because I’m unsure exactly how I will use them yet. Besides, deleting an unwanted index is as easy as deleting a key from a hash.

delete index - event.rb
1
2
3
event = Event.find("somekey")
event.indexes.delete("month_year_bin")
event.save

It can be done while the system is up and running and serving users. This is a major boost in convenience.

Querying

There are a several ways I can select a collection of keys from 2i.

equality

The simplest is the equality query.

2i m/r integration - event.rb
1
2
3
4
5
6
7
job = Riak::MapReduce.new(Event.client)
job.index("events", "year_month_int", "#{event.year}#{event.month}".to_i)
job.map("function(value, keydata, arg){ var data = Riak.mapValuesJson(value)[0]; if(data.location === \"\") {return [data];} else{ return [];} }", :keep => true)
job.run

#returns
[{"category"=>"business", "event_date"=>"2011-08-13T00:00:00+00:00", "location"=>""}, {"category"=>"business", "event_date"=>"2011-08-09T00:00:00+00:00", "location"=>""}]

All right. I like that. I think I’ll keep the int index and drop the bin one. It’s easier to construct and I think it’s more efficient.

range

Even better, the int index type enables me to do a range query easily. For example, let’s add another index that stores the day, along with month and year of an event and then select all of the events in the month range.

2i m/r integration v2 - event.rb
1
2
3
4
5
6
7
8
9
10
event.indexes["year_month_day_int"] << "#{event.year}#{event.month}#{event.day}".to_i

#and then
job = Riak::MapReduce.new(Event.client)
job.index("events", "year_month_day_int", 201181..2011831)
job.map("function(value, keydata, arg){ var data = Riak.mapValuesJson(value)[0]; if(data.location === \"\") {return [data];} else{ return [];} }", :keep => true)
job.run

#returns
[{"category"=>"business", "event_date"=>"2011-08-13T00:00:00+00:00", "location"=>""}, {"category"=>"business", "event_date"=>"2011-08-09T00:00:00+00:00", "location"=>""}]

That’s very nice. This enables me to offer to my users a weekly view of their data, as well as a monthly one. I don’t think I can manage that with Riak Search. There is also another type of index, the multi-valued index, whose usage is detailed on the GitHub wiki. As usual, the results returned are piped back into my domain object which pulls out keys and values and creates out of them something I can use in the rest of the system.

Disadvantages

The main disadvantage, for me, is the inability to combine multiple secondary indexes to pull out data. Ideally, I would love to only query by index without having to have a portion of my code in Javascript or Erlang to do the additional filtering. Also, as of right now, there is no consistent ordering of the keys selected through the index. That will be corrected later, but it is something that I need to be careful of right now.

Conclusion

I think I’ll be able to live with the disadvantages as they are. I am not a real power user as far as I’ve seen and the current, minimal abilities of both Riak Search and Riak 2i suit me well. Well, I’m off to build the front-end for the new goodies I discovered writing this post. Until the next time.