In the previous post, I briefly introduced CRDTs in Ruby and have shown a basic usage case for one Conflict-Free Replicated Data Type. Tonight, we will explore one slightly more complicated use case that can be seen in real applications in the wild.
For the purpose of this post, imagine we’re running a Netflix clone. We have accounts and we have movies. For the purposes of recommendations, we want to store what movies each account holder has viewed. As is the case with Netflix, an account may be signed in and used across more than one device by holders of differing tastes. Since we don’t really want to impose the restriction of just one account being signed in, we will record statistics from each session.
As you can see, this may lead to write conflicts, since any account holder may want to watch any movie at any time. For the sake of efficiency, we will only store an ID to each movie in the account’s movies watched collection.
The data type used
Out of the currently known CRDTs, the one that seems to fit this problem the best is the GSet, or Grow-only Set, data type. As its name suggests, this CRDT only ever grows in size, it does not shrink. Since recommendation engines thrive on tonnes of data, having an ever-increasing set of records is very beneficial.
The basic layout of a GSet is as follows:
The type information there is for the meangirls library to invoke the correct operational semantics. Other than that, it’s just a simple list of data, always growing in size. Only, the meangirls library does not include a class for a GSet. It has a class for TwoPhaseSet, which is a combination of two GSets, but it doesn’t actually implement a GSet by itself.
That’s not a big deal, we will implement one ourselves.
Since it has fairly simple requirement, let’s take a look at the GSet class. I’ll explain parts of it after the code listing.
If you peek at meangirls implementation of a TwoPhaseSet, you will notice that we copied most of its code into here, save for the remove set logic. Let’s go through the important parts.
Our constructor may or may not take a hash. This is done, so that we can re-construct a GSet from JSON data, such as a Riak record.
We only ever add elements to our GSet. Additional guarantees, like that an element can only be an Integer, can be added here.
When merging, we first must ensure that we’re merging a GSet to a GSet. We would get into weird situations and breakage, otherwise. After that, a simple set union over our 2 sets of IDs will ensure we only ever have unique ID values here.
With that done, let’s add models for the Account records and Movie records. Persistence is done somewhere else. This time, we will store records through irb.
Let’s store a few arbitrary movies, keep their keys in memory and cause a write conflict. We will then see how we can resolve this conflict through the usage of our CRDT.
OK, now we have two movies, so let’s simulate a simultaneous account access by 2 holders, each of which is watching a movie. As shown in the previous post, storing values at the same time will cause a write conflict, resulting in creation of siblings. As usual, the accounts bucket should have allow_mult set to true, so that Riak doesn’t discard one of the writes.
Let’s first create an account.
OK, with the record stored, we retrieve it twice without updating. This ensures that the vclock value for the record is not updated, allowing us the opportunity to cause a write conflict.
Let’s now add the ID of the first created movie to one instance of this Account record and the ID of the second movie to the other one.
Let’s now try storing these objects back into Riak and seeing what happens.
And, there we go. We have successfully (again) caused a conflict. Let’s write some code to resolve it. The trick here to resolving conflicts like this is that you must not access the data nor the raw_data methods. These will throw an exception and mess up your day.
Let’s instead, introduce a class method in the Account model that will handle conflict resolution for us.
In this method, we check first if the raw record coming from Riak has the conflict? flag set, which indicates that there’s been a write conflict somewhere. If it is, we loop through all the siblings and pull out the movies_watched lists. The rest of this record should not be subject to conflict resolution, so we will accept whatever values there are for them.
We then initialize the first item in the list of GSets and then merge the others in that list with the first one. We append this to the hash of resolved values and then instantiate our Account model with the new values.
The second part to this conflict resolution is storing the resolved object back into Riak. Again, as in the previous post, we do that by creating a new Riak::RObject instance, setting its key to the conflicted one and setting the siblings property of the conflicted instance to an array of Riak::RObjects of size 1. The only member is the new instance. We store that and we’re done.
Hopefully, this has shed a bit more light on how CRDTs can be used in real applications. The big key here is that during data model design, a need for conflict resolution must be anticipated and then the data, or parts of it, must be structured as a CRDT. Once that is done, merging conflicted records is pretty easy and availability is preserved.