Getting Started With Riak & Python

What is Riak?

Riak is one of a handful of non-relational datastores that has experienced some exposure lately, and with good reason. It's written in Erlang, is Dynamo-inspired and, while technically is a key-value store, functions very well in our experience as a document store (meaning you can store complex data as the value). Reading through the overview is highly recommended.

It's also extremely predictable in production, handles node failures well and (here's the really interesting bit) demonstrates linear performance as you add nodes (see these benchmarks). This is impressive because many distributed systems, especially in the NoSQL space, don't behave like this.

Like many other NoSQL options, it uses MapReduce for many types of queries (which you can write in either Javascript for ad hoc queries or Erlang for repetitious, speedy queries). But also of interest is that it provides an additional way to get at values in Riak Search, a Solr-like search component. This can be more efficient in many cases due to the use of inverted indexes, rather than traversing every document as part of the MapReduce.

I've spent a lot of time evaluating most of the NoSQL field recently, and I keep coming back to Riak (well, and of course Redis) as the weapon of choice.

Why should I use it?

If you're evaluating datastores, the properties of Riak are very compelling.

  • Fast - A single node can handle ~900 reqs/sec, trailing Cassandra by a mere 150 reqs in our testing.
  • Easy to integrate - If you can work with JSON (who can't?), both Riak's storage AND it's MapReduce can natively handle it. Instant document store.
  • Real scalability - Nodes really can drop in and out without destroying the cluster. There's some sync time between the other nodes if a node drops out or a new node enters the cluster, but thanks to the ring & key replication, nothing will go down.
  • Low risk of data loss - Unlike certain other NoSQL option, Riak uses a WAL (write-ahead log) & key replication to ensure your data is safe & available. This is more important than you might think.
  • Links - It took me a long time to grasp them, but links are a very powerful tool in Riak. They let you natively define relations between data in your datastore, something you have to build yourself in most other stores.

You can find our Riak benchmark at https://gist.github.com/653649. The PB numbers are the using Protocol Buffers, which are much faster than the HTTP interface (which is still very decent & easy to get started with).

So why are you writing this?

An excellent question. Being Pythonistas (among other things), the first question was how do we use this interesting tool from our language of choice. The Basho-official client library, while good, has really poor documentation so figuring things out took some trial & error. Hopefully this post helps clear up some of those gotchas and piques your interest. Let's dive in.

Installing Riak

If you're on a Mac, using Homebrew is an easy option. brew install riak (plus a coffee break thanks to building Erlang) is all you need to get started. Once it's installed, a simple riak start will fire up a single node by default. It's very do-able to run an entire cluster off a single laptop (by configuring it to start several nodes bound together), but the interface is the same whether you're talking to one node or the cluster (also a nice feature of Riak).

The Basho folks also offer binary packages available on their website for all sorts of platforms, so if that's your cup of tea, go for it.

Finally, you should grab the Python binding. A pip install riak should be all that's needed to get you up and running. Time to play!

Poking the Bear

Rather than some piddly "Hello World" test, let's be Cool Web Kids(tm) and build some of the core of Riak-powered blog. We won't hook up a web interface (command line blogging?) but doing so would be trivial in a WSGI or Django app. Your code might look like so:

import riak
import uuid
import time


# For regular HTTP...
# client = riak.RiakClient()

# For Protocol Buffers (go faster!)
client = riak.RiakClient(port=8087, transport_class=riak.RiakPbcTransport)

entry_bucket = client.bucket('entry')
comment_bucket = client.bucket('comment')


def create_entry(entry_dict):
    # ``entry_dict`` should look something like:
    # {
    #     'title': 'First Post!',
    #     'author': 'Daniel',
    #     'slug': 'first-post',
    #     'posted': time.time(),
    #     'tease': 'A test post to my new Riak-powered blog.',
    #     'content': 'Hmph. The tease kinda said it all...',
    # }
    entry = entry_bucket.new(entry_dict['slug'], data=entry_dict)
    entry.store()


def create_comment(entry_slug, comment_dict):
    # ``comment_dict`` should look something like:
    # {
    #     'author': 'Daniel',
    #     'url': 'http://pragmaticbadger.com/',
    #     'posted': time.time(),
    #     'content': 'IS IT WEBSCALE? I HEARD /DEV/NULL IS WEBSCALE.',
    # }
    # Error handling omitted for brevity...
    entry = entry_bucket.get(entry_slug)

    # Give it a UUID for the key.
    comment = comment_bucket.new(str(uuid.uuid1()), data=comment_dict)
    comment.store()

    # Add the link.
    entry.add_link(comment)
    entry.store()


def get_entry_and_comments(entry_slug):
    entry = entry_bucket.get(entry_slug)
    comments = []

    # They come out in the order you added them, so there's no
    # sorting to be done.
    for comment_link in entry.get_links():
        # Gets the related object, then the data out of it's value.
        comments.append(comment_link.get().get_data())

    return {
        'entry': entry.get_data(),
        'comments': comments,
    }


# To test:
if __name__ == '__main__':
    create_entry({
        'title': 'First Post!',
        'author': 'Daniel',
        'slug': 'first-post',
        'posted': time.time(),
        'tease': 'A test post to my new Riak-powered blog.',
        'content': 'Hmph. The tease kinda said it all...',
    })

    create_comment('first-post', {
        'author': 'Matt',
        'url': 'http://pragmaticbadger.com/',
        'posted': time.time(),
        'content': 'IS IT WEBSCALE? I HEARD /DEV/NULL IS WEBSCALE.',
    })
    create_comment('first-post', {
        'author': 'Daniel',
        'url': 'http://pragmaticbadger.com/',
        'posted': time.time(),
        'content': 'You better believe it!',
    })

    data = get_entry_and_comments('first-post')
    print "Entry:"
    print data['entry']['title']
    print data['entry']['tease']
    print
    print "Comments:"

    for comment in data['comments']:
        print "%s - %s" % (comment['author'], comment['content'])

Gotchas

If you look at the code, there's a couple interesting things going on. First, we're using buckets (which are essentially keyspaces) like we would use database tables. We could've lumped everything together in the same bucket (Riak is schema-free with the JSON payloads), but this is generally regarded to be a better practice.

Two, we're using protocol buffers rather than the HTTP interface. Doesn't matter much for this small of an example, but if you want to see what Riak can really do, those couple overrides in the client initialization make a big difference.

Three, don't bother instantiating your own RiakObject. Using it is a pretty manual, error-prone process and you're better off just calling bucket.new or bucket.get, which take care of things under the hood for you.

Four, let the binding handle the JSON bits for you if you can. Using RiakObject, you can totally override this behavior. But as you can see from the example, we never had to worry about it and could just work with our nice Python data structures.

Five, links have the happy attribute that they fetch in the order they were added to the parent object. So in our case, we don't need to worry about changing the order. Riak is more powerful than this (can fetch based on data in the value & omit portion as part of the map phase), but that behavior is good enough for our use case.

Also, note that what you get back from entry.get_links() aren't the objects themselves, but a RiakLink proxy. This can be useful, as those links are already fully-loaded (as part of the previous entry_bucket.get), so they are lightweight and allow you to only fetch what you need.

It's a webservice, silly!

Another neat property of Riak is that everything is exposed as a webservice, which means you can inspect aspects of the data straight from your browser (as well as easily integrate with Javascript, use standard http libraries, etc.). For instance, now that your Riak node is up, you can hit http://localhost:8098/riak/entry/ or http://localhost:8098/riak/entry/first-post/ and see the data that's in Riak.

You can have more fun with this, including adding/changing/deleting data, querying, link-walking & checking stats through the same interface. The REST API page of the Riak wiki has more details.

Conclusion

We've personally been pretty tentative about the NoSQL bandwagon in the past (lots of hype & broken promises) but Riak gives us a lot of things to love with very little in the way of hurt. We're not about to drop PostgreSQL as our primary data storage layer, but Riak seems to make a great complimentary datastore and I'm looking forward to using it more in the future.

Disclaimer - No, we did not receive any compensation from Basho or anyone else to write this article. I'm genuinely excited by what Riak has to offer and want to help spread knowledge & avoid gotchas so that others don't have the initial struggles I did.

Posted on November 16, 2010 @ 10:55 p.m. by Daniel