rick_grehan
Contributing Editor

Review: RethinkDB rethinks real-time Web apps

reviews
Sep 23, 201514 mins

RethinkDB's NoSQL document store delivers high-speed change notifications for super-responsive apps

mobile apps and touch screen
Credit: Thinkstock

Like MongoDB or Couchbase, RethinkDB is a clustered, document-oriented database that delivers the benefits of flexible schema, easy development, and high scalability. Unlike those other document databases, RethinkDB supports “real time” applications with the ability to continuously push updated query results to applications that subscribe to changes.

In this case, a “real time” application is one that must support a large flow of client requests that alter the state of the database and keep all clients apprised of those changes. A common example of a real-time application is the multiplayer game: Hundreds or thousands of users are pushing buttons, those button pushes are changing the game state, and all of the users must see all of the changes in real time.

RethinkDB expends a great deal of effort ensuring that data change events are quickly dispatched throughout the cluster. And it provides this high-speed event processing mechanism while offering plenty of control over database consistency and durability.

Nevertheless, even the RethinkDB engineers admit that, if your primary consideration for a database is ACID compliance, RethinkDB probably shouldn’t be your first choice. The principal reason: RethinkDB does not support transactions across multiple documents. However, within a single document, RethinkDB is fully ACID compliant.

In spite of this, RethinkDB’s “real-time push” technology (explained below) as the means for clients to be kept apprised of database changes makes it ideal for underpinning applications that must provide clients with the most up-to-date view of database state. Further, RethinkDB’s easy-to-grasp query language — embedded in a host of popular programming languages — and its out-of-the-box management and monitoring GUI, make for a smooth on-ramp to learning how to put RethinkDB to work in such applications.

NoSQL, so no schemas

RethinkDB is a JSON document database. A JSON document represents a structured object consisting of key/value pairs. The value can be either a primitive data type (integer, string, floating point number) or a nested JSON object (represented in document form). This means, of course, that JSON can describe arbitrarily complex objects.

RethinkDB stores documents in tables. While this might lead one to think that RethinkDB has relational database ingredients, the fact is a table is simply a logical container; the RethinkDB engineers chose to call that container “table” so that developers coming from a relational background would feel comfortable.

A table in RethinkDB places no real restrictions on the structure of its contained documents. In the relational world, all rows within a table necessarily have the same structure (that is, they have the same fields). But RethinkDB is schema-less: No two documents, not even documents within the same table, need to have the same structure. Of course, it’s generally beneficial for all documents in a table to have the same structure, as it simplifies organization and management. But the flexibility is there, if needed. The RethinkDB documentation provides a nice overview of data modeling options in its “yes it’s a table but not really” world.

Real time with changefeeds

In a typical database system, clients discover alterations to database contents by querying the database. To learn if customer X has updated her shopping cart, you fetch customer X’s shopping cart and look inside. Of course, you can improve the throughput if you associate a timestamp with the shopping cart. Rather than fetch the entire shopping cart object, you can fetch the timestamp. If it’s changed since the last time it was fetched, then you get the shopping cart. In both cases, though, an application becomes aware of database changes by polling it.

RethinkDB provides a “real-time push” capability. RethinkDB client applications register themselves as listeners to specific database events. If something changes in the database, the database notifies the client, which doesn’t have to repeatedly poll the database.

This capability, called a changefeed, can be applied to a table, a document, or a query. Your application registers a changefeed via the changes() command. (RethinkDB’s query language is embedded, so commands are native language method calls, as described in more detail later.) The result of the changes() command is an “infinite cursor” — that is, a cursor object that provides a more or less unending set of change documents. Typically, this is handled inside what amounts to a callback function, in which code iteratively fetches a change document from the cursor. The code looks like this:

rdb.table(‘<table class="legacyTable"name>’).changes().run(conn, function(err, cursor) { cursor.each( ... code to handle changefeed ... ) }

The change document is a JSON document containing the previous and current value of the item that’s been changed. The cursor blocks when no new changes are pending.

You can configure changefeeds to throttle the delivery of information. For example, you might configure a changefeed to wait for N changes before sending a response to the listening application. RethinkDB will merge multiple changes together into a single response. This reduces network traffic. In addition, if many alterations occur between subsequent fetches from the changefeed cursor, RethinkDB will collapse the result so that intermediate changes are discarded, and only the previous and current values are reported.

Change documents also include informative state information, such as whether the item is an initial document (in which case the change is really an addition). Changes are buffered at the server, and if that buffer hits its limit, the server will discard early change documents and insert a special error document into the stream of changes. This error document will indicate how many documents were skipped on account of the buffer overrun.

Sharding in RethinkDB

As a distributed database, RethinkDB spreads data around the cluster by sharding. RethinkDB uses range sharding, assigning each document to a shard based on that document’s primary key. Documents whose keys are relatively contiguous — in the same range — are placed in the same shard. (This is in contrast to hash-based sharding, which assigns documents to shards more or less randomly, as the sharding is governed by a hashed value.)

While range sharding improves query responsiveness, it runs the risk of unbalancing the cluster, should primary keys (like last names in a customer database) be unevenly distributed. Fortunately, RethinkDB intelligently “slices” ranges, cutting overfilled ranges into more numerous segments to rebalance an unbalanced cluster. You can also manually rebalance a cluster through the management console, should the need arise.

Naturally, RethinkDB also supports replication. Each shard is assigned to a primary replica node, but also copied to one or more secondary replica nodes. Reads and writes for a given document are routed to the primary replica node for the shard within which the document resides. As long as the primary replica node is available, reads and writes are consistent. Otherwise, writes are deferred, and reads are served by one of the secondary replica nodes.

Ordinarily, application read operations are up to date. That is, if a read operation follows a write operation (on a particular document), the read is guaranteed to see the effects of the write because both will be served by the shard’s primary node. However, if you want to speed up your application’s read throughput, you can configure reads to be out of date, in which case reads will be directed to the nearest replica node. Recognize, of course, that you’re exchanging speed for concurrency — it’s possible that a read operation will return “old” data, which has been modified by a write on the primary replica node, but not yet migrated to the secondary replica.

Finally, RethinkDB supports failover, which requires that the cluster have at least three nodes and tables be configured to have more than two shards. If a node becomes unavailable and happens to host the primary replica for a table, then one of the secondary nodes is selected by RethinkDB to become the new primary. No data is lost. Should the lost node come back online, it will resume its position as primary. Note that, even if a majority of nodes for a given replica are lost, data can still be retrieved, though it requires a special recovery operation.

ReQL, the RethinkDB query language

RethinkDB’s query language is called ReQL; it’s easy enough to guess what that stands for. Applications employ ReQL via a client driver library. Official drivers are available for Ruby, Python, and JavaScript. Community drivers are available for many other languages, including Node.js, PHP, C++, and C#. (See the RethinkDB website for a list of available drivers.)

The driver supplies the methods that a developer uses to build ReQL queries. ReQL is an “embedded query language,” meaning you create ReQL queries by chaining together function/method calls in whatever native language you’re using. You’re not writing queries as strings that are handed off to a query language process for parsing, as done in SQL.

An example ReQL query (written in Ruby) might look like this:

r.db(“birds”).table(“sightings”).insert({:name=>”Warbler”, :location=>”North Street”, :quantity=>2}).run(conn)

In the above, r is the RethinkDB namespace, and conn represents the connection to the server. Obviously, this is an insert operation on the sightings table within the birds database. A small, three-element JSON document is being added to the table.

Typically, data in a ReQL query flows from left to right. You begin with the RethinkDB namespace, make a call to reference a table, then issue a command — for example, pluck() — to filter the documents within that table, and so on. This sort of chaining is frequently seen in JavaScript, so JavaScript programmers should find the construction of ReQL queries easy to grasp.

Because ReQL queries are written in the native language of the client driver, they are immune to injection attacks. You are also free to use constructs of the driver language — Python lambdas, for instance — to make queries more expressive. The driver translates the query into a kind of pseudo-language and sends the translation to the server for execution. Thus, you can treat ReQL queries much like stored procedures. You can construct a query, put it in a variable, and pass it to the server to be executed later.

Unfortunately, RethinkDB does not yet have a query optimizer, though the system is designed to support one. An optimizer is planned for a future release. However, RethinkDB will parallelize queries when possible. A RethinkDB cluster is symmetric in that a query can be sent to any cluster member, which will forward the query to the proper destination for processing.

You can actually perform the equivalent of a relational JOIN operation. In a relational system, a JOIN of two tables connects specified columns in each table. In RethinkDB, a JOIN of two tables connects specified fields, one in each set of the documents that comprise each table.

In addition, ReQL supports a variant of map-reduce called group-map-reduce (GMR). The map operation will fetch a sequence from the database, a sequence being a set of documents or document fields, and transform it into a different sequence. The reduce operation aggregates the results prior to delivery as the query’s response. The additional “group” step can be used to amass the sequences into partitions for which separate results are produced. (For example, you might use the group operation to gather the results of a map-reduce process by gender.)

RethinkDB’s GMR system is a low-level API into the database. Higher-level ReQL commands (JOIN, for example) are compiled into GMR queries and executed on the server by the GMR infrastructure. As with map-reduce operations in Hadoop, resources are allocated automatically, and the system manages the priorities of concurrent queries. More extensive control of resource allocation for GMR queries is planned in a future release of RethinkDB.

Managing RethinkDB

Right out of the box, RethinkDB supplies a browser-based management GUI. The GUI’s dashboard includes multiple views of various aspects of the cluster. For example, you can display cluster I/O performance (reads/writes per second), along with summaries such as the number of servers, number of tables, indexes, disk usage, and so on.

Select a table, and you can view statistics for that table (such as number of documents), as well as see how the data in the table is distributed across individual nodes. Similarly, you can drill down into information on individual server nodes in the cluster to glean the node’s uptime, its cache size, how many shards it is responsible for, whether it is the primary or secondary replica node for a given table, and so on.

Using the GUI’s data explorer, you can enter and execute ReQL queries in JavaScript. Enable the query profiler, and you get a JSON document describing the steps the system went through to satisfy the query. This document provides information such as how much time the system spent reading each shard, how much time was spent evaluating the primary index, and so on. You can also browse any RethinkDB database and display data in one of several views. The tree view shows the JSON data in “pretty printed” form, while the raw view spews out the JSON as it is. The table view outputs the data as it might appear if RethinkDB were an RDBMS: Each document is a row, and the document fields become columns in the tabular output.

Responsive JSON database

RethinkDB sports plenty of knobs that you can turn to tune its behavior to meet your application’s needs. While it’s true that RethinkDB is not truly ACID compliant, its durability and consistency are adjustable across a wide range, usually on a per-table basis. For example, if you want speedy writes at the expense of consistency, you can configure write operations to a table to be acknowledged when at least one replica — rather than a majority — confirms the write. Similarly, you can select “soft” or “hard” durability. Soft durability acknowledges a write as soon as data is cached in memory, while hard acknowledges the write only after it has been committed to disk.

To improve the responsiveness in large clusters, you can specify that some nodes of a cluster be proxy nodes. A proxy node stores no data, but acts as a request/response router; it knows the optimal route to send a request. In addition, proxy nodes can de-duplicate changefeed messages, alleviating network traffic congestion.

ReQL, though not SQL-like, is nonetheless easy to comprehend. The fact that it is written in any of a number of familiar languages (Python, Ruby, JavaScript) means that most programmers will have no difficulty mastering its component parts.

Embedded query languages occupy a soft spot in my heart for several reasons. They can laugh at injection attacks. They don’t demand mental gear-shifts as you move from one language paradigm to another (as when you have to use SQL statements as strings in an object-oriented language). You don’t have to fiddle with constructs to move data between the query language’s environment and the host language’s environment. And you’re free to use the host language’s expressiveness in your queries.

That’s a lot of positives, and ReQL has them all. Perhaps the only negative is that there is no official ReQL driver for Java — yet. This is somewhat understandable; the preferred languages for RethinkDB developers tend to be DSLs. However, a community-supported driver for Java is available, and the RethinkDB engineers tell me that an official Java driver is in the works. That single blot will soon be erased.

Along with the powerful query language, RethinkDB uniquely combines all of the benefits of a distributed, document-oriented database with the ability to continuously push updated query results to applications. It’s a quick and easy download, and the website’s documentation provides plenty of tutorial information to get started. There’s no reason not to.