Over the past few years, NoSQL databases have received a great deal of attention, with countless attestations of their virtues in enabling consumer-facing, Web-based businesses to manage fast-growing user demand and make use of the huge quantities of data that their users create.
It’s clear that NoSQL adoption has paid dividends to the Twitters and Netflixes of the world. But it’s been less apparent just how much attention mainstream organizations ought to pay to the trend, since relational databases are familiar and well-entrenched, and since many well-established solutions exist for scaling relational databases
Despite the heavy focus on the virtues of NoSQL for “Internet-scale” businesses, the products and services covered under the NoSQL umbrella are well worth consideration by organizations of all sizes, not as across-the-board replacements for relational databases but as additional tools for meeting business goals.
NoSQL refers to a broad class of database products which tend not to expose SQL interfaces. What separates these products from traditional databases has less to do with SQL and more to do with a departure from relational models. In particular, these databases do away with fixed schema, which can be beneficial when developing applications with changing requirements. For this reason, non-relational is a better, if less broadly referenced, handle for this group of products.
One of the canonical documents describing the design concepts and rationales for non-relational databases is Amazon.com’s 2007 paper on its “Dynamo” data store, which the company developed to meet its internal service-level requirements.
The paper describes how traditional relational database management systems, with their focus on prioritizing data consistency above write-operation availability, proved ill-suited to the Web retailer’s needs in the context of Amazon’s infrastructure, which is comprised of large numbers of commodity servers of varying capacities. For Amazon, blocking customers from adding new items to their carts while waiting for separate application nodes to get in sync was too high a price to pay, so Dynamo was designed to boost availability by de-prioritizing consistency.
While the scale of Amazon’s infrastructure and user base is relatively unique (as is Amazon’s capacity for rolling its own data store solution), the need to prioritize certain application characteristics above others is common to every organization. Today’s crop of non-relational database products provide businesses with more options without requiring that they create solutions from scratch.
There are several different types of non-relational databases that fall under the NoSQL umbrella, including key-value stores, document-oriented databases, columnar databases and graph databases, each with their own data models, scaling strategies and use cases.
Pinning down particular NoSQL databases into a specific category can get confusing, as some of the categories tend to blend into each other. For understanding the broad categories of NoSQL data stores, I found this paper by Rick Cattell helpful, in which the former Sun Microsystems database architect breaks down the options into key-value stores, document stores and extensible-record stores.
In a key-value store, individual records amount to some arbitrary lump of information, indexed by a key. These systems typically do not interpret the data themselves, leaving that function to the application. Riak, which is supported by Basho Technologies, and Oracle’s Berkeley DB are examples of popular key-value stores.
In a document store, records are comprised of documents that consist of a variable number of named attributes of various types, such as integers, strings and nested objects. Document-oriented databases tend to recognize the structure of the data they store and have more querying functionality than key-value stores. MongoDB, from 10gen, and Apache CouchDB, which is supported by Couchbase, are examples of popular document stores.
Extensible-record stores, which are also known as wide-column stores, provide a data model similar to relational databases, but with a focus on organizing data into columns (rather than rows) and column families (rather than tables). Apache Cassandra, which is supported by DataStax, and Apache HBase, supported by Cloudera, are examples of popular extensible-record stores.
More important than worrying about which bucket a given non-relational database fits into is focusing on the particular set of features it offers-in particular, which controls it offers for balancing availability, consistency and fault-tolerance, how it handles scaling and which interfaces it provides for accessing data.
For example, Apache Cassandra enables administrators to set their desired trade-offs between availability and consistency on a per-query basis. To maximize consistency, administrators can configure a Cassandra cluster to hold off on reporting a write complete or responding to a read until all nodes in a cluster have responded. To maximize availability, the system can complete an operation if any one node completes a write or responds to a read. Administrators can also opt for several gradations in between to reach a balance and to provide for resiliency in case nodes fail.
MongoDB provides for scaling out across nodes in a cluster through auto-partitioning. If a data set grows too large for a single machine, MongoDB can chunk up the collection and distribute it across the nodes assigned to it, with distributed replica sets to recover from a node failure.
Among the primary challenges for administrators working to wrap their minds around NoSQL databases are the differences in accessing data stored in these systems. Due to the major differences between these products, there isn’t a straight equivalent to SQL in the relational world. Rather, most non-relational databases provide bindings for accessing data using multiple programming languages.
There are a number of SQL-like querying languages that have sprouted up to offer higher-level data access, such as Google’s GQL for its AppEngine platform as a service (PaaS), MongoDB’s Mongo Query Language, Cassandra Query Language and the nascent UnQL (Unstructured Query Language). For Apache Hadoop-based systems, Apache Pig and Apache Hive offer two separate routes for working with data from a higher level.
In my own efforts to better understand the differences in accessing data on relational and non-relational data stores, I’ve found helpful the open-source, Django-nonrelational project. Django is a Python framework for building Web-based applications that sports an object-relational mapping layer for abstracting the differences between separate relational databases. Django nonrelational supports Google’s AppEngine datastore, and offers in-development backend support for Cassandra and MongoDB.
For administrators and developers familiar with Django, experimenting with the various backends provides a hands-on reference for the differences between relational and non-relational stores, and between some of the different NoSQL systems.