SQL. It’s so last century, amirite?
Although it was created around ten years earlier, Structured Query Language (SQL) really began to catch hold in the commercial world during the mid-eighties, and brought with it some great improvements over what had come before.
Probably the most important feature was transactionality, which enabled concurrency and rollback capabilities. SQL also lent itself to ad-hoc business reporting, since queries could be easily written by sales and other departments to gain fresh insights into their company’s data without application programming.
As testament to its usefulness, SQL reigned supreme for over 20 years before it met any serious opposition, mostly owing to how deeply integrated it had become in major business applications.
But I’ve come here to bury Caesar, not to praise him.
Let’s talk NoSQL.
Google Trends says the term has been bandied about on the web since around 2009. But that doesn’t mean everyone was happy with SQL up until then. Google’s Ngram Viewer (which searches a large corpus of books published prior to the web’s existence) says that the phrase “SQL alternative” began appearing in books around 1984 — a full 2 years before it’s first standards release!
Interestingly the trendline for this term takes a nose-dive around the point when “NoSQL” began trending on the web. It turns out that was all due to Twitter.
Back in 2009, a fellow named Johan Oskarsson wanted to set up a meetup to discuss alternatives to SQL. To make sure all the right people showed up, he figured he needed a catchy hashtag. He contacted his pal Eric Evans, who came up with the term. They had no idea they’d be starting a movement!
What’s Wrong With SQL?
Generally speaking, the syntax isn’t that onerous. Complex SQL queries can get pretty ugly, but that’s why God made DBAs; to cope with this kind of stuff. Regardless how you feel about slinging SQL, there are other, more impactful reasons for enterprise to seek an alternative.
The two main issues are…
1. Data Isn’t Stored in the Same Shape as It Is Used
One of the biggest hassles with SQL is what is called “object-relational impedance mismatch“.
In short, the way we represent and reason about data in our user interfaces is considerably different from the way we store that data in a relational database. With modern UIs, we typically manipulate hierarchies of “objects”, often without a defined schema. But in a relational database, we deal with tables of data related to each other by keys. Schemas are not optional.
This disparity in data organization gives rise to object-relational mapping middleware such as Hibernate. Along with that comes extra management overhead, since changes to the data model require changes not only to the UI and DB, but also to the ORM layer that allows them to communicate. It can also add a certain amount of processing time to each request/response cycle.
2. Clustering is Difficult at Web Scale
Roughly speaking, clustering is an attempt to pull together data from all the rows of all the tables that share a particular database column, so that it will be handy when people ask for it. Until the web super-sized the data of very large enterprises, clustering mostly worked fine.
But, as their databases grew, Google and Amazon both reached a critical mass at which the insurmountable challenges of clustering led them to develop their own SQL alternatives — Bigtable and DynamoDB, respectively.
As they sharded and replicated the data to more and more servers around the world, failures or delays between network nodes caused problems with database availability and consistency.
The NoSQL Landscape
There are many implementations of each, and I won’t attempt to enumerate or rank them here, as this article is more about the types than any specific instance.
Many of the implementations out there land squarely in one or the other of these four categories. Some straddle the fence, combining features from different approaches.
Let’s look briefly at each of these approaches, then we’ll try to get a sense of how they differ in practical terms, and how you might go about choosing one.
KVP (Key / Value Pair) Databases
You can think of KVP databases as persistent hashes. Each key has an associated value, and the developer can write, fetch, or delete arbitrary values by key.
- More implementations than any other NoSQL approach
- Value could be any type (string, number, blob, etc.)
- Some support ordering of keys
- Many are in-memory implementations for speed
- Takes up less memory than most other types since relationships are simplest
This is the closest approximation of how we handle the data on the client, so the “impedance mismatch” problem is typically minimized with the use of a document database.
- Similar to KVP databases
- Values are either simple scalars, lists of key/value pairs, or ‘documents’ (complex objects)
- Documents can be modified whole or in part
- Most commonly JSON, eliminating impedance mismatch entirely for HTML5 apps
- Some implementations support secondary indexing
- Similar to SQL in use of ‘rows’
- Similar to Document DB in that sections can be addressed
- As with most other NoSQL approaches, notion of schema is flexible / dynamic
- Radically different from other NoSQL approaches
- Data model is based on nodes and edges rather than aggregation
- While both are “relational”, graph databases are more optimized than SQL for navigating and retrieving data by relationship
Characteristics of SQL vs NoSQL Systems
In response to the classic ACID mantra of the relational database world, the NoSQLr’s had to counter them with their own acronym, BASE.
Atomicity (of updates), Consistency (of state), Isolation (of users from in-progress updates), Durability (of transactions across system failures).
Basically Available (data is never ‘locked’), Soft state (system can change over time even without input, because of…), Eventually consistent (once input stops).
Of the two, BASE definitely scores higher on the contrive-o-meter. And it doesn’t really engender a lot of excitement about what’s on offer. For real insight we need to compare SQL to NoSQL with something called CAP theorem.
Like many of the “Iron Triangles” described in various industries, CAP describes a set of three important attributes, of which you can only optimize for two. In this case it is Consistency, Availability, and Partition tolerance.
Unsurprisingly the CA side of the triangle is where SQL lives, because the P vertex is one of the big pressures that started the NoSQL movement to start with — namely that clustering at scale is a pain with relational databases.
Column family databases are tabular and the closest in data model to RDBMSs. It is really the only NoSQL solution that’s managed to stake out territory on the CA edge.
The others are split between AP and CP, based on the priorities chosen by their implementers.
Choosing a database is ultimately about which two vertices of the CAP triangle are most important for your application.
I hope this article frames the subject well enough get you on your way to exploring the wild and wooly world of NoSQL. The accompanying slide deck mentions several well-known implementations of each NoSQL database type, but is by no means exhaustive.
Below are some links for further reading/viewing.