Thursday, July 1st, 2010

Why NoSQL Won’t Replace Relational Databases Anytime Soon

These days there is a lot of buzz about NoSQL databases. We hear that new databases are about to replace relational ones because the relational data model is so old.

I’m sure this won’t happen anytime soon. On the other hand, I don’t believe NoSQL databases will go away either. My prediction is that new NoSQL databases will keep popping up. Each one will fill a specific need for a time, and then fade away. No two NoSQL databases will ever be compatible. Standardization is fundamentally contrary to the reason why NoSQL databases exist.

Hang on and I will show you why.

Cutting Time to Market
Let’s start here: The only thing that matters in mainstream application development is time to market. Even code quality is sacrificed to have an application hit the market before the competition. This means developer productivity is a top priority. Write as few lines of code as possible to get the job done.

The main way to improve developer productivity is to raise the level of abstraction. Figuratively speaking, you don’t want to build the house from nails and a heap of boards. What you want is prefab modules. In computing it means you no longer code in C. You use Java or some other contemporary language that reduces the code volume by a significant factor.

Progress in the art of programming is all about creating abstractions. They are the prefab modules of programming. Object orientation, message passing, automatic memory management, dependency injection: all such concepts offer abstractions with one goal, to help developers get more bang per line of code. Abstractions certainly don’t help the computer. When the code hits the CPU almost all traces of classes, interfaces and methods are gone. There is only a stream of primitive byte-shuffling instructions. Abstractions are strictly for human consumption.

The principle of the highest possible level of abstraction also applies to data, but may be less obvious. People have generally left C, but you may still run into projects that claim that they “just need something simple” for managing data and that the file system will do.

The Relational Data Model
The currently highest level of abstraction in data management is the relational data model. This is true despite the fact that the theory was proposed in the 70’s. Its solid mathematical foundation makes a lot of difference. Its most notable abstraction is the possibility of a non-procedural query language that decouples applications from their data store.

Note: SQL is not inherent in the relational model. There have been other relational query languages, but SQL ended up as an ANSI standard.

The high level of abstraction was a problem at the time the relational model was proposed. As an illustration, Ingres was a pioneering relational research effort. It typically ran under Unix on PDP-11. It had to be partitioned into several interconnected processes because the PDP-11 (a 16-bit architecture) did not support processes larger than 64 KB (that’s right, kilobytes). This was a long time ago, well before the IBM PC era.

The feasibility of relational databases was questioned for decades. One of the most prominent contenders back then was the so-called CODASYL databases. The network data model was specified by CODASYL by the end of the 60’s. I mention it because modern-day databases like Neo4j have a quite similar data model. It’s one of the oldest ideas in database technology.

You may have heard about the so-called impedance mismatch. It means that the data structures of a programming language, Java for example, cannot be immediately translated to relational table rows. Most people blame the relational data model, but I also find fault with programming languages. The data structures of current programming languages are largely low level abstractions based on one-way pointers (lists) and one-way associations (maps). They have stayed essentially the same for decades so there should be room for new levels of abstraction.

Quantum Jumps in Computing
Note that doubt was cast on Java for performance reasons during its inceptive years. It was believed to be mainly a tool for building applets (client side).

Today a real-time telecom system may well be implemented in Java (server side) and backed by a relational database. A real time billing system, for instance, may handle millions of subscribers and generate gigabytes of billing records every day. It took a while, but today there is no doubt that the relational model and Java are feasible for many demanding purposes.

My point is this: A quantum jump in computing begins with conjuring up a level of abstraction that is not yet feasible. Initially people will complain that it’s a terrible waste of CPU cycles and that performance sucks. Given time there will be enough CPU cycles and memory and gradually everyone will use the new abstractions to improve their productivity.

Stretching the Limits
So much for mainstream applications. At any given time there are people who stretch the limits. This was true about electronics CAD systems in the 80’s – the relational databases of the day could not handle them. It is true now. For instance, there are extremely visible web applications with millions of users spread around the globe. Relational databases just won’t cut it.

The challenges are different at different points in time, but usually a combination of two principles are used to solve those hard cases,

  • Give up finding a general solution and solve only the relevant special cases
  • Take the hit of accepting lower levels of abstraction

For instance, if you are not in banking or telecom billing you may find that your application can do without airtight transactions. Eventual consistency may be good enough. Perhaps some components have to be coded in C after all. Perhaps you invent a way of partitioning the problem over a thousand computers in parallel. The important thing is to get the job done. The flip side of leaving the beaten track is a sharp increase in the volume of code you must produce. The companies behind the million-user web applications have the resources to do it.

NoSQL Databases
Most of the NoSQL databases we see are designed to gain some desirable characteristic (performance, scalability etc.) at the expense of not being general-purpose or operating at a lower level of abstraction. You may find that your application needs one of them to satisfy extreme requirements. Then go for it and accept that you will pay a price in terms of more developer hours. A few years down the road the trade-off will be different. General-purpose databases will manage more complex stuff which means that the previous generation of NoSQL databases will be less needed. New ones will be necessary for those who stretch the new limits.

NoSQL databases may be extremely capable. They are temporary fixes nonetheless. They get their edge by not trying to solve all the problems that contemporary relational databases do and/or by operating at a lower level of abstraction. Each one strikes its own trade-offs. You should find out what they are.

Maybe I’m unfair. Maybe some NoSQL databases build on new theory and not just on pragmatic trade-offs.

Maybe one day we will see a data model on a level of abstraction higher than the relational one. For instance, I have waited a long time for “table” to appear as a valid column data type in a new strain of relational databases. It sounds conceptually simple but is very disruptive. The query language as well as fundamental principles of storage organization are challenged.

Summing Up
In summary, when selecting a database system keep your head cool and don’t make decisions based on hype. Rational thinking still works. That and hands-on testing will help you steer away from costly database mistakes.

Comments are closed.