Tuesday, July 27th, 2010

The Schema-Less Database: Freedom or Bondage?

Some of the new strain of NoSQL databases are schema-less. They also claim this is a feature that brings flexibility. If the schema is such a roadblock, why was it invented in the first place?

Let’s begin with a fundamental question: What is a database schema? Answer: The database schema is the degree to which the database system understands your conceptual model. Understands in the sense of automatically supporting and upholding the model. A lot of schema, a lot of support. No schema, no support. A schema-less database system doesn’t have a clue about your model.

When I say “the model” I mean the conceptual model that is the foundation of every database application. The model is usually a judicious simplification of the real world. Take an order entry system, for example. The model may stipulate that there are Products, Customers and Orders and relationships between them. This is an extremely narrow view of a business, but it may be ok for a model. The model bridges the gap between data and information. It gives meaning to the bytes stored in a database. Programmers continuously use the model when they code an application. The model is the crown jewel of every database application.

A modern relational database systems offers several standardized mechanisms to support a conceptual model. You may tell it to check value ranges and check relationships. You may create indexes to check that some value is unique or to speed up selected types of queries. In their limited way, these mechanisms help protect the database from storing garbage.

So why would anyone be against having their model supported by the database system? No one, if it wasn’t for one very disturbing fact about conceptual models: They evolve. All models in practical use change. If the model changes, and here is the crux of the matter, so must the schema.

High-end relational database systems support many types of schema changes. Adding or dropping a column is seldom a problem. Adding an index is straightforward. In theory. In practice, if a table has half a billion rows, creating an index may take a full day. It’s doable, but it takes so much time that it may interfere with normal operation.

Schema changes in large databases can be a serious pain. Schema-less databases don’t have a schema, so is the pain gone? The answer is no, the integrity of the conceptual model still has to be protected. The burden is shifted from the database system to applications. More code must be written because the database system no longer stops a buggy application from filling the database with nonsense. Application development time increases.

In my opinion “schema-less” is not a feature but a pragmatic trade-off to achieve performance in certain very large databases at the cost of longer development time.

In the specific case of Neo4j the almost total schema-less-ness seems counterproductive in my opinion. Every node in a database may be different. An application must check every node it retrieves to find out if it has the expected attributes, leading to performance loss. In practice every application will invent its own way of managing node types. The database system would do this more efficiently than applications. Versions of node types could be used to retain flexibility.

In conclusion, a schema-less database brings you freedom and flexibility in the same sense that driving without a safety belt does. In some special applications you may be forced to sacrifice schema support due to overriding performance concerns. Mainstream applications should use all schema support they can get.

Comments are closed.