Tuesday, June 22nd, 2010

Network versus Relational: Part II

In a recent post I ran a simplistic database performance test involving Neo4j (a graph or network database) and Apache Derby (a relational database). Both are Java-based. Relational databases are challenged by deep hierarchies, so the test was exactly that: Build a deep hierarchy and retrieve data from it.

The retrieval test was a surprise because relational Derby seemed to perform better than Neo4j.

Anders Nawroth, Neo Technology, correctly commented that the retrieval test case searches for any file in a hierarchy, i.e. the search is independent of the hierarchy. The relational test program took advantage of this fact while Neo4j traversed the the graph.

In the interest of fairness this post digs somewhat deeper into the retrieval test case. The results are intriguing.

To force the retrieval test to take the hierarchy into account, let’s restate it this way: Find the oldest file bigger than 5 MiB in a given subtree of a given scan.

For Neo4j the revamped test case makes no big difference. It’s still a matter of graph traversal.

However, the idea of traversing a hierarchy with a relational database makes me cringe. A general solution implies one query per parent node to find the children. Performance is awful. In real life I would find a way to avoid it. (Note that I already did because every file node has a pointer to the scan it belongs to.) In the revamped test case I prefer to check the file path to find out if it belongs to the given subtree. I must give up my NoSQL ambitions and use a query like this,

SELECT * FROM file WHERE scan = ? AND size > 5000000 AND path LIKE '/usr/local/%' ORDER BY mtime

The result is ordered, so the answer is the first row of the result set. Before running this test I added another file system scan to the database, rooted in /usr, increasing the number of records of the file table to 1.15 million.

The revamped test case applied to the /usr/local subtree of the /usr scan took Derby 3.7 s. It may be compared directly to Neo4j running the original test case on a scan rooted in /usr/local (3.6 s). Derby still stands up to the competition.

Now let’s turn the test case around a different way. In the original test we allowed Derby to do a plain table scan. A similar approach may be used with Neo4j. It has a complementary access method (getAllNodes) for iterating over all nodes regardless of their position in the graph.

There is an important difference between Neo’s getAllNodes and a relational table scan. Neo4j is a schema-less database (except for relationship types). The getAllNodes method iterates over the entire database. In the original test program the type of a node is implied by its position in the hierarchy. With getAllNodes there is no way to determine the position of a node returned by the iterator. The solution is to add a node type attribute to every node. We also have to tag every file/directory node with the scan it belongs to, just like the relational schema.

(Note in passing: There is no such thing as a schema-free database. The question is only who does the job, the database system or you. But that’s the subject of another post.)

For every node in the iteration we must now check the node type and the scan. The search time rose to somewhere around 9 s, worse than graph traversal. The database contained 430.000 nodes at this point. Just iterating over all nodes without accessing any attributes or relationships took 3.0 s.

Derby, on the other hand, solved the original problem in 2.5 s having 1.150.000 files in the database. Finding the oldest file over 5 MiB among 521.000 files took 4.7 s (still with 1.15 million files in the database). Both cases involved at least one comparison for every table row.

Conclusion: I really cannot convince myself that Neo4j graph traversal is faster than Derby table scans. Database creation turned out to be 3 times faster than Derby, but most applications have many more reads than writes, making basic retrieval very important.

Disclaimer: This judgement is based on a toy example and does not necessarily apply to your application. Please make your own tests before doing anything expensive. Please also tell me if you find evidence contrary to this test.

In this test I managed to avoid hierarchy navigation with the relational database. Of course there are applications where this just isn’t possible and a non-relational database is the only way to get acceptable performance. I’ll look into this general issue in an upcoming post.

Comments are closed.