Wednesday, June 2nd, 2010

Network versus Relational

It’s time for the long awaited Network versus Relational database head-to-head, no-mercy showdown. Network databases is represented by Neo4j 1.0, relational by Apache Derby

This shop is biased towards Groovy, so Groovy is used for all test programs. It saves a lot of coding. To a certain extent it also allows us to run a relational database without writing SQL, adding a special twist to the “nosql” concept.

The test programs have been presented in previous posts. This is a link to the Neo4j test case. Here is a link to the Derby test case. The task is to create a database representing a file system hierarchy and then to find the oldest file with a size greater than 5 MiB from the database. A deep hierarchical data structure was chosen for the test because that’s where network databases are expected to shine. Relational databases, on the other hand, are challenged by deep hierarchies.

Any database evaluation is flame war fuel. A few disclaimers might ward off some of it.

The simple test case used here is like peeping through a keyhole. What you see is real but very limited. There is a lot more to evaluate. Fundamentals not tested here include managing concurrency and making efficient use of multi-core CPUs.

We measure performance. But seriously, if performance really matters one should not use Groovy.

The testing environment is summarized at the end of this post.

First task: Create a database. Here are the measurements. The Files column contains the number of files scanned. This is also approximately the number of database records created. The execution time includes opening and closing the database, a significant portion of the total execution time, especially for Neo4j.

FilesTime (s)DB size (MB)
7 1094.813.421
53 37816.451.01513
520 691479.8121

The table shows that Neo4j was around 3 times faster than Derby in the two first cases. In the biggest test case Neo4j blew up with an out of memory error. One may argue that creating half a million records in a single transaction is unreasonable, but Derby did it without complaining. Neo4j also uses slightly more disk space for its databases.

Second task: Find the oldest file with a size greater than 5 MiB. Neither database uses any index, linear scan is expected. The test was only run on a database containing 53 378 records because of the Neo4j memory problems with the bigger database. Timing in this case only includes the search. Opening and closing the database is not measured. The timing is an average of a few runs.

Neo4j: 3.6 seconds
Derby: 2.2 seconds

In this task Derby outran Neo4j, a very surprising result. One possible explanation is that in the Derby test program the innermost loop never loops. It is converted behind the scenes into a single SQL query. The corresponding loop in the Neo4j test program is executed literally. Even so I find it a bit surprising that network traversal in Neo4j appears slower than a table scan in Derby.

In summary Neo4j was much quicker writing a database, but Derby unexpectedly won the retrieval test. If I can find some time I will dig deeper into this and follow up with a later post.

Testing environment: OpenSUSE 11.1 64-bit Linux on a vanilla HP xw4400 workstation with an Intel Core2 CPU 6400 at 2.13 GHZ and 4 GB of memory. More software versions: Groovy 1.7.2, Java 1.6.0. Default Java memory settings.

2 Comments on “Network versus Relational”

  1. I think the difference between the implementations is that Neo4j actually traverses all the way down into the file system hierarchy, while the Derby solution does not. The data has a network/hierarchical structure, but the query doesn’t! To remedy this, you could simply add one more constraint: only look at files inside a specific directory (and its subdirectories) inside the file system.

  2. Håkan

    Anders: That’s right. I hope to follow up with another post real soon now.