Thursday, September 2nd, 2010

Neo4j Performance Revisited and Appreciated

My previous Neo4j performance micro-benchmark left a disturbing hole: There was no explanation why Neo4j didn’t cope well with big transactions.

A closer study uncovered the pretty obvious reason. It also turned out to be the key to zippier Neo4j performance.

Perhaps I should explain my seeming obsession with Neo4j. Once upon a time I was part of the development of an “object oriented” database system, doing much of its design. Quotes around “object oriented” because it was really a graph database in today’s terminology. It was written in C, having strong schema support. C plus schema support made data access extremely fast, a matter of a handful CPU cycles. EasyDB, the current name, has been further developed and is still available from Basesoft. It is a NoSQL database, although the term didn’t exist at the time it was engineered.

The key to Neo4j performance is memory. I reran the test cases with the Java memory options -Xms128m -Xmx1g, allowing the JVM to expand to 1 GB of heap space rather than the default 64 MB.

With these memory settings scanning 115.735 files and writing file information to the database became 35 times (!) faster with Neo4j than Apache Derby or PostgreSQL. Neo4j quickly claimed all of the 1 GB heap space, actively using up nearly 800 MB, almost 8 kB per node. Despite this remarkable performance it still spent a large fraction of the time doing MarkSweep garbage collection.

In comparison, Apache Derby had a modest memory footprint: under 100 MB and only Scavenge garbage collection.

MarkSweep garbage collection is very time-consuming. It seems Derby has logic to avoid incurring it. Neo4j has a tendency to trigger MarkSweep whenever it comes even remotely near the memory limit. It may escape the fatal OutOfMemory, but performance suffers badly.

As for reading the database, more memory didn’t change the relative performance figures a lot. Derby and PostgreSQL stand up well against Neo4j, using a fraction of the memory. A few observations:

  • Neo’s getAllNodes uses a lot of memory, 600 MB in this test, and seems prone to triggering MarkSweep.
  • The Groovy overhead became visible as a side effect of monitoring the test programs. Only Derby spent a sizable amount of time (20%) in Groovy-specific code. For Neo4j, garbage collection dwarfs the Groovy overhead.

Let it be perfectly clear that the reading test case for Derby and PostgreSQL involves only one or two queries. I think it is relevant to compare row-to-row table scan in the relational databases to node-to-node traversal in Neo4j.

In my test cases the data model could be adapted to the problem. The relational databases only had to use one or two queries. There might be situations when this is not possible. In such case the hierarchy must be traversed with one query per node, with a devastating performance hit for the relational databases. Only a specialized database like Neo4j keeps its performance without a flinch.

Software: Neo4j 1.1, Apache Derby, PostgreSQL 8.4.4, Groovy 1.7.4, Java 1.6.0_21. Platform: 64-bit openSUSE 11.3 on a humble Dell box with a Pentium Dual-Core CPU E5300 @ 2.60GHz CPU and 2 GB of memory.

Previous posts:

2 Comments on “Neo4j Performance Revisited and Appreciated”

  1. Nice write-up!
    To avoid GC hits during read operations, try to play with the memory settings listed under which might help?


  2. […] This post was mentioned on Twitter by John Avery, John Avery. John Avery said: Söderström Programvaruverkstad AB » Blog Archive » Neo4j *…* […]