Friday, May 7th, 2010

Neo4j vs. Relational: The Relational Combatant

A previous post promised a head-to-head no-mercy Neo4j vs. relational showdown. It also provided Groovy programs to store and retrieve file system data in a Neo4j graph database. Neo4j was recently released in a 1.0 version (see the Neo4j site).

Now it's time for the relational combatant to enter the scene: Apache Derby.
We will write another Groovy program to store file system data, this time using a Derby relational database. To make the task more interesting we will try to earn a "nosql" medal by not using SQL.

Derby is a relatively new player in the database arena. I chose it mainly because I haven't used it before. The solid documentation makes a good initial impression by presenting a quite feature complete SQL/JDBC database.

The database was defined like this (nosql has to wait a minute, this is nothing but SQL):

SQL:

  1.  

We will use this schema to store file system scans in the scan table. Every file or directory in a file system scan will add an entry to the file table. Every file and directory except the top directory is linked to its parent directory through the parent column.

Scans and files are uniquely identified by an integer key. A specific file may occur more than once, but in different scans, so the file path alone is not unique.

The schema uses a common trick for managing hierarchies in a relational database: cross-level links. In this case every file will have a direct link (the scan column) to the scan it is part of.

Here is a Groovy program that scans the file system and writes the database.

GROOVY:

  1. import java.sql.*
  2. import groovy.sql.*
  3. 'jdbc:derby:/sdb3/cur/data/derby1'// Collect file system data, shut down database
  4. 'FILE'"Files collected: ${fileCount}""OOPS! ${exc.message}"// Derby shutdown requires a new connection
  5. ';shutdown=true'"Shutdown message: ${exc.message}""Processing time: ${stopTime - START} ms"'SCAN'// Recursively collect directory data
  6. 'SELECT MAX(id) as maxid FROM file'// Be useful
  7. "Top directory path required")

The program expects a directory path on the command line. It will traverse the file system beginning with that directory and store file and directory data. Like the Neo4j equivalent it does everything in a single transaction.

You will find the main logic in the recursive collectDir method. It runs through all entries of a directory. If an entry is a directory it invokes itself. Otherwise it just adds a record to the file table.

The "nosql"-ness comes from using Groovy DataSet. DataSets have no way of picking up values assigned to autoincremented columns, so there is a tiny bit of SQL after all. We also had to avoid autoincrement and manage record ids explicitly. This detail takes away some of the Groovy elegance.

After making a few scans it is time to retrieve data. The obvious choice is to use ij that comes with Apache Derby. It is an interactive, command line tool. The beauty of JDBC is that there are many other JDBC-compatible tool we could use, some with a graphical user interface.

Listing scans and asking for the oldest file with a size > 5000000 looks like this (with a scan id provided in the second statement):

SQL:

  1.  

However, this is supposed to be a head-to-head showdown, so we also provide Groovy code to do the same thing.

GROOVY:

  1. import java.sql.*
  2. import groovy.sql.*
  3. 'jdbc:derby:/sdb3/cur/data/derby1'"OOPS! ${exc.message}"// Derby shutdown requires a new connection
  4. ';shutdown=true'"Shutdown message: ${exc.message}"'SCAN'"Scan ${it.ID}: ${it.PATH}""${row.PATH} size ${row.SIZE} modified ${row.MTIME}""File matching criteria not found""VIEW${scanId}"'SELECT count(*) as VCOUNT FROM sys.systables WHERE tablename=?'"CREATE VIEW ${viewName} AS SELECT * FROM file WHERE scan=${scanId}"// Be useful
  5.  

The code takes some explanation. The listScans method is obvious enough, but how about findFile?

The findAll call in the findFile method looks like a procedural statement where we iterate over the tuples of the file table. Maybe you don't believe this, but the two closures (in findAll and sort) are never executed! They are compiled into a SQL statement which is run by firstRow. This is stunning nosql.

Well, to arrive at the elegant code we have to go through some SQL hoops after all. (Maybe someone out there can show me a better solution.) The only values allowed in the findAll conditions are literals. So we define a view for the query to operate on. This is done in the createView method. It is possible to use exception handling to check if a view exists, but there will be an ugly message on the console even if you catch the exception. In addition, strings containing ${...} (GString) are lazily evaluated and intelligently replaced by SQL parameters by DB.execute. So it is necessary to convert the GString to an ordinary String before submitting it as SQL because the view name cannot be a parameter.

In summary we have created almost nosql counterparts to the programs exercising Neo4j in the previous post. Running the last program yields,

/usr/local/thunderbird/thunderbird-bin size 12403548 modified 2006-12-07 09:05:44.0

With a sigh of relief it we note that this is the same file found by the Neo4j retrieval program. The modification time format differs only by including fractional seconds. The reason is that we use the Derby Timestamp datatype without formatting.

The two combatants are finally ready for the final fierce, no-mercy showdown.

Comments are closed.