Friday, May 7th, 2010

Neo4j vs. Relational: The Relational Combatant

A previous post promised a head-to-head no-mercy Neo4j vs. relational showdown. It also provided Groovy programs to store and retrieve file system data in a Neo4j graph database. Neo4j was recently released in a 1.0 version (see the Neo4j site).

Now it's time for the relational combatant to enter the scene: Apache Derby.
We will write another Groovy program to store file system data, this time using a Derby relational database. To make the task more interesting we will try to earn a "nosql" medal by not using SQL.

Derby is a relatively new player in the database arena. I chose it mainly because I haven't used it before. The solid documentation makes a good initial impression by presenting a quite feature complete SQL/JDBC database.

The database was defined like this (nosql has to wait a minute, this is nothing but SQL):

SQL:

  1. CREATE TABLE scan (
  2.        id int PRIMARY KEY generated BY DEFAULT AS identity,
  3.        path varchar(400) NOT NULL,
  4.        tstamp timestamp DEFAULT current_timestamp NOT NULL
  5. );
  6. CREATE TABLE file (
  7.        id int PRIMARY KEY generated BY DEFAULT AS identity,
  8.        scan int NOT NULL REFERENCES scan(id),
  9.        parent int REFERENCES file(id),
  10.        path varchar(400) NOT NULL,
  11.        size bigint,
  12.        mtime timestamp
  13. );

We will use this schema to store file system scans in the scan table. Every file or directory in a file system scan will add an entry to the file table. Every file and directory except the top directory is linked to its parent directory through the parent column.

Scans and files are uniquely identified by an integer key. A specific file may occur more than once, but in different scans, so the file path alone is not unique.

The schema uses a common trick for managing hierarchies in a relational database: cross-level links. In this case every file will have a direct link (the scan column) to the scan it is part of.

Here is a Groovy program that scans the file system and writes the database.

GROOVY:

  1. import java.sql.*
  2. import groovy.sql.*
  3.  
  4. DBURL = 'jdbc:derby:/sdb3/cur/data/derby1'
  5. START = System.currentTimeMillis()
  6. int LASTID
  7.  
  8. // Collect file system data, shut down database
  9. def collect(String topDirPath) {
  10.     try {
  11.         def cnx = java.sql.DriverManager.getConnection(DBURL)
  12.         def db = new Sql(cnx)
  13.         setNextId(db)
  14.         def scanId = addScan(db, topDirPath)
  15.         def fileCount = collectDir(db.dataSet('FILE'), scanId, null, new File(topDirPath))
  16.         cnx.commit()
  17.         println "Files collected: ${fileCount}"
  18.     } catch (Throwable exc) {
  19.         cnx.rollback()
  20.         println "OOPS! ${exc.message}"
  21.     } finally {
  22.         try { // Derby shutdown requires a new connection
  23.             java.sql.DriverManager.getConnection(DBURL + ';shutdown=true')
  24.         } catch (SQLException exc) {
  25.             println "Shutdown message: ${exc.message}"
  26.         }
  27.     }
  28.  
  29.     def stopTime = System.currentTimeMillis()
  30.     println "Processing time: ${stopTime - START} ms"
  31. }
  32.  
  33. int addScan(Sql db, String topDirPath) {
  34.     def ds = db.dataSet('SCAN')
  35.     ds.add(id: ++LASTID, path: topDirPath)
  36.     return LASTID
  37. }
  38.  
  39. int addDirOrFile(DataSet ds, int scanId, Integer parentId, File file) {
  40.     def values = [id: ++LASTID, scan: scanId, path: file.path,
  41.                   mtime: new Timestamp(file.lastModified())]
  42.     if (parentId) values.parent = parentId
  43.     if (!file.isDirectory()) values.size = file.size()
  44.     ds.add(values)
  45.     return LASTID
  46. }
  47.  
  48. // Recursively collect directory data
  49. int collectDir(DataSet ds, int scanId, Integer parentId, File dir) {
  50.     int currentDirId = addDirOrFile(ds, scanId, parentId, dir)
  51.     int fileCount = 1
  52.  
  53.     dir.eachFile {file ->
  54.         if (file.isDirectory()) {
  55.             fileCount += collectDir(ds, scanId, currentDirId, file)
  56.         } else {
  57.             addDirOrFile(ds, scanId, currentDirId, file)
  58.             fileCount++
  59.         }
  60.     }
  61.  
  62.     return fileCount
  63. }
  64.  
  65. int setNextId(Sql db) {
  66.     def row = db.firstRow('SELECT MAX(id) as maxid FROM file')
  67.     Integer id = row.MAXID
  68.     LASTID = id ?: 1
  69. }
  70.  
  71. // Be useful
  72. (args.size()> 0)? collect(args[0]) : println("Top directory path required")

The program expects a directory path on the command line. It will traverse the file system beginning with that directory and store file and directory data. Like the Neo4j equivalent it does everything in a single transaction.

You will find the main logic in the recursive collectDir method. It runs through all entries of a directory. If an entry is a directory it invokes itself. Otherwise it just adds a record to the file table.

The "nosql"-ness comes from using Groovy DataSet. DataSets have no way of picking up values assigned to autoincremented columns, so there is a tiny bit of SQL after all. We also had to avoid autoincrement and manage record ids explicitly. This detail takes away some of the Groovy elegance.

After making a few scans it is time to retrieve data. The obvious choice is to use ij that comes with Apache Derby. It is an interactive, command line tool. The beauty of JDBC is that there are many other JDBC-compatible tool we could use, some with a graphical user interface.

Listing scans and asking for the oldest file with a size > 5000000 looks like this (with a scan id provided in the second statement):

SQL:

  1. SELECT * FROM scan ORDER BY tstamp;
  2. SELECT path, size, mtime FROM file WHERE scan=570467 AND size> 5000000 ORDER BY mtime fetch first row only;

However, this is supposed to be a head-to-head showdown, so we also provide Groovy code to do the same thing.

GROOVY:

  1. import java.sql.*
  2. import groovy.sql.*
  3.  
  4. DBURL = 'jdbc:derby:/sdb3/cur/data/derby1'
  5. def DB = null
  6.  
  7. def query(String scanIdString) {
  8.     try {
  9.         def cnx = java.sql.DriverManager.getConnection(DBURL)
  10.         DB = new Sql(cnx)
  11.         scanIdString? findFile(scanIdString as Long) : listScans()
  12.         DB.commit()
  13.     } catch (Throwable exc) {
  14.         DB.rollback()
  15.         println "OOPS! ${exc.message}"
  16.     } finally {
  17.         try { // Derby shutdown requires a new connection
  18.             java.sql.DriverManager.getConnection(DBURL + ';shutdown=true')
  19.         } catch (SQLException exc) {
  20.             println "Shutdown message: ${exc.message}"
  21.         }
  22.     }
  23. }
  24.  
  25. def listScans() {
  26.     def ds = DB.dataSet('SCAN')
  27.     ds.each {println "Scan ${it.ID}: ${it.PATH}"}
  28. }
  29.  
  30. def findFile(Long scanId) {
  31.     def ds = DB.dataSet(createView(scanId))
  32.     def row = ds.findAll{it.size> 5000000}.sort{it.mtime}.firstRow()
  33.     if (row) {
  34.         println "${row.PATH} size ${row.SIZE} modified ${row.MTIME}"
  35.     } else {
  36.         println "File matching criteria not found"
  37.     }
  38. }
  39.  
  40. String createView(Long scanId) {
  41.     String viewName = "VIEW${scanId}"
  42.     def row = DB.firstRow('SELECT count(*) as VCOUNT FROM sys.systables WHERE tablename=?', [viewName])
  43.     Integer count = row.VCOUNT
  44.     if (count == 0) {
  45.         String q = "CREATE VIEW ${viewName} AS SELECT * FROM file WHERE scan=${scanId}"
  46.         DB.execute(q)
  47.     }
  48.  
  49.     return viewName
  50. }
  51.  
  52. // Be useful
  53. (args.size()> 0)? query(args[0]) : query(null)

The code takes some explanation. The listScans method is obvious enough, but how about findFile?

The findAll call in the findFile method looks like a procedural statement where we iterate over the tuples of the file table. Maybe you don't believe this, but the two closures (in findAll and sort) are never executed! They are compiled into a SQL statement which is run by firstRow. This is stunning nosql.

Well, to arrive at the elegant code we have to go through some SQL hoops after all. (Maybe someone out there can show me a better solution.) The only values allowed in the findAll conditions are literals. So we define a view for the query to operate on. This is done in the createView method. It is possible to use exception handling to check if a view exists, but there will be an ugly message on the console even if you catch the exception. In addition, strings containing ${...} (GString) are lazily evaluated and intelligently replaced by SQL parameters by DB.execute. So it is necessary to convert the GString to an ordinary String before submitting it as SQL because the view name cannot be a parameter.

In summary we have created almost nosql counterparts to the programs exercising Neo4j in the previous post. Running the last program yields,

/usr/local/thunderbird/thunderbird-bin size 12403548 modified 2006-12-07 09:05:44.0

With a sigh of relief it we note that this is the same file found by the Neo4j retrieval program. The modification time format differs only by including fractional seconds. The reason is that we use the Derby Timestamp datatype without formatting.

The two combatants are finally ready for the final fierce, no-mercy showdown.

Comments are closed.