Friday, May 7th, 2010

Groovy and Neo4j More Seriously

Neo4j is a graph database, recently released in a 1.0 version (see the Neo4j site). A previous post showed a trivial example of using Neo4j from Groovy.

This post contains an example somewhat closer to real life. It is also the first entry in a Neo vs. relational head-to-head no-mercy showdown.

The program scans the local file system and stores file data in a Neo database. The file system is a deep hierarchy. Even though it is fairly tidy it is the type of structure that might cause problems with a relational database. Here is the Groovy code.

GROOVY:

  1. import org.neo4j.kernel.*
  2. import org.neo4j.graphdb.*
  3.  
  4. DBPATH = '/sdb3/cur/data/neo2'// Declare relationship types
  5. // Expando magic adds to the PropertyContainer interface
  6. // Collect file system data
  7. "${topDirPath} to be scanned into Neo store ${DB.storeDir}""Files collected: ${fileCount}""OOPS! ${exc.message}""Processing time: ${stopTime - START} ms"
  8. }
  9.  
  10. // Add a new scan (the top directory) to the database
  11. // RETURN the added node
  12. // Recursively collect directory data
  13. // Be useful
  14. "Top directory path required")

The program takes a directory path from the command line. It scans the file system from that directory and creates a data structure linked to the reference node. The reference node is the well known entry point to any Neo database.

Every time the program is run the addScan method (line 42) adds a new node linked to the reference node by the SCANS relationship. A node representing a scan is a normal file node, but its link to the reference node has a property, the time of the scan.

Beginning from the top directory the collectDir method (line 59) recursively builds a data structure in Neo that mimics the file system. A node is created for each directory and each file. The path is stored in a path property. Files are distinguished from directories by having a size property. These nodes have a property called mtime that records the point in time when the file or directory was last modified.

The collect method (line 19) wraps up the action by supplying exception and transaction management. Well, everything is actually done in a single transaction. The method also adds timing.

The last line starts the program after checking that there is a command line argument. See the previous post for details on how to run the program.

After adding a few file system scans it is time to query the database. We would like to find the file with the oldest modification time with size > 5 MiB in a given scan. We have to write another program.

GROOVY:

  1. import org.neo4j.kernel.*
  2. import org.neo4j.graphdb.*
  3. import java.text.SimpleDateFormat
  4.  
  5. DBPATH = '/sdb3/cur/data/neo2'// Define relationship types
  6. // Metaclass magic
  7. "OOPS! ${exc.message}"'yyyy-MM-dd HH:mm:ss'// Omit failed runs
  8. 'tstamp'"Scan ${it.getId()}: ${fmt.format(tstamp)} ${it.path}""Scan with id ${scanId} not found"'size''yyyy-MM-dd HH:mm:ss'"${found.path} size ${found.size} modified ${fmt.format(found.mtime)}""File matching criteria not found"
  9.     }
  10. }
  11.  
  12. // Be useful
  13.  

The program takes an optional command line argument. If you don't provide one it lists all file system scans in the database. The list may look like this,
Scan 207864: 2010-05-06 14:55:04 /usr/local
Scan 158019: 2010-05-06 14:53:36 /usr/local
Scan 96129: 2010-05-04 17:16:04 /usr/local
Scan 42655: 2010-05-04 13:19:45 /usr/local
Scan 35546: 2010-05-04 13:17:55 /usr/local/src
Scan 28437: 2010-04-05 20:28:29 /usr/local/src
Scan 21328: 2010-04-05 20:27:43 /usr/local/src

Pick one of the scans and run the program again with the scan id as command line argument. Here is the result of specifying scan 207864:
/usr/local/thunderbird/thunderbird-bin size 12403548 modified 2006-12-07 09:05:44

So we found there is a real old Thunderbird version taking up space on this machine. This exercise had some practical value after all...

A few concluding comments on the code. Searches are based on the Neo4j Traverser. There is no indexing, so searches are linear. The listScans method (line 32) uses an all-constant Traverser to run through all scans.

The findFile method (line 52) begins by finding a specific scan. We assume the number of scans is small, so a Groovy find outside the database is ok. After finding the scan the doFindFile method (line 67) is invoked. In this case we define our own ReturnableEvaluator the Groovy way (line 70). Finally we use a Groovy inject (line 76) to find the oldest of the files.

Comments are closed.