Friday, May 7th, 2010

Groovy and Neo4j More Seriously

Neo4j is a graph database, recently released in a 1.0 version (see the Neo4j site). A previous post showed a trivial example of using Neo4j from Groovy.

This post contains an example somewhat closer to real life. It is also the first entry in a Neo vs. relational head-to-head no-mercy showdown.

The program scans the local file system and stores file data in a Neo database. The file system is a deep hierarchy. Even though it is fairly tidy it is the type of structure that might cause problems with a relational database. Here is the Groovy code.

GROOVY:

  1. import org.neo4j.kernel.*
  2. import org.neo4j.graphdb.*
  3.  
  4. DBPATH = '/sdb3/cur/data/neo2'
  5. START = System.currentTimeMillis()
  6. DB = new EmbeddedGraphDatabase(DBPATH)
  7.  
  8. // Declare relationship types
  9. enum RELTYPE implements RelationshipType {
  10.     CONTAINS, SCANS
  11. }
  12.  
  13. // Expando magic adds to the PropertyContainer interface
  14. ExpandoMetaClass.enableGlobally()
  15. PropertyContainer.metaClass.getProperty = {name -> delegate.getProperty(name)}
  16. PropertyContainer.metaClass.setProperty = {name, val -> delegate.setProperty(name, val)}
  17.  
  18. // Collect file system data
  19. def collect(String topDirPath) {
  20.     println "${topDirPath} to be scanned into Neo store ${DB.storeDir}"
  21.     def tx = DB.beginTx()
  22.     def topDir = new File(topDirPath)
  23.  
  24.     try {
  25.         def scan = addScan(topDir)
  26.         def fileCount = 1 + collectDir(scan, topDir)
  27.         tx.success()
  28.         println "Files collected: ${fileCount}"
  29.     } catch (Throwable exc) {
  30.         println "OOPS! ${exc.message}"
  31.     } finally {
  32.         tx.finish();
  33.         DB.shutdown()
  34.     }
  35.  
  36.     def stopTime = System.currentTimeMillis()
  37.     println "Processing time: ${stopTime - START} ms"
  38. }
  39.  
  40. // Add a new scan (the top directory) to the database
  41. // RETURN the added node
  42. def addScan(File topDir) {
  43.     def scan = createFileNode(topDir)
  44.     def refNode = DB.referenceNode
  45.     def rel = refNode.createRelationshipTo(scan, RELTYPE.SCANS)
  46.     rel.tstamp = System.currentTimeMillis()
  47.     return scan
  48. }
  49.  
  50. Node createFileNode(File file) {
  51.     def node = DB.createNode()
  52.     node.path = file.path
  53.     if (!file.isDirectory()) node.size = file.size()
  54.     node.mtime = file.lastModified()
  55.     return node
  56. }
  57.  
  58. // Recursively collect directory data
  59. int collectDir(Node parentNode, File parentDir) {
  60.     int fileCount = 0
  61.     parentDir.eachFile {file ->
  62.         def node = createFileNode(file)
  63.         fileCount++
  64.         parentNode.createRelationshipTo(node, RELTYPE.CONTAINS)
  65.  
  66.         if (file.isDirectory()) {
  67.             fileCount += collectDir(node, file)
  68.         }
  69.     }
  70.  
  71.     return fileCount
  72. }
  73.  
  74. // Be useful
  75. (args.size()> 0)? collect(args[0]) : println("Top directory path required")

The program takes a directory path from the command line. It scans the file system from that directory and creates a data structure linked to the reference node. The reference node is the well known entry point to any Neo database.

Every time the program is run the addScan method (line 42) adds a new node linked to the reference node by the SCANS relationship. A node representing a scan is a normal file node, but its link to the reference node has a property, the time of the scan.

Beginning from the top directory the collectDir method (line 59) recursively builds a data structure in Neo that mimics the file system. A node is created for each directory and each file. The path is stored in a path property. Files are distinguished from directories by having a size property. These nodes have a property called mtime that records the point in time when the file or directory was last modified.

The collect method (line 19) wraps up the action by supplying exception and transaction management. Well, everything is actually done in a single transaction. The method also adds timing.

The last line starts the program after checking that there is a command line argument. See the previous post for details on how to run the program.

After adding a few file system scans it is time to query the database. We would like to find the file with the oldest modification time with size > 5 MiB in a given scan. We have to write another program.

GROOVY:

  1. import org.neo4j.kernel.*
  2. import org.neo4j.graphdb.*
  3. import java.text.SimpleDateFormat
  4.  
  5. DBPATH = '/sdb3/cur/data/neo2'
  6. DB = new EmbeddedGraphDatabase(DBPATH)
  7.  
  8. // Define relationship types
  9. enum RELTYPE implements RelationshipType {
  10.     CONTAINS, SCANS
  11. }
  12.  
  13. // Metaclass magic
  14. ExpandoMetaClass.enableGlobally()
  15. PropertyContainer.metaClass.getProperty = {name -> delegate.getProperty(name)}
  16. PropertyContainer.metaClass.setProperty = {name, val -> delegate.setProperty(name, val)}
  17.  
  18. def query(String scanIdString) {
  19.     def tx = DB.beginTx()
  20.  
  21.     try {
  22.         scanIdString? findFile(scanIdString as Long) : listScans()
  23.         tx.success()
  24.     } catch (Throwable exc) {
  25.         println "OOPS! ${exc.message}"
  26.     } finally {
  27.         tx.finish();
  28.         DB.shutdown()
  29.     }
  30. }
  31.  
  32. def listScans() {
  33.     def refNode = DB.referenceNode
  34.     def trav = refNode.traverse(Traverser.Order.BREADTH_FIRST,
  35.                                 StopEvaluator.DEPTH_ONE,
  36.                                 ReturnableEvaluator.ALL_BUT_START_NODE,
  37.                                 RELTYPE.SCANS,
  38.                                 Direction.OUTGOING)
  39.     def fmt = new SimpleDateFormat('yyyy-MM-dd HH:mm:ss')
  40.     trav.each {
  41.         TraversalPosition pos = trav.currentPosition()
  42.         Relationship rel = pos.lastRelationshipTraversed()
  43.  
  44.         // Omit failed runs
  45.         if (rel.hasProperty('tstamp')) {
  46.             Date tstamp = new Date(rel.tstamp)
  47.             println "Scan ${it.getId()}: ${fmt.format(tstamp)} ${it.path}"
  48.         }
  49.     }
  50. }
  51.  
  52. def findFile(Long scanId) {
  53.     def refNode = DB.referenceNode
  54.     def trav = refNode.traverse(Traverser.Order.BREADTH_FIRST,
  55.                                 StopEvaluator.DEPTH_ONE,
  56.                                 ReturnableEvaluator.ALL_BUT_START_NODE,
  57.                                 RELTYPE.SCANS,
  58.                                 Direction.OUTGOING)
  59.     def scan = trav.find {it.getId() == scanId}
  60.     if (scan) {
  61.         doFindFile(scan)
  62.     } else {
  63.         println "Scan with id ${scanId} not found"
  64.     }
  65. }
  66.  
  67. def doFindFile(Node scan) {
  68.     def trav = scan.traverse(Traverser.Order.DEPTH_FIRST,
  69.                              StopEvaluator.END_OF_GRAPH,
  70.                              {pos ->
  71.                                  def node = pos.currentNode()
  72.                                  node.hasProperty('size') && node.size> 5000000
  73.                              } as ReturnableEvaluator,
  74.                              RELTYPE.CONTAINS,
  75.                              Direction.OUTGOING)
  76.     def found = trav.inject(null) {oldest, cur ->
  77.         oldest? ((cur.mtime <oldest.mtime)? cur : oldest) : cur
  78.     }
  79.  
  80.     if (found) {
  81.         def fmt = new SimpleDateFormat('yyyy-MM-dd HH:mm:ss')
  82.         println "${found.path} size ${found.size} modified ${fmt.format(found.mtime)}"
  83.     } else {
  84.         println "File matching criteria not found"
  85.     }
  86. }
  87.  
  88. // Be useful
  89. (args.size()> 0)? query(args[0]) : query(null)

The program takes an optional command line argument. If you don't provide one it lists all file system scans in the database. The list may look like this,
Scan 207864: 2010-05-06 14:55:04 /usr/local
Scan 158019: 2010-05-06 14:53:36 /usr/local
Scan 96129: 2010-05-04 17:16:04 /usr/local
Scan 42655: 2010-05-04 13:19:45 /usr/local
Scan 35546: 2010-05-04 13:17:55 /usr/local/src
Scan 28437: 2010-04-05 20:28:29 /usr/local/src
Scan 21328: 2010-04-05 20:27:43 /usr/local/src

Pick one of the scans and run the program again with the scan id as command line argument. Here is the result of specifying scan 207864:
/usr/local/thunderbird/thunderbird-bin size 12403548 modified 2006-12-07 09:05:44

So we found there is a real old Thunderbird version taking up space on this machine. This exercise had some practical value after all...

A few concluding comments on the code. Searches are based on the Neo4j Traverser. There is no indexing, so searches are linear. The listScans method (line 32) uses an all-constant Traverser to run through all scans.

The findFile method (line 52) begins by finding a specific scan. We assume the number of scans is small, so a Groovy find outside the database is ok. After finding the scan the doFindFile method (line 67) is invoked. In this case we define our own ReturnableEvaluator the Groovy way (line 70). Finally we use a Groovy inject (line 76) to find the oldest of the files.

Comments are closed.