Friday, May 2nd, 2008

PDF the Groovy Way

Because of the strong ties between Java and Groovy, Groovy benefits from an enormous amount of Java utilities. This time we will take a brief look at what you can do with PDF documents.

The leading Java library for PDF manipulation is iText. It has functions for all your fancies, and then some. In my experience it also has unbelievable support. It's Java and thus platform independent.

Let's create a program that, given a folder, recursively extracts document properties from all PDF files in the folder and subfolders.

GROOVY:

  1. span style="color: #ff0000;">"c:/tmp/groovy/docs""PDF documents found under ${root.path}""--- PDF document: ${pdf.file.name}""${entry.key} = ${entry.value}"
  2.         }
  3.     }
  4. }

What you see above is the neat file scanning logic. Note that return just returns from the current closure invocation, not from the script. The script assumes there is a class called PdfProps with a method isPdf for checking if a file is a PDF file. That class is also expected to build a map containing document properties.

Here is an implementation of PdfProps.

GROOVY:

  1. import com.lowagie.text.pdf.PdfDate
  2. import com.lowagie.text.pdf.PdfReader
  3.  
  4. // Read the document properties of a PDF document.
  5. // The file where this document is stored.
  6. // Hash map containing document properties
  7. // Construct: opens a document, reads its info, then closes the file.
  8. // Map from our own property names to values
  9. 'fname''fpath''fsize'// kilobytes
  10. 'app''Creator''title''Title''author''Author''created''CreationDate''mod''ModDate''subject''Subject''keywords''Keywords'// Convert PDF date to java.util.Date
  11.  

The isPdf method (line 21) uses concise Groovy idioms for reading the first line of the file and matching it against a regular expression. Another Groovy idiom is used in the mapNonEmpty method to check that a string is not null, empty, or contains only white space.

The iText library is actually only used in a few places. In line 15 we create a PDF reader. In line 16 we extract document properties from the reader. In line 42 we must use iText to decode a PDF date. The rest is mainly book-keeping.

This example shows that Groovy is terse without losing readability. The way you may transparently connect to the enormous treasure of Java libraries is invaluable.

Oh, did I forget something? Like running the program? Ok, lets use the script engine I introduced in another post.

GROOVY:

  1. span style="color: #ff0000;">"c:/tmp/groovy""PdfTest.groovy"

And here is some sample output.


PDF documents found under c:\tmp\groovy\docs
--- PDF document: cederqvist-1.11.13.pdf
fname = cederqvist-1.11.13.pdf
fpath = c:\tmp\groovy\docs\cederqvist-1.11.13.pdf
fsize = 1024
app = TeX
created = Fri Feb 13 11:29:00 CET 2004
--- PDF document: ora-jdbc-devgref.pdf
fname = ora-jdbc-devgref.pdf
fpath = c:\tmp\groovy\docs\ora-jdbc-devgref.pdf
fsize = 8079
app = FrameMaker 7.0
title = Oracle Database JDBC Developer's Guide and Reference
author = Oracle Corporation
created = Fri Nov 11 21:42:11 CET 2005
mod = Fri Nov 11 21:42:11 CET 2005
subject = Oracle Database
keywords = JDBC, drivers, Java
--- PDF document: ora-sql-ref.pdf
fname = ora-sql-ref.pdf
fpath = c:\tmp\groovy\docs\ora-sql-ref.pdf
fsize = 12154
app = FrameMaker 7.0
title = Oracle Database SQL Reference
author = Oracle Corporation
created = Mon Nov 07 09:42:39 CET 2005
mod = Mon Nov 07 10:41:30 CET 2005
subject = Oracle Database
--- PDF document: pdf_reference.pdf
fname = pdf_reference.pdf
fpath = c:\tmp\groovy\docs\pdf_reference.pdf
fsize = 15696
app = Adobe Acrobat 8.1 Combine Files
title = PDF Reference and Related Documentation
author = Adobe Developer Support
created = Tue Jan 17 23:22:59 CET 2006
mod = Mon Oct 22 22:00:03 CEST 2007
subject = Adobe Acrobat, Version 8.1

Comments are closed.