Friday, May 2nd, 2008

PDF the Groovy Way

Because of the strong ties between Java and Groovy, Groovy benefits from an enormous amount of Java utilities. This time we will take a brief look at what you can do with PDF documents.

The leading Java library for PDF manipulation is iText. It has functions for all your fancies, and then some. In my experience it also has unbelievable support. It's Java and thus platform independent.

Let's create a program that, given a folder, recursively extracts document properties from all PDF files in the folder and subfolders.

Click here for plain text view

GROOVY:

span style="color: #ff0000;">"c:/tmp/groovy/docs""PDF documents found under ${root.path}""--- PDF document: ${pdf.file.name}""${entry.key} = ${entry.value}"
}
}
}

What you see above is the neat file scanning logic. Note that return just returns from the current closure invocation, not from the script. The script assumes there is a class called PdfProps with a method isPdf for checking if a file is a PDF file. That class is also expected to build a map containing document properties.

Here is an implementation of PdfProps.

Click here for plain text view

GROOVY:

import com.lowagie.text.pdf.PdfDate
import com.lowagie.text.pdf.PdfReader
// Read the document properties of a PDF document.
// The file where this document is stored.
// Hash map containing document properties
// Construct: opens a document, reads its info, then closes the file.
// Map from our own property names to values
'fname''fpath''fsize'// kilobytes
'app''Creator''title''Title''author''Author''created''CreationDate''mod''ModDate''subject''Subject''keywords''Keywords'// Convert PDF date to java.util.Date

The isPdf method (line 21) uses concise Groovy idioms for reading the first line of the file and matching it against a regular expression. Another Groovy idiom is used in the mapNonEmpty method to check that a string is not null, empty, or contains only white space.

The iText library is actually only used in a few places. In line 15 we create a PDF reader. In line 16 we extract document properties from the reader. In line 42 we must use iText to decode a PDF date. The rest is mainly book-keeping.

This example shows that Groovy is terse without losing readability. The way you may transparently connect to the enormous treasure of Java libraries is invaluable.

Oh, did I forget something? Like running the program? Ok, lets use the script engine I introduced in another post.

Click here for plain text view

GROOVY:

span style="color: #ff0000;">"c:/tmp/groovy""PdfTest.groovy"

And here is some sample output.

PDF documents found under c:\tmp\groovy\docs --- PDF document: cederqvist-1.11.13.pdf fname = cederqvist-1.11.13.pdf fpath = c:\tmp\groovy\docs\cederqvist-1.11.13.pdf fsize = 1024 app = TeX created = Fri Feb 13 11:29:00 CET 2004 --- PDF document: ora-jdbc-devgref.pdf fname = ora-jdbc-devgref.pdf fpath = c:\tmp\groovy\docs\ora-jdbc-devgref.pdf fsize = 8079 app = FrameMaker 7.0 title = Oracle Database JDBC Developer's Guide and Reference author = Oracle Corporation created = Fri Nov 11 21:42:11 CET 2005 mod = Fri Nov 11 21:42:11 CET 2005 subject = Oracle Database keywords = JDBC, drivers, Java --- PDF document: ora-sql-ref.pdf fname = ora-sql-ref.pdf fpath = c:\tmp\groovy\docs\ora-sql-ref.pdf fsize = 12154 app = FrameMaker 7.0 title = Oracle Database SQL Reference author = Oracle Corporation created = Mon Nov 07 09:42:39 CET 2005 mod = Mon Nov 07 10:41:30 CET 2005 subject = Oracle Database --- PDF document: pdf_reference.pdf fname = pdf_reference.pdf fpath = c:\tmp\groovy\docs\pdf_reference.pdf fsize = 15696 app = Adobe Acrobat 8.1 Combine Files title = PDF Reference and Related Documentation author = Adobe Developer Support created = Tue Jan 17 23:22:59 CET 2006 mod = Mon Oct 22 22:00:03 CEST 2007 subject = Adobe Acrobat, Version 8.1

Topics: Groovy lessons | Comments RSS

Comments are closed.

PDF the Groovy Way

Info

Blog Categories

Blog Archives

Links