Friday, May 2nd, 2008

Office Documents the Groovy Way

Is there anything groovy about office documents? How about getting document properties out of all kinds of MS Office documents without running an Office application? Add platform independence as a bonus. Did you say impossible?

The whole set of MS Office applications has a common basic file format, or should I say, they all wrap their documents in a similar way. This common wrapper contains, among other things, document properties. In Office applications you find document properties in the File menu.

Apache POI is a Java library that understands the file formats of the main Office applications up to Office 2007 when XML took over. It is a very capable tool. Reading document properties is just an easy start.

We will use the Groovy language. The solution will look very similar to a previous post that shows how you may read PDF document properties. Groovy blends effortlessly with Java.

First the framework, a script that descends into a folder and its subfolders finding all Office documents. It also prints the properties of the Office docs it finds.

GROOVY:

  1. span style="color: #ff0000;">"c:/tmp/groovy/docs""Office documents found under ${root.path}""--- Office document: ${doc.file.name}""${entry.key} = ${entry.value}"
  2.         }
  3.     }
  4. }

It's great to be able to scan a file tree so easily. But the main job remains to be done. We need a class OfficeProps that has a method isOfficeDoc and that is able to collect document properties. It could look like this.

GROOVY:

  1. import org.apache.poi.poifs.eventfilesystem.POIFSReader
  2. import org.apache.poi.poifs.filesystem.POIFSFileSystem
  3. // File reader and its callback listener.
  4. // The file where this document is stored.
  5. // Document properties.
  6.  

If it doesn't look very complicated it's because it's incomplete! The file system reader uses callbacks for extracting information. We will add that in a second. I made the file reader static (lines 6 - 13) to save some initialization per document. You will have to change this if you need a thread-safe class.

Line 29 shows a useful Groovy idiom for reading a file. The withInputStream construct makes sure the file is closed when you return from the closure.

Perhaps you wonder why a plain Word or Excel file is called a file system? The reason is that the internal structure of these files actually is a file hierarchy with folders and files. For this simple application we don't see much of that structure, however.

Now we will create the listener that the OfficeProps class requires. It is the core of this application.

GROOVY:

  1. import java.sql.Timestamp
  2.  
  3. import org.apache.poi.hpsf.DocumentSummaryInformation;
  4. import org.apache.poi.hpsf.NoPropertySetStreamException;
  5. import org.apache.poi.hpsf.PropertySet;
  6. import org.apache.poi.hpsf.PropertySetFactory;
  7. import org.apache.poi.hpsf.SummaryInformation;
  8. import org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent
  9. import org.apache.poi.poifs.eventfilesystem.POIFSReaderListener
  10.  
  11. // Listener callback class for reading Office document properties
  12. // Map where collected properties end up
  13. 'fname''fpath'// Convert to kB
  14. 'fsize'// Ignore non-property set streams
  15. "Property set stream \"""\": "'title''app''author''created''mod''subject''keywords''comment''pages''template''category''company''manager'// Conditionally maps the value
  16. // Remove strange characters
  17. // Remove control characters from a string.
  18. ' '' '

Oops, that was quite a bite. I could have left out the sanitize method, but it contains a few examples of useful Groovy idioms. And besides, I have seen lots of weird characters in document titles.

The main event handler begins in line 23. It tries to create a property set out of every event and returns by means of an exception if the event has nothing to do with property sets. In line 36 the event handler goes on to check if the property set is summary information or document summary information. (I didn't choose this terminology, Microsoft did.) In either case we read the properties.

Finally some output from a collection of random Word, Excel and PowerPoint documents. People generally don't use document properties, which is a pity. They are read by all search engines. But then the Office applications do nothing to connect document properties to actual contents.


Office documents found under c:\tmp\groovy\docs
--- Office document: FuturePL.ppt
fname = FuturePL.ppt
fpath = c:\tmp\groovy\docs\FuturePL.ppt
fsize = 184
company = UGA
title = Future Programming Languages
app = Microsoft PowerPoint
author = HONGCHAOLI
created = 2005-04-13 02:35:57.815
mod = 2005-04-15 05:28:20.784
pages = 0
template = Proposal
--- Office document: Guillaume-Laforge-Grails-IJTC.ppt
fname = Guillaume-Laforge-Grails-IJTC.ppt
fpath = c:\tmp\groovy\docs\Guillaume-Laforge-Grails-IJTC.ppt
fsize = 885
title = IJTC 2007 Speaker Template
mod = 2007-11-06 10:55:33.345
pages = 0
--- Office document: hitwise-2008-february-mobile-phone-sites.xls
fname = hitwise-2008-february-mobile-phone-sites.xls
fpath = c:\tmp\groovy\docs\hitwise-2008-february-mobile-phone-sites.xls
fsize = 20
app = Microsoft Excel
author = Vahe Habeshian
created = 2008-04-08 19:47:06.0
mod = 2008-04-08 19:48:06.0
pages = 0
--- Office document: table62.xls
fname = table62.xls
fpath = c:\tmp\groovy\docs\table62.xls
fsize = 143
app = Microsoft Excel
author = EIA
created = 2007-09-13 21:30:47.0
mod = 2007-09-15 18:34:00.0
pages = 0
company = DOE/EIA
--- Office document: WLAN_Roaming.doc
fname = WLAN_Roaming.doc
fpath = c:\tmp\groovy\docs\WLAN_Roaming.doc
fsize = 110
company = TKK
title = ROAMING CONSIDERATIONS FOR THE FINNISH WLAN MARKET
app = Microsoft Word 9.0
author = Eino Kivisaari
created = 2004-03-28 20:11:00.0
mod = 2004-03-28 20:11:00.0
pages = 1
template = Normal.dot

Comments are closed.