Saturday, September 13th, 2008

Groovy: Regular Expressions At Work

After a previous post, covering the basics, we'll go on with another geek-level look at regular expressions in Groovy. This time we will examine groups. Groups is about parsing, how to pick up parts of what a regular expression matches. We will also uncover a few more gotchas.

A part of a regular expression enclosed in parentheses is a group. The groups are numbered from left to right. The first group is group number 1. Note that the numbering is static. Even if the closing group parenthesis is followed by a "+" you won't get more than one group. Or, alternatively, the group may be followed by "*" and match nothing. You will still get a group in the result. There is a method for querying a Matcher for the number of groups, but in most practical cases you never have to use it.

The first post was much about the difference between "=~" (find) and "==~" (matches). The =~ operator is generally better integrated into the Groovy language.

There are several ways to iterate over the matches found by the =~ operator. The regex in this example is the same one as in the previous post. It matches a string of the form "17m" or "7km". The "km" literal is enclosed in a non-capturing group. This means it's not counted as a group, but it serves to keep "km" together. Without it the regex would match "mm" or "km".

GROOVY:

  1. PATSTR = /\s*(\d+)(m|(?:km))/
  2. PAT = ~PATSTR
  3. S3 = '1km 200m 33m''km''2', 'km''98', 'm'''

Lines 5-8 introduce a toMeters method. It takes two strings, a count and a unit, and returns an integer number of meters. The method is tested in lines 10-11 for good measure.

Lines 13-17 introduce the first Groovy way of iterating over matches. Each match is passed to the closure. The first variable gets the whole matched string, while the following closure variables contain group matches. Since the number of groups is determined statically you can always know how many closure variables to use. The toMeters function is used in our example to add the distances to a total. It's a matter of taste, but this form seems clean, readable and concise to me.

Lines 19-23 are very similar. The difference is we use the eachMatch method defined on java.lang.String. This means it cannot be used on pre-compiled regexes. In my opinion it's better to stick to the first form because it's always the same, regardless of pre-compilation.

Lines 25-29 introduce an oddball variant, using an iterator explicitly obtained from the Matcher. The API documentation says nothing about what the iterator returns. For the sake of consistency one would expect an array of strings. In reality it only returns the matched string. This seems like a misfeature, because it's not very useful.

It's time for a gotcha. You may be tempted to first check if a regex matches, and then iterate over the matches. Don't do it. Watch this, a modification of the previous example.

GROOVY:

  1. PATSTR = /\s*(\d+)(m|(?:km))/
  2. PAT = ~PATSTR
  3. S3 = '1km 200m 33m''km'// OOPS! WARNING!!

The difference is in line 11 where we try to protect ourselves before diving into iteration. Note that the test also consumes the first match. The iterator only finds the two next matches.

Besides, the double parentheses in line 11 are Groovy's way of forcing us to state explicitly that we want an assignment in the condition expression, a very good idea.

So far we have only used =~, also known as find. The ==~ operation, or matches, does not have much of elegant syntax to help it. It can't have more than one match, so using an iterator isn't very natural, even though it's allowed. So how do we access matches?

Matches may be accessed by treating a Matcher as a two-dimensional array. The first index is the match index, i.e. the first match has index 0. Each match is represented by an array of strings, the second array index. The first element (index 0) is all of the matched string. The next element (index 1) is group 1, and so on. The array syntax is always available because a Matcher is always a Matcher. Sometimes it may seem that =~ and ==~ are quite different, but below the syntactical level there is one and the same regex machinery using the same Matcher class.

When using ==~ we know there is at most one match, so the first array index is always 0. Let's have a look at some code. The following example uses the same regex and the same toMeters method as the preceding ones.

GROOVY:

  1. PAT = ~/\s*(\d+)(m|(?:km))/
  2. S1 = '1km''km'

When using ==~ it's necessary to check if the regex matches (line 11) before trying anything else. If you consider global variables evil (I do) the main choice is to use the matches method and not the ==~ operator. (This is explained in the previous post.)

When the matches method or the ==~ operator returns true there is exactly one match, so there will always be a "[0]" in the two-dimensional array syntax (line 12). This is not exactly pretty, so there is a second version of the same thing in lines 16-21. I think it's more readable to use the group method (line 19). It's plain Java, no Groovy syntax is left.

In summary, we have covered several ways of accessing regex matches. The find or =~ operator has several options. Iterating over a closure is readable and elegant, just don't fall victim to the gotchas. The matches or ==~ operator has little Groovy syntax that really helps. We find ourselves writing plain Java. It's not wrong, of course, but it's a little more verbose, a little less groovy.

Comments are closed.