Sunday, September 7th, 2008

Groovy: Regular Expressions Basics

The Groovy programming language simplifies many common tasks, including regular expressions. This post shows common uses made easy by Groovy. However, regular expressions in Groovy are not free from irregularities. This geek level post will explain one of them.

Regular expressions are a very powerful and useful tool in any programmer's toolbox. Groovy relies on regular expressions in Java, java.util.regex, a top notch implementation. As you might expect, Groovy adds groovieness to Java by making regular expressions a seamless part of the language.

Let us consider simple tests where the question is, does a string match a pattern or not. You may copy and paste the following examples into a Groovy console to run them. In this sample I use the assert statement. In normal code the tests typically are conditions in if statements.

GROOVY:

  1. PATSTR = /\s*(\d+)(m|(?:km))/
  2. PAT = ~PATSTR
  3. S1 = '1km'
  4. S2 = '2m'
  5. S3 = '1km 200m 33m'
  6. SX = 'Unrelated text'

Line 1 defines a regular expression. It tests for an integer number followed by either 'm' or 'km'. I use the so called slashy syntax for the string. Any string literal is ok, but in a quoted string all backslashes must be doubled.

Line 2 precompiles the string into a pattern (java.util.regex.Pattern). Precompilation improves pattern matching performance. Don't bother about precompilation unless your regex is heavily used. On the other hand, the precompilation syntax is really simple. Only be aware of the difference between "= ~" and "=~". The space is significant.

The tests use the two operators "=~" (find) and "==~" (matches). The find operator (=~) checks if the pattern occurs somewhere in the string. The matches operator (==~) requires that the pattern matches the whole string from beginning to end. As you can see in lines 8-9 the operators work regardless if the pattern is precompiled or not. Precompilation may be added without changing the test syntax.

Now let's try something different.

GROOVY:

  1. PATSTR = /\s*(\d+)(m|(?:km))/
  2. S1 = '1km'
  3. S3 = '1km 200m 33m'// The following *FAILS* with
  4. // groovy.lang.MissingPropertyException: No such property: count for class: java.lang.Boolean
  5.  

The pattern and strings are copied from the previous examle. The first difference is in line 5. The property count added to the result of the =~ (find) operator yields 3. It tells us that the regex found 3 matches in the string. Quite useful and kind of natural.

Natural? Wait a minute, we did a test. A test should return a Boolean. It seems Groovy adds some magic here. If you dig below the surface you will find that the =~ operator returns a Matcher (java.util.regex.Matcher), not a Boolean. The Matcher has a count property. The additional magic is that, when a Matcher is used where a Boolean is expected, it yields true if it founds at least one match.

After uncovering the groovy magic of the =~ operator, of course we'd love to repeat the experience with the ==~ (matches) operator. The test case is in line 9. Try this, and you'll be amazed that the example blows up with an ugly exception. Line 6 was added to avoid suspicion that the string no longer matches.

You may be disappointed to learn that the ==~ operator returns a plain Boolean, not a Matcher. This may seem confusing, but returning a Matcher might have been even more confusing. The poor Matcher has no means to tell if you just tested find or matches. When converted to a Boolean it will always answer the same question: "was there at least one match?"

What can you do if you really want to use ==~ and get hold of the Matcher? The next snippet shows a couple of ways.

GROOVY:

  1. import java.util.regex.Matcher
  2. import java.util.regex.Pattern
  3.  
  4. PATSTR = /\s*(\d+)(m|(?:km))/
  5. S1 = '1km'
  6. S3 = '1km 200m 33m''^' + PATSTR + '$'

Line 8 shows the first way. You can always add "^" before and "$" after a pattern. This will require a match to extend from the beginning to the end of the string. We must use the =~ operator here.

Lines 10-12 illustrate another way. We simply drop all groovieness and run plain Java. It's verbose but it works.

Lines 14-15 contain a suggestion that I don't really like. The Matcher class remembers the last Matcher in a static field. This is really ugly. Getting rid of global variables was one of the main reasons for going object oriented. Anyway, the ==~ operator is used and you may pick up the corresponding Matcher.

In summary, this post shows some basic use of Groovy regular expressions. As usual in Groovy a lot of power goes into concise syntax. However, we also discovered that the Groovy cloth was not enough for a full three-piece suit. In some places the underlying Java machinery shines through.

Comments are closed.