magpiebrain

Sam Newman's site, a Consultant at ThoughtWorks

Its been a day for revisiting previous topics, so I thought I’d readdress some of the “troubles(Strange Java regexp behaviour – grouping)”:http://www.magpiebrain.com/archives/000219.html I was having with regular expressions last week. To recap, I was writing an expression to grab the serial number from the following string:

|          Serial          |
|        1234567890        |

The regexp I was using didn’t seem to work – @Serial(?s).*([0-9]+)@ should of captured the serial number into a group, but was only capturing the last number. Many commentors posted that the reason for this is that @.*@ is a greedy operator (“Doug’s”:http://www.magpiebrain.com/archives/000219.html#comment419 comment should especially be noted – rarely have I seen such effort put into a blog comment!).

Simply put a greedy operator matches as many characters as possible – when this stuff was new to me I used to think of a greedy operator as a little Pacman, chomping his way through my string, waiting till the last possible moment before letting the next operator get a look in. In this instance, the @*@ ate everything until it left just enough text for the @[0-9]+@ to match – which was just the last digit of the serial number. As you would expect, to balance greedy operators, you have lazy (or reluctant) operators. To further abuse a metaphor, I think of a lazy operator as a very full pacman, who is looking for any excuse to go off for a nap. The lazy form of the @*@ operator in Java is @*?@ – in my case this operator gives up when it sees the first number, letting the @[0-9]+@ take over. So lets look at my fixed code:


String input = "|          Serial          |n|         1234567890        |";
Pattern p = Pattern.compile("Serial(?s).*?([0-9]+)");
Matcher m = p.matcher(input);

while(m.find()) {
  System.out.println("Found match: " + m.group());
  System.out.println("Found serial number: " + m.group(1));
}

This particular mistake was quite embarrassing. I’ve always prided myself on my regexp knowledge and to make such a bonehead mistake (not to mention exhibit at least one fundamental misunderstanding about the whole thing) has gone some way to puncture my ego, which I guess is no bad thing… The moral of the story? Reach for the manual before reaching for the blog – you might still make mistakes, but at least you’re making them in private that way!

3 Responses to “Pacman and greedy regexp operators”

  1. Cedric

    A better and more efficient way is *not* to use .* at all but instead, use the complement of what you are trying to match:

    [^0-9] [0-9]

    Reply
  2. Sam

    But that approach makes assumptions about the rest of the string – the string I showed was actually only a small part of a much larger series of text, and using [^0-9] [0-9] would of matched all digit sequences rather than the specific one – I knew I wanted the series of numbers that occurs on the line of tex after the word “Serial”. Sure, I could of made the regexp incredibly precise but I wanted some emasure of flexibility – using Serial(?s).*?([0-9] ) I’m making no assumtions as to line length, padding characters, line feeds etc.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Basic HTML is allowed. Your email address will not be published.

Subscribe to this comment feed via RSS

%d bloggers like this: