magpiebrain

Sam Newman's site, a Consultant at ThoughtWorks

I was playing around with the Regular Expression support in Java 1.4, with a view to repeating my earlier tutorials on the use of sed for text manipulation (this time with Java), when I can across a rather strange problem. Imagine my input is a multi-line string, part of which looks like this:

|          Serial          |
|        1234567890        |

I want to match the serial number, which in this case is 1234567890. First off, I want to match the title itself (I cannot just assume any numbers are serial numbers) but I also have to match the numbers themselves. The code to match the string I want to extract the number from looks like this:



String input = "|          Serial          |n|        1234567890        |";
Pattern p = Pattern.compile("Serial(?s).*[0-9]+");
Matcher m = p.matcher(input);

while(m.find()) {
  System.out.println("Matched String " + m.group());
}


Note: The use of the embedded (?s) tag forces the . to match line terminators – by default it doesn’t unless this flag or DOTALL is used.

Put simply the pattern reads “Match the work serial, followed by any characters until you get to a list of numbers and stop there. Sure enough, running this gives the following result:


Found match Serial          |
|        1234567890

Next I group the numbers being matched using ‘(‘ and ‘)’. This gives me grouping exactly as with sed – I can now index these matching groups, using m.group(index):



  String input = "|          Serial          |n|         1234567890        |";
  Pattern p = Pattern.compile("Serial(?s).*([0-9]+)");
  Matcher m = p.matcher(input);

  while(m.find()) {
    System.out.println("Found match: " + m.group());
    System.out.println("Found serial number: " + m.group(1));
  }


But this gives the following output:


Found match: Serial          |
|        1234567890
Found serial number: 0

For some reason the grouping is only matching the last number, not the whole list. I can’t for the life of me work out why…. Oh well, expect some tutorials on the use of regular expressions soon.

_Updated_: ditched all the dashes as they screwed up the formatting.

_Updated_: Fixed the second code fragment

Advertisements

9 Responses to “Strange Java regexp behaviour – grouping”

  1. Tom Hawtin

    I believe most of the digits are being swallowed by the .*. So use [^0-9]* instead. I think. I’m not a regex man.

    Reply
  2. Anonymouse

    I’ve not used the (?s) before, but normally you put parentheses around the groups you want to capture, so perhaps the following pattern might work…

    (Serial)(?s).*([0-9] )

    Also, why not try d for the digits, unless you require a specific format for the number.

    Reply
  3. Doug

    Well, all I see in the article are “textile”s, so I don’t know what you tried and what you got aside from your description.

    But it sounds like your problem is that you forgot that .* is “greedy”… it will swallow up as many characters as possible while the part to the right can still somehow match. Perhaps you meant .*? which is the “reluctant” form?

    Reply
  4. Sam Newman

    Doug, that may well be it! Oh, and sorry about the textile tags – using all those ‘-‘ characters screwed up the textile formatting.

    Reply
  5. Andrew

    I think you’ll find the api docs sorta handy here.

    The .* is greedy. Try .?* instead to get a non-greedy quantifier, and make the [0-9] possessive.

    The parantheses too.

    Reply
  6. Sam Newman

    The greedy nature of .* was the problem. The decision to make .* greedy is a strange one – in sed (and therefore (e)grep, ed and qed) the .* is non-greedy, hence my mistake. I wonder if .* is greedy in perl?

    Reply
  7. Doug

    Not to be contrary, but…

    The * operator (Kleene closure) is greedy in sed, too. Actually, I don’t know of any package where it ISN’T greedy.

    If you’ll check the bottom of p.22 (document p.A3) of http://cm.bell-labs.com/cm/cs/who/dmr/qedman.pdf you’ll see that * was greedy in qed back in 1970. If you think about it, it HAD to be: changing “a string of digits” to something else pretty much presumes that you want to change the WHOLE string of digits.

    Most open-source implementations of basic regular-expression matching are based on Henry Spencer’s “open source clean room reimplementation” of the V8 regexp library back in the mid-80’s. In
    http://groups.google.com/groups?selm=1316@panda.UUCP you will find greediness explicitly described: “the possibilities for ‘*’, ‘+’, and ‘?’ are considered longest-first”. I would presume that this behavior was mimicked from the V8 regexp library, since “this implementation is believed 100% compatible with V8”.

    Implementations of extended regular-expression matching are generally based on the IEEE 1003.2 (POSIX.2) standard. The documentation of the “regex” library which implements that standard says: “Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible, with subexpressions starting earlier in the RE taking priority over ones starting later.” See http://www.daemon-systems.org/man/re_format.7.html for one example of the man page.

    For its part, Bell’s man page http://www.cs.bell-labs.com/magic/man2html/6/regexp says: “A match to any part of a regular expression extends as far as possible without preventing a match to the remainder of the regular expression.” It also references man pages for sed, ed, grep, and awk, and none of those pages suggest that they modify regexp behavior to be non-greedy.

    Interestingly, the de-facto standard for regular expressions these days is set by Perl. In fact, the Unicode regular expression standard at http://www.unicode.org/unicode/reports/tr18/ which Java is based on is, in turn, based on the Perl specifications. The Perl specifications at http://www.perldoc.com/perl5.6/pod/perlre.html say: “By default, a quantified subpattern is ‘greedy’, that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match.”

    More than you wanted to know, I’m sure.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Basic HTML is allowed. Your email address will not be published.

Subscribe to this comment feed via RSS

%d bloggers like this: