I was playing around with the Regular Expression support in Java 1.4, with a view to repeating my earlier tutorials on the use of sed for text manipulation (this time with Java), when I can across a rather strange problem. Imagine my input is a multi-line string, part of which looks like this:
| Serial | | 1234567890 |
I want to match the serial number, which in this case is 1234567890
. First off, I want to match the title itself (I cannot just assume any numbers are serial numbers) but I also have to match the numbers themselves. The code to match the string I want to extract the number from looks like this:
String input = "| Serial |n| 1234567890 |";
Pattern p = Pattern.compile("Serial(?s).*[0-9]+");
Matcher m = p.matcher(input);
while(m.find()) {
System.out.println("Matched String " + m.group());
}
(?s)
tag forces the .
to match line terminators – by default it doesn’t unless this flag or DOTALL is used.
Put simply the pattern reads “Match the work serial, followed by any characters until you get to a list of numbers and stop there. Sure enough, running this gives the following result:
Found match Serial | | 1234567890
Next I group the numbers being matched using ‘(‘ and ‘)’. This gives me grouping exactly as with sed – I can now index these matching groups, using m.group(index)
:
String input = "| Serial |n| 1234567890 |";
Pattern p = Pattern.compile("Serial(?s).*([0-9]+)");
Matcher m = p.matcher(input);
while(m.find()) {
System.out.println("Found match: " + m.group());
System.out.println("Found serial number: " + m.group(1));
}
But this gives the following output:
Found match: Serial | | 1234567890 Found serial number: 0
For some reason the grouping is only matching the last number, not the whole list. I can’t for the life of me work out why…. Oh well, expect some tutorials on the use of regular expressions soon.
_Updated_: ditched all the dashes as they screwed up the formatting.
_Updated_: Fixed the second code fragment
9 Responses to “Strange Java regexp behaviour – grouping”
Hmmm… your regex doesn’t seem right.
Serial(?s).*</[0-9] )
Did MT eat up some symbols?
I believe most of the digits are being swallowed by the .*. So use [^0-9]* instead. I think. I’m not a regex man.
I’ve not used the (?s) before, but normally you put parentheses around the groups you want to capture, so perhaps the following pattern might work…
(Serial)(?s).*([0-9] )
Also, why not try d for the digits, unless you require a specific format for the number.
Well, all I see in the article are “textile”s, so I don’t know what you tried and what you got aside from your description.
But it sounds like your problem is that you forgot that .* is “greedy”… it will swallow up as many characters as possible while the part to the right can still somehow match. Perhaps you meant .*? which is the “reluctant” form?
Doug, that may well be it! Oh, and sorry about the textile tags – using all those ‘-‘ characters screwed up the textile formatting.
I think you’ll find the api docs sorta handy here.
The .* is greedy. Try .?* instead to get a non-greedy quantifier, and make the [0-9] possessive.
The parantheses too.
The greedy nature of .* was the problem. The decision to make .* greedy is a strange one – in sed (and therefore (e)grep, ed and qed) the .* is non-greedy, hence my mistake. I wonder if .* is greedy in perl?
Greedy is the default in perl.
Not to be contrary, but…
The * operator (Kleene closure) is greedy in sed, too. Actually, I don’t know of any package where it ISN’T greedy.
If you’ll check the bottom of p.22 (document p.A3) of http://cm.bell-labs.com/cm/cs/who/dmr/qedman.pdf you’ll see that * was greedy in qed back in 1970. If you think about it, it HAD to be: changing “a string of digits” to something else pretty much presumes that you want to change the WHOLE string of digits.
Most open-source implementations of basic regular-expression matching are based on Henry Spencer’s “open source clean room reimplementation” of the V8 regexp library back in the mid-80’s. In
http://groups.google.com/groups?selm=1316@panda.UUCP you will find greediness explicitly described: “the possibilities for ‘*’, ‘+’, and ‘?’ are considered longest-first”. I would presume that this behavior was mimicked from the V8 regexp library, since “this implementation is believed 100% compatible with V8”.
Implementations of extended regular-expression matching are generally based on the IEEE 1003.2 (POSIX.2) standard. The documentation of the “regex” library which implements that standard says: “Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible, with subexpressions starting earlier in the RE taking priority over ones starting later.” See http://www.daemon-systems.org/man/re_format.7.html for one example of the man page.
For its part, Bell’s man page http://www.cs.bell-labs.com/magic/man2html/6/regexp says: “A match to any part of a regular expression extends as far as possible without preventing a match to the remainder of the regular expression.” It also references man pages for sed, ed, grep, and awk, and none of those pages suggest that they modify regexp behavior to be non-greedy.
Interestingly, the de-facto standard for regular expressions these days is set by Perl. In fact, the Unicode regular expression standard at http://www.unicode.org/unicode/reports/tr18/ which Java is based on is, in turn, based on the Perl specifications. The Perl specifications at http://www.perldoc.com/perl5.6/pod/perlre.html say: “By default, a quantified subpattern is ‘greedy’, that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match.”
More than you wanted to know, I’m sure.