Home & blog  /  2012  /  Jul  /  view post  /

REGEX round-up: two issues...

posted: 02 Jul '12 19:27 tags: REGEX, Stack-Overflow, parsing, JavaScript, PHP

I've recently become something of an addict to answering questions on Stack Overflow. As something of a gleeful REGEX sadist, this is one of the areas I answer on.

Two questions recently highlighted again some of the inabilities in REGEX.

1. Capturing repeating sub-groups

One concerns captured sub-groups. If you're not sure what this means, take a look at this:

"hello, there".match(/hello, (there)/);

The result of that is an array of the following structure:

1[

2     "hello, there", //the whole match is always the first key

3     "there" //...followed by any sub-matches

4]

As the comment says, in most REGEX implementations if not all, the first element of the matches array is the entire match. But I also asked for a sub-match, (there), so that gets sent in as the second array element. And so on, for as many sub-matches as I specified.

This is fine, but gets tricky where you want to repeat a sub-match and capture each repetition of it separately. Consider this:

"foo bar bar bar".match(/foo (bar ?)+/);

That pattern asks for "foo" followed by "bar" one or more times (since I use the "once or more" operator, +, after my sub-group). It returns this:

["foo bar bar bar", "bar"]

You can see the instruction to allow repetition of the sub-group has been honoured insofar as the pattern, in its entirety, matches. But you will also notice that it captured only one of the sub-group instances, not three.

If you're finding this a bit of a headache (and you either love REGEX or find it melts your brain), the bottom line is this:

Your matches array will contain only as many results as there are sub-groups explicitly defined in your pattern (plus the entire match as the first element).

At least in JavaScrtipt and PHP, anyway. I have seen fleeting references to functionality in other languages that can capture repeated - i.e. implied instances of - sub-groups, but I don't know for sure.

Partial workaround

As confirmed by the priceless Regular-Expressions.info, the above pattern is a common pitfall, but I include it just to illustrate the point.

A partial workaround is to capture the repeated sub-group in yet another sub-group. So:

"foo bar bar bar".match(/foo ((bar ?)+)/);

That returns this:

["foo bar bar bar", "bar bar bar", "bar"]

...because this time I made a point of capturing not only the sub-group to be repeated, but also the cumulative result OF that repetition.

But the problem remains that we have not captured each instance separately. When I say "problem", it's not an error of R&G&X; it wasn't designed to do this. It's just an inability.

It's not a show-stopping problem. If you really want "bar" to be represented three times separately in an array, it doesn't take much to first do the REGEX then split the second array element (in our case) by the space delimiter).

2. REGEX is not a parser

Another question that I answered the other day considered the use of REGEX to parse a proprietary tag format.

This should set alarm bells ringing immediately. REGEX is not a parsing tool.

In any case the user wanted to parse the various tags from the following string. The complication is in the fact that tags may contain nested sub-tags.

Hi [{tag:there}], [{tag:how are you?}]. [{I'm good}], thanks.

A similar question came up today, asking the same thing, but with HTML. (Incidentally, REGEX is always a bad choice for parsing XML-based languages, given the DOM route).

The problem with both is this: REGEX cannot hope to reliably know which opening tags correspond to which closing tags.

Take the following HTML (based on the question I mentioned above):

1<table>

2     <tr>

3         <td>Nottingham Forest</td>

4     </tr>

5     <tr>

6         <td>Notts County</td>

7     </tr>

8     <tr>

9         <td>Mansfield Town</td>

10     </tr>

11</table>

Let's say you want to capture the row that contains the string "County":

/<tr\b[^>]*?>[\s\S]*?County[\s\S]*?<\/tr>/gi

Tested against our HTML, this will be the result:

1<tr>

2     <td>Nottingham Forest</td>

3</tr>

4<tr>

5     <td>Notts County</td>

6</tr>

This is logical when you think how REGEXP works. It begins at the start of the string, and is asked to find an opening <

tag. It is then told to allow for zero or more characters of any kind ([\s\S]up to and including the string "Forest".

(Side note: if you're wondering, I use [\s\S] rather than . because the latter, though a wildcard character, does not in fact match white-space characters, which is a big deal here as our string is multi-line, and line breaks are spacial characters. The only way to match truly anything is with the former, which literally says "get me all non-space and space characters", i.e. everything.)

This is precisely the problem. In telling it to allow any characters of any kind before the string "Forest", we actually cover from the start of the string (i.e. the first row) into the second.

In other words, our match will be from the beginning to the end of the requested row - never just the requested row.

Ideally we'd limit that first [\s\S] from matching another

A partial workaround

Let's return to the first example, with the proprietary tag format. The aim, of course, is to extract each of the three tags. This question was a PHP one, so I came up with this horror.

1$str = "Hi [{tag:there}], [{tag:how are you?}]. [{I'm good}], thanks.";

2$matches = array();

3function replace_cb($this_match) {

4     global $matches;

5     $this_match = $this_match[0];

6     foreach($matches as $index => $match) $this_match = str_replace('**'.($index + 1).'**', $match, $this_match);

7     array_push($matches, $this_match);

8     return '**'.count($matches).'**';

9}

10while(preg_match('/\[\{[^\[]*?\}\]/', $str)) $str = preg_replace_callback('/\[\{[^\[]*?\}\]/', 'replace_cb', $str);

That actually works. The concept is to first locate and extract all tags that do not contain nested sub-tags, and then work outwards. I'll save you the laborious, line-by-line explanation, but if we print_r($matches), we get this:

1Array

2(

3     [0] => [{tagname:content}]

4     [1] => [{tag2: more data here}]

5     [2] => [{tag1:xnkudfdhkfujhkdjki diidfo now nested tag [{tag2: more data here}] kj udf}]

6)

post new comment

Comments (0)