Home & blog  /  Tag: simulation  /

Understanding multi-line mode for JS REGEX

posted: 30 Jun '11 21:37 tags: REGEXP, Javascript, look-behind, simulation

I wanted to do a post on regular expressions' multi-line mode, since from looking around the net there appears to be a common misconception about what this does.

That's kind of understandable; you might expect something called multi-line mode to enable your REGEX pattern to match within strings that contain line breaks.

But that happens anyway, with or without multi-line mode turned on. See:

1var myStr = "This is a \n multi-line \n\ string";

2words = myStr.match(/\w+/g);

3if (words) alert(words.join('\n'));

That will find and alert out all the words (notice I pass the global flag, as I want all the words, not just the first) - even though I ran the REGEX on a multi-line string and didn't stipulate multi-line mode.

No; what multi-line mode is about is changing the behaviour of the ^ and $ anchors.

Normally, these match the start and end of the string, respectively. In multi-line mode, though - which is turned on by passing an 'm' after the final forward slash of your pattern - their meanings are extended to also match the moments before ($) and after (^) a line-break.

So imagine I have a pattern which tries to match as many characters as it can, in succession, of any kind (that's what the [\s\S] does - matches all characters, including spacial characters). Let's try it without multi-line mode first:

1var myStr = "This is a \n multi-line \n\ string";

2alert(myStr.match(/^[\s\S]+$/g));

There, I simply get back the whole string. The ^ matches the start of the string, then matches all the characters in succession, then finally hits the end of the string ($). But in multi-line mode:

1var myStr = "This is a \n multi-line \n\ string";

2alert(myStr.match(/^[\s\S]+$/gm));

This time we get back an array of 3 items - one for each word or sequence of words delimited by the line breaks.

So what's happening there is the ^ matches the start of the string, it matches "this is a " but then finds the starting edge of a line-break. In multi-line mode, the $ matches this, so that's the end of a match. And since I'm in global mode, matching continues after the line-break.

So in conclusion, not the most indicative of names, but it can be useful. Not often, mind...

post a comment

Simulating REGEX look-behinds in JavaScript

posted: 27 Feb '11 12:48 tags: REGEXP, Javascript, look-behind, simulation

It's no secret that JavaScript's implementation of regular expressions is pretty basic compared to, say, that of PHP. Even then, the lack of suport for look behind assertions (LBAs) is massively frustrtaing.

So I wrote my own workaround. Head over here to download, get usage info or view a demo.

My simulation defines three methods: match2 and replace2 (of the String object), and test2 (of the RegExp object). All work like their native counterparts, except they each accept an additional parameter - the LBA (as a string).

So to change all scary animals to scary monsters, you could do this:

1alert('scary lion; scary crocodile; cute puppy'.replace2(/[a-z]+/gi, 'monster', '(?<=scary )'));

2//== scary monster; scary monster; cute puppy

You see my current project involves writing a script which localises British into US English.

Some words change only when used as nouns. For example 'torch' becomes 'flashlight', but only when used as a noun. So it would change in the sentence "to shine a torch", but not in "to torch a building". 'Film' > 'movie' is another such example.

I realised the way to detect the role of a word within a sentence was (at least chiefly) to look at the word preceding it. For example if the word was preceded by the infinitive preposition 'to', or by a pronoun, it was most likely a verb.

Then I remembered you can't do LBAs in JavaScript, and I spent days pulling my hear out.

I hope it proves as useful to some of you as it has to me! Head over here to download, get usage info or view a demo.

post a comment