Thursday, April 16, 2009

greedy greediness in Regular Expression

These quantifiers are greedy - that is your pattern will try to match as much text as possible. Sometimes it presents a problem. Let's consider a typical example - define a pattern to match delimited text, i.e. text enclosed in quotes, brackets, etc. Since we don't know what kind of text is inside the quotes we'll use

/".*"/

But this pattern will match everything between the first " and the last " in the following line:

this file is normally "$VIM/.gvimrc". You can check this with ":version".

This problem can be resolved by using non-greedy quantifiers:
Quantifier

Description
\{-}
matches 0 or more of the preceding atom, as few as possible
\{-n,m}
matches 1 or more of the preceding characters...
\{-n,}
matches at lease or more of the preceding characters...
\{-,m}
matches 1 or more of the preceding characters...
where n and m are positive integers (>0)

Let's use \{-} in place of * in our pattern. So, now ".\{-}" will match the first quoted text:

this file is normally "$VIM/gvimrc". You can check this with ":version".

.\{-} pattern is not without surprises. Look what will happen to the following text after we apply:

:s:.\{-}:_:g

Before:

n and m are decimal numbers between

After:

_n_ _a_n_d_ _m_ _a_r_e_ _d_e_c_i_m_a_l_ _n_u_m_b_e_r_s_ _b_e_t_w_e_e_n_

"As few as possible" applied here means zero character replacements. However match does occur between characters! To explain this behavior I quote Bram himself:

Matching zero characters is still a match. Thus it will replace zero characters with a "_". And then go on to the next position, where it will match again.

It's true that using "\{-}" is mostly useless. It works this way to be consistent with "*", which also matches zero characters. There are more useless ones: "x\{-1,}" always matches one x. You could just use "x". More useful is something like "x\{70}". The others are just consistent behavior: ..., "x\{-3,}", "x\{-2,}", "x\{-1,}.

No comments: