Simple uses of quantifiers can make regexes take quadratic time unnecessarily
Suppose I want to extract all substrings of a text that consist of one-or-more letter a
s followed by a letter b
. This is obviously both:
- trivial to do with a single short regex
- trivial to do in linear time
but if you try to do it with a regex in linear time, you may well screw it up and never even realise. There is a nasty trap here!
Until recently, I would have solved this problem with code something like this (example in JavaScript):
function extractABs(text) {
return text.match(/a+b/g);
}
... and I would have assumed that this would execute in linear time. After all, you can easily envisage the linear-time algorithm - you just need to make a single pass over text
, keeping track as you go of the index at which each sequence of consecutive a
s starts, and checking at the end of each one whether it is followed by a b
or by another character. But the regex-based solution above does not execute in linear time in the worst case! Chuck the following code into your browser console or Node shell and behold the results:
function extractABs(text) {
return text.match(/a+b/g)
}
const a50000 = 'a'.repeat(50000);
const a100000 = 'a'.repeat(100000);
const a150000 = 'a'.repeat(150000);
const a200000 = 'a'.repeat(200000);
const a250000 = 'a'.repeat(250000);
function timeRegex(text) {
console.time();
extractABs(text);
console.timeEnd();
}
timeRegex(a50000);
timeRegex(a100000);
timeRegex(a150000);
timeRegex(a200000);
timeRegex(a250000);
The times I got were 2.112s for 50000 a
s, 8.451s (4x) for 100000 a
s, 19.009s (9x) for 150000 a
s, 33.757s (16x) for 200000 a
s, and 52.856s (25x) for 250000 a
s. In other words, execution time is clearly increasing quadratically with the size of the input.
Why is this happening? This is not an example of what many sources call "catastrophic backtracking", which typically happens when you write regexes with nested quantifiers (like (a+)+
) and results in exponential time complexity. In fact, the cause of the non-linear time complexity here doesn't involve any backtracking at all.
Instead the problem is simply that, after trying to match the regex at index 0 (the start) of our length-n string, consuming n characters, and failing, the regex engine then moves on to trying to match the regex at index 1 of the string, consumes n-1 characters and fails, then tries to match at index 2 of the string, consumes n-2 characters and fails, and so on. The total number of characters that need to be consumed this way is of course the nth triangular number[1], (n² + n) / 2, and thus we have quadratic time complexity.
There's a simple fix: use a lookbehind assertion in your regex that says that the sequence of a
s, a+
, must not be preceded by an a
:
function extractABs(text) {
return text.match(/(?<!a)a+b/g)
}
Now the pathological case with 250000 a
s, that was previously taking a minute to execute, takes a millisecond instead (on my machine). This works because the lookbehind short-circuits the regex engine's attempts to match the pattern in the middle of a substring of a
s. After failing to match at index 0, the regex engine will still ask "does the pattern match at index 1?", but instead of having to consume 249999 characters to get an answer to that question, it will instead see that there's an a
directly before index 1, immediately answer "no", and move on. Thus each such question requires only constant time to answer. And thus we're back to linear time complexity for the whole matching operation.
This tactic seems broadly applicable. In situations where you have a regex that starts with an indefinitely repeated expression (a+
, (foo)*
, [0-9]+
, whatever), if you want to keep the regex's execution time linear, you almost certainly want to precede it with a negative lookbehind ((?<!a)a+
, (?<!foo)(foo)*
, (?<![0-9])[0-9]+
, etc.).
But until today, I didn't know this. I suspect most programmers I've worked with didn't know this. How many quadratic-time regexes have we unwittingly written, to perform what ought to be linear-time operations, due to this trap? I'm sure I've authored a few...
Actually, that's kind of an oversimplification, since there is backtracking involved here which I would guess roughly doubles the time taken. It just doesn't increase the time complexity. ↩︎