A (RegExp) Catastrophe
11th Oct 2012
Have you ever come across the term "Catastrophic Backtracking" in relation to regular expressions? Well I came face-to-face with it's fury the other day while researching a bug in an old CMS module which I was trying my best to fix.
It started out with an outrageous claim from a QA tester that he could crash Chrome by typing in a large number of characters in to a text box. I tested it out and to my surprise the problem was worse than he realised and it all stemmed from a particular regular expression being used to validate an email address. It looked like this:
^[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})$
Looks pretty harmless?? But combine it with a test like so:
/^[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})$/.test('dddddddddddddddddddddddddddddddddddddddd');
Whoooop browser bye bye. In fact you could simplify it to the following test as we found on the stackoverflow thread:
/(d+)*A/.test('dddddddddddddddddddddddddddddddddddddddd');
So what's going on here? Well for the full explanation provided by Tim take a look at the answer, but to give you a quick summary. (d+)*
in essence can be fulfilled by the following matches (each group separated by spaces):
ddddd
dddd d
ddd dd
dd ddd
d dddd
ddd d d
dd d dd
d d ddd
dd dd d
d dd dd
d ddd d
d d d dd
d d dd d
d dd d d
dd d d d
d d d d d
This essentially means that there are an exponential number of possibilities dependant on the length of the string, and at a character length of only 40 you have a possible 549,755,813,888 combinations which takes time to go through and causes the browser to hang.
So keep an eye out for the following combinations in your regular expressions where d can be a single character or (more likely) the result of another block:
(d+)*
(d*)*
(d+)+
Read more information on Catastrophic Backtracking.
Sidenote: As a lot of people made quite clear in the discussion on Stack Overflow, it isn't recommended to use a regular expression to test the full email. It is much safer and efficient to break it up in to it's sections and validate each individually.