jQuery Regular Expressions Review

Update 2010-09-23: Most of the recommendations described below were implemented into the jQuery codebase 2010-09-22. (Git commit: 9dc6e0c572b9c809a3a4c123071d96d48a01dd1c)

Introduction

jQuery is a very popular Javascript framework which internally utilizes a number of regular expressions. Most of these regexes work correctly and accurately match (and fail to match) their intended targets. However, quite a few of these can be improved with regard to efficiency (some with regard to matching, others with regard to non-matching - both are important). A comprehensive review of all jQuery regular expressions was recently performed, and this article presents the results of this analysis and offers improved versions. (The jQuery version analyzed was 1.4.3pre: Git commit #747ba7defd82bffa6c7ccb69e53b834cbfddb62c 20-Sep-2010.)

Note: The improvements presented here are offered freely to the jQuery project, and come primarily as a result of applying efficiency techniques gleaned from Jeffrey Friedl's classic work: Mastering Regular Expressions (3rd Edition) (MRE3).

Table of jQuery Regular expressions.

The following table presents a subset of the jQuery regular expressions organized by file name, line number and variable name, along with recommended improved versions. All the regexes in the jQuery core codebase were scrutinized, but only those which could be improved with regard to efficiency (as judged by this author) are presented. Except where noted, all the improved regexes are functionally identical to their original counterparts, and although most of them are physically longer, they should all be somewhat faster (and some, significantly faster). Notes are provided below the table to describe the reasoning behind the changes. Also, to provide some additional eye/mind-candy, the regexes are presented in the table with both colorized and dynamic syntax highlighting applied. Enjoy!

Table 1. jQuery Regular Expression Recommended Changes
File name Line # Var name	Original regex (Before)
	Recommended regex (After)
	Notes
ajax.js 004 rscript	/<script(.\|\s)*?\/script>/gi
	/<script\b[^<](?:(?!<\/script>)<[^<])*<\/script>/gi
	Update 2010-09-23: Here is another version which matches script tags that are not properly terminated. /<script\b[^<](?:(?!<\/script>)<[^<])*(?:<\/script>\|$)/gi
	Notes: 1, 2, 3, 4, 5 The original regex is particularly inefficient in that it uses a lazy quantifier applied to a capture group containing alternatives. (A fast regex avoids unnecessary alternation, capturing groups and lazy quantifiers - this regex has all three!) For each and every character in a target string script tag, the regex engine must perform three steps including; checking the alternatives, saving the capture group contents and backtracking. Given a 2.5KB test script, the old regex requires 7966 steps to complete its match while the new regex requires only 58. Note also that some might suggest a much simpler regex syntax: "`/<script[\S\s]?\/script>/gi`". Although this is an improvement over the original (it eliminates the unnecessary capture group and alternation), its speed still does not compare with the much more efficient regex presented above. The "unrolling-the-loop"* technique implemented here makes the regex a bit harder to read, but it is much faster, because it allows the engine to consume large swaths of characters in each "gulp" (using the greedy "`[^<]`" construct), and this significantly* reduces the need for backtracking. Although there are some regex engines which can efficiently process the lazy quantifier (e.g. Perl), most can't (and don't). There are certainly circumstances where a lazy quantifier is the best choice, but this is not one of them. (For more insight into this topic, see Steven Levithan's excellent post: "Performance of Greedy vs. Lazy Regex Quantifiers") Update 2010-09-23: Another (very serious) problem with the original regex is that it exhibits catastrophic backtracking when presented a target string having a script tag that is not properly terminated (i.e. it does not have a `</script>` closing tag). This happens because the first capture group "`(.\|\s)?`" contains two alternatives which overlap and can both match the same characters* (i.e. the space and tab characters). When presented with a script tag that is not properly terminated and contains lots of spaces, the regex engine will take a LONG time to declare a non-match. (e.g. With the 2.5KB example script, the debugger gave up trying to declare a non-match after 1 million steps.) This brings up another point that this regex should probably match non-properly-terminated scripts. In other words, if a script tag does not have its closing `</script>` tag, it should match everything from the opening tag up to the end of the string. A new regex which does precisely this is added as a second recommended option above.
ajax.js 005 rselectTextarea	/select\|textarea/i
	/^(?:select\|textarea)/i
	Notes: 6, 7 This regex is used to match a target string consisting of a single word: a DOM node's `nodeName` property. We know that the regex must always match starting at the very first character position of the target string. (Knowing details like this about the target string allows crafting a much more efficient regex.) The original regex is fast when it matches, but is inefficiently slow when it doesn't. Without the "`^`" "start-of-string" anchor, the original regex forces the regex engine to painstakingly (and needlessly) "bump along" and test the whole regex against each and every position in the target string before it can declare match failure. (Eliminating unnecessary work like this is one of the keys to crafting an efficient regex.) The new regex adds the start-of-string anchor, allowing the regex to declare failure immediately, at the beginning of the string. In this case the time savings is small (either regex is fast when applied to small target strings), but the savings add up when this is called many times. Note that the improvement just described applies to many of the regex changes below.
ajax.js 006 rinput	/color\|date\|datetime\|email\|hidden\|month\|number\| password\|range\|search\|tel\|text\|time\|url\|week/i
	/\b(?:color\|date\|datetime\|email\|hidden\|month\|number\| password\|range\|search\|tel\|text\|time\|url\|week)/i
	Notes: 7, 8, 9 In a manner similar to the previous example, adding (or "exposing") a word boundary anchor to the beginning of this regex improves the efficiency in the case of non-matches, by reducing the number of positions within the string where the regex engine attempts a match. Instead of attempting a match at every location within a target string, the engine now only needs to check on word boundaries.
ajax.js 009 rts	/(\?\|&)_=.*?(&\|$)/
	/([?&])_=[^&\r\n]*(&?)/
	Notes: 2, 3, 5 The first improvement is removing the unnecessary alternation of the first capture group and replacing it with a simple character class. The second (and most significant) improvement is eliminating the lazy-dot-star and replacing it with a more precise and efficient greedy-star applied to an "anything-except-ampersand-or-end-of-line" character class. And thirdly, the unnecessary alternation of the second capture group is removed.
attributes.js 007 rtype	/(button\|input)/i
	/^(?:button\|input)/i
	Notes: 1, 6, 7 Once again, as previously described, adding the beginning-of-string anchor speeds up the case of a non-match. And the capture group was not being used so was changed to a non-capturing group.
attributes.js 008 rfocusable	/(button\|input\|object\|select\|textarea)/i
	/^(?:button\|input\|object\|select\|textarea)/i
	Notes: 1, 6, 7
attributes.js 009 rclickable	/^(a\|area)$/i
	/^a(?:rea)?$/i
	Notes: 1, 2, 10 Although this changed version eliminates the use of both a capture group and alternation, it does add one new quantifier (the "`?`"). It is unknown if this change will measurably perform any faster. (It should be a bit faster, but since either regex can only match a very short string, it won't make much of a difference because both versions will be lighting fast.)
attributes.js 010 rradiocheck	/radio\|checkbox/
	/^(?:radio\|checkbox)/i
	Notes: 6, 7, 12, 13 The original regex does not have the "i" ignore-case modifier. It is assumed that the regex should match these strings regardless of case, so the "i" modifier was added. (Note that this is one case where the new regex does not match the same strings as the original - but hopefully, in this case, this is a good thing.)
core.js 023 quickExpr	/^[^<](<[\w\W]+>)[^>]$\|^#([\w\-]+)$/
	/^(?:[^<](<[\w\W]+>)[^>]$\|#([\w-]+)$)/
	Notes: 6, 7 The original regex has two global alternatives each beginning with a "`^`" start-of-string anchor, and ending with a "`$`" end-of-string anchor. However, because there are two global alternatives, the regex engine is typically not smart enough to know that a match can't happen anywhere beyond the first character, so it goes ahead and (needlessly) attempts both alternatives at every position along the string. The improved regex factors out the start of string anchor and encloses the two alternatives within a non-capturing group. This exposes the anchor by itself at the beginning so that the engine can declare match failure after testing the beginning of string only. This saves a lot of unnecessary steps when testing against strings which do not match. Note, however, that the opposite is true for the "`$`" end-of-string anchor. In this case, it is actually more efficient to keep each one at the end of its respective alternative - factoring this out and placing it outside the non-capture group would be slightly (but negligibly) less efficient.
css.js 004 ropacity	/opacity=([^)]*)/
	~~/opacity=/~~
	Notes: 1, 10, 11 The way this regex is being used in the code is with a `RegExp.test()` method call. The capture group (everything following the "=" in the regex), is not being used for any purpose, and since it has a star quantifier, can match nothing at all. Thus, in this case, there is no purpose served by testing for characters beyond the equals sign. This portion of the regex is unnecessary, and can be safely removed, reducing its length and improving the speed in the case of a successful match. Update 2010-09-23: As pointed out by John Resig, the note above is incorrect. The capturing group of the original regex is being used - it is being accessed via the `RegExp.$1` function property. Thus, the original regex is A-Ok as-is.
effects.js 004 rfxtypes	/toggle\|show\|hide/
	/\b(?:toggle\|show\|hide)/i
	Notes: 6, 7, 12, 13 The original regex is case sensitive and will not match upper or mixed case versions of the three alternatives. Assuming this is not the desired behavior the new regex adds the "i" modifier so that mixed cases do match. (Note that this changes what the regex will actually match - hopefully for the better.) A "`\b`" word boundary anchor is placed at the beginning and the alternatives wrapped in a non-capture group to allow faster non-matching.
event.js 005-007 fcleanup	return nm.replace(/[^\w\s\.\\|`]/g, function( ch ) { return "\\" + ch; });
	return nm.replace(/[^\w\s.\|`]/g, "\\$&");
	Notes: 11, 14, 15 The old regex is unnecessarily escaping characters within a character class (this does not affect its speed but does make the regex longer than it needs to be). Also, the replace function needlessly utilizes an anonymous callback function - (a simple replacement string will do the trick). Fixing these two inefficiencies reduces the code size and improves execution speed.
event.js 700 rscript	/textarea\|input\|select/i
	/^(?:textarea\|input\|select)/i
	Notes: 6, 7
manipulation.js 5,6,12-16,238,482 rxhtmlTag rselfClosing fcloseTag	var rxhtmlTag = /(<([\w:]+)[^>]*?)\/>/g, rselfClosing = /^(?:area\|br\|col\|embed\|hr\|img\| input\|link\|meta\|param)$/i, fcloseTag = function( all, front, tag ) { return rselfClosing.test( tag ) ? all : front + "></" + tag + ">"; }; value = value.replace(rxhtmlTag, fcloseTag); elem = elem.replace(rxhtmlTag, fcloseTag);
	var rxhtmlTag = /<(?!area\|br\|col\|embed\|hr\|img\| input\|link\|meta\|param)(([\w:]+)[^>]*)\/>/ig; value = value.replace(rxhtmlTag, "<$1></$2>"); elem = elem.replace(rxhtmlTag, "<$1></$2>");
	Notes: 5, 7, 9, 11, 15, 16 This jQuery code is taking invalid HTML self-closing tags (e.g. `<TAG/>`), and converting them to (hopefully) valid syntax - having both opening and closing tag components (e.g. `<TAG></TAG>`). The original code uses two regexes and a callback function to perform this operation. This can be simplified by combining the two regexes into one and replacing the callback function with a single replacement string. This change reduces the code size and significantly improves the execution speed for both matching and non-matching cases.
manipulation.js 010 rnocache	/<script\|<object\|<embed\|<option\|<style/i
	/<(?:script\|object\|embed\|option\|style)/i
	Notes: 5, 7, 17 The original regex has numerous global alternatives, each beginning with the same literal text (the "`<`" character). With regexes having global alternatives, the regex engine must test all of these alternatives at each and every position in a target string. The new regex factors out the common literal text, and wraps the alternatives in a non-capture group. By "exposing" literal text at the beginning of the regex in this manner, the regex engine only needs to check for matches at character positions beginning with a "`<`". This speeds up both matching and non-matching cases.
offset.js 075 literal regex	/^t(able\|d\|h)$/i
	/^t(?:able\|d\|h)$/i
	Notes: 1, 10, 18 Converting the capture group of this regex into a non-capture group is good policy, but will result in only a negligible performance gain in this case. But more importantly, this regex (and quite a few others), should be cached in pseudo-static variables outside its function (as most of the other regexes in the jQuery codebase already are).
offset.js 210 rscript	/^body\|html$/i
	/^(?:body\|html)$/i
	Notes: 6, 7 The original regex has two global alternatives, one with a start-of-string anchor and the other with an end-of-string anchor. The intent was to anchor both ends but this is not the effect - the regex engine must still attempt a match at each and every position in the target string. The new regex adds the missing non-capture group which fulfills the original regex's intent.

Notes:

Eliminated unnecessary use of capturing group.
Eliminated unnecessary use of alternation.
Eliminated inefficient use of lazy quantifier.
Implemented "unrolling-the-loop" efficiency technique.
A successful match is significantly faster.
Added non-capture group and beginning of string anchor.
A successful non-match is significantly faster.
Added non-capture group and word boundary anchor.
Linefeed added to regex to allow table to fit on screen.
Negligible performance improvement.
Reduced code size.
Added "i" ignore-case modifier.
The old and new regexes can match different text.
Removed unnecessary backslash escapes.
Removed unnecessary callback function.
Removed unnecessary local variables.
Added non-capture group and exposed literal starting text.
Literal RegExp object should be cached.

Comments

As per note 18, there are still a number of literal regexes in the jQuery codebase that have not yet been cached. Caching these is recommended, (but this article does not address them).

Many of the regexes above are being applied to very short strings (e.g. a node's nodeName property), so any performance gains achieved will be negligible for any single call. However, if executed many times, the small performance gains will add up. And it is this author's strong opinion that all regexes implemented in popular software, (such as jQuery), should be optimized, even if the gains are infinitesimal. These small gains really start to add up when one considers the number of sites using this code and the number of users frequently visiting all these sites!

Testing

All the regexes were tested for accuracy and run through the RegexBuddy regular expression debugger to count the number of steps required to perform matching and non-matching on a set of appropriate test data. In all cases, the recommended new regexes required fewer steps on average. (There are a few cases where the original regex would match in a few less steps, but in these cases the new regex would perform significantly better in the case of non-matches.) Although these new regexes were carefully constructed with great attention to detail, the accuracy testing performed was not exhaustive. Also note that no time benchmarks have yet been performed.

If you are a RegexBuddy user and would like to test the regexes for yourself, here is a link to the RegexBuddy library file containing all the before-and-after regexes along with the test data used to test their accuracy: jQueryRegexes.rbl

Summary

jQuery is an excellent Javascript library that has quite a few regular expressions that can be improved with regard to efficiency. This article presents new versions of these regexes which should prove to perform better than the originals. It is this author's hope that they may be of some use to the jQuery project.