Fixing the SyntaxHighlighter 3.0.83 Parser Bug

The Problem:

The SyntaxHighlighter Javascript script utilizes one set of regular expressions for each "brush" used to provide syntax coloring to computer code text. Unfortunately, the logic it uses to apply these regexes is inherently flawed. It assumes that all the regexes are independent from each other and can be applied separately (and this is not true). The current parser applies each regex, one at a time, to the whole code text and then merges all of the matches together and then tries to make sense of how they all overlap. This failed logic results in a variety of erroneous colorization manifestations (see below). Here are the three SyntaxHighlighter (3.0.83) functions which implement this flawed parser design:

    /**
     * Applies all regular expression to the code and stores all found
     * matches in the `this.matches` array.
     * @param {Array} regexList     List of regular expressions.
     * @param {String} code         Source code.
     * @return {Array}              Returns list of matches.
     */
    findMatches: function(regexList, code)
    {
        var result = [];
        
        if (regexList != null)
            for (var i = 0; i &lt; regexList.length; i++) 
                // BUG: length returns len+1 for array if methods added to prototype chain (oising@gmail.com)
                if (typeof (regexList[i]) == "object")
                    result = result.concat(getMatches(code, regexList[i]));
        
        // sort and remove nested the matches
        return this.removeNestedMatches(result.sort(matchesSortCallback));
    },
    
    /**
     * Checks to see if any of the matches are inside of other matches. 
     * This process would get rid of highligted strings inside comments, 
     * keywords inside strings and so on.
     */
    removeNestedMatches: function(matches)
    {
        // Optimized by Jose Prado (http://joseprado.com)
        for (var i = 0; i &lt; matches.length; i++) 
        { 
            if (matches[i] === null)
                continue;
            
            var itemI = matches[i],
                itemIEndPos = itemI.index + itemI.length
                ;
            
            for (var j = i + 1; j &lt; matches.length &amp;&amp; matches[i] !== null; j++) 
            {
                var itemJ = matches[j];
                
                if (itemJ === null) 
                    continue;
                else if (itemJ.index &gt; itemIEndPos) 
                    break;
                else if (itemJ.index == itemI.index &amp;&amp; itemJ.length &gt; itemI.length)
                    matches[i] = null;
                else if (itemJ.index &gt;= itemI.index &amp;&amp; itemJ.index &lt; itemIEndPos) 
                    matches[j] = null;
            }
        }
        
        return matches;
    },
/**
 * Executes given regular expression on provided code and returns all
 * matches that are found.
 * 
 * @param {String} code    Code to execute regular expression on.
 * @param {Object} regex   Regular expression item info from &lt;code&gt;regexList&lt;/code&gt; collection.
 * @return {Array}         Returns a list of Match objects.
 */ 
function getMatches(code, regexInfo)
{
    function defaultAdd(match, regexInfo)
    {
        return match[0];
    };
    
    var index = 0,
        match = null,
        matches = [],
        func = regexInfo.func ? regexInfo.func : defaultAdd
        ;
    
    while((match = regexInfo.regex.exec(code)) != null)
    {
        var resultMatch = func(match, regexInfo);
        
        if (typeof(resultMatch) == 'string')
            resultMatch = [new sh.Match(resultMatch, match.index, regexInfo.css)];

        matches = matches.concat(resultMatch);
    }
    
    return matches;
};

This design assumes that there is only one set of matches that a single regex will make on a given string. Thus, it checks for and finds all the matches found when the regex is applied starting on the first character of the string. But, as the bugs section below clearly shows, there are more than one set of matches that a given regex can match, depending on where in the string you begin looking for the match. The matches are clearly different depending on where you start!

Bugs! What Bugs?

Examples demonstrating erroneous parsing.

This page is currently running the verbose, non-minimized SyntaxHighlighter 3.0.83 script and it is being used to colorize all the code blocks. The following colorized code samples utilize a high contrast color scheme to better demonstrate how the parser out-of-sync error works. Comments are green, single quoted strings are red and double quoted strings are blue.

Standard: Here is what it should always look like:

// Commenting will get you everywhere
'single' line with more 'single' line with more 'single' line with more
"double" line with more "double" line with more "double" line with more

Error #1: Ok let's add one double quote to one of the single quoted strings and watch what happens:

// Commenting will get you everywhere
'single' line with more 'sin"gle' line with more 'single' line with more
"double" line with more "double" line with more "double" line with more

Error #2: Ok let's move the double quote to the comment and watch what happens:

// Commenting will get you every"where
'single' line with more 'single' line with more 'single' line with more
"double" line with more "double" line with more "double" line with more

What happens if there are two "hidden" quotes?:

// Comment"ing will get you everywhere
'single' line with more 'single' line with more 'sin"gle' line with more
"double" line with more "double" line with more "double" line with more

Error #3: Three?:

// Comment"ing will get you every"where
'single' line with more 'single' line with more 'sin"gle' line with more
"double" line with more "double" line with more "double" line with more

Error #4: Next try adding a single quote to the comment:

// Commenting won't get you anywhere
'single' line with more 'single' line with more 'single' line with more
"double" line with more "double" line with more "double" line with more

Error #5: Lastly, just one of each quote type in the comment:

// Commenting "won't get you anywhere
'single' line with more 'single' line with more 'single' line with more
"double" line with more "double" line with more "double" line with more

These examples demonstrate only a few ways the parser can be tripped-up. There are likely more strange highlighting combinations and erroneous manifestations. Here's a couple bugs on the list that are very likely caused by this:

Quote characters in comments break string matching (javascript syntax)
php brush mutiline double quote not working
SQL brush error when single quote and backslash (2.1.364)
and there's more I'm sure...

The Solution:

The regexes are NOT independent of one another. They DO affect each other. A quote may appear within a comment, and it must never be considered as one end of a string delimiter. The regex used to match a string should never be applied inside a comment. The regex to match a comment should never be applied inside a string. Each bit of code text gets just one color and belongs to just one of the regexes.

A much better (and accurate) way to parse the code is to check all regexes at each location in the code string and allow only one regex to match any portion of the code string. In other words, the parser starts off at position zero and checks if any of the regexes match there. The first regex which matches (precisely at this location), wins, and this match is pushed onto the stack of matches. (If none of the regexes match precisely at this position, but one or more match at positions further down the string, then the nearest match is selected as the next match.) The code index pointer is advanced to the end of the current match and then, once again, all of the regexes are tested at this new, advanced location. In this manner the parser "walks" the code string only once and is quite efficient. But more importantly - it parses the syntax correctly. The following function (which is a drop in replacement for: findMatches() and removeNestedMatches()), correctly performs the parsing job in the manner just described.

    /* Combines functionality of old findMatches(), getMatches() and removeNestedMatches()
     * functions. Parse a string of code using an array of regular expressions.
     * 
     * @param {Array} regexList     List of regular expressions.
     * @param {String} code         Source code.
     * @return {Array}              Returns list of matches.
     */
    findMatchesNew: function(regexList, code)                   // Rev:20110216_1700
    {
        var matches = [];                                       // Return value. Array of Match objects.
        if (!regexList) return matches;                         // If regexlist is not defined, we're done.
        var nregex = regexList.length;                          // Loop end value.
        var pos = 0;                                            // Position within code string to begin regex search.
        var re, minIndex, resultMatch, func, match;             // Some locals.
        function defaultAdd(m, r) { return m[0]; };             // Dummy brush modify result function.
        var nextMatch = true;                                   // Prime the pump.
        while(nextMatch) {                                      // One regex match per loop. Advance pos through code.
            nextMatch = null;                                   // Assume this will be the last time through.
            minIndex = code.length + 1;                         // Reset min-match-index to impossibly large value.
            for (var i = 0; i < nregex; i++) {                  // Loop though all regexes at this pos.
                if (typeof(re = regexList[i]) === "object" && re.regex.global) { // Process only 'g'-flagged regexes.
                    re.regex.lastIndex = pos;                   // Start search at same pos for each regex.
                    if ((match = re.regex.exec(code)) !== null) {// Check if this regex matches anywhere?
                        if (match.index === pos) {              // Yes. Check if matched on first char?
                            nextMatch = {r: re, m: match};      // Yes. This is unconditionally our next match.
                            break;                              // Exit "try-all-regexes" for loop.
                        } // Otherwise we matched, but not on the first char. Need to run all regexes to find best.
                        if (match.index < minIndex) {           // Check if we have a new best match?
                            minIndex = match.index;             // Yes, set the bar a little lower.
                            nextMatch = {r: re, m: match};      // Save needed items for Match object.
                        }       
                    }
                } // End if re is object.
            } // End for loop.
            if (nextMatch) {                                    // Check if one of the regexes matched?
                func = nextMatch.r.func ? nextMatch.r.func : defaultAdd;
                resultMatch = func(nextMatch.m, nextMatch.r);   // Allow brush option to customize result.
                if (typeof(resultMatch) === 'string')           // Check if we need a new Match object?
                    resultMatch = new sh.Match(resultMatch, nextMatch.m.index, nextMatch.r.css); // Yes.
                matches.push(resultMatch);                      // Place this match on top of the stack.
                pos = nextMatch.r.regex.lastIndex;              // Adjust pos forward to end of this match.
            }
        } // End while loop.
        return matches;                                         // return array of regex matches
    },

It works! Preliminary testing indicates that this new function works very well. It is probably quite a bit faster, too, (since it doesn't need to do as much work as before). Implementing this function will certainly clear up a lot of erroneous syntax highlighting. Interestingly, the above Javascript snippets appear to be colorized just fine, using the old parser. Although it has a flawed design at its core, it does manage to get it right quite often. (But, unfortunately, never 100% of the time - which is what is called for here.)

Disclaimer:

This function fixes only one portion of the SyntaxHighlighter script. There is another part of the script dealing with "html-scripts" which also calls the getMatches() function. This author has not taken the time to figure out what that portion of the script does. Thus, there very well be some "gotchas" that will need to be ironed out.

Conclusion:

Hopefully, this helps paint a better picture of what is happening. I hope that the author of the project can take a look at this and consider implementing the corrected logic. The function above is offered freely in the hopes that it may be useful. Cheers everybody and Happy Syntax Highlighting!

Jeff Roberson - February 7, 2011