FluxBB Parser Regular Expressions

© Jeff Roberson
Created: 2009-Jul-12
Edited: 2010-Jan-08
Revision History

Introduction

The FluxBB forum software is a lightweight, web standards aware (and XHTML valid) open source project which adheres to the "Keep it simple stupid" philosophy. Much of the code is pretty clean and well designed. However, in this author's opinion, some of the regular expressions (regexes) in the latest version (1.3) are not very well written. This article describes in detail, some of the regexes located in the parser.php file, and provides improved versions of these regexes. The regexs were obtained from the latest version of the parser (as of 2009-07-11) from the FluxBB source trunk (i.e. parser.php version 1078). What follows is a detailed analysis of the first two regexes in the parser.php file, both of which have errors. As time goes on, additional regexes may be optimized and published here if the author's interest in the FluxBB project continues.

Note that in addition to FluxBB 1.4.1078, this article also applies equally to PunBB 1.3.4 and FluxBB 1.3-legacy forums, because these three all share much of the same code in parser.php (including the two regexes described herein).

REGEX 1: FluxBB 1.4.1078 parser.php line 62

Lets first look at the FluxBB 1.4.1078 regex on line 62 (PunBB 1.3.4 Line 37 and FluxBB 1.3-legacy line 37). This part of the code simply checks to see if the signature text contains any BBCodes that are not allowed in the signature - i.e. [quote], [code] or [list]. If the regex finds one of these "illegal" signature tags, the code generates an error. The as-written regex attempts to match an opening QUOTE BBCode tag when it is in any of the following forms: [quote], [quote="name"], [quote="name"], [quote='name'] or [quote=name]. However, the regex fails to match all but the [quote] attribute-free syntax. The following is an analysis (complete with free-spacing mode, commented versions) of the old regex along with a fixed version and a simplified new version:

Here is the old regex (FluxBB parser.php version 1.4.1078 line 62):

Old regex: '#\[quote(=("|"|\'|)(.*)\\1)?\]|\[/quote\]|\[code\]|\[/code\]|\[list(=([1a\*]))?\]|\[/list\]#i' Old regex (with comments): '% \[quote # match start of [quote variations ( # capture nothing important in group 1 = # equals sign delimits quote from its attribute ( # capture opening attribute quote variation in group 2 "|"|\'| # which can be ", double quote, single quote, or NULL ) # end capture group 2 (.*) # capture quote attribute in group 3 (too GREEDY!) \\1 # match group 1 (Should be group \2 ERROR!) )? # end optional capture group 1 \] # match end of opening quote tag | \[/quote\] # match quote end tag | \[code\] # match code start tag | \[/code\] # match code end tag | \[list # match start of [list variations ( # capture list attribute in group 4 = # equals sign delimits list from its attribute ( # capture list type variation in group 5 [1a\*] # which can be "l", "a" or "*" ) # end capture group 5 with list type ) # end capture group 4 with list attribute ?\] # match literal list closing bracket | \[/list\] # match list end tag %ix'

Critique of old regex:

Here is the fixed regex (JMR version 2009-07-13):

Fixed regex: '%\[(?:quote(?:=("|\'|).*?\1)?|/quote|/?code|list(?:=[1a*])?|/list)\]%ix' Fixed regex (with comments) '% \[ # anchor beginning of regex with literal opening bracket common to all cases (?: # start non-capturing group to enclose alternatives quote(?:=("|\'|).*?\1)?+ # match quote, quote="text", quote=\'text\' or quote=text | /quote # or a /quote | /?code # or a code or /code | list(?:=[1a*])?+ # or a list, list=1, list=a or list=* | /list # or a /list ) # end non-capturing group enclosing alternatives \] # anchor end of regex with literal closing bracket common to all cases %ix'

Description of fixed version enhancements:

Compared to the original, this fixed regex is smaller and faster and actually matches all the targeted BBCode tags. However, in the parser code, this regex is only used once as a quick check to test signature text for the presence of disallowed tags. In this case, an even simpler regex (such as the following), might better do the trick.

Here is a new simplified regex (JMR version 2009-07-12):

New simpler regex: '%\[/?(?:quote|code|list)\b[^\]]*\]%i' New simpler regex (with comments): '% \[ # anchor start of regex to literal opening bracket /? # optional slash to match closing tag (?:quote|code|list) # list of tags not allowed in signature \b # tag name must be whole word [^\]]*+ # allow any tag attributes \] # anchor end of regex to literal closing bracket %ix'

This regex is even better for accomplishing the simple task at hand. It does not bother checking the validity of any opening tag attributes and combines both opening and closing tags into one sub expression. It matches everything the previous regexes do, and does it faster and is much easier to read. (Readability is important for maintainability.)

Summary of REGEX 1 analysis

The error in this regex does not really manifest in real world use. From a functional viewpoint, it doesn't matter that the original regex fails to match the opening [quote=xyz] syntax variations because it will always match the closing [/quote] and throw the same error. However, the improved regexes provided here are more efficient and will match (or fail to match) quicker than the original. But this too, is a moot point because this regex is only applied to the text of a signature, which by its very nature is short, and thus efficiency issues have negligible effect (and the improvements made to this particular regex are merely academic).

REGEX 2: FluxBB 1.4.1078 parser.php lines 74, 535 and 648

This regex occurs in three places in the FluxBB 1.4.1078 parser.php (lines 74, 535 and 648) and in four places in PunBB 1.3.4 (lines 51, 523, 652 and 691) and FluxBB 1.3-legacy (lines 49, 519, 648 and 687). It is used to grab the contents enclosed within a LIST BBCode tag. The tricky part of this one is that the contents of the LIST tag can contain embedded (or nested) LIST tags. To accomplish this feat, the regex takes advantage of the advanced recursive syntax of the PCRE engine (?R). However, if one of the nested LIST tags is of the simplest form, i.e. [list] without any attribute, then this particular regex erroneously fails to match the entire contents of the outer LIST tag, and returns an "unbalanced" subset portion of the contents. As you can see, this is a non-trivial and rather complex regex that is much easier to read (and debug) when presented in free-spacing mode with lots of comments as shown below.

Here is the old regex (FluxBB parser.php version 1.4.1078 line 74):

Old regex: '/\[list(?:=([1a\*]))?\]((?>(?:(?!\[list(?:=(?:[1a\*]))\]|\[\/list\]).+?)|(?R))*)\[\/list\]/ems' Old regex (with comments): '/ \[list # match opening bracket and tag name of outermost list tag (?: # start non-capturing group to match optional list type = # equals sign delimits list from its type attribute ( # capture outermost list type into group 1 [1a\*] # list type is 1=numeric, a=alpha or *=bulleted ) # end capture group 1 )? # end optional non-capturing group \] # match closing bracket of outermost opening list tag ( # capture contents of list tag in group 2 (?> # atomic group to capture either list contents or whole nested list (?: # start (unnecessary) non-capture group (?! # match position that is not followed by either an... \[list # opening bracket and tag name of nested opening list tag (?: # start non-capturing group to match type (NOT OPTIONAL = ERROR!) = # equals sign delimits list from its attribute (?: # start (unnecessary does nothing) non-capture group [1a\*] # list type is 1=numeric, a=alpha or *=bulleted ) # end (unnecessary) non-capture group ) # end non-capturing group (NOT OPTIONAL = ERROR!) \] # match closing bracket of opening list tag | # or... \[\/list\] # a closing LIST tag ) # end negative lookahead assertion .+? # lazily match one or more of anything (effectively always just one) ) # end (unnecessary) non-capture group | # or... (?R) # recursively match a whole nested LIST element )* # as many times as necessary until deepest nested LIST tag grabbed ) # end capturing contents of list tag into group 2 \[\/list\] # match outermost closing list tag /xems'

Critique of old regex:

Here is the fixed regex (JMR version 2010-01-08):

Note: Update 08-Jan-2010: This regex has been updated to fix a memory hog problem involving the negative lookahead sub expression. The new regex below uses a lazy-dot-star combined with a positive lookahead at the end, wrapped inside an atomic group to achieve the same result. This one should also be a bit faster.

Fixed regex: '%\[list(?:=([1a*]))?+\]((?:(?>.*?(?=\[list(?:=[1a*])?+\]|\[/list\]))|(?R))*)\[/list\]%ise' Fixed regex (with comments) '% \[list # match opening bracket and tag name of outermost list tag (?: # start non-capturing group to match optional list type = # equals sign delimits list from its attribute ( # capture outermost list type into group 1 [1a*] # list type is 1=numeric, a=alpha or *=bulleted ) # end capture group 1 )?+ # end optional non-capturing group \] # match closing bracket of outermost opening list tag ( # capture contents of list tag in group 2 (?: # non capture group for either contents or whole nested list (?> # atomically grab contents up to a [list*] or [/list] .*? # lazily grab everything up to the next [list*] or [/list] (?= # match position that is followed by either an... \[list # opening bracket and tag name of nested opening list tag (?: # start non-capturing group to match optional list type =[1a*] # list type is 1=numeric, a=alpha or *=bulleted )?+ # end non-capturing group for optional list type attribute \] # match closing bracket of opening list tag | # or... \[/list\] # a closing LIST tag ) # end positive lookahead assertion (we are not on a list tag) ) # end atomic group | # or... (?R) # recursively match a whole nested LIST element )* # as many times as necessary until deepest nested LIST tag grabbed ) # end capturing contents of list tag into group 2 \[/list\] # match outermost closing list tag %isxe'

Description of fixed version enhancements:

Here is a link to PHP script which demonstrates the error in this regex and how the fixed version corrects it: Test_FluxBB1.4.1078_parser_line_74_20100108_1100.zip

Summary of REGEX 2 analysis

Other than the error, the original regex is not half bad. Fixing the error, cleaning up the matching of the LIST contents, and adding comments is all that is required.

Correction 2010-01-08: Actually, the old regex was pretty bad. It had a massive memory hog bug which could cause an Apache Internal Server error when PHP was run as CGI rather than an Apache module. The new regex effectively fixes this problem.

Summary

The FluxBB forum software is good stuff, but some of the regular expressions recently added and/or modified in the core parser script are not as good as they should/could be. In addition to styling and efficiency issues, some have outright errors. This article analyzed and optimized the first two regexes in the parser, both of which had significant issues (although the improvements presented here have little effect on the overall performance and functionality of the software as a whole). With relatively complex regexes such as these, the source code should define them using the free-spacing regex modifier and provide verbose comments which describe each sub expression, as is presented here. These comments provide documentation and allow the programmer/reader to see just how the thing actually works. The critique presented in this article is certainly not meant to be any sort of personal criticism of the regex authors, but rather is intended to help make the FluxBB/PunBB forum software (which is already pretty good), even better.

Writing a really good regular expression which matches precisely what you want and not what you don't want, is both (black) art and science. A well crafted, accurate and efficient regex can reap significant rewards in CPU time savings (and $$$), particularly when this regex is run a lot (such as those in the core of PHP forum software on a popular board). It is thus, well worth the effort to take extra time to really optimize these low level PHP work horses. However, to write a fully optimized regex takes quite a bit of time and effort by the programmer (studying and practicing) to develop the necessary skills to become a master. To achieve these skills, this author strongly recommends reading (and studying): Mastering Regular Expressions - 3rd Edition by Jeffrey Friedl.

Valid XHTML 1.0 Strict Valid CSS!