By Jeff Roberson
Created: 2007-Dec-19
Edited: 2009-Apr-19
Revision History
More than a year has passed since I first submitted this article to the ZTree community. In the meantime, I've been using regexs more and more, to the point that I now don't know how I could get by without them! In fact, I now understand what Jeffrey Friedl describes in his book as: thinking in regular expressions. And speaking of Jeff, I finished reading his book and then and went and read it again! This is, hands down, the most useful book I have ever read. A few months back I needed to do a mass file renaming operation that could only be accomplished using a regex in ZTree. Alas, I had to do them one at a time which was time consuming. There were other times when I wanted to tag a set of files based on filename where only a regex could do the job, and once again I was out of luck. I still love ZTree and its still one of my overall favorite apps of all time, but I strongly believe (now more that ever), that adding PCRE regex support to the ZTW core would turn it into the ultimate file management utility.
The article below is updated, but has not changed a whole lot (see the revision history page for details). One big change is that I did remove all suggestions to use the Boost C++ Regex Library. (After briefly corresponding with Kim I found out that ZTree is written in straight C and not C++.) Additionally, I learned that the PCRE C library is The library of choice. Also, I've updated all the links from here to the ZTree forum archives. The recent forum upgrade busted all the old links - (Note that I used a regex search and replace to quickly and effortlessly fix all the links!) Most of all, I'd just like to reiterate my encouragement to Kim to catch the "regex bug" like I did and then incorporate this extremely powerful feature into ZTree. And in that spirit, I'm going to go send Kim another $10 donation for his efforts, just as soon as I get this new revision posted.
<shameless-plug> In the last year, I have also discovered the wonderful regex related products offered by "Just Great Software" (all written by Jan Goyvaertz, the same one-man-show who created and maintains the Regular-Expressions.info site). The RegexBuddy program helps you compose regexs by providing syntax highlighting and a verbose human readable description for each and every sub-component of any regex. (And this description can be exported to HTML/TXT/Clipboard to provide very slick documentation like this example.) It also provides real-time pattern matching feedback by highlighting any text test data - you see right away what matches and what doesn't as you type in a regex pattern. The PowerGrep program is undoubtedly the most powerful Windows search and replace tool ever. Period. This tool is just Awesome (but it doesn't do the regex file renaming that I hope ZTree will someday be able to do). And the third program I bought from JGsoft is EditPad Pro, which has become my go-to text editor of choice (displacing UltraEdit to a back seat position). Its handling of regular expressions for search and replace, and user editable syntax highlighting and code navigation schemes is exceptional. And all these programs (which have lots of other cool features I don't have space to mention here) are fast, have a very small footprint, have completely non-obtrusive install/uninstall procedures and have extensive accurate documentation - (just like ZTree!). Can you tell that I love this software? </shameless-plug> Ok enough of that - go learn regular expressions and check out JGsoft.
Jeff Roberson 27-Mar-09
Two years ago I had only a vague notion about what "Regular Expressions" were and I rationalized them to be nothing more than another mysterious peculiarity conjured up "over there" in unixland. In my ignorance, I spent a long time (more than a decade) ignoring them (sound familiar?). As it happens, this past couple years I've been focusing on learning new skills in the realm of web applications development: (X)HTML, CSS, Javascript, PHP, Ajax, MySQL and Apache. In my studies, one term kept popping up over and over: Regular Expressions. Curiosity got the better of me and I finally broke down and purchased: "Mastering Regular Expressions (3rd Edition 2006)" by Jeffrey Friedl. One year ago, after working through the first two chapters (which teach all the basic syntax), I found out what all the fuss is about. Frankly, regular expressions (or regex (rhymes with "FedEx") or just RE for short) are so simple, elegant, powerful and downright usable, that I can't believe it took me so long to learn them (and I'm kicking myself for waiting so long).
Regular Expressions consist of a concise, mini programming language to process text. When you first see a non-trivial regex out in the wild, it appears to be a bunch of cryptic gibberish. But once you learn the basic syntax you find out that all those complex looking expressions are actually composed of a series of small, simple constructs all strung together. All I can say is that if you don't already know them, do yourself a favor and spend a couple hours and learn the basics - that's all the time it takes to get started. The basic syntax is quite simple to pick up, and the rewards come quickly. While researching this article, I spent quite a bit of time searching the web for a really decent tutorial on regular expressions and came up with a list of a few good ones (see next section). To intelligently discuss this topic in a constructive manner, one really needs to be familiar with at least the fundamentals. So if you are new to regular expressions, please invest a bit of time to learn the basic syntax. (I can 99.5% guarantee that you will be glad that you did!)
For applications written in C, such as ZTree, the obvious (no-brainer) choice for a regex engine is the PCRE C library (see: http://www.pcre.org/). This library is rock-solid and is thoroughly tested - it is the same library that is used internally by the Apache web server and the PHP scripting language (which are currently running much if not most of the internet). In fact, at this very moment, millions of computers are likely running C code within the PCRE library, serving up dynamic (PHP), clean-URL (.htaccess) web pages to millions of web surfers around the world. Thus, there is absolutely no question about the quality, maturity and reliability of this library. It is fast, powerful and it implements a very rich set of regex features. It is also open source and is free to be used in any application, including commercial ones such as ZTree. And it supports UNICODE! However, on the downside, the PCRE library may be somewhat difficult to implement since it does not have any built in replace functions and has a lower level interface requiring knowledge of how the RE engine works. Kim would need to spend quite a bit of time to become familiar with the mechanics and syntax of working with its API. To help with this effort, I am currently working on writing an in-depth article on how to setup, compile and use the PCRE library with Visual C version 6.
Although there are many "flavors" of regular expression implementations out there in the wild, (and a "Standardized" RE syntax has yet to fully emerge), many if not most modern tools (Perl, PHP, Javascript, .NET, java, Python, Tcl, MySQL, Apache, Ruby etc.) have been gravitating towards the feature rich Perl style syntax. But all these modern implementations do share many base operators and syntax (i.e. character classes, metacharacters, alternation, grouping, backreferences, repetition, anchors and lookaround). In other words, they all share a very large common denominator. The differences between these RE "flavors" appear in the more esoteric complex operators, the modifier options and each tool's specific implementation details. The Perl style syntax is both powerful and ubiquitous and I would strongly suggest we closely adhere to this well established precedent. Before getting to the ZTreeWin RE specific syntax, a quick review of commonly used Perl syntax is in order:
When performing a simple search, the regex search pattern string is typically delimited by a pair of forward slash characters (/). Everything between the delimiting slashes (underlined in red) is the regular expression search pattern itself. These characters are written by the user (You and I) in the language of regular expressions, and they have very precise and specific meaning to the RE engine, but these characters have no meaning whatsoever for the host program (in our case: ZTreeWin). As far as ZTree is concerned, this regex package is a "Black Box" which is simply delivered to the regex engine function along with a target string to be matched. ZTreeWin does not need to speak "RE". It is nothing more than a messenger - the regex engine does all the heavy lifting.
When performing a search and replace operation, three slashes are used and the replacement string is placed between the second and third slashes (see above). Once again the regex search pattern (underlined in red) is between the first and second slash, and the replacement string (underlined in blue) is sandwiched between the second and third slash. This regex finds the first match of "cat", "dog" or "mouse", then replaces it with "animal". Note that groups (in parenthesis) in the search pattern are captured and placed into variables: $1, $2, $3, etc., which can then be placed anywhere in the replacement string. This is very handy and is demonstrated in the example below.
In addition to the RE search pattern string and replacement string, the regular expression engine allows the user to specify modification flags which affect the behavior of the string processing. These Modifiers are specified by one or more single characters immediately following the last slash and can appear in any order. For example the "i" modifier tells the engine to ignore the case of the input string. If not set, the search is case sensitive by default. The "g" modifier tells the engine to globally find and replace all matches within the target string with the replacement string. If not set, only the first match is replaced by default. The RE modifiers (underlined in green) follow the last slash. In this search and replace operation, all occurrences of "cat", "dog" and "mouse" are replaced with "animal".
At first, very few (if any) of these user specified modifiers would need to be incorporated into the ZTree implementation because a default set of behaviors would be sufficient for most of our needs to start with. Some of the modifiers we may well wish to incorporate (particularly with 'CTRL+S' file searching), are the previously mentioned "i" and "g" modifiers as well as the "m" = multiple line mode and "s" = single line mode modifiers. (For file search operations, the ignore case option is already provided as a switch in ZTree's search dialog boxes, so this modifier would not need to be specified in the RE package string.) Note that (IMHO) incorporating these various search engine modifier/options into the ZTree interface may become kind of tricky and will likely be the biggest task of implementation. However, a wise choice for a simple initial set of default behaviors should help speed things along nicely.
We could easily implement this same Perl syntax (which is spoken by many far and wide) throughout ZTreeWin by simply assigning one new unique character, (the RE Quote Delimiter, or simply: "REQD" for short), which behaves exactly like the slash from the previous paragraphs. This new REQD character will have a unique appearance and keystroke sequence. As a suggestion I would propose using a graphic glyph such as: "●" (which is a Unicode #25CF). To type in this new RE delimiting character, I would suggest we simply use the 'ALT' keystroke modifier combined with the single/double quote keyboard key '"': i.e. 'ALT+QUOTE'. Thus, from the user's perspective, a regex is entered by simply enclosing it within the special new delimiter characters. Then the ZTree parser takes this whole regex package string and chops it up into its component parts (i.e. pattern, replacement and modifiers), then passes them on to the regex library functions.
Note that this RE syntax for a replace operation is very similar to the current ZTree implementation for the Rename command, the only difference being that the ZTree syntax uses three quotes and the RE syntax uses three REQD chars. Also the modifiers would be have different meanings.
You may recall, back in 2006 Laurent Duchastel presented a challenge in this thread titled: "[Discuss] Interesting challenge". The goal was to rename a bunch of files having a format like this: "LLLL_DDMMYY_X-X.JPG", so that the new names would come out like this: "LLLL_YYYYMMDD_X-X.JPG". This problem has a Y2K wrench because some of the files are in the twentieth century (with numbers like '85 and '99), and some files have a year in the twenty first century (with numbers like '02 and '06). Yes, it turns out that ZTree was able to tackle this problem using some creative techniques which required multiple steps. But in the same thread Ian Binnie demonstrated in this post that the problem could be easily solved using two regex search and replace operations. I have also studied this problem and have come up with two solutions to the problem using regexs; one simple with less robust filespec matching which is described below, and another more complex solution with more rigorous filename matching and enhanced functionality (and this one is thoroughly described in in this text file). So if regular expressions were implemented as proposed above, ZTree would be able to solve this puzzle in one whack of the 'CTRL+R' "Rename tagged files" command by piping two regexs together. Here is the simple, non-strict solution, followed by a blow by blow account of what's going on in each of the two regexs (from the RE engine's perspective):
●_(\d\d)(\d\d)(\d\d)_●_20$3$2$1_●|●_20([1-9])●_19$1●
A note about substitution variables... Each set of parenthesis within a regex pattern string define a group, which captures its contents into a temporary variable. Groups can be nested. The variables are assigned in order (from 1 to 9) and the counting is incremented with each new left parenthesis. These variables can then be used in both the regex pattern string itself, and in the replacement string (which was demonstrated in the previous example). But the syntax is a little tricky. To specify a variable in the regex search pattern, the: "\1", "\2", "\3"... syntax is used, while in the replacement string, the "$1", "$2", "$3"... syntax is used. As an example of a regex that uses both, lets solve a common grammar error: the doubling doubling of words within a sentence. You can easily find all doubled words separated by whitespace and correct the problem (i.e. delete the second word and the whitespace between them), with a find and replace regex something like this: ●\b(\w+)\s+\1●$1●g. Note that the "\b" is a special word boundary metacharacter, the "\w" is a special word metacharacter (equivalent to "[a-zA-Z0-9_]"), and the "\s" is a special whitespace metacharacter, and the "+" is a metacharacter that says: "one or more of the preceding character". You can probably now see how all these substitution variables can be used in some very powerful ways.
In this post, Kim suggested someone could implement a third party regex engine and plug it into ZTree using the ZAAP interface. After playing around with rpVT and zbarspy, and looking into the specification of the interface (zbar.dat), I can see that ZAAP truly provides a very powerful and flexible interface into the mind of ZTree (well... part of its mind anyway). Yes, a separate synchronized Assistant Application (AA) program could be called on to perform operations (RE or otherwise) on either a single file ('Y'), or multiple tagged files ('CTRL-Y'). And yes, a separate Regular Expression Assistant Application Program (REAAP) could be designed to perform regex operations like searches, replaces and file renaming, and could also reply back to ZTW with "TAG" or "UNTAG" responses (i.e. the REAAP program could untag files within ZTree). Although this would actually work, in comparison with a native built-in RE interface, it would be kinda clunky with some major drawbacks (speed, resources, ergonomics and functionality) as follows:
In summary... Although it would work, the ZTW/REAAP combination approach to implementing regular expressions, is more complex, is more resource hungry for RAM, CPU cycles, processes and disk I/O, is fatter and slower, will likely be much more prone to bugs and less stable (and more difficult to debug), is less efficient, has fewer features and requires extra keystrokes. Writing an entirely new REAAP program would require much more work than simply upgrading ZTree. On the other hand, a natively RE enhanced ZTree would be simpler, smaller, faster, more ergonomic, more efficient, more powerful and more... shall I say, elegant?
As you may have surmised, I have spent a lot of time thinking about this subject (and working on this article). I may be new to the ZTree forum, but I've been deeply involved in low-level programming for more than 30 years, and since 1991, XTree/ZTree has become my all time favorite (and useful) tool. Like many of you, I use ZTree every day from sun up to sun down, (with many thousands of hours behind the wheel) and I feel naked when working on a Windows box that doesn't have it. My recent discovery of the extreme utility of regular expressions led me to think about how they could supercharge the power of ZTree and take it to a whole 'nother level. Although the ZAAP interface and the F9 menus are very powerful and do allow for adding some RE functionality, I strongly believe that only by embedding Perl style regular expressions into the very heart of ZTree, can we achieve truly earth shattering improvements to the most powerful and useful file manager program on Earth. Furthermore, I believe that doing so will turn out to be surprisingly easy. So all those in favor, say: AYE! Thank you for listening. 'Nuf said.
Note: Before bringing this topic up, I did (quite) a bit of digging into the archives, so I'm familiar with most of what's already been said on the subject. And for your convenience and viewing pleasure, here is a list of links to all the threads containing posts having: "regular expression" or "regex" in their titles...
Once again, Thank You Kim Henkel! for giving us ZTree! We all owe you a deep and heartfelt thanks. In fact, I'm going to go over to www.ztree.com right now and give you a $10 donation to show my gratitude. Again... Many Thanks and Peace on Earth.