ZTree Regular Expressions

By Jeff Roberson
Created: 2007-Dec-19
Edited: 2009-Apr-19
Revision History

Table of Contents

  1. Update: March 2009
  2. Background
  3. RE Primer, Quick Start and Tutorial
  4. Proposed Requirements for ZTreeWin RE Implementation
  5. The PCRE C Regular Expression Library
  6. ZTree RE User Interface - the Details
  7. A ZTree Regular Expression Example
  8. Why not just use ZAAP?
  9. Miscellaneous Thoughts and Comments on Past Threads
  10. Summary
  11. Forum Archive RE References
  12. And Last But Not Least...

Update: March 2009

More than a year has passed since I first submitted this article to the ZTree community. In the meantime, I've been using regexs more and more, to the point that I now don't know how I could get by without them! In fact, I now understand what Jeffrey Friedl describes in his book as: thinking in regular expressions. And speaking of Jeff, I finished reading his book and then and went and read it again! This is, hands down, the most useful book I have ever read. A few months back I needed to do a mass file renaming operation that could only be accomplished using a regex in ZTree. Alas, I had to do them one at a time which was time consuming. There were other times when I wanted to tag a set of files based on filename where only a regex could do the job, and once again I was out of luck. I still love ZTree and its still one of my overall favorite apps of all time, but I strongly believe (now more that ever), that adding PCRE regex support to the ZTW core would turn it into the ultimate file management utility.

The article below is updated, but has not changed a whole lot (see the revision history page for details). One big change is that I did remove all suggestions to use the Boost C++ Regex Library. (After briefly corresponding with Kim I found out that ZTree is written in straight C and not C++.) Additionally, I learned that the PCRE C library is The library of choice. Also, I've updated all the links from here to the ZTree forum archives. The recent forum upgrade busted all the old links - (Note that I used a regex search and replace to quickly and effortlessly fix all the links!) Most of all, I'd just like to reiterate my encouragement to Kim to catch the "regex bug" like I did and then incorporate this extremely powerful feature into ZTree. And in that spirit, I'm going to go send Kim another $10 donation for his efforts, just as soon as I get this new revision posted.

<shameless-plug> In the last year, I have also discovered the wonderful regex related products offered by "Just Great Software" (all written by Jan Goyvaertz, the same one-man-show who created and maintains the Regular-Expressions.info site). The RegexBuddy program helps you compose regexs by providing syntax highlighting and a verbose human readable description for each and every sub-component of any regex. (And this description can be exported to HTML/TXT/Clipboard to provide very slick documentation like this example.) It also provides real-time pattern matching feedback by highlighting any text test data - you see right away what matches and what doesn't as you type in a regex pattern. The PowerGrep program is undoubtedly the most powerful Windows search and replace tool ever. Period. This tool is just Awesome (but it doesn't do the regex file renaming that I hope ZTree will someday be able to do). And the third program I bought from JGsoft is EditPad Pro, which has become my go-to text editor of choice (displacing UltraEdit to a back seat position). Its handling of regular expressions for search and replace, and user editable syntax highlighting and code navigation schemes is exceptional. And all these programs (which have lots of other cool features I don't have space to mention here) are fast, have a very small footprint, have completely non-obtrusive install/uninstall procedures and have extensive accurate documentation - (just like ZTree!). Can you tell that I love this software? </shameless-plug> Ok enough of that - go learn regular expressions and check out JGsoft.

Jeff Roberson 27-Mar-09

Background

Two years ago I had only a vague notion about what "Regular Expressions" were and I rationalized them to be nothing more than another mysterious peculiarity conjured up "over there" in unixland. In my ignorance, I spent a long time (more than a decade) ignoring them (sound familiar?). As it happens, this past couple years I've been focusing on learning new skills in the realm of web applications development: (X)HTML, CSS, Javascript, PHP, Ajax, MySQL and Apache. In my studies, one term kept popping up over and over: Regular Expressions. Curiosity got the better of me and I finally broke down and purchased: "Mastering Regular Expressions (3rd Edition 2006)" by Jeffrey Friedl. One year ago, after working through the first two chapters (which teach all the basic syntax), I found out what all the fuss is about. Frankly, regular expressions (or regex (rhymes with "FedEx") or just RE for short) are so simple, elegant, powerful and downright usable, that I can't believe it took me so long to learn them (and I'm kicking myself for waiting so long).

Regular Expressions consist of a concise, mini programming language to process text. When you first see a non-trivial regex out in the wild, it appears to be a bunch of cryptic gibberish. But once you learn the basic syntax you find out that all those complex looking expressions are actually composed of a series of small, simple constructs all strung together. All I can say is that if you don't already know them, do yourself a favor and spend a couple hours and learn the basics - that's all the time it takes to get started. The basic syntax is quite simple to pick up, and the rewards come quickly. While researching this article, I spent quite a bit of time searching the web for a really decent tutorial on regular expressions and came up with a list of a few good ones (see next section). To intelligently discuss this topic in a constructive manner, one really needs to be familiar with at least the fundamentals. So if you are new to regular expressions, please invest a bit of time to learn the basic syntax. (I can 99.5% guarantee that you will be glad that you did!)

RE Primer, Quick Start and Tutorial

Proposed Requirements for ZTreeWin RE Implementation

  1. Backward Compatibility: RE implementation shall not change, impact or interfere with current ZTree behavior in any way. Everything that you are used to being able to do, you can still do the same way as before. All keystrokes and character mnemonics retain their exact same meaning and behavior. The new RE features are seamlessly and non-obtrusively added into the existing interface to provide progressive enhancement to those power users that wish to take advantage of them.
  2. Minimum Impact: The changes required to the ZTree user interface should be minimized. By simply adding one new character glyph, a special RE quote delimiter (REQD) and associated keystroke sequence, RE capability can be incorporated throughout the ZTree interface without having to add any special new RE Mode switches or configuration options. The addition of RE will be transparent with no conflicts to existing behavior. (See discussion below for how this can work.) Also, adding RE capabilities to ZTree should not adversely affect its performance. By implementing all the RE engine functions in a separate DLL (and using delayed loading), the ZTree startup time would not be affected at all. However, there would be a short pause the first time an RE search is performed while the DLL is loaded.
  3. Maximum Power: The most powerful regex implementation "flavor" should be chosen for ZTree to maximize the flexibility of the user's searching expression. This would be the de facto standard Perl style feature set. Regular expressions give lots of power to the user so they should be applied to as many ZTree components as practical/feasible. The following come to mind (Note that in this article, keystroke sequences are enclosed between single quotes):
    • 'F' = Filespecs filter
    • 'ALT+T,F' = Tag by Filespec
    • 'R' = Rename file
    • 'CTRL+R' = Rename tagged files
    • 'V,F' = Viewer Find/search
    • 'CTRL+S' = Search tagged files
    • 'CTRL+S, CTRL+R' = Search and Replace in tagged files ;^)
    From a practical standpoint, these capabilities can be added one at a time. The first one I would vote for is: 'CTRL+R' = Rename tagged files (or possibly: 'CTRL+S' = Search tagged files).
  4. Uniform ZTree RE syntax: The RE implementation syntax should be consistent throughout the ZTree user interface. Just one new additional keystroke combination (the RE quote delimiter, or "REQD" for short), will be used to begin and end each RE sub-string for all RE enhanced components. Once again, no new RE switches, toggles, modes or configuration options are required.
  5. Industry Standard RE Syntax: This would be the Perl style syntax. All characters between the new REQD delimiters will follow the standard RE language syntax. i.e. If you know Perl RE, you know ZTree RE. (... and PHP RE, Javascript RE, .NET RE, java RE, Python RE, Tcl RE, MySQL RE, Apache RE, etc.)
  6. Ergonomic and Streamlined Interface: Within the ZTreeWin interface, the RE sub-strings need to be visually identifiable as being RE so that you can readily see them in the history lists and easily distinguish them from standard ZTree, non-RE search patterns. This can be achieved by assigning a visually unique glyph to the new REQD character, something like this: '' (see discussion below). From the user's perspective, typing in an RE sub-string will be very similar to how the 'ALT+[' and 'ALT+]' keystrokes and their associated glyphs currently work for Filespec Filters (i.e. ◄ABC►). Entering RE sub-strings should follow the [XZ]Tree philosophy and require minimal keystrokes to get the job done.
  7. Integration with existing ZTree Filespec Filters: Currently, ZTree allows multiple filespec sub strings to be combined using OR logic (for positive name specs) and AND NOT logic (for negative name specs) for file names. To extend this schema, the user should be able to add one (or more) regex sub-strings in combination with ZTree format filespecs. e.g. You can happily combine a date and size spec along with an RE name spec and a ZTree name spec. Also, with filespec filters, the NOT prefix operator: "-" could be applied to RE sub strings as well.
  8. Keep it simple stupid! (KISS): It needs to be easy for Kim to implement RE into the very heart of ZTreeWin in a simple, methodical and reliable way so as not to disrupt the program's stability. In other words, it should not require a major overhaul to existing code. The PCRE C regex library, is free, well written, rock-solid and thoroughly test bit of code and would be relatively easy and quick to implement. However, it would require Kim to first learn regular expressions to a pretty deep level, and this will take quite a bit of time.
  9. Install a "Safety" on this powerful Regex weapon: Regular expressions are very powerful. (Perhaps too powerful for a naive user.) A "weapon" like this needs to have redundant safety mechanisms, particularly when a "replace" operation is performed on multiple files (to either the file names or their contents). Thus, the default ZTree behavior should always prompt the user for confirmation before making any changes to the file system. Each prompt should show both the "before" and "after" strings. (This is exactly the way ZTree works right now when doing operations such as 'CTRL-R' - Rename Tagged Files operation.) When using regexs for non-replacement operations (such as setting filespecs or simple searching through file contents), no hard safety mechanisms are required. However, in either case a poorly designed regex can sometimes take literally forever to complete (due to catastrophic backtracking). When this happens (and it will happen), it will hang ZTree, making it unresponsive. Kim could run the search operation in a separate thread of execution and implement a timeout to alleviate this potential problem. And globally, it is probably a good idea to have a configuration option (YACO) to turn ZTree regular expression functionality ON and OFF - in which case the default should be OFF. There should also be a strong CAUTION message in the documentation.

The PCRE C Regular Expression Library

For applications written in C, such as ZTree, the obvious (no-brainer) choice for a regex engine is the PCRE C library (see: http://www.pcre.org/). This library is rock-solid and is thoroughly tested - it is the same library that is used internally by the Apache web server and the PHP scripting language (which are currently running much if not most of the internet). In fact, at this very moment, millions of computers are likely running C code within the PCRE library, serving up dynamic (PHP), clean-URL (.htaccess) web pages to millions of web surfers around the world. Thus, there is absolutely no question about the quality, maturity and reliability of this library. It is fast, powerful and it implements a very rich set of regex features. It is also open source and is free to be used in any application, including commercial ones such as ZTree. And it supports UNICODE! However, on the downside, the PCRE library may be somewhat difficult to implement since it does not have any built in replace functions and has a lower level interface requiring knowledge of how the RE engine works. Kim would need to spend quite a bit of time to become familiar with the mechanics and syntax of working with its API. To help with this effort, I am currently working on writing an in-depth article on how to setup, compile and use the PCRE library with Visual C version 6.

ZTree RE User Interface - the Details

Although there are many "flavors" of regular expression implementations out there in the wild, (and a "Standardized" RE syntax has yet to fully emerge), many if not most modern tools (Perl, PHP, Javascript, .NET, java, Python, Tcl, MySQL, Apache, Ruby etc.) have been gravitating towards the feature rich Perl style syntax. But all these modern implementations do share many base operators and syntax (i.e. character classes, metacharacters, alternation, grouping, backreferences, repetition, anchors and lookaround). In other words, they all share a very large common denominator. The differences between these RE "flavors" appear in the more esoteric complex operators, the modifier options and each tool's specific implementation details. The Perl style syntax is both powerful and ubiquitous and I would strongly suggest we closely adhere to this well established precedent. Before getting to the ZTreeWin RE specific syntax, a quick review of commonly used Perl syntax is in order:

Searching: /(cat|sat|fat)/

When performing a simple search, the regex search pattern string is typically delimited by a pair of forward slash characters (/). Everything between the delimiting slashes (underlined in red) is the regular expression search pattern itself. These characters are written by the user (You and I) in the language of regular expressions, and they have very precise and specific meaning to the RE engine, but these characters have no meaning whatsoever for the host program (in our case: ZTreeWin). As far as ZTree is concerned, this regex package is a "Black Box" which is simply delivered to the regex engine function along with a target string to be matched. ZTreeWin does not need to speak "RE". It is nothing more than a messenger - the regex engine does all the heavy lifting.

Replacing: /(cat|dog|mouse)/animal/

When performing a search and replace operation, three slashes are used and the replacement string is placed between the second and third slashes (see above). Once again the regex search pattern (underlined in red) is between the first and second slash, and the replacement string (underlined in blue) is sandwiched between the second and third slash. This regex finds the first match of "cat", "dog" or "mouse", then replaces it with "animal". Note that groups (in parenthesis) in the search pattern are captured and placed into variables: $1, $2, $3, etc., which can then be placed anywhere in the replacement string. This is very handy and is demonstrated in the example below.

Modifiers: /(cat|dog|mouse)/animal/ig

In addition to the RE search pattern string and replacement string, the regular expression engine allows the user to specify modification flags which affect the behavior of the string processing. These Modifiers are specified by one or more single characters immediately following the last slash and can appear in any order. For example the "i" modifier tells the engine to ignore the case of the input string. If not set, the search is case sensitive by default. The "g" modifier tells the engine to globally find and replace all matches within the target string with the replacement string. If not set, only the first match is replaced by default. The RE modifiers (underlined in green) follow the last slash. In this search and replace operation, all occurrences of "cat", "dog" and "mouse" are replaced with "animal".

At first, very few (if any) of these user specified modifiers would need to be incorporated into the ZTree implementation because a default set of behaviors would be sufficient for most of our needs to start with. Some of the modifiers we may well wish to incorporate (particularly with 'CTRL+S' file searching), are the previously mentioned "i" and "g" modifiers as well as the "m" = multiple line mode and "s" = single line mode modifiers. (For file search operations, the ignore case option is already provided as a switch in ZTree's search dialog boxes, so this modifier would not need to be specified in the RE package string.) Note that (IMHO) incorporating these various search engine modifier/options into the ZTree interface may become kind of tricky and will likely be the biggest task of implementation. However, a wise choice for a simple initial set of default behaviors should help speed things along nicely.

ZTree RE Syntax: (cat|dog|mouse)animalig

We could easily implement this same Perl syntax (which is spoken by many far and wide) throughout ZTreeWin by simply assigning one new unique character, (the RE Quote Delimiter, or simply: "REQD" for short), which behaves exactly like the slash from the previous paragraphs. This new REQD character will have a unique appearance and keystroke sequence. As a suggestion I would propose using a graphic glyph such as: "" (which is a Unicode #25CF). To type in this new RE delimiting character, I would suggest we simply use the 'ALT' keystroke modifier combined with the single/double quote keyboard key '"': i.e. 'ALT+QUOTE'. Thus, from the user's perspective, a regex is entered by simply enclosing it within the special new delimiter characters. Then the ZTree parser takes this whole regex package string and chops it up into its component parts (i.e. pattern, replacement and modifiers), then passes them on to the regex library functions.

Note that this RE syntax for a replace operation is very similar to the current ZTree implementation for the Rename command, the only difference being that the ZTree syntax uses three quotes and the RE syntax uses three REQD chars. Also the modifiers would be have different meanings.

A ZTree Regular Expression Example

You may recall, back in 2006 Laurent Duchastel presented a challenge in this thread titled: "[Discuss] Interesting challenge". The goal was to rename a bunch of files having a format like this: "LLLL_DDMMYY_X-X.JPG", so that the new names would come out like this: "LLLL_YYYYMMDD_X-X.JPG". This problem has a Y2K wrench because some of the files are in the twentieth century (with numbers like '85 and '99), and some files have a year in the twenty first century (with numbers like '02 and '06). Yes, it turns out that ZTree was able to tackle this problem using some creative techniques which required multiple steps. But in the same thread Ian Binnie demonstrated in this post that the problem could be easily solved using two regex search and replace operations. I have also studied this problem and have come up with two solutions to the problem using regexs; one simple with less robust filespec matching which is described below, and another more complex solution with more rigorous filename matching and enhanced functionality (and this one is thoroughly described in in this text file). So if regular expressions were implemented as proposed above, ZTree would be able to solve this puzzle in one whack of the 'CTRL+R' "Rename tagged files" command by piping two regexs together. Here is the simple, non-strict solution, followed by a blow by blow account of what's going on in each of the two regexs (from the RE engine's perspective):

_(\d\d)(\d\d)(\d\d)__20$3$2$1_●|●_20([1-9])_19$1
  1. First regex: Match an underscore followed by a first group of two digits (captured as $1) followed by a second group of two digits (captured as $2) followed by a third group of two digits (captured as $3) followed by an underscore. Replace this matched sub-string with an underscore followed by a '2' followed by a '0' followed by the third captured group of digits followed by the second captured group of digits followed by the first captured group of digits followed by an underscore. (See the RegexBuddy description here.)
  2. Second regex: Match an underscore followed by a '2' followed by a '0' followed by a group of one digit from '1' to '9' (captured as $1). Replace this matched sub-string with an underscore followed by a '1' followed by a '9' followed by the first (and only) captured group of one digit. (See the RegexBuddy description here.)

A note about substitution variables... Each set of parenthesis within a regex pattern string define a group, which captures its contents into a temporary variable. Groups can be nested. The variables are assigned in order (from 1 to 9) and the counting is incremented with each new left parenthesis. These variables can then be used in both the regex pattern string itself, and in the replacement string (which was demonstrated in the previous example). But the syntax is a little tricky. To specify a variable in the regex search pattern, the: "\1", "\2", "\3"... syntax is used, while in the replacement string, the "$1", "$2", "$3"... syntax is used. As an example of a regex that uses both, lets solve a common grammar error: the doubling doubling of words within a sentence. You can easily find all doubled words separated by whitespace and correct the problem (i.e. delete the second word and the whitespace between them), with a find and replace regex something like this: \b(\w+)\s+\1$1g. Note that the "\b" is a special word boundary metacharacter, the "\w" is a special word metacharacter (equivalent to "[a-zA-Z0-9_]"), and the "\s" is a special whitespace metacharacter, and the "+" is a metacharacter that says: "one or more of the preceding character". You can probably now see how all these substitution variables can be used in some very powerful ways.

Why not just use ZAAP?

In this post, Kim suggested someone could implement a third party regex engine and plug it into ZTree using the ZAAP interface. After playing around with rpVT and zbarspy, and looking into the specification of the interface (zbar.dat), I can see that ZAAP truly provides a very powerful and flexible interface into the mind of ZTree (well... part of its mind anyway). Yes, a separate synchronized Assistant Application (AA) program could be called on to perform operations (RE or otherwise) on either a single file ('Y'), or multiple tagged files ('CTRL-Y'). And yes, a separate Regular Expression Assistant Application Program (REAAP) could be designed to perform regex operations like searches, replaces and file renaming, and could also reply back to ZTW with "TAG" or "UNTAG" responses (i.e. the REAAP program could untag files within ZTree). Although this would actually work, in comparison with a native built-in RE interface, it would be kinda clunky with some major drawbacks (speed, resources, ergonomics and functionality) as follows:

In summary... Although it would work, the ZTW/REAAP combination approach to implementing regular expressions, is more complex, is more resource hungry for RAM, CPU cycles, processes and disk I/O, is fatter and slower, will likely be much more prone to bugs and less stable (and more difficult to debug), is less efficient, has fewer features and requires extra keystrokes. Writing an entirely new REAAP program would require much more work than simply upgrading ZTree. On the other hand, a natively RE enhanced ZTree would be simpler, smaller, faster, more ergonomic, more efficient, more powerful and more... shall I say, elegant?

Miscellaneous Thoughts and Comments on Past Threads

Summary

As you may have surmised, I have spent a lot of time thinking about this subject (and working on this article). I may be new to the ZTree forum, but I've been deeply involved in low-level programming for more than 30 years, and since 1991, XTree/ZTree has become my all time favorite (and useful) tool. Like many of you, I use ZTree every day from sun up to sun down, (with many thousands of hours behind the wheel) and I feel naked when working on a Windows box that doesn't have it. My recent discovery of the extreme utility of regular expressions led me to think about how they could supercharge the power of ZTree and take it to a whole 'nother level. Although the ZAAP interface and the F9 menus are very powerful and do allow for adding some RE functionality, I strongly believe that only by embedding Perl style regular expressions into the very heart of ZTree, can we achieve truly earth shattering improvements to the most powerful and useful file manager program on Earth. Furthermore, I believe that doing so will turn out to be surprisingly easy. So all those in favor, say: AYE! Thank you for listening. 'Nuf said.

Forum Archive RE References

Note: Before bringing this topic up, I did (quite) a bit of digging into the archives, so I'm familiar with most of what's already been said on the subject. And for your convenience and viewing pleasure, here is a list of links to all the threads containing posts having: "regular expression" or "regex" in their titles...

And Last But Not Least...

Once again, Thank You Kim Henkel! for giving us ZTree! We all owe you a deep and heartfelt thanks. In fact, I'm going to go over to www.ztree.com right now and give you a $10 donation to show my gratitude. Again... Many Thanks and Peace on Earth.

Valid XHTML 1.0 Strict Valid CSS!