Why Regular Expression “Does Not Exist” Tests are Convoluted

In order to perform a “does not exist” test in a regular expression, you have to leverage negative lookaheads (negative lookarounds), if your flavor of regular expressions supports them:

Although this expression seems to works in RegEx Buddy’s .Net evaluation mode, it actually does NOT in production .Net code (call to RegEx.IsMatch) :

Watch movie online The Lego Batman Movie (2017)

…instead, this syntax is valid (works in both .Net and RegEx Buddy):

Watch movie online The Lego Batman Movie (2017)

(See “The best solution is: ^((?!hede).*)$ AND NOT ^((?!hede).)*$“)

If you’re still struggling to figure out what the regular expression means, here it is in plain English:

does not contain abc
and does not contain def
and does not start with ghijklm
and does not start with nopqrst
and does not start with uvwxyz

So one lesson learned: test using your target environment (I got too cozy with testing in RegEx Buddy). RegEx Buddy is a great way to write, test, and debug regular expressions, but in this case, there was a difference in run-time behavior. One way to conduct an ad-hoc test using the .Net regex class (without bringing up Visual Studio) is to invoke it from PowerShell. From the command line (cmd.exe):

The above example tests whether var1 is a valid identifier (variable name). The \b means is a word boundary anchor, so the test is performed on the first word.

So why do we need to use the negative lookaround syntax to test a “not exists” condition in the first place?

From http://www.perlmonks.org/?node_id=588315#588368

“To negate a regex, you convert it to an NFA to a DFA, complement the DFA (invert accept/reject states), and convert that back to a regex. This is basic stuff from a first course in CS theory. The problem is that this is really inefficient. The NFA->DFA step introduces an exponential blowup in size. Even the special case of deciding whether the negation of a regex is the empty regex (the regex that accepts nothing) is PSPACE-complete (that means it’s bad), let alone trying to compute more arbitrary regex negations.”

To appreciate what this means, here’s a general explanation on how regular expression matches are performed:

From http://en.wikipedia.org/wiki/Regular_expression

There are at least three different algorithms that decide if and how a given regular expression matches a string.

The oldest and fastest rely on a result in formal language theory that allows every nondeterministic finite automaton (NFA) to be transformed into a deterministic finite automaton (DFA). The DFA can be constructed explicitly and then run on the resulting input string one symbol at a time. Constructing the DFA for a regular expression of size m has the time and memory cost of O(2^m), but it can be run on a string of size n in time O(n). An alternative approach is to simulate the NFA directly, essentially building each DFA state on demand and then discarding it at the next step. This keeps the DFA implicit and avoids the exponential construction cost, but running cost rises to O(m^2 n). The explicit approach is called the DFA algorithm and the implicit approach the NFA algorithm. Adding caching to the NFA algorithm is often called the “lazy DFA” algorithm, or just the DFA algorithm without making a distinction. These algorithms are fast, but using them for recalling grouped subexpressions, lazy quantification, and similar features is tricky.

The third algorithm is to match the pattern against the input string by backtracking. This algorithm is commonly called NFA, but this terminology can be confusing. Its running time can be exponential, which simple implementations exhibit when matching against expressions like (a|aa)*b that contain both alternation and unbounded quantification and force the algorithm to consider an exponentially increasing number of sub-cases. This behavior can cause a security problem called Regular expression Denial of Service.

Although backtracking implementations only give an exponential guarantee in the worst case, they provide much greater flexibility and expressive power. For example, any implementation which allows the use of backreferences, or implements the various extensions introduced by Perl, must include some kind of backtracking. Some implementations try to provide the best of both algorithms by first running a fast DFA algorithm, and revert to a potentially slower backtracking algorithm only when a backreference is encountered during the match.

Since there is a lot of hidden potential with yo-yoing and backtracking involved with the negative test, one wild guess is that the .Net regex engine needed some help with pinning the front and end anchors.

An alternative approach would be to test for the positive and negate outside of the regular expression (in code).