Abrazolica

Home Archive Tags About RSS

A Recursive Regular Expression for Matching Regular Expressions

posted on: Sunday, August 18th, 2013, 17:19

Programming languages like Perl and Java add features to regular expressions that make them powerful enough to match strings in a non-regular context free language. The most powerful feature is probably recursion. This is where the regular expression contains a copy of itself. A simple recursive regular expression (RRE) is \(R=aRb|\epsilon\). The variable \(R\) represents the regular expression and it matches strings with zero or more \(a\)'s followed by the same number of \(b\)'s (\(\epsilon\) is the empty string). It looks like a simple language but it can't be recognized by an ordinary regular expression (ORE).

An ORE in simplest form uses only the operations of alternation, concatenation, and Kleene star, in order of lowest to highest precedence. The precedence can be changed using parentheses. Let \(\cup\) be the alternation operator (alternation is a set union operation), let the juxtaposition of symbols represent concatenation, and let \(*\) be the Kleene star operator then a context free grammar for OREs over the alphabet \(\{a,b\}\) is:

\[\begin{aligned} R &\rightarrow T\cup R | T\\ T &\rightarrow FT | F\\ F &\rightarrow X^* | X\\ X &\rightarrow (R) | a | b \end{aligned}\]

This is not a regular grammar which means you can not create an ORE that will recognize all OREs. You can at best create one that will recognize a subset of all possible OREs. But you can create an RRE that will recognize all syntactically correct OREs.

To do that we're going to need different symbols for alternation and Kleene star in the RRE to distinguish them from the same operators in the ORE. Let \(+\) represent alternation in the RRE and let \(\#\) represent Kleene star. Now start with the first production in the grammar. It says that a regular expression can be expressed as a union of any number of terms \(T\), so the first step in constructing the RRE is \(R=(T\cup)^{\#}T\) where \(T\) still needs to be defined. The second production says that each \(T\) can be a concatenation of any number of factors, \(T = F^{\#}F\). Each factor in turn can be a starred or an unstarred expression, \(F=X^*+X\). An expression can be one of the alphabet letters or a parenthesized regular expression, \(X=a+b+(R)\). If we eliminate all of the variables except for \(F\) then the RRE can be defined as

\[\begin{aligned} R &= (F^{\#}F\cup)^{\#}F^{\#}F\\ F &= a+a^*+b+b^*+(R)+(R)^* \end{aligned}\]

This is an RRE because if you substitute \(F\) into the expression for \(R\) you get \(R\) on both sides of the equation. The following shows this converted into a Perl regular expression. It will match any grep style regular expression, i.e. a regular expression that uses \(|\) for alternation, \(*\) for Kleene star, and parentheses for grouping.


$R = qr/((\((??{$R})\)|\((??{$R})\)\*|a|a\*|b|b\*)*(\((??{$R})\)|\((??{$R})\)\*|a|a\*|b|b\*)\|)*(\((??{$R})\)|\((??{$R})\)\*|a|a\*|b|b\*)*(\((??{$R})\)|\((??{$R})\)\*|a|a\*|b|b\*)/

From the definition of the RRE you can find the generating function for the number of syntactically correct regular expressions of a given length. Skipping the derivation, the generating function is

\[\begin{aligned} R(z) &=\frac{1-2z-5z^2-3z^3-\sqrt{z^6+6z^5+13z^4+6z^3-6z^2-4z+1}}{2z^2(1+z)^2}\\ &= 2z+6z^2+22z^3+84z^4+328z^5+1318z^6+5400z^7+\cdots \end{aligned}\]

The function says there are 2 regular expressions of length 1, 6 of length 2, 22 of length 3 and so on.

The 2 regular expressions of length 1 are: \[\begin{matrix} a & b \end{matrix}\]

The 6 regular expressions of length 2 are: \[\begin{matrix} ab & ba & aa & bb & a^* & b^* \end{matrix}\]

The 22 regular expressions of length 3 are: \[\begin{matrix} (a) & (b) & aa^* & ba^* & ab^* & bb^* & a|a & b|a\\ a^*a & b^*a & aaa & baa & aba & bba & a|b & b|b\\ a^*b & b^*b & aab & bab & abb & bbb \end{matrix}\]

The 84 regular expression of length 4 are: \[\begin{matrix} (a^*) & (b^*) & a(a) & b(a) & (aa) & (ba) & a(b) & b(b) & (ab) & (bb) & (a)^*\\ (b)^* & a|a^* & b|a^* & a^*a^* & b^*a^* & aaa^* & baa^* & aba^* & bba^* & a|b^* & b|b^*\\ a^*b^* & b^*b^* & aab^* & bab^* & abb^* & bbb^* & (a)a & (b)a & a^*|a & b^*|a & aa|a\\ ba|a & ab|a & bb|a & aa^*a & ba^*a & ab^*a & bb^*a & a|aa & b|aa & a^*aa & b^*aa\\ aaaa & baaa & abaa & bbaa & a|ba & b|ba & a^*ba & b^*ba & aaba & baba & abba\\ bbba & (a)b & (b)b & a^*|b & b^*|b & aa|b & ba|b & ab|b & bb|b & aa^*b & ba^*b\\ ab^*b & bb^*b & a|ab & b|ab & a^*ab & b^*ab & aaab & baab & abab & bbab & a|bb\\ b|bb & a^*bb & b^*bb & aabb & babb & abbb & bbbb \end{matrix}\]

While these are all syntactically correct, they don't necessarily all make sense. You would never for example write \(a|a\) or \((a)\) even though they are perfectly good regular expressions.

If this sort of thing interests you then you may be interested in our new book Finite Automata and Regular Expressions: Problems and Solutions.