# Abrazolica

Counting with Regular Expressions

A regular expression is used to define a set of words constructed from a finite alphabet of letters. The set of words so defined is called a regular language. The best way to understand what this means is by example. Let the alphabet consist of four letters $A=\{a,b,c,d\}$. One simple language is all the two letter words that begin with the letter $a$. The regular expression defining the language is: $aa+ab+ac+ad$. The $+$ operator in the expression is analogous to a logic $or$ operator or a set theory union operator. The expression can be simplified a bit by writing it as $a(a+b+c+d)$ which represents the concatenation of $a$ with any of the possible one letter words.

This example shows the two basic operations used to construct regular expressions: union and concatenation. If $R$ and $S$ are regular expressions then $R+S$ and $RS$ are also regular expressions. The simplest regular expression is just a single letter, $R=a$ or $R=\lambda$ where $\lambda$ is an empty string i.e. a string with no letters. Multiple concatenations can be written in a manner analogous to exponentiation, $ababab=(ab)^3$ or if $R=ab$ then $RRR=R^3$. Note that $(ab)^3\ne a^3b^3=aaabbb$.

Another operator used in regular expressions is called the Kleene star. It is defined as follows: $R^*=\lambda + R + R^ 2 +R^3 +\cdots$. A starred regular expression can appear zero or more times concatenated with itself. Using the Kleene star, the regular expression for all possible words is simply $(a+b+c+d)^*$.

A language can have words of many different lengths. The number of words with a given length can be found using something called a generating function which we will denote as $N(z)$. For regular languages $N(z)$ will be either a polynomial or a rational function of two polynomials that can be expanded as a power series. In both cases the coefficient of $z^n$ will equal the number of words that have length $n$. $N(z)$ can be found directly from the regular expression by replacing each alphabet letter by $z$ and replacing each starred expression, $R^*$ by $(1-R(z))^{-1}$ where $R(z)$ is gotten by replacing each alphabet letter in $R$ by $z$. The $+$ operators are then treated as addition operators and concatenation is treated as multiplication. For example the generating function for the language defined by $(a+b)(c+d)^*$ becomes

$N(z)=(z+z)(1-z-z)^{-1}=\frac{2z}{1-2z}$

The following examples show how to use these ideas to solve different kinds of counting problems.

A binary alphabet has only two letters $A=\{a,b\}$. The regular expression for all possible words that can be formed is $(a+b)^*$. Simple combinatorial reasoning says that the number of words of length $n$ must be $2^n$ since there are two choices for each of the $n$ letters. You get the same answer by deriving the generating function from the regular expression. Replace $a$ and $b$ with $z$ and transform the star using the rule given above. This results in the following generating function:

$N(z)=\frac{1}{1-2z}=1+2z+4z^2+8z^3+\cdots$

The coefficient of $z^n$ in the power series expansion is $2^n$, the same answer we got using a combinatorial argument.

The problem can easily be generalized to an alphabet with $m$ letters, $A=\{a_1, a_2, \ldots, a_m\}$. The regular expression for all possible words is $(a_1 + a_2 + \cdots + a_m)^*$. Converting this into a generating function gives:

$N(z)=\frac{1}{1-mz}=1+mz+m^2z^2+m^3z^3+\cdots$

The coefficient of $z^n$ in the power series expansion is $m^n$, a result you can also get using combinatorial reasoning.

Using the same alphabet with $m$ letters, we want to count words that contain exactly $k$ copies of $a_1$. The regular expression for this language is $(a_2+a_3+\cdots +a_m)^*(a_1(a_2+a_3+\cdots +a_m)^*)^k$. Converting this into a generating function gives:

$N(z)=\frac{1}{1-(m-1)z}\left(\frac{z}{1-(m-1)z}\right)^k=\frac{z^k}{(1-(m-1)z)^{k+1}}$

The power series expansion of the generating function is:

$N(z)=\sum_{n=k}^{\infty}\binom{n}{k}(m-1)^{n-k}z^n$

So the number of words of length $n$ with exactly $k$ copies of $a_1$ is $\binom{n}{k}(m-1)^{n-k}$. The combinatorial interpretation is that there are $\binom{n}{k}$ ways to choose which $k$ positions become $a_1$ and there are $m-1$ choices for each of the remaining $n-k$ positions. For the binary alphabet $A=\{a,b\}$, $m=2$ and the number of binary words that contain $a$ exactly $k$ times is $\binom{n}{k}$. For the alphabet $A=\{a,b,c\}$, $m=3$ and the number of words that contain $a$ exactly $k$ times is $\binom{n}{k}2^{n-k}$.

Given the alphabet $A=\{a,b,c,d\}$, how many words of length $n$ contain exactly one $c$ and one $d$? The regular expression for the language is $(a+b)^*(c(a+b)^*d+d(a+b)^*c)(a+b)^*$ and the generating function is:

$N(z)=\frac{1}{1-2z}\frac{2z^2}{1-2z}\frac{1}{1-2z}=\frac{2z^2}{(1-2z)^3}$

In the power series expansion the coefficient of $z^n$ is $\binom{n}{2}2^{n-1}$.

Using the same alphabet, how many words of length $n$ contain exactly two copies each of $c$ and $d$? The words in this language are specified by regular expressions of the form: $(a+b)^*x(a+b)^*x(a+b)^*x(a+b)^*x(a+b)^*$ where two of the four $x$ will equal $c$ and two will equal $d$. The language is completely specified by $\binom{4}{2}=6$ regular expressions of this type. The generating function for each of the regular expressions is given by the above equation for alphabets with $m=3$ letters and $k=4$ copies of a given letter (the alphabet is $A=\{a,b,x\}$ and there are 4 copies of $x$). The generating function for the language is therefore:

$N(z)=\frac{6z^4}{(1-2z)^5}$

Can you show how to generalize this result to a language where each word contains exactly $r$ copies of $c$ and $s$ copies of $d$? Show that the generating function in this case is:

$N(z)=\binom{r+s}{r}\frac{z^{r+s}}{(1-2z)^{r+s+1}}$

Suppose now that we have two alphabets, $A=\{a_1,a_2,\ldots,a_l\}$ with $l$ letters, and $B=\{b_1,b_2,\ldots,b_m\}$ with $m$ letters. We want a language where each word can have any number of letters from $A$ but every letter of $B$ must occur a specified number of times. The letter $b_i$, $i=1,2,\ldots,m$ must occur exactly $j_i$ times in each word for a total of $k=j_1+j_2+\cdots+j_m$ letters from the $B$ alphabet. Let $b$ represent any of the $b_i$ letters, then a regular expression specifying an acceptable word will look like: $(a_1+a_2+\cdots+a_l)^*(b(a_1+a_2+\cdots+a_l)^*)^k$. The entire language will be specified by $\binom{k}{j_1 j_2 \cdots j_m}$ such regular expressions. The generating function for the language is then

$N(z)=\binom{k}{j_1 j_2 \cdots j_m}\frac{z^k}{(1-lz)^{k+1}}$

There are many interesting counting problems involving runs and patterns in words. For example, given the binary alphabet $A=\{a,b\}$, how many words of length $n$ have no runs of $k$ or more $a$'s? The regular expression for a run of less than $k$ $a$'s is $R_k=\lambda+a+a^2+\cdots+a^{k-1}$ and the generating function is

$N_k(z)=1+z+z^2+\cdots+z^{k-1}=\frac{1-z^k}{1-z}$

The regular expression for the language is $(R_kb)^*R_k$ and the generating function is

$N(z)=\frac{N_k(z)}{1-zN_k(z)}=\frac{1-z^k}{1-2z+z^{k+1}}$

We can also look at words with no runs of $k$ or more $a$'s or $b$'s. For no runs of more than a single $a$ or $b$ the regular expression is $(\lambda+a)(ba)^*(\lambda+b)$ and the generating function is

$N(z)=\frac{(1+z)^2}{1-z^2}=\frac{1+z}{1-z}$

If you expand this you find that there are only two words for any length $n$ which you would expect since a word must start with either $a$ or $b$ and then alternate.

For no runs of more than two $a$'s or $b$'s the regular expression is $(\lambda+a+a^2)((b+b^2)(a+a^2))^*(\lambda+b+b^2)$ and the generating function is

$N(z)=\frac{(1+z+z^2)^2}{1-(z+z^2)^2}=\frac{1+z+z^2}{1-z-z^2}$

See if you can show that for general $k$, the generating function is:

$N(z)=\frac{(1+z+z^2+\cdots+z^k)^2}{1-(z+z^2+\cdots+z^k)^2}=\frac{1-z^{k+1}}{1-2z+z^{k+1}}$

These examples only scratch the surface of what you can do with regular expressions in combinatorics.

© 2010-2013 Stefan Hollos and Richard Hollos