Parikh's theorem

Parikh's theorem in theoretical computer science says that if one looks only at the number of occurrences of each terminal symbol in a context-free language, without regard to their order, then the language is indistinguishable from a regular language.^[1] It is useful for deciding that strings with a given number of terminals are not accepted by a context-free grammar.^[2] It was first proved by Rohit Parikh in 1961^[3] and republished in 1966.^[4]

Definitions and formal statement

Let $\Sigma =\{a_{1},a_{2},\ldots ,a_{k}\}$ be an alphabet. The Parikh vector of a word is defined as the function ${\textstyle p:\Sigma ^{*}\to \mathbb {N} ^{k}}$ , given by^[1] $p(w)=(|w|_{a_{1}},|w|_{a_{2}},\ldots ,|w|_{a_{k}})$ where $|w|_{a_{i}}$ denotes the number of occurrences of the letter $a_{i}$ in the word $w$ .

A subset of $\mathbb {N} ^{k}$ is said to be linear if it is of the form $u_{0}+\mathbb {N} u_{1}+\dots +\mathbb {N} u_{m}=\{u_{0}+t_{1}u_{1}+\dots +t_{m}u_{m}\mid t_{1},\ldots ,t_{m}\in \mathbb {N} \}$ for some vectors ${\textstyle u_{0},\ldots ,u_{m}}$ . A subset of $\mathbb {N} ^{k}$ is said to be semi-linear if it is a union of finitely many linear subsets.

Theorem—Let $L$ be a context-free language or a regular language, let $P(L)$ be the set of Parikh vectors of words in $L$ , that is, ${\textstyle P(L)=\{p(w)\mid w\in L\}}$ . Then $P(L)$ is a semi-linear set.

If $S$ is any semi-linear set, then there exists a regular language (which a fortiori is context-free) whose Parikh vectors is $S$ .

In short, the image under $p$ of context-free languages and of regular languages is the same, and it is equal to the set of semilinear sets.

Two languages are said to be commutatively equivalent if they have the same set of Parikh vectors. Thus, every context-free language is commutatively equivalent to some regular language.

Proof

The first part is less easy. The reader is referred to ^[5].

The second part is easy to prove.

Proof

Given semi-linear set $S$ , to construct a regular language whose set of Parikh vectors is $S$ .

$S$ is a union of 0 or more linear sets. Since the empty language is regular, and union of regular languages is regular, it suffices to prove that any linear set is the set of Parikh vectors of a regular language.

Let $S=\{u_{0}+t_{1}u_{1}+\dots +t_{m}u_{m}\mid t_{1},\ldots ,t_{m}\in \mathbb {N} \}$ , then it is the set of Parikh vectors of $\{z_{0}\}\cdot (\cup _{i=1}^{m}\{z_{i}\})^{*}$ , where each $z_{i}$ has Parikh vector $u_{i}$ .

Strengthening for bounded languages

A language $L$ is bounded if $L\subset w_{1}^{*}\ldots w_{k}^{*}$ for some fixed words $w_{1},\ldots ,w_{k}$ . Ginsburg and Spanier ^[6] gave a necessary and sufficient condition, similar to Parikh's theorem, for bounded languages.

Call a linear set stratified, if in its definition for each $i\geq 1$ the vector $u_{i}$ has the property that it has at most two non-zero coordinates, and for each $i,j\geq 1$ if each of the vectors $u_{i},u_{j}$ has two non-zero coordinates, $i_{1}<i_{2}$ and $j_{1}<j_{2}$ , respectively, then their order is not $i_{1}<j_{1}<i_{2}<j_{2}$ . A semi-linear set is stratified if it is a union of finitely many stratified linear subsets.

Ginsburg-Spanier—A bounded language $L$ is context-free if and only if $\{(n_{1},\ldots ,n_{k})\mid w_{1}^{n_{1}}\ldots w_{k}^{n_{k}}\in L\}$ is a stratified semi-linear set.

Significance

The theorem has multiple interpretations. It shows that a context-free language over a singleton alphabet must be a regular language and that some context-free languages can only have ambiguous grammars^{[further explanation needed]}. Such languages are called inherently ambiguous languages. From a formal grammar perspective, this means that some ambiguous context-free grammars cannot be converted to equivalent unambiguous context-free grammars.

References

^ ^a ^b Kozen, Dexter (1997). Automata and Computability. New York: Springer-Verlag. ISBN 3-540-78105-6.
^ Håkan Lindqvist. "Parikh's theorem" (PDF). Umeå Universitet.
^ Parikh, Rohit (1961). "Language Generating Devices". Quarterly Progress Report, Research Laboratory of Electronics, MIT.
^ Parikh, Rohit (1966). "On Context-Free Languages". Journal of the Association for Computing Machinery. 13 (4): 570–581. doi:10.1145/321356.321364. S2CID 12263468.
^ Goldstine, J. (1977-01-01). "A simplified proof of parikh's theorem". Discrete Mathematics. 19 (3): 235–239. doi:10.1016/0012-365X(77)90103-0. ISSN 0012-365X.
^ Ginsburg, Seymour; Spanier, Edwin H. (1966). "Presburger formulas, and languages". Pacific Journal of Mathematics. 16 (2): 285–296. doi:10.2140/pjm.1966.16.285.

[kozen-1] Kozen, Dexter (1997). Automata and Computability. New York: Springer-Verlag. ISBN 3-540-78105-6.

[2] Håkan Lindqvist. "Parikh's theorem" (PDF). Umeå Universitet.

[3] Parikh, Rohit (1961). "Language Generating Devices". Quarterly Progress Report, Research Laboratory of Electronics, MIT.

[4] Parikh, Rohit (1966). "On Context-Free Languages". Journal of the Association for Computing Machinery. 13 (4): 570–581. doi:10.1145/321356.321364. S2CID 12263468.

[5] Goldstine, J. (1977-01-01). "A simplified proof of parikh's theorem". Discrete Mathematics. 19 (3): 235–239. doi:10.1016/0012-365X(77)90103-0. ISSN 0012-365X.

[6] Ginsburg, Seymour; Spanier, Edwin H. (1966). "Presburger formulas, and languages". Pacific Journal of Mathematics. 16 (2): 285–296. doi:10.2140/pjm.1966.16.285.

[1]

[2]

[3]

[4]

[5]

[6]