Jump to content

Pattern language (formal languages)

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Jochen Burghardt (talk | contribs) at 17:14, 13 March 2014 (Learning patterns: split 2nd alg step). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In theoretical computer science, a pattern language is a formal language that can be defined as the set of all particular instances of a string of constants and variables. Pattern Languages were introduced by Dana Angluin in the context of machine learning.[1]

Definition

Given a finite set Σ of constant symbols and a countable set X of variable symbols disjoint from Σ, a pattern is a finite non-empty string of symbols from Σ∪X. The length of a pattern p, denoted by |p|, is just the number of its symbols. The set of all patterns containing exactly n distinct variables (each of which may occur several times) is denoted by Pn, the set of all patterns at all by P*. A substitution is a mapping f: P*P* such that[note 1]

  • f is a homomorphism with respect to string concatenation (⋅), formally: ∀p,qP*. f(pq) = f(p)⋅f(q);
  • f is non-erasing, formally: ∀pP*. f(p) ≠ ε, where ε denotes the empty string; and
  • f respects constants, formally: ∀s∈Σ. f(s) = s.

If p = f(q) for some patterns p, qP* and some substitution f, then p is said to be less general than q, written pq; in that case, necessarily |p| ≥ |q| holds. For a pattern p, its language is defined as the set of all less general patterns that are built from constants only, formally: L(p) = { s ∈ Σ+ : sp }, where Σ+ denotes the set of all finite non-empty strings of symbols from Σ.

For example, using the constants Σ = { 0, 1 } and the variables X = { x, y, z, ... }, the pattern 0x10xx1 ∈P1 and xxyP2 has length 7 and 3, respectively. An instance of the former pattern is 00z100z0z1 and 01z101z1z1, it is obtained by the substitution that maps x to 0z and to 1z, respectively, and each other symbol to itself. Both 00z100z0z1 and 01z101z1z1 are also instances of xxy. In fact, L(0x10xx1) is a subset of L(xxy). The language of the pattern x0 and x1 is the set of all bit strings which denote an even and odd number, respectively. The language of xx is the set of all strings obtainable by concatenating a bit string with itself, e.g. 00, 11, 0101, 1010, 11101110 ∈ L(xx).

Properties

NP-hardness of pattern language membership, by reduction from the NP-complete 1-in-3-SAT problem: Given a CNF of m clauses with n variables, a pattern of length 3n+4m+1 with 2n variables and a string of length 4n+5m+1 can be constructed as shown (m=3 and n=4 in the example). Upper-case variables in the pattern correspond to negated variables in the CNF. The string matches the pattern if and only if an assignment exists such that in each clause exactly one literal is 1 (meaning "true" in the CNF). In the left part, e.g. "0wW0" is matched by "01110" just if one of w,W is matched by "1" (corresponding to "false") and the other by "11" (corresponding to "true"), i.e. if w corresponds to the negation of W. In the right part, e.g. "0xYZ0" is matched by "011110" just if exactly one of x,Y,Z is matched by "11" and the others by "1", i.e. if exactly one literal corresponds to "true".

The problem of deciding whether sL(p) for an arbitrary string s ∈ Σ+ and pattern p is NP-complete (see picture), and so is hence the problem of deciding pq for arbitrary patterns p, q.[2]

The class of pattern languages is not closed under ...

  • union: e.g. for Σ = {0,1} as above, L(01)∪L(10) is not a pattern language;
  • complement: Σ+ \ L(0) is not a pattern language;
  • intersection: L(x0y)∩L(x1y) is not a pattern language;
  • Kleene plus: L(0)+ is not a pattern language;
  • homomorphism: f(L(x)) = L(0)+ is not a pattern language, assuming f(0) = 0 = f(1);
  • inverse homomorphism: f-1(111) = { 01, 10, 000 } is not a pattern language, assuming f(0) = 1 and f(1) = 11.

The class of pattern languages is closed under ...

  • concatenation: L(p)⋅L(q) = L(pq);
  • reversal: L(p)rev = L(prev).[3]

The class of pattern languages is incomparable with the class of regular languages and with the class of context-free languages:

  • the pattern language L(xx) is not context-free (hence not regular either) due to the pumping lemma;
  • the non-pattern language L(x0y)∩L(x1y) is produced by the regular (hence also context-free) grammar   S→0A|1B,   A→0A|1C,   B→0C|1B,   C→0|1|0C|1C,   with start symbol S.

If p, qP1 are patterns containing exactly one variable, then pq if and only if L(p) ⊆ L(q); the same equivalence holds for patterns of equal length.[4] For patterns of different length, the above example p = 0x10xx1 and q = xxy shows that L(p) ⊆ L(q) may hold without implying pq. However, any two patterns p and q, of arbitrary lengths, generate the same language if and only if they are equal up to consistent variable renaming.[5] Each pattern p is a common generalization of all strings in its generated language L(p), modulo associativity of (⋅).

Learning patterns

Given a sample set S of strings, a pattern p is called descriptive of S if SL(p), but not SL(q) ⊂ L(p) for any other pattern q.

Given any sample set S, a descriptive pattern for S can be computed by

  • enumerating all patterns (up to variable renaming) not longer than the shortest string in S,
  • selecting from them the patterns that generate a superset of S,
  • selecting from them the patterns of maximal length, and
  • selecting from them a pattern that is minimal with respect to ≤.[6]

Based on this algorithm, the class of pattern languages can be identified in the limit from positive examples.[7]

Notes

  1. ^ Angluin's notion of substitution differs from the usual notion of string substitution.

References

  1. ^ Dana Angluin (1980). "Finding Patterns Common to a Set of Strings" (PDF). Journal of Computer and System Sciences. 21: 46–62.
  2. ^ Theorem 3.6, p.50; Corollary 3.7, p.52
  3. ^ Theorem 3.10, p.53
  4. ^ Lemma 3.9, p.52; Corollary 3.4, p.50
  5. ^ Theorem 3.5, p.50
  6. ^ Theorem 4.1, p.53
  7. ^ Dana Angluin (1980). "Inductive Inference of Formal Languages from Positive Data" (PDF). Information and Control. 45: 117–135.; here: Example 1, p.125