Download - CDM [1ex]Context-Free Grammars

Transcript
Page 1: CDM [1ex]Context-Free Grammars

CDMContext-Free Grammars

Klaus Sutner

Carnegie Mellon Universality

60-cont-free 2017/12/15 23:17

Page 2: CDM [1ex]Context-Free Grammars

1 Generating Languages

� Properties of CFLs

Page 3: CDM [1ex]Context-Free Grammars

Generation vs. Recognition 3

Turing machines can be used to check membership in decidable sets. They canalso be used to enumerate semidecidable sets, whence the classical notion ofrecursively enumerable sets.

For languages L ⊆ Σ? there is a similar notion of generation.

The idea is to set up a system of simple rules that can be used to derive allwords in a particular formal language. These systems are typically highlynondeterministic and it is not clear how to find (efficient) recognitionalgorithms for the corresponding languages.

Page 4: CDM [1ex]Context-Free Grammars

Noam Chomsky 4

Historically, these ideas go back to work by Chomsky in the 1950s. Chomskywas mostly interested natural languages: the goal is to develop grammars thatdifferentiate between grammatical and and ungrammatical sentences.

1 The cat sat on the mat.

2 The mat on the sat cat.

Alas, this turns out to be inordinately difficult, syntax and semantics of naturallanguages are closely connected and very complicated.

But for artificial languages such as programming languages, Chomsky’sapproach turned out be perfectly suited.

Page 5: CDM [1ex]Context-Free Grammars

Cat-Mat Example 5

Determiner

The

Noun

cat

Noun Phrase

Verb

sat Preposition

on Determiner

the

Noun

mat

Noun Phrase

Prepositional Phrase

Verb Phrase Punctuation

.

Sentence

Page 6: CDM [1ex]Context-Free Grammars

Mat-Cat Example 6

Determiner

The

Noun

mat

Noun Phrase

Preposition

on Determiner

the

Adjective

sat

Noun

cat

Noun Phrase

Prepositional Phrase Punctuation

.

Noun Phrase

Page 7: CDM [1ex]Context-Free Grammars

Killer App: Programming Languages 7

Many programming languages have a block structure like so:

beginbeginendbegin

beginendbeginend

endend

Clearly, this is not a regular language and cannot be checked by a finite statemachine. We need more compute power.

Page 8: CDM [1ex]Context-Free Grammars

Generalizing 8

We have two rather different ways of describing regular languages:

finite state machine acceptors

regular expressions

We could try to generalize either one of these.

Let’s start with the algebra angle and handle the machine model later.

Page 9: CDM [1ex]Context-Free Grammars

Grammars 9

Definition

A (formal) grammar is a quadruple

G = 〈V,Σ,P, S 〉

where V and Σ are disjoint alphabets, S ∈ V , and P is a finite set ofproductions or rules.

the symbols of V are (syntactic) variables,

the symbols of Σ are terminals,

S is called the start symbol (or axiom).

We often write Γ = V ∪ Σ for the complete alphabet of G.

Page 10: CDM [1ex]Context-Free Grammars

Context Free Grammars 10

Definition (CFG)

A context free grammar is a grammar where the productions have the form

P ⊆ V × Γ?

It is convenient to write productions in the form

π : A � α

where A ∈ V and α ∈ Γ?.

The idea is that we may replace A by α.

Page 11: CDM [1ex]Context-Free Grammars

Naming Conventions 11

A,B,C . . . represent elements of V ,

S ∈ V is the start symbol,

a, b, c . . . represent elements of Σ,

X,Y, Z . . . represent elements of Γ,

w, x, y . . . represent elements of Σ?,

α, β, γ . . . represent elements of Γ?.

Page 12: CDM [1ex]Context-Free Grammars

Derivations 12

Given a CFG G define a one-step relation1

=⇒ ⊆ Γ? × Γ? as follows:

αAβ1

=⇒ αγβ if A � γ ∈ P

As usual, by induction define

αk+1=⇒ β if ∃ γ (α

k=⇒ γ ∧ γ 1

=⇒ β)

and

α∗

=⇒ β if ∃ k αk

=⇒ β

in which case one says that α derives or yields β. α is a sentential form if itcan be derived from the start symbol S.

To keep notation simple we’ll often just write α =⇒ β.

Page 13: CDM [1ex]Context-Free Grammars

Context Free Languages 13

Definition

The language of a context free grammar G is defined to be

L(G) = {x ∈ Σ? | S ∗=⇒ x }

Thus L(G) is the set of all sentential forms in Σ?. We also say that Ggenerates L(G).

A language is context free (CFL) if there exists a context free grammar thatgenerates it.

Note that in a CFG one can replace a single syntactic variable A by strings overΓ independently of were A occurs; whence the name “context free.” Later onwe will generalize to replacement rules that operate on a whole block ofsymbols (context sensitive grammars).

Page 14: CDM [1ex]Context-Free Grammars

Example: Regular 14

Let G = 〈 {S,A,B}, {a, b},P, S 〉 where the set P of productions is defined by:

S � aA | aBA � aA | aBB � bB | b.

A typical derivation is:

S ⇒ aA⇒ aaA⇒ aaaB ⇒ aaabB ⇒ aaabb

It is not hard to see thatL(G) = a+b+

Not too interesting, we already know how to deal with regular languages.

Can you see the finite state machine hiding in the grammar? Is it minimal?

Page 15: CDM [1ex]Context-Free Grammars

Derivation Graph 15

Derivations of length at most 6 in this grammar.

Page 16: CDM [1ex]Context-Free Grammars

Labeled 16

aA

aaA aaB

aaaA aaaB

aaaaA aaaaB

aaaaaA aaaaaB aaaab aaaabB

aaab aaabB

aaabb aaabbB

aab aabB

aabb aabbB

aabbb aabbbB

aB

ab abB

abb abbB

abbb abbbB

abbbb abbbbB

S

Page 17: CDM [1ex]Context-Free Grammars

Example: Mystery 17

Let G = 〈 {A,B}, {a, b},P, A 〉 where the set P of productions is defined by:

A � AA | AB | aB � AA | BB | b.

A typical derivation is:

A⇒ AA⇒ AAB ⇒ AABB ⇒ AABAA⇒ aabaa

In this case it is not obvious what the language of G is (assuming it has someeasy description, it does). More next time when we talk about parsing.

Page 18: CDM [1ex]Context-Free Grammars

Derivation Graph 18

Derivations of length at most 3 in this grammar. Three terminal strings appearat this point.

Page 19: CDM [1ex]Context-Free Grammars

Depth 4 19

Page 20: CDM [1ex]Context-Free Grammars

Example: Counting 20

Let G = 〈 {S}, {a, b},P, S 〉 where the set P of productions is defined by:

S � aSb | ε

A typical derivation is:

S ⇒ aSb⇒ aaSbb⇒ aaaSbbb⇒ aaabbb

Clearly, this grammar generates the language { aibi | i ≥ 0 }

It is easy to see that this language is not regular.

Page 21: CDM [1ex]Context-Free Grammars

Derivation Graph 21

Page 22: CDM [1ex]Context-Free Grammars

Example: Palindromes 22

Let G = 〈 {S}, {a, b},P, S 〉 where the set P of productions is defined by:

S � aSa | bSb | a | b | ε

A typical derivation is:

S ⇒ aSa⇒ aaSaa⇒ aabSbaa⇒ aababaa

This grammar generates the language of palindromes.

Exercise

Give a careful proof of this claim.

Page 23: CDM [1ex]Context-Free Grammars

Derivation Graph 23

Page 24: CDM [1ex]Context-Free Grammars

Example: Parens 24

Let G = 〈 {S}, {(, )},P, S 〉 where the set P of productions is defined by:

S � SS | (S) | ε

A typical derivation is:

S ⇒ SS ⇒ (S)S ⇒ (S)(S)⇒ (S)((S))⇒ ()(())

This grammar generates the language of well-formed parenthesized expressions.

Exercise

Give a careful proof of this claim.

Page 25: CDM [1ex]Context-Free Grammars

Derivation Graph 25

Page 26: CDM [1ex]Context-Free Grammars

Example: Expressions of Arithmetic 26

Let G = 〈 {E}, {+, ∗, (, ), v},P, E 〉 where the set P of productions is definedby:

E � E + E | E ∗ E | (E) | v

A typical derivation is:

E ⇒ E ∗ E ⇒ E ∗ (E)⇒ E ∗ (E + E)⇒ v ∗ (v + v)

This grammar generates a language of arithmetical expressions with plus andtimes. Alas, there are problems: the following derivation is slightly awkward.

E ⇒ E + E ⇒ E + (E)⇒ E + (E ∗ E)⇒ v + (v ∗ v)

Our grammar is symmetric in + and ∗, it knows nothing about precedence.

Page 27: CDM [1ex]Context-Free Grammars

Derivation Graph 27

Page 28: CDM [1ex]Context-Free Grammars

Ambiguity 28

We may not worry about awkward, but the following problem is fatal:

E ⇒ E + E ⇒ E + E ∗ E ⇒ v + v ∗ v

E ⇒ E ∗ E ⇒ E + E ∗ E ⇒ v + v ∗ v

There are two derivations for the same word v + v ∗ v.

Since derivations determine the semantics of a string this is really bad news: acompiler could interpret v + v ∗ v in two different ways, producing differentresults.

Page 29: CDM [1ex]Context-Free Grammars

Parse Trees 29

Derivation chains are hard to read, a better representation is a tree.

Let G = 〈V,Σ,P, S〉 be a context free grammar.

A parse tree of G (aka grammatical tree) is an ordered tree on nodes N ,together with a labeling λ : N → V ∪ Σ such that

For all interior nodes x: λ(x) ∈ V ,

If x1, . . . , xk are the children, in left-to-right order, of interior node x thenλ(x) � λ(x1) . . . λ(xk) is a production of G,

λ(x) = ε implies x is an only child.

Page 30: CDM [1ex]Context-Free Grammars

Derivation Trees 30

Here are the parse trees of the “expressions grammar” from above.

E

E + E

E ∗ E

E

E

E + E

∗ E

Note that the trees provide a method to evaluate arithmetic expressions, so theexistence of two trees becomes a nightmare.

Page 31: CDM [1ex]Context-Free Grammars

Information Hiding 31

A parse tree typically represents several derivations:

E

E

v

∗ E

E

v

+ E

v

represents for example

θ1 : E ⇒ E ∗ E ⇒ E ∗ E + E ⇒ v ∗ E + E ⇒ v ∗ v + E ⇒ v ∗ v + vθ2 : E ⇒ E ∗ E ⇒ E ∗ E + E ⇒ E ∗ E + v ⇒ E ∗ v + v ⇒ v ∗ v + vθ3 : E ⇒ E ∗ E ⇒ v ∗ E ⇒ v ∗ E + E ⇒ v ∗ v + E ⇒ v ∗ v + vbut notθ4 : E ⇒ E + E ⇒ E ∗ E + E ⇒ v ∗ E + E ⇒ v ∗ v + E ⇒ v ∗ v + v

Page 32: CDM [1ex]Context-Free Grammars

Leftmost Derivations 32

Let G be a grammar and assume α1

=⇒ β.

We call this derivation step leftmost if

α = xAα′ β = xγα′ x ∈ Σ?

A whole derivation is leftmost if it only uses leftmost steps. Thus, eachreplacement is made in the first possible position.

Proposition

Parse trees correspond exactly to leftmost derivations.

Page 33: CDM [1ex]Context-Free Grammars

Ambiguity 33

Definition

A CFG G is ambiguous if there is a word in the language of G that has twodifferent parse trees.

Alternatively, there are two different leftmost derivations.

As the arithmetic example demonstrates, trees are connected to semantics, soambiguity is a serious problem in a programming language.

Page 34: CDM [1ex]Context-Free Grammars

Unambiguous Arithmetic 34

For a “reasonable” context free language it is usually possible to removeambiguity by rewriting the grammar.

For example, here is an unambiguous grammar for our arithmetic expressions.

E � E + T | TT � T ∗ F | FF � (E) | v

In this grammar, v + v ∗ v has only one parse tree.

Here {E, T, F} are syntactic variables that correspond to expressions, termsand factors. Note that it is far from clear how to come up with these syntacticcategories.

Page 35: CDM [1ex]Context-Free Grammars

Inherently Ambiguous Languages 35

Alas, there are CFLs where this trick will not work: every CFG for the languageis already ambiguous. Here is a well-known example:

L = { aibjck | i = j ∨ j = k; i, j, k ≥ 1 }

L consists of two parts and each part is easily unambiguous.

But strings of the form aibici belong to both parts and introduce a kind ofambiguity that cannot be removed.

BTW, { aibici | i ≥ 0 } is not context free.

Exercise

Show that L really is inherently ambiguous.

Page 36: CDM [1ex]Context-Free Grammars

� Generating Languages

2 Properties of CFLs

Page 37: CDM [1ex]Context-Free Grammars

Regular Implies Context Free 37

Lemma

Every regular language is context free.

Proof. Suppose M = 〈Q,Σ, δ; q0, F 〉 is a DFA for L. Consider a CFG withV = Q and productions

p � a q if δ(p, a) = qp � ε if p ∈ F

Let q0 be the start symbol.

2

Page 38: CDM [1ex]Context-Free Grammars

Substitutions 38

Definition

A substitution is a map σ : Σ→ P(Γ?) .

The idea is that for any word x ∈ Σ? we can define its image under σ to belanguage

σ(x1) · σ(x2) · . . . · σ(xn)

Likewise, σ(L) =⋃

x∈L σ(x).

If σ(a) = {w} then we have essentially a homomorphism.

Page 39: CDM [1ex]Context-Free Grammars

The Substitution Lemma 39

Lemma

Let L ⊆ Σ? be a CFL and suppose σ : Σ→ P(Γ?) is a substitution such thatσ(a) is context free for every a ∈ Σ. Then the language σ(L) is also contextfree.

Proof.

Let G = 〈V,Σ,P, S 〉 and Ga = 〈Va,Γ,Pa, Sa 〉 be CFGs for the languages Land La = σ(a) respectively. We may safely assume that the corresponding setsof syntactic variables are pairwise disjoint.

Define G′ as follows. Replace all terminals a on the right hand side of aproduction in G by the corresponding variable Sa.

It is obvious that f(L(G′)) = L where f is the homomorphism defined byf(Sa) = a.

Page 40: CDM [1ex]Context-Free Grammars

Proof, cont’d 40

Now define a new grammar H as follows.

The variables of H are V ∪⋃

a∈Σ Va, the terminals are Σ, the start symbol is Sand the productions are given by

P′ ∪⋃a∈Σ

Pa

Then the language generated by H is σ(L).

It is clear that H derives every word in σ(L).

For the opposite direction consider the parse trees in H.

2

Page 41: CDM [1ex]Context-Free Grammars

Closure Properties 41

Corollary

Suppose L,L1, L2 ⊆ Σ? are CFLs. Then the following languages are alsocontext free: L1 ∪ L2, L1 · L2 and L∗: context free languages are closed underunion, concatenation and Kleene star.

Proof.

This follows immediately from the substitution lemma and the fact that thelanguages {a, b}, {ab} and {a}∗ are trivially context free.

2

Page 42: CDM [1ex]Context-Free Grammars

Non Closure 42

Proposition

CFLs are not closed under intersection and complement.

Consider

L1 = { aibicj | i, j ≥ 0 } L2 = { aibjcj | i, j ≥ 0 }

We will see in a moment that L1 ∩ L2 = { aibici | i ≥ 0 } fails to be contextfree.

Page 43: CDM [1ex]Context-Free Grammars

More Closure 43

Lemma

Suppose L is a CFL and R is regular. Then L ∩R is also context free.

Proof.

This will be easy once we have a machine model for CFLs (push-downautomata), more later.

2

Page 44: CDM [1ex]Context-Free Grammars

Dyck Languages 44

One can generalize strings of balanced parentheses to strings involving multipletypes of parens.

To this end one uses special alphabets with paired symbols:

Γ = Σ ∪ { a | a ∈ Σ }

The Dyck language Dk is generated by the grammar

S � SS | aS a | ε

A typical derivation looks like so:

S ⇒ SS ⇒ aSaS ⇒ aaSa aS ⇒ aaSa aaSa⇒ aaa aaa

Exercise

Find an alternative definition of a Dyck language.

Page 45: CDM [1ex]Context-Free Grammars

A Characterization 45

Let us write Dk for the Dyck language with k = |Σ| kinds of parens.

For D1 there is a nice characterization via a simple counting function. Define#ax to be the number of letters a in word x.

fa(x) = #ax−#ax

Lemma

A string x belongs to the Dyck language D1 ⊆ {a, a}? iff

fa(x) = 0 and

for any prefix z of x: fa(z) ≥ 0.

Page 46: CDM [1ex]Context-Free Grammars

A Paren Mountain 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

1

2

3

4

5

Note that one can read off a proof for the correctness of the grammarS � SS | aS a | ε from the picture.

Page 47: CDM [1ex]Context-Free Grammars

k-Parens 47

For Dk we can still count but this time we need values in Nk

f(x) = (fa1(x), fa2(x), . . . , fak (x))

Then we need f(x) = 0 and f(z) ≥ 0 for all prefixes z of x, just like for D1.

Alas, this is not enough: we also have make sure that proper nesting occursbetween different types of parens.

The critical problem is that we do not want abab.

Page 48: CDM [1ex]Context-Free Grammars

Matching Pairs 48

Let x = x1x2 . . . xn.

Note that if x ∈ Dk and xi = a then there is a unique minimal j > i such thatfa(x[i]) = fa(x[j]) (why?).

Intuitively, xj = a is the matching right paren for a = xi.

Hence we obtain an interval [i, j] associated with the a in position i. Call thecollection of all such intervals Ia, a ∈ Σ.

The critical additional condition for a balanced string is that none of theintervals in

⋃Ia overlap, they are all nested or disjoint.

Exercise

Show that these conditions really describe the language Dk.

Page 49: CDM [1ex]Context-Free Grammars

Dyck vs. CF 49

In a strong sense, Dyck languages are the “most general” context freelanguages: all context free languages are built around the notion of matchingparens, though this may not at all be obvious from their definitions (and,actually, not even from their grammars).

Theorem (Chomsky-Schutzenberger 1963)

Every context free language L ⊆ Σ? has the form L = h(D ∩R) where D is aDyck language, R is regular and h is a homomorphism.

The proof also relies on a machine model, more later.

Page 50: CDM [1ex]Context-Free Grammars

Parikh Vectors 50

Suppose Σ = {a1, a2, . . . , ak}. For x ∈ Σ?, the Parikh vector of x is defined by

#x = (#a1x,#a2x, . . . ,#akx) ∈ Nk

Lift to languages via#L = {#x | x ∈ L } ⊆ Nk

In a sense, the Parikh vector gives the commutative version of a word: we justcount all the letters, but ignore order entirely.

For example, for the Dyck language D1 over {a, a} we have#D1 = { (i, i) | i ≥ 0 }.

Page 51: CDM [1ex]Context-Free Grammars

Semi-Linear Sets 51

A set A ⊆ Nk is semi-linear if it is the finite union of sets of the form

{a0 +∑i

aixi | xi ≥ 0 }

and ai ∈ Nk fixed.

In the special case k = 1, semi-linear sets are often called ultimately periodic:

A = At + (a+mN +Ap)

where At ⊆ {0, . . . , a− 1} and Ap ⊆ {0, . . . ,m− 1} are the transient andperiodic part, respectively.

Observe that for any language L ⊆ {a}?: L is regular iff #L ⊆ N is semi-linear.

Page 52: CDM [1ex]Context-Free Grammars

Parikh’s Theorem 52

Theorem

For any context free language L, the Parikh set #L is semi-linear.

Instead of a proof, consider the example of the Dyck language D1

S � SS | aSa | ε

Let A = #D1, then A is the least set X ⊆ N2 such that

S → SS: X is closed under addition

S → aSa: X is closed under x 7→ x + (1, 1)

S → ε: X contains (0, 0)

Clearly, A = { (i, i) | i ≥ 0 }.

Page 53: CDM [1ex]Context-Free Grammars

Application Parikh 53

It follows immediately that every context free language over Σ = {a} is alreadyregular.

As a consequence, { ap | p prime } is not context free.

This type of argument also works for a slightly messier language like

L = { akb` | k > ` ∨ (k ≤ ` ∧ k prime) }

Note that in this case L and #L are essentially the same, so it all comes downto the set of primes not being semi-linear.

Page 54: CDM [1ex]Context-Free Grammars

Markings 54

Another powerful method to show that a language fails to be context free is ageneralization of the infamous pumping lemma for regular languages. Alas, thistime we need to build up a bit of machinery.

Definition

Let w ∈ Σ?, say, n = |w|. A position in w is a number p, 1 ≤ p ≤ n.

A set K ⊆ [n] of positions is called a marking of w.

A 5-factorization of w consists of 5 words x1, x2, x3, x4, x5 such thatx1x2x3x4x5 = w.

Given a factorization and a marking of w letK(xi) = { p | |x1 . . . xi−1| < p ≤ |x1 . . . xi| } ⊆ K

Thus K(xi) simply consists of all the marked positions in block xi.

Page 55: CDM [1ex]Context-Free Grammars

The Iteration Theorem 55

Theorem

Let G = 〈V,Σ,P, S 〉 be a CFG. Then there exists a number N = N(G) suchthat for all x ∈ L(G), K ⊆ [|x|] a marking of x of cardinality at least N :

there exists a 5-factorization x1, . . . , x5 of x such that, letting Ki = K(xi), wehave:

K1,K2,K3 6= ∅ or K3,K4,K5 6= ∅

|K2 ∪K3 ∪K4| ≤ N .

∀ t ≥ 0(x1x

t2x3x

t4x5 ∈ L(G)

).

Proof. Stare at parse trees. 2

Page 56: CDM [1ex]Context-Free Grammars

Non-Closure 56

Lemma

{ aibici | i ≥ 0 } is not context free.

Proof.

Recall that this shows non-closure under complements and intersections.

So a CFG can count and compare two letters, as in

L1 = { aibicj | i, j ≥ 0 }

L2 = { aibjcj | i, j ≥ 0 }

but three letters are not manageable.

The intuition behind this will become clear next time when we introduce amachine model.

Page 57: CDM [1ex]Context-Free Grammars

Proof 57

Let N be as in the iteration theorem and set

w = aNbNcN ,

K = [N + 1, 2N ] (so the b’s in the middle are marked).

Then there is a factorization x1, . . . , x5 of w such that, letting Ki = K(xi), wehave:

Case 1: K1,K2,K3 6= ∅Then x1 = aNbi, x2 = bj , x3 = bky where j > 0.

But then x1x3x5 /∈ L, contradiction.

Case 2: K3,K4,K5 6= ∅Then x3 = ybi, x4 = bj , x5 = bkcN where j > 0.

Again x1x3x5 /∈ L, contradiction.

Page 58: CDM [1ex]Context-Free Grammars

More Non-Closure 58

It follows that

{x ∈ {a, b, c}∗ | |x|a = |x|b = |x|c }

is not context free: otherwise the intersection with a?b?c? would also becontext free.

Exercise

Show that the copy language

Lcopy = {xx | x ∈ Σ? }

fails to be context free. Compare this to the palindrome language{xxop | x ∈ Σ? }