Language of a Grammar

98
CFG [1] Language of a Grammar If G is a grammar we write L(G) = { wT * | S * w } Definition: A language L is context-free iff there is a grammar G such that L = L(G) start symbol corresponds to start state variable symbols correspond to states terminal symbols T correspond to the alphabet Σ 1

Transcript of Language of a Grammar

Page 1: Language of a Grammar

CFG [1]

Language of a Grammar

If G is a grammar we write

L(G) = { w∈ T∗ | S ⇒∗ w }

Definition: A language L is context-free iff there is a grammar G such that L= L(G)

start symbol corresponds to start state

variable symbols correspond to states

terminal symbols T correspond to the alphabet Σ

1

Page 2: Language of a Grammar

CFG [2]

Context-Free Languages and Regular Languages

Theorem: If L is regular then L is context-free.

Proof: We know L = L(A) for a DFA A. From A we can build a CFG Gsuch that L(A) = L(G)

The variables are A,B,C, with start symbol A, the terminal token are 0, 1and the productions are

A→ 1A | 0B B → 0B | 1C C → ε | 0B | 1C

2

Page 3: Language of a Grammar

CFG [3]

Context-Free Languages and Regular Languages

Let LX be the language generated by the grammar with X as a start symbolwe prove (mutual induction!) that w ∈ LX iff δ̂(X,w) = C by induction on |w|

Such a CFG is called right regular

It would be possible also to define L by a left regular language with start stateC

A→ ε | C1 | A1 B → A0 | C0 | B0 C → B1

The intuition here is that LX represents the path from A to X

3

Page 4: Language of a Grammar

CFG [4]

Example of a derivation

Given the grammar for english above we can generate (leftmost derivation)

SENTENCE ⇒ SUBJECT VERB OBJECT

⇒ ARTICLE NOUN VERB OBJECT ⇒ the NOUN VERB OBJECT

⇒ the NOUN VERB OBJECT ⇒ the cat VERB OBJECT

⇒ the cat caught OBJECT ⇒ the cat caught ARTICLE NOUN

⇒ the cat caught a NOUN ⇒ the cat caught a dog

4

Page 5: Language of a Grammar

CFG [5]

Derivation Tree

Notice that the following generation is possible (rightmost derivation)

SENTENCE ⇒ SUBJECT VERB OBJECT

⇒ SUBJECT VERB ARTICLE NOUN

⇒ SUBJECT VERB ARTICLE dog

⇒ SUBJECT VERB a dog ⇒ SUBJECT caught a dog

⇒ ARTICLE NOUN caught a dog ⇒ ARTICLE cat caught a dog

⇒ the cat caught a dog

5

Page 6: Language of a Grammar

CFG [6]

Derivation Tree

Both generation corresponds to the same derivation tree or parse tree whichreflects the internal structure of the sentence

Number of left derivations of one word

= number of right derivations

= number of parse trees

6

Page 7: Language of a Grammar

CFG [7]

A grammar for arithmetical expressions

S → (S) | S + S | S × S | I

I → 1 | 2 | 3

The terminal symbols are {(, ),+,×, 1, 2, 3}

The variable symbols are S and I

7

Page 8: Language of a Grammar

CFG [8]

Ambiguity

Definition: A grammar G is ambiguous iff there is some word in L(G) whichhas two distinct derivation trees

Intuitively, there are two possible meaning of this word

Example: the previous grammar for arithmetical expression is ambigiuoussince the word 2 + 1× 3 has two possible parse trees

8

Page 9: Language of a Grammar

CFG [9]

Ambiguity

An example of ambiguity in programming language is else with the followingproduction

C → if b then C else C

C → if b then C

C → s

9

Page 10: Language of a Grammar

CFG [10]

Ambiguity

A word like

if b then if b then s else s

can be interpreted as

if b then (if b then s else s)

or

if b then (if b then s) else s

10

Page 11: Language of a Grammar

CFG [11]

Context-Free Languages and Inductive Definitions

Each CFG can be seen as an inductive definition

For instance the grammar for arithmetical expression can be seen as thefollowing inductive definition

• 1, 2, 3 are arithmetical expressions

• if w is an arithmetical expression then so is (w)

• if w1, w2 are arithmetical expressions then so are w1 + w2 and w1 × w2

A natural way to do proofs on context-free languages is to follow this inductivestructure

11

Page 12: Language of a Grammar

CFG [12]

Context-Free Languages and Regular Languages

The following language L = {anbn | n > 1} is context-free

We know that it is not regular

Proposition: The following grammar G generates L

S → ab | aSb

12

Page 13: Language of a Grammar

CFG [13]

Context-Free Languages and Regular Languages

We prove w ∈ L(G) implies w ∈ L by induction on the derivation of w ∈ L(G)

• ab ∈ L(G)

• if w ∈ L(G) then awb ∈ L(G)

We can prove also w ∈ L(G) implies w ∈ L by induction on the length of aderivation S ⇒∗ w

We prove anbn ∈ L(G) by induction on n

13

Page 14: Language of a Grammar

CFG [14]

Abstract Syntax

The parse tree has often too much information w.r.t. the internal structure ofa document. This structure is best reflected by an abstract syntax tree. We giveonly an example here.

Here is BNF for arithmetic expression

E → E + E | E × E | (E) | I I → 1 | 2 | 3

Parse tree for 2 + (1× 3)

14

Page 15: Language of a Grammar

CFG [15]

Abstract Syntax

This can be compared with the abstract syntax tree for the expression 2+(1×3)

Concrete syntax describes the way documents are written while abstract syntaxdescribes the pure structure of a document.

15

Page 16: Language of a Grammar

CFG [16]

Abstract Syntax

In Haskell, use of data types for abstract syntax

data Exp = Plus Exp Exp | Times Exp Exp | Num Atom

data Atom = One | Two | Three

ex = Plus Two (Times One Three)

16

Page 17: Language of a Grammar

CFG [17]

Ambiguity

Definition: A grammar G is ambiguous iff there is some word in L(G) whichhas two distinct derivation trees

Intuitively, there are two possible meaning of this word

Example: the previous grammar for arithmetical expression is ambigiuoussince the word 2 + 1× 3 has two possible parse trees

17

Page 18: Language of a Grammar

CFG [18]

Ambiguity

Let Σ be {0, 1}.

The following grammar of parenthesis expressions is ambiguous

E → ε | EE | 0E1

18

Page 19: Language of a Grammar

CFG [19]

A simple example

Σ = {0, 1}

L = {uuR | u ∈ Σ∗}

This language is not regular: using the pumping lemma on 0k10k

L = L(G) for the grammar

S → ε | 0S0 | 1S1

We prove that if S ⇒∗ v then v ∈ L by induction on the length of S ⇒∗ v

We prove uuR ∈ L(G) if u ∈ Σ∗ by induction on the length of u

19

Page 20: Language of a Grammar

CFG [20]

A simple example

Theorem: The grammar for S is not ambiguous

Proof: By induction on |v| we prove that there is at most one productionS ⇒∗ v

20

Page 21: Language of a Grammar

CFG [21]

Polish notation

The following is a grammar for arithmetical expressions

E → ∗EE | + EE | I, I → a | b

Theorem: This grammar is not ambiguous

We show by induction on |u| the following.

Lemma: for any k there is at most one leftmost derivation of Ek ⇒∗ u

21

Page 22: Language of a Grammar

CFG [22]

Polish notation

The proof is by induction on |u|. If |u| = n + 1 with n > 1 there are threecases

(1) u = +v then the derivation has to be of the form

Ek ⇒ +EEEk−1 ⇒∗ +v

for a derivation Ek+1 ⇒∗ v and we conclude by induction hypothesis

(2) u = ∗v then the derivation has to be of the form

Ek ⇒ ∗EEEk−1 ⇒∗ ∗v

for a derivation Ek+1 ⇒∗ v and we conclude by induction hypothesis

22

Page 23: Language of a Grammar

CFG [23]

Polish notation

(3) u = iv with i = a or i = b, then the derivation has to be of the form

Ek ⇒ iEk−1 ⇒∗ iv

for a derivation Ek−1 ⇒∗ v and we conclude by induction hypothesis

23

Page 24: Language of a Grammar

CFG [24]

Polish notation

It follows from this result that we have the following property.

Corollary: If ∗u1u2 = ∗v1v2 ∈ L(E) then u1 = v1 and u2 = v2. Similarly if+u1u2 = +v1v2 ∈ L(E) then u1 = v1 and u2 = v2.

but the result says also that if u ∈ L(E) then there is a unique parse tree foru.

24

Page 25: Language of a Grammar

CFG [25]

Ambiguity

Now, a more complicated example. Let Σ be {0, 1}.

The following grammar of parenthesis expressions is ambiguous

E → ε | EE | 0E1

We replace this by the following equivalent grammar

S → 0S1S | ε

Lemma: L(S) = L(E)

Theorem: The grammar for S is not ambiguous

25

Page 26: Language of a Grammar

CFG [26]

Ambiguity

Lemma: L(S)L(S) ⊆ L(S)

Proof: we prove that if u ∈ L(S) then uL(S) ⊆ L(S) by induction on |u|

If u = ε then uL(S) = L(S)

If |u| = n+1 then u = 0v1w with v, w ∈ L(S) and |v|, |w| 6 n. By inductionhypothesis, we have wL(S) ⊆ L(S) and so

uL(S) = 0v1wL(S) ⊆ 0v1L(S) ⊆ L(S)

since v ∈ L(S) and 0L(S)1L(S) ⊆ L(S) Q.E.D.

26

Page 27: Language of a Grammar

CFG [27]

Ambiguity

We can also do an induction on the length of a derivation S ⇒∗ u

Using this lemma, we can show L(E) ⊆ L(S)

27

Page 28: Language of a Grammar

CFG [28]

Ambiguity

Lemma: L(E) ⊆ L(S)

Proof: We prove that if u ∈ L(E) then u ∈ L(S) by induction on the lengthof a derivation E ⇒∗ u

If E ⇒ ε = u then u ∈ L(S)

If E ⇒ EE ⇒∗ vw = u then by induction v, w ∈ L(S) and by the previousLemma we have u ∈ L(S)

If E ⇒ 0E1 ⇒∗ 0v1 = u then by induction v ∈ L(S) and so u = 0v1ε ∈ L(S).Q.E.D.

28

Page 29: Language of a Grammar

CFG [29]

Ambiguity

The proof that the grammar for S is not ambiguous is difficult

One first tries to show that there is at most one left-most derivation

S ⇒∗lm u

for any string u ∈ Σ∗

If u is not ε we have that u should be 0u1 and then the derivation should be

S ⇒ 0S1S ⇒ 0u1

with S1S ⇒ u1

29

Page 30: Language of a Grammar

CFG [30]

Ambiguity

This suggests the following statement ψ(u) to be proved by induction on thelength of u

For any k there exists at most one leftmost derivation S(1S)k ⇒∗ u

We can then prove ψ(u) by induction on |u|

If u = ε we should have k = 0 and the derivation has to be S ⇒ ε

30

Page 31: Language of a Grammar

CFG [31]

Ambiguity

If ψ(v) holds for |v| = n and |u| = n+ 1 then u = 0v or u = 1v with |v| = n.We have two cases

(1) u = 1v and S(1S)k ⇒∗ 1v, the derivation has the form

S(1S)k ⇒ ε(1S)k ⇒∗ 1v

for a derivation S(1S)k−1 ⇒∗ v and we conclude by induction hypothesis

(2) u = 0v and S(1S)k ⇒∗ 0v, the derivation has the form

S(1S)k ⇒ 0S1S(1S)k ⇒∗ 0v

for a derivation S(1S)k+1 ⇒∗ v and we conclude by induction hypothesis

31

Page 32: Language of a Grammar

CFG [32]

Inherent Ambiguity

There exists a context-free language L such that for any grammar G ifL = L(G) then G is ambiguous

L = {anbncmdm | n,m > 1} ∪ {anbmcmdn | n,m > 1}

L is context-free

S → AB | C A→ aAb | ab

B → cBd | cd C → aCd | aDd D → bDc | bc

32

Page 33: Language of a Grammar

CFG [33]

Eliminating ε- and unit productions

Definition: A unit production is a production of the form A→ B with A,Bnon terminal symbols.

This is similar to ε-transitions in a ε-NFA

Definition: A ε-production is a production of the form A→ ε

Theorem: For any CFG G there exists a CFG G′ with no ε- or unit productionssuch that L(G′) = L(G)− {ε}

33

Page 34: Language of a Grammar

CFG [34]

Elimination of unit productions

Let P1 be a system of productions such that if A→ B and B → β are in P1

then so is A→ β and G1 = (V, T, P1, S).

Let P2 the set of non unit productions of P1 and G2 = (V, T, P2, S)

Theorem: L(G1) = L(G2)

34

Page 35: Language of a Grammar

CFG [35]

Elimination of unit productions

Proof: If u ∈ L(G1) and S ⇒∗ u is a derivation of minimal length then thisderivation is in G2. Otherwise it has the shape

S ⇒∗ α1Aα2 ⇒ α1Bα2 ⇒n β1Bβ2 ⇒ β1ββ2 ⇒∗ u

and we have a shorter derivation

S ⇒∗ α1Aα2 ⇒n β1Aβ2 ⇒ β1ββ2 ⇒∗ u

contradiction.

35

Page 36: Language of a Grammar

CFG [36]

Elimination of unit productions

S → CBh | D

A→ aaC

B → Sf | ggg

C → cA | d | C

D → E | SABC

E → be

36

Page 37: Language of a Grammar

CFG [37]

Elimination of unit productions

We eliminate unit productions

S → SABC | be | CBh

A→ aaC

B → Sf | ggg

C → cA | d

37

Page 38: Language of a Grammar

CFG [38]

Elimination of ε-productions

If G = (V, T, P, S) build the new system P1 closing P by adding rules

If A→ αBβ and B → ε then A→ αβ

We have L(G1) = L(G). Let P2 the system obtained from P1 by taking awayall ε-productions

Theorem: L(G2) = L(G)− {ε}

Proof: We clearly have L(G2) ⊆ L(G1). We prove that if S ⇒∗ u, u ∈ T ∗

and u 6= ε is a production of minimal length then it does not use any ε-production,so it is a derivation in G2. Q.E.D.

38

Page 39: Language of a Grammar

CFG [39]

Eliminating ε- and unit productions

Starting from G = (V, T, P, S) we build a larger set P1 of productionscontaining P and closed under the two rules

1. if A→ w1Bw2 and B → ε are in P1 then A→ w1w2 is in P1

2. if A→ B and B → w are in P1 then so is A→ w

We add only productions whose right-handside is a subtring of an old right-handside, so this process stops.

It can be shown that if L(V, T, P1, S) = L(G) and that if P ′ is the set ofproductions in P1 that are not ε- neither unit production then L(V, T, P ′, S) =L(G)− {ε}

39

Page 40: Language of a Grammar

CFG [40]

Eliminating ε- and unit productions

Example: If we start from the grammar

S → aSb | SS | ε

we get first the new productions

S → ab | S | S

and if we eliminate the ε- and unit productions we get

S → aSb | SS | ab

40

Page 41: Language of a Grammar

CFG [41]

Eliminating ε- and unit productions

Example: If we start from the grammar

S → AB A→ aAA | ε B → bBB | ε

we get first the new productions

S → A | B A→ aA | a B → bB | b

and if we eliminate the ε- and unit productions we get

S → AB | aAA | aA | a | bB | b A→ aAA | aA | a B → bBB | bB | b

41

Page 42: Language of a Grammar

CFG [42]

Eliminating Useless Symbols

A symbol X is useful if there is some derivation S ⇒∗ αXβ ⇒∗ w where wis in T ∗

X can be in V or T

X is useless iff it is not useful

X is generating iff X ⇒∗ w for some w in T ∗

X is reacheable iff S ⇒∗ αXβ for some α, β

42

Page 43: Language of a Grammar

CFG [43]

Reachable Symbols

By analogy with accessible states, we can define accessible or reachablesymbols. We give an inductive definition

• BASIS: The start symbol S is reachable

• INDUCTION: If A is reachable and A→ w is a production, then all symbolsoccuring in w are reachable.

43

Page 44: Language of a Grammar

CFG [44]

Reachable Symbols

Example: Consider the following CFG

S → aB | BC A→ aA | c | aDb

B → DB | C C → b D → B

Then s is accessible, hence also B and C, and hence D is accessible.

But A is not accessible.

We can take away A from this grammar and we get the same language

S → aB | BC B → DB | C C → b D → B

44

Page 45: Language of a Grammar

CFG [45]

Generating Symbols

We define when an element of V ∪ T (terminal or non terminal symbols) isgenerating by an inductive definition

• BASIS: all elements of T are generating

• INDUCTION: if there is a production X → w where all symbols occuring inw are generating then X is generating

This gives exactly the generating variables

45

Page 46: Language of a Grammar

CFG [46]

Generating Symbols

Example: We consider

S → aS | W | U W → aW

U → a V → aa

Then U, V are generating because U → a V → aa

Hence S is generating because S → U

W is not generating, we have only W → aW for production for W

46

Page 47: Language of a Grammar

CFG [47]

Eliminating Useless Symbols

To eliminate useless symbols in a grammar G, first eliminate all nongeneratingsymbols we get an equivalent grammar G1 and then eliminate all symbols in G1

that are non reachable.

We get a grammar G2 that is equivalent to G1 and to G

We have to do this in this order

Examples: For the grammar

S → AB | a A→ b

47

Page 48: Language of a Grammar

CFG [48]

B is not generating, we get the grammar

S → a A→ b

and then A is not reachable we get the grammar

S → a

48

Page 49: Language of a Grammar

CFG [49]

Elimination of useless variables

S → gAe | aY B | CY, A→ bBY | ooC

B → dd | D, C → jV B | gi

D → n, U → kW

V → baXXX | oV, W → c

X → fV, Y → Y hm

49

Page 50: Language of a Grammar

CFG [50]

Elimination of useless variables

Simplified grammar

S → gAe

A→ ooC

C → gi

50

Page 51: Language of a Grammar

CFG [51]

Linear production systems

Several algorithms we have seen are instances of graph searchingalgorithm/derivability in linear production systems

51

Page 52: Language of a Grammar

CFG [52]

Linear Production systems

For testing for accessibility, for the grammar

S → aB | BC, A→ aA | c | aDb

B → DB | C, C → b | B

we associate the production system

→ S, S → B, S → C

A→ A, A→ D, B → B

B → D, B → C, C → B

and we can produce S,B,D,C

52

Page 53: Language of a Grammar

CFG [53]

Linear Production systems

A lot of problems in elementary logic are of this form

A→ B, B → C, A,C → D

What can we deduce from A?

53

Page 54: Language of a Grammar

CFG [54]

Linear Production systems

For computing generating symbols we have a more general form of productionsystem

For instance for the grammar

A→ ABC, A→ C, B → Ca, C → a

we can associate the following production system

A,B,C → A, C → A, C → B, → C

and we can produce C,B,A. There is an algorithm for this kind of problemin 7.4.3

54

Page 55: Language of a Grammar

CFG [55]

Chomsky Normal Form

Definition: A CFG is in Chomsky Normal Form (CNF) iff all productions areof the form A→ BC or A→ a

Theorem: For any CFG G there is a CFG G′ in Chomsky Normal Form suchthat L(G′) = L(G)− {ε}

55

Page 56: Language of a Grammar

CFG [56]

Chomsky Normal Form

We can assume that G has no ε- or unit productions. For each terminal a weintroduce a new nonterminal Aa with the production

Aa → a

We can then assume that all productions are of the form A → a or A →B1B2 . . . Bk with k > 2

If k > 2 we introduce C with productions A→ B1C and C → B2 . . . Bk untilwe have only right-hand sides of length 6 2

56

Page 57: Language of a Grammar

CFG [57]

Chomsky Normal Form

Example: For the grammar

S → aSb | SS | ab

we get firstS → ASB | SS | AB A→ a B → b

and thenS → AC | SS | AB A→ a B → b C → SB

which is in Chomsky Normal Form

57

Page 58: Language of a Grammar

CFG [58]

The Chomsky Hierarchy

Noam Chomsky 1956

Four types of grammars

Type 0: no restrictions

Type 1: Context-sensitive, rules αAβ → αγβ

Type 2: Context-free or context-insensitive

Type 3: Regular, rules of the form A→ Ba or A→ aB or A→ ε

Type 3 ⊆ Type 2 ⊆ Type 1 ⊆ Type 0

Grammars for programming languages are usually Type 2

58

Page 59: Language of a Grammar

CFG [59]

Context-Free Languages and Regular Languages

Theorem: If L is regular then L is context-free.

Proof: We know L = L(A) for a DFA A. We have A = (Q,Σ, δ, q0, F ).We define a CFG G = (Q,Σ, P, q0) where P is the set of productions q → aq′

if δ(q, a) = q′ and q → ε if q ∈ F . We have then q ⇒∗ uq′ iff δ̂(q, u) = q′ andq ⇒∗ u iff δ̂(q, u) ∈ F . In particular u ∈ L(G) iff u ∈ L(A).

A grammar where all productions are of the form A→ aB or A→ ε is calledleft regular

59

Page 60: Language of a Grammar

CFG [60]

Pumping Lemma for Left Regular Languages

Let G = (V, T, P, S) be a left regular language, and let N be |V |.

If a1 . . . ar is a string of length ≥ N any derivation

S ⇒ a1B1 ⇒ a1a2B2 ⇒ · · · ⇒ a1 . . . aiA

⇒ · · · ⇒ a1 . . . ajA⇒ · · · ⇒ a1 . . . an

has length n and there is at least one variable A which is used twice (pigeon-holeprinciple)

If x = a1 . . . ai and y = ai+1 . . . aj and z = aj+1 . . . an we have |xy| 6 N andxykz ∈ L(G) for all k

60

Page 61: Language of a Grammar

CFG [61]

Pumping Lemma for Context-Free Languages

Let L be a context-free language

Theorem: There exists N such that if z ∈ L and N 6 |z| then one can writez = uvwxy such that

z = uvwxy, |vx| > 0, |vwx| 6 N, uvkwxky ∈ L for all k

61

Page 62: Language of a Grammar

CFG [62]

Pumping Lemma for Context-Free Languages

Theorem: The language {akbkck | k > 0} is not context-free

Proof: Assume L to be context-free. Then we have N as stated in thePumping Lemma. Consider z = aNbNcN . We have N 6 |z| so we can writez = uvwxy such that

z = uvwxy, |vx| > 0, |vwx| 6 N, uvkwxky ∈ L for all k

Since |vwx| 6 N there is one letter d ∈ {a, b, c} that occurs not in vwx, andsince |vx| > 0 there is another letter e 6= d that occurs in vx. Then e has moreoccurence than d in uv2wx2y, and this contradicts uv2wx2y ∈ L. Q.E.D.

62

Page 63: Language of a Grammar

CFG [63]

Proof of the CFL Pumping Lemma

We can assume that the language is presented by a grammar in ChomskyNormal Form, working with L− {ε}

The crucial remark is that a binary tree with height p + 1 has at most 2p

leaves

The height of a binary tree is the number of nodes from the root to thelongest path

63

Page 64: Language of a Grammar

CFG [64]

Proof of the CFL Pumping Lemma

Example: the Chomsky grammar

S → AC | AB, A→ a, B → b, C → SB

consider a parse tree for a4b4 corresponding to the derivation

S ⇒ AC ⇒ aC ⇒ aSB ⇒ aACB ⇒ aaCB ⇒ aaSBB

⇒ a2ACBB ⇒ a3CBB ⇒ a3SBBB ⇒ a3ABBBB ⇒ a4BBBB

⇒ a4BBBB ⇒ a4bBBB ⇒ a4b2BB ⇒ a4b3B ⇒ a4b4

The symbol S appears twice on a path u = aa, v = a, w = ab, x = b, y = bb

64

Page 65: Language of a Grammar

CFG [65]

Non closure under intersection

T = {a, b, c}

L1 = {akbkcm | k,m > 0}

L2 = {ambkck | k,m > 0}

L1 and L2 are CFL, but the intersection

L1 ∩ L2 = {akbkck | k > 0}

is not CF

65

Page 66: Language of a Grammar

CFG [66]

Non closure under intersection

However one can show (we will not do the proof in this course, but you shouldknow the result)

Theorem: If L1 ⊆ Σ∗ is context-free and L2 ⊆ Σ∗ is regular then L1 ∩ L2 iscontext-free

Application: The following language, for Σ = {0, 1}

L = {uu | u ∈ Σ∗}

is not context-free, by considering the intersection with L(0∗1∗0∗1∗)

One can show that the complement of L is context-free!

66

Page 67: Language of a Grammar

CFG [67]

Closure under union

If L1 = L(G1) and L2 = L(G2) with disjoint set of variables V1 and V2, andsame alphabet T , we can define

G = (V1 ∪ V2 ∪ {S}, T, P1 ∪ P2 ∪ {S → S1 | S2}, S)

It is then direct to show that L(G) = L(G1) ∪ L(G2) since a derivation hasthe form

S ⇒ S1 ⇒∗ u

orS ⇒ S2 ⇒∗ u

67

Page 68: Language of a Grammar

CFG [68]

Non-Closure Under Complement

L1 ∩ L2 = L1 ∪ L2

So CFL cannot be closed under complement in general. Otherwise they wouldbe closed under intersection.

68

Page 69: Language of a Grammar

CFG [69]

Closure Under Concatenation

If L1 = L(G1) and L2 = L(G2) with disjoint set of variables V1 and V2, andsame alphabet T , we can define

G = (V1 ∪ V2 ∪ {S}, T, P1 ∪ P2 ∪ {S → S1S2}, S)

It is then direct to show that L(G) = L(G1)L(G2) since a derivation has theform

S ⇒ S1S2 ⇒∗ u1u2

withS1 ⇒∗ u1, S2 ⇒∗ u2

69

Page 70: Language of a Grammar

CFG [70]

LL(1) parsing

A grammar is LL(1) if in a sequence of leftmost production we can decidewhat is the production looking only at the first symbol of the string to be parsed

For instance S → +SS | a | b is LL(1)

Any regular grammar S → aA , A→ bA | ε is LL(1) iff it corresponds toa deterministic FA

There are algorithms to decide if a grammar is LL(1) (not done in this course)

Any LL(1) grammar is unambiguous (because by definition there is a at mostone left most derivation for any string)

70

Page 71: Language of a Grammar

CFG [71]

Grammar transformations

The grammar

S → AB, A→ aA | a, b→ bB | c

is equivalent to the grammar

S → aAB, A→ aA | ε, b→ bB | c

71

Page 72: Language of a Grammar

CFG [72]

Grammar transformations

The grammar

S → Bb B → Sa | a

which is not LL(1) is equivalent to the grammar

S → abT T → abT | ε

which is LL(1)

72

Page 73: Language of a Grammar

CFG [73]

Grammar transformations

The grammar

A→ Aa | b

is equivalent to the grammar

A→ bB, B → aB | ε

In general however there is no algorithm to decide L(G1) = L(G2)

For regular expression, we have an algorithm to decide L(E1) = L(E2)

73

Page 74: Language of a Grammar

CFG [74]

The CYK Algorithm

We present now an algorithm to decide if w ∈ L(G), assuming G to be inChomsky Normal Form.

This is an example of the technique of dynamic programming

Let n be |w|. The natural algorithm (trying all productions of length < 2n)may be exponential. This technique gives a O(n3) algorithm!!

74

Page 75: Language of a Grammar

CFG [75]

dynamic programming

fib 0 = fib 1 = 1

fib (n+ 2) = fib n+ fib (n+ 1)

fib 5? calls fib 4, fib 3 and fib 4 calls fib 3

So in a top-down computation there is duplication of works (if one does notuse memoization)

75

Page 76: Language of a Grammar

CFG [76]

dynamic programming

For a bottom-up computation

fib 2 = 2, fib 3 = 3, fib 4 = 5, fib 5 = 8

What is going on in the CYK algorithm or Earley algorithm is similar

S → AB | BC, A→ BA | a, B → CC | b, C → AB | a

bab ∈ L(G)?? and aba ∈ L(G)?

76

Page 77: Language of a Grammar

CFG [77]

dynamic programming

The idea is to represent bab as the collection of the facts b(0, 1), a(1, 2), b(2, 3)

We compute then the facts X(i, k) for i < k by induction on k − i

Only one rule:

If we have a production C → AB and A in X(i, j) and B in X(j, k) then Cis in X(i, k)

77

Page 78: Language of a Grammar

CFG [78]

The CYK Algorithm

The algorithm is best understood in term of production systems

Example: the grammar

S → AB | BA | SS | AC | BD

A→ a, B → b, C → SB, D → SA

becomes the production system

78

Page 79: Language of a Grammar

CFG [79]

The CYK Algorithm

A(x, y), B(y, z) → S(x, z), B(x, y), A(y, z) → S(x, z)

S(x, y), S(y, z) → S(x, z), A(x, y), C(y, z) → S(x, z)

B(x, y), D(y, z) → S(x, z), S(x, y), B(y, z) → C(x, z)

S(x, y), A(y, z) → D(x, z), a(x, y) → A(x, y), b(x, y) → B(x, y)

79

Page 80: Language of a Grammar

CFG [80]

The CYK Algorithm

The problem if one can one derive S ⇒∗ aabbab is transformed to the problem:can one produce S(0, 6) in this production system given the facts

a(0, 1), a(1, 2), b(2, 3), b(3, 4), a(4, 5), b(5, 6)

80

Page 81: Language of a Grammar

CFG [81]

The CYK Algorithm

For this we apply a forward chaining/bottom up sequence of productions

A(0, 1), A(1, 2), B(2, 3), B(3, 4), A(4, 5), B(5, 6)

S(1, 3), S(3, 5), S(4, 6)

S(1, 5), C(1, 4), C(3, 6)

S(0, 4), . . .

S(0, 6)

81

Page 82: Language of a Grammar

CFG [82]

The CYK Algorithm

For instance the fact that C(3, 6) is produced corresponds to the derivation

C ⇒ SB ⇒ BAB ⇒ bAB ⇒ baB ⇒ bab

In this way, we get a solution in O(n3)!

82

Page 83: Language of a Grammar

CFG [83]

Forward-chaining inference

This idea works actually for any grammar. For instance

S → SS | aSb | ε

is represented by the production system

→ S(x, x), S(x, y), S(y, z) → S(x, z)

a(x, y), S(y, z), b(z, t) → S(x, t)and the problem to decide S ⇒∗ aabb is replaced by the problem to derive S(0, 4)from the facts

a(0, 1), a(1, 2), b(2, 3), b(3, 4)

83

Page 84: Language of a Grammar

CFG [84]

Forward-chaining inference

This is the main idea behind Earley algorithm

Mainly used for parsing in computational linguistics

Earley parsers are interesting because they can parse all context-free languages

84

Page 85: Language of a Grammar

CFG [85]

Complement of a CLF

We have seen that CLF are not closed under intersection, are closed underunion

It follows that they are not closed under complement

Here is an explicit example: one can show that the complement of

{anbncn | n > 0}

is a CFL

85

Page 86: Language of a Grammar

CFG [86]

Undecidable Problems

We have given algorithm to decide L(G) 6= ∅ and w ∈ L(G). What issurprising is that it can be shown that there are no algorithms for the followingproblems

Given G1 and G2 do we have L(G1) ⊆ L(G2)? Do we have L(G1) = L(G2)?Given G and R regular expression, do we have L(G) = L(R)? L(R) ⊆ L(G)?Do we have L(G) = T ∗ where T is the alphabet of G? (Compare to the case ofregular languages)

Given G is G ambiguous??

86

Page 87: Language of a Grammar

CFG [87]

Undecidable Problems

One reduces these problems to the Post Correspondance Problem

Given u1, . . . , un and v1, . . . , vn in {0, 1}∗ is it possible to find i1, . . . , ik suchthat

ui1 . . . uik = vi1 . . . vik

Example: 1, 10, 011 and 101, 00, 11

Challenge example: 001, 01, 01, 10 and 0, 011, 101, 001

87

Page 88: Language of a Grammar

CFG [88]

Haskell Program

isPrefix [] ys = TrueisPrefix (x:xs) (y:ys) = x == y && isPrefix xs ysisPrefix xs ys = False

isComp (xs,ys) = isPrefix xs ys || isPrefix ys xs

exists p [] = Falseexists p (x:xs) = p x || exists p xs

exhibit p (x:xs) = if p x then x else exhibit p xs

88

Page 89: Language of a Grammar

CFG [89]

Haskell Program

addNum k [] = []addNum k (x:xs) = (k,x):(addNum (k+1) xs)

nextStep xs ys =concat (map (\ (n,(s,t)) ->

map (\ (ns,(u,v)) -> (ns++[n],(u ++ s,v ++ t)))ys)

xs)

89

Page 90: Language of a Grammar

CFG [90]

Haskell Program

mainLoop xs ys =letbs = filter (isComp . snd) ysprop (_,(u,v)) = u == vinif exists prop bs then exhibit prop bselse if bs == [] then error"NO SOLUTION"

else mainLoop xs (nextStep xs bs)

90

Page 91: Language of a Grammar

CFG [91]

Haskell Program

post xs =letas = addNum 1 xsin mainLoop as (map (\ (n,z) -> ([n],z)) as)

xs1 = [("1","101"),("10","00"),("011","11")]

xs2 = [("001","0"),("01","011"),("01","101"),("10","001")]

91

Page 92: Language of a Grammar

CFG [92]

Haskell Program

Main> post xs1([1,3,2,3],("101110011","101110011"))

Main> post xs2

ERROR - Garbage collection fails to reclaim sufficient space[2,2,2,3,2,2,2,3,3,4,4,6,8,8,15,21,15,17,18,24,15,12,12,18,18,24,24,45,63,66,84,91,140,182,201,346,418,324,330,321,423,459,780

92

Page 93: Language of a Grammar

CFG [93]

Post Correspondance Problem and CFL

To the sequence u1, . . . , un we associate the following grammar GA

The alphabet is {0, 1, a1, . . . , an}

The productions are

A→ u1a1 | . . . | unan | u1Aa1 | . . . | unAan

This grammar is non ambiguous

93

Page 94: Language of a Grammar

CFG [94]

Post Correspondance Problem and CFL

To the sequence v1, . . . , vn we associate the following grammar GB

The alphabet is the same {0, 1, a1, . . . , an}

The productions are

B → v1a1 | . . . | vnan | v1Ba1 | . . . | vnBan

This grammar is non ambiguous

94

Page 95: Language of a Grammar

CFG [95]

Post Correspondance Problem and CFL

Theorem: We have L(GA)∩L(GB) 6= ∅ iff the Post Correspondance Problemfor u1, . . . , un and v1, . . . , vn has a solution

95

Page 96: Language of a Grammar

CFG [96]

Post Correspondance Problem and CFL

Finally we have the grammar G with productions

S → A | B

Theorem: The grammar G is ambiguous iff the Post Correspondance Problemfor u1, . . . , un and v1, . . . , vn has a solution

96

Page 97: Language of a Grammar

CFG [97]

Post Correspondance Problem and CFL

The complement of L(GA) is CF

We see this on one example u1 = 0, u2 = 10

The complement of L(GB) is CF

Hence we have a grammar GC for the union of the complement of L(GA)and the complement of L(GB)

97

Page 98: Language of a Grammar

CFG [98]

Post Correspondance Problem and CFL

Theorem: We have L(GC) = T ∗ iff L(GA) ∩ L(GB) = ∅

Hence the problems

L(E) = L(G)

L(E) ⊆ L(G)

are in general undecidable

98