Solving systems of regular expression equationspolaris.cs.uiuc.edu/~padua/cs426/cs426-5.pdf · •...
Transcript of Solving systems of regular expression equationspolaris.cs.uiuc.edu/~padua/cs426/cs426-5.pdf · •...
1 of 27CS426 Compiler Construction
Fall 2006
Solving systems of regular expression equations
2 of 27CS426 Compiler Construction
Fall 2006
Consider the following DFSM
For each transition from state Q to state R under input a, we create a production of the form Q→Ra and for each final state, P, a production of the form P → ε. Productions lead to equations in the obvious way.
So, we have the equations
A1=1A1+0A2
A2=0A3+1a1
A3=0A3+1A1+ε
A1 A3A2
1 0 0 0
1
1
3 of 27CS426 Compiler Construction
Fall 2006
These equations can be solved as follows:
First solve for A3
A3=0*1A1+0*
Then replace A3 in the equation for A2
A2= 00*1A1+00*+1A1
finally replacing A2 in the equation for A1 we have
A1 = 1A1+000*1A1+000*+01A1
= (1+000*1+01)*000*
the obvious answer !
4 of 27CS426 Compiler Construction
Fall 2006
Parsing
• In our compiler model, the parser accepts tokens and verifies that the string can be generated by the grammar.
• Three types of parsers-- tabular -- top-down-- bottom up
Lexical analyzer parser
rest of front end
Symboltable
source program
token
get nexttoken
parsetree
Backtrack parsing
• Backtrack parsing simulates a nondeterministic machine.
• There are both top-down and bottom up forms of backtrack parsing.
• Both forms may take an exponential amount of time to parse.
• Let us only discuss the top down version.
• The name top-down parsing comes from the idea that we attempt to produce a parse tree for the input string starting from the top (root) and working down to the leaves.
• For top-down backtrack parsing, we begin with a tree containing one node labeled S, the start symbol of the grammar.
• We assume that the alternate productions for every nonterminal have been ordered. that is, if A → α1 | α2 | ... | αn are all the A productions in the grammar, we assign some ordering to the αi’s (the alternates for A).
• Recursively proceed as follows-- If the active node is labeled by a nonterminal, say A,
then choose the first alternate, say X1 ... Xk, for A and create k direct descendants for A labeled X1, X2, ..., Xk. Make X1 the active node. If k = 0, then make the node immediately to the right of A active.
-- If the active node is labeled by a terminal, say a, then compare the current input symbol with a. If they match, then make active the node inmmediately to the right of of a and move the input pointer one symbol to the right. If a does not match the current input symbol, go back to the node where the previous production was applied, adjust the input pointer if necessary and try the next alternate. If no alternate is possible, go back to the next previous node, and so forth.
7 of 27CS426 Compiler Construction
Fall 2006
• Cosider the grammar S→aSbS | aS | c and the input string aacbc. The sequence of parsing tress would be as follows:
S
a S b S
S
a S b S
a S b S
S
a S b S
S
a S b S
a S b S
no matchbacktrack
S
a S b S
a S b S
a Sno matchbacktrack
S
a S b S
a S b S
c
S
a S b S
a S b S
c
S
a S b S
a S b S
c c
S
a S b S
a S b S
c c
no matchbacktrack
9 of 27CS426 Compiler Construction
Fall 2006
S
a S b S
a Sc
S
a S b S
a S
c
S
a S b S
a S
c
S
a S b S
a S
c
10 of 27CS426 Compiler Construction
Fall 2006
The Cocke-Younger-Kasami Algorithm
• Tabular methods include: Cocke-Younger-Kasami algorithm and Earley’s algorithm.
• These are essentially dynamic programming methods and are discussed mainly because of their simplicity.
• However, they are highly inefficent (n3 time and n2 space). They are the parsing equivalent to “bubble sort”.
• Definition: A Context Free Grammar (N, Σ, P, S) is said to be in Chomsky normal form (CNF) if each production in P is one of the forms (1) A → BC with A, B, C in N(2) A → a with a in Σ, or(3) if ε is in L(G) then S → ε is a production and S does not appear on the right hand side of any production.
• Theorem: Let L be a context free language (i.e. a language that can be generated by a context free grammar). Then L = L(G) for some CFG G in Chomsky Normal Form.
12 of 27CS426 Compiler Construction
Fall 2006
13 of 27CS426 Compiler Construction
Fall 2006
• Example: The GFG, G, defined by
S → aAB | BAA → BBB | aB → AS | b
can be converted to the equivalent CNF grammar, G’S → a’ <AB> | BAA → B <BB> | aB → AS | b<AB> → AB<BB> → BBa’ → a
where G=(N,{a,b},P,S) and G’=(N’,{a,b},P’,S). Here P and P’ correspond to the two sets of productions above and N’ ={S, A, B, <AB>, <BB>, a’}.
• let w = a1a2...an be an input string which is to be parsed according to G. We assume that each ai is in Σ for 1 ≤ i ≤ n. The essence of the algorithm is the construction of a parse table T, whose elements we denote tij for 1 ≤ i ≤ n and 1 ≤ j ≤ n-i+1. Each tij will have a value which is a
subset of N. A will be in tij iff A →+ aiai+1...ai+j-1, that is, if A derives the the j input symbols beginning at position i. As a special case, the input string w is in L(G) iff S is in t1n.
15 of 27CS426 Compiler Construction
Fall 2006
16 of 27CS426 Compiler Construction
Fall 2006
17 of 27CS426 Compiler Construction
Fall 2006
Top Down parsing.
A non-backtracking form of top-down parser is called a predictive parser (section 2.4 of the textbook).
The main idea is that at each point it is possible to know by looking at the next input token which production should be applied. For most programming languages, the fact that each type of statement is initiated by a different keyword (e.g. for, while, if) facilitates the development of top-down predictive parsers.
18 of 27CS426 Compiler Construction
Fall 2006
• Consider the following grammartype → simple | * id | array [ simple ] of typesymple → integer | char | num : num
• Here the tokens are in bold.
• Starting from type, it is always clear which production to apply. -- Thus, if the next token is integer, char, or num,
the production type → simple should be appied.-- If it is *, the the production type → * id should be
applied .-- Finally, if it is array,
type → array [ simple ] of type should be applied
• Clearly, no need for backtracking ever arises.
19 of 27CS426 Compiler Construction
Fall 2006
• Predictive parsing relies on the information about what first symbols can be generated by the right hand side of a production.
• Thus,-- FIRST(simple) = {integer, char, num}-- FIRST(* id) = { * }-- FIRST(array [ simple ] of type) = {array}
20 of 27CS426 Compiler Construction
Fall 2006
Problems with predictive parsing.
Consider the grammar
S → S aS → b
By just looking at the next token it is impossible to determine how many times the first production shoudl be applied.
For example, if the input is baaaaa then the first production must be applied five times before the second one is.
Grammars like this are said to be left recursive.
Left recursion can be elimninated by changing the grammar. In this case to:
S → b S’S’ → a S’ | ε
21 of 27CS426 Compiler Construction
Fall 2006
A second problem is that it may not be possible to determine by just looking at the next token which production to apply.
Consider the following grammar:S → i E t S | i E t S e S | a E → b
By just looking at the next token it is impossibel to determine whether the first or the second production should be applied.
Again, changing the grammar can solve this problem:
S → i E t S S’ | a
S’ → e S | ε
E → b
22 of 27CS426 Compiler Construction
Fall 2006
23 of 27CS426 Compiler Construction
Fall 2006
Example
Consider the grammar:A → id = EE → E + TE → E - TE → TE → -TT → T * FT → T /FT → FF → (E)F → id F → number
After eliminating the left recursion, we obtain
24 of 27CS426 Compiler Construction
Fall 2006
A → id = EE → T E’E → -T E’ E’→ -T E’E’→ +T E’T → F T’T’ → * F T’T → / F T’F → (E)F → id F → number
From where we can create the recursive descent parser.
procedure A(){match(id);t=idnum;match(‘=’);expression();emit(“fstore”);emitln(t);
}procedure E(){
if ( lookahead == “-” ) {match(“-”);T();emit(“fneg”);
}else T();E’();
}procedure E’(){
if ( lookahead == “-” ) {match(“-”);T();emit(“fsub”);E’();
}elseif ( lookahead == “+” ) {
match(“+”);T();emit(“fadd”);E’();
}
}
procedure T(){F();T’();
}procedure T’(){
if ( lookahead == “*” ) {match(“*”);F();emitln(“fmul”);T’();
}elseif ( lookahead == “/” ) {
match(“/”);F();emitln(“fdiv”);T’();
}
}
26 of 27CS426 Compiler Construction
Fall 2006
procedure F(){if ( lookahead == num ) {
emit(“ldc”);emitln(value);match (num);
}elseif lookahead == “(” {
match(“(”);E();match(“)”);
}else {
match(id);emit(“fload”);emitln(idnum);
}}
27 of 27CS426 Compiler Construction
Fall 2006