03-60-214 Lexical analysis S a, b B a - University of...

103
1 03-60-214 Lexical analysis S a, b a B

Transcript of 03-60-214 Lexical analysis S a, b B a - University of...

1

03-60-214Lexicalanalysis S

a, b

a B

2

Lexicalanalysisinperspec5ve

•  LEXICALANALYZER–  ScanInput–  RemoveWhiteSpace,NewLine,…–  Iden5fyTokens–  CreateSymbolTable–  InsertTokensintoSymbolTable–  GenerateErrors–  SendTokenstoParser

lexical analyzer parser

symbol table

source program

token

get next token

•  PARSER–  PerformSyntaxAnalysis

–  Ac5onsDictatedbyTokenOrder

–  UpdateSymbolTableEntries

–  CreateAbstractRepresenta5onofSource

–  GenerateErrors

•  LEXICALANALYZER:Transformscharacterstreamtotokenstream–  Alsocalledscanner,lexer,linearanalyzer

3

Whereweare

Total=price+tax;

Total = price + tax ;

Lexical analyzer

Parser

id+id

Expr

assignment

=id

Regular expression

4

Basicterminologiesinlexicalanalysis

• Token–  Aclassifica5onforacommonsetofstrings–  Examples:if,<identifier>, <number> …

• Pa[ern–  Theruleswhichcharacterizethesetofstringsforatoken–  RecallfileandOSwildcards(*.java)

• Lexeme–  Actualsequenceofcharactersthatmatchespa[ernandisclassifiedby

atoken–  Iden5fiers:if, price, 10.00,etc…

Regular expression

if (price + gst – rebate <= 10.00) gift := false

5

Examplesoftoken,lexemeandpa[ernif (price + gst – rebate <= 10.00) gift := false

Token lexeme Informaldescrip5onofpa[ern

if if If

Lparen ( (

Iden5fier price Stringconsistsofle[ersandnumbersandstartswithale[er

operator + +

iden5fier gst Stringconsistsofle[ersandnumbersandstartswithale[er

operator - -

iden5fier rebate Stringconsistsofle[ersandnumbersandstartswithale[er

operator <= Lessthanorequalto

number 10.00 Anynumericconstant

rparen ) )

iden5fier gid Stringconsistsofle[ersandnumbersandstartswithale[er

operator := Assignmentsymbol

iden5fier false Stringconsistsofle[ersandnumbersandstartswithale[er

Regular expression

6

Regularexpression•  Scannerisbasedonregularexpression.•  Rememberlanguageisasetofstrings.•  Examplesofregularexpression

–  Le[era a–  Keywordif if–  Allthele[ers a|b|c|...|z|A|B|C...|Z–  Allthedigits 0|1|2|3|4|5|6|7|8|9–  AlltheIden5fiers le[er(le[er|digit)*

•  Basicopera5ons:–  Setunion,e.g.,a|b–  Concatena5on,e.g,ab–  Kleeneclosure,e.g.,a*

Regular expression

7

Regularexpression

• Regularexpression:construc5ngsequencesofsymbols(strings)fromanalphabet.

• LetΣbeanalphabet,raregularexpressionthenL(r)isthelanguagethatischaracterizedbytherulesofr

• Defini5onofregularexpression–  εisaregularexpressionthatdenotesthelanguage{ε}

•  Notethatitisnot{}–  IfaisinΣ,aisaregularexpressionthatdenotes{a}–  LetrandsberegularexpressionswithlanguagesL(r)andL(s).Then

•  r|sisaregularexpressionàL(r)∪L(s)•  rsisaregularexpressionàL(r)L(s)•  r*isaregularexpressionà(L(r))*

•  Itisaninduc5vedefini5on!• Dis5nc5onbetweenregularlanguageandregularexpression

Regular expression

8

Formallanguageopera5ons

Opera5on Nota5on Defini5on ExampleL={a,b}M={0,1}

unionofLandM L∪M

L∪M={s|sisinLorsisinM}

{a,b,0,1}

concatena)onofLandM

LM

LM={st|sisinLandtisinM}

{a0,a1,b0,b1}

KleeneclosureofL

L*

L*denoteszeroormoreconcatena5onsofL

Allthestringsconsistsof“a”and“b”,plustheemptystring.{ε, a,aa,bb,ab,ba,aaa,…}

posi)veclosure L+

L+denotes“oneormoreconcatena5onsof“L

Allthestringsconsistsof“a”and“b”.

Regular expression

9

Regularexpressionexamplerevisited

• Examplesofregularexpression–  le[eràa|b|c|...|z|A|B|C...|Z–  digità0|1|2|3|4|5|6|7|8|9–  Iden5fieràle[er(le[er|digit)*

• Exercise:whyisitaregularexpression?

Regular expression

10

Precedenceofoperators

•  Can the following RE be simplified? (a) | ((b)*(c))

•  *isofthehighestprecedence;•  ts(Concatena5on)comesnext;•  |lowest.

•  Example–  (a) | ((b)*(c))isequivalenttoa|b*c

Regular expression

11

Proper5esofregularexpressions

Property Descrip5on

r|s=s|r |iscommuta5ve

r|(s|t)=(r|s)|t |isassocia5ve

(rs)t=r(st) Concatena5onisassocia5ve

r(s|t)=rs|rt(s|t)r=sr|tr

Concatena5ondistributesover|

......

What is why we can write

•  either a|b or b|a

•  a|b|c, or (a|b)|c

•  abc, or (ab)c

Regular expression

12

Nota5onalshorthandofregularexpression

•  Oneormoreinstance–  L+=LL*–  L*=L+|ε –  Example

•  digitsà digit digit* •  digitsàdigit+

•  Zerooroneinstance–  L?=L|ε–  Example:

•  Op5onal_frac5onà.digits|ε •  op5onal_frac5onà(.digits)?

•  Characterclasses–  [abc]=a|b|c–  [a-z]=a|b|c...|z

Regular expression

REexample

• Stringsoflength5• Supposetheonlycharactersareaandb• Solu5on

(a|b)(a|b)(a|b)(a|b)(a|b)

• Simplifica5on(a|b){5}

13

REexample

• Stringscontainingatmostonea b*ab* | b* Or b*(a|ε) b* | b*

• Moreconciserepresenta5onb*a?b*

14

REforemailaddress

• [email protected]@[email protected]…le[er=[a-zA-Z]digit=[0-9]w=le[er|digitw+(.w+)*@w+.w+(.w+)*

15

REforoddnumbers

Isthefollowingcorrect?[13579]+33343443[0-9]*[13579]

16

17

Moreregularexpressionexample•  REforrepresen5ngmonths

–  Exampleoflegalinputs•  Febcanberepresentedas02or2•  Novemberisrepresentedas11

•  Firsttry:(0|1)?[0-9]–  Matchesalllegalinputs?Yes

•  1,2,11,12,01,02,...–  Matchesnoillegalinputs?No

•  13,14,..etc•  Secondtry:

(0|1)?[0-9]=(ε|(0|1))[0-9]=[0-9]|(0|1)[0-9]=[0-9]|(0[0-9]|1[0-9][0-9]|(0[0-9]|1[0-2]–  Matchesalllegalinputs?Yes

•  1,2,11,12,01,02,...–  Matchesnoillegalinputs?No

•  0,00

Regular expression

18

Deriveregularexpressions

• Solu5on:[1-9]|(0[1-9])|(1[012])–  Either1-9,or0followedby1to9,or1followedby0,1,or2.–  Matchesalllegalinputs–  Matchesnoillegalinputs

• Moreconcisesolu5on:0?[1-9]|1[012]–  Isitequalto[1-9]|(0[1-9])|(1[012])?0?[1-9]|1[012]=(ε|0)[1-9]|1[012](byshorthandnota5on)=ε[1-9]|0[1-9]|1[012](bydistribu5onover|)=[1-9]|0[1-9]|1[012]

Regular expression

19

Regularexpressionexample(realnumber)

• Realnumbersuchas0,1,2,3.14–  Digit:[0-9]–  Integer:[0-9]+–  Firsttry:[0-9]+(.[0-9]+)?

•  Wanttoallow“.25”aslegalinput?

–  Secondtry:[0-9]+ | ([0-9]*.[0-9]+)

• Op5onalunaryminus:-?([0-9]+|([0-9]*.[0-9]+))-?(\d+|(\d*.\d+))

Regular expression

20

Regularexpressionexercises

• Canthestringbaabecreatedfromtheregularexpressiona*b*a*b*?

• Describethelanguage(inwords)representedby(a*a)b|ab.a+b

• Writetheregularexpressionthatrepresents:–  AllstringsoverΣ={a,b}thatendina.[ab]*a

(a|b)*a–  AllstringsoverΣ={0,1}ofevenlength. ((0|1)(0|1))*(00|01|10|11)*

Regular expression

21

Regulargrammarandregularexpression• Theyareequivalent

–  Everyregularexpressioncanbeexpressedbyregulargrammar–  Everyregulargrammarcanbeexpressedbyregularexpression–  Differentwaystoexpressthesamething

• Whytwonota5ons–  Grammarisamoregeneralconcept–  REismoreconcise

• Howtotranslatebetweenthem–  Useautomata–  Willintroducelater

Regular expression

22

Whatwelearntlastclass

• Defini5onofregularexpression–  εisaregularexpressionthatdenotesthelanguage{ε}

•  Notethatitisnot{}–  IfaisinΣ,aisaregularexpressionthatdenotes{a}–  LetrandsberegularexpressionswithlanguagesL(r)andL(s).Then

•  (r)|(s)isaregularexpressionàL(r)∪L(s)•  (r)(s)isaregularexpressionàL(r)L(s)•  (r)*isaregularexpressionà(L(r))*

Regular expression

23

Applica5onsofregularexpression

•  InWindows–  InwindowsyoucanuseREtosearchforfilesortextsinafile

•  Inunix,therearemanyRErelevanttools,suchasGrep–  StandsforGlobalRegularExpressionsandPrint(orGlobalRegular

ExpressionandParser…);–  UsefulUNIXcommandtofindpa[ernsofcharactersinatextfile;

•  XMLDTDcontentmodel–  <!ELEMENTstudent(name,(phone|cell)*,address,course+)>

<student><name>Jianguo</name><phone>1234567</phone><phone>2345678</phone><address>401sunsetave</address><course>214</course></student>

•  JavaCoreAPIhasregexpackage!•  Scannergenera5on

Regular expression

24

• REinXMLSchema<xsd:simpleTypename="TelephoneNumber"><xsd:restric5onbase="xsd:string"><xsd:lengthvalue="8"/><xsd:pa[ernvalue="\d{3}-\d{4}"/></xsd:restric5on></xsd:simpleType>

Regular expression

25

RegularexpressionsusedinScanner,Stringetc

•  Asampleproblem–  Developaprogramthat,givenasinputthreepointsP1,P2,andP3onthecartesian

coordinateplane,reportswhetherP3liesonthelinecontainingP1andP2.–  Inordertoinputthethreepoints,theuserenterssixintegersx1,y1,x2,y2,x3,andy3.

ThethreepointsareP1=(x1,y1),P2=(x2,y2),andP3=(x3,y3).–  Theprogramshouldrepeatun5ltheuser'sinputissuchthatP1andP2arethesame

point.•  Sampleinput

–  Enterx1:0–  Entery1:0–  Enterx2:2–  Entery2:5–  Enterx3:1–  Entery3:3

•  Output–  Thepoint(1,3)ISNOTonthelineconstructedfrompoints(2,5)and(0,0).

Regular expression in Java

26

Howtoreadandprocesstheinput

•  FirstTryScannersc=newScanner(System.in);intx1=sc.nextInt();java.u5l.InputMismatchExcep5on

•  Wewanttocapturethevaluesforxandy,anddiscardeverythingelse

•  DescribeeverythingelseasdelimitersScannersc=newScanner(System.in).useDelimiter(“…”);

•  Thedelimiterscanbeanyregularexpression

SampleinputEnterx1:2Entery1:8Enterx2:0Entery2:0Enterx3:1Entery3:4

Regular expression in Java

27

useDelimiterinScannerclass

• useDelimiter(“Enterx1:”)–  Thiswillthrowaway“Enterx1:”only.Todiscard“Enterx2:”aswell,

youmaywanttoaddthefollowing

• useDelimiter(“Enterx1:|Enterx2:”)–  Ver5calbarmeans“OR”—either“Enter x1:”OR“Enter x2:” –  ItiscalledRegularexpression;–  Nowyouknowhowtoexpandtothecaseforx3--

• useDelimiter(“Enterx1:|Enterx2:|Enterx3”)–  Wecansimplifiedtheaboveusingothernota5onsinREGUALR

EXPRESSION.

Regular expression in Java

28

Useregularexpressiontocapturetheinput

•  useDelimiter(“Enterx\\d”)–  Where\\dmeansanydigit.Nowhowtoreadinthevaluesforyaxis?

•  useDelimiter(“Enter\\w\\d”)–  \\wmeansanyle[erordigit

•  useDelimiter(“Enter\\w{2}”)–  Whatifthereareleadingandtrailingspacesaroundthesampleinput?

•  useDelimiter(“(|\\t)Enter\\w{2}(|\\t)”)•  useDelimiter(“\\sEnter\\w{2}\\s”)

–  \\sstandsforallkindsofwhitespace.•  useDelimiter(\\s*Enter\\w{2}:\\s*)

–  *meansthatzeroormorespacescanoccur.

Regular expression in Java

29

Codefragmenttoreaddata

//readfromfilefortes5ngpurposeScannersc=newScanner(newFile("online.txt")). useDelimiter("\\s*Enter\\w{2}:\\s*");

intx1=sc.nextInt();inty1=sc.nextInt();intx2=sc.nextInt();inty2=sc.nextInt();intx3=sc.nextInt();inty3=sc.nextInt();

Regular expression in Java

30

RegexpackageinJava

•  Javahasregularpackagejava.u5l.regex,• Asimpleexample:

–  Pickoutthevaliddatesinastring–  E.g.inthestring“finalexam2008-04-22,or2008-4-22,butnot

2008-22-04”–  Validdates:2008-04-22,2008-4-22

• Firstweneedtowritetheregularexpressions.\d{4}-(0?[1-9]|1[012])-\d{2}

Regular expression in Java

31

Regexpackage

• First,youmustcompilethepa[ernimport java.util.regex.*;Pattern p = Pattern.compile(“\\d{4}-(0?[1-9]|1[012])-\

\d{2}");–  Notethatinjavayouneedtowrite\\dinsteadof\d

• Next,youmustcreateamatcherforaspecificpieceoftextbysendingamessagetoyourpa[ern–  Matcherm=p.matcher(“…yourtextgoeshere….");

• Pointstono5ce:–  Pa[ernandMatcherarebothinjava.u5l.regex–  NeitherPa[ernnorMatcherhasapublicconstructor;youcreate

thesebyusingmethodsinthePa[ernclass–  Thematchercontainsinforma5onaboutboththepa[erntouseand

thetexttowhichitwillbeapplied

Regular expression in Java

32

Regexinjava

• Nowthatwehaveamatcherm,–  m.matches()returnstrueifthepa[ernmatchestheen5retext

string,andfalseotherwise–  m.lookingAt()returnstrueifthepa[ernmatchesatthe

beginningofthetextstring,andfalseotherwise–  m.find()returnstrueifthepa[ernmatchesanypartofthetext

string,andfalseotherwise•  Ifcalledagain,m.find()willstartsearchingfromwherethelastmatchwasfound

•  m.find()willreturntrueforasmanymatchesasthereareinthestring;aderthat,itwillreturnfalse

•  Whenm.find()returnsfalse,matchermwillberesettothebeginningofthetextstring(andmaybeusedagain)

Regular expression in java

33

Regexexampleimportjava.u5l.regex.*;publicclassRegexTest{publicstaAcvoidmain(Stringargs[]){

Stringpa[ern="\\d{4}-(0?[1-9]|1[012])-\\d{2}";Stringtext="finalexam2008-04-22,or2008-4-22,butnot 2008-22-04";

Pa[ernp=Pa[ern.compile(pa[ern);Matcherm=p.matcher(text);while(m.find()){ System.out.println("validdate:"+text.substring(m.start(),m.end()));}}

}

Printout:–  validdate:2008-04-22–  validdate:2008-4-22

Regular expression in java

34

Moreshorthandnota5oninspecifictools,likeregexpackageinJava

•  Differentsodwaretoolshaveslightlydifferentnota5ons(e.g.regex,grep,JLEX);

•  Shorthandnota5onsfromregexpackage–  . anyonecharacterexceptalineterminator–  \d adigit:[0-9]–  \D anon-digit:[^0-9]–  \s awhitespacecharacter:[\t\n\r]–  \S anon-whitespacecharacter:[^\s]–  \w awordcharacter:[a-zA-Z_0-9]–  \W anon-wordcharacter:[^\w]

•  GetfamiliarwithregularexpressionusingtheregexTesterApplet.

•  NotethatStringclasssinceJava1.4providessimilarmethodsforregularexpression

Regular expression in java

35

TryRegexTester

•  Runningatcoursewebsiteasanapplet;–  h[p://cs.uwindsor.ca/~jlu/214/regex_tester.htm

• Writeregularexpressionsandtrythematch(),find()methods;

• Whatif3q14insteadof3.14isthestringtobematched?Why?•  groupsarenumberedbycoun5ngtheiropeningparenthesesfromledtoright.

–  ((A)(B(C)))hasfourgroups:–  1((A)(B(C)))–  2(A)–  3(B(C))–  4(C)

Regular expression

36

Prac5ceregularexpressionusinggrep

Usegreptosearchforcertainpa[erninhtmlfiles;–  SearchforCanadianpostalcodeinatextfile;–  SearchforOntariocarplatenumberinatextfile.

–  usetcsh.Type•  %tcsh

–  Preparetextfile,say“test”,thatconsistsofsamplepostalcodeetc.–  Type

•  grep‘[a-z][0-9][a-z][0-9][a-z][0-9]’test

•  grep–i‘[a-z][0-9][a-z][0-9][a-z][0-9]’test

Regular expression in unix

37

Prac5cethefollowinggrepcommands:grep'cat'grepTest-youwillfindboth"cat"and"vaca5on”grep'\<cat\>'grepTest

--wordboundarygrep-i'\<cat\>'grepTest

–  ignorethecasegrep'\<ega\.a[\.com\>'grepTest

-metacharactergrep'"[^"]*"'grepTest

--findquotedstringegrep'[a-z][0-9][a-z]?[0-9][a-z][0-9]'grepTest

–  findpostalcode,onlyifitisinlowercaseegrep-i'[a-z][0-9][a-z]?[0-9][a-z][0-9]'grepTest

–  ignorethecasegrep'q[^u]'grepTest

--won'tfind"iraq",butwillfind"iraqi"grep'^cat'grepTest

findonlylinesstartwithcat•  egrepissimilartogrep,butsomepa[ernsonlyworkinegrep.

Regular expression in unix

megawatt computing ega.att.com Line with uppercase Cat and Dog. 214 cat vacation Cat Dog vacation only capital Cat the cat likes the dog "quoted string " iraq iraqi Qantas ADBB 123 ADBB12 N9B 3P5 N9B3P5 01-16-2004 14-14-2004

38

Unixmachineaccount

• Applyforaunixaccount:–  [email protected]

• Accessunixmachinesathome:–  YouneedtouseSSH;–  Oneplacetodownload:

•  www.uwindsor.ca/its-->services/downloads

39

REandFinitestateAutomaton(FA)•  Regularexpressionisadeclara5vewaytodescribethetokens

–  Itdescribeswhatisatoken,butnothowtorecognizethetoken.•  FAisusedtodescribehowthetokenisrecognized

–  FAiseasytobesimulatedbycomputerprograms;

•  Thereisa1-1correspondencebetweenFAandregularexpression–  Scannergenerator(suchasJLex)bridgesthegapbetweenregularexpression

andFA.

Scanner generator

FA RE Java scanner program

String stream

Tokens

40

Insidescannergenerator

•  Maincomponentsofscannergenera5on–  REtoNFA–  NFAtoDFA–  Minimiza5on–  DFAsimula5on

RE

NFA

DFA

Minimized DFA

Program

Thompson construction

Subset construction

DFA simulation Scanner generator

Minimization

41

Finiteautomata

•  FAalsocalledFiniteStateMachine(FSM)–  Abstractmodelofacompu5ngen5ty;–  Decideswhethertoacceptorrejectastring.

•  TwotypesofFA:–  Non-determinis5c(NFA):Hasmorethanonealterna5veac5on

forthesameinputsymbol.–  Determinis5c(DFA):Hasatmostoneac5onforagiveninput

symbol.•  Example:howdowewriteaprogramtorecognizejavaiden5fiers?

Start letter

letter

digit

s0 s1 S0:

if (getChar() is letter) goto S1;

S1: if (getChar() is letter or digit) goto S1;

Autom

ata

S

a, b a

B

Automaton:amechanismthatisrela5velyself-opera5ng--webster

42

Non-determinis5cFiniteAutomata(FA)

•  NFA(Non-determinis5cFiniteAutomaton)isa5-tuple(S,Σ, δ, S0, F):–  S:asetofstates;–  Σ:thesymbolsoftheinputalphabet;–  δ:atransi5onfunc5on;

•  move(state,symbol)àasetofstates

–  S0:s0∈S,thestartstate;–  F:F⊆S,asetoffinaloraccep5ngstates.

•  Non-determinis5c--astateandsymbolpaircanbemappedtoasetofstates.–  Itisdeterminis5ciftheresultoftransi5onconsistsofonlyonestate.

•  Finite—thenumberofstatesisfinite.

Autom

ata

Start letter

letter

digit

s0 s1

43

Transi5onDiagram

•  FAcanberepresentedusingtransi5ondiagram.•  CorrespondingtoFAdefini5on,atransi5ondiagramhas:

–  States:Representedbycircles;–  Σ: Alphabet,representedbylabelsonedges; –  Moves:Representedbylabeleddirectededgesbetweenstates.Thelabel

istheinputsymbol;–  StartState:arrowhead;–  FinalState(s):representedbydoublecircles.

•  Exampletransi5ondiagramtorecognize(a|b)*abb

q0 q3 a

q1 b

a, b

q2 b

Autom

ata

44

SimpleexamplesofFA

• Epsilon

• a

• a*

• a+

•  (a|b)*

start

a

0

start

a

1 a

0

start

a

0

b

start

a, b

0

start

1

a 0

start

1

ε 0

Autom

ata

45

ProceduresofdefiningaDFA/NFA

• Defineinputalphabetandini5alstate• Drawthetransi5ondiagram• Check

–  allstateshaveout-goingarcslabeledwithalltheinputsymbols(DFA).–  Arethereanymissingfinalstates?–  Arethereanyduplicatestates?–  allstringsinthelanguagecanbeaccepted.–  allstringsnotinthelanguagecannotbeaccepted.

• Nameallthestates• Define(S,Σ,δ,q0,F)

Autom

ata

46

Exampleofconstruc5ngaFA

• ConstructaDFAthatacceptsalanguageLoverΣ={0,1}suchthatListhesetofallstringswithanynumberof“0”sfollowedbyanynumberof“1”s.

• Regularexpression:0*1*• Σ={0,1}• Drawini5alstateofthetransi5ondiagram

Start

Autom

ata

47

Exampleofconstruc5ngaFA(cont.)

• Dradthetransi5ondiagram

Start 1

0 1

0

•  Is“111”accepted?• Theledmoststatehasmissedanarcwithinput“1”

Start 1

0 1

0

1

Autom

ata

•  Is“00”accepted?

48

Exampleofconstruc5ngaFA(cont.)

•  Is“00”accepted?• Theledmosttwostatesarealsofinalstates

–  Firststatefromtheled:εisalsoaccepted–  Secondstatefromtheled:

stringswith“0”sonlyarealsoaccepted

Start 1 0 1

0

1

Autom

ata

49

Exampleofconstruc5ngaFA(cont.)

• Theledmosttwostatesareduplicate–  theirarcspointtothesamestateswiththesamesymbols

Start 1

0 1

• Checkthattheyarecorrect–  Allstringsinthelanguagecanbeaccepted

•  εisaccepted•  stringswith“0”s/“1”sonlyareaccepted

–  Allstringsnotbelongedtothelanguagecannotbeaccepted• Nameallthestates

Start 1

0 1

q0 q1

Autom

ata

50

HowdoesFAwork

•  NFAdefini5onfor(a|b)*abb–  S={q0,q1,q2,q3}–  Σ={a,b}–  Transi5ons:move(q0,a)={q0,q1},move(q0,b)={q0},....–  s0=q0–  F={q3}

•  Transi5ondiagramrepresenta5on–  Non-determinism:

•  exi5ngfromonestatetherearemul5pleedgeslabeledwithsamesymbol,or•  Thereareepsilonedges.

–  HowdoesFAwork?Input:ababb

move(q0,a)=q1move(q1,b)=q2move(q2,a)=?(undefined)REJECT!

move(q0, a) = q0 move(q0, b) = q0 move(q0, a) = q1 move(q1, b) = q2 move(q2, b) = q3 ACCEPT !

q0 q3 a

q1 b

a,b

q2 b

Autom

ata

51

FAfor(a|b)*abb

–  WhatdoesitmeanthatastringisacceptedbyaFA?•  AnFAacceptsaninputstringxiffthereisapathfromthestartstatetoafinalstate,suchthattheedgelabelsalongthispathspelloutx;

–  Apathfor“aabb”:q0àaq0àaq1àbq2àbq3–  Is“aab”acceptable?

q0àaq0àaq1àbq2q0àaq0àaq0àbq0

•  Theanswerisno;•  Finalstatemustbereached;•  Ingeneral,therecouldbeseveralpaths.

–  Is“aabbb”acceptable?q0àaq0àaq1àbq2àbq3

•  Theanswerisno.•  Labelsonthepathmustspellouttheen5restring.

q0 q3 a

q1 b

a,b

q2 b

Autom

ata

52

ExampleofNFAwithepsilonsymbol

•  NFAaccep5ngaa*|bb* –  Is“aaab”acceptable?–  Is“aaa”acceptable?

0

1

3 b

b

4 ε

ε

a

a

2

Autom

ata

Simula5nganNFA

•  Keeptrackofasetofstates,ini5allythestartstateandeverythingreachablebyε-moves.

• Foreachcharacterintheinput:–  Maintainasetofnextstates,ini5allyempty.

• Foreachcurrentstate:–  Followalltransi5onslabeledwiththecurrentle[er.–  Addthesestatestothesetofnewstates.

• Addeverystatereachablebyanε-movetothesetofnextstates.

• Complexity:O(mn2)forstringsoflengthmandautomatawithnstates

53

q0 q3 a

q1 b

a,b

q2 b

54

Transi5ontable

•  Itisoneofthewaystoimplementthetransi5onfunc5on–  Thereisarowforeachstate;–  Thereisacolumnforeachsymbol;–  Entryin(states,symbola)isthesetofstatescanbereachedfromstateson

inputa.

•  Nondeterminis5c:–  Theentriesaresetsinsteadofasinglestate

States Input

a b

>q0 {q0,q1} {q0}

q1 {q2}

q2 {q3}

*q3

Autom

ata

q0 q3 a

q1 b

a,b

q2 b

55

DFA(Determinis5cFiniteAutomaton)•  AspecialcaseofNFA

–  Thetransi5onfunc5onmapsthepair(state,symbol)toonestate.•  Whenrepresentedbytransi5ondiagram,foreachstateSandsymbola,there

isatmostoneedgelabeledaleavingS;•  Whenrepresentedtransi5ontable,eachentryinthetableisasinglestate.

–  Thereisnoε-transi5on•  Example:DFAfor(a|b)*abb

•  RecalltheNFA:

States Input

a b

Q0 Q1 q0

Q1 Q1 Q2

Q2 Q1 Q3

Q3 Q1 Q0

q0

q3

a

a

q1 q2

a b b

b

a b

q0 q3 a

q1 b

a,b

q2 b

Autom

ata

56

DFAtoprogram

•  NFAismoreconcise,butnoteasytoimplement;

•  InDFA,sincetransi5ontablesdon’thaveanyalterna5veop5ons,DFAsareeasilysimulatedviaanalgorithm.

RE

NFA

DFA

Minimized DFA

Program

Thompson construction

Subset construction

DFA simulation Scanner generator

Minimization

Autom

ata simulation

57

SimulateaDFA

•  AlgorithmtosimulateDFAInput:Stringx,DFAD.

•  Transi5onfunc5onismove(s,c);•  StartstateisS0;•  FinalstatesareF.

Output:“yes”ifDacceptsx;“no”otherwise;Algorithm:

currentState ← s0 currentChar ← nextchar; while currentChar ≠ eof { currentState ← move(currentState, currentChar); currentChar ← nextchar; } if currentState is in F then return “yes” else return “no”

•  RuntheFAsimulator!• Writeasimulator.

Autom

ata simulation

58

NFAtoDFA

• Whereweare:wearegoingtodiscussthetransla5onfromNFAtoDFA.

• Theorem:AlanguageLisacceptedbyanNFAiffitisacceptedbyaDFA

• Subsetconstruc5onisusedtoperformthetransla5onfromNFAtoDFA.

RE

NFA

DFA

Minimized DFA

Program

Thompson construction

Subset construction

DFA simulation Scanner generator

Minimization

NFA to D

FA

59

Summarize

• Wehavecoveredmanyconcepts–  RE,Regulargrammar,FA(NFA,DFA),

Transi5onDiagram,Transi5onTable.• Whatistherela5onshipbetweenthem?

–  RE,Regulargrammar,NFA,DFA,Transi5onDiagramareallofthesameexpressivepower;

–  REisadeclara5vedescrip5on,henceeasierforustowrite;

–  DFAisclosertomachine;–  Transi5onDiagramisagraphic

representa5onofFA;–  Transi5onTableisoneofthemethodsto

implementthetransi5onfunc5onsinFA.• Whataboutregulargrammar?

–  Wewillseeitsrelevanceinsyntaxanalysis.•  Anotherpath:howtoderiveREfromDFA?

RE

NFA

DFA

Minimized DFA

Program

Thompson construction

Subset construction

DFA simulation

60

Regularexpressionandregulargrammar

• REtoregulargrammar:–  DrawanautomatafortheRE–  Constructregulargrammarfromautomaton–  Example

S0àle[erS1S1àle[erS1S1àdigitS1S1àε

Start letter

letter

digit

s0 s1

61

Conver5ngDFAstoREs

1.  Combineseriallinksbyconcatena5on2.  Combineparallellinksbyalterna5on3.  Removeself-loopsbyKleeneclosure4.  Selectanode(otherthanini5alorfinal)forremoval.Replace

itwithasetofequivalentlinkswhosepathexpressionscorrespondtotheinandoutlinks

5.  Repeatsteps1-4un5lthegraphconsistsofasinglelinkbetweentheentryandexitnodes.

62

Example

0 1 2

6

4 3 d

a

b c

d

7

5 a

b d

d

b

c

0 1 2

6

4 3 d a|b|c d

7

5 a

b d

d

b|c

0 4 3 d(a|b|c)d

5 a d

b(b|c)d

63

Example(cont.)

0 4 3 d(a|b|c)d

5 a d

b(b|c)da

0 4 3 d(a|b|c)d

5 a (b(b|c)da)*d

0 d(a|b|c)da(b(b|c)da)*d

5

64

Issuesnotcovered

• RegularexpressiontoDFAdirectly;

RE

NFA

DFA

Minimized DFA

Program

Thompson construction

Subset construction

DFA simulation

65

LexicalacceptorsandLexicalanalyzers

• DFA/NFAacceptsorrejectsastring;–  Theyarecalledlexicalacceptors;

• Butthepurposeofalexicalanalyzerisnotjusttoacceptorrejectstring.Thereareseveralissues:–  Mul)plematches:Oneregularexpressionmaymatchseveral

substrings.•  e.g.,ID=le[er+,String=“abc”,IDcanmatchwitha, ab, abc.•  Weshouldfindthelongestmatches,i.e.,longestsubstringoftheinputthatmatchestheregularexpression;

–  Mul)pleREs:WhatifonestringcanmatchseveralREs?•  e.g.,ID=le[er+,INT=“int”,•  String“int”canbebothareservedwordINT,andaniden5fier.Howcanwedecideitisareservedwordinsteadanusualiden5fier?

–  Ac)ons:Onceatokenisrecognized,wewanttoperformdifferenttasksonthem,insteadofsimplyreturnthestringrecognized.

66

Longestmatch

•  WhenseveralsubstringscanmatchthesameRE,weshouldreturnthelongestone.–  e.g.,ID=le[er+,String=“abc”,IDcanmatchwitha,ab,abc.

•  Problem:whatifalexergoespastafinalstateofashortertoken,butthendoesn’tfindanyothermatchingtokenlater?

•  Example:ConsiderR=00|10|0011andinputw=0010.

• WereachstateCwithnotransi5ononinput0.•  Solu5on:Keepingtrackofthelongestmatchjustmeansrememberingthelast5metheDFAwasinafinalstate;

S 0 1

0

0

1

1A B

F

C

E

D

67

Longestmatch(cont.)

•  Thisisdonebyintroducingthefollowingvariables:–  LastFinal:finalstatemostrecentlyencountered;–  InpputPosi5onAtLastFinal:mostrecentposi5onintheinputstringinwhich

theexecu5onoftheDFAwasinafinalstate;–  Yytext:Textofthetokenbeingmatched,i.e.,substringbetween

ini5alInputPosi5onandinputPosi5onAtLastFinal.•  Thiswayalongestmatchisrecognizedwhentheexecu5onoftheDFAreachesadead-end,i.e.,astatewithnotransi5ons.

•  Each5meatokenisrecognized,theexecu5onoftheDFAresumesintheini5alstatetorecognizethenexttoken.

•  Ingeneral,whenatokenisrecognized,currentInputPosi5onmaybefarbeyondinputPosi5onAtLastFinal.

S 0 1

0

0

1

1A B

F

C

E

D

68

Handlingmul5pleREs

•  CombinetheNFAsofalltheREsintoasinglefiniteautomaton.• WhatiftwoREsmatchesthesamestring?

–  E.g.,forastring“abb”,bothREs“a*bb”and“ab*”matchesthestring.WhichREisintended?

–  Itisimportantbecausedifferentac5onsmaytakedependingontheREbeingmatched;

–  Solu5on:OrderREs:theREprecedeswillmatchfirst.

•  Howaboutreservedwords?–  Forstring“int”,shouldwereturntokenINTortokenID?–  Twosolu5ons:

•  Constructareservedwordtableandlookupthetableevery5meaniden5fierisencountered;

•  Put“int”asanRE,andputthatREbeforetheiden5fierRE.Sowheneverthestring“int”ismet,RE“int”willbematchedfirstandthetokenINTwillbereturned(insteadofthetokenID).

69

Ac5ons

• Ac5onscanbeaddedforfinalstates;• Ac5onscanbedescribedinausualprogramminglanguage.InJLex,ac5onisdescribedinJava.

70

Buildascannerforasimplelanguage

• Thelanguageofassignmentstatements: LHS=RHS intLHS=RHS

…–  led-handsideofassignmentisaniden5fier,withop5onaltype

declara5on;–  Iden5fierisale[erfollowedbyoneormorele[ersordigits–  right-handsideisoneofthefollowing:

•  ID+ID•  ID*ID•  ID==ID

• Examplestatementintx3=x1+x2

71

Step1:Definetokens

Ourlanguagehassixtypeoftokens.–  theycanbedefinedbysixregularexpressions:

token Regularexpression

ASSIGN "="

ID le[er(le[er|digit)*

INT “int”

PLUS "+"

TIMES "*"

EQUALS "=="

72

Step2:ConvertREstoNFAs

“=”

letter

“+”

“*”

“=”

ASSIGN:

ID:

PLUS:

TIMES:

EQUALS:

Letter, digit

“=”

Step 3: Combine the NFAs, Convert NFAs to DFAs, minimize the DFAs

INT: “n” “i” “t”

ε

73

Step4:ExtendtheDFA• ModifytheDFAsothatafinalstatecanhave

–  anassociatedac5on,suchas:"putbackonecharacter"or"returntokenXXX“.

• Forexample,theDFAthatrecognizesiden5fierscanbemodifiedasfollows–  recallthatscanneriscalledbyaparser(onetokenisreturnedpereach

call)–  henceac5onreturnputsthescannerintostateS

letter

any char except letter or digit

letter | digit action:

•  put back 1 char •  return ID

S

74

Step5:CombinedFAforourlanguage•  combine the DFAs for all of the tokens in to a single FA.

letter any char except letter or digit

letter | digit put back 1 char; return ID

S “*” ID

return EQUALS “=”

any char except “=”

put back 1 char; return ASSIGN

return TIMES

“+”

return PLUS

“=”

F1

F4

F3

F2

TMP F5

“n” “i”

“t”

SP return INT, put back one char

F6

I1 I2

I3

•  It is not a DFA. Just for illustration purpose.

F7 SP

75

Exampletracefor“intx3=x1+x2”

Input Lastfinalstate

Currentstate

InputposiAonatlastFinalState

CurrentinputposiAon

IniAalinputposiAon

acAon

intx3=x1+x2 0 S 0 0 0

ntx3=x1+x2 0 I1 1 1 0

tx3=x1+x2 0 I2 2 2 0

[sp]x3=x1+x2 0 I3 3 3 0

x3=x1+x2 0 F6 4 4 0 Ac5on1

putBackOneChar();Yytext=substring(0,4-1);in5alInputPosi5on=3;currentSate=S;

[sp]x3=x1+x2 0 S 3 3 3

x3=x1+x2 0 SP 4 4 3 Ac5on2

Yytext=substring(3,3);ini5alInputPosi5on=4;currentState=S;

x3=x1+x2 0 S 4 4 4

3=x1+x2 0 ID 5 5 4

=x1+x2 0 ID 6 6 4

x1+x2 0 F2 7 7 4 Ac5on3

putBackOneChar;Yytext=substring(4,7-1);ini5alInputPosi5on=6;currentState=S;

=x1+x2 0 S 6 6 6

......

76

Scannergenerator:history

• LEX–  Alexicalanalyzergenerator,wri[enbyLeskandSchmidtatBellLabs

in1975fortheUNIXopera5ngsystem;–  Itnowexistsformanyopera5ngsystems;–  LEXproducesascannerwhichisaCprogram;–  LEXacceptsregularexpressionsandallowsac5ons(i.e.,codeto

executed)tobeassociatedwitheachregularexpression.

•  JLex–  Lexthatgeneratesascannerwri[eninJava;–  ItselfisalsoimplementedinJava.

• Therearemanysimilartools,formostprogramminglanguages

JLex

77

Overallpicture

Tokens

Scanner generator

NFA RE Java scanner program

String stream

DFA

Minimize DFA

Simulate DFA

JLex

78

Insidelexicalanalyzergenerator

• Howdoesalexicalanalyzerwork?–  Getinputfromuserwhodefines

tokensintheformthatisequivalenttoregulargrammar

–  TurntheregulargrammarintoaNFA

–  ConverttheNFAintoDFA–  Generatethecodethatsimulates

theDFA

Classes in JLex:

CAccept CAcceptAnchor CAlloc CBunch CDfa CDTrans CEmit CError CInput CLexGen CMakeNfa CMinimize CNfa CNfa2Dfa CNfaPair CSet CSimplifyNfa CSpec CUtility Main SparseBitSet ucsb

JLex

79

Howscannergeneratorisused

•  Writethescannerspecifica5on;•  Generatethescannerprogramusingscannergenerator;•  Compilethescannerprogram;•  Runthescannerprogramoninputstreams,andproducesequencesoftokens.

Scanner generator (e.g., JLex)

Scanner definition (e.g., JLex spec)

Scanner program, e.g., Scanner.java

Java compiler

Scanner program (e.g., Scanner.java)

Scanner.class

Scanner Scanner.class Input stream Sequence of

tokens

JLex

80

JLexspecifica5on

•  JLexspecifica5onconsistsofthreeparts,separatedby“%%”

UserJavacode,tobecopiedverba5mintothescannerprogram,placedbeforethelexerclass;

%%JLexdirec5ves,macrodefini5ons,commonlyusedtospecifyle[ers,digits,whitespace;%%Regularexpressionsandac5ons:

•  Specifyhowtodivideinputintotokens;•  Regularexpressionsarefollowedbyac5ons;

–  Printerrormessages;returntokencodes;

JLex

81

FirstJLexexamplesimple.lex

•  Recognizeintandiden5fiers.1.  %%2.  %{publicsta5cvoidmain(Stringargv[])throwsjava.io.IOExcep5on{3.  MyLexeryy=newMyLexer(System.in);4.  while(true){5.  yy.yylex();6.  }7.  }8.  %}9.  %notunix10. %typevoid11. %classMyLexer12. %eofval{return;13. %eofval}

14.  IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*

15. %%

16.  "int"{System.out.println("INTrecognized");}17.  {IDENTIFIER}{System.out.println("IDis..."+yytext());}18.  \r|\n{}19.  .{}

JLex

82

Codegeneratedwillbeinsimple.lex.java

classMyLexer{publicsta5cvoidmain(Stringargv[])throwsjava.io.IOExcep5on{

MyLexeryy=newMyLexer(System.in);while(true){yy.yylex();

}}publicvoidyylex(){......case5:{System.out.println("INTrecognized");}case7:{System.out.println("IDis..."+yytext());}......}}

Copied from internal code directive

JLex

83

RunningtheJLexexample•  StepstoruntheJLex

D:\214>javaJLex.Mainsimple.lexProcessingfirstsec5on--usercode.Processingsecondsec5on--JLexdeclara5ons.Processingthirdsec5on--lexicalrules.Crea5ngNFAmachinerepresenta5on.NFAcomprisedof22states.Workingoncharacterclasses.::::::::.NFAhas10dis5nctcharacterclasses.Crea5ngDFAtransi5ontable.WorkingonDFAstates...........MinimizingDFAtransi5ontable.9statesaderremovalofredundantstates.Outpu�nglexicalanalyzercode.D:\214>movesimple.lex.javaMyLexer.javaD:\214>javacMyLexer.javaD:\214>javaMyLexer//itiswai5ngforkeyboardinputintmyid0INTrecognizedIDis...myid0

JLex

84

Exercises

• TrytomodifyJLexdirec5vesinthepreviousJLexspec,andobservewhetheritiss5llworking.Ifitisnotworking,trytounderstandthereason.–  Remove“%notunix”direc5ve;–  Change“return;”to“returnnull;”;–  Remove“%typevoid”;–  ......

• MovetheIden5fierregularexpressionbeforethe“int”RE.Whatwillhappentotheinput“int”?

• Whatifyouremovethelastline(line19,“.{}”)?

JLex

85

Changesimple.lex:readinputfromfile

1.  importjava.io.*;2.  %%3.  %{publicsta5cvoidmain(Stringargv[])throwsjava.io.IOExcep5on{4.  MyLexeryy=newMyLexer(newFileReader(“input”));5.  while(yy.yylex()>=0);6.  }7.  %}8.  %integer9.  %classMyLexer

10.  %%

11.  "int"{System.out.println("INTrecognized");}12.  [a-zA-Z_][a-zA-Z0-9_]*{System.out.println("IDis..."+yytext());}13.  \r|\n|.{}

•  %integer:tomakethereturningtypeofyylex()asint.

JLex

86

Extendtheexample:addreturninganduseclasses–  Whenatokenisrecognized,inmostofthecasewewanttoreturnatokenobject,sothat

otherprogramscanuseit.classUseLexer{publicsta5cvoidmain(String[]args)throwsjava.io.IOExcep5on{Tokent;MyLexer2lexer=newMyLexer2(System.in);while((t=lexer.yylex())!=null)System.out.println(t.toString());}}classToken{Stringtype;Stringtext;intline;Token(Stringt,Stringtxt,intl){type=t;text=txt;line=l;}publicStringtoString(){returntext+""+type+""+line;}}%%%notunix%line%typeToken%classMyLexer2%eofval{returnnull;%eofval}IDENTIFIER=[a-zA-Z_][a-zA-Z0-9_]*%%"int"{return(newToken("INT",yytext(),yyline));}{IDENTIFIER}{return(newToken("ID",yytext(),yyline));}\r|\n{}.{}

JLex

87

Codegeneratedfrommylexer2.lex

classUseLexer{publicsta5cvoidmain(String[]args)throwsjava.io.IOExcep5on{Tokent;MyLexer2lexer=newMyLexer2(System.in);while((t=lexer.yylex())!=null)System.out.println(t.toString());}}classToken{Stringtype;Stringtext;intline;Token(Stringt,Stringtxt,intl){type=t;text=txt;line=l;}publicStringtoString(){returntext+""+type+""+line;}}ClassMyLexer2{publicTokenyylex(){......case5:{return(newToken("INT",yytext(),yyline));}case7:{return(newToken("ID",yytext(),yyline));}

......}}

JLex

88

Runningtheextendedlexspecifica5onmylexer2.lex

D:\214>javaJLex.Mainmylexer2.lexProcessingfirstsec5on--usercode.Processingsecondsec5on--JLexdeclara5ons.Processingthirdsec5on--lexicalrules.Crea5ngNFAmachinerepresenta5on.NFAcomprisedof22states.Workingoncharacterclasses.::::::::.NFAhas10dis5nctcharacterclasses.Crea5ngDFAtransi5ontable.WorkingonDFAstates...........MinimizingDFAtransi5ontable.9statesaderremovalofredundantstates.Outpu�nglexicalanalyzercode.D:\214>movemylexer2.lex.javaMyLexer2.javaD:\214>javacMyLexer2.javaD:\214>javaUseLexerintintINT0x1x1ID1

JLex

89

Anotherexample

1importjava.io.IOExcep5on;2%%3%public4%classNumbers_15%typevoid6%eofval{return;8%eofval}910%line11%{publicsta5cvoidmain(Stringargs[]){12Numbers_1num=newNumbers_1(System.in);13try{14num.yylex();15}catch(IOExcep5one){System.err.println(e);}16}17%}1819%%20\r\n{System.out.println("---"+(yyline+1));}22.*\r\n{System.out.print("+++"+(yyline+1)+"\t"+yytext());}

JLex

90

Usercode(firstsec5onofJLex)

• Usercodeiscopiedverba5mintothelexicalanalyzersourcefilethatJLexoutputs,atthetopofthefile.–  Packagedeclara5ons;–  Importsofanexternalclass–  Classdefini5ons

• Generatedcodepackage declarations; import packages; Class definitions; class Yylex { ... ... }

• Yylexclassisthedefaultlexerclassname.Itcanbechangedtootherclassnameusing%classdirec5ve.

JLex

91

JLexdirec5ves(Secondsec5on)

•  Internalcodetolexicalanalyzerclass• Marcodefini5on• Statedeclara5on• Character/linecoun5ng• Lexicalanalyzercomponent5tle• Specifyingthereturnvalueonend-of-file• Specifyinganinterfacetoimplement

JLex

92

InternalCodetoLexicalAnalyzerClass

•  %{ …. %} direc5vepermitsthedeclara5onofvariablesandfunc5onsinternaltothegeneratedlexicalanalyzer

•  Generalform:%{<code>%}

•  Effect:<code>willbecopiedintotheLexerclass,suchasMyLexer.classMyLexer{…..<code>……}

•  Examplepublicsta5cvoidmain(Stringargv[])throwsjava.io.IOExcep5on{

MyLexeryy=newMyLexer(System.in);while(true){yy.yylex();}}

•  Differencewiththeusercodesec5on–  Itiscopiedinsidethelexerclass(e.g.,theMyLexerclass)

JLex

93

MacroDefini5on•  Purpose:defineonceandusedseveral5mes;

–  Amustwhenwewritelargelexspecifica5on.•  Generalformofmacrodefini5on:

–  <name>=<defini5on>–  shouldbecontainedonasingleline–  Macronameshouldbevalididen5fiers–  Macrodefini5onshouldbevalidregularexpressions–  Macrodefini5oncancontainothermacroexpansions,inthestandard

{<name>}formatformacroswithinregularexpressions.•  Example

–  Defini5on(inthesecondpartofJLexspec):IDENTIFIER=[a-zA-Z_][a-zA-Z0-9_]*ALPHA=[A-Za-z_]DIGIT=[0-9]ALPHA_NUMERIC={ALPHA}|{DIGIT}

–  Use(inthethirdpart):{IDENTIFIER}{returnnewToken(ID,yytext());}

JLex

94

Statedirec5ve•  Samestringcouldbematchedbydifferentregularexpressions,accordingtoitssurroundingenvironment.–  String“int”insidecommentshouldnotberecognizedasareservedword,not

evenasaniden5fier.

•  Par5cularlyusefulwhenyouneedtoanalyzemixedlanguages;•  Forexample,inJSP,JavaprogramscanbeimbeddedinsideHTMLblocks.OnceyouareinsideJavablock,youfollowtheJavasyntax.ButwhenyouareoutoftheJavablock,youneedtofollowtheHTMLsyntax.–  Injava“int”shouldberecognizedasareservedword;–  InHTML“int”shouldberecognizedjustasausualstring.

•  StatesinsideJLex<HTMLState>%{{yybegin(JavaState);}<HTMLState>“int”{returnstring;}<JavaState>%}{yybegin(HTMLState);}<JavaState>“int”{returnkeyword;}

JLex

95

StateDirec5ve(cont.)•  MechanismtomixFAstatesandREs•  Declaringasetof“startstates”(inthesecondpartofJLexspec)

%statestate0[,state1,state2,….]•  Howtousethestate(inthethirdpartofJLexspec):

–  REcanbeprefixedbythesetofstartstatesinwhichitisvalid;• Wecanmakeatransi5onfromonestatetoanotherwithinputRE

–  yybegin(STATE)isthecommandtomaketransi5ontoSTATE;•  YYINITIAL:implicitstartstateofyylex();

–  Butwecanchangethestartstate;•  Example(fromthesampleinJLexspec):

%state COMMENT %% <YYINITIAL>if {return new tok(sym.IF,”IF”);} <YYINITIAL>[a-z]+ {return new tok(sym.ID, yytext());} <YYINITIAL>”/*” {yybegin(COMMENT);} <COMMENT>”*/” {yybegin(YYINITIAL);} <COMMENT>. {}

JLex

96

Characterandlinecoun5ng

•  Some5mesitisusefultoknowwhereexactlythetokenisinthetext.Tokenposi5onisimplementedusinglinecoun5ngandcharcoun5ng.

•  Charactercoun5ngisturnedoffbydefault,ac5vatedwiththedirec5ve“%char”–  Createaninstancevariableyycharinthescanner;–  zero-basedcharacterindexofthefirstcharacteronthematchedregionof

text.

•  Linecoun5ngisturnedoffbydefault,ac5vatedwiththedirec5ve“%line”–  Createaninstancevariableyylineinthescanner;–  zero-basedlineindexatthebeginningofthematchedregionoftext.

•  Example“int”{return(newYytoken(4,yytext(),yyline,yychar,yychar+3));}

JLex

97

Lexicalanalyzercomponent5tles

• Changethenameofgenerated–  lexicalanalyzerclass %class<name>–  thetokenizingfunc5on %func5on<name>–  thetokenreturntype %type<name>

• DefaultnamesclassYylex{/*lexicalanalyzerclass*/publicYytoken/*thetokenreturntype*/yylex(){…}/*thetokenizingfunc5on*/==>Yylex.yylex() returns Yytoken type

JLex

98

SpecifyinganInterfacetoimplement

• Form:%implements<InterfaceName>• AllowstheusertospecifyaninterfacewhichtheYylexoryourlexerclasswillimplement.

• Thegeneratedparserclassdeclara5onwilllooklike:class MyLexer implementsInterfaceName{…….}

JLex

99

Regularexpressionrules

•  Generalform:regularExpression{ac5on}

•  Example:{IDENTIFIER}{System.out.println("IDis..."+yytext());}

•  Interpreta5on:Pa[entobematchedcodetobeexecutedwhenthepa[ernismatched

•  CodegeneratedinMyLexer:“case2: {System.out.println("IDis..."+yytext());}“

JLex

100

RegularExpressionRules•  Specifiesrulesforbreakingtheinputstreamintotokens•  RegularExpression+Ac5ons(javacode)[<states>]<expression>{<ac5on>}•  Whenmatchedwithmorethanonerule,

–  choosetherulethatisgivenfirstintheJlexspec.•  Referthe“int”andIDENTIFIERexample.

•  TherulesgiveninaJLexspecifica5onshouldmatchallpossibleinput.•  Anerrorwillberaisedifthegeneratedlexerreceivesinputthatdoesnotmatchanyofitsrules

–  E.g.,therulesonlylistedthecaseforIden5fiers,andsaidnothingaboutnumbers,butyourinputhasnumbers.

–  Thisisthemostcommonerror(morethan50%)–  putthefollowingruleatthebo[omofREspec

.{java.lang.System.out.println(“Error:” + yytext());} dot(.)willmatchanyinputexceptforthenewline.

JLex

101

Availablelexicalvalueswithinac5oncode

•  java.lang.Stringyytext()–  matchespor5onofthecharacterinputstream;–  alwaysac5ve.

•  Intyychar–  Zero-basedcharacterindexofthefirstcharacterinthematched

por5onoftheinputstream;–  ac5vatedby%chardirec5ve.

•  Intyyline–  Zero-basedlinenumberofthestartofthematchedpor5onofthe

inputstream;–  ac5vatedby%linedirec5ve.

JLex

102

RegularexpressioninJLex•Specialcharacters:?+|()ˆ$/;.=<>[]{}”\andblank

–  Ader\thespecialcharacterslosetheirspecialmeaning.–  Example:\+

•  Betweendoublequotes”allspecialcharactersbut\and”losetheirspecialmeaning.–  Example:”+”

•  Thefollowingescapesequencesarerecognized:\b\n\t\f\r.• With[]wecandescribesetsofcharacters.

–  [abc]isthesameas(a|b|c).Notethatitisnotequivalenttoabc–  With[ˆ]wecandescribesetsofcharacters.–  [ˆ\n\”]meansanythingbutanewlineorquotes–  [ˆa–z]meansanythingbutONElower-casele[er

• Wecanuse.asashortcutfor[ˆ\n]•  $: denotestheendofaline.If$endsaregularexpression,theexpressionmatchedonlyattheendofaline.

JLex

103

Concludingremarks• FocusedonLexicalAnalysisProcess,Including

–  RegularExpressions–  FiniteAutomaton–  Conversion–  Lex

• Regulargrammar=regularexpression• RegularexpressionàNFAàDFAàlexer• Thenextstepinthecompila5onprocessisParsing:

–  Contextfreegrammar;–  Top-downparsingandbo[omupparsing.

JLex