Regular Expression Sub-Matching using Partial Derivativessuma0002/talks/ppdp12-part... · 2013. 8....

Post on 23-Aug-2020

1 views 0 download

Transcript of Regular Expression Sub-Matching using Partial Derivativessuma0002/talks/ppdp12-part... · 2013. 8....

Regular Expression Sub-Matchingusing Partial Derivatives

Martin Sulzmann Kenny Zhuo Ming Lu

Hochschule Karlsruhe Nanyang Polytechnic

Regular Expression Sub-Matching using Partial Derivatives – p. 1/18

Regular Expressions - The Basics

Words: w ::= Σ∗

Regular expressions

r ::= r + r Choice| rr Concatenation| r∗ Kleene star| ǫ Empty word| φ Empty language| l ∈ Σ Letters

(A + (BC))∗ denotes a regular language

L( (A + (BC))∗ ) = {ǫ, A, BC, ABC, ...}

Regular Expression Sub-Matching using Partial Derivatives – p. 2/18

Regular ExpressionSub-Matching

Matching

w matches r iff w ∈ L(r)

ABAAC matches (A + AB)(BAA + A)(AC + C)

L( (A + AB)(BAA + A)(AC + C) ) ={ABAAAC, ABAAC, AAAC, AAC, ABBAAAC, ABBAAC, ABAC}

Regular Expression Sub-Matching using Partial Derivatives – p. 3/18

Regular ExpressionSub-Matching

Matching

w matches r iff w ∈ L(r)

ABAAC matches (A + AB)(BAA + A)(AC + C)

L( (A + AB)(BAA + A)(AC + C) ) ={ABAAAC, ABAAC, AAAC, AAC, ABBAAAC, ABBAAC, ABAC}

Which sub-parts are matched?

Regular Expression Sub-Matching using Partial Derivatives – p. 3/18

Regular ExpressionSub-Matching

Matching

w matches r iff w ∈ L(r)

ABAAC matches (A + AB)(BAA + A)(AC + C)

L( (A + AB)(BAA + A)(AC + C) ) ={ABAAAC, ABAAC, AAAC, AAC, ABBAAAC, ABBAAC, ABAC}

Which sub-parts are matched?

(x1 : (A + AB))(x2 : (BAA + A))(x3 : (AC + C))

Regular Expression Sub-Matching using Partial Derivatives – p. 3/18

Regular ExpressionSub-Matching

Matching

w matches r iff w ∈ L(r)

ABAAC matches (A + AB)(BAA + A)(AC + C)

L( (A + AB)(BAA + A)(AC + C) ) ={ABAAAC, ABAAC, AAAC, AAC, ABBAAAC, ABBAAC, ABAC}

Which sub-parts are matched?

(x1 : (A + AB))(x2 : (BAA + A))(x3 : (AC + C))

Sub-matchings for ABAAC arex1 = AB

x2 = A

x3 = AC

Regular Expression Sub-Matching using Partial Derivatives – p. 3/18

Regular ExpressionSub-Matching

Matching

w matches r iff w ∈ L(r)

ABAAC matches (A + AB)(BAA + A)(AC + C)

L( (A + AB)(BAA + A)(AC + C) ) ={ABAAAC, ABAAC, AAAC, AAC, ABBAAAC, ABBAAC, ABAC}

Which sub-parts are matched?

(x1 : (A + AB))(x2 : (BAA + A))(x3 : (AC + C))

Sub-matchings for ABAAC arex1 = AB

x2 = A

x3 = AC

Now that the difference is clear:Matching = Sub-matching

Regular Expression Sub-Matching using Partial Derivatives – p. 3/18

Slow Regular Expression Matching

ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r

Regular Expression Sub-Matching using Partial Derivatives – p. 4/18

Slow Regular Expression Matching

ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r

Consider n = 2

Regular Expression Sub-Matching using Partial Derivatives – p. 4/18

Slow Regular Expression Matching

ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r

Consider n = 2

AA ⊢ (x : A?A?)(y : AA)

Regular Expression Sub-Matching using Partial Derivatives – p. 4/18

Slow Regular Expression Matching

ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r

Consider n = 2

AA ⊢ (x : A?A?)(y : AA)

AA ⊢ (x : A?A?)(y : AA)

Regular Expression Sub-Matching using Partial Derivatives – p. 4/18

Slow Regular Expression Matching

ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r

Consider n = 2

AA ⊢ (x : A?A?)(y : AA)

AA ⊢ (x : A?A?)(y : AA)

Fail ⇒ Backtrack

Regular Expression Sub-Matching using Partial Derivatives – p. 4/18

Slow Regular Expression Matching

ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r

Consider n = 2

AA ⊢ (x : A?A?)(y : AA)

AA ⊢ (x : A?A?)(y : AA)

Fail ⇒ BacktrackAA ⊢ (x : A?A?)(y : AA)

Regular Expression Sub-Matching using Partial Derivatives – p. 4/18

Slow Regular Expression Matching

ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r

Consider n = 2

AA ⊢ (x : A?A?)(y : AA)

AA ⊢ (x : A?A?)(y : AA)

Fail ⇒ BacktrackAA ⊢ (x : A?A?)(y : AA)

...

Regular Expression Sub-Matching using Partial Derivatives – p. 4/18

Slow Regular Expression Matching

ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r

Consider n = 2

AA ⊢ (x : A?A?)(y : AA)

AA ⊢ (x : A?A?)(y : AA)

Fail ⇒ BacktrackAA ⊢ (x : A?A?)(y : AA)

...AA ⊢ (x : A?A?)(y : AA)

Regular Expression Sub-Matching using Partial Derivatives – p. 4/18

Slow Regular Expression Matching

ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r

Consider n = 2

AA ⊢ (x : A?A?)(y : AA)

AA ⊢ (x : A?A?)(y : AA)

Fail ⇒ BacktrackAA ⊢ (x : A?A?)(y : AA)

...AA ⊢ (x : A?A?)(y : AA)

Success but exponential complexity

Regular Expression Sub-Matching using Partial Derivatives – p. 4/18

Fast Regular Expression Matching

For brevity, we ignore sub-matching locations

Regular Expression Sub-Matching using Partial Derivatives – p. 5/18

Fast Regular Expression Matching

For brevity, we ignore sub-matching locations

Convert A?A?AA to NFA

�������� A//�������� A

//�������� A//�������� A

//����������������

//��������

OO

//��������

OO

//��������

OO

Regular Expression Sub-Matching using Partial Derivatives – p. 5/18

Fast Regular Expression Matching

For brevity, we ignore sub-matching locations

Convert A?A?AA to NFA

�������� A//�������� A

//�������� A//�������� A

//����������������

//��������

OO

//��������

OO

//��������

OO

Simultaneous search for match AA

Regular Expression Sub-Matching using Partial Derivatives – p. 5/18

Fast Regular Expression Matching

For brevity, we ignore sub-matching locations

Convert A?A?AA to NFA

�������� A//• A

//• A//• A

//����������������

//��������

OO

//��������

OO

//��������

OO

Simultaneous search for match AA

AA

Regular Expression Sub-Matching using Partial Derivatives – p. 5/18

Fast Regular Expression Matching

For brevity, we ignore sub-matching locations

Convert A?A?AA to NFA

�������� A//�������� A

//• A//• A

//•

//��������

OO

//��������

OO

//��������

OO

Simultaneous search for match AA

AA

Regular Expression Sub-Matching using Partial Derivatives – p. 5/18

Fast Regular Expression Matching

For brevity, we ignore sub-matching locations

Convert A?A?AA to NFA

�������� A//�������� A

//• A//• A

//•

//��������

OO

//��������

OO

//��������

OO

Simultaneous search for match AA

AA

No backtracking, linear searchTo be tracked states linear in the size of regularexpressionLinear complexity!So far Thompson and Glushkov NFA constructionSee Russ Cox, Alain Frisch et. al., Ville Laurikari,...

Regular Expression Sub-Matching using Partial Derivatives – p. 5/18

Our Contributions

Matching automata construction based onBrzozowski’s Derivatives (DFA)Antimirov’s Partial Derivatives (NFA)

Fast and elegant algorithms forPOSIX matchinggreedy left-most matching

Implementation in Haskell supporting real-world regularexpressions

Regular Expression Sub-Matching using Partial Derivatives – p. 6/18

Brzozowski’s Derivatives

r\l “take away the leading l”

r\l derivative of r w.r.t. l

L(r\l) = {w | lw ∈ L(r)}

Compute r\l by induction, e.g.A\A = ǫ, B\A = φ, r∗\l = (r\l)r∗

r1r2\l = (r1\l)r2 + r2\l if ǫ ∈ L(r1)

Matching derivation:

r1

l→ r2 iff r2 = r1\l

w = l1...ln check if rl1→ ...

ln→ r′ where ǫ ∈ L(r′)

Regular Expression Sub-Matching using Partial Derivatives – p. 7/18

Matching with Derivatives

Example: Match AB against (x : A + y : AB + z : B)∗

Regular Expression Sub-Matching using Partial Derivatives – p. 8/18

Matching with Derivatives

Example: Match AB against (x : A + y : AB + z : B)∗

(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}

Regular Expression Sub-Matching using Partial Derivatives – p. 8/18

Matching with Derivatives

Example: Match AB against (x : A + y : AB + z : B)∗

(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}

A→ (A + AB + B)∗\A

Regular Expression Sub-Matching using Partial Derivatives – p. 8/18

Matching with Derivatives

Example: Match AB against (x : A + y : AB + z : B)∗

(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}

A→ (A + AB + B)∗\A

= (A + AB + B)\A (A + AB + B)∗

Regular Expression Sub-Matching using Partial Derivatives – p. 8/18

Matching with Derivatives

Example: Match AB against (x : A + y : AB + z : B)∗

(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}

A→ (A + AB + B)∗\A

= (A + AB + B)\A (A + AB + B)∗

= (A\A + AB\A + B\A) (A + AB + B)∗

Regular Expression Sub-Matching using Partial Derivatives – p. 8/18

Matching with Derivatives

Example: Match AB against (x : A + y : AB + z : B)∗

(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}

A→ (A + AB + B)∗\A

= (A + AB + B)\A (A + AB + B)∗

= (A\A + AB\A + B\A) (A + AB + B)∗

= (ǫ + B) (A + AB + B)∗

Regular Expression Sub-Matching using Partial Derivatives – p. 8/18

Matching with Derivatives

Example: Match AB against (x : A + y : AB + z : B)∗

(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}

A→ (A + AB + B)∗\A

= (A + AB + B)\A (A + AB + B)∗

= (A\A + AB\A + B\A) (A + AB + B)∗

= (ǫ + B)1 (A + AB + B)∗ {x1 : A, y1 : A, z1 : ǫ}

Record matchings for each iteration

Paper records matchings within pattern

Regular Expression Sub-Matching using Partial Derivatives – p. 8/18

Matching with Derivatives

Example: Match AB against (x : A + y : AB + z : B)∗

(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}

A→ (A + AB + B)∗\A

= (A + AB + B)\A (A + AB + B)∗

= (A\A + AB\A + B\A) (A + AB + B)∗

= (ǫ + B)1 (A + AB + B)∗ {x1 : A, y1 : A, z1 : ǫ}

B→ ((ǫ + B)1

︸ ︷︷ ︸

p1

(A + AB + B)∗︸ ︷︷ ︸

p2

)\B

(p1p2)\l = (p1\l, p2) + (empty(p1)p2\l) if p1 empty

Choice of matchings. Don’t drop p1,

keep p1 and make p1 “empty” (ǫ + B)1 ⇒ (ǫ + φ)1

so p1 won’t contribute further matchingsRegular Expression Sub-Matching using Partial Derivatives – p. 8/18

Matching with Derivatives

Example: Match AB against (x : A + y : AB + z : B)∗

(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}

A→ (A + AB + B)∗\A

= (A + AB + B)\A (A + AB + B)∗

= (A\A + AB\A + B\A) (A + AB + B)∗

= (ǫ + B)1 (A + AB + B)∗ {x1 : A, y1 : A, z1 : ǫ}

B→ ((ǫ + B)1 (A + AB + B)∗)\B

= ((ǫ+B)1\B (A+AB +B)∗)+(ǫ+φ)1((A+AB +B)∗\B)

... = ǫ1(A + AB + B)∗ + (ǫ + φ)1ǫ2(A + AB + B)∗

Regular Expression Sub-Matching using Partial Derivatives – p. 8/18

Matching with Derivatives

Example: Match AB against (x : A + y : AB + z : B)∗

(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}

A→ (A + AB + B)∗\A

= (A + AB + B)\A (A + AB + B)∗

= (A\A + AB\A + B\A) (A + AB + B)∗

= (ǫ + B)1 (A + AB + B)∗ {x1 : A, y1 : A, z1 : ǫ}

B→ ((ǫ + B)1 (A + AB + B)∗)\B

= ((ǫ+B)1\B (A+AB +B)∗)+(ǫ+φ)1((A+AB +B)∗\B)

... = ǫ1(A + AB + B)∗ + (ǫ + φ)1ǫ2(A + AB + B)∗

{y1 : AB} and {x1 : A, z2 : B}

POSIX and greedy left-most match

Regular Expression Sub-Matching using Partial Derivatives – p. 8/18

Derivatives Matching Summary

Computes all matchings ⇒ exponential complexity

Optimization:

Simplify, e.g. aggressively to the left

(1) r + r ⇒ r keep left r (2) φ r ⇒ φ (3) ǫ r ⇒ r

ǫ1(A + AB + B)∗︸ ︷︷ ︸

p1

+ (ǫ + φ)1ǫ2(A + AB + B)∗︸ ︷︷ ︸

p2

⇒∗ ǫ1(A + AB + B)∗ + ǫ2(A + AB + B)∗

⇒ ǫ1(A + AB + B)∗

⇒ (A + AB + B)∗

Yields POSIX match

Regular Expression Sub-Matching using Partial Derivatives – p. 9/18

Matching with Partial Derivatives

Derivatives represent states of a DFA·\· : r 7→ L 7→ r

(A + AB + B)∗\A = (ǫ + B) (A + AB + B)∗

On the fly DFA construction.

Partial derivatives represent states of an NFA·\p· : r 7→ L 7→ 2r

L(r\l) = L(r1 + ... + rn) where r\pl = {r1, ..., rn}

(A+AB+B)∗\pA = {ǫ(A+AB+B)∗, B(A+AB+B)∗}

Set of partial derivatives finite and linear in size ofregular expression.Build NFA match automata.

Regular Expression Sub-Matching using Partial Derivatives – p. 10/18

Matching with Partial Derivatives

(Sketch of) NFA match automata

76540123p2

// 76540123p1A

//

A

>>|

||

||

||

||

76540123p3 ...

(A + AB + B)∗︸ ︷︷ ︸

p1

\pA = {ǫ(A + AB + B)∗︸ ︷︷ ︸

p2

, B(A + AB + B)∗︸ ︷︷ ︸

p3

}

Depth-first left-most traversal⇒ greedy left-most match

Not POSIX because structure is broken apart

See paper for details of construction

Regular Expression Sub-Matching using Partial Derivatives – p. 11/18

POSIX versus Greedy Left-Most Match

Derivatives for POSIX matching

POSIX = maximal match w.r.t structureAB matches (A + AB + B)∗

(A + AB + B)∗\A = (ǫ + B) (A + AB + B)∗

Partial derivatives for greedy left-most matchingGreedy left-most = maximal match ignoring anystructureAB matches (A + AB + B)∗

(A + AB + B)∗\pA ={ǫ(A + AB + B)∗,

B(A + AB + B)∗}

Regular Expression Sub-Matching using Partial Derivatives – p. 12/18

Implementation

Fine-tuned Haskell implementation of greedy left-mostusing partial derivative NFAs.

Real-world extensions:Group matchings, anchored match, ...

Competitive performance (see paper for details):C-based: RE2, PCREHaskell-based: Weighted, TDFA

Partial derivative NFA construction “smaller” comparedto Thompon and Glushkov NFA construction

Reference implementation of Thompon, Glushkovand Partial Derivative NFA construction

Regular Expression Sub-Matching using Partial Derivatives – p. 13/18

Conclusion

First application of derivatives and partial derivatives forregular expression sub-matching

Future work:Implementation in other languagesEfficient POSIX implementation

Tricky for NFA but see “backwards scanning” trickby Russ CoxExploiting laziness of the on-the fly derivativeconstruction

Error explanationWhy is there no match?

Regular Expression Sub-Matching using Partial Derivatives – p. 14/18

Errata - Simplifications

page 7, left colomn: The derivation should be as followswhere we have underlined the corrected parts andexpressions involving φ have been alredy removed.

(x|ǫ : A∗, y|ǫ : A∗)A→ (x|A : A∗, y|ǫ : A∗) + (x|ǫ : ǫ, y|A : A∗)A→ ((x|AA : A∗, y|ǫ : A∗) + (x|A : ǫ, y|A : A∗))+

(x|ǫ : ǫ, y|AA : A∗))A→ ...

Simplifications at the pattern and regular expression level.

Regular Expression Sub-Matching using Partial Derivatives – p. 15/18

Errata - Derivative Matching

Figure 8 “Derivative Matching”:env(·) :: p → {Γ}

env((x|w : r)) =

8

<

:

{{(x, w)}} if ǫ ∈ L(r)

{} otherwise

env((x|w : p)) = {{(x, w)} ⊎ es|es ∈ env(p)}

env((p1, p2)) = {e1 ⊎ e2|e1 ∈ env(p1), e2 ∈ env(p2)}

env((p1 + p2)) = env(p1) ⊎ env(p2)

env(p∗) = env(p)

match(·, ·) :: p → w → {Γ}

match(p, w) = env(p\w)

There’s an issue:

In case env(p) yields {} but the pattern is empty, weshould actually return instead {{}}.

Regular Expression Sub-Matching using Partial Derivatives – p. 16/18

Errata - Derivative Matching (2)

To explain the issue, consider the initial pattern(x|ǫ : (y|ǫ : A)∗).Building the derivative w.r.t A yields

(x|A : (y|A : ǫ, (y|ǫ : A)∗))

Applying env() on the subpattern (y|ǫ : A)∗ yields {}because the underlying pattern (y|ǫ : A) is not empty.Clearly, (y|ǫ : A)∗ contains empty (zero iterations). Hence, inthis situation, we shouldn’t return {} (“no match”) but ratherreport {{}} (“empty match”).

Regular Expression Sub-Matching using Partial Derivatives – p. 17/18

Errata - Derivative Matching (3)

Here’s the fix:env(·) :: p → {Γ}

env((x|w : r)) =

8

<

:

{{(x, w)}} if ǫ ∈ L(r)

{} otherwise

env((x|w : p)) = envH (((x|w : p), {{(x, w)} ⊎ es|es ∈ env(p)}))

env((p1, p2)) = envH ((p1, p2), {e1 ⊎ e2|e1 ∈ env(p1), e2 ∈ env(p2)}))

env((p1 + p2)) = envH ((p1 + p2, env(p1) ⊎ env(p2)))

env(p∗) = envH ((p∗, env(p)))

envH ((p, e)) =

8

<

:

{{}} if ǫ ∈ L(p ↓) and e = {}

e otherwise

Regular Expression Sub-Matching using Partial Derivatives – p. 18/18