Regular Expression Sub-Matching using Partial Derivativessuma0002/talks/ppdp12-part... · 2013. 8....
Transcript of Regular Expression Sub-Matching using Partial Derivativessuma0002/talks/ppdp12-part... · 2013. 8....
Regular Expression Sub-Matchingusing Partial Derivatives
Martin Sulzmann Kenny Zhuo Ming Lu
Hochschule Karlsruhe Nanyang Polytechnic
Regular Expression Sub-Matching using Partial Derivatives – p. 1/18
Regular Expressions - The Basics
Words: w ::= Σ∗
Regular expressions
r ::= r + r Choice| rr Concatenation| r∗ Kleene star| ǫ Empty word| φ Empty language| l ∈ Σ Letters
(A + (BC))∗ denotes a regular language
L( (A + (BC))∗ ) = {ǫ, A, BC, ABC, ...}
Regular Expression Sub-Matching using Partial Derivatives – p. 2/18
Regular ExpressionSub-Matching
Matching
w matches r iff w ∈ L(r)
ABAAC matches (A + AB)(BAA + A)(AC + C)
L( (A + AB)(BAA + A)(AC + C) ) ={ABAAAC, ABAAC, AAAC, AAC, ABBAAAC, ABBAAC, ABAC}
Regular Expression Sub-Matching using Partial Derivatives – p. 3/18
Regular ExpressionSub-Matching
Matching
w matches r iff w ∈ L(r)
ABAAC matches (A + AB)(BAA + A)(AC + C)
L( (A + AB)(BAA + A)(AC + C) ) ={ABAAAC, ABAAC, AAAC, AAC, ABBAAAC, ABBAAC, ABAC}
Which sub-parts are matched?
Regular Expression Sub-Matching using Partial Derivatives – p. 3/18
Regular ExpressionSub-Matching
Matching
w matches r iff w ∈ L(r)
ABAAC matches (A + AB)(BAA + A)(AC + C)
L( (A + AB)(BAA + A)(AC + C) ) ={ABAAAC, ABAAC, AAAC, AAC, ABBAAAC, ABBAAC, ABAC}
Which sub-parts are matched?
(x1 : (A + AB))(x2 : (BAA + A))(x3 : (AC + C))
Regular Expression Sub-Matching using Partial Derivatives – p. 3/18
Regular ExpressionSub-Matching
Matching
w matches r iff w ∈ L(r)
ABAAC matches (A + AB)(BAA + A)(AC + C)
L( (A + AB)(BAA + A)(AC + C) ) ={ABAAAC, ABAAC, AAAC, AAC, ABBAAAC, ABBAAC, ABAC}
Which sub-parts are matched?
(x1 : (A + AB))(x2 : (BAA + A))(x3 : (AC + C))
Sub-matchings for ABAAC arex1 = AB
x2 = A
x3 = AC
Regular Expression Sub-Matching using Partial Derivatives – p. 3/18
Regular ExpressionSub-Matching
Matching
w matches r iff w ∈ L(r)
ABAAC matches (A + AB)(BAA + A)(AC + C)
L( (A + AB)(BAA + A)(AC + C) ) ={ABAAAC, ABAAC, AAAC, AAC, ABBAAAC, ABBAAC, ABAC}
Which sub-parts are matched?
(x1 : (A + AB))(x2 : (BAA + A))(x3 : (AC + C))
Sub-matchings for ABAAC arex1 = AB
x2 = A
x3 = AC
Now that the difference is clear:Matching = Sub-matching
Regular Expression Sub-Matching using Partial Derivatives – p. 3/18
Slow Regular Expression Matching
ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r
Regular Expression Sub-Matching using Partial Derivatives – p. 4/18
Slow Regular Expression Matching
ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r
Consider n = 2
Regular Expression Sub-Matching using Partial Derivatives – p. 4/18
Slow Regular Expression Matching
ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r
Consider n = 2
AA ⊢ (x : A?A?)(y : AA)
Regular Expression Sub-Matching using Partial Derivatives – p. 4/18
Slow Regular Expression Matching
ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r
Consider n = 2
AA ⊢ (x : A?A?)(y : AA)
AA ⊢ (x : A?A?)(y : AA)
Regular Expression Sub-Matching using Partial Derivatives – p. 4/18
Slow Regular Expression Matching
ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r
Consider n = 2
AA ⊢ (x : A?A?)(y : AA)
AA ⊢ (x : A?A?)(y : AA)
Fail ⇒ Backtrack
Regular Expression Sub-Matching using Partial Derivatives – p. 4/18
Slow Regular Expression Matching
ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r
Consider n = 2
AA ⊢ (x : A?A?)(y : AA)
AA ⊢ (x : A?A?)(y : AA)
Fail ⇒ BacktrackAA ⊢ (x : A?A?)(y : AA)
Regular Expression Sub-Matching using Partial Derivatives – p. 4/18
Slow Regular Expression Matching
ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r
Consider n = 2
AA ⊢ (x : A?A?)(y : AA)
AA ⊢ (x : A?A?)(y : AA)
Fail ⇒ BacktrackAA ⊢ (x : A?A?)(y : AA)
...
Regular Expression Sub-Matching using Partial Derivatives – p. 4/18
Slow Regular Expression Matching
ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r
Consider n = 2
AA ⊢ (x : A?A?)(y : AA)
AA ⊢ (x : A?A?)(y : AA)
Fail ⇒ BacktrackAA ⊢ (x : A?A?)(y : AA)
...AA ⊢ (x : A?A?)(y : AA)
Regular Expression Sub-Matching using Partial Derivatives – p. 4/18
Slow Regular Expression Matching
ExampleAn “A n-times”(x : A?n)(y : An) r? = ǫ + r
Consider n = 2
AA ⊢ (x : A?A?)(y : AA)
AA ⊢ (x : A?A?)(y : AA)
Fail ⇒ BacktrackAA ⊢ (x : A?A?)(y : AA)
...AA ⊢ (x : A?A?)(y : AA)
Success but exponential complexity
Regular Expression Sub-Matching using Partial Derivatives – p. 4/18
Fast Regular Expression Matching
For brevity, we ignore sub-matching locations
Regular Expression Sub-Matching using Partial Derivatives – p. 5/18
Fast Regular Expression Matching
For brevity, we ignore sub-matching locations
Convert A?A?AA to NFA
�������� A//�������� A
//�������� A//�������� A
//����������������
//��������
OO
//��������
OO
//��������
OO
Regular Expression Sub-Matching using Partial Derivatives – p. 5/18
Fast Regular Expression Matching
For brevity, we ignore sub-matching locations
Convert A?A?AA to NFA
�������� A//�������� A
//�������� A//�������� A
//����������������
//��������
OO
//��������
OO
//��������
OO
Simultaneous search for match AA
Regular Expression Sub-Matching using Partial Derivatives – p. 5/18
Fast Regular Expression Matching
For brevity, we ignore sub-matching locations
Convert A?A?AA to NFA
�������� A//• A
//• A//• A
//����������������
//��������
OO
//��������
OO
//��������
OO
Simultaneous search for match AA
AA
Regular Expression Sub-Matching using Partial Derivatives – p. 5/18
Fast Regular Expression Matching
For brevity, we ignore sub-matching locations
Convert A?A?AA to NFA
�������� A//�������� A
//• A//• A
//•
//��������
OO
//��������
OO
//��������
OO
Simultaneous search for match AA
AA
Regular Expression Sub-Matching using Partial Derivatives – p. 5/18
Fast Regular Expression Matching
For brevity, we ignore sub-matching locations
Convert A?A?AA to NFA
�������� A//�������� A
//• A//• A
//•
//��������
OO
//��������
OO
//��������
OO
Simultaneous search for match AA
AA
No backtracking, linear searchTo be tracked states linear in the size of regularexpressionLinear complexity!So far Thompson and Glushkov NFA constructionSee Russ Cox, Alain Frisch et. al., Ville Laurikari,...
Regular Expression Sub-Matching using Partial Derivatives – p. 5/18
Our Contributions
Matching automata construction based onBrzozowski’s Derivatives (DFA)Antimirov’s Partial Derivatives (NFA)
Fast and elegant algorithms forPOSIX matchinggreedy left-most matching
Implementation in Haskell supporting real-world regularexpressions
Regular Expression Sub-Matching using Partial Derivatives – p. 6/18
Brzozowski’s Derivatives
r\l “take away the leading l”
r\l derivative of r w.r.t. l
L(r\l) = {w | lw ∈ L(r)}
Compute r\l by induction, e.g.A\A = ǫ, B\A = φ, r∗\l = (r\l)r∗
r1r2\l = (r1\l)r2 + r2\l if ǫ ∈ L(r1)
Matching derivation:
r1
l→ r2 iff r2 = r1\l
w = l1...ln check if rl1→ ...
ln→ r′ where ǫ ∈ L(r′)
Regular Expression Sub-Matching using Partial Derivatives – p. 7/18
Matching with Derivatives
Example: Match AB against (x : A + y : AB + z : B)∗
Regular Expression Sub-Matching using Partial Derivatives – p. 8/18
Matching with Derivatives
Example: Match AB against (x : A + y : AB + z : B)∗
(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}
Regular Expression Sub-Matching using Partial Derivatives – p. 8/18
Matching with Derivatives
Example: Match AB against (x : A + y : AB + z : B)∗
(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}
A→ (A + AB + B)∗\A
Regular Expression Sub-Matching using Partial Derivatives – p. 8/18
Matching with Derivatives
Example: Match AB against (x : A + y : AB + z : B)∗
(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}
A→ (A + AB + B)∗\A
= (A + AB + B)\A (A + AB + B)∗
Regular Expression Sub-Matching using Partial Derivatives – p. 8/18
Matching with Derivatives
Example: Match AB against (x : A + y : AB + z : B)∗
(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}
A→ (A + AB + B)∗\A
= (A + AB + B)\A (A + AB + B)∗
= (A\A + AB\A + B\A) (A + AB + B)∗
Regular Expression Sub-Matching using Partial Derivatives – p. 8/18
Matching with Derivatives
Example: Match AB against (x : A + y : AB + z : B)∗
(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}
A→ (A + AB + B)∗\A
= (A + AB + B)\A (A + AB + B)∗
= (A\A + AB\A + B\A) (A + AB + B)∗
= (ǫ + B) (A + AB + B)∗
Regular Expression Sub-Matching using Partial Derivatives – p. 8/18
Matching with Derivatives
Example: Match AB against (x : A + y : AB + z : B)∗
(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}
A→ (A + AB + B)∗\A
= (A + AB + B)\A (A + AB + B)∗
= (A\A + AB\A + B\A) (A + AB + B)∗
= (ǫ + B)1 (A + AB + B)∗ {x1 : A, y1 : A, z1 : ǫ}
Record matchings for each iteration
Paper records matchings within pattern
Regular Expression Sub-Matching using Partial Derivatives – p. 8/18
Matching with Derivatives
Example: Match AB against (x : A + y : AB + z : B)∗
(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}
A→ (A + AB + B)∗\A
= (A + AB + B)\A (A + AB + B)∗
= (A\A + AB\A + B\A) (A + AB + B)∗
= (ǫ + B)1 (A + AB + B)∗ {x1 : A, y1 : A, z1 : ǫ}
B→ ((ǫ + B)1
︸ ︷︷ ︸
p1
(A + AB + B)∗︸ ︷︷ ︸
p2
)\B
(p1p2)\l = (p1\l, p2) + (empty(p1)p2\l) if p1 empty
Choice of matchings. Don’t drop p1,
keep p1 and make p1 “empty” (ǫ + B)1 ⇒ (ǫ + φ)1
so p1 won’t contribute further matchingsRegular Expression Sub-Matching using Partial Derivatives – p. 8/18
Matching with Derivatives
Example: Match AB against (x : A + y : AB + z : B)∗
(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}
A→ (A + AB + B)∗\A
= (A + AB + B)\A (A + AB + B)∗
= (A\A + AB\A + B\A) (A + AB + B)∗
= (ǫ + B)1 (A + AB + B)∗ {x1 : A, y1 : A, z1 : ǫ}
B→ ((ǫ + B)1 (A + AB + B)∗)\B
= ((ǫ+B)1\B (A+AB +B)∗)+(ǫ+φ)1((A+AB +B)∗\B)
... = ǫ1(A + AB + B)∗ + (ǫ + φ)1ǫ2(A + AB + B)∗
Regular Expression Sub-Matching using Partial Derivatives – p. 8/18
Matching with Derivatives
Example: Match AB against (x : A + y : AB + z : B)∗
(A + AB + B)∗ {x : ǫ, y : ǫ, z : ǫ}
A→ (A + AB + B)∗\A
= (A + AB + B)\A (A + AB + B)∗
= (A\A + AB\A + B\A) (A + AB + B)∗
= (ǫ + B)1 (A + AB + B)∗ {x1 : A, y1 : A, z1 : ǫ}
B→ ((ǫ + B)1 (A + AB + B)∗)\B
= ((ǫ+B)1\B (A+AB +B)∗)+(ǫ+φ)1((A+AB +B)∗\B)
... = ǫ1(A + AB + B)∗ + (ǫ + φ)1ǫ2(A + AB + B)∗
{y1 : AB} and {x1 : A, z2 : B}
POSIX and greedy left-most match
Regular Expression Sub-Matching using Partial Derivatives – p. 8/18
Derivatives Matching Summary
Computes all matchings ⇒ exponential complexity
Optimization:
Simplify, e.g. aggressively to the left
(1) r + r ⇒ r keep left r (2) φ r ⇒ φ (3) ǫ r ⇒ r
ǫ1(A + AB + B)∗︸ ︷︷ ︸
p1
+ (ǫ + φ)1ǫ2(A + AB + B)∗︸ ︷︷ ︸
p2
⇒∗ ǫ1(A + AB + B)∗ + ǫ2(A + AB + B)∗
⇒ ǫ1(A + AB + B)∗
⇒ (A + AB + B)∗
Yields POSIX match
Regular Expression Sub-Matching using Partial Derivatives – p. 9/18
Matching with Partial Derivatives
Derivatives represent states of a DFA·\· : r 7→ L 7→ r
(A + AB + B)∗\A = (ǫ + B) (A + AB + B)∗
On the fly DFA construction.
Partial derivatives represent states of an NFA·\p· : r 7→ L 7→ 2r
L(r\l) = L(r1 + ... + rn) where r\pl = {r1, ..., rn}
(A+AB+B)∗\pA = {ǫ(A+AB+B)∗, B(A+AB+B)∗}
Set of partial derivatives finite and linear in size ofregular expression.Build NFA match automata.
Regular Expression Sub-Matching using Partial Derivatives – p. 10/18
Matching with Partial Derivatives
(Sketch of) NFA match automata
76540123p2
// 76540123p1A
//
A
>>|
||
||
||
||
76540123p3 ...
(A + AB + B)∗︸ ︷︷ ︸
p1
\pA = {ǫ(A + AB + B)∗︸ ︷︷ ︸
p2
, B(A + AB + B)∗︸ ︷︷ ︸
p3
}
Depth-first left-most traversal⇒ greedy left-most match
Not POSIX because structure is broken apart
See paper for details of construction
Regular Expression Sub-Matching using Partial Derivatives – p. 11/18
POSIX versus Greedy Left-Most Match
Derivatives for POSIX matching
POSIX = maximal match w.r.t structureAB matches (A + AB + B)∗
(A + AB + B)∗\A = (ǫ + B) (A + AB + B)∗
Partial derivatives for greedy left-most matchingGreedy left-most = maximal match ignoring anystructureAB matches (A + AB + B)∗
(A + AB + B)∗\pA ={ǫ(A + AB + B)∗,
B(A + AB + B)∗}
Regular Expression Sub-Matching using Partial Derivatives – p. 12/18
Implementation
Fine-tuned Haskell implementation of greedy left-mostusing partial derivative NFAs.
Real-world extensions:Group matchings, anchored match, ...
Competitive performance (see paper for details):C-based: RE2, PCREHaskell-based: Weighted, TDFA
Partial derivative NFA construction “smaller” comparedto Thompon and Glushkov NFA construction
Reference implementation of Thompon, Glushkovand Partial Derivative NFA construction
Regular Expression Sub-Matching using Partial Derivatives – p. 13/18
Conclusion
First application of derivatives and partial derivatives forregular expression sub-matching
Future work:Implementation in other languagesEfficient POSIX implementation
Tricky for NFA but see “backwards scanning” trickby Russ CoxExploiting laziness of the on-the fly derivativeconstruction
Error explanationWhy is there no match?
Regular Expression Sub-Matching using Partial Derivatives – p. 14/18
Errata - Simplifications
page 7, left colomn: The derivation should be as followswhere we have underlined the corrected parts andexpressions involving φ have been alredy removed.
(x|ǫ : A∗, y|ǫ : A∗)A→ (x|A : A∗, y|ǫ : A∗) + (x|ǫ : ǫ, y|A : A∗)A→ ((x|AA : A∗, y|ǫ : A∗) + (x|A : ǫ, y|A : A∗))+
(x|ǫ : ǫ, y|AA : A∗))A→ ...
Simplifications at the pattern and regular expression level.
Regular Expression Sub-Matching using Partial Derivatives – p. 15/18
Errata - Derivative Matching
Figure 8 “Derivative Matching”:env(·) :: p → {Γ}
env((x|w : r)) =
8
<
:
{{(x, w)}} if ǫ ∈ L(r)
{} otherwise
env((x|w : p)) = {{(x, w)} ⊎ es|es ∈ env(p)}
env((p1, p2)) = {e1 ⊎ e2|e1 ∈ env(p1), e2 ∈ env(p2)}
env((p1 + p2)) = env(p1) ⊎ env(p2)
env(p∗) = env(p)
match(·, ·) :: p → w → {Γ}
match(p, w) = env(p\w)
There’s an issue:
In case env(p) yields {} but the pattern is empty, weshould actually return instead {{}}.
Regular Expression Sub-Matching using Partial Derivatives – p. 16/18
Errata - Derivative Matching (2)
To explain the issue, consider the initial pattern(x|ǫ : (y|ǫ : A)∗).Building the derivative w.r.t A yields
(x|A : (y|A : ǫ, (y|ǫ : A)∗))
Applying env() on the subpattern (y|ǫ : A)∗ yields {}because the underlying pattern (y|ǫ : A) is not empty.Clearly, (y|ǫ : A)∗ contains empty (zero iterations). Hence, inthis situation, we shouldn’t return {} (“no match”) but ratherreport {{}} (“empty match”).
Regular Expression Sub-Matching using Partial Derivatives – p. 17/18
Errata - Derivative Matching (3)
Here’s the fix:env(·) :: p → {Γ}
env((x|w : r)) =
8
<
:
{{(x, w)}} if ǫ ∈ L(r)
{} otherwise
env((x|w : p)) = envH (((x|w : p), {{(x, w)} ⊎ es|es ∈ env(p)}))
env((p1, p2)) = envH ((p1, p2), {e1 ⊎ e2|e1 ∈ env(p1), e2 ∈ env(p2)}))
env((p1 + p2)) = envH ((p1 + p2, env(p1) ⊎ env(p2)))
env(p∗) = envH ((p∗, env(p)))
envH ((p, e)) =
8
<
:
{{}} if ǫ ∈ L(p ↓) and e = {}
e otherwise
Regular Expression Sub-Matching using Partial Derivatives – p. 18/18