A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen,...

38
A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur

Transcript of A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen,...

Page 1: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

A Sublinear Algorithm For Weakly Approximating Edit DistanceBatu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami

Presentation by Itai Dinur

Page 2: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Edit Distance (Levenshtein distance)

Let A,B be two strings over a fixed alphabet Σ. The edit distance D(A,B) between A and B is defined as the minimum number of character insertions, deletions, and substitutions that transform A into B, or vice versa.

Page 3: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Applications

Bioinformatics Text processing Web search

Page 4: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Algorithms

Wagner and Fischer gave a dynamic programming algorithm that runs in time O(n2)

Masek and Paterson gave an improved algorithm that runs in time O(n2/logn)

Page 5: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

The Edit Distance Testing Problem

On input A,B and parameters 0<α<1, C>1: If D(A,B)≤nα, output CLOSE with probability at

least 2/3 If D(A,B)>n/C, output FAR with probability at least

2/3 Note that the output is unrestricted for

nα<D(A,B)≤n/C E.g. cannot distinguish between n0.1 and n0.9

The algorithm presented for the problem runs in time Õ(nmax{α/2,2α-1})

Page 6: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Motivation

In some applications, given many pairs of strings, one is interested in computing the edit distance only for close strings For string pairs where the edit distance is above a

certain threshold, the actual value of the distance is irrelevant

Page 7: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Lower Bound

Any probabilistic algorithm for the edit distance test problem requires Ω(nα/2) queries The algorithm presented for the problem runs in

time Õ(nmax{α/2,2α-1}), which is close to optimal for α≤2/3

Page 8: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Other Approximations

There are several papers that give better approximation results, but none run in sublinear time Andoni and Onak give an algorithm that computes

the edit distance between two strings up to a factor of in n1+o(1) time )lognÕ(2

Page 9: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Algorithm Overview

A recursive divide and conquer algorithm B is broken into substrings which are recursively

matched against A The matches are pieced together to form a

matching for A It is too expensive to match all the substrings

A small number of them are sampled and matched, relying on statistical properties of the matchings

Page 10: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Approximate Matching

Definition 1: An interval I = B[s…e] has a (t,E)-(approximate) matching with respect to A if for some interval A[s’…e’], s’=s+t and D(A[s’…e’],I)≤E

A abcd1234efgh5678

B cd02

I has a (2,1)-(approximate) matching with respect to A

Page 11: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Coordinated Matching

Definition 2:Let I = (I1,…Ik) be a collection of intervals. We say that I has a (t,σ,E,D)-coordinated matching with A if for all but D of the intervals Ii I, Ii has a (ti,E)-matching with A, where |t-ti|≤σ

A abcd1234efgh5678

B cd0236gjfkl5

I has a (1,1,2,1)-coordinated matching with A

Page 12: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Coordinated Matching to Approximate Matching

We decompose an interval I of size S into k disjoint continuous subintervals, I=(I1,…Ik), each of size S’=S/k (assuming k|S)

Lemma 1: If (I1,…Ik) has a (t,σ,εS’,δk)-coordinated matching with A, then I has a (t,βS)-(approximate) matching with A, where β = (2σ/S’ + ε+δ)

Page 13: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Approximate Matching to Coordinated Matching

Lemma 2: Let c>1 and S>cE. If I has a (t,E)-matching with A then I=(I1,…Ik) has (t,E,cE/k,k/c)-coordinated matching with A

Lemma 3: If I has a (t,E)-matching with A, and k≥E, then I=(I1,…Ik) has (t,E,0,E)-coordinated matching with A

Page 14: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

To match A and B

Decompose B into a set of continuous disjoint intervals I Lemma 2 argues that a match for A and B gives a

coordinated matching for A and I Use a subroutine (COORD-MATCHES) to

find coordinated matches for I Lemma 1 infers the existence of good matches for

B from coordinated matches for I

Page 15: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

COORD-MATCHES

COORD-MATCHES(A,I,σ,E,D,ε,c) Let d be a constant, l=dlog(n). Choose samples i1,

…,il uniformly and independently from [1,…,k]

For each chosen sample ij compute Tj=MATCHES(A,ij,E)

Let Δ=(D/k+ε/2)l Return the set T, where t T iff Tj∩[t-σ…t+σ]=Ø

for at most Δ sets Tj

Page 16: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Sampling Lemma

Lemma 4: Suppose that a random element of a set S of size n has a property Z with probability p. For any positive ε and c, there exists d such that for dlog(n) random samples from S the fraction p’ of these samples with property Z satisfies p-ε/2≤p’≤p+ε/2 with probability 1-1/nc

Page 17: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

COORD-MATCHES

Lemma 5: With probability 1-1/nc-1 over the random coins of COORD-MATCHES, the output T of COORD-MATCHES(A,I,σ,E,D,ε,c) has the following properties: If I has a (t,σ,E,D)-coordinated matching then t T If t T then I has a (t,σ,E,D+εk)-coordinated

matching

Page 18: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

MATCHES(A,I,E)

If E≥1, use a recursive call to COORD-MATCHES

If E<1 (i.e E=0), then A must contain the interval I unchanged. The set of t values is computed directly using the algorithm SHIFTS

Page 19: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Implementing SHIFTS

A naïve implementation of SHIFTS may give an output set T consisting of n elements We may restrict the allowed shifts to [-nα,…,+nα ]

However, we need a running time of o(nα), so we must further restrict the set of possible outputs

Page 20: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

The Approximate Matching problem

Actually, we will solve the approximate matching problem: Given a block I=B[s…e] of length b=e-s+1, and a constant c2>1, find all indexes s’ such that A[s’…(s’+b-1)] matches I, in a sense that the two substrings have Hamming-distance at most b/c2 Note that if D(A,B)<nα, it is enough to consider s’

in the interval [s-nα,s+nα]

Page 21: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

The Approximate Matching problem

Naively, we can randomly sample O(log(n)) indexes i to determine (with high probability) if a substring of A[(t+1)…(t+b)] matches I, for a given t, and try all 2nα possible shifts Requires Ω(nα) queries to A

Page 22: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

The Ruler Procedure

We can compare pairs of characters A[i],I[j] such that a pair is compared for every i-j from 0 to u=2nα with √u queries to each string given that b>√u

In A character positions divisible by √u are queried A[√u,2√u,…u] . In I, √u consecutive positions are queried I[1…√u]

Define cen=t/√u1mil=t(mod√u), then for i=cen√u, j=√u-mil i-j=t

Page 23: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

The Ruler Procedure

To test whether a block matches: pick l=Θ(log(n)) random numbers m1,m2…,ml from [0,b-√u]

For each cen and mil marks construct a fingerprint with l offsets e.g. f(√u)=A[√u+m1,√u+m2,…,√u+ml]

Detect with high probability if a block matches with shift t by comparing the cen and mil fingerprints. i.e.

f(cen√u)= A[cen√u+m1…cen√u+ml] and f(t(mod√u)) =I[t(mod√u)+m1… t(mod√u)+ml]

Page 24: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

The Ruler Procedure

If b≤√u we have only O(b) mil marks and Ω(u/b) cen marks

We can find all matching shifts by using O(max{√u,u/b}log(n)) queries

Page 25: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Efficient Implementation of the Ruler

We need an efficiently algorithm to compare all fingerprints and return valid shifts

A dbadaabcdabddcd

B abcdab

u=|A|-|B|=9 √u=3l=2 m1=1 m2=3

FingerprintA-ListB-List

Page 26: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Efficient Implementation of the Ruler

A dbadaabcdabddcd

B abcdab

u=|A|-|B|=9 √u=3l=2 m1=1 m2=3

FingerprintA-ListB-List

da3

Page 27: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Efficient Implementation of the Ruler

A dbadaabcdabddcd

B abcdab

u=|A|-|B|=9 √u=3l=2 m1=1 m2=3

FingerprintA-ListB-List

da3

bd61

ad9

ca2

db3

Page 28: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Quantizing the Ruler

The explicit list of all matching t can have Ω(u) values

We round the values of t to multiples of some integer Q and return all quantized shifts

The running time is O(max{√u,u/b,u/Q}log(n))

Page 29: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

SHIFTS(A,I,Q)

Initialize the fingerprint data structure Pick l=Θ(log(n)) random numbers m1,m2…,ml

Add all the fingerprints f(i) of A to the data structure, adding i to the A-list of f(i)

Add all the fingerprints f(j) of I to the data structure, adding j to the B-list of f(j)

Quantize all A-lists and B-lists For each fingerprint, output the list of

quantized shifts (differences)

Page 30: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

SHIFTS(A,I,Q)

Theorem 1: Procedure SHIFTS finds all quantized shifts of interval I in A, with high probability. It runs in time O(max{√u,u/b,u/Q}log(n)), where u=|A|-b

Page 31: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

MATCHES(A,I,E)

If E<1, use SHIFTS to compute T If E≥1

Set k=min{εn1-α,2c1E} Decompose I into a set I of continuous disjoint

intervals of size |I|/k Compute T=COORD-MATCHES(A,I,E,c1E/k,k/c1)

Return T

Page 32: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

DECIDE(A,B,α,C)

Choose sufficiently small ε, and sufficiently large c1 (given α,C)

Let the quantization parameter be

Q=εmin{n1-α,nα/2} Set T = MATCHES(A,B,nα) If T is nonempty, output CLOSE, otherwise

output FAR

Page 33: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

DECIDE(A,B,α,C)

For any fixed α<1, we can choose constants ε and c1 such that procedure DECIDE solves the edit distance testing problem with high probability

Page 34: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Running Time Analysis

Note that when k=2c1E, COORD-MATCHES is called with edit distance parameter c1E/k=1/2<1. I.e. next call to MATCHES will call SHIFTS and end the recursion

Each level, The interval input to MATCHES goes down by a factor of k=Ω(n1-α), after r=α/(1-α) levels the intervals are of length n/nr(1-α)=O(n1-α), E=O(nα/nr(1-α))=O(1) and SHIFT will be called next

Page 35: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Running Time Analysis α<1/2

One level of recursion B is broken to intervals of size O(nα) dlog(n) calls to SHIFT with Q=εnα/2

Each call takes O(max{√u,u/b,u/Q}log(n)) = O(max{nα/2,1,nα/2}log(n))=O(nα/2log(n))

One merge taking O(nα/2log(n)) Total running time O(nα/2log2(n))

Page 36: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Running Time Analysis 1/2<α<2/3

Two levels of recursion At the last level, B is broken to intervals of

size O(nα/2) log2(n) calls to SHIFT with Q=εnα/2

Each call takes O(nα/2log(n)) log(n) merges each taking O(nα/2log(n)) Total running time O(nα/2log3(n))

Page 37: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Running Time Analysis α>2/3

r>2 levels of recursion At the last level, B is broken to intervals of

size O(n1-α) logO(1)(n) calls to SHIFT with Q=εn1-α

Note that n1-α<nα/2

Each call takes O(max{√u,u/b,u/Q}log(n)) = O((u/b)log(n))=O(n2α-1log(n))

Total running time Õ(n2α-1log(n))

Page 38: A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Conclusion

We saw an algorithm for the edit distance test problem that runs in time Õ(nmax{α/2,2α-1})

Any probabilistic algorithm for the edit distance test problem requires Ω(nα/2) queries