Proximity algorithms for nearly-doubling spaces

Proximity algorithms for nearly-doubling spaces

Lee-Ad Gottlieb

Robert Krauthgamer

Weizmann Institute

Proximity algorithms for nearly-doubling spaces 2

Proximity problems In arbitrary metric space, some proximity problems are hard

For example, the nearest neighbor search problem requires Θ(n) time

The doubling dimension

parameterizes the “bad”

case…

q

~1~1

~1

~1

~1


Doubling Dimension Definition: Ball B(x,r) = all points within distance r from x.

The doubling constant (of a metric M) is the minimum value ¸ such that every ball can be covered by ¸ balls of half the radius First used by [Ass-83], algorithmically by [Cla-97]. The doubling dimension is dim(M)=log ¸(M) [GKL-03] A metric is doubling if its doubling dimension is constant

Packing property of doubling spaces A set with diameter D and min. inter-point

distance a, contains at most

(D/a)O(log¸) points

Here ≤7.


Applications In the past few years, many algorithmic tasks have been

analyzed via the doubling dimension For example, approximate nearest neighbor search can be executed in

time ¸O(1) log n

Some other algorithms analyzed via the doubling dimension Nearest neighbor search [KL-04, BKL-06, CG-06] Clustering [Tal-04, ABS-08, FM-10] Spanner construction [GGN-06, CG-06, DPP-06, GR-08] Routing [KSW-04, Sil-05, AGGM-06, KRXY-07, KRX-08] Travelling Salesperson [Tal-04] Machine learning [BLL-09, GKK-10]

Message: This is an active line of research…


Problem Most algorithms developed for doubling spaces are not robust

Algorithmic guarantees don’t hold for nearly-doubling spaces If a small fraction of the working set possesses high doubling dimension,

algorithmic performance degrades.

This problem motivates the following key task Given an n-point set S and target dimension d* Remove from S the fewest number of points so that the remaining set

has doubling dimension at most d*


Two paradigms How can removing a few “bad” points help? Two models:

1. Ignore the bad points Outlier detection.

[GHPT-05] cluster based on similarity, seek a large subset with low intrinsic dimension.

Algorithms with slack. Throw bad points into the slack [KRXY-07] gave a routing algorithm with guarantees for most of the input

points. [FM-10] gave a kinetic clustering algorithm for most of the input points. [GKK-10] gave a machine learning algorithm – small subset doesn’t interfere

with learning


Two paradigms How can removing a few “bad” points help? Two models:

2. Tailor a different algorithm for the bad points Example: Spanner construction. A spanner is an edge subset of the full

graph Good points: Low doubling dimension sparse spanner with nice

properties (low stretch and degree) Bad points: Take the full graph If the number of bad points is O(n.5), we have a spanner with O(n) edges


Results Recall our key problem

Given an n-point set S and target dimension d* Remove from S the fewest number of points so that the remaining set

has doubling dimension at most d*

This problem is NP-hard Even determining the doubling dimension of a point set exactly is NP-

hard! Proof on the next slide But the doubling dimension can be approximated within a constant factor…

Our contribution: bicriteria approximation algorithm In time 2O(d*) n3, we remove a number of points arbitrarily close to optimal,

while achieving doubling dimension 4d* + O(1) We can also achieve near-linear runtime, at the cost of slightly higher

dimension


Warm up Lemma: It is NP-hard to determine the doubling dimension of a set S

Reduction: from vertex cover with bounded degree Δ = n½. the size of any vertex cover is at least n½.

Construction: A set S of n points corresponding to the vertex set V. Let d(u,v) = ½ if the cor. vertices are connected by an edge Let d(u,v) = 1 if the cor. vertices aren’t connected

Analysis: Any subset of S found in a ball of radius ½ has at most n½ points - degree of original

graph S is a ball of radius 1. The minimum covering of all of S with balls of radius ½ is equal to

the minimum vertex cover of V.

Note: reduction preserves hardness of approximation Corollary: It is NP-hard to determine if removing k points from S can

leave a set with doubling dimension d*. So our problem is hard as well.

½½

1


Bicriteria algorithm Recall that he doubling constant (of a metric M) is

the minimum value ¸ such that every r-radius ball can be covered by ¸ balls of half the radius

Define the related notion of density constant as the minimum value >0 such that every r-radius ball contains at most

points at mutual interpoint distance r/2 Nice property: The density constant can only decrease under the

removal of points, unlike the doubling constant.

We can show that √(S) ≤ ¸(S) ≤ (S) it’s NP-hard to compute the density constant

(ratio-preserving reduction from independent set)

=2, =3


Bicriteria algorithm We will give a bicriteria algorithm for the density constant.

Problem statement: Given an n-point set S and target density constant * Remove from S the fewest number of points so that the remaining set

has density constant at most *

A bicriteria algorithm for the density constant is itself a bicriteria algorithm for the doubling constant within a quadratic factor


Witness set Given a set S, a subset S’ is a witness set

for the density constant if All points are at interpoint distance at least r/2 Note that S’ is a concise proof that the density

constant of S is at least |S’|

Theorem: Fix a value ’< (S). A witness set of S of size at least √‘ can be found in time 2O(*) n3

Proof outline: For each point p and radius r define the r-ball

of p. Greedily cover all points in the r-ball with

disjoint balls of radius r/2. Then cover all points in each r/2 ball with

disjoint balls of radius r/4. Since there exists in S a witness set of size

(S), there exists a p and r so that either there are √(S) r/2 balls, and these form a

witness set, or one r/2 ball covers √(S) r/4 balls, and these

form a witness set.


Bicriteria algorithm Recall our problem

Given an n-point set S and target density constant * Remove from S the fewest number of points so that the remaining set

has density constant at most *

Our bricriteria solution: Let k be the true answer (the minimum number of points that must be

removed). We remove k c/(c-1) points and the remaining set has density constant

c2*2


Bicriteria algorithm Algorithm

Run the subroutine to identify a witness set of size at least c* Remove it Repeat

Analysis The density constant of the resulting set is not greater than c2*2

since we terminated without finding a witness set of size at least c* Every time a witness set of size w>c* is removed by our algorithm, the

optimal algorithm must remove at least w-* points or else the true solution would have density constant greater than *

It follows that are algorithm removes k w/(w-*) < kc/(c-1) points


Conclusion We conclude that there exists a bicriteria algorithm for the

density constant We remove k c/(c-1) points and the remaining set has density constant

c2*2

It follows that there exists a bricriteria algorithm for the doubling constant We remove k c/(c-1) points and the remaining set has doubling constant

c4¸*4

Proximity algorithms for nearly-doubling spaces

Documents

Transcript of Proximity algorithms for nearly-doubling spaces