k-means, k-means++ - WordPress.com

34
k -means, k -means++ Barna Saha March 8, 2016

Transcript of k-means, k-means++ - WordPress.com

Page 1: k-means, k-means++ - WordPress.com

k-means, k-means++

Barna Saha

March 8, 2016

Page 2: k-means, k-means++ - WordPress.com

K -Means: The Most Popular Clustering Algorithm

I k-means clustering problem is one of the oldest and mostimportant problem.

I Given an integer k , and a set of data points in Rd , the goal isto choose k centers so as to minimize φ, the sum of squareddistance between each point and its closest center.

I Solving this problem exactly is NP-Hard.

I 25 years ago Llyod proposed a simple local search algorithmthat is still very widely used today.

I A 2002 survey of data mining techniques states that it “is byfar the most popular clustering algorithm used in scientificand industrial applications.”

Page 3: k-means, k-means++ - WordPress.com

K -Means: The Most Popular Clustering Algorithm

I k-means clustering problem is one of the oldest and mostimportant problem.

I Given an integer k , and a set of data points in Rd , the goal isto choose k centers so as to minimize φ, the sum of squareddistance between each point and its closest center.

I Solving this problem exactly is NP-Hard.

I 25 years ago Llyod proposed a simple local search algorithmthat is still very widely used today.

I A 2002 survey of data mining techniques states that it “is byfar the most popular clustering algorithm used in scientificand industrial applications.”

Page 4: k-means, k-means++ - WordPress.com

K -Means: The Most Popular Clustering Algorithm

I k-means clustering problem is one of the oldest and mostimportant problem.

I Given an integer k , and a set of data points in Rd , the goal isto choose k centers so as to minimize φ, the sum of squareddistance between each point and its closest center.

I Solving this problem exactly is NP-Hard.

I 25 years ago Llyod proposed a simple local search algorithmthat is still very widely used today.

I A 2002 survey of data mining techniques states that it “is byfar the most popular clustering algorithm used in scientificand industrial applications.”

Page 5: k-means, k-means++ - WordPress.com

K -Means: The Most Popular Clustering Algorithm

I k-means clustering problem is one of the oldest and mostimportant problem.

I Given an integer k , and a set of data points in Rd , the goal isto choose k centers so as to minimize φ, the sum of squareddistance between each point and its closest center.

I Solving this problem exactly is NP-Hard.

I 25 years ago Llyod proposed a simple local search algorithmthat is still very widely used today.

I A 2002 survey of data mining techniques states that it “is byfar the most popular clustering algorithm used in scientificand industrial applications.”

Page 6: k-means, k-means++ - WordPress.com

K -Means: The Most Popular Clustering Algorithm

I k-means clustering problem is one of the oldest and mostimportant problem.

I Given an integer k , and a set of data points in Rd , the goal isto choose k centers so as to minimize φ, the sum of squareddistance between each point and its closest center.

I Solving this problem exactly is NP-Hard.

I 25 years ago Llyod proposed a simple local search algorithmthat is still very widely used today.

I A 2002 survey of data mining techniques states that it “is byfar the most popular clustering algorithm used in scientificand industrial applications.”

Page 7: k-means, k-means++ - WordPress.com

Llyod’s Local Search Algorithm

1. Begin with k arbitrary centers, typically chosen uniformly atrandom from the data points.

2. Assign each point to its nearest cluster center

3. Recompute the new cluster centers as the center of mass ofthe points assigned to the clusters

4. Repeat Steps 1-3 until the process converges.

φ is monotonically decreasing, which ensures no configuration isrepeated.Convergence could be slow: kn possible clusterings.

Page 8: k-means, k-means++ - WordPress.com

Llyod’s Local Search Algorithm

1. Begin with k arbitrary centers, typically chosen uniformly atrandom from the data points.

2. Assign each point to its nearest cluster center

3. Recompute the new cluster centers as the center of mass ofthe points assigned to the clusters

4. Repeat Steps 1-3 until the process converges.

φ is monotonically decreasing, which ensures no configuration isrepeated.Convergence could be slow: kn possible clusterings.

Page 9: k-means, k-means++ - WordPress.com

Llyod’s Local Search Algorithm

1. Begin with k arbitrary centers, typically chosen uniformly atrandom from the data points.

2. Assign each point to its nearest cluster center

3. Recompute the new cluster centers as the center of mass ofthe points assigned to the clusters

4. Repeat Steps 1-3 until the process converges.

φ is monotonically decreasing, which ensures no configuration isrepeated.Convergence could be slow: kn possible clusterings.

Page 10: k-means, k-means++ - WordPress.com

Llyod’s Local Search Algorithm

1. Begin with k arbitrary centers, typically chosen uniformly atrandom from the data points.

2. Assign each point to its nearest cluster center

3. Recompute the new cluster centers as the center of mass ofthe points assigned to the clusters

4. Repeat Steps 1-3 until the process converges.

φ is monotonically decreasing, which ensures no configuration isrepeated.Convergence could be slow: kn possible clusterings.

Page 11: k-means, k-means++ - WordPress.com

Llyod’s Local Search Algorithm

1. Begin with k arbitrary centers, typically chosen uniformly atrandom from the data points.

2. Assign each point to its nearest cluster center

3. Recompute the new cluster centers as the center of mass ofthe points assigned to the clusters

4. Repeat Steps 1-3 until the process converges.

φ is monotonically decreasing, which ensures no configuration isrepeated.Convergence could be slow: kn possible clusterings.

Page 12: k-means, k-means++ - WordPress.com

Llyod’s Local Search Algorithm

I Convergence could be slow: kn possible clusterings.

I Lloyd’s k-means algorithm has polynomial smoothed runningtime.

Page 13: k-means, k-means++ - WordPress.com

Illustration of k-means clustering

Figure: Convergence to Local Optimum

By Agor153 - Own work, CC BY-SA 3.0,https://commons.wikimedia.org/w/index.php?curid=19403547

Page 14: k-means, k-means++ - WordPress.com

Drawbacks of k-means algorithm

I Euclidean distance is used as a metric and variance is used asa measure of cluster scatter.

I The number of clusters k is an input parameter: aninappropriate choice of k may yield poor results. That is why,when performing k-means, it is important to run diagnosticchecks for determining the number of clusters in the data set.

I Convergence to a local minimum may produce results far awayfrom optimal clustering

Page 15: k-means, k-means++ - WordPress.com

k-means++

Figure:

Reference: https://csci8980bigdataalgo.files.wordpress.

com/2013/09/bats-means.pdf

Page 16: k-means, k-means++ - WordPress.com

k-means++

Figure:

Page 17: k-means, k-means++ - WordPress.com

k-means++I The first fix: Select centers based on distance.

Figure:

Page 18: k-means, k-means++ - WordPress.com

k-means++I The first fix: Select centers based on distance.

Figure:

Page 19: k-means, k-means++ - WordPress.com

k-means++I The first fix: Select centers based on distance.

Figure:

Page 20: k-means, k-means++ - WordPress.com

k-means++I The first fix: Select centers based on distance.

Figure:

Page 21: k-means, k-means++ - WordPress.com

k-means++I The first fix: Select centers based on distance.

Figure:

Page 22: k-means, k-means++ - WordPress.com

k-means++I Sensitive to outliers.

Figure:

Page 23: k-means, k-means++ - WordPress.com

k-means++I Sensitive to outliers.

Figure:

Page 24: k-means, k-means++ - WordPress.com

k-means++I Sensitive to outliers.

Page 25: k-means, k-means++ - WordPress.com

k-means++

I Interpolate between the two methods:

Let D(x) be the distance between x and the nearest clustercenter. Sample x as a cluster center proportionately to(D(x))2.

k-means++ returns clustering C which is log k-competitive.

In the class we will show that if the cluster centers are chosen fromeach optimal cluster then k-means++ is 8-competitive.

Page 26: k-means, k-means++ - WordPress.com
Page 27: k-means, k-means++ - WordPress.com
Page 28: k-means, k-means++ - WordPress.com
Page 29: k-means, k-means++ - WordPress.com
Page 30: k-means, k-means++ - WordPress.com
Page 31: k-means, k-means++ - WordPress.com
Page 32: k-means, k-means++ - WordPress.com
Page 33: k-means, k-means++ - WordPress.com
Page 34: k-means, k-means++ - WordPress.com