k-means, k-means++

Barna Saha

March 8, 2016

K -Means: The Most Popular Clustering Algorithm

I k-means clustering problem is one of the oldest and mostimportant problem.

I Given an integer k , and a set of data points in Rd , the goal isto choose k centers so as to minimize φ, the sum of squareddistance between each point and its closest center.

I Solving this problem exactly is NP-Hard.

I 25 years ago Llyod proposed a simple local search algorithmthat is still very widely used today.

I A 2002 survey of data mining techniques states that it “is byfar the most popular clustering algorithm used in scientificand industrial applications.”

Llyod’s Local Search Algorithm

1. Begin with k arbitrary centers, typically chosen uniformly atrandom from the data points.

2. Assign each point to its nearest cluster center

3. Recompute the new cluster centers as the center of mass ofthe points assigned to the clusters

4. Repeat Steps 1-3 until the process converges.

φ is monotonically decreasing, which ensures no configuration isrepeated.Convergence could be slow: kn possible clusterings.

I Convergence could be slow: kn possible clusterings.

I Lloyd’s k-means algorithm has polynomial smoothed runningtime.

Illustration of k-means clustering

Figure: Convergence to Local Optimum

By Agor153 - Own work, CC BY-SA 3.0,https://commons.wikimedia.org/w/index.php?curid=19403547

Drawbacks of k-means algorithm

I Euclidean distance is used as a metric and variance is used asa measure of cluster scatter.

I The number of clusters k is an input parameter: aninappropriate choice of k may yield poor results. That is why,when performing k-means, it is important to run diagnosticchecks for determining the number of clusters in the data set.

I Convergence to a local minimum may produce results far awayfrom optimal clustering

k-means++

Figure:

Reference: https://csci8980bigdataalgo.files.wordpress.

com/2013/09/bats-means.pdf

k-means++

Figure:

k-means++I The first fix: Select centers based on distance.

Figure:

k-means++I Sensitive to outliers.

Figure:

k-means++

I Interpolate between the two methods:

Let D(x) be the distance between x and the nearest clustercenter. Sample x as a cluster center proportionately to(D(x))2.

k-means++ returns clustering C which is log k-competitive.

In the class we will show that if the cluster centers are chosen fromeach optimal cluster then k-means++ is 8-competitive.

k-means, k-means++ - WordPress.com

Transcript of k-means, k-means++ - WordPress.com

k-means, k-means++ - WordPress.com

Documents

Transcript of k-means, k-means++ - WordPress.com

ADDITONAL MATHEMATICS - WordPress.com For Examiner’s Use 0606/12/M/J/11 3 (a) 5 4 3 2 1 π π x 2 y O The figure shows the graph of y = k + m sin px for 0 x k π, where , m and p

WordPress.com · 00000000000000000000 900000000000000000000 0000000000000000000000000 00000000000000000000000000 j 000000000000000000000000000 ) 000000000000000000000000000000000

Convergence of the k-Means Minimization Problem using ...homepages.warwick.ac.uk/~masfk/Gamma_Convergence... · The k-means method addresses this problem by clustering data points

Zeta2 ;Re s k 0 k 0 k 0 k n 1 - Wolfram Research

ΠΡΟΓΡΑΜΜΑ - WordPress.com · 3Α A 8Σ ΣΤΑΑΣ/ A 5ΡΟ 4 8ΟΤ 5Σ – Ο : 5ΣΣ Α ΤΟ ΠΡΟΡΑΑ ΑΣ 6 N Δ « D L D Δ F K N R D Q» Οι « i b l ` k \-Δμ

Logistic Regression Part Onebrunner/oldclass/appliedf... · Logistic regression • X=1 means smoker, X=0 means non-smoker • Y=1 means dead, Y=0 means alive • Log odds of death

Electrostatics Coulomb’s Law. F = (kq 1 q 2 )/r 2 K = 9 x 10 9 Nm 2 /C 2 Positive force means repulsion Negative force means attraction.

GROWTH ORDERS OF CESARO AND ABEL MEANS` OF …

OPEN ACCESS entropy - LIXnielsen/entropy-16-03273.pdf · On Clustering Histograms with k-Means by Using Mixed -Divergences Frank Nielsen 1 ;2 *, ... the generic method of bag-of-X

Transborder clusters as a means to enhance competitiveness:

Bahar Tesis Udinus Penjurusan Smu Dngn Fuzzy c Means

On Clustering Histograms with k-Means by Using Mixed α-Divergences

Analysis of (π ± ,K ＋ ) and (K － ,K ＋ ) spectra in DWIA

MA1101 MATEMATIKA 1A - WordPress.com · 1.3 Teorema-Teorema Limit Menggunakan teorema-teorema limit dalam ... Hendra Gunawan 10 lim kf ( x) k lim f ( x). x o c x o c lim k k. x c

1. Introduction. k - CMUmthorpe/kmeans.pdf · Abstract. The k-means method is an iterative clustering algorithm which associates each observation withoneof k clusters. Ittraditionallyemploys

CDM cusps in LSB galaxies by means of stellar kinematics

Xριστούγεννα θα πει - Christmas means

2. Diffraction as a means to determine crystal structureleung.uwaterloo.ca/CHEM/750/Lectures 2007/SSNT-3-Surface-2.pdfDiffraction as a means to determine crystal ... X-ray: E (eV ...

TD DYNAMIQUE DU POINT - WordPress.com

10Grciv pdf - WordPress.com · ΑΡΧΑΙΟΣ ΕΛΛΗΝΙΚΟΣ ΠΟΛΙΤΙΣΜΟΣ