Point clustering with a convex, corrected K-means · k: X a= k+ E a with E[X a] = k and E a ˘ ind...

Point clustering with a convex, corrected K-means

Martin Royer

Univ. Paris-Saclay

Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 1 / 17

Let X1, . . . , Xn be n observations in Rp, G = {Gk}1≤k≤K a partition of interest.

(α) - Latent model

Assume ∀k,∀a ∈ Gk:

Xa = µk + Ea with E[Xa] = µk and Ea ∼ind

sub-N (0,Σa)

Key quantities to keep in mind:

cluster separation∆G(µ) := mink 6=l |µk − µl|2peak noise σ2 := maxa∈[n] |Σa|op

Solving for MLE on (α) withhomoscedastic observations ⇔ K-means


K-means objective writes:

GKmeans ∈ argminG={Gk}1≤k≤K

Crit(G) :=

K∑k=1

∑a∈Gk

‖Xa − XGk‖2 (1)

=1

2

K∑k=1

1

|Gk|∑

a,b∈G2k

‖Xa −Xb‖2 (2)

= −K∑k=1

∑a,b∈G2

k

1

|Gk|〈Xa, Xb〉+

n∑a=1

‖Xa‖2 (3)

= −〈BG , XXT 〉+ ‖X‖2F (4)

with X :=

[XT

1...

XTn

]∈ Rn×p, and

BG ∈ Rn×n characteristic of G s.t. Bab :=

{1/|Gk| if a, b ∈ Gk0 otherwise


A pseudo-SDP, Peng and Wei [2007]

LemmaWe have

BKmeans ∈ argmaxB∈D

〈XXT , B〉 (5)

where D :=

sym. B ∈ Rn×n :

• B > 0• B.1n = 1n• Tr(B) = K• B2 = B

Is K-means ”optimal”?

ClaimSuppose Σ1 = ... = Σn, then

argmaxB∈D

〈E[XXT ], B〉 = BG (6)


Convexifying K-means

K-means as an optimization over set of matrices

D :=

symmetric B ∈ Rn×n :

• B > 0• B.1n = 1n• Tr(B) = K• B2 = B

Replace it by an optimization over set of matrices

C :=

symmetric B ∈ Rn×n :

• B > 0• B.1n = 1n• Tr(B) = K• (I <)B < 0

ClaimSuppose Σ1 = ... = Σn, then we have

argmaxB∈D

〈E[XXT ], B〉 = argmaxB∈C

〈E[XXT ], B〉 = BG


K-means is biased

define membership matrix of a partition G: Aak := 1{a ∈ Gk}, A ∈ Rn×K .

define also µ :=

[µT1...

µTK

]∈ RK×p, E :=

[ET

1...

ETn

]∈ Rn×p.

then according to (α) we have X = Aµ+ E and:

XXT = A(µµT )AT + (AµET + EµTAT ) + EET (7)

E[XXT ] = A(µµT )AT + diag(Tr(Cov(Ea))

)= A(µµT )AT + Γ (8)

ClaimFor a given G, µ, there exist Γ suchthatBG /∈ argmaxB∈D〈E[XXT ], B〉 andBG /∈ argmaxB∈C〈E[XXT ], B〉


Adapted from Bunea et al. [2016],

How do we estimate Γ = diag(Tr(Cov(Ea))

)?

a ∈ Gk, suppose we find neighbours v1, v2. Then Tr(Cov(Ea)) estimated by

Γaa = 〈Xa −Xv1 , Xa −Xv2〉 = |Ea|22 − 〈Ea, Ev1 + Ev2〉 (9)

v1, v2 unknown!

Estimator for Γ

For (a, b) ∈ [n]2 let V (a, b) := max(c,d)∈([n]\{a,b})2∣∣〈Xa −Xb,

Xc−Xd

|Xc−Xd|2 〉∣∣,

v1 := argminb∈[n]\{a} V (a, b) and v2 := argminb∈[p]\{a,v1} V (a, b). Then

Γ := diag(〈Xa −Xv1 , Xa −Xv2〉a∈[n]

). (10)

Take-away: v1, v2 chosen so that Γaa is a ”good” proxy for Tr(Cov(Ea))


The following estimator ”improves” on K-means:

Convex, corrected K-means

B := argmaxB∈C

〈XXT−Γ, B〉. (11)

Let effective dimension r∗ := maxa∈[n] Tr(Σa)/maxa∈[n] |Σa|op 6 p,

Theorem (R. ’17)

Suppose m := mink |Gk| > 2. Under latent model (α), if

m∆2G(µ) & σ2

(n+m log n+

√r∗(n+m log n)

)(12)

then B = BG with high probability.


0 2 4 6 8 10 12 14 16SNR ∆2(µ)/σ2

0.2

0.0

0.2

0.4

0.6

0.8

1.0ad

j_m

i(G,G

)

kmeans++pecoklowrank-spectralhierarchicalcord

n = 100

p = 500

K = 5


102 103 104 105

p

0.0

0.2

0.4

0.6

0.8

1.0ad

j_m

i(G,G

)

pecoklowrank-spectralhierarchicalcord

n = 100

K = 5


”Low” dimension regimes

Is the separation condition optimal?

m∆2G(µ) & σ2

(n+m log n+

√r∗(n+m log n)

)Suppose all groups have equal size m ≈ n/K

”Low dimension” p . n+m log n

”effective low dimension” r∗ . n+m log n . p

∆2G(µ) & σ2

(K + log n

)(13)

→ result by Mixon et al. [2016], recovery in: ∆2G(µ) & σ2K2


”High” dimension regime

Is the separation condition optimal?

m∆2G(µ) & σ2

(n+m log n+

√r∗(n+m log n)

)Suppose all groups have equal size m ≈ n/K

”High dimension” n+m log n . r∗

∆2G(µ) & σ2

√r∗K

n

(K + log n

)(14)

→ result by Banks et al. [2016], lower bound for detecting Gaussian mixture:

∆2G(µ) & σ2

√(K logK

) pn

(15)


(β) - Generalized model

Assume ∃δ > 0,∀k,∀a ∈ Gk:

Xa = νa +Ea with E[Xa] = νa ∈ Bf (µk, δ) and Ea ∼ind

sub-N (0,Σa)

Theorem (R. ’17)

Suppose m := mink |Gk| > 2. Under model (β), if

m∆2G(µ) & σ2

(n+m log n+

√r∗(n+m log n)

)+ δσ + δ2(

√n+m) (16)

then B = BG with high probability

NB: when δ is of order σ√

log n, no difference between models (α) and (β), i.e. itis the model error one can tolerate (and moral: we couldn’t expect more)


Identifiability given K

(β) - Generalized model

Assume ∃δ > 0,∀k,∀a ∈ Gk:

Xa = νa + Ea with νa ∈ Bf (µk, δ) and Ea ∼ind

sub-N (0,Σa)

Quantities to consider:

discriminating power ρ(G, µ, δ) := ∆G(µ)/δ

Identifiability

If ρ(G, µ, δ) > 4, then G is the unique maximizer of ρ over partitions of size |G|


How to account for number of clusters K?

K-adaptive estimator

Let κ ∈ R+, estimate BG using the SDP

Badapt = argmaxB∈C

〈XXT − Γ, B〉 − κ× tr(B) (17)

C :=

symmetric B ∈ Rn×n :• B > 0• B.1n = 1n• (1 <)B < 0

Theorem (R. ’17)

If

σ2(√r∗n+ n

)& κ & m∆2

G(µ) (18)

then we have the same recovery conditions as before, i.e.

m∆2G(µ) & σ2

(n+m log n+

√r∗(n+m log n)

)(19)


Optimization problem with ADMM

We want to solve the semi-definite program

B = argmaxB∈C

〈gram, B〉 (20)

over the set C := {symmetric B ∈ Rn×n : B < 0, B > 0, B1 = 1,Tr(B) = K}We introduce X,Y, Z, let L : Rn×n → Rn+1 be a linear operator so thatL(X) = b collects the n+ 1 affine constraints for set C, then problem (20) isexactly equivalent to:

infX∈Rn×n

{−〈gram, X〉+ δ{X:L(X)=b} + δS+n

(Y ) + δPos(Z)︸︷︷︸f(X,Y,Z)

} (21)

subject to X = Y = Z

Introduce dual variable χ from D = {χ = (x, y, z) ∈ (Rn×n)3|x = y = z}:

minimize f(X,Y, Z) + δD(χ) + (ρ/2)‖(X,Y, Z)− χ+ U‖22 (22)

subject to (X,Y, Z)− χ = 0


Thank you for your attention

J. Banks, C. Moore, N. Verzelen, R. Vershynin, and J. Xu. Information-theoreticbounds and phase transitions in clustering, sparse PCA, and submatrixlocalization. ArXiv e-prints, July 2016.

F. Bunea, C. Giraud, M. Royer, and N. Verzelen. PECOK: a convex optimizationapproach to variable clustering. ArXiv e-prints, June 2016.

D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures withk-means. In 2016 IEEE Information Theory Workshop (ITW), pages 211–215,Sept 2016. doi: 10.1109/ITW.2016.7606826.

J. Peng and Y. Wei. Approximating k-means-type clustering via semidefiniteprogramming. SIAM J. on Optimization, 18(1):186–205, February 2007. ISSN1052-6234. doi: 10.1137/050641983.


Effect of correcting for Γ on the separating rate of (12)

Separation rate for un-corrected K-means:

m∆2G(µ) & σ2

(n+m log n+

√r∗(n+m log n) + r∗

)(23)

In the ”High dimension” n+m log n . r∗ regime, the leading rate is σ2r∗,

whereas with correction Γ, the leading rate was σ2√r∗(n+m log n)


Point clustering with a convex, corrected K-means · k: X a= k+ E a with E[X a] = k and E a ˘ ind...

Documents

Transcript of Point clustering with a convex, corrected K-means · k: X a= k+ E a with E[X a] = k and E a ˘ ind...