Point clustering with a convex, corrected K-means · k: X a= k+ E a with E[X a] = k and E a ˘ ind...
Transcript of Point clustering with a convex, corrected K-means · k: X a= k+ E a with E[X a] = k and E a ˘ ind...
Point clustering with a convex, corrected K-means
Martin Royer
Univ. Paris-Saclay
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 1 / 17
Let X1, . . . , Xn be n observations in Rp, G = {Gk}1≤k≤K a partition of interest.
(α) - Latent model
Assume ∀k,∀a ∈ Gk:
Xa = µk + Ea with E[Xa] = µk and Ea ∼ind
sub-N (0,Σa)
Key quantities to keep in mind:
cluster separation∆G(µ) := mink 6=l |µk − µl|2peak noise σ2 := maxa∈[n] |Σa|op
Solving for MLE on (α) withhomoscedastic observations ⇔ K-means
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 2 / 17
K-means objective writes:
GKmeans ∈ argminG={Gk}1≤k≤K
Crit(G) :=
K∑k=1
∑a∈Gk
‖Xa − XGk‖2 (1)
=1
2
K∑k=1
1
|Gk|∑
a,b∈G2k
‖Xa −Xb‖2 (2)
= −K∑k=1
∑a,b∈G2
k
1
|Gk|〈Xa, Xb〉+
n∑a=1
‖Xa‖2 (3)
= −〈BG , XXT 〉+ ‖X‖2F (4)
with X :=
[XT
1...
XTn
]∈ Rn×p, and
BG ∈ Rn×n characteristic of G s.t. Bab :=
{1/|Gk| if a, b ∈ Gk0 otherwise
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 3 / 17
A pseudo-SDP, Peng and Wei [2007]
LemmaWe have
BKmeans ∈ argmaxB∈D
〈XXT , B〉 (5)
where D :=
sym. B ∈ Rn×n :
• B > 0• B.1n = 1n• Tr(B) = K• B2 = B
Is K-means ”optimal”?
ClaimSuppose Σ1 = ... = Σn, then
argmaxB∈D
〈E[XXT ], B〉 = BG (6)
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 4 / 17
Convexifying K-means
K-means as an optimization over set of matrices
D :=
symmetric B ∈ Rn×n :
• B > 0• B.1n = 1n• Tr(B) = K• B2 = B
Replace it by an optimization over set of matrices
C :=
symmetric B ∈ Rn×n :
• B > 0• B.1n = 1n• Tr(B) = K• (I <)B < 0
ClaimSuppose Σ1 = ... = Σn, then we have
argmaxB∈D
〈E[XXT ], B〉 = argmaxB∈C
〈E[XXT ], B〉 = BG
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 5 / 17
K-means is biased
define membership matrix of a partition G: Aak := 1{a ∈ Gk}, A ∈ Rn×K .
define also µ :=
[µT1...
µTK
]∈ RK×p, E :=
[ET
1...
ETn
]∈ Rn×p.
then according to (α) we have X = Aµ+ E and:
XXT = A(µµT )AT + (AµET + EµTAT ) + EET (7)
E[XXT ] = A(µµT )AT + diag(Tr(Cov(Ea))
)= A(µµT )AT + Γ (8)
ClaimFor a given G, µ, there exist Γ suchthatBG /∈ argmaxB∈D〈E[XXT ], B〉 andBG /∈ argmaxB∈C〈E[XXT ], B〉
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 6 / 17
Adapted from Bunea et al. [2016],
How do we estimate Γ = diag(Tr(Cov(Ea))
)?
a ∈ Gk, suppose we find neighbours v1, v2. Then Tr(Cov(Ea)) estimated by
Γaa = 〈Xa −Xv1 , Xa −Xv2〉 = |Ea|22 − 〈Ea, Ev1 + Ev2〉 (9)
v1, v2 unknown!
Estimator for Γ
For (a, b) ∈ [n]2 let V (a, b) := max(c,d)∈([n]\{a,b})2∣∣〈Xa −Xb,
Xc−Xd
|Xc−Xd|2 〉∣∣,
v1 := argminb∈[n]\{a} V (a, b) and v2 := argminb∈[p]\{a,v1} V (a, b). Then
Γ := diag(〈Xa −Xv1 , Xa −Xv2〉a∈[n]
). (10)
Take-away: v1, v2 chosen so that Γaa is a ”good” proxy for Tr(Cov(Ea))
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 7 / 17
The following estimator ”improves” on K-means:
Convex, corrected K-means
B := argmaxB∈C
〈XXT−Γ, B〉. (11)
Let effective dimension r∗ := maxa∈[n] Tr(Σa)/maxa∈[n] |Σa|op 6 p,
Theorem (R. ’17)
Suppose m := mink |Gk| > 2. Under latent model (α), if
m∆2G(µ) & σ2
(n+m log n+
√r∗(n+m log n)
)(12)
then B = BG with high probability.
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 8 / 17
0 2 4 6 8 10 12 14 16SNR ∆2(µ)/σ2
0.2
0.0
0.2
0.4
0.6
0.8
1.0ad
j_m
i(G,G
)
kmeans++pecoklowrank-spectralhierarchicalcord
n = 100
p = 500
K = 5
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 9 / 17
102 103 104 105
p
0.0
0.2
0.4
0.6
0.8
1.0ad
j_m
i(G,G
)
pecoklowrank-spectralhierarchicalcord
n = 100
K = 5
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 10 / 17
”Low” dimension regimes
Is the separation condition optimal?
m∆2G(µ) & σ2
(n+m log n+
√r∗(n+m log n)
)Suppose all groups have equal size m ≈ n/K
”Low dimension” p . n+m log n
”effective low dimension” r∗ . n+m log n . p
∆2G(µ) & σ2
(K + log n
)(13)
→ result by Mixon et al. [2016], recovery in: ∆2G(µ) & σ2K2
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 11 / 17
”High” dimension regime
Is the separation condition optimal?
m∆2G(µ) & σ2
(n+m log n+
√r∗(n+m log n)
)Suppose all groups have equal size m ≈ n/K
”High dimension” n+m log n . r∗
∆2G(µ) & σ2
√r∗K
n
(K + log n
)(14)
→ result by Banks et al. [2016], lower bound for detecting Gaussian mixture:
∆2G(µ) & σ2
√(K logK
) pn
(15)
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 12 / 17
(β) - Generalized model
Assume ∃δ > 0,∀k,∀a ∈ Gk:
Xa = νa +Ea with E[Xa] = νa ∈ Bf (µk, δ) and Ea ∼ind
sub-N (0,Σa)
Theorem (R. ’17)
Suppose m := mink |Gk| > 2. Under model (β), if
m∆2G(µ) & σ2
(n+m log n+
√r∗(n+m log n)
)+ δσ + δ2(
√n+m) (16)
then B = BG with high probability
NB: when δ is of order σ√
log n, no difference between models (α) and (β), i.e. itis the model error one can tolerate (and moral: we couldn’t expect more)
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 13 / 17
Identifiability given K
(β) - Generalized model
Assume ∃δ > 0,∀k,∀a ∈ Gk:
Xa = νa + Ea with νa ∈ Bf (µk, δ) and Ea ∼ind
sub-N (0,Σa)
Quantities to consider:
discriminating power ρ(G, µ, δ) := ∆G(µ)/δ
Identifiability
If ρ(G, µ, δ) > 4, then G is the unique maximizer of ρ over partitions of size |G|
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 14 / 17
How to account for number of clusters K?
K-adaptive estimator
Let κ ∈ R+, estimate BG using the SDP
Badapt = argmaxB∈C
〈XXT − Γ, B〉 − κ× tr(B) (17)
C :=
symmetric B ∈ Rn×n :• B > 0• B.1n = 1n• (1 <)B < 0
Theorem (R. ’17)
If
σ2(√r∗n+ n
)& κ & m∆2
G(µ) (18)
then we have the same recovery conditions as before, i.e.
m∆2G(µ) & σ2
(n+m log n+
√r∗(n+m log n)
)(19)
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 15 / 17
Optimization problem with ADMM
We want to solve the semi-definite program
B = argmaxB∈C
〈gram, B〉 (20)
over the set C := {symmetric B ∈ Rn×n : B < 0, B > 0, B1 = 1,Tr(B) = K}We introduce X,Y, Z, let L : Rn×n → Rn+1 be a linear operator so thatL(X) = b collects the n+ 1 affine constraints for set C, then problem (20) isexactly equivalent to:
infX∈Rn×n
{−〈gram, X〉+ δ{X:L(X)=b} + δS+n
(Y ) + δPos(Z)︸ ︷︷ ︸f(X,Y,Z)
} (21)
subject to X = Y = Z
Introduce dual variable χ from D = {χ = (x, y, z) ∈ (Rn×n)3|x = y = z}:
minimize f(X,Y, Z) + δD(χ) + (ρ/2)‖(X,Y, Z)− χ+ U‖22 (22)
subject to (X,Y, Z)− χ = 0
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 16 / 17
Thank you for your attention
J. Banks, C. Moore, N. Verzelen, R. Vershynin, and J. Xu. Information-theoreticbounds and phase transitions in clustering, sparse PCA, and submatrixlocalization. ArXiv e-prints, July 2016.
F. Bunea, C. Giraud, M. Royer, and N. Verzelen. PECOK: a convex optimizationapproach to variable clustering. ArXiv e-prints, June 2016.
D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures withk-means. In 2016 IEEE Information Theory Workshop (ITW), pages 211–215,Sept 2016. doi: 10.1109/ITW.2016.7606826.
J. Peng and Y. Wei. Approximating k-means-type clustering via semidefiniteprogramming. SIAM J. on Optimization, 18(1):186–205, February 2007. ISSN1052-6234. doi: 10.1137/050641983.
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 17 / 17
Effect of correcting for Γ on the separating rate of (12)
Separation rate for un-corrected K-means:
m∆2G(µ) & σ2
(n+m log n+
√r∗(n+m log n) + r∗
)(23)
In the ”High dimension” n+m log n . r∗ regime, the leading rate is σ2r∗,
whereas with correction Γ, the leading rate was σ2√r∗(n+m log n)
Martin Royer (Univ. Paris-Saclay) Clustering with a convex, corrected K-means 1 / 1