On Clustering Histograms with k-Means by Using Mixed α-Divergences

29
On Clustering Histograms with k -Means by Using Mixed α-Divergences Entropy 16(6): 3273-3301 (2014) Frank Nielsen 1,2 Richard Nock 3 Shun-ichi Amari 4 1 Sony Computer Science Laboratories, Japan E-Mail: [email protected] 2 ´ Ecole Polytechnique, France 3 NICTA/ANU, Australia 4 RIKEN Brain Science Institute, Japan 2014 c 2014 Frank Nielsen 1/29

description

Frank Nielsen, Richard Nock, Shun-ichi Amari: On Clustering Histograms with k-Means by Using Mixed α-Divergences. Entropy 16(6): 3273-3301 (2014)

Transcript of On Clustering Histograms with k-Means by Using Mixed α-Divergences

Page 1: On Clustering Histograms with k-Means by Using Mixed α-Divergences

On Clustering Histograms with k-Means by UsingMixed α-Divergences

Entropy 16(6): 3273-3301 (2014)

Frank Nielsen1,2 Richard Nock3 Shun-ichi Amari4

1 Sony Computer Science Laboratories, JapanE-Mail: [email protected]

2 Ecole Polytechnique, France3 NICTA/ANU, Australia

4 RIKEN Brain Science Institute, Japan

2014

c© 2014 Frank Nielsen 1/29

Page 2: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Clustering histograms

◮ Information Retrieval systems (IRs) based on bag-of-words

paradigm (bag-of-textons, bag-of-features, bag-of-X)

◮ The role of distances:◮ Initially, create a dictionary of “words” by quantizing using

k-means clustering (depends on the underlying distance)◮ At query time, find “closest” (histogram) document by

querying with the histogram query

◮ Notation: Positive arrays h (counting histogram) versusfrequency histograms h (normalized counting) d bins

For IRs, prefer symmetric distances (not necessarily metrics) likethe Jeffreys divergence or the Jensen-Shannon divergence (unifiedby a one parameterized family of divergences in [11])

c© 2014 Frank Nielsen 2/29

Page 3: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Ali-Silvey-Csiszar f -divergences

An important class of divergences: f -divergences [10, 1, 7] definedfor a convex generator f (with f (1) = f ′(1) = 0 and f ′′(1) = 1):

If (p : q).=

d∑

i=1

qi f

(

pi

qi

)

Those divergences preserve information monotonicity [3] underany arbitrary transition probability (Markov morphisms).f -divergences can be extended to positive arrays [3].

c© 2014 Frank Nielsen 3/29

Page 4: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Mixed divergences

Defined on three parameters:

Mλ(p : q : r).= λD(p : q) + (1− λ)D(q : r)

for λ ∈ [0, 1].Mixed divergences include:

◮ the sided divergences for λ ∈ {0, 1},◮ the symmetrized (arithmetic mean) divergence for λ = 1

2 .

c© 2014 Frank Nielsen 4/29

Page 5: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Mixed divergence-based k-means clusteringk distinct seeds from the dataset with li = ri .

Input: Weighted histogram set H, divergence D(·, ·), integerk > 0, real λ ∈ [0, 1];

Initialize left-sided/right-sided seeds C = {(li , ri )}ki=1;repeat

//Assignmentfor i = 1, 2, ..., k do

Ci ← {h ∈ H : i = argminj Mλ(lj : h : rj)};// Dual-sided centroid relocationfor i = 1, 2, ..., k do

ri ← argminx D(Ci : x) =∑

h∈CiwjD(h : x);

li ← argminx D(x : Ci) =∑

h∈CiwjD(x : h);

until convergence;Output: Partition of H into k clusters following C;→ different from the k-means clustering with respect to thesymmetrized divergences

c© 2014 Frank Nielsen 5/29

Page 6: On Clustering Histograms with k-Means by Using Mixed α-Divergences

α-divergences

For α ∈ R 6= ±1, define α-divergences [6] on positive arrays [18] :

Dα(p : q).=

d∑

i=1

4

1− α2

(

1− α

2pi +

1 + α

2qi − (pi)

1−α2 (qi )

1+α2

)

with Dα(p : q) = D−α(q : p) and in the limit casesD−1(p : q) = KL(p : q) and D1(p : q) = KL(q : p), where KL isthe extended Kullback–Leibler divergence:

KL(p : q).=

d∑

i=1

pi logpi

qi+ qi − pi .

c© 2014 Frank Nielsen 6/29

Page 7: On Clustering Histograms with k-Means by Using Mixed α-Divergences

α-divergences belong to f -divergences

The α-divergences belong to the class of Csiszar f -divergenceswith the following generator:

f (t) =

41−α2

(

1− t(1+α)/2)

, if α 6= ±1,t ln t, if α = 1,− ln t, if α = −1

The Pearson and Neyman χ2 distances are obtained for α = −3and α = 3:

D3(p : q) =1

2

i

(qi − pi)2

pi,

D−3(p : q) =1

2

i

(qi − pi)2

qi.

c© 2014 Frank Nielsen 7/29

Page 8: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Squared Hellinger symmetric distance is a α = 0-divergence

Divergence D0 is the squared Hellinger symmetric distance (scaledby 4) extended to positive arrays:

D0(p : q) = 2

(

p(x)−√

q(x))2

dx = 4H2(p, q),

with the Hellinger distance:

H(p, q) =

1

2

(

p(x)−√

q(x))2

dx

c© 2014 Frank Nielsen 8/29

Page 9: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Mixed α-divergences

◮ Mixed α-divergence between a histogram x to two histogramsp and q:

Mλ,α(p : x : q) = λDα(p : x) + (1− λ)Dα(x : q),

= λD−α(x : p) + (1− λ)D−α(q : x),

= M1−λ,−α(q : x : p),

◮ α-Jeffreys symmetrized divergence is obtained for λ = 12 :

Sα(p, q) = M 12,α(q : p : q) = M 1

2,α(p : q : p)

◮ skew symmetrized α-divergence is defined by:

Sλ,α(p : q) = λDα(p : q) + (1− λ)Dα(q : p)

c© 2014 Frank Nielsen 9/29

Page 10: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Coupled k-Means++ α-Seeding

Algorithm 1: Mixed α-seeding; MAS(H, k , λ, α)Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1],

real α ∈ R;Let C ← hj with uniform probability ;for i = 2, 3, ..., k do

Pick at random histogram h ∈ H with probability:

πH(h).=

whMλ,α(ch : h : ch)∑

y∈H wyMλ,α(cy : y : cy ), (1)

//where (ch, ch).= argmin(z ,z)∈C Mλ,α(z : h : z);

C ← C ∪ {(h, h)};Output: Set of initial cluster centers C;

c© 2014 Frank Nielsen 10/29

Page 11: On Clustering Histograms with k-Means by Using Mixed α-Divergences

A guaranteed probabilistic initialization

Let Cλ,α denote for short the cost function related to the clusteringtype chosen (left-, right-, skew Jeffreys or mixed) in MASand C opt

λ,α

denote the optimal related clustering in k clusters, for λ ∈ [0, 1]and α ∈ (−1, 1). Then, on average, with respect to distribution(1), the initial clustering of MAS satisfies:

Eπ[Cλ,α] ≤ 4

{

f (λ)g(k)h2(α)C optλ,α if λ ∈ (0, 1)

g(k)z(α)h4(α)C optλ,α otherwise

.

Here, f (λ) = max{

1−λλ , λ

1−λ

}

, g(k) = 2(2 + log k), z(α) =

(

1+|α|1−|α|

)8|α|2

(1−|α|)2 , h(α) = maxi p|α|i /mini p

|α|i ; the min is defined on

strictly positive coordinates, and π denotes the picking distribution.

c© 2014 Frank Nielsen 11/29

Page 12: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Mixed α-hard clustering: MAhC(H, k , λ, α)

Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1],real α ∈ R;

Let C = {(li , ri )}ki=1 ← MAS(H, k , λ, α);repeat

//Assignmentfor i = 1, 2, ..., k do

Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj)};// Centroid relocationfor i = 1, 2, ..., k do

ri ←(

h∈Aiwih

1−α2

)2

1−α;

li ←(

h∈Aiwih

1+α2

) 21+α

;

until convergence;Output: Partition of H in k clusters following C;

c© 2014 Frank Nielsen 12/29

Page 13: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Sided Positive α-Centroids [14]

The left-sided lα and right-sided rα positive weighted α-centroidcoordinates of a set of n positive histograms h1, ..., hn are weightedα-means:

r iα = f −1α

n∑

j=1

wj fα(hij)

, l iα = r i−α

with fα(x) =

{

x1−α2 α 6= ±1,

log x α = 1.

c© 2014 Frank Nielsen 13/29

Page 14: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Sided Frequency α-Centroids [2]

Theorem (Amari, 2007)

The coordinates of the sided frequency α-centroids of a set of nweighted frequency histograms are the normalised weightedα-means.

c© 2014 Frank Nielsen 14/29

Page 15: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Positive and Frequency α-centroids

Summary:

◮ r iα =

{

(∑n

j=1 wj(hij )

1−α2 )

21−α α 6= 1

r i1 =∏n

j=1(hij )wj α = 1

◮ l iα = r i−α =

{

(∑n

j=1 wj(hij)

1+α2 )

21+α α 6= −1

l i−1 =∏n

j=1(hij )wj α = −1

◮ r iα = r iαw(rα)

◮ l iα = r i−α =r i−α

w(r−α)

c© 2014 Frank Nielsen 15/29

Page 16: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Mixed α-Centroids

Two centroids minimizer of:

j

wjMλ,α(l : hj : r)

Generalizing mixed Bregman divergences [16]:

TheoremThe two mixed α-centroids are the left-sided and right-sidedα-centroids.

c© 2014 Frank Nielsen 16/29

Page 17: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Symmetrized Jeffreys-Type α-Centroids

Sα(p, q) =1

2(Dα(p : q) + Dα(q : p)) = S−α(p, q),

= M 12(p : q : p),

For α = ±1, we get half of Jeffreys divergence:

S±1(p, q) =1

2

d∑

i=1

(pi − qi) logpi

qi

c© 2014 Frank Nielsen 17/29

Page 18: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Jeffreys α-divergence and Heinz means

When p and q are frequency histograms, we have for α 6= ±1:

Jα(p : q) =8

1− α2

(

1 +d∑

i=1

H 1−α2(pi , qi )

)

where H 1−α2(a, b) a symmetric Heinz mean [8, 5]:

Hβ(a, b) =aβb1−β + a1−βbβ

2

Heinz means interpolate the arithmetic and geometric means andsatisfies the inequality:

√ab = H 1

2(a, b) ≤ Hα(a, b) ≤ H0(a, b) =

a + b

2.

c© 2014 Frank Nielsen 18/29

Page 19: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Jeffreys divergence in the limit case

For α = ±1, Sα(p, q) tends to the Jeffreys divergence:

J(p, q) = KL(p, q) +KL(q, p) =d∑

i=1

(pi − qi)(log pi − log qi)

The Jeffreys divergence writes mathematically the same forfrequency histograms:

J(p, q) = KL(p, q) +KL(q, p) =d∑

i=1

(pi − qi)(log pi − log qi)

c© 2014 Frank Nielsen 19/29

Page 20: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Analytic formula for the positive Jeffreys centroid [12]

Theorem (Jeffreys positive centroid [12])

The Jeffreys positive centroid c = (c1, ..., cd ) of a set {h1, ..., hn}of n weighted positive histograms with d bins can be calculatedcomponent-wise exactly using the Lambert W analytic function:

c i =ai

W ( ai

g i e)

where ai =∑n

j=1 πjhij denotes the coordinate-wise arithmetic

weighted means and g i =∏n

j=1(hij )πj the coordinate-wise

geometric weighted means.

The Lambert analytic function W [4] (positive branch) is definedby W (x)eW (x) = x for x ≥ 0.

c© 2014 Frank Nielsen 20/29

Page 21: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Jeffreys frequency centroid [12]

Theorem (Jeffreys frequency centroid [12])

Let c denote the Jeffreys frequency centroid and c ′ = cwc

thenormalised Jeffreys positive centroid. Then, the approximation

factor αc′ =S1(c′,H)

S1(c,H)is such that 1 ≤ αc′ ≤ 1

wc(with wc ≤ 1).

better upper bounds in [12].

c© 2014 Frank Nielsen 21/29

Page 22: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Reducing a n-size problem to a 2-size problem

Generalize [17] (symmetrized Kullback–Leibler divergence) and [15](symmetrized Bregman divergence)

Lemma (Reduction property)

The symmetrized Jα-centroid of a set of n weighted histogramsamount to computing the symmetrized α-centroid for the weightedα-mean and −α-mean:

min Jα(x ,H) = minx

(Dα(x : rα) + Dα(lα : x)) .

c© 2014 Frank Nielsen 22/29

Page 23: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Frequency symmetrized α-centroid

Minimizer of minx∈∆d

j wjSα(x , hi )Instead of seeking for x in the probability simplex, we can optimizeon the unconstrained domain R

d−1 by using the natural parameterreparameterization [13] of multinomials.

LemmaThe α-divergence for distributions belonging to the sameexponential families amounts to computing a divergence on thecorresponding natural parameters:

Aα(p : q) =4

1− α2

(

1− e−J( 1−α

2 )F (θp :θq)

)

,

where JβF (θ1 : θ2) = βF (θ1) + (1− β)F (θ2)− F (βθ1 + (1− β)θ2)is a skewed Jensen divergence defined for the log-normaliser F ofthe family.

c© 2014 Frank Nielsen 23/29

Page 24: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Implementation (in processing.org)

Snapshot of the α-clustering software. Here, n = 800 frequencyhistograms of three bins with k = 8, and α = 0.7 and λ = 1

2 .c© 2014 Frank Nielsen 24/29

Page 25: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Soft Mixed α-ClusteringLearn both α and λ (α-EM [9])

Input: Histogram set H with |H| = m, integer k > 0, realλ← λinit ∈ [0, 1], real α ∈ R;

Let C = {(li , ri )}ki=1 ← MAS(H, k , λ, α);repeat

//Expectationfor i = 1, 2, ...,m do

for j = 1, 2, ..., k do

p(j |hi ) = πj exp(−Mλ,α(lj :hi :rj))∑j′ πj′ exp(−Mλ,α(lj′ :hi :rj′))

;

//Maximizationfor j = 1, 2, ..., k do

πj ← 1m

i p(j |hi );

li ←(

1∑i p(j |hi )

i p(j |hi )h1+α2

i

)2

1+α

;

ri ←(

1∑i p(j |hi )

i p(j |hi )h1−α2

i

)2

1−α

;

//Alpha - Lambdaα← α− η1

∑kj=1

∑mi=1 p(j |hi ) ∂

∂αMλ,α(lj : hi : rj);

if λinit 6= 0, 1 then

λ← λ− η2

(

∑kj=1

∑mi=1 p(j |hi )Dα(lj : hi)−

∑kj=1

∑mi=1 p(j |hi )Dα(hi : rj)

)

;

//for some small η1, η2; ensure that λ ∈ [0, 1].

until convergence;Output: Soft clustering of H according to k densities p(j |.)

following C;c© 2014 Frank Nielsen 25/29

Page 26: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Summary

1. Mixed divergences,mixed divergence k-means++ seeding,coupled k-means seeding

2. Sided left or right α-centroid k-means

3. Coupled k-means with respect to mixed α-divergences relyingon dual α-centroids

4. Symmetrized Jeffreys-type α-centroid (variational) k-means,

All technical proofs and details in:Entropy 16(6): 3273-3301 (2014)

c© 2014 Frank Nielsen 26/29

Page 27: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Bibliographic references I

Syed Mumtaz Ali and Samuel David Silvey.

A general class of coefficients of divergence of one distribution from another.Journal of the Royal Statistical Society, Series B, 28:131–142, 1966.

Shun-ichi Amari.

Integration of stochastic models by minimizing α-divergence.Neural Computation, 19(10):2780–2796, 2007.

Shun-ichi Amari.

alpha-divergence is unique, belonging to both f -divergence and Bregman divergence classes.IEEE Transactions on Information Theory, 55(11):4925–4931, 2009.

D. A. Barry, P. J. Culligan-Hensley, and S. J. Barry.

Real values of the W -function.ACM Trans. Math. Softw., 21(2):161–171, June 1995.

Adam Besenyei.

On the invariance equation for Heinz means.Mathematical Inequalities & Applications, 15(4):973–979, 2012.

Andrzej Cichocki, Sergio Cruces, and Shun-ichi Amari.

Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization.Entropy, 13(1):134–170, 2011.

Imre Csiszar.

Information-type measures of difference of probability distributions and indirect observation.Studia Scientiarum Mathematicarum Hungarica, 2:229–318, 1967.

c© 2014 Frank Nielsen 27/29

Page 28: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Bibliographic references II

Erhard Heinz.

Beitrage zur storungstheorie der spektralzerlegung.Mathematische Annalen, 123:415–438, 1951.

Yasuo Matsuyama.

The alpha-EM algorithm: surrogate likelihood maximization using alpha-logarithmic information measures.IEEE Transactions on Information Theory, 49(3):692–706, 2003.

Tetsuzo Morimoto.

Markov processes and the h-theorem.Journal of the Physical Society of Japan, 18(3), March 1963.

Frank Nielsen.

A family of statistical symmetric divergences based on Jensen’s inequality.CoRR, abs/1009.4004, 2010.

Frank Nielsen.

Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximationfor frequency histograms.IEEE Signal Processing Letters (SPL), 20(7), July 2013.

Frank Nielsen and Vincent Garcia.

Statistical exponential families: A digest with flash cards, 2009.arXiv.org:0911.4863.

Frank Nielsen and Richard Nock.

The dual Voronoi diagrams with respect to representational Bregman divergences.In International Symposium on Voronoi Diagrams (ISVD), pages 71–78, DTU Lyngby, Denmark, June 2009.IEEE.

c© 2014 Frank Nielsen 28/29

Page 29: On Clustering Histograms with k-Means by Using Mixed α-Divergences

Bibliographic references III

Frank Nielsen and Richard Nock.

Sided and symmetrized Bregman centroids.IEEE Transactions on Information Theory, 55(6):2882–2904, 2009.

Richard Nock, Panu Luosto, and Jyrki Kivinen.

Mixed Bregman clustering with approximation guarantees.In Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases,pages 154–169, Berlin, Heidelberg, 2008. Springer-Verlag.

Raymond N. J. Veldhuis.

The centroid of the symmetrical Kullback-Leibler distance.IEEE signal processing letters, 9(3):96–99, March 2002.

Huaiyu Zhu and Richard Rohwer.

Measurements of generalisation based on information geometry.In StephenW. Ellacott, JohnC. Mason, and IainJ. Anderson, editors, Mathematics of Neural Networks,volume 8 of Operations Research/Computer Science Interfaces Series, pages 394–398. Springer US, 1997.

c© 2014 Frank Nielsen 29/29