On Clustering Histograms with k-Means by Using Mixed α-Divergences

On Clustering Histograms with k-Means by UsingMixed α-Divergences

Entropy 16(6): 3273-3301 (2014)

Frank Nielsen1,2 Richard Nock3 Shun-ichi Amari4

1 Sony Computer Science Laboratories, JapanE-Mail: [email protected]

2 Ecole Polytechnique, France3 NICTA/ANU, Australia

4 RIKEN Brain Science Institute, Japan

2014

c© 2014 Frank Nielsen 1/29

http://www.mdpi.com/1099-4300/16/6/3273

Clustering histograms

◮ Information Retrieval systems (IRs) based on bag-of-words

paradigm (bag-of-textons, bag-of-features, bag-of-X)

◮ The role of distances:◮ Initially, create a dictionary of “words” by quantizing using

k-means clustering (depends on the underlying distance)◮ At query time, find “closest” (histogram) document by

querying with the histogram query

◮ Notation: Positive arrays h (counting histogram) versusfrequency histograms h (normalized counting) d bins

For IRs, prefer symmetric distances (not necessarily metrics) likethe Jeffreys divergence or the Jensen-Shannon divergence (unifiedby a one parameterized family of divergences in [11])


Ali-Silvey-Csiszar f -divergences

An important class of divergences: f -divergences [10, 1, 7] definedfor a convex generator f (with f (1) = f ′(1) = 0 and f ′′(1) = 1):

If (p : q).=

d∑

i=1

qi f

(

pi

qi

)

Those divergences preserve information monotonicity [3] underany arbitrary transition probability (Markov morphisms).f -divergences can be extended to positive arrays [3].


Mixed divergences

Defined on three parameters:

Mλ(p : q : r).= λD(p : q) + (1− λ)D(q : r)

for λ ∈ [0, 1].Mixed divergences include:

◮ the sided divergences for λ ∈ {0, 1},◮ the symmetrized (arithmetic mean) divergence for λ = 1

2 .


Mixed divergence-based k-means clusteringk distinct seeds from the dataset with li = ri .

Input: Weighted histogram set H, divergence D(·, ·), integerk > 0, real λ ∈ [0, 1];

Initialize left-sided/right-sided seeds C = {(li , ri )}ki=1;repeat

//Assignmentfor i = 1, 2, ..., k do

Ci ← {h ∈ H : i = argminj Mλ(lj : h : rj)};// Dual-sided centroid relocationfor i = 1, 2, ..., k do

ri ← argminx D(Ci : x) =∑

h∈CiwjD(h : x);

li ← argminx D(x : Ci) =∑

h∈CiwjD(x : h);

until convergence;Output: Partition of H into k clusters following C;→ different from the k-means clustering with respect to thesymmetrized divergences


α-divergences

For α ∈ R 6= ±1, define α-divergences [6] on positive arrays [18] :

Dα(p : q).=

d∑

i=1

4

1− α2

(

1− α

2pi +

1 + α

2qi − (pi)

1−α2 (qi )

1+α2

)

with Dα(p : q) = D−α(q : p) and in the limit casesD−1(p : q) = KL(p : q) and D1(p : q) = KL(q : p), where KL isthe extended Kullback–Leibler divergence:

KL(p : q).=

d∑

i=1

pi logpi

qi+ qi − pi .


α-divergences belong to f -divergences

The α-divergences belong to the class of Csiszar f -divergenceswith the following generator:

f (t) =

41−α2

(

1− t(1+α)/2)

, if α 6= ±1,t ln t, if α = 1,− ln t, if α = −1

The Pearson and Neyman χ2 distances are obtained for α = −3and α = 3:

D3(p : q) =1

2

∑

i

(qi − pi)2

pi,

D−3(p : q) =1

2

∑

i

(qi − pi)2

qi.


Squared Hellinger symmetric distance is a α = 0-divergence

Divergence D0 is the squared Hellinger symmetric distance (scaledby 4) extended to positive arrays:

D0(p : q) = 2

∫

(

√

p(x)−√

q(x))2

dx = 4H2(p, q),

with the Hellinger distance:

H(p, q) =

√

1

2

∫

(

√

p(x)−√

q(x))2

dx


Mixed α-divergences

◮ Mixed α-divergence between a histogram x to two histogramsp and q:

Mλ,α(p : x : q) = λDα(p : x) + (1− λ)Dα(x : q),

= λD−α(x : p) + (1− λ)D−α(q : x),

= M1−λ,−α(q : x : p),

◮ α-Jeffreys symmetrized divergence is obtained for λ = 12 :

Sα(p, q) = M 12,α(q : p : q) = M 1

2,α(p : q : p)

◮ skew symmetrized α-divergence is defined by:

Sλ,α(p : q) = λDα(p : q) + (1− λ)Dα(q : p)


Coupled k-Means++ α-Seeding

Algorithm 1: Mixed α-seeding; MAS(H, k , λ, α)Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1],

real α ∈ R;Let C ← hj with uniform probability ;for i = 2, 3, ..., k do

Pick at random histogram h ∈ H with probability:

πH(h).=

whMλ,α(ch : h : ch)∑

y∈H wyMλ,α(cy : y : cy ), (1)

//where (ch, ch).= argmin(z ,z)∈C Mλ,α(z : h : z);

C ← C ∪ {(h, h)};Output: Set of initial cluster centers C;


A guaranteed probabilistic initialization

Let Cλ,α denote for short the cost function related to the clusteringtype chosen (left-, right-, skew Jeffreys or mixed) in MASand C opt

λ,α

denote the optimal related clustering in k clusters, for λ ∈ [0, 1]and α ∈ (−1, 1). Then, on average, with respect to distribution(1), the initial clustering of MAS satisfies:

Eπ[Cλ,α] ≤ 4

{

f (λ)g(k)h2(α)C optλ,α if λ ∈ (0, 1)

g(k)z(α)h4(α)C optλ,α otherwise

.

Here, f (λ) = max{

1−λλ , λ

1−λ

}

, g(k) = 2(2 + log k), z(α) =

(

1+|α|1−|α|

)8|α|2

(1−|α|)2 , h(α) = maxi p|α|i /mini p

|α|i ; the min is defined on

strictly positive coordinates, and π denotes the picking distribution.


Mixed α-hard clustering: MAhC(H, k , λ, α)

Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1],real α ∈ R;

Let C = {(li , ri )}ki=1 ← MAS(H, k , λ, α);repeat

//Assignmentfor i = 1, 2, ..., k do

Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj)};// Centroid relocationfor i = 1, 2, ..., k do

ri ←(

∑

h∈Aiwih

1−α2

)2

1−α;

li ←(

∑

h∈Aiwih

1+α2

) 21+α

;

until convergence;Output: Partition of H in k clusters following C;


Sided Positive α-Centroids [14]

The left-sided lα and right-sided rα positive weighted α-centroidcoordinates of a set of n positive histograms h1, ..., hn are weightedα-means:

r iα = f −1α

n∑

j=1

wj fα(hij)

, l iα = r i−α

with fα(x) =

{

x1−α2 α 6= ±1,

log x α = 1.


Sided Frequency α-Centroids [2]

Theorem (Amari, 2007)

The coordinates of the sided frequency α-centroids of a set of nweighted frequency histograms are the normalised weightedα-means.


Positive and Frequency α-centroids

Summary:

◮ r iα =

{

(∑n

j=1 wj(hij )

1−α2 )

21−α α 6= 1

r i1 =∏n

j=1(hij )wj α = 1

◮ l iα = r i−α =

{

(∑n

j=1 wj(hij)

1+α2 )

21+α α 6= −1

l i−1 =∏n

j=1(hij )wj α = −1

◮ r iα = r iαw(rα)

◮ l iα = r i−α =r i−α

w(r−α)


Mixed α-Centroids

Two centroids minimizer of:

∑

j

wjMλ,α(l : hj : r)

Generalizing mixed Bregman divergences [16]:

TheoremThe two mixed α-centroids are the left-sided and right-sidedα-centroids.


Symmetrized Jeffreys-Type α-Centroids

Sα(p, q) =1

2(Dα(p : q) + Dα(q : p)) = S−α(p, q),

= M 12(p : q : p),

For α = ±1, we get half of Jeffreys divergence:

S±1(p, q) =1

2

d∑

i=1

(pi − qi) logpi

qi


Jeffreys α-divergence and Heinz means

When p and q are frequency histograms, we have for α 6= ±1:

Jα(p : q) =8

1− α2

(

1 +d∑

i=1

H 1−α2(pi , qi )

)

where H 1−α2(a, b) a symmetric Heinz mean [8, 5]:

Hβ(a, b) =aβb1−β + a1−βbβ

2

Heinz means interpolate the arithmetic and geometric means andsatisfies the inequality:

√ab = H 1

2(a, b) ≤ Hα(a, b) ≤ H0(a, b) =

a + b

2.


Jeffreys divergence in the limit case

For α = ±1, Sα(p, q) tends to the Jeffreys divergence:

J(p, q) = KL(p, q) +KL(q, p) =d∑

i=1

(pi − qi)(log pi − log qi)

The Jeffreys divergence writes mathematically the same forfrequency histograms:

J(p, q) = KL(p, q) +KL(q, p) =d∑

i=1

(pi − qi)(log pi − log qi)


Analytic formula for the positive Jeffreys centroid [12]

Theorem (Jeffreys positive centroid [12])

The Jeffreys positive centroid c = (c1, ..., cd ) of a set {h1, ..., hn}of n weighted positive histograms with d bins can be calculatedcomponent-wise exactly using the Lambert W analytic function:

c i =ai

W ( ai

g i e)

where ai =∑n

j=1 πjhij denotes the coordinate-wise arithmetic

weighted means and g i =∏n

j=1(hij )πj the coordinate-wise

geometric weighted means.

The Lambert analytic function W [4] (positive branch) is definedby W (x)eW (x) = x for x ≥ 0.


Jeffreys frequency centroid [12]

Theorem (Jeffreys frequency centroid [12])

Let c denote the Jeffreys frequency centroid and c ′ = cwc

thenormalised Jeffreys positive centroid. Then, the approximation

factor αc′ =S1(c′,H)

S1(c,H)is such that 1 ≤ αc′ ≤ 1

wc(with wc ≤ 1).

better upper bounds in [12].


Reducing a n-size problem to a 2-size problem

Generalize [17] (symmetrized Kullback–Leibler divergence) and [15](symmetrized Bregman divergence)

Lemma (Reduction property)

The symmetrized Jα-centroid of a set of n weighted histogramsamount to computing the symmetrized α-centroid for the weightedα-mean and −α-mean:

min Jα(x ,H) = minx

(Dα(x : rα) + Dα(lα : x)) .


Frequency symmetrized α-centroid

Minimizer of minx∈∆d

∑

j wjSα(x , hi )Instead of seeking for x in the probability simplex, we can optimizeon the unconstrained domain R

d−1 by using the natural parameterreparameterization [13] of multinomials.

LemmaThe α-divergence for distributions belonging to the sameexponential families amounts to computing a divergence on thecorresponding natural parameters:

Aα(p : q) =4

1− α2

(

1− e−J( 1−α

2 )F (θp :θq)

)

,

where JβF (θ1 : θ2) = βF (θ1) + (1− β)F (θ2)− F (βθ1 + (1− β)θ2)is a skewed Jensen divergence defined for the log-normaliser F ofthe family.


Implementation (in processing.org)

Snapshot of the α-clustering software. Here, n = 800 frequencyhistograms of three bins with k = 8, and α = 0.7 and λ = 1

2 .c© 2014 Frank Nielsen 24/29

processing.org

Soft Mixed α-ClusteringLearn both α and λ (α-EM [9])

Input: Histogram set H with |H| = m, integer k > 0, realλ← λinit ∈ [0, 1], real α ∈ R;

Let C = {(li , ri )}ki=1 ← MAS(H, k , λ, α);repeat

//Expectationfor i = 1, 2, ...,m do

for j = 1, 2, ..., k do

p(j |hi ) = πj exp(−Mλ,α(lj :hi :rj))∑j′ πj′ exp(−Mλ,α(lj′ :hi :rj′))

;

//Maximizationfor j = 1, 2, ..., k do

πj ← 1m

∑

i p(j |hi );

li ←(

1∑i p(j |hi )

∑

i p(j |hi )h1+α2

i

)2

1+α

;

ri ←(

1∑i p(j |hi )

∑

i p(j |hi )h1−α2

i

)2

1−α

;

//Alpha - Lambdaα← α− η1

∑kj=1

∑mi=1 p(j |hi ) ∂

∂αMλ,α(lj : hi : rj);

if λinit 6= 0, 1 then

λ← λ− η2

(

∑kj=1

∑mi=1 p(j |hi )Dα(lj : hi)−

∑kj=1

∑mi=1 p(j |hi )Dα(hi : rj)

)

;

//for some small η1, η2; ensure that λ ∈ [0, 1].

until convergence;Output: Soft clustering of H according to k densities p(j |.)

following C;c© 2014 Frank Nielsen 25/29

Summary

1. Mixed divergences,mixed divergence k-means++ seeding,coupled k-means seeding

2. Sided left or right α-centroid k-means

3. Coupled k-means with respect to mixed α-divergences relyingon dual α-centroids

4. Symmetrized Jeffreys-type α-centroid (variational) k-means,

All technical proofs and details in:Entropy 16(6): 3273-3301 (2014)


http://www.mdpi.com/1099-4300/16/6/3273

Bibliographic references I

Syed Mumtaz Ali and Samuel David Silvey.

A general class of coefficients of divergence of one distribution from another.Journal of the Royal Statistical Society, Series B, 28:131–142, 1966.

Shun-ichi Amari.

Integration of stochastic models by minimizing α-divergence.Neural Computation, 19(10):2780–2796, 2007.

Shun-ichi Amari.

alpha-divergence is unique, belonging to both f -divergence and Bregman divergence classes.IEEE Transactions on Information Theory, 55(11):4925–4931, 2009.

D. A. Barry, P. J. Culligan-Hensley, and S. J. Barry.

Real values of the W -function.ACM Trans. Math. Softw., 21(2):161–171, June 1995.

Adam Besenyei.

On the invariance equation for Heinz means.Mathematical Inequalities & Applications, 15(4):973–979, 2012.

Andrzej Cichocki, Sergio Cruces, and Shun-ichi Amari.

Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization.Entropy, 13(1):134–170, 2011.

Imre Csiszar.

Information-type measures of difference of probability distributions and indirect observation.Studia Scientiarum Mathematicarum Hungarica, 2:229–318, 1967.


Bibliographic references II

Erhard Heinz.

Beitrage zur storungstheorie der spektralzerlegung.Mathematische Annalen, 123:415–438, 1951.

Yasuo Matsuyama.

The alpha-EM algorithm: surrogate likelihood maximization using alpha-logarithmic information measures.IEEE Transactions on Information Theory, 49(3):692–706, 2003.

Tetsuzo Morimoto.

Markov processes and the h-theorem.Journal of the Physical Society of Japan, 18(3), March 1963.

Frank Nielsen.

A family of statistical symmetric divergences based on Jensen’s inequality.CoRR, abs/1009.4004, 2010.

Frank Nielsen.

Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximationfor frequency histograms.IEEE Signal Processing Letters (SPL), 20(7), July 2013.

Frank Nielsen and Vincent Garcia.

Statistical exponential families: A digest with flash cards, 2009.arXiv.org:0911.4863.

Frank Nielsen and Richard Nock.

The dual Voronoi diagrams with respect to representational Bregman divergences.In International Symposium on Voronoi Diagrams (ISVD), pages 71–78, DTU Lyngby, Denmark, June 2009.IEEE.


Bibliographic references III

Frank Nielsen and Richard Nock.

Sided and symmetrized Bregman centroids.IEEE Transactions on Information Theory, 55(6):2882–2904, 2009.

Richard Nock, Panu Luosto, and Jyrki Kivinen.

Mixed Bregman clustering with approximation guarantees.In Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases,pages 154–169, Berlin, Heidelberg, 2008. Springer-Verlag.

Raymond N. J. Veldhuis.

The centroid of the symmetrical Kullback-Leibler distance.IEEE signal processing letters, 9(3):96–99, March 2002.

Huaiyu Zhu and Richard Rohwer.

Measurements of generalisation based on information geometry.In StephenW. Ellacott, JohnC. Mason, and IainJ. Anderson, editors, Mathematics of Neural Networks,volume 8 of Operations Research/Computer Science Interfaces Series, pages 394–398. Springer US, 1997.


On Clustering Histograms with k-Means by Using Mixed α-Divergences

Science

Transcript of On Clustering Histograms with k-Means by Using Mixed α-Divergences