Non-informative reparametrisation for location-scale mixtures

1
Non-informative reparametrisations for location-scale mixtures Kaniav Kamary 1 , Kate Lee 2 , Christian P. Robert 1,3 1 CEREMADE, Université Paris–Dauphine, Paris 2 Auckland University of Technology, New Zealand 3 Dept. of Statistics, University of Warwick, and CREST, Paris Introduction Traditional definition of mixture density: f (x θ , p )= k i =1 p i f (x θ i ) k i =1 p i = 1 . (1) which gives a separate meaning to each component. For the location-scale Gaussian mixture: f (x θ , p )= k i =1 p i N(x μ i i ) Mengersen and Robert (1996) [2] established that an improper prior on (μ 1 1 ) leads to a proper prior when μ i = μ i -1 + σ i -1 δ i and σ i = τ i σ i -1 i < 1. Diebolt and Robert (1994) [3] discussed the alternative approach of imposing proper posteriors on improper priors by banning almost empty components from the likelihood function. Setting global mean and variance E θ,p (X )= μ and var θ,p (X )= σ 2 , imposes natural constraints on the component parameters; μ = k i =1 p i μ i ; σ 2 = k i =1 p i μ 2 i + k i =1 p i σ 2 i - μ 2 ; E θ,p (X 2 )= k i =1 p i μ 2 i + k i =1 p i σ 2 i which implies that (μ 1 ,...,μ k 1 ,...,σ k ) belongs to a specific ellipse. New reparametrisation: Modifying the parameterization of the location-scale mixture in terms of the global mean and variance of the mixture distribution. Writing f (x θ , p )= k i =1 p i f (x μ + σγ i p i ,ση i p i ) , (2) leads a parameter space such that (p 1 ,..., p k 1 ,...,γ k 1 ,...,η k ) is constrained by p i i 0 (1 i k ) k i =1 p i = 1 k i =1 p i γ i = 0 k i =1 {η 2 i + γ 2 i }= 1. which implies i 0 p i 1 , 0 γ i 1 , 0 η i 1. The constraints lead that (γ 1 ,...,η ) belongs to an hypersphere of R 2k centered at the origin with the radius of r = 1 intersected with an hyperplane of this space passing the origin that results in a circle centered at the origin with radius 1. Spherical coordinate representation of γ ’s: Suppose that k i =1 γ 2 i = ϕ 2 . The vector γ belongs both to the hypersphere of radius ϕ and to the hyperplane orthogonal to p i ; i = 1,..., k . s -th orthogonal base Λ s : Λ 1,j = - p 2 , j = 1 p 1 , j = 2 0, j > 2 s -th vector is given by Λ s,j = -(p j p s+1 ) 12 s l =1 p l 12 , s > 1, j s s l =1 p l 12 , s > 1, j = s + 1 0, s > 1, j > s + 1 and s -th orthonormal base is F s = Λ s Λ s . Figure: Image from Robert Osserman. (γ 1 ,...,γ k ) can be written as (γ 1 ,...,γ k )= ϕ cos($ 1 )F 1 + ϕ sin($ 1 ) cos($ 2 )F 2 + ... + ϕ sin($ 1 ) sin($ k -2 )F k -1 with the angles $ 1 ,...,$ k -3 in [0] and $ k -2 in [0, 2π ]. Foundational consequences: The restriction is compact and helpful in selecting improper and non-informative priors over mixtures. Prior modeling: Global mean and variance: The posterior distribution associated with the prior π (μ, σ )= 1σ is proper when (a) proper distributions are used on the other parameters and (b) there are at least two observations in the sample. Component weights: (p 1 ,..., p k )∼Dir (α 0 ,...,α 0 ), Angles $’s: $ 1 ,...,$ k -3 ∼U[0] and $ k ∼U[0, 2π ], Raduis ϕ and η 1 ,...,η k : If k is small, (ϕ 2 2 1 ,...,η 2 k )∼Dir (α,...,α) while for k more than 3, (η 1 ,...,η k ) is written through spherical coordinates η i = 1 - ϕ 2 cos(ξ i ) , i = 1 1 - ϕ 2 i -1 j =1 sin(ξ j ) cos(ξ i ) , 1 < i < k 1 - ϕ 2 i -1 j =1 sin(ξ j ) , i = k Unlike $, the support for all angles ξ 1 , k -1 is limited to [02], due to the positivity requirement on the η i ’s. (ξ 1 , k -1 ) ∼ U ([02] k -1 ). MCMC algorithm Metropolis-within-Gibbs algorithm for reparameterised mixture model: 1 Generate initial values (μ (0) (0) , p (0) (0) (0) 1 ,...,ξ (0) k -1 ,$ (0) 1 ,...,$ (0) k -2 ). 2 For t = 1,..., T , the update of (μ (t ) (t ) , p (t ) (t ) (t ) 1 ,...,ξ (t ) k -1 ,$ (t ) 1 ,...,$ (t ) k -2 ) follows; 2.1 Generate a proposal μ ∼N(μ (t -1) μ ) and update μ (t ) against π (⋅x (t -1) , p (t -1) (t -1) (t -1) ,$ (t -1) ). 2.2 Generate a proposal log(σ ) ∼N(log(σ (t -1) )σ ) and update σ (t ) against π (⋅x (t ) , p (t -1) (t -1) (t -1) ,$ (t -1) ). 2.3 Generate a proposal (ϕ 2 ) ∼B eta((ϕ 2 ) (t ) ε ϕ + 1, (1 -(ϕ 2 ) (t ) )ε ϕ + 1) and update ϕ (t ) against π (⋅x (t ) (t ) , p (t -1) (t ) ,$ (t ) ). 2.4 Generate a proposal p Dir(p (t -1) 1 ε p + 1,..., p (t -1) k ε p + 1), and update p (t ) against π (⋅x (t ) (t ) (t ) (t ) ,$ (t ) ). 2.5 Generate proposals ξ i U [ξ (t ) i - ε ξ (t ) i + ε ξ ], i = 1, , k - 1, and update (ξ (t ) 1 ,...,ξ (t ) k -1 ) against π (⋅x (t ) (t ) , p (t ) (t ) ,$ (t ) ). 2.6 Generate proposals $ i U [$ (t ) i - ε $ ,$ (t ) i + ε $ ], i = 1, , k - 2, and update ($ (t ) 1 ,...,$ (t ) k -2 ) against π (⋅x (t ) (t ) , p (t ) (t ) (t ) ). where p (t ) =(p (t ) 1 ,..., p (t ) k ), x =(x 1 ,..., x n ), ξ (t ) =(ξ (t ) 1 ,...,ξ (t ) k -1 ) and $ (t ) =($ (t ) 1 ,...,$ (t ) k -2 ). Ultimixt package Implementation of the Metropolis-within-Gibbs algorithm for reparametrized mixture distribution; Calibrate the scales of the various proposals by aiming an average acceptance rate of either 0.44 or 0.234 depending on the dimension of the simulated parameter; Accurately estimate the component parameters; Point estimator of the component parameters in the case of label switching: K-means clustering algorithm; Reordering labels towards producing the shortest distance between the current posterior sample and the (or a) maximum posterior probability (MAP) estimate; [1]. Mixture of two normal distributions A sample of size 50 simulated from .65N (-8, 2)+ .35N (-.5, 1), Figure: Empirical densities of 10 sequences of running Metropolis-within-Gibbs algorithm in parallel with 2e + 05 iterations. Outcomes of 10 parallel chains started randomly from different starting values, are indistinguishable; Chains are well-mixed; Sampler output covers the entire sample space; Estimated densities converge to a neighborhood of the true values; Estimated mixture density is remarkably smooth; Mixture of three normal distributions A sample of size 50 is simulated from model .27N (-4.5, 1)+ .4N(10, 1)+ .33N(3, 1) Figure: Sequences of μ i i and p i and estimated mixture density; mixture density estimate based on 10 4 MCMC iterations Overfitting case Extreme valued posterior samples for an overfitted model. Galaxy dataset: Point estimator of the parameters of a mixture of (Left) 6 components; (Right) 4 components. References [1] S. Früwirth. Schnatter. (2001). Markov chain Monte Carlo estimation of classical and dynamic switching and mixture models. J. American Statist. Assoc., 96 194–209. [2] K. Mengersen and C. Robert. (1996) Testing for mixtures: A Bayesian entropic approach (with discussion). In Bayesian Statistics 5 (J. Berger, J. Bernardo, A. Dawid, D. Lindley and A. Smith, eds). Oxford University Press, Oxford, 255–276. [3] J. Diebolt and C. Robert. (1994) Estimation of finite mixture distributions by Bayesian sampling. J. Royal Statist. Society Series B, 56 363–375. [email protected]

Transcript of Non-informative reparametrisation for location-scale mixtures

Page 1: Non-informative reparametrisation for location-scale mixtures

Non-informative reparametrisations for location-scale mixturesKaniav Kamary1, Kate Lee2, Christian P. Robert1,31CEREMADE, Université Paris–Dauphine, Paris 2Auckland University of Technology, New Zealand 3Dept. of Statistics, University ofWarwick, and CREST, Paris

Introduction

Traditional definition of mixture density:

f (x ∣θ,p) =k

i=1pif (x ∣θi)

k

i=1pi = 1 . (1)

which gives a separate meaning to each component.For the location-scale Gaussian mixture:

f (x ∣θ,p) =k

i=1piN (x ∣µi, σi)

Mengersen and Robert (1996) [2] established that an improper prior on (µ1, σ1) leads to a proper priorwhen

µi = µi−1 + σi−1δi and σi = τiσi−1, τi < 1.Diebolt and Robert (1994) [3] discussed the alternative approach of imposing proper posteriors onimproper priors by banning almost empty components from the likelihood function.

Setting global mean and variance Eθ,p(X) = µ and varθ,p(X) = σ2, imposes natural constraints on thecomponent parameters;

µ =

k

i=1piµi ; σ2

=

k

i=1piµ

2i +

k

i=1piσ

2i − µ

2; Eθ,p(X 2) =

k

i=1piµ

2i +

k

i=1piσ

2i

which implies that (µ1, . . . , µk , σ1, . . . , σk) belongs to a specific ellipse.

New reparametrisation: Modifying the parameterization of the location-scale mixture in terms ofthe global mean and variance of the mixture distribution.

Writing

f (x ∣θ,p) =k

i=1pif (x ∣µ + σγi/

pi, σηi/√

pi) , (2)

leads a parameter space such that (p1, . . . ,pk , γ1, . . . , γk , η1, . . . , ηk) is constrained by

pi, ηi ≥ 0 (1 ≤ i ≤ k)

k

i=1pi = 1

k

i=1

piγi = 0k

i=1{η2

i + γ2i } = 1.

which implies ∀i 0 ≤ pi ≤ 1 , 0 ≤ γi ≤ 1 , 0 ≤ ηi ≤ 1. The constraints lead that (γ1, . . . , η) belongs to anhypersphere of R2k centered at the origin with the radius of r = 1 intersected with an hyperplane of thisspace passing the origin that results in a circle centered at the origin with radius 1.

Spherical coordinate representation of γ’s:Suppose that ∑k

i=1 γ2i = ϕ

2. The vector γ belongsboth to the hypersphere of radius ϕ and to thehyperplane orthogonal to

pi ; i = 1, . . . ,k .s-th orthogonal base Λs:

Λ1,j =

⎧⎪⎪⎨⎪⎪⎩

p2, j = 1√

p1, j = 20, j > 2

s-th vector is given by

Λs,j =

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

−(pjps+1)1/2

/(∑

sl=1 pl)

1/2, s > 1, j ≤ s

(∑

sl=1 pl)

1/2, s > 1, j = s + 1

0, s > 1, j > s + 1

and s-th orthonormal base is Fs = Λs/∣Λs∣.Figure: Image from Robert Osserman.

(γ1, . . . , γk) can be written as

(γ1, . . . , γk) = ϕcos($1)F1 + ϕsin($1)cos($2)F2 + . . . + ϕsin($1)⋯sin($k−2)Fk−1

with the angles $1, . . . ,$k−3 in [0, π] and $k−2 in [0,2π].

Foundational consequences: The restriction is compact and helpful in selecting improper andnon-informative priors over mixtures.

Prior modeling:Global mean and variance: The posterior distribution associated with the prior π(µ,σ) = 1/σ is properwhen (a) proper distributions are used on the other parameters and (b) there are at least twoobservations in the sample.Component weights: (p1, . . . ,pk) ∼ Dir(α0, . . . , α0),Angles $’s: $1, . . . ,$k−3 ∼ U[0, π] and $k ∼ U[0,2π],Raduis ϕ and η1, . . . , ηk: If k is small, (ϕ2, η2

1, . . . , η2k) ∼ Dir(α, . . . , α) while for k more than 3, (η1, . . . , ηk)

is written through spherical coordinates

ηi =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

1 − ϕ2 cos(ξi) , i = 1√

1 − ϕ2i−1

j=1sin(ξj)cos(ξi) , 1 < i < k

1 − ϕ2i−1

j=1sin(ξj) , i = k

Unlike $, the support for all angles ξ1,⋯, ξk−1 is limited to [0, π/2], due to the positivity requirement onthe ηi ’s.

(ξ1,⋯, ξk−1) ∼ U([0, π/2]k−1).

MCMC algorithm

Metropolis-within-Gibbs algorithm for reparameterised mixture model:1 Generate initial values (µ(0), σ(0),p(0), ϕ(0), ξ

(0)1 , . . . , ξ

(0)k−1,$

(0)1 , . . . ,$

(0)k−2).

2 For t = 1, . . . ,T , the update of (µ(t), σ(t),p(t), ϕ(t), ξ(t)1 , . . . , ξ

(t)k−1,$

(t)1 , . . . ,$

(t)k−2)

follows;2.1 Generate a proposal µ′ ∼ N (µ(t−1), εµ) and update µ(t) against

π(⋅∣x , σ(t−1),p(t−1), ϕ(t−1), ξ(t−1),$(t−1)).

2.2 Generate a proposal log(σ)′ ∼ N (log(σ(t−1)), εσ) and update σ(t) against

π(⋅∣x , µ(t),p(t−1), ϕ(t−1), ξ(t−1),$(t−1)).

2.3 Generate a proposal (ϕ2)′∼ Beta((ϕ2

)(t)εϕ + 1, (1 − (ϕ2

)(t)

)εϕ + 1) and update ϕ(t) against

π(⋅∣x , µ(t), σ(t),p(t−1), ξ(t),$(t)).

2.4 Generate a proposal p′∼ Dir(p(t−1)

1 εp + 1, . . . ,p(t−1)k εp + 1), and update p(t) against

π(⋅∣x , µ(t), σ(t), ϕ(t), ξ(t),$(t)).

2.5 Generate proposals ξ′i ∼ U[ξ(t)i − εξ, ξ

(t)i + εξ], i = 1,⋯,k − 1, and update (ξ

(t)1 , . . . , ξ

(t)k−1) against

π(⋅∣x , µ(t), σ(t),p(t), ϕ(t),$(t)).

2.6 Generate proposals $′

i ∼ U[$(t)i − ε$,$

(t)i + ε$], i = 1,⋯,k − 2, and update ($

(t)1 , . . . ,$

(t)k−2) against

π(⋅∣x , µ(t), σ(t),p(t), ϕ(t), ξ(t)).

where p(t)= (p(t)

1 , . . . ,p(t)k ), x = (x1, . . . ,xn), ξ(t) = (ξ

(t)1 , . . . , ξ

(t)k−1) and $(t)

= ($(t)1 , . . . ,$

(t)k−2).

Ultimixt package

▸ Implementation of the Metropolis-within-Gibbs algorithm for reparametrized mixture distribution;▸ Calibrate the scales of the various proposals by aiming an average acceptance rate of either 0.44 or 0.234

depending on the dimension of the simulated parameter;▸ Accurately estimate the component parameters;

Point estimator of the component parameters in the case of label switching:▸ K-means clustering algorithm;▸ Reordering labels towards producing the shortest distance between the current posterior sample and the

(or a) maximum posterior probability (MAP) estimate; [1].

Mixture of two normal distributions

A sample of size 50 simulated from .65N (−8,2) + .35N (−.5,1),

Figure: Empirical densities of 10 sequences of running Metropolis-within-Gibbs algorithm in parallel with 2e + 05 iterations.

▸ Outcomes of 10 parallel chains startedrandomly from different starting values,are indistinguishable;

▸ Chains are well-mixed;▸ Sampler output covers the entire

sample space;▸ Estimated densities converge to a

neighborhood of the true values;▸ Estimated mixture density is remarkably

smooth;

Mixture of three normal distributions

A sample of size 50 is simulated from model .27N (−4.5,1) + .4N (10,1) + .33N (3,1)

Figure: Sequences of µi , σi and pi and estimated mixture density; mixture density estimate based on 104 MCMC iterations

Overfitting case

Extreme valued posterior samples for an overfitted model.

Galaxy dataset: Point estimator of the parameters of a mixture of (Left) 6 components; (Right) 4 components.

References

[1] S. Früwirth. Schnatter. (2001). Markov chain Monte Carlo estimation of classical and dynamic switchingand mixture models. J. American Statist. Assoc., 96 194–209.

[2] K. Mengersen and C. Robert. (1996) Testing for mixtures: A Bayesian entropic approach (with discussion).In Bayesian Statistics 5 (J. Berger, J. Bernardo, A. Dawid, D. Lindley and A. Smith, eds). Oxford UniversityPress, Oxford, 255–276.

[3] J. Diebolt and C. Robert. (1994) Estimation of finite mixture distributions by Bayesian sampling. J. RoyalStatist. Society Series B, 56 363–375.

[email protected]