Towards Maximum Likelihood: Learning Undirected Graphical ...

Post on 08-Nov-2021

4 views 0 download

Transcript of Towards Maximum Likelihood: Learning Undirected Graphical ...

Towards Maximum Likelihood: Learning UndirectedGraphical Models using Persistent Sequential Monte

Carlo

Hanchen Xiong

Institute of Computer ScienceUniversity of Innsbruck, Austria

November 26, 2014

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 1 / 22

Background

Undirected Graphical Models:

Markov Random Fields (or Markov Network)

Condtional Random Fields

Modeling:

p(x;θ) =exp (−E (x;θ))

Z(θ)(1)

Energy: E (x;θ) = −θ>φ(x) (2)

with random variables x = [x1, x2, . . . , xD ] ∈ XD where xd can take Nd

discrete values, φ(x) is a K -dimensional vector of sufficient statistics, andparameter θ ∈ RK .

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 2 / 22

Maximum Log-Likelihood Training

UGMs’ likelihood functions is concave w.r.t. θ[Koller and Friedman, 2009];

Given training data D = {x(m)}Mm=1, the derivative of average

log-likelihood L(θ|D) = 1M

∑Mm=1 log p(x(m);θ) as

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

=1

M

M∑m=1

φ(x(m))−∑x′∈D

p(x′;θ)φ(x)′

(3)

interpretation: iteratively pulls down the energy of the data spaceoccupied by D (positive phase), but raises the energy over all dataspace XD (negative phase), until it reaches a balance (ψ+ = ψ−).

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 3 / 22

Maximum Log-Likelihood Training

UGMs’ likelihood functions is concave w.r.t. θ[Koller and Friedman, 2009];

Given training data D = {x(m)}Mm=1, the derivative of average

log-likelihood L(θ|D) = 1M

∑Mm=1 log p(x(m);θ) as

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

=1

M

M∑m=1

φ(x(m))−∑x′∈D

p(x′;θ)φ(x)′

(3)

interpretation: iteratively pulls down the energy of the data spaceoccupied by D (positive phase), but raises the energy over all dataspace XD (negative phase), until it reaches a balance (ψ+ = ψ−).

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 3 / 22

Maximum Log-Likelihood Training

UGMs’ likelihood functions is concave w.r.t. θ[Koller and Friedman, 2009];

Given training data D = {x(m)}Mm=1, the derivative of average

log-likelihood L(θ|D) = 1M

∑Mm=1 log p(x(m);θ) as

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

=1

M

M∑m=1

φ(x(m))−∑x′∈D

p(x′;θ)φ(x)′

(3)

interpretation: iteratively pulls down the energy of the data spaceoccupied by D (positive phase), but raises the energy over all dataspace XD (negative phase), until it reaches a balance (ψ+ = ψ−).

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 3 / 22

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

MCMCML

In MCMCML, a proposal distribution p(x;θ0) is set up in the sameform as a UGM, and we have

Z(θ)Z(θ0) =

∑x exp(θ>φ(x))∑x exp(θ>0 φ(x))

=∑

x exp(θ>φ(x))

exp(θ>0 φ(x))× exp(θ>0 φ(x))∑

x exp(θ>0 φ(x))

=∑

xexp(θ>φ(x))

exp(θ>0 φ(x))× exp(θ>0 φ(x))∑

x exp(θ>0 φ(x))

=∑

x exp((θ − θ0)>φ(x)

)p(x;θ0)

≈ 1S

∑Ss=1 w

(s)

(4)

where w (s) isw (s) = exp

((θ − θ0)>φ(x̄(s))

), (5)

MCMCML is an importance sampling approximation of the gradient.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 6 / 22

MCMCML

In MCMCML, a proposal distribution p(x;θ0) is set up in the sameform as a UGM, and we have

Z(θ)Z(θ0) =

∑x exp(θ>φ(x))∑x exp(θ>0 φ(x))

=∑

x exp(θ>φ(x))

exp(θ>0 φ(x))× exp(θ>0 φ(x))∑

x exp(θ>0 φ(x))

=∑

xexp(θ>φ(x))

exp(θ>0 φ(x))× exp(θ>0 φ(x))∑

x exp(θ>0 φ(x))

=∑

x exp((θ − θ0)>φ(x)

)p(x;θ0)

≈ 1S

∑Ss=1 w

(s)

(4)

where w (s) isw (s) = exp

((θ − θ0)>φ(x̄(s))

), (5)

MCMCML is an importance sampling approximation of the gradient.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 6 / 22

MCMCML

1: t ← 0, initialize the proposal distribution p(x;θ0)2: Sample {x̄(s)} from p(x;θ0)3: while ! stop criterion do4: Calculate w (s) using (5)

5: Calculate gradient ∂L̃(θt |D)∂θt

using importance samplingapproximation.

6: update θt+1 = θt + η ∂L̃(θt |D)∂θt

7: t ← t + 18: end while

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 7 / 22

MCMCML

MCMCML’s performance highly depends on initial proposaldistribution;

at time t, it is helpful to update the proposal distribution as thep(x;θt−1);

this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;

this also looks like SAP learning schemes.

a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22

MCMCML

MCMCML’s performance highly depends on initial proposaldistribution;

at time t, it is helpful to update the proposal distribution as thep(x;θt−1);

this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;

this also looks like SAP learning schemes.

a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22

MCMCML

MCMCML’s performance highly depends on initial proposaldistribution;

at time t, it is helpful to update the proposal distribution as thep(x;θt−1);

this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;

this also looks like SAP learning schemes.

a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22

MCMCML

MCMCML’s performance highly depends on initial proposaldistribution;

at time t, it is helpful to update the proposal distribution as thep(x;θt−1);

this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;

this also looks like SAP learning schemes.

a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22

MCMCML

MCMCML’s performance highly depends on initial proposaldistribution;

at time t, it is helpful to update the proposal distribution as thep(x;θt−1);

this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;

this also looks like SAP learning schemes.

a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22

SAP Learning as Sequential Monte Carlo

1: Initialize p(x;θ0), t ← 0

2: Sample particles {x̄(s)0 }Ss=1 ∼ p(x;θ0)

3: while ! stop criterion do4: Assign w (s) ← 1

S ,∀s ∈ S // importance reweighting

5: // resampling is ignored because it has no effect6: switch (algorithmic choice) // MCMC transition7: case CD:8: generate a brand new particle set {x̄(s)

t+1}Ss=1 with Gibbs sampling from D

9: case PCD:10: evolve particle set {x̄(s)

t }Ss=1 to {x̄(s)t+1}

Ss=1 with one step Gibbs sampling

11: case Tempered Transition:12: evolve particle set {x̄(s)

t }Ss=1 to {x̄(s)t+1}

Ss=1 with tempered transition

13: case Parallel Tempering:14: evolve particle set {x̄(s)

t }Ss=1 to {x̄(s)t+1}

Ss=1 with parallel tempering

15: end switch16: Compute the gradient ∆θt according to (4);17: θt+1 = θt + η∆θt , t ← t + 1;18: reduce η;

19: end whileHanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 9 / 22

SAP Learning as Sequential Monte Carlo

A sequential Monte Carlo (SMC) algorithms can work on thecondition that sequential, intermediate distributions are wellconstructed: two successive ones should be close.

the gap between successive distributions in SAP: η × D1 learning rate η;

2 the dimensionality of x: D.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 10 / 22

SAP Learning as Sequential Monte Carlo

A sequential Monte Carlo (SMC) algorithms can work on thecondition that sequential, intermediate distributions are wellconstructed: two successive ones should be close.

the gap between successive distributions in SAP: η × D1 learning rate η;2 the dimensionality of x: D.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 10 / 22

Persistent Sequential Monte Carlo

Every sequential, intermediate distributions is constructed by learning,so learning and sampling are interestingly entangled.

applying SMC philosophy future in sampling: Persistent SMC (SPMC)

{x̄(s)}S/2s=1 ∼ UD {x̄t−1,(s)}S/2s=1 ∼ pH(x;θt−1)

p0(x;θt) p1(x;θt) · · · pH(x;θt)

{x̄(s)}S/2s=1 ∼ UD {x̄t,(s)}S/2s=1 ∼ pH(x;θt)

p0(x;θt+1) p1(x;θt+1) · · · pH(x;θt+1)

{x̄t+1,(s)}S/2s=1 ∼ pH(x;θt+1)

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 11 / 22

Persistent Sequential Monte Carlo

Every sequential, intermediate distributions is constructed by learning,so learning and sampling are interestingly entangled.

applying SMC philosophy future in sampling: Persistent SMC (SPMC)

{x̄(s)}S/2s=1 ∼ UD {x̄t−1,(s)}S/2s=1 ∼ pH(x;θt−1)

p0(x;θt) p1(x;θt) · · · pH(x;θt)

{x̄(s)}S/2s=1 ∼ UD {x̄t,(s)}S/2s=1 ∼ pH(x;θt)

p0(x;θt+1) p1(x;θt+1) · · · pH(x;θt+1)

{x̄t+1,(s)}S/2s=1 ∼ pH(x;θt+1)

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 11 / 22

Persistent Sequential Monte Carlo

{x̄(s)}S/2s=1 ∼ UD {x̄t−1,(s)}S/2s=1 ∼ pH(x;θt−1)

p0(x;θt) p1(x;θt) · · · pH(x;θt)

{x̄(s)}S/2s=1 ∼ UD {x̄t,(s)}S/2s=1 ∼ pH(x;θt)

p0(x;θt+1) p1(x;θt+1) · · · pH(x;θt+1)

{x̄t+1,(s)}S/2s=1 ∼ pH(x;θt+1)

UD is a uniform distribu-tion on x, and interme-diate sequential distribu-tions are: ph(x;θt+1) ∝p(x;θt)

1−βhp(x;θt+1)βh

where 0 ≤ βH ≤ βH−1 ≤· · ·β0 = 1.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 12 / 22

Number of Sub-Sequential Distributions

One issue arising in PSMC is the number of βh, i.e. H

By exploiting degeneration of particle set: the importance weightingfor each particle is

w (s) =ph(x̄(s);θt+1)

ph−1(x̄(s);θt+1)

= exp(E (x̄(s);θt)

)∆βhexp

(E (x̄(s);θt+1)

)−∆βh(6)

where ∆βh is the step length from βh−1 to βh, i.e. ∆βh = βh − βh−1.the ESS of a particle set as [Kong et al., 1994]

σ =(∑S

s=1 w(s))2

S∑S

s=1 w(s)2∈[

1

S, 1

](7)

ESS σ is actually a function of ∆βh.

Set a threshold on σ, at every h, and find the biggest gap by usingbidirectional search.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 13 / 22

Number of Sub-Sequential Distributions

One issue arising in PSMC is the number of βh, i.e. H

By exploiting degeneration of particle set: the importance weightingfor each particle is

w (s) =ph(x̄(s);θt+1)

ph−1(x̄(s);θt+1)

= exp(E (x̄(s);θt)

)∆βhexp

(E (x̄(s);θt+1)

)−∆βh(6)

where ∆βh is the step length from βh−1 to βh, i.e. ∆βh = βh − βh−1.the ESS of a particle set as [Kong et al., 1994]

σ =(∑S

s=1 w(s))2

S∑S

s=1 w(s)2∈[

1

S, 1

](7)

ESS σ is actually a function of ∆βh.

Set a threshold on σ, at every h, and find the biggest gap by usingbidirectional search.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 13 / 22

Number of Sub-Sequential Distributions

One issue arising in PSMC is the number of βh, i.e. H

By exploiting degeneration of particle set: the importance weightingfor each particle is

w (s) =ph(x̄(s);θt+1)

ph−1(x̄(s);θt+1)

= exp(E (x̄(s);θt)

)∆βhexp

(E (x̄(s);θt+1)

)−∆βh(6)

where ∆βh is the step length from βh−1 to βh, i.e. ∆βh = βh − βh−1.the ESS of a particle set as [Kong et al., 1994]

σ =(∑S

s=1 w(s))2

S∑S

s=1 w(s)2∈[

1

S, 1

](7)

ESS σ is actually a function of ∆βh.

Set a threshold on σ, at every h, and find the biggest gap by usingbidirectional search.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 13 / 22

Number of Sub-Sequential Distributions

One issue arising in PSMC is the number of βh, i.e. H

By exploiting degeneration of particle set: the importance weightingfor each particle is

w (s) =ph(x̄(s);θt+1)

ph−1(x̄(s);θt+1)

= exp(E (x̄(s);θt)

)∆βhexp

(E (x̄(s);θt+1)

)−∆βh(6)

where ∆βh is the step length from βh−1 to βh, i.e. ∆βh = βh − βh−1.the ESS of a particle set as [Kong et al., 1994]

σ =(∑S

s=1 w(s))2

S∑S

s=1 w(s)2∈[

1

S, 1

](7)

ESS σ is actually a function of ∆βh.

Set a threshold on σ, at every h, and find the biggest gap by usingbidirectional search.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 13 / 22

PSMC

Input: a training dataset D = {x(m)}Mm=1, learning rate η1: Initialize p(x;θ0), t ← 0

2: Sample particles {x̄(s)0 }Ss=1 ∼ p(x;θ0)

3: while ! stop criterion // root-SMC do4: h← 0, β0 ← 15: while βh < 1 // sub-SMC do6: assign importance weights {w (s)}Ss=1 to particles according to (6)7: resample particles based on {w (s)}Ss=1

8: find the step length ∆βh9: βh+1 = βh + δβ

10: h← h + 111: end while12: Compute the gradient ∆θt according to (4)13: θt+1 = θt + η∆θt

14: t ← t + 115: end while

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 14 / 22

Experiments

two experiments on two challenges:

big learning rates

high dimensional distributions

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 15 / 22

First Experiment: Small Learning Rates

a small-size Boltzmann Machine with only 10 variables is used to avoid theeffect of model complexitylearning rate ηt = 1

100+t

0 100 200 300 400 500−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

number of epochs

log−

likelih

ood

ML

PCD−10

PCD−1

PT

TT

PSMC

SMC

(a)

0 100 200 300 400 5007

8

9

10

11

12

number of epochs

nu

mb

er

of

tem

pe

ratu

res

PSMC

(b)

0 100 200 300 400 5007

8

9

10

11

12

number of epochs

nu

mb

er

of

tem

pe

ratu

res

Standard SMC

(c)

Figure: The performance of algorithms with the first learning rate scheme. (a):log-likelihood vs. number of epochs; (b) and (c): the number of βs in PSMC andSMC at each iteration (blue) and their mean values (red).

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 16 / 22

First Experiment: Bigger Learning Rates

learning rate ηt = 120+0.5×t

0 20 40 60 80 100−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

number of epochs

log−

likelih

ood

ML

PCD−10

PCD−1

PT

TT

PSMC

SMC

(a)

0 20 40 60 80 1007

8

9

10

11

12

number of epochs

nu

mb

er

of

tem

pe

ratu

res

PSMC

(b)

0 20 40 60 80 1007

8

9

10

11

12

number of epochs

nu

mb

er

of

tem

pe

ratu

res

Standard SMC

(c)

Figure: The performance of algorithms with the second learning rate scheme. (a):log-likelihood vs. number of epochs; (b) and (c): the number of βs in PSMC andSMC at each iteration (blue) and their mean values (red).

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 17 / 22

First Experiment: Large Learning Rates

learning rate ηt = 110+0.1×t

0 10 20 30 40−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

number of epochs

log−

likelih

ood

ML

PCD−10

PT

TT

PSMC

SMC

(a)

0 10 20 30 407

8

9

10

11

12

number of epochs

nu

mb

er

of

tem

pe

ratu

res

PSMC

(b)

0 10 20 30 407

8

9

10

11

12

number of epochs

nu

mb

er

of

tem

pe

ratu

res

Standard SMC

(c)

Figure: The performance of algorithms with the third learning rate scheme. (a):log-likelihood vs. number of epochs; (b) and (c): the number of βs in PSMC andSMC at each iteration (blue) and their mean values (red).

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 18 / 22

Second Experiment: Small Dimensionality/Scale

we used the popular restricted Boltzmann machine to model handwrittendigit images (the MNIST database).10 hidden unites

200 400 600 800 1000−230

−225

−220

−215

−210

−205

−200

−195

number of epochs

testin

g−

da

ta lo

g−

like

liho

od

200 400 600 800 1000−230

−225

−220

−215

−210

−205

−200

−195

number of epochs

tra

inin

g−

da

ta lo

g−

like

liho

od

PCD−100 PCD−1 PT TT PSMC SMC

(a)

0 200 400 600 800 100060

80

100

120

140

160

180

200

220

number of epochs

num

ber

of te

mpera

ture

s

PSMC

(b)

0 200 400 600 800 100050

100

150

200

250

300

number of epochs

num

ber

of te

mpera

ture

s

Standard SMC

(c)Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 19 / 22

Second Experiment: Large Dimensionality/Scale

500 hidden unites

200 400 600 800 1000−200

−195

−190

−185

−180

−175

−170

−165

−160

number of epochs

testin

g−

da

ta lo

g−

like

liho

od

200 400 600 800 1000−200

−195

−190

−185

−180

−175

−170

−165

−160

number of epochs

tra

inin

g−

da

ta lo

g−

like

liho

od

PCD−200 PCD−1 PT TT PSMC

(d)

0 200 400 600 800 10000

50

100

150

200

250

number of epochs

num

ber

of te

mpera

ture

s

PSMC

(e)

0 200 400 600 800 10006

7

8

9

10

11

number of epochs

num

ber

of te

mpera

ture

sStandard SMC

(f)

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 20 / 22

Conclusion

a new interpretation of learning undirected graphical models:sequential Monte Carlo (SMC)

reveal two challenges in learning: large learning rate, highdimensionality;

deeper application of SMC in learning −→ Persistent SMC;

yield higher likelihood than state-of-the-art algorithms in challengingcases.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 21 / 22

Conclusion

a new interpretation of learning undirected graphical models:sequential Monte Carlo (SMC)

reveal two challenges in learning: large learning rate, highdimensionality;

deeper application of SMC in learning −→ Persistent SMC;

yield higher likelihood than state-of-the-art algorithms in challengingcases.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 21 / 22

Conclusion

a new interpretation of learning undirected graphical models:sequential Monte Carlo (SMC)

reveal two challenges in learning: large learning rate, highdimensionality;

deeper application of SMC in learning −→ Persistent SMC;

yield higher likelihood than state-of-the-art algorithms in challengingcases.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 21 / 22

Conclusion

a new interpretation of learning undirected graphical models:sequential Monte Carlo (SMC)

reveal two challenges in learning: large learning rate, highdimensionality;

deeper application of SMC in learning −→ Persistent SMC;

yield higher likelihood than state-of-the-art algorithms in challengingcases.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 21 / 22

END

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 22 / 22

Asuncion, A. U., Liu, Q., Ihler, A. T., and Smyth, P. (2010).Particle filtered MCMC-MLE with connections to contrastivedivergence.In ICML.

Desjardins, G., Courville, A., Bengio, Y., Vincent, P., and Delalleau,O. (2010).Tempered Markov Chain Monte Carlo for training of restrictedBoltzmann machines.In AISTATS.

Geyer, C. J. (1991).Markov Chain Monte Carlo Maximum Likelihood.Compuuting Science and Statistics:Proceedings of the 23rdSymposium on the Interface.

Hinton, G. E. (2002).Training products of experts by minimizing contrastive divergence.Neural Computation, 14(8):1771–1800.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 22 / 22

Koller, D. and Friedman, N. (2009).Probabilistic Graphical Models: Principles and Techniques.MIT Press.

Kong, A., Liu, J. S., and Wong, W. H. (1994).Sequential Imputations and Bayesian Missing Data Problems.Journal of the American Statistical Association, 89(425):278–288.

Robbins, H. and Monro, S. (1951).A Stochastic Approximation Method.Ann.Math.Stat., 22:400–407.

Salakhutdinov, R. (2010).Learning in markov random fields using tempered transitions.In NIPS.

Tieleman, T. (2008).Training Restricted Boltzmann Machines using Approximations to theLikelihood Gradient.In ICML, pages 1064–1071.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 22 / 22