Download - Towards Maximum Likelihood: Learning Undirected Graphical ...

Transcript
Page 1: Towards Maximum Likelihood: Learning Undirected Graphical ...

Towards Maximum Likelihood: Learning UndirectedGraphical Models using Persistent Sequential Monte

Carlo

Hanchen Xiong

Institute of Computer ScienceUniversity of Innsbruck, Austria

November 26, 2014

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 1 / 22

Page 2: Towards Maximum Likelihood: Learning Undirected Graphical ...

Background

Undirected Graphical Models:

Markov Random Fields (or Markov Network)

Condtional Random Fields

Modeling:

p(x;θ) =exp (−E (x;θ))

Z(θ)(1)

Energy: E (x;θ) = −θ>φ(x) (2)

with random variables x = [x1, x2, . . . , xD ] ∈ XD where xd can take Nd

discrete values, φ(x) is a K -dimensional vector of sufficient statistics, andparameter θ ∈ RK .

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 2 / 22

Page 3: Towards Maximum Likelihood: Learning Undirected Graphical ...

Maximum Log-Likelihood Training

UGMs’ likelihood functions is concave w.r.t. θ[Koller and Friedman, 2009];

Given training data D = {x(m)}Mm=1, the derivative of average

log-likelihood L(θ|D) = 1M

∑Mm=1 log p(x(m);θ) as

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

=1

M

M∑m=1

φ(x(m))−∑x′∈D

p(x′;θ)φ(x)′

(3)

interpretation: iteratively pulls down the energy of the data spaceoccupied by D (positive phase), but raises the energy over all dataspace XD (negative phase), until it reaches a balance (ψ+ = ψ−).

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 3 / 22

Page 4: Towards Maximum Likelihood: Learning Undirected Graphical ...

Maximum Log-Likelihood Training

UGMs’ likelihood functions is concave w.r.t. θ[Koller and Friedman, 2009];

Given training data D = {x(m)}Mm=1, the derivative of average

log-likelihood L(θ|D) = 1M

∑Mm=1 log p(x(m);θ) as

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

=1

M

M∑m=1

φ(x(m))−∑x′∈D

p(x′;θ)φ(x)′

(3)

interpretation: iteratively pulls down the energy of the data spaceoccupied by D (positive phase), but raises the energy over all dataspace XD (negative phase), until it reaches a balance (ψ+ = ψ−).

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 3 / 22

Page 5: Towards Maximum Likelihood: Learning Undirected Graphical ...

Maximum Log-Likelihood Training

UGMs’ likelihood functions is concave w.r.t. θ[Koller and Friedman, 2009];

Given training data D = {x(m)}Mm=1, the derivative of average

log-likelihood L(θ|D) = 1M

∑Mm=1 log p(x(m);θ) as

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

=1

M

M∑m=1

φ(x(m))−∑x′∈D

p(x′;θ)φ(x)′

(3)

interpretation: iteratively pulls down the energy of the data spaceoccupied by D (positive phase), but raises the energy over all dataspace XD (negative phase), until it reaches a balance (ψ+ = ψ−).

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 3 / 22

Page 6: Towards Maximum Likelihood: Learning Undirected Graphical ...

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Page 7: Towards Maximum Likelihood: Learning Undirected Graphical ...

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Page 8: Towards Maximum Likelihood: Learning Undirected Graphical ...

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Page 9: Towards Maximum Likelihood: Learning Undirected Graphical ...

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Page 10: Towards Maximum Likelihood: Learning Undirected Graphical ...

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Page 11: Towards Maximum Likelihood: Learning Undirected Graphical ...

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Page 12: Towards Maximum Likelihood: Learning Undirected Graphical ...

Existing Learning Methods

Approximate the second term of the gradient L:

∂L(θ|D)

∂θ= ED(φ(x))︸ ︷︷ ︸

ψ+

−Eθ(φ(x))︸ ︷︷ ︸ψ−

Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]

Contrastive Divergence (CD) [Hinton, 2002]

Persistent Contrastive Divergence (PCD) [Tieleman, 2008]

Tempered Transition (TT) [Salakhutdinov, 2010]

Parallel Tempering (PT) [Desjardins et al., 2010]

CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22

Page 13: Towards Maximum Likelihood: Learning Undirected Graphical ...

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Page 14: Towards Maximum Likelihood: Learning Undirected Graphical ...

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Page 15: Towards Maximum Likelihood: Learning Undirected Graphical ...

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Page 16: Towards Maximum Likelihood: Learning Undirected Graphical ...

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Page 17: Towards Maximum Likelihood: Learning Undirected Graphical ...

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Page 18: Towards Maximum Likelihood: Learning Undirected Graphical ...

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Page 19: Towards Maximum Likelihood: Learning Undirected Graphical ...

Robbins-Monro’s SAP

1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.

2 for t = 0 : T do // T iterations

3 for n = 1 : N do // go through all N particles

4 Sample st+1,n from st,n using transition operator Hθt ;

5 end for

6 Update: θt+1 = θt + η[ 1

M

M∑m=1

φ(x(m))− 1

N

N∑n=1

φ(st+1,n)]

7 Decrease η.

8 end for

9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22

Page 20: Towards Maximum Likelihood: Learning Undirected Graphical ...

MCMCML

In MCMCML, a proposal distribution p(x;θ0) is set up in the sameform as a UGM, and we have

Z(θ)Z(θ0) =

∑x exp(θ>φ(x))∑x exp(θ>0 φ(x))

=∑

x exp(θ>φ(x))

exp(θ>0 φ(x))× exp(θ>0 φ(x))∑

x exp(θ>0 φ(x))

=∑

xexp(θ>φ(x))

exp(θ>0 φ(x))× exp(θ>0 φ(x))∑

x exp(θ>0 φ(x))

=∑

x exp((θ − θ0)>φ(x)

)p(x;θ0)

≈ 1S

∑Ss=1 w

(s)

(4)

where w (s) isw (s) = exp

((θ − θ0)>φ(x̄(s))

), (5)

MCMCML is an importance sampling approximation of the gradient.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 6 / 22

Page 21: Towards Maximum Likelihood: Learning Undirected Graphical ...

MCMCML

In MCMCML, a proposal distribution p(x;θ0) is set up in the sameform as a UGM, and we have

Z(θ)Z(θ0) =

∑x exp(θ>φ(x))∑x exp(θ>0 φ(x))

=∑

x exp(θ>φ(x))

exp(θ>0 φ(x))× exp(θ>0 φ(x))∑

x exp(θ>0 φ(x))

=∑

xexp(θ>φ(x))

exp(θ>0 φ(x))× exp(θ>0 φ(x))∑

x exp(θ>0 φ(x))

=∑

x exp((θ − θ0)>φ(x)

)p(x;θ0)

≈ 1S

∑Ss=1 w

(s)

(4)

where w (s) isw (s) = exp

((θ − θ0)>φ(x̄(s))

), (5)

MCMCML is an importance sampling approximation of the gradient.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 6 / 22

Page 22: Towards Maximum Likelihood: Learning Undirected Graphical ...

MCMCML

1: t ← 0, initialize the proposal distribution p(x;θ0)2: Sample {x̄(s)} from p(x;θ0)3: while ! stop criterion do4: Calculate w (s) using (5)

5: Calculate gradient ∂L̃(θt |D)∂θt

using importance samplingapproximation.

6: update θt+1 = θt + η ∂L̃(θt |D)∂θt

7: t ← t + 18: end while

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 7 / 22

Page 23: Towards Maximum Likelihood: Learning Undirected Graphical ...

MCMCML

MCMCML’s performance highly depends on initial proposaldistribution;

at time t, it is helpful to update the proposal distribution as thep(x;θt−1);

this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;

this also looks like SAP learning schemes.

a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22

Page 24: Towards Maximum Likelihood: Learning Undirected Graphical ...

MCMCML

MCMCML’s performance highly depends on initial proposaldistribution;

at time t, it is helpful to update the proposal distribution as thep(x;θt−1);

this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;

this also looks like SAP learning schemes.

a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22

Page 25: Towards Maximum Likelihood: Learning Undirected Graphical ...

MCMCML

MCMCML’s performance highly depends on initial proposaldistribution;

at time t, it is helpful to update the proposal distribution as thep(x;θt−1);

this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;

this also looks like SAP learning schemes.

a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22

Page 26: Towards Maximum Likelihood: Learning Undirected Graphical ...

MCMCML

MCMCML’s performance highly depends on initial proposaldistribution;

at time t, it is helpful to update the proposal distribution as thep(x;θt−1);

this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;

this also looks like SAP learning schemes.

a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22

Page 27: Towards Maximum Likelihood: Learning Undirected Graphical ...

MCMCML

MCMCML’s performance highly depends on initial proposaldistribution;

at time t, it is helpful to update the proposal distribution as thep(x;θt−1);

this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;

this also looks like SAP learning schemes.

a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22

Page 28: Towards Maximum Likelihood: Learning Undirected Graphical ...

SAP Learning as Sequential Monte Carlo

1: Initialize p(x;θ0), t ← 0

2: Sample particles {x̄(s)0 }Ss=1 ∼ p(x;θ0)

3: while ! stop criterion do4: Assign w (s) ← 1

S ,∀s ∈ S // importance reweighting

5: // resampling is ignored because it has no effect6: switch (algorithmic choice) // MCMC transition7: case CD:8: generate a brand new particle set {x̄(s)

t+1}Ss=1 with Gibbs sampling from D

9: case PCD:10: evolve particle set {x̄(s)

t }Ss=1 to {x̄(s)t+1}

Ss=1 with one step Gibbs sampling

11: case Tempered Transition:12: evolve particle set {x̄(s)

t }Ss=1 to {x̄(s)t+1}

Ss=1 with tempered transition

13: case Parallel Tempering:14: evolve particle set {x̄(s)

t }Ss=1 to {x̄(s)t+1}

Ss=1 with parallel tempering

15: end switch16: Compute the gradient ∆θt according to (4);17: θt+1 = θt + η∆θt , t ← t + 1;18: reduce η;

19: end whileHanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 9 / 22

Page 29: Towards Maximum Likelihood: Learning Undirected Graphical ...

SAP Learning as Sequential Monte Carlo

A sequential Monte Carlo (SMC) algorithms can work on thecondition that sequential, intermediate distributions are wellconstructed: two successive ones should be close.

the gap between successive distributions in SAP: η × D1 learning rate η;

2 the dimensionality of x: D.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 10 / 22

Page 30: Towards Maximum Likelihood: Learning Undirected Graphical ...

SAP Learning as Sequential Monte Carlo

A sequential Monte Carlo (SMC) algorithms can work on thecondition that sequential, intermediate distributions are wellconstructed: two successive ones should be close.

the gap between successive distributions in SAP: η × D1 learning rate η;2 the dimensionality of x: D.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 10 / 22

Page 31: Towards Maximum Likelihood: Learning Undirected Graphical ...

Persistent Sequential Monte Carlo

Every sequential, intermediate distributions is constructed by learning,so learning and sampling are interestingly entangled.

applying SMC philosophy future in sampling: Persistent SMC (SPMC)

{x̄(s)}S/2s=1 ∼ UD {x̄t−1,(s)}S/2s=1 ∼ pH(x;θt−1)

p0(x;θt) p1(x;θt) · · · pH(x;θt)

{x̄(s)}S/2s=1 ∼ UD {x̄t,(s)}S/2s=1 ∼ pH(x;θt)

p0(x;θt+1) p1(x;θt+1) · · · pH(x;θt+1)

{x̄t+1,(s)}S/2s=1 ∼ pH(x;θt+1)

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 11 / 22

Page 32: Towards Maximum Likelihood: Learning Undirected Graphical ...

Persistent Sequential Monte Carlo

Every sequential, intermediate distributions is constructed by learning,so learning and sampling are interestingly entangled.

applying SMC philosophy future in sampling: Persistent SMC (SPMC)

{x̄(s)}S/2s=1 ∼ UD {x̄t−1,(s)}S/2s=1 ∼ pH(x;θt−1)

p0(x;θt) p1(x;θt) · · · pH(x;θt)

{x̄(s)}S/2s=1 ∼ UD {x̄t,(s)}S/2s=1 ∼ pH(x;θt)

p0(x;θt+1) p1(x;θt+1) · · · pH(x;θt+1)

{x̄t+1,(s)}S/2s=1 ∼ pH(x;θt+1)

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 11 / 22

Page 33: Towards Maximum Likelihood: Learning Undirected Graphical ...

Persistent Sequential Monte Carlo

{x̄(s)}S/2s=1 ∼ UD {x̄t−1,(s)}S/2s=1 ∼ pH(x;θt−1)

p0(x;θt) p1(x;θt) · · · pH(x;θt)

{x̄(s)}S/2s=1 ∼ UD {x̄t,(s)}S/2s=1 ∼ pH(x;θt)

p0(x;θt+1) p1(x;θt+1) · · · pH(x;θt+1)

{x̄t+1,(s)}S/2s=1 ∼ pH(x;θt+1)

UD is a uniform distribu-tion on x, and interme-diate sequential distribu-tions are: ph(x;θt+1) ∝p(x;θt)

1−βhp(x;θt+1)βh

where 0 ≤ βH ≤ βH−1 ≤· · ·β0 = 1.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 12 / 22

Page 34: Towards Maximum Likelihood: Learning Undirected Graphical ...

Number of Sub-Sequential Distributions

One issue arising in PSMC is the number of βh, i.e. H

By exploiting degeneration of particle set: the importance weightingfor each particle is

w (s) =ph(x̄(s);θt+1)

ph−1(x̄(s);θt+1)

= exp(E (x̄(s);θt)

)∆βhexp

(E (x̄(s);θt+1)

)−∆βh(6)

where ∆βh is the step length from βh−1 to βh, i.e. ∆βh = βh − βh−1.the ESS of a particle set as [Kong et al., 1994]

σ =(∑S

s=1 w(s))2

S∑S

s=1 w(s)2∈[

1

S, 1

](7)

ESS σ is actually a function of ∆βh.

Set a threshold on σ, at every h, and find the biggest gap by usingbidirectional search.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 13 / 22

Page 35: Towards Maximum Likelihood: Learning Undirected Graphical ...

Number of Sub-Sequential Distributions

One issue arising in PSMC is the number of βh, i.e. H

By exploiting degeneration of particle set: the importance weightingfor each particle is

w (s) =ph(x̄(s);θt+1)

ph−1(x̄(s);θt+1)

= exp(E (x̄(s);θt)

)∆βhexp

(E (x̄(s);θt+1)

)−∆βh(6)

where ∆βh is the step length from βh−1 to βh, i.e. ∆βh = βh − βh−1.the ESS of a particle set as [Kong et al., 1994]

σ =(∑S

s=1 w(s))2

S∑S

s=1 w(s)2∈[

1

S, 1

](7)

ESS σ is actually a function of ∆βh.

Set a threshold on σ, at every h, and find the biggest gap by usingbidirectional search.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 13 / 22

Page 36: Towards Maximum Likelihood: Learning Undirected Graphical ...

Number of Sub-Sequential Distributions

One issue arising in PSMC is the number of βh, i.e. H

By exploiting degeneration of particle set: the importance weightingfor each particle is

w (s) =ph(x̄(s);θt+1)

ph−1(x̄(s);θt+1)

= exp(E (x̄(s);θt)

)∆βhexp

(E (x̄(s);θt+1)

)−∆βh(6)

where ∆βh is the step length from βh−1 to βh, i.e. ∆βh = βh − βh−1.the ESS of a particle set as [Kong et al., 1994]

σ =(∑S

s=1 w(s))2

S∑S

s=1 w(s)2∈[

1

S, 1

](7)

ESS σ is actually a function of ∆βh.

Set a threshold on σ, at every h, and find the biggest gap by usingbidirectional search.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 13 / 22

Page 37: Towards Maximum Likelihood: Learning Undirected Graphical ...

Number of Sub-Sequential Distributions

One issue arising in PSMC is the number of βh, i.e. H

By exploiting degeneration of particle set: the importance weightingfor each particle is

w (s) =ph(x̄(s);θt+1)

ph−1(x̄(s);θt+1)

= exp(E (x̄(s);θt)

)∆βhexp

(E (x̄(s);θt+1)

)−∆βh(6)

where ∆βh is the step length from βh−1 to βh, i.e. ∆βh = βh − βh−1.the ESS of a particle set as [Kong et al., 1994]

σ =(∑S

s=1 w(s))2

S∑S

s=1 w(s)2∈[

1

S, 1

](7)

ESS σ is actually a function of ∆βh.

Set a threshold on σ, at every h, and find the biggest gap by usingbidirectional search.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 13 / 22

Page 38: Towards Maximum Likelihood: Learning Undirected Graphical ...

PSMC

Input: a training dataset D = {x(m)}Mm=1, learning rate η1: Initialize p(x;θ0), t ← 0

2: Sample particles {x̄(s)0 }Ss=1 ∼ p(x;θ0)

3: while ! stop criterion // root-SMC do4: h← 0, β0 ← 15: while βh < 1 // sub-SMC do6: assign importance weights {w (s)}Ss=1 to particles according to (6)7: resample particles based on {w (s)}Ss=1

8: find the step length ∆βh9: βh+1 = βh + δβ

10: h← h + 111: end while12: Compute the gradient ∆θt according to (4)13: θt+1 = θt + η∆θt

14: t ← t + 115: end while

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 14 / 22

Page 39: Towards Maximum Likelihood: Learning Undirected Graphical ...

Experiments

two experiments on two challenges:

big learning rates

high dimensional distributions

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 15 / 22

Page 40: Towards Maximum Likelihood: Learning Undirected Graphical ...

First Experiment: Small Learning Rates

a small-size Boltzmann Machine with only 10 variables is used to avoid theeffect of model complexitylearning rate ηt = 1

100+t

0 100 200 300 400 500−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

number of epochs

log−

likelih

ood

ML

PCD−10

PCD−1

PT

TT

PSMC

SMC

(a)

0 100 200 300 400 5007

8

9

10

11

12

number of epochs

nu

mb

er

of

tem

pe

ratu

res

PSMC

(b)

0 100 200 300 400 5007

8

9

10

11

12

number of epochs

nu

mb

er

of

tem

pe

ratu

res

Standard SMC

(c)

Figure: The performance of algorithms with the first learning rate scheme. (a):log-likelihood vs. number of epochs; (b) and (c): the number of βs in PSMC andSMC at each iteration (blue) and their mean values (red).

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 16 / 22

Page 41: Towards Maximum Likelihood: Learning Undirected Graphical ...

First Experiment: Bigger Learning Rates

learning rate ηt = 120+0.5×t

0 20 40 60 80 100−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

number of epochs

log−

likelih

ood

ML

PCD−10

PCD−1

PT

TT

PSMC

SMC

(a)

0 20 40 60 80 1007

8

9

10

11

12

number of epochs

nu

mb

er

of

tem

pe

ratu

res

PSMC

(b)

0 20 40 60 80 1007

8

9

10

11

12

number of epochs

nu

mb

er

of

tem

pe

ratu

res

Standard SMC

(c)

Figure: The performance of algorithms with the second learning rate scheme. (a):log-likelihood vs. number of epochs; (b) and (c): the number of βs in PSMC andSMC at each iteration (blue) and their mean values (red).

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 17 / 22

Page 42: Towards Maximum Likelihood: Learning Undirected Graphical ...

First Experiment: Large Learning Rates

learning rate ηt = 110+0.1×t

0 10 20 30 40−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

number of epochs

log−

likelih

ood

ML

PCD−10

PT

TT

PSMC

SMC

(a)

0 10 20 30 407

8

9

10

11

12

number of epochs

nu

mb

er

of

tem

pe

ratu

res

PSMC

(b)

0 10 20 30 407

8

9

10

11

12

number of epochs

nu

mb

er

of

tem

pe

ratu

res

Standard SMC

(c)

Figure: The performance of algorithms with the third learning rate scheme. (a):log-likelihood vs. number of epochs; (b) and (c): the number of βs in PSMC andSMC at each iteration (blue) and their mean values (red).

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 18 / 22

Page 43: Towards Maximum Likelihood: Learning Undirected Graphical ...

Second Experiment: Small Dimensionality/Scale

we used the popular restricted Boltzmann machine to model handwrittendigit images (the MNIST database).10 hidden unites

200 400 600 800 1000−230

−225

−220

−215

−210

−205

−200

−195

number of epochs

testin

g−

da

ta lo

g−

like

liho

od

200 400 600 800 1000−230

−225

−220

−215

−210

−205

−200

−195

number of epochs

tra

inin

g−

da

ta lo

g−

like

liho

od

PCD−100 PCD−1 PT TT PSMC SMC

(a)

0 200 400 600 800 100060

80

100

120

140

160

180

200

220

number of epochs

num

ber

of te

mpera

ture

s

PSMC

(b)

0 200 400 600 800 100050

100

150

200

250

300

number of epochs

num

ber

of te

mpera

ture

s

Standard SMC

(c)Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 19 / 22

Page 44: Towards Maximum Likelihood: Learning Undirected Graphical ...

Second Experiment: Large Dimensionality/Scale

500 hidden unites

200 400 600 800 1000−200

−195

−190

−185

−180

−175

−170

−165

−160

number of epochs

testin

g−

da

ta lo

g−

like

liho

od

200 400 600 800 1000−200

−195

−190

−185

−180

−175

−170

−165

−160

number of epochs

tra

inin

g−

da

ta lo

g−

like

liho

od

PCD−200 PCD−1 PT TT PSMC

(d)

0 200 400 600 800 10000

50

100

150

200

250

number of epochs

num

ber

of te

mpera

ture

s

PSMC

(e)

0 200 400 600 800 10006

7

8

9

10

11

number of epochs

num

ber

of te

mpera

ture

sStandard SMC

(f)

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 20 / 22

Page 45: Towards Maximum Likelihood: Learning Undirected Graphical ...

Conclusion

a new interpretation of learning undirected graphical models:sequential Monte Carlo (SMC)

reveal two challenges in learning: large learning rate, highdimensionality;

deeper application of SMC in learning −→ Persistent SMC;

yield higher likelihood than state-of-the-art algorithms in challengingcases.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 21 / 22

Page 46: Towards Maximum Likelihood: Learning Undirected Graphical ...

Conclusion

a new interpretation of learning undirected graphical models:sequential Monte Carlo (SMC)

reveal two challenges in learning: large learning rate, highdimensionality;

deeper application of SMC in learning −→ Persistent SMC;

yield higher likelihood than state-of-the-art algorithms in challengingcases.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 21 / 22

Page 47: Towards Maximum Likelihood: Learning Undirected Graphical ...

Conclusion

a new interpretation of learning undirected graphical models:sequential Monte Carlo (SMC)

reveal two challenges in learning: large learning rate, highdimensionality;

deeper application of SMC in learning −→ Persistent SMC;

yield higher likelihood than state-of-the-art algorithms in challengingcases.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 21 / 22

Page 48: Towards Maximum Likelihood: Learning Undirected Graphical ...

Conclusion

a new interpretation of learning undirected graphical models:sequential Monte Carlo (SMC)

reveal two challenges in learning: large learning rate, highdimensionality;

deeper application of SMC in learning −→ Persistent SMC;

yield higher likelihood than state-of-the-art algorithms in challengingcases.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 21 / 22

Page 49: Towards Maximum Likelihood: Learning Undirected Graphical ...

END

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 22 / 22

Page 50: Towards Maximum Likelihood: Learning Undirected Graphical ...

Asuncion, A. U., Liu, Q., Ihler, A. T., and Smyth, P. (2010).Particle filtered MCMC-MLE with connections to contrastivedivergence.In ICML.

Desjardins, G., Courville, A., Bengio, Y., Vincent, P., and Delalleau,O. (2010).Tempered Markov Chain Monte Carlo for training of restrictedBoltzmann machines.In AISTATS.

Geyer, C. J. (1991).Markov Chain Monte Carlo Maximum Likelihood.Compuuting Science and Statistics:Proceedings of the 23rdSymposium on the Interface.

Hinton, G. E. (2002).Training products of experts by minimizing contrastive divergence.Neural Computation, 14(8):1771–1800.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 22 / 22

Page 51: Towards Maximum Likelihood: Learning Undirected Graphical ...

Koller, D. and Friedman, N. (2009).Probabilistic Graphical Models: Principles and Techniques.MIT Press.

Kong, A., Liu, J. S., and Wong, W. H. (1994).Sequential Imputations and Bayesian Missing Data Problems.Journal of the American Statistical Association, 89(425):278–288.

Robbins, H. and Monro, S. (1951).A Stochastic Approximation Method.Ann.Math.Stat., 22:400–407.

Salakhutdinov, R. (2010).Learning in markov random fields using tempered transitions.In NIPS.

Tieleman, T. (2008).Training Restricted Boltzmann Machines using Approximations to theLikelihood Gradient.In ICML, pages 1064–1071.

Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 22 / 22