ST451 - Lent term Bayesian Machine Learning

ST451 - Lent termBayesian Machine Learning

Kostas Kalogeropoulos

Variational Bayes / Approximation

1 / 31

Summary of last lecture

Classification Problem: Categorical y , mixed X .

Generative models: Specify π(y) with ‘prior’ probabilities, thenπ(X |y) for each category of y , e.g. LDA

Discriminative models: Logistic regression, maximum likelihoodvia Newton-Raphson.

Bayesian Logistic Regression: Use of Laplace approximationsimilar results with MLE.

Prediction Assesment: Accuracy, Area under the ROC curve andlog score rule.

2 / 31

Motivating Example 1‘Default’ dataset consist of three variables: annual income, credit cardbalance and whether or not the person has defaulted in his/her creditcard.

0 500 1000 1500 2000 2500

02

00

00

40

00

06

00

00

Balance

Inco

me

No Yes

0500

1000

1500

2000

2500

Default

Ba

lan

ce

No Yes

020000

40000

60000

Default

Inco

me

The aim is to build a model to predict whether a person will defaultbased on annual income and monthly credit card balance.

3 / 31

Motivating Example 2

Volatility Index (VIX) provided by Chicago Board of Exchange (CBOE).Derived from the S&P 500 index options. Represents market’sexpectation of its future 30-day volatility. A measure of market risk.

4 / 31

Variational vs Laplace ApproximationIn both of these examples (and many others), the posterior and theexact distribution of MLEs are typically intractable.

Last week we used the Laplace approximation. This week we will lookinto the Variational approximation. Below we see these approximationsin terms of the pdfs (left) and negative log scale (right).

5 / 31

Outline

1 Essentials of Variational Inference

2 Examples with mean field approximation

3 Automatic Variational Inference

6 / 31

Outline




7 / 31

Main idea

Ideally we would like to use the posterior π(θ|y). But it is not alwaysavailable.

Laplace approximation uses a Normal distribution based on a singlepoint (the mode).

Variational approximation usually follows the steps below1 Consider a family of distributions q(θ|y , φ) with parameters φ, e.g.

Normal, Gamma etc.2 Select φ such that q(θ|y , φ) is as close as possible to π(θ|y).

8 / 31

Approximating a Gamma(0,1) with a N(µ, σ2)

9 / 31

Variational Bayes

As close as possible translates into minimising the KL divergence

KL(q||π) =

∫q(θ|y , φ) log

q(θ|y , φ)

π(θ|y)dθ

It can be shown that KL(q||π) ≥ 0 and KL(q||π) = 0 iff q D= π.

But π(θ|y) is intractable so the above is not very useful. Instead weconsider the evidence lower bound (ELBO)

ELBO(φ) =

∫q(θ|y , φ) log

f (y |θ)π(θ)

q(θ|y , φ)dθ

10 / 31

Variational Bayes (cont’d)Note that the sum of KL(q||π) and ELBO(φ) is equal to∫

q(θ|y , φ) logq(θ|y , φ)

π(θ|y)dθ +

∫q(θ|y , φ) log

f (y |θ)π(θ)

q(θ|y , φ)dθ∫

q(θ|y , φ)

{log

q(θ|y , φ)

π(θ|y)+ log

f (y |θ)π(θ)

q(θ|y , φ)

}dθ∫

q(θ|y , φ)

{log

f (y |θ)π(θ)

π(θ|y)

}dθ =

∫q(θ|y , φ) logπ(y)dθ

logπ(y)

∫q(θ|y , φ)dθ = logπ(y).

Notes:1 Since KL(q||p) ≥ 0, we get that logπ(y) ≥ELBO(φ). Hence, the

name evidence lower bound (ELBO).2 The sum above is independent of φ so minimising KL(q||p) is the

same as maximising ELBO(φ).

11 / 31

Mean field approximation

Ofcourse the minimum KL divergence can still be large, depends onthe choice of q.

The most widely used choice is known as the mean field approximationand assumes that q can be factorised into some components

q(θ|y , φ) =∏

i

q(θi |y , φi)) =∏

i

q(θi)

This results in an algorithm that iteratively maximises ELBO(φ) wrteach q(θi) keeping q(θj), j 6= i , or else q(θ−i) fixed.

We refer to each q(θi) as VB component.

12 / 31

Mean field approximation (cont’d)

Let θ = (θi , θ−i ), q(θ|y , φ) = q(θi )q(θ−i ), and π(y , θ) = f (y |θ)π(θ). Then

ELBO =

∫ ∫q(θi )q(θ−i ) log

π(y , θ)

q(θi )q(θ−i )dθ−idθi

=

∫q(θi )

{∫logπ(y , θ)q(θ−i )dθ−i

}dθi

−∫ ∫

q(θi )q(θ−i ) log q(θi )dθ−idθi

−∫ ∫

q(θi )q(θ−i ) log q(θ−i )dθ−idθi

=

∫q(θi )

{∫logπ(y , θ)q(θ−i )dθ−i

}dθi

−∫

q(θi ) log q(θi )dθi −∫

q(θ−i ) log q(θ−i )dθ−i

13 / 31

Mean field approximation (cont’d)

Note that∫

logπ(y , θ)q(θ−i)dθ−i = Eq(θ−i )[logπ(y , θ)] and define

q∗(θi) = 1z exp

(Eq(θ−i )[logπ(y , θ)]

). Then

log q∗(θi) = log{∫

logπ(y , θ)q(θ−i)dθ−i

}+ c

and we can therefore write

ELBO =

∫q(θi) log

q∗(θi)

q(θi)dθi −

∫q(θ−i) log q(θ−i)dθ−i + c

= −KL(q(θi)||q∗(θi)−∫

q(θ−i) log q(θ−i)dθ−i + c

14 / 31

Mean field approximation (cont’d)Hence maximising ELBO wrt q(θi) is the same as minimisingKL(q(θi)||q∗(θi) wrt q(θi).

KL(q(θi)||q∗(θi) is minimised when

q(θi) = q∗(θi) =1z

exp(Eq(θ−i )[logπ(y , θ)]

)

This suggests the following algorithm1 Initialise each q(θi) to the prior π(θi),2 For each θi , update q(θi), based on Eq(θ−i )[logπ(y , θ)],3 Continue until ELBO converges.

Generally well behaved algorithm when it can be derived, i.e. when wecan recognise the q∗(θi)’s.

15 / 31

Outline




16 / 31

Example: N(µ, τ−1)

Let y = (y1, . . . , yn) (yi independent) from N(µ, τ−1). AssignN(µ0, (λ0τ)−1) as prior for µ|τ and Gamma(α0, β0) for τ . The posteriorcan be derived but let’s consider its variational approximation.

Assume q(θ) = q(µ)q(τ)

The log joint density is (c will be denoting constant from now on)

logπ(y , θ) = n2 log τ − τ

2

n∑i=1

(yi − µ)2 + 12 log τ − τλ0

2 (µ− µ0)2

+ (α0 − 1) log τ − β0τ + c

We will now derive the VB components q(µ) and q(τ)

17 / 31

Example: N(µ, τ−1) - q(µ)For q(µ) we can focus on the terms involving µ.

log q(µ) = Eq(τ)

[− τ

2

n∑i=1

(yi − µ)2 − τλ02 (µ− µ0)2

]+ c

= −Eq(τ)(τ)

2

[n∑

i=1

(yi − µ)2 + λ0(µ− µ0)2

]+ c.

By inspection we can identify q(µ) to be the N(µφ, τ−1φ ), where

µφ =λ0µ0 +

∑ni=1 yi

λ0 + nτφ = (λ0 + n)E(τ)

The quantity E(τ) = Eq(τ)(τ) will be provided by the derivation of q(τ)

18 / 31

Example: N(µ, τ−1) - q(τ)For q(τ) we take as before all the terms involving τ

log q(τ) = Eq(µ)

[(n+12 + α0 − 1

)log τ − β0τ − τ

2

n∑i=1

(yi − µ)2

− τλ02 (µ− µ0)2

]+ c

=(n+1

2 + α0 − 1)

log τ − β0τ

− τ2Eq(µ)

[n∑

i=1

(yi − µ)2 + λ0(µ− µ0)2

]

By inspection we can identify q(τ) to be the Gamma(αφ, βφ), where

αφ = α0 + n+12 ,

βφ = β0 + 12

(S2

y − 2E(µ)Sy + nE(µ2))

+ λ02

(µ2

0 − 2µ0E(µ) + E(µ2)),

E(µ) = Eq(µ)(µ),

Sy =∑

i

yi , S2y =

∑i

y2i .

19 / 31

Example: N(µ, τ−1) - overall algorithm

So overall we set q(µ) =N(µφ, τ−1φ ) and q(τ) =Gamma(αφ, βφ).

Then we look for the q parameters φ = (µφ, τφ, αφ, βφ) that maximiseELBO by first initialising and then iteratively updating

µφ =λ0µ0 +

∑ni=1 yi

λ0 + n,

τφ = (λ0 + n)E(τ),

αφ = α0 + n+12 ,

βφ = β0 + 12

(S2

y − 2E(µ)Sy + nE(µ2))

+ λ02

(µ2

0 − 2µ0E(µ) + E(µ2)),

E(τ) = αφ/βφ,

E(µ) = µφ,

E(µ2) =1τφ

+ µ2φ.

20 / 31

Graphical illustration of the previous algorithm

21 / 31

Remarks on mean field approximation

Possible to extend this to linear regression and other exponentialfamily models including logistic regression. In some cases somemodel-specific tricks are needed.

Generally provides a good approximation to the mean butunderestimates the variance (as it cannot capture posteriordependencies).

Model selection can be done by optimising each model separatelyand then comparing their ELBO’s. Prediction is alsostraightforward

A fair amount of derivations are required and it is easy to lose thebig picture. A black box would be useful.

22 / 31

Outline




23 / 31

Motivating Example 2

Volatility Index (VIX) provided by Chicago Board of Exchange (CBOE).Derived from the S&P 500 index options. Represents market’sexpectation of its future 30-day volatility. A measure of market risk.

24 / 31

Modelling VIX

VIX trajectories are mean reverting and autocorrelated. A simplemodel that captures these stylised facts is

Yt = Yt−1 + κ(µ− Yt−1)δ + σεt ,

where Yt is VIX at time t, and εt are independent error terms.

µ : long term mean, σ volatility of volatility, κ mean reversion speed.

A convenient option is to set εt ∼ N(0, δ) as it gives closed formposterior and distribution of MLEs.

But it is not a good choice for the spikes that we observe. A tdistribution with low degrees of freedom is a much better option, yetintractable. No much room for the previous tricks either.

25 / 31

Hurdles towards Automatic Variational Inference

The procedure for variational inference can be automated. The aim isto be able to specify the likelihood and the prior and nothing else.

But even under the framework of mean field approximation, there aretwo main hurdles:

Each θi may be given a different distribution depending also on itsrange, e.g. R, R+, [0,1] etc.It is not always possible to derive the algorithm presented earlier.Even if it was possible its final form would depend on the model,so cannot be automated.

The recent Automatic Differentiation Variational Inference (ADVI)approach of Kucukelbir et al (2016) addresses those issues.

26 / 31

Transformation to the Rp

The first step is to transform all the θi components to the real line usinglog or logit transformations where needed. Hence we transform fromthe parameter space Θ to Rp via the function T (·).

We can then define ζ := T (θ) and the the joint density (likelihood timesprior) can be written as

π(y , θ) = π(y ,T−1(ζ)

) ∣∣∣det JT−1(ζ)

∣∣∣Given that ζ is defined in Rp, we can assign the Normal distribution onit. The default option is to assume p independent Normals.

q(ζ|y , φ) =

p∏i=1

N(ζi |µi , σ2i )

Note that the corresponding q(θ|y) is not necessarily Normal.27 / 31

Optimisation

Numerical optimisation can be used. It is essential to calculate thegradient of ELBO(φ) to obtain good performance. We can write

ELBO(φ) =

∫q(θ|y , φ) log

f (y |θ)π(θ)

q(θ|y , φ)dθ

=

∫q(θ|y , φ) log f (y |θ)dθ −

∫q(θ|y , φ) log

q(θ|y , φ)

π(θ)dθ

= Eq(θ) [log f (y |θ)]− KL [q(θ|y , φ)||π(θ)]

The second term in the expression above can be derived analytically(Kucukelbir et al 2016), so automatic differentiation can be used.

The first term is tricky because it doesn’t have a closed form.Automatic differentiation can only be used for terms inside theexpectation.

28 / 31

ReparameterisationTo see this note the we want to calculate

∇φEq(φ) [log f (y |θ)] = ∇φEq(φ)

[f(y |T−1(ζ)

) ∣∣∣det JT−1(ζ)

∣∣∣] ,i.e. an expectation wrt φ.

This problem can be addressed by (further) reparameterisation, i.e. bystandardising the ζ ’s

ηi =ζi − µi

σi, i = 1, . . . , k .

Now we can write (for the corresponding transformation Tφ from θ to η)

∇φEq(φ) [log f (y |θ)] = ∇φEq(η)

[f(y |T−1

φ (η)) ∣∣∣det JT−1

φ (η)

∣∣∣] ,and we can see that φ now appears only inside the expectation.

29 / 31

Stochastic Gradient Descent (SGD)

The following trick allows quick and automatic estimates of the gradientof ELBO using Monte Carlo and automatic differentiation.

It can be used in state-of-the-art algorithms for big models and bigdata (e.g. deep networks), such as the stochastic gradient descent.

In such contexts there is no need to calculate the gradient from theentire data, but only from a small batch.

At the moment, in deep learning, this the only scalable method forBayesian Inference.

30 / 31

Today’s lecture - Reading

Bishop: 10.1 10.3 10.6.

Murphy: 21.1 21.2 21.3.1 21.5.

Kucukelbir A., Tran D., Ranganath R., Gelman A., Blei D.M. (2016)Automatic Differentiation Variational Inference. Available on Arxiv

31 / 31

ST451 - Lent term Bayesian Machine Learning

Documents

Transcript of ST451 - Lent term Bayesian Machine Learning