ST451 - Lent term Bayesian Machine Learning

31
ST451 - Lent term Bayesian Machine Learning Kostas Kalogeropoulos Variational Bayes / Approximation 1 / 31

Transcript of ST451 - Lent term Bayesian Machine Learning

Page 1: ST451 - Lent term Bayesian Machine Learning

ST451 - Lent termBayesian Machine Learning

Kostas Kalogeropoulos

Variational Bayes / Approximation

1 / 31

Page 2: ST451 - Lent term Bayesian Machine Learning

Summary of last lecture

Classification Problem: Categorical y , mixed X .

Generative models: Specify π(y) with ‘prior’ probabilities, thenπ(X |y) for each category of y , e.g. LDA

Discriminative models: Logistic regression, maximum likelihoodvia Newton-Raphson.

Bayesian Logistic Regression: Use of Laplace approximationsimilar results with MLE.

Prediction Assesment: Accuracy, Area under the ROC curve andlog score rule.

2 / 31

Page 3: ST451 - Lent term Bayesian Machine Learning

Motivating Example 1‘Default’ dataset consist of three variables: annual income, credit cardbalance and whether or not the person has defaulted in his/her creditcard.

0 500 1000 1500 2000 2500

02

00

00

40

00

06

00

00

Balance

Inco

me

No Yes

0500

1000

1500

2000

2500

Default

Ba

lan

ce

No Yes

020000

40000

60000

Default

Inco

me

The aim is to build a model to predict whether a person will defaultbased on annual income and monthly credit card balance.

3 / 31

Page 4: ST451 - Lent term Bayesian Machine Learning

Motivating Example 2

Volatility Index (VIX) provided by Chicago Board of Exchange (CBOE).Derived from the S&P 500 index options. Represents market’sexpectation of its future 30-day volatility. A measure of market risk.

4 / 31

Page 5: ST451 - Lent term Bayesian Machine Learning

Variational vs Laplace ApproximationIn both of these examples (and many others), the posterior and theexact distribution of MLEs are typically intractable.

Last week we used the Laplace approximation. This week we will lookinto the Variational approximation. Below we see these approximationsin terms of the pdfs (left) and negative log scale (right).

5 / 31

Page 6: ST451 - Lent term Bayesian Machine Learning

Outline

1 Essentials of Variational Inference

2 Examples with mean field approximation

3 Automatic Variational Inference

6 / 31

Page 7: ST451 - Lent term Bayesian Machine Learning

Outline

1 Essentials of Variational Inference

2 Examples with mean field approximation

3 Automatic Variational Inference

7 / 31

Page 8: ST451 - Lent term Bayesian Machine Learning

Main idea

Ideally we would like to use the posterior π(θ|y). But it is not alwaysavailable.

Laplace approximation uses a Normal distribution based on a singlepoint (the mode).

Variational approximation usually follows the steps below1 Consider a family of distributions q(θ|y , φ) with parameters φ, e.g.

Normal, Gamma etc.2 Select φ such that q(θ|y , φ) is as close as possible to π(θ|y).

8 / 31

Page 9: ST451 - Lent term Bayesian Machine Learning

Approximating a Gamma(0,1) with a N(µ, σ2)

9 / 31

Page 10: ST451 - Lent term Bayesian Machine Learning

Variational Bayes

As close as possible translates into minimising the KL divergence

KL(q||π) =

∫q(θ|y , φ) log

q(θ|y , φ)

π(θ|y)dθ

It can be shown that KL(q||π) ≥ 0 and KL(q||π) = 0 iff q D= π.

But π(θ|y) is intractable so the above is not very useful. Instead weconsider the evidence lower bound (ELBO)

ELBO(φ) =

∫q(θ|y , φ) log

f (y |θ)π(θ)

q(θ|y , φ)dθ

10 / 31

Page 11: ST451 - Lent term Bayesian Machine Learning

Variational Bayes (cont’d)Note that the sum of KL(q||π) and ELBO(φ) is equal to∫

q(θ|y , φ) logq(θ|y , φ)

π(θ|y)dθ +

∫q(θ|y , φ) log

f (y |θ)π(θ)

q(θ|y , φ)dθ∫

q(θ|y , φ)

{log

q(θ|y , φ)

π(θ|y)+ log

f (y |θ)π(θ)

q(θ|y , φ)

}dθ∫

q(θ|y , φ)

{log

f (y |θ)π(θ)

π(θ|y)

}dθ =

∫q(θ|y , φ) logπ(y)dθ

logπ(y)

∫q(θ|y , φ)dθ = logπ(y).

Notes:1 Since KL(q||p) ≥ 0, we get that logπ(y) ≥ELBO(φ). Hence, the

name evidence lower bound (ELBO).2 The sum above is independent of φ so minimising KL(q||p) is the

same as maximising ELBO(φ).

11 / 31

Page 12: ST451 - Lent term Bayesian Machine Learning

Mean field approximation

Ofcourse the minimum KL divergence can still be large, depends onthe choice of q.

The most widely used choice is known as the mean field approximationand assumes that q can be factorised into some components

q(θ|y , φ) =∏

i

q(θi |y , φi)) =∏

i

q(θi)

This results in an algorithm that iteratively maximises ELBO(φ) wrteach q(θi) keeping q(θj), j 6= i , or else q(θ−i) fixed.

We refer to each q(θi) as VB component.

12 / 31

Page 13: ST451 - Lent term Bayesian Machine Learning

Mean field approximation (cont’d)

Let θ = (θi , θ−i ), q(θ|y , φ) = q(θi )q(θ−i ), and π(y , θ) = f (y |θ)π(θ). Then

ELBO =

∫ ∫q(θi )q(θ−i ) log

π(y , θ)

q(θi )q(θ−i )dθ−idθi

=

∫q(θi )

{∫logπ(y , θ)q(θ−i )dθ−i

}dθi

−∫ ∫

q(θi )q(θ−i ) log q(θi )dθ−idθi

−∫ ∫

q(θi )q(θ−i ) log q(θ−i )dθ−idθi

=

∫q(θi )

{∫logπ(y , θ)q(θ−i )dθ−i

}dθi

−∫

q(θi ) log q(θi )dθi −∫

q(θ−i ) log q(θ−i )dθ−i

13 / 31

Page 14: ST451 - Lent term Bayesian Machine Learning

Mean field approximation (cont’d)

Note that∫

logπ(y , θ)q(θ−i)dθ−i = Eq(θ−i )[logπ(y , θ)] and define

q∗(θi) = 1z exp

(Eq(θ−i )[logπ(y , θ)]

). Then

log q∗(θi) = log{∫

logπ(y , θ)q(θ−i)dθ−i

}+ c

and we can therefore write

ELBO =

∫q(θi) log

q∗(θi)

q(θi)dθi −

∫q(θ−i) log q(θ−i)dθ−i + c

= −KL(q(θi)||q∗(θi)−∫

q(θ−i) log q(θ−i)dθ−i + c

14 / 31

Page 15: ST451 - Lent term Bayesian Machine Learning

Mean field approximation (cont’d)Hence maximising ELBO wrt q(θi) is the same as minimisingKL(q(θi)||q∗(θi) wrt q(θi).

KL(q(θi)||q∗(θi) is minimised when

q(θi) = q∗(θi) =1z

exp(Eq(θ−i )[logπ(y , θ)]

)

This suggests the following algorithm1 Initialise each q(θi) to the prior π(θi),2 For each θi , update q(θi), based on Eq(θ−i )[logπ(y , θ)],3 Continue until ELBO converges.

Generally well behaved algorithm when it can be derived, i.e. when wecan recognise the q∗(θi)’s.

15 / 31

Page 16: ST451 - Lent term Bayesian Machine Learning

Outline

1 Essentials of Variational Inference

2 Examples with mean field approximation

3 Automatic Variational Inference

16 / 31

Page 17: ST451 - Lent term Bayesian Machine Learning

Example: N(µ, τ−1)

Let y = (y1, . . . , yn) (yi independent) from N(µ, τ−1). AssignN(µ0, (λ0τ)−1) as prior for µ|τ and Gamma(α0, β0) for τ . The posteriorcan be derived but let’s consider its variational approximation.

Assume q(θ) = q(µ)q(τ)

The log joint density is (c will be denoting constant from now on)

logπ(y , θ) = n2 log τ − τ

2

n∑i=1

(yi − µ)2 + 12 log τ − τλ0

2 (µ− µ0)2

+ (α0 − 1) log τ − β0τ + c

We will now derive the VB components q(µ) and q(τ)

17 / 31

Page 18: ST451 - Lent term Bayesian Machine Learning

Example: N(µ, τ−1) - q(µ)For q(µ) we can focus on the terms involving µ.

log q(µ) = Eq(τ)

[− τ

2

n∑i=1

(yi − µ)2 − τλ02 (µ− µ0)2

]+ c

= −Eq(τ)(τ)

2

[n∑

i=1

(yi − µ)2 + λ0(µ− µ0)2

]+ c.

By inspection we can identify q(µ) to be the N(µφ, τ−1φ ), where

µφ =λ0µ0 +

∑ni=1 yi

λ0 + nτφ = (λ0 + n)E(τ)

The quantity E(τ) = Eq(τ)(τ) will be provided by the derivation of q(τ)

18 / 31

Page 19: ST451 - Lent term Bayesian Machine Learning

Example: N(µ, τ−1) - q(τ)For q(τ) we take as before all the terms involving τ

log q(τ) = Eq(µ)

[(n+12 + α0 − 1

)log τ − β0τ − τ

2

n∑i=1

(yi − µ)2

− τλ02 (µ− µ0)2

]+ c

=(n+1

2 + α0 − 1)

log τ − β0τ

− τ2Eq(µ)

[n∑

i=1

(yi − µ)2 + λ0(µ− µ0)2

]

By inspection we can identify q(τ) to be the Gamma(αφ, βφ), where

αφ = α0 + n+12 ,

βφ = β0 + 12

(S2

y − 2E(µ)Sy + nE(µ2))

+ λ02

(µ2

0 − 2µ0E(µ) + E(µ2)),

E(µ) = Eq(µ)(µ),

Sy =∑

i

yi , S2y =

∑i

y2i .

19 / 31

Page 20: ST451 - Lent term Bayesian Machine Learning

Example: N(µ, τ−1) - overall algorithm

So overall we set q(µ) =N(µφ, τ−1φ ) and q(τ) =Gamma(αφ, βφ).

Then we look for the q parameters φ = (µφ, τφ, αφ, βφ) that maximiseELBO by first initialising and then iteratively updating

µφ =λ0µ0 +

∑ni=1 yi

λ0 + n,

τφ = (λ0 + n)E(τ),

αφ = α0 + n+12 ,

βφ = β0 + 12

(S2

y − 2E(µ)Sy + nE(µ2))

+ λ02

(µ2

0 − 2µ0E(µ) + E(µ2)),

E(τ) = αφ/βφ,

E(µ) = µφ,

E(µ2) =1τφ

+ µ2φ.

20 / 31

Page 21: ST451 - Lent term Bayesian Machine Learning

Graphical illustration of the previous algorithm

21 / 31

Page 22: ST451 - Lent term Bayesian Machine Learning

Remarks on mean field approximation

Possible to extend this to linear regression and other exponentialfamily models including logistic regression. In some cases somemodel-specific tricks are needed.

Generally provides a good approximation to the mean butunderestimates the variance (as it cannot capture posteriordependencies).

Model selection can be done by optimising each model separatelyand then comparing their ELBO’s. Prediction is alsostraightforward

A fair amount of derivations are required and it is easy to lose thebig picture. A black box would be useful.

22 / 31

Page 23: ST451 - Lent term Bayesian Machine Learning

Outline

1 Essentials of Variational Inference

2 Examples with mean field approximation

3 Automatic Variational Inference

23 / 31

Page 24: ST451 - Lent term Bayesian Machine Learning

Motivating Example 2

Volatility Index (VIX) provided by Chicago Board of Exchange (CBOE).Derived from the S&P 500 index options. Represents market’sexpectation of its future 30-day volatility. A measure of market risk.

24 / 31

Page 25: ST451 - Lent term Bayesian Machine Learning

Modelling VIX

VIX trajectories are mean reverting and autocorrelated. A simplemodel that captures these stylised facts is

Yt = Yt−1 + κ(µ− Yt−1)δ + σεt ,

where Yt is VIX at time t, and εt are independent error terms.

µ : long term mean, σ volatility of volatility, κ mean reversion speed.

A convenient option is to set εt ∼ N(0, δ) as it gives closed formposterior and distribution of MLEs.

But it is not a good choice for the spikes that we observe. A tdistribution with low degrees of freedom is a much better option, yetintractable. No much room for the previous tricks either.

25 / 31

Page 26: ST451 - Lent term Bayesian Machine Learning

Hurdles towards Automatic Variational Inference

The procedure for variational inference can be automated. The aim isto be able to specify the likelihood and the prior and nothing else.

But even under the framework of mean field approximation, there aretwo main hurdles:

Each θi may be given a different distribution depending also on itsrange, e.g. R, R+, [0,1] etc.It is not always possible to derive the algorithm presented earlier.Even if it was possible its final form would depend on the model,so cannot be automated.

The recent Automatic Differentiation Variational Inference (ADVI)approach of Kucukelbir et al (2016) addresses those issues.

26 / 31

Page 27: ST451 - Lent term Bayesian Machine Learning

Transformation to the Rp

The first step is to transform all the θi components to the real line usinglog or logit transformations where needed. Hence we transform fromthe parameter space Θ to Rp via the function T (·).

We can then define ζ := T (θ) and the the joint density (likelihood timesprior) can be written as

π(y , θ) = π(y ,T−1(ζ)

) ∣∣∣det JT−1(ζ)

∣∣∣Given that ζ is defined in Rp, we can assign the Normal distribution onit. The default option is to assume p independent Normals.

q(ζ|y , φ) =

p∏i=1

N(ζi |µi , σ2i )

Note that the corresponding q(θ|y) is not necessarily Normal.27 / 31

Page 28: ST451 - Lent term Bayesian Machine Learning

Optimisation

Numerical optimisation can be used. It is essential to calculate thegradient of ELBO(φ) to obtain good performance. We can write

ELBO(φ) =

∫q(θ|y , φ) log

f (y |θ)π(θ)

q(θ|y , φ)dθ

=

∫q(θ|y , φ) log f (y |θ)dθ −

∫q(θ|y , φ) log

q(θ|y , φ)

π(θ)dθ

= Eq(θ) [log f (y |θ)]− KL [q(θ|y , φ)||π(θ)]

The second term in the expression above can be derived analytically(Kucukelbir et al 2016), so automatic differentiation can be used.

The first term is tricky because it doesn’t have a closed form.Automatic differentiation can only be used for terms inside theexpectation.

28 / 31

Page 29: ST451 - Lent term Bayesian Machine Learning

ReparameterisationTo see this note the we want to calculate

∇φEq(φ) [log f (y |θ)] = ∇φEq(φ)

[f(y |T−1(ζ)

) ∣∣∣det JT−1(ζ)

∣∣∣] ,i.e. an expectation wrt φ.

This problem can be addressed by (further) reparameterisation, i.e. bystandardising the ζ ’s

ηi =ζi − µi

σi, i = 1, . . . , k .

Now we can write (for the corresponding transformation Tφ from θ to η)

∇φEq(φ) [log f (y |θ)] = ∇φEq(η)

[f(y |T−1

φ (η)) ∣∣∣det JT−1

φ (η)

∣∣∣] ,and we can see that φ now appears only inside the expectation.

29 / 31

Page 30: ST451 - Lent term Bayesian Machine Learning

Stochastic Gradient Descent (SGD)

The following trick allows quick and automatic estimates of the gradientof ELBO using Monte Carlo and automatic differentiation.

It can be used in state-of-the-art algorithms for big models and bigdata (e.g. deep networks), such as the stochastic gradient descent.

In such contexts there is no need to calculate the gradient from theentire data, but only from a small batch.

At the moment, in deep learning, this the only scalable method forBayesian Inference.

30 / 31

Page 31: ST451 - Lent term Bayesian Machine Learning

Today’s lecture - Reading

Bishop: 10.1 10.3 10.6.

Murphy: 21.1 21.2 21.3.1 21.5.

Kucukelbir A., Tran D., Ranganath R., Gelman A., Blei D.M. (2016)Automatic Differentiation Variational Inference. Available on Arxiv

31 / 31