Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover...

43
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015

Transcript of Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover...

Page 1: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

GAUSSIAN PROCESSREGRESSIONCSE 515T

Spring 2015

Page 2: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

1. BACKGROUNDThe kernel trick again. . .

Page 3: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

The Kernel Trick

Consider again the linear regression model:

y(x) = φ(x)>w + ε,

with priorp(w) = N (w;0,Σ).

The kernel trick is to define the function

K(x,x′) = φ(x)>Σφ(x′),

which allows us to. . .

Introduction The Kernel Trick 3

Page 4: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

The Kernel Trick. . . given training data D and test inputs X∗, write thepredictive distribution in this nice form:

p(y∗ | X∗,D, σ2) = N (y∗;µy∗|D,Ky∗|D),

where

µy∗|D = K>∗ (K + σ2I)−1y

Ky∗|D = K∗∗ −K>∗ (K + σ2I)−1K∗,

and we have defined

K = K(X,X) K∗ = K(X,X∗) K∗∗ = K(X∗,X∗).

Introduction The Kernel Trick 4

Page 5: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

The Kernel Trick

This does more than just make the expressions pretty. Inparticular, it is often easier/cheaper to calculate K directlyrather than explicitly compute φ(x) and take the dot product.

Example: “all subsets” kernel.

Introduction The Kernel Trick 5

Page 6: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

The Kernel Trick

Idea: completely abandon the idea of deriving explicit featureexpansions and simply derive (positive-definite) kernelfunctions K directly!

Introduction The Kernel Trick 6

Page 7: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

The Kernel Trick

Maybe we could skip the entire procedure of thinking aboutw, which we never explicitly use here. . .

p(y∗ | X∗,D, σ2) = N (y∗;µy∗|D, Ky∗|D),

where

µy∗|D = K>∗ (K + σ2I)−1y

Ky∗|D = K∗∗ −K>∗ (K + σ2I)−1K∗.

Introduction The Kernel Trick 7

Page 8: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

2. GAUSSIAN PROCESSESA reimagining of Bayesian regression

Page 9: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

For more information

http://www.gaussianprocess.org/

(Also code!)

Gaussian Processes Introduction 9

Page 10: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Regression

Consider the general regression problem. Here we have:

• an input domain X (for example, Rn, but in generalanything),

• an unknown function f : X → R, and• and (perhaps noisy) observations of the function:D =

{(xi, yi)

}, where yi = f(xi) + εi.

Our goal is to predict the value of the function f(X∗) at sometest locations X∗.

Gaussian Processes Introduction 10

Page 11: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Gaussian processes

• Gaussian processes take a nonparameteric approach toregression. We select a prior distribution over the functionf and condition this distribution on our observations,using the posterior distribution to make predictions.

• Gaussian processes are very powerful and leverage themany convenient properties of the Gaussian distributionto enable tractable inference.

Gaussian Processes Introduction 11

Page 12: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

From the Gaussian distribution to GPs

• How can we leverage these useful properties of theGaussian distribution to approach the regression problem?We have a problem: the latent function f is usuallyinfinite dimensional; however, the multivariate Gaussiandistribution is only useful in finite dimensions.

• The Gaussian process is a natural generalization of themultivariate Gaussian distribution to potentially infinitesettings.

Gaussian Processes GPs: Prior 12

Page 13: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

GPs: Definition

Definition (GPs)A Gaussian process is a (potentially infinite) collection ofrandom variables such that the joint distribution of any finitenumber of them is multivariate Gaussian.

Gaussian Processes GPs: Prior 13

Page 14: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

GPs: Notation

A Gaussian process distribution on f is written

p(f) = GP(f ;µ,K),

and just like the multivariate Gaussian distribution, isparameterized by its first two moments (now functions):• E[f ] = µ : X → R, the mean function, and

• E[(f(x)− µ(x)

)(f(x′)− µ(x′)

)]= K : X × X → R, a

positive semidefinite covariance function or kernel.

Gaussian Processes GPs: Prior 14

Page 15: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

GPs: Mean and covariance functions

• The mean function encodes the central tendency of thefunction, and is often assumed to be a constant (usuallyzero).

• The covariance function encodes information about theshape and structure we expect the function to have. Asimple and very common example is the squaredexponential covariance:

K(x,x′) = exp(−1/2‖x− x′‖2

),

which encodes the notation that “nearby points shouldhave similar function values.”

Gaussian Processes GPs: Prior 15

Page 16: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

GPs: Prior on finite sets

Suppose we have selected a GP prior GP(f ;µ,K) for thefunction f . Consider a finite set of points X ⊆ X . The GPprior on f , by definition, implies the following joint distributionon the associated function values f = f(X):

p(f | X) = N (f ;µ(X), K(X,X)).

That is, we simply evaluate the mean and covariance functionsat X and take the associated multivariate Gaussiandistribution. Very simple!

Gaussian Processes GPs: Prior 16

Page 17: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Prior: Sampling examples

0 1 2 3 4 5 6 7 8 9 10

−2

0

2

x

µ(x) ±2σ samples

K = exp(−1/2‖x− x′‖2

)Gaussian Processes GPs: Prior 17

Page 18: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Prior: Sampling examples

0 1 2 3 4 5 6 7 8 9 10

−2

0

2

x

µ(x) ±2σ samples

K = λ2 exp

(−‖x− x

′‖22`2

)λ = 1/2, ` = 2

Gaussian Processes GPs: Prior 18

Page 19: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Prior: Sampling examples

0 1 2 3 4 5 6 7 8 9 10−4

−2

0

2

4

x

µ(x) ±2σ samples

K = exp(−‖x− x′‖

)Gaussian Processes GPs: Prior 19

Page 20: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

From the prior to the posterior

So far, I’ve only told you how to construct prior distributionsover the function f . How do we condition our prior on someobservations D = (X, f) to make predictions about the valueof f at some points X∗?

Gaussian Processes GPs: Posterior 20

Page 21: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

From the prior to the posterior

We begin by writing the joint distribution between the trainingfunction values f(X) = f and the test function valuesf(X∗) = f∗:

p(f , f∗) = N([

ff∗

];

[µ(X)µ(X∗)

],

[K(X,X) K(X,X∗)K(X∗,X) K(X∗,X∗)

]). . .

Gaussian Processes GPs: Posterior 21

Page 22: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

From the prior to the posterior

. . . we then condition this multivariate Gaussian on the knowntraining values f . We already know how to do that!

p(f∗ | X∗,D) = N(f∗;µf |D(X∗), Kf |D(X∗,X∗)

),

where

µf |D(x) = µ(x) +K(x,X)K−1(f − µ(X)

)Kf |D(x,x

′) = K(x,x′)−K(x,X)K−1K(X,x′).

Gaussian Processes GPs: Posterior 22

Page 23: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

From the prior to the posterior

Notice that the functions µf |D and Kf |D are valid mean andcovariance functions, respectively. That means the previousslide is telling us the posterior distribution over f is a Gaussianprocess!

Gaussian Processes GPs: Posterior 23

Page 24: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

The posterior mean

One way to understand the posterior mean function µf |D is asa correction to the prior mean consisting of a weightedcombination of kernel functions, one for each training datapoint:

µf |D(x) = µ(x) +K(x,X)(K(X,X)

)−1(f − µ(X)

)= µ(x) +

N∑i=1

αiK(xi,x),

where αi = K(X,X)−1(f(xi)− µ(xi)

).

Gaussian Processes GPs: Posterior 24

Page 25: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Prior

0 1 2 3 4 5 6 7 8 9 10

−2

0

2

x

µ(x) ±2σ samples

K = exp(−1/2‖x− x′‖2

)Gaussian Processes GPs: Posterior 25

Page 26: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Posterior example

0 1 2 3 4 5 6 7 8 9 10

−2

0

2

x

observations µ(x) ±2σ

Gaussian Processes GPs: Posterior 26

Page 27: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Posterior: Sampling

0 1 2 3 4 5 6 7 8 9 10

−2

0

2

x

observations µ(x) ±2σ samples

Gaussian Processes GPs: Posterior 27

Page 28: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Dealing with noise

So far, we have assumed we can sample the function f exactly,which is uncommon in regression settings. How do we dealwith observation noise?

tl;dr: the same way we did with Bayesian linear regression!

Gaussian Processes Gaussian noise 28

Page 29: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Dealing with noise

We must create a model for our observations given the latentfunction. To begin, we will choose the simple iid, zero-meanadditive Gaussian noise model:

y(x) = f(x) + ε,

p(ε | x) = N (ε; 0, σ2);

combined we have

p(y | f) = N (y; f , σ2I).

Gaussian Processes Gaussian noise 29

Page 30: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Noisy posterior

To derive the posterior given noisy observations D, we againwrite the joint distribution between the training functionvalues y and the test function values f∗:

p(y, f∗) = N([

yf∗

];

[µ(X)µ(X∗)

],

[K(X,X) + σ2I K(X,X∗)K(X∗,X) K(X∗,X∗)

]). . .

(1)

Gaussian Processes Gaussian noise 30

Page 31: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Noisy posterior

. . . and condition as before.

p(f∗ | X∗,D) = N(f∗;µf |D(X∗), Kf |D(X∗,X∗)

),

where

µf |D(x) = µ(x) +K(x,X)(K(X,X) + σ2I

)−1(y − µ(X)

)Kf |D(x,x

′) = K(x,x′)−K(x,X)(K(X,X) + σ2I

)−1K(X,x′).

Gaussian Processes Gaussian noise 31

Page 32: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Noisy posterior: Sampling

0 1 2 3 4 5 6 7 8 9 10

−2

0

2

x

observations µ(x) ±2σ samples

σ = 0.1

Gaussian Processes Gaussian noise 32

Page 33: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Noisy posterior: Sampling

0 1 2 3 4 5 6 7 8 9 10

−2

0

2

x

observations µ(x) ±2σ samples

σ = 0.5

Gaussian Processes Gaussian noise 33

Page 34: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

3. HYPERPARAMETERSWe’re not done yet. . .

Page 35: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Hyperparameters

• So far, we have assumed that the Gaussian process priordistribution on f has been specified a priori.

• But this prior distribution itself has parameters, forexample the length scale `, the output scale λ, and thenoise variance σ2. As parameters of a prior distribution,we call these hyperparameters.

• For convenience, we will write θ to denote the vector ofall hyperparameters of the model (including of µ and K).

• How do we learn θ?

Hyperparameters 35

Page 36: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Marginal likelihood

Assume we have chosen a parameterized prior

p(f | θ) = GP(f ;µ(x; θ), K(x,x′; θ)

).

We will measure the quality of the fit to our training dataD = (X,y) with the marginal likelihood, the probability ofobserving the given data under our prior:

p(y | X, θ) =∫p(y | f) p(f | X, θ) df ,

where we have marginalized the unknown function values f(hence, marginal likelihood).

Hyperparameters 36

Page 37: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Marginal likelihood: Evaluating

Thankfully, this is an integral we can do analytically under theGaussian noise assumption!

p(y | X, θ) =∫p(y | f) p(f | X, θ) df ,

=

∫ iid noise

N (y; f , σ2I)GP prior

N(f ;µ(X; θ), K(X,X; θ)

)df

= N(y;µ(X; θ), K(X,X; θ) + σ2I

).

(Convolutions of two Gaussians are Gaussian.)

Hyperparameters 37

Page 38: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Marginal likelihood: Evaluating

The log-likelihood of our data under the chosen prior are then(writing V = (K(X,X; θ) + σ2I)):

log p(y | X, θ) =data fit

−(y − µ)>V−1(y − µ)

2−

Occam’s razorlog detV

2− N log 2π

2

The first term is large when the data fit the model well, andthe second term is large when the volume of the priorcovariance is small; that is, when the model is simpler.

Hyperparameters 38

Page 39: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Hyperparameters: Example

0 1 2 3 4 5 6 7 8 9 10−4

−2

0

2

4

x

θ = (λ, `, σ) = (1, 1, 1/5), log p(y | X, θ) = −27.6

Hyperparameters 39

Page 40: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Hyperparameters: Example

0 1 2 3 4 5 6 7 8 9 10−4

−2

0

2

4

x

θ = (λ, `, σ) = (2, 1/3, 1/20), log p(y | X, θ) = −46.5

Hyperparameters 40

Page 41: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Hyperparameters are important

Comparing the marginal likelihoods, we see that the observeddata are over 100 million times more likely to have beengenerated by the first model rather than from the secondmodel! Clearly hyperparameters can be quite important.

Hyperparameters 41

Page 42: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Can we marginalize hyperparameters?

To be fully Bayesian, we would choose a hyperprior over θ,p(θ), and marginalize the unknown hyperparameters whenmaking predictions:

p(f∗ | x∗,D) =∫p(f∗ | x∗,D, θ)p(y | X, θ)p(θ) dθ∫

p(y | X, θ)p(θ) dθ

Unfortunately, this integral cannot be resolved analytically.

(Of course. . . )

Hyperparameters 42

Page 43: Gaussian Process Regression · Maximumlikelihood-II Instead,ifwebelievetheposteriordistributionover to bewell-concentrated(forexample,ifwehavemany trainingexamples),wemayapproximatep(

Maximum likelihood-II• Instead, if we believe the posterior distribution over θ tobe well-concentrated (for example, if we have manytraining examples), we may approximate p(θ | D) with adelta distribution at the point with the maximummarginal likelihood:

θMLE = argmaxθ

p(y | X, θ).

This is called maximum likelihood-II (ML-II) inference.• This effectively gives the approximation

p(f∗ | x∗,D) ≈ p(f∗ | x∗,D, θMLE).

• How can we find θMLE?

Hyperparameters 43