Download - Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Transcript
Page 1: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Bayesian Interpretations of Regularization

Charlie Frogner

9.520 Class 15

April 1, 2009

C. Frogner Bayesian Interpretations of Regularization

Page 2: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

The Plan

Regularized least squares maps {(xi , yi)}ni=1 to a function that

minimizes the regularized loss:

fS = arg minf∈H

12

n∑

i=1

(yi − f (xi))2 +

λ

2‖f‖2

H

Can we justify Tikhonov regularization from a probabilistic pointof view?

C. Frogner Bayesian Interpretations of Regularization

Page 3: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

The Plan

Bayesian estimation basics

Bayesian interpretation of ERM

Bayesian interpretation of linear RLS

Bayesian interpretation of kernel RLS

Transductive model

Infinite dimensions = weird

C. Frogner Bayesian Interpretations of Regularization

Page 4: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Some notation

S = {(xi , yi)}ni=1 is the set of observed input/output pairs in

Rd × R (the training set).

X and Y denote the matrices [x1, . . . , xn]T ∈ Rn×d and

[y1, . . . , yn]T ∈ Rn, respectively.

θ is a vector of parameters in Rp.

p(Y |X , θ) is the joint distribution over outputs Y giveninputs X and the parameters.

C. Frogner Bayesian Interpretations of Regularization

Page 5: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Estimation

The setup:

A model: relates observed quantities (α) to an unobservedquantity (say β).

Want: an estimator – maps observed data α back to anestimate of unobserved β.

Nothing new yet...

Estimator

β ∈ B is unobserved, α ∈ A is observed. An estimator for β is afunction

β : A → B

such that β(α) is an estimate of β given an observation α.

C. Frogner Bayesian Interpretations of Regularization

Page 6: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Estimation

Tikhonov fits in the estimation framework.

fS = arg minf∈H

12

n∑

i=1

(yi − f (xi))2 +

λ

2‖f‖2

H

Regression model:

yi = f (xi) + ε, ε ∼ N(

0, σ2ε I

)

C. Frogner Bayesian Interpretations of Regularization

Page 7: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Bayesian Estimation

Difference: Bayesian model specifies p(β, α) , usually by ameasurement model, p(α|β) and a prior p(β).

Bayesian model

β is unobserved, α is observed.

p(β, α) = p(α|β) · p(β)

C. Frogner Bayesian Interpretations of Regularization

Page 8: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

ERM as a Maximum Likelihood Estimator

(Linear) Expected risk minimization:

fS(x) = xT θERM(S), θERM(S) = arg minθ

12

n∑

i=1

(yi − xTi θ)

2

Measurement model:

Y |X , θ ∼ N(

Xθ, σ2ε I

)

X fixed/non-random, θ is unknown.

C. Frogner Bayesian Interpretations of Regularization

Page 9: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

ERM as a Maximum Likelihood Estimator

(Linear) Expected risk minimization:

fS(x) = xT θERM(S), θERM(S) = arg minθ

12

n∑

i=1

(yi − xTi θ)

2

Measurement model:

Y |X , θ ∼ N(

Xθ, σ2ε I

)

X fixed/non-random, θ is unknown.

C. Frogner Bayesian Interpretations of Regularization

Page 10: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

ERM as a Maximum Likelihood Estimator

Measurement model:

Y |X , θ ∼ N(

Xθ, σ2ε I

)

Want to estimate θ.

Can do this without defining a prior on θ.

Maximize the likelihood, i.e. the probability of theobservations.

Likelihood

The likelihood of any fixed parameter vector θ is:

L(θ|X ) = p(Y |X , θ)

Note: we always condition on X .

C. Frogner Bayesian Interpretations of Regularization

Page 11: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

ERM as a Maximum Likelihood Estimator

Measurement model:

Y |X , θ ∼ N(

Xθ, σ2ε I

)

Likelihood:

L(θ|X ) = N(

Y ; Xθ, σ2ε I

)

∝ exp(

− 12σ2

ε

‖Y − Xθ‖2)

Maximum likelihood estimator is ERM:

arg minθ

12‖Y − Xθ‖2

C. Frogner Bayesian Interpretations of Regularization

Page 12: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

ERM as a Maximum Likelihood Estimator

Measurement model:

Y |X , θ ∼ N(

Xθ, σ2ε I

)

Likelihood:

L(θ|X ) = N(

Y ; Xθ, σ2ε I

)

∝ exp(

− 12σ2

ε

‖Y − Xθ‖2)

Maximum likelihood estimator is ERM:

arg minθ

12‖Y − Xθ‖2

C. Frogner Bayesian Interpretations of Regularization

Page 13: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Really?

12‖Y − Xθ‖2

C. Frogner Bayesian Interpretations of Regularization

Page 14: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Really?

e− 1

2σ2ε

‖Y−Xθ‖2

C. Frogner Bayesian Interpretations of Regularization

Page 15: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What about regularization?

Linear regularized least squares:

fS(x) = xT θ, θRLS(S) = arg minθ

12

n∑

i=1

(yi − xTi θ)

2 +λ

2‖θ‖2

Is there a model of Y and θ that yields linear RLS?

Yes.

e− 1

2σ2ε

(n

i=1(yi−xT

i θ)2+ λ

2 ‖θ‖2)

p(Y |X , θ) · p(θ)

C. Frogner Bayesian Interpretations of Regularization

Page 16: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What about regularization?

Linear regularized least squares:

fS(x) = xT θ, θRLS(S) = arg minθ

12

n∑

i=1

(yi − xTi θ)

2 +λ

2‖θ‖2

Is there a model of Y and θ that yields linear RLS?

Yes.

e− 1

2σ2ε

(n

i=1(yi−xT

i θ)2+ λ

2 ‖θ‖2)

p(Y |X , θ) · p(θ)

C. Frogner Bayesian Interpretations of Regularization

Page 17: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What about regularization?

Linear regularized least squares:

fS(x) = xT θ, θRLS(S) = arg minθ

12

n∑

i=1

(yi − xTi θ)

2 +λ

2‖θ‖2

Is there a model of Y and θ that yields linear RLS?

Yes.

e− 1

2σ2ε

n∑

i=1(yi−xT

i θ)2

· e−λ

2 ‖θ‖2

p(Y |X , θ) · p(θ)

C. Frogner Bayesian Interpretations of Regularization

Page 18: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What about regularization?

Linear regularized least squares:

fS(x) = xT θ, θRLS(S) = arg minθ

12

n∑

i=1

(yi − xTi θ)

2 +λ

2‖θ‖2

Is there a model of Y and θ that yields linear RLS?

Yes.

e− 1

2σ2ε

n∑

i=1(yi−xT

i θ)2

· e−λ

2 ‖θ‖2

p(Y |X , θ) · p(θ)

C. Frogner Bayesian Interpretations of Regularization

Page 19: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Linear RLS as a MAP estimator

Measurement model:

Y |X , θ ∼ N(

Xθ, σ2ε I

)

Add a prior:θ ∼ N (0, I)

So σ2ε = λ. How to estimate θ?

C. Frogner Bayesian Interpretations of Regularization

Page 20: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

The Bayesian method

Take p(Y |X , θ) and p(θ).

Apply Bayes’ rule to get posterior:

p(θ|X ,Y ) =p(Y |X , θ) · p(θ)

p(Y |X )

=p(Y |X , θ) · p(θ)∫

p(Y |X , θ)dθ

Use the posterior to estimate θ.

C. Frogner Bayesian Interpretations of Regularization

Page 21: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Estimators that use the posterior

Bayes least squares estimator

The Bayes least squares estimator for θ given the observed Yis:

θBLS(Y |X ) = Eθ|X ,Y [θ]

i.e. the mean of the posterior.

Maximum a posteriori estimator

The MAP estimator for θ given the observed Y is:

θMAP(Y |X ) = arg maxθ

p(θ|X ,Y )

i.e. a mode of the posterior.

C. Frogner Bayesian Interpretations of Regularization

Page 22: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Estimators that use the posterior

Bayes least squares estimator

The Bayes least squares estimator for θ given the observed Yis:

θBLS(Y |X ) = Eθ|X ,Y [θ]

i.e. the mean of the posterior.

Maximum a posteriori estimator

The MAP estimator for θ given the observed Y is:

θMAP(Y |X ) = arg maxθ

p(θ|X ,Y )

i.e. a mode of the posterior.

C. Frogner Bayesian Interpretations of Regularization

Page 23: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Linear RLS as a MAP estimator

Model:Y |X , θ ∼ N

(

Xθ, σ2ε I

)

, θ ∼ N (0, I)

Joint over Y and θ:[

Y |Xθ

]

∼ N([

00

]

,

[

XX T + σ2ε I X

X T I

])

Condition on Y |X .

C. Frogner Bayesian Interpretations of Regularization

Page 24: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Linear RLS as a MAP estimator

Model:Y |X , θ ∼ N

(

Xθ, σ2ε I

)

, θ ∼ N (0, I)

Posterior:θ|X ,Y ∼ N

(

µθ|X ,Y ,Σθ|X ,Y)

where

µθ|X ,Y = X T (XX T + σ2ε I)−1Y

Σθ|X ,Y = I − X T (XX T + σ2ε I)−1X

This is Gaussian, so

θMAP(Y |X ) = θBLS(Y |X ) = X T (XX T + σ2ε I)−1Y

C. Frogner Bayesian Interpretations of Regularization

Page 25: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Linear RLS as a MAP estimator

Model:Y |X , θ ∼ N

(

Xθ, σ2ε I

)

, θ ∼ N (0, I)

Posterior:θ|X ,Y ∼ N

(

µθ|X ,Y ,Σθ|X ,Y)

where

µθ|X ,Y = X T (XX T + σ2ε I)−1Y

Σθ|X ,Y = I − X T (XX T + σ2ε I)−1X

This is Gaussian, so

θMAP(Y |X ) = θBLS(Y |X ) = X T (XX T + σ2ε I)−1Y

C. Frogner Bayesian Interpretations of Regularization

Page 26: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Linear RLS as a MAP estimator

Model:Y |X , θ ∼ N

(

Xθ, σ2ε I

)

, θ ∼ N (0, I)

θMAP(Y |X ) = X T (XX T + σ2ε I)−1Y

Recall the linear RLS solution:

θRLS(Y |X ) = arg minθ

12

n∑

i=1

(yi − xTi θ)

2 +λ

2‖θ‖2

= X T (XX T +λ

2I)−1Y

How do we write the estimated function?

C. Frogner Bayesian Interpretations of Regularization

Page 27: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Linear RLS as a MAP estimator

Model:Y |X , θ ∼ N

(

Xθ, σ2ε I

)

, θ ∼ N (0, I)

θMAP(Y |X ) = X T (XX T + σ2ε I)−1Y

Recall the linear RLS solution:

θRLS(Y |X ) = arg minθ

12

n∑

i=1

(yi − xTi θ)

2 +λ

2‖θ‖2

= X T (XX T +λ

2I)−1Y

How do we write the estimated function?

C. Frogner Bayesian Interpretations of Regularization

Page 28: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What about Kernel RLS?

We can use basically the same trick to derive kernel RLS;

fS = arg minf∈HK

12

n∑

i=1

(yi − f (xi ))2 +

λ

2‖f‖2

HK

How?

Feature space: f (x) = 〈θ, φ(x)〉F

e− 1

2 (n

i=1(yi−φ(xi )

T θ)2+ λ

2 θT θ)

Feature space must be finite-dimensional.

C. Frogner Bayesian Interpretations of Regularization

Page 29: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What about Kernel RLS?

We can use basically the same trick to derive kernel RLS;

fS = arg minf∈HK

12

n∑

i=1

(yi − f (xi ))2 +

λ

2‖f‖2

HK

How?

Feature space: f (x) = 〈θ, φ(x)〉F

e− 1

2 (n

i=1(yi−φ(xi )

T θ)2+ λ

2 θT θ)

Feature space must be finite-dimensional.

C. Frogner Bayesian Interpretations of Regularization

Page 30: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What about Kernel RLS?

We can use basically the same trick to derive kernel RLS;

fS = arg minf∈HK

12

n∑

i=1

(yi − f (xi ))2 +

λ

2‖f‖2

HK

How?

Feature space: f (x) = 〈θ, φ(x)〉F

e− 1

2

n∑

i=1(yi−φ(xi )

T θ)2

· e−λ

2 θT θ

Feature space must be finite-dimensional.

C. Frogner Bayesian Interpretations of Regularization

Page 31: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What about Kernel RLS?

We can use basically the same trick to derive kernel RLS;

fS = arg minf∈HK

12

n∑

i=1

(yi − f (xi ))2 +

λ

2‖f‖2

HK

How?

Feature space: f (x) = 〈θ, φ(x)〉F

e− 1

2

n∑

i=1(yi−φ(xi )

T θ)2

· e−λ

2 θT θ

Feature space must be finite-dimensional.

C. Frogner Bayesian Interpretations of Regularization

Page 32: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

More notation

φ(X ) = [φ(x1), . . . , φ(xn)]T

K (X ,X ) is the kernel matrix: [K (X ,X )]ij = K (xi , xj)

K (x ,X ) = [K (x , x1), . . . ,K (x , xn)]

f (X ) = [f (x1), . . . , f (xn)]T

C. Frogner Bayesian Interpretations of Regularization

Page 33: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What about Kernel RLS?

Model:

Y |X , θ ∼ N(

φ(X )θ, σ2ε I

)

, θ ∼ N (0, I)

Then:

θMAP(Y |X ) = φ(X )T (φ(X )φ(X )T + σ2ε I)−1Y

What is φ(X )φ(X )T ?

It’s K (X ,X ).

C. Frogner Bayesian Interpretations of Regularization

Page 34: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What about Kernel RLS?

Model:

Y |X , θ ∼ N(

φ(X )θ, σ2ε I

)

, θ ∼ N (0, I)

Then:

θMAP(Y |X ) = φ(X )T (φ(X )φ(X )T + σ2ε I)−1Y

What is φ(X )φ(X )T ?

It’s K (X ,X ).

C. Frogner Bayesian Interpretations of Regularization

Page 35: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What about Kernel RLS?

Model:

Y |X , θ ∼ N(

φ(X )θ, σ2ε I

)

, θ ∼ N (0, I)

Then:θMAP(Y |X ) = φ(X )T (K (X ,X ) + σ2

ε I)−1Y

Estimated function?

fMAP(x) = φ(x)θMAP(Y |X )

= φ(x)φ(X )T (K (X ,X ) + σ2ε I)−1Y

= K (x ,X )(K (X ,X ) +λ

2I)−1Y

= fRLS(x)

C. Frogner Bayesian Interpretations of Regularization

Page 36: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What about Kernel RLS?

Model:

Y |X , θ ∼ N(

φ(X )θ, σ2ε I

)

, θ ∼ N (0, I)

Then:θMAP(Y |X ) = φ(X )T (K (X ,X ) + σ2

ε I)−1Y

Estimated function?

fMAP(x) = φ(x)θMAP(Y |X )

= φ(x)φ(X )T (K (X ,X ) + σ2ε I)−1Y

= K (x ,X )(K (X ,X ) +λ

2I)−1Y

= fRLS(x)

C. Frogner Bayesian Interpretations of Regularization

Page 37: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

A prior over functions

Model:

Y |X , θ ∼ N(

φ(X )θ, σ2ε I

)

, θ ∼ N (0, I)

Can we write this as a prior on HK ?

C. Frogner Bayesian Interpretations of Regularization

Page 38: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

A prior over functions

Remember Mercer’s theorem:

K (xi , xj) =∑

k

νkψk (xi)ψk (xj)

where νkψk (·) =∫

K (·, y)ψk (y)dy for all k . The functions{√νkψk (·)} form an orthonormal basis for HK .

Let φ(·) = [√ν1ψ1(·), . . . ,√νpψp(·)]. Then:

HK = {θTφ(·)|θ ∈ Rp}

C. Frogner Bayesian Interpretations of Regularization

Page 39: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

A prior over functions

We showed: when θ ∼ N (0, I),

fMAP(·) = θMAP(Y |X )Tφ(·) = fRLS(·)

Taking φ(·) = [√ν1ψ1, . . . ,

√νpψp], this prior is equivalently:

f (·) = θTφ(·) ∼ N (0, I)

i.e. the functions in HK are Gaussian distributed:

p(f ) ∝ exp(

−12‖f‖2

HK

)

= exp(

−12θT θ

)

Note: again we need HK to be finite-dimensional.

C. Frogner Bayesian Interpretations of Regularization

Page 40: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

A prior over functions

So:

p(f ) ∝ exp(

−12‖f‖2

HK

)

⇔ θ ∼ N (0, I) ⇒ fMAP = fRLS

Assuming p(f ) ∝ exp(−12‖f‖2

HK),

p (f |X ,Y ) =p (Y |X , f ) · p(f )

p (Y |X )

∝ exp(

−12‖Y − f (X )‖2

)

exp(

−12‖f‖2

)

= exp(

−12‖Y − f (X )‖2 − 1

2‖f‖2

)

C. Frogner Bayesian Interpretations of Regularization

Page 41: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

A quick recap

We wanted to know if RLS has a probabilistic interpretation.Empirical risk minimization is ML.

p(Y |X , θ) ∝ e− 12‖Y−Xθ‖2

Linear RLS is MAP.

p(Y , θ|X ) ∝ e− 12‖Y−Xθ‖2 · e−λ

2 θT θ

Kernel RLS is also MAP.

p(Y , θ|X ) ∝ e− 12‖Y−φ(X)θ‖2 · e−λ

2 θT θ

Equivalent to a Gaussian prior on HK :

p(Y , θ|X ) ∝ e− 12‖Y−f (X)‖2 · e−λ

2 ‖f‖2HK

But these don’t work for infinite dimensional function spaces...C. Frogner Bayesian Interpretations of Regularization

Page 42: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

A quick recap

We wanted to know if RLS has a probabilistic interpretation.Empirical risk minimization is ML.

p(Y |X , θ) ∝ e− 12‖Y−Xθ‖2

Linear RLS is MAP.

p(Y , θ|X ) ∝ e− 12‖Y−Xθ‖2 · e−λ

2 θT θ

Kernel RLS is also MAP.

p(Y , θ|X ) ∝ e− 12‖Y−φ(X)θ‖2 · e−λ

2 θT θ

Equivalent to a Gaussian prior on HK :

p(Y , θ|X ) ∝ e− 12‖Y−f (X)‖2 · e−λ

2 ‖f‖2HK

But these don’t work for infinite dimensional function spaces...C. Frogner Bayesian Interpretations of Regularization

Page 43: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

A quick recap

We wanted to know if RLS has a probabilistic interpretation.Empirical risk minimization is ML.

p(Y |X , θ) ∝ e− 12‖Y−Xθ‖2

Linear RLS is MAP.

p(Y , θ|X ) ∝ e− 12‖Y−Xθ‖2 · e−λ

2 θT θ

Kernel RLS is also MAP.

p(Y , θ|X ) ∝ e− 12‖Y−φ(X)θ‖2 · e−λ

2 θT θ

Equivalent to a Gaussian prior on HK :

p(Y , θ|X ) ∝ e− 12‖Y−f (X)‖2 · e−λ

2 ‖f‖2HK

But these don’t work for infinite dimensional function spaces...C. Frogner Bayesian Interpretations of Regularization

Page 44: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

A quick recap

We wanted to know if RLS has a probabilistic interpretation.Empirical risk minimization is ML.

p(Y |X , θ) ∝ e− 12‖Y−Xθ‖2

Linear RLS is MAP.

p(Y , θ|X ) ∝ e− 12‖Y−Xθ‖2 · e−λ

2 θT θ

Kernel RLS is also MAP.

p(Y , θ|X ) ∝ e− 12‖Y−φ(X)θ‖2 · e−λ

2 θT θ

Equivalent to a Gaussian prior on HK :

p(Y , θ|X ) ∝ e− 12‖Y−f (X)‖2 · e−λ

2 ‖f‖2HK

But these don’t work for infinite dimensional function spaces...C. Frogner Bayesian Interpretations of Regularization

Page 45: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Transductive setting

We hinted at problems if dimHK = ∞.

Idea: Forget about estimating θ (i.e. f ).

Instead: Estimate predicted outputs

Y ∗ = [y∗1 , . . . , y

∗M ]T

at test inputsX ∗ = [x∗

1 , . . . , x∗M ]T

Need the joint distribution over Y ∗ and Y .

C. Frogner Bayesian Interpretations of Regularization

Page 46: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Transductive setting

Say Y ∗ and Y are jointly Gaussian:[

YY ∗

]

= N([

µY

µY∗

]

,

[

ΛY ΛYY∗

ΛY∗Y ΛY∗

])

Want: kernel RLS.

General form for the posterior:

Y ∗|X ,Y ∼ N(

µY∗|X ,Y ,ΣY∗|X ,Y)

where

µY∗|X ,Y = µY∗ + ΛTYY∗Λ−1

Y (Y − µY )

ΣY∗|X ,Y = ΛY∗ − ΛTYY∗Λ−1

Y ΛYY∗

C. Frogner Bayesian Interpretations of Regularization

Page 47: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Transductive setting

Say Y ∗ and Y are jointly Gaussian:[

YY ∗

]

= N([

µY

µY∗

]

,

[

ΛY ΛYY∗

ΛY∗Y ΛY∗

])

Want: kernel RLS.

General form for the posterior:

Y ∗|X ,Y ∼ N(

µY∗|X ,Y ,ΣY∗|X ,Y)

where

µY∗|X ,Y = µY∗ + ΛTYY∗Λ−1

Y (Y − µY )

ΣY∗|X ,Y = ΛY∗ − ΛTYY∗Λ−1

Y ΛYY∗

C. Frogner Bayesian Interpretations of Regularization

Page 48: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Transductive setting

Set ΛY = K (X ,X ) + σ2I, ΛYY∗ = K (X ,X ∗), ΛY∗ = K (X ∗,X ∗).

Posterior:Y ∗|X ,Y ∼ N

(

µY∗|X ,Y ,ΣY∗|X ,Y)

where

µY∗|X ,Y = µY∗ + K (X ∗,X )(K (X ,X + σ2I)−1(Y − µY )

ΣY∗|X ,Y = K (X ∗,X ∗) − K (X ∗,X )(K (X ,X ) + σ2I)−1K (X ,X ∗)

So: Y ∗MAP = fRLS(X ∗).

C. Frogner Bayesian Interpretations of Regularization

Page 49: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Transductive setting

Model:[

YY ∗

]

= N([

µY

µY∗

]

,

[

K (X ,X ) + σ2ε I K (X ,X ∗)

K (X ∗,X ) K (X ∗,X ∗)

])

MAP estimate (posterior mean) = RLS function at every pointx∗, regardless of dimHK .

Are the prior and posterior (on points!) consistent with adistribution on HK ?

C. Frogner Bayesian Interpretations of Regularization

Page 50: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Transductive setting

Strictly speaking, θ and f don’t come into play here at all:

Have: p(Y ∗|X ,Y )Do not have: p(θ|X ,Y ) or p(f |X ,Y )

But, if HK is finite dimensional, the joint over Y and Y ∗ isconsistent with:

Y = f (X ) + ε,

Y ∗ = f (X ), and

f ∈ HK is a random trajectory from a Gaussian processover the domain, with mean µ and covariance K .

(Ergo, people call this “Gaussian process regression.”)(Also “Kriging,” because of a guy.)

C. Frogner Bayesian Interpretations of Regularization

Page 51: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Transductive setting

Strictly speaking, θ and f don’t come into play here at all:

Have: p(Y ∗|X ,Y )Do not have: p(θ|X ,Y ) or p(f |X ,Y )

But, if HK is finite dimensional, the joint over Y and Y ∗ isconsistent with:

Y = f (X ) + ε,

Y ∗ = f (X ), and

f ∈ HK is a random trajectory from a Gaussian processover the domain, with mean µ and covariance K .

(Ergo, people call this “Gaussian process regression.”)(Also “Kriging,” because of a guy.)

C. Frogner Bayesian Interpretations of Regularization

Page 52: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Recap redux

Empirical risk minimization is the maximum likelihoodestimator when:

y = xT θ + ε

Linear RLS is the MAP estimator when:

y = xT θ + ε, θ ∼ N (0, I)

Kernel RLS is the MAP estimator when:

y = φ(x)T θ + ε, θ ∼ N (0, I)

in finite dimensional HK .Kernel RLS is the MAP estimator at points when:[

YY ∗

]

= N([

µY

µY∗

]

,

[

K (X ,X ) + σ2ε I K (X ,X ∗)

K (X ∗,X ) K (X ∗,X ∗)

])

in possibly infinite dimensional HK .

C. Frogner Bayesian Interpretations of Regularization

Page 53: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Is this useful in practice?

Want confidence intervals + believe the posteriors aremeaningful = yes

Maybe other reasons?

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

5

X

Y

Gaussian Regression + Confidence Intervals

Regression functionObserved points

C. Frogner Bayesian Interpretations of Regularization

Page 54: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What is going on with infinite-dimensional HK ?

Wrote down statements like: θ ∼ N (0, I) and f ∼ N (0, I).The space HK can be written

HK = {f : ‖f‖2HK

<∞} = {θTφ(·) :

∞∑

i=1

θ2i <∞}

Difference between finite and infinite: not every θ yields afunction θTφ(·) in HK .

A hint: θ ∼ N (0, I) ⇒ E‖θTφ(·)‖2HK

= ∞.In fact: θ ∼ N (0, I) ⇒ θTφ(·) ∈ HK with probability zero.

So be careful out there.

C. Frogner Bayesian Interpretations of Regularization

Page 55: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What is going on with infinite-dimensional HK ?

Wrote down statements like: θ ∼ N (0, I) and f ∼ N (0, I).The space HK can be written

HK = {f : ‖f‖2HK

<∞} = {θTφ(·) :

∞∑

i=1

θ2i <∞}

Difference between finite and infinite: not every θ yields afunction θTφ(·) in HK .

A hint: θ ∼ N (0, I) ⇒ E‖θTφ(·)‖2HK

= ∞.In fact: θ ∼ N (0, I) ⇒ θTφ(·) ∈ HK with probability zero.

So be careful out there.

C. Frogner Bayesian Interpretations of Regularization

Page 56: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

What is going on with infinite-dimensional HK ?

Wrote down statements like: θ ∼ N (0, I) and f ∼ N (0, I).The space HK can be written

HK = {f : ‖f‖2HK

<∞} = {θTφ(·) :

∞∑

i=1

θ2i <∞}

Difference between finite and infinite: not every θ yields afunction θTφ(·) in HK .

A hint: θ ∼ N (0, I) ⇒ E‖θTφ(·)‖2HK

= ∞.In fact: θ ∼ N (0, I) ⇒ θTφ(·) ∈ HK with probability zero.

So be careful out there.

C. Frogner Bayesian Interpretations of Regularization

Page 57: Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

A hint that things are amiss

Assume: θ ∼ N (0, I), {φi}∞i=1 orthonormal basis in HK .

E‖θTφ‖2HK

= E‖∞

i=1

θiφi‖2HK

= E

∞∑

i=1

θ2i

=

∞∑

i=1

1

= ∞

C. Frogner Bayesian Interpretations of Regularization