Bayesian Linear Regression - University at Buffalo

Machine Learning Srihari

Bayesian Linear Regression

Sargur Srihari [email protected]


Topics in Bayesian Regression

•  Recall Max Likelihood Linear Regression •  Parameter Distribution •  Predictive Distribution •  Equivalent Kernel

2


3

Linear Regression: model complexity M •  Polynomial regression

–  Red lines are best fits with M = 0,1,3,9 and N=10

Poor representations of sin(2πx)

Best Fit to sin(2πx)

Over Fit Poor representation of sin(2πx)

y(x,w) = w

0+w

1x +w

2x 2 + ..+w

MxM = w

jx j

j=0

M

∑


Max Likelihood Regression •  Input vector x , basis functions {ϕ1(x),.., ϕM(x)}:

•  Objective Function:

•  Closed-form ML solution is:

•  Gradient Descent: 4

y(x,w) = w

jφ

j(x)

j=0

M −1

∑ = wTφ(x)

wML= (ΦTΦ)−1ΦTt

E(w) = 1

2 t

n−wTφ(x

n) { }2

n=1

N

∑ + λ2wTw

E(w) =

12

tn−wTφ(x

n) { }

n=1

N

∑2

wML= (λI +ΦTΦ)−1ΦTt

w(τ +1) = w(τ ) − η∇E

∇E =− t

n−w(τ)Tφ(x

n) { }

n=1

N

∑ φ(xn)

∇E = − t

n−w(τ )Tφ(x

n) { }

n=1

N

∑ φ(xn)

⎡

⎣⎢

⎤

⎦⎥ − λw(τ )

φ

j(x) = exp −

12(x −µ

j)tΣ−1(x −µ

j)

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟Radial basis fns:

Max Likelihood objective with N examples {x1,..xN}: (equivalent to Mean Squared Error Objective)

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=Φ

−

−

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

φφ

φφφφ

where Φ is the design matrix: (ΦTΦ)-1 is Moore-Penrose inverse

Regularized MSE with N examples: (λ is the regularization coefficient)

Regularized solution is:

Regularized version:


Shortcomings of MLE •  M.L.E. of parameters w does not address

– M (Model complexity: how many basis functions? –  It is controlled by data size N

•  More data allows better fit without overfitting

•  Regularization also controls overfit (λ controls effect)

•  But M and choice of ϕj are still important –  M can be determined by holdout, but wasteful of data

•  Model complexity and over-fitting are better handled using Bayesian approach 5

E(w) = ED(w)+ λE

W(w)

E

W(w) = 1

2wTwwhere

E

D(w) =

12

tn−wTφ(x

n) { }

n=1

N

∑2


6

Bayesian Linear Regression •  Using Bayes rule, posterior is proportional to

Likelihood × Prior: – where p(t|w) is the likelihood of observed data –  p(w) is prior distribution over the parameters

•  We will look at: – A normal distribution for prior p(w) – Likelihood p(t|w) is a product of Gaussians based

on the noise model – And conclude that posterior is also Gaussian

p(w | t) = p(t |w)p(w)

p(t)


7

Gaussian Prior Parameters Assume multivariate Gaussian prior for w (which has components w0,..,wM-1)

p(w) = N (w|m0 , S0) with mean m0 and covariance matrix S0

If we choose S0 = α-1I it means that the variances of the weights are all equal to α-1 and covariances are zero

p(w) with zero mean (m0=0) and isotropic over weights (same variances)

w0

w1


8

Likelihood of Data is Gaussian •  Assume noise precision parameter β

•  Likelihood of t ={t1,..,tN} is then

– This is the probability of target data t given the parameters w and input X={x1,..,xN}

– Due to Gaussian noise, likelihood p(t |w) is also a Gaussian

p(t | X,w,β) = N t

n|wTφ(x

n),β−1( )

n=1

N

∏

t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t|x,w,β)=N(t|y(x,w),β-1) Note that output t is a scalar


9

Posterior Distribution is also Gaussian •  Prior: p(w)~N (w|m0 , S0) i.e., it is Gaussian •  Likelihood comes from Gaussian noise

•  It follows that posterior p(w|t) is also Gaussian •  Proof: use standard result from Gaussians:

•  If marginal p(w) & conditional p(t|w) have Gaussian forms then the marginals p(t) and p(w|t) are also Gaussian:

–  Let p(w) = N(w|µ,Λ-1) and p(t|w) = N(t|Aw+b,L-1) –  Then marginal p(t) = N(t|Aµ+b,L-1+AΛ-1AT) and

conditional p(w|t) = N(w|Σ{AtL(t-b)+Λµ},Σ) where Σ=(Λ+ATLA)-1

p(t | X,w,β) = N t

n|wTφ(x

n),β−1( )

n=1

N

∏


10

Exact form of Posterior Distribution •  We have p(w)= N (w|m0 , S0) & •  Posterior is also Gaussian, written directly as

p(w|t)=N(w|mN,SN) –  where mN is the mean of the posterior given by mN= SN (S0

-1m0+ β ΦTt) Φ is the design matrix –  and SN is the covariance matrix of posterior given by SN

-1= S0-1+ β ΦTΦ

w0

w1

Prior and Posterior in weight space for scalar input x and y(x,w)=w0+w1x w0

w1 p(w |α) = N(w | 0,α −1I )

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=Φ

−

−

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

φφ

φφφφ

p(t | X,w,β) = N t

n|wTφ(x

n),β−1( )

n=1

N

∏

w1


Properties of Posterior 1.  Since posterior p(w|t)=N(w|mN,SN) is

Gaussian its mode coincides with its mean – Thus maximum posterior weight is wMAP= mN

2.  Infinitely broad prior S0=α -1I, i.e.,precision α à0– Then mean mN reduces to the maximum likelihood

value, i.e., mean is the solution vector

3.  If N = 0, posterior reverts to the prior 4.  If data points arrive sequentially, then posterior

to any stage acts as prior distribution for subsequent data points 11

wML= (ΦTΦ)−1ΦTt


Choose a simple Gaussian prior p(w)

•  Zero mean (m0=0) isotropic •  (same variances) Gaussian

•  Corresponding posterior distribution is p(w|t)=N(w|mN,SN)

where mN=β SNΦTt and SN

-1=α I+β ΦTΦ

12

p(w |α) ∼ N(w | 0,α −1I) Single precision parameter α

Note: β is noise precision and α is variance of parameter w in prior

Point Estimate with infinite samples

w0

w1

y(x,w)=w0+w1x p(w|α)~N(0,1/α)


•  Since and •  we have

•  Log of Posterior is •  Thus Maximization of posterior is equivalent to

minimization of sum-of-squares error

with addition of quadratic regularization term wTw with λ = α /β

Equivalence to MLE with Regularization

13

ln p(w | t) = − β

2tn−wTφ(x

n){ }

n=1

N

∑2

− α2wTw + const

E(w) = 1

2 t

n−wTφ(x

n) { }2

n=1

N

∑ + λ2wTw

p(w | t) = N t

n|wTφ(x

n),β−1( )

n=1

N

∏ N(w | 0,α −1I)

p(t | X,w,β) = N t

n| wTφ(x

n),β−1( )

n=1

N

∏ p(w |α) = N(w | 0,α −1I )


14

Bayesian Linear Regression Example (Straight Line Fit)

•  Single input variable x •  Single target variable t •  Goal is to fit

– Linear model y(x,w) = w0+ w1x

•  Goal of Linear Regression is to recover w =[w0 ,w1] given the samples

x

t


Data Generation •  Synthetic data generated from

f(x,w)=w0+w1x with parameter values

w0= -0.3 and w1=0.5

– First choose xn from U(x|-1,1), then evaluate f(xn,w) – Add Gaussian noise with st dev 0.2 to get target tn

–  Precision parameter β = (1/0.2 )2= 25

•  For prior over w we choose α = 2

15

t

x -1 1

p(w |α) = N(w | 0,α −1I )


Sampling p(w) and p(w|t)

16

•  Each sample represents a straight line in data space (modified by examples)

w0

w1

Distribution Six samples

p(w)

p(w|t)

Goal of Bayesian Linear Regression: Determine p(w|t)

y(x,w)=w0+w1x

With two examples:

With no examples:


17

Sequential Bayesian Learning •  Since there are

only two parameters –  We can plot

prior and posterior distributions in parameter space

•  We look at sequential update of posterior

Before data points observed

After first data point (x,t) observed Band represents values of w0, w1 representing st lines going near data point x

Likelihood for 2nd point alone

Likelihood for 20th point alone

First Data Point (x1,t1)

Likelihood p(t|x.w) as function of w

Prior/ Posterior

p(w) gives p(w|t)

Six samples (regression functions) corresponding to y(x,w) with w drawn from posterior

X

With infinite points posterior is a delta function centered at true parameters (white cross)

Second Data Point

True parameter Value

No Data Point

Twenty Data Points

We are plotting p(w|t) for a single data point


Generalization of Gaussian prior •  The Gaussian prior over parameters is

Maximization of posterior ln p(w|t) is equivalent to minimization of sum of squares error

•  Other prior yields Lasso and variations:

–  q=2 corresponds to Gaussian – Corresponds to minimization of regularized error

function 18

p(w |α) = N(w | 0,α −1I)

E(w) = 1

2 t

n−wTφ(x

n) { }2

n=1

N

∑ + λ2wTw

p(w |α) = q2

α2

⎛⎝⎜

⎞⎠⎟

1/q1

Γ(1/q)

⎡

⎣⎢⎢

⎤

⎦⎥⎥

M

exp − α2

|wj|q

j=1

M

∑⎛

⎝⎜⎞

⎠⎟

12

tn−wTφ(x

n) { }2

n=1

N

∑ + λ2

|wj|q

j=1

M

∑


•  Usually not interested in the value of w itself •  But predicting t for a new value of x

p(t|t,X,x) or p(t|t)

– Leaving out conditioning variables X and x for convenience

•  Marginalizing over parameter variable w, is the standard Bayesian approach – Sum rule of probability – We can now write

Predictive Distribution

p(t)= p(t,w)dw∫ = p(t|w)p(w)dw∫

19 p(t | t)= p(t|w)p(w|t)dw∫


•  We can predict t for a new value of x using

–  With explicit dependence on prior parameter α, noise parameter β, & targets in training set t

Predictive Distribution with α, β,x,t

p(t | t,α,β)= p(t|w,β) ⋅p(w|t,α,β)dw∫

p(t |x,w,β) = N(t |y(x,w),β−1)Conditional of target t given weight w posterior of weight w

p(w|t)=N(w|mN,SN)

•  RHS is a convolution of two Gaussian distributions •  whose result is the Gaussian:

p(t | t)= p(t|w)p(w|t)dw∫

mN=β SNΦTt SN

-1=α I+β ΦTΦ where

We have left out conditioning variables X and x for convenience. Also we have applied sum rule of probability p(t)=Σwp(t|w)p(w)

p(t |x,t,α,β) = N(t |m

NTφ(x),σ

N2 (x)) where σ

N2 (x) =

1β

+φ(x)TSNφ(x)


•  Predictive distribution is a Gaussian:

p(t |x, t,α,β) = N(t |mNTφ(x),σ

N2 (x))

where σN2 (x) =

1β

+φ(x)TSNφ(x)

Variance of Predictive Distribution

Noise in data Uncertainty associated with parameters w: where is the covariance of p(w|α) Since as no. of samples increases it becomes narrower As Nà ∞, second term of variance goes to zero and variance of predictive distribution arises solely from the additive noise parameter β

SN-1=α I+β ΦTΦ

Since noise process and distribution of w are independent Gaussians their variances are additive

σN +12 (x)≤ σ

N2 (x)

21


Example of Predictive Distribution

•  Data generated from sin(2πx) •  Model: nine Gaussian basis functions

•  Predictive distribution

22 Mean of Predictive Distribution

Plot of p(t|x)for one data point showing mean (red) and one std dev (pink)

where mN=β SNΦTt, SN-1=α I+β ΦTΦ

and α and β come from assumptionsp(w|α)= N (w|0, α-1I )

y(x,w) = w

jφ

j(x)

j=0

8

∑ = wTφ(x)

p(t | x, t,α,β)= N(t |mNTφ(x),σ

N2 (x)) where σ

N2 (x)=

1β+φ(x)TS

Nφ(x)

φj(x) = exp

(x −µj)2

2σ2

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟

p(t | x,w,β) = N(t |y(x,w),β−1)

Machine Learning Srihari Predictive Distribution Variance

, std dev of t, is smallest in neighborhood of data points Uncertainty decreases as more data points are observed

N=1 N=2

N=4 N=25

Mean of the Gaussian Predictive Distribution

One standard deviation from Mean

Plot only shows point-wise predictive variance To show covariance between predictions at different values of x draw samples from posterior distribution over w p(w|t) and plot corresponding functions y(x,w)

p(t | x, t,α,β) = N(t |m

NTφ(x),σ

N2 (x))where σ

N2 (x) =

1β

+φ(x)TSNφ(x)

σN2 (x)

p(w |α) = N(w | 0,α −1I)

p(t|x,w,β)=N(t|y(x,w),β-1)

SN-1=α I+β ΦTΦ

where we have assumed Gaussian prior over parameters:

Using data from sin(2πx):

Bayesian prediction:

Noise model assumed Gaussian:

and use design matrix as:

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=Φ

−

−

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

φφ

φφφφ


24

Plots of function y(x,w)

Draw samples w from from posterior distribution p(w|t)

and plot samples from y(x,w) = wTϕ(x)

Shows covariance between predictions at different values of x

For a given function, for a pair of x,x’ , the values of y,y’ are determined by k(x,x’) which in turn is determined by the samples

N=1 N=2

N=4 N=25

p(w|t)=N(w|mN,SN)

Machine Learning Srihari Disadvantage of Local Basis

•  Predictive distribution, assuming Gaussian prior –  and Gaussian noise t = y(x,w)+ε

–  where noise is defined probabilistically as p(t|x,w,β)=N(t|y(x,w),β-1)

•  With localized basis functions, e.g., Gaussian

– at regions away from basis function centers, contribution of second term of variance σn

2 in will go to zero leaving only noise contribution β -1

•  Model becomes very confident outside of region occupied by basis functions

– Problem avoided by alternative Bayesian approach of Gaussian Processes 25

p(t | x, t,α,β) = N(t |m

NTφ(x),σ

N2 (x)) where σ

N2 (x) =

1β

+φ(x)TSNφ(x)

p(w |α) = N(w | 0,α −1I)

SN-1=α I+β ΦTΦ


Dealing with unknown β •  If both w and β are treated as unknown then we

can introduce a conjugate prior distribution p(w,β) which is given by a Gaussian-gamma distribution

–  In this case the predictive distribution is a Student’s t-distribution

26

p(µ,λ) = N µ | µ

0, βλ( )−1⎛

⎝⎜⎜⎜

⎞⎠⎟⎟⎟Gam λ |a,b( )

2/12/22/1 )(1)2()2/12/(),,|(

−−

⎥⎦

⎤⎢⎣

⎡ −+⎟⎠⎞⎜

⎝⎛

+Γ+Γ=

ν

νµλ

πνλ

νννλµ x

xSt

Machine Learning Srihari Mean of p(w|t) has Kernel Interpretation

•  Regression function is: •  If we take a Bayesian approach with Gaussian prior

p(w)=N(w|m0 , S0) then we have: – Posterior p(w|t)=N (w|mN,SN) where

mN= SN (S0-1m0+ βΦTt)

SN-1= S0

-1+ βΦTΦ •  With zero mean isotropic p(w|α)= N(w|0, α-1I)

mN= β SN ΦTt,

SN-1= α I+ β ΦTΦ

•  Posterior mean β SN ΦTt has a kernel interpretation –  Sets stage for kernel methods and Gaussian processes

y(x,w) = w

jφ

j(x) = wTφ(x)

j=0

M −1

∑

27

Machine Learning Srihari Equivalent Kernel

•  Posterior mean of w is mN=βSNΦTt –  where SN

-1= S0-1+ βΦTΦ ,

•  S0 is the covariance matrix of the prior p(w), β is the noise parameter and Φ is the design matrix that depends on the samples

•  Substitute mean value into Regression function

•  Mean of predictive distribution at point x is

–  where k(x,x’)=βϕ (x)TSN ϕ (x’) is the equivalent kernel

•  Thus mean of predictive distribution is a linear combination of training set target variables tn –  Note: the equivalent kernel depends on input values xn from

the dataset because they appear in SN

y(x,m

N) = m

NTφ(x) = βφ(x)TS

NΦTt =

n=1

N

∑ βφ(x)TSNφ(x

n)t

n= k(x,x

n)t

nn=1

N

∑

y(x,w) = w

jφ

j(x) = wTφ(x)

j=0

M −1

∑


Kernel Function •  Regression functions such as

•  That take a linear combination of the training set target values are known as linear smoothers

•  They depend on the input values xn from the data set since they appear in the definition of SN

29

k(x,x’)=βϕ (x)TSNϕ (x’)

SN-1= S0

-1+ βΦTΦ

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=Φ

−

−

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

φφ

φφφφ

y(x,m

N) = k(x,x

n)t

nn=1

N

∑


Example of kernel for Gaussian Basis

30

For three values of x the behavior of k(x,x’) is shown as a slice

Kernels are localized around x, i.e., peaks when x =x’

k(x, x’)=ϕ(x)TSNϕ(x’)

Data set used to generate kernel were 200 values of x equally spaced in (-1,1)

x’

x

Plot of k(x,x’) shown as a function of x and x’ Peaks when x=x’

Gaussian Basis ϕ(x)

Kernel used directly in regression. Mean of the predictive distribution is Obtained by forming a weighted combination of target values:

Data points close to x are given higher weight than points further removed from x

y(x,m

N) = k(x,x

n)t

nn=1

N

∑

Equivalent Kernel

φj(x) = exp

(x − µj)2

2s2

⎛

⎝⎜

⎞

⎠⎟

SN-1= S0

-1+ βΦTΦ

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=Φ

−

−

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

φφ

φφφφ


Equivalent Kernel for Polynomial Basis Function

31

k(x,x’)=βϕ (x)TSNϕ (x’)

Data points close to x are given higher weight than points further removed from x

Localized function of x’ even though corresponding basis function is nonlocal

ϕj(x)=x j

SN-1= S0

-1+ β ΦTΦ Plotted as a function of x’ for x=0


Equivalent Kernel for Sigmoidal Basis Function

32

k(x,x’)=βϕ(x)TSNϕ(x’)

Localized function of x’ even though corresponding basis function is nonlocal

φ

j(x) = σ

x −µj

s

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟ where σ(a) =

11+ exp(−a)


Covariance between y(x) and y(x’)

cov [y(x), y(x’)] = cov[ϕ(x)Tw, wTϕ (x’)] = ϕ (x)TSNϕ (x’)

= β-1k(x, x’)

33

From the form of the equivalent kernel k(x, x’) the predictive mean at nearby points y(x) , y(x’) will be highly correlated For more distant pairs correlation is smaller The kernel captures the covariance

where we have used: p(w|t)~N(w|mN,SN) k(x, x’)= βϕ(x)TSNϕ (x)

An important insight: The value of the kernel function between two points is directly related to the covariance between their target values


Predictive plot vs. Posterior plots •  Predictive distribution

– allows us to visualize pointwise uncertainty in the predictions governed by

•  Drawing samples from posterior p(w|t)

– Plotting corresponding functions y(x,w) we visualize joint uncertainty in the posterior distribution between the y values at two or more x values as governed by the kernel

34

p(t | x, t,α,β) = N(t |m

NTφ(x),σ

N2 (x)) where σ

N2 (x) =

1β

+φ(x)TSNφ(x)


Directly Specifying Kernel Function •  Formulation of Linear Regression in terms of

kernel function suggests an alternative approach to regression: –  Instead of introducing a set of basis functions, which

implicitly determines an equivalent kernel: – Directly define kernel functions and use to make

predictions for new input x, given the observation set

•  This leads to a practical framework for regression (and classification) called Gaussian Processes 35

Machine Learning Srihari Summing Kernel Values Over samples

•  Effective kernel defines weights by which –  target values combined to make a prediction at x

•  It can be shown that weights sum to one, i.e.,

•  For all values of x

– This result can be proven intuitively: •  Since summation is equivalent to

considering predictive mean ŷ(x) for a set of integer data in which tn=1 for all n

•  Provided basis functions are linearly independent, that N>M, one of the basis functions is constant (corresponding to the bias parameter), then we can fit training data exactly, and hence ŷ(x)=1

y(x,m

N) = k(x,x

n)t

nn=1

N

∑

k(x, x’)=ϕ(x)TSNϕ(x’) SN

-1= S0-1+ βΦTΦ

36

k(x,x

n) = 1

n=1

N

∑


Kernel Function Properties •  Equivalent kernel can be positive or negative

– Although it satisfies a summation constraint, the corresponding predictions are not necessarily a convex combination of the training set target variables

•  Equivalent kernel satisfies important property shared by kernel functions in general. –  It can be expressed in the form of an inner product

wrt a vector ψ(x) of nonlinear functions: 37 k(x,z) =ψ (x)Tψ (z) ψ (x) = β1/2S

N1/2φ(x)

k(x, x’)=ϕ(x)TSNϕ(x’)

where

Bayesian Linear Regression - University at Buffalo

Documents

Transcript of Bayesian Linear Regression - University at Buffalo