# Bayesian Linear Regression - University at Buffalo

of 37/37

date post

17-Oct-2021Category

## Documents

view

0download

0

Embed Size (px)

### Transcript of Bayesian Linear Regression - University at Buffalo

3.4-BayesianRegression.ppt2

Linear Regression: model complexity M • Polynomial regression

– Red lines are best fits with M = 0,1,3,9 and N=10

Poor representations of sin(2πx)

Best Fit to sin(2πx)

y(x,w) = w

0 +w

Max Likelihood Regression • Input vector x , basis functions {1(x),.., M(x)}:

• Objective Function:

∇E =− t

n −w(τ)Tφ(x

Radial basis fns:

where Φ is the design matrix: (ΦTΦ)-1 is Moore-Penrose inverse

Regularized MSE with N examples: (λ is the regularization coefficient)

Regularized solution is:

Shortcomings of MLE • M.L.E. of parameters w does not address

–M (Model complexity: how many basis functions? – It is controlled by data size N

• More data allows better fit without overfitting

• Regularization also controls overfit (λ controls effect)

• But M and choice of j are still important – M can be determined by holdout, but wasteful of data

• Model complexity and over-fitting are better handled using Bayesian approach 5

E(w) = E D (w)+ λE

W (w)

Bayesian Linear Regression • Using Bayes rule, posterior is proportional to

Likelihood × Prior: –where p(t|w) is the likelihood of observed data – p(w) is prior distribution over the parameters

• We will look at: –A normal distribution for prior p(w) –Likelihood p(t|w) is a product of Gaussians based

on the noise model –And conclude that posterior is also Gaussian

p(w | t) = p(t |w)p(w)

7

Gaussian Prior Parameters Assume multivariate Gaussian prior for w (which has components w0,..,wM-1)

p(w) = N (w|m0 , S0) with mean m0 and covariance matrix S0

If we choose S0 = α-1I it means that the variances of the weights are all equal to α-1 and covariances are zero

p(w) with zero mean (m0=0) and isotropic over weights (same variances)

w0

w1

Likelihood of Data is Gaussian • Assume noise precision parameter β

• Likelihood of t ={t1,..,tN} is then

–This is the probability of target data t given the parameters w and input X={x1,..,xN}

–Due to Gaussian noise, likelihood p(t |w) is also a Gaussian

p(t | X,w,β) = N t

∏

t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t|x,w,β)=N(t|y(x,w),β-1) Note that output t is a scalar

Machine Learning Srihari

9

Posterior Distribution is also Gaussian • Prior: p(w)~N (w|m0 , S0) i.e., it is Gaussian • Likelihood comes from Gaussian noise

• It follows that posterior p(w|t) is also Gaussian • Proof: use standard result from Gaussians:

• If marginal p(w) & conditional p(t|w) have Gaussian forms then the marginals p(t) and p(w|t) are also Gaussian:

– Let p(w) = N(w|µ,Λ-1) and p(t|w) = N(t|Aw+b,L-1) – Then marginal p(t) = N(t|Aµ+b,L-1+AΛ-1AT) and

conditional p(w|t) = N(w|Σ{AtL(t-b)+Λµ},Σ) where Σ=(Λ+ATLA)-1

p(t | X,w,β) = N t

10

Exact form of Posterior Distribution • We have p(w)= N (w|m0 , S0) & • Posterior is also Gaussian, written directly as

p(w|t)=N(w|mN,SN) – where mN is the mean of the posterior given by mN= SN (S0

-1m0+ β ΦTt) Φ is the design matrix – and SN is the covariance matrix of posterior given by SN

-1= S0 -1+ β ΦTΦ

w0

w1

Prior and Posterior in weight space for scalar input x and y(x,w)=w0+w1x w0

w1 p(w |α) = N(w | 0,α −1I )

Machine Learning Srihari

Properties of Posterior 1. Since posterior p(w|t)=N(w|mN,SN) is

Gaussian its mode coincides with its mean –Thus maximum posterior weight is wMAP= mN

2. Infinitely broad prior S0=α -1I, i.e.,precision α à0 –Then mean mN reduces to the maximum likelihood

value, i.e., mean is the solution vector

3. If N = 0, posterior reverts to the prior 4. If data points arrive sequentially, then posterior

to any stage acts as prior distribution for subsequent data points 11

wML = (ΦTΦ)−1ΦTt

Machine Learning Srihari

• Zero mean (m0=0) isotropic • (same variances) Gaussian

• Corresponding posterior distribution is p(w|t)=N(w|mN,SN)

where mN=β SNΦTt and SN

-1=α I+β ΦTΦ

12

p(w |α) ∼ N(w | 0,α −1I) Single precision parameter α

Note: β is noise precision and α is variance of parameter w in prior

Point Estimate with infinite samples

w0

w1

Machine Learning Srihari

• Since and • we have

• Log of Posterior is • Thus Maximization of posterior is equivalent to

minimization of sum-of-squares error

with addition of quadratic regularization term wTw with λ = α /β

Equivalence to MLE with Regularization

13

Machine Learning Srihari

Bayesian Linear Regression Example (Straight Line Fit)

• Single input variable x • Single target variable t • Goal is to fit

–Linear model y(x,w) = w0+ w1x

• Goal of Linear Regression is to recover w =[w0 ,w1] given the samples

x

t

w0= -0.3 and w1=0.5

–First choose xn from U(x|-1,1), then evaluate f(xn,w) –Add Gaussian noise with st dev 0.2 to get target tn

– Precision parameter β = (1/0.2 )2= 25

• For prior over w we choose α = 2

15

t

Machine Learning Srihari

16

• Each sample represents a straight line in data space (modified by examples)

w0

w1

y(x,w)=w0+w1x

With two examples:

With no examples:

Machine Learning Srihari

prior and posterior distributions in parameter space

• We look at sequential update of posterior

Before data points observed

After first data point (x,t) observed Band represents values of w0, w1 representing st lines going near data point x

Likelihood for 2nd point alone

Likelihood for 20th point alone

First Data Point (x1,t1)

Prior/ Posterior

p(w) gives p(w|t)

Six samples (regression functions) corresponding to y(x,w) with w drawn from posterior

X

With infinite points posterior is a delta function centered at true parameters (white cross)

Second Data Point

True parameter Value

No Data Point

Twenty Data Points

We are plotting p(w|t) for a single data point

Machine Learning Srihari

Generalization of Gaussian prior • The Gaussian prior over parameters is

Maximization of posterior ln p(w|t) is equivalent to minimization of sum of squares error

• Other prior yields Lasso and variations:

– q=2 corresponds to Gaussian –Corresponds to minimization of regularized error

function 18

E(w) = 1

2 t

n −wTφ(x

n ) { }2

n=1

Machine Learning Srihari

• Usually not interested in the value of w itself • But predicting t for a new value of x

p(t|t,X,x) or p(t|t)

–Leaving out conditioning variables X and x for convenience

• Marginalizing over parameter variable w, is the standard Bayesian approach –Sum rule of probability –We can now write

Predictive Distribution

Machine Learning Srihari

• We can predict t for a new value of x using

– With explicit dependence on prior parameter α, noise parameter β, & targets in training set t

Predictive Distribution with α, β,x,t

p(t | t,α,β)= p(t|w,β) ⋅p(w|t,α,β)dw∫

p(t |x,w,β) = N(t |y(x,w),β−1) Conditional of target t given weight w posterior of weight w

p(w|t)=N(w|mN,SN)

• RHS is a convolution of two Gaussian distributions • whose result is the Gaussian:

p(t | t)= p(t|w)p(w|t)dw∫

mN=β SNΦTt SN

-1=α I+β ΦTΦ where

We have left out conditioning variables X and x for convenience. Also we have applied sum rule of probability p(t)=Σwp(t|w)p(w)

p(t |x,t,α,β) = N(t |m

N 2 (x) =

p(t |x, t,α,β) = N(t |m N Tφ(x),σ

N 2 (x))

1 β

Variance of Predictive Distribution

Noise in data Uncertainty associated with parameters w: where is the covariance of p(w|α) Since as no. of samples increases it becomes narrower As Nà ∞, second term of variance goes to zero and variance of predictive distribution arises solely from the additive noise parameter β

SN -1=α I+β ΦTΦ

Since noise process and distribution of w are independent Gaussians their variances are additive

σN +1 2 (x)≤ σ

N 2 (x)

• Predictive distribution

22 Mean of Predictive Distribution

Plot of p(t|x) for one data point showing mean (red) and one std dev (pink)

where mN=β SNΦTt, SN -1=α I+β ΦTΦ

and α and β come from assumptions p(w|α)= N (w|0, α-1I )

y(x,w) = w

j φ

j (x)

j=0

N 2 (x)) where σ

N 2 (x)=

1 β +φ(x)TS

Machine Learning Srihari Predictive Distribution Variance

, std dev of t, is smallest in neighborhood of data points Uncertainty decreases as more data points are observed

N=1 N=2

N=4 N=25

One standard deviation from Mean

Plot only shows point-wise predictive variance To show covariance between predictions at different values of x draw samples from posterior distribution over w p(w|t) and plot corresponding functions y(x,w)

p(t | x, t,α,β) = N(t |m

N Tφ(x),σ

p(t|x,w,β)=N(t|y(x,w),β-1)

where we have assumed Gaussian prior over parameters:

Using data from sin(2πx):

and plot samples from y(x,w) = wT(x)

Shows covariance between predictions at different values of x

For a given function, for a pair of x,x’ , the values of y,y’ are determined by k(x,x’) which in turn is determined by the samples

N=1 N=2

N=4 N=25

p(w|t)=N(w|mN,SN)

Machine Learning Srihari Disadvantage of Local Basis

• Predictive distribution, assuming Gaussian prior – and Gaussian noise t = y(x,w)+ε

– where noise is defined probabilistically as p(t|x,w,β)=N(t|y(x,w),β-1)

• With localized basis functions, e.g., Gaussian

–at regions away from basis function centers, contribution of second term of variance σn

2 in will go to zero leaving only noise contribution β -1

• Model becomes very confident outside of region occupied by basis functions

–Problem avoided by alternative Bayesian approach of Gaussian Processes 25

p(t | x, t,α,β) = N(t |m

N Tφ(x),σ

N 2 (x) =

SN -1=α I+β ΦTΦ

Machine Learning Srihari

Dealing with unknown β • If both w and β are treated as unknown then we

can introduce a conjugate prior distribution p(w,β) which is given by a Gaussian-gamma distribution

– In this case the predictive distribution is a Student’s t-distribution

26

Machine Learning Srihari Mean of p(w|t) has Kernel Interpretation

• Regression function is: • If we take a Bayesian approach with Gaussian prior

p(w)=N(w|m0 , S0) then we have: –Posterior p(w|t)=N (w|mN,SN) where

mN= SN (S0 -1m0+ βΦTt)

SN -1= S0

-1+ βΦTΦ • With zero mean isotropic p(w|α)= N(w|0, α-1I)

mN= β SN ΦTt,

SN -1= α I+ β ΦTΦ

• Posterior mean β SN ΦTt has a kernel interpretation – Sets stage for kernel methods and Gaussian processes

y(x,w) = w

j φ

• Posterior mean of w is mN=βSNΦTt – where SN

-1= S0 -1+ βΦTΦ ,

• S0 is the covariance matrix of the prior p(w), β is the noise parameter and Φ is the design matrix that depends on the samples

• Substitute mean value into Regression function

• Mean of predictive distribution at point x is

– where k(x,x’)=β (x)TSN (x’) is the equivalent kernel

• Thus mean of predictive distribution is a linear combination of training set target variables tn – Note: the equivalent kernel depends on input values xn from

the dataset because they appear in SN

y(x,m

Kernel Function • Regression functions such as

• That take a linear combination of the training set target values are known as linear smoothers

• They depend on the input values xn from the data set since they appear in the definition of SN

29

30

For three values of x the behavior of k(x,x’) is shown as a slice

Kernels are localized around x, i.e., peaks when x =x’

k(x, x’)=(x)TSN(x’)

Data set used to generate kernel were 200 values of x equally spaced in (-1,1)

x’

x

Plot of k(x,x’) shown as a function of x and x’ Peaks when x=x’

Gaussian Basis (x)

Kernel used directly in regression. Mean of the predictive distribution is Obtained by forming a weighted combination of target values:

Data points close to x are given higher weight than points further removed from x

y(x,m

31

k(x,x’)=β (x)TSN (x’)

Data points close to x are given higher weight than points further removed from x

Localized function of x’ even though corresponding basis function is nonlocal

j(x)=x j

SN -1= S0

-1+ β ΦTΦ Plotted as a function of x’ for x=0

Machine Learning Srihari

32

k(x,x’)=β(x)TSN(x’)

Localized function of x’ even though corresponding basis function is nonlocal

φ

cov [y(x), y(x’)] = cov[(x)Tw, wT (x’)] = (x)TSN (x’)

= β-1k(x, x’)

33

From the form of the equivalent kernel k(x, x’) the predictive mean at nearby points y(x) , y(x’) will be highly correlated For more distant pairs correlation is smaller The kernel captures the covariance

where we have used: p(w|t)~N(w|mN,SN) k(x, x’)= β(x)TSN (x)

An important insight: The value of the kernel function between two points is directly related to the covariance between their target values

Machine Learning Srihari

Predictive plot vs. Posterior plots • Predictive distribution

–allows us to visualize pointwise uncertainty in the predictions governed by

• Drawing samples from posterior p(w|t)

–Plotting corresponding functions y(x,w) we visualize joint uncertainty in the posterior distribution between the y values at two or more x values as governed by the kernel

34

N Tφ(x),σ

N 2 (x) =

+φ(x)TS N φ(x)

Machine Learning Srihari

Directly Specifying Kernel Function • Formulation of Linear Regression in terms of

kernel function suggests an alternative approach to regression: – Instead of introducing a set of basis functions, which

implicitly determines an equivalent kernel: –Directly define kernel functions and use to make

predictions for new input x, given the observation set

• This leads to a practical framework for regression (and classification) called Gaussian Processes 35

Machine Learning Srihari Summing Kernel Values Over samples

• Effective kernel defines weights by which – target values combined to make a prediction at x

• It can be shown that weights sum to one, i.e.,

• For all values of x

–This result can be proven intuitively: • Since summation is equivalent to

considering predictive mean (x) for a set of integer data in which tn=1 for all n

• Provided basis functions are linearly independent, that N>M, one of the basis functions is constant (corresponding to the bias parameter), then we can fit training data exactly, and hence (x)=1

y(x,m

Kernel Function Properties • Equivalent kernel can be positive or negative

–Although it satisfies a summation constraint, the corresponding predictions are not necessarily a convex combination of the training set target variables

• Equivalent kernel satisfies important property shared by kernel functions in general. – It can be expressed in the form of an inner product

wrt a vector ψ(x) of nonlinear functions: 37 k(x,z) =ψ (x)Tψ (z) ψ (x) = β1/2S

N 1/2φ(x)

Linear Regression: model complexity M • Polynomial regression

– Red lines are best fits with M = 0,1,3,9 and N=10

Poor representations of sin(2πx)

Best Fit to sin(2πx)

y(x,w) = w

0 +w

Max Likelihood Regression • Input vector x , basis functions {1(x),.., M(x)}:

• Objective Function:

∇E =− t

n −w(τ)Tφ(x

Radial basis fns:

where Φ is the design matrix: (ΦTΦ)-1 is Moore-Penrose inverse

Regularized MSE with N examples: (λ is the regularization coefficient)

Regularized solution is:

Shortcomings of MLE • M.L.E. of parameters w does not address

–M (Model complexity: how many basis functions? – It is controlled by data size N

• More data allows better fit without overfitting

• Regularization also controls overfit (λ controls effect)

• But M and choice of j are still important – M can be determined by holdout, but wasteful of data

• Model complexity and over-fitting are better handled using Bayesian approach 5

E(w) = E D (w)+ λE

W (w)

Bayesian Linear Regression • Using Bayes rule, posterior is proportional to

Likelihood × Prior: –where p(t|w) is the likelihood of observed data – p(w) is prior distribution over the parameters

• We will look at: –A normal distribution for prior p(w) –Likelihood p(t|w) is a product of Gaussians based

on the noise model –And conclude that posterior is also Gaussian

p(w | t) = p(t |w)p(w)

7

Gaussian Prior Parameters Assume multivariate Gaussian prior for w (which has components w0,..,wM-1)

p(w) = N (w|m0 , S0) with mean m0 and covariance matrix S0

If we choose S0 = α-1I it means that the variances of the weights are all equal to α-1 and covariances are zero

p(w) with zero mean (m0=0) and isotropic over weights (same variances)

w0

w1

Likelihood of Data is Gaussian • Assume noise precision parameter β

• Likelihood of t ={t1,..,tN} is then

–This is the probability of target data t given the parameters w and input X={x1,..,xN}

–Due to Gaussian noise, likelihood p(t |w) is also a Gaussian

p(t | X,w,β) = N t

∏

t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t|x,w,β)=N(t|y(x,w),β-1) Note that output t is a scalar

Machine Learning Srihari

9

Posterior Distribution is also Gaussian • Prior: p(w)~N (w|m0 , S0) i.e., it is Gaussian • Likelihood comes from Gaussian noise

• It follows that posterior p(w|t) is also Gaussian • Proof: use standard result from Gaussians:

• If marginal p(w) & conditional p(t|w) have Gaussian forms then the marginals p(t) and p(w|t) are also Gaussian:

– Let p(w) = N(w|µ,Λ-1) and p(t|w) = N(t|Aw+b,L-1) – Then marginal p(t) = N(t|Aµ+b,L-1+AΛ-1AT) and

conditional p(w|t) = N(w|Σ{AtL(t-b)+Λµ},Σ) where Σ=(Λ+ATLA)-1

p(t | X,w,β) = N t

10

Exact form of Posterior Distribution • We have p(w)= N (w|m0 , S0) & • Posterior is also Gaussian, written directly as

p(w|t)=N(w|mN,SN) – where mN is the mean of the posterior given by mN= SN (S0

-1m0+ β ΦTt) Φ is the design matrix – and SN is the covariance matrix of posterior given by SN

-1= S0 -1+ β ΦTΦ

w0

w1

Prior and Posterior in weight space for scalar input x and y(x,w)=w0+w1x w0

w1 p(w |α) = N(w | 0,α −1I )

Machine Learning Srihari

Properties of Posterior 1. Since posterior p(w|t)=N(w|mN,SN) is

Gaussian its mode coincides with its mean –Thus maximum posterior weight is wMAP= mN

2. Infinitely broad prior S0=α -1I, i.e.,precision α à0 –Then mean mN reduces to the maximum likelihood

value, i.e., mean is the solution vector

3. If N = 0, posterior reverts to the prior 4. If data points arrive sequentially, then posterior

to any stage acts as prior distribution for subsequent data points 11

wML = (ΦTΦ)−1ΦTt

Machine Learning Srihari

• Zero mean (m0=0) isotropic • (same variances) Gaussian

• Corresponding posterior distribution is p(w|t)=N(w|mN,SN)

where mN=β SNΦTt and SN

-1=α I+β ΦTΦ

12

p(w |α) ∼ N(w | 0,α −1I) Single precision parameter α

Note: β is noise precision and α is variance of parameter w in prior

Point Estimate with infinite samples

w0

w1

Machine Learning Srihari

• Since and • we have

• Log of Posterior is • Thus Maximization of posterior is equivalent to

minimization of sum-of-squares error

with addition of quadratic regularization term wTw with λ = α /β

Equivalence to MLE with Regularization

13

Machine Learning Srihari

Bayesian Linear Regression Example (Straight Line Fit)

• Single input variable x • Single target variable t • Goal is to fit

–Linear model y(x,w) = w0+ w1x

• Goal of Linear Regression is to recover w =[w0 ,w1] given the samples

x

t

w0= -0.3 and w1=0.5

–First choose xn from U(x|-1,1), then evaluate f(xn,w) –Add Gaussian noise with st dev 0.2 to get target tn

– Precision parameter β = (1/0.2 )2= 25

• For prior over w we choose α = 2

15

t

Machine Learning Srihari

16

• Each sample represents a straight line in data space (modified by examples)

w0

w1

y(x,w)=w0+w1x

With two examples:

With no examples:

Machine Learning Srihari

prior and posterior distributions in parameter space

• We look at sequential update of posterior

Before data points observed

After first data point (x,t) observed Band represents values of w0, w1 representing st lines going near data point x

Likelihood for 2nd point alone

Likelihood for 20th point alone

First Data Point (x1,t1)

Prior/ Posterior

p(w) gives p(w|t)

Six samples (regression functions) corresponding to y(x,w) with w drawn from posterior

X

With infinite points posterior is a delta function centered at true parameters (white cross)

Second Data Point

True parameter Value

No Data Point

Twenty Data Points

We are plotting p(w|t) for a single data point

Machine Learning Srihari

Generalization of Gaussian prior • The Gaussian prior over parameters is

Maximization of posterior ln p(w|t) is equivalent to minimization of sum of squares error

• Other prior yields Lasso and variations:

– q=2 corresponds to Gaussian –Corresponds to minimization of regularized error

function 18

E(w) = 1

2 t

n −wTφ(x

n ) { }2

n=1

Machine Learning Srihari

• Usually not interested in the value of w itself • But predicting t for a new value of x

p(t|t,X,x) or p(t|t)

–Leaving out conditioning variables X and x for convenience

• Marginalizing over parameter variable w, is the standard Bayesian approach –Sum rule of probability –We can now write

Predictive Distribution

Machine Learning Srihari

• We can predict t for a new value of x using

– With explicit dependence on prior parameter α, noise parameter β, & targets in training set t

Predictive Distribution with α, β,x,t

p(t | t,α,β)= p(t|w,β) ⋅p(w|t,α,β)dw∫

p(t |x,w,β) = N(t |y(x,w),β−1) Conditional of target t given weight w posterior of weight w

p(w|t)=N(w|mN,SN)

• RHS is a convolution of two Gaussian distributions • whose result is the Gaussian:

p(t | t)= p(t|w)p(w|t)dw∫

mN=β SNΦTt SN

-1=α I+β ΦTΦ where

We have left out conditioning variables X and x for convenience. Also we have applied sum rule of probability p(t)=Σwp(t|w)p(w)

p(t |x,t,α,β) = N(t |m

N 2 (x) =

p(t |x, t,α,β) = N(t |m N Tφ(x),σ

N 2 (x))

1 β

Variance of Predictive Distribution

Noise in data Uncertainty associated with parameters w: where is the covariance of p(w|α) Since as no. of samples increases it becomes narrower As Nà ∞, second term of variance goes to zero and variance of predictive distribution arises solely from the additive noise parameter β

SN -1=α I+β ΦTΦ

Since noise process and distribution of w are independent Gaussians their variances are additive

σN +1 2 (x)≤ σ

N 2 (x)

• Predictive distribution

22 Mean of Predictive Distribution

Plot of p(t|x) for one data point showing mean (red) and one std dev (pink)

where mN=β SNΦTt, SN -1=α I+β ΦTΦ

and α and β come from assumptions p(w|α)= N (w|0, α-1I )

y(x,w) = w

j φ

j (x)

j=0

N 2 (x)) where σ

N 2 (x)=

1 β +φ(x)TS

Machine Learning Srihari Predictive Distribution Variance

, std dev of t, is smallest in neighborhood of data points Uncertainty decreases as more data points are observed

N=1 N=2

N=4 N=25

One standard deviation from Mean

Plot only shows point-wise predictive variance To show covariance between predictions at different values of x draw samples from posterior distribution over w p(w|t) and plot corresponding functions y(x,w)

p(t | x, t,α,β) = N(t |m

N Tφ(x),σ

p(t|x,w,β)=N(t|y(x,w),β-1)

where we have assumed Gaussian prior over parameters:

Using data from sin(2πx):

and plot samples from y(x,w) = wT(x)

Shows covariance between predictions at different values of x

For a given function, for a pair of x,x’ , the values of y,y’ are determined by k(x,x’) which in turn is determined by the samples

N=1 N=2

N=4 N=25

p(w|t)=N(w|mN,SN)

Machine Learning Srihari Disadvantage of Local Basis

• Predictive distribution, assuming Gaussian prior – and Gaussian noise t = y(x,w)+ε

– where noise is defined probabilistically as p(t|x,w,β)=N(t|y(x,w),β-1)

• With localized basis functions, e.g., Gaussian

–at regions away from basis function centers, contribution of second term of variance σn

2 in will go to zero leaving only noise contribution β -1

• Model becomes very confident outside of region occupied by basis functions

–Problem avoided by alternative Bayesian approach of Gaussian Processes 25

p(t | x, t,α,β) = N(t |m

N Tφ(x),σ

N 2 (x) =

SN -1=α I+β ΦTΦ

Machine Learning Srihari

Dealing with unknown β • If both w and β are treated as unknown then we

can introduce a conjugate prior distribution p(w,β) which is given by a Gaussian-gamma distribution

– In this case the predictive distribution is a Student’s t-distribution

26

Machine Learning Srihari Mean of p(w|t) has Kernel Interpretation

• Regression function is: • If we take a Bayesian approach with Gaussian prior

p(w)=N(w|m0 , S0) then we have: –Posterior p(w|t)=N (w|mN,SN) where

mN= SN (S0 -1m0+ βΦTt)

SN -1= S0

-1+ βΦTΦ • With zero mean isotropic p(w|α)= N(w|0, α-1I)

mN= β SN ΦTt,

SN -1= α I+ β ΦTΦ

• Posterior mean β SN ΦTt has a kernel interpretation – Sets stage for kernel methods and Gaussian processes

y(x,w) = w

j φ

• Posterior mean of w is mN=βSNΦTt – where SN

-1= S0 -1+ βΦTΦ ,

• S0 is the covariance matrix of the prior p(w), β is the noise parameter and Φ is the design matrix that depends on the samples

• Substitute mean value into Regression function

• Mean of predictive distribution at point x is

– where k(x,x’)=β (x)TSN (x’) is the equivalent kernel

• Thus mean of predictive distribution is a linear combination of training set target variables tn – Note: the equivalent kernel depends on input values xn from

the dataset because they appear in SN

y(x,m

Kernel Function • Regression functions such as

• That take a linear combination of the training set target values are known as linear smoothers

• They depend on the input values xn from the data set since they appear in the definition of SN

29

30

For three values of x the behavior of k(x,x’) is shown as a slice

Kernels are localized around x, i.e., peaks when x =x’

k(x, x’)=(x)TSN(x’)

Data set used to generate kernel were 200 values of x equally spaced in (-1,1)

x’

x

Plot of k(x,x’) shown as a function of x and x’ Peaks when x=x’

Gaussian Basis (x)

Kernel used directly in regression. Mean of the predictive distribution is Obtained by forming a weighted combination of target values:

Data points close to x are given higher weight than points further removed from x

y(x,m

31

k(x,x’)=β (x)TSN (x’)

Data points close to x are given higher weight than points further removed from x

Localized function of x’ even though corresponding basis function is nonlocal

j(x)=x j

SN -1= S0

-1+ β ΦTΦ Plotted as a function of x’ for x=0

Machine Learning Srihari

32

k(x,x’)=β(x)TSN(x’)

Localized function of x’ even though corresponding basis function is nonlocal

φ

cov [y(x), y(x’)] = cov[(x)Tw, wT (x’)] = (x)TSN (x’)

= β-1k(x, x’)

33

From the form of the equivalent kernel k(x, x’) the predictive mean at nearby points y(x) , y(x’) will be highly correlated For more distant pairs correlation is smaller The kernel captures the covariance

where we have used: p(w|t)~N(w|mN,SN) k(x, x’)= β(x)TSN (x)

An important insight: The value of the kernel function between two points is directly related to the covariance between their target values

Machine Learning Srihari

Predictive plot vs. Posterior plots • Predictive distribution

–allows us to visualize pointwise uncertainty in the predictions governed by

• Drawing samples from posterior p(w|t)

–Plotting corresponding functions y(x,w) we visualize joint uncertainty in the posterior distribution between the y values at two or more x values as governed by the kernel

34

N Tφ(x),σ

N 2 (x) =

+φ(x)TS N φ(x)

Machine Learning Srihari

Directly Specifying Kernel Function • Formulation of Linear Regression in terms of

kernel function suggests an alternative approach to regression: – Instead of introducing a set of basis functions, which

implicitly determines an equivalent kernel: –Directly define kernel functions and use to make

predictions for new input x, given the observation set

• This leads to a practical framework for regression (and classification) called Gaussian Processes 35

Machine Learning Srihari Summing Kernel Values Over samples

• Effective kernel defines weights by which – target values combined to make a prediction at x

• It can be shown that weights sum to one, i.e.,

• For all values of x

–This result can be proven intuitively: • Since summation is equivalent to

considering predictive mean (x) for a set of integer data in which tn=1 for all n

• Provided basis functions are linearly independent, that N>M, one of the basis functions is constant (corresponding to the bias parameter), then we can fit training data exactly, and hence (x)=1

y(x,m

Kernel Function Properties • Equivalent kernel can be positive or negative

–Although it satisfies a summation constraint, the corresponding predictions are not necessarily a convex combination of the training set target variables

• Equivalent kernel satisfies important property shared by kernel functions in general. – It can be expressed in the form of an inner product

wrt a vector ψ(x) of nonlinear functions: 37 k(x,z) =ψ (x)Tψ (z) ψ (x) = β1/2S

N 1/2φ(x)