Bayesian Linear Regression - University at Buffalo

of 37/37
Machine Learning Srihari Bayesian Linear Regression Sargur Srihari [email protected]
  • date post

    17-Oct-2021
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of Bayesian Linear Regression - University at Buffalo

3.4-BayesianRegression.ppt2
Linear Regression: model complexity M • Polynomial regression
– Red lines are best fits with M = 0,1,3,9 and N=10
Poor representations of sin(2πx)
Best Fit to sin(2πx)
y(x,w) = w
0 +w
Max Likelihood Regression • Input vector x , basis functions {1(x),.., M(x)}:
• Objective Function:
∇E =− t
n −w(τ)Tφ(x
Radial basis fns:






where Φ is the design matrix: (ΦTΦ)-1 is Moore-Penrose inverse
Regularized MSE with N examples: (λ is the regularization coefficient)
Regularized solution is:
Shortcomings of MLE • M.L.E. of parameters w does not address
–M (Model complexity: how many basis functions? – It is controlled by data size N
• More data allows better fit without overfitting
• Regularization also controls overfit (λ controls effect)
• But M and choice of j are still important – M can be determined by holdout, but wasteful of data
• Model complexity and over-fitting are better handled using Bayesian approach 5
E(w) = E D (w)+ λE
W (w)
Bayesian Linear Regression • Using Bayes rule, posterior is proportional to
Likelihood × Prior: –where p(t|w) is the likelihood of observed data – p(w) is prior distribution over the parameters
• We will look at: –A normal distribution for prior p(w) –Likelihood p(t|w) is a product of Gaussians based
on the noise model –And conclude that posterior is also Gaussian
p(w | t) = p(t |w)p(w)
7
Gaussian Prior Parameters Assume multivariate Gaussian prior for w (which has components w0,..,wM-1)
p(w) = N (w|m0 , S0) with mean m0 and covariance matrix S0
If we choose S0 = α-1I it means that the variances of the weights are all equal to α-1 and covariances are zero
p(w) with zero mean (m0=0) and isotropic over weights (same variances)
w0
w1
Likelihood of Data is Gaussian • Assume noise precision parameter β
• Likelihood of t ={t1,..,tN} is then
–This is the probability of target data t given the parameters w and input X={x1,..,xN}
–Due to Gaussian noise, likelihood p(t |w) is also a Gaussian
p(t | X,w,β) = N t

t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t|x,w,β)=N(t|y(x,w),β-1) Note that output t is a scalar
Machine Learning Srihari
9
Posterior Distribution is also Gaussian • Prior: p(w)~N (w|m0 , S0) i.e., it is Gaussian • Likelihood comes from Gaussian noise
• It follows that posterior p(w|t) is also Gaussian • Proof: use standard result from Gaussians:
• If marginal p(w) & conditional p(t|w) have Gaussian forms then the marginals p(t) and p(w|t) are also Gaussian:
– Let p(w) = N(w|µ,Λ-1) and p(t|w) = N(t|Aw+b,L-1) – Then marginal p(t) = N(t|Aµ+b,L-1+AΛ-1AT) and
conditional p(w|t) = N(w|Σ{AtL(t-b)+Λµ},Σ) where Σ=(Λ+ATLA)-1
p(t | X,w,β) = N t
10
Exact form of Posterior Distribution • We have p(w)= N (w|m0 , S0) & • Posterior is also Gaussian, written directly as
p(w|t)=N(w|mN,SN) – where mN is the mean of the posterior given by mN= SN (S0
-1m0+ β ΦTt) Φ is the design matrix – and SN is the covariance matrix of posterior given by SN
-1= S0 -1+ β ΦTΦ
w0
w1
Prior and Posterior in weight space for scalar input x and y(x,w)=w0+w1x w0
w1 p(w |α) = N(w | 0,α −1I )






Machine Learning Srihari
Properties of Posterior 1. Since posterior p(w|t)=N(w|mN,SN) is
Gaussian its mode coincides with its mean –Thus maximum posterior weight is wMAP= mN
2. Infinitely broad prior S0=α -1I, i.e.,precision α à0 –Then mean mN reduces to the maximum likelihood
value, i.e., mean is the solution vector
3. If N = 0, posterior reverts to the prior 4. If data points arrive sequentially, then posterior
to any stage acts as prior distribution for subsequent data points 11
wML = (ΦTΦ)−1ΦTt
Machine Learning Srihari
• Zero mean (m0=0) isotropic • (same variances) Gaussian
• Corresponding posterior distribution is p(w|t)=N(w|mN,SN)
where mN=β SNΦTt and SN
-1=α I+β ΦTΦ
12
p(w |α) ∼ N(w | 0,α −1I) Single precision parameter α
Note: β is noise precision and α is variance of parameter w in prior
Point Estimate with infinite samples
w0
w1
Machine Learning Srihari
• Since and • we have
• Log of Posterior is • Thus Maximization of posterior is equivalent to
minimization of sum-of-squares error
with addition of quadratic regularization term wTw with λ = α /β
Equivalence to MLE with Regularization
13
Machine Learning Srihari
Bayesian Linear Regression Example (Straight Line Fit)
• Single input variable x • Single target variable t • Goal is to fit
–Linear model y(x,w) = w0+ w1x
• Goal of Linear Regression is to recover w =[w0 ,w1] given the samples
x
t
w0= -0.3 and w1=0.5
–First choose xn from U(x|-1,1), then evaluate f(xn,w) –Add Gaussian noise with st dev 0.2 to get target tn
– Precision parameter β = (1/0.2 )2= 25
• For prior over w we choose α = 2
15
t
Machine Learning Srihari
16
• Each sample represents a straight line in data space (modified by examples)
w0
w1
y(x,w)=w0+w1x
With two examples:
With no examples:
Machine Learning Srihari
prior and posterior distributions in parameter space
• We look at sequential update of posterior
Before data points observed
After first data point (x,t) observed Band represents values of w0, w1 representing st lines going near data point x
Likelihood for 2nd point alone
Likelihood for 20th point alone
First Data Point (x1,t1)
Prior/ Posterior
p(w) gives p(w|t)
Six samples (regression functions) corresponding to y(x,w) with w drawn from posterior
X
With infinite points posterior is a delta function centered at true parameters (white cross)
Second Data Point
True parameter Value
No Data Point
Twenty Data Points
We are plotting p(w|t) for a single data point
Machine Learning Srihari
Generalization of Gaussian prior • The Gaussian prior over parameters is
Maximization of posterior ln p(w|t) is equivalent to minimization of sum of squares error
• Other prior yields Lasso and variations:
– q=2 corresponds to Gaussian –Corresponds to minimization of regularized error
function 18
E(w) = 1
2 t
n −wTφ(x
n ) { }2
n=1
Machine Learning Srihari
• Usually not interested in the value of w itself • But predicting t for a new value of x
p(t|t,X,x) or p(t|t)
–Leaving out conditioning variables X and x for convenience
• Marginalizing over parameter variable w, is the standard Bayesian approach –Sum rule of probability –We can now write
Predictive Distribution
Machine Learning Srihari
• We can predict t for a new value of x using
– With explicit dependence on prior parameter α, noise parameter β, & targets in training set t
Predictive Distribution with α, β,x,t
p(t | t,α,β)= p(t|w,β) ⋅p(w|t,α,β)dw∫
p(t |x,w,β) = N(t |y(x,w),β−1) Conditional of target t given weight w posterior of weight w
p(w|t)=N(w|mN,SN)
• RHS is a convolution of two Gaussian distributions • whose result is the Gaussian:
p(t | t)= p(t|w)p(w|t)dw∫
mN=β SNΦTt SN
-1=α I+β ΦTΦ where
We have left out conditioning variables X and x for convenience. Also we have applied sum rule of probability p(t)=Σwp(t|w)p(w)
p(t |x,t,α,β) = N(t |m
N 2 (x) =
p(t |x, t,α,β) = N(t |m N Tφ(x),σ
N 2 (x))
1 β
Variance of Predictive Distribution
Noise in data Uncertainty associated with parameters w: where is the covariance of p(w|α) Since as no. of samples increases it becomes narrower As Nà ∞, second term of variance goes to zero and variance of predictive distribution arises solely from the additive noise parameter β
SN -1=α I+β ΦTΦ
Since noise process and distribution of w are independent Gaussians their variances are additive
σN +1 2 (x)≤ σ
N 2 (x)
• Predictive distribution
22 Mean of Predictive Distribution
Plot of p(t|x) for one data point showing mean (red) and one std dev (pink)
where mN=β SNΦTt, SN -1=α I+β ΦTΦ
and α and β come from assumptions p(w|α)= N (w|0, α-1I )
y(x,w) = w
j φ
j (x)
j=0
N 2 (x)) where σ
N 2 (x)=
1 β +φ(x)TS
Machine Learning Srihari Predictive Distribution Variance
, std dev of t, is smallest in neighborhood of data points Uncertainty decreases as more data points are observed
N=1 N=2
N=4 N=25
One standard deviation from Mean
Plot only shows point-wise predictive variance To show covariance between predictions at different values of x draw samples from posterior distribution over w p(w|t) and plot corresponding functions y(x,w)
p(t | x, t,α,β) = N(t |m
N Tφ(x),σ
p(t|x,w,β)=N(t|y(x,w),β-1)
where we have assumed Gaussian prior over parameters:
Using data from sin(2πx):






and plot samples from y(x,w) = wT(x)
Shows covariance between predictions at different values of x
For a given function, for a pair of x,x’ , the values of y,y’ are determined by k(x,x’) which in turn is determined by the samples
N=1 N=2
N=4 N=25
p(w|t)=N(w|mN,SN)
Machine Learning Srihari Disadvantage of Local Basis
• Predictive distribution, assuming Gaussian prior – and Gaussian noise t = y(x,w)+ε
– where noise is defined probabilistically as p(t|x,w,β)=N(t|y(x,w),β-1)
• With localized basis functions, e.g., Gaussian
–at regions away from basis function centers, contribution of second term of variance σn
2 in will go to zero leaving only noise contribution β -1
• Model becomes very confident outside of region occupied by basis functions
–Problem avoided by alternative Bayesian approach of Gaussian Processes 25
p(t | x, t,α,β) = N(t |m
N Tφ(x),σ
N 2 (x) =
SN -1=α I+β ΦTΦ
Machine Learning Srihari
Dealing with unknown β • If both w and β are treated as unknown then we
can introduce a conjugate prior distribution p(w,β) which is given by a Gaussian-gamma distribution
– In this case the predictive distribution is a Student’s t-distribution
26
Machine Learning Srihari Mean of p(w|t) has Kernel Interpretation
• Regression function is: • If we take a Bayesian approach with Gaussian prior
p(w)=N(w|m0 , S0) then we have: –Posterior p(w|t)=N (w|mN,SN) where
mN= SN (S0 -1m0+ βΦTt)
SN -1= S0
-1+ βΦTΦ • With zero mean isotropic p(w|α)= N(w|0, α-1I)
mN= β SN ΦTt,
SN -1= α I+ β ΦTΦ
• Posterior mean β SN ΦTt has a kernel interpretation – Sets stage for kernel methods and Gaussian processes
y(x,w) = w
j φ
• Posterior mean of w is mN=βSNΦTt – where SN
-1= S0 -1+ βΦTΦ ,
• S0 is the covariance matrix of the prior p(w), β is the noise parameter and Φ is the design matrix that depends on the samples
• Substitute mean value into Regression function
• Mean of predictive distribution at point x is
– where k(x,x’)=β (x)TSN (x’) is the equivalent kernel
• Thus mean of predictive distribution is a linear combination of training set target variables tn – Note: the equivalent kernel depends on input values xn from
the dataset because they appear in SN
y(x,m
Kernel Function • Regression functions such as
• That take a linear combination of the training set target values are known as linear smoothers
• They depend on the input values xn from the data set since they appear in the definition of SN
29
30
For three values of x the behavior of k(x,x’) is shown as a slice
Kernels are localized around x, i.e., peaks when x =x’
k(x, x’)=(x)TSN(x’)
Data set used to generate kernel were 200 values of x equally spaced in (-1,1)
x’
x
Plot of k(x,x’) shown as a function of x and x’ Peaks when x=x’
Gaussian Basis (x)
Kernel used directly in regression. Mean of the predictive distribution is Obtained by forming a weighted combination of target values:
Data points close to x are given higher weight than points further removed from x
y(x,m
31
k(x,x’)=β (x)TSN (x’)
Data points close to x are given higher weight than points further removed from x
Localized function of x’ even though corresponding basis function is nonlocal
j(x)=x j
SN -1= S0
-1+ β ΦTΦ Plotted as a function of x’ for x=0
Machine Learning Srihari
32
k(x,x’)=β(x)TSN(x’)
Localized function of x’ even though corresponding basis function is nonlocal
φ
cov [y(x), y(x’)] = cov[(x)Tw, wT (x’)] = (x)TSN (x’)
= β-1k(x, x’)
33
From the form of the equivalent kernel k(x, x’) the predictive mean at nearby points y(x) , y(x’) will be highly correlated For more distant pairs correlation is smaller The kernel captures the covariance
where we have used: p(w|t)~N(w|mN,SN) k(x, x’)= β(x)TSN (x)
An important insight: The value of the kernel function between two points is directly related to the covariance between their target values
Machine Learning Srihari
Predictive plot vs. Posterior plots • Predictive distribution
–allows us to visualize pointwise uncertainty in the predictions governed by
• Drawing samples from posterior p(w|t)
–Plotting corresponding functions y(x,w) we visualize joint uncertainty in the posterior distribution between the y values at two or more x values as governed by the kernel
34
N Tφ(x),σ
N 2 (x) =
+φ(x)TS N φ(x)
Machine Learning Srihari
Directly Specifying Kernel Function • Formulation of Linear Regression in terms of
kernel function suggests an alternative approach to regression: – Instead of introducing a set of basis functions, which
implicitly determines an equivalent kernel: –Directly define kernel functions and use to make
predictions for new input x, given the observation set
• This leads to a practical framework for regression (and classification) called Gaussian Processes 35
Machine Learning Srihari Summing Kernel Values Over samples
• Effective kernel defines weights by which – target values combined to make a prediction at x
• It can be shown that weights sum to one, i.e.,
• For all values of x
–This result can be proven intuitively: • Since summation is equivalent to
considering predictive mean (x) for a set of integer data in which tn=1 for all n
• Provided basis functions are linearly independent, that N>M, one of the basis functions is constant (corresponding to the bias parameter), then we can fit training data exactly, and hence (x)=1
y(x,m
Kernel Function Properties • Equivalent kernel can be positive or negative
–Although it satisfies a summation constraint, the corresponding predictions are not necessarily a convex combination of the training set target variables
• Equivalent kernel satisfies important property shared by kernel functions in general. – It can be expressed in the form of an inner product
wrt a vector ψ(x) of nonlinear functions: 37 k(x,z) =ψ (x)Tψ (z) ψ (x) = β1/2S
N 1/2φ(x)