Post on 31-Mar-2018
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 1
Regression Estimation – Least
Squares and Maximum Likelihood
Dr. Frank Wood
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 2
Least Squares Max(min)imization
• Function to minimize w.r.t. β, β
• Minimize this by maximizing –Q
• Find partials and set both equal to zero
Q =∑n
i=1(Yi − (β0 + β1Xi))2
go to board
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 3
Normal Equations
• The result of this maximization step are called the normal equations. b0 and b1 are called point estimators of β and β respectively
• This is a system of two equations and two unknowns. The solution is given by…
∑Yi = nb0 + b1
∑Xi
∑XiYi = b0
∑Xi + b1
∑X2i
Write these on board
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 4
Solution to Normal Equations
b1 =
∑(Xi − X)(Yi − Y )∑(Xi − X)2
b0 = Y − b1X
X =
∑Xi
n
Y =
∑Yi
n
• After a lot of algebra one arrives at
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 5
Least Squares Fit
1 2 3 4 5 6 7 8 9 10 1110
15
20
25
30
35
40
Predictor/Input
Response/O
utp
ut
Estimate, y = 2.09x + 8.36, mse: 4.15
True, y = 2x + 9, mse: 4.22
?
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 6
Guess #1
1 2 3 4 5 6 7 8 9 10 1110
15
20
25
30
35
40
Predictor/Input
Response/O
utp
ut
Guess, y = 0x + 21.2, mse: 37.1
True, y = 2x + 9, mse: 4.22
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 7
Guess #2
1 2 3 4 5 6 7 8 9 10 1110
15
20
25
30
35
40
Predictor/Input
Response/O
utp
ut
Guess, y = 1.5x + 13, mse: 7.84
True, y = 2x + 9, mse: 4.22
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 8
Looking Ahead: Matrix Least Squares
• Solution to this equation is solution to least squares linear regression (and maximum likelihood under normal error distribution assumption)
Y1Y2...Yn
=
X1 1X2 1...Xn 1
[β1β0
]
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 9
Questions to Ask
• Is the relationship really linear?
• What is the distribution of the of “errors”?
• Is the fit good?
• How much of the variability of the response is accounted for by including the predictor variable?
• Is the chosen predictor variable the best one?
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 10
Is This Better?
1 2 3 4 5 6 7 8 9 10 1110
15
20
25
30
35
40
Predictor/Input
Response/O
utp
ut
7 Order, mse: 3.18
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 11
Goals for First Half of Course
• How to do linear regression
– Self familiarization with software tools
• How to interpret standard linear regression results
• How to derive tests
• How to assess and address deficiencies in regression models
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 12
Properties of Solution
• The ith residual is defined to be
• The sum of the residuals is zero:
ei = Yi − Yi
∑
i
ei =∑(Yi − b0 − b1Xi)
=∑
Yi − nb0 − b1∑
Xi
= 0By first normal equation.
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 13
Properties of Solution
• The sum of the observed values Yi equals the sum of the fitted values Yi
∑
i
Yi =∑
i
Yi
=∑
i
(b1Xi + b0)
=∑
i
(b1Xi + Y − b1X)
= b1∑
i
Xi + nY − b1nX
= b1nX +∑
i
Yi − b1nX
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 14
Properties of Solution
• The sum of the weighted residuals is zero when the residual in the ith trial is weighted by the level of the predictor variable in the ith trial
∑
i
Xiei =∑(Xi(Yi − b0 − b1Xi))
=∑
i
XiYi − b0∑
Xi − b1∑(X2
i )
= 0By second normal equation.
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 15
Properties of Solution
• The sum of the weighted residuals is zero when the residual in the ith trial is weighted by the fitted value of the response variable for the ith trial
∑
i
Yiei =∑
i
(b0 + b1Xi)ei
= b0∑
i
ei + b1∑
i
eiXi
= 0By previous properties.
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 16
Properties of Solution
• The regression line always goes through the point
X, Y
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 17
Estimating Error Term Variance σ
• Review estimation in non-regression setting.
• Show estimation results for regression setting.
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 18
Estimation Review
• An estimator is a rule that tells how to calculate the value of an estimate based on the measurements contained in a sample
• i.e. the sample mean
Y = 1n
∑ni=1 Yi
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 19
Point Estimators and Bias
• Point estimator
• Unknown quantity / parameter
• Definition: Bias of estimator
θ = f({Y1, . . . , Yn})
θ
B(θ) = E(θ)− θ
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 20
One Sample Example
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
µ = 5, σ = 0.75
samples
θ
est. θ
run bias_example_plot.m
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 21
Distribution of Estimator
• If the estimator is a function of the samples and the distribution of the samples is known then the distribution of the estimator can (often) be determined
– Methods
• Distribution (CDF) functions
• Transformations
• Moment generating functions
• Jacobians (change of variable)
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 22
Example
• Samples from a Normal(µ,σ) distribution
• Estimate the population mean
Yi ∼ Normal(µ, σ2)
θ = µ, θ = Y = 1n
∑ni=1 Yi
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 23
Sampling Distribution of the Estimator
• First moment
• This is an example of an unbiased estimator
E(θ) = E(1
n
n∑
i=1
Yi)
=1
n
n∑
i=1
E(Yi) =nµ
n= θ
B(θ) = E(θ)− θ = 0
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 24
Variance of Estimator
• Definition: Variance of estimator
• Remember:
V (θ) = E([θ − E(θ)]2)
V (cY ) = c2V (Y )
V (∑n
i=1 Yi) =∑n
i=1 V (Yi)
Only if the Yi are independent with finite variance
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 25
Example Estimator Variance
• For N(0,1) mean estimator
• Note assumptions
V (θ) = V (1
n
n∑
i=1
Yi)
=1
n2
n∑
i=1
V (Yi) =nσ2
n2=
σ2
n
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 26
Distribution of sample mean estimator
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1000 samples
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 27
Bias Variance Trade-off
• The mean squared error of an estimator
• Can be re-expressed
MSE(θ) = E([θ − θ]2)
MSE(θ) = V (θ) + (B(θ)2)
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 28
MSE = VAR + BIAS2
• Proof
MSE(θ) = E((θ − θ)2)
= E(([θ − E(θ)] + [E(θ)− θ])2)
= E([θ − E(θ)]2) + 2E([E(θ)− θ][θ − E(θ)]) + E([E(θ)− θ]2)
= V (θ) + 2E([E(θ)[θ − E(θ)]− θ[θ − E(θ)])) + (B(θ))2
= V (θ) + 2(0 + 0) + (B(θ))2
= V (θ) + (B(θ))2
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 29
Trade-off
• Think of variance as confidence and bias as correctness.
– Intuitions (largely) apply
• Sometimes a biased estimator can produce lower MSE if it lowers the variance.
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 30
Estimating Error Term Variance σ
• Regression model
• Variance of each observation Yi is σ (the same as for the error term ǫi)
• Each Yi comes from a different probability distribution with different means that depend on the level Xi
• The deviation of an observation Yi must be calculated around its own estimated mean.
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 31
s2 estimator for σ
• MSE is an unbiased estimator of σ
• The sum of squares SSE has n-2 degrees of freedom associated with it.
s2 =MSE = SSEn−2 =
∑(Yi−Yi)
2
n−2 =
∑e2i
n−2
E(MSE) = σ2
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 32
Normal Error Regression Model
• No matter how the error terms ǫi are
distributed, the least squares method provides unbiased point estimators of β and β– that also have minimum variance among all
unbiased linear estimators
• To set up interval estimates and make tests we need to specify the distribution of the ǫi
• We will assume that the ǫi are normally
distributed.
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 33
Normal Error Regression Model
• Yi value of the response variable in the ith trial
• β and β are parameters
• Xi is a known constant, the value of the predictor variable in the ith trial
• ǫi ~iid N(0,σ)
• i = 1,…,n
Yi = β0 + β1Xi + ǫi
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 34
Notational Convention
• When you see ǫi ~iid N(0,σ)
• It is read as ǫi is distributed identically and
independently according to a normal distribution with mean 0 and variance σ
• Examples
– θ ~ Poisson(λ)
– z ~ G(θ)
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 35
Maximum Likelihood Principle
• The method of maximum likelihood chooses as estimates those values of the parameters that are most consistent with the sample data.
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 36
Likelihood Function
• If
then the likelihood function is
Xi ∼ F (Θ), i = 1 . . . n
L({Xi}ni=1,Θ) =
∏ni=1 F (Xi; Θ)
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 37
Example, N(10,3) Density, Single Obs.
N=10, - log likelihood = 4.3038
0 2 4 6 8 10 12 14 16 18 200
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Samples
N(10, 3) Density
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 38
Example, N(10,3) Density, Single Obs. Again
N=10, - log likelihood = 4.3038
0 2 4 6 8 10 12 14 16 18 200
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Samples
N(10, 3) Density
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 39
Example, N(10,3) Density, Multiple Obs.
0 5 10 15 20 250
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Samples
N(10, 3) Density
N=10, - log likelihood = 36.2204
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 40
Maximum Likelihood Estimation
• The likelihood function can be maximized w.r.t. the parameter(s) Θ, doing this one can
arrive at estimators for parameters as well.
• To do this, find solutions to (analytically or by following gradient)
L({Xi}ni=1,Θ) =
∏ni=1 F (Xi; Θ)
dL({Xi}n
i=1,Θ)
dΘ = 0
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 41
Important Trick
• Never (almost) maximize the likelihood function, maximize the log likelihood function instead.
Quite often the log of the density is easier to work with mathematically.
log(L({Xi}ni=1,Θ)) = log(
n∏
i=1
F (Xi; Θ))
=n∑
i=1
log(F (Xi; Θ))
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 42
ML Normal Regression
• Likelihood function
which if you maximize (how?) w.r.t. to the parameters you get…
L(β0, β1, σ2) =
n∏
i=1
1
(2πσ2)1/2e− 1
2σ2(Yi−β0−β1Xi)
2
=1
(2πσ2)n/2e− 1
2σ2
∑n
i=1(Yi−β0−β1Xi)
2
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 43
Maximum Likelihood Estimator(s)
• β– b0 same as in least squares case
• β– b1 same as in least squares case
• σ
• Note that ML estimator is biased as s2 is unbiased
and
σ2 =
∑i(Yi−Yi)
2
n
s2 =MSE = nn−2 σ
2
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 44
Comments
• Least squares minimizes the squared error between the prediction and the true output
• The normal distribution is fully characterized by its first two central moments (mean and variance)
• Food for thought:
– What does the bias in the ML estimator of the
error variance mean? And where does it come
from?