Regression Estimation – Least Squares and Maximum...

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 1

Regression Estimation – Least

Squares and Maximum Likelihood

Dr. Frank Wood

Least Squares Max(min)imization

• Function to minimize w.r.t. β, β

• Minimize this by maximizing –Q

• Find partials and set both equal to zero

Q =∑n

i=1(Yi − (β0 + β1Xi))2

go to board

Normal Equations

• The result of this maximization step are called the normal equations. b0 and b1 are called point estimators of β and β respectively

• This is a system of two equations and two unknowns. The solution is given by…

∑Yi = nb0 + b1

∑XiYi = b0

∑Xi + b1

∑X2i

Write these on board

Solution to Normal Equations

∑(Xi − X)(Yi − Y )∑(Xi − X)2

b0 = Y − b1X

• After a lot of algebra one arrives at

Least Squares Fit

1 2 3 4 5 6 7 8 9 10 1110

Predictor/Input

Response/O

Estimate, y = 2.09x + 8.36, mse: 4.15

True, y = 2x + 9, mse: 4.22

Guess #1

1 2 3 4 5 6 7 8 9 10 1110

Predictor/Input

Response/O

Guess, y = 0x + 21.2, mse: 37.1

True, y = 2x + 9, mse: 4.22

Guess #2

1 2 3 4 5 6 7 8 9 10 1110

Predictor/Input

Response/O

Guess, y = 1.5x + 13, mse: 7.84

True, y = 2x + 9, mse: 4.22

Looking Ahead: Matrix Least Squares

• Solution to this equation is solution to least squares linear regression (and maximum likelihood under normal error distribution assumption)

Y1Y2...Yn

X1 1X2 1...Xn 1

[β1β0

Questions to Ask

• Is the relationship really linear?

• What is the distribution of the of “errors”?

• Is the fit good?

• How much of the variability of the response is accounted for by including the predictor variable?

• Is the chosen predictor variable the best one?

Is This Better?

1 2 3 4 5 6 7 8 9 10 1110

Predictor/Input

Response/O

7 Order, mse: 3.18

Goals for First Half of Course

• How to do linear regression

– Self familiarization with software tools

• How to interpret standard linear regression results

• How to derive tests

• How to assess and address deficiencies in regression models

Properties of Solution

• The ith residual is defined to be

• The sum of the residuals is zero:

ei = Yi − Yi

ei =∑(Yi − b0 − b1Xi)

Yi − nb0 − b1∑

= 0By first normal equation.

• The sum of the observed values Yi equals the sum of the fitted values Yi

Yi =∑

(b1Xi + b0)

(b1Xi + Y − b1X)

= b1∑

Xi + nY − b1nX

= b1nX +∑

Yi − b1nX

• The sum of the weighted residuals is zero when the residual in the ith trial is weighted by the level of the predictor variable in the ith trial

Xiei =∑(Xi(Yi − b0 − b1Xi))

XiYi − b0∑

Xi − b1∑(X2

= 0By second normal equation.

• The sum of the weighted residuals is zero when the residual in the ith trial is weighted by the fitted value of the response variable for the ith trial

Yiei =∑

(b0 + b1Xi)ei

= b0∑

ei + b1∑

= 0By previous properties.

• The regression line always goes through the point

Estimating Error Term Variance σ

• Review estimation in non-regression setting.

• Show estimation results for regression setting.

Estimation Review

• An estimator is a rule that tells how to calculate the value of an estimate based on the measurements contained in a sample

• i.e. the sample mean

Y = 1n

∑ni=1 Yi

Point Estimators and Bias

• Point estimator

• Unknown quantity / parameter

• Definition: Bias of estimator

θ = f({Y1, . . . , Yn})

B(θ) = E(θ)− θ

One Sample Example

0 1 2 3 4 5 6 7 8 9 100

µ = 5, σ = 0.75

samples

est. θ

run bias_example_plot.m

Distribution of Estimator

• If the estimator is a function of the samples and the distribution of the samples is known then the distribution of the estimator can (often) be determined

– Methods

• Distribution (CDF) functions

• Transformations

• Moment generating functions

• Jacobians (change of variable)

Example

• Samples from a Normal(µ,σ) distribution

• Estimate the population mean

Yi ∼ Normal(µ, σ2)

θ = µ, θ = Y = 1n

∑ni=1 Yi

Sampling Distribution of the Estimator

• First moment

• This is an example of an unbiased estimator

E(θ) = E(1

E(Yi) =nµ

B(θ) = E(θ)− θ = 0

Variance of Estimator

• Definition: Variance of estimator

• Remember:

V (θ) = E([θ − E(θ)]2)

V (cY ) = c2V (Y )

V (∑n

i=1 Yi) =∑n

i=1 V (Yi)

Only if the Yi are independent with finite variance

Example Estimator Variance

• For N(0,1) mean estimator

• Note assumptions

V (θ) = V (1

V (Yi) =nσ2

Distribution of sample mean estimator

2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.50

1000 samples

Bias Variance Trade-off

• The mean squared error of an estimator

• Can be re-expressed

MSE(θ) = E([θ − θ]2)

MSE(θ) = V (θ) + (B(θ)2)

MSE = VAR + BIAS2

• Proof

MSE(θ) = E((θ − θ)2)

= E(([θ − E(θ)] + [E(θ)− θ])2)

= E([θ − E(θ)]2) + 2E([E(θ)− θ][θ − E(θ)]) + E([E(θ)− θ]2)

= V (θ) + 2E([E(θ)[θ − E(θ)]− θ[θ − E(θ)])) + (B(θ))2

= V (θ) + 2(0 + 0) + (B(θ))2

= V (θ) + (B(θ))2

Trade-off

• Think of variance as confidence and bias as correctness.

– Intuitions (largely) apply

• Sometimes a biased estimator can produce lower MSE if it lowers the variance.

Estimating Error Term Variance σ

• Regression model

• Variance of each observation Yi is σ (the same as for the error term ǫi)

• Each Yi comes from a different probability distribution with different means that depend on the level Xi

• The deviation of an observation Yi must be calculated around its own estimated mean.

s2 estimator for σ

• MSE is an unbiased estimator of σ

• The sum of squares SSE has n-2 degrees of freedom associated with it.

s2 =MSE = SSEn−2 =

∑(Yi−Yi)

n−2 =

∑e2i

E(MSE) = σ2

Normal Error Regression Model

• No matter how the error terms ǫi are

distributed, the least squares method provides unbiased point estimators of β and β– that also have minimum variance among all

unbiased linear estimators

• To set up interval estimates and make tests we need to specify the distribution of the ǫi

• We will assume that the ǫi are normally

distributed.

Normal Error Regression Model

• Yi value of the response variable in the ith trial

• β and β are parameters

• Xi is a known constant, the value of the predictor variable in the ith trial

• ǫi ~iid N(0,σ)

• i = 1,…,n

Yi = β0 + β1Xi + ǫi

Notational Convention

• When you see ǫi ~iid N(0,σ)

• It is read as ǫi is distributed identically and

independently according to a normal distribution with mean 0 and variance σ

• Examples

– θ ~ Poisson(λ)

– z ~ G(θ)

Maximum Likelihood Principle

• The method of maximum likelihood chooses as estimates those values of the parameters that are most consistent with the sample data.

Likelihood Function

• If

then the likelihood function is

Xi ∼ F (Θ), i = 1 . . . n

L({Xi}ni=1,Θ) =

∏ni=1 F (Xi; Θ)

Example, N(10,3) Density, Single Obs.

N=10, - log likelihood = 4.3038

0 2 4 6 8 10 12 14 16 18 200

Samples

N(10, 3) Density

Example, N(10,3) Density, Single Obs. Again

0 2 4 6 8 10 12 14 16 18 200

Samples

N(10, 3) Density

Example, N(10,3) Density, Multiple Obs.

0 5 10 15 20 250

Samples

N(10, 3) Density

Maximum Likelihood Estimation

• The likelihood function can be maximized w.r.t. the parameter(s) Θ, doing this one can

arrive at estimators for parameters as well.

• To do this, find solutions to (analytically or by following gradient)

L({Xi}ni=1,Θ) =

∏ni=1 F (Xi; Θ)

dL({Xi}n

i=1,Θ)

dΘ = 0

Important Trick

• Never (almost) maximize the likelihood function, maximize the log likelihood function instead.

Quite often the log of the density is easier to work with mathematically.

log(L({Xi}ni=1,Θ)) = log(

F (Xi; Θ))

log(F (Xi; Θ))

ML Normal Regression

• Likelihood function

which if you maximize (how?) w.r.t. to the parameters you get…

L(β0, β1, σ2) =

(2πσ2)1/2e− 1

2σ2(Yi−β0−β1Xi)

(2πσ2)n/2e− 1

i=1(Yi−β0−β1Xi)

Maximum Likelihood Estimator(s)

• β– b0 same as in least squares case

• β– b1 same as in least squares case

• σ

• Note that ML estimator is biased as s2 is unbiased

∑i(Yi−Yi)

s2 =MSE = nn−2 σ

Comments

• Least squares minimizes the squared error between the prediction and the true output

• The normal distribution is fully characterized by its first two central moments (mean and variance)

• Food for thought:

– What does the bias in the ML estimator of the

error variance mean? And where does it come

Regression Estimation – Least Squares and Maximum...

Documents

Transcript of Regression Estimation – Least Squares and Maximum...

ALSAlgorithms - Nc State Universitymeyer.math.ncsu.edu/Meyer/Talks/SAS_6_9_05_NmfWorkshop.pdf · Gradient Descent–Constrained Least Squares ... (objective function tails off after

CS184A/284A AI in Biology and Medicine Linear Regressionxhx/courses/CS284A/linear... · AI in Biology and Medicine Linear Regression. Machine Learning Linear Regression via Least

Applied Numerical Linear Algebra. Lecture 8larisa/www/NumLinAlg/Lecture8_2019.pdf · 2019-09-26 · Least squares and classiﬁcation algorithms THEOREM LetA= UΣVT betheSVDofthem-by-nmatrixA,where

4 squares questions

Yuting Duan , Antoine Guitton, and Paul Sava Center for Wave …newton.mines.edu/paul/talks/2017_CWP_ElasticLSRTM.pdf · Elastic least-squares migration Yuting Duan , Antoine Guitton,

ADJUSTMENT COMPUTATIONS STATISTICS AND LEAST SQUARES IN SURVEYING AND GIS PAUL WOLF CHARLES D. GHILANI.

A Linear Least-Squares Solution to ... - cv-foundation.org · A Linear Least-Squares Solution to Elastic Shape-from-Template Abed Malti1 Adrien Bartoli2 Richard Hartley3 1 Fluminance/INRIA,

Rosemary Renaut, Jodi Mead - Arizona State Universityrosie/mypresentations/cfgpres.pdf · 2008. 1. 2. · Regularization Parameter Estimation for Least Squares: A Newton method using

Lecture 4 Econ 488. Ordinary Least Squares (OLS) Objective of OLS Minimize the sum of squared residuals: where Remember that OLS is not the only possible.

Curve fitting – Least squaresphysik/sites/mona/wp... · Curve fitting – Least squares 9 Prob. to get whole set yifor set of xi N i y f x a i N P y y a e i i i 1 [ ( ; )] /2 ]

Greene, Econometric Analysis (6th ed, 2008)fm · Greene, Econometric Analysis (6th ed, 2008) Chapters 10, 11, 12: Generalized Least Squares, Heteroskedas-ticity, Serial Correlation

The Parametric Self-Dual Simplex Method...Parametric Self-Dual Simplex Method m+n number of pivots Data Least Squares Least Absolute Deviation A log{log plot of Tvs. m+ nand the L1

Least-squares finite element methods for the Poisson - People

Translation Synchronization via Truncated Least Squaresxrhuang/slides/TranSyncSpotlight_NIPS17.pdf · Translation Synchronization via Truncated Least Squares Xiangru Huang1* Zhenxiao

Wednesday, 08/04/2020 Antonis Argyrosusers.ics.forth.gr/~argyros/cs472_spring20/18_CV... · Least squares line fitting Data: (x1, y1), …, (xn, yn)Line equation: yi= mxi+ b Find

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS · STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS ... This is vaguely analogous to the sampling standard deviation for a mean

Available online at · and then the solution was refined by the full matrix least-squares method using SHELXL-97.35 Non-hydrogen atoms were refined with anisotropic displacement

Least Squares - SVCL

Lecture 13 Extra Sums of Squares - Purdue Universityghobbs/STAT_512/Lecture_Notes/Regression/Topic_13.pdf · 13-4 Extra Sums of Squares (2) • Can also view in terms of SSE’s •

Least Squares Fit to Main Harmonics