Weighted Least Squares I -...

[email protected] Stat221 www.fas.harvard.edu/˜stat221'

&

$

%

Weighted Least Squares I

• for i = 1, 2, . . . , n we have, see [1, Bradley],

– data: Yi | xii.n.i.d∼ f(yi | θi), where θi = E(Yi | xi)

– co-variates: xi = (xi1, xi2, . . . , xip)T

– let Xn×p be the matrix of covariates with rows xTi

– parameter of interest: β = (β1, β2, . . . , βp), p < n

– θi = E(Yi | xi) = βT xi

– V ar(Yi | xi) = vi(φ) has a known form, which doesn’t depend on β,vi(φ)’s are not all the same and φ is known

• want to estimate β

• ignoring the underlying density, one could use the Weighted Least Squaresestimator:

βWLS = arg minβ

n∑

i=1

vi(φ)−1(Yi − βT xi

)2

February 17, 2006 c© 2006 - Gopi Goswami ([email protected])


&

$

%

WLS II• one could also use the Maximum Likelihood Estimator:

βMLE = arg maxβ

log(L(β)) = arg maxβ

n∑

i=1

log(f(Yi | βT xi)

)

• for WLS we solve the following normal equation:

n∑

i=1


)xij = 0, j = 1, 2, . . . , p (1)

• for MLE we solve the following system of equations:

∂

∂βj

n∑

i=1

log(f(Yi | βT xi)

)= 0, j = 1, 2, . . . , p (2)

• for certain choice of f(· | ·), βWLS = βMLE , what are those?



&

$

%

NEF of Distributions: I• NEF stands for Natural Exponential Family

• a NEF looks like:

f(y | θ) = h(y) exp[P (θ)y − Q(θ)]

where θ = E(Y ) and range of Y doesn’t depend on θ

• consider∫

f(y | θ)dy = 1 or∫

h(y) exp[P (θ)y − Q(θ)]dy = 1 and assumedifferentiation under the integral sign is possible

– apply ddθ

to both sides of the above to get: θ = E(Y ) = Q′(θ)P ′(θ) , why?

– apply d2

dθ2 to both sides of the above to get: V ar(Y ) = 1P ′(θ) , why?



&

$

%

WLS and MLE I• if f(Yi | βT xi) all come from a NEF, then βWLS = βMLE

• sketch of proof:

n∑

i=1

log(f(Yi | βT xi)

)=

n∑

i=1

{log(h(Yi)) + P (βT xi)Yi − Q(βT xi)}

=⇒∂

∂βj

n∑

i=1

log(f(Yi | βT xi)

)=

n∑

i=1

{P ′(θi)xijYi − Q′(θi)xij}

=

n∑

i=1

P ′(θi)(Yi −

Q′(θi)P ′(θi)

)xij

=n∑

i=1

vi(φ)−1 (Yi − E(Yi | xi))xij

=n∑

i=1


)xij



&

$

%

WLS and MLE II• so equation (2) boils down to solving for:

n∑

i=1


)xij = 0, j = 1, 2, . . . , p

• the above is exactly same as equation (1), Q.E.D.

• note the solutions to above equations also satisfies (how?):

(XT WX

)βWLS = XT WY

=⇒ βWLS =(XT WX

)−1XT WY where W is diagonal with

(W )ii = vi(φ)−1



&

$

%

Example I

• Heteroskedastic Least Squares: for i = 1, 2, . . . , n we have,

Yi | xii.n.i.d∼ Normal1(θi, σ

2 · k(xi)), for some known constant σ2 and aknown function k(·) with k : R

p −→ (0,∞)

• θi = E(Yi | xi) = βT xi


• so we take diagonal W such that (W )ii = 1/(σ2 · k(xi)) and

βWLS =(XT WX

)−1 (XT WY

)

• now βWLS = βMLE because for Normal distribution comes from a NEF:

Normal1(θ, σ2· k(x); y) =

exph

−y2

2σ2·k(xi)

i

p2πσ2 · k(xi)| {z }

h(y)

exp

266664

θ

σ2 · k(xi)| {z }P (θ)

y −θ2

2σ2 · k(xi)| {z }Q(θ)

377775



&

$

%

Iteratively Reweighted Least Squares I

• suppose in the previous setting, for a known non-linear function m(·, ·) withfirst derivative we have: θi = m(β, xi)


• ignoring the underlying density, one uses the Iteratively Reweighted LeastSquares estimator:

βIRLS = arg minβ

n∑

i=1

vi(φ)−1 (Yi − m(β, xi))2

• one can show under this set up, as well, βIRLS = βMLE

• the proof is very similar to the proof of βWLS = βMLE , which we didbefore, left as an assignment problem



&

$

%

IRLS II• here we need to solve the following normal equation:

n∑

i=1

vi(φ)−1 (Yi − m(β, xi))∂

∂βj

m(β, xi) = 0, j = 1, 2, . . . , p (3)

• the problem is the normal equations (3) are not easily solved for β

• one could use the NR algorithm, instead we are going to use somethingdifferent



&

$

%

IRLS III• a new iterative route:

• let current update be βn−1

• “linearize the problem” using Taylor expansion:

m(β, xi) ≈ m(bβn−1, xi) +“β − bβn−1

”T h∇βm(bβn−1, xi)

i

• now solve the simpler problem:

bβn = arg minβ

nX

i=1

vi(φ)−1

“Yi − m(bβn−1, xi) + bβT

n−1

h∇βm(bβn−1, xi)

i”−

βTh∇βm(bβn−1, xi)

i ff2



&

$

%

IRLS IV• the simpler problem can be solved with the following normal equations:

nX

i=1

vi(φ)−1

“Yi − m(bβn−1, xi) + bβT

n−1

h∇βm(bβn−1, xi)

i”−

βTh∇βm(bβn−1, xi)

i ff∂

∂βj

m(bβn−1, xi) = 0, j = 1, 2, . . . , p (4)

• now take:

( bXn−1)ij =∂

∂βj

m(bβn−1, xi)

(cWn−1)ij =

8<:

vi(φ)−1 if i = j

0 otherwise

(bYn−1)i = Yi − m(bβn−1, xi)



&

$

%

IRLS V• equation (4) amounts to solving (why?):

XTn−1Wn−1Yn−1 =

(XT

n−1Wn−1Xn−1

)(βn − βn−1

)

=⇒ βn − βn−1 =(XT

n−1Wn−1Xn−1

)−1

XTn−1Wn−1Yn−1

=⇒ βn = βn−1 +(XT

n−1Wn−1Xn−1

)−1

XTn−1Wn−1Yn−1 (5)

• so the second term above looks like the WLS solution of regressing Yn−1 onXn−1 with weights Wn−1 and we iterate this procedure and hence the name

• the IRLS algorithm:

– start with properly chosen initial β0 and apply the above updatingscheme (until convergence) to get β0 −→ β1 −→ β2 −→ · · · −→ βIRLS



&

$

%

IRLS VI• note from equation (5) it looks like a NR type update, this is a so called

Newton Raphson like algorithm

• IRLS may or may not converge depending on starting values, much like NR



&

$

%

Example I

• Heteroskedastic Non-linear Least Squares: for i = 1, 2, . . . , n we have,

Yi | xii.n.i.d∼ Normal1(θi, σ

2 · k(xi)), for some known constant σ2 and aknown function k(·) with k : R

p −→ (0,∞)

• θi = E(Yi | xi) = m(β, xi), for a known non-linear function m(·) with firstderivative


• here for computing βIRLS(= βMLE , why?), we will need:

(Xn−1)ij =∂

∂βj

m(βn−1, xi)

(Wn−1)ij =

1/(σ2 · k(xi)) if i = j

0 otherwise

(Yn−1)i = Yi − m(βn−1, xi)



&

$

%

IRLS and Scoring I

• consider the Generalized Linear Model (GLM) set up (a quick recap):

– random component: f(Yi | θi) come from a NEF, θi = E(Yi | xi)

– systematic component: call ηi = βT xi, also called the linear predictor

– link function: an invertible function g(·) such that ηi = g(θi) with firstderivative

• let V ar(Yi | xi) = vi(β, φ), for some known parameter φ


• going to use scoring to find the MLE: βMLE



&

$

%

IRLS and Scoring II

• the log likelihood and it’s derivative or the score:

n∑

i=1

log (f(Yi | θi)) =n∑

i=1

{log(h(Yi)) + P (θi)Yi − Q(θi)} (6)

=⇒∂

∂βj

n∑

i=1

log (f(Yi | θi)) =n∑

i=1

∂

∂βj

{P (θi)Yi − Q(θi)}

=n∑

i=1

vi(β, φ)−1 (Yi − E(Yi | xi)) dixij (why?)

= uj , say,

here di :=∂θi

∂ηi

, ∀ i and di, ui both are functions of β



&

$

%

IRLS and Scoring III

• if v(·, ·) doesn’t depend on β (assume it from now on), then the informationmatrix entries simplify to:

I(β)kj = E

[−

∂

∂βk

uj

]=

n∑

i=1

vi(φ)−1d2i xijxik (why?)

• in case v(·, ·) does depend on β, one needs carefully compute the informationmatrix entries on a case-by-case basis



&

$

%

IRLS and Scoring IV

• define:

(X)ij = xij

(Wn−1

)ij

=

vi(φ)−1d2i (βn−1) if i = j

0 otherwise(Rn−1

)i=

(Yi − g−1(β

T

n−1xi)) /

di(βn−1)

• so we have (why?):

I(βn−1) = XT Wn−1XT

u(βn−1) =n∑

i=1

vi(φ)−1(Yi − g−1(β

T

n−1xi))

di(βn−1)xij

= XT Wn−1Rn−1



&

$

%

IRLS and Scoring V

• now the scoring update satisfies:

βn = βn−1 +[I(βn−1)

]−1

u(βn−1)

=⇒ βn = βn−1 + (XT Wn−1XT )−1XT Wn−1Rn−1

• so, scoring updates for the MLE is reduces to some IRLS updates for theNEF densities



&

$

%

Example I

• Logistic Regression:

– for i = 1, 2, . . . , n we have, Yi | xii.n.i.d∼ Bernoulli(θi)

– we have ηi = βT xi

– also, ηi = g(θi) = log(

θi

1−θi

), the well known logit transform

• note if we take ηi = g(θi) = Φ−1 (θi), the well known probit transform, thenwe will have the probit regression model (here Φ−1(·) is the inverse cdf ofthe Normal1(0, 1) distribution)

• what will be the expressions for Wn−1 and Yn−1 in this case?



&

$

%

References

[1] Edwin L. Bradley. The equivalence of maximum likelihood and weightedleast squares estimates in the exponential family. Journal of the AmericanStatistical Association, 68:199–200, 1973.

[2] A. Charnes, E. L. Frome, and P. L. Yu. The equivalence of generalized leastsquares and maximum likelihood estimates in the exponential family.Journal of the American Statistical Association, 71:169–171, 1976.

[3] P. J. Green. Iteratively reweighted least squares for maximum likelihoodestimation, and some robust and resistant alternatives (with discussion).Journal of the Royal Statistical Society, Series B, Methodological, 46:149–192, 1984.


Weighted Least Squares I -...

Documents

Transcript of Weighted Least Squares I -...