ADJUSTMENT COMPUTATIONS STATISTICS AND LEAST SQUARES IN SURVEYING AND GIS PAUL WOLF
Weighted Least Squares I -...
Click here to load reader
Transcript of Weighted Least Squares I -...
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
Weighted Least Squares I
• for i = 1, 2, . . . , n we have, see [1, Bradley],
– data: Yi | xii.n.i.d∼ f(yi | θi), where θi = E(Yi | xi)
– co-variates: xi = (xi1, xi2, . . . , xip)T
– let Xn×p be the matrix of covariates with rows xTi
– parameter of interest: β = (β1, β2, . . . , βp), p < n
– θi = E(Yi | xi) = βT xi
– V ar(Yi | xi) = vi(φ) has a known form, which doesn’t depend on β,vi(φ)’s are not all the same and φ is known
• want to estimate β
• ignoring the underlying density, one could use the Weighted Least Squaresestimator:
βWLS = arg minβ
n∑
i=1
vi(φ)−1(Yi − βT xi
)2
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 1
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
WLS II• one could also use the Maximum Likelihood Estimator:
βMLE = arg maxβ
log(L(β)) = arg maxβ
n∑
i=1
log(f(Yi | βT xi)
)
• for WLS we solve the following normal equation:
n∑
i=1
vi(φ)−1(Yi − βT xi
)xij = 0, j = 1, 2, . . . , p (1)
• for MLE we solve the following system of equations:
∂
∂βj
n∑
i=1
log(f(Yi | βT xi)
)= 0, j = 1, 2, . . . , p (2)
• for certain choice of f(· | ·), βWLS = βMLE , what are those?
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 2
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
NEF of Distributions: I• NEF stands for Natural Exponential Family
• a NEF looks like:
f(y | θ) = h(y) exp[P (θ)y − Q(θ)]
where θ = E(Y ) and range of Y doesn’t depend on θ
• consider∫
f(y | θ)dy = 1 or∫
h(y) exp[P (θ)y − Q(θ)]dy = 1 and assumedifferentiation under the integral sign is possible
– apply ddθ
to both sides of the above to get: θ = E(Y ) = Q′(θ)P ′(θ) , why?
– apply d2
dθ2 to both sides of the above to get: V ar(Y ) = 1P ′(θ) , why?
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 3
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
WLS and MLE I• if f(Yi | βT xi) all come from a NEF, then βWLS = βMLE
• sketch of proof:
n∑
i=1
log(f(Yi | βT xi)
)=
n∑
i=1
{log(h(Yi)) + P (βT xi)Yi − Q(βT xi)}
=⇒∂
∂βj
n∑
i=1
log(f(Yi | βT xi)
)=
n∑
i=1
{P ′(θi)xijYi − Q′(θi)xij}
=
n∑
i=1
P ′(θi)(Yi −
Q′(θi)P ′(θi)
)xij
=n∑
i=1
vi(φ)−1 (Yi − E(Yi | xi))xij
=n∑
i=1
vi(φ)−1(Yi − βT xi
)xij
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 4
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
WLS and MLE II• so equation (2) boils down to solving for:
n∑
i=1
vi(φ)−1(Yi − βT xi
)xij = 0, j = 1, 2, . . . , p
• the above is exactly same as equation (1), Q.E.D.
• note the solutions to above equations also satisfies (how?):
(XT WX
)βWLS = XT WY
=⇒ βWLS =(XT WX
)−1XT WY where W is diagonal with
(W )ii = vi(φ)−1
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 5
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
Example I
• Heteroskedastic Least Squares: for i = 1, 2, . . . , n we have,
Yi | xii.n.i.d∼ Normal1(θi, σ
2 · k(xi)), for some known constant σ2 and aknown function k(·) with k : R
p −→ (0,∞)
• θi = E(Yi | xi) = βT xi
• want to estimate β
• so we take diagonal W such that (W )ii = 1/(σ2 · k(xi)) and
βWLS =(XT WX
)−1 (XT WY
)
• now βWLS = βMLE because for Normal distribution comes from a NEF:
Normal1(θ, σ2· k(x); y) =
exph
−y2
2σ2·k(xi)
i
p2πσ2 · k(xi)| {z }
h(y)
exp
266664
θ
σ2 · k(xi)| {z }P (θ)
y −θ2
2σ2 · k(xi)| {z }Q(θ)
377775
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 6
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
Iteratively Reweighted Least Squares I
• suppose in the previous setting, for a known non-linear function m(·, ·) withfirst derivative we have: θi = m(β, xi)
• want to estimate β
• ignoring the underlying density, one uses the Iteratively Reweighted LeastSquares estimator:
βIRLS = arg minβ
n∑
i=1
vi(φ)−1 (Yi − m(β, xi))2
• one can show under this set up, as well, βIRLS = βMLE
• the proof is very similar to the proof of βWLS = βMLE , which we didbefore, left as an assignment problem
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 7
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
IRLS II• here we need to solve the following normal equation:
n∑
i=1
vi(φ)−1 (Yi − m(β, xi))∂
∂βj
m(β, xi) = 0, j = 1, 2, . . . , p (3)
• the problem is the normal equations (3) are not easily solved for β
• one could use the NR algorithm, instead we are going to use somethingdifferent
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 8
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
IRLS III• a new iterative route:
• let current update be βn−1
• “linearize the problem” using Taylor expansion:
m(β, xi) ≈ m(bβn−1, xi) +“β − bβn−1
”T h∇βm(bβn−1, xi)
i
• now solve the simpler problem:
bβn = arg minβ
nX
i=1
vi(φ)−1
“Yi − m(bβn−1, xi) + bβT
n−1
h∇βm(bβn−1, xi)
i”−
βTh∇βm(bβn−1, xi)
i ff2
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 9
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
IRLS IV• the simpler problem can be solved with the following normal equations:
nX
i=1
vi(φ)−1
“Yi − m(bβn−1, xi) + bβT
n−1
h∇βm(bβn−1, xi)
i”−
βTh∇βm(bβn−1, xi)
i ff∂
∂βj
m(bβn−1, xi) = 0, j = 1, 2, . . . , p (4)
• now take:
( bXn−1)ij =∂
∂βj
m(bβn−1, xi)
(cWn−1)ij =
8<:
vi(φ)−1 if i = j
0 otherwise
(bYn−1)i = Yi − m(bβn−1, xi)
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 10
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
IRLS V• equation (4) amounts to solving (why?):
XTn−1Wn−1Yn−1 =
(XT
n−1Wn−1Xn−1
)(βn − βn−1
)
=⇒ βn − βn−1 =(XT
n−1Wn−1Xn−1
)−1
XTn−1Wn−1Yn−1
=⇒ βn = βn−1 +(XT
n−1Wn−1Xn−1
)−1
XTn−1Wn−1Yn−1 (5)
• so the second term above looks like the WLS solution of regressing Yn−1 onXn−1 with weights Wn−1 and we iterate this procedure and hence the name
• the IRLS algorithm:
– start with properly chosen initial β0 and apply the above updatingscheme (until convergence) to get β0 −→ β1 −→ β2 −→ · · · −→ βIRLS
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 11
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
IRLS VI• note from equation (5) it looks like a NR type update, this is a so called
Newton Raphson like algorithm
• IRLS may or may not converge depending on starting values, much like NR
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 12
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
Example I
• Heteroskedastic Non-linear Least Squares: for i = 1, 2, . . . , n we have,
Yi | xii.n.i.d∼ Normal1(θi, σ
2 · k(xi)), for some known constant σ2 and aknown function k(·) with k : R
p −→ (0,∞)
• θi = E(Yi | xi) = m(β, xi), for a known non-linear function m(·) with firstderivative
• want to estimate β
• here for computing βIRLS(= βMLE , why?), we will need:
(Xn−1)ij =∂
∂βj
m(βn−1, xi)
(Wn−1)ij =
1/(σ2 · k(xi)) if i = j
0 otherwise
(Yn−1)i = Yi − m(βn−1, xi)
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 13
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
IRLS and Scoring I
• consider the Generalized Linear Model (GLM) set up (a quick recap):
– random component: f(Yi | θi) come from a NEF, θi = E(Yi | xi)
– systematic component: call ηi = βT xi, also called the linear predictor
– link function: an invertible function g(·) such that ηi = g(θi) with firstderivative
• let V ar(Yi | xi) = vi(β, φ), for some known parameter φ
• want to estimate β
• going to use scoring to find the MLE: βMLE
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 14
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
IRLS and Scoring II
• the log likelihood and it’s derivative or the score:
n∑
i=1
log (f(Yi | θi)) =n∑
i=1
{log(h(Yi)) + P (θi)Yi − Q(θi)} (6)
=⇒∂
∂βj
n∑
i=1
log (f(Yi | θi)) =n∑
i=1
∂
∂βj
{P (θi)Yi − Q(θi)}
=n∑
i=1
vi(β, φ)−1 (Yi − E(Yi | xi)) dixij (why?)
= uj , say,
here di :=∂θi
∂ηi
, ∀ i and di, ui both are functions of β
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 15
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
IRLS and Scoring III
• if v(·, ·) doesn’t depend on β (assume it from now on), then the informationmatrix entries simplify to:
I(β)kj = E
[−
∂
∂βk
uj
]=
n∑
i=1
vi(φ)−1d2i xijxik (why?)
• in case v(·, ·) does depend on β, one needs carefully compute the informationmatrix entries on a case-by-case basis
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 16
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
IRLS and Scoring IV
• define:
(X)ij = xij
(Wn−1
)ij
=
vi(φ)−1d2i (βn−1) if i = j
0 otherwise(Rn−1
)i=
(Yi − g−1(β
T
n−1xi)) /
di(βn−1)
• so we have (why?):
I(βn−1) = XT Wn−1XT
u(βn−1) =n∑
i=1
vi(φ)−1(Yi − g−1(β
T
n−1xi))
di(βn−1)xij
= XT Wn−1Rn−1
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 17
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
IRLS and Scoring V
• now the scoring update satisfies:
βn = βn−1 +[I(βn−1)
]−1
u(βn−1)
=⇒ βn = βn−1 + (XT Wn−1XT )−1XT Wn−1Rn−1
• so, scoring updates for the MLE is reduces to some IRLS updates for theNEF densities
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 18
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
Example I
• Logistic Regression:
– for i = 1, 2, . . . , n we have, Yi | xii.n.i.d∼ Bernoulli(θi)
– we have ηi = βT xi
– also, ηi = g(θi) = log(
θi
1−θi
), the well known logit transform
• note if we take ηi = g(θi) = Φ−1 (θi), the well known probit transform, thenwe will have the probit regression model (here Φ−1(·) is the inverse cdf ofthe Normal1(0, 1) distribution)
• what will be the expressions for Wn−1 and Yn−1 in this case?
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 19
[email protected] Stat221 www.fas.harvard.edu/˜stat221'
&
$
%
References
[1] Edwin L. Bradley. The equivalence of maximum likelihood and weightedleast squares estimates in the exponential family. Journal of the AmericanStatistical Association, 68:199–200, 1973.
[2] A. Charnes, E. L. Frome, and P. L. Yu. The equivalence of generalized leastsquares and maximum likelihood estimates in the exponential family.Journal of the American Statistical Association, 71:169–171, 1976.
[3] P. J. Green. Iteratively reweighted least squares for maximum likelihoodestimation, and some robust and resistant alternatives (with discussion).Journal of the Royal Statistical Society, Series B, Methodological, 46:149–192, 1984.
February 17, 2006 c© 2006 - Gopi Goswami ([email protected]) Page 20