Linear algebra Probability theory Statistical inference
Lecture 2
Linear algebra Probability theory Statistical inference
Let X be the n by 1 matrix (really, just a vector) with all entries equalto 1,
X =
11...1
.
And consider the span of the columns (really, just one column) of X,the set of all vectors of the form
Xβ0
(where β0 here is any real number).
Linear algebra Probability theory Statistical inference
Consider some other n-dimensional vector Y ,
Y =
Y1Y2...
Yn
.
Maybe Y is in the subspace. But if it is not, we can ask: what is thevector in the subspace closest to Y?
That is, what value of β0 minimizes the (squared) distance betweenXβ0 and Y ,
||Y − Xβ0||2 = (Y − Xβ0)T(Y − Xβ0) =
n∑i=1
(Yi − β0)2.
Linear algebra Probability theory Statistical inference
Let’s solve for the value of β0 to minimize the distance
n∑i=1
(Yi − β0)2
by differentiating with respect to β0 and setting to zero:
−2n∑
i=1
(Yi − β0) = 0.
In matrix notation, differentiate
(Y − Xβ0)T(Y − Xβ0)
with respect to β0 and set to zero:
−2XT(Y − Xβ0) = 0.
Linear algebra Probability theory Statistical inference
Solving, we obtain for the nearest vector in the subspace
Xβ0 =
11...1
β0
where
β0 =
∑ni=1 Yi
n,
or equivalently,Xβ0 = X(XTX)−1XTY.
Linear algebra Probability theory Statistical inference
What if we take a more general (but one-dimensional) X: suppose thatthe entries of X are arbitrary numbers Xi1.
X =
X11X21
...Xn1
.
And consider the span of the columns of X, the set of all vectors of theform
Xβ1
(where again β1 here is any real number).
Linear algebra Probability theory Statistical inference
And as before, consider some other n-dimensional vector Y ,
Y =
Y1Y2...
Yn
.
Maybe Y is in the subspace. But if it is not, we can ask, as before:what is the vector in the subspace closest to Y?
That is, what value of β1 minimizes the (squared) distance betweenXβ1 and Y ,
||Y − Xβ1||2 = (Y − Xβ1)T(Y − Xβ1) =
n∑i=1
(Yi − β1Xi1)2.
Linear algebra Probability theory Statistical inference
Let’s solve for the value of β1 to minimize the distance
n∑i=1
(Yi − β1Xi1)2
by differentiating with respect to β1 and setting to zero:
−2n∑
i=1
(Yi − β1Xi1) = 0.
In matrix notation, differentiate
(Y − Xβ1)T(Y − Xβ1)
with respect to β1 and set to zero:
−2XT(Y − Xβ1)Xi1 = 0.
Linear algebra Probability theory Statistical inference
Solving, we obtain for the nearest vector in the subspace
Xβ1 =
Xi1Xi2...
Xin
β1
where
β1 =
∑ni=1 YiXi1∑n
i=1 X2i1,
or equivalently,Xβ1 = X(XTX)−1XTY.
Linear algebra Probability theory Statistical inference
Now let’s go to an n by 2 matrix X:
X =
1 X111 X21...
...1 Xn1
.
And consider the span of the columns of X, the set of all vectors of theform
Xβ,
where here β is the two-dimensional column vector (β0, β1)T .
Linear algebra Probability theory Statistical inference
And as before, consider some other n-dimensional vector Y ,
Y =
Y1Y2...
Yn
.
Maybe Y is in the subspace. But if it is not, we can ask, as before:what is the vector in the subspace closest to Y?
That is, what value of the two-dimensional β minimizes the (squared)distance between Xβ and Y ,
||Y − Xβ||2 = (Y − Xβ)T(Y − Xβ) =
n∑i=1
(Yi − [β0 + β1Xi1])2.
Linear algebra Probability theory Statistical inference
Let’s solve for the value of β to minimize the distancen∑
i=1
(Yi − [β0 + β1Xi1])2
by taking the gradient with respect to β and setting to zero:
−2n∑
i=1
(Yi − [β0 + β1Xi1]) = 0
−2n∑
i=1
(Yi − [β0 + β1Xi1])Xi1 = 0
In matrix notation, take the gradient of
(Y − Xβ)T(Y − Xβ)
with respect to β and set to zero:
−2XT(Y − Xβ) = 0.
Linear algebra Probability theory Statistical inference
Solving, we obtain for the nearest vector in the subspace
Xβ =
11...1
β0 +
Xi1Xi2...
Xin
β1
where
β0 = Y − β1X1
β1 =
n∑i=1
(Xi1 − X1)Yi
/n∑
i=1
(Xi1 − X1)2
or equivalently,Xβ = X(XTX)−1XTY.
Linear algebra Probability theory Statistical inference
In general, for an n by p matrix X, the vector in the span of thecolumns of X nearest to Y , the so-called projection of Y onto the spanof the columns of X, is the vector
Xβ,
where β is the minimizer of
||Y − Xβ||2 = (Y − Xβ)T(Y − Xβ).
If we take the gradient with respect to β we arrive at
XT(Y − Xβ) = 0,
from which it follows that
β = (XTX)−1XTY.
Linear algebra Probability theory Statistical inference
Let Σ = D be a diagonal matrix with all positive entries.• Note that Σ is a simple example of a symmetric, positive definite
matrix.• Note that we could write Σ = IDIT , where I is the identity
matrix.• Note that the columns of I are orthogonal, of unit length, and• they are eigenvectors,• with eigenvalues equal to the corresponding elements of D.
Linear algebra Probability theory Statistical inference
Let c be any unit length vector, and consider the decomposition of cas a weighted sum of the columns of I,
c = c1
10...0
+ c2
01...0
+ · · · + cp
00...1
.
What happens when you compute Σc? You get
Σc = d1c1
10...0
+ d2c2
01...0
+ · · · + dpcp
00...1
.
Linear algebra Probability theory Statistical inference
And if you further compute cTΣc, you get,
cTΣc = d1c1cT
10...0
+ d2c2cT
01...0
+ · · · + dpcpcT
00...1
.
= c21d1 + c2
2d2 + . . . + c2pdp
Linear algebra Probability theory Statistical inference
Suppose you wanted to maximize cTΣc among unit length c?
That is, how do you find c to maximize
c21d1 + c2
2d2 + . . . + c2pdp
subject to the constraint that
c21 + c2
2 + . . . + c2p = 1?
Take c to be the eigenvector associated with the largest d!
Linear algebra Probability theory Statistical inference
Lets start all over again. But this time, we’ll not take Σ = D adiagonal matrix with all positive entries. Instead, take Σ = PDPT ,where D is again a diagonal matrix with all positive entries, and P is amatrix whose columns are orthonormal (and span p dimensionalspace).• Note that Σ is a complicated example of a symmetric, positive
definite matrix.• Note that we write Σ = PDPT , where P is the not the identity
matrix any more, but rather some other orthonormal matrix.• Note that the columns of P are by definition orthogonal, of unit
length.• And just like the columns of I were eigenvectors, so are the
columns of P• again with eigenvalues equal to the corresponding elements of D.
Linear algebra Probability theory Statistical inference
Let c be any unit length vector, and consider the decomposition of cas a weighted sum of the columns of P (not of I now, but rather of P),
c = c1P1 + c2P2 + · · · + cpPp.
(The columns of P are a basis for p dimensional space.)
What happens when you compute Σc? You get
Σc = PDPTc = PDPT(c1P1 + c2P2 + · · · + cpPp)
= PD
c1c2...
cp
= P
d1c1d2c2
...dpcp
= c1d1P1 + c2d2P2 + · · · + cpdpPp
Linear algebra Probability theory Statistical inference
And if you further compute cTΣc, you get,
cTΣc = d1c1cTP1 + d2c2cTP2 + · · · + dpcpcTPp.
= c21d1 + c2
2d2 + . . . + c2pdp
Linear algebra Probability theory Statistical inference
Suppose you wanted to maximize cTΣc as a function of unit lengthvectors c?
That is, how do you find c to maximize
c21d1 + c2
2d2 + . . . + c2pdp
subject to the constraint that
c21 + c2
2 + . . . + c2p = 1?
Again, take c to be the eigenvector associated with the largest d!
Linear algebra Probability theory Statistical inference
One last fact that will be relevant when we use these results: everysymmetric positive definite matrix Σ can be written in the form PDPT
where the columns of P are an orthonormal basis for p-dimensionalspace, the columns are the eigen-vectors of Σ, and the diagonalmatrix D has as components the corresponding (positive) eigenvalues.
Linear algebra Probability theory Statistical inference
It might help to think of Σ = PDPT as a linear transformation. Thinkof how it maps the unit sphere. . .
The transformation corresponding to Σ maps the orthonormaleigenvectors, the columns of P, into stretched or shrunken versions ofthemselves. That is, Σ maps the unit sphere into an ellipsoid, withaxes equal to the eigen-vectors, and the lengths of the axes equal totwice the eigenvectors.
From this point of view, does it make sense that to maximize cTΣc forc on the unit sphere, one can do no better than taking c equal to theeigenvector with the largest eigenvalue?
Linear algebra Probability theory Statistical inference
Suppose that a p by p matrix Σ is symmetric, so that Σ = ΣT .
Suppose also that Σ is positive definite, so that for any non-zerop-dimensional vector c, cTΣc is greater than zero. Then• All of the eigen-values of Σ are real and positive.• All of the eigen-vectors of Σ are orthogonal (or have the same
eigen-value).• We can find p linearly independent orthogonal unit-length
p-dimensional eigen-vectors Pj.• Let P be the p by p matrix whose columns are the Pj and Let D
be the corresponding diagonal matrix whose entries are theeigen-values.
• Then,Σ = PDPT .
Linear algebra Probability theory Statistical inference
Suppose you want to maximize cTΣc with respect to p-dimensionalunit vectors c.• Local maximizers are given by c = Pj
• and the corresponding local maxima are the eigen-values.
Linear algebra Probability theory Statistical inference
Suppose you want to maximize cTΣc with respect to p-dimensionalunit vectors c such that
ctΣPj = 0
for the Pj corresponding to some set of eigen-vectors.• Local maximizers are given by the Pj associated with the other
eigen-vectors.• and the corresponding local maxima are the eigen-values.
Linear algebra Probability theory Statistical inference
• The linear transformation Σ maps the unit sphere to an elipsoidwith axes the Pj, and with the length of the axes equal to twicethe eigen-values.
• For a vector c, PTc has as it’s components the aj for whichp∑
j=1
ajPj = c.
• So DPTc stretches or shrinks those aj by the associatedeigen-values, λj.
• and so PDPTc isp∑
j=1
ajλjPj.
• In short,p∑
j=1
ajPj −→p∑
j=1
ajλjPj.
Linear algebra Probability theory Statistical inference
• Let θ be a (vector of ) random variable(s) with (joint) densityπ(θ)
• Let Y be a (vector of) random variable(s) with (joint) conditionaldensity f θ(y) given θ.
• The conditional density of θ given Y = y is
π(θ|y) = π(θ)f θ(y)
/∫π(θ)f θ(y)dθ .
• The conditional expectation of θ given Y = y is
E{θ|y} =
∫π(θ|y)θdθ.
• and the value of θ that maximizes the posterior likelihood solves
ddθ
ln (π(θ)) +ddθ
ln(
f θ(y)dθ(θ))
= 0.
Linear algebra Probability theory Statistical inference
Suppose that X and Y are jointly distributed random variables withjoint density fXY(x, y),• the density of Y is
fY(y) =
∫fXY(x, y)dx
• the density of X is
fX(x) =
∫fXY(x, y)dy
• and the conditional density of Y given X is
fXY(x, y)/fX(x)
Linear algebra Probability theory Statistical inference
The expectations of X and Y and the conditional expectation of Ygiven X are
E{X} =
∫fX(x)xdx
E{Y} =
∫fY(y)ydy
E{Y|X} =
∫fY|X(y|X)ydy
And we haveE{Y} = E{E{Y|X}}.
Linear algebra Probability theory Statistical inference
The law of the unconscious statistician says that
E{g(X)} =
∫fX(x)g(x)dx.
so that also
Var{g(X)} =
∫fX(x)(g(x)− E{g(X)})2dx.
Linear algebra Probability theory Statistical inference
The variance of Y and the conditional variance of Y given X are
Var(Y) =
∫fY(y)(y− E{Y})2dy
Var(Y|X) =
∫fY|X(y|X)(y− E{Y|X})2dy
Linear algebra Probability theory Statistical inference
The covariance between two random variables X and Y are defined as
E{(Y − E{Y})(X − E{X})}
and we have
Var(Y) = E{Var(Y|X)} + Var(E{Y|X})
Linear algebra Probability theory Statistical inference
For a vector of random variables X, we define the expectation vector:E{X} is the vectors with entries equal to the expectations of thecomponents of X.
Linear algebra Probability theory Statistical inference
And we define the covariance matrix, Cov(X),
E{(X − E{X})(X − E{X})T},
with diagonal entries equal to the variances of the components of X,and the covariances arranged in the off-diagonals.
Note that a covariance matrix is symmetric, and, as long as thecomponents of X are not linear functions of each other, positivedefinite.
Linear algebra Probability theory Statistical inference
Independence• Random variables are independent if their joint density is equal
to the product of their marginals.• Independence captures the notion of one random variable’s value
having no implications for the value of the other.• If two random variables are independent, their covariance is
equal to zero.
Linear algebra Probability theory Statistical inference
IF X is a q-dimensional vector of random variables with expectationvector µ and covariance matrix Σ, and if M is an r by q matrix ofconstants, and ν is an r-dimensional vector of constants, then
E{MX + ν} = Mµ+ ν
andCov(MX + ν) = MCov(X)MT .
Linear algebra Probability theory Statistical inference
Chebechev’s inequality:
P{|X − µ| ≥ ε} ≤ Var(X)/ε2
Linear algebra Probability theory Statistical inference
Suppose Xi are all independent , i from 1 to n, and suppose that eachXi has a finite variance (which we will denote σ2
i .) Then the varianceof X, that is, the variance of
1n
n∑i=1
Xi,
is equal to1n2
n∑i=1
σ2i .
And, in particular, if the σ2i have an upper bound in common, then the
variance tends to zero with large values of n.
Linear algebra Probability theory Statistical inference
In this situation, from Chebechev’s inequality, we find that
P{|X − µ| ≥ ε}
is also small. Here, µ is the average of the expectations of the Xi.
In short, by taking more data, we can learn.
Linear algebra Probability theory Statistical inference
With minimal technical assumptions about the finite variance of theindependent Xi, we can go beyond the behavior of X − µ to considernot just that it tends to zero, but also how it varies around zero.
P
√n
(1n
n∑i=1
Xi −1n
n∑i=1
E{Xi}
)≤ x
√√√√ n∑i−1
σ2i
→∫ x
−∞
e−t2/2√
2πdt.
Not only can we learn, we can know how well we’ve learned!
Linear algebra Probability theory Statistical inference
Linear algebra Probability theory Statistical inference
When you analyze data,• There is data• A statistical method is applied• There are the results of your method• You also produce some indication of the precision of your results• The results and precision estimates are used to draw conclusions
How do you know what method to apply?
Linear algebra Probability theory Statistical inference
Linear algebra Probability theory Statistical inference
Given a statistical model, and given an analytic goal, there is (almostalways) the appropriate method already in SAS, R, SPSS, Matlab,Minitab, Systat, et cetera.
• What is a statistical model?• What is an analytic goal?• How does one elicit them from the “client?”
Linear algebra Probability theory Statistical inference
Linear algebra Probability theory Statistical inference
A probability model has• A sample space for the observables (the random variables)• A joint distribution on the sample space• The joint distribution reflects all the sources of variability that
are inherent in the random variables.
Linear algebra Probability theory Statistical inference
A statistical model is a family of probability models• on the sample space for the observables (the random variables)• What is specified about the joint distribution reflects what is
known about the distribution of the random variables• What is unspecified reflects what is unknown about the
distribution of the random variables• That we have random variables reflects that even if we knew
everything that could be known about the distribution, therewould still be randomness.
Parameters index the possible distributions for the data.
Linear algebra Probability theory Statistical inference
What must be considered in devising a statistical model? What isknown, what is unknown about• The sources of variability• The sampling plan• Mechanisms underlying the phenomena under examination• Counterfactuals - issues of causation and confounding• Practical issues relating to complexity, sample size, and
computation
Linear algebra Probability theory Statistical inference
The analytic goal is a statement of the researchers’ goal in terms ofthe parameters.
The mathematical version of this is “decision theory.”• Parameters Θ
• indexing probability models on outcomes Y• Possible “actions” A• A loss associated with parameter-action pairs L(θ, a)
• Decision rules map data to actions d : Y −→ A.• We evaluate decision rules via
Eθ{L(θ, d(Y))}
Linear algebra Probability theory Statistical inference
• n subjects, randomly assigned to treatment or placebo• Cure or Failure recorded for all• Researchers wish to convince EPA that the treatment is
efficacious, but only if it really is
Model? Analytic Goal?
Linear algebra Probability theory Statistical inference
• n patient charts chosen at random from a physician’s practice• Total gains generated by “up-coding” recorded for each• Prosecutors need to assess the total gains in order to recommend
the amount to be recovered
Model? Analytic Goal?
Linear algebra Probability theory Statistical inference
• n loan applications, or cell histologies, or examples of pastweather patterns
• associated foreclosure outcomes, or cancer outcome, or rainfall• Researchers want to help others do prediction with new data
Model? Analytic Goal?
Linear algebra Probability theory Statistical inference
• A parameterization is a mapping from the parameter space to theprobability models for the data
• The likelihood is the density (or probability mass function) of theobserved data as a function of the parameter
• The maximum likelihood estimator is the value of the parameterthat maximizes the likelihood
• We usually find the MLE by differentiating the logarithm of thelikelihood, and setting it to zero
• Maximum likelihood estimates are• Unbiased (Eθ{θ} = θ.)• Efficient in the sense of having smallest variance among unbiased
estimates
Linear algebra Probability theory Statistical inference
Ordinary least squares linear regression as maximum likelihood.• The (conditional) model• The likelihood• The score equations
Linear algebra Probability theory Statistical inference
Mixture models and the EM algorithm• Mixture models when the component identifiers are available• The likelihood when they are not• An iterative approach to estimation
Linear algebra Probability theory Statistical inference
Suppose you really believe that the parameter has a distribution π(θ).And that nature or god or . . . chose from that distribution when itcreated θ.• And suppose you wanted to estimate θ.• Suppose we have some loss function, say
L(θ, θ)
• So that we need to find a function θ to minimize
E{L(θ, θ)}
• What expectation are we talking about? The expectation over θ!• Find θ to minimize
E{L(θ(Y), θ)|Y}.
• Or maybe just approximate that optimal choice with theexpectation or mode or . . .
Linear algebra Probability theory Statistical inference
Bayes’ theorem.θ ∼ π(θ)
Y|θ ∼ f θ(y)
π(θ|y) = π(θ)f θ(y)
/∫π(θ)f θ(y)dθ
Linear algebra Probability theory Statistical inference
• And suppose you wanted to estimate θ after observing some datagenerated according to f θ(y).
• Suppose we have some loss function, say
L(θ, θ)
• So that we need to find a function θ(y) to minimize
E{L(θ(Y), θ)}
• What expectation are we talking about? The expectation over θand Y!
E{L(θ(Y), θ)} = E{E{L(θ(Y), θ)|θ}} = E{E{L(θ(Y), θ)|Y}}
• Find θ(Y) to minimize
E{L(θ(Y), θ)|Y}.
• Or maybe just approximate that optimal choice with the posteriorexpectation, or mode, or . . .
Top Related