Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Lecture 2


Let X be the n by 1 matrix (really, just a vector) with all entries equalto 1,

X =

11...1

.

And consider the span of the columns (really, just one column) of X,the set of all vectors of the form

Xβ0

(where β0 here is any real number).


Consider some other n-dimensional vector Y ,

Y =

Y1Y2...

Yn

.

Maybe Y is in the subspace. But if it is not, we can ask: what is thevector in the subspace closest to Y?

That is, what value of β0 minimizes the (squared) distance betweenXβ0 and Y ,

||Y − Xβ0||2 = (Y − Xβ0)T(Y − Xβ0) =

n∑i=1

(Yi − β0)2.


Let’s solve for the value of β0 to minimize the distance

n∑i=1

(Yi − β0)2

by differentiating with respect to β0 and setting to zero:

−2n∑

i=1

(Yi − β0) = 0.

In matrix notation, differentiate

(Y − Xβ0)T(Y − Xβ0)

with respect to β0 and set to zero:

−2XT(Y − Xβ0) = 0.


Solving, we obtain for the nearest vector in the subspace

Xβ0 =

11...1

β0

where

β0 =

∑ni=1 Yi

n,

or equivalently,Xβ0 = X(XTX)−1XTY.


What if we take a more general (but one-dimensional) X: suppose thatthe entries of X are arbitrary numbers Xi1.

X =

X11X21

...Xn1

.

And consider the span of the columns of X, the set of all vectors of theform

Xβ1

(where again β1 here is any real number).


And as before, consider some other n-dimensional vector Y ,

Y =

Y1Y2...

Yn

.

Maybe Y is in the subspace. But if it is not, we can ask, as before:what is the vector in the subspace closest to Y?

That is, what value of β1 minimizes the (squared) distance betweenXβ1 and Y ,

||Y − Xβ1||2 = (Y − Xβ1)T(Y − Xβ1) =

n∑i=1

(Yi − β1Xi1)2.


Let’s solve for the value of β1 to minimize the distance

n∑i=1

(Yi − β1Xi1)2

by differentiating with respect to β1 and setting to zero:

−2n∑

i=1

(Yi − β1Xi1) = 0.

In matrix notation, differentiate

(Y − Xβ1)T(Y − Xβ1)

with respect to β1 and set to zero:

−2XT(Y − Xβ1)Xi1 = 0.



Xβ1 =

Xi1Xi2...

Xin

β1

where

β1 =

∑ni=1 YiXi1∑n

i=1 X2i1,

or equivalently,Xβ1 = X(XTX)−1XTY.


Now let’s go to an n by 2 matrix X:

X =

1 X111 X21...

...1 Xn1

.

And consider the span of the columns of X, the set of all vectors of theform

Xβ,

where here β is the two-dimensional column vector (β0, β1)T .


And as before, consider some other n-dimensional vector Y ,

Y =

Y1Y2...

Yn

.

Maybe Y is in the subspace. But if it is not, we can ask, as before:what is the vector in the subspace closest to Y?

That is, what value of the two-dimensional β minimizes the (squared)distance between Xβ and Y ,

||Y − Xβ||2 = (Y − Xβ)T(Y − Xβ) =

n∑i=1

(Yi − [β0 + β1Xi1])2.


Let’s solve for the value of β to minimize the distancen∑

i=1

(Yi − [β0 + β1Xi1])2

by taking the gradient with respect to β and setting to zero:

−2n∑

i=1

(Yi − [β0 + β1Xi1]) = 0

−2n∑

i=1

(Yi − [β0 + β1Xi1])Xi1 = 0

In matrix notation, take the gradient of

(Y − Xβ)T(Y − Xβ)

with respect to β and set to zero:

−2XT(Y − Xβ) = 0.



Xβ =

11...1

β0 +

Xi1Xi2...

Xin

β1

where

β0 = Y − β1X1

β1 =

n∑i=1

(Xi1 − X1)Yi

/n∑

i=1

(Xi1 − X1)2

or equivalently,Xβ = X(XTX)−1XTY.


In general, for an n by p matrix X, the vector in the span of thecolumns of X nearest to Y , the so-called projection of Y onto the spanof the columns of X, is the vector

Xβ,

where β is the minimizer of

||Y − Xβ||2 = (Y − Xβ)T(Y − Xβ).

If we take the gradient with respect to β we arrive at

XT(Y − Xβ) = 0,

from which it follows that

β = (XTX)−1XTY.


Let Σ = D be a diagonal matrix with all positive entries.• Note that Σ is a simple example of a symmetric, positive definite

matrix.• Note that we could write Σ = IDIT , where I is the identity

matrix.• Note that the columns of I are orthogonal, of unit length, and• they are eigenvectors,• with eigenvalues equal to the corresponding elements of D.


Let c be any unit length vector, and consider the decomposition of cas a weighted sum of the columns of I,

c = c1

10...0

+ c2

01...0

+ · · · + cp

00...1

.

What happens when you compute Σc? You get

Σc = d1c1

10...0

+ d2c2

01...0

+ · · · + dpcp

00...1

.


And if you further compute cTΣc, you get,

cTΣc = d1c1cT

10...0

+ d2c2cT

01...0

+ · · · + dpcpcT

00...1

.

= c21d1 + c2

2d2 + . . . + c2pdp


Suppose you wanted to maximize cTΣc among unit length c?

That is, how do you find c to maximize

c21d1 + c2

2d2 + . . . + c2pdp

subject to the constraint that

c21 + c2

2 + . . . + c2p = 1?

Take c to be the eigenvector associated with the largest d!


Lets start all over again. But this time, we’ll not take Σ = D adiagonal matrix with all positive entries. Instead, take Σ = PDPT ,where D is again a diagonal matrix with all positive entries, and P is amatrix whose columns are orthonormal (and span p dimensionalspace).• Note that Σ is a complicated example of a symmetric, positive

definite matrix.• Note that we write Σ = PDPT , where P is the not the identity

matrix any more, but rather some other orthonormal matrix.• Note that the columns of P are by definition orthogonal, of unit

length.• And just like the columns of I were eigenvectors, so are the

columns of P• again with eigenvalues equal to the corresponding elements of D.


Let c be any unit length vector, and consider the decomposition of cas a weighted sum of the columns of P (not of I now, but rather of P),

c = c1P1 + c2P2 + · · · + cpPp.

(The columns of P are a basis for p dimensional space.)

What happens when you compute Σc? You get

Σc = PDPTc = PDPT(c1P1 + c2P2 + · · · + cpPp)

= PD

c1c2...

cp

= P

d1c1d2c2

...dpcp

= c1d1P1 + c2d2P2 + · · · + cpdpPp


And if you further compute cTΣc, you get,

cTΣc = d1c1cTP1 + d2c2cTP2 + · · · + dpcpcTPp.

= c21d1 + c2

2d2 + . . . + c2pdp


Suppose you wanted to maximize cTΣc as a function of unit lengthvectors c?

That is, how do you find c to maximize

c21d1 + c2

2d2 + . . . + c2pdp

subject to the constraint that

c21 + c2

2 + . . . + c2p = 1?

Again, take c to be the eigenvector associated with the largest d!


One last fact that will be relevant when we use these results: everysymmetric positive definite matrix Σ can be written in the form PDPT

where the columns of P are an orthonormal basis for p-dimensionalspace, the columns are the eigen-vectors of Σ, and the diagonalmatrix D has as components the corresponding (positive) eigenvalues.


It might help to think of Σ = PDPT as a linear transformation. Thinkof how it maps the unit sphere. . .

The transformation corresponding to Σ maps the orthonormaleigenvectors, the columns of P, into stretched or shrunken versions ofthemselves. That is, Σ maps the unit sphere into an ellipsoid, withaxes equal to the eigen-vectors, and the lengths of the axes equal totwice the eigenvectors.

From this point of view, does it make sense that to maximize cTΣc forc on the unit sphere, one can do no better than taking c equal to theeigenvector with the largest eigenvalue?


Suppose that a p by p matrix Σ is symmetric, so that Σ = ΣT .

Suppose also that Σ is positive definite, so that for any non-zerop-dimensional vector c, cTΣc is greater than zero. Then• All of the eigen-values of Σ are real and positive.• All of the eigen-vectors of Σ are orthogonal (or have the same

eigen-value).• We can find p linearly independent orthogonal unit-length

p-dimensional eigen-vectors Pj.• Let P be the p by p matrix whose columns are the Pj and Let D

be the corresponding diagonal matrix whose entries are theeigen-values.

• Then,Σ = PDPT .


Suppose you want to maximize cTΣc with respect to p-dimensionalunit vectors c.• Local maximizers are given by c = Pj

• and the corresponding local maxima are the eigen-values.


Suppose you want to maximize cTΣc with respect to p-dimensionalunit vectors c such that

ctΣPj = 0

for the Pj corresponding to some set of eigen-vectors.• Local maximizers are given by the Pj associated with the other

eigen-vectors.• and the corresponding local maxima are the eigen-values.


• The linear transformation Σ maps the unit sphere to an elipsoidwith axes the Pj, and with the length of the axes equal to twicethe eigen-values.

• For a vector c, PTc has as it’s components the aj for whichp∑

j=1

ajPj = c.

• So DPTc stretches or shrinks those aj by the associatedeigen-values, λj.

• and so PDPTc isp∑

j=1

ajλjPj.

• In short,p∑

j=1

ajPj −→p∑

j=1

ajλjPj.


• Let θ be a (vector of ) random variable(s) with (joint) densityπ(θ)

• Let Y be a (vector of) random variable(s) with (joint) conditionaldensity f θ(y) given θ.

• The conditional density of θ given Y = y is

π(θ|y) = π(θ)f θ(y)

/∫π(θ)f θ(y)dθ .

• The conditional expectation of θ given Y = y is

E{θ|y} =

∫π(θ|y)θdθ.

• and the value of θ that maximizes the posterior likelihood solves

ddθ

ln (π(θ)) +ddθ

ln(

f θ(y)dθ(θ))

= 0.


Suppose that X and Y are jointly distributed random variables withjoint density fXY(x, y),• the density of Y is

fY(y) =

∫fXY(x, y)dx

• the density of X is

fX(x) =

∫fXY(x, y)dy

• and the conditional density of Y given X is

fXY(x, y)/fX(x)


The expectations of X and Y and the conditional expectation of Ygiven X are

E{X} =

∫fX(x)xdx

E{Y} =

∫fY(y)ydy

E{Y|X} =

∫fY|X(y|X)ydy

And we haveE{Y} = E{E{Y|X}}.


The law of the unconscious statistician says that

E{g(X)} =

∫fX(x)g(x)dx.

so that also

Var{g(X)} =

∫fX(x)(g(x)− E{g(X)})2dx.


The variance of Y and the conditional variance of Y given X are

Var(Y) =

∫fY(y)(y− E{Y})2dy

Var(Y|X) =

∫fY|X(y|X)(y− E{Y|X})2dy


The covariance between two random variables X and Y are defined as

E{(Y − E{Y})(X − E{X})}

and we have

Var(Y) = E{Var(Y|X)} + Var(E{Y|X})


For a vector of random variables X, we define the expectation vector:E{X} is the vectors with entries equal to the expectations of thecomponents of X.


And we define the covariance matrix, Cov(X),

E{(X − E{X})(X − E{X})T},

with diagonal entries equal to the variances of the components of X,and the covariances arranged in the off-diagonals.

Note that a covariance matrix is symmetric, and, as long as thecomponents of X are not linear functions of each other, positivedefinite.


Independence• Random variables are independent if their joint density is equal

to the product of their marginals.• Independence captures the notion of one random variable’s value

having no implications for the value of the other.• If two random variables are independent, their covariance is

equal to zero.


IF X is a q-dimensional vector of random variables with expectationvector µ and covariance matrix Σ, and if M is an r by q matrix ofconstants, and ν is an r-dimensional vector of constants, then

E{MX + ν} = Mµ+ ν

andCov(MX + ν) = MCov(X)MT .


Chebechev’s inequality:

P{|X − µ| ≥ ε} ≤ Var(X)/ε2


Suppose Xi are all independent , i from 1 to n, and suppose that eachXi has a finite variance (which we will denote σ2

i .) Then the varianceof X, that is, the variance of

1n

n∑i=1

Xi,

is equal to1n2

n∑i=1

σ2i .

And, in particular, if the σ2i have an upper bound in common, then the

variance tends to zero with large values of n.


In this situation, from Chebechev’s inequality, we find that

P{|X − µ| ≥ ε}

is also small. Here, µ is the average of the expectations of the Xi.

In short, by taking more data, we can learn.


With minimal technical assumptions about the finite variance of theindependent Xi, we can go beyond the behavior of X − µ to considernot just that it tends to zero, but also how it varies around zero.

P

√n

(1n

n∑i=1

Xi −1n

n∑i=1

E{Xi}

)≤ x

√√√√ n∑i−1

σ2i

→∫ x

−∞

e−t2/2√

2πdt.

Not only can we learn, we can know how well we’ve learned!


When you analyze data,• There is data• A statistical method is applied• There are the results of your method• You also produce some indication of the precision of your results• The results and precision estimates are used to draw conclusions

How do you know what method to apply?


Given a statistical model, and given an analytic goal, there is (almostalways) the appropriate method already in SAS, R, SPSS, Matlab,Minitab, Systat, et cetera.

• What is a statistical model?• What is an analytic goal?• How does one elicit them from the “client?”


A probability model has• A sample space for the observables (the random variables)• A joint distribution on the sample space• The joint distribution reflects all the sources of variability that

are inherent in the random variables.


A statistical model is a family of probability models• on the sample space for the observables (the random variables)• What is specified about the joint distribution reflects what is

known about the distribution of the random variables• What is unspecified reflects what is unknown about the

distribution of the random variables• That we have random variables reflects that even if we knew

everything that could be known about the distribution, therewould still be randomness.

Parameters index the possible distributions for the data.


What must be considered in devising a statistical model? What isknown, what is unknown about• The sources of variability• The sampling plan• Mechanisms underlying the phenomena under examination• Counterfactuals - issues of causation and confounding• Practical issues relating to complexity, sample size, and

computation


The analytic goal is a statement of the researchers’ goal in terms ofthe parameters.

The mathematical version of this is “decision theory.”• Parameters Θ

• indexing probability models on outcomes Y• Possible “actions” A• A loss associated with parameter-action pairs L(θ, a)

• Decision rules map data to actions d : Y −→ A.• We evaluate decision rules via

Eθ{L(θ, d(Y))}


• n subjects, randomly assigned to treatment or placebo• Cure or Failure recorded for all• Researchers wish to convince EPA that the treatment is

efficacious, but only if it really is

Model? Analytic Goal?


• n patient charts chosen at random from a physician’s practice• Total gains generated by “up-coding” recorded for each• Prosecutors need to assess the total gains in order to recommend

the amount to be recovered



• n loan applications, or cell histologies, or examples of pastweather patterns

• associated foreclosure outcomes, or cancer outcome, or rainfall• Researchers want to help others do prediction with new data



• A parameterization is a mapping from the parameter space to theprobability models for the data

• The likelihood is the density (or probability mass function) of theobserved data as a function of the parameter

• The maximum likelihood estimator is the value of the parameterthat maximizes the likelihood

• We usually find the MLE by differentiating the logarithm of thelikelihood, and setting it to zero

• Maximum likelihood estimates are• Unbiased (Eθ{θ} = θ.)• Efficient in the sense of having smallest variance among unbiased

estimates


Ordinary least squares linear regression as maximum likelihood.• The (conditional) model• The likelihood• The score equations


Mixture models and the EM algorithm• Mixture models when the component identifiers are available• The likelihood when they are not• An iterative approach to estimation


Suppose you really believe that the parameter has a distribution π(θ).And that nature or god or . . . chose from that distribution when itcreated θ.• And suppose you wanted to estimate θ.• Suppose we have some loss function, say

L(θ, θ)

• So that we need to find a function θ to minimize

E{L(θ, θ)}

• What expectation are we talking about? The expectation over θ!• Find θ to minimize

E{L(θ(Y), θ)|Y}.

• Or maybe just approximate that optimal choice with theexpectation or mode or . . .


Bayes’ theorem.θ ∼ π(θ)

Y|θ ∼ f θ(y)

π(θ|y) = π(θ)f θ(y)

/∫π(θ)f θ(y)dθ


• And suppose you wanted to estimate θ after observing some datagenerated according to f θ(y).

• Suppose we have some loss function, say

L(θ, θ)

• So that we need to find a function θ(y) to minimize

E{L(θ(Y), θ)}

• What expectation are we talking about? The expectation over θand Y!

E{L(θ(Y), θ)} = E{E{L(θ(Y), θ)|θ}} = E{E{L(θ(Y), θ)|Y}}

• Find θ(Y) to minimize

E{L(θ(Y), θ)|Y}.

• Or maybe just approximate that optimal choice with the posteriorexpectation, or mode, or . . .

Lecture 2 Data Mining

Documents

Transcript of Lecture 2 Data Mining