Lecture 2 Data Mining

60
Linear algebra Probability theory Statistical inference Lecture 2

description

Data Mining Lecture 2 columbia university

Transcript of Lecture 2 Data Mining

Page 1: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Lecture 2

Page 2: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Let X be the n by 1 matrix (really, just a vector) with all entries equalto 1,

X =

11...1

.

And consider the span of the columns (really, just one column) of X,the set of all vectors of the form

Xβ0

(where β0 here is any real number).

Page 3: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Consider some other n-dimensional vector Y ,

Y =

Y1Y2...

Yn

.

Maybe Y is in the subspace. But if it is not, we can ask: what is thevector in the subspace closest to Y?

That is, what value of β0 minimizes the (squared) distance betweenXβ0 and Y ,

||Y − Xβ0||2 = (Y − Xβ0)T(Y − Xβ0) =

n∑i=1

(Yi − β0)2.

Page 4: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Let’s solve for the value of β0 to minimize the distance

n∑i=1

(Yi − β0)2

by differentiating with respect to β0 and setting to zero:

−2n∑

i=1

(Yi − β0) = 0.

In matrix notation, differentiate

(Y − Xβ0)T(Y − Xβ0)

with respect to β0 and set to zero:

−2XT(Y − Xβ0) = 0.

Page 5: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Solving, we obtain for the nearest vector in the subspace

Xβ0 =

11...1

β0

where

β0 =

∑ni=1 Yi

n,

or equivalently,Xβ0 = X(XTX)−1XTY.

Page 6: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

What if we take a more general (but one-dimensional) X: suppose thatthe entries of X are arbitrary numbers Xi1.

X =

X11X21

...Xn1

.

And consider the span of the columns of X, the set of all vectors of theform

Xβ1

(where again β1 here is any real number).

Page 7: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

And as before, consider some other n-dimensional vector Y ,

Y =

Y1Y2...

Yn

.

Maybe Y is in the subspace. But if it is not, we can ask, as before:what is the vector in the subspace closest to Y?

That is, what value of β1 minimizes the (squared) distance betweenXβ1 and Y ,

||Y − Xβ1||2 = (Y − Xβ1)T(Y − Xβ1) =

n∑i=1

(Yi − β1Xi1)2.

Page 8: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Let’s solve for the value of β1 to minimize the distance

n∑i=1

(Yi − β1Xi1)2

by differentiating with respect to β1 and setting to zero:

−2n∑

i=1

(Yi − β1Xi1) = 0.

In matrix notation, differentiate

(Y − Xβ1)T(Y − Xβ1)

with respect to β1 and set to zero:

−2XT(Y − Xβ1)Xi1 = 0.

Page 9: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Solving, we obtain for the nearest vector in the subspace

Xβ1 =

Xi1Xi2...

Xin

β1

where

β1 =

∑ni=1 YiXi1∑n

i=1 X2i1,

or equivalently,Xβ1 = X(XTX)−1XTY.

Page 10: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Now let’s go to an n by 2 matrix X:

X =

1 X111 X21...

...1 Xn1

.

And consider the span of the columns of X, the set of all vectors of theform

Xβ,

where here β is the two-dimensional column vector (β0, β1)T .

Page 11: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

And as before, consider some other n-dimensional vector Y ,

Y =

Y1Y2...

Yn

.

Maybe Y is in the subspace. But if it is not, we can ask, as before:what is the vector in the subspace closest to Y?

That is, what value of the two-dimensional β minimizes the (squared)distance between Xβ and Y ,

||Y − Xβ||2 = (Y − Xβ)T(Y − Xβ) =

n∑i=1

(Yi − [β0 + β1Xi1])2.

Page 12: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Let’s solve for the value of β to minimize the distancen∑

i=1

(Yi − [β0 + β1Xi1])2

by taking the gradient with respect to β and setting to zero:

−2n∑

i=1

(Yi − [β0 + β1Xi1]) = 0

−2n∑

i=1

(Yi − [β0 + β1Xi1])Xi1 = 0

In matrix notation, take the gradient of

(Y − Xβ)T(Y − Xβ)

with respect to β and set to zero:

−2XT(Y − Xβ) = 0.

Page 13: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Solving, we obtain for the nearest vector in the subspace

Xβ =

11...1

β0 +

Xi1Xi2...

Xin

β1

where

β0 = Y − β1X1

β1 =

n∑i=1

(Xi1 − X1)Yi

/n∑

i=1

(Xi1 − X1)2

or equivalently,Xβ = X(XTX)−1XTY.

Page 14: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

In general, for an n by p matrix X, the vector in the span of thecolumns of X nearest to Y , the so-called projection of Y onto the spanof the columns of X, is the vector

Xβ,

where β is the minimizer of

||Y − Xβ||2 = (Y − Xβ)T(Y − Xβ).

If we take the gradient with respect to β we arrive at

XT(Y − Xβ) = 0,

from which it follows that

β = (XTX)−1XTY.

Page 15: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Let Σ = D be a diagonal matrix with all positive entries.• Note that Σ is a simple example of a symmetric, positive definite

matrix.• Note that we could write Σ = IDIT , where I is the identity

matrix.• Note that the columns of I are orthogonal, of unit length, and• they are eigenvectors,• with eigenvalues equal to the corresponding elements of D.

Page 16: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Let c be any unit length vector, and consider the decomposition of cas a weighted sum of the columns of I,

c = c1

10...0

+ c2

01...0

+ · · · + cp

00...1

.

What happens when you compute Σc? You get

Σc = d1c1

10...0

+ d2c2

01...0

+ · · · + dpcp

00...1

.

Page 17: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

And if you further compute cTΣc, you get,

cTΣc = d1c1cT

10...0

+ d2c2cT

01...0

+ · · · + dpcpcT

00...1

.

= c21d1 + c2

2d2 + . . . + c2pdp

Page 18: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Suppose you wanted to maximize cTΣc among unit length c?

That is, how do you find c to maximize

c21d1 + c2

2d2 + . . . + c2pdp

subject to the constraint that

c21 + c2

2 + . . . + c2p = 1?

Take c to be the eigenvector associated with the largest d!

Page 19: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Lets start all over again. But this time, we’ll not take Σ = D adiagonal matrix with all positive entries. Instead, take Σ = PDPT ,where D is again a diagonal matrix with all positive entries, and P is amatrix whose columns are orthonormal (and span p dimensionalspace).• Note that Σ is a complicated example of a symmetric, positive

definite matrix.• Note that we write Σ = PDPT , where P is the not the identity

matrix any more, but rather some other orthonormal matrix.• Note that the columns of P are by definition orthogonal, of unit

length.• And just like the columns of I were eigenvectors, so are the

columns of P• again with eigenvalues equal to the corresponding elements of D.

Page 20: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Let c be any unit length vector, and consider the decomposition of cas a weighted sum of the columns of P (not of I now, but rather of P),

c = c1P1 + c2P2 + · · · + cpPp.

(The columns of P are a basis for p dimensional space.)

What happens when you compute Σc? You get

Σc = PDPTc = PDPT(c1P1 + c2P2 + · · · + cpPp)

= PD

c1c2...

cp

= P

d1c1d2c2

...dpcp

= c1d1P1 + c2d2P2 + · · · + cpdpPp

Page 21: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

And if you further compute cTΣc, you get,

cTΣc = d1c1cTP1 + d2c2cTP2 + · · · + dpcpcTPp.

= c21d1 + c2

2d2 + . . . + c2pdp

Page 22: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Suppose you wanted to maximize cTΣc as a function of unit lengthvectors c?

That is, how do you find c to maximize

c21d1 + c2

2d2 + . . . + c2pdp

subject to the constraint that

c21 + c2

2 + . . . + c2p = 1?

Again, take c to be the eigenvector associated with the largest d!

Page 23: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

One last fact that will be relevant when we use these results: everysymmetric positive definite matrix Σ can be written in the form PDPT

where the columns of P are an orthonormal basis for p-dimensionalspace, the columns are the eigen-vectors of Σ, and the diagonalmatrix D has as components the corresponding (positive) eigenvalues.

Page 24: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

It might help to think of Σ = PDPT as a linear transformation. Thinkof how it maps the unit sphere. . .

The transformation corresponding to Σ maps the orthonormaleigenvectors, the columns of P, into stretched or shrunken versions ofthemselves. That is, Σ maps the unit sphere into an ellipsoid, withaxes equal to the eigen-vectors, and the lengths of the axes equal totwice the eigenvectors.

From this point of view, does it make sense that to maximize cTΣc forc on the unit sphere, one can do no better than taking c equal to theeigenvector with the largest eigenvalue?

Page 25: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Suppose that a p by p matrix Σ is symmetric, so that Σ = ΣT .

Suppose also that Σ is positive definite, so that for any non-zerop-dimensional vector c, cTΣc is greater than zero. Then• All of the eigen-values of Σ are real and positive.• All of the eigen-vectors of Σ are orthogonal (or have the same

eigen-value).• We can find p linearly independent orthogonal unit-length

p-dimensional eigen-vectors Pj.• Let P be the p by p matrix whose columns are the Pj and Let D

be the corresponding diagonal matrix whose entries are theeigen-values.

• Then,Σ = PDPT .

Page 26: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Suppose you want to maximize cTΣc with respect to p-dimensionalunit vectors c.• Local maximizers are given by c = Pj

• and the corresponding local maxima are the eigen-values.

Page 27: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Suppose you want to maximize cTΣc with respect to p-dimensionalunit vectors c such that

ctΣPj = 0

for the Pj corresponding to some set of eigen-vectors.• Local maximizers are given by the Pj associated with the other

eigen-vectors.• and the corresponding local maxima are the eigen-values.

Page 28: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

• The linear transformation Σ maps the unit sphere to an elipsoidwith axes the Pj, and with the length of the axes equal to twicethe eigen-values.

• For a vector c, PTc has as it’s components the aj for whichp∑

j=1

ajPj = c.

• So DPTc stretches or shrinks those aj by the associatedeigen-values, λj.

• and so PDPTc isp∑

j=1

ajλjPj.

• In short,p∑

j=1

ajPj −→p∑

j=1

ajλjPj.

Page 29: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

• Let θ be a (vector of ) random variable(s) with (joint) densityπ(θ)

• Let Y be a (vector of) random variable(s) with (joint) conditionaldensity f θ(y) given θ.

• The conditional density of θ given Y = y is

π(θ|y) = π(θ)f θ(y)

/∫π(θ)f θ(y)dθ .

• The conditional expectation of θ given Y = y is

E{θ|y} =

∫π(θ|y)θdθ.

• and the value of θ that maximizes the posterior likelihood solves

ddθ

ln (π(θ)) +ddθ

ln(

f θ(y)dθ(θ))

= 0.

Page 30: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Suppose that X and Y are jointly distributed random variables withjoint density fXY(x, y),• the density of Y is

fY(y) =

∫fXY(x, y)dx

• the density of X is

fX(x) =

∫fXY(x, y)dy

• and the conditional density of Y given X is

fXY(x, y)/fX(x)

Page 31: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

The expectations of X and Y and the conditional expectation of Ygiven X are

E{X} =

∫fX(x)xdx

E{Y} =

∫fY(y)ydy

E{Y|X} =

∫fY|X(y|X)ydy

And we haveE{Y} = E{E{Y|X}}.

Page 32: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

The law of the unconscious statistician says that

E{g(X)} =

∫fX(x)g(x)dx.

so that also

Var{g(X)} =

∫fX(x)(g(x)− E{g(X)})2dx.

Page 33: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

The variance of Y and the conditional variance of Y given X are

Var(Y) =

∫fY(y)(y− E{Y})2dy

Var(Y|X) =

∫fY|X(y|X)(y− E{Y|X})2dy

Page 34: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

The covariance between two random variables X and Y are defined as

E{(Y − E{Y})(X − E{X})}

and we have

Var(Y) = E{Var(Y|X)} + Var(E{Y|X})

Page 35: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

For a vector of random variables X, we define the expectation vector:E{X} is the vectors with entries equal to the expectations of thecomponents of X.

Page 36: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

And we define the covariance matrix, Cov(X),

E{(X − E{X})(X − E{X})T},

with diagonal entries equal to the variances of the components of X,and the covariances arranged in the off-diagonals.

Note that a covariance matrix is symmetric, and, as long as thecomponents of X are not linear functions of each other, positivedefinite.

Page 37: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Independence• Random variables are independent if their joint density is equal

to the product of their marginals.• Independence captures the notion of one random variable’s value

having no implications for the value of the other.• If two random variables are independent, their covariance is

equal to zero.

Page 38: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

IF X is a q-dimensional vector of random variables with expectationvector µ and covariance matrix Σ, and if M is an r by q matrix ofconstants, and ν is an r-dimensional vector of constants, then

E{MX + ν} = Mµ+ ν

andCov(MX + ν) = MCov(X)MT .

Page 39: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Chebechev’s inequality:

P{|X − µ| ≥ ε} ≤ Var(X)/ε2

Page 40: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Suppose Xi are all independent , i from 1 to n, and suppose that eachXi has a finite variance (which we will denote σ2

i .) Then the varianceof X, that is, the variance of

1n

n∑i=1

Xi,

is equal to1n2

n∑i=1

σ2i .

And, in particular, if the σ2i have an upper bound in common, then the

variance tends to zero with large values of n.

Page 41: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

In this situation, from Chebechev’s inequality, we find that

P{|X − µ| ≥ ε}

is also small. Here, µ is the average of the expectations of the Xi.

In short, by taking more data, we can learn.

Page 42: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

With minimal technical assumptions about the finite variance of theindependent Xi, we can go beyond the behavior of X − µ to considernot just that it tends to zero, but also how it varies around zero.

P

√n

(1n

n∑i=1

Xi −1n

n∑i=1

E{Xi}

)≤ x

√√√√ n∑i−1

σ2i

→∫ x

−∞

e−t2/2√

2πdt.

Not only can we learn, we can know how well we’ve learned!

Page 43: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Page 44: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

When you analyze data,• There is data• A statistical method is applied• There are the results of your method• You also produce some indication of the precision of your results• The results and precision estimates are used to draw conclusions

How do you know what method to apply?

Page 45: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Page 46: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Given a statistical model, and given an analytic goal, there is (almostalways) the appropriate method already in SAS, R, SPSS, Matlab,Minitab, Systat, et cetera.

• What is a statistical model?• What is an analytic goal?• How does one elicit them from the “client?”

Page 47: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Page 48: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

A probability model has• A sample space for the observables (the random variables)• A joint distribution on the sample space• The joint distribution reflects all the sources of variability that

are inherent in the random variables.

Page 49: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

A statistical model is a family of probability models• on the sample space for the observables (the random variables)• What is specified about the joint distribution reflects what is

known about the distribution of the random variables• What is unspecified reflects what is unknown about the

distribution of the random variables• That we have random variables reflects that even if we knew

everything that could be known about the distribution, therewould still be randomness.

Parameters index the possible distributions for the data.

Page 50: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

What must be considered in devising a statistical model? What isknown, what is unknown about• The sources of variability• The sampling plan• Mechanisms underlying the phenomena under examination• Counterfactuals - issues of causation and confounding• Practical issues relating to complexity, sample size, and

computation

Page 51: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

The analytic goal is a statement of the researchers’ goal in terms ofthe parameters.

The mathematical version of this is “decision theory.”• Parameters Θ

• indexing probability models on outcomes Y• Possible “actions” A• A loss associated with parameter-action pairs L(θ, a)

• Decision rules map data to actions d : Y −→ A.• We evaluate decision rules via

Eθ{L(θ, d(Y))}

Page 52: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

• n subjects, randomly assigned to treatment or placebo• Cure or Failure recorded for all• Researchers wish to convince EPA that the treatment is

efficacious, but only if it really is

Model? Analytic Goal?

Page 53: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

• n patient charts chosen at random from a physician’s practice• Total gains generated by “up-coding” recorded for each• Prosecutors need to assess the total gains in order to recommend

the amount to be recovered

Model? Analytic Goal?

Page 54: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

• n loan applications, or cell histologies, or examples of pastweather patterns

• associated foreclosure outcomes, or cancer outcome, or rainfall• Researchers want to help others do prediction with new data

Model? Analytic Goal?

Page 55: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

• A parameterization is a mapping from the parameter space to theprobability models for the data

• The likelihood is the density (or probability mass function) of theobserved data as a function of the parameter

• The maximum likelihood estimator is the value of the parameterthat maximizes the likelihood

• We usually find the MLE by differentiating the logarithm of thelikelihood, and setting it to zero

• Maximum likelihood estimates are• Unbiased (Eθ{θ} = θ.)• Efficient in the sense of having smallest variance among unbiased

estimates

Page 56: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Ordinary least squares linear regression as maximum likelihood.• The (conditional) model• The likelihood• The score equations

Page 57: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Mixture models and the EM algorithm• Mixture models when the component identifiers are available• The likelihood when they are not• An iterative approach to estimation

Page 58: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Suppose you really believe that the parameter has a distribution π(θ).And that nature or god or . . . chose from that distribution when itcreated θ.• And suppose you wanted to estimate θ.• Suppose we have some loss function, say

L(θ, θ)

• So that we need to find a function θ to minimize

E{L(θ, θ)}

• What expectation are we talking about? The expectation over θ!• Find θ to minimize

E{L(θ(Y), θ)|Y}.

• Or maybe just approximate that optimal choice with theexpectation or mode or . . .

Page 59: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

Bayes’ theorem.θ ∼ π(θ)

Y|θ ∼ f θ(y)

π(θ|y) = π(θ)f θ(y)

/∫π(θ)f θ(y)dθ

Page 60: Lecture 2 Data Mining

Linear algebra Probability theory Statistical inference

• And suppose you wanted to estimate θ after observing some datagenerated according to f θ(y).

• Suppose we have some loss function, say

L(θ, θ)

• So that we need to find a function θ(y) to minimize

E{L(θ(Y), θ)}

• What expectation are we talking about? The expectation over θand Y!

E{L(θ(Y), θ)} = E{E{L(θ(Y), θ)|θ}} = E{E{L(θ(Y), θ)|Y}}

• Find θ(Y) to minimize

E{L(θ(Y), θ)|Y}.

• Or maybe just approximate that optimal choice with the posteriorexpectation, or mode, or . . .