Ml mle_bayes

36
Machine Learning Maximum Likelihood Estimation and Bayesian Parameter Estimation (Parametric Learning) Phong VO [email protected] September 11, 2010 – Typeset by Foil T E X

description

 

Transcript of Ml mle_bayes

Page 1: Ml  mle_bayes

Machine LearningMaximum Likelihood Estimation

and Bayesian Parameter Estimation(Parametric Learning)

Phong [email protected]

September 11, 2010

– Typeset by FoilTEX –

Page 2: Ml  mle_bayes

Introduction

• From previous lecture, designing classifier assumes knowledge of p(x|ωi)and P (ωi) for each class, i.e. For Gaussian densities, we need to knowµi,Σi for i = 1, . . . , c

• Unfortunately, this information is not available directly.

• Given training samples with true class label for each of sample, we havea learning problem.

• If the form of the densities is known, i.e. the number of parameters andgeneral knowledge about the problem, a parameter estimation problemresults.

– Typeset by FoilTEX – 1

Page 3: Ml  mle_bayes

Example 1. Assume that p(x|ωi) is a normal density with mean µi andcovariance matrix Σi, although we do not know exact values of thesequantities. This knowledge simplifies the problem from one of estimatingan unknow function p(x|ωx) to one of estimating the parameters µi andΣi.

– Typeset by FoilTEX – 2

Page 4: Ml  mle_bayes

Approaches to Parameter Estimation

• In maximum likelihood estimation, we assume the parameters are fixed,but unknown. The MLE approach seeks the ”‘best”’ parameter estimatein the sense that ”‘best”’ means the set of parameters that maximizethe probability of obtaining the training set.

• Bayesian estimation models the parameters to be estimated as randomvariables with some (assumed) known priori distribution. The trainingset are ”‘observation”’, which allow conversion of the a priori informationinto an a posteriori density. The Bayesian approach uses the training setto update the training set-conditioned density function of the unknownparameters.

– Typeset by FoilTEX – 3

Page 5: Ml  mle_bayes

Maximum Likelihood Estimation

• MLE nearly always have good convergence properties as the number oftraining samples increases.

• It is often simpler than alternate methods, such as Bayesian techniques.

– Typeset by FoilTEX – 4

Page 6: Ml  mle_bayes

Formulation

• Assume D =⋃cj=1Dj, with the samples in Dj having been drawn

independently according to the probability law p(x|ωj).

• Assume that p(x|ωj) has a known parametric form, and is determineduniquely by the value of a parameter vector θj, i.e. p(x|ωi) ∼ N(µj,Σj)where θj = {µj,Σj}.

• The dependence of p(x|ωj) on θj is expressed as p(x|ωj,θj).

• Our problem: use the training samples to obtain good estimates for theunknown parameter vectors θ1, . . . ,θc.

– Typeset by FoilTEX – 5

Page 7: Ml  mle_bayes

• Assume more that samples in Di give no information about θj if i 6= j.In other word, parameters are functionally independent.

• The problem of classification is turned into the problems of parameterestimation: Use a set D of training samples draw independently from theprobability density p(x|θ) to estimate the unknown parameter vector θ.

• Suppose that D = {x1, . . . ,xn}. Since the samples were drawnindependently, we have

p(D|θ) =n∏k=1

p(xk|θ)

p(D|θ) is called the likelihood of θ with respect to the set of samples.

– Typeset by FoilTEX – 6

Page 8: Ml  mle_bayes

• The maximum likelihood estimate of θ is the value θ̂ that maximizesp(D|θ)

– Typeset by FoilTEX – 7

Page 9: Ml  mle_bayes

• Take the logarithm on both sides (just for analytical purpose), we definel(θ) as the log-likelihood function

l(θ) = ln p(D|θ)

• Since the logarithm is monotonically increasing, the θ̂ that maximizesthe log-likelihood also maximizes the likelihood,

θ̂ = arg max θl(θ) = arg max θ

n∑k=1

ln p(xk|θ)

• θ̂ can be found by taking derivatives of log-likelihood function

– Typeset by FoilTEX – 8

Page 10: Ml  mle_bayes

∇θl =

n∑k=1

∇θln p(xk|θ)

where

∇θ ≡

∂∂θ1...∂∂θp

,

and then solve the equation

∇θl = 0.

– Typeset by FoilTEX – 9

Page 11: Ml  mle_bayes

• A solution could be a global maximum, a local maximum or minimum.We have to check each of them individually.

• NOTE: A related class of estimators - maximum a posteriori or MAPestimators - find the value of θ that maximizes l(θ)p(θ). Thus a MLestimator is a MAP estimator for the uniform or ”‘flat”’ prior.

– Typeset by FoilTEX – 10

Page 12: Ml  mle_bayes

MLE: The Gaussian Case for Unknown µ

• In this case, only the mean is unknown. Under this condition, we considera sample point xk and find

ln p(xk|µ) = −1

2ln((2π)d|Σ|

)− 1

2(xk − µ)tΣ−1(xk − µ)

and

∇θln p(xk|µ) = Σ−1(xk − µ).

• The maximum likelihood estimate for µ must satisfy

– Typeset by FoilTEX – 11

Page 13: Ml  mle_bayes

n∑k=1

Σ−1(xk − µ̂) = 0,

• Solve above equation, we obtain

µ̂ =1

n

n∑k=1

xk

• Interpretation: The maximum likelihood estimate for theu unknownpopulation mean is just the arithmetic average of the training samples- the sample mean. Think of the n samples as a cloud of points, thesample mean is the centroid of the cloud.

– Typeset by FoilTEX – 12

Page 14: Ml  mle_bayes

MLE: The Gaussian Case for Unknown µ and Σ

• Consider the univariate normal case, θ = {θ1, θ2} = {µ, σ2}. Thelog-likelihood of a single point is

ln p(xk|θ) = −1

2ln 2πθ2 −

1

2θ2(xk − θ1)2

and its derivative is

∇θl = ∇θln p(xk|θ) =

( 1θ2(xk − θ1)

− 12θ2

+ (xk−θ1)2

2θ22

).

– Typeset by FoilTEX – 13

Page 15: Ml  mle_bayes

• Let ∇θl = 0 and we obtain

n∑k=1

1

θ̂2(xk − θ̂1) = 0

and

−n∑k=1

1

θ̂2+

n∑k=1

(xk − θ̂1)2

θ̂22= 0

.

• Substitute µ = θ̂1 and σ = θ̂2 we obtain

µ̂ =1

n

n∑k=1

xk

– Typeset by FoilTEX – 14

Page 16: Ml  mle_bayes

and

σ̂2 =1

n

n∑k=1

(xk − µ̂)2

.

Exercise 1. Estimate µ and Σ for the case of multivariate Gaussian.

– Typeset by FoilTEX – 15

Page 17: Ml  mle_bayes

Bayesian Parameter Estimation

• Bayes’ formula allows us to compute the posterior probabilities P (ωi|x)from the prior probabilities P (ωi) and the class-conditional densitiesp(x|ωi).

• How can we proceed those quantities?

– Prior probabilities: from knowledge of the functional forms for unknowndensities and ranges for the values of unknown parameters

– class-conditional densities: from training samples.

– Typeset by FoilTEX – 16

Page 18: Ml  mle_bayes

• Given training samples as D, Bayes’s formula then becomes

P (ωi|x,D) =p(x|ωi,D)P (ωi|D)∑cj=1 p(x|ωj,D)P (ωj|D)

• Assume that the a priori probabilities are known, P (ωi) = P (ωi|D) andthe samples in Di have no influence on p(x|ωj,D) if i 6= j,

P (ωi|x,D) =p(x|ωi,Di)P (ωi)∑cj=1 p(x|ωj,Dj)P (ωj)

• We have c separate problems of the following form: use a set D of samplesdrawn independently according to the fixed but unknown probabilitydistribution p(x) to determine p(x|D). Our supervised learning problemis turned into an unsupervised density estimation problem.

– Typeset by FoilTEX – 17

Page 19: Ml  mle_bayes

The Parameter Distribution

• Although the desired probability density p(x) is unknown, we assumethat it has a known parametric form.

• The unknown factor is the value of a parameter vector θ. As long as θis known, the function p(x|θ) is known.

• Information that we have about θ prior to observing the samples isassumed to be contained in a known prior density p(θ).

• Observation of the samples converts this to a posterior density p(θ|D),which is expected to be sharply peaked about the true value of θ.

– Typeset by FoilTEX – 18

Page 20: Ml  mle_bayes

• Our basic goal is to compute p(x|D), which is as close as we can come toobtaining the unknown p(x). By integrating the joint density p(x,θ|D),we have

p(x|D) =∫θ

p(x,θ|D)dθ (1)

=

∫θ

p(x|θ,D)p(θ|D)dθ (2)

=

∫θ

p(x|θ)p(θ|D)dθ (3)

– Typeset by FoilTEX – 19

Page 21: Ml  mle_bayes

BPE: Gaussian Case

• Calculate p(θ|D) and p(x|D) for the case where p(x|µ) ∼ N(µ,Σ)

• Consider the univariate case where µ is unknown

p(x|µ) ∼ N(µ, σ2)

• We assume that the prior density p(µ) has a known distribution

p(µ) ∼ N(µ0, σ20)

Interpretation: µ0 represents our best a priori guess for µ, and σ20

measures our uncertainty about this guess.

– Typeset by FoilTEX – 20

Page 22: Ml  mle_bayes

• Once µ is ”‘guessed”’, it determines the density for x. Letting D ={x1, . . . , xn}, Bayes’ formula gives us

p(µ|D) = p(D|µ)p(µ)∫p(D|µ)p(µ)dµ

∝n∏k=1

p(xk|µ)p(µ)

where it is easy to see the affection of training samples to the estimationof the true µ.

• Since p(xk|µ) ∼ N(µ, σ2) and p(µ) ∼ N(µ0, σ20), we have

– Typeset by FoilTEX – 21

Page 23: Ml  mle_bayes

p(µ|D) ∝n∏k=1

p(xk|µ)︷ ︸︸ ︷1

σ√2π

exp

(1

2

(xk − µσ

)2) p(µ)︷ ︸︸ ︷

1

σ0√2π

exp

(1

2

(xk − µ0

σ0

)2)

(4)

∝ exp

(−12

(n∑k=1

(µ− xkσ

)2

+

(µ− µ0

σ0

)2))

(5)

∝ exp

(−12

((n

σ2+

1

σ20

)µ2 − 2

(1

σ2

n∑k=1

xk +µ0

σ20

))(6)

• p(µ|D) is again a normal density and is said to be a reproducing densityand p(µ) is conjugate prior.

– Typeset by FoilTEX – 22

Page 24: Ml  mle_bayes

• If we write p(µ|D) ∼ N(µn, σ2n), then

p(µ|D) = 1

σ√2π

exp

(−12

(µ− µnσn

)2)

• Equating coefficients show us

µn =nσ2

0

nσ20 + σ2

xn +σ2

nσ20 + σ2

µ0

where xn = 1n

∑nk=1 xk and

σ2n =

σ20σ

2

nσ20 + σ2

– Typeset by FoilTEX – 23

Page 25: Ml  mle_bayes

Interpretation: these equations show how the prior information iscombined with the empirical information in the samples to obtain the aposteriori density p(µ|D).

– Typeset by FoilTEX – 24

Page 26: Ml  mle_bayes

Interpretation

• µn represents our best guess for µ after observing n samples

• σ2n measures our uncertainty about this guess,

lim n→∞σ2n =

σ2

n,

each additional observation decreases our uncertainty about the truevalue of µ.

• As n increases, p(µ|D) approaches a Dirac delta function.

• This behavior is known as Bayesian learning.

– Typeset by FoilTEX – 25

Page 27: Ml  mle_bayes

– Typeset by FoilTEX – 26

Page 28: Ml  mle_bayes

– Typeset by FoilTEX – 27

Page 29: Ml  mle_bayes

• µn is a positive combination of xn and µ0, xn ≤ µn ≤ µ0

µn =

lim n→∞µn = xn if σ 6= 0

µ0 if σ0 = 0

xn if σ0 � σ

• The dogmatism:prior knowledge

empirical data∼ σ2

σ20

• If the dogmatism in not infinite, after enough samples are taken the exactvalues assumed for µ0 and σ2

0 will be unimportant, µn will converge tothe sample mean.

– Typeset by FoilTEX – 28

Page 30: Ml  mle_bayes

Compute the class-conditional density

• Having obtained a posteriori density for the mean, p(µ|D), we nowcompute the ”‘class-contitional”’ density for p(x|D)

p(x|D) =∫p(x|µ)p(µ|D)dµ (7)

=

∫1

σ√2π

exp

(−12

(x− µσ

))1

σn√2π

exp

(−12

(x− µnσn

))(8)

=1

2πσσnexp

(−12

(x− µn)2

σ2 + σ2n

)f(σ, σn), (9)

– Typeset by FoilTEX – 29

Page 31: Ml  mle_bayes

where

f(σ, σn) =

∫exp

(−12

σ2 + σ2n

σ2σ2n

(µ− σ

2nx+ σ2µnσ2 + σ2

n

)2)dµ

• Hence p(x|D) is normally distributed with mean µn and variance σ2+σ2n

p(x|D) ∼ N(µn, σ2 + σ2

n).

• The density p(x|D) is the desired class-conditional density p(x|ωj,Dj).

Exercise 2. Use Bayesian estimation to calculate the a posteriori densityp(θ|D) and the desired probability density p(x|D) for the multivariate casewhere p(x|µ) ∼ N(µ,Σ)

– Typeset by FoilTEX – 30

Page 32: Ml  mle_bayes

BPE: General Theory

The basic assumptions for the applicability of Bayesian estimation aresummarized as follows:

1. The form of the density p(x|θ) is assumed to be known, but the valueof the parameter vector θ is not known exactly.

2. Our initial knowledge about θ is assumed to be contained in a known apriori density p(θ).

3. The rest of our knowledge about θ is contained in a set D of n samplesx1, . . . ,xn drawn independently according to the unknown probabilitydensity p(x).

– Typeset by FoilTEX – 31

Page 33: Ml  mle_bayes

The basic problem is to compute the posterior density p(θ|D)

p(x|D) =∫p(x|θ)p(θ|D)dθ.

By Bayes’ formula we have

p(θ|D) = p(D|θ)p(θ)∫p(D|θ)p(θ)dθ

and by the independence assumption

p(D|θ) =n∏k=1

p(xk|θ).

– Typeset by FoilTEX – 32

Page 34: Ml  mle_bayes

Frequentists Perspective

• Probability refers to limiting relative frequencies. Probabilities areobjective properties of the real world.

• Parameters are fixed, unknown constants. Because they are notfluctuating, no useful probability statements can be made aboutparameters.

• Statistical procedures should be designed to have well-defined long runfrequency properties

– Typeset by FoilTEX – 33

Page 35: Ml  mle_bayes

Bayesian Perspective

• Probability describes degrees of belief, not limiting frequency.

• We can make probability statements about parameters, even though theyare fixed constants.

• We make inference about a parameters θ by producing a probabilitydistribution for θ.

– Typeset by FoilTEX – 34

Page 36: Ml  mle_bayes

Frequentists VS. Bayesians

• Bayesian inference is a controversial approach because it inherentlyembrace a subjective notion of probability.

– Typeset by FoilTEX – 35