Learning From Data: MLE · 2011-02-15 · Summary MLE is one way to estimate parameters from data...

Maximum Likelihood Estimators

Learning From Data: MLE

Parameter Estimation

Assuming sample x1, x2, ..., xn is from a parametric distribution f(x|θ), estimate θ.

E.g.: Given sample HHTTTTTHTHTTTHH of (possibly biased) coin flips, estimate

θ = probability of Heads

f(x|θ) is the Bernoulli probability mass function with parameter θ

LikelihoodP(x | θ): Probability of event x given model θViewed as a function of x (fixed θ), it’s a probability

E.g., Σx P(x | θ) = 1

Viewed as a function of θ (fixed x), it’s a likelihoodE.g., Σθ P(x | θ) can be anything; relative values of interest. E.g., if θ = prob of heads in a sequence of coin flips then P(HHTHH | .6) > P(HHTHH | .5), I.e., event HHTHH is more likely when θ = .6 than θ = .5

And what θ make HHTHH most likely?

Likelihood FunctionProbability of HHTHH,

given P(H) = θ:

θ θ4(1-θ)

0.2 0.0013

0.5 0.0313

0.8 0.0819

0.95 0.04070.0 0.2 0.4 0.6 0.8 1.0

One (of many) approaches to param. est.Likelihood of (indp) observations x1, x2, ..., xn

As a function of θ, what θ maximizes the likelihood of the data actually observedTypical approach: or

Maximum Likelihood Parameter Estimation

L(x1, x2, . . . , xn | θ) =n�

f(xi | θ)

!"L(#x | ") = 0

∂θlog L(�x | θ) = 0

(Also verify it’s max, not min, & not better on boundary)

Example 1n coin flips, x1, x2, ..., xn; n0 tails, n1 heads, n0 + n1 = n;

θ = probability of heads

Observed fraction of successes in sample is MLE of success probability in population

dL/dθ = 0

A desirable property: An estimator Y of a parameter θ is an unbiased estimator if E[Y] = θFor coin ex. above, MLE is unbiased: Y = fraction of heads = (Σ1≤i≤nXi)/n, (Xi = indicator for heads in ith trial) so E[Y] = (Σ1≤i≤n E[Xi])/n = n θ/n = θ

Aside: are all unbiased estimators equally good?

• No!

• E.g., “Ignore all but 1st flip; if it was H, let Y’ = 1; else Y’ = 0”

• Exercise: show this is unbiased

• Exercise: if observed data has at least one H and at least one T, what is the likelihood of the data given the model with θ = Y’ ?

Parameter EstimationAssuming sample x1, x2, ..., xn is from a parametric distribution f(x|θ), estimate θ.

E.g.: Given n normal samples, estimate mean & variance

f(x) = 1√2πσ2 e−(x−µ)2/(2σ2)

θ = (µ,σ2)

-3 -2 -1 0 1 2 3

µ ± !

Ex2: I got data; a little birdie tells me it’s normal, and promises σ2 = 1

X X XX X XXX XObserved Data

-3 -2 -1 0 1 2 3

µ ± !

Which is more likely: (a) this?

-3 -2 -1 0 1 2 3

µ ± !

Which is more likely: (b) or this?

-3 -2 -1 0 1 2 3

µ ± !

Which is more likely: (c) or this?

-3 -2 -1 0 1 2 3

µ ± !

Looks good by eye, but how do I optimize my estimate of μ ?

Ex. 2: xi ∼ N(µ,σ2), σ2 = 1, µ unknown

And verify it’s max, not min & not better on boundary

Sample mean is MLE of population mean

dL/dθ = 0

Ex3: I got data; a little birdie tells me it’s normal (but does not tell me σ2)

-3 -2 -1 0 1 2 3

µ ± !

Which is more likely: (a) this?

-3 -2 -1 0 1 2 3

µ ± !

Which is more likely: (b) or this?

-3 -2 -1 0 1 2 3

µ ± !

-3 -2 -1 0 1 2 3

µ ± !

Which is more likely: (d) or this?

-3 -2 -1 0 1 2 3

µ ± !

Which is more likely: (d) or this?

Looks good by eye, but how do I optimize my estimates of μ & σ ?

Ex 3: xi ∼ N(µ,σ2), µ,σ2 both unknown

0.4θ1

Sample mean is MLE of population mean, again

Ex. 3, (cont.)

lnL(x1, x2, . . . , xn|θ1, θ2) =�

1≤i≤n

ln 2πθ2 −(xi − θ1)2

∂∂θ2

lnL(x1, x2, . . . , xn|θ1, θ2) =�

1≤i≤n

2πθ2+

(xi − θ1)2

θ̂2 =��

1≤i≤n(xi − θ̂1)2�

/n = s̄2

Sample variance is MLE of population variance

Bias? if Y is sample mean Y = (Σ1≤i≤n Xi)/n then E[Y] = (Σ1≤i≤n E[Xi])/n = n μ/n = μso the MLE is an unbiased estimator of population mean

Similarly, (Σ1≤i≤n (Xi-μ)2)/n is an unbiased estimator of σ2.Unfortunately, if μ is unknown, estimated from the same data, as above, is a consistent, but biased estimate of population variance. (An example of overfitting.) Unbiased estimate is:

Moral: MLE is a great idea, but not a magic bullet24

Ex. 3, (cont.)

I.e., limn→∞

= correct

Biased? Yes. Why? As an extreme, think about n = 1. Then θ2 = 0; probably an underestimate!

Also, think about n = 2. Then θ1 is exactly between the two sample points, the position that exactly minimizes the expression for θ2. Any other choices for θ1, θ2 make the likelihood of the observed data slightly lower. But it’s actually pretty unlikely that two sample points would be chosen exactly equidistant from, and on opposite sides of the mean, so the MLE θ2 systematically underestimates θ2.

(But not by much, & bias shrinks with sample size.)

Learning From Data: MLE · 2011-02-15 · Summary MLE is one way to estimate parameters from data...

Documents

Transcript of Learning From Data: MLE · 2011-02-15 · Summary MLE is one way to estimate parameters from data...

Design, Development, and Preliminary Data from ACT Michael ...

Data Mining-Higher Dimentional Data

Terrestrial exospheric dayside H-density prole at from UVIS ......Terrestrial exospheric dayside H-density prole at 3-15 R e from UVIS/HDAC and TWINS Lyman- data combined Jochen H.

Data Analysis and Statistical Methods Statistics 651suhasini/teaching651/lecture10MWF.pdf · Lecture 10 (MWF) QQplots Motivating the QQplot A QQplots orders the data from the smallest

τ MASS - Particle Data Grouppdg.lbl.gov/2019/listings/rpp2019-list-tau.pdf · 3GONZALEZ-SPRINBERG 00 use data on tau lepton production at LEP1, SLC, and LEP2, and data from colliders

Simulation of β n Emission From Fission Using Evaluated Nuclear Decay Data

Learning From Data Lecture 11 Overfittingmagdon/courses/LFD-Slides/SlidesLect11.pdf · Learning From Data Lecture 11 Overfitting What is Overfitting When does Overfitting Occur

A V620 iersal Certer - Camille Bauer · 2017. 9. 27. · 2 Data sheet V620 Le – 12.16 Camille Bauer Metrawatt AG A V620 iersal Certer Input Data Voltage input 9 bipolar ranges from

Estimation of T e from ECE data Estimation of n e from reflectometry data

indico.fnal.gov€¦ · the proton radius puzzle - inferred from muonic H - inferred from electronic H - extraction from e p, e n scattering, ππNN data (this talk) - previous extractions

Maximum Likelihood EstimationMaximum likelihood estimator (MLE) The maximum likelihood estimator (MLE) is the value b which maximises L( ;x). The MLE also maximises l( ;x) because

FEATURES - invensense.tdk.com · PDM DATA FORMAT . The output from the DATA pin of the INMP521 is in PDM format. This data is the 1-bit output of a fourth-order Σ-Δ modulator. The

Jamie Boyd CERN CERN Summer Student Lectures 2011 From Raw Data to Physics Results 1.

Mixed Models for Longitudinal Ordinal and Nominal Data¬€ects model β estimates from mixed-effects (random intercepts) model are larger (in abs. value) than from fixed-effects

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method · STAT 135 Lab 2 Con dence Intervals, MLE and the Delta Method Rebecca Barter February 2, 2015. Con dence Intervals.

Selection on Glycine β-1,3-endoglucanase genes ......Nov 15, 2004 · J.G. Bishop et al. 2 Sequence data from this article have been deposited with GenBank Data Library under the

Data Overview at ICBR - ufdcimages.uflib.ufl.edu · Data Drivers at ICBR (cont.) – Future 2013 – Future • The lessons learned from NGS are proactively being applied to all other

X-Ray Reflection Data Analysis Update Information can be extracted from XRR data: - Film thickness - Roughness.

IRT Parameter Estimation - GitHub Pages · 2021. 5. 13. · MLE in IRT MLE in IRT Therefore,maximize f 3 (Yj ; ) = Yn i=1 YJ j=1 P j( i) y ij Q j( i) (1 y ij) whichisequivalenttomaximizing

Visualizing Data from the LHC with the Atlantis Event Display Program