adakdh1/ada/ada07c6.pdf · Page 1 Example: Correct the Bias in (S2)1/2 Define Derivatives:...

4
Page 1 Example: Correct the Bias in (S 2 ) 1/2 Define Derivatives: Evaluate the bias: y( x) = x b , y '( x) = bx b1 , y"( x) = b (b 1) x b2 S 2 ( ) b = y S 2 ( ) + y" S 2 ( ) 2 Var S 2 ( ) +... = y σ 2 ( ) + y" σ 2 ( ) 2 2σ 4 N 1 +... = σ 2b + b (b 1)σ 2(b2) 2 2σ 4 N 1 +... = σ 2b 1+ b (b 1) N 1 +... " # $ % & ' S 2 ( ) p/2 = σ p 1+ p( p 2) 4( N 1) + O( N 2 ) " # $ % & ' Bias-corrected: S 1 p( p 2) 4( N 1) +... # $ % & ' ( 1/ p 1 N 1 X μ ( ) 2 i=1 N # $ % & ' ( 1/2 S p = σ p S 2 (S 2 ) 1/2 σ 2 σ Review : Badness-of-FitStatistics Sample Variance mean S 2 1 N 1 X i μ i ( ) 2 i=1 N X Chi-squared optimal average χ 2 X i μ i σ i 2 i=1 N ˆ X Mean Absolute Deviation median MAD 1 N X i μ i i=1 N X M Sum Absolute Normalised Errors: SANE X i μ i σ i i=1 N Sigma-clip ε 2 | ε | ± K σ χ 2 | χ | Badness functions: Error Bars live with the Model Not with the Data Usually the distinction is unimportant. But sometimes it is important. Error bars live with the model, not the data! Example: Poisson data: How to attach error bars to the data points? The wrong way: Assigning σ(X i )= X i gives a downward bias. Points lower than average by chance are given smaller error bars, and hence more weight than they deserve. The right way: Assign σ= λ, where λ = mean count rate predicted by the model. 0 2 4 6 8 10 0 5 10 15 20 λ i X i Prob( x = n | λ) = λ n e λ n! n = 0, 1, 2, ... X i = λ, σ 2 ( X i ) = λ If σ X i ( ) = X i , then 1 σ 2 = when X i = 0 and ˆ X X i /σ i 2 i 1/σ i 2 i = 0 ⋅∞ = 0 , clearly wrong ! Maximum Likelihood (ML) Estimation Likelihood of parameters α for a given dataset: Example: Gaussian errors: L(α) P( X | α) = P( X 1 | α) × P( X 2 | α) × ... × P( X N | α) P( X i | α) i=1 N To maximise L(α), minimise χ 2 + lnσ 2 i i P( X i | α) = 1 2 π σ i exp 1 2 X i μ i α ( ) σ i " # $ % & ' 2 ( ) * + * , - * . * L α ( ) = exp χ 2 2 { } Z D , Z D 2 π ( ) N/2 σ i i=1 N BoF = 2 ln L = χ 2 + lnσ 2 i + N ln 2 π ( ) i Maximum Likelihood Parameters α ML satisfies 0= ∂α [2ln L( α)], Var[ α ML ] 2 2 ∂α 2 [2ln L(α)] α=αML Generalises χ 2 fitting. 1. For parameters that affect σ 2. For non-Gaussian errors Need ML when Parameters alter Error Bars Data points X i with no error bars: To find μ, minimise χ 2 . To find σ , minimising χ 2 fails! ML method minimises χ 2 = X i μ σ i=1 N 2 χ 2 0 as σ →∞ 2 ln L = χ 2 + N lnσ 2 60 70 80 90 100 110 120 0 5 10 15 20 i X i χ 2 σ ML 2 N ln σ -2 ln L 0 50 100 150 200 250 300 350 0 10 20 σ

Transcript of adakdh1/ada/ada07c6.pdf · Page 1 Example: Correct the Bias in (S2)1/2 Define Derivatives:...

Page 1: adakdh1/ada/ada07c6.pdf · Page 1 Example: Correct the Bias in (S2)1/2 Define Derivatives: Evaluate the bias: y(x)=xb, y'(x)=bxb−1,y"(x)=b(b−1)xb−2 (S2)b =y(S2)+ y"(S2) 2 Var(S2)+...

Page 1

Example: Correct the Bias in (S2)1/2

Define Derivatives:Evaluate the bias:

y(x) = xb, y '(x) = b xb−1, y"(x) = b(b−1)xb−2

S2( )b= y S2( )+

y" S2( )2

Var S2( )+...

= y σ 2( )+y" σ 2( )

22σ 4

N −1+...

=σ 2b +b(b−1)σ 2(b−2)

22σ 4

N −1+... =σ 2b 1+ b(b−1)

N −1+...

"

#$

%

&'

S2( )p/2

=σ p 1+ p(p− 2)4(N −1)

+ O(N −2 )"

#$

%

&'

Bias-corrected: S ≡ 1− p(p− 2)4(N −1)

+...#

$%

&

'(

1/p1

N −1X −µ( )2

i=1

N

∑#

$%

&

'(

1/2

S p =σ p

S2

(S2)1/2

σ2

σ

Review : “Badness-of-Fit” StatisticsSample Variance mean

S2 ≡

1

N −1Xi−µ

i( )2

i=1

N

∑ → X

Chi-squared optimal average

χ 2 ≡Xi−µ

i

σi

2

i=1

N

∑ → X̂

Mean Absolute Deviation median

MAD ≡1

NXi−µ

i

i=1

N

∑ → XM

Sum Absolute Normalised Errors:

SANE ≡Xi−µ

i

σii=1

N

Sigma-clip ε 2

| ε |

± K σ

χ 2

| χ |

Badness functions:

Error Bars live with the Model

Not with the Data

Usually the distinction is unimportant.

But sometimes it is important.

Error bars live with the model, not the data!Example: Poisson data:

How to attach error bars to the data points?The wrong way:

Assigning σ(Xi )= √Xi gives a downward bias. Points lower than average by chance are given smaller error bars, and hence more weight than they deserve.

The right way: Assign σ= √ λ, where λ = mean count rate predicted by the model.

0

2

4

6

8

10

0 5 10 15 20

λ

i

Xi

Prob(x = n | λ) =λn e−λ

n! n = 0, 1, 2, ...

Xi

= λ, σ 2(X

i) = λ

If σ Xi( ) = X

i, then 1 σ 2 =∞ when X

i= 0

and ˆ X ≡

Xi/σ

i

2

i

1/σi

2

i

∑=

0 ⋅ ∞

∞= 0 , clearly wrong !

Maximum Likelihood (ML) EstimationLikelihood of parameters α for a given dataset:

Example: Gaussian errors:

L(α) ≡ P(X |α) = P(X1 |α) × P(X2 |α) × ...× P(XN

|α)

≡ P(Xi|α)

i=1

N

To maximise L(α), minimise χ 2 + lnσ 2i

i∑

P(Xi |α) = 12π σ i

exp −12Xi −µi α( )

σ i

"

#$

%

&'

2()*

+*

,-*

.*

L α( ) =exp −χ 2 2{ }

ZD

, ZD ≡ 2π( )N /2σ i

i=1

N

BoF = − 2 lnL = χ 2 + lnσ 2i + N ln 2π( )

i∑

Maximum Likelihood Parameters

αML satisfies 0 =∂

∂α[−2lnL(α)],

Var[αML ] ≈2

∂ 2

∂α 2[−2lnL(α)]

α=α

ML

Generalises χ2 fitting.1. For parameters that affect σ

2. For non-Gaussian errors

Need ML when Parameters alter Error Bars•  Data points Xi with no error bars:

•  To find µ, minimise χ2.•  To find σ , minimising χ2 fails!

•  ML method minimises

χ 2=

Xi−µ

σ

i=1

N

∑2

χ 2→ 0 as σ → ∞

− 2 lnL = χ 2 + N lnσ 2

60

70

80

90

100

110

120

0 5 10 15 20 i

Xi

χ2

σML

2 N ln σ

-2 ln L

0

50

100

150

200

250

300

350

0 10 20 σ

Page 2: adakdh1/ada/ada07c6.pdf · Page 1 Example: Correct the Bias in (S2)1/2 Define Derivatives: Evaluate the bias: y(x)=xb, y'(x)=bxb−1,y"(x)=b(b−1)xb−2 (S2)b =y(S2)+ y"(S2) 2 Var(S2)+...

Page 2

Need ML to fit low-count Poisson Data

Example : Poisson data :

Likelihood for N Poisson data points :

Maximum likelihood estimator of λ :

P(X = n | λ) = e−λλ n

n !n = 0,1,...∞

L(λ) = P(Xi| λ)

i=1

N

∏ =e−λ λX i

Xi!

i=1

N

lnL = − λ + Xilnλ − lnX

i!( )

i

∂ lnL

∂ λ= −N +

1

λXi

i

∑ = 0 at λ = λML

∴ λML

=1

NXi

i

∑ .

0

2

4

6

8

10

12

14

16

18

20

0 2 4 6 8 10

i

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10

L(λ)

λ

Xi

0

0.05

0.1

0.15

0.2

0.25

0.3

0 2 4 6 8 10

λ=1

λML=4.5λ=7

P(X | λ)

X

Conditional Probabilities P(X,Y) = joint probability density of X and YP(X) = projection of P(X,Y) onto X axis.

Conditional Probability: P( X | Y ) = “probability of X given Y ” = “normalised slice” of P( X,Y ) at a fixed value of Y.

P(X |Y ) ≡P(X,Y )

P(Y )=

P(X,Y )

P(X,Y ) dX∫

Y

XP( X )

P( X | Y )

P( X, Y )

P(X) = P(X,Y ) dY∫

Test Understanding

Y = 3 XX = Gaussian

P( Y | X = 2 ) = ?

P( Y | X > 2 ) = ?

X

Y

1 2 3

Y

Y

Test Understanding

Y = 3 XX = Gaussian

P( Y | X = 2 ) = ?

P( Y | X > 2 ) = ?

X

Y

1 2 3

Y6

Y

Test Understanding

Y = 3 XX = Gaussian

P( Y | X = 2 ) = ?

P( Y | X > 2 ) = ?

X

Y

1 2 3

Y6

Y6

Conditional Probabilities

P(X) = projection onto X axis. P(Y) = projection onto Y axis.

Conditional Probability: P( X | Y ) = normalised slice at fixed Y P( Y | X ) = normalised slice at fixed X

Y

XP( X )

P( X | Y )

P( Y

)

P( Y

| X

)

P( X, Y )

P(X) = P(X,Y ) dY∫P(Y ) = P(X,Y ) dX∫

P(X |Y ) ≡P(X,Y )

P(Y )P(Y | X) ≡

P(X,Y )

P(X)

P(X,Y ) = P(X |Y ) P(Y )

= P(Y | X) P(X)

Page 3: adakdh1/ada/ada07c6.pdf · Page 1 Example: Correct the Bias in (S2)1/2 Define Derivatives: Evaluate the bias: y(x)=xb, y'(x)=bxb−1,y"(x)=b(b−1)xb−2 (S2)b =y(S2)+ y"(S2) 2 Var(S2)+...

Page 3

Bayes’ Theorem and Bayesian InferenceBayes’ Theorem:

Bayesian Inference :

Shows us how to change our probability distribution over various models in light of new data.€

P(model | data) =P(data |model) P(model)

P(data)

Since P(X,Y ) = P(X |Y ) P(Y ) = P(Y | X) P(X)

then P(X |Y ) =P(Y | X) P(X)

P(Y )=

P(Y | X) P(X)

P(Y | X) P(X)∫ dX€

P(X |Y ) =P(Y | X) P(X)

P(Y )

Inferences depend on Prior, not just DataBayesian inference: ( M = model, D = data )

•  The Likelihood, P( data | model ), is quantified by a “badness-of-fit” statistic. e.g. P( data | model ) ~ exp(-χ2/2)

•  The Prior, P( model ) expresses your prejudice (prior knowledge).•  The Posterior, P( model | data ), gives your inference, the relative

probabilities of different models (parameters), in light of the data. €

Posterior Probability = (Likelihood ×Prior Probability) / Evidence

P(M |D) =P(D | M) P(M)

P(D)=

P(D | M) P(M)

P(D | M)P(M)dM∫Relative probability of two models M1 and M2 :

P(M1 |D)

P(M2 |D)=P(D | M1)

P(D | M2)×P(M1)

P(M2)≈ exp

−Δχ 2

2

×

P(M1)

P(M2)

No absolute inferences ! New data updates your prior expectations, but your conclusions depend also on your prior.

Choice of Prior•  A model for a set of data X depends on a parameter α.

•  Knowledge of α before measuring X is quantified by the prior P(α).

•  Choice of P(α) is arbitrary, subject to common sense!

•  After measuring X, Bayes theorem gives posterior :

•  Different priors P(α) lead to different inferences :

X

α

P( X

| α

)

P( α ) ~ 1 / α

Uniform P( log α )

P( α | X ) ~ L( α ) P( α )

P( α )

P( α | X ) ~ L( α ) P( α )

L( α )

Uniform P( α )

L( α )P α | X( )∝ P X |α( )P α( )= L(α) P α( )

Max Likelihood and Bayesian Inference

P(X |α)

α1

α2

Parameter Space

X1

X2

Data Space

α1€

α2

Model

P(α | X) =P(X |α) P(α)

P(X |α) P(α) dα∫∝L(α) P(α)

X1

X2 Specific Dataset

Likelihood Function

Bayesian Inference:€

αML

maximises L(α)

L(α) ≡ P(X |α)

+

Likelihood updates the Prior.

Posterior Probability

αMAP maximises posterior : L α( )P α( )

N=1 Gaussian Datum with Uniform Prior

Data : X ±σ Model parameter : µ

Likelihood function :

L(µ) ≡ P(X | µ) =e

−1

2

X −µ

σ

2

2π σ

µML

= X maximises L(µ).

Posterior probability :

P(µ | X) =P(X | µ) P(µ)

P(X)

P(X) = P(X | µ) P(µ) dµ∫

X

µ€

P(X |µ)

µML

= X

µ€

P(µ | X)

P(µ) = constant

Uniform prior:

Maximum Likelihood implicitly assumes a Uniform Prior

N=1 Gaussian Datum with Gaussian PriorGaussian Data: X ±σ

Likelihood: L(µ)≡P(X |µ) = 12π σ

e−

12X−µσ

#

$%

&

'(

2

Prior: P(µ) = 12π σ 0

e−

12µ−µ0σ 0

#

$%

&

'(

2

Posterior : P(µ | X)∝ Likelihood x Prior

L(µ) P(µ)∝ e−

12X−µσ

#

$%

&

'(

2

e−

12µ−µ0σ 0

#

$%

&

'(

2

∝ exp −12µ −µMAP

σ µMAP( )

#

$%%

&

'((

2*+,

-,

./,

0,

X

µ€

P(X |µ)

Same as Optimal Average !Gaussian prior acts like 1 more data point.

Data “pulls” the probability away from the prior, and vice-versa.

Verify this result.

µMAP =

µ0

σ 02 +

Xσ 2

1σ 0

2 +1σ 2

, Var µMAP( ) = 11σ 0

2 +1σ 2

. €

X

µ

L(µ)

P(µ)

µ0

Likelihood x Prior:

µMAPMaximum Posterior (MAP) estimate:

Page 4: adakdh1/ada/ada07c6.pdf · Page 1 Example: Correct the Bias in (S2)1/2 Define Derivatives: Evaluate the bias: y(x)=xb, y'(x)=bxb−1,y"(x)=b(b−1)xb−2 (S2)b =y(S2)+ y"(S2) 2 Var(S2)+...

Page 4

N Gaussian Data with Gaussian PriorLikelihood: L(µ) ≡ P(X |µ) = P Xi |µ( )

i=1

N

∏ =exp −

12χ 2{ }

2π( )N /2σ i

i∏

Prior: P(µ) =1

2π σ 0

exp −12µ −µ0

σ 0

$

%&

'

()

2*+,

-,

./,

0,

Posterior : P(µ | X)∝Likelihood x Prior

L(µ) P(µ) ∝ exp −χ 2

2−

12µ −µ0

σ 0

$

%&

'

()

2*+,

-,

./,

0,∝ exp −

12µ −µMAP

σ µMAP( )

$

%&&

'

())

2*+,

-,

./,

0,

Xi

µ€

P(Xi|µ)

Same as Optimal Average !

Gaussian prior acts like 1 more data point.

µMAP =

µ0

σ 02 +

Xi

σ i2

i=1

N

∑1σ 0

2 +1σ i

2i=1

N

∑ , σ 2 µMAP( ) = 1

1σ 0

2 +1σ i

2i=1

N

∑.

µ

L(µ)

P(µ)

µ0

Likelihood x Prior:

µMAP

Maximum Posterior (MAP) estimate:

1.  Error bars live with the Model, not with the Data.2.  Bayes Theorem (Bayesian Inference)

3.  Maximum Likelihood, e.g. for Gaussian Data:

4. Minimise χ2 if Gaussian errors with known σi . 5. or Maximise likelihood ( e.g. minimise BoF = -2 ln L ), if error bars unknown, or low-count Poisson data.6. or full Bayesian analysis, including the prior: e.g. for Gaussian Data:

Summary:

P(Model | Data) = P(Data |Model) P(Model)P(Data)

BoF = −2 lnP(Model | Data) = χ 2 + lnσ 2i

i=1

N

∑ − 2 lnP(Model)+ const

L(Model) ≡ P(Data | Model)

BoF = −2 lnL = χ 2 + lnσ 2i

i=1

N

∑ + const