Bayesian and frequentist inference

Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions

Bayesian and frequentist inference

Nancy Reid

March 26, 2007

Don Fraser, Ana-Maria Staicu


Overview

Methods of inference

Asymptotic theoryApproximate posteriorsmatching priors

ExamplesLogistic regressionnormal circle

Constructing priors

Conclusions


A Basic Structure• data y = (y1, . . . , yn) Rn

• model f (y ; θ) probability density• parameters θ = (θ1, . . . , θk ) Rk

• inference about θ (or components of θ)• an interval of values of θ consistent with the data• a probability distribution for θ, conditional on the data


• “The average 12 month weight loss on the Atkins diet was10.3 lbs, compared to 3.5, 5.7 and 4.8 lbs on the Zone,Learn and Ornish diets, respectively.” 1

• “mean 12 month weight loss on the Atkins diet was 4.7 kg(95% confidence interval 3.1 to 6.3 kg)”

• “mean 12 month weight loss was significantly differentbetween the Atkins and the Zone diets (p < 0.05)”

• The probability that the average weight loss on the Atkinsdiet is greater than that on the Zone diet is 0.983

1JAMA, Mar 7


Bayesian inference

• π(θ | y) ∝ f (y ; θ)π(θ)

• π(θ | y) =f (y ; θ)π(θ)∫f (y ; θ)π(θ)dθ

• parameter of interest ψ(θ)

• πm(ψ | y) =∫{ψ(θ)=ψ} π(θ | y)dθ

• integrals often computed using simulation methods• marginal posterior density summarizes all information

about ψ available from the data


nonBayesian inference

• find a function of the data t(y), say, with1. distribution depending only on ψ2. distribution that is known

• usually based on the log-likelihood function`(θ; y) = log f (y ; θ) θ̂ ≡ arg sup `(θ; y)θ̂ψ ≡ arg supψ(θ)=ψ `(θ; y)

• likelihood ratio statistic: 2{`(θ̂; y)− `(θ̂ψ; y)} d−→ χ2d1

• Wald statistic Σ̂−1/2(ψ̂ − ψ)d−→ N(0, I)

• if ψ is scalar: r∗(ψ).∼ N(0,1)

• statistic is derived from the model (via the likelihoodfunction)

• distribution comes from asymptotic theory• likelihood is not (usually) integrated but is (usually)

maximized


... Bayesian inference

• well defined basis for inference• internally consistent• leads to optimal results from one point of view• requires a probability distribution for θ ( a prior distribution)• not necessarily calibrated• not always clear how much answers depend on the choice

of prior• requires (high-) dimensional integration, or simulation, or

approximation• useful in applications


... nonBayesian inference

• properties well understood• calibrated: properties correspond to long-run frequency• choice of function t(y) may be non-optimal• approximation to distribution of t(Y ) may be poor• approximation to distribution of t(Y ) may be excellent• useful in applications


• Fisher (1920s, frequentist): “I know only one case inmathematics of a doctrine which has been accepted anddeveloped by the most eminent men of their time, and isnow perhaps accepted by men now living, which at thesame time has appeared to a succession of sound writersto be fundamentally false and devoid of foundation”

• Lindley (1950s, Bayesian): “Every statistician would be aBayesian if he took the trouble to read the literaturethoroughly and was honest enough to admit that he mighthave been wrong”


Approximate posteriors

∫ ψ

πm(ψ | y)dψ = Φ(r∗B){1 + O(n−3/2)}

r∗B = r +1r

logqB

rr = [2{`p(ψ̂)− `p(ψ)}]1/2

qB = −`′p(ψ)j−1/2p (ψ̂)

|jλλ(θ̂)|1/2

|jλλ(θ̂ψ)|1/2

π(θ̂ψ)

π(θ̂)

`p(ψ) = `(ψ, λ̂ψ) = `(θ̂ψ)


∫ ψ

πm(ψ | y)dψ = Φ(r∗B){1 + O(n−3/2)}

r∗B = r +1r

logqB

rr = [2{`p(ψ̂)− `p(ψ)}]1/2

qB = −`′p(ψ)j−1/2p (ψ̂)

|jλλ(θ̂)|1/2

|jλλ(θ̂ψ)|1/2

π(θ̂ψ)

π(θ̂)

`p(ψ) = `(ψ, λ̂ψ) = `(θ̂ψ)2007

-03-

27Bayesian and frequentist inference

Asymptotic theory



marginal posterior density:

πm(ψ | y) =

∫{ψ(θ)=ψ}

exp{`(θ; y}π(θ)dθ/ ∫

exp{`(θ; y)}π(θ)dθ

= cexp{`(θ̂ψ)}|jλλ(θ̂ψ)|−1/2π(θ̂ψ)

exp{`(θ̂)}|j(θ̂)|−1/2π(θ̂){1 + O(n−3/2)}

.= c exp{`(θ̂ψ)− `(θ̂)} |jλλ(θ̂ψ)|−1/2

jp(ψ̂)−1/2|jλλ(θ̂ψ)|−1/2

π(θ̂ψ)

π(θ̂)

πm(r | y) = c exp(−12

r2)r

qB

= c exp(−12

r2 − logr

qB)


... approximation to the posterior is really good

• Example:yi ∼ N(θ, I), yi = (yi1, . . . , yik ), θ = (θ1, . . . , θk )

• ψ =√

(θ21 + ...+ θ2

k )

• flat prior π(θ)dθ ∝ dθ• posterior is proper, in fact is χ2′

k (n‖y‖2)

• r∗B =√

n(ψ̂ − ψ) +1

√n(ψ̂ − ψ)

log

{(ψ

ψ̂

)(k−1)/2}


... normal circle, k=2

2 3 4 5 6 7 8

0.0

0.2

0.4

0.6

0.8

1.0

normal circle, n=1

psi

post

erio

r m

argi

nal

2 3 4 5 6 7 8−0.

0003

0−

0.00

020

−0.

0001

00.

0000

0

normal circle

psi

diffe

renc

e


... normal circle, increasing k

2 3 4 5 6 7 8

0.0

0.2

0.4

0.6

0.8

1.0

changing k

psi

post

erio

r m

argi

nal

2 3 4 5 6 7 8

0.00

0.01

0.02

0.03

0.04

0.05

increasing k

psi

diffe

renc

e

k=5k=10k=20


Are Bayesian and nonBayesian solutions reallythat different?

• posterior probability limit• Prθ|Y{θ ≤ θ(1−α)|y} = 1− α

• upper confidence bound• PrY |θ{θ ≤ θ(1−α)(π,Y )} = 1− α

• first depends on the prior, second is evaluated under themodel

• is there a prior for which both limits are equal?• no2, but there might be a prior for which they are

approximately equal:PrY |θ{θ ≤ θ(1−α)(π,Y )} = 1− α+ O(1/n)

• matching prior

2except in a location model


... matching priors

• if θ ∈ R, then π(θ) ∝ i1/2(θ) (Welch & Peers, 1963)• if θ = (ψ, λ) then

π(θ) ∝ i1/2ψψ (θ)g(λ)

(Peers, 1965; Tibshirani, 1989)• where we reparametrize if necessary so that iψλ(θ) = 0• matching priors are one version of objective priors• they depend on the model• reference priors of Berger and Bernardo are a different

version of objective priors• with a slightly different goal: maximize the ‘distance’ from

prior to posterior• good news: Bayesian inference with matching priors is

calibrated• bad news: there is a whole family of such priors, even in

relatively simple situations


... good news: matching priors are unique

when used in the r∗ approximation(DiCiccio & Martin 93, Staicu 07)∫ ψ

πm(ψ | y)dψ = Φ(r∗B){1 + O(n−3/2)}

r∗B = r +1r

logqB

rr = [2{`p(ψ̂)− `p(ψ)}]1/2

qB = −`′p(ψ)j−1/2p (ψ̂)

|jλλ(θ̂)|1/2

|jλλ(θ̂ψ)|1/2

π(ψ, λ̂ψ)

π(ψ̂, λ̂)

= ”i1/2ψψ (ψ, λ̂ψ)

i1/2ψψ (ψ̂, λ̂)

g(λ̂ψ)

g(λ̂)

when ψ orthogonal to λ, λ̂ψ = λ̂{1 + Op(n−1)}


Logistic regression

The first ten out of 79 sets of observations on the physicalcharacteristics of urine. Presence/absence of calcium oxalatecrystals is indicated by 1/0. Two cases with missing values.3

Case Crystals Specific gravity pH Osmolarity Conductivity Urea Calcium1 0 1.021 4.91 725 — 443 2.452 0 1.017 5.74 577 20.0 296 4.493 0 1.008 7.20 321 14.9 101 2.364 0 1.011 5.51 408 12.6 224 2.155 0 1.005 6.52 187 7.5 91 1.166 0 1.020 5.27 668 25.3 252 3.347 0 1.012 5.62 461 17.4 195 1.408 0 1.029 5.67 1107 35.9 550 8.489 0 1.015 5.41 543 21.9 170 1.16

10 0 1.021 6.13 779 25.7 382 2.21...

.

.

.

3Andrews & Herzberg, 1985


... logistic regression

Model: Independent binary responses Y1, . . . ,Yn with

Pr(Yi = 1) =exp(xT

i β)

1 + exp(xTi β)

linear exponential family:

`(β; y) = exp{β0Σyi + β1Σx1iyi + · · ·+ β6Σx6iyi − c(β)− d(y)}

parameter of interest ψ = β6, sayorthogonal nuisance parameter λj = E(yxj)

λ̂ψ ≡ λ̂ (special to exponential families)π(ψ, λ) ∝ i1/2

ψψ (ψ, λ), unique


beta_p

BN

and L

apla

ce a

ppro

xim

ation

-0.08 -0.06 -0.04 -0.02 0.0 0.02

-4-2

02

4



Confidence/posterior interval for parameter of interest

method lower bound upper bound p-value for 0Φ(r∗) −0.058 0.00029 0.052Φ(r∗B) −0.058 0.00028 0.052

1st order −0.063 −0.00062 0.047

First line uses a (very accurate) approximation to theconditional distribution of Σyix6i , given sufficient statistics fornuisance parameters (exponential family)Fitted using cond library of package hoa


Confidence/posterior interval for parameter of interest

method lower bound upper bound p-value for 0Φ(r∗) −0.058 0.00029 0.052Φ(r∗B) −0.058 0.00028 0.052

1st order −0.063 −0.00062 0.047

First line uses a (very accurate) approximation to theconditional distribution of Σyix6i , given sufficient statistics fornuisance parameters (exponential family)Fitted using cond library of package hoa

2007

-03-


Examples

Logistic regression


n = 77,d = 7, but the information in 77 binary observations seems tobe comparable to the information in about 10 continuous observations


Matching priors need to be targetted on theparameter of interest

• as do all objective priors• ‘flat’ priors for vector parameters don’t lead to calibrated

inferences• Example: yi ∼ N(θi ,1), i = 1, . . . , k• exact and approximate posterior for

∑θ2

i nearly identical


2 3 4 5 6 7 8

0.0

0.2

0.4

0.6

0.8

1.0

changing k

psi

post

erio

r m

argi

nal

2 3 4 5 6 7 8

0.00

0.01

0.02

0.03

0.04

0.05

increasing k

psi

diffe

renc

e

k=5k=10k=20


Matching priors need to be targetted on theparameter of interest

• as do all objective priors• ‘flat’ priors for vector parameters don’t lead to calibrated

inferences• Example: yi ∼ N(θi ,1), i = 1, . . . , k• exact and approximate posterior for

∑θ2

i nearly identical• posterior used prior π(θ) ∝ 1, marginalize to ψ =

∑θ2

i

• posterior limits are not probability limits• there is an exact nonBayesian solution, based on marginal

density of∑

y2i (noncentral χ2) Dawid, Stone, Zidek, 1974


−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

dependence on dimension

psi − y1

diffe

renc

e

dim = 10dim = 5dim = 2


−3 −2 −1 0 1 2 3

0.00

0.02

0.04

0.06

0.08

dependence on n

y1−psi

diffe

renc

e

12345102030


nonBayesian asymptotics

• find a function of the data with known distributiondepending on ψ

• our candidate: r∗F = r∗F (ψ; y) with N(0,1) distribution•

r∗F = r +1r

logqF

r•

qF = {χ(θ̂)− χ(θ̂ψ)}/σ̂χ• a maximum likelihood type departure• χ(θ) = scalar function of ϕ(θ)

• θ → ϕ(θ) = ϕ(θ; y0): a re-parametrization of the model


... nonBayesian asymptotics

• where does this come from and why does it work• using ϕ we can construct an approximate exponential

family model• this is the route to accurate density approximation using

saddlepoint-type arguments• to get ϕ(θ) differentiate the log-likelihood on the sample

space

• ϕj(θ) = ddt `(θ; y0 + tv j)

∣∣∣t=0

• jiggle the observed data in directions vj , and record thechange in the log likelihood

• gives a way to implement approximate conditioning• leads to quite accurate frequentist solutions


... does this help with finding priors?

1. strong matching priors (FR, 2002)• posterior marginal distribution s(ψ) = Φ(r∗B)

• nonBayesian p-value function p(ψ) = Φ(r∗F )

•

r∗F = r +1r

logqF

r

r∗B = r +1r

logqB

r

• Set qF = qB, and solve for prior π (if possible)• guarantees matching to 3rd order• leads to a data dependent prior• automatically targetted on parameter of interest• only a function of the parameter of interest


1. strong matching priors (FR, 2002)• posterior marginal distribution s(ψ) = Φ(r∗B)

• nonBayesian p-value function p(ψ) = Φ(r∗F )

•

r∗F = r +1r

logqF

r

r∗B = r +1r

logqB

r

• Set qF = qB, and solve for prior π (if possible)• guarantees matching to 3rd order• leads to a data dependent prior• automatically targetted on parameter of interest• only a function of the parameter of interest

2007

-03-


Constructing priors


qB = −`′p(ψ)j−1/2p (ψ̂) blue prior

qF = {χ(θ̂)− χ(θ̂ψ)}/σ̂χ


... does this help

2. approximate location models• every model f (y ; θ) can be approximated by a location

model (locally)• location models have automatic prior for vector θ: flat• change at θ that generates an increment d θ̂ at observed

data, W (θ), say

• flat prior is π(θ) ∝ |W (θ)| = |AV (θ)| = |d θ̂dθ|

• V is a n × k matrix with i th row

Vi(θ) = −F;θ′(y0

i ; θ)

Fy (y0i ; θ)

• gives a natural default prior for vector parameters• still need to deal with marginalization to target parameters

of interest

... does this help

2. approximate location models• every model f (y ; θ) can be approximated by a location

model (locally)• location models have automatic prior for vector θ: flat• change at θ that generates an increment d θ̂ at observed

data, W (θ), say

• flat prior is π(θ) ∝ |W (θ)| = |AV (θ)| = |d θ̂dθ|

• V is a n × k matrix with i th row

Vi(θ) = −F;θ′(y0

i ; θ)

Fy (y0i ; θ)

• gives a natural default prior for vector parameters• still need to deal with marginalization to target parameters

of interest

2007

-03-


Constructing priors

... does this help

• approximate location model f (y − β(θ)), say

• factors as f1(a)f2(θ̂ − β(θ))

• if f ′2(·) = 0 then

•d θ̂ = βθ(θ)dθ

• flat prior is |βθ(θ)|


Summarizing

• asymptotic theory is useful for obtaining (very) goodapproximations to distributions

• can be used in Bayesian or nonBayesian contexts• can be used to derive posterior intervals that are calibrated• can be used to simplify the choice of matching priors• finding priors is not straightforward• likelihood plus first derivative change in the sample space

is ‘instrinsic’ to inference, at least asymptotically toO(n−3/2)

• flat priors for a location parameter will not lead to calibratedposteriors for curved parameters


We’re all Bayesians really

Economist Jan 14, 2006


... Bayesian

Griffiths and Tenenbaum Psychological Science• asked subjects to make predictions based on limited

evidence• concluded that the predictions were consistent with

Bayesian methods of inference• “the results of our experiment reveal a far closer

correspondence between optimal statistical inference andeveryday cognition than that suggested by previousresearch”


... Bayesian

• Goldstein (2006): “we argue first that the subjectivist Bayesapproach is the only feasible method for tackling manyscientific problems”

• Neal (1998): “Using ”technological” or ”reference” priorschosen solely for convenience, or out of a mis-guideddesire for pseudo-objectivity ... [or] using priors that varywith the amount of data that you have collected ... have noreal Bayesian justification, and since they are usuallyoffered with no other justification either, I consider them tobe highly dubious.”

• Wasserman (2006): “I think it would be hard to defend asequence of subjective analyses that yield poor frequencycoverage”

References1. ba.stat.cmu Volume 2 # 3 Berger 2006, Goldstein 2006,

Wasserman 20062. DiCiccio and Martin 19933. Fraser Reid 20024. Little 20065. Tibshirani 19896. Welch and Peers 1963

Bayesian and frequentist inference

Documents

Transcript of Bayesian and frequentist inference