Bayesian and frequentist inference
Transcript of Bayesian and frequentist inference
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
Bayesian and frequentist inference
Nancy Reid
March 26, 2007
Don Fraser, Ana-Maria Staicu
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
Overview
Methods of inference
Asymptotic theoryApproximate posteriorsmatching priors
ExamplesLogistic regressionnormal circle
Constructing priors
Conclusions
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
A Basic Structure• data y = (y1, . . . , yn) Rn
• model f (y ; θ) probability density• parameters θ = (θ1, . . . , θk ) Rk
• inference about θ (or components of θ)• an interval of values of θ consistent with the data• a probability distribution for θ, conditional on the data
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
• “The average 12 month weight loss on the Atkins diet was10.3 lbs, compared to 3.5, 5.7 and 4.8 lbs on the Zone,Learn and Ornish diets, respectively.” 1
• “mean 12 month weight loss on the Atkins diet was 4.7 kg(95% confidence interval 3.1 to 6.3 kg)”
• “mean 12 month weight loss was significantly differentbetween the Atkins and the Zone diets (p < 0.05)”
• The probability that the average weight loss on the Atkinsdiet is greater than that on the Zone diet is 0.983
1JAMA, Mar 7
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
Bayesian inference
• π(θ | y) ∝ f (y ; θ)π(θ)
• π(θ | y) =f (y ; θ)π(θ)∫f (y ; θ)π(θ)dθ
• parameter of interest ψ(θ)
• πm(ψ | y) =∫{ψ(θ)=ψ} π(θ | y)dθ
• integrals often computed using simulation methods• marginal posterior density summarizes all information
about ψ available from the data
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
nonBayesian inference
• find a function of the data t(y), say, with1. distribution depending only on ψ2. distribution that is known
• usually based on the log-likelihood function`(θ; y) = log f (y ; θ) θ̂ ≡ arg sup `(θ; y)θ̂ψ ≡ arg supψ(θ)=ψ `(θ; y)
• likelihood ratio statistic: 2{`(θ̂; y)− `(θ̂ψ; y)} d−→ χ2d1
• Wald statistic Σ̂−1/2(ψ̂ − ψ)d−→ N(0, I)
• if ψ is scalar: r∗(ψ).∼ N(0,1)
• statistic is derived from the model (via the likelihoodfunction)
• distribution comes from asymptotic theory• likelihood is not (usually) integrated but is (usually)
maximized
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... Bayesian inference
• well defined basis for inference• internally consistent• leads to optimal results from one point of view• requires a probability distribution for θ ( a prior distribution)• not necessarily calibrated• not always clear how much answers depend on the choice
of prior• requires (high-) dimensional integration, or simulation, or
approximation• useful in applications
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... nonBayesian inference
• properties well understood• calibrated: properties correspond to long-run frequency• choice of function t(y) may be non-optimal• approximation to distribution of t(Y ) may be poor• approximation to distribution of t(Y ) may be excellent• useful in applications
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
• Fisher (1920s, frequentist): “I know only one case inmathematics of a doctrine which has been accepted anddeveloped by the most eminent men of their time, and isnow perhaps accepted by men now living, which at thesame time has appeared to a succession of sound writersto be fundamentally false and devoid of foundation”
• Lindley (1950s, Bayesian): “Every statistician would be aBayesian if he took the trouble to read the literaturethoroughly and was honest enough to admit that he mighthave been wrong”
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
Approximate posteriors
∫ ψ
πm(ψ | y)dψ = Φ(r∗B){1 + O(n−3/2)}
r∗B = r +1r
logqB
rr = [2{`p(ψ̂)− `p(ψ)}]1/2
qB = −`′p(ψ)j−1/2p (ψ̂)
|jλλ(θ̂)|1/2
|jλλ(θ̂ψ)|1/2
π(θ̂ψ)
π(θ̂)
`p(ψ) = `(ψ, λ̂ψ) = `(θ̂ψ)
Approximate posteriors
∫ ψ
πm(ψ | y)dψ = Φ(r∗B){1 + O(n−3/2)}
r∗B = r +1r
logqB
rr = [2{`p(ψ̂)− `p(ψ)}]1/2
qB = −`′p(ψ)j−1/2p (ψ̂)
|jλλ(θ̂)|1/2
|jλλ(θ̂ψ)|1/2
π(θ̂ψ)
π(θ̂)
`p(ψ) = `(ψ, λ̂ψ) = `(θ̂ψ)2007
-03-
27Bayesian and frequentist inference
Asymptotic theory
Approximate posteriors
Approximate posteriors
marginal posterior density:
πm(ψ | y) =
∫{ψ(θ)=ψ}
exp{`(θ; y}π(θ)dθ/ ∫
exp{`(θ; y)}π(θ)dθ
= cexp{`(θ̂ψ)}|jλλ(θ̂ψ)|−1/2π(θ̂ψ)
exp{`(θ̂)}|j(θ̂)|−1/2π(θ̂){1 + O(n−3/2)}
.= c exp{`(θ̂ψ)− `(θ̂)} |jλλ(θ̂ψ)|−1/2
jp(ψ̂)−1/2|jλλ(θ̂ψ)|−1/2
π(θ̂ψ)
π(θ̂)
πm(r | y) = c exp(−12
r2)r
qB
= c exp(−12
r2 − logr
qB)
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... approximation to the posterior is really good
• Example:yi ∼ N(θ, I), yi = (yi1, . . . , yik ), θ = (θ1, . . . , θk )
• ψ =√
(θ21 + ...+ θ2
k )
• flat prior π(θ)dθ ∝ dθ• posterior is proper, in fact is χ2′
k (n‖y‖2)
• r∗B =√
n(ψ̂ − ψ) +1
√n(ψ̂ − ψ)
log
{(ψ
ψ̂
)(k−1)/2}
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... normal circle, k=2
2 3 4 5 6 7 8
0.0
0.2
0.4
0.6
0.8
1.0
normal circle, n=1
psi
post
erio
r m
argi
nal
2 3 4 5 6 7 8−0.
0003
0−
0.00
020
−0.
0001
00.
0000
0
normal circle
psi
diffe
renc
e
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... normal circle, increasing k
2 3 4 5 6 7 8
0.0
0.2
0.4
0.6
0.8
1.0
changing k
psi
post
erio
r m
argi
nal
2 3 4 5 6 7 8
0.00
0.01
0.02
0.03
0.04
0.05
increasing k
psi
diffe
renc
e
k=5k=10k=20
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
Are Bayesian and nonBayesian solutions reallythat different?
• posterior probability limit• Prθ|Y{θ ≤ θ(1−α)|y} = 1− α
• upper confidence bound• PrY |θ{θ ≤ θ(1−α)(π,Y )} = 1− α
• first depends on the prior, second is evaluated under themodel
• is there a prior for which both limits are equal?• no2, but there might be a prior for which they are
approximately equal:PrY |θ{θ ≤ θ(1−α)(π,Y )} = 1− α+ O(1/n)
• matching prior
2except in a location model
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... matching priors
• if θ ∈ R, then π(θ) ∝ i1/2(θ) (Welch & Peers, 1963)• if θ = (ψ, λ) then
π(θ) ∝ i1/2ψψ (θ)g(λ)
(Peers, 1965; Tibshirani, 1989)• where we reparametrize if necessary so that iψλ(θ) = 0• matching priors are one version of objective priors• they depend on the model• reference priors of Berger and Bernardo are a different
version of objective priors• with a slightly different goal: maximize the ‘distance’ from
prior to posterior• good news: Bayesian inference with matching priors is
calibrated• bad news: there is a whole family of such priors, even in
relatively simple situations
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... good news: matching priors are unique
when used in the r∗ approximation(DiCiccio & Martin 93, Staicu 07)∫ ψ
πm(ψ | y)dψ = Φ(r∗B){1 + O(n−3/2)}
r∗B = r +1r
logqB
rr = [2{`p(ψ̂)− `p(ψ)}]1/2
qB = −`′p(ψ)j−1/2p (ψ̂)
|jλλ(θ̂)|1/2
|jλλ(θ̂ψ)|1/2
π(ψ, λ̂ψ)
π(ψ̂, λ̂)
= ”i1/2ψψ (ψ, λ̂ψ)
i1/2ψψ (ψ̂, λ̂)
g(λ̂ψ)
g(λ̂)
when ψ orthogonal to λ, λ̂ψ = λ̂{1 + Op(n−1)}
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
Logistic regression
The first ten out of 79 sets of observations on the physicalcharacteristics of urine. Presence/absence of calcium oxalatecrystals is indicated by 1/0. Two cases with missing values.3
Case Crystals Specific gravity pH Osmolarity Conductivity Urea Calcium1 0 1.021 4.91 725 — 443 2.452 0 1.017 5.74 577 20.0 296 4.493 0 1.008 7.20 321 14.9 101 2.364 0 1.011 5.51 408 12.6 224 2.155 0 1.005 6.52 187 7.5 91 1.166 0 1.020 5.27 668 25.3 252 3.347 0 1.012 5.62 461 17.4 195 1.408 0 1.029 5.67 1107 35.9 550 8.489 0 1.015 5.41 543 21.9 170 1.16
10 0 1.021 6.13 779 25.7 382 2.21...
.
.
.
3Andrews & Herzberg, 1985
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... logistic regression
Model: Independent binary responses Y1, . . . ,Yn with
Pr(Yi = 1) =exp(xT
i β)
1 + exp(xTi β)
linear exponential family:
`(β; y) = exp{β0Σyi + β1Σx1iyi + · · ·+ β6Σx6iyi − c(β)− d(y)}
parameter of interest ψ = β6, sayorthogonal nuisance parameter λj = E(yxj)
λ̂ψ ≡ λ̂ (special to exponential families)π(ψ, λ) ∝ i1/2
ψψ (ψ, λ), unique
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
beta_p
BN
and L
apla
ce a
ppro
xim
ation
-0.08 -0.06 -0.04 -0.02 0.0 0.02
-4-2
02
4
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... logistic regression
Confidence/posterior interval for parameter of interest
method lower bound upper bound p-value for 0Φ(r∗) −0.058 0.00029 0.052Φ(r∗B) −0.058 0.00028 0.052
1st order −0.063 −0.00062 0.047
First line uses a (very accurate) approximation to theconditional distribution of Σyix6i , given sufficient statistics fornuisance parameters (exponential family)Fitted using cond library of package hoa
... logistic regression
Confidence/posterior interval for parameter of interest
method lower bound upper bound p-value for 0Φ(r∗) −0.058 0.00029 0.052Φ(r∗B) −0.058 0.00028 0.052
1st order −0.063 −0.00062 0.047
First line uses a (very accurate) approximation to theconditional distribution of Σyix6i , given sufficient statistics fornuisance parameters (exponential family)Fitted using cond library of package hoa
2007
-03-
27Bayesian and frequentist inference
Examples
Logistic regression
... logistic regression
n = 77,d = 7, but the information in 77 binary observations seems tobe comparable to the information in about 10 continuous observations
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
Matching priors need to be targetted on theparameter of interest
• as do all objective priors• ‘flat’ priors for vector parameters don’t lead to calibrated
inferences• Example: yi ∼ N(θi ,1), i = 1, . . . , k• exact and approximate posterior for
∑θ2
i nearly identical
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
2 3 4 5 6 7 8
0.0
0.2
0.4
0.6
0.8
1.0
changing k
psi
post
erio
r m
argi
nal
2 3 4 5 6 7 8
0.00
0.01
0.02
0.03
0.04
0.05
increasing k
psi
diffe
renc
e
k=5k=10k=20
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
Matching priors need to be targetted on theparameter of interest
• as do all objective priors• ‘flat’ priors for vector parameters don’t lead to calibrated
inferences• Example: yi ∼ N(θi ,1), i = 1, . . . , k• exact and approximate posterior for
∑θ2
i nearly identical• posterior used prior π(θ) ∝ 1, marginalize to ψ =
∑θ2
i
• posterior limits are not probability limits• there is an exact nonBayesian solution, based on marginal
density of∑
y2i (noncentral χ2) Dawid, Stone, Zidek, 1974
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
0.6
dependence on dimension
psi − y1
diffe
renc
e
dim = 10dim = 5dim = 2
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
−3 −2 −1 0 1 2 3
0.00
0.02
0.04
0.06
0.08
dependence on n
y1−psi
diffe
renc
e
12345102030
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
nonBayesian asymptotics
• find a function of the data with known distributiondepending on ψ
• our candidate: r∗F = r∗F (ψ; y) with N(0,1) distribution•
r∗F = r +1r
logqF
r•
qF = {χ(θ̂)− χ(θ̂ψ)}/σ̂χ• a maximum likelihood type departure• χ(θ) = scalar function of ϕ(θ)
• θ → ϕ(θ) = ϕ(θ; y0): a re-parametrization of the model
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... nonBayesian asymptotics
• where does this come from and why does it work• using ϕ we can construct an approximate exponential
family model• this is the route to accurate density approximation using
saddlepoint-type arguments• to get ϕ(θ) differentiate the log-likelihood on the sample
space
• ϕj(θ) = ddt `(θ; y0 + tv j)
∣∣∣t=0
• jiggle the observed data in directions vj , and record thechange in the log likelihood
• gives a way to implement approximate conditioning• leads to quite accurate frequentist solutions
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... does this help with finding priors?
1. strong matching priors (FR, 2002)• posterior marginal distribution s(ψ) = Φ(r∗B)
• nonBayesian p-value function p(ψ) = Φ(r∗F )
•
r∗F = r +1r
logqF
r
r∗B = r +1r
logqB
r
• Set qF = qB, and solve for prior π (if possible)• guarantees matching to 3rd order• leads to a data dependent prior• automatically targetted on parameter of interest• only a function of the parameter of interest
... does this help with finding priors?
1. strong matching priors (FR, 2002)• posterior marginal distribution s(ψ) = Φ(r∗B)
• nonBayesian p-value function p(ψ) = Φ(r∗F )
•
r∗F = r +1r
logqF
r
r∗B = r +1r
logqB
r
• Set qF = qB, and solve for prior π (if possible)• guarantees matching to 3rd order• leads to a data dependent prior• automatically targetted on parameter of interest• only a function of the parameter of interest
2007
-03-
27Bayesian and frequentist inference
Constructing priors
... does this help with finding priors?
qB = −`′p(ψ)j−1/2p (ψ̂) blue prior
qF = {χ(θ̂)− χ(θ̂ψ)}/σ̂χ
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... does this help
2. approximate location models• every model f (y ; θ) can be approximated by a location
model (locally)• location models have automatic prior for vector θ: flat• change at θ that generates an increment d θ̂ at observed
data, W (θ), say
• flat prior is π(θ) ∝ |W (θ)| = |AV (θ)| = |d θ̂dθ|
• V is a n × k matrix with i th row
Vi(θ) = −F;θ′(y0
i ; θ)
Fy (y0i ; θ)
• gives a natural default prior for vector parameters• still need to deal with marginalization to target parameters
of interest
... does this help
2. approximate location models• every model f (y ; θ) can be approximated by a location
model (locally)• location models have automatic prior for vector θ: flat• change at θ that generates an increment d θ̂ at observed
data, W (θ), say
• flat prior is π(θ) ∝ |W (θ)| = |AV (θ)| = |d θ̂dθ|
• V is a n × k matrix with i th row
Vi(θ) = −F;θ′(y0
i ; θ)
Fy (y0i ; θ)
• gives a natural default prior for vector parameters• still need to deal with marginalization to target parameters
of interest
2007
-03-
27Bayesian and frequentist inference
Constructing priors
... does this help
• approximate location model f (y − β(θ)), say
• factors as f1(a)f2(θ̂ − β(θ))
• if f ′2(·) = 0 then
•d θ̂ = βθ(θ)dθ
• flat prior is |βθ(θ)|
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
Summarizing
• asymptotic theory is useful for obtaining (very) goodapproximations to distributions
• can be used in Bayesian or nonBayesian contexts• can be used to derive posterior intervals that are calibrated• can be used to simplify the choice of matching priors• finding priors is not straightforward• likelihood plus first derivative change in the sample space
is ‘instrinsic’ to inference, at least asymptotically toO(n−3/2)
• flat priors for a location parameter will not lead to calibratedposteriors for curved parameters
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
We’re all Bayesians really
Economist Jan 14, 2006
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... Bayesian
Griffiths and Tenenbaum Psychological Science• asked subjects to make predictions based on limited
evidence• concluded that the predictions were consistent with
Bayesian methods of inference• “the results of our experiment reveal a far closer
correspondence between optimal statistical inference andeveryday cognition than that suggested by previousresearch”
Overview Methods of inference Asymptotic theory Examples Constructing priors Conclusions
... Bayesian
• Goldstein (2006): “we argue first that the subjectivist Bayesapproach is the only feasible method for tackling manyscientific problems”
• Neal (1998): “Using ”technological” or ”reference” priorschosen solely for convenience, or out of a mis-guideddesire for pseudo-objectivity ... [or] using priors that varywith the amount of data that you have collected ... have noreal Bayesian justification, and since they are usuallyoffered with no other justification either, I consider them tobe highly dubious.”
• Wasserman (2006): “I think it would be hard to defend asequence of subjective analyses that yield poor frequencycoverage”
References1. ba.stat.cmu Volume 2 # 3 Berger 2006, Goldstein 2006,
Wasserman 20062. DiCiccio and Martin 19933. Fraser Reid 20024. Little 20065. Tibshirani 19896. Welch and Peers 1963