Basics of Decision Theory - University of North Carolina ...

25
Basics of Decision Theory Andrew Nobel September, 2020

Transcript of Basics of Decision Theory - University of North Carolina ...

Basics of Decision Theory

Andrew Nobel

September, 2020

Decision Theory

Framework for formalizing inference problems

I Inference targets: unknown quantities of interest

I Observations: access to information about unknown quantities

I Decision rules: take action based on observations

I Loss and risk: assessments of performance, comparison of decision rules

General Setting

Set X of possible observations, called the sample space

I Single measurement X = {0, 1}, N, or R

I Multiple measurements X = {0, 1}n, Nn, or Rn

Family P = {f(x|θ) : θ ∈ Θ} of pdf/pmf on X

I Indices θ ∈ Θ called parameters, “states of nature”

I Index set Θ called parameter space, assumed known

Inference

I Observe X ∈ X (random) with X ∼ f(x|θ) ∈ P, where θ is unknown

I Goal: Learn about unknown θ based on the observed value x of X

Principal Inference Problems

1. Point estimation

I Observe X ∼ f(x|θ). Obtain an estimate θ of θ, where θ ∈ Θ.

2. Hypothesis testing: Given subset Θ0 ⊆ Θ of interest

I Observe X ∼ f(x|θ). Decide if θ ∈ Θ0 or θ ∈ Θc0.

3. Confidence set estimation

I Observe X ∼ f(x|θ). Find a small set C ⊆ Θ that is likely to contain θ.

Inference as Deterministic/Stochastic Procedure

Two complementary perspectives

I Inference process as deterministic map from data x to estimates/decisions

I Stochastic behavior of these maps when applied to random observation X

Actions and Decisions

Inference amounts to making a decision about the parameter θ based on the observedvalue x ∈ X of X ∼ f(·|θ). Use term “data” for realized values of random quantities

I Decision space A = set of all possible decisions

I Decision rule is a map d : X → A from data to decisions

I Family D of allowable decision rules

Decision space A and family D will depend on

I Nature of the inference problem at hand

I Criteria such as invariance, smoothness, unbiasedness

I Computational constraints

Loss and Risk

Definition: A loss function is a map ` : Θ×A → R. Interpret `(θ, a) as the cost if wemake decision a when true state of nature is θ

Definition: The risk function of a decision rule d : X → A is defined by

R(θ, d) = Eθ`(θ, d(X)) θ ∈ Θ

I Eθh(X) is the expectation of h(X) when X ∼ f(x|θ)

I R(θ, d) = expected loss of rule d when applied to observation X ∼ f(x|θ)

I Continuous case R(θ, d) =∫`(θ, d(x))f(x|θ) dx

I Discrete case R(θ, d) =∑x `(θ, d(x)) p(x|θ)

Framework for Point Estimation

Goal: Estimate of parameter θ based on observation X ∼ f(x|θ) ∈ P

I Typically A = Θ, i.e., the decision space is the parameter space

I Decision rule d : X → Θ is an estimator. Common to write d(x) as θ(x)

Common loss functions

I Squared loss `(θ, θ′) = (θ − θ′)2

I Absolute loss `(θ, θ′) = |θ − θ′|

I qth power loss `(θ, θ′) = ||θ − θ′||q some q > 0

I Zero-one loss `(θ, θ′) = I(θ 6= θ′)

I Kullback-Liebler loss `(θ, θ′) =∫f(x|θ) log(f(x|θ)/f(x|θ′)) dx

Framework for Hypothesis Testing

Given: Partition Θ = Θ0 ∪Θ1 of parameter space

Goal: Decide if θ ∈ Θ0 or θ ∈ Θ1 based on X ∼ f(x|θ) ∈ P

I Decision space A = {0, 1} where a indicates decision θ ∈ Θa

I Decision rule d : X → {0, 1}

I Zero-one loss `(θ, a) = I(θ 6∈ Θa) (1 if decision is incorrect, 0 otherwise)

Under zero-one loss risk function has form

R(θ, d) = Eθ`(θ, d(X)) =

{Pθ(d(X) = 1) if θ ∈ Θ0 (Type I error)

Pθ(d(X) = 0) if θ ∈ Θ1 (Type II error)

Framework for Interval Estimation

Goal: Find small confidence set C ⊆ Θ likely to contain θ based on X ∼ f(x|θ) ∈ P

I Decision space A ⊆ 2Θ, e.g., intervals, rectangles, balls

I Decision rule C : X → A

I Weighted 0-1 loss `(θ, C) = I(θ 6∈ C) + λVol(C), some λ > 0

Under weighted zero-one loss risk function has form

R(θ, C) = Eθ`(θ, C(X)) = Pθ(θ 6∈ C(X)) + λEθ[Vol(C(X))]

Note that

I Minimizing risk entails trade-off between coverage probability and size ofconfidence set

I In frequentist setting observation X is random, but parameter θ is not

Frequentist and Bayesian Perspectives on Inference

Different approaches stemming in part from different interpretations of probability

Frequentist

I Probability defined through repetitions of a random experiment

I True parameter θ is a fixed element of Θ, but otherwise unknown

I Analysis and interpretation of inference based on (potentially unrealized)replications of basic experiment

Bayesian

I Probability understood as a (potentially subjective) measure of belief

I Belief about true parameter before and after an experiment representedrespectively by prior and posterior distributions on the parameter space Θ

I Experiment regarded as unique. Inference based on updating prior basedon data, without reference to other experiments or repetition

Overview of Bayesian Inference

Basic ingredients

I Family P = {f(x|θ) : θ ∈ Θ} of sampling densities on X

I Prior density π(θ) on parameter space Θ

I Joint densities f(x, θ) = f(x|θ)π(θ), marginal density m(x) =∫f(x, θ)dθ

I Observation model: First θ drawn from π, then X drawn from f(x|θ)

Idea: Prior density π(θ) reflects belief/information about parameters before experimentis conducted. Given data x, update prior using Bayes formula to obtain

π(θ|x) =f(x|θ)π(θ)

m(x)posterior density

Key point: All inferences about θ, (point estimates, hypothesis tests, intervalestimates) are based on the posterior density

Comparing Decision Rules

Recall: Risk of decision rule d : X → Θ under loss ` summarized by risk function

R(θ, d) = Eθ`(θ, d(X))

Question: Given two decision rules d1 and d2, how should we compare theirassociated risk functions R(θ, d1) and R(θ, d2)?

I Frequentist perspective: Consider maximum risk over θ ∈ Θ

I Bayesian perspective: Consider average risk over prior π

Point Estimation Under Squared Loss

Given family P = {f(x|θ) : θ ∈ Θ} with Θ ⊆ R, and an estimator θ : X → Θ

I The bias of θ at θ is biasθ(θ) = Eθ[θ(X)]− θ

I The variance of θ at θ is Varθ(θ) = Eθ[θ(X)− Eθ θ(X)

]2

I Say θ is unbiased if biasθ(θ) = 0 for all θ

Bias-Variance Decomposition: Under the squared loss `(θ, a) = (θ − a)2

R(θ, θ) = Varθ(θ) + (biasθ(θ))2

Upshot: For an estimator θ to perform well it should

I Be centered near the true parameter (small bias)

I Not be too spread out (small variance)

Example: Estimation of a Normal Mean

Observation: X ∼ N (θ, 1) with θ ∈ R

Goal: Estimate θ under the squared error loss.

Consider two estimators

I θ1(x) = x, risk function R(θ, θ1) = 1

I θ2(x) = 3, risk function R(θ, θ2) = (θ − 3)2

Neither risk function dominates the other

Example: Probability of Success in Bernoulli Trial

Observation: X1, . . . , Xn ∼ Bern(θ) with θ ∈ (0, 1)

Goal: Estimate θ under the squared error loss.

Consider two estimators

θ1(x) = xn R(θ, θ1) =θ(1− θ)

n

θ2(x) =nxn +

√n/2

n+√n

R(θ, θ2) =1

4(1 +√n)2

(constant)

Neither risk function dominates the other

Maximum Risk and Bayes Risk

Idea: Single number summaries of overall risk

Definition: Given family P = {f(x|θ) : θ ∈ Θ} and loss function ` : Θ×A → R

(i) The maximum risk of a decision rule d : X → A is

Rm(d) = supθ∈Θ

R(θ, d)

(ii) The Bayes risk of a decision rule d : X → A under prior density π is

Rπ(d) =

∫R(θ, d)π(θ)dθ

Example: Probability of Success in Bernoulli Trial

Recall: Observe X1, . . . , Xn ∼ Bern(θ). Estimators θ1, θ2 for θ with

R(θ, θ1) =θ(1− θ)

nR(θ, θ2) =

n

4(n+√n)2

A. Maximum risk: Prefer estimator θ2(x) as

Rm(θ1) =1

4n>

1

4(1 +√n)2

= Rm(θ2)

B. Bayes risk: Under uniform prior π(θ) = 1, prefer estimator θ1 for n ≥ 20 as

Rπ(θ1) =1

6n<

1

4(1 +√n)2

= Rπ(θ2)

Minimax Rules and Bayes Rules

Definition: The minimax risk for a family of decision rules D is

R∗m = infd∈D

Rm(d) = infd∈D

supθ∈Θ

R(θ, d)

A rule d ∈ D is said to be minimax if Rm(d) = R∗m.

Definition: The optimal Bayes risk for a family of decision rules D under a prior π is

R∗π = infd∈D

Rπ(d) = infd∈D

∫R(θ, d)π(θ)dθ

A rule d ∈ D is said to be a Bayes rule for π if Rπ(d) = R∗π . Note: R∗π depends on π

Fact: Minimax risk is always bounded below by the optimal Bayes risk: for every priordistribution π on Θ one has R∗m ≥ R∗π

Finding Bayes Rules by Minimizing Posterior Risk

Given: Family P = {f(x|θ) : θ ∈ Θ} and prior density π on Θ. Recall the posteriordensity of θ given X = x is

π(θ|x) =f(x|θ)π(θ)

m(x)where m(x) =

∫f(x|θ)π(θ)dθ

Definition: The posterior risk of a decision a ∈ A given x under π is

Rπ(a|x) =

∫Θ`(θ, a)π(θ|x) dθ = E[`(θ, a)|X = x]

Fact: Under mild conditions, the decision rule

dπ(x) = argmina∈A

Rπ(a|x)

is a Bayes rule for π, provided that it is contained in D

Bayesian Point Estimators Under Different Loss Functions

Given: Family P = {f(x|θ) : θ ∈ Θ} with Θ = R and prior density π(θ).

A. Under squared loss `(θ, θ′) = (θ − θ′)2, Bayes estimator is the posterior mean

θπ(x) =

∫Θθ π(θ|x) dθ

B. Under absolute loss `(θ, θ′) = |θ − θ′|, Bayes estimator is posterior median

θπ(x) = u such that∫ u

−∞π(θ|x) dθ =

1

2

C. Under zero-one loss `(θ, θ′) = I(θ 6= θ′), Bayes estimator is posterior mode

θπ(x) = argmaxθ∈Θ π(θ|x)

Bayes Rules with Constant Risk are Minimax

Theorem: Let dπ be the Bayes rule for a family D under a prior π. If the risk functionR(θ, dπ) is constant then dπ is minimax for D.

Note: If dπ is minimax then π is said to be a least favorable prior

Example: Consider X1, . . . , Xn ∼ Bern(θ). Consider the point estimator

θ(x) =nxn +

√n/2

n+√n

under the squared error loss

I Have seen that risk R(θ, θ) = [4(1 +√n)]−2 is constant

I Can show θ is posterior mean for θ under Beta(√n/2,

√n/2) prior

I By Theorem, θ is minimax

Admissibility

Setting: General inference problem with family D of candidate decision rules

Definition: A decision rule d ∈ D is inadmissible if there is some d′ ∈ D such that

(i) R(θ, d′) ≤ R(θ, d) for all θ ∈ Θ

(ii) R(θ, d′) < R(θ, d) for some θ ∈ Θ

If no such d′ exists, then d is said to be admissible

I Admissibility depends on the family D and the loss function `

I A rule d is either admissible or inadmissible

I Admissible rules are candidates for good/reasonable rules

I There may be many admissible rules

I Admissibility is a weak criterion. Obviously silly rules can be admissible.

Example

Observations: X1, . . . , Xn i.i.d. Bern(θ) with θ ∈ (0, 1)

Goal: Estimate of θ under squared loss. Candidate estimators

I θ1(x) = x with R(θ, θ1) = θ(1− θ)/n

I θ2(x) = x1 with R(θ, θ2) = θ(1− θ)

I θ3(x) = 12

with R(θ, θ3) = (θ − 12

)2

Fact

1. θ1 is admissible

2. θ2 is inadmissible (bettered by θ1)

3. θ3 is admissible (lazy, but unbeatable when θ = 12

)

Admissibility of Bayes Rules

Thm: Consider a Bayesian decision problem in which

I Θ ⊆ Rp is open

I π(θ) > 0 for every θ ∈ Θ

I The Bayes rule dπ for π has finite Bayes risk rπ(dπ)

If R(θ, d) is a continuous function of θ for each d ∈ D, then dπ is admissible.

Idea: If there is a rule d′ such that R(θ, d′) ≤ R(θ, dπ) for all θ with strict inequality forsome θ, then the Bayes risk of d′ would be less than that of dπ , which is a contradiction.