Game Theory, Maximum Entropy, Minimum Discrepancy and Robust

29
Game Theory, Maximum Entropy, Minimum Discrepancy and Robust Bayesian Decision Theory Peter D. Gr¨ unwald and A. Philip Dawid Presented by: Arindam Banerjee . – p.

Transcript of Game Theory, Maximum Entropy, Minimum Discrepancy and Robust

Game Theory, Maximum Entropy, Minimum

Discrepancy and Robust Bayesian Decision Theory

Peter D. Grunwald and A. Philip Dawid

Presented by: Arindam Banerjee

. – p.1

Maximum Entropy

Given a stochastic source X ∼ P , H(P ) is the minimum(expected) number of bits necessary for lossless compressionof X

H(P ) = infq∈A

EP [− log q(X)]

Maximum Entropy principle suggest picking the distributionP ∈ Γ with the maximum entropy

supP∈Γ

H(P ) = supP∈Γ

infq∈A

EP [− log q(X)]

Examples: Gibbs distributions, Exponential family models,Statistical mechanics models

. – p.2

Robust Bayes

Log loss game: DM chooses q ∈ A, nature chooses x, DMsuffers loss − log q(x)

Nature can pick x following any distribution P ∈ Γ

DM’s strategy is to minimize the worst case expected loss

So DM picks q that achieves

infq∈A

supP∈Γ

EP [− log q(X)]

q is the Robust Bayes act against Γ

. – p.3

Maximum Entropy is Robust Bayes

Following the minimax results of game theory, conjecture:

supP∈Γ

infq∈A

EP [− log q(X)] = infq∈A

supP∈Γ

EP [− log q(X)]

Conjecture is true under very general conditions

The robust Bayes act q = p∗, the density of the maximumentropy distribution P ∗

Hence, maximum entropy is robust Bayes

A game-theoretic/statistical justification of “maxent”

. – p.4

Example

Maxent models with mean-value constraintsΓ = {P : EP (T ) = τ}, where T = t(X) ∈ R

k is a statistic

. – p.5

Example

Maxent models with mean-value constraintsΓ = {P : EP (T ) = τ}, where T = t(X) ∈ R

k is a statistic

There will exist a distribution P ∗ with p∗(x) = exp(αT t(x) + α0)and EP∗ [T ] = τ

. – p.5

Example

Maxent models with mean-value constraintsΓ = {P : EP (T ) = τ}, where T = t(X) ∈ R

k is a statistic

There will exist a distribution P ∗ with p∗(x) = exp(αT t(x) + α0)and EP∗ [T ] = τ

Then, for any P ∈ Γ

EP [− log p∗(X)] = α0 + αT τ = H(P ∗)

. – p.5

Example

Maxent models with mean-value constraintsΓ = {P : EP (T ) = τ}, where T = t(X) ∈ R

k is a statistic

There will exist a distribution P ∗ with p∗(x) = exp(αT t(x) + α0)and EP∗ [T ] = τ

Then, for any P ∈ Γ

EP [− log p∗(X)] = α0 + αT τ = H(P ∗)

P ∗ maximizes entropy as

H(P ) = infq∈A

EP [− log q(X)] ≤ EP [− log p∗(X)] = H(P ∗)

. – p.5

Example

Maxent models with mean-value constraintsΓ = {P : EP (T ) = τ}, where T = t(X) ∈ R

k is a statistic

There will exist a distribution P ∗ with p∗(x) = exp(αT t(x) + α0)and EP∗ [T ] = τ

Then, for any P ∈ Γ

EP [− log p∗(X)] = α0 + αT τ = H(P ∗)

P ∗ maximizes entropy as

H(P ) = infq∈A

EP [− log q(X)] ≤ EP [− log p∗(X)] = H(P ∗)

P ∗ is robust Bayes since (equality for q = p∗)

supP∈Γ

EP [− log q(X)] ≥ EP∗ [− log q(X)] ≥ EP∗ [− log p∗(X)] ≥ H(P ∗)

. – p.5

Example

Maxent models with mean-value constraintsΓ = {P : EP (T ) = τ}, where T = t(X) ∈ R

k is a statistic

There will exist a distribution P ∗ with p∗(x) = exp(αT t(x) + α0)and EP∗ [T ] = τ

Then, for any P ∈ Γ

EP [− log p∗(X)] = α0 + αT τ = H(P ∗)

P ∗ maximizes entropy as

H(P ) = infq∈A

EP [− log q(X)] ≤ EP [− log p∗(X)] = H(P ∗)

P ∗ is robust Bayes since (equality for q = p∗)

supP∈Γ

EP [− log q(X)] ≥ EP∗ [− log q(X)] ≥ EP∗ [− log p∗(X)] ≥ H(P ∗)

Further, the value

supP∈Γ

infq∈A

EP [− log q(X)] = infq∈A

supP∈Γ

EP [− log q(X)] = H(P ∗)

. – p.5

Decision Problems

A basic game: DM chooses action a ∈ A, nature reveals x ∈ X ,DM suffers loss L(x, a)

If nature is using x ∼ P, P ∈ P, the expected lossL(P, a) = EP [L(X, a)]

If DM is using randomized act a ∼ ζ, L(P, ζ) = EP×ζ [L(X, A)]

ζP is a Bayes act if ∀ζ

EP [L(X, ζ) − L(X, ζP )] ≥ 0

Generalized entropy H(P ) = infa∈A L(P, a)

Consider the action space A to be a set of distributions Q overnature’s set X

Scoring rule S(x, Q) = L(x, ζQ) is proper if S(P, P ) = H(P )

. – p.6

Examples

Brier Score: S(x, Q) = ‖δx − q‖2 so that

S(P, Q) =∑

j

q2

j − 2∑

j

qjpj + 1

H(P ) = S(P, P ) = 1 −∑

j

p2

j

. – p.7

Examples

Brier Score: S(x, Q) = ‖δx − q‖2 so that

S(P, Q) =∑

j

q2

j − 2∑

j

qjpj + 1

H(P ) = S(P, P ) = 1 −∑

j

p2

j

Logarithm Score: S(x, Q) = − log q(x) with

S(P, Q) = EP [− log q(X)]

H(P ) = S(P, P ) = EP [− log p(x)]

. – p.7

Examples

Brier Score: S(x, Q) = ‖δx − q‖2 so that

S(P, Q) =∑

j

q2

j − 2∑

j

qjpj + 1

H(P ) = S(P, P ) = 1 −∑

j

p2

j

Logarithm Score: S(x, Q) = − log q(x) with

S(P, Q) = EP [− log q(X)]

H(P ) = S(P, P ) = EP [− log p(x)]

Bregman Score:

S(x, Q) = −ψ(q(x)) −∑

t

(δx − q(t))ψ′(q(t))

S(P, Q) = S(P, P ) + dψ(P, Q)

H(P ) = S(P, P ) = E[−ψ(P )]. – p.7

Maximum Entropy and Robust Bayes

DM only knows that nature picks some P ∈ Γ

The “robust” Bayes criterion

infζ∈Z

supP∈Γ

L(P, ζ)

The maximum entropy criterion

supP∈Γ

infζ∈Z

L(P, ζ)

(P ∗, ζ∗) is an equilibrium of the game if H∗ = L(P ∗, Q∗) is finiteand the following holds:(a) L(P ∗, ζ∗) ≤ L(P ∗, ζ), ζ ∈ Z

(b) L(P ∗, ζ∗) ≥ L(P, ζ∗), P ∈ Γ

. – p.8

Maximum Entropy and Robust Bayes (Contd)

Lemma: Suppose there exist maximum entropy distributionP ∗ ∈ Γ and robust Bayes act ζ∗ ∈ Z. Then,

supP∈Γ

infζ∈Z

L(P, ζ) ≤ L(P ∗, ζ∗) ≤ infζ∈Z

supP∈Γ

L(P, ζ)

Further, if the game has a value V ∗, then V ∗ = L(P ∗, ζ∗) and(P ∗, ζ∗) is an equilibrium of the game

. – p.9

Maximum Entropy and Robust Bayes (Contd)

Lemma: Suppose there exist maximum entropy distributionP ∗ ∈ Γ and robust Bayes act ζ∗ ∈ Z. Then,

supP∈Γ

infζ∈Z

L(P, ζ) ≤ L(P ∗, ζ∗) ≤ infζ∈Z

supP∈Γ

L(P, ζ)

Further, if the game has a value V ∗, then V ∗ = L(P ∗, ζ∗) and(P ∗, ζ∗) is an equilibrium of the game

Theorem: Suppose that an equilibrium (P ∗, ζ∗) exists in a game.Then(i) The game has value H∗ = L(P ∗, ζ∗)

(ii) ζ∗ is a Bayes act against P ∗

(iii) H(P ∗) = H∗

(iv) P ∗ maximizes the entropy H(P ) over Γ

(v) ζ∗ is robust Bayes against Γ

. – p.9

Minimax Theorems

Simple Case: X is finite and L is bounded

. – p.10

Minimax Theorems

Simple Case: X is finite and L is bounded

Theorem: If Γ is a closed convex subset of P, then the gamehas a finite value H∗, and H(P ) achieves its maximum H∗ forsome P ∗ ∈ Γ

. – p.10

Minimax Theorems

Simple Case: X is finite and L is bounded

Theorem: If Γ is a closed convex subset of P, then the gamehas a finite value H∗, and H(P ) achieves its maximum H∗ forsome P ∗ ∈ Γ

Theorem: If Γ is convex and the unrestricted risk-set is closed,then there is a robust Bayes act ζ∗ ∈ Z. Further, there is P ∗ ∈ γ

such that ζ∗ is Bayes against P ∗ and (P ∗, ζ∗) is an equilibrium

. – p.10

Minimax Theorems

Simple Case: X is finite and L is bounded

Theorem: If Γ is a closed convex subset of P, then the gamehas a finite value H∗, and H(P ) achieves its maximum H∗ forsome P ∗ ∈ Γ

Theorem: If Γ is convex and the unrestricted risk-set is closed,then there is a robust Bayes act ζ∗ ∈ Z. Further, there is P ∗ ∈ γ

such that ζ∗ is Bayes against P ∗ and (P ∗, ζ∗) is an equilibrium

More general minimax theorems can be derived for muchweaker conditions on Γ and L

. – p.10

Mean Value Constraints

Let T = t(x) be a fixed real/vector statistic

Consider the class of distributions with mean value constraints

Γ = Γτ = {P ∈ P : EP (T ) = τ}

Specific entropy function h(τ) = supP∈Γτ

H(P )

An act ζ is linear if L(x, ζ) = β0 + βT t(x)

P is linear if it has a linear Bayes act ζ; (P, ζ) is called a linearpair; if EP (T ) = τ is finite, τ is a linear point

. – p.11

Mean Value Constraints

Let T = t(x) be a fixed real/vector statistic

Consider the class of distributions with mean value constraints

Γ = Γτ = {P ∈ P : EP (T ) = τ}

Specific entropy function h(τ) = supP∈Γτ

H(P )

An act ζ is linear if L(x, ζ) = β0 + βT t(x)

P is linear if it has a linear Bayes act ζ; (P, ζ) is called a linearpair; if EP (T ) = τ is finite, τ is a linear point

Theorem: If τ is linear with associated linear pair (Pτ , ζτ ) andlinear coefficients (β0, β), then(i) ζτ is an equalizer rule against Γτ

(ii) (Pτ , ζτ ) is an equilibrium(iii) ζτ is robust Bayes against Γτ

(iv) h(τ) = H(Pτ ) = β0 + βT τ

. – p.11

Discrepancy

Discrepancy between distribution P and act ζ

D(P, ζ) = L(P, ζ) − H(P )

If a Bayes act ζP exists, then

D(P, ζ) = EP [L(X, ζ) − L(X, ζP )]

When the loss is a proper scoring rule S, divergenced(P, Q) = S(P, Q) − H(P )

. – p.12

Discrepancy

Discrepancy between distribution P and act ζ

D(P, ζ) = L(P, ζ) − H(P )

If a Bayes act ζP exists, then

D(P, ζ) = EP [L(X, ζ) − L(X, ζP )]

When the loss is a proper scoring rule S, divergenced(P, Q) = S(P, Q) − H(P )

Lemma: Let P1, . . . , Pn have finite entropies and (p1, . . . , pn) bea probability vector. Then, with P =

∑i piPi

H(P ) =∑

i

piH(Pi) + pid(Pi, P )

d(P , Q) =∑

i

pid(Pi, Q) −∑

i

pid(Pi, P )

. – p.12

Relative Entropy

Given a reference act ζ0, the relative loss

L0(x, a) = L(x, a) − L(x, ζ0)

Similarly L0(x, ζ) = L(x, ζ) − L(z, ζ0) andL0(P, ζ) = L(P, ζ) − L(P, ζ0)

If a Bayes act ζP against P exists, the generalized relativeentropy H0(P ) = infa∈A L0(P, a) is

H0(P ) = EP [L(X, ζP ) − L(X, ζ0)]

Maximum generalized relative entropy is same as minimumdiscrepancy

H0(P ) = H(P ) − L(P, ζ0) = −D(P, ζ0)

Choose P ∈ Γ to minimize D(P, ζ0) from the reference act ζ0

. – p.13

Pythagorean Inequality

Theorem: Suppose (P ∗, ζ∗) is an equilibrium. Then for all P ∈ Γ,

D(P, ζ∗) + D(P ∗, ζ0) ≤ D(P, ζ0)

Further, if the inequality holds with rhs finite for all P ∈ Γ, then(P ∗, ζ∗) is an equilibrium

. – p.14

Pythagorean Inequality

Theorem: Suppose (P ∗, ζ∗) is an equilibrium. Then for all P ∈ Γ,

D(P, ζ∗) + D(P ∗, ζ0) ≤ D(P, ζ0)

Further, if the inequality holds with rhs finite for all P ∈ Γ, then(P ∗, ζ∗) is an equilibrium

Corollary: If S is a proper scoring function, then in the relativegame with loss S0(P, Q), if (P ∗, P ∗) is an equilibrium, then for allP ∈ Γ,

d(P, P ∗) + d(P ∗, P0) ≤ d(P, P0)

. – p.14

Pythagorean Inequality

Theorem: Suppose (P ∗, ζ∗) is an equilibrium. Then for all P ∈ Γ,

D(P, ζ∗) + D(P ∗, ζ0) ≤ D(P, ζ0)

Further, if the inequality holds with rhs finite for all P ∈ Γ, then(P ∗, ζ∗) is an equilibrium

Corollary: If S is a proper scoring function, then in the relativegame with loss S0(P, Q), if (P ∗, P ∗) is an equilibrium, then for allP ∈ Γ,

d(P, P ∗) + d(P ∗, P0) ≤ d(P, P0)

Theorem: Suppose (P ∗, ζ∗) is an equilibrium of the relativegame. If ζ∗ is an equalizer rule, i.e., L0(P, ζ∗) = H0(P

∗) for allP ∈ Γ, then Pythagorean equality holds. Conversely, if thePythagorean equality holds, then L0(P, ζ∗) = H(P ∗) for all P ∈ Γ

with D(P, ζ0) < ∞; in particular, if D(P, ζ0) < ∞ for all P ∈ Γ, ζ∗

is an equalizer rule

. – p.14