Download - OBUST AYESIAN PDATING · The ‘true’ Bayes act of decision problem (parametrised by ): = argmin Z X ‘( ;x)dG; (1) where G(x) is the sample distribution of x. The traditional

Transcript
Page 1: OBUST AYESIAN PDATING · The ‘true’ Bayes act of decision problem (parametrised by ): = argmin Z X ‘( ;x)dG; (1) where G(x) is the sample distribution of x. The traditional

ROBUST BAYESIAN UPDATINGJACK JEWSON, JIM Q. SMITH AND CHRIS HOLMES (OXFORD) COLLABORATING WITH JEREMIAS

KNOBLAUCH AND THEO DAMOULAS

M-OPEN INFERENCE• “All models are wrong but some are useful”

G. E. P. Box• Cannot learn θ0 generating the data.• Define parameter of interest by defining diver-

gence between model and sample distributionof the data (Walker, 2013) (JSPI).

GENERAL BAYESIAN UPDATING• The ‘true’ Bayes act of decision problem

(parametrised by θ):

θ∗ = argminθ

∫X`(θ, x)dG, (1)

where G(x) is the sample distribution of x.• The traditional Bayesian builds a belief model

to approximate G(x).

Without a model, the General Bayesian’s pos-terior beliefs (Bissiri, Holmes and Walker, 2016) (JRSSB)must be close to:

• the prior (measured using KL-divergence).• and the data (measured using expected loss).

The posterior minimising the sum of these is:

π(θ|x) ∝ π(θ) exp

(−w

∑i

`(θ, xi)

). (2)

BAYES AS GENERAL BAYESIf `(θ, x) = − log(f(x; θ)) then the generalBayesian update recovers Bayes rule:

π(θ|x) ∝ π(θ)∏i

{f(xi; θ)}. (3)

• Bayesian updating is learning about the param-eter which minimises the KL-divergence to thesample distribution of the data.• But as f(x; θ)→ 0, − log(f(x; θ))→∞.• Results in an (implicit) desire to correctly cap-

ture the tail behaviour of the underlying pro-cess in order to conduct principled inference.

ROBUST BAYESIAN ONLINE CHANGEPOINT DETECTION (BOCPD)Standard BOCPD (e.g. Knoblauch and Damoulas (2018)(ICML))• Detects Changepoints (CPs) online providing

full uncertainty quantification.• Combines a run-length (time since last CP) pos-

terior and parameter posterior within segment.• Use predictive density of next observation as

the run length likelihood.• Outliers have low predictive density and cause

spurious CPs.• Efficient recursion to update posterior online.

Robust BOCPD (Knoblauch, Jewson and Damoulas (2018)(arxiv))• Maintains full and principled uncertainty quan-

tification.• Robustify run-length posterior using the β-

divergence score in place of the log-score.• Can set hyperparameters such that one obser-

vation alone cannot declare a CP.• Also use the β-divergence for the parameter

posterior.• Propose a structured (quasi-conjugate) vari-

ational inference routine to conduct high-dimensional inference for the β-Bayes online.

• Initialise β to give maximum influence to re-gions where data is a priori expected to arrive.

• Update β online using a higher level loss.• Could possibly be extended to Robust Bayesian

model selection using a loss function on theprior predictive.

Synthetic exampleFive-dimensional Vector Autoregression (VAR)with one dimension injected with t4-noise.

0 100 200 300 400 500 600Time

0

5

10

15

20

25

30

Figure 1: Maximum A Posteriori (MAP) CPs of ro-bust (standard) BOCPD shown as solid (dashed) verti-cal lines. True CPs at t = 200, 400. In high dimensionsit becomes increasingly likely that the model’s tails aremisspecified in at least one dimension.

‘well-log’ datasetUnivariate data seeking to detect changes in rockstrata.

80000

100000

120000

140000

Nucle

ar R

espo

nse

0

1000

0 500 1000 1500 2000 2500 3000 3500 4000Time

0

1000

run

leng

ths

10 117 10 102 10 87 10 72 10 57 10 42 10 27 10 12

Figure 2: Maximum A Posteriori (MAP) segmentationand run-length distributions of the well-log data. Ro-bust segmentation depicted using solid lines, CPs addi-tionally declared under standard BOCPD with dashedlines. The corresponding run-length distributions forrobust (middle) and standard (bottom) BOCPD areshown in greyscale. The most likely run-lengths aredashed.

London Air PollutionDataset recording Nitrogen Oxide levels across 29stations in London modelled using spatially struc-tured Bayesian VARs.

0.0

0.5

1.0

P(m

|y)

Time

0

50

run

leng

th

0.0

0.5

1.0

P(m

|y)

2002-09 2002-10 2002-11 2002-12 2003-01 2003-02 2003-03 2003-04 2003-05 2003-06 2003-07 2003-08Time

0

100

200run

leng

th

10 262 10 229 10 196 10 163 10 130 10 97 10 64 10 31

Figure 3: On-line model posteriors for three differentVAR models (solid, dashed, dotted) and run-length dis-tributions in greyscale with most likely run-lengths forstandard (top two panels) and robust (bottom two pan-els) BOCPD. Also marked are the congestion chargeintroduction, 17/02/2003 (solid vertical line) and theMAP segmentations (crosses).

A PRINCIPLED ALTERNATIVE• Each divergence d(·, ·) has a corresponding loss

function `d(·, ·)• Equation (2) allows for principled belief up-

dating for parameter minimising divergencesother than KL-divergence (Jewson, Smith andHolmes, 2018) (Entropy)

π(d)(θ|x) ∝ π(d)(θ) exp

(−

n∑i=1

`d(xi, f(·; θ))

).

(4)• Not a pseudo or approximate posterior as pre-

viously thought.• w = 1 as doing model based inference with a

well-defined divergence.• Principled justification allows the divergence

to become a subjective judgement alongsideprior and model.• Represents how strongly you believe in your

model (especially its tails).• Decouples belief elicitation and robustness.• Decision theoretic reasons for Total Vari-

ation (TV), Hellinger (Hell) or alpha-divergences, but these require a densityestimate.• Alternatively the β-divergence with loss

`β(θ, x) =1

1 + β

∫Yf(y; θ)β+1dy − 1

βf(x; θ)β .

(5)

SIMPLE DEMONSTRATION

−5 0 5 10

0.0

0.2

0.4

x

Den

sity

gKLHellalpha = 0.75beta = 0.5TVN(0,1)

0 1 2 3 4 5 60.00

0.10

0.20

KL and power

Y

Influ

ence

β = 0(KLD)β = 0.05β = 0.25β = 0.5

Figure 4: Left: Posterior predictive distributions fittingN (µ, σ2) to g = 0.9N (0, 1)+0.1N (5, 52) using the KL-Bayes , Hell-Bayes, TV-Bayes , alpha-Bayes (α = 0.75)and beta-Bayes (α = 0.5). Right: The influence (Kurtekand Bharath, 2015) (Biometrika) of removing one of 1000observations from a t(4) distribution when fitting aN (µ, σ2) under the beta-Bayes for different values ofβ.