OBUST AYESIAN PDATING · The ‘true’ Bayes act of decision problem (parametrised by ): = argmin...

1
R OBUST B AYESIAN U PDATING J ACK J EWSON ,J IM Q. S MITH AND C HRIS H OLMES (O XFORD ) COLLABORATING WITH J EREMIAS K NOBLAUCH AND T HEO D AMOULAS M- OPEN I NFERENCE “All models are wrong but some are useful” G. E. P. Box Cannot learn θ 0 generating the data. Define parameter of interest by defining diver- gence between model and sample distribution of the data (Walker, 2013) (JSPI). G ENERAL B AYESIAN U PDATING The ‘true’ Bayes act of decision problem (parametrised by θ ): θ * = arg min θ Z X (θ,x)dG, (1) where G(x) is the sample distribution of x. The traditional Bayesian builds a belief model to approximate G(x). Without a model, the General Bayesian’s pos- terior beliefs (Bissiri, Holmes and Walker, 2016) (JRSSB) must be close to: the prior (measured using KL-divergence). and the data (measured using expected loss). The posterior minimising the sum of these is: π (θ |x) π (θ ) exp -w X i (θ,x i ) ! . (2) B AYES AS G ENERAL B AYES If (θ,x) = - log(f (x; θ )) then the general Bayesian update recovers Bayes rule: π (θ |x) π (θ ) Y i {f (x i ; θ )}. (3) Bayesian updating is learning about the param- eter which minimises the KL-divergence to the sample distribution of the data. But as f (x; θ ) 0, - log(f (x; θ )) →∞. Results in an (implicit) desire to correctly cap- ture the tail behaviour of the underlying pro- cess in order to conduct principled inference. R OBUST B AYESIAN O NLINE C HANGEPOINT D ETECTION (BOCPD) Standard BOCPD (e.g. Knoblauch and Damoulas (2018) (ICML)) Detects Changepoints (CPs) online providing full uncertainty quantification. Combines a run-length (time since last CP) pos- terior and parameter posterior within segment. Use predictive density of next observation as the run length likelihood. Outliers have low predictive density and cause spurious CPs. Efficient recursion to update posterior online. Robust BOCPD (Knoblauch, Jewson and Damoulas (2018) (arxiv)) Maintains full and principled uncertainty quan- tification. Robustify run-length posterior using the β - divergence score in place of the log-score. Can set hyperparameters such that one obser- vation alone cannot declare a CP. Also use the β -divergence for the parameter posterior. Propose a structured (quasi-conjugate) vari- ational inference routine to conduct high- dimensional inference for the β -Bayes online. Initialise β to give maximum influence to re- gions where data is a priori expected to arrive. Update β online using a higher level loss. Could possibly be extended to Robust Bayesian model selection using a loss function on the prior predictive. Synthetic example Five-dimensional Vector Autoregression (VAR) with one dimension injected with t 4 -noise. 0 100 200 300 400 500 600 Time 0 5 10 15 20 25 30 Figure 1: Maximum A Posteriori (MAP) CPs of ro- bust (standard) BOCPD shown as solid (dashed) verti- cal lines. True CPs at t = 200, 400. In high dimensions it becomes increasingly likely that the model’s tails are misspecified in at least one dimension. ‘well-log’ dataset Univariate data seeking to detect changes in rock strata. Figure 2: Maximum A Posteriori (MAP) segmentation and run-length distributions of the well-log data. Ro- bust segmentation depicted using solid lines, CPs addi- tionally declared under standard BOCPD with dashed lines. The corresponding run-length distributions for robust (middle) and standard (bottom) BOCPD are shown in greyscale. The most likely run-lengths are dashed. London Air Pollution Dataset recording Nitrogen Oxide levels across 29 stations in London modelled using spatially struc- tured Bayesian VARs. Figure 3: On-line model posteriors for three different VAR models (solid, dashed, dotted) and run-length dis- tributions in greyscale with most likely run-lengths for standard (top two panels) and robust (bottom two pan- els) BOCPD. Also marked are the congestion charge introduction, 17/02/2003 (solid vertical line) and the MAP segmentations (crosses). AP RINCIPLED A LTERNATIVE Each divergence d(·, ·) has a corresponding loss function d (·, ·) Equation (2) allows for principled belief up- dating for parameter minimising divergences other than KL-divergence (Jewson, Smith and Holmes, 2018) (Entropy) π (d) (θ |x) π (d) (θ ) exp - n X i=1 d (x i ,f (·; θ )) ! . (4) Not a pseudo or approximate posterior as pre- viously thought. w =1 as doing model based inference with a well-defined divergence. Principled justification allows the divergence to become a subjective judgement alongside prior and model. Represents how strongly you believe in your model (especially its tails). Decouples belief elicitation and robustness. Decision theoretic reasons for Total Vari- ation (TV), Hellinger (Hell) or alpha- divergences, but these require a density estimate. Alternatively the β -divergence with loss β (θ,x)= 1 1+ β Z Y f (y ; θ ) β +1 dy - 1 β f (x; θ ) β . (5) S IMPLE D EMONSTRATION -5 0 5 10 0.0 0.2 0.4 x Density g KL Hell alpha = 0.75 beta = 0.5 TV N(0,1) 0 1 2 3 4 5 6 0.00 0.10 0.20 Y Influence β= 0(KLD29 β= 0.05 β= 0.25 β= 0.5 Figure 4: Left: Posterior predictive distributions fitting N (μ, σ 2 ) to g =0.9N (0, 1) + 0.1N (5, 5 2 ) using the KL- Bayes , Hell-Bayes, TV-Bayes , alpha-Bayes (α =0.75) and beta-Bayes (α =0.5). Right: The influence (Kurtek and Bharath, 2015) (Biometrika) of removing one of 1000 observations from a t(4) distribution when fitting a N (μ, σ 2 ) under the beta-Bayes for different values of β .

Transcript of OBUST AYESIAN PDATING · The ‘true’ Bayes act of decision problem (parametrised by ): = argmin...

Page 1: OBUST AYESIAN PDATING · The ‘true’ Bayes act of decision problem (parametrised by ): = argmin Z X ‘( ;x)dG; (1) where G(x) is the sample distribution of x. The traditional

ROBUST BAYESIAN UPDATINGJACK JEWSON, JIM Q. SMITH AND CHRIS HOLMES (OXFORD) COLLABORATING WITH JEREMIAS

KNOBLAUCH AND THEO DAMOULAS

M-OPEN INFERENCE• “All models are wrong but some are useful”

G. E. P. Box• Cannot learn θ0 generating the data.• Define parameter of interest by defining diver-

gence between model and sample distributionof the data (Walker, 2013) (JSPI).

GENERAL BAYESIAN UPDATING• The ‘true’ Bayes act of decision problem

(parametrised by θ):

θ∗ = argminθ

∫X`(θ, x)dG, (1)

where G(x) is the sample distribution of x.• The traditional Bayesian builds a belief model

to approximate G(x).

Without a model, the General Bayesian’s pos-terior beliefs (Bissiri, Holmes and Walker, 2016) (JRSSB)must be close to:

• the prior (measured using KL-divergence).• and the data (measured using expected loss).

The posterior minimising the sum of these is:

π(θ|x) ∝ π(θ) exp

(−w

∑i

`(θ, xi)

). (2)

BAYES AS GENERAL BAYESIf `(θ, x) = − log(f(x; θ)) then the generalBayesian update recovers Bayes rule:

π(θ|x) ∝ π(θ)∏i

{f(xi; θ)}. (3)

• Bayesian updating is learning about the param-eter which minimises the KL-divergence to thesample distribution of the data.• But as f(x; θ)→ 0, − log(f(x; θ))→∞.• Results in an (implicit) desire to correctly cap-

ture the tail behaviour of the underlying pro-cess in order to conduct principled inference.

ROBUST BAYESIAN ONLINE CHANGEPOINT DETECTION (BOCPD)Standard BOCPD (e.g. Knoblauch and Damoulas (2018)(ICML))• Detects Changepoints (CPs) online providing

full uncertainty quantification.• Combines a run-length (time since last CP) pos-

terior and parameter posterior within segment.• Use predictive density of next observation as

the run length likelihood.• Outliers have low predictive density and cause

spurious CPs.• Efficient recursion to update posterior online.

Robust BOCPD (Knoblauch, Jewson and Damoulas (2018)(arxiv))• Maintains full and principled uncertainty quan-

tification.• Robustify run-length posterior using the β-

divergence score in place of the log-score.• Can set hyperparameters such that one obser-

vation alone cannot declare a CP.• Also use the β-divergence for the parameter

posterior.• Propose a structured (quasi-conjugate) vari-

ational inference routine to conduct high-dimensional inference for the β-Bayes online.

• Initialise β to give maximum influence to re-gions where data is a priori expected to arrive.

• Update β online using a higher level loss.• Could possibly be extended to Robust Bayesian

model selection using a loss function on theprior predictive.

Synthetic exampleFive-dimensional Vector Autoregression (VAR)with one dimension injected with t4-noise.

0 100 200 300 400 500 600Time

0

5

10

15

20

25

30

Figure 1: Maximum A Posteriori (MAP) CPs of ro-bust (standard) BOCPD shown as solid (dashed) verti-cal lines. True CPs at t = 200, 400. In high dimensionsit becomes increasingly likely that the model’s tails aremisspecified in at least one dimension.

‘well-log’ datasetUnivariate data seeking to detect changes in rockstrata.

80000

100000

120000

140000

Nucle

ar R

espo

nse

0

1000

0 500 1000 1500 2000 2500 3000 3500 4000Time

0

1000

run

leng

ths

10 117 10 102 10 87 10 72 10 57 10 42 10 27 10 12

Figure 2: Maximum A Posteriori (MAP) segmentationand run-length distributions of the well-log data. Ro-bust segmentation depicted using solid lines, CPs addi-tionally declared under standard BOCPD with dashedlines. The corresponding run-length distributions forrobust (middle) and standard (bottom) BOCPD areshown in greyscale. The most likely run-lengths aredashed.

London Air PollutionDataset recording Nitrogen Oxide levels across 29stations in London modelled using spatially struc-tured Bayesian VARs.

0.0

0.5

1.0

P(m

|y)

Time

0

50

run

leng

th

0.0

0.5

1.0

P(m

|y)

2002-09 2002-10 2002-11 2002-12 2003-01 2003-02 2003-03 2003-04 2003-05 2003-06 2003-07 2003-08Time

0

100

200run

leng

th

10 262 10 229 10 196 10 163 10 130 10 97 10 64 10 31

Figure 3: On-line model posteriors for three differentVAR models (solid, dashed, dotted) and run-length dis-tributions in greyscale with most likely run-lengths forstandard (top two panels) and robust (bottom two pan-els) BOCPD. Also marked are the congestion chargeintroduction, 17/02/2003 (solid vertical line) and theMAP segmentations (crosses).

A PRINCIPLED ALTERNATIVE• Each divergence d(·, ·) has a corresponding loss

function `d(·, ·)• Equation (2) allows for principled belief up-

dating for parameter minimising divergencesother than KL-divergence (Jewson, Smith andHolmes, 2018) (Entropy)

π(d)(θ|x) ∝ π(d)(θ) exp

(−

n∑i=1

`d(xi, f(·; θ))

).

(4)• Not a pseudo or approximate posterior as pre-

viously thought.• w = 1 as doing model based inference with a

well-defined divergence.• Principled justification allows the divergence

to become a subjective judgement alongsideprior and model.• Represents how strongly you believe in your

model (especially its tails).• Decouples belief elicitation and robustness.• Decision theoretic reasons for Total Vari-

ation (TV), Hellinger (Hell) or alpha-divergences, but these require a densityestimate.• Alternatively the β-divergence with loss

`β(θ, x) =1

1 + β

∫Yf(y; θ)β+1dy − 1

βf(x; θ)β .

(5)

SIMPLE DEMONSTRATION

−5 0 5 10

0.0

0.2

0.4

x

Den

sity

gKLHellalpha = 0.75beta = 0.5TVN(0,1)

0 1 2 3 4 5 60.00

0.10

0.20

KL and power

Y

Influ

ence

β = 0(KLD)β = 0.05β = 0.25β = 0.5

Figure 4: Left: Posterior predictive distributions fittingN (µ, σ2) to g = 0.9N (0, 1)+0.1N (5, 52) using the KL-Bayes , Hell-Bayes, TV-Bayes , alpha-Bayes (α = 0.75)and beta-Bayes (α = 0.5). Right: The influence (Kurtekand Bharath, 2015) (Biometrika) of removing one of 1000observations from a t(4) distribution when fitting aN (µ, σ2) under the beta-Bayes for different values ofβ.