Peter Orbanz · TERMINOLOGY Parametric model I Number of parameters ﬁxed (or constantly bounded)...

Exchangeability

Peter Orbanz

Columbia University

PARAMETERS AND PATTERNS

Parameters

P(X|θ) = Probability[data|pattern]

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml

20 Regression

−5 0 5

−3

−2

−1

0

1

2

3

input, x

outp

ut, y

(a), = 1

−5 0 5

−3

−2

−1

0

1

2

3

input, x

outp

ut, y

−5 0 5

−3

−2

−1

0

1

2

3

input, x

outp

ut, y

(b), = 0.3 (c), = 3

Figure 2.5: (a) Data is generated from a GP with hyperparameters (,σf ,σn) =(1, 1, 0.1), as shown by the + symbols. Using Gaussian process prediction with thesehyperparameters we obtain a 95% confidence region for the underlying function f(shown in grey). Panels (b) and (c) again show the 95% confidence region, but thistime for hyperparameter values (0.3, 1.08, 0.00005) and (3.0, 1.16, 0.89) respectively.

The covariance is denoted ky as it is for the noisy targets y rather than for theunderlying function f . Observe that the length-scale , the signal variance σ2

f

and the noise variance σ2n can be varied. In general we call the free parametershyperparameters

hyperparameters.11

In chapter 5 we will consider various methods for determining the hyperpa-rameters from training data. However, in this section our aim is more simply toexplore the effects of varying the hyperparameters on GP prediction. Considerthe data shown by + signs in Figure 2.5(a). This was generated from a GPwith the SE kernel with (,σf ,σn) = (1, 1, 0.1). The figure also shows the 2standard-deviation error bars for the predictions obtained using these values ofthe hyperparameters, as per eq. (2.24). Notice how the error bars get largerfor input values that are distant from any training points. Indeed if the x-axis

11We refer to the parameters of the covariance function as hyperparameters to emphasizethat they are parameters of a non-parametric model; in accordance with the weight-spaceview, section 2.1, the parameters (weights) of the underlying parametric model have beenintegrated out.

Inference idea

data = underlying pattern + independent noise

Peter Orbanz 2 / 25

TERMINOLOGY

Parametric modelI Number of parameters fixed (or constantly bounded) w.r.t. sample size

Nonparametric model

I Number of parameters grows with sample size

I ∞-dimensional parameter space

Example: Density estimation20 CHAPTER 2. BAYESIAN DECISION THEORY

x2

x1

µ

Figure 2.9: Samples drawn from a two-dimensional Gaussian lie in a cloud centered onthe mean µ. The red ellipses show lines of equal probability density of the Gaussian.

being merely !2 times the identity matrix I. Geometrically, this corresponds to thesituation in which the samples fall in equal-size hyperspherical clusters, the clusterfor the ith class being centered about the mean vector µi. The computation of thedeterminant and the inverse of !i is particularly easy: |!i| = !2d and !!1

i = (1/!2)I.Since both |!i| and the (d/2) ln 2" term in Eq. 47 are independent of i, they areunimportant additive constants that can be ignored. Thus we obtain the simplediscriminant functions

gi(x) = !"x ! µi"2

2!2+ ln P (#i), (48)

where " · " is the Euclidean norm, that is,Euclideannorm

"x ! µi"2 = (x ! µi)t(x ! µi). (49)

If the prior probabilities are not equal, then Eq. 48 shows that the squared distance"x ! µ"2 must be normalized by the variance !2 and o!set by adding ln P (#i); thus,if x is equally near two di!erent mean vectors, the optimal decision will favor the apriori more likely category.

Regardless of whether the prior probabilities are equal or not, it is not actuallynecessary to compute distances. Expansion of the quadratic form (x ! µi)

t(x ! µi)yields

gi(x) = ! 1

2!2[xtx ! 2µt

ix + µtiµi] + ln P (#i), (50)

which appears to be a quadratic function of x. However, the quadratic term xtx isthe same for all i, making it an ignorable additive constant. Thus, we obtain theequivalent linear discriminant functionslinear

discriminant

gi(x) = wtix + wi0, (51)

where

Parametric

8 CHAPTER 4. NONPARAMETRIC TECHNIQUES

-2-1

01

2 -2

-1

0

1

2

0

0.05

0.1

0.15

-2-1

01

2

h = 1

!(x)

-2-1

0

1

2 -2

-1

0

1

2

0

0.2

0.4

0.6

-2-1

0

1

2

h = .5

!(x)

-2-1

0

1

2 -2

-1

0

1

2

0

1

2

3

4

-2-1

0

1

2

h = .2

!(x)

Figure 4.3: Examples of two-dimensional circularly symmetric normal Parzen windows!(x/h) for three di!erent values of h. Note that because the "k(·) are normalized,di!erent vertical scales must be used to show their structure.

p(x)p(x) p(x)

Figure 4.4: Three Parzen-window density estimates based on the same set of fivesamples, using the window functions in Fig. 4.3. As before, the vertical axes havebeen scaled to show the structure of each function.

and

limn!"

#2n(x) = 0. (18)

To prove convergence we must place conditions on the unknown density p(x), onthe window function !(u), and on the window width hn. In general, continuity ofp(·) at x is required, and the conditions imposed by Eqs. 12 & 13 are customarilyinvoked. With care, it can be shown that the following additional conditions assureconvergence (Problem 1):

supu

!(u) < ! (19)

lim#u#!"

!(u)

d!

i=1

ui = 0 (20)

NonparametricPeter Orbanz 3 / 25

NONPARAMETRIC BAYESIAN MODEL

Definition

A nonparametric Bayesian model is a Bayesian model on an∞-dimensionalparameter space.

InterpretationParameter space T = set of possible patterns. Recall previous tutorials:

Model T Application

Gaussian process Smooth functions Regression problemsDP mixtures Smooth densities Density estimation

CRP, 2-param. CRP Parititons Clustering

Solution to Bayesian problem = posterior distribution on patterns

[Sch95]Peter Orbanz 4 / 25

DE FINETTI’S THEOREM

Infinite exchangeabilityFor all π ∈ S∞ (= infinite symmetric group):

P(X1,X2, . . . ) = P(Xπ(1),Xπ(2), ...) or π(P) = P

Theorem (de Finetti)

P exchangeable ⇔ P(X1,X2, . . . ) =

∫M(X )

( ∞∏n=1

Q(Xn))

dν(Q)

I Q is a random measure

I ν uniquely determined by P

Peter Orbanz 5 / 25

FINITE EXCHANGEABILITY

Finite sequence X1, . . . ,XnExchangeability of finite sequence 6⇒ de Finetti-representation

Example: Two exchangeable random bits

X1 = 0 X1 = 1

X2 = 0 0 1/2X2 = 1 1/2 0

Suppose de Finetti holds; then

0 =

P(X1 = X2 = 1) =

∫[0,1] p2dν(p)

P(X1 = X2 = 0) =∫

[0,1](1− p)2dν(p)

⇒ νp = 0 = 1

νp = 1 = 1

IntuitionFinite exchangeability does not eliminate sequential patterns.

[DF80]Peter Orbanz 6 / 25

SUPPORT OF PRIORS

M(X )

Model

P0 = Pθ0

P0 outside model:misspecified

[Gho10, KvdV06]Peter Orbanz 7 / 25

SUPPORT OF NONPARAMETRIC PRIORS

Large support

I Support of nonparametric priors is larger (∞-dimensional) than of parametricpriors (finite-dimensional).

I However: No uniform prior (or even “neutral” improper prior) exists on M(X ).

Interpretation of nonparametric prior assumptionsConcentration of nonparametric prior on subset of M(X ) typically representsstructural prior assumption.

I GP regression with unknown bandwidth:I Any continuous function possibleI Prior can express e.g. “very smooth functions are more probable”

I Clustering: Expected number of clusters is...I ...small −→ CRP priorI ...power law −→ two-parameter CRP

Peter Orbanz 8 / 25

PARAMETERIZED MODELS

Probability model

Ω

ωP

X

X(ω) P(X) = X[P]

TΘ(ω)

X

Θ

Parameterized model P[X|Θ]

Ω X∞X∞M(X ) ⊃ PF

TT

Θ

I P = P[X|θ]|θ ∈ T I F ≡ law of large numbers

I T : P[ . |Θ = θ] 7→ θ bijection

I Θ := T F X∞

[Sch95]Peter Orbanz 9 / 25

JUSTIFICATION: BY EXCHANGEABILITY

Again: de Finetti

P(X1,X2, . . . ) =

∫M(X )

( ∞∏n=1

Q(Xn))

dν(Q) =

∫T

( ∞∏n=1

Q(Xn|Θ = θ))

dνT (θ)

I Θ random measure (since Θ(ω) ∈ M(X ))

Convergence resultsThe de Finetti theorem comes with a convergence result attached:

I Empirical measure: Fnweakly−−−→ θ as n→∞

I Posterior Λn(Θ|X1, . . . ,Xn) = Λn( . , ω) in M(T ) exists

I Posterior convergence: Λn( . , ω)n→∞−−−→ δΘ(ω)

[Kal01]Peter Orbanz 10 / 25

SPECIAL TYPES OF

EXCHANGEABLE DATA

MODIFICATIONS

Pólya Urns

P(Xn+1|X1 = x1, . . . ,Xn = xn) =1

α+ n

n∑j=1

δxj (Xn+1) +α

α+ nG0(Xn+1)

Exchangeable: I ν is DP(α,G0)

I∏∞

n=1 Q(Xn|θ) =∏∞

n=1 θ(Xn) =∏∞

n=1

(∑∞j=1 cjδtj (Xn)

)Exchangeable increment processes (H. Bühlmann)Stationary, exchangeable increment process = mixture of Lévy processes

P((Xt)t∈R+) =

∫Lα,γ,µ((Xt)t∈R+)dν(α, γ, µ)

Lα,γ,µ = Lévy process with jump measure µ

[B60, Kal01]Peter Orbanz 12 / 25

MODIFICATION 2: RANDOM PARTITIONS

Random partition of N

Π = B1,B2, . . . e.g. 1, 3, 5, . . ., 2, 4, 10, . . .

Paint-box distributionI Weights s1, s2, . . . ≥ 0 with

∑sj ≤ 1

I U1,U2, . . . ∼ Uniform[0, 1] s1 s2

U3 U2U1

1−∑

j sjSampling Π ∼ β[ . |s]:

i, j ∈ N in same block ⇔ Ui,Uj in same interval

i separate block ⇔ Ui in interval 1−∑

sj

Theorem (Kingman)

Π exchangeable ⇔ P(Π ∈ . ) =

∫β[Π ∈ . |s]Q(ds)

[Kin78]Peter Orbanz 13 / 25

ROTATION INVARIANCE

Rotatable sequence

Pn(X1, . . . ,Xn) = Pn(Rn(X1, . . . ,Xn)) for all Rn ∈ O(n)

Infinite case

X1,X2, . . . rotatable :⇔ X1, . . . ,Xn rotatable for all n

Theorem (Freedman)Infinite sequence rotatable iff

P(X1,X2, . . . ) =

∫R+

( ∞∏n=1

Nσ(Xn))

dνR+(σ)

Nσ denotes (0, σ)-Gaussian

Peter Orbanz 14 / 25

TWO INTERPRETATIONS

As special case of de Finetti

I Rotatable⇒ exchangeable

I General de Finetti: Parameter space T = M(X )

I Rotation invariance: T shrinks to Nσ|σ ∈ R+

As invariance under different symmetry

I Exchangeability = invariance of P(X1,X2, ...) under group action

I Freedman: Different group (O(n) rather than S∞)

I In these cases: symmetry⇒ decomposition theorem


NON-EXCHANGEABLE DATA

EXCHANGEABILITY: RANDOM GRAPHS

Random graph with independent edgesGiven: θ : [0, 1]2 → [0, 1] symmetric function

I U1,U2, . . . ∼ Uniform[0, 1]

I Edge (i, j) present:

(i, j) ∼ Bernoulli(θ(Ui,Uj))

Call this distribution Γ(G ∈ . |θ).

00

11

U1

U1

U2

U2

0

1

Predge 1, 2

θ

Theorem (Aldous; Hoover)A random (dense) graph G is exchangeable iff

P(G ∈ . ) =

∫T

Γ(G ∈ . |θ)Q(dθ)

1

2

34

5

6

7 8

9

Peter Orbanz 2 / 7

[Ald81, Hoo79]Peter Orbanz 17 / 25

EXCHANGEABILITY: RANDOM GRAPHS

Random graph with independent edgesGiven: θ : [0, 1]2 → [0, 1] symmetric function

I U1,U2, . . . ∼ Uniform[0, 1]

I Edge (i, j) present:

(i, j) ∼ Bernoulli(θ(Ui,Uj))

Call this distribution Γ(G ∈ . |θ).

00

11

U1

U1

U2

U2

0

1

Predge 1, 2θ

Theorem (Aldous; Hoover)A random (dense) graph G is exchangeable iff

P(G ∈ . ) =

∫T

Γ(G ∈ . |θ)Q(dθ)

1

2

34

5

6

7 8

9

Peter Orbanz 2 / 7

[Ald81, Hoo79]Peter Orbanz 17 / 25

DE FINETTI: GEOMETRY

Finite case

P =∑ei∈E

νiei

I E = e1, e2, e3I (ν1, ν2, ν3) barycentric coordinates

e1

e2 e3

P

ν1

ν2

ν3

Infinite/continuous case

P( . ) =

∫E

e( . )dν(e) =

∫T

k(θ, . )dνT (θ)

I k : T → E ⊂ M(X ) probability kernel (= conditional probability)

I k is random measure with values k(θ, . ) ∈ EI de Finetti: k(θ, . ) =

∏n∈N Q( . |θ) and T = M(X )


DECOMPOSITION BY SYMMETRY

Theorem (Varadarajan)

I G nice group on space YI Call measure µ ergodic if µ(A) ∈ 0, 1 for all G-invariant sets A.

I E := ergodic probability measures

Then there is a Markov kernel k : Y → E s.t.:

P ∈ M(V) G-invariant ⇔ P(A) =

∫T

k(θ,A)dν(θ)

de FinettiI G = S∞ and Y = X∞

I G-invariant sets = exchangeable events

I E = factorial distributions (“Hewitt-Savage 0-1 law”)

[Var63]Peter Orbanz 19 / 25

SYMMETRY AND SUFFICIENCY

SUFFICIENT STATISTICS

ProblemApparently no direct connection with standard models

Sufficient StatisticFunctions Sn of data sufficient if:

I Intuitively: Sn(X1, . . . ,Xn) contains all information sampleprovides on parameter

I Formally:

Pn(X1, . . . ,Xn|Θ, Sn) = P(X1, . . . ,Xn|S) for all n

Sufficiency and symmetry

I P exchangeable⇔ Sn(x1, . . . , xn) = 1n

∑ni=1 δxn sufficient

I P rotatable⇔ Sn(x1, . . . , xn) =√∑n

i=1 x2i = ‖(x1, . . . , xn)‖2 sufficient


DECOMPOSITION BY SUFFICIENCY

Theorem (Diaconis and Freedman; Lauritzen; several others)

Given: Sufficient statistic Sn for each nkn( . , sn) = conditional probability of X1, . . . ,Xn given sn

1. kn converges to a limit function:

kn( . , Sn(X1(ω), . . . ,Xn(ω)))n→∞−−−→ k∞( . , ω)

2. P(X1,X2, . . . ) has the decomposition

P( . ) =

∫k∞( . , ω)dν(ω)

3. The model P ⊂ M(X ) is a convex set with extreme points k∞( . , ω)

4. The measure ν is uniquely determined by P

(Theorem statement omits technical conditions.)


EXAMPLES

de Finetti’s theorem

P exchangeable ⇔ Sn(x1, . . . , xn) =1n

n∑i=1

δxn sufficient

Rotation invariance

P rotatable ⇔ Sn(x1, . . . , xn) = ‖(x1, . . . , xn)‖2 sufficient

Kingman’s theorem

Π exchangeable ⇔ asymptotic block sizes are sufficient statistic

Exponential families (Küchler and Lauritzen)Choose X = R∞. Under suitable regularity conditions:Sn additive, i.e.

Sn(x1, . . . , xn) =1n

n∑i=1

S0(xi)

if and only if ergodic measures are exponential family.[KL89]Peter Orbanz 23 / 25

SUMMARY

Non-exchangeable data

I Identify invariance principle and its ergodic measures

I Ergodic measures↔ generalize i.i.d. distributions↔ likelihood

I Prior = distribution on ergodic measures

Random structure Theorem of Mixtures of...

Exchangeable sequences de Finetti product distributionsHewitt & Savage

Processes with exch. increments Bühlmann Lévy processesExchangeable partitions Kingman "paint-box distributions"Exchangeable arrays Aldous sampling scheme on [0, 1]2

HooverKallenberg

Block-exchangeable sequences Diaconis & Freedman Markov chainsExchangeable Rd-sequences with Küchler & Lauritzen Exponential families

additive sufficient statistics


REFERENCES I

[Ald81] David J. Aldous. Representations for partially exchangeable arrays of random variables. J. Multivariate Anal., 11(4):581–598, 1981.

[B60] H. Bühlmann. Austauschbare stochastische Variabeln und ihre Grenzwertsätze. PhD thesis, 1960. University of California Press, 1960.

[DF80] P. Diaconis and D. Freedman. Finite exchangeable sequences. The Annals of Probability, 8(4):pp. 745–764, 1980.

[Gho10] S. Ghosal. Dirichlet process, related priors and posterior asymptotics. In N. L. Hjort et al., editors, Bayesian Nonparametrics, pages 36–83.Cambridge University Press, 2010.

[Hoo79] D. N. Hoover. Relations on probability spaces and arrays of random variables. Technical report, Institute of Advanced Study, Princeton,1979.

[Kal01] O. Kallenberg. Foundations of Modern Probability. Springer, 2nd edition, 2001.

[Kin78] J. F. C. Kingman. The representation of partition structures. J. London Math. Soc., 2(18):374–380, 1978.

[KL89] U. Küchler and S. L. Lauritzen. Exponential families, extreme point models and minimal space-time invariant functions for stochasticprocesses with stationary and independent increments. Scand. J. Stat., 16:237–261, 1989.

[KvdV06] B. J. K. Kleijn and A. W. van der Vaart. Misspecification in infinite-dimensional Bayesian statistics. Annals of Statistics, 34(2):837–877,2006.

[Sch95] M. J. Schervish. Theory of Statistics. Springer, 1995.

[Var63] V. S. Varadarajan. Groups of automorphisms of Borel spaces. Transactions of the American Mathematical Society, 109(2):pp. 191–220,1963.


Peter Orbanz · TERMINOLOGY Parametric model I Number of parameters ﬁxed (or constantly bounded)...

Documents

Transcript of Peter Orbanz · TERMINOLOGY Parametric model I Number of parameters ﬁxed (or constantly bounded)...