Peter Orbanz · TERMINOLOGY Parametric model I Number of parameters fixed (or constantly bounded)...
Transcript of Peter Orbanz · TERMINOLOGY Parametric model I Number of parameters fixed (or constantly bounded)...
Exchangeability
Peter Orbanz
Columbia University
PARAMETERS AND PATTERNS
Parameters
P(X|θ) = Probability[data|pattern]
C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml
20 Regression
−5 0 5
−3
−2
−1
0
1
2
3
input, x
outp
ut, y
(a), = 1
−5 0 5
−3
−2
−1
0
1
2
3
input, x
outp
ut, y
−5 0 5
−3
−2
−1
0
1
2
3
input, x
outp
ut, y
(b), = 0.3 (c), = 3
Figure 2.5: (a) Data is generated from a GP with hyperparameters (,σf ,σn) =(1, 1, 0.1), as shown by the + symbols. Using Gaussian process prediction with thesehyperparameters we obtain a 95% confidence region for the underlying function f(shown in grey). Panels (b) and (c) again show the 95% confidence region, but thistime for hyperparameter values (0.3, 1.08, 0.00005) and (3.0, 1.16, 0.89) respectively.
The covariance is denoted ky as it is for the noisy targets y rather than for theunderlying function f . Observe that the length-scale , the signal variance σ2
f
and the noise variance σ2n can be varied. In general we call the free parametershyperparameters
hyperparameters.11
In chapter 5 we will consider various methods for determining the hyperpa-rameters from training data. However, in this section our aim is more simply toexplore the effects of varying the hyperparameters on GP prediction. Considerthe data shown by + signs in Figure 2.5(a). This was generated from a GPwith the SE kernel with (,σf ,σn) = (1, 1, 0.1). The figure also shows the 2standard-deviation error bars for the predictions obtained using these values ofthe hyperparameters, as per eq. (2.24). Notice how the error bars get largerfor input values that are distant from any training points. Indeed if the x-axis
11We refer to the parameters of the covariance function as hyperparameters to emphasizethat they are parameters of a non-parametric model; in accordance with the weight-spaceview, section 2.1, the parameters (weights) of the underlying parametric model have beenintegrated out.
Inference idea
data = underlying pattern + independent noise
Peter Orbanz 2 / 25
TERMINOLOGY
Parametric modelI Number of parameters fixed (or constantly bounded) w.r.t. sample size
Nonparametric model
I Number of parameters grows with sample size
I ∞-dimensional parameter space
Example: Density estimation20 CHAPTER 2. BAYESIAN DECISION THEORY
x2
x1
µ
Figure 2.9: Samples drawn from a two-dimensional Gaussian lie in a cloud centered onthe mean µ. The red ellipses show lines of equal probability density of the Gaussian.
being merely !2 times the identity matrix I. Geometrically, this corresponds to thesituation in which the samples fall in equal-size hyperspherical clusters, the clusterfor the ith class being centered about the mean vector µi. The computation of thedeterminant and the inverse of !i is particularly easy: |!i| = !2d and !!1
i = (1/!2)I.Since both |!i| and the (d/2) ln 2" term in Eq. 47 are independent of i, they areunimportant additive constants that can be ignored. Thus we obtain the simplediscriminant functions
gi(x) = !"x ! µi"2
2!2+ ln P (#i), (48)
where " · " is the Euclidean norm, that is,Euclideannorm
"x ! µi"2 = (x ! µi)t(x ! µi). (49)
If the prior probabilities are not equal, then Eq. 48 shows that the squared distance"x ! µ"2 must be normalized by the variance !2 and o!set by adding ln P (#i); thus,if x is equally near two di!erent mean vectors, the optimal decision will favor the apriori more likely category.
Regardless of whether the prior probabilities are equal or not, it is not actuallynecessary to compute distances. Expansion of the quadratic form (x ! µi)
t(x ! µi)yields
gi(x) = ! 1
2!2[xtx ! 2µt
ix + µtiµi] + ln P (#i), (50)
which appears to be a quadratic function of x. However, the quadratic term xtx isthe same for all i, making it an ignorable additive constant. Thus, we obtain theequivalent linear discriminant functionslinear
discriminant
gi(x) = wtix + wi0, (51)
where
Parametric
8 CHAPTER 4. NONPARAMETRIC TECHNIQUES
-2-1
01
2 -2
-1
0
1
2
0
0.05
0.1
0.15
-2-1
01
2
h = 1
!(x)
-2-1
0
1
2 -2
-1
0
1
2
0
0.2
0.4
0.6
-2-1
0
1
2
h = .5
!(x)
-2-1
0
1
2 -2
-1
0
1
2
0
1
2
3
4
-2-1
0
1
2
h = .2
!(x)
Figure 4.3: Examples of two-dimensional circularly symmetric normal Parzen windows!(x/h) for three di!erent values of h. Note that because the "k(·) are normalized,di!erent vertical scales must be used to show their structure.
p(x)p(x) p(x)
Figure 4.4: Three Parzen-window density estimates based on the same set of fivesamples, using the window functions in Fig. 4.3. As before, the vertical axes havebeen scaled to show the structure of each function.
and
limn!"
#2n(x) = 0. (18)
To prove convergence we must place conditions on the unknown density p(x), onthe window function !(u), and on the window width hn. In general, continuity ofp(·) at x is required, and the conditions imposed by Eqs. 12 & 13 are customarilyinvoked. With care, it can be shown that the following additional conditions assureconvergence (Problem 1):
supu
!(u) < ! (19)
lim#u#!"
!(u)
d!
i=1
ui = 0 (20)
NonparametricPeter Orbanz 3 / 25
NONPARAMETRIC BAYESIAN MODEL
Definition
A nonparametric Bayesian model is a Bayesian model on an∞-dimensionalparameter space.
InterpretationParameter space T = set of possible patterns. Recall previous tutorials:
Model T Application
Gaussian process Smooth functions Regression problemsDP mixtures Smooth densities Density estimation
CRP, 2-param. CRP Parititons Clustering
Solution to Bayesian problem = posterior distribution on patterns
[Sch95]Peter Orbanz 4 / 25
DE FINETTI’S THEOREM
Infinite exchangeabilityFor all π ∈ S∞ (= infinite symmetric group):
P(X1,X2, . . . ) = P(Xπ(1),Xπ(2), ...) or π(P) = P
Theorem (de Finetti)
P exchangeable ⇔ P(X1,X2, . . . ) =
∫M(X )
( ∞∏n=1
Q(Xn))
dν(Q)
I Q is a random measure
I ν uniquely determined by P
Peter Orbanz 5 / 25
FINITE EXCHANGEABILITY
Finite sequence X1, . . . ,XnExchangeability of finite sequence 6⇒ de Finetti-representation
Example: Two exchangeable random bits
X1 = 0 X1 = 1
X2 = 0 0 1/2X2 = 1 1/2 0
Suppose de Finetti holds; then
0 =
P(X1 = X2 = 1) =
∫[0,1] p2dν(p)
P(X1 = X2 = 0) =∫
[0,1](1− p)2dν(p)
⇒ νp = 0 = 1
νp = 1 = 1
IntuitionFinite exchangeability does not eliminate sequential patterns.
[DF80]Peter Orbanz 6 / 25
SUPPORT OF PRIORS
M(X )
Model
P0 = Pθ0
P0 outside model:misspecified
[Gho10, KvdV06]Peter Orbanz 7 / 25
SUPPORT OF NONPARAMETRIC PRIORS
Large support
I Support of nonparametric priors is larger (∞-dimensional) than of parametricpriors (finite-dimensional).
I However: No uniform prior (or even “neutral” improper prior) exists on M(X ).
Interpretation of nonparametric prior assumptionsConcentration of nonparametric prior on subset of M(X ) typically representsstructural prior assumption.
I GP regression with unknown bandwidth:I Any continuous function possibleI Prior can express e.g. “very smooth functions are more probable”
I Clustering: Expected number of clusters is...I ...small −→ CRP priorI ...power law −→ two-parameter CRP
Peter Orbanz 8 / 25
PARAMETERIZED MODELS
Probability model
Ω
ωP
X
X(ω) P(X) = X[P]
TΘ(ω)
X
Θ
Parameterized model P[X|Θ]
Ω X∞X∞M(X ) ⊃ PF
TT
Θ
I P = P[X|θ]|θ ∈ T I F ≡ law of large numbers
I T : P[ . |Θ = θ] 7→ θ bijection
I Θ := T F X∞
[Sch95]Peter Orbanz 9 / 25
JUSTIFICATION: BY EXCHANGEABILITY
Again: de Finetti
P(X1,X2, . . . ) =
∫M(X )
( ∞∏n=1
Q(Xn))
dν(Q) =
∫T
( ∞∏n=1
Q(Xn|Θ = θ))
dνT (θ)
I Θ random measure (since Θ(ω) ∈ M(X ))
Convergence resultsThe de Finetti theorem comes with a convergence result attached:
I Empirical measure: Fnweakly−−−→ θ as n→∞
I Posterior Λn(Θ|X1, . . . ,Xn) = Λn( . , ω) in M(T ) exists
I Posterior convergence: Λn( . , ω)n→∞−−−→ δΘ(ω)
[Kal01]Peter Orbanz 10 / 25
SPECIAL TYPES OF
EXCHANGEABLE DATA
MODIFICATIONS
Pólya Urns
P(Xn+1|X1 = x1, . . . ,Xn = xn) =1
α+ n
n∑j=1
δxj (Xn+1) +α
α+ nG0(Xn+1)
Exchangeable: I ν is DP(α,G0)
I∏∞
n=1 Q(Xn|θ) =∏∞
n=1 θ(Xn) =∏∞
n=1
(∑∞j=1 cjδtj (Xn)
)Exchangeable increment processes (H. Bühlmann)Stationary, exchangeable increment process = mixture of Lévy processes
P((Xt)t∈R+) =
∫Lα,γ,µ((Xt)t∈R+)dν(α, γ, µ)
Lα,γ,µ = Lévy process with jump measure µ
[B60, Kal01]Peter Orbanz 12 / 25
MODIFICATION 2: RANDOM PARTITIONS
Random partition of N
Π = B1,B2, . . . e.g. 1, 3, 5, . . ., 2, 4, 10, . . .
Paint-box distributionI Weights s1, s2, . . . ≥ 0 with
∑sj ≤ 1
I U1,U2, . . . ∼ Uniform[0, 1] s1 s2
U3 U2U1
1−∑
j sjSampling Π ∼ β[ . |s]:
i, j ∈ N in same block ⇔ Ui,Uj in same interval
i separate block ⇔ Ui in interval 1−∑
sj
Theorem (Kingman)
Π exchangeable ⇔ P(Π ∈ . ) =
∫β[Π ∈ . |s]Q(ds)
[Kin78]Peter Orbanz 13 / 25
ROTATION INVARIANCE
Rotatable sequence
Pn(X1, . . . ,Xn) = Pn(Rn(X1, . . . ,Xn)) for all Rn ∈ O(n)
Infinite case
X1,X2, . . . rotatable :⇔ X1, . . . ,Xn rotatable for all n
Theorem (Freedman)Infinite sequence rotatable iff
P(X1,X2, . . . ) =
∫R+
( ∞∏n=1
Nσ(Xn))
dνR+(σ)
Nσ denotes (0, σ)-Gaussian
Peter Orbanz 14 / 25
TWO INTERPRETATIONS
As special case of de Finetti
I Rotatable⇒ exchangeable
I General de Finetti: Parameter space T = M(X )
I Rotation invariance: T shrinks to Nσ|σ ∈ R+
As invariance under different symmetry
I Exchangeability = invariance of P(X1,X2, ...) under group action
I Freedman: Different group (O(n) rather than S∞)
I In these cases: symmetry⇒ decomposition theorem
Peter Orbanz 15 / 25
NON-EXCHANGEABLE DATA
EXCHANGEABILITY: RANDOM GRAPHS
Random graph with independent edgesGiven: θ : [0, 1]2 → [0, 1] symmetric function
I U1,U2, . . . ∼ Uniform[0, 1]
I Edge (i, j) present:
(i, j) ∼ Bernoulli(θ(Ui,Uj))
Call this distribution Γ(G ∈ . |θ).
00
11
U1
U1
U2
U2
0
1
Predge 1, 2
θ
Theorem (Aldous; Hoover)A random (dense) graph G is exchangeable iff
P(G ∈ . ) =
∫T
Γ(G ∈ . |θ)Q(dθ)
1
2
34
5
6
7 8
9
Peter Orbanz 2 / 7
[Ald81, Hoo79]Peter Orbanz 17 / 25
EXCHANGEABILITY: RANDOM GRAPHS
Random graph with independent edgesGiven: θ : [0, 1]2 → [0, 1] symmetric function
I U1,U2, . . . ∼ Uniform[0, 1]
I Edge (i, j) present:
(i, j) ∼ Bernoulli(θ(Ui,Uj))
Call this distribution Γ(G ∈ . |θ).
00
11
U1
U1
U2
U2
0
1
Predge 1, 2θ
Theorem (Aldous; Hoover)A random (dense) graph G is exchangeable iff
P(G ∈ . ) =
∫T
Γ(G ∈ . |θ)Q(dθ)
1
2
34
5
6
7 8
9
Peter Orbanz 2 / 7
[Ald81, Hoo79]Peter Orbanz 17 / 25
DE FINETTI: GEOMETRY
Finite case
P =∑ei∈E
νiei
I E = e1, e2, e3I (ν1, ν2, ν3) barycentric coordinates
e1
e2 e3
P
ν1
ν2
ν3
Infinite/continuous case
P( . ) =
∫E
e( . )dν(e) =
∫T
k(θ, . )dνT (θ)
I k : T → E ⊂ M(X ) probability kernel (= conditional probability)
I k is random measure with values k(θ, . ) ∈ EI de Finetti: k(θ, . ) =
∏n∈N Q( . |θ) and T = M(X )
Peter Orbanz 18 / 25
DECOMPOSITION BY SYMMETRY
Theorem (Varadarajan)
I G nice group on space YI Call measure µ ergodic if µ(A) ∈ 0, 1 for all G-invariant sets A.
I E := ergodic probability measures
Then there is a Markov kernel k : Y → E s.t.:
P ∈ M(V) G-invariant ⇔ P(A) =
∫T
k(θ,A)dν(θ)
de FinettiI G = S∞ and Y = X∞
I G-invariant sets = exchangeable events
I E = factorial distributions (“Hewitt-Savage 0-1 law”)
[Var63]Peter Orbanz 19 / 25
SYMMETRY AND SUFFICIENCY
SUFFICIENT STATISTICS
ProblemApparently no direct connection with standard models
Sufficient StatisticFunctions Sn of data sufficient if:
I Intuitively: Sn(X1, . . . ,Xn) contains all information sampleprovides on parameter
I Formally:
Pn(X1, . . . ,Xn|Θ, Sn) = P(X1, . . . ,Xn|S) for all n
Sufficiency and symmetry
I P exchangeable⇔ Sn(x1, . . . , xn) = 1n
∑ni=1 δxn sufficient
I P rotatable⇔ Sn(x1, . . . , xn) =√∑n
i=1 x2i = ‖(x1, . . . , xn)‖2 sufficient
Peter Orbanz 21 / 25
DECOMPOSITION BY SUFFICIENCY
Theorem (Diaconis and Freedman; Lauritzen; several others)
Given: Sufficient statistic Sn for each nkn( . , sn) = conditional probability of X1, . . . ,Xn given sn
1. kn converges to a limit function:
kn( . , Sn(X1(ω), . . . ,Xn(ω)))n→∞−−−→ k∞( . , ω)
2. P(X1,X2, . . . ) has the decomposition
P( . ) =
∫k∞( . , ω)dν(ω)
3. The model P ⊂ M(X ) is a convex set with extreme points k∞( . , ω)
4. The measure ν is uniquely determined by P
(Theorem statement omits technical conditions.)
Peter Orbanz 22 / 25
EXAMPLES
de Finetti’s theorem
P exchangeable ⇔ Sn(x1, . . . , xn) =1n
n∑i=1
δxn sufficient
Rotation invariance
P rotatable ⇔ Sn(x1, . . . , xn) = ‖(x1, . . . , xn)‖2 sufficient
Kingman’s theorem
Π exchangeable ⇔ asymptotic block sizes are sufficient statistic
Exponential families (Küchler and Lauritzen)Choose X = R∞. Under suitable regularity conditions:Sn additive, i.e.
Sn(x1, . . . , xn) =1n
n∑i=1
S0(xi)
if and only if ergodic measures are exponential family.[KL89]Peter Orbanz 23 / 25
SUMMARY
Non-exchangeable data
I Identify invariance principle and its ergodic measures
I Ergodic measures↔ generalize i.i.d. distributions↔ likelihood
I Prior = distribution on ergodic measures
Random structure Theorem of Mixtures of...
Exchangeable sequences de Finetti product distributionsHewitt & Savage
Processes with exch. increments Bühlmann Lévy processesExchangeable partitions Kingman "paint-box distributions"Exchangeable arrays Aldous sampling scheme on [0, 1]2
HooverKallenberg
Block-exchangeable sequences Diaconis & Freedman Markov chainsExchangeable Rd-sequences with Küchler & Lauritzen Exponential families
additive sufficient statistics
Peter Orbanz 24 / 25
REFERENCES I
[Ald81] David J. Aldous. Representations for partially exchangeable arrays of random variables. J. Multivariate Anal., 11(4):581–598, 1981.
[B60] H. Bühlmann. Austauschbare stochastische Variabeln und ihre Grenzwertsätze. PhD thesis, 1960. University of California Press, 1960.
[DF80] P. Diaconis and D. Freedman. Finite exchangeable sequences. The Annals of Probability, 8(4):pp. 745–764, 1980.
[Gho10] S. Ghosal. Dirichlet process, related priors and posterior asymptotics. In N. L. Hjort et al., editors, Bayesian Nonparametrics, pages 36–83.Cambridge University Press, 2010.
[Hoo79] D. N. Hoover. Relations on probability spaces and arrays of random variables. Technical report, Institute of Advanced Study, Princeton,1979.
[Kal01] O. Kallenberg. Foundations of Modern Probability. Springer, 2nd edition, 2001.
[Kin78] J. F. C. Kingman. The representation of partition structures. J. London Math. Soc., 2(18):374–380, 1978.
[KL89] U. Küchler and S. L. Lauritzen. Exponential families, extreme point models and minimal space-time invariant functions for stochasticprocesses with stationary and independent increments. Scand. J. Stat., 16:237–261, 1989.
[KvdV06] B. J. K. Kleijn and A. W. van der Vaart. Misspecification in infinite-dimensional Bayesian statistics. Annals of Statistics, 34(2):837–877,2006.
[Sch95] M. J. Schervish. Theory of Statistics. Springer, 1995.
[Var63] V. S. Varadarajan. Groups of automorphisms of Borel spaces. Transactions of the American Mathematical Society, 109(2):pp. 191–220,1963.
Peter Orbanz 25 / 25