Post on 11-Feb-2022
10/22/2018
1
CS 3750 Advanced Machine Learning
Gaussian Processes: classification
Jinpeng Zhou
jiz150@pitt.edu
Gaussian Processes (GP)
xi
GP is a collection of ๐ ๐ฑ such that:
any set of ๐ ๐ฑ๐ , โฆ , ๐(๐ฑ๐) has a joint Gaussian distribution.
๐ฒ = ๐(๐ฑ) + ๐ ๐ ๐ ๐ฑ)
xi
10/22/2018
2
Weight-Space View
๐ฆ = ๐ ๐ฑ + ๐ = ๐ฑ๐๐ฐ+ ๐, ๐ ~ ๐ 0, ๐2
โ ๐ฒ ~ ๐ ๐๐๐ฐ, ๐2๐ฐ
Assume ๐ฐ~๐ต(๐, ๐บ๐ฉ), then:
๐ ๐ฐ ๐, ๐ฒ) =๐ ๐ฒ ๐,๐ฐ ๐ ๐ฐ
๐ ๐ฒ ๐)~๐
1
๐2๐ดโ1๐๐ฒ, ๐ดโ1
where ๐ด =1
๐2๐๐๐ + ฮฃ๐
โ1
Posterior on weights is Gaussian.
Weight-Space View
๐ ๐ฐ ๐, ๐ฒ) ~ ๐1
๐2๐ดโ1๐๐ฒ, ๐ดโ1 , ๐ด =
1
๐2๐๐๐ + ฮฃ๐
โ1
Thus (Similar result when use basis function: ๐ ๐ฑ = ๐ฝ(๐ฑ)๐๐ฐ):
๐ ๐โ ๐ฑโ, ๐, ๐ฒ) = โ๐(๐ ๐ฑโ, ๐ฐ ๐ ๐ฐ ๐, ๐ฒ)๐๐ฐ
= ๐1
๐2๐ฑโ๐๐ดโ1๐๐ฒ, ๐ฑโ
๐๐ดโ1๐ฑโ = ๐(๐, ฮฃ)
โข ๐โ has a Gaussian distribution and its ฮฃ doesnโt depend on y
โข x always appears as xTx (โscalar productโ kernel trick)
Kernel Distribution of ๐โ Prediction
10/22/2018
3
Function-space View
Given:
mean function ๐ ๐ฑ
covariance function ๐ ๐ฑ, ๐ฑโฒ
Define GP as:
f ๐ฑ ~ ๐บ๐(๐(๐ฑ), ๐ ๐ฑ, ๐ฑโฒ )
Such that for each ๐ฑ๐ข:
f ๐ฑ๐ข ~ ๐ ๐(๐ฑ๐ข), ๐ ๐ฑ๐ข, ๐ฑ๐ข
cov ๐(๐ฑ๐ข), ๐(๐ฑ๐ฃ) = ๐ ๐ฑ๐ข, ๐ฑ๐ฃ
Function-space View
Given joint Gaussian distribution:
xAxB
~ N(ฮผAฮผB
,ฮฃAA ฮฃABฮฃBA ฮฃBB
)
The conditional densities are also Gaussian:
xA | xB ~ N(ฮผA + ฮฃABฮฃBBโ1 xA โ xB , ฮฃAA โ ฮฃABฮฃBB
โ1ฮฃBA)
xB | xA ~ N(โฆ,โฆ)
10/22/2018
4
Function-space View
Consider training and test points.
๐๐โ
~ ๐(๐ ๐ฟ๐ ๐ฟโ
,๐ฒ ๐ฟ,๐ฟ ๐ฒ ๐ฟ,๐ฟโ
๐ฒ(๐ฟโ, ๐ฟ) ๐ฒ ๐ฟโ, ๐ฟโ)
๐ฒ is the covariance matrix.
๐ฆ = ๐ ๐ฑ + ๐, ๐ ~ ๐ 0, ๐2 . Thus:
๐ฒ๐ฒโ
| ๐, ๐โ~ ๐(๐ ๐ฟ๐ ๐ฟโ
,๐ฒ ๐ฟ, ๐ฟ + ๐๐๐ฐ ๐ฒ ๐ฟ,๐ฟโ
๐ฒ(๐ฟโ, ๐ฟ) ๐ฒ ๐ฟโ, ๐ฟโ + ๐๐๐ฐ)
๐ฒโ| ๐ฒ, ๐ฟ, ๐ฟโ ~ ๐(โฆ ,โฆ )
GP for Regression (GPR)
Squares are observed, circles are latent
The thick bar is like a โbusโ connected all nodes
Each y is conditionally independent of all other nodes given f
10/22/2018
5
GPR
fi
f*f1
y1
x1
xi yi
fn
xn yn
x*
y*
GPR
๐ฒ = ๐ + ๐, where ๐ ~ GP m = 0, k and ๐ ~ N(0, ฯ2):
๐ฒ๐ฒโ
~ ๐(0,๐ฒ ๐ฟ,๐ฟ + ๐๐๐ฐ ๐ฒ ๐ฟ,๐ฟโ
๐ฒ(๐ฟโ, ๐ฟ) ๐ฒ ๐ฟโ, ๐ฟโ + ๐๐๐ฐ)
Prediction: ๐ฒโ | ๐ฒ, ๐ฟ, ๐ฟโ ~ ๐(โฆ,โฆ) = ๐(๐โ, ๐ฎโ)
Where:
๐โ = ๐ฒ ๐ฟโ, ๐ฟ ๐ฒ ๐ฟ,๐ฟ + ๐๐๐ฐโ๐๐
๐ฎโ = ๐ฒ ๐ฟโ, ๐ฟโ + ๐๐๐ฐ - ๐ฒ ๐ฟโ, ๐ฟ ๐ฒ ๐ฟ,๐ฟ + ๐๐๐ฐโ๐๐ฒ ๐ฟ,๐ฟโ
10/22/2018
6
GP Classification (GPC)
Map output into [0, 1] by using response function:
โข Sigmoid function (logistic):
๐ ๐ฅ = 1 + ๐โ๐ฅ โ1
โข Cumulative normal (probit):
ฮฆ ๐ฅ = โโ๐ฅ
๐ ๐ก 0, 1 ๐๐ก
GPC examples
Squared exponential kernel with hyperparameter length-scale = 0.1
Logistic response function.
10/22/2018
7
GPC
GP over f
! = #(%)
! = ' # %
GP over f
Non-G over y
GPC
Target ๐ฒ = ๐(๐ ๐ฑ ) is a non-Gaussian.
E.g., itโs a Bernoulli for class 0, 1:
๐ ๐ฒ ๐ = ฯ๐=1๐ ๐(f(xi))
๐ฆ๐ 1 โ ๐(f(xi)1โ๐ฆ๐
How to predict p(๐ฒโ | ๐ฒ, ๐ฟ, ๐ฟโ) ?
10/22/2018
8
GPC
Assume single test case ๐ฑโ, predict its class 0 or 1.
Assume sigmoid function and a GP prior over f.
Thus, try to involve f:
๐(yโ = 1 ๐ฒ,๐ฟ, ๐ฑโ = โ๐(y = 1, fโ ๐ฒ,๐ฟ, ๐ฑโ ๐fโ
๐= yโ = 1 fโ, ๐ฒ, ๐ฟ, ๐ฑโ) ๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ ๐fโ
๐(fโ)= ๐(fโ ๐ฒ,๐ฟ, ๐ฑโ ๐fโ
Approximation
๐(yโ = 1 ๐ฒ,๐ฟ, ๐ฑโ ๐(fโ) = ๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ ๐fโ
Note that ๐(fโ) is a sigmoid function.
If ๐(fโ ๐ฒ,๐ฟ, ๐ฑโ is a Gaussian, convolution of a
sigmoid and a Gaussian can be computed as:
๐ ๐ก ๐ ๐, ๐ 2 ๐๐ก โ ๐ 1 +๐๐ 2
8๐
Otherwise, analytically intractable!
10/22/2018
9
If ๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ is a Gaussianโฆ
๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ ,โ๐(f = ๐ ๐ฒ, ๐ฟ, ๐ฑโ ๐๐
=๐ ๐ฒ | fโ, ๐, ๐ฟ, ๐ฑโ ๐ fโ, ๐, ๐ฟ,๐ฑโ
๐ ๐ฒ, ๐ฟ, ๐ฑโ๐๐
=๐ ๐ฒ | ๐, ๐ฟ ๐ ๐, ๐ฟ, ๐ฑโ ๐ fโ | ๐, ๐ฟ,๐ฑโ
๐ ๐ฒ, ๐ฟ, ๐ฑโ๐๐
=๐ ๐ฒ | ๐, ๐ฟ ๐ ๐, ๐ฟ
๐ ๐ฒ, ๐ฟ๐ fโ | ๐ฟ, ๐ฑโ, ๐ ๐๐
= ๐(๐ ๐ฒ,๐ฟ ๐(fโ ๐ฟ, ๐ฑโ, ๐ ๐๐
If ๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ is a Gaussianโฆ
๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ = ๐(๐ ๐ฒ,๐ฟ ๐(fโ ๐ฟ, ๐ฑโ, ๐ ๐๐
Recall from GP regression: ๐ฒโ | ๐ฒ, ๐ฟ, ๐ฟโ ~ ๐(๐โ, ๐ฎโ)
๐โ = ๐ฒ ๐ฟโ, ๐ฟ ๐ฒ ๐ฟ, ๐ฟ + ๐๐๐ฐโ๐๐
๐ฎโ = ๐ฒ ๐ฟโ, ๐ฟโ + ๐๐๐ฐ - ๐ฒ ๐ฟโ, ๐ฟ ๐ฒ ๐ฟ, ๐ฟ + ๐๐๐ฐโ๐๐ฒ ๐ฟ,๐ฟโ
Replace ๐ฟโ with ๐ฑโ:
๐(fโ ๐ฟ, ๐ฑโ, ๐ ~ ๐ต(๐โ๐ป๐ช๐
โ๐๐, ๐ โ ๐โ๐ป๐ช๐
โ๐๐โ)
Where: ๐โ = ๐ฒ ๐ฟ, ๐ฑโ ๐ช๐ = ๐ฒ ๐ฟ,๐ฟ + ๐๐๐ฐ
๐ = ๐ ๐ฑโ, ๐ฑโ + ๐๐
10/22/2018
10
If ๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ is a Gaussianโฆ
๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ = ๐(๐ ๐ฒ,๐ฟ ๐(fโ ๐ฟ, ๐ฑโ, ๐ ๐๐
Now ๐(fโ ๐ฟ, ๐ฑโ, ๐ is a Gaussian!
IF:
๐(๐ ๐ฒ, ๐ฟ is a Gaussian
THEN:
๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ is a Gaussian!
Is ๐(๐ ๐ฒ, ๐ฟ a Gaussian?
๐(๐ ๐ฒ, ๐ฟ =๐ ๐ ๐, ๐ฟ ๐ ๐ ๐ฟ)
๐ ๐ ๐ฟ)
๐ ๐ ๐ฟ) ~ ๐ต ๐,๐ฒ , ๐ ๐ ๐, ๐ฟ = ฯ๐=1๐ ๐ ๐ฆ๐ | ๐๐
โ ๐(๐ ๐ฒ,๐ฟ =๐ต ๐,๐ฒ
๐ ๐ ๐ฟ)ฯ๐=1๐ ๐ ๐ฆ๐ | ๐๐
Note: ๐ ๐ฆ๐ | ๐๐ is Non-Gaussian (e.g., Bernoulli)
โ ๐(๐ ๐ฒ,๐ฟ is Non-Gaussian.
10/22/2018
11
Approximation
๐(yโ = 1 ๐ฒ,๐ฟ, ๐ฑโ ๐(fโ) = ๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ ๐fโ
If ๐(fโ ๐ฒ,๐ฟ, ๐ฑโ is a Gaussian, Try to approximate
๐(fโ ๐ฒ, ๐, ๐ฑโ as Gaussian, โฆ be computed as โฆ
โข Laplace Approximation
โข Expectation Propagation (EP)
โข Variational Approximation, Monte Carlo Sampling, etc.
If ๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ is a Gaussianโฆ
๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ = ๐(๐ ๐ฒ,๐ฟ ๐(fโ ๐ฟ, ๐ฑโ, ๐ ๐๐
Now ๐(fโ ๐ฟ, ๐ฑโ, ๐ is a Gaussian!
IF:
๐(๐ ๐ฒ, ๐ฟ is a approximated as Gaussian
THEN:
๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ is a approximated as Gaussian!
10/22/2018
12
Laplace Approximation
Goal: use a Gaussian to approximate ๐ ๐
Gaussian:1
2๐๐2exp โ
๐ฅโ๐ 2
2๐2
Idea:
โข ๐ should match the mode: ๐๐๐๐๐๐ฅ๐ฅ ๐(๐)
โข Relation between exp ๐(๐ฅ โ ๐)2 and ๐(๐)
Laplace Approximation
โข ๐ should match the mode: ๐๐๐๐๐๐ฅ๐ฅ ๐(๐)
๐โฒ ๐ = 0, ๐ is placed at the mode
โข Relation between exp ๐(๐ฅ โ ๐)2 and ๐(๐)
Taylor expansion of ๐๐ ๐(๐) at ๐
Let ๐ ๐ = ๐๐๐(๐) , Taylor expansion of ๐ก ๐ฅ at ๐ :
๐ก ๐ฅ = ๐ก(๐) +๐กโฒ ๐
1!๐ฅ โ ๐ +
๐กโฒโฒ ๐
2!๐ฅ โ ๐ 2 +โฏ
10/22/2018
13
Laplace Approximation
Taylor quadratic approximation:
๐ก ๐ฅ = ๐ก(๐) +๐กโฒ ๐
1!๐ฅ โ ๐ +
๐กโฒโฒ ๐
2!๐ฅ โ ๐ 2 +โฏ
โ ๐ก(๐) +๐กโฒ ๐
1!๐ฅ โ ๐ +
๐กโฒโฒ ๐
2!๐ฅ โ ๐ 2
Note that ๐กโฒ ๐ =๐โฒ ๐ = 0
๐(๐)= 0, thus:
๐ก ๐ฅ โ ๐ก ๐ +๐กโฒโฒ ๐
2!๐ฅ โ ๐ 2
Laplace Approximation
๐ก ๐ฅ โ ๐ก ๐ +๐กโฒโฒ ๐
2!๐ฅ โ ๐ 2
// ๐ก ๐ฅ = ๐๐ ๐(๐ฅ)
โ exp ๐ก ๐ฅ โ exp ๐ก ๐ +๐กโฒโฒ ๐
2!๐ฅ โ ๐ 2
โ ๐ ๐ฅ โ ๐๐ฅ๐ ๐ก ๐ ๐๐ฅ๐๐กโฒโฒ ๐ ๐ฅโ๐ 2
2!
โ ๐ ๐ฅ โ ๐ ๐ ๐๐ฅ๐ โ๐ด ๐ฅโ๐ 2
2
Where: A = โ ๐กโฒโฒ ๐ = โ๐2
๐๐ฅ2๐๐ ๐ ๐ฅ |๐ฅ=๐
10/22/2018
14
Laplace Approximation
๐ ๐ฅ โ ๐(๐) ๐๐ฅ๐ โ๐ด ๐ฅโ๐ 2
2
๐ ๐ฑ โ ๐(๐) ๐๐ฅ๐ โ๐ฑโ๐ ๐ป๐จ(๐ฑโ๐)
2
Corresponding Gaussian approximation:
๐ ๐ฑ โ ๐(๐, ๐จโ๐)
Where:
๐
๐๐ฑ๐๐ ๐ ๐ฑ |๐ฑ=๐ = 0
โ๐2
๐๐ฑ2๐๐ ๐ ๐ฑ |๐ฑ=๐ = A
Laplace Approximation
Back to ๐(๐ ๐ฒ, ๐ฟ :
ln ๐(๐ ๐ฒ,๐ฟ = ln๐ ๐ฒ ๐ ๐ ๐ ๐
๐ ๐ฒ ๐)
= ln ๐ ๐ฒ ๐ + ln ๐ ๐ ๐ โ ln ๐ ๐ฒ ๐)
ln ๐ ๐ฒ ๐) is independent of f, and we only need to
do 1st & 2nd derivatives on f
Thus, define ฮจ ๐ for derivative calculation:
ฮจ ๐ = ln ๐ ๐ฒ ๐ + ln ๐ ๐ ๐
10/22/2018
15
Laplace Approximation
ฮจ ๐ = ln ๐ ๐ฒ ๐ + ln ๐ ๐ ๐
Note that:
o ๐ ๐ฒ ๐ = ฯ๐=1๐ ๐(f(xi))
๐ฆ๐ 1 โ ๐(f(xi)1โ๐ฆ๐
= ฯ๐=1๐ 1 โ ๐(f(xi)
๐(f(xi))
1โ๐(f(xi)
๐ฆ๐
= ฯ๐=1๐ 1
1+๐f(xi)๐f xi
๐ฆ๐
o ๐ | ๐ ~ ๐ต(๐, ๐ช๐)
Laplace Approximation
ฮจ ๐ = ln ๐ ๐ฒ ๐ + ln ๐ ๐ ๐
o ๐ ๐ฒ ๐ = ฯ๐=1๐ 1
1+๐f(xi)๐f xi
๐ฆ๐
o ๐ | ๐ ~ ๐ต(๐, ๐ช๐)
Thus:
ฮจ ๐ = ๐ฒ๐๐ โ ฯ๐=1๐ ln 1 + ๐f xi โ
๐๐๐๐งโ๐๐
๐โ
๐ง ln 2๐
๐โ
ln ๐ถ๐
๐
10/22/2018
16
Laplace Approximation
ฮจ ๐ = ln ๐ ๐ฒ ๐ + ln ๐ ๐ ๐
= ๐ฒ๐๐ โ ฯ๐=1๐ ln 1 + ๐f xi โ
๐๐๐ช๐โ๐๐
๐โ
๐ง ln 2๐
๐โ
ln ๐ถ๐
๐
๐ปฮจ ๐ = ๐ปln ๐ ๐ฒ ๐ โ ๐๐งโ๐๐ = ๐ฒ โ ๐(๐) โ ๐๐ง
โ๐๐
where ๐ ๐ = ๐(๐ ๐๐ ), โฆ , ๐(๐(๐๐))๐ป
๐ป๐ปฮจ ๐ = ๐ป๐ปln ๐ ๐ฒ ๐ โ ๐๐งโ๐ = โ๐พ๐ โ ๐ช๐
โ๐
๐๐ =๐ ๐ ๐๐ 1 โ ๐ ๐ ๐๐
โฑ
๐ ๐ ๐๐ 1 โ ๐ ๐ ๐๐
Laplace Approximation
Corresponding Gaussian approximation:
๐(๐ ๐ฒ,๐ฟ โ ๐(๐, ๐จโ๐)
Where:
๐ปฮจ ๐ = ๐ฒ โ ๐ ๐ โ ๐๐งโ๐๐ = 0
=> ๐ฒ โ ๐ ๐ = ๐๐งโ๐๐
?=> ๐
โ๐ป๐ปฮจ ๐ = ๐พ๐ + ๐๐งโ๐
=> A = ๐พ๐ + ๐๐งโ๐
10/22/2018
17
Laplace Approximation
Cannot directly solve f from: ๐ฒ โ ๐ ๐ = ๐๐งโ๐๐.
Recall its motivation:
place f at the mode: maximum ๐(๐ ๐ฒ, ๐ฟ
Thus, instead of maximum, try to find ๐โ for the
minimum of โln ๐(๐ ๐ฒ,๐ฟ , with iteration:
๐๐๐๐ค = ๐๐๐๐ โ (๐ป๐ปฮจ)โ1๐ปฮจ
Finally, get the Gaussian approximation:
๐(๐ ๐ฒ, ๐ฟ โ ๐(๐โ, (๐พ๐ + ๐๐งโ๐)โ๐)
๐ ๐ = ๐(๐ ๐๐ ), โฆ , ๐(๐(๐๐))๐ป
Laplace Approximation
๐(yโ = 1 ๐ฒ,๐ฟ, ๐ฑโ ๐(fโ) = ๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ ๐fโ
๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ = ๐(๐ ๐ฒ,๐ฟ ๐(fโ ๐ฟ, ๐ฑโ, ๐ ๐๐
Now we have:
๐(๐ ๐ฒ, ๐ฟ is approximated as Gaussian.
๐(fโ ๐ฟ, ๐ฑโ, ๐ is a Gaussian.
Thus, ๐(fโ ๐ฒ,๐ฟ, ๐ฑโ is a Gaussian ~ ๐ ๐, ๐ 2
Thus, ๐(yโ = 1 ๐ฒ,๐ฟ, ๐ฑโ โ ๐ 1 +๐๐ 2
8๐
10/22/2018
18
Expectation Propagation
๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ = ๐(๐ ๐ฒ,๐ฟ ๐(fโ ๐ฟ, ๐ฑโ, ๐ ๐๐
๐(๐ ๐ฒ, ๐ฟ =๐ ๐ฒ ๐ ๐ ๐ ๐
๐ ๐ฒ ๐)
=1
Z๐ต ๐,๐ฒ ฯ๐=1
๐ ๐ ๐ฆ๐ | ๐๐
Normalization term:
๐ = ๐ ๐ฒ ๐) = ๐ ๐ ๐ ฯ๐=1๐ ๐ ๐ฆ๐ | ๐๐ ๐๐
Expectation Propagation
๐(๐ ๐ฒ, ๐ฟ =1
๐๐ต ๐,๐ฒ ฯ๐=1
๐ ๐ ๐ฆ๐ | ๐๐
๐ ๐ฆ๐ | ๐๐ is Non-Gaussian due to response function.
Try to find a Gaussian ๐ก๐ to approximate it:
๐ ๐ฆ๐ | ๐๐ โ ๐ก๐ ๐๐ เทจ๐๐ , ๐๐ , ๐๐2) = เทฉ๐๐ ๐(๐๐ | ๐๐, ๐๐
2)
โ ฯ๐=1๐ ๐ ๐ฆ๐ | ๐๐ โ ฯ๐=1
๐ ๐ก๐ ๐๐ เทจ๐๐, ๐๐ , ๐๐2)
= N ๐, เทจฮฃ ฯ๐=1๐ เทจ๐๐, where เทจฮฃ is ๐๐๐๐ ๐๐
2
10/22/2018
19
Expectation Propagation
๐(๐ ๐ฒ, ๐ฟ =1
๐๐ต ๐,๐ฒ ฯ๐=1
๐ ๐ ๐ฆ๐ | ๐๐
โ1
๐๐ต ๐,๐ฒ ฯ๐=1
๐ ๐ก๐ ๐๐ เทจ๐๐ , ๐๐ , ๐๐2)
๐(๐ ๐ฒ, ๐ฟ โ1
๐๐ต ๐,๐ฒ N ๐, เทฉ๐บ ฯ๐=1
๐ เทจ๐๐
Thus:
๐(๐ ๐ฒ, ๐ฟ โ ๐(๐ ๐ฒ,๐ฟ = N(๐, ฮฃ)
Where: ๐ = ฮฃเทจฮฃโ1๐ ฮฃ = ๐ฒโ1 + เทจฮฃโ1โ1
Expectation Propagation
How to choose เทจ๐๐ , ๐๐ , ๐๐2 that defines ๐ก๐?
เทจ๐๐ , ๐๐ , ๐๐2 that minimize the difference between
๐(๐ ๐ฒ,๐ฟ and ๐(๐ ๐ฒ,๐ฟ :
Aฯ๐=1๐ ๐ ๐ฆ๐ | ๐๐ โ Aฯ๐=1
๐ ๐ก๐ ๐๐ เทจ๐๐ , ๐๐, ๐๐2) where A =
1
๐๐ต ๐,๐ฒ
Using Kullback-Leibler (KL) divergence: KL (p(x) || q(x))
โข min{๐ก๐}
๐พ๐ฟ Aฯ๐=1๐ ๐ ๐ฆ๐ | ๐๐ || ๐ดฯ๐=1
๐ ๐ก๐ intractableโ
โข min๐ก๐
๐พ๐ฟ ๐ด๐ ๐ฆ๐ | ๐๐ ฯ๐โ ๐ ๐ก๐ || ๐ด๐ก๐ฯ๐โ ๐ ๐ก๐ iterative on ๐ก๐ โ
10/22/2018
20
Expectation Propagation
1 2 nโฆ
1 2 nโฆ
Default
intractableโ
Expectation Propagation
1 2 nโฆ
1 2 nโฆ
ฯ๐โ 1 ๐ก๐
1st Iteration, based on marginal for ๐ก1
เถฑ๐(๐ ๐ฒ,๐ฟ ๐๐๐โ ๐
10/22/2018
21
Expectation Propagation
1 2 nโฆ
1 2 nโฆ
2st Iteration, based on marginal for ๐ก2
Expectation Propagation
1 2 nโฆ
1 2 nโฆ
ฯ๐โ ๐ ๐ก๐
nth Iteration, based on marginal for ๐ก๐
10/22/2018
22
Expectation Propagation
1 2 nโฆ
1 2 nโฆ
Repeat until convergence
Expectation Propagation
1 2 nโฆ
1 2 nโฆ
Repeat until convergence
10/22/2018
23
Expectation Propagation
Iteratively update ๐ก๐
1. Start from current approximate posterior, leave out
the current ๐ก๐ to get the cavity distribution ๐โ๐ ๐๐ :
๐โ๐ ๐๐ = ๐ดฯ๐โ ๐ ๐ก๐ ๐๐๐โ ๐ =๐ก๐ ๐ด ฯ๐โ ๐ ๐ก๐๐๐๐โ ๐
๐ก๐= ๐ด ฯ๐=1
๐ ๐ก๐๐๐๐โ ๐
๐ก๐
= ๐(๐ ๐ฒ,๐ฟ ๐๐๐โ ๐
๐ก๐=
๐ ๐๐ | ๐ฒ,๐ฟ
๐ก๐=
N(๐๐ | ๐๐,๐๐2)
เทจ๐๐ ๐ ๐๐ ๐๐, ๐๐2)= N ๐๐ ๐โ๐, ๐โ๐
2 )
where: ๐๐2 = ฮฃ๐๐ ๐โ๐
2 = ๐๐โ2 โ ๐๐
โ2 โ1
๐โ๐ = ๐โ๐2 ๐๐
โ2๐๐ โ ๐๐โ2 ๐๐
๐(๐ ๐ฒ, ๐ฟ = ๐ดฯ๐=1๐ ๐ก๐ = N(๐, ฮฃ)
๐ก๐ ๐๐ เทจ๐๐ , ๐๐ , ๐๐2) = เทฉ๐๐ ๐(๐๐ | ๐๐ , ๐๐
2)
Expectation Propagation
Iteratively update ๐ก๐
2. Find the new Gaussian marginal เท๐ ๐๐ : ๐(๐๐ ๐ฒ, ๐ฟ
which best approximates the product of ๐โ๐ ๐๐ and
๐ ๐ฆ๐ | ๐๐ :
๐โ๐ ๐๐ ๐ ๐ฆ๐ | ๐๐ โ เท๐ ๐๐ = ๐๐ ๐ เท๐๐ , เท๐๐2
By:
min๐ก๐
๐พ๐ฟ ๐โ๐ ๐๐ ๐ ๐ฆ๐ | ๐๐ || เท๐ ๐๐ // tractable
๐ ๐ฆ๐ | ๐๐ โ ๐ก๐๐ ๐ฆ๐ | ๐๐ is from response function
Non-Gaussian Marginal p ๐๐ | ๐ฒ, ๐ฟ
10/22/2018
24
Expectation Propagation
Iteratively update ๐ก๐
3. Update parameters เทจ๐๐ , ๐๐ , ๐๐2 of ๐ก๐ based on ๐๐,
เท๐๐, เท๐๐2
from found เท๐ ๐๐ . E.g.:
Recall: ๐โ๐ ๐๐ =๐ ๐๐
๐ก๐= N ๐๐ ๐โ๐ , ๐โ๐
2 )
where ๐โ๐2 = ๐๐
โ2 โ ๐๐โ2 โ1
Now we have new เท๐ ๐๐ ,
๐โ๐2 = เท๐๐
โ2 โ ๐๐โ2 โ1
โ ๐๐2 = เท๐๐
โ2 โ ๐โ๐โ2 โ1
old
new
old
Expectation Propagation
๐(yโ = 1 ๐ฒ,๐ฟ, ๐ฑโ ๐(fโ) = ๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ ๐fโ
๐(fโ ๐ฒ, ๐ฟ, ๐ฑโ = ๐(๐ ๐ฒ,๐ฟ ๐(fโ ๐ฟ, ๐ฑโ, ๐ ๐๐
๐(๐ ๐ฒ, ๐ฟ =1
๐๐ต ๐,๐ฒ ฯ๐=1
๐ ๐ ๐ฆ๐ | ๐๐
Now we have:
๐ ๐ฆ๐ | ๐๐ is approximated as Gaussian.
Subsequently ๐(๐ ๐ฒ,๐ฟ is a Gaussian.
๐(fโ ๐ฟ, ๐ฑโ, ๐ is a Gaussian.
Thus, ๐(fโ ๐ฒ,๐ฟ, ๐ฑโ is a Gaussian ~ ๐ ๐, ๐ 2
10/22/2018
25
Approximation
โข Laplace Approximation
โข ๐(๐ ๐ฒ,๐ฟ
โข 2nd order Taylor approximation
โข Mean ๐ is placed at the mode
โข Covariance ๐จโ๐ is also related to the mode
โข EP
โข ๐ ๐ฆ๐ | ๐๐
โข Iteratively update ๐ก๐ by minimizing KL divergence
Approximation
Laplace peaks at posterior mode
EP has a more accurate placement of probability mass
10/22/2018
26
GPC
โข Nonparametric classification based on Bayesian
methodology
โข Classification decision is directly made from
observed data
โข The computational cost is O(N3) in general, due to
the covariance matrix inversion. (DeepMindโs newly
proposed โNeural Processesโ achieved O(N) by GP+NN)
References
1. Ebden, Mark. "Gaussian processes for regression: A quick introduction." The
Website of Robotics Research Group in Department on Engineering Science,
University of Oxford 91 (2008): 424-436.
2. Williams, Christopher K., and Carl Edward Rasmussen. "Gaussian processes
for machine learning." the MIT Press 2, no. 3 (2006): 4.
3. Kuss, Malte, and Carl E. Rasmussen. "Assessing approximations for Gaussian
process classification." In Advances in Neural Information Processing Systems,
pp. 699-706. 2006.
4. Milos Hauskrecht. โCS2750 Machine Learning: Linear regressionโ.
http://people.cs.pitt.edu/~milos/courses/cs2750/Lectures/Class8.pdf
5. Nicholas, Zabaras. โKernel Introduction to Gaussian Methods and Processes.โ
https://www.dropbox.com/s/2yonhmzn57lvv1d/Lec27-
KernelMethods.pdf?dl=0
6. Yee Whye The. โExpectation and Belief Propagationโ.
https://www.stats.ox.ac.uk/~teh/teaching/probmodels/lecture4bp.pdf