Probabilistic PCA, Independent Component Analysis

Post on 11-Feb-2022

5 views 0 download

Transcript of Probabilistic PCA, Independent Component Analysis

10/22/2018

1

CS 3750 Advanced Machine Learning

Gaussian Processes: classification

Jinpeng Zhou

jiz150@pitt.edu

Gaussian Processes (GP)

xi

GP is a collection of ๐Ÿ ๐ฑ such that:

any set of ๐Ÿ ๐ฑ๐Ÿ , โ€ฆ , ๐Ÿ(๐ฑ๐’) has a joint Gaussian distribution.

๐ฒ = ๐Ÿ(๐ฑ) + ๐›† ๐‘ ๐’š ๐ฑ)

xi

10/22/2018

2

Weight-Space View

๐‘ฆ = ๐‘“ ๐ฑ + ๐œ€ = ๐ฑ๐“๐ฐ+ ๐œ€, ๐œ€ ~ ๐‘ 0, ๐œŽ2

โ‡’ ๐ฒ ~ ๐‘ ๐—๐“๐ฐ, ๐œŽ2๐‘ฐ

Assume ๐ฐ~๐‘ต(๐ŸŽ, ๐šบ๐ฉ), then:

๐‘ ๐ฐ ๐—, ๐ฒ) =๐‘ ๐ฒ ๐—,๐ฐ ๐‘ ๐ฐ

๐‘ ๐ฒ ๐—)~๐‘

1

๐œŽ2๐ดโˆ’1๐—๐ฒ, ๐ดโˆ’1

where ๐ด =1

๐œŽ2๐‘‹๐‘‹๐‘‡ + ฮฃ๐‘

โˆ’1

Posterior on weights is Gaussian.

Weight-Space View

๐‘ ๐ฐ ๐—, ๐ฒ) ~ ๐‘1

๐œŽ2๐ดโˆ’1๐—๐ฒ, ๐ดโˆ’1 , ๐ด =

1

๐œŽ2๐‘‹๐‘‹๐‘‡ + ฮฃ๐‘

โˆ’1

Thus (Similar result when use basis function: ๐Ÿ ๐ฑ = ๐šฝ(๐ฑ)๐“๐ฐ):

๐‘ ๐‘“โˆ— ๐ฑโˆ—, ๐—, ๐ฒ) = โˆ—๐‘(๐‘“ ๐ฑโˆ—, ๐ฐ ๐‘ ๐ฐ ๐—, ๐ฒ)๐‘‘๐ฐ

= ๐‘1

๐œŽ2๐ฑโˆ—๐‘‡๐ดโˆ’1๐—๐ฒ, ๐ฑโˆ—

๐‘‡๐ดโˆ’1๐ฑโˆ— = ๐‘(๐œ‡, ฮฃ)

โ€ข ๐‘“โˆ— has a Gaussian distribution and its ฮฃ doesnโ€™t depend on y

โ€ข x always appears as xTx (โ€œscalar productโ€ kernel trick)

Kernel Distribution of ๐’‡โˆ— Prediction

10/22/2018

3

Function-space View

Given:

mean function ๐‘š ๐ฑ

covariance function ๐‘˜ ๐ฑ, ๐ฑโ€ฒ

Define GP as:

f ๐ฑ ~ ๐บ๐‘ƒ(๐‘š(๐ฑ), ๐‘˜ ๐ฑ, ๐ฑโ€ฒ )

Such that for each ๐ฑ๐ข:

f ๐ฑ๐ข ~ ๐‘ ๐‘š(๐ฑ๐ข), ๐‘˜ ๐ฑ๐ข, ๐ฑ๐ข

cov ๐Ÿ(๐ฑ๐ข), ๐Ÿ(๐ฑ๐ฃ) = ๐‘˜ ๐ฑ๐ข, ๐ฑ๐ฃ

Function-space View

Given joint Gaussian distribution:

xAxB

~ N(ฮผAฮผB

,ฮฃAA ฮฃABฮฃBA ฮฃBB

)

The conditional densities are also Gaussian:

xA | xB ~ N(ฮผA + ฮฃABฮฃBBโˆ’1 xA โˆ’ xB , ฮฃAA โˆ’ ฮฃABฮฃBB

โˆ’1ฮฃBA)

xB | xA ~ N(โ€ฆ,โ€ฆ)

10/22/2018

4

Function-space View

Consider training and test points.

๐Ÿ๐Ÿโˆ—

~ ๐‘(๐’Ž ๐‘ฟ๐’Ž ๐‘ฟโˆ—

,๐‘ฒ ๐‘ฟ,๐‘ฟ ๐‘ฒ ๐‘ฟ,๐‘ฟโˆ—

๐‘ฒ(๐‘ฟโˆ—, ๐‘ฟ) ๐‘ฒ ๐‘ฟโˆ—, ๐‘ฟโˆ—)

๐‘ฒ is the covariance matrix.

๐‘ฆ = ๐‘“ ๐ฑ + ๐œ€, ๐œ€ ~ ๐‘ 0, ๐œŽ2 . Thus:

๐ฒ๐ฒโˆ—

| ๐—, ๐—โˆ—~ ๐‘(๐’Ž ๐‘ฟ๐’Ž ๐‘ฟโˆ—

,๐‘ฒ ๐‘ฟ, ๐‘ฟ + ๐ˆ๐Ÿ๐‘ฐ ๐‘ฒ ๐‘ฟ,๐‘ฟโˆ—

๐‘ฒ(๐‘ฟโˆ—, ๐‘ฟ) ๐‘ฒ ๐‘ฟโˆ—, ๐‘ฟโˆ— + ๐ˆ๐Ÿ๐‘ฐ)

๐ฒโˆ—| ๐ฒ, ๐‘ฟ, ๐‘ฟโˆ— ~ ๐‘(โ€ฆ ,โ€ฆ )

GP for Regression (GPR)

Squares are observed, circles are latent

The thick bar is like a โ€œbusโ€ connected all nodes

Each y is conditionally independent of all other nodes given f

10/22/2018

5

GPR

fi

f*f1

y1

x1

xi yi

fn

xn yn

x*

y*

GPR

๐ฒ = ๐Ÿ + ๐›†, where ๐Ÿ ~ GP m = 0, k and ๐›† ~ N(0, ฯƒ2):

๐ฒ๐ฒโˆ—

~ ๐‘(0,๐‘ฒ ๐‘ฟ,๐‘ฟ + ๐ˆ๐Ÿ๐‘ฐ ๐‘ฒ ๐‘ฟ,๐‘ฟโˆ—

๐‘ฒ(๐‘ฟโˆ—, ๐‘ฟ) ๐‘ฒ ๐‘ฟโˆ—, ๐‘ฟโˆ— + ๐ˆ๐Ÿ๐‘ฐ)

Prediction: ๐ฒโˆ— | ๐ฒ, ๐‘ฟ, ๐‘ฟโˆ— ~ ๐‘(โ€ฆ,โ€ฆ) = ๐‘(๐โˆ—, ๐œฎโˆ—)

Where:

๐โˆ— = ๐‘ฒ ๐‘ฟโˆ—, ๐‘ฟ ๐‘ฒ ๐‘ฟ,๐‘ฟ + ๐ˆ๐Ÿ๐‘ฐโˆ’๐Ÿ๐’š

๐œฎโˆ— = ๐‘ฒ ๐‘ฟโˆ—, ๐‘ฟโˆ— + ๐ˆ๐Ÿ๐‘ฐ - ๐‘ฒ ๐‘ฟโˆ—, ๐‘ฟ ๐‘ฒ ๐‘ฟ,๐‘ฟ + ๐ˆ๐Ÿ๐‘ฐโˆ’๐Ÿ๐‘ฒ ๐‘ฟ,๐‘ฟโˆ—

10/22/2018

6

GP Classification (GPC)

Map output into [0, 1] by using response function:

โ€ข Sigmoid function (logistic):

๐œŽ ๐‘ฅ = 1 + ๐‘’โˆ’๐‘ฅ โˆ’1

โ€ข Cumulative normal (probit):

ฮฆ ๐‘ฅ = โˆžโˆ’๐‘ฅ

๐‘ ๐‘ก 0, 1 ๐‘‘๐‘ก

GPC examples

Squared exponential kernel with hyperparameter length-scale = 0.1

Logistic response function.

10/22/2018

7

GPC

GP over f

! = #(%)

! = ' # %

GP over f

Non-G over y

GPC

Target ๐ฒ = ๐ˆ(๐Ÿ ๐ฑ ) is a non-Gaussian.

E.g., itโ€™s a Bernoulli for class 0, 1:

๐’‘ ๐ฒ ๐Ÿ = ฯ‚๐‘–=1๐‘› ๐œŽ(f(xi))

๐‘ฆ๐‘– 1 โˆ’ ๐œŽ(f(xi)1โˆ’๐‘ฆ๐‘–

How to predict p(๐ฒโˆ— | ๐ฒ, ๐‘ฟ, ๐‘ฟโˆ—) ?

10/22/2018

8

GPC

Assume single test case ๐ฑโˆ—, predict its class 0 or 1.

Assume sigmoid function and a GP prior over f.

Thus, try to involve f:

๐‘(yโˆ— = 1 ๐ฒ,๐‘ฟ, ๐ฑโˆ— = โˆ—๐‘(y = 1, fโˆ— ๐ฒ,๐‘ฟ, ๐ฑโˆ— ๐‘‘fโˆ—

๐‘= yโˆ— = 1 fโˆ—, ๐ฒ, ๐‘ฟ, ๐ฑโˆ—) ๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— ๐‘‘fโˆ—

๐œŽ(fโˆ—)= ๐‘(fโˆ— ๐ฒ,๐‘ฟ, ๐ฑโˆ— ๐‘‘fโˆ—

Approximation

๐‘(yโˆ— = 1 ๐ฒ,๐‘ฟ, ๐ฑโˆ— ๐œŽ(fโˆ—) = ๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— ๐‘‘fโˆ—

Note that ๐œŽ(fโˆ—) is a sigmoid function.

If ๐‘(fโˆ— ๐ฒ,๐‘ฟ, ๐ฑโˆ— is a Gaussian, convolution of a

sigmoid and a Gaussian can be computed as:

๐œŽ ๐‘ก ๐‘ ๐œ‡, ๐‘ 2 ๐‘‘๐‘ก โ‰ˆ ๐œŽ 1 +๐œ‹๐‘ 2

8๐œ‡

Otherwise, analytically intractable!

10/22/2018

9

If ๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— is a Gaussianโ€ฆ

๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— ,โˆ—๐‘(f = ๐Ÿ ๐ฒ, ๐‘ฟ, ๐ฑโˆ— ๐‘‘๐Ÿ

=๐‘ ๐ฒ | fโˆ—, ๐Ÿ, ๐‘ฟ, ๐ฑโˆ— ๐‘ fโˆ—, ๐Ÿ, ๐‘ฟ,๐ฑโˆ—

๐‘ ๐ฒ, ๐‘ฟ, ๐ฑโˆ—๐‘‘๐Ÿ

=๐‘ ๐ฒ | ๐Ÿ, ๐‘ฟ ๐‘ ๐Ÿ, ๐‘ฟ, ๐ฑโˆ— ๐‘ fโˆ— | ๐Ÿ, ๐‘ฟ,๐ฑโˆ—

๐‘ ๐ฒ, ๐‘ฟ, ๐ฑโˆ—๐‘‘๐Ÿ

=๐‘ ๐ฒ | ๐Ÿ, ๐‘ฟ ๐‘ ๐Ÿ, ๐‘ฟ

๐‘ ๐ฒ, ๐‘ฟ๐‘ fโˆ— | ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ ๐‘‘๐Ÿ

= ๐‘(๐Ÿ ๐ฒ,๐‘ฟ ๐‘(fโˆ— ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ ๐‘‘๐Ÿ

If ๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— is a Gaussianโ€ฆ

๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— = ๐‘(๐Ÿ ๐ฒ,๐‘ฟ ๐‘(fโˆ— ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ ๐‘‘๐Ÿ

Recall from GP regression: ๐ฒโˆ— | ๐ฒ, ๐‘ฟ, ๐‘ฟโˆ— ~ ๐‘(๐โˆ—, ๐œฎโˆ—)

๐โˆ— = ๐‘ฒ ๐‘ฟโˆ—, ๐‘ฟ ๐‘ฒ ๐‘ฟ, ๐‘ฟ + ๐ˆ๐Ÿ๐‘ฐโˆ’๐Ÿ๐’š

๐œฎโˆ— = ๐‘ฒ ๐‘ฟโˆ—, ๐‘ฟโˆ— + ๐ˆ๐Ÿ๐‘ฐ - ๐‘ฒ ๐‘ฟโˆ—, ๐‘ฟ ๐‘ฒ ๐‘ฟ, ๐‘ฟ + ๐ˆ๐Ÿ๐‘ฐโˆ’๐Ÿ๐‘ฒ ๐‘ฟ,๐‘ฟโˆ—

Replace ๐‘ฟโˆ— with ๐ฑโˆ—:

๐‘(fโˆ— ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ ~ ๐‘ต(๐’Œโˆ—๐‘ป๐‘ช๐’

โˆ’๐Ÿ๐’š, ๐‘ โˆ’ ๐’Œโˆ—๐‘ป๐‘ช๐’

โˆ’๐Ÿ๐’Œโˆ—)

Where: ๐’Œโˆ— = ๐‘ฒ ๐‘ฟ, ๐ฑโˆ— ๐‘ช๐’ = ๐‘ฒ ๐‘ฟ,๐‘ฟ + ๐ˆ๐Ÿ๐‘ฐ

๐‘ = ๐‘˜ ๐ฑโˆ—, ๐ฑโˆ— + ๐ˆ๐Ÿ

10/22/2018

10

If ๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— is a Gaussianโ€ฆ

๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— = ๐‘(๐Ÿ ๐ฒ,๐‘ฟ ๐‘(fโˆ— ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ ๐‘‘๐Ÿ

Now ๐‘(fโˆ— ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ is a Gaussian!

IF:

๐‘(๐Ÿ ๐ฒ, ๐‘ฟ is a Gaussian

THEN:

๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— is a Gaussian!

Is ๐‘(๐Ÿ ๐ฒ, ๐‘ฟ a Gaussian?

๐‘(๐Ÿ ๐ฒ, ๐‘ฟ =๐’‘ ๐’š ๐Ÿ, ๐‘ฟ ๐’‘ ๐’‡ ๐‘ฟ)

๐’‘ ๐’š ๐‘ฟ)

๐’‘ ๐Ÿ ๐‘ฟ) ~ ๐‘ต ๐ŸŽ,๐‘ฒ , ๐’‘ ๐’š ๐Ÿ, ๐‘ฟ = ฯ‚๐‘–=1๐‘› ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘–

โ‡’ ๐‘(๐Ÿ ๐ฒ,๐‘ฟ =๐‘ต ๐ŸŽ,๐‘ฒ

๐’‘ ๐’š ๐‘ฟ)ฯ‚๐‘–=1๐‘› ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘–

Note: ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– is Non-Gaussian (e.g., Bernoulli)

โ‡’ ๐‘(๐Ÿ ๐ฒ,๐‘ฟ is Non-Gaussian.

10/22/2018

11

Approximation

๐‘(yโˆ— = 1 ๐ฒ,๐‘ฟ, ๐ฑโˆ— ๐œŽ(fโˆ—) = ๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— ๐‘‘fโˆ—

If ๐‘(fโˆ— ๐ฒ,๐‘ฟ, ๐ฑโˆ— is a Gaussian, Try to approximate

๐‘(fโˆ— ๐ฒ, ๐—, ๐ฑโˆ— as Gaussian, โ€ฆ be computed as โ€ฆ

โ€ข Laplace Approximation

โ€ข Expectation Propagation (EP)

โ€ข Variational Approximation, Monte Carlo Sampling, etc.

If ๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— is a Gaussianโ€ฆ

๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— = ๐‘(๐Ÿ ๐ฒ,๐‘ฟ ๐‘(fโˆ— ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ ๐‘‘๐Ÿ

Now ๐‘(fโˆ— ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ is a Gaussian!

IF:

๐‘(๐Ÿ ๐ฒ, ๐‘ฟ is a approximated as Gaussian

THEN:

๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— is a approximated as Gaussian!

10/22/2018

12

Laplace Approximation

Goal: use a Gaussian to approximate ๐‘ ๐’™

Gaussian:1

2๐œ‹๐œŽ2exp โˆ’

๐‘ฅโˆ’๐œ‡ 2

2๐œŽ2

Idea:

โ€ข ๐œ‡ should match the mode: ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐‘ฅ ๐‘(๐’™)

โ€ข Relation between exp ๐‘˜(๐‘ฅ โˆ’ ๐œ‡)2 and ๐‘(๐’™)

Laplace Approximation

โ€ข ๐œ‡ should match the mode: ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐‘ฅ ๐‘(๐’™)

๐‘โ€ฒ ๐œ‡ = 0, ๐œ‡ is placed at the mode

โ€ข Relation between exp ๐‘˜(๐‘ฅ โˆ’ ๐œ‡)2 and ๐‘(๐’™)

Taylor expansion of ๐’๐’ ๐’‘(๐’™) at ๐

Let ๐’• ๐’™ = ๐’๐’๐’‘(๐’™) , Taylor expansion of ๐‘ก ๐‘ฅ at ๐ :

๐‘ก ๐‘ฅ = ๐‘ก(๐œ‡) +๐‘กโ€ฒ ๐œ‡

1!๐‘ฅ โˆ’ ๐œ‡ +

๐‘กโ€ฒโ€ฒ ๐œ‡

2!๐‘ฅ โˆ’ ๐œ‡ 2 +โ‹ฏ

10/22/2018

13

Laplace Approximation

Taylor quadratic approximation:

๐‘ก ๐‘ฅ = ๐‘ก(๐œ‡) +๐‘กโ€ฒ ๐œ‡

1!๐‘ฅ โˆ’ ๐œ‡ +

๐‘กโ€ฒโ€ฒ ๐œ‡

2!๐‘ฅ โˆ’ ๐œ‡ 2 +โ‹ฏ

โ‰ˆ ๐‘ก(๐œ‡) +๐‘กโ€ฒ ๐œ‡

1!๐‘ฅ โˆ’ ๐œ‡ +

๐‘กโ€ฒโ€ฒ ๐œ‡

2!๐‘ฅ โˆ’ ๐œ‡ 2

Note that ๐‘กโ€ฒ ๐œ‡ =๐‘โ€ฒ ๐œ‡ = 0

๐‘(๐œ‡)= 0, thus:

๐‘ก ๐‘ฅ โ‰ˆ ๐‘ก ๐œ‡ +๐‘กโ€ฒโ€ฒ ๐œ‡

2!๐‘ฅ โˆ’ ๐œ‡ 2

Laplace Approximation

๐‘ก ๐‘ฅ โ‰ˆ ๐‘ก ๐œ‡ +๐‘กโ€ฒโ€ฒ ๐œ‡

2!๐‘ฅ โˆ’ ๐œ‡ 2

// ๐‘ก ๐‘ฅ = ๐‘™๐‘› ๐‘(๐‘ฅ)

โ‡’ exp ๐‘ก ๐‘ฅ โ‰ˆ exp ๐‘ก ๐œ‡ +๐‘กโ€ฒโ€ฒ ๐œ‡

2!๐‘ฅ โˆ’ ๐œ‡ 2

โ‡’ ๐‘ ๐‘ฅ โ‰ˆ ๐‘’๐‘ฅ๐‘ ๐‘ก ๐œ‡ ๐‘’๐‘ฅ๐‘๐‘กโ€ฒโ€ฒ ๐œ‡ ๐‘ฅโˆ’๐œ‡ 2

2!

โ‡’ ๐‘ ๐‘ฅ โ‰ˆ ๐‘ ๐œ‡ ๐‘’๐‘ฅ๐‘ โˆ’๐ด ๐‘ฅโˆ’๐œ‡ 2

2

Where: A = โˆ’ ๐‘กโ€ฒโ€ฒ ๐œ‡ = โˆ’๐‘‘2

๐‘‘๐‘ฅ2๐‘™๐‘› ๐‘ ๐‘ฅ |๐‘ฅ=๐œ‡

10/22/2018

14

Laplace Approximation

๐‘ ๐‘ฅ โ‰ˆ ๐‘(๐œ‡) ๐‘’๐‘ฅ๐‘ โˆ’๐ด ๐‘ฅโˆ’๐œ‡ 2

2

๐‘ ๐ฑ โ‰ˆ ๐‘(๐) ๐‘’๐‘ฅ๐‘ โˆ’๐ฑโˆ’๐ ๐‘ป๐‘จ(๐ฑโˆ’๐)

2

Corresponding Gaussian approximation:

๐‘ ๐ฑ โ‰ˆ ๐‘(๐, ๐‘จโˆ’๐Ÿ)

Where:

๐‘‘

๐‘‘๐ฑ๐‘™๐‘› ๐‘ ๐ฑ |๐ฑ=๐ = 0

โˆ’๐‘‘2

๐‘‘๐ฑ2๐‘™๐‘› ๐‘ ๐ฑ |๐ฑ=๐ = A

Laplace Approximation

Back to ๐‘(๐Ÿ ๐ฒ, ๐‘ฟ :

ln ๐‘(๐Ÿ ๐ฒ,๐‘ฟ = ln๐‘ ๐ฒ ๐Ÿ ๐‘ ๐Ÿ ๐—

๐‘ ๐ฒ ๐—)

= ln ๐‘ ๐ฒ ๐Ÿ + ln ๐‘ ๐Ÿ ๐— โˆ’ ln ๐‘ ๐ฒ ๐—)

ln ๐‘ ๐ฒ ๐—) is independent of f, and we only need to

do 1st & 2nd derivatives on f

Thus, define ฮจ ๐Ÿ for derivative calculation:

ฮจ ๐Ÿ = ln ๐‘ ๐ฒ ๐Ÿ + ln ๐‘ ๐Ÿ ๐—

10/22/2018

15

Laplace Approximation

ฮจ ๐Ÿ = ln ๐‘ ๐ฒ ๐Ÿ + ln ๐‘ ๐Ÿ ๐—

Note that:

o ๐‘ ๐ฒ ๐Ÿ = ฯ‚๐‘–=1๐‘› ๐œŽ(f(xi))

๐‘ฆ๐‘– 1 โˆ’ ๐œŽ(f(xi)1โˆ’๐‘ฆ๐‘–

= ฯ‚๐‘–=1๐‘› 1 โˆ’ ๐œŽ(f(xi)

๐œŽ(f(xi))

1โˆ’๐œŽ(f(xi)

๐‘ฆ๐‘–

= ฯ‚๐‘–=1๐‘› 1

1+๐‘’f(xi)๐‘’f xi

๐‘ฆ๐‘–

o ๐Ÿ | ๐— ~ ๐‘ต(๐ŸŽ, ๐‘ช๐’)

Laplace Approximation

ฮจ ๐Ÿ = ln ๐‘ ๐ฒ ๐Ÿ + ln ๐‘ ๐Ÿ ๐—

o ๐‘ ๐ฒ ๐Ÿ = ฯ‚๐‘–=1๐‘› 1

1+๐‘’f(xi)๐‘’f xi

๐‘ฆ๐‘–

o ๐Ÿ | ๐— ~ ๐‘ต(๐ŸŽ, ๐‘ช๐’)

Thus:

ฮจ ๐Ÿ = ๐ฒ๐“๐Ÿ โˆ’ ฯƒ๐‘–=1๐‘› ln 1 + ๐‘’f xi โˆ’

๐Ÿ๐“๐‚๐งโˆ’๐Ÿ๐Ÿ

๐Ÿโˆ’

๐ง ln 2๐œ‹

๐Ÿโˆ’

ln ๐ถ๐‘›

๐Ÿ

10/22/2018

16

Laplace Approximation

ฮจ ๐Ÿ = ln ๐‘ ๐ฒ ๐Ÿ + ln ๐‘ ๐Ÿ ๐—

= ๐ฒ๐“๐Ÿ โˆ’ ฯƒ๐‘–=1๐‘› ln 1 + ๐‘’f xi โˆ’

๐Ÿ๐“๐‘ช๐’โˆ’๐Ÿ๐Ÿ

๐Ÿโˆ’

๐ง ln 2๐œ‹

๐Ÿโˆ’

ln ๐ถ๐‘›

๐Ÿ

๐›ปฮจ ๐Ÿ = ๐›ปln ๐‘ ๐ฒ ๐Ÿ โˆ’ ๐‚๐งโˆ’๐Ÿ๐Ÿ = ๐ฒ โˆ’ ๐ˆ(๐Ÿ) โˆ’ ๐‚๐ง

โˆ’๐Ÿ๐Ÿ

where ๐ˆ ๐Ÿ = ๐ˆ(๐Ÿ ๐’™๐Ÿ ), โ€ฆ , ๐ˆ(๐Ÿ(๐’™๐’))๐‘ป

๐›ป๐›ปฮจ ๐Ÿ = ๐›ป๐›ปln ๐‘ ๐ฒ ๐Ÿ โˆ’ ๐‚๐งโˆ’๐Ÿ = โˆ’๐‘พ๐’ โˆ’ ๐‘ช๐’

โˆ’๐Ÿ

๐‘Š๐‘› =๐ˆ ๐Ÿ ๐’™๐Ÿ 1 โˆ’ ๐ˆ ๐Ÿ ๐’™๐Ÿ

โ‹ฑ

๐ˆ ๐Ÿ ๐’™๐’ 1 โˆ’ ๐ˆ ๐Ÿ ๐’™๐’

Laplace Approximation

Corresponding Gaussian approximation:

๐‘(๐Ÿ ๐ฒ,๐‘ฟ โ‰ˆ ๐‘(๐, ๐‘จโˆ’๐Ÿ)

Where:

๐›ปฮจ ๐ = ๐ฒ โˆ’ ๐ˆ ๐ โˆ’ ๐‚๐งโˆ’๐Ÿ๐ = 0

=> ๐ฒ โˆ’ ๐ˆ ๐ = ๐‚๐งโˆ’๐Ÿ๐

?=> ๐

โˆ’๐›ป๐›ปฮจ ๐ = ๐‘พ๐’ + ๐‚๐งโˆ’๐Ÿ

=> A = ๐‘พ๐’ + ๐‚๐งโˆ’๐Ÿ

10/22/2018

17

Laplace Approximation

Cannot directly solve f from: ๐ฒ โˆ’ ๐ˆ ๐Ÿ = ๐‚๐งโˆ’๐Ÿ๐Ÿ.

Recall its motivation:

place f at the mode: maximum ๐‘(๐Ÿ ๐ฒ, ๐‘ฟ

Thus, instead of maximum, try to find ๐Ÿโˆ— for the

minimum of โˆ’ln ๐‘(๐Ÿ ๐ฒ,๐‘ฟ , with iteration:

๐Ÿ๐‘›๐‘’๐‘ค = ๐Ÿ๐‘œ๐‘™๐‘‘ โˆ’ (๐›ป๐›ปฮจ)โˆ’1๐›ปฮจ

Finally, get the Gaussian approximation:

๐‘(๐Ÿ ๐ฒ, ๐‘ฟ โ‰ˆ ๐‘(๐Ÿโˆ—, (๐‘พ๐’ + ๐‚๐งโˆ’๐Ÿ)โˆ’๐Ÿ)

๐ˆ ๐Ÿ = ๐ˆ(๐Ÿ ๐’™๐Ÿ ), โ€ฆ , ๐ˆ(๐Ÿ(๐’™๐’))๐‘ป

Laplace Approximation

๐‘(yโˆ— = 1 ๐ฒ,๐‘ฟ, ๐ฑโˆ— ๐œŽ(fโˆ—) = ๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— ๐‘‘fโˆ—

๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— = ๐‘(๐Ÿ ๐ฒ,๐‘ฟ ๐‘(fโˆ— ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ ๐‘‘๐Ÿ

Now we have:

๐‘(๐Ÿ ๐ฒ, ๐‘ฟ is approximated as Gaussian.

๐‘(fโˆ— ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ is a Gaussian.

Thus, ๐‘(fโˆ— ๐ฒ,๐‘ฟ, ๐ฑโˆ— is a Gaussian ~ ๐‘ ๐œ‡, ๐‘ 2

Thus, ๐‘(yโˆ— = 1 ๐ฒ,๐‘ฟ, ๐ฑโˆ— โ‰ˆ ๐œŽ 1 +๐œ‹๐‘ 2

8๐œ‡

10/22/2018

18

Expectation Propagation

๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— = ๐‘(๐Ÿ ๐ฒ,๐‘ฟ ๐‘(fโˆ— ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ ๐‘‘๐Ÿ

๐‘(๐Ÿ ๐ฒ, ๐‘ฟ =๐‘ ๐ฒ ๐Ÿ ๐‘ ๐Ÿ ๐—

๐‘ ๐ฒ ๐—)

=1

Z๐‘ต ๐ŸŽ,๐‘ฒ ฯ‚๐‘–=1

๐‘› ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘–

Normalization term:

๐‘ = ๐‘ ๐ฒ ๐—) = ๐‘ ๐Ÿ ๐— ฯ‚๐‘–=1๐‘› ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– ๐‘‘๐Ÿ

Expectation Propagation

๐‘(๐Ÿ ๐ฒ, ๐‘ฟ =1

๐‘๐‘ต ๐ŸŽ,๐‘ฒ ฯ‚๐‘–=1

๐‘› ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘–

๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– is Non-Gaussian due to response function.

Try to find a Gaussian ๐‘ก๐‘– to approximate it:

๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– โ‰ˆ ๐‘ก๐‘– ๐‘“๐‘– เทจ๐‘๐‘– , ๐œ‡๐‘– , ๐œŽ๐‘–2) = เทฉ๐‘๐‘– ๐‘(๐‘“๐‘– | ๐œ‡๐‘–, ๐œŽ๐‘–

2)

โ‡’ ฯ‚๐‘–=1๐‘› ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– โ‰ˆ ฯ‚๐‘–=1

๐‘› ๐‘ก๐‘– ๐‘“๐‘– เทจ๐‘๐‘–, ๐œ‡๐‘– , ๐œŽ๐‘–2)

= N ๐, เทจฮฃ ฯ‚๐‘–=1๐‘› เทจ๐‘๐‘–, where เทจฮฃ is ๐‘‘๐‘–๐‘Ž๐‘” ๐œŽ๐‘–

2

10/22/2018

19

Expectation Propagation

๐‘(๐Ÿ ๐ฒ, ๐‘ฟ =1

๐‘๐‘ต ๐ŸŽ,๐‘ฒ ฯ‚๐‘–=1

๐‘› ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘–

โ‰ˆ1

๐‘๐‘ต ๐ŸŽ,๐‘ฒ ฯ‚๐‘–=1

๐‘› ๐‘ก๐‘– ๐‘“๐‘– เทจ๐‘๐‘– , ๐œ‡๐‘– , ๐œŽ๐‘–2)

๐‘(๐Ÿ ๐ฒ, ๐‘ฟ โ‰ˆ1

๐‘๐‘ต ๐ŸŽ,๐‘ฒ N ๐, เทฉ๐šบ ฯ‚๐‘–=1

๐‘› เทจ๐‘๐‘–

Thus:

๐‘(๐Ÿ ๐ฒ, ๐‘ฟ โ‰ˆ ๐‘ž(๐Ÿ ๐ฒ,๐‘ฟ = N(๐, ฮฃ)

Where: ๐ = ฮฃเทจฮฃโˆ’1๐ ฮฃ = ๐‘ฒโˆ’1 + เทจฮฃโˆ’1โˆ’1

Expectation Propagation

How to choose เทจ๐‘๐‘– , ๐œ‡๐‘– , ๐œŽ๐‘–2 that defines ๐‘ก๐‘–?

เทจ๐‘๐‘– , ๐œ‡๐‘– , ๐œŽ๐‘–2 that minimize the difference between

๐‘(๐Ÿ ๐ฒ,๐‘ฟ and ๐‘ž(๐Ÿ ๐ฒ,๐‘ฟ :

Aฯ‚๐‘–=1๐‘› ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– โ‰ˆ Aฯ‚๐‘–=1

๐‘› ๐‘ก๐‘– ๐‘“๐‘– เทจ๐‘๐‘– , ๐œ‡๐‘–, ๐œŽ๐‘–2) where A =

1

๐‘๐‘ต ๐ŸŽ,๐‘ฒ

Using Kullback-Leibler (KL) divergence: KL (p(x) || q(x))

โ€ข min{๐‘ก๐‘–}

๐พ๐ฟ Aฯ‚๐‘–=1๐‘› ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– || ๐ดฯ‚๐‘–=1

๐‘› ๐‘ก๐‘– intractableโœ—

โ€ข min๐‘ก๐‘–

๐พ๐ฟ ๐ด๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– ฯ‚๐‘—โ‰ ๐‘– ๐‘ก๐‘— || ๐ด๐‘ก๐‘–ฯ‚๐‘—โ‰ ๐‘– ๐‘ก๐‘— iterative on ๐‘ก๐‘– โœ“

10/22/2018

20

Expectation Propagation

1 2 nโ€ฆ

1 2 nโ€ฆ

Default

intractableโœ—

Expectation Propagation

1 2 nโ€ฆ

1 2 nโ€ฆ

ฯ‚๐‘—โ‰ 1 ๐‘ก๐‘—

1st Iteration, based on marginal for ๐‘ก1

เถฑ๐‘ž(๐Ÿ ๐ฒ,๐‘ฟ ๐‘‘๐‘“๐‘—โ‰ ๐‘–

10/22/2018

21

Expectation Propagation

1 2 nโ€ฆ

1 2 nโ€ฆ

2st Iteration, based on marginal for ๐‘ก2

Expectation Propagation

1 2 nโ€ฆ

1 2 nโ€ฆ

ฯ‚๐‘—โ‰ ๐‘› ๐‘ก๐‘—

nth Iteration, based on marginal for ๐‘ก๐‘›

10/22/2018

22

Expectation Propagation

1 2 nโ€ฆ

1 2 nโ€ฆ

Repeat until convergence

Expectation Propagation

1 2 nโ€ฆ

1 2 nโ€ฆ

Repeat until convergence

10/22/2018

23

Expectation Propagation

Iteratively update ๐‘ก๐‘–

1. Start from current approximate posterior, leave out

the current ๐‘ก๐‘– to get the cavity distribution ๐‘žโˆ’๐‘– ๐‘“๐‘– :

๐‘žโˆ’๐‘– ๐‘“๐‘– = ๐ดฯ‚๐‘—โ‰ ๐‘– ๐‘ก๐‘— ๐‘‘๐‘“๐‘—โ‰ ๐‘– =๐‘ก๐‘– ๐ด ฯ‚๐‘—โ‰ ๐‘– ๐‘ก๐‘—๐‘‘๐‘“๐‘—โ‰ ๐‘–

๐‘ก๐‘–= ๐ด ฯ‚๐‘—=1

๐‘› ๐‘ก๐‘—๐‘‘๐‘“๐‘—โ‰ ๐‘–

๐‘ก๐‘–

= ๐‘ž(๐Ÿ ๐ฒ,๐‘ฟ ๐‘‘๐‘“๐‘—โ‰ ๐‘–

๐‘ก๐‘–=

๐‘ž ๐‘“๐‘– | ๐ฒ,๐‘ฟ

๐‘ก๐‘–=

N(๐‘“๐‘– | ๐œ‡๐‘–,๐œŽ๐‘–2)

เทจ๐‘๐‘– ๐‘ ๐‘“๐‘– ๐œ‡๐‘–, ๐œŽ๐‘–2)= N ๐‘“๐‘– ๐œ‡โˆ’๐‘–, ๐œŽโˆ’๐‘–

2 )

where: ๐œŽ๐‘–2 = ฮฃ๐‘–๐‘– ๐œŽโˆ’๐‘–

2 = ๐œŽ๐‘–โˆ’2 โˆ’ ๐œŽ๐‘–

โˆ’2 โˆ’1

๐œ‡โˆ’๐‘– = ๐œŽโˆ’๐‘–2 ๐œŽ๐‘–

โˆ’2๐œ‡๐‘– โˆ’ ๐œŽ๐‘–โˆ’2 ๐œ‡๐‘–

๐‘ž(๐Ÿ ๐ฒ, ๐‘ฟ = ๐ดฯ‚๐‘–=1๐‘› ๐‘ก๐‘– = N(๐, ฮฃ)

๐‘ก๐‘– ๐‘“๐‘– เทจ๐‘๐‘– , ๐œ‡๐‘– , ๐œŽ๐‘–2) = เทฉ๐‘๐‘– ๐‘(๐‘“๐‘– | ๐œ‡๐‘– , ๐œŽ๐‘–

2)

Expectation Propagation

Iteratively update ๐‘ก๐‘–

2. Find the new Gaussian marginal เทœ๐‘ž ๐‘“๐‘– : ๐‘ž(๐‘“๐’Š ๐ฒ, ๐‘ฟ

which best approximates the product of ๐‘žโˆ’๐‘– ๐‘“๐‘– and

๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– :

๐‘žโˆ’๐‘– ๐‘“๐‘– ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– โ‰ˆ เทœ๐‘ž ๐‘“๐‘– = ๐‘๐‘– ๐‘ เท๐œ‡๐‘– , เท๐œŽ๐‘–2

By:

min๐‘ก๐‘–

๐พ๐ฟ ๐‘žโˆ’๐‘– ๐‘“๐‘– ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– || เทœ๐‘ž ๐‘“๐‘– // tractable

๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– โ‰ˆ ๐‘ก๐‘–๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– is from response function

Non-Gaussian Marginal p ๐‘“๐‘– | ๐ฒ, ๐‘ฟ

10/22/2018

24

Expectation Propagation

Iteratively update ๐‘ก๐‘–

3. Update parameters เทจ๐‘๐‘– , ๐œ‡๐‘– , ๐œŽ๐‘–2 of ๐‘ก๐‘– based on ๐‘๐‘–,

เท๐œ‡๐‘–, เท๐œŽ๐‘–2

from found เทœ๐‘ž ๐‘“๐‘– . E.g.:

Recall: ๐‘žโˆ’๐‘– ๐‘“๐‘– =๐‘ž ๐‘“๐‘–

๐‘ก๐‘–= N ๐‘“๐‘– ๐œ‡โˆ’๐‘– , ๐œŽโˆ’๐‘–

2 )

where ๐œŽโˆ’๐‘–2 = ๐œŽ๐‘–

โˆ’2 โˆ’ ๐œŽ๐‘–โˆ’2 โˆ’1

Now we have new เทœ๐‘ž ๐‘“๐‘– ,

๐œŽโˆ’๐‘–2 = เท๐œŽ๐‘–

โˆ’2 โˆ’ ๐œŽ๐‘–โˆ’2 โˆ’1

โ‡’ ๐œŽ๐‘–2 = เท๐œŽ๐‘–

โˆ’2 โˆ’ ๐œŽโˆ’๐‘–โˆ’2 โˆ’1

old

new

old

Expectation Propagation

๐‘(yโˆ— = 1 ๐ฒ,๐‘ฟ, ๐ฑโˆ— ๐œŽ(fโˆ—) = ๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— ๐‘‘fโˆ—

๐‘(fโˆ— ๐ฒ, ๐‘ฟ, ๐ฑโˆ— = ๐‘(๐Ÿ ๐ฒ,๐‘ฟ ๐‘(fโˆ— ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ ๐‘‘๐Ÿ

๐‘(๐Ÿ ๐ฒ, ๐‘ฟ =1

๐‘๐‘ต ๐ŸŽ,๐‘ฒ ฯ‚๐‘–=1

๐‘› ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘–

Now we have:

๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘– is approximated as Gaussian.

Subsequently ๐‘(๐Ÿ ๐ฒ,๐‘ฟ is a Gaussian.

๐‘(fโˆ— ๐‘ฟ, ๐ฑโˆ—, ๐Ÿ is a Gaussian.

Thus, ๐‘(fโˆ— ๐ฒ,๐‘ฟ, ๐ฑโˆ— is a Gaussian ~ ๐‘ ๐œ‡, ๐‘ 2

10/22/2018

25

Approximation

โ€ข Laplace Approximation

โ€ข ๐‘(๐Ÿ ๐ฒ,๐‘ฟ

โ€ข 2nd order Taylor approximation

โ€ข Mean ๐ is placed at the mode

โ€ข Covariance ๐‘จโˆ’๐Ÿ is also related to the mode

โ€ข EP

โ€ข ๐‘ ๐‘ฆ๐‘– | ๐‘“๐‘–

โ€ข Iteratively update ๐‘ก๐‘– by minimizing KL divergence

Approximation

Laplace peaks at posterior mode

EP has a more accurate placement of probability mass

10/22/2018

26

GPC

โ€ข Nonparametric classification based on Bayesian

methodology

โ€ข Classification decision is directly made from

observed data

โ€ข The computational cost is O(N3) in general, due to

the covariance matrix inversion. (DeepMindโ€™s newly

proposed โ€œNeural Processesโ€ achieved O(N) by GP+NN)

References

1. Ebden, Mark. "Gaussian processes for regression: A quick introduction." The

Website of Robotics Research Group in Department on Engineering Science,

University of Oxford 91 (2008): 424-436.

2. Williams, Christopher K., and Carl Edward Rasmussen. "Gaussian processes

for machine learning." the MIT Press 2, no. 3 (2006): 4.

3. Kuss, Malte, and Carl E. Rasmussen. "Assessing approximations for Gaussian

process classification." In Advances in Neural Information Processing Systems,

pp. 699-706. 2006.

4. Milos Hauskrecht. โ€œCS2750 Machine Learning: Linear regressionโ€.

http://people.cs.pitt.edu/~milos/courses/cs2750/Lectures/Class8.pdf

5. Nicholas, Zabaras. โ€œKernel Introduction to Gaussian Methods and Processes.โ€

https://www.dropbox.com/s/2yonhmzn57lvv1d/Lec27-

KernelMethods.pdf?dl=0

6. Yee Whye The. โ€œExpectation and Belief Propagationโ€.

https://www.stats.ox.ac.uk/~teh/teaching/probmodels/lecture4bp.pdf