Scalable Gaussian Processes

47
Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 1 / 46

Transcript of Scalable Gaussian Processes

Page 1: Scalable Gaussian Processes

Scalable Gaussian Processes

Zhenwen Dai

Amazon

9 September 2019 @GPSS 2019

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 1 / 46

Page 2: Scalable Gaussian Processes

Gaussian processInput and Output Data:

y = (y1, . . . , yN), X = (x1, . . . ,xN)>

p(y|f) = N(y|f , σ2I

), p(f |X) = N (f |0,K(X,X))

0.4 0.5 0.6 0.7 0.8 0.9 1.0

−6

−4

−2

0

2

4

6

8

10 Mean

Data

Confidence

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 2 / 46

Page 3: Scalable Gaussian Processes

Behind a Gaussian process fit

Maximum likelihood estimate of the hyper-parameters.

θ∗ = arg maxθ

log p(y|X, θ) = arg maxθ

logN(y|0,K + σ2I

)Prediction on a test point given the observed data and the optimizedhyper-parameters.

p(f∗|X∗,y,X, θ) =

N(f∗|K∗(K + σ2I)−1y,K∗∗ −K∗(K + σ2I)−1K>∗

)

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 3 / 46

Page 4: Scalable Gaussian Processes

How to implement the log-likelihood (1)

Compute the covariance matrix K:

K =

k(x1,x1) · · · k(x1,xN)...

. . ....

k(xN ,x1) · · · k(xN ,xN)

where k(xi,xj) = γ exp

(− 1

2l2(xi − xj)

>(xi − xj))

The complexity is O(N2Q).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 4 / 46

Page 5: Scalable Gaussian Processes

How to implement the log-likelihood (2)

Plug in the log-pdf of multi-variate normal distribution:

log p(y|X) = logN(y|0,K + σ2I

)=− 1

2log |2π(K + σ2I)| − 1

2y>(K + σ2I)−1y

=− 1

2(||L−1y||2 +N log 2π)−

∑i

log Lii

Take a Cholesky decomposition: L = chol(K + σ2I).

The computational complexity is O(N3 +N2 +N). Therefore, the overallcomplexity including the computation of K is O(N3).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 5 / 46

Page 6: Scalable Gaussian Processes

A quick profiling (N=1000, Q=10)

Time unit is microsecond.

Line # Time % Time Line Contents

2 def log_likelihood(kern, X, Y, sigma2):

3 6.0 0.0 N = X.shape[0]

4 55595.0 58.7 K = kern.K(X)

5 4369.0 4.6 Ky = K + np.eye(N)*sigma2

6 30012.0 31.7 L = np.linalg.cholesky(Ky)

7 4361.0 4.6 LinvY = dtrtrs(L, Y, lower=1)[0]

8 49.0 0.1 logL = N*np.log(2*np.pi)/-2.

9 82.0 0.1 logL += np.square(LinvY).sum()/-2.

10 208.0 0.2 logL += -np.log(np.diag(L)).sum()

11 2.0 0.0 return logL

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 6 / 46

Page 7: Scalable Gaussian Processes

Empirical analysis of computational timeI collect the run time for N = {10, 100, 500, 1000, 1500, 2000}.They take 1.3ms, 8.5ms, 28ms, 0.12s, 0.29s, 0.76s.

0 500 1000 1500 2000 2500

data size (N)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4ti

me

(sec

ond

)Mean

Data

Confidence

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 7 / 46

Page 8: Scalable Gaussian Processes

What if we have 1 million data points?

The mean of predicted computational time is 9.4× 107 seconds ≈ 2.98 years.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 8 / 46

Page 9: Scalable Gaussian Processes

What about waiting for faster computers?

Computational time = amount of workcomputer speed

If the computer speed increase at the pace of 20% year over year:I After 10 years, it will take about 176 days.I After 50 years, it will take about 2.9 hours.

If we double the size of data, it takes 11.4 years to catch up.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 9 / 46

Page 10: Scalable Gaussian Processes

What about parallel computing / GPU?

Ongoing works about speeding upCholesky decomposition with multi-coreCPU or GPU.

Main limitation: heavy communicationand shared memory.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 10 / 46

Page 11: Scalable Gaussian Processes

Other approaches

Apart from speeding up the exact computation, there have been a lot of works onapproximation of GP inference.

These methods often target at some specific scenario and provide goodapproximation for the targeted scenarios.

Provide an overview about common approximations.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 11 / 46

Page 12: Scalable Gaussian Processes

Big data (?)

lots of data 6= complex function

In real world problems, we often collect a lot of data for modeling relatively simplerelations.

−0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2−10

−5

0

5

10

15

20

Mean

Data

Confidence

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 12 / 46

Page 13: Scalable Gaussian Processes

Data subsampling?

Real data often do not evenly distributed.

We tend to get a lot of data on common cases and very few data on rare cases.

−0.25 0.00 0.25 0.50 0.75 1.00 1.25−10

−5

0

5

10

15

20

Mean

Data

Confidence

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 13 / 46

Page 14: Scalable Gaussian Processes

Covariance matrix of redundant data

With redundant data, the covariance matrix becomes low rank.

What about low rank approximation?

0 20 40 60 80 100

0

200

400

600

800

1000

1200

1400

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 14 / 46

Page 15: Scalable Gaussian Processes

Low-rank approximation

Let’s recall the log-likelihood of GP:

log p(y|X) = logN(y|0,K + σ2I

),

where K is the covariance matrix computed from X according to the kernelfunction k(·, ·) and σ2 is the variance of the Gaussian noise distribution.

Assume K to be low rank.

This leads to Nystrom approximation by Williams and Seeger [Williams and Seeger,2001].

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 15 / 46

Page 16: Scalable Gaussian Processes

Approximation by subset

Let’s randomly pick a subset from the training data: Z ∈ RM×Q.

Approximate the covariance matrix K by K.

K = KzK−1zz K>z , where Kz = K(X,Z) and Kzz = K(Z,Z).

Note that K ∈ RN×N , Kz ∈ RN×M and Kzz ∈ RM×M .

The log-likelihood is approximated by

log p(y|X, θ) ≈ logN(y|0,KzK

−1zz K>z + σ2I

).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 16 / 46

Page 17: Scalable Gaussian Processes

Efficient computation using Woodbury formula

The naive formulation does not bring any computational benefits.

L = −1

2log |2π(K + σ2I)| − 1

2y>(K + σ2I)−1y

Apply the Woodbury formula:

(KzK−1zz K>z + σ2I)−1 = σ−2I− σ−4Kz(Kzz + σ−2K>z Kz)

−1K>z

Note that (Kzz + σ−2K>z Kz) ∈ RM×M .

The computational complexity reduces to O(NM2).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 17 / 46

Page 18: Scalable Gaussian Processes

Nystrom approximation

The above approach is called Nystrom approximation by Williams and Seeger[2001].

The approximation is directly done on the covariance matrix without the concept ofpseudo data.

The approximation becomes exact if the whole data set is taken, i.e.,KK−1K> = K.

The subset selection is done randomly.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 18 / 46

Page 19: Scalable Gaussian Processes

Gaussian process with Pseudo Data (1)

Snelson and Ghahramani [2006] proposes the idea of having pseudo data, which islater referred to as Fully independent training conditional (FITC).

Augment the training data (X, y) with pseudo data u at location Z.

p

([yu

]|[XZ

])=N

([yu

]|0,[Kff + σ2I Kfu

K>fu Kuu

])where Kff = K(X,X), Kfu = K(X,Z) and Kuu = K(Z,Z).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 19 / 46

Page 20: Scalable Gaussian Processes

Gaussian process with Pseudo Data (2)

Thanks to the marginalization property of Gaussian distribution,

p(y|X) =

∫u

p(y,u|X,Z).

Further re-arrange the notation:

p(y,u|X,Z) = p(y|u,X,Z)p(u|Z)

where p(u|Z) = N (u|0,Kuu),p(y|u,X,Z) = N

(y|KfuK

−1uuu,Kff −KfuK

−1uuK>fu + σ2I

).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 20 / 46

Page 21: Scalable Gaussian Processes

FITC approximation (1)

So far, p(y|X) has not been changed, but there is no speed-up, Kff ∈ RN×N inKff −KfuK

−1uuK>fu + σ2I.

The FITC approximation assumes

p(y|u,X,Z) = N(y|KfuK

−1uuu,Λ + σ2I

),

where Λ = (Kff −KfuK−1uuK>fu) ◦ I.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 21 / 46

Page 22: Scalable Gaussian Processes

FITC approximation (2)

Marginalize u from the model definition:

p(y|X,Z) = N(y|0,KfuK

−1uuK>fu + Λ + σ2I

)Woodbury formula can be applied in the sam way as in Nystrom approximation:

(KzK−1zz K>z + Λ + σ2I)−1 = A−AKz(Kzz + K>z AKz)

−1K>z A,

where A = (Λ + σ2I)−1.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 22 / 46

Page 23: Scalable Gaussian Processes

FITC approximation (3)

FITC allows the pseudo data not being a subset of training data.

The inducing inputs Z can be optimized via gradient optimization.

Like Nystrom approximation, when taking all the training data as inducing inputs,the FITC approximation is equivalent to the original GP:

p(y|X,Z = X) = N(y|0,Kff + σ2I

)FITC can be combined easily with expectation propagation (EP). Bui et al. [2017]provides an overview and a nice connection with variational sparse GP.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 23 / 46

Page 24: Scalable Gaussian Processes

Model Approximation vs. Approximate Inference

When the exact model/inference is intractable, typically there are two types ofapproaches:

Approximate the original model with a simpler one such that inference becomestractable, like Nystrom approximation, FITC.

Keep the original model but derive an approximate inference method which is oftennot able to return the true answer, like variational inference.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 24 / 46

Page 25: Scalable Gaussian Processes

Model Approximation vs. Approximate Inference

A problem with model approximation is that

when an approximated model requires some tuning, e.g., for hyper-parameters, it isunclear how to improve it based on training data.

In the case of FITC, we know the model is correct if Z = X, however, optimizing Zwill not necessarily lead to a better location.

In fact, optimizing Z can lead to overfitting. [Quinonero-Candela and Rasmussen,2005]

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 25 / 46

Page 26: Scalable Gaussian Processes

Variational Sparse Gaussian Process (1)

Titsias [2009] introduces a variational approach for sparse GP.

It follows the same concept of pseudo data:

p(y|X) =

∫f ,u

p(y|f)p(f |u,X,Z)p(u|Z)

where p(u|Z) = N (u|0,Kuu),p(y|u,X,Z) = N

(y|KfuK

−1uuu,Kff −KfuK

−1uuK>fu + σ2I

).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 26 / 46

Page 27: Scalable Gaussian Processes

Variational Sparse Gaussian Process (2)

Instead of approximate the model, Titsias [2009] derives a variational lower bound.

Normally, a variational lower bound of a marginal likelihood, also known asevidence lower bound (ELBO), looks like

log p(y|X) = log

∫f ,u

p(y|f)p(f |u,X,Z)p(u|Z)

≥∫f ,u

q(f ,u) logp(y|f)p(f |u,X,Z)p(u|Z)

q(f ,u).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 27 / 46

Page 28: Scalable Gaussian Processes

Special Variational Posterior

Titsias [2009] defines an unusual variational posterior:

q(f ,u) = p(f |u,X,Z)q(u), where q(u) = N (u|µ,Σ) .

Plug it into the lower bound:

L =

∫f ,u

p(f |u,X,Z)q(u) logp(y|f)(((((((

p(f |u,X,Z)p(u|Z)

(((((((p(f |u,X,Z)q(u)

= 〈log p(y|f)〉p(f |u,X,Z)q(u) − KL (q(u) ‖ p(u|Z))

=⟨logN

(y|KfuK

−1uuu, σ2I

)⟩q(u)− KL (q(u) ‖ p(u|Z))

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 28 / 46

Page 29: Scalable Gaussian Processes

Special Variational Posterior

There is no inversion of any big covariance matrices in the first term:

−N2

log 2πσ2 − 1

2σ2

⟨(KfuK

−1uuu− y)>(KfuK

−1uuu− y)

⟩q(u)

The overall complexity of the lower bound is O(NM2).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 29 / 46

Page 30: Scalable Gaussian Processes

Tighten the Bound

Find the optimal parameters of q(u):

µ∗,Σ∗ = arg maxµ,Σ

L(µ,Σ).

Make the bound as tight as possible by plugging in µ∗ and Σ∗:

L = logN(y|0,KfuK

−1uuK>fu + σ2I

)− 1

2σ2tr(Kff −KfuK

−1uuK>fu

).

The overall complexity of the lower bound remains O(NM2).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 30 / 46

Page 31: Scalable Gaussian Processes

Variational sparse GP

Note that L is not a valid log-pdf,∫y

exp(L(y)) ≤ 1, due to the trace term.

As inducing points are variational parameters, optimizing the inducing inputs Zalways leads to a better bound.

The model does not “overfit” with too many inducing points.

0.4 0.5 0.6 0.7 0.8 0.9 1.0

−6

−4

−2

0

2

4

6

8

10 Mean

Inducing

Data

Confidence

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 31 / 46

Page 32: Scalable Gaussian Processes

Are big covariance matrices always (almost) low-rank?

Of course, not.

A time series exampley = f(t) + ε

The data are collected with even time interval continuously.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 32 / 46

Page 33: Scalable Gaussian Processes

A time series example: 10 data points

When we observe until t = 1.0:

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 33 / 46

Page 34: Scalable Gaussian Processes

A time series example: 100 data points

When we observe until t = 10.0:

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 34 / 46

Page 35: Scalable Gaussian Processes

A time series example: 1000 data points

When we observe until t = 100.0:

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 35 / 46

Page 36: Scalable Gaussian Processes

Banded precision matrix

For the kernels like the Matern family, the precision matrix is banded.

For example, given a Matern12

or known as exponential kernel:

k(x, x′) = σ2 exp(− |x−x′|l2

).

This slide is taken from Nicolas Durrande [?].

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 36 / 46

Page 37: Scalable Gaussian Processes

Closed form precision matrix

The precision matrix of Matern kernels can be computed in closed form.

The lower triangular matrix from the Cholesky decomposition of the precisionmatrix is banded as well.

log(y|X) = −1

2log |2π(LL>)−1| − 1

2tr(yy>LL>

)where L is the lower triangular matrix from the Cholesky decomposition of theprecision matrix Q, Q = LL>.

The computational complexity becomes O(N).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 37 / 46

Page 38: Scalable Gaussian Processes

Other approximations

deterministic/stochastic frequency approximation

distributed approximation

conjugate gradient methods for covariance matrix inversion

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 38 / 46

Page 39: Scalable Gaussian Processes

Q & A!

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 39 / 46

Page 40: Scalable Gaussian Processes

Parallel Sparse Gaussian Process

Beyond Approximate the inference method, maybe we could exploit parallelization.

For Gaussian process, it turns out to be very hard, because parallel Choleskydecomposition is very difficult.

Dai et al. [2014] and Gal et al. [2014] proposes a parallel inference method forsparse GP.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 40 / 46

Page 41: Scalable Gaussian Processes

Data Parallelism

Consider a training set: D = {(x1, y1), . . . , (xN , yN)}.Assume there are C computational cores/machines.

A data parallelism algorithm divides the data set into C partitions as evenly aspossible: D =

⋃Cc=1Dc.

The parallelism happens in the way that the function running on each core onlyrequiring the data from the local partition.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 41 / 46

Page 42: Scalable Gaussian Processes

A simple example: neural network regression

l =N∑n=1

||yn − fθ(xn)||2 =C∑c=1

∑nc∈Dc

||ync − fθ(xnc)||2

1 Each core computes its local objective lc =∑

nc∈Dc||ync − fθ(xnc)||2.

2 Each core computes the gradient of its local object ∂lc/∂θ.

3 Aggregate all the local objectives and gradients l =∑C

c=1 lc and

∂l/∂θ =∑C

c=1 ∂lc/∂θ.

4 Take a step along the gradient following a gradient descent algorithm.

5 Repeat Step 1 until converge.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 42 / 46

Page 43: Scalable Gaussian Processes

Data Parallelism for Sparse GP

The variational lower bound (after applying Woodbury formula) is

L =− N

2log 2πσ2 +

1

2log

|Kuu||Kuu + σ−2Φ| −

1

2σ2y>y

+1

2σ4y>Kfu(Kuu + Φ)−1K>fuy −

1

2σ2φ+

1

2σ2tr(K−1uuΦ

)where Φ = K>fuKfu and φ = tr (Kff ).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 43 / 46

Page 44: Scalable Gaussian Processes

Data Parallelism for Sparse GP

The lower bound is not fully distributable like in the simple example.

All the terms involving data can be written as a sum across data points:

y>y =N∑n=1

y2n, y>Kfu =

N∑n=1

ynKfnu, Φ =N∑n=1

K>fnuKfnu

φ =N∑n=1

Kfnfn ,where Kfnu = K(xn,Z), Kfnfn = K(xn,xn).

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 44 / 46

Page 45: Scalable Gaussian Processes

Data Parallelism for Sparse GP

1 [local] Compute all the data related terms locally: y>c yc, y>c Kfcu, Φc and φc.

2 [global] Aggregate all the local terms and compute the lower bound L on onenode.

3 [global] Compute the gradient of the bound w.r.t. the model parameters.

4 [global] Compute the gradient w.r.t. the local terms ∂L/∂Kfcu, ∂L/∂Φc and∂L/∂φc and broadcast to individual nodes.

5 [local] Compute the gradient contribution of the local terms and aggregate thelocal gradients into the final gradient.

6 [global] Take a gradient step and repeat Step 1.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 45 / 46

Page 46: Scalable Gaussian Processes

Data Parallelism for Sparse GP

0 10000 20000 30000 40000 50000 60000 70000number of datapoints

0

5

10

15

20

25

30

35

40

avera

ge t

ime p

er

itera

tion (

seco

nds)

1 CPUs

2 CPUs

4 CPUs

8 CPUs

16 CPUs

32 CPUs

1 GPUs

2 GPUs

4 GPUs

0 10000 20000 30000 40000 50000 60000 70000number of datapoints

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

perc

enta

ge o

f in

dis

trib

uta

ble

com

puta

tional ti

me

1 cpu cores

2 cpu cores

4 cpu cores

8 cpu cores

16 cpu cores

32 cpu cores

1 GPUs

2 GPUs

4 GPUs

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 46 / 46

Page 47: Scalable Gaussian Processes

Thang D Bui, Josiah Yan, and Richard E Turner. A unifying framework for gaussian processpseudo-point approximations using power expectation propagation. Journal of Machine LearningResearch, 18:3649–3720, 2017.

Zhenwen Dai, Andreas Damianou, James Hensman, and Neil D. Lawrence. Gaussian process modelswith parallelization and gpu acceleration. In NIPS workshop Software Engineering for MachineLearning, 2014.

Yarin Gal, Mark van der Wilk, and Carl Edward Rasmussen. Distributed variational inference in sparsegaussian process regression and latent variable models. In Advances in Neural InformationProcessing Systems 27, pages 3257–3265, 2014.

Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximategaussian process regression. Journal of Machine Learning Research, 6:1939—-1959, 2005.

Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advancesin Neural Information Processing Systems, pages 1257–1264. 2006.

Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Proceedingsof the Twelth International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009.

Christopher K. I. Williams and Matthias Seeger. Using the nystrom method to speed up kernelmachines. In Advances in Neural Information Processing Systems, pages 682–688. 2001.

Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 46 / 46