Al Parker September 14, 2010 Drawing samples from high dimensional Gaussians using polynomials

download Al Parker September 14, 2010 Drawing samples from high dimensional Gaussians using polynomials

of 47

  • date post

    26-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    1

Embed Size (px)

Transcript of Al Parker September 14, 2010 Drawing samples from high dimensional Gaussians using polynomials

  • Slide 1
  • Al Parker September 14, 2010 Drawing samples from high dimensional Gaussians using polynomials
  • Slide 2
  • Colin Fox, Physics, University of Otago New Zealand Institute of Mathematics, University of Auckland Center for Biofilm Engineering,, Bozeman Acknowledgements
  • Slide 3
  • The normal or Gaussian distribution
  • Slide 4
  • y = ( 2 ) 1/2 z + ~ N(, 2 ) How to sample from a Gaussian N(, 2 )? Sample z ~ N(0,1)
  • Slide 5
  • The multivariate Gaussian distribution
  • Slide 6
  • Slide 7
  • y = 1/2 z+ ~ N(,) How to sample from a Gaussian N(,)? Sample z ~ N(0, I ) (eg y = W 1/2 z + )
  • Slide 8
  • Example: From 64 faces, modeling face space with a Gaussian Process N(,) Pixel intensity at the ith row and jth column is y(s(i,j)), y(s) R 112 x R 112 (s) R 112 x R 112 (s,s) R 12544 x R 12544
  • Slide 9
  • ~N(,)
  • Slide 10
  • How to estimate , for N(,)? MLE/BLUE (least squares) MVQUE Use a Bayesian Posterior via MCMC
  • Slide 11
  • Another example: Interpolation
  • Slide 12
  • One can assume a covariance function which has some parameters
  • Slide 13
  • I used a Bayesian posterior for |data to construct |data
  • Slide 14
  • Simulating the process: samples from N(,|data) y = 1/2 z + ~N(, )
  • Slide 15
  • Gaussian Processes modeling global ozone Cressie and Johannesson, Fixed rank krigging for very large spatial datasets, 2006
  • Slide 16
  • Gaussian Processes modeling global ozone
  • Slide 17
  • The problem To generate a sample y = 1/2 z+ ~ N(,), how to calculate the factorization = 1/2 ( 1/2 ) T ? 1/2 = W 1/2 by eigen-decomposition, 10/3n 3 flops 1/2 = C by Cholesky factorization, 1/3n 3 flops For LARGE Gaussians (n>10 5, eg in image analysis and global data sets), these approaches are not possible n 3 is computationally TOO EXPENSIVE storing an n x n matrix requires TOO MUCH MEMORY
  • Slide 18
  • Some solutions Work with sparse precision matrix -1 models (Rue, 2001) Circulant embeddings (Gneiting et al, 2005) Iterative methods: Advantages: COST: n 2 flops per iteration MEMORY: Only vectors of size n x 1 need be stored Disadvantages: If the method runs for n iterations, then there is no cost savings over a direct method
  • Slide 19
  • Gibbs: an iterative sampler of N(0,A) and N(0, A -1 ) Let A= or A= -1 1.Split A into D=diag(A), L=lower(A), L T =upper(A) 2.Sample z ~ N(0, I ) 3.Take conditional samples in each coordinate direction, so that a full sweep of all n coordinates is y k =-D -1 L y k - D -1 L T y k-1 + D -1/2 z y k converges in distribution geometrically to N(0,A -1 ) Ay k converges in distribution geometrically to N(0,A)
  • Slide 20
  • Gibbs: an iterative sampler Gibbs sampling from N(,) starting from (0,0)
  • Slide 21
  • Gibbs: an iterative sampler Gibbs sampling from N(,) starting from (0,0)
  • Slide 22
  • Theres a link to solving Ax=b Solving Ax=b is equivalent to minimizing an n- dimensional quadratic (when A is spd) A Gaussian is sufficiently specified by the same quadratic (with A= -1 and b=A):
  • Slide 23
  • Gauss-Siedel Linear Solve of Ax=b 1.Split A into D=diag(A), L=lower (A), L T =upper(A) 2.Minimize the quadratic f(x) in each coordinate direction, so that a full sweep of all n coordinates is x k =-D -1 L x k - D -1 L T x k-1 + D -1 b x k converges geometrically A -1 b
  • Slide 24
  • Gauss-Siedel Linear Solve of Ax=b
  • Slide 25
  • x k converges geometrically A -1 b, (x k - A -1 b) = G k ( x 0 - A -1 b) where (G) < 1
  • Slide 26
  • Theorem: A Gibbs sampler is a Gauss Siedel linear solver Proof: A Gibbs sampler is y k =-D -1 L y k - D -1 L T y k-1 + D -1/2 z A Gauss-Siedel linear solve of Ax=b is x k =-D -1 L x k - D -1 L T x k-1 + D -1 b
  • Slide 27
  • Gauss Siedel is a Stationary Linear Solver A Gauss-Siedel linear solve of Ax=b is x k =-D -1 L x k - D -1 L T x k-1 + D -1 b Gauss Siedel can be written as M x k = N x k-1 + b where M = D + L and N = D - L T, A = M N, the general form of a stationary linear solver
  • Slide 28
  • Stationary linear solvers of Ax=b 1.Split A=M-N 2.Iterate Mx k = N x k-1 + b 1.Split A=M-N 2.Iterate x k = M -1 Nx k-1 + M -1 b = Gx k-1 + M -1 b x k converges geometrically A -1 b, (x k - A -1 b) = G k ( x 0 - A -1 b) when (G) = (M -1 N)< 1
  • Slide 29
  • Stationary Samplers from Stationary Solvers Solving Ax=b: 1.Split A=M-N 2.Iterate Mx k = N x k-1 + b x k A -1 b if (M -1 N)< 1 Sampling from N(0,A) and N(0,A -1 ): 1.Split A=M-N 2.Iterate My k = N y k-1 + c k-1 where c k-1 ~ N(0, M T + N) y k N(0,A -1 ) if (M -1 N)< 1 Ay k N(0,A) if (M -1 N)< 1
  • Slide 30
  • How to sample c k-1 ~ N(0, M T + N) ? Gauss Siedel M = D + L, c k-1 ~ N(0, D) SOR (successive over-relaxation) M = 1/wD + L, c k-1 ~ N(0, (2-w)/w D) Richardson M = I, c k-1 ~ N(0, 2I-A ) Jacobi M = D, c k-1 ~ N(0, 2D-A )
  • Slide 31
  • Theorem: Stat Linear Solver converges iff Stat Sampler converges and the geometric convergence is geometric Proof: They have the same iteration operator: For linear solves: x k = Gx k-1 + M -1 b so that (x k - A -1 b) = G k ( x 0 - A -1 b) For sampling: y k = Gy k-1 + M -1 c k-1 E(y k )= G k E(y 0 ) Var(y k ) = A -1 - G k A -1 G kT Proof for Gaussians given by Barone and Frigessi, 1990. For arbitrary distributions by Duflo, 1997
  • Slide 32
  • Acceleration schemes for stationary linear solvers can be used to accelerate stationary samplers Polynomial acceleration of a stationary solver of Ax=b is 1. Split A = M - N 2. x k+1 = (1- v k ) x k-1 + v k x k + v k u k M -1 (b-A x k ) which replaces (x k - A -1 b) = G k ( x 0 - A -1 b) with a k th order polynomial (x k - A -1 b) = p(G)( x 0 - A -1 b)
  • Slide 33
  • Chebyshev acceleration x k+1 = (1- v k ) x k-1 + v k x k + v k u k M -1 (b-A x k ) where v k, u k are functions of the 2 extreme eigenvalues of G (not very expensive to get estimates of these eigenvalues) Gauss-Siedel converged like this
  • Slide 34
  • x k+1 = (1- v k ) x k-1 + v k x k + v k u k M -1 (b-A x k ) where v k, u k are functions of the 2 extreme eigenvalues of G (not very expensive to get estimates of these eigenvalues) convergence (geometric-like) with Chebyshev acceleration
  • Slide 35
  • Polynomial Accelerated Stationary Sampler from N(0,A) and N(0,A -1 ) 1. Split A = M - N 2. y k+1 = (1- v k ) y k-1 + v k y k + v k u k M -1 (c k -A y k ) where c k ~ N(0, (2-v k )/v k ( (2 u k )/ u k M T + N)
  • Slide 36
  • Theorem A polynomial accelerated sampler converges with the same convergence rate as the corresponding linear solver as long as v k, u k are independent of the iterates y k. Gibbs Sampler Chebyshev Accelerated Gibbs
  • Slide 37
  • Chebyshev acceleration is guaranteed to be faster than a Gibbs sampler Covariance matrix convergence ||A -1 S k || 2
  • Slide 38
  • Chebyshev accelerated Gibbs sample in 10 6 dimensions: data = SPHERE + , Sample from (SPHERE|data) ~ N(0, 2 I)
  • Slide 39
  • Conclusions Gaussian Processes are cool! Common techniques from numerical linear algebra can be used to sample from Gaussians Cholesky factorization (precise but expensive) Any stationary linear solver can be used as a stationary sampler (inexpensive but with geometric convergence) Stationary samplers can be accelerated by polynomials (guaranteed!) Polynomial accelerated Samplers Chebyshev Conjugate Gradients Lanczos Sampler
  • Slide 40
  • Slide 41
  • Estimation of (,r) from the data using a a Markov Chain
  • Slide 42
  • Marginal Posteriors
  • Slide 43
  • x k+1 = (1- v k ) x k-1 + v k x k + v k u k M -1 (b-A x k ) where v k, u k are functions of the residuals b-Ax k convergence guaranteed in n finite steps with CG acceleration Conjugate Gradient (CG) acceleration
  • Slide 44
  • The theorem does not apply since the parameters v k, u k are functions of the residuals b k - A y k We have devised an approach called a CD sampler to construct samples with covariance Var(y k ) = V k D k -1 V k T A -1 where V k is a matrix of unit length residuals b - Ax k from the standard CG algorithm. Conjugate Gradient (CG) Acceleration
  • Slide 45
  • A GOOD THING: The CG algorithm is a great linear solver! If the eigenvalues of A are in c clusters, then a solution to Ax=b is found in c