Nearly-TightSampleComplexityBoundsforcvliaw/neurips18_poster.pdf ·...

1
Nearly-Tight Sample Complexity Bounds for Learning Mixtures of Gaussians Hassan Ashtiani 12 , Shai Ben-David 2 , Christopher Liaw 3 , Nicholas J.A. Harvey 3 , Abbas Mehrabian 4 , Yaniv Plan 3 1 McMaster University, 2 University of Waterloo, 3 University of British Columbia, 4 McGill University Our results Theorem. The sample complexity for learning mixtures of k Gaussians in R d up to total variation distance ε is ( e Θ(·) suppresses polylog(kd/ε) factors) e Θ kd 2 ε 2 (for general Gaussians) e Θ ( kd ε 2 ) (for axis-aligned Gaussians) Correspondingly, given n samples from the true distribution, the minimax risk is ˜ O ( p kd 2 /n) and ˜ O ( p kd/n), respectively. PAC Learning of Distributions Given i.i.d. samples from unknown target distribution D , output ˆ D such that d TV (D , ˆ D ) = sup E | Pr D [E ] - Pr ˆ D [E ]| = 1 2 kf D - f D 0 k 1 ε. •F : An arbitrary class of distributions (e.g. Gaussians) k -mix(F ): k -mixtures of F , i.e. k -mix(F ) : = { i[k] w i D i : w i 0, i w i =1, D i ∈ F}. Sample complexity of F is minimum number m F (ε) such that there is an algorithm that, given m F (ε) samples from D , outputs ˆ D with d TV (D , ˆ D ) ε. PAC learning is not equivalent to parameter estimation where goal is to recover parameters of distribution. Compression Framework We develop a novel compression framework that uses few samples to build a representative family of distributions. 1. Encoder given true distribution D∈F & draws m(ε) points D . 2. Encoder sends t(ε) points and/or “helper” bits to decoder. 3. Decoder outputs ˆ D∈F such that d TV (D , ˆ D ) ε w.p. 2/3. If this is possible, we say F is (m(ε), t(ε))-compressible. Compression Theorem Compression Theorem [ABHLMP ’18]. If F is (m(ε), t(ε))-compressible then sample complexity for learning F is e O m(ε) + t(ε) ε 2 . Compression of Mixtures Lemma. If F is (m(ε), t(ε))-compressible then k -mix(F ) is ( km(ε/k) ε ,k t(ε/k ))-compressible. Compression Theorem for Mixtures. If F is (m(ε), t(ε))-compressible then sample complexity for learning k -mix(F ) is e O km(ε/k) ε + kt(ε/k) ε 2 . Example: Gaussians in R Claim. Gaussians in R are (1, 2)-compressible. 1. True distribution is N (μ, σ 2 ); encoder draws 1points from N (μ, σ 2 ). 2. With high probability, X i μ + σ, X j μ - σ . 3. Encoder sends X i ,X j ; decoder recovers μ, σ approximately. Outline of the Algorithm Assume: (i) F is (m(ε), t(ε))-compressible; (ii) true dist. D∈F Input: Error parameter ε> 0. 1. Draw m(ε) i.i.d. samples from D . 2. Encoder has at most m(ε) t(ε) 2 t(ε) outputs so enumerate all M = m(ε) t(ε) 2 t(ε) of decoder’s outputs, D 1 ,..., D M . By assumption, d TV (D i , D ) ε for some i. 3. Use tournament algorithm [DL ’01] to find best distribution amongst D 1 ,..., D M ; O (log(M )2 ) samples suffice for this step. Sample complexity is m(ε) + O (log(M )2 )= e O m(ε) + t(ε) ε 2 . Proof of Upper Bound Lemma. Gaussians in R d are (O (d), e O (d 2 ))-compressible. Sketch of lemma. Suppose true Gaussian is N (μ, Σ). Encoder draws O (d) points from N (μ, Σ). Points give rough shape of ellipsoid induced by μ, Σ; encoder sends points & e O (d 2 ) bits; decoder approximates ellipsoid. Decoder outputs N μ, ˆ Σ). Proof of upper bound. Combine lemma with compression theorem. Lower Bound Technique Theorem (Fano’s Inequality). If D 1 ,..., D r are distributions such that d TV (D i , D j ) ε and KL(D i , D j ) ε 2 for all i 6= j then sample complexity is Ω(log(r )2 ). Use probabilistic method to find 2 Ω(d 2 ) Gaussian distributions satisfying hypothesis of Fano’s Inequality. Repeat following procedure 2 Ω(d 2 ) times: 1. Start with identity covariance matrix. 2. Choose random subspace S a of dimension d/10 & perturb eigenvalues by ε/ d along S a . Let Σ a be corresponding covariance matrix and D a = N (0, Σ a ). Claim. If a 6= b then KL(D a , D b ) ε 2 and d TV (D a , D b ) ε with probability 1 - exp(-Ω(d 2 )). Can lift construction to get 2 Ω(kd 2 ) k -mixture of d-dimensional Gaussians satsifying Fano’s Inequality. Remark. Lower bound for axis-aligned proved by [SOAJ ’14]. References [DL ’01] Devroye, L., & Lugosi, G. (2001). Combinatorial methods in density estimation. Springer Science & Business Media. [SOAJ ’14] Suresh, A. T., Orlitsky, A., Acharya, J., & Jafarpour, A. (2014). Near-optimal-sample estimators for spherical gaussian mixtures. In Advances in Neural Information Processing Systems (pp. 1395-1403).

Transcript of Nearly-TightSampleComplexityBoundsforcvliaw/neurips18_poster.pdf ·...

Page 1: Nearly-TightSampleComplexityBoundsforcvliaw/neurips18_poster.pdf · Nearly-TightSampleComplexityBoundsfor LearningMixturesofGaussians Hassan Ashtiani12, Shai Ben-David2, Christopher

Nearly-TightSampleComplexityBounds forLearningMixturesofGaussians

Hassan Ashtiani12, Shai Ben-David2, Christopher Liaw3, Nicholas J.A. Harvey3, Abbas Mehrabian4, Yaniv Plan3

1McMaster University, 2University of Waterloo, 3University of British Columbia, 4McGill University

Our results

Theorem. The sample complexity for learningmixtures of k Gaussians in Rd up to total variationdistance ε is (Θ(·) suppresses polylog(kd/ε) factors)

• Θ(kd2

ε2

)(for general Gaussians)

• Θ(kdε2

)(for axis-aligned Gaussians)

Correspondingly, given n samples from the true distribution, theminimax risk is O(

√kd2/n) and O(

√kd/n), respectively.

PAC Learning of Distributions• Given i.i.d. samples from unknown target distribution D, outputD such that

dTV (D, D) = supE|PrD

[E]− PrD

[E]| = 1

2‖fD − fD′‖1 ≤ ε.

• F : An arbitrary class of distributions (e.g. Gaussians)k-mix(F): k-mixtures of F ,i.e. k-mix(F) :=

∑i∈[k] wiDi : wi ≥ 0,

∑i wi = 1,Di ∈ F.

• Sample complexity of F is minimum number mF (ε) such thatthere is an algorithm that, given mF (ε) samples from D, outputsD with dTV (D, D) ≤ ε.• PAC learning is not equivalent to parameter estimationwhere goal is to recover parameters of distribution.

Compression FrameworkWe develop a novel compression framework that uses few samplesto build a representative family of distributions.

1. Encoder given true distribution D ∈ F & draws m(ε) points D.

2. Encoder sends t(ε) points and/or “helper” bits to decoder.

3. Decoder outputs D ∈ F such that dTV (D, D) ≤ ε w.p. 2/3.

If this is possible, we say F is (m(ε), t(ε))-compressible.

Compression Theorem

Compression Theorem [ABHLMP ’18].If F is (m(ε), t(ε))-compressible then sample complexity for

learning F is O(m(ε) + t(ε)

ε2

).

Compression of MixturesLemma. If F is (m(ε), t(ε))-compressible then k-mix(F) is

(km(ε/k)ε , kt(ε/k))-compressible.

Compression Theorem for Mixtures.If F is (m(ε), t(ε))-compressible then sample complexity for learning

k-mix(F) is O(

km(ε/k)ε + kt(ε/k)

ε2

).

Example: Gaussians in R

Claim. Gaussians in R are (1/ε, 2)-compressible.1. True distribution is N (µ, σ2); encoder draws 1/ε points fromN (µ, σ2).

2. With high probability, ∃Xi ≈ µ+ σ,Xj ≈ µ− σ.

3. Encoder sends Xi, Xj ; decoder recovers µ, σ approximately.

Outline of the AlgorithmAssume: (i) F is (m(ε), t(ε))-compressible; (ii) true dist. D ∈ FInput: Error parameter ε > 0.

1. Draw m(ε) i.i.d. samples from D.

2. Encoder has at most m(ε)t(ε)2t(ε) outputs so enumerate allM = m(ε)t(ε)2t(ε) of decoder’s outputs, D1, . . . ,DM .By assumption, dTV (Di,D) ≤ ε for some i.

3. Use tournament algorithm [DL ’01] to find best distributionamongst D1, . . . ,DM ; O(log(M)/ε2) samples suffice for this step.

Sample complexity is m(ε) +O(log(M)/ε2) = O(m(ε) + t(ε)

ε2

).

Proof of Upper Bound

Lemma. Gaussians in Rd are (O(d), O(d2))-compressible.Sketch of lemma. Suppose true Gaussian is N (µ,Σ).• Encoder draws O(d) points from N (µ,Σ).• Points give rough shape of ellipsoid induced by µ,Σ; encoder

sends points & O(d2) bits; decoder approximates ellipsoid.• Decoder outputs N (µ, Σ).

Proof of upper bound. Combine lemma with compression theorem.

Lower Bound TechniqueTheorem (Fano’s Inequality). If D1, . . . ,Dr are distributions suchthat dTV (Di,Dj) ≥ ε and KL(Di,Dj) ≤ ε2 for all i 6= j then samplecomplexity is Ω(log(r)/ε2).

• Use probabilistic method to find 2Ω(d2) Gaussian distributionssatisfying hypothesis of Fano’s Inequality.

• Repeat following procedure 2Ω(d2) times:

1. Start with identity covariance matrix.

2. Choose random subspace Sa of dimension d/10 & perturbeigenvalues by ε/

√d along Sa.

Let Σa be corresponding covariance matrix and Da =N (0,Σa).

Claim. If a 6= b then KL(Da,Db) ≤ ε2 and dTV (Da,Db) ≥ ε withprobability 1− exp(−Ω(d2)).

Can lift construction to get 2Ω(kd2) k-mixture of d-dimensionalGaussians satsifying Fano’s Inequality.

Remark. Lower bound for axis-aligned proved by [SOAJ ’14].

References[DL ’01] Devroye, L., & Lugosi, G. (2001). Combinatorial methods in density

estimation. Springer Science & Business Media.

[SOAJ ’14] Suresh, A. T., Orlitsky, A., Acharya, J., & Jafarpour, A.

(2014). Near-optimal-sample estimators for spherical gaussian mixtures. In

Advances in Neural Information Processing Systems (pp. 1395-1403).