Download - Nearly-TightSampleComplexityBoundsforcvliaw/neurips18_poster.pdf · Nearly-TightSampleComplexityBoundsfor LearningMixturesofGaussians Hassan Ashtiani12, Shai Ben-David2, Christopher

Transcript
Page 1: Nearly-TightSampleComplexityBoundsforcvliaw/neurips18_poster.pdf · Nearly-TightSampleComplexityBoundsfor LearningMixturesofGaussians Hassan Ashtiani12, Shai Ben-David2, Christopher

Nearly-TightSampleComplexityBounds forLearningMixturesofGaussians

Hassan Ashtiani12, Shai Ben-David2, Christopher Liaw3, Nicholas J.A. Harvey3, Abbas Mehrabian4, Yaniv Plan3

1McMaster University, 2University of Waterloo, 3University of British Columbia, 4McGill University

Our results

Theorem. The sample complexity for learningmixtures of k Gaussians in Rd up to total variationdistance ε is (Θ(·) suppresses polylog(kd/ε) factors)

• Θ(kd2

ε2

)(for general Gaussians)

• Θ(kdε2

)(for axis-aligned Gaussians)

Correspondingly, given n samples from the true distribution, theminimax risk is O(

√kd2/n) and O(

√kd/n), respectively.

PAC Learning of Distributions• Given i.i.d. samples from unknown target distribution D, outputD such that

dTV (D, D) = supE|PrD

[E]− PrD

[E]| = 1

2‖fD − fD′‖1 ≤ ε.

• F : An arbitrary class of distributions (e.g. Gaussians)k-mix(F): k-mixtures of F ,i.e. k-mix(F) :=

∑i∈[k] wiDi : wi ≥ 0,

∑i wi = 1,Di ∈ F.

• Sample complexity of F is minimum number mF (ε) such thatthere is an algorithm that, given mF (ε) samples from D, outputsD with dTV (D, D) ≤ ε.• PAC learning is not equivalent to parameter estimationwhere goal is to recover parameters of distribution.

Compression FrameworkWe develop a novel compression framework that uses few samplesto build a representative family of distributions.

1. Encoder given true distribution D ∈ F & draws m(ε) points D.

2. Encoder sends t(ε) points and/or “helper” bits to decoder.

3. Decoder outputs D ∈ F such that dTV (D, D) ≤ ε w.p. 2/3.

If this is possible, we say F is (m(ε), t(ε))-compressible.

Compression Theorem

Compression Theorem [ABHLMP ’18].If F is (m(ε), t(ε))-compressible then sample complexity for

learning F is O(m(ε) + t(ε)

ε2

).

Compression of MixturesLemma. If F is (m(ε), t(ε))-compressible then k-mix(F) is

(km(ε/k)ε , kt(ε/k))-compressible.

Compression Theorem for Mixtures.If F is (m(ε), t(ε))-compressible then sample complexity for learning

k-mix(F) is O(

km(ε/k)ε + kt(ε/k)

ε2

).

Example: Gaussians in R

Claim. Gaussians in R are (1/ε, 2)-compressible.1. True distribution is N (µ, σ2); encoder draws 1/ε points fromN (µ, σ2).

2. With high probability, ∃Xi ≈ µ+ σ,Xj ≈ µ− σ.

3. Encoder sends Xi, Xj ; decoder recovers µ, σ approximately.

Outline of the AlgorithmAssume: (i) F is (m(ε), t(ε))-compressible; (ii) true dist. D ∈ FInput: Error parameter ε > 0.

1. Draw m(ε) i.i.d. samples from D.

2. Encoder has at most m(ε)t(ε)2t(ε) outputs so enumerate allM = m(ε)t(ε)2t(ε) of decoder’s outputs, D1, . . . ,DM .By assumption, dTV (Di,D) ≤ ε for some i.

3. Use tournament algorithm [DL ’01] to find best distributionamongst D1, . . . ,DM ; O(log(M)/ε2) samples suffice for this step.

Sample complexity is m(ε) +O(log(M)/ε2) = O(m(ε) + t(ε)

ε2

).

Proof of Upper Bound

Lemma. Gaussians in Rd are (O(d), O(d2))-compressible.Sketch of lemma. Suppose true Gaussian is N (µ,Σ).• Encoder draws O(d) points from N (µ,Σ).• Points give rough shape of ellipsoid induced by µ,Σ; encoder

sends points & O(d2) bits; decoder approximates ellipsoid.• Decoder outputs N (µ, Σ).

Proof of upper bound. Combine lemma with compression theorem.

Lower Bound TechniqueTheorem (Fano’s Inequality). If D1, . . . ,Dr are distributions suchthat dTV (Di,Dj) ≥ ε and KL(Di,Dj) ≤ ε2 for all i 6= j then samplecomplexity is Ω(log(r)/ε2).

• Use probabilistic method to find 2Ω(d2) Gaussian distributionssatisfying hypothesis of Fano’s Inequality.

• Repeat following procedure 2Ω(d2) times:

1. Start with identity covariance matrix.

2. Choose random subspace Sa of dimension d/10 & perturbeigenvalues by ε/

√d along Sa.

Let Σa be corresponding covariance matrix and Da =N (0,Σa).

Claim. If a 6= b then KL(Da,Db) ≤ ε2 and dTV (Da,Db) ≥ ε withprobability 1− exp(−Ω(d2)).

Can lift construction to get 2Ω(kd2) k-mixture of d-dimensionalGaussians satsifying Fano’s Inequality.

Remark. Lower bound for axis-aligned proved by [SOAJ ’14].

References[DL ’01] Devroye, L., & Lugosi, G. (2001). Combinatorial methods in density

estimation. Springer Science & Business Media.

[SOAJ ’14] Suresh, A. T., Orlitsky, A., Acharya, J., & Jafarpour, A.

(2014). Near-optimal-sample estimators for spherical gaussian mixtures. In

Advances in Neural Information Processing Systems (pp. 1395-1403).