# Learning Mixtures of Product Distributions

date post

11-Jan-2016Category

## Documents

view

27download

0

Embed Size (px)

description

### Transcript of Learning Mixtures of Product Distributions

Learning Mixtures of Product DistributionsJon FeldmanColumbia UniversityRocco ServedioColumbia UniversityRyan ODonnellIAS

Learning DistributionsThere is a an unknown distribution P over Rn, or maybe just over {0,1}n.

An algorithm gets access to random samples from P.

In time polynomial in n/ it should output a hypothesis distribution Q which (w.h.p.) is -close to P.

[Technical details later.]

Learning Distributions

R 0

Hopeless in general!

Learning Classes of DistributionsSince this is hopeless in general one assumes that P comes from class of distributions C.We speak of whether C is polynomial-time learnable or not; this means that there is one algorithm that learns every P in C.

Some easily learnable classes: C = {Gaussians over Rn}C = {Product distributions over {0,1}n}Learning Distributions

Learning product distributions over {0,1}nE.g. n = 3. Samples0 1 00 1 10 1 11 1 10 1 00 1 10 1 00 1 01 1 10 0 0Hypothesis: [.2 .9 .5]

Mixtures of product distributionsFix k 2 and let 1 + 2 + k = 1. The -mixture of distributions P 1, , P k is:Draw i according to mixture weights i.Draw from P i.

In the case of product distributions over {0,1}n:1 [1111 ]2 [2222 ]k [ kkkk ]12132123n3nn

Learning mixture exampleE.g. n = 4. Samples 1 1 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 0 1 0 1 0True distribution: 60% [ .8 .8 .6 .2 ]40% [ .2 .4 .3 .8 ]

Prior work[KMRRSS94]: learned in time poly(n/, 2k) in the special case that there is a number p < such that every i is either p or 1p.[FM99]: learned mixtures of 2 product distributions over {0,1}n in polynomial time (with a few minor technical deficiencies).[CGG98]: learned a generalization of 2 product distributions over {0,1}n, no deficiencies.The latter two leave mixtures of 3+ as an open problem: there is a qualitative difference between 2 & 3. [FM99] also leaves open learning mixes of Gaussians, other Rn distributions.j

Our resultsA poly(n/) time algorithm learning a mixture of k product distributions over {0,1}n for any constant k.Evidence that getting a poly(n/) algorithm for k = (1) [even in the case where s are in {0, , 1}] will be very hard (if possible).Generalizations:Let C 1, , C n be nice classes of distributions over R (definable in terms of O(1) moments) Algorithm learns mixture of O(1) distributions in C 1 C n.Only pairwise independence of coords is used

Technical definitionsWhen is a hypothesis distribution Q -close to the target distribution P ?L1 distance? |P(x) Q(x)|.KL divergence: KL(P || Q) = P (x) log[P (x)/Q(x)].

Getting a KL-close hypothesis is more stringent:fact: L1 O(KL).

We learn under KL divergence, which leads to some technical advantages (and some technical difficulties).

Learning distributions summary

Learning a class of distributions C.Let P be any distribution in the class.Given and > 0.Get samples and do poly(n/, log(1/)) much work.With probability at least 1 output a hypothesis Q which satisfies KL(P || Q) < .

Some intuition for k = 2Idea:Find two coordinates j and j' to key off.

Suppose you notice that the bits in coords j and j' are very frequently different.

Then probably most of the 01 examples come from one mixture and most of the 10 examples come from the other mixture

Use this separation to estimate all other means.

More details for the intuitionSuppose you somehow know the following three things:

The mixture weights are 60% / 40%.There are j and j' such that means satisfy

pj pj'qj qj'

The values pj, pj', qj, qj' themselves.

> .

More details for the intuitionMain algorithmic idea: For each coord m, estimate (to within 2) the correlation between j & m and j' & m.

corr(j, m) = (.6 pj) pm + (.4 qj) qmcorr(j', m) = (.6 pj') pm + (.4 qj') qm

Solve this system of equations for pm, qm. Done! Since the determinant is > , any error in correlation estimation error does not blow up too much.

Two questions1. This assumes that there is some 22 submatrix which is far from singular. In general, no reason to believe this is the case.But if not, then one set of means is very nearly a multiple of the other set; problem becomes very easy.

2. How did we know 1, 2? How did we know which j and j' were good? How did we know the 4 means pj, pj', qj, qj'?

GuessingJust guess. I.e., try all possibilities.

Guess if the 2 n matrix is essentially rank 1 or not.Guess 1, 2 to within 2.(Time: 1/4.)Guess correct j, j'.(Time: n2.)Guess pj, pj', qj, qj' to within 2.(Time: 1/8.)

Solve the system of equations in every case. Time: poly(n/).

Checking guessesAfter this we get a whole bunch of candidate hypotheses.When we get lucky and make all the right guesses, the resulting candidate hypothesis will be a good one say, will be -close in KL to the truth.

Can we pick the (or, a) candidate hypothesis which is KL-close to the truth? I.e., can we guess and check?

Yes use a Maximum Likelihood test

Checking with MLSuppose Q is a candidate hypothesis for P. Estimate its log likelihood: log x S Q(x) = x S log Q(x) |S| E[log Q (x)]= |S| P (x) log Q (x)= |S| [ P log P KL(P || Q ) ].

Checking with ML contdBy Chernoff bounds, if we take enough samples, all candidate hypotheses Q will have their estimated log-likelihoods close to their expectations.Any KL-close Q will look very good in the ML test.Anything which looks good in the ML test is KL-close.Thus assuming there is an -close candidate hypothesis among guesses, we find an O()-close candidate hypothesis.I.e., we can guess and check.

Overview of the algorithmWe now give the precise algorithm for learning a mixture of k product distributions, along with intuition for why it works.Intuitively:Estimate all the pairwise correlations of bits.Guess a number of parameters of the mixture distn.Use guesses, correlation estimates to solve for remaining parameters.Show that whenever guesses are close, the resulting parameter estimations give a close-in-KL candidate hypothesis.Check candidates with ML algorithm, pick best one.

The algorithm1. Estimate all pairwise correlations corr(j, j') to within (/n)k. (Time: (n/)k.)Note:corr(j, j') = i = 1..k i i i = j , j' ,where j = ( (i) i )i = 1..k

2. Guess all i to within (/n)k. (Time: (n/)k2.)

Now it suffices to estimate all vectors j, j = 1 n.jj'~~~j~

Mixtures of product distributionsFix k 2 and let 1 + 2 + k = 1. The -mixture of distributions P 1, , P k is:Draw i according to mixture weights i.Draw from P i.

In the case of product distributions over {0,1}n:1 [1111 ]2 [2222 ]k [ kkkk ]12132123n3nn

Guessing matrices from most of their Gram matricesLet A be the k n matrix of is.

A =

After estimating all correlations, we know all dot products of distinct columns of A to high accuracy. Goal: determine all entries of A, making only O(1) guesses.~ j12n~~~

Two remarksThis is the final problem, where all the main action and technical challenge lies. Note that all we ever do with the samples is estimate pairwise correlations.If we knew the dot products of the columns of A with themselves, wed have the whole matrix ATA. That would be great; we could just factor it and recover A exactly. Unfortunately, there doesnt seem to be any way to get at these quantities i = 1..k i (i)2.j

Keying off a nonsingular submatrixIdea: find a nonsingular k k matrix to key off.As before, the usual case is that A has full rank.Then there is a k k nonsingular submatrix AJ.Guess this matrix (time: nk) and all its entries to within (/n)k (time: (n/)k3 final running time).Now use this submatrix and correlation estimates to find all other entries of A:

for all m, AJT Am = corr(m, j) (j J)

Non-full rank caseBut what if A is not full rank? (Or in actual analysis, if A is extremely close to being rank deficient.) A genuine problem.Then A has some perpendicular space of dimension 0 < d k, spanned by some orthonormal vectors u1, , ud.Guess d and the vectors u1, , ud.Now adjoin these columns to A getting a full rank matrix.

A' = A u1 u2 ud

Non-full rank case contdNow A' has full rank and we can do the full rank case! Why do we still know all pairwise dot products of A's columns?Dot product of us with A columns are 0!Dot product of us with each other is 1. (Dont need this.)4. Guess a k k submatrix of A' and all its entries. Use these to solve for all other entries.

The actual analysisThe actual analysis of this algorithm is quite delicate.Theres some linear algebra & numerical analysis ideas. The main issue is: The degree to which A is essentially of rank k d is similar to the degree to which all guessed vectors u really do have dot product 0 with As original columns.The key is to find a large multiplicative gap between As singular values, and treat its location as the essential rank of A.This is where the necessary accuracy (/n)k comes in.

Can we learn a mixture of (1)?Claim: Let T be a decision tree on {0,1}n with k leaves. Then the uniform distribution over the inputs which make T output 1 is a mixture of at most k product distributions.

Indeed, all product distributions have means 0, , or 1.x1x2x2x300001111100102/3: [0, 0, , , , ]1/3: [1, 1, 0, , , ]

Learning DT