Learning From Data Lecture 19 A Peek At Unsupervised...

23
Learning From Data Lecture 19 A Peek At Unsupervised Learning k-Means Clustering Probability Density Estimation Gaussian Mixture Models M. Magdon-Ismail CSCI 4100/6100

Transcript of Learning From Data Lecture 19 A Peek At Unsupervised...

Page 1: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Learning From DataLecture 19

A Peek At Unsupervised Learning

k-Means ClusteringProbability Density Estimation

Gaussian Mixture Models

M. Magdon-IsmailCSCI 4100/6100

Page 2: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

recap: Radial Basis Functions

Nonparametric RBF Parametric k-RBF-Network

g(x) =N∑

n=1

(

αn(x)∑N

m=1 αm(x)

)

· yn

αn(x) = φ(

||x−xn ||r

)

(bump on x)

r = 0.05

No Training

h(x) = w0 +

k∑

j=1

wj · φ

(

||x− µj ||

r

)

= wtΦ(x)

(bump on µj)

linear model given µj

choose µj as centers of k-clusters of data

x

y

x

y

k = 4, r = 1k

k = 10, regularized

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 2 /23 Unsupervised learning −→

Page 3: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Unsupervised Learning

• Preprocessor to organize the data for supervised learning:Organize data for faster nearest neighbor search

Determine centers for RBF bumps.

• Important to be able to organize the data to identify patterns.Learn the patterns in data, e.g. the patterns in a language before getting into a supervised setting.

amazon.com organizes books into categories

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 3 /23 Clustering digits −→

Page 4: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Clustering Digits

21-NN rule, 10 Classes 10 Clustering of Data

0

1

2

3

4

67

89

Average Intensity

Sym

metry

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 4 /23 Clustering −→

Page 5: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Clustering

A cluster is a collection of points S

A k-clustering is a partition of the data into k clusters S1, . . . , Sk.

∪kj=1Sj = D

Si ∩ Sj = ∅ for i 6= j

Each cluster has a center µj

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 5 /23 k-means error −→

Page 6: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

How good is a clustering?

Points in a cluster should be similar (close to each other, and the center)

Error in cluster j:

Ej =∑

xn∈Sj

||xn − µj ||2.

k-Means Clustering Error:

Ein(S1, . . . , Sk;µ1, . . . ,µk) =k∑

j=1

Ej

=N∑

n=1

||xn − µ(xn) ||2

µ(xn) is the center of the cluster to which xn belongs.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 6 /23 −→

Page 7: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

k-Means Clustering

You get to pick S1, . . . , Sk and µ1, . . . ,µk to minimize Ein(S1, . . . , Sk;µ1, . . . ,µk)

If centers µj are known, picking the sets is easy:Add to Sj all points closest to µj

If the clusters Sj are known, picking the centers is easy:Center µj is the centroid of cluster Sj

µj =1

|Sj|

xn∈Sj

xn

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 7 /23 Lloyd’s algorithm −→

Page 8: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Lloyd’s Algorithm for k-Means Clustering

Ein(S1, . . . , Sk;µ1, . . . ,µk) =N∑

n=1

||xn − µ(xn) ||2

1: Initialize Pick well separated centers µj.

2: Update Sj to be all points closest µj.

Sj ← {xn : ||xn − µj || ≤ ||xn − µℓ || for ℓ = 1, . . . , k}.

3: Update µj to the centroid of Sj.

µj ←1

|Sj|

xn∈Sj

xn

4: Repeat steps 2 and 3 until Ein stops decreasing.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 8 /23 Update clusters −→

Page 9: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Lloyd’s Algorithm for k-Means Clustering

Ein(S1, . . . , Sk;µ1, . . . ,µk) =N∑

n=1

||xn − µ(xn) ||2

1: Initialize Pick well separated centers µj.

2: Update Sj to be all points closest µj.

Sj ← {xn : ||xn − µj || ≤ ||xn − µℓ || for ℓ = 1, . . . , k}.

3: Update µj to the centroid of Sj.

µj ←1

|Sj|

xn∈Sj

xn

4: Repeat steps 2 and 3 until Ein stops decreasing.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 9 /23 Update centers −→

Page 10: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Lloyd’s Algorithm for k-Means Clustering

Ein(S1, . . . , Sk;µ1, . . . ,µk) =N∑

n=1

||xn − µ(xn) ||2

1: Initialize Pick well separated centers µj.

2: Update Sj to be all points closest µj.

Sj ← {xn : ||xn − µj || ≤ ||xn − µℓ || for ℓ = 1, . . . , k}.

3: Update µj to the centroid of Sj.

µj ←1

|Sj|

xn∈Sj

xn

4: Repeat steps 2 and 3 until Ein stops decreasing.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 10 /23 Application to RBF-Network −→

Page 11: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Application to k-RBF-Network

10-center RBF-network 300-center RBF-network

Choosing k - knowledge of problem (10 digits) or CV.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 11 /23 Probability density estimation −→

Page 12: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Probability Density Estimation

P (x)

P (x) measures how likely it is to generate inputs similar to x.

Estimating P (x) results in a ‘softer/finer’ representation than clustering

Clusters are regions of high probability.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 12 /23 Parzen windows −→

Page 13: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Parzen Windows – RBF density estimation

Basic idea: put a bump of ‘size’ (volume) 1N on each data point.

x

P(x)

P̂ (x) =1

Nrd

N∑

i=1

φ

(

||x− xi ||

r

)

φ(z) = 1(2π)d/2

e−1

2z2

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 13 /23 Digits data −→

Page 14: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Digits Data

RBF Density Estimate Density Contours

xy xy

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 14 /23 GMM −→

Page 15: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

The Gaussian Mixture Model (GMM)

Instead of N bumps −→ k ≪ N bumps. (Similar to nonparametric RBF −→ parametric k-RBF-network)

Instead of uniform spherical bumps −→ each bump has its own shape.

Bump centers:µ1, . . . ,µk

Bump shapes:Σ1, . . . ,Σk

Gaussian formula for the bump:

N (x;µj,Σj) =1

(2π)d/2|Σj|1/2e−

12(x− µj)

tΣj−1(x− µj).

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 15 /23 GMM formula −→

Page 16: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

GMM Density Estimate

N (x;µj,Σj) =1

(2π)d/2|Σj|1/2e−

12(x− µj)

tΣj−1(x− µj).

P̂ (x) =k∑

j=1

wj N (x;µj,Σj)

(Sum of k weighted bumps).

wj > 0,k∑

j=1

wj = 1

You get to pick {wj,µj,Σj}j=1,...,k

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 16 /23 Maximum likelihood −→

Page 17: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Maximize Likelihood Estimation

Pick {wj,µj,Σj}j=1,...,k to best explain the data.

Maximize the likelihood of the data given {wj,µj,Σj}j=1,...,k

(We saw this when we derived the cross entropy error for logistic regression)

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 17 /23 E-M algorithm −→

Page 18: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Expectation-Maximization: The E-M Algorithm

A simple algorithm to get to the local minimum of the likelihood.

Partition variables into two sets. Given one-set, you can estimate the other

‘Bootstrap’ your way to a decent solution.

Lloyd’s algorithm for k-means is an example for ‘hard clustering’

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 18 /23 γnj −→

Page 19: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Bump Memberships

Fraction of xn belonging to bump j (a ‘hidden variable’)

γnj

Nj =N∑

n=1

γnj (‘number’ of points in bump j)

wj =Nj

N(probability bump j)

µj =1

Nj

N∑

n=1

γnjxn (centroid of bump j)

Σj =1

Nj

N∑

n=1

γnjxnxt

n − µjµt

j (covariance matrix of bump j)

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 19 /23 Parameters given γnj −→

Page 20: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Bump Memberships

Fraction of xn belonging to bump j (a ‘hidden variable’)

γnj

Nj =N∑

n=1

γnj (‘number’ of points in bump j)

wj =Nj

N(probability bump j)

µj =1

Nj

N∑

n=1

γnjxn (centroid of bump j)

Σj =1

Nj

N∑

n=1

γnjxnxt

n − µjµt

j (covariance matrix of bump j)

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 20 /23 Restimating γnj −→

Page 21: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Re-Estimating Bump Memberships

γnj =wj N (xn;µj,Σj)

∑kℓ=1wℓ N (xn;µℓ,Σℓ)

γnj is the probability that xn came from bump j

probability of bump j: wj

probability density for xn given bump j: N (xn;µj,Σj)

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 21 /23 E-M Algorithm −→

Page 22: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

E-M Algorithm

E-M Algorithm for GMMs:

1: Start with estimates for the bump membership γnj.

2: Estimate wj,µj,Σj given the bump memberships.

3: Update the bump memberships given wj,µj,Σj;

4: Iterate to step 2 until convergence.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 22 /23 GMM on digits −→

Page 23: Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

GMM on Digits Data

10-center GMM Density Contours

xy x

y

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 23 /23