A Peek At Unsupervised Learning Lecture 19 Learning From...

6
Learning From Data Lecture 19 A Peek At Unsupervised Learning k-Means Clustering Probability Density Estimation Gaussian Mixture Models M. Magdon-Ismail CSCI 4100/6100 recap: Radial Basis Functions Nonparametric RBF Parametric k-RBF-Network g(x)= N n=1 α n (x) N m=1 α m (x) · y n α n (x)= φ || xxn || r (bump on x) r =0.05 No Training h(x)= w 0 + k j=1 w j · φ || x μ j || r = w t Φ(x) (bump on μj ) linear model given μ j choose μ j as centers of k-clusters of data x y x y k =4,r = 1 k k = 10, regularized c AM L Creator: Malik Magdon-Ismail Unsupervised Learning:2 /23 Unsupervised learning −→ Unsupervised Learning Preprocessor to organize the data for supervised learning: Organize data for faster nearest neighbor search Determine centers for RBF bumps. Important to be able to organize the data to identify patterns. Learn the patterns in data, e.g. the patterns in a language before getting into a supervised setting. amazon.com organizes books into categories c AM L Creator: Malik Magdon-Ismail Unsupervised Learning:3 /23 Clustering digits −→ Clustering Digits 21-NN rule, 10 Classes 10 Clustering of Data 0 1 2 3 4 6 7 8 9 Average Intensity Symmetry c AM L Creator: Malik Magdon-Ismail Unsupervised Learning:4 /23 Clustering −→

Transcript of A Peek At Unsupervised Learning Lecture 19 Learning From...

Page 1: A Peek At Unsupervised Learning Lecture 19 Learning From ...magdon/courses/LFD-Slides/SlidesLect19-C.pdf · Clustering A cluster is a collection of points S A k-clustering is a partition

Learning From DataLecture 19

A Peek At Unsupervised Learning

k-Means ClusteringProbability Density Estimation

Gaussian Mixture Models

M. Magdon-IsmailCSCI 4100/6100

recap: Radial Basis Functions

Nonparametric RBF Parametric k-RBF-Network

g(x) =N∑

n=1

(αn(x)∑N

m=1 αm(x)

)· yn

αn(x) = φ(||x−xn ||

r

)(bump on x)

r = 0.05

No Training

h(x) = w0 +

k∑

j=1

wj · φ(||x− µj ||

r

)

= wtΦ(x)

(bump on µj)

linear model given µj

choose µj as centers of k-clusters of data

x

y

x

y

k = 4, r = 1k

k = 10, regularized

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 2 /23 Unsupervised learning −→

Unsupervised Learning

• Preprocessor to organize the data for supervised learning:Organize data for faster nearest neighbor search

Determine centers for RBF bumps.

• Important to be able to organize the data to identify patterns.Learn the patterns in data, e.g. the patterns in a language before getting into a supervised setting.

amazon.com organizes books into categories

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 3 /23 Clustering digits −→

Clustering Digits

21-NN rule, 10 Classes 10 Clustering of Data

0

1

2

3

4

67

89

Average Intensity

Sym

metry

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 4 /23 Clustering −→

Page 2: A Peek At Unsupervised Learning Lecture 19 Learning From ...magdon/courses/LFD-Slides/SlidesLect19-C.pdf · Clustering A cluster is a collection of points S A k-clustering is a partition

Clustering

A cluster is a collection of points S

A k-clustering is a partition of the data into k clusters S1, . . . , Sk.

∪kj=1Sj = D

Si ∩ Sj = ∅ for i 6= j

Each cluster has a center µj

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 5 /23 k-means error −→

How good is a clustering?

Points in a cluster should be similar (close to each other, and the center)

Error in cluster j:

Ej =∑

xn∈Sj

||xn − µj ||2.

k-Means Clustering Error:

Ein(S1, . . . , Sk;µ1, . . . ,µk) =k∑

j=1

Ej

=N∑

n=1

||xn − µ(xn) ||2

µ(xn) is the center of the cluster to which xn belongs.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 6 /23 −→

k-Means Clustering

You get to pick S1, . . . , Sk and µ1, . . . ,µk to minimize Ein(S1, . . . , Sk;µ1, . . . ,µk)

If centers µj are known, picking the sets is easy:Add to Sj all points closest to µj

If the clusters Sj are known, picking the centers is easy:Center µj is the centroid of cluster Sj

µj =1

|Sj|∑

xn∈Sjxn

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 7 /23 Lloyd’s algorithm −→

Lloyd’s Algorithm for k-Means Clustering

Ein(S1, . . . , Sk;µ1, . . . ,µk) =N∑

n=1

||xn − µ(xn) ||2

1: Initialize Pick well separated centers µj.

2: Update Sj to be all points closest µj.

Sj ← {xn : ||xn − µj || ≤ ||xn − µℓ || for ℓ = 1, . . . , k}.

3: Update µj to the centroid of Sj.

µj ←1

|Sj|∑

xn∈Sjxn

4: Repeat steps 2 and 3 until Ein stops decreasing.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 8 /23 Update clusters −→

Page 3: A Peek At Unsupervised Learning Lecture 19 Learning From ...magdon/courses/LFD-Slides/SlidesLect19-C.pdf · Clustering A cluster is a collection of points S A k-clustering is a partition

Lloyd’s Algorithm for k-Means Clustering

Ein(S1, . . . , Sk;µ1, . . . ,µk) =N∑

n=1

||xn − µ(xn) ||2

1: Initialize Pick well separated centers µj.

2: Update Sj to be all points closest µj.

Sj ← {xn : ||xn − µj || ≤ ||xn − µℓ || for ℓ = 1, . . . , k}.

3: Update µj to the centroid of Sj.

µj ←1

|Sj|∑

xn∈Sjxn

4: Repeat steps 2 and 3 until Ein stops decreasing.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 9 /23 Update centers −→

Lloyd’s Algorithm for k-Means Clustering

Ein(S1, . . . , Sk;µ1, . . . ,µk) =N∑

n=1

||xn − µ(xn) ||2

1: Initialize Pick well separated centers µj.

2: Update Sj to be all points closest µj.

Sj ← {xn : ||xn − µj || ≤ ||xn − µℓ || for ℓ = 1, . . . , k}.

3: Update µj to the centroid of Sj.

µj ←1

|Sj|∑

xn∈Sjxn

4: Repeat steps 2 and 3 until Ein stops decreasing.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 10 /23 Application to RBF-Network −→

Application to k-RBF-Network

10-center RBF-network 300-center RBF-network

Choosing k - knowledge of problem (10 digits) or CV.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 11 /23 Probability density estimation −→

Probability Density Estimation

P (x)

P (x) measures how likely it is to generate inputs similar to x.

Estimating P (x) results in a ‘softer/finer’ representation than clustering

Clusters are regions of high probability.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 12 /23 Parzen windows −→

Page 4: A Peek At Unsupervised Learning Lecture 19 Learning From ...magdon/courses/LFD-Slides/SlidesLect19-C.pdf · Clustering A cluster is a collection of points S A k-clustering is a partition

Parzen Windows – RBF density estimation

Basic idea: put a bump of ‘size’ (volume) 1N on each data point.

x

P(x)

P̂ (x) =1

Nrd

N∑

i=1

φ

(||x− xi ||r

)

φ(z) = 1(2π)d/2

e−12z

2

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 13 /23 Digits data −→

Digits Data

RBF Density Estimate Density Contours

xy x

y

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 14 /23 GMM −→

The Gaussian Mixture Model (GMM)

Instead of N bumps −→ k ≪ N bumps. (Similar to nonparametric RBF −→ parametric k-RBF-network)

Instead of uniform spherical bumps −→ each bump has its own shape.

Bump centers:µ1, . . . ,µk

Bump shapes:Σ1, . . . ,Σk

Gaussian formula for the bump:

N (x;µj,Σj) =1

(2π)d/2|Σj|1/2e−

12(x− µj)

tΣj−1(x− µj).

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 15 /23 GMM formula −→

GMM Density Estimate

N (x;µj,Σj) =1

(2π)d/2|Σj|1/2e−

12(x− µj)

tΣj−1(x− µj).

P̂ (x) =k∑

j=1

wj N (x;µj,Σj)

(Sum of k weighted bumps).

wj > 0,k∑

j=1

wj = 1

You get to pick {wj,µj,Σj}j=1,...,k

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 16 /23 Maximum likelihood −→

Page 5: A Peek At Unsupervised Learning Lecture 19 Learning From ...magdon/courses/LFD-Slides/SlidesLect19-C.pdf · Clustering A cluster is a collection of points S A k-clustering is a partition

Maximize Likelihood Estimation

Pick {wj,µj,Σj}j=1,...,k to best explain the data.

Maximize the likelihood of the data given {wj,µj,Σj}j=1,...,k

(We saw this when we derived the cross entropy error for logistic regression)

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 17 /23 E-M algorithm −→

Expectation-Maximization: The E-M Algorithm

A simple algorithm to get to the local minimum of the likelihood.

Partition variables into two sets. Given one-set, you can estimate the other

‘Bootstrap’ your way to a decent solution.

Lloyd’s algorithm for k-means is an example for ‘hard clustering’

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 18 /23 γnj −→

Bump Memberships

Fraction of xn belonging to bump j (a ‘hidden variable’)

γnj

Nj =N∑

n=1

γnj (‘number’ of points in bump j)

wj =Nj

N(probability bump j)

µj =1

Nj

N∑

n=1

γnjxn (centroid of bump j)

Σj =1

Nj

N∑

n=1

γnjxnxtn − µjµ

tj (covariance matrix of bump j)

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 19 /23 Parameters given γnj −→

Bump Memberships

Fraction of xn belonging to bump j (a ‘hidden variable’)

γnj

Nj =N∑

n=1

γnj (‘number’ of points in bump j)

wj =Nj

N(probability bump j)

µj =1

Nj

N∑

n=1

γnjxn (centroid of bump j)

Σj =1

Nj

N∑

n=1

γnjxnxtn − µjµ

tj (covariance matrix of bump j)

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 20 /23 Restimating γnj −→

Page 6: A Peek At Unsupervised Learning Lecture 19 Learning From ...magdon/courses/LFD-Slides/SlidesLect19-C.pdf · Clustering A cluster is a collection of points S A k-clustering is a partition

Re-Estimating Bump Memberships

γnj =wj N (xn;µj,Σj)∑kℓ=1wℓ N (xn;µℓ,Σℓ)

γnj is the probability that xn came from bump j

probability of bump j: wj

probability density for xn given bump j: N (xn;µj,Σj)

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 21 /23 E-M Algorithm −→

E-M Algorithm

E-M Algorithm for GMMs:

1: Start with estimates for the bump membership γnj.

2: Estimate wj,µj,Σj given the bump memberships.

3: Update the bump memberships given wj,µj,Σj;

4: Iterate to step 2 until convergence.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 22 /23 GMM on digits −→

GMM on Digits Data

10-center GMM Density Contours

xy x

y

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 23 /23