SlidesLect19-C.pdf · Clustering A cluster is a collection of points S A k-clustering is a partition

Transcript

Learning From DataLecture 19

A Peek At Unsupervised Learning

k-Means ClusteringProbability Density Estimation

Gaussian Mixture Models

M. Magdon-IsmailCSCI 4100/6100

recap: Radial Basis Functions

Nonparametric RBF Parametric k-RBF-Network

g(x) =N∑

n=1

(αn(x)∑N

m=1 αm(x)

)· yn

αn(x) = φ(||x−xn ||

)(bump on x)

r = 0.05

No Training

h(x) = w0 +

k∑

j=1

wj · φ(||x− µj ||

)

= wtΦ(x)

(bump on µj)

linear model given µj

choose µj as centers of k-clusters of data

k = 4, r = 1k

k = 10, regularized

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 2 /23 Unsupervised learning −→

Unsupervised Learning

• Preprocessor to organize the data for supervised learning:Organize data for faster nearest neighbor search

Determine centers for RBF bumps.

• Important to be able to organize the data to identify patterns.Learn the patterns in data, e.g. the patterns in a language before getting into a supervised setting.

amazon.com organizes books into categories

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 3 /23 Clustering digits −→

Clustering Digits

21-NN rule, 10 Classes 10 Clustering of Data

Average Intensity

Sym

metry

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 4 /23 Clustering −→

Clustering

A cluster is a collection of points S

A k-clustering is a partition of the data into k clusters S1, . . . , Sk.

∪kj=1Sj = D

Si ∩ Sj = ∅ for i 6= j

Each cluster has a center µj

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 5 /23 k-means error −→

How good is a clustering?

Points in a cluster should be similar (close to each other, and the center)

Error in cluster j:

Ej =∑

xn∈Sj

||xn − µj ||2.

k-Means Clustering Error:

Ein(S1, . . . , Sk;µ1, . . . ,µk) =k∑

j=1

=N∑

n=1

||xn − µ(xn) ||2

µ(xn) is the center of the cluster to which xn belongs.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 6 /23 −→

k-Means Clustering

You get to pick S1, . . . , Sk and µ1, . . . ,µk to minimize Ein(S1, . . . , Sk;µ1, . . . ,µk)

If centers µj are known, picking the sets is easy:Add to Sj all points closest to µj

If the clusters Sj are known, picking the centers is easy:Center µj is the centroid of cluster Sj

µj =1

|Sj|∑

xn∈Sjxn

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 7 /23 Lloyd’s algorithm −→

Lloyd’s Algorithm for k-Means Clustering

Ein(S1, . . . , Sk;µ1, . . . ,µk) =N∑

n=1

||xn − µ(xn) ||2

1: Initialize Pick well separated centers µj.

2: Update Sj to be all points closest µj.

Sj ← {xn : ||xn − µj || ≤ ||xn − µℓ || for ℓ = 1, . . . , k}.

3: Update µj to the centroid of Sj.

µj ←1

|Sj|∑

xn∈Sjxn

4: Repeat steps 2 and 3 until Ein stops decreasing.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 8 /23 Update clusters −→

Page 3: A Peek At Unsupervised Learning Lecture 19 Learning From ...magdon/courses/LFD-Slides/SlidesLect19-C.pdf · Clustering A cluster is a collection of points S A k-clustering is a partition

Lloyd’s Algorithm for k-Means Clustering

Ein(S1, . . . , Sk;µ1, . . . ,µk) =N∑

n=1

||xn − µ(xn) ||2

1: Initialize Pick well separated centers µj.

2: Update Sj to be all points closest µj.

Sj ← {xn : ||xn − µj || ≤ ||xn − µℓ || for ℓ = 1, . . . , k}.

3: Update µj to the centroid of Sj.

µj ←1

|Sj|∑

xn∈Sjxn

4: Repeat steps 2 and 3 until Ein stops decreasing.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 9 /23 Update centers −→

Lloyd’s Algorithm for k-Means Clustering

Ein(S1, . . . , Sk;µ1, . . . ,µk) =N∑

n=1

||xn − µ(xn) ||2

1: Initialize Pick well separated centers µj.

2: Update Sj to be all points closest µj.

Sj ← {xn : ||xn − µj || ≤ ||xn − µℓ || for ℓ = 1, . . . , k}.

3: Update µj to the centroid of Sj.

µj ←1

|Sj|∑

xn∈Sjxn

4: Repeat steps 2 and 3 until Ein stops decreasing.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 10 /23 Application to RBF-Network −→

Application to k-RBF-Network

10-center RBF-network 300-center RBF-network

Choosing k - knowledge of problem (10 digits) or CV.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 11 /23 Probability density estimation −→

Probability Density Estimation

P (x)

P (x) measures how likely it is to generate inputs similar to x.

Estimating P (x) results in a ‘softer/finer’ representation than clustering

Clusters are regions of high probability.

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 12 /23 Parzen windows −→

Page 4: A Peek At Unsupervised Learning Lecture 19 Learning From ...magdon/courses/LFD-Slides/SlidesLect19-C.pdf · Clustering A cluster is a collection of points S A k-clustering is a partition

Parzen Windows – RBF density estimation

Basic idea: put a bump of ‘size’ (volume) 1N on each data point.

P(x)

P̂ (x) =1

Nrd

N∑

i=1

(||x− xi ||r

)

φ(z) = 1(2π)d/2

e−12z

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 13 /23 Digits data −→

Digits Data

RBF Density Estimate Density Contours

xy x

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 14 /23 GMM −→

The Gaussian Mixture Model (GMM)

Instead of N bumps −→ k ≪ N bumps. (Similar to nonparametric RBF −→ parametric k-RBF-network)

Instead of uniform spherical bumps −→ each bump has its own shape.

Bump centers:µ1, . . . ,µk

Bump shapes:Σ1, . . . ,Σk

Gaussian formula for the bump:

N (x;µj,Σj) =1

(2π)d/2|Σj|1/2e−

12(x− µj)

tΣj−1(x− µj).

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 15 /23 GMM formula −→

GMM Density Estimate

N (x;µj,Σj) =1

(2π)d/2|Σj|1/2e−

12(x− µj)

tΣj−1(x− µj).

P̂ (x) =k∑

j=1

wj N (x;µj,Σj)

(Sum of k weighted bumps).

wj > 0,k∑

j=1

wj = 1

You get to pick {wj,µj,Σj}j=1,...,k

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 16 /23 Maximum likelihood −→

Page 5: A Peek At Unsupervised Learning Lecture 19 Learning From ...magdon/courses/LFD-Slides/SlidesLect19-C.pdf · Clustering A cluster is a collection of points S A k-clustering is a partition

Maximize Likelihood Estimation

Pick {wj,µj,Σj}j=1,...,k to best explain the data.

Maximize the likelihood of the data given {wj,µj,Σj}j=1,...,k

(We saw this when we derived the cross entropy error for logistic regression)

c© AML Creator: Malik Magdon-Ismail Unsupervised Learning: 17 /23 E-M algorithm −→

Expectation-Maximization: The E-M Algorithm

A simple algorithm to get to the local minimum of the likelihood.

Partition variables into two sets. Given one-set, you can estimate the other

‘Bootstrap’ your way to a decent solution.

Lloyd’s algorithm for k-means is an example for ‘hard clustering’

Bump Memberships

Fraction of xn belonging to bump j (a ‘hidden variable’)

γnj

Nj =N∑

n=1

γnj (‘number’ of points in bump j)

wj =Nj

N(probability bump j)

µj =1

N∑

n=1

γnjxn (centroid of bump j)

Σj =1

N∑

n=1

γnjxnxtn − µjµ

tj (covariance matrix of bump j)

Bump Memberships

Fraction of xn belonging to bump j (a ‘hidden variable’)

γnj

Nj =N∑

n=1

γnj (‘number’ of points in bump j)

wj =Nj

N(probability bump j)

µj =1

N∑

n=1

γnjxn (centroid of bump j)

Σj =1

N∑

n=1

γnjxnxtn − µjµ

tj (covariance matrix of bump j)

Page 6: A Peek At Unsupervised Learning Lecture 19 Learning From ...magdon/courses/LFD-Slides/SlidesLect19-C.pdf · Clustering A cluster is a collection of points S A k-clustering is a partition

Re-Estimating Bump Memberships

γnj =wj N (xn;µj,Σj)∑kℓ=1wℓ N (xn;µℓ,Σℓ)

γnj is the probability that xn came from bump j

probability of bump j: wj

probability density for xn given bump j: N (xn;µj,Σj)

E-M Algorithm

E-M Algorithm for GMMs:

1: Start with estimates for the bump membership γnj.

2: Estimate wj,µj,Σj given the bump memberships.

3: Update the bump memberships given wj,µj,Σj;

4: Iterate to step 2 until convergence.

GMM on Digits Data

10-center GMM Density Contours

xy x

Top Related

TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy learning

On An Interpretation Of Spectral Clustering Via Heat ... · On An Interpretation Of Spectral Clustering Via Heat Equation And Finite Elements Theory ... and some normalized cut criterion

LEARNING ALGEBRA IN A COMPUTER ALGEBRA … · LEARNING ALGEBRA IN A COMPUTER ALGEBRA ENVIRONMENT Design research on the ... 6 The G9-I research cycle: learning ... materials 109 6.4

Reinforcement Learning - 4. Model-free reinforcement Learning

Learning 2.0

A Dynamic Hierarchical Clustering Method for Trajectory-Based Unusual Video Event Detection

Finite Element Approach to Clustering of Multidimensional ...page.mi.fu-berlin.de/horenko/download/SISC-071596_v2.pdf · Finite element approach to clustering of multidimensional

Clustering Social Networks