Cheat Sheet: Algorithms for Supervised- and Unsupervised ... · PDF fileCheat Sheet:...

1
Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning 1 Algorithm Description Model Objective Training Regularisation Complexity Non-linear Online learning k-nearest neighbour The label of a new point ˆ x is classified with the most frequent label ˆ t of the k nearest training instances. ˆ t = arg max C i:x i N k (x, ˆ x) δ(t i , C) N k (x, ˆ x) k points in x closest to ˆ x Euclidean distance formula: D i=1 (x i ˆ x i ) 2 δ(a, b) 1 if a = b; 0 o/w No optimisation needed. Use cross-validation to learn the appropriate k; otherwise no training, classification based on existing points. k acts as to regularise the classifier: as k N the boundary becomes smoother. O(NM) space complexity, since all training instances and all their features need to be kept in memory. Natively finds non-linear boundaries. To be added. Naive Bayes Learn p(C k |x) by modelling p(x|C k ) and p(C k ), using Bayes’ rule to infer the class conditional probability. Assumes each feature independent of all others, ergo ‘Naive.’ y(x) = arg max k p(C k |x) = arg max k p(x|C k ) × p(C k ) = arg max k D i=1 p(x i |C k ) × p(C k ) = arg max k D i=1 log p(x i |C k ) + log p(C k ) No optimisation needed. Multivariate likelihood p(x|C k )= D i=1 log p(x i |C k ) p MLE (x i = v|C k ) = N j=1 δ(t j = C k x ji = v) N j=1 δ(t j = C k ) Multinomial likelihood p(x|C k )= D i=1 p(word i |C k ) x i p MLE (word i = v|C k )= N j=1 δ(t j = C k ) × x ji N j=1 D d=1 δ(t j = C k ) × x di . . . where: x ji is the count of word i in test example j ; x di is the count of feature d in test example j . Gaussian likelihood p(x|C k )= D i=1 N (v; μ ik ik ) Use a Dirichlet prior on the parameters to obtain a MAP estimate. Multivariate likelihood p MAP (x i = v|C k )= (β i 1) + N j=1 δ(t j = C k x ji = v) |x i |(β i 1) + N j=1 δ(t j = C k ) Multinomial likelihood p MAP (word i = v|C k )= (α i 1) + N j=1 δ(t j = C k ) × x ji N j=1 D d=1 (δ(t j = C k ) × x di ) D + D d=1 α d O(NM), each training instance must be visited and each of its features counted. Can only learn linear boundaries for multivariate/multinomial attributes. With Gaussian attributes, quadratic boundaries can be learned with uni-modal distributions. To be added. Log-linear Estimate p(C k |x) directly, by assuming a maximum entropy distribution and optimising an objective function over the conditional entropy distribution. y(x) = arg max k p(C k |x) = arg max k m λmφm(x, C k ) . . . where: p(C k |x)= 1 Z λ (x) e m λmφm(x,C k ) Z λ (x)= k e m λmφm(x,C k ) Minimise the negative log-likelihood: L MLE (λ, D)= (x,t)∈D p(t|x)= (x,t)∈D log p(t|x) = (x,t)∈D log Z λ (x) m λmφm(x, t) = (x,t)∈D log k e m λmφm(x,C k ) m λmφm(x, t) Gradient descent (or gradient ascent if maximising objective): λ n+1 = λ n ηL . . . where η is the step parameter. L MLE (λ, D)= (x,t)∈D E[φ(x, ·)] φ(x, t) L MAP (λ, D)= λ σ 2 + (x,t)∈D E[φ(x, ·)] (x,t)∈D φ(x, t) . . . where (x,t)∈D φ(x, t) are the empirical counts. For each class C k : (x,t)∈D E[φ(x, ·)] = (x,t)∈D φ(x, ·)p(C k |x) Penalise large values for the λ parameters, by introducing a prior distribution over them (typically a Gaussian). Objective function L MAP (λ, D) = arg min λ log p(λ) (x,t)∈D log p(t|x) = arg min λ log e (0λ) 2 2σ 2 (x,t)∈D log p(t|x) = arg min λ m λ 2 m 2σ 2 (x,t)∈D log p(t|x) O(INMK), since each training instance must be visited and each combination of class and features must be calculated for the appropriate feature mapping. Reformulate the class conditional distribution in terms of a kernel K(x, x ), and use a non-linear kernel (for example K(x, x ) = (1 + w T x) 2 ). By the Representer Theorem: p(C k |x)= 1 Z λ (x) e λ T φ(x,C k ) = 1 Z λ (x) e N n=1 K i=1 α nk φ(xn,C i ) T φ(x,C k ) = 1 Z λ (x) e N n=1 K i=1 α nk K((xn,C i ),(x,C k )) = 1 Z λ (x) e N n=1 α nk K(xn,x) Online Gradient Descent: Update the parameters using GD after seeing each training instance. Perceptron Directly estimate the linear function y(x) by iteratively updating the weight vector when incorrectly classifying a training instance. Binary, linear classifier: y(x) = sign(w T x) . . . where: sign(x)= +1 if x 0 1 if x< 0 Multiclass perceptron: y(x) = arg max C k w T φ(x, C k ) Tries to minimise the Error function; the number of incorrectly classified input vectors: arg min w E P (w) = arg min w n∈M w T xntn . . . where M is the set of misclassified training vectors. Iterate over each training example xn, and update the weight vector if misclassification: w i+1 = w i + ηE P (w) = w i + ηxntn . . . where typically η = 1. For the multiclass perceptron: w i+1 = w i + φ(x, t) φ(x, y(x)) The Voted Perceptron: run the perceptron i times and store each iteration’s weight vector. Then: y(x) = sign i c i × sign(w T i x) . . . where c i is the number of correctly classified training instances for w i . O(INML), since each combination of instance, class and features must be calculated (see log-linear). Use a kernel K(x, x ), and 1 weight per training instance: y(x) = sign N n=1 wntnK(x, xn) . . . and the update: w i+1 n = w i n +1 The perceptron is an online algorithm per default. Support vector machines A maximum margin classifier: finds the separating hyperplane with the maximum margin to its closest data points. y(x)= N n=1 λntnx T xn + w 0 Primal arg min w,w 0 1 2 ||w|| 2 s.t. tn(w T xn + w 0 ) 1 n Dual ˜ L()= N n=1 λn N n=1 N m=1 λnλmtntmx T n xm s.t. λn 0, N n=1 λntn =0, n Quadratic Programming (QP) SMO, Sequential Minimal Optimisation (chunking). The soft margin SVM: penalise a hyperplane by the number and distance of misclassified points. Primal arg min w,w 0 1 2 ||w|| 2 + C N n=1 ξn s.t. tn(w T xn + w 0 ) 1 ξn, ξn > 0 n Dual ˜ L()= N n=1 λn N n=1 N m=1 λnλmtntmx T n xm s.t. 0 λn C, N n=1 λntn =0, n QP: O(n 3 ); SMO: much more efficient than QP, since computation based only on support vectors. Use a non-linear kernel K(x, x ): y(x)= N n=1 λntnx T xn + w 0 = N n=1 λntnK(x, xn)+ w 0 ˜ L()= N n=1 λn N n=1 N m=1 λnλmtntmx T n xm = N n=1 λn N n=1 N m=1 λnλmtntmK(xn,xm) Online SVM. See, for example: The Huller: A Simple and Efficient Online SVM, Bordes & Bottou (2005) Pegasos: Primal Estimated sub-Gradient Solver for SVM, Shalev-Shwartz et al. (2007) k-means A hard-margin, geometric clustering algorithm, where each data point is assigned to its closest centroid. Hard assignments r nk ∈{0, 1} s.t. n k r nk = 1, i.e. each data point is assigned to one cluster k. Geometric distance: The Euclidean distance, l 2 norm: ||xn μ k || 2 = D i=1 (x ni μ ki ) 2 arg min rN n=1 K k=1 r nk ||xn μ k || 2 2 . . . i.e. minimise the distance from each cluster centre to each of its points. Expectation: r nk = 1 if ||xn μ k || 2 minimal for k 0 o/w Maximisation: μ (k) MLE = n r nk xn n r nk . . . where μ (k) is the centroid of cluster k. Only hard-margin assignment to clusters. To be added. For non-linearly separable data, use kernel k-means as suggested in: Kernel k-means, Spectral Clustering and Normalized Cuts, Dhillon et al. (2004). Sequential k-means: update the cen- troids after processing one point at a time. Mixture of Gaussians A probabilistic clustering algorithm, where clusters are modelled as latent Guassians and each data point is assigned the probability of being drawn from a particular Gaussian. Assignments to clusters by specifying probabilities p(x (i) ,z (i) )= p(x (i) |z (i) )p(z (i) ) . . . with z (i) Multinomial(γ), and γ nk p(k|xn) s.t. k j=1 γ nj = 1. I.e. want to maximise the probability of the observed data x. L(x, π, μ, Σ) = log p(x|π, μ, Σ) = N n=1 log K k=1 π k N k (xn|μ k , Σ k ) Expectation: For each n, k set: γ nk = p(z (i) = k|x (i) ; γ, μ, Σ) (= p(k|xn)) = p(x (i) |z (i) = k; μ, Σ)p(z (i) = k; π) K j=1 p(x (i) |z (i) = l; μ, Σ)p(z (i) = l; π) = π k N (xn|μ k , Σ k ) K j=1 π j N (xn|μ j , Σ j ) Maximisation: π k = 1 N N n=1 γ nk Σ k = N n=1 γ nk (xn μ k )(xn μ k ) T N n=1 γ nk μ k = N n=1 γ nk xn N n=1 γ nk The mixture of Gaussians assigns probabilities for each cluster to each data point, and as such is capable of capturing ambiguities in the data set. To be added. Not applicable. Online Gaussian Mixture Models. A good start is: A View of the EM Algorithm that Jus- tifies Incremental, Sparse, and Other Variants, Neal & Hinton (1998). 1 Created by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011.

Transcript of Cheat Sheet: Algorithms for Supervised- and Unsupervised ... · PDF fileCheat Sheet:...

Page 1: Cheat Sheet: Algorithms for Supervised- and Unsupervised ... · PDF fileCheat Sheet: Algorithms for Supervised- and Unsupervised Learning 1 Algorithm Description Model Objective Training

Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning 1

Algorithm Description Model Objective Training Regularisation Complexity Non-linear Online learning

k-nearestneighbour

The label of a new point x̂ is classifiedwith the most frequent label t̂ of the knearest training instances.

t̂ = argmaxC

i:xi∈Nk(x,x̂)

δ(ti, C)

• Nk(x, x̂) ← k points in xclosest to x̂

• Euclidean distance formula:��Di=1(xi − x̂i)2

• δ(a, b) ← 1 if a = b; 0 o/w

No optimisation needed.Use cross-validation to learn the appropriate k; otherwise notraining, classification based on existing points.

k acts as to regularise the classifier: as k → N theboundary becomes smoother.

O(NM) space complexity, since alltraining instances and all theirfeatures need to be kept in memory.

Natively finds non-linear boundaries. To be added.

Naive Bayes

Learn p(Ck|x) by modelling p(x|Ck)and p(Ck), using Bayes’ rule to inferthe class conditional probability.Assumes each feature independent ofall others, ergo ‘Naive.’

y(x) = argmaxk

p(Ck|x)

= argmaxk

p(x|Ck)× p(Ck)

= argmaxk

D�

i=1

p(xi|Ck)× p(Ck)

= argmaxk

D�

i=1

log p(xi|Ck) + log p(Ck)

No optimisation needed.

Multivariate likelihood p(x|Ck) =�D

i=1 log p(xi|Ck)

pMLE(xi = v|Ck) =

�Nj=1 δ(tj = Ck ∧ xji = v)

�Nj=1 δ(tj = Ck)

Multinomial likelihood p(x|Ck) =�D

i=1 p(wordi|Ck)xi

pMLE(wordi = v|Ck) =

�Nj=1 δ(tj = Ck)× xji

�Nj=1

�Dd=1 δ(tj = Ck)× xdi

. . . where:

• xji is the count of word i in test example j;• xdi is the count of feature d in test example j.

Gaussian likelihood p(x|Ck) =�D

i=1 N (v;µik,σik)

Use a Dirichlet prior on the parameters to obtain aMAP estimate.

Multivariate likelihood

pMAP(xi = v|Ck) =

(βi − 1) +�N

j=1 δ(tj = Ck ∧ xji = v)

|xi|(βi − 1) +�N

j=1 δ(tj = Ck)

Multinomial likelihood

pMAP(wordi = v|Ck) =

(αi − 1) +�N

j=1 δ(tj = Ck)× xji�N

j=1

�Dd=1 (δ(tj = Ck)× xdi)−D +

�Dd=1 αd

O(NM), each training instance mustbe visited and each of its featurescounted.

Can only learn linear boundaries formultivariate/multinomial attributes.

With Gaussian attributes, quadraticboundaries can be learned with uni-modaldistributions.

To be added.

Log-linear

Estimate p(Ck|x) directly, byassuming a maximum entropydistribution and optimising anobjective function over theconditional entropy distribution.

y(x) = argmaxk

p(Ck|x)

= argmaxk

m

λmφm(x, Ck)

. . . where:

p(Ck|x) =1

Zλ(x)e�

m λmφm(x,Ck)

Zλ(x) =�

k

e�

m λmφm(x,Ck)

Minimise the negative log-likelihood:

LMLE(λ,D) =�

(x,t)∈Dp(t|x) = −

(x,t)∈Dlog p(t|x)

=�

(x,t)∈D

�logZλ(x)−

m

λmφm(x, t)

=�

(x,t)∈D

�log

k

e�

m λmφm(x,Ck) −�

m

λmφm(x, t)

Gradient descent (or gradient ascent if maximisingobjective):

λn+1 = λn− η∆L

. . . where η is the step parameter.

∆LMLE(λ,D) =�

(x,t)∈DE[φ(x, ·)]− φ(x, t)

∆LMAP(λ,D,σ) =λ

σ2+

(x,t)∈DE[φ(x, ·)]−

(x,t)∈Dφ(x, t)

. . . where�

(x,t)∈D φ(x, t) are the empirical counts.

For each class Ck:�

(x,t)∈DE[φ(x, ·)] =

(x,t)∈Dφ(x, ·)p(Ck|x)

Penalise large values for the λ parameters, byintroducing a prior distribution over them (typicallya Gaussian).

Objective function

LMAP(λ,D,σ) = argminλ

− log p(λ)−�

(x,t)∈Dlog p(t|x)

= argminλ

− log e(0−λ)2

2σ2 −

(x,t)∈Dlog p(t|x)

= argminλ

m λ2m

2σ2−

(x,t)∈Dlog p(t|x)

O(INMK), since each traininginstance must be visited and eachcombination of class and featuresmust be calculated for theappropriate feature mapping.

Reformulate the class conditional distributionin terms of a kernel K(x, x�), and use anon-linear kernel (for exampleK(x, x�) = (1 +wT x)2). By the RepresenterTheorem:

p(Ck|x) =1

Zλ(x)eλ

T φ(x,Ck)

=1

Zλ(x)e�N

n=1�K

i=1 αnkφ(xn,Ci)T φ(x,Ck)

=1

Zλ(x)e�N

n=1�K

i=1 αnkK((xn,Ci),(x,Ck))

=1

Zλ(x)e�N

n=1 αnkK(xn,x)

Online Gradient Descent: Update theparameters using GD after seeing eachtraining instance.

Perceptron

Directly estimate the linear functiony(x) by iteratively updating theweight vector when incorrectlyclassifying a training instance.

Binary, linear classifier:

y(x) = sign(wT x)

. . . where:

sign(x) =

�+1 if x ≥ 0−1 if x < 0

Multiclass perceptron:

y(x) = argmaxCk

wTφ(x, Ck)

Tries to minimise the Error function; the number ofincorrectly classified input vectors:

argminw

EP (w) = argminw

n∈MwT xntn

. . . where M is the set of misclassified trainingvectors.

Iterate over each training example xn, and update theweight vector if misclassification:

wi+1 = wi + η∆EP (w)

= wi + ηxntn

. . . where typically η = 1.

For the multiclass perceptron:

wi+1 = wi + φ(x, t)− φ(x, y(x))

The Voted Perceptron: run the perceptron i timesand store each iteration’s weight vector. Then:

y(x) = sign

��

i

ci × sign(wTi x)

. . . where ci is the number of correctly classifiedtraining instances for wi.

O(INML), since each combination ofinstance, class and features must becalculated (see log-linear).

Use a kernel K(x, x�), and 1 weight pertraining instance:

y(x) = sign

�N�

n=1

wntnK(x, xn)

. . . and the update:

wi+1n = wi

n + 1

The perceptron is an online algorithmper default.

Support

vector

machines

A maximum margin classifier: findsthe separating hyperplane with themaximum margin to its closest datapoints.

y(x) =N�

n=1

λntnxT xn + w0

Primal

argminw,w0

1

2||w||

2

s.t. tn(wT xn + w0) ≥ 1 ∀n

Dual

L̃(∧) =N�

n=1

λn −

N�

n=1

N�

m=1

λnλmtntmxTnxm

s.t. λn ≥ 0,N�

n=1

λntn = 0, ∀n

• Quadratic Programming (QP)

• SMO, Sequential Minimal Optimisation (chunking).

The soft margin SVM: penalise a hyperplane by thenumber and distance of misclassified points.

Primal

argminw,w0

1

2||w||

2 + CN�

n=1

ξn

s.t. tn(wT xn + w0) ≥ 1− ξn, ξn > 0 ∀n

Dual

L̃(∧) =N�

n=1

λn −

N�

n=1

N�

m=1

λnλmtntmxTnxm

s.t. 0 ≤ λn ≤ C,N�

n=1

λntn = 0, ∀n

• QP: O(n3);

• SMO: much more efficient thanQP, since computation basedonly on support vectors.

Use a non-linear kernel K(x, x�):

y(x) =N�

n=1

λntnxT xn + w0

=N�

n=1

λntnK(x, xn) + w0

L̃(∧) =N�

n=1

λn −

N�

n=1

N�

m=1

λnλmtntmxTnxm

=N�

n=1

λn −

N�

n=1

N�

m=1

λnλmtntmK(xn, xm)

Online SVM. See, for example:

• The Huller: A Simple andEfficient Online SVM, Bordes& Bottou (2005)

• Pegasos: Primal Estimatedsub-Gradient Solver for SVM,Shalev-Shwartz et al. (2007)

k-means

A hard-margin, geometric clusteringalgorithm, where each data point isassigned to its closest centroid.

Hard assignments rnk ∈ {0, 1} s.t.∀n

�k rnk = 1, i.e. each data point is

assigned to one cluster k.

Geometric distance: The Euclideandistance, l2 norm:

||xn − µk||2 =

����D�

i=1

(xni − µki)2

argminr,µ

N�

n=1

K�

k=1

rnk||xn − µk||22

. . . i.e. minimise the distance from each cluster centreto each of its points.

Expectation:

rnk =

�1 if ||xn − µk||

2 minimal for k0 o/w

Maximisation:

µ(k)MLE =

�n rnkxn�n rnk

. . . where µ(k) is the centroid of cluster k.

Only hard-margin assignment to clusters. To be added.

For non-linearly separable data, use kernelk-means as suggested in:

Kernel k-means, Spectral Clustering andNormalized Cuts, Dhillon et al. (2004).

Sequential k-means: update the cen-troids after processing one point at atime.

Mixture of

Gaussians

A probabilistic clustering algorithm,where clusters are modelled as latentGuassians and each data point isassigned the probability of beingdrawn from a particular Gaussian.

Assignments to clusters by specifyingprobabilities

p(x(i), z(i)) = p(x(i)|z(i))p(z(i))

. . . with z(i) ∼ Multinomial(γ), and

γnk ≡ p(k|xn) s.t.�k

j=1 γnj = 1. I.e.want to maximise the probability ofthe observed data x.

L(x,π, µ,Σ) = log p(x|π, µ,Σ)

=N�

n=1

log

�K�

k=1

πkNk(xn|µk,Σk)

Expectation: For each n, k set:

γnk = p(z(i) = k|x(i); γ, µ,Σ) (= p(k|xn))

=p(x(i)|z(i) = k;µ,Σ)p(z(i) = k;π)

�Kj=1 p(x

(i)|z(i) = l;µ,Σ)p(z(i) = l;π)

=πkN (xn|µk,Σk)�Kj=1 πjN (xn|µj ,Σj)

Maximisation:

πk =1

N

N�

n=1

γnk

Σk =

�Nn=1 γnk(xn − µk)(xn − µk)T�N

n=1 γnk

µk =

�Nn=1 γnkxn

�Nn=1 γnk

The mixture of Gaussians assigns probabilities foreach cluster to each data point, and as such iscapable of capturing ambiguities in the data set.

To be added. Not applicable.

Online Gaussian Mixture Models. Agood start is:

A View of the EM Algorithm that Jus-tifies Incremental, Sparse, and OtherVariants, Neal & Hinton (1998).

1Created by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011.