Cheat Sheet: Algorithms for Supervised- and Unsupervised ... · PDF fileCheat Sheet:...

Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning 1

Algorithm Description Model Objective Training Regularisation Complexity Non-linear Online learning

k-nearestneighbour

The label of a new point x̂ is classifiedwith the most frequent label t̂ of the knearest training instances.

t̂ = argmaxC

i:xi∈Nk(x,x̂)

δ(ti, C)

• Nk(x, x̂) ← k points in xclosest to x̂

• Euclidean distance formula:��Di=1(xi − x̂i)2

• δ(a, b) ← 1 if a = b; 0 o/w

No optimisation needed.Use cross-validation to learn the appropriate k; otherwise notraining, classification based on existing points.

k acts as to regularise the classifier: as k → N theboundary becomes smoother.

O(NM) space complexity, since alltraining instances and all theirfeatures need to be kept in memory.

Natively finds non-linear boundaries. To be added.

Naive Bayes

Learn p(Ck|x) by modelling p(x|Ck)and p(Ck), using Bayes’ rule to inferthe class conditional probability.Assumes each feature independent ofall others, ergo ‘Naive.’

y(x) = argmaxk

p(Ck|x)

= argmaxk

p(x|Ck)× p(Ck)

= argmaxk

p(xi|Ck)× p(Ck)

= argmaxk

log p(xi|Ck) + log p(Ck)

No optimisation needed.

Multivariate likelihood p(x|Ck) =�D

i=1 log p(xi|Ck)

pMLE(xi = v|Ck) =

�Nj=1 δ(tj = Ck ∧ xji = v)

�Nj=1 δ(tj = Ck)

Multinomial likelihood p(x|Ck) =�D

i=1 p(wordi|Ck)xi

pMLE(wordi = v|Ck) =

�Nj=1 δ(tj = Ck)× xji

�Nj=1

�Dd=1 δ(tj = Ck)× xdi

. . . where:

• xji is the count of word i in test example j;• xdi is the count of feature d in test example j.

Gaussian likelihood p(x|Ck) =�D

i=1 N (v;µik,σik)

Use a Dirichlet prior on the parameters to obtain aMAP estimate.

Multivariate likelihood

pMAP(xi = v|Ck) =

(βi − 1) +�N

j=1 δ(tj = Ck ∧ xji = v)

|xi|(βi − 1) +�N

j=1 δ(tj = Ck)

Multinomial likelihood

pMAP(wordi = v|Ck) =

(αi − 1) +�N

j=1 δ(tj = Ck)× xji�N

�Dd=1 (δ(tj = Ck)× xdi)−D +

�Dd=1 αd

O(NM), each training instance mustbe visited and each of its featurescounted.

Can only learn linear boundaries formultivariate/multinomial attributes.

With Gaussian attributes, quadraticboundaries can be learned with uni-modaldistributions.

To be added.

Log-linear

Estimate p(Ck|x) directly, byassuming a maximum entropydistribution and optimising anobjective function over theconditional entropy distribution.

y(x) = argmaxk

p(Ck|x)

= argmaxk

λmφm(x, Ck)

. . . where:

p(Ck|x) =1

Zλ(x)e�

m λmφm(x,Ck)

Zλ(x) =�

m λmφm(x,Ck)

Minimise the negative log-likelihood:

LMLE(λ,D) =�

(x,t)∈Dp(t|x) = −

(x,t)∈Dlog p(t|x)

(x,t)∈D

�logZλ(x)−

λmφm(x, t)

(x,t)∈D

�log

m λmφm(x,Ck) −�

λmφm(x, t)

Gradient descent (or gradient ascent if maximisingobjective):

λn+1 = λn− η∆L

. . . where η is the step parameter.

∆LMLE(λ,D) =�

(x,t)∈DE[φ(x, ·)]− φ(x, t)

∆LMAP(λ,D,σ) =λ

(x,t)∈DE[φ(x, ·)]−

(x,t)∈Dφ(x, t)

. . . where�

(x,t)∈D φ(x, t) are the empirical counts.

For each class Ck:�

(x,t)∈DE[φ(x, ·)] =

(x,t)∈Dφ(x, ·)p(Ck|x)

Penalise large values for the λ parameters, byintroducing a prior distribution over them (typicallya Gaussian).

Objective function

LMAP(λ,D,σ) = argminλ

− log p(λ)−�

(x,t)∈Dlog p(t|x)

= argminλ

− log e(0−λ)2

2σ2 −

(x,t)∈Dlog p(t|x)

= argminλ

m λ2m

2σ2−

(x,t)∈Dlog p(t|x)

O(INMK), since each traininginstance must be visited and eachcombination of class and featuresmust be calculated for theappropriate feature mapping.

Reformulate the class conditional distributionin terms of a kernel K(x, x�), and use anon-linear kernel (for exampleK(x, x�) = (1 +wT x)2). By the RepresenterTheorem:

p(Ck|x) =1

Zλ(x)eλ

T φ(x,Ck)

Zλ(x)e�N

n=1�K

i=1 αnkφ(xn,Ci)T φ(x,Ck)

Zλ(x)e�N

n=1�K

i=1 αnkK((xn,Ci),(x,Ck))

Zλ(x)e�N

n=1 αnkK(xn,x)

Online Gradient Descent: Update theparameters using GD after seeing eachtraining instance.

Perceptron

Directly estimate the linear functiony(x) by iteratively updating theweight vector when incorrectlyclassifying a training instance.

Binary, linear classifier:

y(x) = sign(wT x)

. . . where:

sign(x) =

�+1 if x ≥ 0−1 if x < 0

Multiclass perceptron:

y(x) = argmaxCk

wTφ(x, Ck)

Tries to minimise the Error function; the number ofincorrectly classified input vectors:

argminw

EP (w) = argminw

n∈MwT xntn

. . . where M is the set of misclassified trainingvectors.

Iterate over each training example xn, and update theweight vector if misclassification:

wi+1 = wi + η∆EP (w)

= wi + ηxntn

. . . where typically η = 1.

For the multiclass perceptron:

wi+1 = wi + φ(x, t)− φ(x, y(x))

The Voted Perceptron: run the perceptron i timesand store each iteration’s weight vector. Then:

y(x) = sign

��

ci × sign(wTi x)

. . . where ci is the number of correctly classifiedtraining instances for wi.

O(INML), since each combination ofinstance, class and features must becalculated (see log-linear).

Use a kernel K(x, x�), and 1 weight pertraining instance:

y(x) = sign

�N�

wntnK(x, xn)

. . . and the update:

wi+1n = wi

The perceptron is an online algorithmper default.

Support

vector

machines

A maximum margin classifier: findsthe separating hyperplane with themaximum margin to its closest datapoints.

y(x) =N�

λntnxT xn + w0

Primal

argminw,w0

2||w||

s.t. tn(wT xn + w0) ≥ 1 ∀n

L̃(∧) =N�

λn −

λnλmtntmxTnxm

s.t. λn ≥ 0,N�

λntn = 0, ∀n

• Quadratic Programming (QP)

• SMO, Sequential Minimal Optimisation (chunking).

The soft margin SVM: penalise a hyperplane by thenumber and distance of misclassified points.

Primal

argminw,w0

2||w||

2 + CN�

s.t. tn(wT xn + w0) ≥ 1− ξn, ξn > 0 ∀n

L̃(∧) =N�

λn −

λnλmtntmxTnxm

s.t. 0 ≤ λn ≤ C,N�

λntn = 0, ∀n

• QP: O(n3);

• SMO: much more efficient thanQP, since computation basedonly on support vectors.

Use a non-linear kernel K(x, x�):

y(x) =N�

λntnxT xn + w0

λntnK(x, xn) + w0

L̃(∧) =N�

λn −

λnλmtntmxTnxm

λn −

λnλmtntmK(xn, xm)

Online SVM. See, for example:

• The Huller: A Simple andEfficient Online SVM, Bordes& Bottou (2005)

• Pegasos: Primal Estimatedsub-Gradient Solver for SVM,Shalev-Shwartz et al. (2007)

k-means

A hard-margin, geometric clusteringalgorithm, where each data point isassigned to its closest centroid.

Hard assignments rnk ∈ {0, 1} s.t.∀n

�k rnk = 1, i.e. each data point is

assigned to one cluster k.

Geometric distance: The Euclideandistance, l2 norm:

||xn − µk||2 =

��D�

(xni − µki)2

argminr,µ

rnk||xn − µk||22

. . . i.e. minimise the distance from each cluster centreto each of its points.

Expectation:

�1 if ||xn − µk||

2 minimal for k0 o/w

Maximisation:

µ(k)MLE =

�n rnkxn�n rnk

. . . where µ(k) is the centroid of cluster k.

Only hard-margin assignment to clusters. To be added.

For non-linearly separable data, use kernelk-means as suggested in:

Kernel k-means, Spectral Clustering andNormalized Cuts, Dhillon et al. (2004).

Sequential k-means: update the cen-troids after processing one point at atime.

Mixture of

Gaussians

A probabilistic clustering algorithm,where clusters are modelled as latentGuassians and each data point isassigned the probability of beingdrawn from a particular Gaussian.

Assignments to clusters by specifyingprobabilities

p(x(i), z(i)) = p(x(i)|z(i))p(z(i))

. . . with z(i) ∼ Multinomial(γ), and

γnk ≡ p(k|xn) s.t.�k

j=1 γnj = 1. I.e.want to maximise the probability ofthe observed data x.

L(x,π, µ,Σ) = log p(x|π, µ,Σ)

�K�

πkNk(xn|µk,Σk)

Expectation: For each n, k set:

γnk = p(z(i) = k|x(i); γ, µ,Σ) (= p(k|xn))

=p(x(i)|z(i) = k;µ,Σ)p(z(i) = k;π)

�Kj=1 p(x

(i)|z(i) = l;µ,Σ)p(z(i) = l;π)

=πkN (xn|µk,Σk)�Kj=1 πjN (xn|µj ,Σj)

Maximisation:

πk =1

�Nn=1 γnk(xn − µk)(xn − µk)T�N

n=1 γnk

�Nn=1 γnkxn

�Nn=1 γnk

The mixture of Gaussians assigns probabilities foreach cluster to each data point, and as such iscapable of capturing ambiguities in the data set.

To be added. Not applicable.

Online Gaussian Mixture Models. Agood start is:

A View of the EM Algorithm that Jus-tifies Incremental, Sparse, and OtherVariants, Neal & Hinton (1998).

1Created by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011.

Cheat Sheet: Algorithms for Supervised- and Unsupervised ... · PDF fileCheat Sheet:...

Documents

Transcript of Cheat Sheet: Algorithms for Supervised- and Unsupervised ... · PDF fileCheat Sheet:...

LiSA - A Library of Scheduling Algorithms Handbook for ... · Moreover, the incorporation of own algorithms and the automated call of algorithms are described. Chapter 8 contains

Algorithms and data structures IIktiml.mff.cuni.cz/~cepek/ADS2-EN.pdf2 Syllabus 1. String matching 2. Network flows 3. Arithmetic algorithms 4. Parallel arithmetic algorithms 5. Problem

Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.

Fft Algorithms

Optimization Part VIII: Unfolded algorithms

Vz Algorithms 08

CSC 413/513: Intro to Algorithms

Part I TWO MESSAGE PASSING ALGORITHMS...Exact algorithms are too expensive and one typically resorts to heuristic algorithms. (e.g. node elimination techniques; see K&F 9.4.3.2) Step

Quantum Algorithms for Portfolio Optimization

Interactive Matting Christoph Rhemann Supervised by: Margrit Gelautz and Carsten Rother.

Recursive Algorithms & Program Correctness

Randomized Algorithms CS648

From Unsupervised to Semi-supervised Learning Approaches From

Bidimensionality and Approximation Algorithms

Algorithms Workbook GATE CSE MADE EASY

Supervised learning for computer vision: Theory and ...fbach/eccv08_fbach.pdf · Supervised learning for computer vision: Theory and algorithms - Part II Francis Bach1 & Jean-Yves

Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

CHAPTER 14 Clustering and Unsupervised Classification CLASSIFICATION A. Dermanis.

Semi-supervised Sequence-to-sequence ASR using Unpaired ... · arXiv:1905.01152v2 [eess.AS] 20 Aug 2019 Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text Murali

Recursive Algorithms & Program Correctness 1.Algorithm. 2.Recursive Algorithms 3.Algorithm Correctness-Program.