• date post

12-Apr-2018
• Category

## Documents

• view

227

11

Embed Size (px)

### Transcript of Lecture 4: Regression ctd and multiple az/lectures/ml/lect4.pdf · PDF fileLecture 4:...

• Lecture 4: Regression ctd and multiple classes

C19 Machine Learning Hilary 2015 A. Zisserman

Regression Lasso L1 regularization SVM regression and epsilon-insensitive loss

More loss functions

Multi-class Classification using binary classifiers random forests neural networks

• NXi=1

l (f(xi,w), yi) + R (w)

There is a choice of both loss functions and regularization

So far we have seen ridge regression

squared loss:

squared regularizer:

Now, consider other losses and regularizers

Regression cost functions

loss function regularization

Minimize with respect to w

NXi=1

(yi f(xi,w))2

kwk2

• NXi=1

(yi f(xi,w))2 + dXj

|wj|

The Lasso or L1 norm regularization

loss function regularization

This is a quadratic optimization problem

There is a unique solution

p-Norm definition:

Minimize with respect to w Rd LASSO = Least Absolute Shrinkage and Selection

kwkp = dXj=1

|wi|p1p

• Sparsity property of the Lasso contour plots for d = 2

NXi=1

(yi f(xi,w))2

kwk2 dXj

|wj|ridge regression lasso

Minimum where loss contours tangent to regularizers

For the lasso case, minima occur at corners

Consequently one of the weights is zero

In high dimensions many weights can be zero

• 2 4 6 8 10-1

-0.5

0

0.5

1

1.5Least Squares

2 4 6 8 10-1

-0.5

0

0.5

1

1.5Ridge Regression

2 4 6 8 10-1

-0.5

0

0.5

1

1.5LASSO

Comparison of learnt weight vectors

Linear regression

N = 100

d = 10

y = w . x

lambda = 1

Lasso: number of non-zero coefficients = 5

• 0 0.5 1-1

-0.5

0

0.5

1

1.5Ridge Regression

percent of lambdaMax

coef

ficie

nt v

alue

s

0 0.5 1-1

-0.5

0

0.5

1

1.5L1-Regularized Least Squares

percent of lambdaMaxco

effic

ient

val

ues

Lasso in action

regularization parameter lambda

• Sparse weight vectors

Weights being zero is a method of feature selection zeroing out the unimportant features

The SVM classifier also has this property (sparse alpha in the dual representation)

Ridge regression does not

• Other regularizers

NXi=1

(yi f(xi,w))2 + dXj

|wj|q

For q 1, the cost function is convex and has a unique minimum.The solution can be obtained by quadratic optimization.

For q < 1, the problem is not convex, and obtaining the globalminimum is more difficult

• r

SVMs for Regression

Use -insensitive error measure

V(r) =

(0 if |r| |r| otherwise.

This can also be written as

V(r) = (|r| )+where ()+ indicates the positive part of (.).

Or equivalently as

V(r) = max ((|r| ),0)

loss function regularization

V(r)

square loss

cost is zero inside epsilon tube

• cost is zero inside epsilon tube As before, introduce slack variables forpoints that violate -insensitive error.

For each data point, xi, two slack vari-ables, i,

bi, are required (depending onwhether f(xi) is above or below the tube)

Learning is by the optimization

minwRd, i, bi C

NXi

i+

bi+ 12||w||2

subject to

yi f(xi,w)++i, yi f(xi,w)bi, i 0, bi 0 for i = 1 . . . N Again, this is a quadratic programming problem

It can be dualized

Some of the data points will become support vectors

It can be kernelized

• 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

Example: SV regression with Gaussian basis functions

Regression function Gaussians centred on data points

Parameters are: C, epsilon, sigma

f(x,a) =NXi=1

aie(xxi)2/22 N = 50

• 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5rbf = 0.2 C = 1e+003 = 2

est. func.+-supp. vec.margin. vec.

C = 1000, epsilon = 0.2, sigma = 0.5

As epsilon increases: fit becomes looser

less data points are support vectors

• Loss functions for regression

3 2 1 0 1 2 30

1

2

3

4

yf(x)

squareinsensitiveHuber

quadratic (square) loss `(y, f(x)) = 12(y f(x))2

-insensitive loss `(y, f(x)) = max ((|r| ),0)

Huber loss (mixed quadratic/linear): robustness to outliers:`(y, f(x)) = h(y f(x))

h(r) =

(r2 if |r| c2c|r| c2 otherwise.

all of these are convex

• Final notes on cost functions

Regressors and classifiers can be constructed by a mix n match of loss functions and regularizers to obtain a learning machine suited to a particular application. e.g. for a classifier

L1SVM

minwRd

NXi

max (0,1 yif(xi)) + ||w||1

Least squares SVM

minwRd

NXi

[max (0,1 yif(xi))]2 + ||w||2

f(x) = w>x+ b

• Multi-class Classification

• Multi-Class Classification what we would like

Goal: a decision rule that divides input space into K decision regionsseparated by decision boundaries

• Reminder: K Nearest Neighbour (K-NN) Classifier

Algorithm For each test point, x, to be classified, find the K nearest

samples in the training data Classify the point, x, according to the majority vote of their

class labels

e.g. K = 3

naturally applicable to multi-class case

• Build from binary classifiers

Learn: K two-class 1-vs-the-rest classifiers fk (x)

C1

C2

C3

1 vs 2 & 3

3 vs 1 & 22 vs 1 & 3

?

?

?

?

• Build from binary classifiers continued

Learn: K two-class 1 vs the rest classifiers fk (x)

Classification: choose class with most positive score

C1

C2

C3

1 vs 2 & 3

3 vs 1 & 22 vs 1 & 3

maxkfk(x)

• Build from binary classifiers continued

Learn: K two-class 1 vs the rest classifiers fk (x)

Classification: choose class with most positive score

C1

C2

C3

maxkfk(x)

• Application: hand written digit recognition

Feature vectors: each image is 28 x 28 pixels. Rearrange as a 784-vector x

Training: learn k=10 two-class 1-vs-the-rest SVM classifiers fk (x)

Classification: choose class with most positive score

f(x) = maxkfk(x)

• 0

5

5

2

5

3

5

4

5

5

1 2 3 4 5 6 7 8 9 0

1 2

3 4

5

Example

hand drawn

classification

• Why not learn a multi-class SVM directly?For example for three classes

Learn w = (w1,w2,w3)> using the cost function

minw||w||2 subject to

w1>xi w2>xi & w1>xi w3>xi for i class 1

w2>xi w3>xi & w2>xi w1>xi for i class 2

w3>xi w1>xi & w3>xi w2>xi for i class 3

This is a quadratic optimization problem subject to linearconstraints and there is a unique minimum

Note, a margin can also be included in the constraints

In practice there is a little or no improvement over the binary case

• Random Forests

• Random Forest Overview A natural multiple class classifier

Fast at test time

• Generic trees and decision trees

terminal(leaf)node

internal

(split)node

rootnode0

1 2

3 4 5 6

9 10 11 12 13 14

A general tree structure

1

1

x1

x2

Classifyas

greenclass

A decision tree x2 > 1

x1 > 1

Classifyas

blueclass

Classifyas

redclass

• Testing

?

choose class with max posterior at leaf node

• Decision forest model: training and information gain

Beforesp

lit

Information gain

Shannons entropy

Node training

(fordiscrete,nonparametricdistributions)

Split1

Split2

• Trees vs Forests

A single tree may over fit to the training data Instead train multiple trees and combine their predictions Each tree can differ in both training data and node tests Achieve this by injecting randomness into training algorithm

1

1x1

x2

lack of margin

• Forests and trees

Aforestisanensembleoftrees.Thetreesareallslightlydifferentfromoneanother.

• Classification forest - injecting randomness I

(1) Bagging (randomizing the training set)

The full training set

The randomly sampled subset of training data

made available for the tree t

Forest training

• Classification forest - injecting randomness II

(2) Node training - random subsets of node tests available at each node

Foresttraining

Nodeweaklearner

Nodetestparams

T1T2

T3

• From a single tree classification .What do we do at the leaf?

leafleaf

leaf

Prediction model: probabilistic

• How to combine the predictions of the trees?

Tree t=1 t=2 t=3

Forest output probability

The ensemble model

• Random Forests - summary Node tests are usually chosen to be cheap to evaluate (weak

classifiers) e.g. comparing pairs of feature components

At run time, classification is very fast (like AdaBoost)

Clas