Lecture 4: Regression ctd and multiple az/lectures/ml/lect4.pdf · PDF fileLecture 4:...

Click here to load reader

  • date post

    12-Apr-2018
  • Category

    Documents

  • view

    227
  • download

    11

Embed Size (px)

Transcript of Lecture 4: Regression ctd and multiple az/lectures/ml/lect4.pdf · PDF fileLecture 4:...

  • Lecture 4: Regression ctd and multiple classes

    C19 Machine Learning Hilary 2015 A. Zisserman

    Regression Lasso L1 regularization SVM regression and epsilon-insensitive loss

    More loss functions

    Multi-class Classification using binary classifiers random forests neural networks

  • NXi=1

    l (f(xi,w), yi) + R (w)

    There is a choice of both loss functions and regularization

    So far we have seen ridge regression

    squared loss:

    squared regularizer:

    Now, consider other losses and regularizers

    Regression cost functions

    loss function regularization

    Minimize with respect to w

    NXi=1

    (yi f(xi,w))2

    kwk2

  • NXi=1

    (yi f(xi,w))2 + dXj

    |wj|

    The Lasso or L1 norm regularization

    loss function regularization

    This is a quadratic optimization problem

    There is a unique solution

    p-Norm definition:

    Minimize with respect to w Rd LASSO = Least Absolute Shrinkage and Selection

    kwkp = dXj=1

    |wi|p1p

  • Sparsity property of the Lasso contour plots for d = 2

    NXi=1

    (yi f(xi,w))2

    kwk2 dXj

    |wj|ridge regression lasso

    Minimum where loss contours tangent to regularizers

    For the lasso case, minima occur at corners

    Consequently one of the weights is zero

    In high dimensions many weights can be zero

  • 2 4 6 8 10-1

    -0.5

    0

    0.5

    1

    1.5Least Squares

    2 4 6 8 10-1

    -0.5

    0

    0.5

    1

    1.5Ridge Regression

    2 4 6 8 10-1

    -0.5

    0

    0.5

    1

    1.5LASSO

    Comparison of learnt weight vectors

    Linear regression

    N = 100

    d = 10

    y = w . x

    lambda = 1

    Lasso: number of non-zero coefficients = 5

  • 0 0.5 1-1

    -0.5

    0

    0.5

    1

    1.5Ridge Regression

    percent of lambdaMax

    coef

    ficie

    nt v

    alue

    s

    0 0.5 1-1

    -0.5

    0

    0.5

    1

    1.5L1-Regularized Least Squares

    percent of lambdaMaxco

    effic

    ient

    val

    ues

    Lasso in action

    regularization parameter lambda

  • Sparse weight vectors

    Weights being zero is a method of feature selection zeroing out the unimportant features

    The SVM classifier also has this property (sparse alpha in the dual representation)

    Ridge regression does not

  • Other regularizers

    NXi=1

    (yi f(xi,w))2 + dXj

    |wj|q

    For q 1, the cost function is convex and has a unique minimum.The solution can be obtained by quadratic optimization.

    For q < 1, the problem is not convex, and obtaining the globalminimum is more difficult

  • r

    SVMs for Regression

    Use -insensitive error measure

    V(r) =

    (0 if |r| |r| otherwise.

    This can also be written as

    V(r) = (|r| )+where ()+ indicates the positive part of (.).

    Or equivalently as

    V(r) = max ((|r| ),0)

    loss function regularization

    V(r)

    square loss

    cost is zero inside epsilon tube

  • cost is zero inside epsilon tube As before, introduce slack variables forpoints that violate -insensitive error.

    For each data point, xi, two slack vari-ables, i,

    bi, are required (depending onwhether f(xi) is above or below the tube)

    Learning is by the optimization

    minwRd, i, bi C

    NXi

    i+

    bi+ 12||w||2

    subject to

    yi f(xi,w)++i, yi f(xi,w)bi, i 0, bi 0 for i = 1 . . . N Again, this is a quadratic programming problem

    It can be dualized

    Some of the data points will become support vectors

    It can be kernelized

  • 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    Example: SV regression with Gaussian basis functions

    Regression function Gaussians centred on data points

    Parameters are: C, epsilon, sigma

    f(x,a) =NXi=1

    aie(xxi)2/22 N = 50

  • 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

    -1

    -0.5

    0

    0.5

    1

    1.5rbf = 0.2 C = 1e+003 = 2

    est. func.+-supp. vec.margin. vec.

    C = 1000, epsilon = 0.2, sigma = 0.5

    As epsilon increases: fit becomes looser

    less data points are support vectors

  • Loss functions for regression

    3 2 1 0 1 2 30

    1

    2

    3

    4

    yf(x)

    squareinsensitiveHuber

    quadratic (square) loss `(y, f(x)) = 12(y f(x))2

    -insensitive loss `(y, f(x)) = max ((|r| ),0)

    Huber loss (mixed quadratic/linear): robustness to outliers:`(y, f(x)) = h(y f(x))

    h(r) =

    (r2 if |r| c2c|r| c2 otherwise.

    all of these are convex

  • Final notes on cost functions

    Regressors and classifiers can be constructed by a mix n match of loss functions and regularizers to obtain a learning machine suited to a particular application. e.g. for a classifier

    L1SVM

    minwRd

    NXi

    max (0,1 yif(xi)) + ||w||1

    Least squares SVM

    minwRd

    NXi

    [max (0,1 yif(xi))]2 + ||w||2

    f(x) = w>x+ b

  • Multi-class Classification

  • Multi-Class Classification what we would like

    Assign input vector to one of K classes

    Goal: a decision rule that divides input space into K decision regionsseparated by decision boundaries

  • Reminder: K Nearest Neighbour (K-NN) Classifier

    Algorithm For each test point, x, to be classified, find the K nearest

    samples in the training data Classify the point, x, according to the majority vote of their

    class labels

    e.g. K = 3

    naturally applicable to multi-class case

  • Build from binary classifiers

    Learn: K two-class 1-vs-the-rest classifiers fk (x)

    C1

    C2

    C3

    1 vs 2 & 3

    3 vs 1 & 22 vs 1 & 3

    ?

    ?

    ?

    ?

  • Build from binary classifiers continued

    Learn: K two-class 1 vs the rest classifiers fk (x)

    Classification: choose class with most positive score

    C1

    C2

    C3

    1 vs 2 & 3

    3 vs 1 & 22 vs 1 & 3

    maxkfk(x)

  • Build from binary classifiers continued

    Learn: K two-class 1 vs the rest classifiers fk (x)

    Classification: choose class with most positive score

    C1

    C2

    C3

    maxkfk(x)

  • Application: hand written digit recognition

    Feature vectors: each image is 28 x 28 pixels. Rearrange as a 784-vector x

    Training: learn k=10 two-class 1-vs-the-rest SVM classifiers fk (x)

    Classification: choose class with most positive score

    f(x) = maxkfk(x)

  • 0

    5

    5

    2

    5

    3

    5

    4

    5

    5

    1 2 3 4 5 6 7 8 9 0

    1 2

    3 4

    5

    Example

    hand drawn

    classification

  • Why not learn a multi-class SVM directly?For example for three classes

    Learn w = (w1,w2,w3)> using the cost function

    minw||w||2 subject to

    w1>xi w2>xi & w1>xi w3>xi for i class 1

    w2>xi w3>xi & w2>xi w1>xi for i class 2

    w3>xi w1>xi & w3>xi w2>xi for i class 3

    This is a quadratic optimization problem subject to linearconstraints and there is a unique minimum

    Note, a margin can also be included in the constraints

    In practice there is a little or no improvement over the binary case

  • Random Forests

  • Random Forest Overview A natural multiple class classifier

    Fast at test time

    Start with single tree, and then move onto forest

  • Generic trees and decision trees

    terminal(leaf)node

    internal

    (split)node

    rootnode0

    1 2

    3 4 5 6

    9 10 11 12 13 14

    A general tree structure

    1

    1

    x1

    x2

    Classifyas

    greenclass

    A decision tree x2 > 1

    x1 > 1

    Classifyas

    blueclass

    Classifyas

    redclass

  • Testing

    ?

    choose class with max posterior at leaf node

  • Decision forest model: training and information gain

    Beforesp

    lit

    Information gain

    Shannons entropy

    Node training

    (fordiscrete,nonparametricdistributions)

    Split1

    Split2

  • Trees vs Forests

    A single tree may over fit to the training data Instead train multiple trees and combine their predictions Each tree can differ in both training data and node tests Achieve this by injecting randomness into training algorithm

    1

    1x1

    x2

    lack of margin

  • Forests and trees

    Aforestisanensembleoftrees.Thetreesareallslightlydifferentfromoneanother.

  • Classification forest - injecting randomness I

    (1) Bagging (randomizing the training set)

    The full training set

    The randomly sampled subset of training data

    made available for the tree t

    Forest training

  • Classification forest - injecting randomness II

    (2) Node training - random subsets of node tests available at each node

    Foresttraining

    Nodeweaklearner

    Nodetestparams

    T1T2

    T3

  • From a single tree classification .What do we do at the leaf?

    leafleaf

    leaf

    Prediction model: probabilistic

  • How to combine the predictions of the trees?

    Tree t=1 t=2 t=3

    Forest output probability

    The ensemble model

  • Random Forests - summary Node tests are usually chosen to be cheap to evaluate (weak

    classifiers) e.g. comparing pairs of feature components

    At run time, classification is very fast (like AdaBoost)

    Clas