Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Machine Learning Seminar:Support Vector Regression

Presented by: Heng Ji10/08/03

Outline

Regression Background Linear ε- Insensitive Loss Algorithm• Primal Formulation• Dual Formulation• Kernel FormulationQuadratic ε- Insensitive Loss AlgorithmKernel Ridge Regression & Gaussian Process

Regression = find a function that fits the observations

Observations:

(1949,100)(1950,117)

...(1996,1462)(1997,1469)(1998,1467)(1999,1474)

(x,y) pairs

Linear fit...

Not so good...

}1345,1314,...,184,215{ˆ

}1474,1467,...,117,100{

y

y

Better linear fit...

Take logarithmof y and fit astraight line

Transform back to original

So so...

}1765,1660,...,88,83{ˆ

}1474,1467,...,117,100{

y

y

So what is regression about?

Construct a model of a process, using examples of the process.

Input: x (possibly a vector)Output: f(x) (generated by the process)Examples: Pairs of input and output {y, x}

Our model:

The function is our estimate of the true function g(x)

Assumption about the process

The “fixed regressor model”

)()]([)( nngny x

x(n) Observed inputy(n) Observed outputg[x(n)] True underlying function(n) I.I.D noise process

with zero mean

Nnnyn ...1)(),( xData set:

Example

-1 -0.5 0 0.5 1-6

-4

-2

0

2

4

6

8

10

Observations (y)Underlying function (g)

32 65.0)( xxxxg 2

Model Sets (examples)

g(x) = 0.5 + x + x2 + 6x3123

1={a+bx}; 2={a+bx+cx2}; 3={a+bx+cx2+dx3};Linear; Quadratic; Cubic;

1 2 3

Idealized regression

g(x)

Model Set (our hypothesis set)

fopt(x)

Error

Find appropriate model family and find f(x) with minimum “distance” to g(x) (“error”)

How measure “distance”?

• Q: What is the distance (difference) between functions f and g?

Margin Slack Variable

For Example(xi, yi), function f, Margin slack variable

θ: target accuracy in test

γ: difference between target accuracy and

margin in training

))(|)(|,0max( iii xfy

ε- Insensitive Loss Function Let ε= θ-γ, Margin Slack Variable

Linear ε- Insensitive Loss:

Quadratic ε- Insensitive Loss

)|)(|,0max( iii xfy

22 |)(| xfyL

)|)(|,0max()( xfyyL

Linear ε- Insensitive Loss a Linear SV Machine

ξ

ξ

Yi-<w,xi>

Basic Idea of SV Regression

Starting point We have input data X = {(x1,y1), …., (xN,yN)} Goal

We want to find a robust function f(x) that has at most ε deviation from the targets y, while at the same time being as flat as possible.

Idea Simple Regression Problem + Optimization + Kernel Trick

Thus setting:

Primal Regression Problem

b,Xw,bwx)x(f

ii

ii

yb

by

wx

wx

w

Subject to

||||2

1 min 2

Linear ε- Insensitive Loss Regression

min

subject to

ε decide Insensitive Zone C a trade-off between error and ||w|| εand C must be tuned simultaneously Regression is more difficult than Classification?

l

iiiCw

1

*2)(

2

1

0,

,

,

*

*

ii

iii

iii

ybxw

bxwy

Parameters used in SV Regression

Dual Formulation• Lagrangian function will help us to formulate the dual proble

m

• ε: insensitive loss βi* : Lagrange Multiplier

ξi : difference value for points above εband

ξi*: difference value for points below εband

• Optimality Conditions

l

iiiiiiii

l

iii

l

iiiiii

l

iii

bxwy

bxwyCL

1

***

1

*

11

*2

)(),(

),()(||||2

1

w

0

0

0)(

0)(

***

1

*

1

*

iii

iii

l

iii

l

iiii

CL

CL

b

L

wL

xw

Dual Formulation(Cont’)• Dual Problem

• Solving

],0[,

0) ( tosubject

)()(,))(( 2

1 max

*

1

*

1

*

1 1

**

C

yxx

ii

ii

l

iiii

l

iii

l

i

l

jjijjii

l

iiii

l

iiii

bxf1

*

1

**

)()(

and , ) (

xx

xw

KKT Optimality Conditions and b KKT Optimality Conditions

b can be computed as follows

0)(

0)(

0)(

0)(

**

***

*

ii

ii

iiii

iiii

C

C

bxwy

bxwy

),0(for ,

),0(for ,*

i

Cxwyb

Cxwyb

iii

ii

This means that the Lagrange multipliers will only be non-zero for points outside the band. Thus these points are the support vectors

The Idea of SVM

• input space feature space

•

Kernel Version

Why can we use Kernel?

The complexity of a function’s representation depends only on the number of SVs the complete algorithm can be described in terms of inner product.

An implicit mapping to the feature space

Mapping via Kernel

l

iiii

1

** ) ( xw

],0[,

0) ( tosubject

)()(,))(( 2

1 max

*

1

*

1

*

1 1

**

C

yxxk

ii

ii

l

iiii

l

iii

l

i

l

jjijjii

Quadratic ε- Insensitive Loss Regression

0,0

0) ( tosubject

)()(,))(( 2

1 max

*

1

1*

1

*

1 1

**

ii

ii

l

iijCiii

l

iii

l

i

l

jjijjii yxxk

l

iiiCw

1

2*22)(

2

1

Problem:min

subject to

Kernel Formulation

0,

,

,

*

*

ii

iii

iii

ybxw

bxwy

Kernel Ridge Regression & Gaussian Processes

l

iiw

1

22

ε= 0 Least Square Linear Regression The weight decay factor is controlled by C min (λ~1/C)

subject to Kernel Formulation (I: Identity Matrix) is also the mean of a Gaussian distribution

iii xwy ,

kIKyxwxf 1)(',)(

Architecture of SV Regression Machine

similar to regression in a three-layered neural network!?

b

Conclusion

SVM is a useful alternative to neural networkTwo key concepts of SVM• optimization• kernel trickAdvantages of SV Regression• Represent solution by a small subset of training

points• Ensure the existence of global minimum• Ensure the optimization of a reliable eneralization

bound

Discussion1: Influence of an insensitivity band on regression quality

17 measured training data points are used. Left: ε= 0.1 15 SV are chosen Right: ε= 0.5 6 chosen SV produced a much better regression function

Enables sparseness within SVs, but guarantees sparseness?

Robust (robust to small changes in data/ model)

Less sensitive to outliers

Discussion2: ε- Insensitive Loss

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Documents

Transcript of Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.