Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
-
Upload
diane-kelly -
Category
Documents
-
view
219 -
download
0
Transcript of Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Machine Learning Seminar:Support Vector Regression
Presented by: Heng Ji10/08/03
Outline
Regression Background Linear ε- Insensitive Loss Algorithm• Primal Formulation• Dual Formulation• Kernel FormulationQuadratic ε- Insensitive Loss AlgorithmKernel Ridge Regression & Gaussian Process
Regression = find a function that fits the observations
Observations:
(1949,100)(1950,117)
...(1996,1462)(1997,1469)(1998,1467)(1999,1474)
(x,y) pairs
Linear fit...
Not so good...
}1345,1314,...,184,215{ˆ
}1474,1467,...,117,100{
y
y
Better linear fit...
Take logarithmof y and fit astraight line
Transform back to original
So so...
}1765,1660,...,88,83{ˆ
}1474,1467,...,117,100{
y
y
So what is regression about?
Construct a model of a process, using examples of the process.
Input: x (possibly a vector)Output: f(x) (generated by the process)Examples: Pairs of input and output {y, x}
Our model:
The function is our estimate of the true function g(x)
Assumption about the process
The “fixed regressor model”
)()]([)( nngny x
x(n) Observed inputy(n) Observed outputg[x(n)] True underlying function(n) I.I.D noise process
with zero mean
Nnnyn ...1)(),( xData set:
Example
-1 -0.5 0 0.5 1-6
-4
-2
0
2
4
6
8
10
Observations (y)Underlying function (g)
32 65.0)( xxxxg 2
Model Sets (examples)
g(x) = 0.5 + x + x2 + 6x3123
1={a+bx}; 2={a+bx+cx2}; 3={a+bx+cx2+dx3};Linear; Quadratic; Cubic;
1 2 3
Idealized regression
g(x)
Model Set (our hypothesis set)
fopt(x)
Error
Find appropriate model family and find f(x) with minimum “distance” to g(x) (“error”)
How measure “distance”?
• Q: What is the distance (difference) between functions f and g?
Margin Slack Variable
For Example(xi, yi), function f, Margin slack variable
θ: target accuracy in test
γ: difference between target accuracy and
margin in training
))(|)(|,0max( iii xfy
ε- Insensitive Loss Function Let ε= θ-γ, Margin Slack Variable
Linear ε- Insensitive Loss:
Quadratic ε- Insensitive Loss
)|)(|,0max( iii xfy
22 |)(| xfyL
)|)(|,0max()( xfyyL
Linear ε- Insensitive Loss a Linear SV Machine
ξ
ξ
Yi-<w,xi>
Basic Idea of SV Regression
Starting point We have input data X = {(x1,y1), …., (xN,yN)} Goal
We want to find a robust function f(x) that has at most ε deviation from the targets y, while at the same time being as flat as possible.
Idea Simple Regression Problem + Optimization + Kernel Trick
Thus setting:
Primal Regression Problem
b,Xw,bwx)x(f
ii
ii
yb
by
wx
wx
w
Subject to
||||2
1 min 2
Linear ε- Insensitive Loss Regression
min
subject to
ε decide Insensitive Zone C a trade-off between error and ||w|| εand C must be tuned simultaneously Regression is more difficult than Classification?
l
iiiCw
1
*2)(
2
1
0,
,
,
*
*
ii
iii
iii
ybxw
bxwy
Parameters used in SV Regression
Dual Formulation• Lagrangian function will help us to formulate the dual proble
m
• ε: insensitive loss βi* : Lagrange Multiplier
ξi : difference value for points above εband
ξi*: difference value for points below εband
• Optimality Conditions
l
iiiiiiii
l
iii
l
iiiiii
l
iii
bxwy
bxwyCL
1
***
1
*
11
*2
)(),(
),()(||||2
1
w
0
0
0)(
0)(
***
1
*
1
*
iii
iii
l
iii
l
iiii
CL
CL
b
L
wL
xw
Dual Formulation(Cont’)• Dual Problem
• Solving
],0[,
0) ( tosubject
)()(,))(( 2
1 max
*
1
*
1
*
1 1
**
C
yxx
ii
ii
l
iiii
l
iii
l
i
l
jjijjii
l
iiii
l
iiii
bxf1
*
1
**
)()(
and , ) (
xx
xw
KKT Optimality Conditions and b KKT Optimality Conditions
b can be computed as follows
0)(
0)(
0)(
0)(
**
***
*
ii
ii
iiii
iiii
C
C
bxwy
bxwy
),0(for ,
),0(for ,*
i
Cxwyb
Cxwyb
iii
ii
This means that the Lagrange multipliers will only be non-zero for points outside the band. Thus these points are the support vectors
The Idea of SVM
• input space feature space
•
Kernel Version
Why can we use Kernel?
The complexity of a function’s representation depends only on the number of SVs the complete algorithm can be described in terms of inner product.
An implicit mapping to the feature space
Mapping via Kernel
l
iiii
1
** ) ( xw
],0[,
0) ( tosubject
)()(,))(( 2
1 max
*
1
*
1
*
1 1
**
C
yxxk
ii
ii
l
iiii
l
iii
l
i
l
jjijjii
Quadratic ε- Insensitive Loss Regression
0,0
0) ( tosubject
)()(,))(( 2
1 max
*
1
1*
1
*
1 1
**
ii
ii
l
iijCiii
l
iii
l
i
l
jjijjii yxxk
l
iiiCw
1
2*22)(
2
1
Problem:min
subject to
Kernel Formulation
0,
,
,
*
*
ii
iii
iii
ybxw
bxwy
Kernel Ridge Regression & Gaussian Processes
l
iiw
1
22
ε= 0 Least Square Linear Regression The weight decay factor is controlled by C min (λ~1/C)
subject to Kernel Formulation (I: Identity Matrix) is also the mean of a Gaussian distribution
iii xwy ,
kIKyxwxf 1)(',)(
Architecture of SV Regression Machine
similar to regression in a three-layered neural network!?
b
Conclusion
SVM is a useful alternative to neural networkTwo key concepts of SVM• optimization• kernel trickAdvantages of SV Regression• Represent solution by a small subset of training
points• Ensure the existence of global minimum• Ensure the optimization of a reliable eneralization
bound
Discussion1: Influence of an insensitivity band on regression quality
17 measured training data points are used. Left: ε= 0.1 15 SV are chosen Right: ε= 0.5 6 chosen SV produced a much better regression function
Enables sparseness within SVs, but guarantees sparseness?
Robust (robust to small changes in data/ model)
Less sensitive to outliers
Discussion2: ε- Insensitive Loss