CHAPTER 10: Linear Discrimination Eick/Aldaydin: Topic13.
-
Upload
kaley-benbow -
Category
Documents
-
view
226 -
download
5
Transcript of CHAPTER 10: Linear Discrimination Eick/Aldaydin: Topic13.
CHAPTER 10:
Linear Discrimination
Eick/Aldaydin: Topic13
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)2
Likelihood- vs. Discriminant-based Classification Likelihood-based: Assume a model for p(x|Ci), use
Bayes’ rule to calculate P(Ci|x)
gi(x) = log P(Ci|x)
Discriminant-based: Assume a model for gi(x|Φi); no density estimation
Estimating the boundaries is enough; no need to accurately estimate the densities inside the boundaries
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3
Linear Discriminant
Linear discriminant:
Advantages: Simple: O(d) space/computation Knowledge extraction: Weighted sum of attributes;
positive/negative weights, magnitudes (credit scoring) Optimal when p(x|Ci) are Gaussian with shared cov
matrix; useful when classes are (almost) linearly separable
01
00| ij
d
jiji
Tiiii wxwww,g
xwwx
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4
Generalized Linear Model
Quadratic discriminant:
Higher-order (product) terms:
Map from x to z using nonlinear basis functions and use a linear discriminant in z-space
215224
2132211 xxz,xz,xz,xz,xz
00| iTii
Tiiii ww,,g xwxxwx WW
k
jjiji wg
1
xx
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5
Two Classes
0
201021
202101
21
w
ww
ww
ggg
T
T
TT
xw
xww
xwxw
xxx
otherwise
0if choose
2
1
C
gC x
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6
Geometry
Support Vector Machines
One Possible Solution
B1
Support Vector Machines
Another possible solution
B2
Support Vector Machines
Other possible solutions
B2
Support Vector Machines
Which one is better? B1 or B2? How do you define better?
B1
B2
Support Vector Machines
Find hyperplane maximizes the margin => B1 is better than B2
B1
B2
b11
b12
b21b22
margin
Support Vector MachinesB1
b11
b12
0 bxw
1 bxw 1 bxw
1bxw if1
1bxw if1)(
xf 2||||
2 Margin
w
Examples are:
(x1,..,xn,y) with y{-1,1}
Support Vector Machines We want to maximize:
Which is equivalent to minimizing:
But subjected to the following N constraints:
This is a constrained convex quadratic optimization problem that can be solved in polynominal time
Numerical approaches to solve it (e.g., quadratic programming) exist
The function to be optimized has only a single minimum no local minimum problem
2||||
2 Margin
w
N1,..,i 1b)xw(y ii
2
||||)(
2wwL
Support Vector Machines What if the problem is not linearly separable?
Linear SVM for Non-linearly Separable Problems
What if the problem is not linearly separable? Introduce slack variables Need to minimize:
Subject to (i=1,..,N):
C is chosen using a validation set trying to keep the margins wide while keeping the training error low.
i
iii
0)2(
-1b)xw(*y )1(
N
i
kiC
wwL
1
2
2
||||)(
Measures prediction error
Inverse size of marginbetween hyperplanes
Parameter
Slack variable
allows constraint violationto a certain degree
Nonlinear Support Vector Machines
What if decision boundary is not linear?
Alternative 1:Use technique thatEmploys non-lineardecision boundaries
Non-linear function
Nonlinear Support Vector Machines1. Transform data into higher dimensional space2. Find the best hyperplane using the methods introduced
earlier
Alternative 2:Transform into a higher dimensionalattribute space and find linear decision boundaries in this space
Nonlinear Support Vector Machines
1. Choose a non-linear kernel function to transform into a different, usually higher dimensional, attribute space
2. Minimize
but subjected to the following N constraints:
N1,..,i 1b))x w(y ii
2
||||)(
2wwL
Find a good hyperplanein the transformed space
Example: Polynomial Kernel Function
Polynomial Kernel Function:(x1,x2)=(x12,x22,sqrt(2)*x1,sqrt(2)*x2,1)K(u,v)=(u)(v)= (uv + 1)2
A Support Vector Machine with polynomial kernel function classifies a new example z as follows:
sign(( iyi(xi)(z))+b) =
sign(( iyi (xiz +1)2))+b)
Remark: i and b are determined using the methods for linear SVMs that were discussed earlier
Kernel function trick: perform computations in the original space, although we solve an optimization problem in the transformed space more efficient; More detailsTopic14.
Summary Support Vector Machines Support vector machines learn hyperplanes that separate
two classes maximizing the margin between them (the empty space between the instances of the two classes).
Support vector machines introduce slack variables, in the case that classes are not linear separable and trying to maximize margins while keeping the training error low.
The most popular versions of SVMs use non-linear kernel functions to map the attribute space into a higher dimensional space to facilitate finding “good” linear decision boundaries in the modified space.
Support vector machines find “margin optimal” hyperplanes by solving a convex quadratic optimization problem. However, this optimization process is quite slow and support vector machines tend to fail if the number of examples goes beyond 500/2000/5000…
In general, support vector machines accomplish quite high accuracies, if compared to other techniques.
Useful Support Vector Machine LinksLecture notes are much more helpful to understand the basic ideas: http://www.ics.uci.edu/~welling/teaching/KernelsICS273B/Kernels.html http://cerium.raunvis.hi.is/~tpr/courseware/svm/kraekjur.html Some tools are often used in publications livsvm: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ spider: http://www.kyb.tuebingen.mpg.de/bs/people/spider/index.html Tutorial Slides: http://www.support-vector.net/icml-tutorial.pdf Surveys: http://www.svms.org/survey/Camp00.pdf More General Material: http://www.learning-with-kernels.org/ http://www.kernel-machines.org/http://kernel-machines.org/publications.htmlhttp://www.support-vector.net/tutorial.html
Remarks: Thanks to Chaofan Sun for providing these links!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)22
Optimal Separating Hyperplane
(Cortes and Vapnik, 1995; Vapnik, 1995)
1
as rewritten be can which
1 for 1
1 for 1
that such and find
if 1
if 1 where
0
0
0
0
2
1
wr
rw
rw
w
C
Crr,
tTt
ttT
ttT
t
tt
ttt
xw
xw
xw
w
x
xxX
Alpaydin transparencies on Support Vector Machines—not used in lecture!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)23
Margin
Distance from the discriminant to the closest instances on either side
Distance of x to the hyperplane is
We require
For a unique sol’n, fix ρ||w||=1and to max margin
w
xw 0wtT
t,
wr tTt
wxw 0
t,wr tTt 1 to subject 21 min 0
2xww
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)24
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)25
00
0
21
121
1 to subject 21 min
10
1
110
2
10
2
0
2
N
t
ttp
N
t
tttp
N
t
tN
t
tTtt
N
t
tTttp
tTt
rw
L
rL
wr
wrL
t,wr
xww
xww
xww
xww
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)26
Most αt are 0 and only a small number have αt
>0; they are the support vectors
t
0
0 and 0 to subject
21
21
21
t,r
rr
rwrL
ttt
t
tsTtst
t s
st
t
tT
t t
ttt
t
tttTTd
xx
ww
xwww
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)27
Soft Margin Hyperplane
Not linearly separable
Soft error
New primal is
ttTt wxr 10w
t
t
t
tt
t
ttTtt
t
tp wxrCL 1
21
0
2ww
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)28
t
ttt
t
TtttT
t
ttt
t
ttt
,Krg
rg
rr
xxx
xφxφxφwx
xφzw
Kernel Machines
Preprocess input x by basis functionsz = φ(x) g(z)=wTz
g(x)=wT φ(x) The SVM solution
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)29
Kernel Functions
Polynomials of degree q:
Radial-basis functions: Sigmoidal functions:
qtTt ,K 1 xxxx
(Cherkassky and Mulier, 1998)
T
T
x,x,xx,x,x,
yxyxyyxxyxyx
yxyx
,K
22
212121
22
22
21
2121212211
22211
2
2221
2221
1
1
x
yxyx
2
2
expxx
xxt
t ,K
12tanh tTt ,K xxxx
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)30
SVM for Regression
Use a linear model (possibly kernelized)f(x)=wTx+w0
Use the є-sensitive error function
otherwise
if 0tt
tttt
fr
frf,re
x
xx
00
0
tt
ttT
tTt
,
rw
wr
xw
xw
t
ttC2
21
min w