• date post

10-Jun-2015
• Category

## Documents

• view

666
• download

2

Embed Size (px)

### Transcript of Linear Discrimination Centering on Support Vector Machines

• 1. CHAPTER 10: Linear Discrimination Eick/Aldaydin: Topic13

2. Likelihood- vs. Discriminant-based Classification

• Likelihood-based:Assume a model forp ( x | C i ), use Bayes rule to calculateP ( C i | x )
• g i ( x ) = logP ( C i | x )
• Discriminant-based:Assume a model forg i ( x | i ); no density estimation
• Estimating the boundaries is enough; no need to accurately estimate the densities inside the boundaries

Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 3. Linear Discriminant

• Linear discriminant:
• Advantages:
• Simple: O( d ) space/computation
• Knowledge extraction: Weighted sum of attributes; positive/negative weights, magnitudes (credit scoring)
• Optimal whenp ( x | C i ) are Gaussian with shared cov matrix; useful when classes are (almost) linearly separable

Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 4. Generalized Linear Model

• Quadratic discriminant:
• Higher-order (product) terms:
• Map fromxtozusingnonlinear basis functionsand use a linear discriminant inz -space

Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 5. Two Classes Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 6. Geometry Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 7. Support Vector Machines

• One Possible Solution

8. Support Vector Machines

• Another possible solution

9. Support Vector Machines

• Other possible solutions

10. Support Vector Machines

• Which one is better? B1 or B2?
• How do you define better?

11. Support Vector Machines

• Find hyperplanemaximizesthe margin => B1 is better than B2

12. Support Vector Machines Examples are: (x1,..,xn,y) with y {-1,1} 13. Support Vector Machines

• We want to maximize:
• Which is equivalent to minimizing:
• But subjected to the following N constraints:
• This is a constrained convex quadratic optimization problem that can be solved in polynominal time
• Numerical approaches to solve it (e.g., quadratic programming) exist
• The function to be optimized has only a single minimum no local minimum problem

14. Support Vector Machines

• What if the problem is not linearly separable?

15. Linear SVM for Non-linearly Separable Problems

• What if the problem is not linearly separable?
• Introduce slack variables
• Need to minimize:
• Subject to (i=1,..,N):
• C is chosen using a validation set trying to keep the margins wide while keeping the training error low.

Measures prediction error Inverse size of margin between hyperplanes Parameter Slack variable allows constraint violation to a certain degree 16. Nonlinear Support Vector Machines

• What if decision boundary is not linear?

Alternative 1: Use technique that Employs non-linear decision boundaries Non-linear function 17. Nonlinear Support Vector Machines

• Transform data into higher dimensional space
• Find the best hyperplane using the methods introduced earlier

Alternative 2: Transform into a higher dimensional attribute space andfindlinear decision boundaries in this space 18. Nonlinear Support Vector Machines

• Choose a non-linear kernel functionto transform into a different, usually higher dimensional, attribute space
• Minimize
• but subjected to the following N constraints:

Find a good hyperplane in the transformed space 19. Example: Polynomial Kernel Function

• Polynomial Kernel Function:
• (x1,x2)=(x1 2 ,x2 2 ,sqrt(2)*x1,sqrt(2)*x2,1)
• K(u,v)= (u) (v)= (u v + 1) 2
• A Support Vector Machine with polynomial kernel function classifies a new example z as follows:
• sign(( i y i (x i ) (z))+b)=
• sign(( i y i (x i z +1) 2 ))+b)
• Remark: iand b are determined using the methods for linear SVMs that were discussed earlier

Kernel function trick :perform computations in the original space, although we solve an optimization problem in the transformed spacemore efficient; More details Topic14. 20. Summary Support Vector Machines

• Support vector machines learn hyperplanes that separate two classes maximizing themargin between them( the empty space between the instances of the two classes) .
• Support vector machines introduce slack variables, in the case that classes are not linear separable and trying to maximize margins while keeping the training error low.
• The most popular versions of SVMs use non-linear kernel functions to map the attribute space into a higher dimensional space to facilitate finding good linear decision boundaries in the modified space.
• Support vector machines find margin optimal hyperplanes by solving a convex quadratic optimization problem. However, this optimization process is quite slow and support vector machines tend to fail if the number of examples goes beyond 500/2000/5000
• In general, support vector machines accomplish quite high accuracies, if compared to other techniques.

21. Useful Support Vector Machine Links Lecture notes are much more helpful to understand the basic ideas:http://www.ics.uci.edu/~welling/teaching/KernelsICS273B/Kernels.html http://cerium.raunvis.hi.is/~tpr/courseware/svm/kraekjur.html Some tools are often used in publications livsvm:http://www.csie.ntu.edu.tw/~cjlin/libsvm/ spider:http://www.kyb.tuebingen.mpg.de/bs/people/spider/index.html Tutorial Slides:http://www.support-vector.net/icml-tutorial.pdf Surveys: http://www.svms.org/survey/Camp00.pdf More General Material:http://www.learning-with-kernels.org/ http://www.kernel-machines.org/ http://kernel-machines.org/publications.html http://www.support-vector.net/tutorial.html Remarks : Thanks to Chaofan Sun for providing these links! 22. Optimal Separating Hyperplane Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) (Cortes and Vapnik, 1995; Vapnik, 1995) Alpaydin transparencies on Support Vector Machines not used in lecture! 23. Margin

• Distance from the discriminant to the closest instances on either side
• Distance of x to the hyperplane is
• We require
• For a unique soln, fix || w ||=1and to max margin

Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 24. Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 25. Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 26. Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) Most tare 0 and only a small number have t>0; they are thesupport vectors 27. Soft Margin Hyperplane

• Not linearly separable
• Soft error
• New primal is

Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 28. Kernel Machines

• Preprocess inputxby basis functions
• z= ( x ) g ( z )= w T z
• g ( x )= w T ( x )
• The SVM solution

Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 29. Kernel Functions

• Polynomials of degreeq :
• Radial-basis functions:
• Sigmoidal functions:

Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) (Cherkassky and Mulier, 1998) 30. SVM for Regression

• Use a linear model (possibly kernelized)
• f ( x )= w T x + w 0
• Use the -sensitive error function

Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1)