Linear Discrimination Centering on Support Vector Machines
date post
10-Jun-2015Category
Documents
view
667download
2
Embed Size (px)
Transcript of Linear Discrimination Centering on Support Vector Machines
- 1. CHAPTER 10: Linear Discrimination Eick/Aldaydin: Topic13
2. Likelihood- vs. Discriminant-based Classification
- Likelihood-based:Assume a model forp ( x | C i ), use Bayes rule to calculateP ( C i | x )
- g i ( x ) = logP ( C i | x )
- Discriminant-based:Assume a model forg i ( x | i ); no density estimation
- Estimating the boundaries is enough; no need to accurately estimate the densities inside the boundaries
Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 3. Linear Discriminant
- Linear discriminant:
- Advantages:
- Simple: O( d ) space/computation
- Knowledge extraction: Weighted sum of attributes; positive/negative weights, magnitudes (credit scoring)
- Optimal whenp ( x | C i ) are Gaussian with shared cov matrix; useful when classes are (almost) linearly separable
Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 4. Generalized Linear Model
- Quadratic discriminant:
- Higher-order (product) terms:
- Map fromxtozusingnonlinear basis functionsand use a linear discriminant inz -space
Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 5. Two Classes Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 6. Geometry Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 7. Support Vector Machines
- One Possible Solution
8. Support Vector Machines
- Another possible solution
9. Support Vector Machines
- Other possible solutions
10. Support Vector Machines
- Which one is better? B1 or B2?
- How do you define better?
11. Support Vector Machines
- Find hyperplanemaximizesthe margin => B1 is better than B2
12. Support Vector Machines Examples are: (x1,..,xn,y) with y {-1,1} 13. Support Vector Machines
- We want to maximize:
- Which is equivalent to minimizing:
- But subjected to the following N constraints:
- This is a constrained convex quadratic optimization problem that can be solved in polynominal time
- Numerical approaches to solve it (e.g., quadratic programming) exist
- The function to be optimized has only a single minimum no local minimum problem
14. Support Vector Machines
- What if the problem is not linearly separable?
15. Linear SVM for Non-linearly Separable Problems
- What if the problem is not linearly separable?
- Introduce slack variables
- Need to minimize:
- Subject to (i=1,..,N):
- C is chosen using a validation set trying to keep the margins wide while keeping the training error low.
Measures prediction error Inverse size of margin between hyperplanes Parameter Slack variable allows constraint violation to a certain degree 16. Nonlinear Support Vector Machines
- What if decision boundary is not linear?
Alternative 1: Use technique that Employs non-linear decision boundaries Non-linear function 17. Nonlinear Support Vector Machines
- Transform data into higher dimensional space
- Find the best hyperplane using the methods introduced earlier
Alternative 2: Transform into a higher dimensional attribute space andfindlinear decision boundaries in this space 18. Nonlinear Support Vector Machines
- Choose a non-linear kernel functionto transform into a different, usually higher dimensional, attribute space
- Minimize
- but subjected to the following N constraints:
Find a good hyperplane in the transformed space 19. Example: Polynomial Kernel Function
- Polynomial Kernel Function:
- (x1,x2)=(x1 2 ,x2 2 ,sqrt(2)*x1,sqrt(2)*x2,1)
- K(u,v)= (u) (v)= (u v + 1) 2
- A Support Vector Machine with polynomial kernel function classifies a new example z as follows:
- sign(( i y i (x i ) (z))+b)=
- sign(( i y i (x i z +1) 2 ))+b)
- Remark: iand b are determined using the methods for linear SVMs that were discussed earlier
Kernel function trick :perform computations in the original space, although we solve an optimization problem in the transformed spacemore efficient; More details Topic14. 20. Summary Support Vector Machines
- Support vector machines learn hyperplanes that separate two classes maximizing themargin between them( the empty space between the instances of the two classes) .
- Support vector machines introduce slack variables, in the case that classes are not linear separable and trying to maximize margins while keeping the training error low.
- The most popular versions of SVMs use non-linear kernel functions to map the attribute space into a higher dimensional space to facilitate finding good linear decision boundaries in the modified space.
- Support vector machines find margin optimal hyperplanes by solving a convex quadratic optimization problem. However, this optimization process is quite slow and support vector machines tend to fail if the number of examples goes beyond 500/2000/5000
- In general, support vector machines accomplish quite high accuracies, if compared to other techniques.
21. Useful Support Vector Machine Links Lecture notes are much more helpful to understand the basic ideas:http://www.ics.uci.edu/~welling/teaching/KernelsICS273B/Kernels.html http://cerium.raunvis.hi.is/~tpr/courseware/svm/kraekjur.html Some tools are often used in publications livsvm:http://www.csie.ntu.edu.tw/~cjlin/libsvm/ spider:http://www.kyb.tuebingen.mpg.de/bs/people/spider/index.html Tutorial Slides:http://www.support-vector.net/icml-tutorial.pdf Surveys: http://www.svms.org/survey/Camp00.pdf More General Material:http://www.learning-with-kernels.org/ http://www.kernel-machines.org/ http://kernel-machines.org/publications.html http://www.support-vector.net/tutorial.html Remarks : Thanks to Chaofan Sun for providing these links! 22. Optimal Separating Hyperplane Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) (Cortes and Vapnik, 1995; Vapnik, 1995) Alpaydin transparencies on Support Vector Machines not used in lecture! 23. Margin
- Distance from the discriminant to the closest instances on either side
- Distance of x to the hyperplane is
- We require
- For a unique soln, fix || w ||=1and to max margin
Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 24. Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 25. Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 26. Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) Most tare 0 and only a small number have t>0; they are thesupport vectors 27. Soft Margin Hyperplane
- Not linearly separable
- Soft error
- New primal is
Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 28. Kernel Machines
- Preprocess inputxby basis functions
- z= ( x ) g ( z )= w T z
- g ( x )= w T ( x )
- The SVM solution
Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) 29. Kernel Functions
- Polynomials of degreeq :
- Radial-basis functions:
- Sigmoidal functions:
Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1) (Cherkassky and Mulier, 1998) 30. SVM for Regression
- Use a linear model (possibly kernelized)
- f ( x )= w T x + w 0
- Use the -sensitive error function
Lecture Notes for E Alpaydn 2004 Introduction to Machine Learning The MIT Press (V1.1)