Learning in Indefiniteness · Purushottam Kar (CSE/IITK) Learning in Indeﬁniteness August 2, 2010...

⟨Φ(x), Φ(x′)

⟩= K (x, x′) = C(x, x′)− C(x, x0)− C(x′, x0)

Learning in Indefiniteness

Purushottam Kar

Department of Computer Science and EngineeringIndian Institute of Technology Kanpur

August 2, 2010

Purushottam Kar (CSE/IITK) Learning in Indefiniteness August 2, 2010 1 / 60

Outline

1 A brief introduction to learning

2 Kernels - Definite and Indefinite

3 Using kernels as measures of distanceLandmarking based approachesApproximate embeddings into Pseudo Euclidean spacesExact embeddings into Banach spaces

4 Using kernels as measures of similarityApproximate embeddings into Pseudo Euclidean spacesExact embeddings into Kreın spacesLandmarking based approaches

5 Conclusion

Outline

5 Conclusion

Outline

5 Conclusion

Outline

5 Conclusion

Outline

5 Conclusion

Outline

A Quiz

Outline

A Quiz

Outline

A Quiz

Outline

A Quiz

Learning

Learning 100

Learning

Learning as pattern recognition

Binary classification

Multi-class classificationMulti-label classificationRegressionClusteringRanking...

Learning

Binary classificationMulti-class classification

Multi-label classificationRegressionClusteringRanking...

Learning

Binary classificationMulti-class classificationMulti-label classification

RegressionClusteringRanking...

Learning

Binary classificationMulti-class classificationMulti-label classificationRegression

ClusteringRanking...

Learning

Binary classificationMulti-class classificationMulti-label classificationRegressionClustering

Ranking...

Learning

Binary classificationMulti-class classificationMulti-label classificationRegressionClusteringRanking

Learning

Binary classificationMulti-class classificationMulti-label classificationRegressionClusteringRanking...

Learning

Binary classification X

Multi-class classificationMulti-label classificationRegressionClusteringRanking...

Learning

Learning Dichotomies from examples

Learning the distinction between a bird and a non-birdMain approaches :

I Generative (Bayesian classification)I Predictive

F Feature BasedF Kernel Based

This talk : Kernel Based predictive approaches to binaryclassification

Learning

Learning Dichotomies from examplesLearning the distinction between a bird and a non-bird

Main approaches :

Learning

Learning Dichotomies from examplesLearning the distinction between a bird and a non-birdMain approaches :

Learning

I Generative (Bayesian classification)

I Predictive

Learning

F Feature Based

F Kernel Based

Learning

F Feature BasedF Kernel Based X

Learning

Probably Approximately Correct learning[Kearns and Vazirani, 1997]

DefinitionA class of boolean functions F defined on a domain X is said to bePAC-learnable if there exists a class of boolean functions H defined onX , an algorithm A and a function S : R+ × R+ such that for alldistributions µ defined on X , all t ∈ F , all ε, δ > 0 : A, when given(xi , f (xi))n

i=1, xi ∈R µ where n = S(1/ε,1/δ), returns with probability(taken over the choice of x1, . . . , xn) greater than 1− δ, a functionh ∈ H such that

Prx∈Rµ

[h(x) 6= t(x)] ≤ ε.

t is the Target function, F the Concept Class

h is the Hypothesis, H the Hypothesis ClassS is the Sample Complexity of the algorithm A

Learning

Prx∈Rµ

[h(x) 6= t(x)] ≤ ε.

t is the Target function, F the Concept Classh is the Hypothesis, H the Hypothesis Class

S is the Sample Complexity of the algorithm A

Learning

Prx∈Rµ

[h(x) 6= t(x)] ≤ ε.

t is the Target function, F the Concept Classh is the Hypothesis, H the Hypothesis ClassS is the Sample Complexity of the algorithm A

Learning

Limitations of PAC learning

Most interesting function classes are not PAC learnable withpolynomial sample complexities eg. Regular Languages

Adversarial combinations of target functions and distributions canmake learning impossibleWeaker notions of learning

I Weak-PAC learning - require only that ε be bounded away from 12

I Restrict oneself to benign distributions (uniform, mixture ofGaussians)

I Restrict oneself to benign learning scenarios (targetfunction-distribution pairs that are benign)

I Vaguely defined in literature

Learning

Most interesting function classes are not PAC learnable withpolynomial sample complexities eg. Regular LanguagesAdversarial combinations of target functions and distributions canmake learning impossible

Weaker notions of learning

Learning

Most interesting function classes are not PAC learnable withpolynomial sample complexities eg. Regular LanguagesAdversarial combinations of target functions and distributions canmake learning impossibleWeaker notions of learning

Learning

I Restrict oneself to benign learning scenarios (targetfunction-distribution pairs that are benign) X

Learning

Weak∗-Probably Approximately Correct learning

DefinitionA class of boolean functions F defined on a domain X is said to beweak∗-PAC-learnable if for every t ∈ F and distribution µ defined on X ,there exists a class of boolean functions H defined on X , an algorithmA and a function S : R+ × R+ such that for all ε, δ > 0 : A, when given(xi , f (xi))n

Prx∈Rµ

[h(x) 6= t(x)] ≤ ε.

Kernels

DefinitionGiven a non-empty set X , a symmetric real-valued (resp. Hermitiancomplex valued) function f : X × X → R (resp f : X × X → C) is calleda kernel.

All notions of (symmetric) distances, similarities are kernels

Alternatively kernels can be thought of as measures of similarityor distance

Kernels

DefinitionGiven a non-empty set X , a symmetric real-valued (resp. Hermitiancomplex valued) function f : X × X → R (resp f : X × X → C) is calleda kernel.

All notions of (symmetric) distances, similarities are kernelsAlternatively kernels can be thought of as measures of similarityor distance

Kernels

Definiteness

DefinitionA matrix A ∈ Rn×n is said to be positive definite if ∀c ∈ Rn, c 6= 0,c>Ac > 0.

DefinitionA kernel K defined on a domain X is said to be positive definite if∀n ∈ N, ∀x1, . . . xn ∈ X , the matrix G = (Gij) = (K (xi , xj)) is positivedefinite. Alternatively, for every g ∈ L2(X ),

∫∫X g(x)g(x ′)K (x , x ′) ≥ 0.

DefinitionA kernel K is said to be indefinite if it is neither positive definite nornegative definite.

Kernels

Definiteness

∫∫X g(x)g(x ′)K (x , x ′) ≥ 0.

Kernels

Definiteness

∫∫X g(x)g(x ′)K (x , x ′) ≥ 0.

Kernels

The Kernel Trick

All PD Kernels turn out to be inner products in some Hilbert space

Thus, any algorithm that only takes as input pairwise innerproducts can be made to implicitly work in such spacesResults known as Representer Theorems keep any Curses ofdimensionality at bay...Testing the Mercer condition difficultIndefinite kernels known to give good performanceAbility to use indefinite kernels increases the scope oflearning-the-kernel algorithmsLearning paradigm somewhere between PAC and weak∗-PAC

Kernels

The Kernel Trick

All PD Kernels turn out to be inner products in some Hilbert spaceThus, any algorithm that only takes as input pairwise innerproducts can be made to implicitly work in such spaces

Results known as Representer Theorems keep any Curses ofdimensionality at bay...Testing the Mercer condition difficultIndefinite kernels known to give good performanceAbility to use indefinite kernels increases the scope oflearning-the-kernel algorithmsLearning paradigm somewhere between PAC and weak∗-PAC

Kernels

The Kernel Trick

All PD Kernels turn out to be inner products in some Hilbert spaceThus, any algorithm that only takes as input pairwise innerproducts can be made to implicitly work in such spacesResults known as Representer Theorems keep any Curses ofdimensionality at bay

...Testing the Mercer condition difficultIndefinite kernels known to give good performanceAbility to use indefinite kernels increases the scope oflearning-the-kernel algorithmsLearning paradigm somewhere between PAC and weak∗-PAC

Kernels

The Kernel Trick

All PD Kernels turn out to be inner products in some Hilbert spaceThus, any algorithm that only takes as input pairwise innerproducts can be made to implicitly work in such spacesResults known as Representer Theorems keep any Curses ofdimensionality at bay...

Testing the Mercer condition difficultIndefinite kernels known to give good performanceAbility to use indefinite kernels increases the scope oflearning-the-kernel algorithmsLearning paradigm somewhere between PAC and weak∗-PAC

Kernels

The Kernel Trick

All PD Kernels turn out to be inner products in some Hilbert spaceThus, any algorithm that only takes as input pairwise innerproducts can be made to implicitly work in such spacesResults known as Representer Theorems keep any Curses ofdimensionality at bay...Testing the Mercer condition difficult

Indefinite kernels known to give good performanceAbility to use indefinite kernels increases the scope oflearning-the-kernel algorithmsLearning paradigm somewhere between PAC and weak∗-PAC

Kernels

The Kernel Trick

All PD Kernels turn out to be inner products in some Hilbert spaceThus, any algorithm that only takes as input pairwise innerproducts can be made to implicitly work in such spacesResults known as Representer Theorems keep any Curses ofdimensionality at bay...Testing the Mercer condition difficultIndefinite kernels known to give good performance

Ability to use indefinite kernels increases the scope oflearning-the-kernel algorithmsLearning paradigm somewhere between PAC and weak∗-PAC

Kernels

The Kernel Trick

All PD Kernels turn out to be inner products in some Hilbert spaceThus, any algorithm that only takes as input pairwise innerproducts can be made to implicitly work in such spacesResults known as Representer Theorems keep any Curses ofdimensionality at bay...Testing the Mercer condition difficultIndefinite kernels known to give good performanceAbility to use indefinite kernels increases the scope oflearning-the-kernel algorithms

Learning paradigm somewhere between PAC and weak∗-PAC

Kernels

The Kernel Trick

All PD Kernels turn out to be inner products in some Hilbert spaceThus, any algorithm that only takes as input pairwise innerproducts can be made to implicitly work in such spacesResults known as Representer Theorems keep any Curses ofdimensionality at bay...Testing the Mercer condition difficultIndefinite kernels known to give good performanceAbility to use indefinite kernels increases the scope oflearning-the-kernel algorithmsLearning paradigm somewhere between PAC and weak∗-PAC

Kernels as distances

Kernels as distances Landmarking based approaches

Nearest neighbor classification [Duda et al., 2000]

Learning domain is some distance (possibly metric) space (X ,d)

Given T = (xi , t(xi))ni=1, xi ∈ X , yi ∈ {−1,+1}, T = T + ∪ T−

Classify a new point x as + if d(x ,T +) < d(x ,T−) otherwise as −When will this work ?

I Intuitively when a large fraction of domain points are closer(according to d) to points of the same label than points of thedifferent label

I Prx∈Rµ

[d(x ,X t(x)) < d(x ,X t(x))

]≥ 1− ε

I Prx∈Rµ

[d(x ,X t(x)) < d(x ,X t(x))

]≥ 1− ε

Classify a new point x as + if d(x ,T +) < d(x ,T−) otherwise as −

When will this work ?

I Prx∈Rµ

[d(x ,X t(x)) < d(x ,X t(x))

]≥ 1− ε

I Prx∈Rµ

[d(x ,X t(x)) < d(x ,X t(x))

]≥ 1− ε

I Prx∈Rµ

[d(x ,X t(x)) < d(x ,X t(x))

]≥ 1− ε

I Prx∈Rµ

[d(x ,X t(x)) < d(x ,X t(x))

]≥ 1− ε

What is a good distance function

DefinitionA distance function d is said to be strongly (ε, γ)-good for a learningproblem, if at least 1− ε probability mass of examples x ∈ µ satisfy

Prx ,x ′′∈Rµ

[d(x , x ′) < d(x , x ′′)|x ′ ∈ X t(x), x ′′ ∈ X t(x)

]≥ 1

2+ γ.

A smoothed version of the earlier intuitive notion of good distancefunction

Correspondingly the algorithm is also a smoothed version of theclassical NN algorithm

What is a good distance function

DefinitionA distance function d is said to be strongly (ε, γ)-good for a learningproblem, if at least 1− ε probability mass of examples x ∈ µ satisfy

Prx ,x ′′∈Rµ

[d(x , x ′) < d(x , x ′′)|x ′ ∈ X t(x), x ′′ ∈ X t(x)

]≥ 1

2+ γ.

A smoothed version of the earlier intuitive notion of good distancefunctionCorrespondingly the algorithm is also a smoothed version of theclassical NN algorithm

Learning with a good distance function

Theorem ([Wang et al., 2007])Given a strongly (ε, γ)-good distance function, the following classifierh, for any ε, δ > 0, when given n = 1

γ2 lg(1δ

)pairs of positive and

negative training points, (ai ,bi)ni=1,ai ∈R µ+,bi ∈R µ− with probability

greater than 1− δ, has an error no more than ε+ δ

h(x) = sgn [f (x)] , f (x) =1n

n∑i=1

sgn [d(x ,bi)− d(x ,ai)]

What about the NN algorithm - any guarantees for that ?

For metric distances - in a few slidesNote that this is an instance of weak∗-PAC learningGuarantees for NN on non-metric distances ?

γ2 lg(1δ

h(x) = sgn [f (x)] , f (x) =1n

n∑i=1

What about the NN algorithm - any guarantees for that ?For metric distances - in a few slides

Note that this is an instance of weak∗-PAC learningGuarantees for NN on non-metric distances ?

γ2 lg(1δ

h(x) = sgn [f (x)] , f (x) =1n

n∑i=1

What about the NN algorithm - any guarantees for that ?For metric distances - in a few slidesNote that this is an instance of weak∗-PAC learning

Guarantees for NN on non-metric distances ?

γ2 lg(1δ

h(x) = sgn [f (x)] , f (x) =1n

n∑i=1

What about the NN algorithm - any guarantees for that ?For metric distances - in a few slidesNote that this is an instance of weak∗-PAC learningGuarantees for NN on non-metric distances ?

Other landmarking approaches

[Weinshall et al., 1998], [Jacobs et al., 2000] investigatealgorithms where a (set of) representative(s) is chosen for eachlabel: eg the centroid of all training points with that label

[Pekalska and Duin, 2001] consider combining classifiers basedon different dissimilarity functions as well as building classifiers oncombinations of different dissimilarity functions[Weinberger and Saul, 2009] propose methods to learn aMahalanobis distance to improve NN classification

[Weinshall et al., 1998], [Jacobs et al., 2000] investigatealgorithms where a (set of) representative(s) is chosen for eachlabel: eg the centroid of all training points with that label[Pekalska and Duin, 2001] consider combining classifiers basedon different dissimilarity functions as well as building classifiers oncombinations of different dissimilarity functions

[Weinberger and Saul, 2009] propose methods to learn aMahalanobis distance to improve NN classification

[Weinshall et al., 1998], [Jacobs et al., 2000] investigatealgorithms where a (set of) representative(s) is chosen for eachlabel: eg the centroid of all training points with that label[Pekalska and Duin, 2001] consider combining classifiers basedon different dissimilarity functions as well as building classifiers oncombinations of different dissimilarity functions[Weinberger and Saul, 2009] propose methods to learn aMahalanobis distance to improve NN classification

[Gottlieb et al., 2010] present efficient schemes for NN classifiers(Lipschitz extension classifiers) in doubling spaces

h(x) = sgn [f (x)] , f (x) = minxi∈T

(t(xi) + 2

d(x , xi)

d(T +,T−)

I make use of approximate nearest neighbor search algorithmsI show that pseudo dimension of Lipschitz classifiers in doubling

spaces is boundedI are able to provides schemes for optimizing the bias-variance

trade-off

(t(xi) + 2

d(x , xi)

d(T +,T−)

)I make use of approximate nearest neighbor search algorithms

I show that pseudo dimension of Lipschitz classifiers in doublingspaces is bounded

I are able to provides schemes for optimizing the bias-variancetrade-off

(t(xi) + 2

d(x , xi)

d(T +,T−)

)I make use of approximate nearest neighbor search algorithmsI show that pseudo dimension of Lipschitz classifiers in doubling

spaces is bounded

I are able to provides schemes for optimizing the bias-variancetrade-off

(t(xi) + 2

d(x , xi)

d(T +,T−)

)I make use of approximate nearest neighbor search algorithmsI show that pseudo dimension of Lipschitz classifiers in doubling

spaces is boundedI are able to provides schemes for optimizing the bias-variance

trade-off

Kernels as distances PE space approaches

Data sensitive embeddings

Landmarking based approaches can be seen as implicitlyembedding the domain into an n dimensional feature space

Perform an explicit embedding of training data to some vectorspace that is isometric and learn a classifierPerform (approximately) isometric embeddings of test data intothe same vector space to classify themExact for transductive problems, approximate for inductive onesLong history of such techniques from early AI - Multidimensionalscaling

Landmarking based approaches can be seen as implicitlyembedding the domain into an n dimensional feature spacePerform an explicit embedding of training data to some vectorspace that is isometric and learn a classifier

Perform (approximately) isometric embeddings of test data intothe same vector space to classify themExact for transductive problems, approximate for inductive onesLong history of such techniques from early AI - Multidimensionalscaling

Landmarking based approaches can be seen as implicitlyembedding the domain into an n dimensional feature spacePerform an explicit embedding of training data to some vectorspace that is isometric and learn a classifierPerform (approximately) isometric embeddings of test data intothe same vector space to classify them

Exact for transductive problems, approximate for inductive onesLong history of such techniques from early AI - Multidimensionalscaling

Landmarking based approaches can be seen as implicitlyembedding the domain into an n dimensional feature spacePerform an explicit embedding of training data to some vectorspace that is isometric and learn a classifierPerform (approximately) isometric embeddings of test data intothe same vector space to classify themExact for transductive problems, approximate for inductive ones

Long history of such techniques from early AI - Multidimensionalscaling

Landmarking based approaches can be seen as implicitlyembedding the domain into an n dimensional feature spacePerform an explicit embedding of training data to some vectorspace that is isometric and learn a classifierPerform (approximately) isometric embeddings of test data intothe same vector space to classify themExact for transductive problems, approximate for inductive onesLong history of such techniques from early AI - Multidimensionalscaling

Kernels as distances Pseudo Euclidean spaces

The Minkowski space-time

DefinitionR4 = R3 ⊕ R1 := R(3,1) endowed with the inner product〈(x1, y1, z1, t1), (x2, y2, z2, t2)〉 = x1x2 + y1y2 + z1z2 − t1t2 is a4-dimensional Minkowski space with signature (3,1). The normimposed by this inner product is ‖(x1, y1, z1, t1)‖2 = x2

1 + y21 + z2

1 − t21

Can have vectors of negative length due to the imaginary timecoordinate

The definition an be extended to arbitrary R(p,q) (PE Spaces)

Theorem ([Goldfarb, 1984], [Haasdonk, 2005])Any finite pseudo metric (X ,d), |X | = n can be isometricallyembedded in

(R(p,q), ‖ · ‖2

)for some values of p + q < n.

1 + y21 + z2

1 − t21

Can have vectors of negative length due to the imaginary timecoordinateThe definition an be extended to arbitrary R(p,q) (PE Spaces)

(R(p,q), ‖ · ‖2

1 + y21 + z2

1 − t21

Can have vectors of negative length due to the imaginary timecoordinateThe definition an be extended to arbitrary R(p,q) (PE Spaces)

(R(p,q), ‖ · ‖2

The Embedding

Embedding the training setGiven a distance matrix Rn×n 3 D = (d(xi , xj)), find the correspondinginner products in the PE space as G = −1

2JDJ where J = I − 1n 11>.

Do an eigendecomposition of B = QΛQ> = Q|Λ|12 M|Λ|

12 Q> where

[Ip×p 0

0 −Iq×q

]. The representation of the points is X = Q|Λ|

Embedding a new pointPerform a linear projection into the space found above. Givend = (d(x , xi)), the vector of distances to the old points, the innerproducts to all the old points is found as g = −1

(d − 1

n 11>D)

J. Nowfind the mean square error solution to xMX> = b as x = bX |Λ|−1M.

Classification in PE spaces

Earliest observations by [Goldfarb, 1984] who realized the linkbetween landmarking and embedding approaches

[Pekalska and Duin, 2000],[Pekalska et al., 2001],[Pekalska and Duin, 2002] use this space to learn SVM, LPM,Quadratic Discriminant and Fisher Linear Discriminant classifiers[Harol et al., 2006] propose enlarging the PE space to allow forlesser distortion in embeddings test points[Duin and Pekalska, 2008] propose refinements to the distancemeasure by making modifications to the PE space allowing forbetter NN classificationGuarantees for classifiers learned in PE spaces ?

Earliest observations by [Goldfarb, 1984] who realized the linkbetween landmarking and embedding approaches[Pekalska and Duin, 2000],[Pekalska et al., 2001],[Pekalska and Duin, 2002] use this space to learn SVM, LPM,Quadratic Discriminant and Fisher Linear Discriminant classifiers

[Harol et al., 2006] propose enlarging the PE space to allow forlesser distortion in embeddings test points[Duin and Pekalska, 2008] propose refinements to the distancemeasure by making modifications to the PE space allowing forbetter NN classificationGuarantees for classifiers learned in PE spaces ?

Earliest observations by [Goldfarb, 1984] who realized the linkbetween landmarking and embedding approaches[Pekalska and Duin, 2000],[Pekalska et al., 2001],[Pekalska and Duin, 2002] use this space to learn SVM, LPM,Quadratic Discriminant and Fisher Linear Discriminant classifiers[Harol et al., 2006] propose enlarging the PE space to allow forlesser distortion in embeddings test points

[Duin and Pekalska, 2008] propose refinements to the distancemeasure by making modifications to the PE space allowing forbetter NN classificationGuarantees for classifiers learned in PE spaces ?

Earliest observations by [Goldfarb, 1984] who realized the linkbetween landmarking and embedding approaches[Pekalska and Duin, 2000],[Pekalska et al., 2001],[Pekalska and Duin, 2002] use this space to learn SVM, LPM,Quadratic Discriminant and Fisher Linear Discriminant classifiers[Harol et al., 2006] propose enlarging the PE space to allow forlesser distortion in embeddings test points[Duin and Pekalska, 2008] propose refinements to the distancemeasure by making modifications to the PE space allowing forbetter NN classification

Guarantees for classifiers learned in PE spaces ?

Earliest observations by [Goldfarb, 1984] who realized the linkbetween landmarking and embedding approaches[Pekalska and Duin, 2000],[Pekalska et al., 2001],[Pekalska and Duin, 2002] use this space to learn SVM, LPM,Quadratic Discriminant and Fisher Linear Discriminant classifiers[Harol et al., 2006] propose enlarging the PE space to allow forlesser distortion in embeddings test points[Duin and Pekalska, 2008] propose refinements to the distancemeasure by making modifications to the PE space allowing forbetter NN classificationGuarantees for classifiers learned in PE spaces ?

Kernels as distances Banach space approaches

Data insensitive embeddings

Possible if the distance measure can be isometrically embeddedinto some space

Learn a simple classifier there and interpret it in terms of thedistance measureRequire algorithms that can work without explicit embeddingsExact for transductive as well as inductive problemsRecent interest due to advent of large margin classifiers

Possible if the distance measure can be isometrically embeddedinto some spaceLearn a simple classifier there and interpret it in terms of thedistance measure

Require algorithms that can work without explicit embeddingsExact for transductive as well as inductive problemsRecent interest due to advent of large margin classifiers

Possible if the distance measure can be isometrically embeddedinto some spaceLearn a simple classifier there and interpret it in terms of thedistance measureRequire algorithms that can work without explicit embeddings

Exact for transductive as well as inductive problemsRecent interest due to advent of large margin classifiers

Possible if the distance measure can be isometrically embeddedinto some spaceLearn a simple classifier there and interpret it in terms of thedistance measureRequire algorithms that can work without explicit embeddingsExact for transductive as well as inductive problems

Recent interest due to advent of large margin classifiers

Possible if the distance measure can be isometrically embeddedinto some spaceLearn a simple classifier there and interpret it in terms of thedistance measureRequire algorithms that can work without explicit embeddingsExact for transductive as well as inductive problemsRecent interest due to advent of large margin classifiers

Kernels as distances Banach spaces

Normed Spaces

DefinitionGiven a vector space V over a field F ⊆ C, a norm is a function‖ · ‖ : V → R such that ∀u,v ∈ V ,a ∈ F , ‖av‖ = |a|‖v‖,‖u + v‖ ≤ ‖u‖+ ‖v‖ and ‖v‖ = 0 if and only if v = 0. A vector spacethat is complete with respect to a norm is called a Banach space.

Theorem ([von Luxburg and Bousquet, 2004])Given a metric spaceM = (X ,d) and the space of all Lipschitzfunctions Lip(X ) defined onM, there exists a Banach Space B andmaps Φ : X → B and Ψ : Lip(X )→ B′, the operator norm on B′ givingthe Lipschitz constant for each function f ∈ Lip(X ) such that both canbe realized simultaneously as isomorphic isometries.

The Kuratowski embedding gives a constructive proof

Normed Spaces

Classification in Banach spaces

[von Luxburg and Bousquet, 2004] proposes large marginclassification schemes on Banach spaces relying on Convex hullinterpretations of SVM classifiers

infp+∈C+,p−∈C−

‖p+ − p−‖ (1)

supt∈B′

〈T ,p+ − p−〉‖T‖

infT∈B′,b∈R

‖T‖ = L(T )

subject to t(xi) (〈T , xi〉+ b) ≥ 1,∀i = 1, . . . ,n.(3)

infT∈B′,b∈R

L(T ) + Cn∑

i=1ξi

subject to t(xi) (〈T , xi〉+ b) ≥ 1− ξi , ξ ≥ 0∀i = 1, . . . ,n.(4)

‖p+ − p−‖ (1)

supt∈B′

〈T ,p+ − p−〉‖T‖

infT∈B′,b∈R

‖T‖ = L(T )

infT∈B′,b∈R

L(T ) + Cn∑

i=1ξi

‖p+ − p−‖ (1)

supt∈B′

〈T ,p+ − p−〉‖T‖

infT∈B′,b∈R

‖T‖ = L(T )

infT∈B′,b∈R

L(T ) + Cn∑

i=1ξi

‖p+ − p−‖ (1)

supt∈B′

〈T ,p+ − p−〉‖T‖

infT∈B′,b∈R

‖T‖ = L(T )

infT∈B′,b∈R

L(T ) + Cn∑

i=1ξi

‖p+ − p−‖ (1)

supt∈B′

〈T ,p+ − p−〉‖T‖

infT∈B′,b∈R

‖T‖ = L(T )

infT∈B′,b∈R

L(T ) + Cn∑

i=1ξi

Representer Theorems

Lets us escape the curse of dimensionality

Theorem (Lipschitz extension)Given a Lipschitz function f defined on a finite subset X ⊂ X , one canextend f to f ′ on the entire domain such that Lip(f ′) = Lip(f ).

Solution to Program 3 is always of the form

f (x) =d(x ,T−)− d(x ,T +)

d(T +,T−)

g(x) = αmini

(t(xi) + L0d(x , xi)) + (1− α)maxi

(t(xi)− L0d(x , xi))

f (x) =d(x ,T−)− d(x ,T +)

d(T +,T−)

g(x) = αmini

(t(xi) + L0d(x , xi)) + (1− α)maxi

f (x) =d(x ,T−)− d(x ,T +)

d(T +,T−)

g(x) = αmini

(t(xi) + L0d(x , xi)) + (1− α)maxi

But ...

Not a representer theorem involving distances to individualtraining points

Shown not to exist in certain cases - but the examples don’t seemnaturalBy restricting oneself to different subspaces of Lip(X ) onerecovers the SVM, LPM and NN algorithmsCan one use bi-Lipschitz embeddings instead ?Can one define “distance kernels” that allow one to restrict oneselfto specific subspaces of Lip(X )

But ...

Not a representer theorem involving distances to individualtraining pointsShown not to exist in certain cases - but the examples don’t seemnatural

By restricting oneself to different subspaces of Lip(X ) onerecovers the SVM, LPM and NN algorithmsCan one use bi-Lipschitz embeddings instead ?Can one define “distance kernels” that allow one to restrict oneselfto specific subspaces of Lip(X )

But ...

Not a representer theorem involving distances to individualtraining pointsShown not to exist in certain cases - but the examples don’t seemnaturalBy restricting oneself to different subspaces of Lip(X ) onerecovers the SVM, LPM and NN algorithms

Can one use bi-Lipschitz embeddings instead ?Can one define “distance kernels” that allow one to restrict oneselfto specific subspaces of Lip(X )

But ...

Not a representer theorem involving distances to individualtraining pointsShown not to exist in certain cases - but the examples don’t seemnaturalBy restricting oneself to different subspaces of Lip(X ) onerecovers the SVM, LPM and NN algorithmsCan one use bi-Lipschitz embeddings instead ?

Can one define “distance kernels” that allow one to restrict oneselfto specific subspaces of Lip(X )

But ...

Not a representer theorem involving distances to individualtraining pointsShown not to exist in certain cases - but the examples don’t seemnaturalBy restricting oneself to different subspaces of Lip(X ) onerecovers the SVM, LPM and NN algorithmsCan one use bi-Lipschitz embeddings instead ?Can one define “distance kernels” that allow one to restrict oneselfto specific subspaces of Lip(X )

Other Banach Space Approaches

[Hein et al., 2005] consider low distortion embeddings into Hilbertspaces giving a re-derivation of the SVM algorithm

DefinitionA matrix A ∈ Rn×n is said to be conditionally positive definite if∀c ∈ Rn, c>1 = 0, c>Ac > 0.

DefinitionA kernel K defined on a domain X is said to be conditionally positivedefinite if ∀n ∈ N, ∀x1, . . . xn ∈ X , the matrix G = (Gij) = (K (xi , xj)) isconditionally positive definite.

TheoremA metric d is Hibertian if it can be isometrically embedded into aHilbert space iff −d2 is conditionally positive definite

[Der and Lee, 2007] consider exploiting the semi-inner productstructure present in Banach space to yield SVM formulations

I Aim for a kernel trick for general metricsI Lack of symmetry and bi-linearity for semi inner products prevents

such kernel tricks for general metrics[Zhang et al., 2009] propose Reproducing Kernel Banach Spacesakin to RKHS that admit kernel tricks

I Use a bilinear form on B × B′ instead of B × BI No succinct characterizations of what can yield an RKBSI For finite domains, any kernel is a reproducing kernel for some

RKBS (trivial)

I Aim for a kernel trick for general metrics

I Lack of symmetry and bi-linearity for semi inner products preventssuch kernel tricks for general metrics

[Zhang et al., 2009] propose Reproducing Kernel Banach Spacesakin to RKHS that admit kernel tricks

RKBS (trivial)

such kernel tricks for general metrics

[Zhang et al., 2009] propose Reproducing Kernel Banach Spacesakin to RKHS that admit kernel tricks

RKBS (trivial)

I Use a bilinear form on B × B′ instead of B × B

I No succinct characterizations of what can yield an RKBSI For finite domains, any kernel is a reproducing kernel for some

RKBS (trivial)

I Use a bilinear form on B × B′ instead of B × BI No succinct characterizations of what can yield an RKBS

I For finite domains, any kernel is a reproducing kernel for someRKBS (trivial)

RKBS (trivial)

Kernel Trick for Distances ?

Theorem ([Scholkopf, 2000])A kernel C defined on some domain X is CPD iff for some fixedx0 ∈ X , the kernel K (x , x ′) = C(x , x ′)− C(x , x0)− C(x ′, x0) is PD.Such a C is also a Hilbertian metric.

The SVM algorithm is incapable of distinguishing between C andK [Boughorbel et al., 2005]

n∑i,j=1

αiαjyiyjK (xi , xj) =n∑

i,j=1αiαjyiyjC(xi , xj) subject to

n∑i=1

αiyi = 0

What about higher order CPD kernels - their characterization ?

n∑i,j=1

n∑i=1

αiyi = 0

n∑i,j=1

n∑i=1

αiyi = 0

Kernels as similarity

The Kernel Trick

Mercer’s theorem tells us that a similarity space (X ,K ) isembeddable in a Hilbert space iff K is a PSD kernel

Quite similar to what we had for Banach spaces only with morestructure nowCan formulate large margin classifiers as beforeRepresenter Theorem [Scholkopf and Smola, 2001] : solution of

the form f (x) =n∑

i=1K (x , xi)

Generalization Guarantees : method of Rademacher Averages[Mendelson, 2003]

The Kernel Trick

Mercer’s theorem tells us that a similarity space (X ,K ) isembeddable in a Hilbert space iff K is a PSD kernelQuite similar to what we had for Banach spaces only with morestructure now

Can formulate large margin classifiers as beforeRepresenter Theorem [Scholkopf and Smola, 2001] : solution of

i=1K (x , xi)

The Kernel Trick

Mercer’s theorem tells us that a similarity space (X ,K ) isembeddable in a Hilbert space iff K is a PSD kernelQuite similar to what we had for Banach spaces only with morestructure nowCan formulate large margin classifiers as before

Representer Theorem [Scholkopf and Smola, 2001] : solution of

i=1K (x , xi)

The Kernel Trick

Mercer’s theorem tells us that a similarity space (X ,K ) isembeddable in a Hilbert space iff K is a PSD kernelQuite similar to what we had for Banach spaces only with morestructure nowCan formulate large margin classifiers as beforeRepresenter Theorem [Scholkopf and Smola, 2001] : solution of

i=1K (x , xi)

The Kernel Trick

Mercer’s theorem tells us that a similarity space (X ,K ) isembeddable in a Hilbert space iff K is a PSD kernelQuite similar to what we had for Banach spaces only with morestructure nowCan formulate large margin classifiers as beforeRepresenter Theorem [Scholkopf and Smola, 2001] : solution of

i=1K (x , xi)

Kernels as similarity Indefinite Similarity Kernels

The Lazy approaches

Why bother building a theory when one already exists !

I Use a PD approximation to the given indefinite kernel !![Chen et al., 2009] Spectrum Shift, Spectrum Clip, Spectrum Flip

I [Luss and d’Aspremont, 2007] folds this process into the SVMalgorithm by treating an indefinite kernel as a noisy version of aMercer kernel

I Tries to handle test points consistently but no theoreticaljustification of the process

I Mercer kernels are not dense in the space of symmetric kernels

[Haasdonk and Bahlmann, 2004] propose distance substitutionkernels : substituting distance/similarity measures into kernels ofthe form K (‖x− y‖),K (〈x,y〉)

I These yield PD kernels iff the distance measure is Hilbertian

The Lazy approaches

Why bother building a theory when one already exists !I Use a PD approximation to the given indefinite kernel !!

[Chen et al., 2009] Spectrum Shift, Spectrum Clip, Spectrum Flip

The Lazy approaches

[Chen et al., 2009] Spectrum Shift, Spectrum Clip, Spectrum Flip

I Mercer kernels are not dense in the space of symmetric kernels[Haasdonk and Bahlmann, 2004] propose distance substitutionkernels : substituting distance/similarity measures into kernels ofthe form K (‖x− y‖),K (〈x,y〉)

The Lazy approaches

[Chen et al., 2009] Spectrum Shift, Spectrum Clip, Spectrum FlipI [Luss and d’Aspremont, 2007] folds this process into the SVM

algorithm by treating an indefinite kernel as a noisy version of aMercer kernel

The Lazy approaches

Kernels as similarity PE space approaches

Working with Indefinite Similarities

Embed Training sets into PE spaces (Minkowski spaces) as before

[Graepel et al., 1998] proposes to learn SVMs in this space -unfortunately not a large margin formulation[Graepel et al., 1999] propose LP machines in a ν-SVM likeformulation to obtain sparse classifiers[Mierswa, 2006] proposes using evolutionary algorithms to solvenon-convex formulations involving indefinite kernels

Embed Training sets into PE spaces (Minkowski spaces) as before[Graepel et al., 1998] proposes to learn SVMs in this space -unfortunately not a large margin formulation

[Graepel et al., 1999] propose LP machines in a ν-SVM likeformulation to obtain sparse classifiers[Mierswa, 2006] proposes using evolutionary algorithms to solvenon-convex formulations involving indefinite kernels

Embed Training sets into PE spaces (Minkowski spaces) as before[Graepel et al., 1998] proposes to learn SVMs in this space -unfortunately not a large margin formulation[Graepel et al., 1999] propose LP machines in a ν-SVM likeformulation to obtain sparse classifiers

[Mierswa, 2006] proposes using evolutionary algorithms to solvenon-convex formulations involving indefinite kernels

Embed Training sets into PE spaces (Minkowski spaces) as before[Graepel et al., 1998] proposes to learn SVMs in this space -unfortunately not a large margin formulation[Graepel et al., 1999] propose LP machines in a ν-SVM likeformulation to obtain sparse classifiers[Mierswa, 2006] proposes using evolutionary algorithms to solvenon-convex formulations involving indefinite kernels

[Haasdonk, 2005] embeds training data into a PE space andformulates a ν-SVM-like classifier there

Not a margin maximization formulationNew points are not embedded into this space - rather the SVM likerepresentation is used (without justification)Optimization not possible since program formulations arenon-convex - stabilization usedCan any guarantees be given for this formulation ?

[Haasdonk, 2005] embeds training data into a PE space andformulates a ν-SVM-like classifier thereNot a margin maximization formulation

New points are not embedded into this space - rather the SVM likerepresentation is used (without justification)Optimization not possible since program formulations arenon-convex - stabilization usedCan any guarantees be given for this formulation ?

[Haasdonk, 2005] embeds training data into a PE space andformulates a ν-SVM-like classifier thereNot a margin maximization formulationNew points are not embedded into this space - rather the SVM likerepresentation is used (without justification)

Optimization not possible since program formulations arenon-convex - stabilization usedCan any guarantees be given for this formulation ?

[Haasdonk, 2005] embeds training data into a PE space andformulates a ν-SVM-like classifier thereNot a margin maximization formulationNew points are not embedded into this space - rather the SVM likerepresentation is used (without justification)Optimization not possible since program formulations arenon-convex - stabilization used

Can any guarantees be given for this formulation ?

[Haasdonk, 2005] embeds training data into a PE space andformulates a ν-SVM-like classifier thereNot a margin maximization formulationNew points are not embedded into this space - rather the SVM likerepresentation is used (without justification)Optimization not possible since program formulations arenon-convex - stabilization usedCan any guarantees be given for this formulation ?

Kernels as similarity Kreın space approaches

Kreın spaces

DefinitionAn inner product space (K, 〈, 〉K) is called a Kreın space if there existtwo Hilbert spaces H+ and H− such K = H+ ⊕H− and ∀f ,g ∈ K,〈f ,g〉K = 〈f ,g〉H+

− 〈f ,g〉H− .

DefinitionGiven a domain X , a subset K ⊂ RX is called a Reproducing KernelKreın space if the evaluation functional Tx : f 7→ f (x) is continuous onK with respect to its strong topology.

Theorem ([Ong et al., 2004])A kernel K on X is a reproducing kernel for some Kreın space K iffthere exist PD kernels K+ and K− such that K = K+ − K−.

Classification in Kreın spaces

[Ong et al., 2004] proves all the necessary results for learninglarge margin classifiers

Prove that even stabilization leads to an SVM-like RepresenterTheoremNo large margin formulations considered due to singularity issues

I Instead regularization is performed by truncating the spectrum of KI Iterative methods to minimize squared error lead to regularizations

Proves generalization error bounds using method of Rademacheraverages

[Ong et al., 2004] proves all the necessary results for learninglarge margin classifiersProve that even stabilization leads to an SVM-like RepresenterTheorem

No large margin formulations considered due to singularity issues

[Ong et al., 2004] proves all the necessary results for learninglarge margin classifiersProve that even stabilization leads to an SVM-like RepresenterTheoremNo large margin formulations considered due to singularity issues

I Instead regularization is performed by truncating the spectrum of K

I Iterative methods to minimize squared error lead to regularizations

Kernels as similarity Landmarking based approaches

Landmarking approaches

[Graepel et al., 1999] consider landmarking with indefinite kernels

Perform L1 regularization for large margin classifier to obtainsparse solutions - yields an LP formulationAlso propose the ν-SVM formulation to get control over number ofmargin violationsAllows us to perform optimizations in the bias-variance trade-offHowever no guarantees given - were provided later by[Hein et al., 2005], [von Luxburg and Bousquet, 2004]

[Graepel et al., 1999] consider landmarking with indefinite kernelsPerform L1 regularization for large margin classifier to obtainsparse solutions - yields an LP formulation

Also propose the ν-SVM formulation to get control over number ofmargin violationsAllows us to perform optimizations in the bias-variance trade-offHowever no guarantees given - were provided later by[Hein et al., 2005], [von Luxburg and Bousquet, 2004]

[Graepel et al., 1999] consider landmarking with indefinite kernelsPerform L1 regularization for large margin classifier to obtainsparse solutions - yields an LP formulationAlso propose the ν-SVM formulation to get control over number ofmargin violations

Allows us to perform optimizations in the bias-variance trade-offHowever no guarantees given - were provided later by[Hein et al., 2005], [von Luxburg and Bousquet, 2004]

[Graepel et al., 1999] consider landmarking with indefinite kernelsPerform L1 regularization for large margin classifier to obtainsparse solutions - yields an LP formulationAlso propose the ν-SVM formulation to get control over number ofmargin violationsAllows us to perform optimizations in the bias-variance trade-off

However no guarantees given - were provided later by[Hein et al., 2005], [von Luxburg and Bousquet, 2004]

[Graepel et al., 1999] consider landmarking with indefinite kernelsPerform L1 regularization for large margin classifier to obtainsparse solutions - yields an LP formulationAlso propose the ν-SVM formulation to get control over number ofmargin violationsAllows us to perform optimizations in the bias-variance trade-offHowever no guarantees given - were provided later by[Hein et al., 2005], [von Luxburg and Bousquet, 2004]

What is a good similarity function

DefinitionA kernel function K is said to be (ε, γ)-kernel good for a learningproblem, if ∃β ∈ KK

Prx∈Rµ

[t(x)(〈β,ΦK (x)〉 > γ)] ≥ 1− ε.

DefinitionA kernel function K is said to be strongly (ε, γ)-good for a learningproblem, if at least a 1− ε probablity mass of the domain satisfies

Ex ′∈Rµ+

[K (x , x ′)

x ′∈Rµ−

[K (x , x ′)

Theorem ([Balcan et al., 2008a])Given a strongly (ε, γ)-good distance function, the following classifierh, for any ε, δ > 0, when given n = 16

γ2 lg(2δ

h(x) = sgn [f (x)] , f (x) =1n

n∑i=1

K (x ,ai)−1n

n∑i=1

K (x ,bi)

Have to introduce a weighing function to extend scope of thealgorithm

Can be shown to imply that the landmarking kernel induced by arandom sample is good kernel with high probabilityYet another instance of weak∗-PAC learning

γ2 lg(2δ

h(x) = sgn [f (x)] , f (x) =1n

n∑i=1

K (x ,ai)−1n

n∑i=1

K (x ,bi)

Have to introduce a weighing function to extend scope of thealgorithmCan be shown to imply that the landmarking kernel induced by arandom sample is good kernel with high probability

Yet another instance of weak∗-PAC learning

γ2 lg(2δ

h(x) = sgn [f (x)] , f (x) =1n

n∑i=1

K (x ,ai)−1n

n∑i=1

K (x ,bi)

Have to introduce a weighing function to extend scope of thealgorithmCan be shown to imply that the landmarking kernel induced by arandom sample is good kernel with high probabilityYet another instance of weak∗-PAC learning

Kernels as Kernels vs. Kernels as Similarity

Similarity→ Kernel : (ε, γ)-good⇒ (ε+ δ, γ/2)-kernel good

Kernel→ Similarity : (ε, γ)-kernel good⇒(ε+ ε0,

12(1− ε)ε0γ2)-kernel good

[Srebro, 2007] There exist learning instances for which kernelsperform better as kernels than as similarity functions[Balcan et al., 2008b] There exist function classes anddistributions such that no kernel performs well on all the functions.However there exist similarity functions that give optimalperformanceRole of the weighing function not investigated

Similarity→ Kernel : (ε, γ)-good⇒ (ε+ δ, γ/2)-kernel goodKernel→ Similarity : (ε, γ)-kernel good⇒(ε+ ε0,

[Srebro, 2007] There exist learning instances for which kernelsperform better as kernels than as similarity functions

[Balcan et al., 2008b] There exist function classes anddistributions such that no kernel performs well on all the functions.However there exist similarity functions that give optimalperformanceRole of the weighing function not investigated

[Srebro, 2007] There exist learning instances for which kernelsperform better as kernels than as similarity functions[Balcan et al., 2008b] There exist function classes anddistributions such that no kernel performs well on all the functions.However there exist similarity functions that give optimalperformance

Role of the weighing function not investigated

Conclusion

The big picture

Finite-dimensional embeddings (PE, Minkowski spaces)

I Work well in transductive settingsI Allow for support vector like effectsI Not much work on generalization guaranteesI Not much known about distortion incurred when embedding test

pointsI Should work well owing to Representer Theorems

Conclusion

The big picture

Finite-dimensional embeddings (PE, Minkowski spaces)I Work well in transductive settings

I Allow for support vector like effectsI Not much work on generalization guaranteesI Not much known about distortion incurred when embedding test

Conclusion

The big picture

Finite-dimensional embeddings (PE, Minkowski spaces)I Work well in transductive settingsI Allow for support vector like effects

I Not much work on generalization guaranteesI Not much known about distortion incurred when embedding test

Conclusion

The big picture

Finite-dimensional embeddings (PE, Minkowski spaces)I Work well in transductive settingsI Allow for support vector like effectsI Not much work on generalization guarantees

I Not much known about distortion incurred when embedding testpoints

I Should work well owing to Representer Theorems

Conclusion

The big picture

Finite-dimensional embeddings (PE, Minkowski spaces)I Work well in transductive settingsI Allow for support vector like effectsI Not much work on generalization guaranteesI Not much known about distortion incurred when embedding test

points

I Should work well owing to Representer Theorems

Conclusion

The big picture

Finite-dimensional embeddings (PE, Minkowski spaces)I Work well in transductive settingsI Allow for support vector like effectsI Not much work on generalization guaranteesI Not much known about distortion incurred when embedding test

Conclusion

The big picture

Exact embeddings (Banach, Keın spaces)

I Work well in inductive settingsI Allow for support vector like effectsI Generalization guarantees well studiedI Embeddings are isometric or “isosimilar”I Too much power though ([von Luxburg and Bousquet, 2004],

[Ong et al., 2004])

Conclusion

The big picture

Exact embeddings (Banach, Keın spaces)I Work well in inductive settings

I Allow for support vector like effectsI Generalization guarantees well studiedI Embeddings are isometric or “isosimilar”I Too much power though ([von Luxburg and Bousquet, 2004],

[Ong et al., 2004])

Conclusion

The big picture

Exact embeddings (Banach, Keın spaces)I Work well in inductive settingsI Allow for support vector like effects

I Generalization guarantees well studiedI Embeddings are isometric or “isosimilar”I Too much power though ([von Luxburg and Bousquet, 2004],

[Ong et al., 2004])

Conclusion

The big picture

Exact embeddings (Banach, Keın spaces)I Work well in inductive settingsI Allow for support vector like effectsI Generalization guarantees well studied

I Embeddings are isometric or “isosimilar”I Too much power though ([von Luxburg and Bousquet, 2004],

[Ong et al., 2004])

Conclusion

The big picture

Exact embeddings (Banach, Keın spaces)I Work well in inductive settingsI Allow for support vector like effectsI Generalization guarantees well studiedI Embeddings are isometric or “isosimilar”

I Too much power though ([von Luxburg and Bousquet, 2004],[Ong et al., 2004])

Conclusion

The big picture

Exact embeddings (Banach, Keın spaces)I Work well in inductive settingsI Allow for support vector like effectsI Generalization guarantees well studiedI Embeddings are isometric or “isosimilar”I Too much power though ([von Luxburg and Bousquet, 2004],

[Ong et al., 2004])

Conclusion

The big picture

I Work well in inductive settingsI Dont allow support vector like effects (got to keep all the landmarks)I Generalization guarantees thereI But how does one find a “good” kernel ?

Conclusion

The big picture

Landmarking approachesI Work well in inductive settings

I Dont allow support vector like effects (got to keep all the landmarks)I Generalization guarantees thereI But how does one find a “good” kernel ?

Conclusion

The big picture

Landmarking approachesI Work well in inductive settingsI Dont allow support vector like effects (got to keep all the landmarks)

I Generalization guarantees thereI But how does one find a “good” kernel ?

Conclusion

The big picture

Landmarking approachesI Work well in inductive settingsI Dont allow support vector like effects (got to keep all the landmarks)I Generalization guarantees there

I But how does one find a “good” kernel ?

Conclusion

The big picture

Landmarking approachesI Work well in inductive settingsI Dont allow support vector like effects (got to keep all the landmarks)I Generalization guarantees thereI But how does one find a “good” kernel ?

Conclusion

The big picture

Landmarking approachesI Work well in inductive settingsI Dont allow support vector like effects (got to keep all the landmarks)I Generalization guarantees thereI But how does one find a “good” kernel ?

Conclusion

Open questions

Choosing the kernel : still requires one to attend Hogwarts

Existing approaches to learning kernels are pathetic[Balcan et al., 2008c] proposes to learn with multiple similarityfunctionsNeed testable definitions of goodness of kernels

Conclusion

Open questions

Choosing the kernel : still requires one to attend HogwartsExisting approaches to learning kernels are pathetic

[Balcan et al., 2008c] proposes to learn with multiple similarityfunctionsNeed testable definitions of goodness of kernels

Conclusion

Open questions

Choosing the kernel : still requires one to attend HogwartsExisting approaches to learning kernels are pathetic[Balcan et al., 2008c] proposes to learn with multiple similarityfunctions

Need testable definitions of goodness of kernels

Conclusion

Open questions

Choosing the kernel : still requires one to attend HogwartsExisting approaches to learning kernels are pathetic[Balcan et al., 2008c] proposes to learn with multiple similarityfunctionsNeed testable definitions of goodness of kernels

Conclusion

Open questions

Application of indefinite kernels to other tasks

I clustering [Balcan et al., 2008d]I principal componentsI multi-class classification [Balcan and Blum, 2006]

Analysis of the feature maps induced by embeddings into Banach,Keın spaces [Balcan et al., 2006]

Conclusion

Open questions

Application of indefinite kernels to other tasksI clustering [Balcan et al., 2008d]

I principal componentsI multi-class classification [Balcan and Blum, 2006]

Conclusion

Open questions

Application of indefinite kernels to other tasksI clustering [Balcan et al., 2008d]I principal components

I multi-class classification [Balcan and Blum, 2006]

Conclusion

Open questions

Application of indefinite kernels to other tasksI clustering [Balcan et al., 2008d]I principal componentsI multi-class classification [Balcan and Blum, 2006]

Conclusion

Open questions

Application of indefinite kernels to other tasksI clustering [Balcan et al., 2008d]I principal componentsI multi-class classification [Balcan and Blum, 2006]

Conclusion

Bibliography I

Balcan, M.-F. and Blum, A. (2006).On a Theory of Learning with Similarity Functions.In International Conference on Machine Learning, pages 73–80.

Balcan, M.-F., Blum, A., and Srebro, N. (2008a).A Theory of Learning with Similarity Functions.Machine Learning, 71(1-2):89–112.

Balcan, M.-F., Blum, A., and Srebro, N. (2008b).Improved Guarantees for Learning via Similarity Functions.In Annual Conference on Computational Learning Theory, pages287–298.

Conclusion

Bibliography II

Balcan, M.-F., Blum, A., and Srebro, N. (2008c).Learning with Multiple Similarity Functions.Workshop on Kernel Learning: Automatic Selection of OptimalKernels, Advances in Neural Information Processing Systems.

Balcan, M.-F., Blum, A., and Vempala, S. (2006).Kernels as Features: On Kernels, Margins, and Low-dimensionalMappings.Machine Learning, 65(1):79–94.

Balcan, M.-F., Blum, A., and Vempala, S. (2008d).A Discriminative Framework for Clustering via Similarity Functions.In ACM Annual Symposium on Theory of Computing, pages671–680.

Conclusion

Bibliography III

Boughorbel, S., Tarel, J.-P., and Boujemaa, N. (2005).Conditionally Positive Definite Kernels for SVM Based ImageRecognition.In IEEE International Conference on Multimedia & Expo, pages113–116.

Chen, Y., Garcia, E. K., Gupta, M. R., Rahimi, A., and Cazzanti, L.(2009).Similarity-based Classification: Concepts and Algorithms.Journal of Machine Learning Research, 10:747–776.

Der, R. and Lee, D. (2007).Large Margin Classification in Banach Spaces.In International Conference on Artificial Intelligence and Statistics,volume 2 of JMLR Workshop and Conference Proceedings, pages91–98.

Conclusion

Bibliography IV

Duda, R. O., Hart, P. E., and Stork, D. G. (2000).Pattern Classification.Wiley-Interscience, second edition.

Duin, R. P. W. and Pekalska, E. (2008).On Refining Dissimilarity Matrices for an Improved NN Learning.In International Conference on Pattern Recognition, pages 1–4.

Goldfarb, L. (1984).A Unified Approach to Pattern Recognition.Pattern Recognition, 17(5):575–582.

Gottlieb, L.-A., Kontorovich, A. L., and Krauthgamer, R. (2010).Efficient Classification for Metric Data.In Annual Conference on Computational Learning Theory.

Conclusion

Bibliography V

Graepel, T., Herbrich, R., Bollmann-Sdorra, P., and Obermayer, K.(1998).Classification on Pairwise Proximity Data.In Advances in Neural Information Processing Systems, pages438–444.

Graepel, T., Herbrich, R., Scholkopf, B., Smola, A., Bartlett, P.,Muller, K.-R., Obermayer, K., and Williamson, R. (1999).Classication of Proximity Data with LP machines.In Ninth International Conference on Artificial Neural Networks,pages 304–309.

Haasdonk, B. (2005).Feature Space Interpretation of SVMs with Indefinite Kernels.IEEE Transactions on Pattern Analysis and Machince Intelligence,27(4):482–492.

Conclusion

Bibliography VI

Haasdonk, B. and Bahlmann, C. (2004).Learning with Distance Substitution Kernels.In Annual Symposium of Deutsche Arbeitsgemeinschaft furMustererkennung, pages 220–227.

Harol, A., Pekalska, E., Verzakov, S., and Duin, R. P. W. (2006).Augmented Embedding of Dissimilarity Data into(Pseudo-)Euclidean Spaces.In IAPR Workshop on Structural, Syntactic, and Statistical PatternRecognition, pages 613–621.

Hein, M., Bousquet, O., and Scholkopf, B. (2005).Maximal Margin Classification for Metric Spaces.Journal of Computer and System Sciences, 71(3):333–359.

Conclusion

Bibliography VII

Jacobs, D. W., Weinshall, D., and Gdalyahu, Y. (2000).Classification with Nonmetric Distances: Image Retrieval andClass Representation.IEEE Transactions on Pattern Analysis and Machince Intelligence,22(6):583–600.

Kearns, M. and Vazirani, U. (1997).An Introduction to Computational Learning Theory.The MIT Press.

Luss, R. and d’Aspremont, A. (2007).Support Vector Machine Classification with Indefinite Kernels.In Advances in Neural Information Processing Systems.

Conclusion

Bibliography VIII

Mendelson, S. (2003).A Few Notes on Statistical Learning Theory.In Mendelson, S. and Smola, A. J., editors, Advanced Lectures onMachine Learning, Machine Learning Summer School, volume2600 of Lecture Notes in Computer Science. Springer.

Mierswa, I. (2006).Making Indefinite Kernel Learning Practical.Technical report, Collaborative Research Center 475, University ofDortmund.

Ong, C. S., Mary, X., Canu, S., and Smola, A. J. (2004).Learning with non-positive Kernels.In International Conference on Machine Learning.

Conclusion

Bibliography IX

Pekalska, E. and Duin, R. P. W. (2000).Classifiers for Dissimilarity-Based Pattern Recognition.In International Conference on Pattern Recognition, pages2012–2016.

Pekalska, E. and Duin, R. P. W. (2001).On Combining Dissimilarity Representations.In Multiple Classifier Systems, pages 359–368.

Pekalska, E. and Duin, R. P. W. (2002).Dissimilarity representations allow for building good classifiers.Pattern Recognition Letters, 23(8):943–956.

Conclusion

Bibliography X

Pekalska, E., Paclık, P., and Duin, R. P. W. (2001).A Generalized Kernel Approach to Dissimilarity-basedClassification.Journal of Machine Learning Research, 2:175–211.

Scholkopf, B. (2000).The Kernel Trick for Distances.In Advances in Neural Information Processing Systems, pages301–307.

Scholkopf, B. and Smola, A. J. (2001).Learning with Kernels: Support Vector Machines, Regularization,Optimization, and Beyond.The MIT Press, first edition.

Conclusion

Bibliography XI

Srebro, N. (2007).How Good Is a Kernel When Used as a Similarity Measure?In Annual Conference on Computational Learning Theory, pages323–335.

von Luxburg, U. and Bousquet, O. (2004).Distance-Based Classification with Lipschitz Functions.Journal of Machine Learning Research, 5:669–695.

Wang, L., Yang, C., and Feng, J. (2007).On Learning with Dissimilarity Functions.In International Conference on Machine Learning, pages 991–998.

Conclusion

Bibliography XII

Weinberger, K. Q. and Saul, L. K. (2009).Distance Metric Learning for Large Margin Nearest NeighborClassification.Journal of Machine Learning Research, 10:207–244.

Weinshall, D., Jacobs, D. W., and Gdalyahu, Y. (1998).Classification in Non-Metric Spaces.In Advances in Neural Information Processing Systems, pages838–846.

Zhang, H., Xu, Y., and Zhang, J. (2009).Reproducing Kernel Banach Spaces for Machine Learning.Journal of Machine Learning, 10:2741–2775.

Learning in Indefiniteness · Purushottam Kar (CSE/IITK) Learning in Indeﬁniteness August 2, 2010...

Documents

Transcript of Learning in Indefiniteness · Purushottam Kar (CSE/IITK) Learning in Indeﬁniteness August 2, 2010...

RL5: On-policyandoﬀ-policyalgorithms · Overview Oﬀ-policyalgorithms Q-learning(lasttime) R-learning(avariantofQ-learning) On-policyalgorithms SARSA TD( ) Actor-criticmethods

5 Deep Learning Multi-layered feedforward neural networks ... · 5 Deep Learning •Some Topics in Deep Learning: ∗Learning algorithms: ~Back propagation, Stochastic Gradient Descent

Advanced Machine Learning & Perception

E learning hsm inventor

Reinforcement Learning - 4. Model-free reinforcement Learning

Video learning

Digital Academy | e-learning

LEARNING WITH NONTRIVIAL TEACHER: LEARNING USING PRIVILEGED

Machine Learning (CSE 446): Learning as Minimizing Loss ...

20 E-Learning Center

Hebbian Coincidence Learning

Mobile learning

Difficulties in Learning Algebra

Interesting people

E-learning Smirniotakis.gr

NASTOPOULOS LEARNING THEORIES

Accelerated Learning

Really open learning

psat.hupsat.hu/sites/default/files/download/2016_psak_pecs_absztraktkotet...Ågota u. Közgazdasagtudományi Kar 48as ter Mathiåsz utc:a aóbita Bibszinház Zsolnay Kulturålis Negyed

Html learning fornewbies