Infinite Task Learning with Vector-Valued...

Infinite Task Learning with Vector-ValuedRKHSs

Alex LambertJoint work with R. Brault, Z. Szabo, M. Sangnier, F.d’Alché-Buc.September 13, 2018

Motivation

An example of task : Quantile Regression

• (X, Y) ∈ Rd × R random variables• θ ∈ (0, 1)

Learnq(x) = inf {y ∈ R , P(Y ⩽ y | X = x) = θ}

from iid copies (xi, yi)ni=1

Figure 1: Example of several quantile functions (toy dataset).

1/18


• (X, Y) ∈ Rd × R random variables• θ ∈ (0, 1)

Learnq(x) = inf {y ∈ R , P(Y ⩽ y | X = x) = θ}

from iid copies (xi, yi)ni=1

Figure 1: Example of several quantile functions (toy dataset). 1/18


Minimize in hEX,Y[max(θ(Y− h(X)), (θ− 1)(Y− h(X)))]

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

3

2

1

0

1

2

Learning two quantiles= 0.5= 0.75

Figure 2: Two independently learnt quantile estimations.

• Not adapted to the structure of the problem• No way to recover other quantiles

2/18



0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

3

2

1

0

1

2



• Not adapted to the structure of the problem

• No way to recover other quantiles

2/18



0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

3

2

1

0

1

2



• Not adapted to the structure of the problem• No way to recover other quantiles

2/18

An example of task : Cost-Sensitive Classification

• Binary classification with asymetric loss function. Minimize

EX,Y[∣∣∣∣θ+ 12 − 1{−1}(Y)

∣∣∣∣|1− Yh(X)|+]

Figure 3: Independent cost-sensitive classification.

• No structure, No interpolation

3/18

An example of task : Cost-Sensitive Classification

• Binary classification with asymetric loss function. Minimize

EX,Y[∣∣∣∣θ+ 12 − 1{−1}(Y)

∣∣∣∣|1− Yh(X)|+]

Figure 3: Independent cost-sensitive classification.

• No structure, No interpolation3/18

An example of task : Density Level Set Estimation

(Schölkopf et al., 2000) Given (xi)ni=1 iid and θ ∈ (0, 1), minimizefor (h, t) ∈ Hk × R

J(h, t) = 1θn

n∑i=1

max (0, t− h(xi))− t+12∥h∥

2Hk

Decision function

d(x) = 1R+(h(x)− t)

θ-property of the decision functionThe decision function should separate new data into twoseparate subsets with proportion θ of outliers.

4/18

Multi-Task Learning

Given a problem indexed by some hyperparameter θ, solveconcomitantly a finite number of tasks given by (θ1, . . . , θp).

• Output in Rp

• Sum the loss functions associated to each (θi)pi=1• Add regularization to benefit from similarity of tasks• Create specific model constraints with prior knowledge oftasks

How to extend this to a continuum of tasks ?

5/18

Multi-Task Learning


• Output in Rp

• Sum the loss functions associated to each (θi)pi=1

• Add regularization to benefit from similarity of tasks• Create specific model constraints with prior knowledge oftasks


5/18

Multi-Task Learning


• Output in Rp

• Sum the loss functions associated to each (θi)pi=1• Add regularization to benefit from similarity of tasks

• Create specific model constraints with prior knowledge oftasks


5/18

Multi-Task Learning


• Output in Rp

• Sum the loss functions associated to each (θi)pi=1• Add regularization to benefit from similarity of tasks• Create specific model constraints with prior knowledge oftasks


5/18

A functional approach

Proposed framework : learn function-valued functions

‘input 7→ (hyperparameter 7→ output)’

‘x 7→ (θ 7→ y)’

Goal : Learn a global function while preserving desiredproperties of the output function for each hyperparameter θ.

6/18

Supervised Learning Framework

Parametrized Task

ERM setting: minimize in h ∈ H ⊂ F (X; F (Θ; R)) for a trainingset S = (xi, yi)ni=1 and λ > 0

RS(h) =n∑i=1

V(yi,h(xi)) + λΩ(h)

where

V(y,h(x)) : =∫Θv(θ, y,h(x)(θ))dµ(θ),

and Ω(h) is a regularization term.

7/18

Sampled Empirical Risk

Estimating the integral: Quadrature, Monte-Carlo orQuasi-Monte-Carlo.

Ṽ(y,h(x)) : =m∑j =1

wjv(θj, y,h(x)(θj))

• wj can’t depend on h• QMC: low discrepancy sequences (Sobol) lead to errorrates O( log(m)m )

• No need to approximate too precisely

8/18



Ṽ(y,h(x)) : =m∑j =1


• wj can’t depend on h

• QMC: low discrepancy sequences (Sobol) lead to errorrates O( log(m)m )


8/18



Ṽ(y,h(x)) : =m∑j =1


• wj can’t depend on h• QMC: low discrepancy sequences (Sobol) lead to errorrates O( log(m)m )


8/18

Functional spaceH

vv-RKHS framework (Carmeli et al., 2006):

• Hilbert space of functions with values in a Hilbert space• Regularity properties (bounded functional evaluation)

Take two scalar kernels kX:X× X → R and kΘ: Θ×Θ → R,construct

K:

X× X → L(HkΘ)x, z 7→ kX(x, z)IHkΘStructure: HK ≃ HkX ⊗HkΘ i.e

HK = span {kX(·, x) · kΘ(·, θ), (x, θ) ∈ X×Θ}

9/18

Functional spaceH




K:

X× X → L(HkΘ)x, z 7→ kX(x, z)IHkΘ

Structure: HK ≃ HkX ⊗HkΘ i.e


9/18

Functional spaceH




K:

X× X → L(HkΘ)x, z 7→ kX(x, z)IHkΘStructure: HK ≃ HkX ⊗HkΘ i.e


9/18

Optimization

Optimization problem:

(1)arg minh ∈HK

R̃S(h) + λ∥h∥2HK , λ > 0

Representer Theorem

Assume that the local loss function is a proper l.s.c function.Then, the solution h∗ to the problem (1) is unique andverifies ∀(x, θ) ∈ X×Θ

h∗(x)(θ) =n∑i=1

m∑j=1

αijkX(x, xi)kΘ(θ, θj)

for some(αij

)n,mi,j=1 ∈ R

n×m.

10/18

Optimization

Solved by L-BFGS-B + smoothing of the local loss.

• Complexity in O(♯iterations · (n2m+ nm2))• Smoothing à la Huber: infimal convolution with ||·||2

11/18

Statistical Guarantees

Context of uniform stability in vv-RKHS (Kadri et al., 2015)Generalization bound

Let h∗ ∈ HK be the solution of the problem above for the QRor CSC problem with QMC approximation. For a large class ofkernels,

R(h∗) ⩽ R̃S(h∗) + OPX,Y(

1√λn

)+ O

(log(m)√

λm

)

• Requires bounded random variables in QR• Tradeoff between n and m• Mild hypothesis on the kernels

12/18




R(h∗) ⩽ R̃S(h∗) + OPX,Y(

1√λn

)+ O

(log(m)√

λm

)

• Requires bounded random variables in QR

• Tradeoff between n and m• Mild hypothesis on the kernels

12/18




R(h∗) ⩽ R̃S(h∗) + OPX,Y(

1√λn

)+ O

(log(m)√

λm

)

• Requires bounded random variables in QR• Tradeoff between n and m

• Mild hypothesis on the kernels

12/18




R(h∗) ⩽ R̃S(h∗) + OPX,Y(

1√λn

)+ O

(log(m)√

λm

)

• Requires bounded random variables in QR• Tradeoff between n and m• Mild hypothesis on the kernels

12/18

Numerical experiments: Infinite Quantile Regression

Crossing penalty: hard or soft constraints.

Ωnc(h) := λnc∫X

∫Θ

∣∣∣∣−∂h∂θ (x)(θ)∣∣∣∣+

dµ(θ)dP(x)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

3

2

1

0

1

2

3

Non-crossing: nc = 2.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Crossing: nc = 0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4: Comparison w/o crossing penalty for IQR.

• Matches state of the art on 20 UCI datasets. (Sangnieret al., 2016)

13/18

Numerical experiments: Infinite Quantile Regression

Crossing penalty: hard or soft constraints.

Ωnc(h) := λnc∫X

∫Θ

∣∣∣∣−∂h∂θ (x)(θ)∣∣∣∣+

dµ(θ)dP(x)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

3

2

1

0

1

2

3

Non-crossing: nc = 2.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Crossing: nc = 0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4: Comparison w/o crossing penalty for IQR.

• Matches state of the art on 20 UCI datasets. (Sangnieret al., 2016) 13/18

Numerical experiments: Infinite Cost-Sensitive Classification

Cont

inuu

m:

=0.

9

sensitivity=0.42specificity=1.0

Cont

inuu

m:

=0


Cont

inuu

m:

=0.

9


Inde

pend

ent:

=0.

9


Inde

pend

ent:

=0


Inde

pend

ent:

=0.

9


Figure 5: ICSC vs Independent learning

• Improves performances• Hard to tune the kernels

14/18


Cont

inuu

m:

=0.

9


Cont

inuu

m:

=0


Cont

inuu

m:

=0.

9


Inde

pend

ent:

=0.

9


Inde

pend

ent:

=0


Inde

pend

ent:

=0.

9



• Improves performances

• Hard to tune the kernels

14/18


Cont

inuu

m:

=0.

9


Cont

inuu

m:

=0


Cont

inuu

m:

=0.

9


Inde

pend

ent:

=0.

9


Inde

pend

ent:

=0


Inde

pend

ent:

=0.

9



• Improves performances• Hard to tune the kernels

14/18

An unsupervised task : Density levelset estimation

Functional learning

Integrated problem: minimize in h, t ∈ HK ×Hkb

∫ 10

1θn

n∑i =1

max (0, t(θ)− h(xi)(θ))− t(θ) +12∥h(·)(θ)∥

2HkX

dµ(θ)

Take (θj)mj=1 ∈ (0, 1) a QMC sequence, minimize

J(h, t) = 1nm

n∑i=1

m∑j=1

1θj

max (0, t(θj)− h(xi)(θj))

− t(θj) +∥∥h(·)(θj)∥∥2HkX + λ2∥t∥2Hkb

15/18

Solving in Vector-Valued RKHSs

Finite expansion of the solutionThere exist

(αij

)n,mi,j=1 ∈ R

n×m and(βj)mj=1 ∈ R

m such that for∀(x, ν) ∈ X× (0, 1),

h∗(x)(ν) =n∑i=1

m∑j=1

αijkX(x, xi)kν(ν, νj)

t∗(ν) =m∑j=1

βjkb(ν, νj)

• Weak regularizer but still representer• Classical convex problem in R(n+1)m: solvers (L-BFGS)

16/18



(αij

)n,mi,j=1 ∈ R



h∗(x)(ν) =n∑i=1

m∑j=1


t∗(ν) =m∑j=1

βjkb(ν, νj)

• Weak regularizer but still representer

• Classical convex problem in R(n+1)m: solvers (L-BFGS)

16/18



(αij

)n,mi,j=1 ∈ R



h∗(x)(ν) =n∑i=1

m∑j=1


t∗(ν) =m∑j=1

βjkb(ν, νj)

• Weak regularizer but still representer• Classical convex problem in R(n+1)m: solvers (L-BFGS)

16/18

Numerical experiments: Infinite One-Class SVM

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

inlie

rs

Dataset: wiltTrainTestOracle Train

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

inlie

rsDataset: spambase

TrainTestOracle Train

Figure 6: Level set estimation: the ν-property is approximatelysatisfied. Top: Wilt benchmark; bottom: Spambase dataset.

17/18

Perspectives

Perspectives

Investigate:

• Algorithmic guarantees

• New regularization term :∑

j∥∥h(·)(θj)∥∥HkX

• Hard monotony constraints• Scaling up : ORFF (Brault et al., 2016)

18/18

Perspectives

Investigate:

• Algorithmic guarantees• New regularization term :

∑j∥∥h(·)(θj)∥∥HkX


18/18

Perspectives

Investigate:



• Hard monotony constraints

• Scaling up : ORFF (Brault et al., 2016)

18/18

Perspectives

Investigate:




18/18

References

Schölkopf, Bernhard et al. (2000). “New support vectoralgorithms.” In: Neural computation 12.5, pp. 1207–1245.

Carmeli, Claudio, Ernesto De Vito, and Alessandro Toigo (2006).“Vector valued reproducing kernel Hilbert spaces ofintegrable functions and Mercer theorem.” In: Analysis andApplications 4 (4), pp. 377–408.

Kadri, Hachem et al. (2015). “Operator-valued kernels forlearning from functional response data.” In: Journal ofMachine Learning Research 16, pp. 1–54.

Brault, Romain, Markus Heinonen, and Florence d’Alché-Buc(2016). “Random Fourier Features For Operator-ValuedKernels.” In: Asian Conference on Machine Learning (ACML),pp. 110–125.

Sangnier, Maxime, Olivier Fercoq, and Florence d’Alché-Buc(2016). “Joint quantile regression in vector-valued RKHSs.” In:

18/18

Advances in Neural Information Processing Systems (NIPS),pp. 3693–3701.

18/18

MotivationSupervised Learning FrameworkAn unsupervised task : Density level set estimationPerspectivesReferences

Infinite Task Learning with Vector-Valued...

Documents

Transcript of Infinite Task Learning with Vector-Valued...