Kernel Ridge Regression - Rensselaer Polytechnic Institute

Kernel Ridge Regression

Prof. BennettBased on Chapter 2 of

Shawe-Taylor and Cristianini

Outline

OverviewRidge RegressionKernel Ridge RegressionOther KernelsSummary

Recall E&K model

R(t)=at2+bt+cIs linear is in its parametersDefine mapping θ(t) and make linear

function in the θ(t) or feature space2

( ) [ 1]' [ ]'

( ) ( ) [ 1]'

t t t s a b ca

R t t s t t b at bt cc

⎡ ⎤⎢ ⎥= ⋅ = = + +⎢ ⎥⎢ ⎥⎣ ⎦

Linear versus Nonlinear

f(t)=bt+c f(θ(t))=at2+bt+c

Kernel Method

Two parts (+):Mapping into embedding or feature space defined by kernel.Learning algorithm for discovering linear patterns in that space.

Illustrate using linear ridge regression.

Linear Regression in Feature Space

Key Idea:Map data to higher dimensional space (feature space) and perform linear regression in embedded space.

Embedding Map:: n NR F R N nφ ∈ → ⊆ >>x

Nonlinear Regression in Feature Space

[ ]1 2

2 21 2 3

( ) , , 2

( ) ( ) ,

r sw r w s

r s r s

w r w s w r s

⎡ ⎤= ⎣ ⎦=

Feature

[ ]1 2

3 3 2 2 2 2

3 31 2 1 0

( ) , , , , , , , ,

( ) ( ) ,

. . . . 2F

r sw r w s

r s r s r s s r s r s r

w r w s w r s

⎡ ⎤= ⎣ ⎦=

= + + +

Feature

Let’s try a quadratic on Aquasol

What are all the terms we need to add to our 525 dimensional space to map it into feature space?

Just do squared terms and cross terms.

Kernel and Duality to the rescue

Duality Alternative but equivalent view of the problemsKernel trick – makes mapping to the feature space efficient

Linear Regression

Given training data:

Construct linear function:

( ) ( ) ( ) ( )( )1 1 2 2, , , , , , , , ,

points and labels i

S y y y y

∈ ∈

x x x x

… …

1( ) , '

g w x=

= = = ∑x w x w x

Least Squares Approximation

Define error

Minimize loss

( )g x y≈

( , ) ( )f y y g ξ= − =x x

( )( )

( , ) ( , ) ( )

i i ii i

L g s L w S y g

l y gξ

= = −

∑ ∑

Ridge Regression

Use least norm solution for fixedRegularized problem

Optimality Condition:

2 2min ( , )L Sλ λ= + −w w w y Xw

( , )2 2 ' 2 ' 0

L Sλ λ∂

= − + =∂w

w X y X Xww

( )' 'nλ+ =X X I w X y

0.λ >

Requires 0(n3) operations

Ridge Regression (cont)

Inverse always exists for any

Alternative representation:( ) 1' 'λ −= +w X X I X y

0.λ >

( ) ( )( )

' ' ' '

+ = ⇒ = −

⇒ = − =

X X I w X y w X y X Xw

w X y Xw X α

( )1λ−= −α y Xw

Solving l×l equation is

0.λ >

( ) ( )( )

' ' ' '

+ = ⇒ = −

⇒ = − =

w X y Xw X α( )( ) ( )

where '

⇒ = − = −

⇒ + =

⇒ = + =

α y Xw

α y Xw y XX αXX α α y

α G I y G XXSolving l×l equation is

0.λ >

( ) ( )( )

' ' ' '

+ = ⇒ = −

⇒ = − =

w X y Xw X α( )( ) ( )

where '

⇒ = − = −

⇒ + =

⇒ = + =

α y Xw

α y Xw y XX αXX α α y

α G I y G XXSolving l×l equation is

Gram or Kernel Matrix

Gram Matrix

Composed of inner products of data

'K G XX= =

, ,i j i jK x x=

Dual Ridge Regression

To predict new point:

Note need only compute G, the Gram Matrix

1( ) , , '

where ,

g α λ −

= = = +

∑x w x x x y G I z

' ,ij i jG= =G XX x x

Ridge Regression requires only inner products between data points

Efficiency

To compute w in primal ridge regression is 0(n3)

α in dual ridge regression is 0(l3)

To predict new point xprimal 0(n)

dual 0(nl)( ) ( )

1 1 1( ) , ,

i i i i j ji i j

g α α= = =

⎛ ⎞= = = ⎜ ⎟

⎝ ⎠∑ ∑ ∑x wx x x x x

Dual is better if n>>l

( ) ,n

= = ∑x w x x

Notes on Ridge Regression

“Regularization” is key to addressstability and regularization.Regularization lets method work when n>>l.Dual more efficient when n>>l.Dual only requires inner products of data.

( ) ( ) ,

( ) , ( )

α φ φ=

In dual representation:

So if we can efficiently compute inner product, our metis efficient

Let try it for our sample problem

2 2 2 21 2 1 2 1 2 1 2

2 2 2 21 1 2 2 1 2 1 2

21 1 2 2

( ) , ( )

( , , 2 ) , ( , , 2 )

u u u u v v v v

u v u v u u v v

u v u v

2D e f i n e : ( , ) ,K =u v u v

Kernel Function

A kernel is a function K such that

There are many possible kernels.Simplest is linear kernel.

, ( ), ( )

where is a mapping from input space to feature space .

=x u x u

, ,K =x u x u

Dual Ridge Regression

Note need only compute G, the Gram Matrix

1( ) , , '

where ,

g α λ −

= = = +

' ,ij i jG= =G XX x x

Ridge Regression requires only inner products between data points

Ridge Regression in Feature Space

To compute the Gram Matrix

1( ( )) , ( ) ( ), ( ) '

where ( ), ( )

g φ φ α φ φ λ

= = = +

( ) ( ) ' ( ), ( ) ( , )ij i j i jG Kφ φ φ φ= = =G X X x x x x

Use kernel to compute inner product

( ) ( ) ,

( ) , ( )

α φ φ

Kernel trick works for any dual representation:

Popular Kernels based on vectors

By Hilbert-Schmidt Kernels (Courant and Hilbert 1953)

( ), ( ) ( , )u v K u vθ θ ≡for certain η and K, e.g.

( ) ( , )Degree polynomial ( , 1)

|| ||Radial Basis Function M achine exp

Two Layer Neural Network ( , )

u K u vd u v

sigmoid u v c

⎛ ⎞−−⎜ ⎟

⎝ ⎠+

Kernels Intuition

Kernels encode the notion of similarity to be used for a specific applications.

Document use cosine of “bags of text”.Gene sequences can used edit distance.

Similarity defines distance:

Trick is to get right encoding for domain.

2|| || ( ) '( ) , 2 , ,− = − − = − +u v u v u v u u u v v v

Important Points

Kernel method = linear method + embedding in feature space.Kernel functions used to do embedding efficiently.Feature space is higher dimensional space so must regularize.Choose kernel appropriate to domain.

Kernel Ridge Regression

Simple to derive kernel methodWorks great in practice with some finessing.Next time:Practical issues.More standard dual derivation.

Optimal Solution

Want:Mathematical Model:

Optimality Conditions:

2 2min ( , , ) ( ) || ||L b S b wλ= − + +w w y Xw e

b is a vector of on≈ +y Xw e e

( , , ) 2 '( ) 2 0L b S b λ∂= − − + =

∂w X y Xw e ww

( , , ) 2 '( ) 0L b S e bb

∂= − − =

∂w y Xw e

Let Try it: In class lab

Basic Ridge Regression

Optimality Condition:

Dual equivalent

2 2min ( , )L Sλ λ= + −w w w y Xw

( ) 1' 'nλ −= +w X X I X y

0.λ >

,where G ,

'i j i j

yλ −= +

α G I

w X α

Better model

Basic Ridge Regression

Center X and Y to get Xc, Yc

Dual equivalent

2 2,min ( , ) ( )b L S bλ λ= + − +w w w y Xw e

( ) 1' 'nc c c cλ −= +w X X I X y

0.λ >

,where G ,

'i j i j

λ −= +

α G I

w X α

Recenter Data

Shift y by mean

Shift x by mean

1 :i i ii

y yc yµ µ=

= = −∑

1 :i i ii

= = −∑x x x x x

Ridge Regression with bias

Center data by

Calculate w

CalculateTo predict new point

x and µ

1( ) 'c c c cλ −= +w X 'X I X y

'c c c µ= − = −X X ex y y e

b µ=( ) ( ) 'g x x x w b= − +

Kernel Ridge Regression - Rensselaer Polytechnic Institute

Documents

Transcript of Kernel Ridge Regression - Rensselaer Polytechnic Institute

HESR Electron Cooler - Fermilabconferences.fnal.gov/cool05/Presentations/Thursday/R09_Reistad.pdf · HESR Electron Cooler COOL05 Eagle Ridge, Galena, IL USA September 18-23, 2005

Focused fine-tuning of ridge regression · Focused ne-tuning of ridge regression Kristo er Hellton Department of Mathematics, University of Oslo May 9, 2016 K. Hellton (UiO) Focused

Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.

Point-to-Point Wireless Communication (III) + 1.59 = 11.2 dB (about 7.8 dB if you maintain spectral efficiency) Rensselaer Polytechnic Institute Shivkumar Kalyanaraman 9: “shiv rpi”

Anti-TNF- therapy in uveitis Hydroxychloro- · Anti-TNF-α therapy in uveitis The Eye Clinic Polytechnic University of Marche Head: Prof Alfonso Giovannini Gold salts Sulfasalazine

MA’DIN POLYTECHNIC COLLEGEmadinpoly.com/pdf/labmanual/3/ac MACHINES LAB 437.pdf · MA’DIN POLYTECHNIC COLLEGE ... Three phase power measurement by two wattmeter method ... What

C. Varela1 Actors in SALSA and Erlang (PDCS 9, CPE 5*) Support for actor model in SALSA and Erlang Carlos Varela Rennselaer Polytechnic Institute October.

David Parker University of BirminghamHistory of accelerators at Birmingham • 60” Nuffield cyclotron (1948-1999) 10MeV p, 40MeV α • Radial Ridge Cyclotron (1960-2002) (axially

Exam 1 Crib Sheet - Rensselaer Polytechnic Institute · Exam 2 Crib Sheet First order circuits Differential equation: dy yft dt , with solution yt y t y t hp f t represents a source

Neutron Spin Rotation - Rensselaer Polytechnic · PDF fileNeutron Spin Rotation Werner, et al., Physical Review Letters 35(1975)1053 ... a s pre d ict e d by th e C P T theorem . T

Circuit Elements in AC Circuits - Polytechnic Schoolfaculty.polytechnic.org/physics/2 HONORS PHYSICS 2007...element symbol units resistive nature Filter? phase relationship R C ohms

lecture 10 l2 regularization STAT598z ... · Ridge regression Creditdataset(averagecreditcarddebt) 1e-02 1e+00 1e+02 1e+04 Standardized Coefficients-300 -100 0 100 200 300 400 Income

Angular Momentum - Polytechnic Schoolfaculty.polytechnic.org/physics/3 A.P. PHYSICS 2009...equations that yield position, velocity, and acceleration as a function of time for three

PROJECT NUMBER: REC0036 - Worcester Polytechnic Institute (WPI)

IMPEDANCE MATCHING - Hong Kong Polytechnic …cktse.eie.polyu.edu.hk/eie403/impedancematching.pdf · Michael Tse: Impedance Matching 5 The Q factor approach to matching The Q factor

MA’DIN POLYTECHNIC COLLEGE MACHINES LAB 437.pdf · MA’DIN POLYTECHNIC COLLEGE ... Three phase power measurement by two wattmeter method ... What are the advantages and disadvantages

The Strong Interaction - Rensselaer Polytechnic Institute · The Strong Interaction ... present as a component of the cosmic radiation at high altitudes, can ... Running%coupling%constant.

Government Polytechnic, Muzaffarpurgpmuz.bih.nic.in/docs/ECRE IV.pdf · Experiment:1 & 3 Aim of experiment: Construction Assembling of a Stabilizer Apparatus required Resistor R1

Modeling the Model The Semantics of the CCSM4 Sea Ice Model Tetherless World Constellation Rensselaer Polytechnic Institute.

SWV - Rensselaer Polytechnic Institute · 2018. 4. 26. · 3ureohp 3rlqwv $ slsh dvvhpeo\ lv xvhg wr vxvshqg d zhljkw ri 1 dw srlqw ' dqg lv khog lq sodfh e\ fdeoh %& dqg d vlqjoh