Kernel Smoothing Methods

Hanchen Wang

Ph.D Candidate inInformation Engineering, University of Cambridge

September 29, 2019

Hanchen Wang (hw501@cam.ac.uk) Kernel Smoothing Methods September 29, 2019 1 / 18

Overview

1 6.0 what is kernel smoothing?

2 6.1 one-dimensional kernel smoothers

3 6.2 selecting the width λ of the kernel

4 6.3 local regression in Rp

5 6.4 structured local regression models in Rp

6 6.5 local likelihood and other models

7 6.6 kernel density estimation and classification

8 6.7 radial basis functions and kernels

9 6.8 mixture models for density estimation and classifications

10 6.9 computation considerations

11 Q & A: relationship between kernel smoothing methods and kernelmethods

12 one more thing: solution manual to these textbooks

6.0 what is kernel smoothing method?

a class of regression techniques that achieve flexibility in esti-mating function f (X ) over the domain Rp by fitting a differentbut simple model separately at each query point x0.

resulting estimated function f (X ) is smooth in Rp

fitting gets done at evaluation time, memory-based methods requirein principle little or no training, similar as kNN lazy learning

require hyperparameter setting such as metric window size λ

kernels are mostly used as a device for localization rather thanhigh-dimensional (implicit) feature extractor in kernel methods

6.1 one-dimensional kernel smoothers, overview

Y = sin(4X ) + ε,X ∼ U[0, 1], ε ∼ N(0, 1/3)

red point → f̂ (x0), red circles → observations contributing to the fitat x0, solid yellow region → the weights assigned to observations

6.1 one-dimensional kernel smoothers, overview

k-nearest-neighbor average: discontinuous; equal weight for neighborhoods

f̂ (x) = Ave (yi |xi ∈ Nk (x))

Nadaraya–Watson kernel-weighted average

f̂ (x0) =

∑Ni=1 Kλ (x0, xi ) yi∑Ni=1 Kλ (x0, xi )

Kλ (x0, x) = D

(|x − x0|

)Epanechnikov quadratic kernel: D(t) =

(1− t2

)if |t| ≤ 1

0 otherwise

more general with adaptive neighborhood:

Kλ (x0, x) = D

(|x − x0|hλ (x0)

tri-cube kernel: D(t) =

{ (1− |t|3

)3if |t| ≤ 1

0 otherwise

compact or not?differentiable at boundary?

6.1 one-dimensional kernel smoothers, local linear

boundary issues arise → fit a locally weighted linear regression

f̂ (x0) = α̂ (x0) + β̂ (x0) x0 → minα(x0),β(x0)

N∑i=1

Kλ (x0, xi ) [yi − α (x0)− β (x0) xi ]2

minα(x0),β(x0)

N∑i=1

Kλ (x0, xi ) [yi − α (x0)− β (x0) xi ]2

in matrix form:

minα(x0),β(x0)

y1y2...yn

1 x11 x2...

( α(x0)β(x0)

Kλ(x0, x1)Kλ(x0, x2)

...Kλ(x0, xn)

y1y2...yn

1 x11 x2...

( α(x0)β(x0)

α(x0),β(x0)

(y − B

(α(x0)β(x0)

))TW(x0)

(y − B

(α(x0)β(x0)

→ f̂ (x0) =

)(BT W (x0) B

)−1BT W (x0) y =

N∑i=1

li (x0) yi

f̂ (x0) is linear w.r.t yi

li (x0)→ Kλ(x0, xi ) + least squares operations, equivalent kernel

why ’this bias is removed to first order’:

E(f̂ (x0)) =N∑i=1

li (x0) f (xi ) =f (x0)N∑i=1

li (x0) + f ′ (x0)N∑i=1

(xi − x0) li (x0)

+f ′′ (x0)

N∑i=1

(xi − x0)2 li (x0) +O(x3)

it can be proved:∑N

i=1 li (x0) = 1 and∑N

i=1 (xi − x0) li (x0) = 0there is still room for improvement: quadratic fits outperform linear at curvature region

6.1 one-dimensional kernel smoothers, local polynomial

we can fit local polynomial fits of any degree d : f̂ (x0) = α̂ (x0) +∑d

j=1 β̂j (x0) x j0

minα(x0),βj (x0),j=1,...,d

N∑i=1

Kλ (x0, xi )

yi − α (x0)−d∑

βj (x0) x ji

bias of d-degree fitting is provably to have components of degree d + 1 and higher

no free lunch → increased variance

yi = f (xi ) + εi , ε ∼ N (0, ε2), Var(f̂ (x0)

)= σ2 ‖l (x0)‖2 , d ↑ ‖l (x0)‖2 ↑

- Local linear fits, bias decreaseat the boundaries at a modestcost in variance- Local quadratic fit, do little atthe boundaries for bias, increasethe variance a lot; but mosthelpful in reducing bias due tocurvature in the interior of thedomain.

6.2 and 6.3

selecting the width λ of the kernela natural bias–variance tradeoff

narrower window, larger variance, smaller biaswider window, smaller variance, larger bias

same intuition for local regression(linear/polynomial) estimates

local regression in Rp

p-dimensional 6= p × 1-dimensional → interaction terms between dimensionsconsider p = 3, thus each point is 3× 1 vector: x := (x(1), x(2), x(3))T

then the general form for local kernel regression with d order polynomial is:

(k)j (x0)

N∑i=1

Kλ (x0, xi )

(yi − B(d)(xi )

(0)1 (x0), β

(1)1 (x0), β

(1)2 (x0), ..., β

(d)d(d+1)/2

(x0))T)2

Kλ (x0, xi ) = D

(‖x− x0‖

)B(0)(x)T = (1) B(1) (x)T =

(1, x(1), x(2), x(3)

)B(2) (xi )

1, x(1), x(2), x(3), x2(1), x

2(2), x

2(3), x(1)x(2), x(1)x(3), x(2)x(3)

)While boundary effects is a much bigger problem in higher dimensions, since the fractionof points on the boundary is larger.In fact, one of the manifestations of the curse of dimensionality is that the fraction ofpoints close to the boundary increases to one as the dimension grows.

6.4 structured local regression models in Rp

structured kernelmodify the kernel → standardize each dimension to unit standard deviation

more general, use a positive semidefinite matrix A→ Mahalanobis metric1

Kλ,A (x0, x) = D

((x− x0)T A (x− x0)

)structured regression functions

analysis-of-variance (ANOVA) decompositions

E(y |X0) = f (X1,X2, . . . ,Xp) = α+∑j

)+∑k<`

gk` (Xk ,X`) + · · ·

varying coefficient model ∼ latent variable ???

f (X ) = α(Z) + β1(Z)X1 + · · ·+ βq(Z)Xq

minα(z0),β(z0)

N∑i=1

Kλ (z0, zi )(yi − α (z0)− x1iβ1 (z0)− · · · − xqiβq (z0)

1http://contrib.scikit-learn.org/metric-learn/

6.5 local likelihood and other models

the concept of local regression and varying coefficient models is extremely broad

local likelihood inference: y0 = θ (x0) = xT0 β (x0)

l(β(x0)) =N∑i=1

Kλ(x0, xi )l(yi , xTi β(x0))

l(θ(z0)) =N∑i=1

Kλ(z0, zi )l(yi , η(xi , θ(z0))) η(x, θ) = xT θ

autoregressive time series model: yt = β0 + β1yt−1 + β2yt−2 + · · ·+ βkyt−k + εt

lag set: zt = (yt−1, yt−2, . . . , yt−k ) yt = zTt β + εt

recall multiclass linear logistic regression model from ch.4

Pr(G = j|X = x) =eβj0+βTj x

1 +∑J−1

k=1eβk0+βT

l(β(x0)) =N∑i=1

Kλ(x0, xi )

βgi 0(x0) + βgi (x0)T (xi − x0)− log

J−1∑k=1

exp(βk0(x0) + βk (x0)T (xi − x0)

6.6 kernel density estimation and classification

Kernel Density EstimationSuppose we have a random sample x1, ..., xN drawn from a probability density fX(x), andwe wish to estimate fX at a point x0

f̂X (x0) =#xi ∈ N (x0)

f̂X (x0) =1

N∑i=1

Kλ (x0, xi )→ f̂X (x0) =1

N (2λ2π)p2

N∑i=1

e−12

(||xi−x0||/λ)2

Kernel Density Classificationnonparametric density estimates for classification using Bayes’ theorem

a J class problem, fit nonparametric density estimates f̂j (X ), j = 1, . . . , J, along withestimates of the class priors π̂j (usually the sample proportions), then

P̂r (G = j |X = x0) =π̂j f̂j (x0)∑J

k=1 π̂k f̂k (x0)

The Naive Bayes Classifierespecially appropriate when the dim p is high, making density estimation unattractive, thenaive Bayes model assumes that given a class G = j , the features Xk are independent:

fj (X ) =

p∏k=1

fjk (Xk )

6.7 radial basis functions and kernels

OMITTED

6.8 and 6.9

Mixture Models for Density Estimation and ClassificationGaussian Mixture Model(GMM), more in ch.8

f (x) =M∑

αmφ (x ;µm,Σm)∑m

αm = 1 r̂im =α̂mφ

(xi ; µ̂m, Σ̂m

k=1 α̂kφ(xi ; µ̂k , Σ̂k

)Computational Considerations

both local regression and density estimation are memory-based methods

model is the entire training data set, fitting is done at evaluation or prediction time

computational cost to fit at a single observation x0 is O(N) flops,

for comparison, an expansion in M basis functions costs O(M) for one evaluation, andtypically M ∼ O(log N)basis function methods have an initial cost of at least O(NM2 + M3)smoothing parameter(s) λ for kernel methods are typically determined off-line, forexample using cross-validation, at a cost of O(N2) flops

Popular implementations of local regression(such as loess function in and R) has someoptimization techniques O(NM)s

Q & A: relationship between kernel smoothing methodsand kernel methods - confused due to abuse of terminology

Kernel Methodsrise from dual representation

inner product of the (usually in higher dimension) feature vectors k(x, x′) = φ(x)Tφ(x′)

the advantage of such representations is ’we can therefore work directly in terms of ofkernels and avoid explicit introduction of the feature vector φ(x)’, from 2

a more general idea containing concepts such as linear kernel regression/classification,kernel smoothing regression, Gaussian process etc etc

Kernel Smoothing Methodsbasically it specify the methods for deriving more smooth and less biased fitting curves

the similarity of these two concepts are they share lots of basic kernel function forms suchas Gaussian kernel or radial basis function

2Ch 6.1 Kernel Methods, Pattern Recognition and Machine Learning, by C.M.Bishop et al.,Hanchen Wang (hw501@cam.ac.uk) Kernel Smoothing Methods September 29, 2019 16 / 18

one more thing: solution manual to these textbooks

for Pattern Recognition and Machine Learning :

https://github.com/zhengqigao/PRML-Solution-Manual:

for Elements of Statistical Learning :

https://github.com/hansen7/ESL_Solution_Manual

lots of other statistic/probability textbooks solution manuals:

https://waxworksmath.com/index.aspx

The End

Kernel Smoothing Methods - GitHub Pages

Transcript of Kernel Smoothing Methods - GitHub Pages

Kernel Smoothing Methods - GitHub Pages

Documents

Transcript of Kernel Smoothing Methods - GitHub Pages

The Bergman Kernel on Riemann Surfaces

Laplacian Kernel Splatting for Efficient Depth-of-field ...laplaciansplatting.mpi-inf.mpg.de/laplacian_splatting.pdf · Laplacian Kernel Splating for Eficient Depth-of-field and Motion

Kernel methods on £nite groups

Condicionantes de Clubes de Convergência no Brasil · sidade de distribuição pelo método de suavização por núcleo (kernel smoothing). O Gráfico 1 exibe um exemplo de duas

Bayesian SAE using Complex Survey DataSpatial smoothing: read map To perform spatial smoothing using ICAR, we rst need to construct an adjacency matrix where each row and column is

AB-721: Desempenho de Aeronaves - GitHub Pages

Bayesian Regression & Classiﬁcation · Bayesian Regression & Classiﬁcation learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic

Linear Algebra Basics: Matrices - GitHub Pages

Linux Kernel Overview - Τμήμα Επιστήμης …hy428/reading/lec22_linux.pdfLinux Kernel Overview ManolisMarazakis MichailFlouris FORTH-ICS (CARV Lab) References • J. Corbet,

The Performance of μ -Kernel-Based Systems

Status Report - GitHub Pages

Kernels for kernel-based machine learninghelper.ipam.ucla.edu/publications/ccstut/ccstut_9749.pdfKernels for kernel-based machine learning Matthias Rupp Berlin Institute of Technology,

Heat kernel analysis for Bessel operators on symmetric conespure.au.dk/.../files/69558245/heat_kernel_analysis_on_symmetric_co… · Heat kernel analysis for Bessel operators on symmetric

Lecture IV: Perturbations - GitHub Pages

Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique

Mathematical statistics - GitHub Pages

Lecture 12: Support Vector Regression, Kernel Trick and … › ... › 2016a › lecture-12-unannotated.pdf · Lecture 12: Support Vector Regression, Kernel Trick and Optimization

Learning From Data Lecture 25 The Kernel Trickmagdon/courses/LFD-Slides/SlidesLect25.pdfLearning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel

Lecture 17: Smoothing splines, Local Regression, and … · Lecture 17: Smoothing splines, Local Regression, and GAMs Reading: Sections7.5-7 STATS202: Dataminingandanalysis November6,2017

Kernel Ridge Regression - Rensselaer Polytechnic Institute