Kernel Smoothing Methods - GitHub Pages

Kernel Smoothing Methods

Hanchen Wang

Ph.D Candidate inInformation Engineering, University of Cambridge

September 29, 2019

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 1 / 18

Overview

1 6.0 what is kernel smoothing?

2 6.1 one-dimensional kernel smoothers

3 6.2 selecting the width λ of the kernel

4 6.3 local regression in Rp

5 6.4 structured local regression models in Rp

6 6.5 local likelihood and other models

7 6.6 kernel density estimation and classification

8 6.7 radial basis functions and kernels

9 6.8 mixture models for density estimation and classifications

10 6.9 computation considerations

11 Q & A: relationship between kernel smoothing methods and kernelmethods

12 one more thing: solution manual to these textbooks


6.0 what is kernel smoothing method?

a class of regression techniques that achieve flexibility in esti-mating function f (X ) over the domain Rp by fitting a differentbut simple model separately at each query point x0.

resulting estimated function f (X ) is smooth in Rp

fitting gets done at evaluation time, memory-based methods requirein principle little or no training, similar as kNN lazy learning

require hyperparameter setting such as metric window size λ

kernels are mostly used as a device for localization rather thanhigh-dimensional (implicit) feature extractor in kernel methods


https://en.wikipedia.org/wiki/Lazy_learning

6.1 one-dimensional kernel smoothers, overview

Y = sin(4X ) + ε,X ∼ U[0, 1], ε ∼ N(0, 1/3)

red point → f̂ (x0), red circles → observations contributing to the fitat x0, solid yellow region → the weights assigned to observations


6.1 one-dimensional kernel smoothers, overview

k-nearest-neighbor average: discontinuous; equal weight for neighborhoods

f̂ (x) = Ave (yi |xi ∈ Nk (x))

Nadaraya–Watson kernel-weighted average

f̂ (x0) =

∑Ni=1 Kλ (x0, xi ) yi∑Ni=1 Kλ (x0, xi )

Kλ (x0, x) = D

(|x − x0|

λ

)Epanechnikov quadratic kernel: D(t) =

{34

(1− t2

)if |t| ≤ 1

0 otherwise

more general with adaptive neighborhood:

Kλ (x0, x) = D

(|x − x0|hλ (x0)

)

tri-cube kernel: D(t) =

{ (1− |t|3

)3if |t| ≤ 1

0 otherwise

compact or not?differentiable at boundary?


6.1 one-dimensional kernel smoothers, local linear

boundary issues arise → fit a locally weighted linear regression

f̂ (x0) = α̂ (x0) + β̂ (x0) x0 → minα(x0),β(x0)

N∑i=1

Kλ (x0, xi ) [yi − α (x0)− β (x0) xi ]2



minα(x0),β(x0)

N∑i=1

Kλ (x0, xi ) [yi − α (x0)− β (x0) xi ]2

in matrix form:

minα(x0),β(x0)

y1y2...yn

−

1 x11 x2...

1 xn

( α(x0)β(x0)

)T

diag

Kλ(x0, x1)Kλ(x0, x2)

...Kλ(x0, xn)

y1y2...yn

−

1 x11 x2...

1 xn

( α(x0)β(x0)

)min

α(x0),β(x0)

(y − B

(α(x0)β(x0)

))TW(x0)

(y − B

(α(x0)β(x0)

))

→ f̂ (x0) =

(1x0

)(BT W (x0) B

)−1BT W (x0) y =

N∑i=1

li (x0) yi

f̂ (x0) is linear w.r.t yi

li (x0)→ Kλ(x0, xi ) + least squares operations, equivalent kernel



why ’this bias is removed to first order’:

E(f̂ (x0)) =N∑i=1

li (x0) f (xi ) =f (x0)N∑i=1

li (x0) + f ′ (x0)N∑i=1

(xi − x0) li (x0)

+f ′′ (x0)

2

N∑i=1

(xi − x0)2 li (x0) +O(x3)

it can be proved:∑N

i=1 li (x0) = 1 and∑N

i=1 (xi − x0) li (x0) = 0there is still room for improvement: quadratic fits outperform linear at curvature region


6.1 one-dimensional kernel smoothers, local polynomial

we can fit local polynomial fits of any degree d : f̂ (x0) = α̂ (x0) +∑d

j=1 β̂j (x0) x j0

minα(x0),βj (x0),j=1,...,d

N∑i=1

Kλ (x0, xi )

yi − α (x0)−d∑

j=1

βj (x0) x ji

2

bias of d-degree fitting is provably to have components of degree d + 1 and higher

no free lunch → increased variance

yi = f (xi ) + εi , ε ∼ N (0, ε2), Var(f̂ (x0)

)= σ2 ‖l (x0)‖2 , d ↑ ‖l (x0)‖2 ↑

- Local linear fits, bias decreaseat the boundaries at a modestcost in variance- Local quadratic fit, do little atthe boundaries for bias, increasethe variance a lot; but mosthelpful in reducing bias due tocurvature in the interior of thedomain.


6.2 and 6.3

selecting the width λ of the kernela natural bias–variance tradeoff

narrower window, larger variance, smaller biaswider window, smaller variance, larger bias

same intuition for local regression(linear/polynomial) estimates

local regression in Rp

p-dimensional 6= p × 1-dimensional → interaction terms between dimensionsconsider p = 3, thus each point is 3× 1 vector: x := (x(1), x(2), x(3))T

then the general form for local kernel regression with d order polynomial is:

minβ

(k)j (x0)

N∑i=1

Kλ (x0, xi )

(yi − B(d)(xi )

T(β

(0)1 (x0), β

(1)1 (x0), β

(1)2 (x0), ..., β

(d)d(d+1)/2

(x0))T)2

where

Kλ (x0, xi ) = D

(‖x− x0‖

λ

)B(0)(x)T = (1) B(1) (x)T =

(1, x(1), x(2), x(3)

)B(2) (xi )

T =(

1, x(1), x(2), x(3), x2(1), x

2(2), x

2(3), x(1)x(2), x(1)x(3), x(2)x(3)

)While boundary effects is a much bigger problem in higher dimensions, since the fractionof points on the boundary is larger.In fact, one of the manifestations of the curse of dimensionality is that the fraction ofpoints close to the boundary increases to one as the dimension grows.


6.4 structured local regression models in Rp

structured kernelmodify the kernel → standardize each dimension to unit standard deviation

more general, use a positive semidefinite matrix A→ Mahalanobis metric1

Kλ,A (x0, x) = D

((x− x0)T A (x− x0)

λ

)structured regression functions

analysis-of-variance (ANOVA) decompositions

E(y |X0) = f (X1,X2, . . . ,Xp) = α+∑j

gj(Xj

)+∑k<`

gk` (Xk ,X`) + · · ·

varying coefficient model ∼ latent variable ???

f (X ) = α(Z) + β1(Z)X1 + · · ·+ βq(Z)Xq

minα(z0),β(z0)

N∑i=1

Kλ (z0, zi )(yi − α (z0)− x1iβ1 (z0)− · · · − xqiβq (z0)

)2

1http://contrib.scikit-learn.org/metric-learn/


http://contrib.scikit-learn.org/metric-learn/

6.5 local likelihood and other models

the concept of local regression and varying coefficient models is extremely broad

local likelihood inference: y0 = θ (x0) = xT0 β (x0)

l(β(x0)) =N∑i=1

Kλ(x0, xi )l(yi , xTi β(x0))

l(θ(z0)) =N∑i=1

Kλ(z0, zi )l(yi , η(xi , θ(z0))) η(x, θ) = xT θ

autoregressive time series model: yt = β0 + β1yt−1 + β2yt−2 + · · ·+ βkyt−k + εt

lag set: zt = (yt−1, yt−2, . . . , yt−k ) yt = zTt β + εt

recall multiclass linear logistic regression model from ch.4

Pr(G = j|X = x) =eβj0+βTj x

1 +∑J−1

k=1eβk0+βT

kx

l(β(x0)) =N∑i=1

Kλ(x0, xi )

βgi 0(x0) + βgi (x0)T (xi − x0)− log

1 +

J−1∑k=1

exp(βk0(x0) + βk (x0)T (xi − x0)

)


6.6 kernel density estimation and classification

Kernel Density EstimationSuppose we have a random sample x1, ..., xN drawn from a probability density fX(x), andwe wish to estimate fX at a point x0

f̂X (x0) =#xi ∈ N (x0)

N(λ)

f̂X (x0) =1

N(λ)

N∑i=1

Kλ (x0, xi )→ f̂X (x0) =1

N (2λ2π)p2

N∑i=1

e−12

(||xi−x0||/λ)2

Kernel Density Classificationnonparametric density estimates for classification using Bayes’ theorem

a J class problem, fit nonparametric density estimates f̂j (X ), j = 1, . . . , J, along withestimates of the class priors π̂j (usually the sample proportions), then

P̂r (G = j |X = x0) =π̂j f̂j (x0)∑J

k=1 π̂k f̂k (x0)

The Naive Bayes Classifierespecially appropriate when the dim p is high, making density estimation unattractive, thenaive Bayes model assumes that given a class G = j , the features Xk are independent:

fj (X ) =

p∏k=1

fjk (Xk )


6.7 radial basis functions and kernels

OMITTED


6.8 and 6.9

Mixture Models for Density Estimation and ClassificationGaussian Mixture Model(GMM), more in ch.8

f (x) =M∑

m=1

αmφ (x ;µm,Σm)∑m

αm = 1 r̂im =α̂mφ

(xi ; µ̂m, Σ̂m

)∑M

k=1 α̂kφ(xi ; µ̂k , Σ̂k

)Computational Considerations

both local regression and density estimation are memory-based methods

model is the entire training data set, fitting is done at evaluation or prediction time

computational cost to fit at a single observation x0 is O(N) flops,

for comparison, an expansion in M basis functions costs O(M) for one evaluation, andtypically M ∼ O(log N)basis function methods have an initial cost of at least O(NM2 + M3)smoothing parameter(s) λ for kernel methods are typically determined off-line, forexample using cross-validation, at a cost of O(N2) flops

Popular implementations of local regression(such as loess function in and R) has someoptimization techniques O(NM)s


Q & A: relationship between kernel smoothing methodsand kernel methods - confused due to abuse of terminology

Kernel Methodsrise from dual representation

inner product of the (usually in higher dimension) feature vectors k(x, x′) = φ(x)Tφ(x′)

the advantage of such representations is ’we can therefore work directly in terms of ofkernels and avoid explicit introduction of the feature vector φ(x)’, from 2

a more general idea containing concepts such as linear kernel regression/classification,kernel smoothing regression, Gaussian process etc etc

Kernel Smoothing Methodsbasically it specify the methods for deriving more smooth and less biased fitting curves

the similarity of these two concepts are they share lots of basic kernel function forms suchas Gaussian kernel or radial basis function

2Ch 6.1 Kernel Methods, Pattern Recognition and Machine Learning, by C.M.Bishop et al.,Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 16 / 18

one more thing: solution manual to these textbooks

for Pattern Recognition and Machine Learning :

https://github.com/zhengqigao/PRML-Solution-Manual:

for Elements of Statistical Learning :

https://github.com/hansen7/ESL_Solution_Manual

lots of other statistic/probability textbooks solution manuals:

https://waxworksmath.com/index.aspx


https://github.com/zhengqigao/PRML-Solution-Manual

https://github.com/hansen7/ESL_Solution_Manual

https://waxworksmath.com/index.aspx

The End


Kernel Smoothing Methods - GitHub Pages

Documents

Transcript of Kernel Smoothing Methods - GitHub Pages