Kernel Smoothing Methods - GitHub Pages

18
Kernel Smoothing Methods Hanchen Wang Ph.D Candidate in Information Engineering, University of Cambridge September 29, 2019 Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 1 / 18

Transcript of Kernel Smoothing Methods - GitHub Pages

Page 1: Kernel Smoothing Methods - GitHub Pages

Kernel Smoothing Methods

Hanchen Wang

Ph.D Candidate inInformation Engineering, University of Cambridge

September 29, 2019

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 1 / 18

Page 2: Kernel Smoothing Methods - GitHub Pages

Overview

1 6.0 what is kernel smoothing?

2 6.1 one-dimensional kernel smoothers

3 6.2 selecting the width λ of the kernel

4 6.3 local regression in Rp

5 6.4 structured local regression models in Rp

6 6.5 local likelihood and other models

7 6.6 kernel density estimation and classification

8 6.7 radial basis functions and kernels

9 6.8 mixture models for density estimation and classifications

10 6.9 computation considerations

11 Q & A: relationship between kernel smoothing methods and kernelmethods

12 one more thing: solution manual to these textbooks

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 2 / 18

Page 3: Kernel Smoothing Methods - GitHub Pages

6.0 what is kernel smoothing method?

a class of regression techniques that achieve flexibility in esti-mating function f (X ) over the domain Rp by fitting a differentbut simple model separately at each query point x0.

resulting estimated function f (X ) is smooth in Rp

fitting gets done at evaluation time, memory-based methods requirein principle little or no training, similar as kNN lazy learning

require hyperparameter setting such as metric window size λ

kernels are mostly used as a device for localization rather thanhigh-dimensional (implicit) feature extractor in kernel methods

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 3 / 18

Page 4: Kernel Smoothing Methods - GitHub Pages

6.1 one-dimensional kernel smoothers, overview

Y = sin(4X ) + ε,X ∼ U[0, 1], ε ∼ N(0, 1/3)

red point → f̂ (x0), red circles → observations contributing to the fitat x0, solid yellow region → the weights assigned to observations

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 4 / 18

Page 5: Kernel Smoothing Methods - GitHub Pages

6.1 one-dimensional kernel smoothers, overview

k-nearest-neighbor average: discontinuous; equal weight for neighborhoods

f̂ (x) = Ave (yi |xi ∈ Nk (x))

Nadaraya–Watson kernel-weighted average

f̂ (x0) =

∑Ni=1 Kλ (x0, xi ) yi∑Ni=1 Kλ (x0, xi )

Kλ (x0, x) = D

(|x − x0|

λ

)Epanechnikov quadratic kernel: D(t) =

{34

(1− t2

)if |t| ≤ 1

0 otherwise

more general with adaptive neighborhood:

Kλ (x0, x) = D

(|x − x0|hλ (x0)

)

tri-cube kernel: D(t) =

{ (1− |t|3

)3if |t| ≤ 1

0 otherwise

compact or not?differentiable at boundary?

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 5 / 18

Page 6: Kernel Smoothing Methods - GitHub Pages

6.1 one-dimensional kernel smoothers, local linear

boundary issues arise → fit a locally weighted linear regression

f̂ (x0) = α̂ (x0) + β̂ (x0) x0 → minα(x0),β(x0)

N∑i=1

Kλ (x0, xi ) [yi − α (x0)− β (x0) xi ]2

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 6 / 18

Page 7: Kernel Smoothing Methods - GitHub Pages

6.1 one-dimensional kernel smoothers, local linear

minα(x0),β(x0)

N∑i=1

Kλ (x0, xi ) [yi − α (x0)− β (x0) xi ]2

in matrix form:

minα(x0),β(x0)

y1y2...yn

1 x11 x2...

1 xn

( α(x0)β(x0)

)T

diag

Kλ(x0, x1)Kλ(x0, x2)

...Kλ(x0, xn)

y1y2...yn

1 x11 x2...

1 xn

( α(x0)β(x0)

)min

α(x0),β(x0)

(y − B

(α(x0)β(x0)

))TW(x0)

(y − B

(α(x0)β(x0)

))

→ f̂ (x0) =

(1x0

)(BT W (x0) B

)−1BT W (x0) y =

N∑i=1

li (x0) yi

f̂ (x0) is linear w.r.t yi

li (x0)→ Kλ(x0, xi ) + least squares operations, equivalent kernel

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 7 / 18

Page 8: Kernel Smoothing Methods - GitHub Pages

6.1 one-dimensional kernel smoothers, local linear

why ’this bias is removed to first order’:

E(f̂ (x0)) =N∑i=1

li (x0) f (xi ) =f (x0)N∑i=1

li (x0) + f ′ (x0)N∑i=1

(xi − x0) li (x0)

+f ′′ (x0)

2

N∑i=1

(xi − x0)2 li (x0) +O(x3)

it can be proved:∑N

i=1 li (x0) = 1 and∑N

i=1 (xi − x0) li (x0) = 0there is still room for improvement: quadratic fits outperform linear at curvature region

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 8 / 18

Page 9: Kernel Smoothing Methods - GitHub Pages

6.1 one-dimensional kernel smoothers, local polynomial

we can fit local polynomial fits of any degree d : f̂ (x0) = α̂ (x0) +∑d

j=1 β̂j (x0) x j0

minα(x0),βj (x0),j=1,...,d

N∑i=1

Kλ (x0, xi )

yi − α (x0)−d∑

j=1

βj (x0) x ji

2

bias of d-degree fitting is provably to have components of degree d + 1 and higher

no free lunch → increased variance

yi = f (xi ) + εi , ε ∼ N (0, ε2), Var(f̂ (x0)

)= σ2 ‖l (x0)‖2 , d ↑ ‖l (x0)‖2 ↑

- Local linear fits, bias decreaseat the boundaries at a modestcost in variance- Local quadratic fit, do little atthe boundaries for bias, increasethe variance a lot; but mosthelpful in reducing bias due tocurvature in the interior of thedomain.

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 9 / 18

Page 10: Kernel Smoothing Methods - GitHub Pages

6.2 and 6.3

selecting the width λ of the kernela natural bias–variance tradeoff

narrower window, larger variance, smaller biaswider window, smaller variance, larger bias

same intuition for local regression(linear/polynomial) estimates

local regression in Rp

p-dimensional 6= p × 1-dimensional → interaction terms between dimensionsconsider p = 3, thus each point is 3× 1 vector: x := (x(1), x(2), x(3))T

then the general form for local kernel regression with d order polynomial is:

minβ

(k)j (x0)

N∑i=1

Kλ (x0, xi )

(yi − B(d)(xi )

T(β

(0)1 (x0), β

(1)1 (x0), β

(1)2 (x0), ..., β

(d)d(d+1)/2

(x0))T)2

where

Kλ (x0, xi ) = D

(‖x− x0‖

λ

)B(0)(x)T = (1) B(1) (x)T =

(1, x(1), x(2), x(3)

)B(2) (xi )

T =(

1, x(1), x(2), x(3), x2(1), x

2(2), x

2(3), x(1)x(2), x(1)x(3), x(2)x(3)

)While boundary effects is a much bigger problem in higher dimensions, since the fractionof points on the boundary is larger.In fact, one of the manifestations of the curse of dimensionality is that the fraction ofpoints close to the boundary increases to one as the dimension grows.

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 10 / 18

Page 11: Kernel Smoothing Methods - GitHub Pages

6.4 structured local regression models in Rp

structured kernelmodify the kernel → standardize each dimension to unit standard deviation

more general, use a positive semidefinite matrix A→ Mahalanobis metric1

Kλ,A (x0, x) = D

((x− x0)T A (x− x0)

λ

)structured regression functions

analysis-of-variance (ANOVA) decompositions

E(y |X0) = f (X1,X2, . . . ,Xp) = α+∑j

gj(Xj

)+∑k<`

gk` (Xk ,X`) + · · ·

varying coefficient model ∼ latent variable ???

f (X ) = α(Z) + β1(Z)X1 + · · ·+ βq(Z)Xq

minα(z0),β(z0)

N∑i=1

Kλ (z0, zi )(yi − α (z0)− x1iβ1 (z0)− · · · − xqiβq (z0)

)2

1http://contrib.scikit-learn.org/metric-learn/

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 11 / 18

Page 12: Kernel Smoothing Methods - GitHub Pages

6.5 local likelihood and other models

the concept of local regression and varying coefficient models is extremely broad

local likelihood inference: y0 = θ (x0) = xT0 β (x0)

l(β(x0)) =N∑i=1

Kλ(x0, xi )l(yi , xTi β(x0))

l(θ(z0)) =N∑i=1

Kλ(z0, zi )l(yi , η(xi , θ(z0))) η(x, θ) = xT θ

autoregressive time series model: yt = β0 + β1yt−1 + β2yt−2 + · · ·+ βkyt−k + εt

lag set: zt = (yt−1, yt−2, . . . , yt−k ) yt = zTt β + εt

recall multiclass linear logistic regression model from ch.4

Pr(G = j|X = x) =eβj0+βTj x

1 +∑J−1

k=1eβk0+βT

kx

l(β(x0)) =N∑i=1

Kλ(x0, xi )

βgi 0(x0) + βgi (x0)T (xi − x0)− log

1 +

J−1∑k=1

exp(βk0(x0) + βk (x0)T (xi − x0)

)

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 12 / 18

Page 13: Kernel Smoothing Methods - GitHub Pages

6.6 kernel density estimation and classification

Kernel Density EstimationSuppose we have a random sample x1, ..., xN drawn from a probability density fX(x), andwe wish to estimate fX at a point x0

f̂X (x0) =#xi ∈ N (x0)

N(λ)

f̂X (x0) =1

N(λ)

N∑i=1

Kλ (x0, xi )→ f̂X (x0) =1

N (2λ2π)p2

N∑i=1

e−12

(||xi−x0||/λ)2

Kernel Density Classificationnonparametric density estimates for classification using Bayes’ theorem

a J class problem, fit nonparametric density estimates f̂j (X ), j = 1, . . . , J, along withestimates of the class priors π̂j (usually the sample proportions), then

P̂r (G = j |X = x0) =π̂j f̂j (x0)∑J

k=1 π̂k f̂k (x0)

The Naive Bayes Classifierespecially appropriate when the dim p is high, making density estimation unattractive, thenaive Bayes model assumes that given a class G = j , the features Xk are independent:

fj (X ) =

p∏k=1

fjk (Xk )

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 13 / 18

Page 14: Kernel Smoothing Methods - GitHub Pages

6.7 radial basis functions and kernels

OMITTED

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 14 / 18

Page 15: Kernel Smoothing Methods - GitHub Pages

6.8 and 6.9

Mixture Models for Density Estimation and ClassificationGaussian Mixture Model(GMM), more in ch.8

f (x) =M∑

m=1

αmφ (x ;µm,Σm)∑m

αm = 1 r̂im =α̂mφ

(xi ; µ̂m, Σ̂m

)∑M

k=1 α̂kφ(xi ; µ̂k , Σ̂k

)Computational Considerations

both local regression and density estimation are memory-based methods

model is the entire training data set, fitting is done at evaluation or prediction time

computational cost to fit at a single observation x0 is O(N) flops,

for comparison, an expansion in M basis functions costs O(M) for one evaluation, andtypically M ∼ O(log N)basis function methods have an initial cost of at least O(NM2 + M3)smoothing parameter(s) λ for kernel methods are typically determined off-line, forexample using cross-validation, at a cost of O(N2) flops

Popular implementations of local regression(such as loess function in and R) has someoptimization techniques O(NM)s

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 15 / 18

Page 16: Kernel Smoothing Methods - GitHub Pages

Q & A: relationship between kernel smoothing methodsand kernel methods - confused due to abuse of terminology

Kernel Methodsrise from dual representation

inner product of the (usually in higher dimension) feature vectors k(x, x′) = φ(x)Tφ(x′)

the advantage of such representations is ’we can therefore work directly in terms of ofkernels and avoid explicit introduction of the feature vector φ(x)’, from 2

a more general idea containing concepts such as linear kernel regression/classification,kernel smoothing regression, Gaussian process etc etc

Kernel Smoothing Methodsbasically it specify the methods for deriving more smooth and less biased fitting curves

the similarity of these two concepts are they share lots of basic kernel function forms suchas Gaussian kernel or radial basis function

2Ch 6.1 Kernel Methods, Pattern Recognition and Machine Learning, by C.M.Bishop et al.,Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 16 / 18

Page 17: Kernel Smoothing Methods - GitHub Pages

one more thing: solution manual to these textbooks

for Pattern Recognition and Machine Learning :

https://github.com/zhengqigao/PRML-Solution-Manual:

for Elements of Statistical Learning :

https://github.com/hansen7/ESL_Solution_Manual

lots of other statistic/probability textbooks solution manuals:

https://waxworksmath.com/index.aspx

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 17 / 18

Page 18: Kernel Smoothing Methods - GitHub Pages

The End

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 18 / 18