Kernel Smoothing Methods - GitHub Pages
Transcript of Kernel Smoothing Methods - GitHub Pages
Kernel Smoothing Methods
Hanchen Wang
Ph.D Candidate inInformation Engineering, University of Cambridge
September 29, 2019
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 1 / 18
Overview
1 6.0 what is kernel smoothing?
2 6.1 one-dimensional kernel smoothers
3 6.2 selecting the width λ of the kernel
4 6.3 local regression in Rp
5 6.4 structured local regression models in Rp
6 6.5 local likelihood and other models
7 6.6 kernel density estimation and classification
8 6.7 radial basis functions and kernels
9 6.8 mixture models for density estimation and classifications
10 6.9 computation considerations
11 Q & A: relationship between kernel smoothing methods and kernelmethods
12 one more thing: solution manual to these textbooks
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 2 / 18
6.0 what is kernel smoothing method?
a class of regression techniques that achieve flexibility in esti-mating function f (X ) over the domain Rp by fitting a differentbut simple model separately at each query point x0.
resulting estimated function f (X ) is smooth in Rp
fitting gets done at evaluation time, memory-based methods requirein principle little or no training, similar as kNN lazy learning
require hyperparameter setting such as metric window size λ
kernels are mostly used as a device for localization rather thanhigh-dimensional (implicit) feature extractor in kernel methods
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 3 / 18
6.1 one-dimensional kernel smoothers, overview
Y = sin(4X ) + ε,X ∼ U[0, 1], ε ∼ N(0, 1/3)
red point → f̂ (x0), red circles → observations contributing to the fitat x0, solid yellow region → the weights assigned to observations
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 4 / 18
6.1 one-dimensional kernel smoothers, overview
k-nearest-neighbor average: discontinuous; equal weight for neighborhoods
f̂ (x) = Ave (yi |xi ∈ Nk (x))
Nadaraya–Watson kernel-weighted average
f̂ (x0) =
∑Ni=1 Kλ (x0, xi ) yi∑Ni=1 Kλ (x0, xi )
Kλ (x0, x) = D
(|x − x0|
λ
)Epanechnikov quadratic kernel: D(t) =
{34
(1− t2
)if |t| ≤ 1
0 otherwise
more general with adaptive neighborhood:
Kλ (x0, x) = D
(|x − x0|hλ (x0)
)
tri-cube kernel: D(t) =
{ (1− |t|3
)3if |t| ≤ 1
0 otherwise
compact or not?differentiable at boundary?
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 5 / 18
6.1 one-dimensional kernel smoothers, local linear
boundary issues arise → fit a locally weighted linear regression
f̂ (x0) = α̂ (x0) + β̂ (x0) x0 → minα(x0),β(x0)
N∑i=1
Kλ (x0, xi ) [yi − α (x0)− β (x0) xi ]2
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 6 / 18
6.1 one-dimensional kernel smoothers, local linear
minα(x0),β(x0)
N∑i=1
Kλ (x0, xi ) [yi − α (x0)− β (x0) xi ]2
in matrix form:
minα(x0),β(x0)
y1y2...yn
−
1 x11 x2...
1 xn
( α(x0)β(x0)
)T
diag
Kλ(x0, x1)Kλ(x0, x2)
...Kλ(x0, xn)
y1y2...yn
−
1 x11 x2...
1 xn
( α(x0)β(x0)
)min
α(x0),β(x0)
(y − B
(α(x0)β(x0)
))TW(x0)
(y − B
(α(x0)β(x0)
))
→ f̂ (x0) =
(1x0
)(BT W (x0) B
)−1BT W (x0) y =
N∑i=1
li (x0) yi
f̂ (x0) is linear w.r.t yi
li (x0)→ Kλ(x0, xi ) + least squares operations, equivalent kernel
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 7 / 18
6.1 one-dimensional kernel smoothers, local linear
why ’this bias is removed to first order’:
E(f̂ (x0)) =N∑i=1
li (x0) f (xi ) =f (x0)N∑i=1
li (x0) + f ′ (x0)N∑i=1
(xi − x0) li (x0)
+f ′′ (x0)
2
N∑i=1
(xi − x0)2 li (x0) +O(x3)
it can be proved:∑N
i=1 li (x0) = 1 and∑N
i=1 (xi − x0) li (x0) = 0there is still room for improvement: quadratic fits outperform linear at curvature region
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 8 / 18
6.1 one-dimensional kernel smoothers, local polynomial
we can fit local polynomial fits of any degree d : f̂ (x0) = α̂ (x0) +∑d
j=1 β̂j (x0) x j0
minα(x0),βj (x0),j=1,...,d
N∑i=1
Kλ (x0, xi )
yi − α (x0)−d∑
j=1
βj (x0) x ji
2
bias of d-degree fitting is provably to have components of degree d + 1 and higher
no free lunch → increased variance
yi = f (xi ) + εi , ε ∼ N (0, ε2), Var(f̂ (x0)
)= σ2 ‖l (x0)‖2 , d ↑ ‖l (x0)‖2 ↑
- Local linear fits, bias decreaseat the boundaries at a modestcost in variance- Local quadratic fit, do little atthe boundaries for bias, increasethe variance a lot; but mosthelpful in reducing bias due tocurvature in the interior of thedomain.
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 9 / 18
6.2 and 6.3
selecting the width λ of the kernela natural bias–variance tradeoff
narrower window, larger variance, smaller biaswider window, smaller variance, larger bias
same intuition for local regression(linear/polynomial) estimates
local regression in Rp
p-dimensional 6= p × 1-dimensional → interaction terms between dimensionsconsider p = 3, thus each point is 3× 1 vector: x := (x(1), x(2), x(3))T
then the general form for local kernel regression with d order polynomial is:
minβ
(k)j (x0)
N∑i=1
Kλ (x0, xi )
(yi − B(d)(xi )
T(β
(0)1 (x0), β
(1)1 (x0), β
(1)2 (x0), ..., β
(d)d(d+1)/2
(x0))T)2
where
Kλ (x0, xi ) = D
(‖x− x0‖
λ
)B(0)(x)T = (1) B(1) (x)T =
(1, x(1), x(2), x(3)
)B(2) (xi )
T =(
1, x(1), x(2), x(3), x2(1), x
2(2), x
2(3), x(1)x(2), x(1)x(3), x(2)x(3)
)While boundary effects is a much bigger problem in higher dimensions, since the fractionof points on the boundary is larger.In fact, one of the manifestations of the curse of dimensionality is that the fraction ofpoints close to the boundary increases to one as the dimension grows.
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 10 / 18
6.4 structured local regression models in Rp
structured kernelmodify the kernel → standardize each dimension to unit standard deviation
more general, use a positive semidefinite matrix A→ Mahalanobis metric1
Kλ,A (x0, x) = D
((x− x0)T A (x− x0)
λ
)structured regression functions
analysis-of-variance (ANOVA) decompositions
E(y |X0) = f (X1,X2, . . . ,Xp) = α+∑j
gj(Xj
)+∑k<`
gk` (Xk ,X`) + · · ·
varying coefficient model ∼ latent variable ???
f (X ) = α(Z) + β1(Z)X1 + · · ·+ βq(Z)Xq
minα(z0),β(z0)
N∑i=1
Kλ (z0, zi )(yi − α (z0)− x1iβ1 (z0)− · · · − xqiβq (z0)
)2
1http://contrib.scikit-learn.org/metric-learn/
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 11 / 18
6.5 local likelihood and other models
the concept of local regression and varying coefficient models is extremely broad
local likelihood inference: y0 = θ (x0) = xT0 β (x0)
l(β(x0)) =N∑i=1
Kλ(x0, xi )l(yi , xTi β(x0))
l(θ(z0)) =N∑i=1
Kλ(z0, zi )l(yi , η(xi , θ(z0))) η(x, θ) = xT θ
autoregressive time series model: yt = β0 + β1yt−1 + β2yt−2 + · · ·+ βkyt−k + εt
lag set: zt = (yt−1, yt−2, . . . , yt−k ) yt = zTt β + εt
recall multiclass linear logistic regression model from ch.4
Pr(G = j|X = x) =eβj0+βTj x
1 +∑J−1
k=1eβk0+βT
kx
l(β(x0)) =N∑i=1
Kλ(x0, xi )
βgi 0(x0) + βgi (x0)T (xi − x0)− log
1 +
J−1∑k=1
exp(βk0(x0) + βk (x0)T (xi − x0)
)
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 12 / 18
6.6 kernel density estimation and classification
Kernel Density EstimationSuppose we have a random sample x1, ..., xN drawn from a probability density fX(x), andwe wish to estimate fX at a point x0
f̂X (x0) =#xi ∈ N (x0)
N(λ)
f̂X (x0) =1
N(λ)
N∑i=1
Kλ (x0, xi )→ f̂X (x0) =1
N (2λ2π)p2
N∑i=1
e−12
(||xi−x0||/λ)2
Kernel Density Classificationnonparametric density estimates for classification using Bayes’ theorem
a J class problem, fit nonparametric density estimates f̂j (X ), j = 1, . . . , J, along withestimates of the class priors π̂j (usually the sample proportions), then
P̂r (G = j |X = x0) =π̂j f̂j (x0)∑J
k=1 π̂k f̂k (x0)
The Naive Bayes Classifierespecially appropriate when the dim p is high, making density estimation unattractive, thenaive Bayes model assumes that given a class G = j , the features Xk are independent:
fj (X ) =
p∏k=1
fjk (Xk )
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 13 / 18
6.7 radial basis functions and kernels
OMITTED
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 14 / 18
6.8 and 6.9
Mixture Models for Density Estimation and ClassificationGaussian Mixture Model(GMM), more in ch.8
f (x) =M∑
m=1
αmφ (x ;µm,Σm)∑m
αm = 1 r̂im =α̂mφ
(xi ; µ̂m, Σ̂m
)∑M
k=1 α̂kφ(xi ; µ̂k , Σ̂k
)Computational Considerations
both local regression and density estimation are memory-based methods
model is the entire training data set, fitting is done at evaluation or prediction time
computational cost to fit at a single observation x0 is O(N) flops,
for comparison, an expansion in M basis functions costs O(M) for one evaluation, andtypically M ∼ O(log N)basis function methods have an initial cost of at least O(NM2 + M3)smoothing parameter(s) λ for kernel methods are typically determined off-line, forexample using cross-validation, at a cost of O(N2) flops
Popular implementations of local regression(such as loess function in and R) has someoptimization techniques O(NM)s
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 15 / 18
Q & A: relationship between kernel smoothing methodsand kernel methods - confused due to abuse of terminology
Kernel Methodsrise from dual representation
inner product of the (usually in higher dimension) feature vectors k(x, x′) = φ(x)Tφ(x′)
the advantage of such representations is ’we can therefore work directly in terms of ofkernels and avoid explicit introduction of the feature vector φ(x)’, from 2
a more general idea containing concepts such as linear kernel regression/classification,kernel smoothing regression, Gaussian process etc etc
Kernel Smoothing Methodsbasically it specify the methods for deriving more smooth and less biased fitting curves
the similarity of these two concepts are they share lots of basic kernel function forms suchas Gaussian kernel or radial basis function
2Ch 6.1 Kernel Methods, Pattern Recognition and Machine Learning, by C.M.Bishop et al.,Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 16 / 18
one more thing: solution manual to these textbooks
for Pattern Recognition and Machine Learning :
https://github.com/zhengqigao/PRML-Solution-Manual:
for Elements of Statistical Learning :
https://github.com/hansen7/ESL_Solution_Manual
lots of other statistic/probability textbooks solution manuals:
https://waxworksmath.com/index.aspx
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 17 / 18
The End
Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 18 / 18