Regression and the LMS Algorithm · CSE 5526: Regression 37 . Nonlinear neurons • To extend the...
Transcript of Regression and the LMS Algorithm · CSE 5526: Regression 37 . Nonlinear neurons • To extend the...
-
CSE 5526: Regression 1
CSE 5526: Introduction to Neural Networks
Regression and the LMS Algorithm
-
CSE 5526: Regression 2
Problem statement
-
CSE 5526: Regression 3
Linear regression with one variable
• Given a set of N pairs of data {xi, di}, approximate d by a linear function of x (regressor) i.e. or where the activation function φ(x) = x is a linear function, corresponding to a linear neuron. y is the output of the neuron, and is called the regression (expectational) error
bwxd +≈
ii
iiiii
bwxbwxyd
εεϕε
++=++=+= )(
iii yd −=ε
-
CSE 5526: Regression 4
Linear regression (cont.)
• The problem of regression with one variable is how to choose w and b to minimize the regression error
• The least squares method aims to minimize the square error:
∑∑==
−==N
iii
N
ii ydE
1
2
1
2 )(21
21 ε
-
CSE 5526: Regression 5
Linear regression (cont.)
• To minimize the two-variable square function, set
=∂∂
=∂∂
0
0
wE
bE
-
CSE 5526: Regression 6
Linear regression (cont.)
∑
∑=−−−=
−−∂∂
=∂∂
iii
iii
bwxd
bwxdbb
E
0)(
)(21 2
∑
∑=−−−=
−−∂∂
=∂∂
iiii
iii
xbwxd
bwxdww
E
0)(
)(21 2
-
CSE 5526: Regression 7
Analytic solution approaches
• Solve one equation for b in terms of w • Substitute into other equation, solve for w • Substitute solution for w back into equation for b
• Setup system of equations in matrix notation • Solve matrix equation
• Rewrite problem in matrix form • Compute matrix gradient • Solve for w
-
CSE 5526: Regression 8
Linear regression (cont.)
• Hence
where an overbar (i.e. ) indicates the mean
∑∑∑∑∑
−
−
=
ii
iii
ii
ii
ii
xxN
dxxdxb 2
2
)(
∑∑
−
−−=
ii
iii
xx
ddxxw 2)(
))((
x
Derive yourself!
-
CSE 5526: Regression 9
Linear regression in matrix notation
• Let 𝑿𝑿 = 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑 …𝒙𝒙𝑵𝑵 𝑻𝑻 • Then the model predictions are 𝒚𝒚 = 𝑿𝑿𝑿𝑿 • And the mean square error can be written 𝐸𝐸 𝑿𝑿 = 𝒅𝒅 − 𝒚𝒚 2 = 𝒅𝒅 − 𝑿𝑿𝑿𝑿 2
• To find the optimal w, set the gradient of the error with respect to w equal to 0 and solve for w
𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0 = 𝜕𝜕
𝜕𝜕𝑿𝑿𝒅𝒅 − 𝑿𝑿𝑿𝑿 2
• See The Matrix Cookbook (Petersen & Pedersen)
-
CSE 5526: Regression 10
Linear regression in matrix notation
• 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 𝜕𝜕
𝜕𝜕𝑿𝑿𝒅𝒅 − 𝑿𝑿𝑿𝑿 2
=𝜕𝜕𝜕𝜕𝑿𝑿
𝒅𝒅 − 𝑿𝑿𝑿𝑿 𝑇𝑇 𝒅𝒅 − 𝑿𝑿𝑿𝑿
=𝜕𝜕𝜕𝜕𝑿𝑿
𝒅𝒅𝑇𝑇𝒅𝒅 − 𝟐𝟐𝑿𝑿𝑇𝑇𝑿𝑿𝑇𝑇𝒅𝒅 + 𝑿𝑿𝑇𝑇𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿 = −2𝑿𝑿𝑇𝑇𝒅𝒅 − 2𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿
• 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0 = −2𝑿𝑿𝑇𝑇𝒅𝒅 − 2𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿
⇒ 𝑿𝑿 = 𝑿𝑿𝑇𝑇𝑿𝑿 −1𝑿𝑿𝑇𝑇𝒅𝒅 • Much cleaner!
-
CSE 5526: Regression 11
Finding optimal parameters via search
• Often there is no closed form solution for 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0
• We can still use the gradient in a numerical solution • We will still use the same example to permit comparison • For simplicity’s sake, set b = 0
E(w) is called a cost function
∑=
−=N
iii wxdwE
1
2)(21)(
-
CSE 5526: Regression 12
Cost function
w
E(w)
w*
Emin
Question: how can we update w from w0 to minimize E?
w0
-
CSE 5526: Regression 13
Gradient and directional derivatives
• Consider a two-variable function f(x, y). Its gradient at the point (x0, y0)T is defined as
where ux and uy are unit vectors in the x and y directions, and and
yyxx
yyxx
T
yxfyxf
yyxf
xyxf
u),(u),(
),(,),(f
0000
00
+=
∂
∂∂
∂=∇
==
xff x ∂∂= yff y ∂∂=
-
CSE 5526: Regression 14
Gradient and directional derivatives (cont.)
• At any given direction, u = aux + buy, with , the directional derivative at (x0, y0)T along the unit vector u is • Which direction has the greatest slope? The gradient because of the
dot product!
u),(f
),(),(
)],(),([)],(),([lim
),(),(lim),(
00
0000
00000000
0
0000
000u
T
yx
h
hx
yx
yxbfyxafh
yxfhbyxfhbyxfhbyhaxfh
yxfhbyhaxfyxfD
∇=
+=
−+++−++=
−++=
→
→
122 =+ ba
-
CSE 5526: Regression 15
Gradient and directional derivatives (cont.)
• Example: f(x, y) = 52𝑥𝑥2 − 3𝑥𝑥𝑥𝑥 + 5
2𝑥𝑥2 + 2𝑥𝑥 + 2𝑥𝑥
-
CSE 5526: Regression 16
Gradient and directional derivatives (cont.)
• Example: f(x, y) = 52𝑥𝑥2 − 3𝑥𝑥𝑥𝑥 + 5
2𝑥𝑥2 + 2𝑥𝑥 + 2𝑥𝑥
-
CSE 5526: Regression 17
Gradient and directional derivatives (cont.)
• The level curves of a function 𝑓𝑓(𝑥𝑥,𝑥𝑥) are curves such that 𝑓𝑓 𝑥𝑥,𝑥𝑥 = 𝑘𝑘
• Thus, the directional derivative along a level curve is 0
• And the gradient vector is perpendicular to the level curve
0u),(f 00u =∇=TyxD
-
CSE 5526: Regression 18
Gradient and directional derivatives (cont.)
• The gradient of a cost function is a vector with the dimension of w that points to the direction of maximum E increase and with a magnitude equal to the slope of the tangent of the cost function along that direction • Can the slope be negative?
-
CSE 5526: Regression 19
Gradient illustration
w
E(w)
w*
Emin
w0
Δw w
wwEwwEwE
w ∆∆−−∆+
=
∇
→∆ 2)()(lim
)(
000
0
Gradient
-
CSE 5526: Regression 20
Gradient descent
• Minimize the cost function via gradient (steepest) descent – a case of hill-climbing
n: iteration number η: learning rate •See previous figure
)()()1( nEnwnw ∇−=+ η
-
CSE 5526: Regression 21
Gradient descent (cont.)
• For the mean-square-error cost function and linear neurons
2
22
)]()()([21
)]()([21)(
21)(
nxnwnd
nyndnenE
−=
−==
)()()()(
21
)()(
2
nxnenwne
nwEnE
−=∂∂
=∂∂
=∇
-
CSE 5526: Regression 22
Gradient descent (cont.)
• Hence • This is the least-mean-square (LMS) algorithm, or the Widrow-Hoff
rule
)()]()([)()()()()1(
nxnyndnwnxnenwnw−+=
+=+ηη
-
CSE 5526: Regression 23
Stochastic gradient descent
• If the cost function is of the form
𝐸𝐸 𝑤𝑤 = �𝐸𝐸𝑛𝑛 𝑤𝑤𝑁𝑁
𝑛𝑛=1
• Then one gradient descent step requires computing
Δw =𝜕𝜕𝜕𝜕𝑤𝑤
𝐸𝐸 𝑤𝑤 = �𝜕𝜕𝜕𝜕𝑤𝑤
𝐸𝐸𝑛𝑛 𝑤𝑤𝑁𝑁
𝑛𝑛=1
• Which means computing 𝐸𝐸(𝑤𝑤) or its gradient for every data point
• Many steps may be required to reach an optimum
-
CSE 5526: Regression 24
Stochastic gradient descent
• It is generally much more computationally efficient to use
Δ𝑤𝑤 = �𝜕𝜕𝜕𝜕𝑤𝑤
𝐸𝐸𝑛𝑛 𝑤𝑤𝑛𝑛𝑖𝑖+𝑛𝑛𝑏𝑏−1
𝑛𝑛=𝑛𝑛𝑖𝑖
• For small values of 𝑛𝑛𝑏𝑏 • This update rule may converge in many fewer
passes through the data (epochs)
-
CSE 5526: Regression 25
Stochastic gradient descent example
-
CSE 5526: Regression 26
Stochastic gradient descent error functions
-
CSE 5526: Regression 27
Stochastic gradient descent gradients
-
CSE 5526: Regression 28
Stochastic gradient descent animation
-
CSE 5526: Regression 29
Gradient descent animation
-
CSE 5526: Regression 30
Multi-variable LMS
• The analysis for the one-variable case extends to the multi-variable case where w0= b (bias) and x0 = 1, as done for perceptron learning
2)]()()([21)( nnndnE T xw−=
T
mwE
wE
wE
∂∂
∂∂
∂∂
=∇ ,...,,)w(E10
-
CSE 5526: Regression 31
Multi-variable LMS (cont.)
• The LMS algorithm
)()]()([)()()()(
)()()1(
nnyndnnnen
nnn
xwxw
Eww
−+=+=
∇−=+
ηηη
-
CSE 5526: Regression 32
LMS algorithm remarks
• The LMS rule is exactly the same equation as the perceptron learning rule
• Perceptron learning is for nonlinear (M-P) neurons, whereas LMS learning is for linear neurons. • i.e., perceptron learning is for classification and LMS is
for function approximation • LMS should be less sensitive to noise in the input
data than perceptrons • On the other hand, LMS learning converges slowly
• Newton’s method changes weights in the direction of the minimum E(w) and leads to fast convergence. • But it is not online and is computationally expensive
-
CSE 5526: Regression 33
Stability of adaptation
When η is too small, learning converges slowly
When η is too large, learning doesn’t converge
-
CSE 5526: Regression 34
Learning rate annealing
• Basic idea: start with a large rate but gradually decrease it • Stochastic approximation
c is a positive parameter
ncn =)(η
-
CSE 5526: Regression 35
Learning rate annealing (cont.)
• Search-then-converge η0 and τ are positive parameters •When n is small compared to τ, learning rate is approximately constant •When n is large compared to τ, learning rule schedule roughly follows stochastic approximation
)(1)( 0
τηη
nn
+=
-
CSE 5526: Regression 36
Rate annealing illustration
-
CSE 5526: Regression 37
Nonlinear neurons
• To extend the LMS algorithm to nonlinear neurons, consider differentiable activation function φ at iteration n
2
2
)()(21
)]()([21)(
−=
−=
∑j
jj nxwnd
nyndnE
ϕ
-
CSE 5526: Regression 38
Nonlinear neurons (cont.)
• By chain rule of differentiation
( )( ) )()()(
)()()]()([
nxnvnenxnvnynd
wv
vy
yE
wE
j
j
jj
ϕ
ϕ′−=
′−−=
∂∂
∂∂
∂∂
=∂∂
-
CSE 5526: Regression 39
Nonlinear neurons (cont.)
• Gradient descent gives • The above is called the delta (δ) rule
• If we choose a logistic sigmoid for φ
then
)()()(
)())(()()()1(
nxnnwnxnvnenwnw
jj
jjj
ηδ
ϕη
+=
′+=+
)exp(11)(
avv
−+=ϕ
)](1)[()( vvav ϕϕϕ −=′ (see textbook)
-
CSE 5526: Regression 40
Role of activation function
v
φ
v
φ′
The role of φ′: weight update is most sensitive when v is near zero
CSE 5526: Introduction to Neural NetworksProblem statementLinear regression with one variableLinear regression (cont.)Linear regression (cont.)Linear regression (cont.)Analytic solution approachesLinear regression (cont.)Linear regression in matrix notationLinear regression in matrix notationFinding optimal parameters via searchCost functionGradient and directional derivativesGradient and directional derivatives (cont.)Gradient and directional derivatives (cont.)Gradient and directional derivatives (cont.)Gradient and directional derivatives (cont.)Gradient and directional derivatives (cont.)Gradient illustrationGradient descentGradient descent (cont.)Gradient descent (cont.)Stochastic gradient descentStochastic gradient descentStochastic gradient descent exampleStochastic gradient descent error functionsStochastic gradient descent gradientsStochastic gradient descent animationGradient descent animationMulti-variable LMSMulti-variable LMS (cont.)LMS algorithm remarksStability of adaptationLearning rate annealingLearning rate annealing (cont.)Rate annealing illustrationNonlinear neuronsNonlinear neurons (cont.)Nonlinear neurons (cont.)Role of activation function