Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression...

40
CSE 5526: Regression 1 CSE 5526: Introduction to Neural Networks Regression and the LMS Algorithm

Transcript of Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression...

Page 1: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 1

CSE 5526: Introduction to Neural Networks

Regression and the LMS Algorithm

Page 2: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 2

Problem statement

Page 3: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 3

Linear regression with one variable

• Given a set of N pairs of data {xi, di}, approximate d by a linear function of x (regressor) i.e. or where the activation function φ(x) = x is a linear function, corresponding to a linear neuron. y is the output of the neuron, and is called the regression (expectational) error

bwxd +≈

ii

iiiii

bwxbwxyd

εεϕε

++=++=+= )(

iii yd −=ε

Page 4: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 4

Linear regression (cont.)

• The problem of regression with one variable is how to choose w and b to minimize the regression error

• The least squares method aims to minimize the square error:

∑∑==

−==N

iii

N

ii ydE

1

2

1

2 )(21

21 ε

Page 5: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 5

Linear regression (cont.)

• To minimize the two-variable square function, set

=∂∂

=∂∂

0

0

wE

bE

Page 6: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 6

Linear regression (cont.)

∑=−−−=

−−∂∂

=∂∂

iii

iii

bwxd

bwxdbb

E

0)(

)(21 2

∑=−−−=

−−∂∂

=∂∂

iiii

iii

xbwxd

bwxdww

E

0)(

)(21 2

Page 7: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 7

Analytic solution approaches

• Solve one equation for b in terms of w • Substitute into other equation, solve for w • Substitute solution for w back into equation for b

• Setup system of equations in matrix notation • Solve matrix equation

• Rewrite problem in matrix form • Compute matrix gradient • Solve for w

Page 8: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 8

Linear regression (cont.)

• Hence

where an overbar (i.e. ) indicates the mean

∑∑∑∑∑

=

ii

iii

ii

ii

ii

xxN

dxxdxb 2

2

)(

∑∑

−−=

ii

iii

xx

ddxxw 2)(

))((

x

Derive yourself!

Page 9: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 9

Linear regression in matrix notation

• Let 𝑿𝑿 = 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑 …𝒙𝒙𝑵𝑵 𝑻𝑻 • Then the model predictions are 𝒚𝒚 = 𝑿𝑿𝑿𝑿 • And the mean square error can be written 𝐸𝐸 𝑿𝑿 = 𝒅𝒅 − 𝒚𝒚 2 = 𝒅𝒅 − 𝑿𝑿𝑿𝑿 2

• To find the optimal w, set the gradient of the error with respect to w equal to 0 and solve for w

𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0 = 𝜕𝜕

𝜕𝜕𝑿𝑿𝒅𝒅 − 𝑿𝑿𝑿𝑿 2

• See The Matrix Cookbook (Petersen & Pedersen)

Page 10: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 10

Linear regression in matrix notation

• 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 𝜕𝜕

𝜕𝜕𝑿𝑿𝒅𝒅 − 𝑿𝑿𝑿𝑿 2

=𝜕𝜕𝜕𝜕𝑿𝑿

𝒅𝒅 − 𝑿𝑿𝑿𝑿 𝑇𝑇 𝒅𝒅 − 𝑿𝑿𝑿𝑿

=𝜕𝜕𝜕𝜕𝑿𝑿

𝒅𝒅𝑇𝑇𝒅𝒅 − 𝟐𝟐𝑿𝑿𝑇𝑇𝑿𝑿𝑇𝑇𝒅𝒅 + 𝑿𝑿𝑇𝑇𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿

= −2𝑿𝑿𝑇𝑇𝒅𝒅 − 2𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿

• 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0 = −2𝑿𝑿𝑇𝑇𝒅𝒅 − 2𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿

⇒ 𝑿𝑿 = 𝑿𝑿𝑇𝑇𝑿𝑿 −1𝑿𝑿𝑇𝑇𝒅𝒅 • Much cleaner!

Page 11: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 11

Finding optimal parameters via search

• Often there is no closed form solution for 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0

• We can still use the gradient in a numerical solution • We will still use the same example to permit comparison • For simplicity’s sake, set b = 0

E(w) is called a cost function

∑=

−=N

iii wxdwE

1

2)(21)(

Page 12: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 12

Cost function

w

E(w)

w*

Emin

Question: how can we update w from w0 to minimize E?

w0

Page 13: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 13

Gradient and directional derivatives

• Consider a two-variable function f(x, y). Its gradient at the point (x0, y0)T is defined as

where ux and uy are unit vectors in the x and y directions, and and

yyxx

yyxx

T

yxfyxf

yyxf

xyxf

u),(u),(

),(,),(f

0000

00

+=

∂∂

∂=∇

==

xff x ∂∂= yff y ∂∂=

Page 14: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 14

Gradient and directional derivatives (cont.)

• At any given direction, u = aux + buy, with , the directional derivative at (x0, y0)T along the unit vector u is • Which direction has the greatest slope? The gradient because of the

dot product!

u),(f

),(),(

)],(),([)],(),([lim

),(),(lim),(

00

0000

00000000

0

0000

000u

T

yx

h

hx

yx

yxbfyxafh

yxfhbyxfhbyxfhbyhaxfh

yxfhbyhaxfyxfD

∇=

+=

−+++−++=

−++=

122 =+ ba

Page 15: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 15

Gradient and directional derivatives (cont.)

• Example: f(x, y) = 52𝑥𝑥2 − 3𝑥𝑥𝑥𝑥 + 5

2𝑥𝑥2 + 2𝑥𝑥 + 2𝑥𝑥

Page 16: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 16

Gradient and directional derivatives (cont.)

• Example: f(x, y) = 52𝑥𝑥2 − 3𝑥𝑥𝑥𝑥 + 5

2𝑥𝑥2 + 2𝑥𝑥 + 2𝑥𝑥

Page 17: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 17

Gradient and directional derivatives (cont.)

• The level curves of a function 𝑓𝑓(𝑥𝑥,𝑥𝑥) are curves such that 𝑓𝑓 𝑥𝑥,𝑥𝑥 = 𝑘𝑘

• Thus, the directional derivative along a level curve is 0

• And the gradient vector is perpendicular to the level curve

0u),(f 00u =∇= TyxD

Page 18: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 18

Gradient and directional derivatives (cont.)

• The gradient of a cost function is a vector with the dimension of w that points to the direction of maximum E increase and with a magnitude equal to the slope of the tangent of the cost function along that direction • Can the slope be negative?

Page 19: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 19

Gradient illustration

w

E(w)

w*

Emin

w0

Δw

wwwEwwE

wE

w ∆∆−−∆+

=

→∆ 2)()(lim

)(

000

0

Gradient

Page 20: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 20

Gradient descent

• Minimize the cost function via gradient (steepest) descent – a case of hill-climbing

n: iteration number η: learning rate •See previous figure

)()()1( nEnwnw ∇−=+ η

Page 21: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 21

Gradient descent (cont.)

• For the mean-square-error cost function and linear neurons

2

22

)]()()([21

)]()([21)(

21)(

nxnwnd

nyndnenE

−=

−==

)()()()(

21

)()(

2

nxnenwne

nwEnE

−=∂∂

=∂∂

=∇

Page 22: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 22

Gradient descent (cont.)

• Hence • This is the least-mean-square (LMS) algorithm, or the Widrow-Hoff

rule

)()]()([)()()()()1(

nxnyndnwnxnenwnw−+=

+=+ηη

Page 23: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 23

Stochastic gradient descent

• If the cost function is of the form

𝐸𝐸 𝑤𝑤 = �𝐸𝐸𝑛𝑛 𝑤𝑤𝑁𝑁

𝑛𝑛=1

• Then one gradient descent step requires computing

Δw =𝜕𝜕𝜕𝜕𝑤𝑤

𝐸𝐸 𝑤𝑤 = �𝜕𝜕𝜕𝜕𝑤𝑤

𝐸𝐸𝑛𝑛 𝑤𝑤𝑁𝑁

𝑛𝑛=1

• Which means computing 𝐸𝐸(𝑤𝑤) or its gradient for every data point

• Many steps may be required to reach an optimum

Page 24: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 24

Stochastic gradient descent

• It is generally much more computationally efficient to use

Δ𝑤𝑤 = �𝜕𝜕𝜕𝜕𝑤𝑤

𝐸𝐸𝑛𝑛 𝑤𝑤𝑛𝑛𝑖𝑖+𝑛𝑛𝑏𝑏−1

𝑛𝑛=𝑛𝑛𝑖𝑖

• For small values of 𝑛𝑛𝑏𝑏 • This update rule may converge in many fewer

passes through the data (epochs)

Page 25: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 25

Stochastic gradient descent example

Page 26: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 26

Stochastic gradient descent error functions

Page 27: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 27

Stochastic gradient descent gradients

Page 28: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 28

Stochastic gradient descent animation

Page 29: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 29

Gradient descent animation

Page 30: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 30

Multi-variable LMS

• The analysis for the one-variable case extends to the multi-variable case where w0= b (bias) and x0 = 1, as done for perceptron learning

2)]()()([21)( nnndnE T xw−=

T

mwE

wE

wE

∂∂

∂∂

∂∂

=∇ ,...,,)w(E10

Page 31: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 31

Multi-variable LMS (cont.)

• The LMS algorithm

)()]()([)()()()(

)()()1(

nnyndnnnen

nnn

xwxw

Eww

−+=+=

∇−=+

ηηη

Page 32: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 32

LMS algorithm remarks

• The LMS rule is exactly the same equation as the perceptron learning rule

• Perceptron learning is for nonlinear (M-P) neurons, whereas LMS learning is for linear neurons. • i.e., perceptron learning is for classification and LMS is

for function approximation • LMS should be less sensitive to noise in the input

data than perceptrons • On the other hand, LMS learning converges slowly

• Newton’s method changes weights in the direction of the minimum E(w) and leads to fast convergence. • But it is not online and is computationally expensive

Page 33: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 33

Stability of adaptation

When η is too small, learning converges slowly

When η is too large, learning doesn’t converge

Page 34: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 34

Learning rate annealing

• Basic idea: start with a large rate but gradually decrease it • Stochastic approximation

c is a positive parameter

ncn =)(η

Page 35: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 35

Learning rate annealing (cont.)

• Search-then-converge η0 and τ are positive parameters •When n is small compared to τ, learning rate is approximately constant •When n is large compared to τ, learning rule schedule roughly follows stochastic approximation

)(1)( 0

τηη

nn

+=

Page 36: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 36

Rate annealing illustration

Page 37: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 37

Nonlinear neurons

• To extend the LMS algorithm to nonlinear neurons, consider differentiable activation function φ at iteration n

2

2

)()(21

)]()([21)(

−=

−=

∑j

jj nxwnd

nyndnE

ϕ

Page 38: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 38

Nonlinear neurons (cont.)

• By chain rule of differentiation

( )( ) )()()(

)()()]()([

nxnvnenxnvnynd

wv

vy

yE

wE

j

j

jj

ϕ

ϕ′−=

′−−=

∂∂

∂∂

∂∂

=∂∂

Page 39: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 39

Nonlinear neurons (cont.)

• Gradient descent gives • The above is called the delta (δ) rule

• If we choose a logistic sigmoid for φ

then

)()()(

)())(()()()1(

nxnnwnxnvnenwnw

jj

jjj

ηδ

ϕη

+=

′+=+

)exp(11)(

avv

−+=ϕ

)](1)[()( vvav ϕϕϕ −=′ (see textbook)

Page 40: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 40

Role of activation function

v

φ

v

φ′

The role of φ′: weight update is most sensitive when v is near zero