Regression and the LMS Algorithm · CSE 5526: Regression 37 . Nonlinear neurons • To extend the...

40
CSE 5526: Regression 1 CSE 5526: Introduction to Neural Networks Regression and the LMS Algorithm

Transcript of Regression and the LMS Algorithm · CSE 5526: Regression 37 . Nonlinear neurons • To extend the...

  • CSE 5526: Regression 1

    CSE 5526: Introduction to Neural Networks

    Regression and the LMS Algorithm

  • CSE 5526: Regression 2

    Problem statement

  • CSE 5526: Regression 3

    Linear regression with one variable

    • Given a set of N pairs of data {xi, di}, approximate d by a linear function of x (regressor) i.e. or where the activation function φ(x) = x is a linear function, corresponding to a linear neuron. y is the output of the neuron, and is called the regression (expectational) error

    bwxd +≈

    ii

    iiiii

    bwxbwxyd

    εεϕε

    ++=++=+= )(

    iii yd −=ε

  • CSE 5526: Regression 4

    Linear regression (cont.)

    • The problem of regression with one variable is how to choose w and b to minimize the regression error

    • The least squares method aims to minimize the square error:

    ∑∑==

    −==N

    iii

    N

    ii ydE

    1

    2

    1

    2 )(21

    21 ε

  • CSE 5526: Regression 5

    Linear regression (cont.)

    • To minimize the two-variable square function, set

    =∂∂

    =∂∂

    0

    0

    wE

    bE

  • CSE 5526: Regression 6

    Linear regression (cont.)

    ∑=−−−=

    −−∂∂

    =∂∂

    iii

    iii

    bwxd

    bwxdbb

    E

    0)(

    )(21 2

    ∑=−−−=

    −−∂∂

    =∂∂

    iiii

    iii

    xbwxd

    bwxdww

    E

    0)(

    )(21 2

  • CSE 5526: Regression 7

    Analytic solution approaches

    • Solve one equation for b in terms of w • Substitute into other equation, solve for w • Substitute solution for w back into equation for b

    • Setup system of equations in matrix notation • Solve matrix equation

    • Rewrite problem in matrix form • Compute matrix gradient • Solve for w

  • CSE 5526: Regression 8

    Linear regression (cont.)

    • Hence

    where an overbar (i.e. ) indicates the mean

    ∑∑∑∑∑

    =

    ii

    iii

    ii

    ii

    ii

    xxN

    dxxdxb 2

    2

    )(

    ∑∑

    −−=

    ii

    iii

    xx

    ddxxw 2)(

    ))((

    x

    Derive yourself!

  • CSE 5526: Regression 9

    Linear regression in matrix notation

    • Let 𝑿𝑿 = 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑 …𝒙𝒙𝑵𝑵 𝑻𝑻 • Then the model predictions are 𝒚𝒚 = 𝑿𝑿𝑿𝑿 • And the mean square error can be written 𝐸𝐸 𝑿𝑿 = 𝒅𝒅 − 𝒚𝒚 2 = 𝒅𝒅 − 𝑿𝑿𝑿𝑿 2

    • To find the optimal w, set the gradient of the error with respect to w equal to 0 and solve for w

    𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0 = 𝜕𝜕

    𝜕𝜕𝑿𝑿𝒅𝒅 − 𝑿𝑿𝑿𝑿 2

    • See The Matrix Cookbook (Petersen & Pedersen)

  • CSE 5526: Regression 10

    Linear regression in matrix notation

    • 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 𝜕𝜕

    𝜕𝜕𝑿𝑿𝒅𝒅 − 𝑿𝑿𝑿𝑿 2

    =𝜕𝜕𝜕𝜕𝑿𝑿

    𝒅𝒅 − 𝑿𝑿𝑿𝑿 𝑇𝑇 𝒅𝒅 − 𝑿𝑿𝑿𝑿

    =𝜕𝜕𝜕𝜕𝑿𝑿

    𝒅𝒅𝑇𝑇𝒅𝒅 − 𝟐𝟐𝑿𝑿𝑇𝑇𝑿𝑿𝑇𝑇𝒅𝒅 + 𝑿𝑿𝑇𝑇𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿 = −2𝑿𝑿𝑇𝑇𝒅𝒅 − 2𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿

    • 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0 = −2𝑿𝑿𝑇𝑇𝒅𝒅 − 2𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿

    ⇒ 𝑿𝑿 = 𝑿𝑿𝑇𝑇𝑿𝑿 −1𝑿𝑿𝑇𝑇𝒅𝒅 • Much cleaner!

  • CSE 5526: Regression 11

    Finding optimal parameters via search

    • Often there is no closed form solution for 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0

    • We can still use the gradient in a numerical solution • We will still use the same example to permit comparison • For simplicity’s sake, set b = 0

    E(w) is called a cost function

    ∑=

    −=N

    iii wxdwE

    1

    2)(21)(

  • CSE 5526: Regression 12

    Cost function

    w

    E(w)

    w*

    Emin

    Question: how can we update w from w0 to minimize E?

    w0

  • CSE 5526: Regression 13

    Gradient and directional derivatives

    • Consider a two-variable function f(x, y). Its gradient at the point (x0, y0)T is defined as

    where ux and uy are unit vectors in the x and y directions, and and

    yyxx

    yyxx

    T

    yxfyxf

    yyxf

    xyxf

    u),(u),(

    ),(,),(f

    0000

    00

    +=

    ∂∂

    ∂=∇

    ==

    xff x ∂∂= yff y ∂∂=

  • CSE 5526: Regression 14

    Gradient and directional derivatives (cont.)

    • At any given direction, u = aux + buy, with , the directional derivative at (x0, y0)T along the unit vector u is • Which direction has the greatest slope? The gradient because of the

    dot product!

    u),(f

    ),(),(

    )],(),([)],(),([lim

    ),(),(lim),(

    00

    0000

    00000000

    0

    0000

    000u

    T

    yx

    h

    hx

    yx

    yxbfyxafh

    yxfhbyxfhbyxfhbyhaxfh

    yxfhbyhaxfyxfD

    ∇=

    +=

    −+++−++=

    −++=

    122 =+ ba

  • CSE 5526: Regression 15

    Gradient and directional derivatives (cont.)

    • Example: f(x, y) = 52𝑥𝑥2 − 3𝑥𝑥𝑥𝑥 + 5

    2𝑥𝑥2 + 2𝑥𝑥 + 2𝑥𝑥

  • CSE 5526: Regression 16

    Gradient and directional derivatives (cont.)

    • Example: f(x, y) = 52𝑥𝑥2 − 3𝑥𝑥𝑥𝑥 + 5

    2𝑥𝑥2 + 2𝑥𝑥 + 2𝑥𝑥

  • CSE 5526: Regression 17

    Gradient and directional derivatives (cont.)

    • The level curves of a function 𝑓𝑓(𝑥𝑥,𝑥𝑥) are curves such that 𝑓𝑓 𝑥𝑥,𝑥𝑥 = 𝑘𝑘

    • Thus, the directional derivative along a level curve is 0

    • And the gradient vector is perpendicular to the level curve

    0u),(f 00u =∇=TyxD

  • CSE 5526: Regression 18

    Gradient and directional derivatives (cont.)

    • The gradient of a cost function is a vector with the dimension of w that points to the direction of maximum E increase and with a magnitude equal to the slope of the tangent of the cost function along that direction • Can the slope be negative?

  • CSE 5526: Regression 19

    Gradient illustration

    w

    E(w)

    w*

    Emin

    w0

    Δw w

    wwEwwEwE

    w ∆∆−−∆+

    =

    →∆ 2)()(lim

    )(

    000

    0

    Gradient

  • CSE 5526: Regression 20

    Gradient descent

    • Minimize the cost function via gradient (steepest) descent – a case of hill-climbing

    n: iteration number η: learning rate •See previous figure

    )()()1( nEnwnw ∇−=+ η

  • CSE 5526: Regression 21

    Gradient descent (cont.)

    • For the mean-square-error cost function and linear neurons

    2

    22

    )]()()([21

    )]()([21)(

    21)(

    nxnwnd

    nyndnenE

    −=

    −==

    )()()()(

    21

    )()(

    2

    nxnenwne

    nwEnE

    −=∂∂

    =∂∂

    =∇

  • CSE 5526: Regression 22

    Gradient descent (cont.)

    • Hence • This is the least-mean-square (LMS) algorithm, or the Widrow-Hoff

    rule

    )()]()([)()()()()1(

    nxnyndnwnxnenwnw−+=

    +=+ηη

  • CSE 5526: Regression 23

    Stochastic gradient descent

    • If the cost function is of the form

    𝐸𝐸 𝑤𝑤 = �𝐸𝐸𝑛𝑛 𝑤𝑤𝑁𝑁

    𝑛𝑛=1

    • Then one gradient descent step requires computing

    Δw =𝜕𝜕𝜕𝜕𝑤𝑤

    𝐸𝐸 𝑤𝑤 = �𝜕𝜕𝜕𝜕𝑤𝑤

    𝐸𝐸𝑛𝑛 𝑤𝑤𝑁𝑁

    𝑛𝑛=1

    • Which means computing 𝐸𝐸(𝑤𝑤) or its gradient for every data point

    • Many steps may be required to reach an optimum

  • CSE 5526: Regression 24

    Stochastic gradient descent

    • It is generally much more computationally efficient to use

    Δ𝑤𝑤 = �𝜕𝜕𝜕𝜕𝑤𝑤

    𝐸𝐸𝑛𝑛 𝑤𝑤𝑛𝑛𝑖𝑖+𝑛𝑛𝑏𝑏−1

    𝑛𝑛=𝑛𝑛𝑖𝑖

    • For small values of 𝑛𝑛𝑏𝑏 • This update rule may converge in many fewer

    passes through the data (epochs)

  • CSE 5526: Regression 25

    Stochastic gradient descent example

  • CSE 5526: Regression 26

    Stochastic gradient descent error functions

  • CSE 5526: Regression 27

    Stochastic gradient descent gradients

  • CSE 5526: Regression 28

    Stochastic gradient descent animation

  • CSE 5526: Regression 29

    Gradient descent animation

  • CSE 5526: Regression 30

    Multi-variable LMS

    • The analysis for the one-variable case extends to the multi-variable case where w0= b (bias) and x0 = 1, as done for perceptron learning

    2)]()()([21)( nnndnE T xw−=

    T

    mwE

    wE

    wE

    ∂∂

    ∂∂

    ∂∂

    =∇ ,...,,)w(E10

  • CSE 5526: Regression 31

    Multi-variable LMS (cont.)

    • The LMS algorithm

    )()]()([)()()()(

    )()()1(

    nnyndnnnen

    nnn

    xwxw

    Eww

    −+=+=

    ∇−=+

    ηηη

  • CSE 5526: Regression 32

    LMS algorithm remarks

    • The LMS rule is exactly the same equation as the perceptron learning rule

    • Perceptron learning is for nonlinear (M-P) neurons, whereas LMS learning is for linear neurons. • i.e., perceptron learning is for classification and LMS is

    for function approximation • LMS should be less sensitive to noise in the input

    data than perceptrons • On the other hand, LMS learning converges slowly

    • Newton’s method changes weights in the direction of the minimum E(w) and leads to fast convergence. • But it is not online and is computationally expensive

  • CSE 5526: Regression 33

    Stability of adaptation

    When η is too small, learning converges slowly

    When η is too large, learning doesn’t converge

  • CSE 5526: Regression 34

    Learning rate annealing

    • Basic idea: start with a large rate but gradually decrease it • Stochastic approximation

    c is a positive parameter

    ncn =)(η

  • CSE 5526: Regression 35

    Learning rate annealing (cont.)

    • Search-then-converge η0 and τ are positive parameters •When n is small compared to τ, learning rate is approximately constant •When n is large compared to τ, learning rule schedule roughly follows stochastic approximation

    )(1)( 0

    τηη

    nn

    +=

  • CSE 5526: Regression 36

    Rate annealing illustration

  • CSE 5526: Regression 37

    Nonlinear neurons

    • To extend the LMS algorithm to nonlinear neurons, consider differentiable activation function φ at iteration n

    2

    2

    )()(21

    )]()([21)(

    −=

    −=

    ∑j

    jj nxwnd

    nyndnE

    ϕ

  • CSE 5526: Regression 38

    Nonlinear neurons (cont.)

    • By chain rule of differentiation

    ( )( ) )()()(

    )()()]()([

    nxnvnenxnvnynd

    wv

    vy

    yE

    wE

    j

    j

    jj

    ϕ

    ϕ′−=

    ′−−=

    ∂∂

    ∂∂

    ∂∂

    =∂∂

  • CSE 5526: Regression 39

    Nonlinear neurons (cont.)

    • Gradient descent gives • The above is called the delta (δ) rule

    • If we choose a logistic sigmoid for φ

    then

    )()()(

    )())(()()()1(

    nxnnwnxnvnenwnw

    jj

    jjj

    ηδ

    ϕη

    +=

    ′+=+

    )exp(11)(

    avv

    −+=ϕ

    )](1)[()( vvav ϕϕϕ −=′ (see textbook)

  • CSE 5526: Regression 40

    Role of activation function

    v

    φ

    v

    φ′

    The role of φ′: weight update is most sensitive when v is near zero

    CSE 5526: Introduction to Neural NetworksProblem statementLinear regression with one variableLinear regression (cont.)Linear regression (cont.)Linear regression (cont.)Analytic solution approachesLinear regression (cont.)Linear regression in matrix notationLinear regression in matrix notationFinding optimal parameters via searchCost functionGradient and directional derivativesGradient and directional derivatives (cont.)Gradient and directional derivatives (cont.)Gradient and directional derivatives (cont.)Gradient and directional derivatives (cont.)Gradient and directional derivatives (cont.)Gradient illustrationGradient descentGradient descent (cont.)Gradient descent (cont.)Stochastic gradient descentStochastic gradient descentStochastic gradient descent exampleStochastic gradient descent error functionsStochastic gradient descent gradientsStochastic gradient descent animationGradient descent animationMulti-variable LMSMulti-variable LMS (cont.)LMS algorithm remarksStability of adaptationLearning rate annealingLearning rate annealing (cont.)Rate annealing illustrationNonlinear neuronsNonlinear neurons (cont.)Nonlinear neurons (cont.)Role of activation function