CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... ·...

33
Part II 1 CSE 5526: Introduction to Neural Networks Linear Regression

Transcript of CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... ·...

Page 1: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 1

CSE 5526: Introduction to Neural Networks

Linear Regression

Page 2: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 2

Problem statement

Page 3: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 3

Problem statement

Page 4: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 4

Linear regression with one variable

• Given a set of N pairs of data < xi, di >, approximate d by a linear function of x (regressor)i.e.

or

where the activation function φ(x) = x is a linear function, and it corresponds to a linear neuron. y is the output of the neuron, and

is called the regression (expectational) error

bwxd +≈

iiiii bwxyd εϕε ++=+= )(

iii yd −=ε

ii bwx ε++=

Page 5: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 5

Linear regression (cont.)

• The problem of regression with one variable is how to choose w and b to minimize the regression error

• The least squares method aims to minimize the square error E:

∑∑==

−==N

iii

N

ii ydE

1

2

1

2 )(21

21 ε

Page 6: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 6

Linear regression (cont.)

• To minimize the two-variable square function, set

=∂∂

=∂∂

0

0

wE

bE

Page 7: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 7

Linear regression (cont.)

∑=−−−=

∂−−∂

=∂∂

iii

i

ii

bwxd

bbwxd

bE

0)(

)(21 2

∑=−−−=

∂−−∂

=∂∂

iiii

i

ii

xbwxd

wbwxd

wE

0)(

)(21 2

Page 8: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 8

Linear regression (cont.)

• Hence

where an overbar (i.e. ) indicates the mean

∑∑∑∑∑

−=

ii

iii

ii

ii

ii

xxN

dxxdxb

])([ 2

2

∑∑

−−=

ii

iii

xx

ddxxw 2)(

))((

x

Derive yourself!

Page 9: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 9

Linear regression (cont.)

• This method gives an optimal solution, but it can be time-and memory-consuming as a batch solution

Page 10: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 10

Finding optimal parameters via search

• Without loss of generality, set b = 0

E(w) is called a cost function

∑=

−=N

iii wxdwE

1

2)(21)(

Page 11: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 11

Cost function

w

E(w)

w*

Emin

Question: how can we update w to minimize E?

Page 12: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 12

Gradient and directional derivatives

• Without loss of generality, consider a two-variable function f(x, y). The gradient of f(x, y) at a given point (x0, y0)T is

where ux and uy are unit vectors in the x and y directions, andand

0

0

)),(,),((yyxx

Ty

yxfx

yxff==∂

∂∂

∂=∇

xff x ∂∂= yff y ∂∂=

yyxx yxfyxf uu ),(),( 0000 +=

Page 13: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 13

Gradient and directional derivatives (cont.)

• At any given direction, u = aux + buy, with , the directional derivative at (x0, y0)T along the unit vector u is

• Which direction has the greatest slope? – The gradient because of the dot product!

hyxfhbyhaxfyxfD

h

),(),(lim),( 0000000

−++=

→u

122 =+ ba

hyxfhbyxfhbyxfhbyhaxf

h

)],(),([)],(),([lim 000000000

−+++−++=

),(),( 0000 yxbfyxaf yx +=

u),( 00 yxf T∇=

Page 14: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 14

Gradient and directional derivatives (cont.)

• Example: see blackboard

Page 15: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 15

Gradient and directional derivatives (cont.)

• To find the gradient at a particular point (x0, y0)T, first find the level curve or contour of f(x, y) at that point, C(x0, y0). A tangent vector u to C satisfies

because f(x, y) is constant on a level curve. Hence the gradient vector is perpendicular to the tangent vector

0),( 00 =∇= uu yxfD T

Page 16: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 16

An illustration of level curves

Page 17: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 17

Gradient and directional derivatives (cont.)

• The gradient of a cost function is a vector with the dimension of w that points to the direction of maximum Eincrease and with a magnitude equal to the slope of the tangent of the cost function along that direction• Can the slope be negative?

Page 18: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 18

Gradient illustration

w

E(w)

w*

Emin

w0

Δww

wwEwwEwE

w ∆∆−−∆+

=

→∆ 2)()(lim

)(

000

0

Gradient

Page 19: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 19

Gradient descent

• Minimize the cost function via gradient (steepest) descent –a case of hill-climbing

n: iteration numberη: learning rate

•See previous figure

)()()1( nEnwnw ∇−=+ η

Page 20: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 20

Gradient descent (cont.)

• For the mean-square-error cost function:

22 )]()([21)(

21)( nyndnenE −==

linear neurons

)()(

21

)()(

2

nwne

nwEnE

∂∂

=∂∂

=∇

2)]()()([21 nxnwnd −=

)()( nxne−=

Page 21: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 21

Gradient descent (cont.)

• Hence

• This is the least-mean-square (LMS) algorithm, or the Widrow-Hoff rule

)()()()1( nxnenwnw η+=+

)()]()([)( nxnyndnw −+= η

Page 22: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 22

Multi-variable case

• The analysis for the one-variable case extends to the multi-variable case

where w0= b (bias) and x0 = 1, as done for perceptron learning

2)]()()([21)( nnndnE T xw−=

T

mwE

wE

wEE ),...,,()(

10 ∂∂

∂∂

∂∂

=∇ w

Page 23: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 23

Multi-variable case (cont.)

• The LMS algorithm

)()()1( nEnn ∇−=+ ηww

)()()( nnen xw η+=

)()]()([)( nnyndn xw −+= η

Page 24: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 24

LMS algorithm

• Remarks• The LMS rule is exactly the same in math form as the perceptron

learning rule• Perceptron learning is for McCulloch-Pitts neurons, which are

nonlinear, whereas LMS learning is for linear neurons. In other words, perceptron learning is for classification and LMS is for function approximation

• LMS should be less sensitive to noise in the input data than perceptrons. On the other hand, LMS learning converges slowly

• Newton’s method changes weights in the direction of the minimum E(w) and leads to fast convergence. But it is not an online version and computationally extensive

Page 25: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 25

Stability of adaptation

When η is too small, learning converges slowly

Page 26: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 26

Stability of adaptation (cont.)

When η is too large, learning doesn’t converge

Page 27: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 27

Learning rate annealing

• Basic idea: start with a large rate but gradually decrease it• Stochastic approximation

c is a positive parameter

ncn =)(η

Page 28: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 28

Learning rate annealing (cont.)

• Search-then-converge

η0 and τ are positive parameters

•When n is small compared to τ, learning rate is approximately constant•When n is large compared to τ, learning rate schedule roughly follows stochastic approximation

)(1)( 0

τηη

nn

+=

Page 29: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 29

Rate annealing illustration

Page 30: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 30

Nonlinear neurons

• To extend the LMS algorithm to nonlinear neurons, consider differentiable activation function φ at iteration n

2)]()([21)( nyndnE −=

2)])(()([21 ∑−=

jjj nxwnd ϕ

Page 31: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 31

Nonlinear neurons (cont.)

• By chain rule of differentiation

jj wv

vy

yE

wE

∂∂

∂∂

∂∂

=∂∂

)())(()]()([ nxnvnynd jϕ′−−=

)())(()( nxnvne jϕ′−=

Page 32: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 32

Nonlinear neurons (cont.)

• The gradient descent gives

• The above is called the delta (δ) rule• If we choose a logistic sigmoid for φ

then

)())(()()()1( nxnvnenwnw jjj ϕη ′+=+

)exp(11)(

avv

−+=ϕ

)](1)[()( vvav ϕϕϕ −=′ (see textbook)

)()()( nxnnw jj ηδ+=

Page 33: CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... · CSE 5526: Introduction to Neural Networks Linear Regression. Part II 2 Problem statement.

Part II 33

Role of activation function

v

φ

v

φ′

The role of φ′: weight update is most sensitive when v is near zero