CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... ·...

Post on 20-May-2020

47 views 0 download

Transcript of CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... ·...

Part II 1

CSE 5526: Introduction to Neural Networks

Linear Regression

Part II 2

Problem statement

Part II 3

Problem statement

Part II 4

Linear regression with one variable

• Given a set of N pairs of data < xi, di >, approximate d by a linear function of x (regressor)i.e.

or

where the activation function φ(x) = x is a linear function, and it corresponds to a linear neuron. y is the output of the neuron, and

is called the regression (expectational) error

bwxd +≈

iiiii bwxyd εϕε ++=+= )(

iii yd −=ε

ii bwx ε++=

Part II 5

Linear regression (cont.)

• The problem of regression with one variable is how to choose w and b to minimize the regression error

• The least squares method aims to minimize the square error E:

∑∑==

−==N

iii

N

ii ydE

1

2

1

2 )(21

21 ε

Part II 6

Linear regression (cont.)

• To minimize the two-variable square function, set

=∂∂

=∂∂

0

0

wE

bE

Part II 7

Linear regression (cont.)

∑=−−−=

∂−−∂

=∂∂

iii

i

ii

bwxd

bbwxd

bE

0)(

)(21 2

∑=−−−=

∂−−∂

=∂∂

iiii

i

ii

xbwxd

wbwxd

wE

0)(

)(21 2

Part II 8

Linear regression (cont.)

• Hence

where an overbar (i.e. ) indicates the mean

∑∑∑∑∑

−=

ii

iii

ii

ii

ii

xxN

dxxdxb

])([ 2

2

∑∑

−−=

ii

iii

xx

ddxxw 2)(

))((

x

Derive yourself!

Part II 9

Linear regression (cont.)

• This method gives an optimal solution, but it can be time-and memory-consuming as a batch solution

Part II 10

Finding optimal parameters via search

• Without loss of generality, set b = 0

E(w) is called a cost function

∑=

−=N

iii wxdwE

1

2)(21)(

Part II 11

Cost function

w

E(w)

w*

Emin

Question: how can we update w to minimize E?

Part II 12

Gradient and directional derivatives

• Without loss of generality, consider a two-variable function f(x, y). The gradient of f(x, y) at a given point (x0, y0)T is

where ux and uy are unit vectors in the x and y directions, andand

0

0

)),(,),((yyxx

Ty

yxfx

yxff==∂

∂∂

∂=∇

xff x ∂∂= yff y ∂∂=

yyxx yxfyxf uu ),(),( 0000 +=

Part II 13

Gradient and directional derivatives (cont.)

• At any given direction, u = aux + buy, with , the directional derivative at (x0, y0)T along the unit vector u is

• Which direction has the greatest slope? – The gradient because of the dot product!

hyxfhbyhaxfyxfD

h

),(),(lim),( 0000000

−++=

→u

122 =+ ba

hyxfhbyxfhbyxfhbyhaxf

h

)],(),([)],(),([lim 000000000

−+++−++=

),(),( 0000 yxbfyxaf yx +=

u),( 00 yxf T∇=

Part II 14

Gradient and directional derivatives (cont.)

• Example: see blackboard

Part II 15

Gradient and directional derivatives (cont.)

• To find the gradient at a particular point (x0, y0)T, first find the level curve or contour of f(x, y) at that point, C(x0, y0). A tangent vector u to C satisfies

because f(x, y) is constant on a level curve. Hence the gradient vector is perpendicular to the tangent vector

0),( 00 =∇= uu yxfD T

Part II 16

An illustration of level curves

Part II 17

Gradient and directional derivatives (cont.)

• The gradient of a cost function is a vector with the dimension of w that points to the direction of maximum Eincrease and with a magnitude equal to the slope of the tangent of the cost function along that direction• Can the slope be negative?

Part II 18

Gradient illustration

w

E(w)

w*

Emin

w0

Δww

wwEwwEwE

w ∆∆−−∆+

=

→∆ 2)()(lim

)(

000

0

Gradient

Part II 19

Gradient descent

• Minimize the cost function via gradient (steepest) descent –a case of hill-climbing

n: iteration numberη: learning rate

•See previous figure

)()()1( nEnwnw ∇−=+ η

Part II 20

Gradient descent (cont.)

• For the mean-square-error cost function:

22 )]()([21)(

21)( nyndnenE −==

linear neurons

)()(

21

)()(

2

nwne

nwEnE

∂∂

=∂∂

=∇

2)]()()([21 nxnwnd −=

)()( nxne−=

Part II 21

Gradient descent (cont.)

• Hence

• This is the least-mean-square (LMS) algorithm, or the Widrow-Hoff rule

)()()()1( nxnenwnw η+=+

)()]()([)( nxnyndnw −+= η

Part II 22

Multi-variable case

• The analysis for the one-variable case extends to the multi-variable case

where w0= b (bias) and x0 = 1, as done for perceptron learning

2)]()()([21)( nnndnE T xw−=

T

mwE

wE

wEE ),...,,()(

10 ∂∂

∂∂

∂∂

=∇ w

Part II 23

Multi-variable case (cont.)

• The LMS algorithm

)()()1( nEnn ∇−=+ ηww

)()()( nnen xw η+=

)()]()([)( nnyndn xw −+= η

Part II 24

LMS algorithm

• Remarks• The LMS rule is exactly the same in math form as the perceptron

learning rule• Perceptron learning is for McCulloch-Pitts neurons, which are

nonlinear, whereas LMS learning is for linear neurons. In other words, perceptron learning is for classification and LMS is for function approximation

• LMS should be less sensitive to noise in the input data than perceptrons. On the other hand, LMS learning converges slowly

• Newton’s method changes weights in the direction of the minimum E(w) and leads to fast convergence. But it is not an online version and computationally extensive

Part II 25

Stability of adaptation

When η is too small, learning converges slowly

Part II 26

Stability of adaptation (cont.)

When η is too large, learning doesn’t converge

Part II 27

Learning rate annealing

• Basic idea: start with a large rate but gradually decrease it• Stochastic approximation

c is a positive parameter

ncn =)(η

Part II 28

Learning rate annealing (cont.)

• Search-then-converge

η0 and τ are positive parameters

•When n is small compared to τ, learning rate is approximately constant•When n is large compared to τ, learning rate schedule roughly follows stochastic approximation

)(1)( 0

τηη

nn

+=

Part II 29

Rate annealing illustration

Part II 30

Nonlinear neurons

• To extend the LMS algorithm to nonlinear neurons, consider differentiable activation function φ at iteration n

2)]()([21)( nyndnE −=

2)])(()([21 ∑−=

jjj nxwnd ϕ

Part II 31

Nonlinear neurons (cont.)

• By chain rule of differentiation

jj wv

vy

yE

wE

∂∂

∂∂

∂∂

=∂∂

)())(()]()([ nxnvnynd jϕ′−−=

)())(()( nxnvne jϕ′−=

Part II 32

Nonlinear neurons (cont.)

• The gradient descent gives

• The above is called the delta (δ) rule• If we choose a logistic sigmoid for φ

then

)())(()()()1( nxnvnenwnw jjj ϕη ′+=+

)exp(11)(

avv

−+=ϕ

)](1)[()( vvav ϕϕϕ −=′ (see textbook)

)()()( nxnnw jj ηδ+=

Part II 33

Role of activation function

v

φ

v

φ′

The role of φ′: weight update is most sensitive when v is near zero