CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... ·...

Part II 1

CSE 5526: Introduction to Neural Networks

Linear Regression

Part II 2

Problem statement

Part II 3

Problem statement

Part II 4

Linear regression with one variable

• Given a set of N pairs of data < xi, di >, approximate d by a linear function of x (regressor)i.e.

where the activation function φ(x) = x is a linear function, and it corresponds to a linear neuron. y is the output of the neuron, and

is called the regression (expectational) error

bwxd +≈

iiiii bwxyd εϕε ++=+= )(

iii yd −=ε

ii bwx ε++=

Part II 5

Linear regression (cont.)

• The problem of regression with one variable is how to choose w and b to minimize the regression error

• The least squares method aims to minimize the square error E:

∑∑==

−==N

ii ydE

2 )(21

Part II 6

• To minimize the two-variable square function, set

=∂∂

Part II 7

∑=−−−=

∂−−∂

=∂∂

)(21 2

∑=−−−=

∂−−∂

=∂∂

)(21 2

Part II 8

• Hence

where an overbar (i.e. ) indicates the mean

∑∑∑∑∑

dxxdxb

])([ 2

∑∑

−−=

ddxxw 2)(

Derive yourself!

Part II 9

• This method gives an optimal solution, but it can be time-and memory-consuming as a batch solution

Part II 10

Finding optimal parameters via search

• Without loss of generality, set b = 0

E(w) is called a cost function

iii wxdwE

2)(21)(

Part II 11

Cost function

Question: how can we update w to minimize E?

Part II 12

Gradient and directional derivatives

• Without loss of generality, consider a two-variable function f(x, y). The gradient of f(x, y) at a given point (x0, y0)T is

where ux and uy are unit vectors in the x and y directions, andand

)),(,),((yyxx

yxff==∂

∂∂

∂=∇

xff x ∂∂= yff y ∂∂=

yyxx yxfyxf uu ),(),( 0000 +=

Part II 13

Gradient and directional derivatives (cont.)

• At any given direction, u = aux + buy, with , the directional derivative at (x0, y0)T along the unit vector u is

• Which direction has the greatest slope? – The gradient because of the dot product!

hyxfhbyhaxfyxfD

),(),(lim),( 0000000

−++=

122 =+ ba

hyxfhbyxfhbyxfhbyhaxf

)],(),([)],(),([lim 000000000

−+++−++=

),(),( 0000 yxbfyxaf yx +=

u),( 00 yxf T∇=

Part II 14

• Example: see blackboard

Part II 15

• To find the gradient at a particular point (x0, y0)T, first find the level curve or contour of f(x, y) at that point, C(x0, y0). A tangent vector u to C satisfies

because f(x, y) is constant on a level curve. Hence the gradient vector is perpendicular to the tangent vector

0),( 00 =∇= uu yxfD T

Part II 16

An illustration of level curves

Part II 17

• The gradient of a cost function is a vector with the dimension of w that points to the direction of maximum Eincrease and with a magnitude equal to the slope of the tangent of the cost function along that direction• Can the slope be negative?

Part II 18

Gradient illustration

wwEwwEwE

w ∆∆−−∆+

→∆ 2)()(lim

Gradient

Part II 19

Gradient descent

• Minimize the cost function via gradient (steepest) descent –a case of hill-climbing

n: iteration numberη: learning rate

•See previous figure

)()()1( nEnwnw ∇−=+ η

Part II 20

Gradient descent (cont.)

• For the mean-square-error cost function:

22 )]()([21)(

21)( nyndnenE −==

linear neurons

∂∂

=∂∂

2)]()()([21 nxnwnd −=

)()( nxne−=

Part II 21

Gradient descent (cont.)

• Hence

• This is the least-mean-square (LMS) algorithm, or the Widrow-Hoff rule

)()()()1( nxnenwnw η+=+

)()]()([)( nxnyndnw −+= η

Part II 22

Multi-variable case

• The analysis for the one-variable case extends to the multi-variable case

where w0= b (bias) and x0 = 1, as done for perceptron learning

2)]()()([21)( nnndnE T xw−=

wEE ),...,,()(

10 ∂∂

∂∂

=∇ w

Part II 23

Multi-variable case (cont.)

• The LMS algorithm

)()()1( nEnn ∇−=+ ηww

)()()( nnen xw η+=

)()]()([)( nnyndn xw −+= η

Part II 24

LMS algorithm

• Remarks• The LMS rule is exactly the same in math form as the perceptron

learning rule• Perceptron learning is for McCulloch-Pitts neurons, which are

nonlinear, whereas LMS learning is for linear neurons. In other words, perceptron learning is for classification and LMS is for function approximation

• LMS should be less sensitive to noise in the input data than perceptrons. On the other hand, LMS learning converges slowly

• Newton’s method changes weights in the direction of the minimum E(w) and leads to fast convergence. But it is not an online version and computationally extensive

Part II 25

Stability of adaptation

When η is too small, learning converges slowly

Part II 26

Stability of adaptation (cont.)

When η is too large, learning doesn’t converge

Part II 27

Learning rate annealing

• Basic idea: start with a large rate but gradually decrease it• Stochastic approximation

c is a positive parameter

ncn =)(η

Part II 28

Learning rate annealing (cont.)

• Search-then-converge

η0 and τ are positive parameters

•When n is small compared to τ, learning rate is approximately constant•When n is large compared to τ, learning rate schedule roughly follows stochastic approximation

)(1)( 0

τηη

Part II 29

Rate annealing illustration

Part II 30

Nonlinear neurons

• To extend the LMS algorithm to nonlinear neurons, consider differentiable activation function φ at iteration n

2)]()([21)( nyndnE −=

2)])(()([21 ∑−=

jjj nxwnd ϕ

Part II 31

Nonlinear neurons (cont.)

• By chain rule of differentiation

∂∂

=∂∂

)())(()]()([ nxnvnynd jϕ′−−=

)())(()( nxnvne jϕ′−=

Part II 32

Nonlinear neurons (cont.)

• The gradient descent gives

• The above is called the delta (δ) rule• If we choose a logistic sigmoid for φ

)())(()()()1( nxnvnenwnw jjj ϕη ′+=+

)exp(11)(

−+=ϕ

)](1)[()( vvav ϕϕϕ −=′ (see textbook)

)()()( nxnnw jj ηδ+=

Part II 33

Role of activation function

The role of φ′: weight update is most sensitive when v is near zero

CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... ·...

Documents

Transcript of CSE 5526: Introduction to Neural Networksweb.cse.ohio-state.edu/~wang.77/teaching/cse5526/... ·...

Teoría Cuántica Neural

𝐼 𝑙 - Politecnico di Milanohome.deib.polimi.it/boracchi/teaching/IC/IC_Lez1_a... · 2018-02-14 · Convolutional Neural Networks ... Convolutional Neural Networks for Visual

CSE 444: Database Internals - courses.cs.washington.edu · CSE 444: Database Internals Lecture 9 ... • Quiz section slides are posted CSE 444 ... Rn WHERE cond 1 AND cond

NEURAL REGENERATIN RESEARCH · NEURAL REGENERATIN RESEARCH 1490 NEURAL REGEN RES/ VL 13 (N. 9), SEPTEMBER 2018 Proanthocyanidin B2 attenuates high-glucose-induced neurotoxicity of

(Artificial) Neural Network

Provable approximation properties for deep neural networks

Espaciais ETE Engenharia e Gerenciamento de Sistemas Espaciaisperondi/07.12.2009/CSE-314-4_07-12-2009.pdf · CSE-314-4 Planejamento e Gestão da Qualidade Engenharia e Tecnologia

CSE 2315 – Discrete Structure HW6 Solution

Pushpak Bhattacharyya CSE Dept., IIT Bombay 29 th March, 2011

Introduction to Database Systems CSE 344

CSE 135: Introduction to Theory of Computation CFLs ...

PROCESSES IN BIOLOGICAL VISION - The Complete Neural System

Functionalized α Helical Peptide Hydrogels for Neural ... · Functionalized α‑Helical Peptide Hydrogels for Neural Tissue Engineering Nazia Mehrban,† Bangfu Zhu,‡ Francesco

Lecture 23: Neural Networks - Department of Statistics

Electron-hadron separation by Neural-network Yonsei university.

On Feature Learning in Neural Networks: Emergence from ...

Global stabilization of fractional-order memristive neural ...

Topics in the lecture: Neural networks Backpropagation

Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

neural network lecture 3