Artificial Neural Networks - ut

Artificial Neural Networks

Introduction to Computational NeuroscienceArdi Tampuu17.10.2016

Artificial neural network

NB! Inspired by biology, not based on biology!

ApplicationsAutomatic speech recognition

Automatic image classification and tagging

Natural language modeling

Learning objectives

How do artificial neural networks work?

What types of artificial neural networks are used for what tasks?

What are the state-of-the-art results achieved with artificial neural networks?

How DO neural networks work?

Part 1

Frank Rosenblatt (1957)

Added learning rule to McCulloch-Pitts neuron.

b

otherwise ,0

0 if,1 2211 bwxwxy

Perceptron

Prediction:

ΣΣ

1x 2x 1

y

1w 2w

b

otherwise ,0

0 if,1 2211 bwxwxy

Perceptron

Prediction: Learning:

ΣΣ

1x 2x 1

y

1w 2w

)(

)(

ytbb

xytww iii

If prediction == target, do nothing

If prediction < target, increase weights of positive inputs, decrease wights of negative inputs

If prediction > target, vice versa

Let’s try it out!X Y X OR Y

0 0 0

0 1 1

1 0 1

1 1 1 A B CInitialize A,B,C=0, so output is 0

go over the examples in table:1. t = y, so no changes2. y = 0, t = 1 → A = 0,B = 1, C = 1 3. y = t = 14. y = t = 15. y = 1, t = 0 → A = 0, B = 1, C = 06. y = t = 17. y = 0, t = 1 → A = 1, B = 1, C = 0

Learning:

)(

)(

ytbb

xytww iii

Perceptron limitations

Perceptron learning algorithm converges only for linearly separable problems (because it only has one layer)

Minsky, Papert, “Perceptrons” (1969)

Multi-layer perceptrons

Add non-linearactivation functions

Add hidden layer(s)

Universal approximation theorem (important!): Any continuous function can be approximated by finite feed-forward neural network with one hidden layer.

Forward propagation

+1

x1

x2

+1

Σ

Σ

Σ

y1 =σ(b11+x1w11+x2w21)z = b21+y1w21+y2w22

(no nonlinearity)

b11

b12

w11

w12

w21

w22

y2 = σ(b12+x1w12+x2w22)

b21

w21

w22

Z

Loss function

• Function approximation:

• Binary classification:

• Multi-class classification:

2)(2

1ztL

0 if ,)1log(

1 if ,)log(

tz

tzL

jj

j ztL log

)log(z

2)10( z

)1log( z

Backpropagation

+1

x1

x2

+1

Σ

Σ

Σ

ey1= ezW21 σ’(b11+x1w11+x2w21) dL/dz = ez = z-t

∆b11= ey1

∆b12= ey1

∆w11= ey1x1

∆w12= ey2x1

∆w21= ey1x2

∆w22= ey2x2

ey2= ezw22 σ’(b12+x1w12+x2w22)

∆b21= ez

∆w21= ezy1

∆w22= ezy2

y1 = σ(b11+x1w11+x2w21) z = b21+y1w21+y2w22

y2 = σ(b12+x1w12+x2w22)

Derivative of sigmoid:

))(1)(()(' xxx

Gradient Descent

• Gradient descent finds weight values that result in small loss.

• Gradient descent is guaranteed to find only local minimum.

• But there is plenty of them and they are often good enough!

Walking around in energy(loss) landscape based on only local gradient information

Things to remember...

Perceptron was the first artificial neuron model invented in late 1950s.

Perceptron can learn only linearly separable classification problems.

Feed-forward networks with non-linear activation functions and hidden layers can overcome limitations of perceptrons.

Multi-layer artificial neural networks are trained using backpropagation and gradient descent.

Neural networks taxonomyPart 2

Simple feed-forward networks

• Architecture:– Each node connected to all

nodes of previous layer.– Information moves in one

direction only.

• Used for:– Function approximation– Simple classification

problems– Not too many inputs (~100)

OUTPUT LAYER

INPUT LAYER

HIDDEN LAYER

Convolutional neural networks

Hubel & Wiesel (1959)

• Performed experiments with anesthetized cat.

• Discovered topographical mapping, sensitivity to orientation and hierarchical processing.

Simple cells – convolution

Complex cells – pooling

Convolution in neural nets

Recommending music on Spotify

Convolutional neural networks

• Architecture:• Convolutional layer: • local connections +

– weight sharing.– Pooling layer: translation

invariance.

• Used for:– images,– any other data with locality

property, i.e. adjacent characters make up word.

-2

2 3

2 1

0 1 2 -1

POOLING LAYER

INPUT LAYER

1

2

-3

CONVOLUTIONAL LAYER

1 0 -1weights:

max

Convolution

Convolution searches for the same pattern over the entire image and calculates a score for each match.

1 0 1

0 1 0

1 0 1

Convolution

Convolution searches for the same pattern over the entire image and calculates a score for each match.

1 -1 1

-1 1 -1

1 -1 1

And this..

1 1 1

1 1 -1

1 -1 -1

Now try this:

What do these filers do?

1 1 1

1 1 1

1 1 1

0 1 0

1 -4 1

0 1 0

1 1 1

1 1 1

1 1 1

0 1 0

1 -4 1

0 1 0

Pooling

Pooling achieves translation invariance by taking maximum of adjacent convolution scores.

Example: handwritten digit recognition

Y. LeCun et al., “Handwritten digit recognition: Applications of neural net chips and automatic learning”, 1989.

LeCun et al. (1989)

Recurrent neural networks

• Architecture:– Hidden layer nodes

connected to each other.– Allows retaining internal

state and memory.

• Used for:– speech recognition,– handwriting recognition,– any “time” series – brain

activity, DNA reads

OUTPUT LAYER

INPUT LAYER

RECURRENT HIDDEN LAYER

Backpropagation through time

H1

O1

I1

OUTPUT LAYER

INPUT LAYER

H2

O2

I2

H3

O3

I3

H0HIDDEN LAYER

time

H4

O4

I4

T4T3T2T1

?

same W

Auto-encoders

• Architecture:– Input and output are the

same !!– Hidden layer functions as

a “bottleneck”.– Network is trained to

reconstruct input from hidden layer activations.

• Used for:– image search– dimensionality reduction

OUTPUT LAYER = INPUT LAYER

INPUT LAYER

HIDDEN LAYER

We didn’t talk about...

• Restricted Boltzmann Machines (RBMs)

• Long Short Term Memory networks (LSTMs)

• Echo State Networks / Liquid State Machines

• Hopfield Network

• Self-organizing maps (SOMs)

• Radial basis function networks (RBFs)

• But we covered the most important ones!

Things to remember...

Simple feed-forward networks are usually used for function approximation, i.e. predict energy consumption.

Convolutional neural networks are mostly used for images.

Recurrent neural networks are used for speech recognition and language modeling

Autoencoders are used for dimensionality reduction.

State-of-the-art resultsPart 3

Deep Learning

Artificial neural networks and backpropagation have been around since 1980s. What’s all this fuss about “deep learning”?

• What has changed:– we have much bigger datasets,– we have much faster computers (think GPUs),– we have learned a few tricks how to train

networks with very very many (150) layers.

GoogLeNet

ImageNet 2014 winner – 27 layers, 5M weights.

Szegedy et al., “Going Deeper with Convolutions” (2014).

ImageNet classification

Try it yourself: http://www.clarifai.com/#demo

Wu et al., “Deep Image: Scaling up Image Recognition” (2015). Ioffe, Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” (2015).

current best 4,9%(human error 5,1%)

Automatic image descriptions

Karpathy, Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions” (2014)

Reinforcement learning

Pong Breakout Space Invaders

Seaquest Beam Rider Enduro

screen score

actions

https://github.com/tambetm/simple_dqnMnih et al., “Human-level control through deep reinforcement learning” (2015)

Multiagent reinforcement learning

Tampuu, Matiisen et al., “Multiagent Cooperation and Competition with DeepReinforcement Learning” (2015)

Videos on YouTube about competitive mode and collaborative mode

Program execution

Curriculum learning – learning simple expressions first and then more complex – proved to be essential.

Zaremba, Sutskever, “Learning to Execute” (2015).

● Neural Turing Machines

● Memory Networks – writing and reading from external memory(infinite memory)

The future of AI?

For example: Hybrid computing using a neural network with dynamic external memory (Graves, Hassabis et al. 2016)

Thank you!

Artificial Neural Networks - ut

Documents

Transcript of Artificial Neural Networks - ut