Artificial Neural Networks
Introduction to Computational NeuroscienceArdi Tampuu17.10.2016
Artificial neural network
NB! Inspired by biology, not based on biology!
ApplicationsAutomatic speech recognition
Automatic image classification and tagging
Natural language modeling
Learning objectives
How do artificial neural networks work?
What types of artificial neural networks are used for what tasks?
What are the state-of-the-art results achieved with artificial neural networks?
How DO neural networks work?
Part 1
Frank Rosenblatt (1957)
Added learning rule to McCulloch-Pitts neuron.
b
otherwise ,0
0 if,1 2211 bwxwxy
Perceptron
Prediction:
ΣΣ
1x 2x 1
y
1w 2w
b
otherwise ,0
0 if,1 2211 bwxwxy
Perceptron
Prediction: Learning:
ΣΣ
1x 2x 1
y
1w 2w
)(
)(
ytbb
xytww iii
If prediction == target, do nothing
If prediction < target, increase weights of positive inputs, decrease wights of negative inputs
If prediction > target, vice versa
Let’s try it out!X Y X OR Y
0 0 0
0 1 1
1 0 1
1 1 1 A B CInitialize A,B,C=0, so output is 0
go over the examples in table:1. t = y, so no changes2. y = 0, t = 1 → A = 0,B = 1, C = 1 3. y = t = 14. y = t = 15. y = 1, t = 0 → A = 0, B = 1, C = 06. y = t = 17. y = 0, t = 1 → A = 1, B = 1, C = 0
Learning:
)(
)(
ytbb
xytww iii
Perceptron limitations
Perceptron learning algorithm converges only for linearly separable problems (because it only has one layer)
Minsky, Papert, “Perceptrons” (1969)
Multi-layer perceptrons
Add non-linearactivation functions
Add hidden layer(s)
Universal approximation theorem (important!): Any continuous function can be approximated by finite feed-forward neural network with one hidden layer.
Forward propagation
+1
x1
x2
+1
Σ
Σ
Σ
y1 =σ(b11+x1w11+x2w21)z = b21+y1w21+y2w22
(no nonlinearity)
b11
b12
w11
w12
w21
w22
y2 = σ(b12+x1w12+x2w22)
b21
w21
w22
Z
Loss function
• Function approximation:
• Binary classification:
• Multi-class classification:
2)(2
1ztL
0 if ,)1log(
1 if ,)log(
tz
tzL
jj
j ztL log
)log(z
2)10( z
)1log( z
Backpropagation
+1
x1
x2
+1
Σ
Σ
Σ
ey1= ezW21 σ’(b11+x1w11+x2w21) dL/dz = ez = z-t
∆b11= ey1
∆b12= ey1
∆w11= ey1x1
∆w12= ey2x1
∆w21= ey1x2
∆w22= ey2x2
ey2= ezw22 σ’(b12+x1w12+x2w22)
∆b21= ez
∆w21= ezy1
∆w22= ezy2
y1 = σ(b11+x1w11+x2w21) z = b21+y1w21+y2w22
y2 = σ(b12+x1w12+x2w22)
Derivative of sigmoid:
))(1)(()(' xxx
Gradient Descent
• Gradient descent finds weight values that result in small loss.
• Gradient descent is guaranteed to find only local minimum.
• But there is plenty of them and they are often good enough!
Walking around in energy(loss) landscape based on only local gradient information
Things to remember...
Perceptron was the first artificial neuron model invented in late 1950s.
Perceptron can learn only linearly separable classification problems.
Feed-forward networks with non-linear activation functions and hidden layers can overcome limitations of perceptrons.
Multi-layer artificial neural networks are trained using backpropagation and gradient descent.
Neural networks taxonomyPart 2
Simple feed-forward networks
• Architecture:– Each node connected to all
nodes of previous layer.– Information moves in one
direction only.
• Used for:– Function approximation– Simple classification
problems– Not too many inputs (~100)
OUTPUT LAYER
INPUT LAYER
HIDDEN LAYER
Convolutional neural networks
Hubel & Wiesel (1959)
• Performed experiments with anesthetized cat.
• Discovered topographical mapping, sensitivity to orientation and hierarchical processing.
Simple cells – convolution
Complex cells – pooling
Convolution in neural nets
Recommending music on Spotify
Convolutional neural networks
• Architecture:• Convolutional layer: • local connections +
– weight sharing.– Pooling layer: translation
invariance.
• Used for:– images,– any other data with locality
property, i.e. adjacent characters make up word.
-2
2 3
2 1
0 1 2 -1
POOLING LAYER
INPUT LAYER
1
2
-3
CONVOLUTIONAL LAYER
1 0 -1weights:
max
Convolution
Convolution searches for the same pattern over the entire image and calculates a score for each match.
1 0 1
0 1 0
1 0 1
Convolution
Convolution searches for the same pattern over the entire image and calculates a score for each match.
1 -1 1
-1 1 -1
1 -1 1
And this..
1 1 1
1 1 -1
1 -1 -1
Now try this:
What do these filers do?
1 1 1
1 1 1
1 1 1
0 1 0
1 -4 1
0 1 0
1 1 1
1 1 1
1 1 1
0 1 0
1 -4 1
0 1 0
Pooling
Pooling achieves translation invariance by taking maximum of adjacent convolution scores.
Example: handwritten digit recognition
Y. LeCun et al., “Handwritten digit recognition: Applications of neural net chips and automatic learning”, 1989.
LeCun et al. (1989)
Recurrent neural networks
• Architecture:– Hidden layer nodes
connected to each other.– Allows retaining internal
state and memory.
• Used for:– speech recognition,– handwriting recognition,– any “time” series – brain
activity, DNA reads
OUTPUT LAYER
INPUT LAYER
RECURRENT HIDDEN LAYER
Backpropagation through time
H1
O1
I1
OUTPUT LAYER
INPUT LAYER
H2
O2
I2
H3
O3
I3
H0HIDDEN LAYER
time
H4
O4
I4
T4T3T2T1
?
same W
Auto-encoders
• Architecture:– Input and output are the
same !!– Hidden layer functions as
a “bottleneck”.– Network is trained to
reconstruct input from hidden layer activations.
• Used for:– image search– dimensionality reduction
OUTPUT LAYER = INPUT LAYER
INPUT LAYER
HIDDEN LAYER
We didn’t talk about...
• Restricted Boltzmann Machines (RBMs)
• Long Short Term Memory networks (LSTMs)
• Echo State Networks / Liquid State Machines
• Hopfield Network
• Self-organizing maps (SOMs)
• Radial basis function networks (RBFs)
• But we covered the most important ones!
Things to remember...
Simple feed-forward networks are usually used for function approximation, i.e. predict energy consumption.
Convolutional neural networks are mostly used for images.
Recurrent neural networks are used for speech recognition and language modeling
Autoencoders are used for dimensionality reduction.
State-of-the-art resultsPart 3
Deep Learning
Artificial neural networks and backpropagation have been around since 1980s. What’s all this fuss about “deep learning”?
• What has changed:– we have much bigger datasets,– we have much faster computers (think GPUs),– we have learned a few tricks how to train
networks with very very many (150) layers.
GoogLeNet
ImageNet 2014 winner – 27 layers, 5M weights.
Szegedy et al., “Going Deeper with Convolutions” (2014).
ImageNet classification
Try it yourself: http://www.clarifai.com/#demo
Wu et al., “Deep Image: Scaling up Image Recognition” (2015). Ioffe, Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” (2015).
current best 4,9%(human error 5,1%)
Automatic image descriptions
Karpathy, Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions” (2014)
Reinforcement learning
Pong Breakout Space Invaders
Seaquest Beam Rider Enduro
screen score
actions
https://github.com/tambetm/simple_dqnMnih et al., “Human-level control through deep reinforcement learning” (2015)
Multiagent reinforcement learning
Tampuu, Matiisen et al., “Multiagent Cooperation and Competition with DeepReinforcement Learning” (2015)
Videos on YouTube about competitive mode and collaborative mode
Program execution
Curriculum learning – learning simple expressions first and then more complex – proved to be essential.
Zaremba, Sutskever, “Learning to Execute” (2015).
● Neural Turing Machines
● Memory Networks – writing and reading from external memory(infinite memory)
The future of AI?
For example: Hybrid computing using a neural network with dynamic external memory (Graves, Hassabis et al. 2016)
Thank you!
Top Related