Lecture 6: Perceptron & Logistic Regression

55
Lecture 6: Perceptron & Logistic Regression Andreas Wichert Department of Computer Science and Engineering Técnico Lisboa

Transcript of Lecture 6: Perceptron & Logistic Regression

Page 1: Lecture 6: Perceptron & Logistic Regression

Lecture 6: Perceptron & Logistic Regression

Andreas WichertDepartment of Computer Science and Engineering

Técnico Lisboa

Page 2: Lecture 6: Perceptron & Logistic Regression

Similarity to real neurons...

Page 3: Lecture 6: Perceptron & Logistic Regression

• The dot product is a linear representation represented by the value net

• Non linearity can be achieved be a non liner activation (transfer) function φ() with

Page 4: Lecture 6: Perceptron & Logistic Regression

Sgn

• Examples of nonlinear transfer functions are the sgn function

Page 5: Lecture 6: Perceptron & Logistic Regression

• The sgn activation function can be scaled to the two values 0 or 1 for not firing and firing,

• Both non linear function sgn(net) and sgn0(net) are non continuous. The activation function sgn0(net) can be approximated by the non linear continuous function σ(net)

Page 6: Lecture 6: Perceptron & Logistic Regression

(a) The activation function sgn0(net). (b) The function σ(net) with α = 1. (c) The function σ(net) with α = 5. (d) The function σ(net) with α = 10 is very similar to sgn0(net), bigger α make it even more similar

Page 7: Lecture 6: Perceptron & Logistic Regression

Perceptron (1957)

• Linear threshold unit (LTU)

S

x1

x2

xn

...

w1w2

wn

w0X0=1

o

McCulloch-Pitts model of a neuron (1943)

The “bias”, a constant term that does not depend on any input value

Page 8: Lecture 6: Perceptron & Logistic Regression

Linearly separable patterns

X0=1, bias...

Page 9: Lecture 6: Perceptron & Logistic Regression

• The goal of a perceptron is to correctly classify the set of pattern D={x1,x2,..xm} into one of the classes C1 and C2

• The output for class C1 is o=1 and for C2 is o=-1

• For n=2 è

Page 10: Lecture 6: Perceptron & Logistic Regression

Perceptron learning rule

• Consider linearly separable problems• How to find appropriate weights

• Initialize each vector w to some small random values

• Look if the output pattern o belongs to the desired class, has the desired value d

• h is called the learning rate• 0 < h ≤ 1

Δw =η ⋅ (d −o) ⋅ x

Page 11: Lecture 6: Perceptron & Logistic Regression

• In supervised learning the network has its output compared with known correct answers

• Supervised learning• Learning with a teacher

• (d-o) plays the role of the error signal

• The algorithm converges to the correct classification• if the training data is linearly separable• and h is sufficiently small

Page 12: Lecture 6: Perceptron & Logistic Regression

XOR problem and Perceptron

• By Minsky and Papert in mid 1960

Page 13: Lecture 6: Perceptron & Logistic Regression
Page 14: Lecture 6: Perceptron & Logistic Regression

• and x0 = 1

Page 15: Lecture 6: Perceptron & Logistic Regression

Discriminant Functions

• The hyperplane is described by w and b

Page 16: Lecture 6: Perceptron & Logistic Regression

Learning

• δk = (tk − ok) plays the role of the error signal with being either zero, 1or −1

Page 17: Lecture 6: Perceptron & Logistic Regression

Learning

• In original Perceptron with sgn it is

• Gives the same results if we scale ηcorrespondingly

Page 18: Lecture 6: Perceptron & Logistic Regression

Multiclass linear discriminant

Page 19: Lecture 6: Perceptron & Logistic Regression

Problem

Page 20: Lecture 6: Perceptron & Logistic Regression

Gradient Decent

• If the training data is not linearly separable, the perceptron learning algorithm can fail to converge• However the delta rule that is based on the gradient descent

overcomes these difficulties• Consider simpler linear unit with a linear activation function

Page 21: Lecture 6: Perceptron & Logistic Regression

• We can define the training error for a training data set Dt of Nelements with

Page 22: Lecture 6: Perceptron & Logistic Regression

• The training error is function of w, with w = (w0,w1) we can represent all possible values as a two dimensional function

• The error surface as defined by E(w) contains only one global minimum since it is convex.

Page 23: Lecture 6: Perceptron & Logistic Regression

• To find a local minimum of a function E(w) using gradient descent, one takes steps proportional to the negative of the gradient

Page 24: Lecture 6: Perceptron & Logistic Regression
Page 25: Lecture 6: Perceptron & Logistic Regression
Page 26: Lecture 6: Perceptron & Logistic Regression

• The update rule for gradient decent is given by

Page 27: Lecture 6: Perceptron & Logistic Regression

Stochastic gradient descent

• The gradient decent training rule updates summing over all N training examples.• Stochastic gradient descent (SGD) approximates gradient decent by

updating weights incrementally and calculating the error for each example according to the update rule by

Page 28: Lecture 6: Perceptron & Logistic Regression

• This rule is known as delta-rule or LMS (last mean-square) weight update.• It is used in Adaline (Adaptive Linear Element) for adaptive filters and

was developed by Widroff and Hoff

Page 29: Lecture 6: Perceptron & Logistic Regression

• ADALINE. An adaptive linear neuron. Manually adapted synapses. Designed and built by Ted Hoff in 1960. Demonstrates the least mean square LMS learning algorithm. (Courtesy of Professor Bernard Widrow.)

Page 30: Lecture 6: Perceptron & Logistic Regression

Multiclass linear discriminant

• For K artificial neurons an index k is used with k ∈ {1,2,··· ,K} • Identify the weight vector wk and the output ok of the neuron.

Page 31: Lecture 6: Perceptron & Logistic Regression

REGULARISATION

Page 32: Lecture 6: Perceptron & Logistic Regression

sign

• |wj| is convex• The subdifferential at the origin is the interval [−1, 1]. • The subdifferential at any point x0<0 is the singleton set {−1}• The subdifferential at any point x0>0 is the singleton set {1}. • This is similar to the sign function, but is not a single-valued function at 0

• instead including all possible subderivatives.

Page 33: Lecture 6: Perceptron & Logistic Regression

Continuous activation functions

Page 34: Lecture 6: Perceptron & Logistic Regression

Sigmoid Activation

Page 35: Lecture 6: Perceptron & Logistic Regression

Logistic Regression

Page 36: Lecture 6: Perceptron & Logistic Regression

Probabilistic Generative Models

Page 37: Lecture 6: Perceptron & Logistic Regression
Page 38: Lecture 6: Perceptron & Logistic Regression

Training

Page 39: Lecture 6: Perceptron & Logistic Regression

Error function

• Error function is defined by negative logarithm of the likelihood

Page 40: Lecture 6: Perceptron & Logistic Regression
Page 41: Lecture 6: Perceptron & Logistic Regression
Page 42: Lecture 6: Perceptron & Logistic Regression
Page 43: Lecture 6: Perceptron & Logistic Regression

Multiclass logistic regression

Page 44: Lecture 6: Perceptron & Logistic Regression
Page 45: Lecture 6: Perceptron & Logistic Regression

• The softmax function is used in various multi class classification methods, such as multinomial logistic regression (also known as softmax regression) with the prediction

Page 46: Lecture 6: Perceptron & Logistic Regression

Cross Entropy Loss Function for softmax

• Taking the error of the gradient function with respect to the vector wjwe obtain

Page 47: Lecture 6: Perceptron & Logistic Regression
Page 48: Lecture 6: Perceptron & Logistic Regression

Kronecker Function

Page 49: Lecture 6: Perceptron & Logistic Regression

• Taking the error of the gradient function with respect to netj over all training patterns we get

Page 50: Lecture 6: Perceptron & Logistic Regression
Page 51: Lecture 6: Perceptron & Logistic Regression
Page 52: Lecture 6: Perceptron & Logistic Regression
Page 53: Lecture 6: Perceptron & Logistic Regression

Literature

• Simon O. Haykin, Neural Networks and Learning Machine, (3rd Edition), Pearson 2008• Chapter 1

• Christopher M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer 2006• Chapter 4

• Hands-On Machine Learning with Scikit-Learn and TensorFlow, Aurélien Géron , O'Reilly Media; 1 edition, 2017• Chapter 4

Page 54: Lecture 6: Perceptron & Logistic Regression

Literature

• Machine Learning - A Journey to Deep Learning, A. Wichert, Luis Sa-Couto, World Scientific, 2021

• Chapter 5

Page 55: Lecture 6: Perceptron & Logistic Regression

Literature (Additional)

• Introduction To The Theory Of Neural Computation (Santa Fe Institute Series Book 1), John A. Hertz, Anders S. Krogh, Richard G. Palmer, Addison-Wesley Pub. Co, Redwood City, CA; 1 edition (January 1, 1991)

• Chapter 5