Lecture 6: Perceptron & Logistic Regression

Andreas WichertDepartment of Computer Science and Engineering

Técnico Lisboa

Similarity to real neurons...

• The dot product is a linear representation represented by the value net

• Non linearity can be achieved be a non liner activation (transfer) function φ() with

• Examples of nonlinear transfer functions are the sgn function

• The sgn activation function can be scaled to the two values 0 or 1 for not firing and firing,

• Both non linear function sgn(net) and sgn0(net) are non continuous. The activation function sgn0(net) can be approximated by the non linear continuous function σ(net)

(a) The activation function sgn0(net). (b) The function σ(net) with α = 1. (c) The function σ(net) with α = 5. (d) The function σ(net) with α = 10 is very similar to sgn0(net), bigger α make it even more similar

Perceptron (1957)

• Linear threshold unit (LTU)

w0X0=1

McCulloch-Pitts model of a neuron (1943)

The “bias”, a constant term that does not depend on any input value

Linearly separable patterns

X0=1, bias...

• The goal of a perceptron is to correctly classify the set of pattern D={x1,x2,..xm} into one of the classes C1 and C2

• The output for class C1 is o=1 and for C2 is o=-1

• For n=2 è

Perceptron learning rule

• Consider linearly separable problems• How to find appropriate weights

• Initialize each vector w to some small random values

• Look if the output pattern o belongs to the desired class, has the desired value d

• h is called the learning rate• 0 < h ≤ 1

Δw =η ⋅ (d −o) ⋅ x

• In supervised learning the network has its output compared with known correct answers

• Supervised learning• Learning with a teacher

• (d-o) plays the role of the error signal

• The algorithm converges to the correct classification• if the training data is linearly separable• and h is sufficiently small

XOR problem and Perceptron

• By Minsky and Papert in mid 1960

• and x0 = 1

Discriminant Functions

• The hyperplane is described by w and b

Learning

• δk = (tk − ok) plays the role of the error signal with being either zero, 1or −1

Learning

• In original Perceptron with sgn it is

• Gives the same results if we scale ηcorrespondingly

Multiclass linear discriminant

Problem

Gradient Decent

• If the training data is not linearly separable, the perceptron learning algorithm can fail to converge• However the delta rule that is based on the gradient descent

overcomes these difficulties• Consider simpler linear unit with a linear activation function

• We can define the training error for a training data set Dt of Nelements with

• The training error is function of w, with w = (w0,w1) we can represent all possible values as a two dimensional function

• The error surface as defined by E(w) contains only one global minimum since it is convex.

• To find a local minimum of a function E(w) using gradient descent, one takes steps proportional to the negative of the gradient

• The update rule for gradient decent is given by

Stochastic gradient descent

• The gradient decent training rule updates summing over all N training examples.• Stochastic gradient descent (SGD) approximates gradient decent by

updating weights incrementally and calculating the error for each example according to the update rule by

• This rule is known as delta-rule or LMS (last mean-square) weight update.• It is used in Adaline (Adaptive Linear Element) for adaptive filters and

was developed by Widroff and Hoff

• ADALINE. An adaptive linear neuron. Manually adapted synapses. Designed and built by Ted Hoff in 1960. Demonstrates the least mean square LMS learning algorithm. (Courtesy of Professor Bernard Widrow.)

Multiclass linear discriminant

• For K artificial neurons an index k is used with k ∈ {1,2,··· ,K} • Identify the weight vector wk and the output ok of the neuron.

REGULARISATION

• |wj| is convex• The subdifferential at the origin is the interval [−1, 1]. • The subdifferential at any point x0<0 is the singleton set {−1}• The subdifferential at any point x0>0 is the singleton set {1}. • This is similar to the sign function, but is not a single-valued function at 0

• instead including all possible subderivatives.

Continuous activation functions

Sigmoid Activation

Logistic Regression

Probabilistic Generative Models

Training

Error function

• Error function is defined by negative logarithm of the likelihood

Multiclass logistic regression

• The softmax function is used in various multi class classification methods, such as multinomial logistic regression (also known as softmax regression) with the prediction

Cross Entropy Loss Function for softmax

• Taking the error of the gradient function with respect to the vector wjwe obtain

Kronecker Function

• Taking the error of the gradient function with respect to netj over all training patterns we get

Literature

• Simon O. Haykin, Neural Networks and Learning Machine, (3rd Edition), Pearson 2008• Chapter 1

• Christopher M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer 2006• Chapter 4

• Hands-On Machine Learning with Scikit-Learn and TensorFlow, Aurélien Géron , O'Reilly Media; 1 edition, 2017• Chapter 4

Literature

• Machine Learning - A Journey to Deep Learning, A. Wichert, Luis Sa-Couto, World Scientific, 2021

• Chapter 5

Literature (Additional)

• Introduction To The Theory Of Neural Computation (Santa Fe Institute Series Book 1), John A. Hertz, Anders S. Krogh, Richard G. Palmer, Addison-Wesley Pub. Co, Redwood City, CA; 1 edition (January 1, 1991)

• Chapter 5

Lecture 6: Perceptron & Logistic Regression

Documents

Transcript of Lecture 6: Perceptron & Logistic Regression

Logistic Regression and Generalized Linear Models · Logistic Regression and Generalized Linear Models Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts …Authors:

CS 472 -Perceptron 1

Multivariate Logistic Regression

Classi cation 3: Logistic regression (continued); model ...ryantibs/datamining/lectures/22-clas3.pdf1Kohonen (1989), \Self-Organization and Associative Memory" 18. 2.Choose a training

Perceptron - Kangwon

Perceptron Homework - axon.cs.byu.edu

HỒI QUI LOGISTIC ĐƠN VÀ ĐA BIẾN

Logistic Regression Optimization - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_470/04a_sgd.pdfReminder: Logistic Regression P(Y =0jX)= 1 1+exp 0 + P i iX i (1) P(Y =1jX)= exp 0 +

Multivariate Logistic Regression - McGill · PDF file1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that

Chapter 5: Logistic Regression-II

Logistic Regression - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_3022/10c.pdf · Logistic Regression: Regression for outputting Probabilities Intuitions

Survival Regression Models - University of California, Davisdmrocke.ucdavis.edu/.../Lecture11SurvivalRegression.pdfis the same one used for matched logistic regression. David M. Rocke

Regression Analysis - Muthén & Muthén, Mplus Home Page 1/lec1_Regressi… · Regression Analysis Regression model: ... Further Readings On Regression Analysis Agresti, A. & Finlay

Factor Analysis Regression - Universität · PDF fileestimation of the regression coefficients of a multiple regression model. ... Factor analysis regression ... ξj are given by a

THE PERCEPTRON The McCulloch-Pitts Neuronangom.myweb.cs.uwindsor.ca/teaching/cs574/le2.pdf · THE PERCEPTRON The McCulloch-Pitts Neuron † The ﬂrst mathematical model of a neuron

RegML 2020 Class 2 Tikhonov regularization and kernelslcsl.mit.edu/courses/regml/regml2020/slides/lect2.pdf · 2020-06-29 · This class I Learning and Regularization: logistic regression

Multivariate Logistic Regression - McGill University Logistic Regression ... Notice that regardless of the number of co- ... Solving this system of equations gives the maximum likelihood

CSC 411 Lecture 18: Kernels - Department of Computer ...jlucas/teaching/csc411/lectures/lec18_handout.pdfOther Kernel methods Kernel Logistic regression I We can think of logistic

Lecture 7 Logistic Regression with Random · PDF file• In logistic regression the error is assumed to have a logistic cumulative density function given x, εi Pr( ) ... Logistic

Regression Discontinuity Design