CS 472 -Perceptron 1

53
CS 472 - Perceptron 1

Transcript of CS 472 -Perceptron 1

Page 1: CS 472 -Perceptron 1

CS 472 - Perceptron 1

Page 2: CS 472 -Perceptron 1

CS 472 - Perceptron 2

Basic Neuron

Page 3: CS 472 -Perceptron 1

CS 472 - Perceptron 3

Expanded Neuron

Page 4: CS 472 -Perceptron 1

CS 472 - Perceptron 4

Perceptron Learning Algorithm

l First neural network learning model in the 1960’sl Simple and limited (single layer model)l Basic concepts are similar for multi-layer models so this is

a good learning tooll Still used in some current applications (large business

problems, where intelligibility is needed, etc.)

Page 5: CS 472 -Perceptron 1

CS 472 - Perceptron 5

Perceptron Node – Threshold Logic Unit

x1

xn

x2

w1

w2

wn

z

θ

θ

θ

<

=

=

=

i

n

ii

i

n

ii

wx z

wx

1

1

if 0

if 1

𝜃

Page 6: CS 472 -Perceptron 1

CS 472 - Perceptron 6

Perceptron Node – Threshold Logic Unit

x1

xn

x2

w1

w2

wn

z

θ

θ

<

=

=

=

i

n

ii

i

n

ii

wx z

wx

1

1

if 0

if 1 • Learn weights such that an objective function is maximized.•What objective function should we use?•What learning algorithm should we use?

𝜃

Page 7: CS 472 -Perceptron 1

CS 472 - Perceptron 7

Perceptron Learning Algorithm

x1

x2

z

θ

θ

<

=

=

=

i

n

ii

i

n

ii

wx z

wx

1

1

if 0

if 1

.4

-.2

.1

x1 x2 t

01

.1

.3.4.8

Page 8: CS 472 -Perceptron 1

CS 472 - Perceptron 8

First Training Instance

.8

.3

z

θ

θ

<

=

=

=

i

n

ii

i

n

ii

wx z

wx

1

1

if 0

if 1

.4

-.2

.1

net = .8*.4 + .3*-.2 = .26

=1

x1 x2 t

01

.1

.3.4.8

Page 9: CS 472 -Perceptron 1

CS 472 - Perceptron 9

Second Training Instance

.4

.1

z

θ

θ

<

=

=

=

i

n

ii

i

n

ii

wx z

wx

1

1

if 0

if 1

.4

-.2

.1

x1 x2 t

01

.1

.3.4.8

net = .4*.4 + .1*-.2 = .14

=1

Dwi = (t - z) * c * xi

Page 10: CS 472 -Perceptron 1

CS 472 - Perceptron 10

Perceptron Rule Learning

Dwi = c(t – z) xil Where wi is the weight from input i to perceptron node, c is the learning

rate, t is the target for the current instance, z is the current output, and xi is ith input

l Least perturbation principle – Only change weights if there is an error– small c rather than changing weights sufficient to make current pattern correct– Scale by xi

l Create a perceptron node with n inputsl Iteratively apply a pattern from the training set and apply the perceptron

rulel Each iteration through the training set is an epochl Continue training until total training set error ceases to improvel Perceptron Convergence Theorem: Guaranteed to find a solution in finite

time if a solution exists

Page 11: CS 472 -Perceptron 1

CS 472 - Perceptron 11

Page 12: CS 472 -Perceptron 1

CS 472 - Perceptron 12

Augmented Pattern Vectors

1 0 1 -> 01 0 0 -> 1Augmented Version1 0 1 1 -> 01 0 0 1 -> 1l Treat threshold like any other weight. No special case.

Call it a bias since it biases the output up or down.l Since we start with random weights anyways, can ignore

the -q notion, and just think of the bias as an extra available weight. (note the author uses a -1 input)

l Always use a bias weight

Page 13: CS 472 -Perceptron 1

CS 472 - Perceptron 13

Perceptron Rule Example

l Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xil Training set 0 0 1 -> 0

1 1 1 -> 11 0 1 -> 10 1 1 -> 0

Pattern Target (t) Weight Vector (wi) Net Output (z) DW0 0 1 1 0 0 0 0 0

Page 14: CS 472 -Perceptron 1

CS 472 - Perceptron 14

Example

l Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xil Training set 0 0 1 -> 0

1 1 1 -> 11 0 1 -> 10 1 1 -> 0

Pattern Target (t) Weight Vector (wi) Net Output (z) DW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0

Page 15: CS 472 -Perceptron 1

CS 472 - Perceptron 15

Example

l Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xil Training set 0 0 1 -> 0

1 1 1 -> 11 0 1 -> 10 1 1 -> 0

Pattern Target (t) Weight Vector (wi) Net Output (z) DW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1

Page 16: CS 472 -Perceptron 1

Peer Instruction

l I pose a challenge question (often multiple choice), which will help solidify understanding of topics we have studied– Might not just be one correct answer

l You each get some time (1-2 minutes) to come up with your answer and vote – use Mentimeter (anonymous)

l Then you get some time to convince your group (neighbors) why you think you are right (2-3 minutes)– Learn from and teach each other!

l You vote again. May change your vote if you want.l We discuss together the different responses, show the

votes, give you opportunity to justify your thinking, and give you further insights

CS 472 - Perceptron 16

Page 17: CS 472 -Perceptron 1

Peer Instruction (PI) Why

l Studies show this approach improves learningl Learn by doing, discussing, and teaching each other

– Curse of knowledge/expert blind-spot– Compared to talking with a peer who just figured it out and who can

explain it in your own jargon– You never really know something until you can teach it to someone

else – More improved learning!

l Learn to reason about your thinking and answersl More enjoyable - You are involved and active in the learning

CS 472 - Perceptron 17

Page 18: CS 472 -Perceptron 1

How Groups Interact

l Best if group members have different initial answers l 3 is the “magic” group number

– You can self-organize "on-the-fly" or sit together specifically to be a group

– Can go 2-4 on a given day to make sure everyone is involvedl Teach and learn from each other: Discuss, reason, articulatel If you know the answer, listen to where colleagues are coming

from first, then be a great humble teacher, you will also learn by doing that, and you’ll be on the other side in the future

– I can’t do that as well because every small group has different misunderstandings and you get to focus on your particular questions

l Be ready to justify to the class your vote and justifications!

CS 472 - Perceptron 18

Page 19: CS 472 -Perceptron 1

CS 472 - Perceptron 19

**Challenge Question** - Perceptronl Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xil Training set 0 0 1 -> 0

1 1 1 -> 11 0 1 -> 10 1 1 -> 0

Pattern Target (t) Weight Vector (wi) Net Output (z) DW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1

l Once it converges the final weight vector will beA. 1 1 1 1B. -1 0 1 0C. 0 0 0 0D. 1 0 0 0E. None of the above

Page 20: CS 472 -Perceptron 1

CS 472 - Perceptron 20

Example

l Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xil Training set 0 0 1 -> 0

1 1 1 -> 11 0 1 -> 10 1 1 -> 0

Pattern Target (t) Weight Vector (wi) Net Output (z) DW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1 3 1 0 0 0 00 1 1 1 0 1 1 1 1

Page 21: CS 472 -Perceptron 1

CS 472 - Perceptron 21

Example

l Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xil Training set 0 0 1 -> 0

1 1 1 -> 11 0 1 -> 10 1 1 -> 0

Pattern Target (t) Weight Vector (wi) Net Output (z) DW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1 3 1 0 0 0 00 1 1 1 0 1 1 1 1 3 1 0 -1 -1 -10 0 1 1 0 1 0 0 0

Page 22: CS 472 -Perceptron 1

CS 472 - Perceptron 22

Example

l Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xil Training set 0 0 1 -> 0

1 1 1 -> 11 0 1 -> 10 1 1 -> 0

Pattern Target (t) Weight Vector (wi) Net Output (z) DW0 0 1 1 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 0 0 0 0 0 0 1 1 1 11 0 1 1 1 1 1 1 1 3 1 0 0 0 00 1 1 1 0 1 1 1 1 3 1 0 -1 -1 -10 0 1 1 0 1 0 0 0 0 0 0 0 0 01 1 1 1 1 1 0 0 0 1 1 0 0 0 01 0 1 1 1 1 0 0 0 1 1 0 0 0 00 1 1 1 0 1 0 0 0 0 0 0 0 0 0

Page 23: CS 472 -Perceptron 1

CS 472 - Perceptron 23

Perceptron Homework

l Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l Assume a learning rate c of 1 and initial weights all 1: Dwi = c(t – z) xil Show weights after each pattern for just one epochl Training set 1 0 1 -> 0

1 .5 0 -> 01 -.4 1 -> 10 1 .5 -> 1

Pattern Target (t) Weight Vector (wi) Net Output (z) DW1 1 1 1

Page 24: CS 472 -Perceptron 1

CS 472 - Perceptron 24

Training Sets and Noise

l Assume a Probability of Error at each input and output value each time a pattern is trained on

l 0 0 1 0 1 1 0 0 1 1 0 -> 0 1 1 0l i.e. P(error) = .05l Or a probability that the algorithm is applied wrong

(opposite) occasionally

l Averages out over learning

Page 25: CS 472 -Perceptron 1

CS 472 - Perceptron 25

Linear Separability

X1

X2

00

0

00

2-d case (two inputs)

W1X1 + W2X2 > (Z=1) W1X1 + W2X2 < (Z=0)

So, what is decision boundary?

W1X1 + W2X2 = X2 + W1X1/W2 = /W2

X2 = (-W1/W2)X1 + /W2

Y = MX + B

Page 26: CS 472 -Perceptron 1

CS 472 - Perceptron 26

Linear Separability

X1

X2

00

0

00

2-d case (two inputs)

W1X1 + W2X2 > (Z=1) W1X1 + W2X2 < (Z=0)

So, what is decision boundary?

W1X1 + W2X2 = X2 + W1X1/W2 = /W2

X2 = (-W1/W2)X1 + /W2

Y = MX + B

If no bias weight, the hyperplane must go through the origin.Note that since 𝛳 = -bias, that the equation with bias (B) is:X2 = (-W1/W2)X1 - B/W2

Page 27: CS 472 -Perceptron 1

CS 472 - Perceptron 27

Linear Separability

Page 28: CS 472 -Perceptron 1

CS 472 - Perceptron 28

Linear Separability and Generalization

When is data noise vs. a legitimate exception

Page 29: CS 472 -Perceptron 1

CS 472 - Perceptron 29

Limited Functionality of Hyperplane

Page 30: CS 472 -Perceptron 1

How to Handle Multi-Class Outputl This is an issue with any learning model which only supports

binary classification (perceptron, SVM, etc.)l Create 1 perceptron for each output class, where the training set

considers all other classes to be negative examples (one vs the rest)

– Run all perceptrons on novel data and set the output to the class of the perceptron which outputs high

– If there is a tie, choose the perceptron with the highest net valuel Create 1 perceptron for each pair of output classes, where the

training set only contains examples from the 2 classes (one vs one)

– Run all perceptrons on novel data and set the output to be the class with the most wins (votes) from the perceptrons

– In case of a tie, use the net values to decide– Number of models grows by the square of the output classes

CS 472 - Perceptron 30

Page 31: CS 472 -Perceptron 1

CS 472 - Perceptron 31

UC Irvine Machine Learning Data BaseIris Data Set

4.8,3.0,1.4,0.3, Iris-setosa5.1,3.8,1.6,0.2, Iris-setosa4.6,3.2,1.4,0.2, Iris-setosa5.3,3.7,1.5,0.2, Iris-setosa5.0,3.3,1.4,0.2, Iris-setosa7.0,3.2,4.7,1.4, Iris-versicolor6.4,3.2,4.5,1.5, Iris-versicolor6.9,3.1,4.9,1.5, Iris-versicolor5.5,2.3,4.0,1.3, Iris-versicolor6.5,2.8,4.6,1.5, Iris-versicolor6.0,2.2,5.0,1.5, Iris-viginica6.9,3.2,5.7,2.3, Iris-viginica5.6,2.8,4.9,2.0, Iris-viginica7.7,2.8,6.7,2.0, Iris-viginica6.3,2.7,4.9,1.8, Iris-viginica

Page 32: CS 472 -Perceptron 1

Objective Functions: Accuracy/Errorl How do we judge the quality of a particular model (e.g.

Perceptron with a particular setting of weights)l Consider how accurate the model is on the data set

– Classification accuracy = # Correct/Total instances– Classification error = # Misclassified/Total instances (= 1 – acc)

l Usually minimize a Loss function (aka cost, error)l For real valued outputs and/or targets

– Pattern error = Target – output: Errors could cancel each otherl S|tj – zj| (L1 loss), where j indexes all outputs in the patternl Common approach is Squared Error = S(tj – zj)2 (L2 loss)

– Total sum squared error = S pattern squared errors = S S (tij – zij)2where i indexes all the patterns in training set

l For nominal data, pattern error is typically 1 for a mismatch and 0 for a match

– For nominal (including binary) output and targets, SSE and classification error are equivalent

CS 472 - Perceptron 32

Page 33: CS 472 -Perceptron 1

Mean Squared Error

l Mean Squared Error (MSE) – SSE/n where n is the number of instances in the data set

– This can be nice because it normalizes the error for data sets of different sizes

– MSE is the average squared error per patternl Root Mean Squared Error (RMSE) – is the square root of the

MSE– This puts the error value back into the same units as the features and

can thus be more intuitivel Since we squared the error on the SSE

– RMSE is the average distance (error) of targets from the outputs in the same scale as the features

– Note RMSE is the root of the total data set MSE, and NOT the sum of the root of each individual pattern MSE

CS 472 - Perceptron 33

Page 34: CS 472 -Perceptron 1

**Challenge Question** - Errorl Given the following data set, what is the L1 (S|ti – zi|), SSE/L2

(S(ti – zi)2), MSE, and RMSE error for the entire data set?

CS 472 - Perceptron 34

x y Output Target Data Set

2 -3 1 1

0 1 0 1

.5 .6 .8 .2

L1 x

SSE x

MSE x

RMSE x

A. .4 1 1 1B. 1.6 2.36 1 1C. .4 .64 .21 0.453D. 1.6 1.36 .67 .82E. None of the above

Page 35: CS 472 -Perceptron 1

**Challenge Question** - Error

CS 472 - Perceptron 35

x y Output Target Data Set

2 -3 1 1

0 1 0 1

.5 .6 .8 .2

L1 1.6

SSE 1.36

MSE 1.36/3 = .453

RMSE .45^.5 = .67

A. .4 1 1 1B. 1.6 2.36 1 1C. .4 .64 .21 0.453D. 1.6 1.36 .67 .82E. None of the above

l Given the following data set, what is the L1 (S|ti – zi|), SSE/L2 (S(ti – zi)2), MSE, and RMSE error for the entire data set?

Page 36: CS 472 -Perceptron 1

SSE Homeworkl Given the following data set, what is the L1, SSE (L2),

MSE, and RMSE error of Output1, Output2, and the entire data set? Fill in cells that have an x.

CS 472 - Perceptron 36

x y Output1 Target1 Output2 Target 2 Data Set

-1 -1 0 1 .6 1.0

-1 1 1 1 -.3 0

1 -1 1 0 1.2 .5

1 1 0 0 0 -.2

L1 x x x

SSE x x x

MSE x x x

RMSE x x x

Page 37: CS 472 -Perceptron 1

CS 472 - Perceptron 37

Gradient Descent Learning: Minimize (Maximze) the Objective Function

SSE:Sum SquaredErrorS (t – z)2

0

Error Landscape

Weight Values

Page 38: CS 472 -Perceptron 1

CS 472 - Perceptron 38

l Goal is to decrease overall error (or other loss function) each time a weight is changed

l Total Sum Squared error one possible loss function E: S (t – z)2

l Seek a weight changing algorithm such that is negative

l If a formula can be found then we have a gradient descent learning algorithm

l Delta rule is a variant of the perceptron rule which gives a gradient descent learning algorithm with perceptron nodes

Deriving a Gradient Descent Learning Algorithm

ijwE

∂∂

Page 39: CS 472 -Perceptron 1

CS 472 - Perceptron 39

Delta rule algorithm

l Delta rule uses (target - net) before the net value goes through the threshold in the learning rule to decide weight update

l Weights are updated even when the output would be correctl Because this model is single layer and because of the SSE objective

function, the error surface is guaranteed to be parabolic with only one minima

l Learning rate– If learning rate is too large can jump around global minimum– If too small, will get to minimum, but will take a longer time– Can decrease learning rate over time to give higher speed and still

attain the global minimum (although exact minimum is still just for training set and thus…)

Δwi = c(t − net)xi

Page 40: CS 472 -Perceptron 1

Batch vs Stochastic Update

l To get the true gradient with the delta rule, we need to sum errors over the entire training set and only update weights at the end of each epoch

l Batch (gradient) vs stochastic (on-line, incremental)– SGD (Stochastic Gradient Descent)– With the stochastic delta rule algorithm, you update after every pattern,

just like with the perceptron algorithm (even though that means each change may not be exactly along the true gradient)

– Stochastic is more efficient and best to use in almost all cases, though not all have figured it out yet

– We’ll talk about this a little bit more when we get to Backpropagation

CS 472 - Perceptron 40

Page 41: CS 472 -Perceptron 1

CS 472 - Perceptron 41

Perceptron rule vs Delta rulel Perceptron rule (target - thresholded output) guaranteed to

converge to a separating hyperplane if the problem is linearly separable. Otherwise may not converge – could get in a cycle

l Singe layer Delta rule guaranteed to have only one global minimum. Thus, it will converge to the best SSE solution whether the problem is linearly separable or not.– Could have a higher misclassification rate than with the perceptron

rule and a less intuitive decision surface – we will discuss this later with regression

l Stopping Criteria – For these models we stop when no longer making progress– When you have gone a few epochs with no significant

improvement/change between epochs (including oscillations)

Page 42: CS 472 -Perceptron 1

CS 472 - Perceptron 42

Exclusive Or

X1

X2

1

1

0

0

Is there a dividing hyperplane?

Page 43: CS 472 -Perceptron 1

CS 472 - Perceptron 43

Linearly Separable Boolean Functions

l d = # of dimensions (i.e. inputs)

Page 44: CS 472 -Perceptron 1

CS 472 - Perceptron 44

Linearly Separable Boolean Functions

l d = # of dimensionsl P = 2d = # of Patterns

Page 45: CS 472 -Perceptron 1

CS 472 - Perceptron 45

Linearly Separable Boolean Functions

l d = # of dimensionsl P = 2d = # of Patternsl 2P = 22d= # of Functionsn Total Functions Linearly Separable Functions0 2 21 4 42 16 14

Page 46: CS 472 -Perceptron 1

CS 472 - Perceptron 46

Linearly Separable Boolean Functions

l d = # of dimensionsl P = 2d = # of Patternsl 2P = 22d= # of Functionsn Total Functions Linearly Separable Functions0 2 21 4 42 16 143 256 1044 65536 18825 4.3 × 109 945726 1.8 × 1019 1.5 × 1077 3.4 × 1038 8.4 × 109

Page 47: CS 472 -Perceptron 1

CS 472 - Perceptron 47

Linearly Separable Functions

LS(P,d) = 2 ∑i=0

d

(P-1)!(P-1-i)!i! for P > d

= 2P for P ≤ d

(All patterns for d=P)

i.e. all 8 ways of dividing 3 vertices of a cube for d=P=3

Where P is the # of patterns for training and

d is the # of inputs

limd -> ∞ (# of LS functions) = ∞

Page 48: CS 472 -Perceptron 1

Linear Models which are Non-Linear in the Input Space

l So far we have used

l We could preprocess the inputs in a non-linear way and do

l To the perceptron algorithm it looks just the same and can use the same learning algorithm, it just has different inputs (SVM)

l For example, for a problem with two inputs x and y (plus the bias), we could also add the inputs x2, y2, and x·y

l The perceptron would just think it is a 5-dimensional task, and it is linear (5-d hyperplane) in those 5 dimensions

– But what kind of decision surfaces would it allow for the original 2-dinput space?

CS 472 - Perceptron 48

f (x,w) = sign( wixi1=1

n

∑ )

f (x,w) = sign( wiφi(x1=1

m

∑ ))

Page 49: CS 472 -Perceptron 1

Quadric Machine

l All quadratic surfaces (2nd order)– ellipsoid– parabola– etc.

l That significantly increases the number of problems that can be solved

l But still many problem which are not quadrically separablel Could go to 3rd and higher order features, but number of

possible features grows exponentiallyl Multi-layer neural networks will allow us to discover high-

order features automatically from the input space

CS 472 - Perceptron 49

Page 50: CS 472 -Perceptron 1

Simple Quadric Example

l What is the decision surface for a 1-d (input) problem?l Perceptron with just feature f1 cannot separate the datal Could we add a transformed feature to our perceptron?

CS 472 - Perceptron 50

-3 -2 -1 0 1 2 3f1

Page 51: CS 472 -Perceptron 1

Simple Quadric Example

l Perceptron with just feature f1 cannot separate the datal Could we add a transformed feature to our perceptron?l f2 = f1

2

CS 472 - Perceptron 51

-3 -2 -1 0 1 2 3f1

Page 52: CS 472 -Perceptron 1

Simple Quadric Example

l Perceptron with just feature f1 cannot separate the datal Could we add another feature to our perceptron f2 = f1

2

l Note could also think of this as just using feature f1 but now allowing a quadric surface to divide the data

CS 472 - Perceptron 52

-3 -2 -1 0 1 2 3f1

-3 -2 -1 0 1 2 3

f2

f1

Page 53: CS 472 -Perceptron 1

Quadric Machine Homeworkl Assume a 2 input perceptron expanded to be a quadric perceptron (it outputs 1 if

net > 0, else 0). Note that with bipolar (binary) inputs of -1, 1, that x2 and y2 would always be 1 and thus do not add info and are not needed. (They would just act like two more bias weights. With non-binary inputs we would want them.)

l Assume a learning rate c of .4 and initial weights all 0: Dwi = c(t – z) xil Show weights after each pattern for one epoch with the following non-linearly

separable training set.l Has it learned to solve the problem after just one epoch?

CS 472 - Perceptron 53

x y Target

-1 -1 0

-1 1 1

1 -1 1

1 1 0