Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

44
Neural Networks Machine Learning 10-701 Tom M. Mitchell Center for Automated Learning and Discovery Carnegie Mellon University October 4, 2005 Required reading: • Neural nets: Mitchell chapter 4 Optional reading: • Bias/Variance error decomposition: Bishop: 9.1, 9.2

Transcript of Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Page 1: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Neural Networks

Machine Learning 10-701

Tom M. MitchellCenter for Automated Learning and Discovery

Carnegie Mellon University

October 4, 2005

Required reading: • Neural nets: Mitchell chapter 4

Optional reading:• Bias/Variance error decomposition: Bishop: 9.1, 9.2

Page 2: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Today:

• Finish up– MLE vs MAP for logisitic regression– Generative/Discriminative classifiers

• Artificial neural networks

Page 3: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

MLE vs MAP • Maximum conditional likelihood estimate

• Maximum a posteriori estimateGaussian P(W) = N(0,σΙ)

ln P(W) ↔ c ∑i wi2

Page 4: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

MLE vs MAP • Maximum conditional likelihood estimate

• Maximum a posteriori estimateGaussian P(W) = N(<μ1 ... μn > ,σΙ)

ln P(W) ↔ c ∑i (wi - μi)2

Page 5: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Generative vs. Discriminative ClassifiersTraining classifiers involves estimating f: X Y, or P(Y|X)

Generative classifiers:

• Assume some functional form for P(X|Y), P(X)

• Estimate parameters of P(X|Y), P(X) from training data

• Use Bayes rule to calculate P(Y|X= xi)

Discriminative classifiers:

• Assume some functional form for P(Y|X)

• Estimate parameters of P(Y|X) from training data

Page 6: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Naïve Bayes vs Logistic Regression

Consider Y boolean, Xi continuous, X=<X1 ... Xn>

Number of parameters:• NB: 4n +1• LR: n+1

Estimation method:• NB parameter estimates are uncoupled• LR parameter estimates are coupled

Page 7: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Naïve Bayes vs. Logistic Regression• Generative and Discriminative classifiers

• Asymptotic comparison (# training examples infinity)

• when model correct

• GNB, LR produce identical classifiers

• when model incorrect

• LR is less biased – does not assume cond indep.

• therefore expected to outperform GNB

[Ng & Jordan, 2002]

Page 8: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Naïve Bayes vs. Logistic Regression• Generative and Discriminative classifiers

• Non-asymptotic analysis (see [Ng & Jordan, 2002] )

• convergence rate of parameter estimates (slightly oversimplified – see paper for bounds)

• GNB order log n (where n = # of attributes in X)

• LR order n

GNB converges more quickly to its (perhaps less helpful) asymptotic estimates

Page 9: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Some experiments from UCI data sets

Page 10: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

What you should know:

• Logistic regression– Functional form follows from Naïve Bayes assumptions– But training procedure picks parameters without the

conditional independence assumption– MLE training: pick W to maximize P(Y | X, W)– MAP training: pick W to maximize P(W | X,Y)

• ‘regularization’

• Gradient ascent/descent– General approach when closed-form solutions unavailable

• Generative vs. Discriminative classifiers

Page 11: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Artificial Neural Networks

Page 12: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Artificial Neural Networks to learn f: X Y

• f might be non-linear function• X (vector of) continuous and/or discrete vars• Y (vector of) continuous and/or discrete vars

• Represent f by network of threshold units• Each unit is a logistic function

• MLE: train weights of all units to minimize sum of squared errors at network outputs

Page 13: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 14: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 15: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

ALVINN[Pomerleau 1993]

Page 16: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 17: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

• Consider regression problem f:X Y , for scalar Yy = f(x) + ε noise N(0,σε)

deterministic

MLE Training for Neural Networks

Learned neural network

Page 18: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

• Consider regression problem f:X Y , for scalar Yy = f(x) + ε noise N(0,σε)

deterministic

MAP Training for Neural Networks

Gaussian P(W) = N(0,σΙ)

ln P(W) ↔ c ∑i wi2

Page 19: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

• Consider regression problem f:X Y , for Y=<y1 ... yN>yi = fi(x) + εi noise N(0,σ) drawn independently

for each output yideterministic

MLE Training for Neural Networks

Page 20: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

td = target output

od = observed unit output

Page 21: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 22: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 23: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 24: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 25: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 26: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 27: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 28: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 29: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 30: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 31: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 32: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 33: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 34: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Semantic Memory Model Based on ANN’s[McClelland & Rogers, Nature 2003]

No hierarchy given.

Train with assertions, e.g., Can(Canary,Fly)

Page 35: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Humans act as though they have a hierarchical memory organization

1. Victims of Semantic Dementia progressively lose knowledge of objectsBut they lose specific details first, general properties later, suggesting hierarchical memory

Thing

Living

AnimalPlant

NonLiving

BirdFish

Canary

2. Children appear to learn general categories and properties first, following the same hierarchy, top down*.

* some debate remains on this.

Question: What learning mechanism could produce this emergent hierarchy?

Page 36: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Memory deterioration follows semantic hierarchy[McClelland & Rogers, Nature 2003]

Page 37: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon
Page 38: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

ANN Also Models Progressive Deterioration [McClelland & Rogers, Nature 2003]

average effect of noise in inputs to hidden layers

Page 39: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

2

Original MLE error fn.

Page 40: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Bias/Variance Decomposition of Error

Page 41: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

• Consider simple regression problem f:X Yy = f(x) + ε

What are sources of prediction error?

noise N(0,σ)

deterministic

Bias – Variance decomposition of error Reading: Bishop chapter 9.1, 9.2

learned

Page 42: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Sources of error• What if we have perfect learner, infinite data?

– Our learned h(x) satisfies h(x)=f(x)– Still have remaining, unavoidable error

σ2

Page 43: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Sources of error• What if we have only n training examples?• What is our expected error

– Taken over random training sets of size n, drawn from distribution D=p(x,y)

Page 44: Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Sources of error