Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Neural Networks

Machine Learning 10-701

Tom M. MitchellCenter for Automated Learning and Discovery

Carnegie Mellon University

October 4, 2005

Required reading: • Neural nets: Mitchell chapter 4

Optional reading:• Bias/Variance error decomposition: Bishop: 9.1, 9.2

Today:

• Finish up– MLE vs MAP for logisitic regression– Generative/Discriminative classifiers

• Artificial neural networks

MLE vs MAP • Maximum conditional likelihood estimate

• Maximum a posteriori estimateGaussian P(W) = N(0,σΙ)

ln P(W) ↔ c ∑i wi2

MLE vs MAP • Maximum conditional likelihood estimate

• Maximum a posteriori estimateGaussian P(W) = N(<μ1 ... μn > ,σΙ)

ln P(W) ↔ c ∑i (wi - μi)2

Generative vs. Discriminative ClassifiersTraining classifiers involves estimating f: X Y, or P(Y|X)

Generative classifiers:

• Assume some functional form for P(X|Y), P(X)

• Estimate parameters of P(X|Y), P(X) from training data

• Use Bayes rule to calculate P(Y|X= xi)

Discriminative classifiers:

• Assume some functional form for P(Y|X)

• Estimate parameters of P(Y|X) from training data

Naïve Bayes vs Logistic Regression

Consider Y boolean, Xi continuous, X=<X1 ... Xn>

Number of parameters:• NB: 4n +1• LR: n+1

Estimation method:• NB parameter estimates are uncoupled• LR parameter estimates are coupled

Naïve Bayes vs. Logistic Regression• Generative and Discriminative classifiers

• Asymptotic comparison (# training examples infinity)

• when model correct

• GNB, LR produce identical classifiers

• when model incorrect

• LR is less biased – does not assume cond indep.

• therefore expected to outperform GNB

[Ng & Jordan, 2002]

Naïve Bayes vs. Logistic Regression• Generative and Discriminative classifiers

• Non-asymptotic analysis (see [Ng & Jordan, 2002] )

• convergence rate of parameter estimates (slightly oversimplified – see paper for bounds)

• GNB order log n (where n = # of attributes in X)

• LR order n

GNB converges more quickly to its (perhaps less helpful) asymptotic estimates

Some experiments from UCI data sets

What you should know:

• Logistic regression– Functional form follows from Naïve Bayes assumptions– But training procedure picks parameters without the

conditional independence assumption– MLE training: pick W to maximize P(Y | X, W)– MAP training: pick W to maximize P(W | X,Y)

• ‘regularization’

• Gradient ascent/descent– General approach when closed-form solutions unavailable

• Generative vs. Discriminative classifiers

Artificial Neural Networks

Artificial Neural Networks to learn f: X Y

• f might be non-linear function• X (vector of) continuous and/or discrete vars• Y (vector of) continuous and/or discrete vars

• Represent f by network of threshold units• Each unit is a logistic function

• MLE: train weights of all units to minimize sum of squared errors at network outputs

ALVINN[Pomerleau 1993]

• Consider regression problem f:X Y , for scalar Yy = f(x) + ε noise N(0,σε)

deterministic

MLE Training for Neural Networks

Learned neural network

• Consider regression problem f:X Y , for scalar Yy = f(x) + ε noise N(0,σε)

deterministic

MAP Training for Neural Networks

Gaussian P(W) = N(0,σΙ)

ln P(W) ↔ c ∑i wi2

• Consider regression problem f:X Y , for Y=<y1 ... yN>yi = fi(x) + εi noise N(0,σ) drawn independently

for each output yideterministic

MLE Training for Neural Networks

td = target output

od = observed unit output

Semantic Memory Model Based on ANN’s[McClelland & Rogers, Nature 2003]

No hierarchy given.

Train with assertions, e.g., Can(Canary,Fly)

Humans act as though they have a hierarchical memory organization

1. Victims of Semantic Dementia progressively lose knowledge of objectsBut they lose specific details first, general properties later, suggesting hierarchical memory

Living

AnimalPlant

NonLiving

BirdFish

Canary

2. Children appear to learn general categories and properties first, following the same hierarchy, top down*.

* some debate remains on this.

Question: What learning mechanism could produce this emergent hierarchy?

Memory deterioration follows semantic hierarchy[McClelland & Rogers, Nature 2003]

ANN Also Models Progressive Deterioration [McClelland & Rogers, Nature 2003]

average effect of noise in inputs to hidden layers

Original MLE error fn.

Bias/Variance Decomposition of Error

• Consider simple regression problem f:X Yy = f(x) + ε

What are sources of prediction error?

noise N(0,σ)

deterministic

Bias – Variance decomposition of error Reading: Bishop chapter 9.1, 9.2

learned

Sources of error• What if we have perfect learner, infinite data?

– Our learned h(x) satisfies h(x)=f(x)– Still have remaining, unavoidable error

Sources of error• What if we have only n training examples?• What is our expected error

– Taken over random training sets of size n, drawn from distribution D=p(x,y)

Sources of error

Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Documents

Transcript of Neural Networks - SCHOOL OF COMPUTER SCIENCE, Carnegie Mellon

Strangeness production at Jefferson Lab: Experiments...6/11/03 R. A. Schumacher, Carnegie Mellon University 3 Physics Topics (continued)! Hypernuclear production: Halls C and A! YN

18 - 322 Fall 2002 Lecture 20 - Carnegie Mellon University

Gröbner Bases - ¶bner Bases Victor Adamchik Carnegie Mellon University Application to Trigonometry Problem. Compute exact values for cosIp 7 M and sinIp 7 M. We start with-1 = exp7

Gaussian Mixture Model - Carnegie Mellon Universitygdicaro/10315/lectures/EM...General GMM • There are k components • Component i has an associated mean vector i • Each component

Ayto Einai to Mellon

Random Walks, Random Fields, and Graph Kernels...Random Walks, Random Fields, and Graph Kernels John Laﬀerty School of Computer Science Carnegie Mellon University Based on work with

Gaussians Linear Regression Bias-Variance TradeoffLinear Regression Bias-Variance Tradeoff Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University January 22nd,

Yuan Zhou, Ryan O’Donnell Carnegie Mellon University

Convex functions - ULisboausers.isr.ist.utl.pt/~jxavier/lec2.pdf · 2018-02-19 · Convex functions Nonlinear optimization Instituto Superior T ecnico and Carnegie Mellon University

12 - ΚΘΒΕGodspell Το μιούζικαλ ÇGodspellÈ πρωτοπαρουσιάστηκε το 1970 από φοιτητές στο Carnegie Mellon University στο Pitts-

Antoine Remond-Tiedrez - Department of Mathematicsremondtiedre/talks/usc-summer... · 2020. 9. 14. · Antoine Remond-Tiedrez Carnegie Mellon University USC Summer School on Mathematical

λ-Calculus · 2013-11-18 · λ-Calculus: Then & Now Dana S. Scott University Professor Emeritus Carnegie Mellon University Visiting Scholar University of California, Berkeley dana.scott@cs.cmu.edu

Partial-Wave Analysis and Baryon Spectroscopy in Pion ...mipp-docdb.fnal.gov/cgi-bin/RetrieveFile?docid=188;filename=... · Spectroscopy in Pion-Nucleon Scattering ... Carnegie-Mellon-Berkeley,

Neural Firing

Higher-Order Pattern Complement and the Strict -Calculusfp/papers/negpat02.pdf · 2002-05-13 · -Calculus ALBERTO MOMIGLIANO University of Leicester and FRANK PFENNING Carnegie Mellon

Kei Moriya Reinhard Schumacher Carnegie Mellon Universityhadron.physics.fsu.edu/~skpark/document/Moriya_APS2009... · 2009-05-11 · TT TO dM d TT TO dM d TOT dM ( "(" (" $# #$!)$

math.cmu.edu · On Convergence of a Variational Model for Lithium-Ion Batteries Kerrek Stinson kstinson@andrew.cmu.edu Carnegie Mellon University March 31, 2020 Abstract A singularly

Trees - Carnegie Mellon School of Computer Scienceckingsf/bioinfo-lectures/trees.pdf · Alternative Proof Thm. An extended binary tree with n internal nodes has n+1 external nodes.

Daria Zieminska - Carnegie Mellon University · Daria Zieminska Indiana University (D0 Collaboration) Beauty 2003 See also recent CDF talks: Sinead Farrington, EPS, July 03: Kevin

Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)