Classifier Evaluation Learning Curve - Javeriana...

7
1 5/18/2008 1 Evaluation Classifier Evaluation How to estimate classifier performance. Learning curves Feature curves Rejects and ROC curves 5/18/2008 2 Evaluation Learning Curve Size training set True classification error ε Bayes error ε* Sub-optimal classifier Bayes consistent classifier cleval testc 5/18/2008 3 Evaluation The Apparent Classification Error Size training set Classification error True error ε Apparent error ε A of training set The apparent (or resubstitution error) of the training set is positively biased (optimistic). An independent test set is needed! bias 5/18/2008 4 Evaluation Error Estimation by Test Set Training Classifier Testing Design Set Classification Error Estimate Other training set Æ other classifier Other test set Æ other error estimate 0.031 0.054 0.095 0.010 0.017 0.003 0.003 0.005 0.009 10 100 1000 0.01 0.03 0.1 N gendat testc 5/18/2008 5 Evaluation Training Set Size ÅÆ Test Set Size Training set should be large for good classifiers. Test set should be large for a reliable, unbiased error estimate. In practice often just a single design set is given Same set for training and testing may give a good classifier, but will definitely yield an optimistically biased error estimate. A small, independent test set yields an unbiased, but unreliable (large variance) error estimate for a well trained classifier. A large, independent test set yields an unbiased and reliable (small variance) error estimate for a badly trained classifier. 50-50 is an often used standard solution for the sizes for training and testing. Is it really good? There are better alternatives. gendat 5/18/2008 6 Evaluation Cross-validation Training Classifieri Testing Classification Error Estimate Rotate n times Size test set 1/n of design set. Size training set is (n - 1)/n of design set. Train and test n test times. Average errors. (Good choice: n = 10) All objects are tested ones Æ most reliable test result that is possible. Final classifier: trained by all objects Æ best possible classifier. Error estimate is slightly pessimistically biased. crossval

Transcript of Classifier Evaluation Learning Curve - Javeriana...

Page 1: Classifier Evaluation Learning Curve - Javeriana Calicic.puj.edu.co/wiki/...media=grupos:destino:1_3_evaluation_handouts… · 5/18/2008 Evaluation 1 Classifier Evaluation ... A2

1

5/18/2008 1Evaluation

Classifier Evaluation

• How to estimate classifier performance.

• Learning curves

• Feature curves

• Rejects and ROC curves

5/18/2008 2Evaluation

Learning Curve

Size training set

True classification error ε

Bayes error ε*

Sub-optimal classifier

Bayes consistent classifier

cleval

testc

5/18/2008 3Evaluation

The Apparent Classification Error

Size training set

Classification error True error ε

Apparent error εA of training set

The apparent (or resubstitution error) of the training set is positively biased (optimistic).

An independent test set is needed!

bias

5/18/2008 4Evaluation

Error Estimation by Test Set

Training

ClassifierTesting

Design Set

Classification Error Estimate

Other training set other classifierOther test set other error estimate

0.031 0.054 0.0950.010 0.017 0.0030.003 0.005 0.009

10

100

1000

0.01 0.03 0.1N

gendat

testc

5/18/2008 5Evaluation

Training Set Size Test Set Size

• Training set should be large for good classifiers.• Test set should be large for a reliable, unbiased error estimate.• In practice often just a single design set is given

Same set for training and testing may give a good

classifier, but will definitely yield an optimistically biased

error estimate.

A small, independent test setyields an unbiased, but

unreliable (large variance) error estimate for a well trained

classifier.

A large, independent test setyields an unbiased and reliable (small variance) error estimate for a badly trained classifier.

50-50 is an often used standard solution for the sizes for training and testing. Is it

really good?There are better alternatives.

gendat

5/18/2008 6Evaluation

Cross-validationTraining

ClassifieriTesting Classification Error

EstimateRotaten times

Size test set 1/n of design set.

Size training set is (n - 1)/n of design set.

Train and test n test times. Average errors. (Good choice: n = 10)

All objects are tested ones most reliable test result that is possible.

Final classifier: trained by all objects best possible classifier.

Error estimate is slightly pessimistically biased.

crossval

Page 2: Classifier Evaluation Learning Curve - Javeriana Calicic.puj.edu.co/wiki/...media=grupos:destino:1_3_evaluation_handouts… · 5/18/2008 Evaluation 1 Classifier Evaluation ... A2

2

5/18/2008 7Evaluation

Leave-one-out Procedure

Training

Test single object

Classification Error Estimate

Rotaten times

Cross-validation in which n is total number of objects. One object tested at a time.

n classifiers to be computed.In general unfeasible for large n.

Doable for k-NN classifier (needs no training).

crossval

testk

Classifieri

5/18/2008 8Evaluation

Expected Learning Curves by Estimated Errors

Size training set

Classification errorTrue error ε

Apparent error εA of training set

Cross-validation error estimate (rotation method)

Leave-one-out (n-1) error estimatecleval

5/18/2008 9Evaluation

Averaged Learning Curve

a = gendath([200 200]); e = cleval(a,ldc,[2,3,5,7,10,15,20],500); plote(e);

For obtaining ‘theoretically expected’ curves many repetitions

are needed.

cleval

5/18/2008 10Evaluation

Repeated Learning Curves

a = gendath([200 200]); for j=1:10e = cleval(a,ldc,[2,3,5,7,10,15,20],1); hold on; plote(e);

end

Small sample sizes have a very large variability.

cleval

plote

5/18/2008 11Evaluation

Learning Curves for Different Classifier Complexity

Size training set

Classification error

complexity

Bayes error

More complex classifiers are better in case of large training setsand worse in case of small training sets

cleval

5/18/2008 12Evaluation

Peaking Phenomenon, OvertrainingCurse of Dimensionality, Rao’s Paradox

feature set size (dimensionality)classifier complexity

Classification error training set size

clevalf

Page 3: Classifier Evaluation Learning Curve - Javeriana Calicic.puj.edu.co/wiki/...media=grupos:destino:1_3_evaluation_handouts… · 5/18/2008 Evaluation 1 Classifier Evaluation ... A2

3

5/18/2008 13Evaluation

Example Overtraining, Polynomial Classifier

1 2 3

654

svc([],’p’’1)

5/18/2008 14Evaluation

Example Overtraining (2)

Complexity

svc([],’p’’1)

testc

gendat

5/18/2008 15Evaluation

Example Overtraining (3)

Run smoothing parameter in Parzen classifier from small to large5/18/2008 16Evaluation

Example Overtraining (4)

Complexity

parzenc([],h)

5/18/2008 17Evaluation

Overtraining Increasing Bias

feature set size (dimensionality)classifier complexity

Classification error True error

Apparent error

Bias

5/18/2008 18Evaluation

Example Curse of Dimensionality

Fisher classifier for various feature rankings

Page 4: Classifier Evaluation Learning Curve - Javeriana Calicic.puj.edu.co/wiki/...media=grupos:destino:1_3_evaluation_handouts… · 5/18/2008 Evaluation 1 Classifier Evaluation ... A2

4

5/18/2008 19Evaluation

Confusion Matrix (1)

reallabels

obtainedlabels

Confusion matrix:

confmat

5/18/2008 20Evaluation

Confusion Matrix (2)

confmat

testc

classified toA B C

class A 8 2 0class B 6 23 1class C 4 1 15

obje

cts f

rom

-0.20 error in class A-0.23 error in class B-0.25 error in class C

0.228 averaged error

5/18/2008 21Evaluation

Confusion Matrix (3)

obje

cts f

rom

classified toA B C

class A 8 2 0 0.20class B 6 24 0 0.20class C 0 0 20 0.00

0.133

classified toA B C

class A 1 9 0 0.9class B 0 30 0 0.0class C 0 0 20 0.0

0.30

classified toA B C

class A 7 2 1 0.30class B 5 21 4 0.30class C 3 2 15 0.30

0.30

classification details are only observable in the confusion matrix!! 5/18/2008 22Evaluation

Conclusions on Error Estimations

• Larger training sets yield better classifiers.

• Independent test sets are needed for obtaining unbiased error estimates.

• Larger test sets yield more accurate error estimates.

• Leave-one-out cross-validation seems to be an optimal compromise, but might be computationally infeasible.

• 10-fold cross-validation is a good practice.

• More complex classifiers need larger training sets to avoid overtraining.

• This holds in particular for larger feature sizes, due to the curse of dimensionality.

• For too small training sets, more simple classifiers or smaller feature sets are needed.

• Confusion matrices allow a detailed look at the per class classification

5/18/2008 23Evaluation

Reject and ROC Analysis

• Reject Types• Reject Curves• Performance Measures• Varying Costs and Priors• ROC Analysis

Error

Reject

ε0

εr

r

x

length

5/18/2008 24Evaluation

Outlier Reject

Reject objects for which probability density is low:

x

Note: in these area the posterior probabilities might be high!

rejectc

Page 5: Classifier Evaluation Learning Curve - Javeriana Calicic.puj.edu.co/wiki/...media=grupos:destino:1_3_evaluation_handouts… · 5/18/2008 Evaluation 1 Classifier Evaluation ... A2

5

5/18/2008 25Evaluation

Ambiguity Reject

Reject objects for which classification is unsure:about equal posterior probabilities:

rejectc

5/18/2008 26Evaluation

Reject Curve

The classification error ε0 can be reduced to εr by rejecting a fraction r of the objects.

2d

Error

Reject

∑0

∑r

r

d

reject

plote

5/18/2008 27Evaluation

Rejecting Classifier

Original classifier

Rejecting classifier

Rejected fraction r

5/18/2008 28Evaluation

Reject Fraction

For reasonably well trained classifiers r > 2 (ε(r=0) – ε(r)).To decrease the error by 1%, more than 2% has to be rejected.

Class B

Class Ax

error decrease

Reject

Rejected fraction r

5/18/2008 29Evaluation

How much to reject?

For given total cost cthis is a linear function in the (r,ε) space.Shift it until a possible operating point is reached.

Error ε

Reject r

d

d*

Cost c

reject plote

Compare the cost of a rejected object, cr , with the cost of a classification error, cε :

5/18/2008 30Evaluation

Error / Performance Measures

xηA

Class A

εA

Class B

Objects of class Α, with prior p(A), is classified by S(x) = 0in a part ηA, assigned to A, and a part εA, assigned to B.

p(A) = ηA + εA

Page 6: Classifier Evaluation Learning Curve - Javeriana Calicic.puj.edu.co/wiki/...media=grupos:destino:1_3_evaluation_handouts… · 5/18/2008 Evaluation 1 Classifier Evaluation ... A2

6

5/18/2008 31Evaluation

Error / Performance Measures

xClass A

ηBεB

Class B

Objects of class Β, with prior p(B), is classified by S(x) = 0in a part ηB, assigned to B, and a part εB, assigned to A.

p(B) = ηB + εB5/18/2008 32Evaluation

Error / Performance Measures

εA : error contribution class AεB : error contribution class B

ε : classification errorη : performance

ε + η = 1

εB= εB1+ εB2ε = εA+ εB

ηA = ηA1 + ηA2ηB = ηB1 + ηB2

η = ηA+ ηBp(A) + p(B) = 1p(A) = ηA+ εAp(B) = ηB+ εB

εA/p(A): error in class AεB/p(B): error in class B

ηΑ/p(A) : sensitivity (recall) class A(what goes correct in class A)

ηΑ/ (ηΑ+εB)/ : precision class A(what belongs to A in what is classified as A)

xClass B

ηA1

Class A

εA=ηB2εB1=ηA2

εB2

ηB1

5/18/2008 33Evaluation

Error / Performance Measures

classified as

T NT TP FNN FP TN

Truelabels

Sensitivity: = Error:

Specificity: =

False Discovery Rate (FDR):

Given a trained classifier and a test set:

x

T N

5/18/2008 34Evaluation

Error / Performance Measures

• Error: probability of erroneous classifications

• Performance: 1 – error

• Sensitivity of a target class (e.g. diseased patients): performance for objects from that target class.

• Specificity: performance for all objects outside the target class.

• Precision of a target class : fraction of correct objects among all objects assigned to that class.

• Recall: fraction of correctly classified objects. This is identical tothe performance. It is also identical to the sensitivity when related to a particular class.

35

ROC Curve

x εB

εA

For every classifier S(x) – d = 0 two errors, εA and εB are made.The ROC curve shows them as function of d.

d

roc plote

5/18/2008 36Evaluation

ROC Curve

ROC is independent of class prior probabilitiesChange of prior is analogous to the shift of the decision boundary

Shift in decision boundary

Priors = [0.2 0.8]Priors = [0.5 0.5]

Page 7: Classifier Evaluation Learning Curve - Javeriana Calicic.puj.edu.co/wiki/...media=grupos:destino:1_3_evaluation_handouts… · 5/18/2008 Evaluation 1 Classifier Evaluation ... A2

7

5/18/2008 37Evaluation

ROC AnalysisROC: Receiver-Operator Characteristic (from communication theory)

Performanceof target class

Error in non-targets1

1

00

1

0

Error class A

Errorclass B

Medical diagnosticsDatabase retrieval

2-class pattern recognition

roc plote

5/18/2008 38Evaluation

When are ROC Curves Useful? (1)

Performance

Cost

1

00

Trade-off between cost and performance.

What is the performance for given cost?

What are the cost of a preset performance?

5/18/2008 39Evaluation

When are ROC Curves Useful? (2)

1

0εA

εB

0

Study of the effect of changing priors

operating points: value of d in S(x) – d = 0

5/18/2008 40Evaluation

When are ROC Curves Useful? (3)

0 εA

εB

0

Compare behavior of different classifiers

εA

εBS1(x)

S2(x)

S3(x)

Any 'virtual' classifier between S1(x) and S2(x)in the (εA ,εB) space can be realized by usingat random α times S1(x) and (1-α) times S2(x).

Combine Classifiers

2A

1AA )1( εα−+αε=ε 2

B1BB )1( εα−+αε=ε

5/18/2008 41Evaluation

Summary Reject and ROC

• Reject for solving ambiguity: reject objects close to the decision boundary lower costs.

• Reject option for protection against outliers.

• ROC analysis for performance – cost trade-off.

• ROC analysis in case of unknown or varying priors.

• ROC analysis for comparing / combining classifiers.