ERASER: A Benchmark to Evaluate Rationalized NLP Models · illustrative of four (out of seven)...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics pages 4443ndash4458July 5 - 10 2020 ccopy2020 Association for Computational Linguistics

4443

ERASER A Benchmark to Evaluate Rationalized NLP Models

Jay DeYoung⋆Ψ Sarthak Jain⋆Ψ Nazneen Fatema Rajani⋆Φ Eric LehmanΨCaiming XiongΦ Richard SocherΦ and Byron C WallaceΨ

⋆Equal contributionΨKhoury College of Computer Sciences Northeastern University

ΦSalesforce Research Palo Alto CA 94301

AbstractState-of-the-art models in NLP are now pre-dominantly based on deep neural networksthat are opaque in terms of how they cometo make predictions This limitation hasincreased interest in designing more inter-pretable deep models for NLP that reveal thelsquoreasoningrsquo behind model outputs But workin this direction has been conducted on dif-ferent datasets and tasks with correspondinglyunique aims and metrics this makes it difficultto track progress We propose the EvaluatingRationales And Simple English Reasoning(ERASER ) benchmark to advance researchon interpretable models in NLP This bench-mark comprises multiple datasets and tasks forwhich human annotations of ldquorationalesrdquo (sup-porting evidence) have been collected We pro-pose several metrics that aim to capture howwell the rationales provided by models alignwith human rationales and also how faithfulthese rationales are (ie the degree to whichprovided rationales influenced the correspond-ing predictions) Our hope is that releasing thisbenchmark facilitates progress on designingmore interpretable NLP systems The bench-mark code and documentation are availableat httpswwweraserbenchmarkcom

1 IntroductionInterest has recently grown in designing NLP sys-tems that can reveal why models make specificpredictions But work in this direction has beenconducted on different datasets and using differentmetrics to quantify performance this has made itdifficult to compare methods and track progressWe aim to address this issue by releasing a stan-dardized benchmark of datasets mdash repurposed andaugmented from pre-existing corpora spanning arange of NLP tasks mdash and associated metrics formeasuring different properties of rationales We re-fer to this as the Evaluating Rationales And SimpleEnglish Reasoning (ERASER ) benchmark

Commonsense Explanations (CoS-E)

Where do you find the most amount of leafs

(a) Compost pile (b) Flowers (c) Forest (d) Field (e) Ground

Movie Reviews

In this movie hellip Plots to take over the world The acting is great The soundtrack is run-of-the-mill but the action more than makes up for it

(a) Positive (b) Negative

Evidence Inference

Article Patients for this trial were recruited hellip Compared with 09 saline 120 mg of inhaled nebulized furosemide had no effect on breathlessness during exercise

(a) Sig decreased (b) No sig difference (c) Sig increased

Prompt With respect to breathlessness what is the reported difference between patients receiving placebo and those receiving furosemide

e-SNLI

H A man in an orange vest leans over a pickup truckP A man is touching a truck

(a) Entailment (b) Contradiction (c) Neutral

Figure 1 Examples of instances labels and rationalesillustrative of four (out of seven) datasets included inERASER The lsquoerasedrsquo snippets are rationales

In curating and releasing ERASER we take in-spiration from the stickiness of the GLUE (Wanget al 2019b) and SuperGLUE (Wang et al 2019a)benchmarks for evaluating progress in natural lan-guage understanding tasks which have driven rapidprogress on models for general language repre-sentation learning We believe the still somewhatnascent subfield of interpretable NLP stands to ben-efit similarly from an analogous collection of stan-dardized datasets and tasks we hope these willaid the design of standardized metrics to measuredifferent properties of lsquointerpretabilityrsquo and wepropose a set of such metrics as a starting point

Interpretability is a broad topic with many possi-ble realizations (Doshi-Velez and Kim 2017 Lip-ton 2016) In ERASER we focus specifically onrationales ie snippets that support outputs Alldatasets in ERASER include such rationales ex-plicitly marked by human annotators By definitionrationales should be sufficient to make predictions

4444

but they may not be comprehensive Therefore forsome datasets we have also collected comprehen-sive rationales (in which all evidence supportingan output has been marked) on test instances

The lsquoqualityrsquo of extracted rationales will dependon their intended use Therefore we propose aninitial set of metrics to evaluate rationales thatare meant to measure different varieties of lsquointer-pretabilityrsquo Broadly this includes measures ofagreement with human-provided rationales and as-sessments of faithfulness The latter aim to capturethe extent to which rationales provided by a modelin fact informed its predictions We believe theseprovide a reasonable start but view the problem ofdesigning metrics for evaluating rationales mdash espe-cially for measuring faithfulness mdash as a topic forfurther research that ERASER can facilitate Andwhile we will provide a lsquoleaderboardrsquo this is betterviewed as a lsquoresults boardrsquo we do not privilegeany one metric Instead ERASER permits compar-ison between models that provide rationales withrespect to different criteria of interest

We implement baseline models and report theirperformance across the corpora in ERASER Wefind that no single lsquooff-the-shelfrsquo architecture isreadily adaptable to datasets with very differentinstance lengths and associated rationale snippets(Section 3) This highlights a need for new modelsthat can consume potentially lengthy inputs andadaptively provide rationales at a task-appropriatelevel of granularity ERASER provides a resourceto develop such models

In sum we introduce the ERASER benchmark(wwweraserbenchmarkcom) a unified set of di-verse NLP datasets (these are repurposed and aug-mented from existing corpora1 including senti-ment analysis Natural Language Inference andQA tasks among others) in a standardized for-mat featuring human rationales for decisions alongwith starter code and tools baseline models andstandardized (initial) metrics for rationales

2 Related Work

Interpretability in NLP is a large fast-growingarea we do not attempt to provide a comprehensiveoverview here Instead we focus on directions par-ticularly relevant to ERASER ie prior work onmodels that provide rationales for their predictions

Learning to explain In ERASER we assume that

1We ask users of the benchmark to cite all original papersand provide a BibTeX entry for doing so on the website

rationales (marked by humans) are provided duringtraining However such direct supervision will notalways be available motivating work on methodsthat can explain (or ldquorationalizerdquo) model predic-tions using only instance-level supervision

In the context of modern neural models for textclassification one might use variants of attention(Bahdanau et al 2015) to extract rationales At-tention mechanisms learn to assign soft weights to(usually contextualized) token representations andso one can extract highly weighted tokens as ratio-nales However attention weights do not in gen-eral provide faithful explanations for predictions(Jain and Wallace 2019 Serrano and Smith 2019Wiegreffe and Pinter 2019 Zhong et al 2019Pruthi et al 2020 Brunner et al 2020 Moradiet al 2019 Vashishth et al 2019) This likelyowes to encoders entangling inputs complicatingthe interpretation of attention weights on inputsover contextualized representations of the same2

By contrast hard attention mechanisms dis-cretely extract snippets from the input to pass to theclassifier by construction providing faithful expla-nations Recent work has proposed hard attentionmechanisms as a means of providing explanationsLei et al (2016) proposed instantiating two modelswith their own parameters one to extract rationalesand one that consumes these to make a predictionThey trained these models jointly via REINFORCE(Williams 1992) style optimization

Recently Jain et al (2020) proposed a variantof this two-model setup that uses heuristic featurescores to derive pseudo-labels on tokens compris-ing rationales one model can then be used to per-form hard extraction in this way while a second(independent) model can make predictions on thebasis of these Elsewhere Chang et al (2019)introduced the notion of classwise rationales thatexplains support for different output classes usinga game theoretic framework Finally other recentwork has proposed using a differentiable binarymask over inputs which also avoids recourse toREINFORCE (Bastings et al 2019)

Post-hoc explanation Another strand of inter-pretability work considers post-hoc explanationmethods which seek to explain why a model madea specific prediction for a given input Commonly

2Interestingly Zhong et al (2019) find that attention some-times provides plausible but not faithful rationales ElsewherePruthi et al (2020) show that one can easily learn to deceivevia attention weights These findings highlight that one shouldbe mindful of the criteria one wants rationales to fulfill

4445

these take the form of token-level importancescores Gradient-based explanations are a standardexample (Sundararajan et al 2017 Smilkov et al2017) These enjoy a clear semantics (describinghow perturbing inputs locally affects outputs) butmay nonetheless exhibit counterintuitive behaviors(Feng et al 2018)

Gradients of course assume model differentia-bility Other methods do not require any modelproperties Examples include LIME (Ribeiro et al2016) and Alvarez-Melis and Jaakkola (2017)these methods approximate model behavior lo-cally by having it repeatedly make predictions overperturbed inputs and fitting a simple explainablemodel over the outputs

Acquiring rationales Aside from interpretabilityconsiderations collecting rationales from annota-tors may afford greater efficiency in terms of modelperformance realized given a fixed amount of anno-tator effort (Zaidan and Eisner 2008) In particularrecent work by McDonnell et al (2017 2016) hasobserved that at least for some tasks asking anno-tators to provide rationales justifying their catego-rizations does not impose much additional effortCombining rationale annotation with active learn-ing (Settles 2012) is another promising direction(Wallace et al 2010 Sharma et al 2015)

Learning from rationales Work on learning fromrationales marked by annotators for text classifica-tion dates back over a decade (Zaidan et al 2007)Earlier efforts proposed extending standard dis-criminative models like Support Vector Machines(SVMs) with regularization terms that penalizedparameter estimates which disagreed with providedrationales (Zaidan et al 2007 Small et al 2011)Other efforts have attempted to specify generativemodels of rationales (Zaidan and Eisner 2008)

More recent work has aimed to exploit ratio-nales in training neural text classifiers Zhang et al(2016) proposed a rationale-augmented Convolu-tional Neural Network (CNN) for text classifica-tion explicitly trained to identify sentences support-ing categorizations Strout et al (2019) showed thatproviding this model with rationales during train-ing yields predicted rationales that are preferredby humans (compared to rationales produced with-out explicit supervision) Other work has proposedlsquopipelinersquo approaches in which independent mod-els are trained to perform rationale extraction andclassification on the basis of these respectively(Lehman et al 2019 Chen et al 2019) assuming

Name Size (traindevtest) Tokens CompEvidence Inference 7958 972 959 4761

BoolQ 6363 1491 2817 3583

Movie Reviews 1600 200 200 774

FEVER 97957 6122 6111 327

MultiRC 24029 3214 4848 303

CoS-E 8733 1092 1092 28

e-SNLI 911938 16449 16429 16

Table 1 Overview of datasets in the ERASER bench-mark Tokens is the average number of tokens in eachdocument Comprehensive rationales mean that all sup-porting evidence is markeddenotes cases where thisis (more or less) true by default are datasets forwhich we have collected comprehensive rationales foreither a subset or all of the test datasets respectivelyAdditional information can be found in Appendix A

explicit training data is available for the formerRajani et al (2019) fine-tuned a Transformer-

based language model (Radford et al 2018) onfree-text rationales provided by humans with anobjective of generating open-ended explanations toimprove performance on downstream tasksEvaluating rationales Work on evaluating ratio-nales has often compared these to human judg-ments (Strout et al 2019 Doshi-Velez and Kim2017) or elicited other human evaluations of ex-planations (Ribeiro et al 2016 Lundberg and Lee2017 Nguyen 2018) There has also been work onvisual evaluations of saliency maps (Li et al 2016Ding et al 2017 Sundararajan et al 2017)

Measuring agreement between extracted andhuman rationales (or collecting subjective assess-ments of them) assesses the plausibility of ratio-nales but such approaches do not establish whetherthe model actually relied on these particular ratio-nales to make a prediction We refer to rationalesthat correspond to the inputs most relied upon tocome to a disposition as faithful

Most automatic evaluations of faithfulness mea-sure the impact of perturbing or erasing words ortokens identified as important on model output (Ar-ras et al 2017 Montavon et al 2017 Serrano andSmith 2019 Samek et al 2016 Jain and Wallace2019) We build upon these methods in Section4 Finally we note that a recent article urges thecommunity to evaluate faithfulness on a continuousscale of acceptability rather than viewing this as abinary proposition (Jacovi and Goldberg 2020)

3 Datasets in ERASERFor all datasets in ERASER we distribute both ref-erence labels and rationales marked by humansas supporting these in a standardized format We

4446

delineate train validation and test splits for allcorpora (see Appendix A for processing details)We ensure that these splits comprise disjoint setsof source documents to avoid contamination3 Wehave made the decision to distribute the test setspublicly4 in part because we do not view the lsquocor-rectrsquo metrics to use as settled We plan to acquireadditional human annotations on held-out portionsof some of the included corpora so as to offer hid-den test set evaluation opportunities in the future

Evidence inference (Lehman et al 2019) Adataset of full-text articles describing randomizedcontrolled trials (RCTs) The task is to inferwhether a given intervention is reported to eithersignificantly increase significantly decrease orhave no significant effect on a specified outcome ascompared to a comparator of interest Rationaleshave been marked as supporting these inferencesAs the original annotations are not necessarily ex-haustive we collected exhaustive rationale annota-tions on a subset of the validation and test data5

BoolQ (Clark et al 2019) This corpus consistsof passages selected from Wikipedia and yesnoquestions generated from these passages As theoriginal Wikipedia article versions used were notmaintained we have made a best-effort attempt torecover these and then find within them the pas-sages answering the corresponding questions Forpublic release we acquired comprehensive annota-tions on a subset of documents in our test set5

Movie Reviews (Zaidan and Eisner 2008) In-cludes positivenegative sentiment labels on moviereviews Original rationale annotations were notnecessarily comprehensive we thus collected com-prehensive rationales on the final two folds of theoriginal dataset (Pang and Lee 2004)5 In contrastto most other datasets the rationale annotationshere are span level as opposed to sentence level

FEVER (Thorne et al 2018) Short for Fact Ex-traction and VERification entails verifying claimsfrom textual sources Specifically each claim is tobe classified as supported refuted or not enoughinformation with reference to a collection of source

3Except for BoolQ wherein source documents in the orig-inal train and validation set were not disjoint and we preservethis structure in our dataset Questions of course are disjoint

4Consequently for datasets that have been part of previ-ous benchmarks with other aims (namely GLUEsuperGLUE)but which we have re-purposed for work on rationales inERASER eg BoolQ (Clark et al 2019) we have carved outfor release test sets from the original validation sets

5Annotation details are in Appendix B

texts We take a subset of this dataset includingonly supported and refuted claims

MultiRC (Khashabi et al 2018) A reading com-prehension dataset composed of questions withmultiple correct answers that by construction de-pend on information from multiple sentences Hereeach rationale is associated with a question whileanswers are independent of one another We con-vert each rationalequestionanswer triplet into aninstance within our dataset Each answer candidatethen has a label of True or False

Commonsense Explanations (CoS-E) (Rajaniet al 2019) This corpus comprises multiple-choice questions and answers from (Talmor et al2019) along with supporting rationales The ratio-nales in this case come in the form both of high-lighted (extracted) supporting snippets and free-text open-ended descriptions of reasoning Givenour focus on extractive rationales ERASER in-cludes only the former for now Following Talmoret al (2019) we repartition the training and valida-tion sets to provide a canonical test split

e-SNLI (Camburu et al 2018) This dataset aug-ments the SNLI corpus (Bowman et al 2015) withrationales marked in the premise andor hypothesis(and natural language explanations which we donot use) For entailment pairs annotators were re-quired to highlight at least one word in the premiseFor contradiction pairs annotators had to highlightat least one word in both the premise and the hy-pothesis for neutral pairs they were only allowedto highlight words in the hypothesis

Human Agreement We report human agreementover extracted rationales for multiple annotatorsand documents in Table 2 All datasets have a highCohen κ (Cohen 1960) with substantial or betteragreement

4 Metrics

In ERASER models are evaluated both for theirpredictive performance and with respect to the ra-tionales that they extract For the former we relyon the established metrics for the respective tasksHere we describe the metrics we propose to eval-uate the quality of extracted rationales We donot claim that these are necessarily the best met-rics for evaluating rationales however Indeed wehope the release of ERASER will spur additionalresearch into how best to measure the quality ofmodel explanations in the context of NLP

4447

Dataset Cohen κ F1 P R Annotatorsdoc DocumentsEvidence Inference - - - - - -BoolQ 0618 plusmn 0194 0617 plusmn 0227 0647 plusmn 0260 0726 plusmn 0217 3 199Movie Reviews 0712 plusmn 0135 0799 plusmn 0138 0693 plusmn 0153 0989 plusmn 0102 2 96FEVER 0854 plusmn 0196 0871 plusmn 0197 0931 plusmn 0205 0855 plusmn 0198 2 24MultiRC 0728 plusmn 0268 0749 plusmn 0265 0695 plusmn 0284 0910 plusmn 0259 2 99CoS-E 0619 plusmn 0308 0654 plusmn 0317 0626 plusmn 0319 0792 plusmn 0371 2 100e-SNLI 0743 plusmn 0162 0799 plusmn 0130 0812 plusmn 0154 0853 plusmn 0124 3 9807

Table 2 Human agreement with respect to rationales For Movie Reviews and BoolQ we calculate the meanagreement of individual annotators with the majority vote per token over the two-three annotators we hired viaUpwork and Amazon Turk respectively The e-SNLI dataset already comprised three annotators for this wecalculate mean agreement between individuals and the majority For CoS-E MultiRC and FEVER members ofour team annotated a subset to use a comparison to the (majority of where appropriate) existing rationales Wecollected comprehensive rationales for Evidence Inference from Medical Doctors as they have a high amount ofexpertise we would expect agreement to be high but have not collected redundant comprehensive annotations

41 Agreement with human rationales

The simplest means of evaluating extracted ratio-nales is to measure how well they agree with thosemarked by humans We consider two classes ofmetrics appropriate for models that perform dis-crete and lsquosoftrsquo selection respectively

For the discrete case measuring exact matchesbetween predicted and reference rationales is likelytoo harsh6 We thus consider more relaxed mea-sures These include Intersection-Over-Union(IOU) borrowed from computer vision (Evering-ham et al 2010) which permits credit assignmentfor partial matches We define IOU on a token levelfor two spans it is the size of the overlap of thetokens they cover divided by the size of their unionWe count a prediction as a match if it overlaps withany of the ground truth rationales by more thansome threshold (here 05) We use these partialmatches to calculate an F1 score We also measuretoken-level precision and recall and use these toderive token-level F1 scores

Metrics for continuous or soft token scoringmodels consider token rankings rewarding modelsfor assigning higher scores to marked tokens Inparticular we take the Area Under the Precision-Recall curve (AUPRC) constructed by sweeping athreshold over token scores We define additionalmetrics for soft scoring models below

In general the rationales we have for tasks aresufficient to make judgments but not necessarilycomprehensive However for some datasets wehave explicitly collected comprehensive rationalesfor at least a subset of the test set Therefore onthese datasets recall evaluates comprehensivenessdirectly (it does so only noisily on other datasets)

6Consider that an extra token destroys the match but notusually the meaning

We highlight which corpora contain comprehensiverationales in the test set in Table 3

42 Measuring faithfulness

As discussed above a model may provide ratio-nales that are plausible (agreeable to humans) butthat it did not rely on for its output In many set-tings one may want rationales that actually explainmodel predictions ie rationales extracted for aninstance in this case ought to have meaningfully in-fluenced its prediction for the same We call thesefaithful rationales How best to measure rationalefaithfulness is an open question In this first versionof ERASER we propose simple metrics motivatedby prior work (Zaidan et al 2007 Yu et al 2019)In particular following Yu et al (2019) we definemetrics intended to measure the comprehensiveness(were all features needed to make a prediction se-lected) and sufficiency (do the extracted rationalescontain enough signal to come to a disposition) ofrationales respectively

Comprehensiveness To calculate rationalecomprehensiveness we create contrast exam-ples (Zaidan et al 2007) We construct a con-trast example for xi xi which is xi with the pre-dicted rationales ri removed Assuming a classifi-cation setting let m(xi)j be the original predictionprovided by a model m for the predicted class jThen we consider the predicted probability fromthe model for the same class once the supportingrationales are stripped Intuitively the model oughtto be less confident in its prediction once rationalesare removed from xi We can measure this as

comprehensiveness =m(xi)j minusm(xiri)j (1)

A high score here implies that the rationales wereindeed influential in the prediction while a lowscore suggests that they were not A negative value

4448

Where do you find the most amount of leafs Where do you find the most amount of leafs

(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

hellip hellip

p(Forest|xi)ltlatexit sha1_base64=ougnmb+NiPKI2tlw1oMyZAtvk8c=gtAAACBXicbVDJSgNBEO2JW4xb1KMeBoMQL2EmCnoMCuIxglkgM4SeTiVp0rPQXSMJ41y8+CtePCji1Xw5tYWQ6a+KDg8V4VVfW8SHCFlvVtZJaWV1bXsuu5jc2t7Z387l5dhbFkUGOhCGXTowoED6CGHAU0IwnU9wQ0vMHV2Gcg1Q8DO5wFIHr017Au5xR1FI7f+j0KSZRWkwchCEm16EEhWn6MGzzk3a+YJWsCcxFYs9IgcxQbeenE7IYh8CZIIq1bKtCN2ESuRMQJpzYgURZQPag5amAfVBucnki9Q81krH7IZSV4DmRP09kVBfqZHv6U6fYlNe2PxP68VYfCTXgQxQgBmy7qxsLE0BxHYna4BIZipAllkutbTdankjLUweV0CPb8y4ukXi7Zp6Xy7VmhcjmLI0sOyBEpEpuckwq5IVVSI4w8kmfySt6MJ+PFeDc+pq0ZYzazT7A+PwBuOKZWQ==ltlatexitgt


(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

hellip

Comprehensiveness

Suffiency

xiltlatexit sha1_base64=5TrbQcGQm9R0QLC9WtTvOnnHQdk=gtAAAB8nicbVBNS8NAEN3Ur1qqh69LBbBU0mqoMeiF48V7AekoWw2m3bpZjfsTsQS+jO8eFDEq7GmGbZuDtj4YeLw3w8y8MBXcgOt+O6W19Y3NrfJ2ZWd3b+genjUMSrTlLWpEkr3QmKY4JK1gYNgvVQzkoSCdcPx7czvPjJtuJIPMElZkJCh5DGnBKzk94GLiOVP0wEfVGtu3Z0DrxKvIDVUoDWofvUjRbOESaCCGON7bgpBTjRwKti00s8MSwkdkyHzLZUkYSbI5ydP8ZlVIhwrbUsCnquJ3KSGDNJQtuZEBiZZW8muf5GcTXQc5lmgGTdLEozgQGhWf44hrRkFMLCFUc3srpiOiCQWbUsWG4C2vEo6jbp3UWcX9aaN0UcZXSCTtE58tAVaqI71EJtRJFCz+gVvTngvDjvzseiteQUM8foD5zPH8XfkZI=ltlatexitgt

xiltlatexit sha1_base64=HYbfjgGaRCmI8j+M0Errr+OmJEA=gtAAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGiYA2lM120i7dbMLuRiyhP8GLB0W8+ou8+Wctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6mfqtR1SaxLBjBP0IzqQPOSMGivdPV4r1R2K+4MZJl4OSlDjnqv9NXtxyyNUBomqNYdz02Mn1FlOBM4KXZTjQllIzrAjqWSRqj9bHbqhJxapUCWNmShszU3xMZjbQeR4HtjKgZ6kVvKv7ndVITXvkZl0lqULL5ojAVxMRk+jfpc4XMiLEllClubyVsSBVlxqZTtCF4iy8vk2a14p1XqncX5dp1HkcBjuEEzsCDS6jBLdShAQwG8Ayv8OYI58V5dz7mrStOPnMEf+B8gBh3o3cltlatexitgt

riltlatexit sha1_base64=6zdNEORaRuNf4q408y762r5g5fM=gtAAAB6nicbVBNS8NAEJ3Ur1qqh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJQlePCji1VkzXjts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0oPq8X664VXcOskq8nFQgR6NfuoNYpZGKA0TVOuu5ybGz6gynAmclnqpxoSyMR1i11JJI9R+Nj91Ss6sMiBhrGxJQ+bq74mMRlpPosB2RtSM9LI3EzuqkJr2MyyQ1KNliUZgKYmIy+5sMuEJmxMQSyhS3txI2oooyY9Mp2RC85ZdXSatW9S6qtfvLSv0mj6MIJ3AK5+DBFdThDhrQBAZDeIZXeHOE8+K8Ox+L1oKTzxzDHzifP1i6jdY=ltlatexitgt

Figure 2 Illustration of faithfulness scoring metrics comprehensiveness and sufficiency on the CommonsenseExplanations (CoS-E) dataset For the former erasing the tokens comprising the provided rationale (xi) ought todecrease model confidence in the output lsquoForestrsquo For the latter the model should be able to come to a similardisposition regarding lsquoForestrsquo using only the rationales ri

here means that the model became more confidentin its prediction after the rationales were removedthis would seem counter-intuitive if the rationaleswere indeed the reason for its prediction

Sufficiency This captures the degree to whichthe snippets within the extracted rationales are ade-quate for a model to make a prediction

sufficiency =m(xi)j minusm(ri)j (2)

These metrics are illustrated in Figure 2As defined the above measures have assumed

discrete rationales ri We would also like to eval-uate the faithfulness of continuous importancescores assigned to tokens by models Here weadopt a simple approach for this We convert softscores over features si provided by a model intodiscrete rationales ri by taking the topminuskd valueswhere kd is a threshold for dataset d We set kd tothe average rationale length provided by humansfor dataset d (see Table 4) Intuitively this saysHow much does the model prediction change if weremove a number of tokens equal to what humansuse (on average for this dataset) in order of theimportance scores assigned to these by the modelOnce we have discretized the soft scores into ra-tionales in this way we compute the faithfulnessscores as per Equations 1 and 2

This approach is conceptually simple It is alsocomputationally cheap to evaluate in contrast tomeasures that require per-token measurements egimportance score correlations with lsquoleave-one-outrsquoscores (Jain and Wallace 2019) or counting howmany lsquoimportantrsquo tokens need to be erased before

a prediction flips (Serrano and Smith 2019) How-ever the necessity of discretizing continuous scoresforces us to pick a particular threshold k

We can also consider the behavior of these mea-sures as a function of k inspired by the measure-ments proposed in Samek et al (2016) in the con-text of evaluating saliency maps for image classi-fication They suggested ranking pixel regions byimportance and then measuring the change in out-put as they are removed in rank order Our datasetscomprise documents and rationales with quite dif-ferent lengths to make this measure comparableacross datasets we construct bins designating thenumber of tokens to be deleted Denoting the to-kens up to and including bin k for instance i by rikwe define an aggregate comprehensiveness mea-sure

1

∣B∣ + 1(∣B∣

sumk=0

m(xi)j minusm(xirik)j) (3)

This is defined for sufficiency analogously Herewe group tokens into k = 5 bins by grouping theminto the top 1 5 10 20 and 50 of to-kens with respect to the corresponding importancescore We refer to these metrics as ldquoArea Over thePerturbation Curverdquo (AOPC)7

These AOPC sufficiency and comprehensivenessmeasures score a particular token ordering undera model As a point of reference we also reportthese when random scores are assigned to tokens

7Our AOPC metrics are similar in concept to ROAR(Hooker et al 2019) except that we re-use an existing modelas opposed to retraining for each fraction

4449

5 Baseline ModelsOur focus in this work is primarily on the ERASERbenchmark itself rather than on any particularmodel(s) But to establish a starting point for futurework we evaluate several baseline models acrossthe corpora in ERASER8 We broadly classify theseinto models that assign lsquosoftrsquo (continuous) scoresto tokens and those that perform a lsquohardrsquo (discrete)selection over inputs We additionally considermodels specifically designed to select individualtokens (and very short sequences) as rationales ascompared to longer snippets All of our implemen-tations are in PyTorch (Paszke et al 2019) and areavailable in the ERASER repository9

All datasets in ERASER comprise inputs ratio-nales and labels But they differ considerably indocument and rationale lengths (Table A) This mo-tivated use of different models for datasets appro-priate to their sizes and rationale granularities Wehope that this benchmark motivates design of mod-els that provide rationales that can flexibly adapt tovarying input lengths and expected rationale gran-ularities Indeed only with such models can weperform comparisons across all datasets

51 Hard selection

Models that perform hard selection may be viewedas comprising two independent modules an en-coder which is responsible for extracting snippetsof inputs and a decoder that makes a predictionbased only on the text provided by the encoder Weconsider two variants of such models

Lei et al (2016) In this model an encoder in-duces a binary mask over inputs x z The decoderaccepts the tokens in x unmasked by z to make aprediction y These modules are trained jointly viaREINFORCE (Williams 1992) style estimationminimizing the loss over expected binary vectorsz yielded from the encoder One of the advantagesof this approach is that it need not have access tomarked rationales it can learn to rationalize on thebasis of instance labels alone However given thatwe do have rationales in the training data we exper-iment with a variant in which we train the encoderexplicitly using rationale-level annotations

In our implementation of Lei et al (2016) wedrop in two independent BERT (Devlin et al 2019)or GloVe (Pennington et al 2014) base modules

8This is not intended to be comprehensive9httpsgithubcomjayded

eraserbenchmark

with bidirectional LSTMs (Hochreiter and Schmid-huber 1997) on top to induce contextualized rep-resentations of tokens for the encoder and decoderrespectively The encoder generates a scalar (de-noting the probability of selecting that token) foreach LSTM hidden state using a feedfoward layerand sigmoid In the variant using human rationalesduring training we minimize cross entropy lossover rationale predictions The final loss is thena composite of classification loss regularizers onrationales (Lei et al 2016) and loss over rationalepredictions when available

Pipeline models These are simple models inwhich we first train the encoder to extract ratio-nales and then train the decoder to perform pre-diction using only rationales No parameters areshared between the two models

Here we first consider a simple pipeline that firstsegments inputs into sentences It passes theseone at a time through a Gated Recurrent Unit(GRU) (Cho et al 2014) to yield hidden represen-tations that we compose via an attentive decodinglayer (Bahdanau et al 2015) This aggregate rep-resentation is then passed to a classification modulewhich predicts whether the corresponding sentenceis a rationale (or not) A second model using effec-tively the same architecture but parameterized inde-pendently consumes the outputs (rationales) fromthe first to make predictions This simple model isdescribed at length in prior work (Lehman et al2019) We further consider a lsquoBERT-to-BERTrsquopipeline where we replace each stage with a BERTmodule for prediction (Devlin et al 2019)

In pipeline models we train each stage indepen-dently The rationale identification stage is trainedusing approximate sentence boundaries from oursource annotations with randomly sampled neg-ative examples at each epoch The classificationstage uses the same positive rationales as the iden-tification stage a type of teacher forcing (Williamsand Zipser 1989) (details in Appendix C)

52 Soft selection

We consider a model that passes tokens throughBERT (Devlin et al 2019) to induce contextual-ized representations that are then passed to a bi-directional LSTM (Hochreiter and Schmidhuber1997) The hidden representations from the LSTMare collapsed into a single vector using additiveattention (Bahdanau et al 2015) The LSTM layerallows us to bypass the 512 word limit imposed by

4450

Perf IOU F1 Token F1

Evidence InferenceLei et al (2016) 0461 0000 0000Lei et al (2016) (u) 0461 0000 0000Lehman et al (2019) 0471 0119 0123Bert-To-Bert 0708 0455 0468

BoolQLei et al (2016) 0381 0000 0000Lei et al (2016) (u) 0380 0000 0000Lehman et al (2019) 0411 0050 0127Bert-To-Bert 0544 0052 0134

Movie ReviewsLei et al (2016) 0914 0124 0285Lei et al (2016) (u) 0920 0012 0322Lehman et al (2019) 0750 0063 0139Bert-To-Bert 0860 0075 0145

FEVERLei et al (2016) 0719 0218 0234Lei et al (2016) (u) 0718 0000 0000Lehman et al (2019) 0691 0540 0523Bert-To-Bert 0877 0835 0812

MultiRCLei et al (2016) 0655 0271 0456Lei et al (2016) (u) 0648 0000dagger 0000dagger

Lehman et al (2019) 0614 0136 0140Bert-To-Bert 0633 0416 0412

CoS-ELei et al (2016) 0477 0255 0331Lei et al (2016) (u) 0476 0000dagger 0000dagger

Bert-To-Bert 0344 0389 0519

e-SNLILei et al (2016) 0917 0693 0692Lei et al (2016) (u) 0903 0261 0379Bert-To-Bert 0733 0704 0701

Table 3 Performance of models that perform hard ra-tionale selection All models are supervised at the ratio-nale level except for those marked with (u) which learnonly from instance-level supervision dagger denotes cases inwhich rationale training degenerated due to the REIN-FORCE style training Perf is accuracy (CoS-E) ormacro-averaged F1 (others) Bert-To-Bert for CoS-Eand e-SNLI uses a token classification objective Bert-To-Bert CoS-E uses the highest scoring answer

BERT when we exceed this we effectively startencoding a lsquonewrsquo sequence (setting the positionalindex to 0) via BERT The hope is that the LSTMlearns to compensate for this Evidence Inferenceand BoolQ comprise very long (gt1000 token) in-puts we were unable to run BERT over these Weinstead resorted to swapping GloVe 300d embed-dings (Pennington et al 2014) in place of BERTrepresentations for tokens spans

To soft score features we consider Simple gra-dients attention induced over contextualized repre-sentations and LIME (Ribeiro et al 2016)

Perf AUPRC Comp uarr Suff darr

Evidence InferenceGloVe + LSTM - Attention 0429 0506 -0002 -0023GloVe + LSTM - Gradient 0429 0016 0046 -0138GloVe + LSTM - Lime 0429 0014 0006 -0128GloVe + LSTM - Random 0429 0014 -0001 -0026

BoolQGloVe + LSTM - Attention 0471 0525 0010 0022GloVe + LSTM - Gradient 0471 0072 0024 0031GloVe + LSTM - Lime 0471 0073 0028 -0154GloVe + LSTM - Random 0471 0074 0000 0005

MoviesBERT+LSTM - Attention 0970 0417 0129 0097BERT+LSTM - Gradient 0970 0385 0142 0112BERT+LSTM - Lime 0970 0280 0187 0093BERT+LSTM - Random 0970 0259 0058 0330

FEVERBERT+LSTM - Attention 0870 0235 0037 0122BERT+LSTM - Gradient 0870 0232 0059 0136BERT+LSTM - Lime 0870 0291 0212 0014BERT+LSTM - Random 0870 0244 0034 0122

MultiRCBERT+LSTM - Attention 0655 0244 0036 0052BERT+LSTM - Gradient 0655 0224 0077 0064BERT+LSTM - Lime 0655 0208 0213 -0079BERT+LSTM - Random 0655 0186 0029 0081

CoS-EBERT+LSTM - Attention 0487 0606 0080 0217BERT+LSTM - Gradient 0487 0585 0124 0226BERT+LSTM - Lime 0487 0544 0223 0143BERT+LSTM - Random 0487 0594 0072 0224

e-SNLIBERT+LSTM - Attention 0960 0395 0105 0583BERT+LSTM - Gradient 0960 0416 0180 0472BERT+LSTM - Lime 0960 0513 0437 0389BERT+LSTM - Random 0960 0357 0081 0487

Table 4 Metrics for lsquosoftrsquo scoring models Perf is ac-curacy (CoS-E) or F1 (others) Comprehensiveness andsufficiency are in terms of AOPC (Eq 3) lsquoRandomrsquoassigns random scores to tokens to induce orderingsthese are averages over 10 runs

6 EvaluationHere we present initial results for the baseline mod-els discussed in Section 5 with respect to the met-rics proposed in Section 4 We present results intwo parts reflecting the two classes of rationalesdiscussed above lsquoHardrsquo approaches that performdiscrete selection of snippets and lsquosoftrsquo methodsthat assign continuous importance scores to tokens

In Table 3 we evaluate models that perform dis-crete selection of rationales We view these as in-herently faithful because by construction we knowwhich snippets the decoder used to make a pre-diction10 Therefore for these methods we reportonly metrics that measure agreement with humanannotations

10This assumes independent encoders and decoders

4451

Due to computational constraints we were un-able to run our BERT-based implementation of Leiet al (2016) over larger corpora Conversely thesimple pipeline of Lehman et al (2019) assumesa setting in which rationale are sentences and sois not appropriate for datasets in which rationalestend to comprise only very short spans Again inour view this highlights the need for models thatcan rationalize at varying levels of granularity de-pending on what is appropriate

We observe that for the ldquorationalizingrdquo modelof Lei et al (2016) exploiting rationale-level super-vision often (though not always) improves agree-ment with human-provided rationales as in priorwork (Zhang et al 2016 Strout et al 2019) In-terestingly this does not seem strongly correlatedwith predictive performance

Lei et al (2016) outperforms the simple pipelinemodel when using a BERT encoder Further Leiet al (2016) outperforms the lsquoBERT-to-BERTrsquopipeline on the comparable datasets for the finalprediction tasks This may be an artifact of theamount of text each model can select lsquoBERT-to-BERTrsquo is limited to sentences while Lei et al(2016) can select any subset of the text Designingextraction models that learn to adaptively selectcontiguous rationales of appropriate length for agiven task seems a potentially promising direction

In Table 4 we report metrics for models thatassign continuous importance scores to individ-ual tokens For these models we again measuredownstream (task) performance (macro F1 or ac-curacy) Here the models are actually the sameand so downstream performance is equivalent Toassess the quality of token scores with respect tohuman annotations we report the Area Under thePrecision Recall Curve (AUPRC)

These scoring functions assign only soft scoresto inputs (and may still use all inputs to come toa particular prediction) so we report the metricsintended to measure faithfulness defined abovecomprehensiveness and sufficiency averaged overlsquobinsrsquo of tokens ordered by importance scores Toprovide a point of reference for these metrics mdashwhich depend on the underlying model mdash we re-port results when rationales are randomly selected(averaged over 10 runs)

Both simple gradient and LIME-based scoringyield more comprehensive rationales than attentionweights consistent with prior work (Jain and Wal-lace 2019 Serrano and Smith 2019) Attention

fares better in terms of AUPRC mdash suggesting bet-ter agreement with human rationales mdash which isalso in line with prior findings that it may provideplausible but not faithful explanation (Zhong et al2019) Interestingly LIME does particularly wellacross these tasks in terms of faithfulness

From the lsquoRandomrsquo results that we concludemodels with overall poor performance on their fi-nal tasks tend to have an overall poor ordering withmarginal differences in comprehensiveness and suf-ficiency between them For models that with highsufficiency scores Movies FEVER CoS-E and e-SNLI we find that random removal is particularlydamaging to performance indicating poor absoluteranking whereas those with high comprehensive-ness are sensitive to rationale length

7 Conclusions and Future DirectionsWe have introduced a new publicly available re-source the Evaluating Rationales And Simple En-glish Reasoning (ERASER) benchmark This com-prises seven datasets all of which include bothinstance level labels and corresponding supportingsnippets (lsquorationalesrsquo) marked by human annotatorsWe have augmented many of these datasets withadditional annotations and converted them into astandard format comprising inputs rationales andoutputs ERASER is intended to facilitate progresson explainable models for NLP

We proposed several metrics intended to mea-sure the quality of rationales extracted by modelsboth in terms of agreement with human annota-tions and in terms of lsquofaithfulnessrsquo We believethese metrics provide reasonable means of compar-ison of specific aspects of interpretability but weview the problem of measuring faithfulness in par-ticular a topic ripe for additional research (whichERASER can facilitate)

Our hope is that ERASER enables future workon designing more interpretable NLP models andcomparing their relative strengths across a vari-ety of tasks datasets and desired criteria It alsoserves as an ideal starting point for several futuredirections such as better evaluation metrics for in-terpretability causal analysis of NLP models anddatasets of rationales in other languages

8 AcknowledgementsWe thank the anonymous ACL reviewers

This work was supported in part by the NSF (CA-REER award 1750978) and by the Army ResearchOffice (W911NF1810328)

4452

ReferencesDavid Alvarez-Melis and Tommi Jaakkola 2017 A

causal framework for explaining the predictions ofblack-box sequence-to-sequence models In Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing pages 412ndash421

Leila Arras Franziska Horn Gregoire MontavonKlaus-Robert Muller and Wojciech Samek 2017rdquowhat is relevant in a text documentrdquo An inter-pretable machine learning approach In PloS one

Dzmitry Bahdanau Kyunghyun Cho and Yoshua Ben-gio 2015 Neural machine translation by jointlylearning to align and translate In 3rd Inter-national Conference on Learning RepresentationsICLR 2015 San Diego CA USA May 7-9 2015Conference Track Proceedings

Joost Bastings Wilker Aziz and Ivan Titov 2019 In-terpretable neural predictions with differentiable bi-nary variables In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics pages 2963ndash2977 Florence Italy Associa-tion for Computational Linguistics

Iz Beltagy Kyle Lo and Arman Cohan 2019 Scib-ert Pretrained language model for scientific text InEMNLP

Samuel R Bowman Gabor Angeli Christopher Pottsand Christopher D Manning 2015 A large anno-tated corpus for learning natural language inferenceIn Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)Association for Computational Linguistics

Gino Brunner Yang Liu Damian Pascual OliverRichter Massimiliano Ciaramita and Roger Watten-hofer 2020 On identifiability in transformers InInternational Conference on Learning Representa-tions

Oana-Maria Camburu Tim Rocktaschel ThomasLukasiewicz and Phil Blunsom 2018 e-snli Nat-ural language inference with natural language expla-nations In Advances in Neural Information Process-ing Systems pages 9539ndash9549

Shiyu Chang Yang Zhang Mo Yu and TommiJaakkola 2019 A game theoretic approach to class-wise selective rationalization In Advances in Neu-ral Information Processing Systems pages 10055ndash10065

Sihao Chen Daniel Khashabi Wenpeng Yin ChrisCallison-Burch and Dan Roth 2019 Seeing thingsfrom a different angle Discovering diverse perspec-tives about claims In Proceedings of the Conferenceof the North American Chapter of the Associationfor Computational Linguistics (NAACL) pages 542ndash557 Minneapolis Minnesota

Kyunghyun Cho Bart van Merrienboer Caglar Gul-cehre Dzmitry Bahdanau Fethi Bougares HolgerSchwenk and Yoshua Bengio 2014 Learningphrase representations using RNN encoderndashdecoderfor statistical machine translation In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) pages 1724ndash1734 Doha Qatar Association for ComputationalLinguistics

Christopher Clark Kenton Lee Ming-Wei ChangTom Kwiatkowski Michael Collins and KristinaToutanova 2019 Boolq Exploring the surprisingdifficulty of natural yesno questions In NAACL

Jacob Cohen 1960 A coefficient of agreement fornominal scales Educational and PsychologicalMeasurement 20(1)37ndash46

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Yanzhuo Ding Yang Liu Huanbo Luan and MaosongSun 2017 Visualizing and understanding neuralmachine translation In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1 Long Papers) VancouverCanada Association for Computational Linguistics

Finale Doshi-Velez and Been Kim 2017 Towards arigorous science of interpretable machine learningarXiv preprint arXiv170208608

Mark Everingham Luc Van Gool Christopher K IWilliams John Winn and Andrew Zisserman 2010The pascal visual object classes (voc) challenge In-ternational Journal of Computer Vision 88(2)303ndash338

Shi Feng Eric Wallace Alvin Grissom Mohit IyyerPedro Rodriguez and Jordan L Boyd-Graber 2018Pathologies of neural models make interpretationdifficult In EMNLP

Matt Gardner Joel Grus Mark Neumann OyvindTafjord Pradeep Dasigi Nelson F Liu Matthew Pe-ters Michael Schmitz and Luke Zettlemoyer 2018AllenNLP A deep semantic natural language pro-cessing platform In Proceedings of Workshop forNLP Open Source Software (NLP-OSS) pages 1ndash6 Melbourne Australia Association for Computa-tional Linguistics

Sepp Hochreiter and Jurgen Schmidhuber 1997Long short-term memory Neural computation9(8)1735ndash1780

4453

Sara Hooker Dumitru Erhan Pieter-Jan Kindermansand Been Kim 2019 A benchmark for interpretabil-ity methods in deep neural networks In H Wal-lach H Larochelle A Beygelzimer F dAlche-BucE Fox and R Garnett editors Advances in Neu-ral Information Processing Systems 32 pages 9737ndash9748 Curran Associates Inc

Alon Jacovi and Yoav Goldberg 2020 Towards faith-fully interpretable nlp systems How should wedefine and evaluate faithfulness arXiv preprintarXiv200403685

Sarthak Jain and Byron C Wallace 2019 Attention isnot Explanation In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long and Short Pa-pers) pages 3543ndash3556 Minneapolis MinnesotaAssociation for Computational Linguistics

Sarthak Jain Sarah Wiegreffe Yuval Pinter and By-ron C Wallace 2020 Learning to Faithfully Ratio-nalize by Construction In Proceedings of the Con-ference of the Association for Computational Lin-guistics (ACL)

Daniel Khashabi Snigdha Chaturvedi Michael RothShyam Upadhyay and Dan Roth 2018 LookingBeyond the Surface A Challenge Set for ReadingComprehension over Multiple Sentences In Procof the Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics (NAACL)

Diederik Kingma and Jimmy Ba 2014 Adam Amethod for stochastic optimization InternationalConference on Learning Representations

Eric Lehman Jay DeYoung Regina Barzilay and By-ron C Wallace 2019 Inferring which medical treat-ments work from reports of clinical trials In Pro-ceedings of the North American Chapter of the As-sociation for Computational Linguistics (NAACL)pages 3705ndash3717

Tao Lei Regina Barzilay and Tommi Jaakkola 2016Rationalizing neural predictions In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 107ndash117

Jiwei Li Xinlei Chen Eduard Hovy and Dan Jurafsky2016 Visualizing and understanding neural modelsin NLP In Proceedings of the 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies pages 681ndash691 San Diego California As-sociation for Computational Linguistics

Zachary C Lipton 2016 The mythos of model inter-pretability arXiv preprint arXiv160603490

Scott M Lundberg and Su-In Lee 2017 A unifiedapproach to interpreting model predictions In Ad-vances in Neural Information Processing Systemspages 4765ndash4774

Tyler McDonnell Mucahid Kutlu Tamer Elsayed andMatthew Lease 2017 The many benefits of anno-tator rationales for relevance judgments In IJCAIpages 4909ndash4913

Tyler McDonnell Matthew Lease Mucahid Kutlu andTamer Elsayed 2016 Why is that relevant col-lecting annotator rationales for relevance judgmentsIn Fourth AAAI Conference on Human Computationand Crowdsourcing

Gregoire Montavon Sebastian Lapuschkin AlexanderBinder Wojciech Samek and Klaus-Robert Muller2017 Explaining nonlinear classification decisionswith deep taylor decomposition Pattern Recogni-tion 65211ndash222

Pooya Moradi Nishant Kambhatla and Anoop Sarkar2019 Interrogating the explanatory power of atten-tion in neural machine translation In Proceedings ofthe 3rd Workshop on Neural Generation and Trans-lation pages 221ndash230 Hong Kong Association forComputational Linguistics

Mark Neumann Daniel King Iz Beltagy and WaleedAmmar 2019 Scispacy Fast and robust modelsfor biomedical natural language processing CoRRabs190207669

Dong Nguyen 2018 Comparing automatic and humanevaluation of local explanations for text classifica-tion In Proceedings of the 2018 Conference of theNorth American Chapter of the Association for Com-putational Linguistics Human Language Technolo-gies Volume 1 (Long Papers) pages 1069ndash1078

Bo Pang and Lillian Lee 2004 A sentimental edu-cation Sentiment analysis using subjectivity sum-marization based on minimum cuts In Proceed-ings of the 42nd Annual Meeting of the Associationfor Computational Linguistics (ACL-04) pages 271ndash278 Barcelona Spain

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga et al 2019 Pytorch An imperative stylehigh-performance deep learning library In Ad-vances in Neural Information Processing Systemspages 8024ndash8035

David J Pearce 2005 An improved algorithm for find-ing the strongly connected components of a directedgraph Technical report Victoria University NZ

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global vectors for word rep-resentation In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 1532ndash1543 Doha Qatar Asso-ciation for Computational Linguistics

Danish Pruthi Mansi Gupta Bhuwan Dhingra Gra-ham Neubig and Zachary C Lipton 2020 Learn-ing to deceive with attention-based explanations InAnnual Conference of the Association for Computa-tional Linguistics (ACL)

4454

Sampo Pyysalo F Ginter Hans Moen T Salakoski andSophia Ananiadou 2013 Distributional semanticsresources for biomedical text processing Proceed-ings of Languages in Biology and Medicine

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training

Nazneen Fatema Rajani Bryan McCann CaimingXiong and Richard Socher 2019 Explain yourselfleveraging language models for commonsense rea-soning Proceedings of the Association for Compu-tational Linguistics (ACL)

Marco Ribeiro Sameer Singh and Carlos Guestrin2016 why should i trust you Explaining the pre-dictions of any classifier In Proceedings of the 2016Conference of the North American Chapter of theAssociation for Computational Linguistics Demon-strations pages 97ndash101

Wojciech Samek Alexander Binder Gregoire Mon-tavon Sebastian Lapuschkin and Klaus-RobertMuller 2016 Evaluating the visualization of whata deep neural network has learned IEEE trans-actions on neural networks and learning systems28(11)2660ndash2673

Tal Schuster Darsh J Shah Yun Jie Serene Yeo DanielFilizzola Enrico Santus and Regina Barzilay 2019Towards debiasing fact verification models In Pro-ceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP) As-sociation for Computational Linguistics

Sofia Serrano and Noah A Smith 2019 Is attentioninterpretable In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics pages 2931ndash2951 Florence Italy Associa-tion for Computational Linguistics

Burr Settles 2012 Active learning Synthesis Lec-tures on Artificial Intelligence and Machine Learn-ing 6(1)1ndash114

Manali Sharma Di Zhuang and Mustafa Bilgic 2015Active learning with rationales for text classificationIn Proceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics Human Language Technologiespages 441ndash451

Kevin Small Byron C Wallace Carla E Brodley andThomas A Trikalinos 2011 The constrained weightspace svm learning with ranked features In Pro-ceedings of the International Conference on Inter-national Conference on Machine Learning (ICML)pages 865ndash872

D Smilkov N Thorat B Kim F Viegas and M Wat-tenberg 2017 SmoothGrad removing noise byadding noise ICML workshop on visualization fordeep learning

Robyn Speer 2019 ftfy Zenodo Version 55

Nitish Srivastava Geoffrey Hinton Alex KrizhevskyIlya Sutskever and Ruslan Salakhutdinov 2014Dropout A simple way to prevent neural networksfrom overfitting Journal of Machine Learning Re-search 151929ndash1958

Julia Strout Ye Zhang and Raymond Mooney 2019Do human rationales improve machine explana-tions In Proceedings of the 2019 ACL WorkshopBlackboxNLP Analyzing and Interpreting NeuralNetworks for NLP pages 56ndash62 Florence Italy As-sociation for Computational Linguistics

Mukund Sundararajan Ankur Taly and Qiqi Yan 2017Axiomatic attribution for deep networks In Pro-ceedings of the 34th International Conference onMachine Learning-Volume 70 pages 3319ndash3328JMLR org

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

James Thorne Andreas Vlachos ChristosChristodoulopoulos and Arpit Mittal 2018FEVER a Large-scale Dataset for Fact Extractionand VERification In Proceedings of the NorthAmerican Chapter of the Association for Computa-tional Linguistics (NAACL) pages 809ndash819

Shikhar Vashishth Shyam Upadhyay Gaurav SinghTomar and Manaal Faruqui 2019 Attention in-terpretability across nlp tasks arXiv preprintarXiv190911218

Byron C Wallace Kevin Small Carla E Brodley andThomas A Trikalinos 2010 Active learning forbiomedical citation screening In Proceedings ofthe 16th ACM SIGKDD international conference onKnowledge discovery and data mining pages 173ndash182 ACM

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a Superglue Astickier benchmark for general-purpose language un-derstanding systems In H Wallach H LarochelleA Beygelzimer F dAlche-Buc E Fox and R Gar-nett editors Advances in Neural Information Pro-cessing Systems 32 pages 3266ndash3280 Curran Asso-ciates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In Inter-national Conference on Learning Representations

4455

Sarah Wiegreffe and Yuval Pinter 2019 Attention isnot not explanation In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 11ndash20 Hong Kong China Associ-ation for Computational Linguistics

Ronald J Williams 1992 Simple statistical gradient-following algorithms for connectionist reinforce-ment learning Machine learning 8(3-4)229ndash256

Ronald J Williams and David Zipser 1989 A learn-ing algorithm for continually running fully recurrentneural networks Neural computation 1(2)270ndash280

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Rrsquoemi Louf Morgan Funtow-icz and Jamie Brew 2019 Huggingfacersquos trans-formers State-of-the-art natural language process-ing ArXiv abs191003771

Mo Yu Shiyu Chang Yang Zhang and TommiJaakkola 2019 Rethinking cooperative rationaliza-tion Introspective extraction and complement con-trol In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages4094ndash4103 Hong Kong China Association forComputational Linguistics

Omar Zaidan Jason Eisner and Christine Piatko2007 Using annotator rationales to improve ma-chine learning for text categorization In Proceed-ings of the conference of the North American chap-ter of the Association for Computational Linguistics(NAACL) pages 260ndash267

Omar F Zaidan and Jason Eisner 2008 Modeling an-notators A generative approach to learning from an-notator rationales In Proceedings of the Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 31ndash40

Ye Zhang Iain Marshall and Byron C Wallace 2016Rationale-augmented convolutional neural networksfor text classification In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP) volume 2016 page 795 NIHPublic Access

Ruiqi Zhong Steven Shao and Kathleen McKeown2019 Fine-grained sentiment analysis with faithfulattention arXiv preprint arXiv190806870

AppendixA Dataset PreprocessingWe describe what if any additional processing weperform on a per-dataset basis All datasets wereconverted to a unified format

MultiRC (Khashabi et al 2018) We perform min-imal processing We use the validation set as thetesting set for public release

Evidence Inference (Lehman et al 2019) We per-form minimal processing As not all of the pro-vided evidence spans come with offsets we deleteany prompts that had no grounded evidence spans

Movie reviews (Zaidan and Eisner 2008) We per-form minimal processing We use the ninth fold asthe validation set and collect annotations on thetenth fold for comprehensive evaluation

FEVER (Thorne et al 2018) We perform substan-tial processing for FEVER - we delete the rdquoNotEnough Infordquo claim class delete any claims withsupport in more than one document and reparti-tion the validation set into a validation and a testset for this benchmark (using the test set wouldcompromise the information retrieval portion ofthe original FEVER task) We ensure that thereis no document overlap between train validationand test sets (we use Pearce (2005) to ensure thisas conceptually a claim may be supported by factsin more than one document) We ensure that thevalidation set contains the documents used to cre-ate the FEVER symmetric dataset (Schuster et al2019) (unfortunately the documents used to createthe validation and test sets overlap so we cannotprovide this partitioning) Additionally we cleanup some encoding errors in the dataset via Speer(2019)

BoolQ (Clark et al 2019) The BoolQ dataset re-quired substantial processing The original datasetdid not retain source Wikipedia articles or col-lection dates In order to identify the sourceparagraphs we download the 122018 Wikipediaarchive and use FuzzyWuzzy httpsgithub

comseatgeekfuzzywuzzy to identify the sourceparagraph span that best matches the original re-lease If the Levenshtein distance ratio does notreach a score of at least 90 the corresponding in-stance is removed For public release we use theofficial validation set for testing and repartitiontrain into a training and validation set

e-SNLI (Camburu et al 2018) We perform mini-mal processing We separate the premise and hy-pothesis statements into separate documents

Commonsense Explanations (CoS-E) (Rajaniet al 2019) We perform minimal processing pri-marily deletion of any questions without a rationale

4456

Dataset Documents Instances Rationale Evidence Statements Evidence Lengths

MultiRCTrain 400 24029 174 56298 215Val 56 3214 185 7498 228Test 83 4848 - - -Evidence InferenceTrain 1924 7958 134 10371 393Val 247 972 138 1294 403Test 240 959 - - -Exhaustive Evidence InferenceVal 81 101 447 5040 352Test 106 152 - - -Movie ReviewsTrain 1599 1600 935 13878 77Val 150 150 745 11430 66Test 200 200 - - -Exhaustive Movie ReviewsVal 50 50 1910 5920 128FEVERTrain 2915 97957 200 146856 313Val 570 6122 216 8672 282Test 614 6111 - - -BoolQTrain 4518 6363 664 63630 1102Val 1092 1491 713 14910 1065Test 2294 2817 - - -e-SNLITrain 911938 549309 273 11990350 18Val 16328 9823 256 236390 16Test 16299 9807 - - -CoS-ETrain 8733 8733 266 8733 74Val 1092 1092 271 1092 76Test 1092 1092 - - -

Table 5 Detailed breakdowns for each dataset - the number of documents instances evidence statements andlengths Additionally we include the percentage of each relevant document that is considered a rationale For testsets counts are for all instances including documents with non comprehensive rationales

Dataset Labels Instances Documents Sentences Tokens

Evidence Inference 3 9889 2411 1560 47606BoolQ 2 10661 7026 1753 35825Movie Reviews 2 2000 1999 368 7741FEVER 2 110190 4099 121 3265MultiRC 2 32091 539 149 3025CoS-E 5 10917 10917 10 276e-SNLI 3 568939 944565 17 160

Table 6 General dataset statistics number of labels instances unique documents and average numbers of sen-tences and tokens in documents across the publicly released trainvalidationtest splits in ERASER For CoS-Eand e-SNLI the sentence counts are not meaningful as the partitioning of questionsentenceanswer formatting isan arbitrary choice in this framework

4457

or questions with rationales that were not possi-ble to automatically map back to the underlyingtext As recommended by the authors of Talmoret al (2019) we repartition the train and validationsets into a train validation and test set for thisbenchmark We encode the entire question and an-swers as a prompt and convert the problem into afive-class prediction We also convert the ldquoSanityrdquodatasets for user convenience

All datasets in ERASER were tokenized usingspaCy11 library (with SciSpacy (Neumann et al2019) for Evidence Inference) In addition we alsosplit all datasets except e-SNLI and CoS-E intosentences using the same library

B Annotation detailsWe collected comprehensive rationales for a subsetof some test sets to accurately evaluate model recallof rationales

1 Movies We used the Upwork Platform12 tohire two fluent english speakers to annotateeach of the 200 documents in our test setWorkers were paid at rate of USD 85 per hourand on average it took them 5 min to anno-tate a document Each annotator was asked toannotate a set of 6 documents and comparedagainst in-house annotations (by authors)

2 Evidence Inference We again used Upworkto hire 4 medical professionals fluent in en-glish and having passed a pilot of 3 documents125 documents were annotated (only once byone of the annotators which we felt was ap-propriate given their high-level of expertise)with an average cost of USD 13 per documentAverage time spent of single document was31 min

3 BoolQ We used Amazon Mechanical Turk(MTurk) to collect reference comprehensiverationales from randomly selected 199 docu-ments from our test set (ranging in 800 to 1500tokens in length) Only workers from AU NZCA US GB with more than 10K approvedHITs and an approval rate of greater than 98were eligible For every document 3 annota-tions were collected and workers were paidUSD 150 per HIT The average work time(obtained through MTurk interface) was 21min We did not anticipate the task taking so

11httpsspacyio12httpwwwupworkcom

long (on average) the effective low pay ratewas unintended

C Hyperparameter and training detailsC1 (Lei et al 2016) models

For these models we set the sparsity rate at 001and we set the contiguity loss weight to 2 timessparsity rate (following the original paper) Weused bert-base-uncased (Wolf et al 2019) as to-ken embedder (for all datasets except BoolQ Ev-idence Inference and FEVER) and BidirectionalLSTM with 128 dimensional hidden state in eachdirection A dropout (Srivastava et al 2014) rateof 02 was used before feeding the hidden repre-sentations to attention layer in decoder and linearlayer in encoder One layer MLP with 128 dimen-sional hidden state and ReLU activation was usedto compute the decoder output distribution

For three datasets mentioned above we useGloVe embeddings (httpnlpstanfordedudataglove840B300dzip)

A learning rate of 2e-5 with Adam (Kingma andBa 2014) optimizer was used for all models and weonly fine-tuned top two layers of BERT encoderTh models were trained for 20 epochs and earlystopping with patience of 5 epochs was used Thebest model was selected on validation set using thefinal task performance metric

The input for the above model was encodedin form of [CLS] document [SEP] query[SEP]

This model was implemented using theAllenNLP library (Gardner et al 2018)

C2 BERT-LSTMGloVe-LSTM

This model is essentially the same as the decoder inprevious section The BERT-LSTM uses the samehyperparameters and GloVe-LSTM is trained witha learning rate of 1e-2

C3 Lehman et al (2019) models

With the exception of the Evidence Inferencedataset these models were trained using the GLoVe(Pennington et al 2014) 200 dimension word vec-tors and Evidence Inference using the (Pyysaloet al 2013) PubMed word vectors We use Adam(Kingma and Ba 2014) with a learning rate of1e-3 Dropout (Srivastava et al 2014) of 005 ateach layer (embedding GRU attention layer) ofthe model for 50 epochs with a patience of 10 Wemonitor validation loss and keep the best modelon the validation set

4458

C4 BERT-to-BERT model

We primarily used the lsquobert-base-uncasedlsquo modelfor both components of the identification and clas-sification pipeline with the sole exception beingEvidence Inference with SciBERT (Beltagy et al2019) We trained with the standard BERT parame-ters of a learning rate of 1e-5 Adam (Kingma andBa 2014) for 10 epochs We monitor validationloss and keep the best model on the validation set

4444

but they may not be comprehensive Therefore forsome datasets we have also collected comprehen-sive rationales (in which all evidence supportingan output has been marked) on test instances

The lsquoqualityrsquo of extracted rationales will dependon their intended use Therefore we propose aninitial set of metrics to evaluate rationales thatare meant to measure different varieties of lsquointer-pretabilityrsquo Broadly this includes measures ofagreement with human-provided rationales and as-sessments of faithfulness The latter aim to capturethe extent to which rationales provided by a modelin fact informed its predictions We believe theseprovide a reasonable start but view the problem ofdesigning metrics for evaluating rationales mdash espe-cially for measuring faithfulness mdash as a topic forfurther research that ERASER can facilitate Andwhile we will provide a lsquoleaderboardrsquo this is betterviewed as a lsquoresults boardrsquo we do not privilegeany one metric Instead ERASER permits compar-ison between models that provide rationales withrespect to different criteria of interest

We implement baseline models and report theirperformance across the corpora in ERASER Wefind that no single lsquooff-the-shelfrsquo architecture isreadily adaptable to datasets with very differentinstance lengths and associated rationale snippets(Section 3) This highlights a need for new modelsthat can consume potentially lengthy inputs andadaptively provide rationales at a task-appropriatelevel of granularity ERASER provides a resourceto develop such models

In sum we introduce the ERASER benchmark(wwweraserbenchmarkcom) a unified set of di-verse NLP datasets (these are repurposed and aug-mented from existing corpora1 including senti-ment analysis Natural Language Inference andQA tasks among others) in a standardized for-mat featuring human rationales for decisions alongwith starter code and tools baseline models andstandardized (initial) metrics for rationales

2 Related Work

Interpretability in NLP is a large fast-growingarea we do not attempt to provide a comprehensiveoverview here Instead we focus on directions par-ticularly relevant to ERASER ie prior work onmodels that provide rationales for their predictions

Learning to explain In ERASER we assume that

1We ask users of the benchmark to cite all original papersand provide a BibTeX entry for doing so on the website

rationales (marked by humans) are provided duringtraining However such direct supervision will notalways be available motivating work on methodsthat can explain (or ldquorationalizerdquo) model predic-tions using only instance-level supervision

In the context of modern neural models for textclassification one might use variants of attention(Bahdanau et al 2015) to extract rationales At-tention mechanisms learn to assign soft weights to(usually contextualized) token representations andso one can extract highly weighted tokens as ratio-nales However attention weights do not in gen-eral provide faithful explanations for predictions(Jain and Wallace 2019 Serrano and Smith 2019Wiegreffe and Pinter 2019 Zhong et al 2019Pruthi et al 2020 Brunner et al 2020 Moradiet al 2019 Vashishth et al 2019) This likelyowes to encoders entangling inputs complicatingthe interpretation of attention weights on inputsover contextualized representations of the same2

By contrast hard attention mechanisms dis-cretely extract snippets from the input to pass to theclassifier by construction providing faithful expla-nations Recent work has proposed hard attentionmechanisms as a means of providing explanationsLei et al (2016) proposed instantiating two modelswith their own parameters one to extract rationalesand one that consumes these to make a predictionThey trained these models jointly via REINFORCE(Williams 1992) style optimization

Recently Jain et al (2020) proposed a variantof this two-model setup that uses heuristic featurescores to derive pseudo-labels on tokens compris-ing rationales one model can then be used to per-form hard extraction in this way while a second(independent) model can make predictions on thebasis of these Elsewhere Chang et al (2019)introduced the notion of classwise rationales thatexplains support for different output classes usinga game theoretic framework Finally other recentwork has proposed using a differentiable binarymask over inputs which also avoids recourse toREINFORCE (Bastings et al 2019)

Post-hoc explanation Another strand of inter-pretability work considers post-hoc explanationmethods which seek to explain why a model madea specific prediction for a given input Commonly

2Interestingly Zhong et al (2019) find that attention some-times provides plausible but not faithful rationales ElsewherePruthi et al (2020) show that one can easily learn to deceivevia attention weights These findings highlight that one shouldbe mindful of the criteria one wants rationales to fulfill

4445







BoolQ 6363 1491 2817 3583


FEVER 97957 6122 6111 327

MultiRC 24029 3214 4848 303

CoS-E 8733 1092 1092 28

e-SNLI 911938 16449 16429 16







4446














4 Metrics


4447















4448


(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

hellip hellip



(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

hellip

Comprehensiveness

Suffiency













1

∣B∣ + 1(∣B∣

sumk=0





4449



51 Hard selection





eraserbenchmark





52 Soft selection


4450









Bert-To-Bert 0344 0389 0519

















4451














4452






















4453























4454






















4455



















4456







4457



















4458



4445







BoolQ 6363 1491 2817 3583


FEVER 97957 6122 6111 327

MultiRC 24029 3214 4848 303

CoS-E 8733 1092 1092 28

e-SNLI 911938 16449 16429 16







4446














4 Metrics


4447















4448


(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

hellip hellip



(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

hellip

Comprehensiveness

Suffiency













1

∣B∣ + 1(∣B∣

sumk=0





4449



51 Hard selection





eraserbenchmark





52 Soft selection


4450









Bert-To-Bert 0344 0389 0519

















4451














4452






















4453























4454






















4455



















4456







4457



















4458



4446














4 Metrics


4447















4448


(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

hellip hellip



(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

hellip

Comprehensiveness

Suffiency













1

∣B∣ + 1(∣B∣

sumk=0





4449



51 Hard selection





eraserbenchmark





52 Soft selection


4450









Bert-To-Bert 0344 0389 0519

















4451














4452






















4453























4454






















4455



















4456







4457



















4458



4447















4448


(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

hellip hellip



(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

hellip

Comprehensiveness

Suffiency













1

∣B∣ + 1(∣B∣

sumk=0





4449



51 Hard selection





eraserbenchmark





52 Soft selection


4450









Bert-To-Bert 0344 0389 0519

















4451














4452






















4453























4454






















4455



















4456







4457



















4458



4448


(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

hellip hellip



(a) Compost pile

(b) Flowers

(c) Forest(d) Field

(e) Ground

hellip

Comprehensiveness

Suffiency













1

∣B∣ + 1(∣B∣

sumk=0





4449



51 Hard selection





eraserbenchmark





52 Soft selection


4450









Bert-To-Bert 0344 0389 0519

















4451














4452






















4453























4454






















4455



















4456







4457



















4458



4449



51 Hard selection





eraserbenchmark





52 Soft selection


4450









Bert-To-Bert 0344 0389 0519

















4451














4452






















4453























4454






















4455



















4456







4457



















4458



4450









Bert-To-Bert 0344 0389 0519

















4451














4452






















4453























4454






















4455



















4456







4457



















4458



4451














4452






















4453























4454






















4455



















4456







4457



















4458



4452






















4453























4454






















4455



















4456







4457



















4458



4453























4454






















4455



















4456







4457



















4458



4454






















4455



















4456







4457



















4458



4455



















4456







4457



















4458



4456







4457



















4458



4457



















4458



4458



ERASER: A Benchmark to Evaluate Rationalized NLP Models · illustrative of four (out of seven)...

Documents

Transcript of ERASER: A Benchmark to Evaluate Rationalized NLP Models · illustrative of four (out of seven)...