Automatically Detecting Action Items in Audio … disagreement and perfect agreement, respec-tively....

8

Click here to load reader

Transcript of Automatically Detecting Action Items in Audio … disagreement and perfect agreement, respec-tively....

Page 1: Automatically Detecting Action Items in Audio … disagreement and perfect agreement, respec-tively. Overall inter-annotator agreement as measured by κ on the action item corpus is

Automatically Detecting Action Items in Audio Meeting Recordings

William Morgan Pi-Chuan Chang Surabhi GuptaDepartment of Computer Science

Stanford University353 Serra Mall

Stanford, CA [email protected]

[email protected]@cs.stanford.edu

Jason M. BrenierDepartment of Linguistics

Center for Spoken Language ResearchInstitute of Cognitive Science

University of Colorado at Boulder594 UCB

Boulder, Colorado [email protected]

Abstract

Identification of action items in meetingrecordings can provide immediate accessto salient information in a medium noto-riously difficult to search and summarize.To this end, we use a maximum entropymodel to automatically detect action item-related utterances from multi-party audiomeeting recordings. We compare the ef-fect of lexical, temporal, syntactic, seman-tic, and prosodic features on system per-formance. We show that on a corpus of ac-tion item annotations on the ICSI meetingrecordings, characterized by high imbal-ance and low inter-annotator agreement,the system performs at an F measure of31.92%. While this is low compared tobetter-studied tasks on more mature cor-pora, the relative usefulness of the featurestowards this task is indicative of their use-fulness on more consistent annotations, aswell as to related tasks.

1 Introduction

Meetings are a ubiquitous feature of workplaceenvironments, and recordings of meetings pro-vide obvious benefit in that they can be replayedor searched through at a later date. As record-ing technology becomes more easily available andstorage space becomes less costly, the feasibil-ity of producing and storing these recordings in-creases. This is particularly true for audio record-ings, which are cheaper to produce and store thanfull audio-video recordings.

However, audio recordings are notoriously diffi-cult to search or to summarize. This is doubly trueof multi-party recordings, which, in addition to the

difficulties presented by single-party recordings,typically contain backchannels, elaborations, andside topics, all of which further confound searchand summarization processes. Making efficientuse of large meeting corpora thus requires intel-ligent summary and review techniques.

One possible user goal given a corpus of meet-ing recordings is to discover theaction items de-cided within the meetings. Action items are deci-sions made within the meeting that require post-meeting attention or labor. Rapid identificationof action items can provide immediate access tosalient portions of the meetings. A review of ac-tion items can also function as (part of) a summaryof the meeting content.

To this end, we explore the task of applyingmaximum entropy classifiers to the task of auto-matically detecting action item utterances in au-dio recordings of multi-party meetings. Althoughavailable corpora for action items are not ideal, itis hoped that the feature analysis presented herewill be of use to later work on other corpora.

2 Related work

Multi-party meetings have attracted a significantamount of recent research attention. The creationof the ICSI corpus (Janin et al., 2003), comprisedof 72 hours of meeting recordings with an averageof 6 speakers per meeting, with associated tran-scripts, has spurred further annotations for var-ious types of information, including dialog acts(Shriberg et al., 2004), topic hierarchies and actionitems (Gruenstein et al., 2005), and “hot spots”(Wrede and Shriberg, 2003).

The classification of individual utterances basedon their role in the dialog, i.e. as opposed to theirsemantic payload, has a long history, especiallyin the context ofdialog act (DA) classification.

Page 2: Automatically Detecting Action Items in Audio … disagreement and perfect agreement, respec-tively. Overall inter-annotator agreement as measured by κ on the action item corpus is

Research on DA classification initially focusedon two-party conversational speech (Mast et al.,1996; Stolcke et al., 1998; Shriberg et al., 1998)and, more recently, has extended to multi-partyaudio recordings like the ICSI corpus (Shriberget al., 2004). Machine learning techniques suchas graphical models (Ji and Bilmes, 2005), maxi-mum entropy models (Ang et al., 2005), and hid-den Markov models (Zimmermann et al., 2005)have been used to classify utterances from multi-party conversations.

It is only more recently that work focusedspecifically on action items themselves has beendeveloped. SVMs have been successfully appliedto the task of extracting action items from emailmessages (Bennett and Carbonell, 2005; Corston-Oliver et al., 2004). Bennett and Carbonell, in par-ticular, distinguish the task of action item detec-tion in email from the more well-studied task oftext classification, noting the finer granularity ofthe action item task and the difference of seman-tics vs. intent. (Although recent work has begun toblur this latter division, e.g. Cohen et al. (2004).)

In the audio domain, annotations for action itemutterances on several recorded meeting corpora,including the ICSI corpus, have recently becomeavailable (Gruenstein et al., 2005), enabling workon this topic.

3 Data

We use action item annotations produced by Gru-enstein et al. (2005). This corpus provides topichierarchy and action item annotations for the ICSImeeting corpus as well as other corpora of meet-ings; due to the ready availability of other types ofannotations for the ICSI corpus, we focus solelyon the annotations for these meetings. Figure 1gives an example of the annotations.

The corpus covers 54 ICSI meetings annotatedby two human annotators, and several other meet-ings annotated by one annotator. Of the 54 meet-ings with dual annotations, 6 contain no actionitems. For this study we consider only those meet-ings which contain action items and which are an-notated by both annotators.

As the annotations were produced by a smallnumber of untrained annotators, an immediatequestion is the degree of consistency and reliabil-ity. Inter-annotator agreement is typically mea-sured by the kappa statistic (Carletta, 1996), de-

kappa

freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

02

46

8

Figure 2: Distribution ofκ (inter-annotator agree-ment) across the 54 ICSI meetings tagged by twoannotators. Of the two meetings withκ = 1.0, onehas only two action items and the other only four.

fined as:

κ =P (O) − P (E)

1 − P (E)

where P (O) is the probability of the observedagreement, andP (E) the probability of the “ex-pected agreement” (i.e., under the assumption thetwo sets of annotations are independent). Thekappa statistic ranges from−1 to1, indicating per-fect disagreement and perfect agreement, respec-tively.

Overall inter-annotator agreement as measuredby κ on the action item corpus is poor, as noted inPurver et al. (2006), with an overallκ of 0.364 andvalues for individual meetings ranging from1.0 toless than zero. Figure 2 shows the distribution ofκ across all 54 annotated ICSI meetings.

To reduce the effect of poor inter-annotatoragreement, we focus on the top 15 meetings asranked byκ; the minimumκ in this set is 0.435.Although this reduces the total amount of dataavailable, our intention is that this subset of themost consistent annotations will form a higher-quality corpus.

While the corpus classifies related action itemutterances into action item “groups,” in this studywe wish to treat the annotations as simply binaryattributes. Visual analysis of annotations for sev-eral meetings outside the set of chosen 15 suggeststhat the union of the two sets of annotations yieldsthe most consistent resulting annotation; thus, forthis study, we consider an utterance to be an actionitem if at least one of the annotators marked it assuch.

The 15-meeting subset contains 24,250 utter-

Page 3: Automatically Detecting Action Items in Audio … disagreement and perfect agreement, respec-tively. Overall inter-annotator agreement as measured by κ on the action item corpus is

A1 A2X X So that will be sort of the assignment for next week, is to—X X to—for slides and whatever net you picked and what it can doand—and how far

you’ve gotten. Pppt!X - Well, I’d like to also,X X though, uh, ha- have a first cut at what theX X belief-net looks like.- X Even if it’s really crude.- - OK? So, you know,- - here a- here are—- X So we’re supposed to @@ about features and whatnot, and—

Figure 1: Example transcript and action item annotations (marked “X”) from annotators A1 and A2.“@@” signifies an unintelligible word. This transcript is from an ICSI meeting recording and hasκ =

0.373, ranking it 16th out of 54 meetings in annotator agreement.

0 500 1000 1500 2000 2500

Figure 3: Number of total and action item utter-ances across the 15 selected meetings. There are24,250 utterances total, 590 of which (2.4%) areaction item utterances.

ances total; under the union strategy above, 590 ofthese are action item utterances. Figure 3 showsthe number of action item utterances and the num-ber of total utterances in the 15 selected meetings.

One noteworthy feature of the ICSI corpus un-derlying the action item annotations is the “digitreading task,” in which the participants of meet-ings take turns reading aloud strings of digits.This task was designed to provide a constrained-vocabulary training set of speech recognition de-velopers interested in multi-party speech. In thisstudy we did not remove these sections; the neteffect is that some portions of the data consist ofthese fairly atypical utterances.

4 Experimental methodology

We formulate the action item detection task as oneof binary classification of utterances. We apply a

maximum entropy (maxent) model (Berger et al.,1996) to this task.

Maxent models seek to maximize the condi-tional probability of a classc given the observa-tionsX using the exponential form

P (c|X) =1

Z(X)exp

[

i

λi,c fi,c(X)

]

where fi,c(X) is the ith feature of the dataXin classc, λi,c is the corresponding weight, andZ(X) is a normalization term. Maxent modelschoose the weightsλi,c so as to maximize the en-tropy of the induced distribution while remainingconsistent with the data and labels; the intuition isthat such a distribution makes the fewest assump-tions about the underlying data.

Our maxent model is regularized by a quadraticprior and uses quasi-Newton parameter optimiza-tion. Due to the limited amount of training data(see Section 3) and to avoid overfitting, we em-ploy 10-fold cross validation in each experiment.

To evaluate system performance, we calculatethe F measure (F ) of precision (P ) and recall (R),defined as:

P =|A ∩ C|

|A|

R =|A ∩ C|

|C|

F =2PR

P + R

whereA is the set of utterances marked as actionitems by the system, andC is the set of (all) cor-rect action item utterances.

Page 4: Automatically Detecting Action Items in Audio … disagreement and perfect agreement, respec-tively. Overall inter-annotator agreement as measured by κ on the action item corpus is

The use of precision and recall is motivated bythe fact that the large imbalance between posi-tive and negative examples in the corpus (Sec-tion 3) means that simpler metrics like accuracyare insufficient—a system that simply classifiesevery utterance as negative will achieve an accu-racy of 97.5%, which clearly is not a good reflec-tion of desired behavior. Recall and F measure forsuch a system, however, will be zero.

Likewise, a system that flips a coin weighted inproportion to the number of positive examples inthe entire corpus will have an accuracy of 95.25%,but will only achieveP = R = F = 2.4%.

5 Features

As noted in Section 3, we treat the task of produc-ing action item annotations as a binary classifica-tion task. To this end, we consider the followingsets of features. (Note that all real-valued featureswere range-normalized so as to lie in[0, 1] and thatno binning was employed.)

5.1 Immediate lexical features

We extract word unigram and bigram featuresfrom the transcript for each utterance. We nor-malize for case and for certain contractions; forexample, “I’ll” is transformed into “I will”.

Note that these are oracle features, as the tran-scripts are human-produced and not the productof automatic speech recognizer (ASR) system out-put.

5.2 Contextual lexical features

We extract word unigram and bigram featuresfrom the transcript for the previous and next ut-terances across all speakers in the meeting.

5.3 Syntactic features

Under the hypothesis that action item utteranceswill exhibit particular syntactic patterns, we usea conditional Markov model part-of-speech (POS)tagger (Toutanova and Manning, 2000) trained onthe Switchboard corpus (Godfrey et al., 1992) totag utterance words for part of speech. We use thefollowing binary POS features:

• Presence ofUH tag, denoting the presence ofan “interjection” (including filled pauses, un-filled pauses, and discourse markers).

• Presence ofMD tag, denoting presence of amodal verb.

• Number ofNN* tags, denoting the number ofnouns.

• Number ofVB* tags, denoting the number ofverbs.

• Presence ofVBD tag, denoting the presenceof a past-tense verb.

5.4 Prosodic features

Under the hypothesis that action item utteranceswill exhibit particular prosodic behavior—for ex-ample, that they are emphasized, or are pitched acertain way—we performed pitch extraction usingan auto-correlation method within the sound anal-ysis package Praat (Boersma and Weenink, 2005).From the meeting audio files we extract the fol-lowing prosodic features, on a per-utterance basis:(pitch measures are in Hz; intensity in energy; nor-malization in all cases isz-normalization)

• Pitch and intensity range, minimum, andmaximum.

• Pitch and intensity mean.

• Pitch and intensity median (0.5 quantile).

• Pitch and intensity standard deviation.

• Pitch slope, processed to eliminate halv-ing/doubling.

• Number of voiced frames.

• Duration-normalized pitch and intensityranges and voiced frame count.

• Speaker-normalized pitch and intensitymeans.

5.5 Temporal features

Under the hypothesis that the length of an utter-ance or its location within the meeting as a wholewill determine its likelihood of being an actionitem—for example, shorter statements near theend of the meeting might be more likely to be ac-tion items—we extract the duration of each utter-ance and the time from its occurrence until the endof the meeting. (Note that the use of this featureprecludes operating in an online setting, where theend of the meeting may not be known in advance.)

5.6 General semantic features

Under the hypothesis that action item utteranceswill frequently involve temporal expressions—e.g.“Let’s have the paper written bynext Tuesday”—we use Identifinder (Bikel et al., 1997) to marktemporal expressions (“TIMEX” tags) in utterancetranscripts, and create a binary feature denoting

Page 5: Automatically Detecting Action Items in Audio … disagreement and perfect agreement, respec-tively. Overall inter-annotator agreement as measured by κ on the action item corpus is

the existence of a temporal expression in each ut-terance.

Note that as Identifinder was trained on broad-cast news corpora, applying it to the very differentdomain of multi-party meeting transcripts may notresult in optimal behavior.

5.7 Dialog-specific semantic features

Under the hypothesis that action item utterancesmay be closely correlated with specific dialogact tags, we use the dialog act annotations fromthe ICSI Meeting Recorder Dialog Act Corpus.(Shriberg et al., 2004) As these DA annotationsdo not correspond one-to-one with utterances inthe ICSI corpus, we align them in the most liberalway possible, i.e., if at least one word in an utter-ance is annotated for a particular DA, we mark theentirety of that utterance as exhibiting that DA.

We consider both fine-grained and coarse-grained dialog acts.1 The former yields 56 fea-tures, indicating occurrence of DA tags suchas “appreciation,” “rhetorical question,” and“task management”; the latter consists of only7 classes—“disruption,” “backchannel,” “filler,”“statement,” “question,” “unlabeled,” and “un-known.”

6 Results

The final performance for the maxent modelacross different feature sets is given in Table 1.F measures scores range from 13.81 to 31.92.Figure 4 shows the interpolated precision-recallcurves for several of these feature sets; thesegraphs display the level of precision that can beachieved if one is willing to sacrifice some recall,and vice versa.

Although ideally, all combinations of featuresshould be evaluated separately, the large numberof features in this precludes this strategy. Thecombination of features explored here was cho-sen so as to start from simpler features and suc-cessively add more complex ones. We start withtranscript features that are immediate and context-independent (“unigram”, “bigram”, “TIMEX”);then add transcript features that require context(“temporal”, “context”), then non-transcript (i.e.audio signal) features (“prosodic”), and finally addfeatures that require both the transcript and the au-dio signal (“DA”).

1We use themap 01 grouping defined in the MRDA cor-pus to collapse the tags.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

recall

prec

isio

n

unigrambigramtemporalcontext+prosodicfine−grained DAs

Figure 4: Interpolated precision-recall curve forseveral (cumulative) feature sets. This graph sug-gests the level of precision that can be achievedif one is willing to sacrifice some recall, and viceversa.

In total, nine combinations of features wereconsidered. In every case except that of syn-tactic and coarse-grained dialog act features, theadditional features improved system performanceand these features were used in succeeding exper-iments. Syntactic and coarse-grained DA featuresresulted in a drop in performance and were dis-carded from succeeding systems.

7 Analysis

The unigram and bigram features provide signif-icant discriminative power. Tables 2 and 3 givethe top features, as determined by weight, for themodels trained only on these features. It is clearfrom Table 3 that the detailed end-of-utterancepunctuation in the human-generated transcriptsprovide valuable discriminative power.

The performance gain from adding TIMEX tag-ging features is small and likely not statisticallysignificant. Post-hoc analysis of the TIMEX tag-ging (Section 5.6) suggests that Identifinder tag-ging accuracy is quite plausible in general, but ex-hibits an unfortunate tendency to mark the digit-reading (see Section 3) portion of the meetings astemporal expressions. It is plausible that remov-ing these utterances from the meetings would al-low this feature a higher accuracy.

Based on the low feature weight assigned, utter-ance length appears to provide no significant valueto the model. However, the time until the meet-ing is over ranks as the highest-weighted featurein the unigram+bigram+TIMEX+temporal featureset. This feature is thus responsible for the 39.25%

Page 6: Automatically Detecting Action Items in Audio … disagreement and perfect agreement, respec-tively. Overall inter-annotator agreement as measured by κ on the action item corpus is

features number F % imp.unigram 6844 13.81unigram+bigram 61281 16.72 21.07unigram+bigram+TIMEX 61284 16.84 0.72unigram+bigram+TIMEX+temporal 61286 23.45 39.25unigram+bigram+TIMEX+temporal+syntactic 61291 21.94 -6.44unigram+bigram+TIMEX+temporal+context 183833 25.62 9.25unigram+bigram+TIMEX+temporal+context+prosodic 183871 27.44 7.10unigram+bigram+TIMEX+temporal+context+prosodic+coarse DAs 183878 26.47 -3.53unigram+bigram+TIMEX+temporal+context+prosodic+fine DAs 183927 31.92 16.33

Table 1: Performance of the maxent classifier as measured by Fmeasure, the relative improvement fromthe preceding feature set, and the number of features, across all feature sets tried. Italicized lines denotethe addition of features which do not improve performance; these are omitted from succeeding systems.

feature +/- λ

“pull” + 2.2100“email” + 1.7883“needs” + 1.7212“added” + 1.6613“mm-hmm” - 1.5937“present” + 1.5740“nine” - 1.5019“!” - 1.5001“five” - 1.4944“together” + 1.4882

Table 2: Features, evidence type (positive denotesaction item), and weight for the top ten featuresin the unigram-only model. “Nine” and “five” arecommon words in the digit-reading task (see Sec-tion 3).

feature +/- λ

“- $” - 1.4308“i will” + 1.4128“, $” - 1.3115“uh $” - 1.2752“w- $” - 1.2419“. $” - 1.2247“email” + 1.2062“six $” - 1.1874“* in” - 1.1833“so $” - 1.1819

Table 3: Features, evidence type and weight forthe top ten features in the unigram+bigram model.The symbol * denotes the beginning of an utter-ance and $ the end. All of the top ten features arebigrams except for the unigrams “email”.

feature +/- λ

mean intensity (norm.) - 1.4288mean pitch (norm.) - 1.0661intensity range + 1.0510“i will” + 0.8657“email” + 0.8113reformulate/summarize (DA) + 0.7946“just go” (next) + 0.7190“i will” (prev.) + 0.7074“the paper” + 0.6788understanding check (DA) + 0.6547

Table 4: Features, evidence type and weight forthe top ten features on the best-performing model.Bigrams labeled “prev.” and “next” correspond tothe lexemes from previous and next utterances, re-spectively. Prosodic features labeled as “norm.”have been normalized on a per-speaker basis.

boost in F measure in row 3 of Table 1.

The addition of part-of-speech tags actually de-creases system performance. It is unclear why thisis the case. It may be that the unigram and bi-gram features already adequately capture any dis-tinctions these features make, or simply that thesefeatures are generally not useful for distinguishingaction items.

Contextual features, on the other hand, im-prove system performance significantly. A post-hoc analysis of the action item annotations makesclear why: action items are often split across mul-tiple utterances (e.g. as in Figure 1), only a portionof which contain lexical cues sufficient to distin-guish them as such. Contextual features thus allowutterances immediately surrounding these “obvi-ous” action items to be tagged as well.

Page 7: Automatically Detecting Action Items in Audio … disagreement and perfect agreement, respec-tively. Overall inter-annotator agreement as measured by κ on the action item corpus is

Prosodic features yield a 7.10% increase inF measure, and analysis shows that speaker-normalized intensity and pitch, and the range inintensity of an utterance, are valuable discrimina-tive features. The subsequent addition of coarse-grained dialog act tags does not further improvesystem performance. It is likely this is due to rea-sons similar to those for POS tags—either the cat-egories are insufficient to distinguish action itemutterances, or whatever usefulness they provide issubsumed by other features.

Table 4 shows the feature weights for the top-ranked features on the best-scoring system. Theaddition of the fine-grained DA tags results in asignificant increase in performance.The F measureof this best feature set is 31.92%.

8 Conclusions

We have shown that several classes of features areuseful for the task of action item annotation frommulti-party meeting corpora. Simple lexical fea-tures, their contextual versions, the time until theend of the meeting, prosodic features, and fine-grained dialog acts each contribute significant in-creases in system performance.

While the raw system performance numbers ofTable 1 are low relative to other, better-studiedtasks on other, more mature corpora, we believethe relative usefulness of the features towards thistask is indicative of their usefulness on more con-sistent annotations, as well as to related tasks.

The Gruenstein et al. (2005) corpus providesa valuable and necessary resource for research inthis area, but several factors raise the question ofannotation quality. The lowκ scores in Section 3are indicative of annotation problems. Post-hocerror analysis yields many examples of utteranceswhich are somewhat difficult to imagine as pos-sible, never mind desirable, to tag. The fact thatthe extremely useful oracular information presentin the fine-grained DA annotation doesnot raiseperformance to the high levels that one might ex-pect further suggests that the annotations are notideal—or, at the least, that they are inconsistentwith the DA annotations.2

This analysis is consistent with the findings ofPurver et al. (2006), who achieve an F measure of

2Which is not to say they are devoid of significant value—training and testing our best system on the corpus with the590 positive classifications randomly shuffled across all ut-terances yields an F measure of only 4.82.

less than 25% when applying SVMs to the classi-fication task to the same corpus, and motivate thedevelopment of a new corpus of action item anno-tations.

9 Future work

In Section 6 we showed that contextual lexicalfeatures are useful for the task of action item de-tection, at least in the fairly limited manner em-ployed in our implementation, which simply looksat immediate previous and immediate next utter-ances. It seems likely that applying a sequencemodel such as an HMM or conditional randomfield (CRFs) will act as a generalization of this fea-ture and may further improve performance.

Addition of features such as speaker change and“hot spots” (Wrede and Shriberg, 2003) may alsoaid classification. Conversely, it is possible thatfeature selection techniques may improve perfor-mance by helping to eliminate poor-quality fea-tures. In this work we have followed an “ev-erything but the kitchen sink” approach, in partbecause we were curious about which featureswould prove useful. The effect of adding POS andcoarse-grained DA features illustrates that this isnot necessarily the ideal strategy in terms of ulti-mate system performance.

In general, the features evaluated in thiswork are an indiscriminate mix of human- andautomatically-generated features; of the human-generated features, some are plausible to generateautomatically, at some loss of quality (e.g. tran-scripts) while others are unlikely to be automati-cally generated in the foreseeable future (e.g. fine-grained dialog acts). Future work may focus onthe effects that automatic generation of the formerhas on overall system performance (although thismay require higher-quality annotations to be use-ful.) For example, the detailed end-of-utterancepunctuation present in the human transcripts pro-vides valuable discriminative power (Table 3), butcurrent ASR systems are not likely to be able toprovide this level of detail. Switching to ASR out-put will have a negative effect on performance.

One final issue is that of utterance segmenta-tion. The scheme used in the ICSI meeting corpusdoes not necessarily correspond to the ideal seg-mentation for other tasks. The action item annota-tions were performed on these segmentations, andin this study we did not attempt resegmentation,but in the future it may prove valuable to collapse,

Page 8: Automatically Detecting Action Items in Audio … disagreement and perfect agreement, respec-tively. Overall inter-annotator agreement as measured by κ on the action item corpus is

for example, successive un-interrupted utterancesfrom the same speaker into a single utterance.

In conclusion, while overall system perfor-mance does not approach levels typical of better-studied classification tasks such as named-entityrecognition, we believe that this is a largely a prod-uct of the current action item annotation quality.We believe that the feature analysis presented hereis useful, for this task and for other related tasks,and that, provided with a set of more consistentaction item annotations, the current system can beused as is to achieve better performance.

Acknowledgments

The authors wish to thank Dan Jurafsky, ChrisManning, Stanley Peters, Matthew Purver, andseveral anonymous reviewers for valuable adviceand comments.

References

Jeremy Ang, Yang Liu, and Elizabeth Shriberg. 2005.Automatic dialog act segmentation and classifica-tion in multiparty meetings. InProceedings of theICASSP.

Paul N. Bennett and Jaime Carbonell. 2005. Detectingaction-items in e-mail. InProceedings of SIGIR.

Adam Berger, Stephen Della Pietra, and Vincent DellaPietra. 1996. A maximum entropy approach to nat-ural language processing.Computational Linguis-tics, 22(1):39–71.

D. Bikel, S. Miller, R. Schwartz, and R. Weischedel.1997. Nymble: a high-performance learning name-finder. InProceedings of the Conference on AppliedNLP.

Paul Boersma and David Weenink. 2005. Praat: doingphonetics by computer v4.4.12 (computer program).

J. Carletta. 1996. Assessing agreement on classifica-tion tasks: The kappa statistic.Computational Lin-guistics, 22(2):249–254.

William W. Cohen, Vitor R. Carvalho, and Tom M.Mitchell. 2004. Learning to classify email into”speech acts”. InProceedings of EMNLP.

Simon Corston-Oliver, Eric Ringger, Michael Ga-mon, and Richard Campbell. 2004. Task-focusedsummarization of email. InText SummarizationBranches Out: Proceedings of the ACL Workshop.

J. Godfrey, E. Holliman, and J.McDaniel. 1992.SWITCHBOARD: Telephone speech corpus forresearch and development. InProceedings ofICAASP.

Alexander Gruenstein, John Niekrasz, and MatthewPurver. 2005. Meeting structure annotation: Dataand tools. InProceedings of the 6th SIGDIAL Work-shop on Discourse and Dialogue.

Adam Janin, Don Baron, Jane Edwards, Dan Ellis,David Gelbart, Nelson Morgan, Barbara Peskin,Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke,and Chuck Wooters. 2003. The ICSI meeting cor-pus. InProceedings of the ICASSP.

Gang Ji and Jeff Bilmes. 2005. Dialog act tag-ging using graphical models. InProceedings of theICASSP.

Marion Mast, R. Kompe, S. Harbeck, A. Kießling,H. Niemann, E. Noth, E.G. Schukat-Talamazzini,and V. Warnke. 1996. Dialog act classification withthe help of prosody. InProceedings of the ICSLP.

Matthew Purver, Patrick Ehlen, and John Niekrasz.2006. Detecting action items in multi-party meet-ings: Annotation and initial experiments. InPro-ceedings of the 3rd Joint Workshop on MLMI.

Elizabeth Shriberg, Rebecca Bates, Andreas Stolcke,Paul Taylor, Daniel Jurafsky, Klaus Ries, Noah Coc-caro, Rachel Martin, Marie Meteer, and Carol VanEssDykema. 1998. Can prosody aid the auto-matic classification of dialog acts in conversationalspeech?Language and Speech, 41(3–4):439–487.

Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, JeremyAng, and Hannah Carvey. 2004. The ICSI meetingrecorder dialog act (MRDA) corpus. InProceedingsof the 5th SIGDIAL Workshop on Discourse and Di-alogue.

Andreas Stolcke, Elizabeth Shriberg, Rebecca Bates,Noah Coccaro, Daniel Jurafsky, Rachel Mar-tin, Marie Meteer, Klaus Ries, Paul Taylor, andCarol Van EssDykema. 1998. Dialog act model-ing for conversational speech. InProceedings ofthe AAAI Spring Symposium on Applying MachineLearning to Discourse Processing.

Kristina Toutanova and Christopher D. Manning.2000. Enriching the knowledge sources used in amaximum entropy part-of-speech tagger. InPro-ceedings of EMNLP.

Britta Wrede and Elizabeth Shriberg. 2003. Spot-ting “hot spots” in meetings: Human judgments andprosodic cues. InProceedings of the European Con-ference on Speech Communication and Technology.

Matthias Zimmermann, Yang Liu, Elizabeth Shriberg,and Andreas Stolcke. 2005. Toward joint segmen-tation and classification of dialog acts in multipartymeetings. InProceedings of the 2nd Joint Workshopon MLMI.