Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19,...

140
Text as Data Justin Grimmer Associate Professor Department of Political Science Stanford University September 30th, 2014 Justin Grimmer (Stanford University) Text as Data September 30th, 2014 1 / 43

Transcript of Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19,...

Page 1: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Text as Data

Justin Grimmer

Associate ProfessorDepartment of Political Science

Stanford University

September 30th, 2014

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 1 / 43

Page 2: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Our Class Approach

1) Choose task

2) Decide on objective function

f (θ,X )

3) Optimize object function over parameters θ

Part of model creating X

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 2 / 43

Page 3: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Our Class Approach

1) Choose task

2) Decide on objective function

f (θ,X )

3) Optimize object function over parameters θ

Part of model creating X

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 2 / 43

Page 4: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Goal for Today: Document-Term Matrices

X =

1 0 0 . . . 30 2 1 . . . 0...

......

. . ....

0 0 0 . . . 5

X = N × K matrix

- N = Number of documents

- K = Number of features

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 3 / 43

Page 5: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Goal for Today: Document-Term Matrices

X =

1 0 0 . . . 30 2 1 . . . 0...

......

. . ....

0 0 0 . . . 5

X = N × K matrix

- N = Number of documents

- K = Number of features

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 3 / 43

Page 6: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Goal for Today: Document-Term Matrices

X =

1 0 0 . . . 30 2 1 . . . 0...

......

. . ....

0 0 0 . . . 5

X = N × K matrix

- N = Number of documents

- K = Number of features

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 3 / 43

Page 7: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Learning From Text

A plan for using texts

1) Acquiring text data

2) Regular expression search in text

3) Creating document-term matrices (term-document matrices)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 4 / 43

Page 8: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Finding Text Data

Many places to find text (which we we’ll review)

Goal: plan text (.txt) file. (UTF-8, ASCII)(May also want to create an XML or JSON file)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 5 / 43

Page 9: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Finding Text Data

Many places to find text (which we we’ll review)Goal: plan text (.txt) file. (UTF-8, ASCII)

(May also want to create an XML or JSON file)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 5 / 43

Page 10: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Finding Text Data

Many places to find text (which we we’ll review)Goal: plan text (.txt) file. (UTF-8, ASCII)(May also want to create an XML or JSON file)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 5 / 43

Page 11: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Plain Text

September 19, 2010 Sunday 10:46 AM EST

REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

CONSTITUTION DAY

LENGTH: 320 words

CLEMMONS, N.C., Sept. 17 -- Rep. Virginia Foxx, R-N.C.

(5th CD), issued the following press release:

Congresswoman Virginia Foxx is celebrating Constitution Day

today by visiting several schools in her district to talk

with students about the Constitution and the individuals

who helped create our charter document. She will visit

Davie County High School, Forbush High School in Yadkin

County and Piney Creek School in Alleghany County.

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 6 / 43

Page 12: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

XML

<DOC><DOCNO>101-levin-mi-1-19901027< /DOCNO><TEXT>Mr. LEVIN. Mr. President, today the House passed and sent

to the President the Great Lakes Critical Programs Act.

... Mr. President, I commend and thank Ms. Bean for her

exceptional efforts on the Great Lakes Critical Programs

Act

< /TEXT>< /DOC>

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 6 / 43

Page 13: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

JSON

{"id":"tag:search.twitter.com,2005:287886850381713411","objectType":"activity"...displayName":"Linda Bowersox",

"postedTime":"2010-03-10T05:16:14.000Z"...

"body":"@JeffFlake thank you for standing firm and voting

NO on the #FiscalCliff (via #PJNET)","object"...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 7 / 43

Page 14: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Prepackaged Data Sources

http://dfr.jstor.org

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 8 / 43

Page 15: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Prepackaged Data Sources

Lexis Nexis (and other data base sources)

1) Batch search and download

2) Do not try to scrape Lexis Nexis(!!!!)

Application Programming Interface (APIs)

- Facilitate interaction with applications (like Twitter)

- Download data (often in JSON format) Twitter, Data.gov, ...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 9 / 43

Page 16: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Prepackaged Data Sources

Lexis Nexis (and other data base sources)

1) Batch search and download

2) Do not try to scrape Lexis Nexis(!!!!)

Application Programming Interface (APIs)

- Facilitate interaction with applications (like Twitter)

- Download data (often in JSON format) Twitter, Data.gov, ...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 9 / 43

Page 17: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Prepackaged Data Sources

Lexis Nexis (and other data base sources)

1) Batch search and download

2) Do not try to scrape Lexis Nexis(!!!!)

Application Programming Interface (APIs)

- Facilitate interaction with applications (like Twitter)

- Download data (often in JSON format) Twitter, Data.gov, ...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 9 / 43

Page 18: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Prepackaged Data Sources

Lexis Nexis (and other data base sources)

1) Batch search and download

2) Do not try to scrape Lexis Nexis(!!!!)

Application Programming Interface (APIs)

- Facilitate interaction with applications (like Twitter)

- Download data (often in JSON format) Twitter, Data.gov, ...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 9 / 43

Page 19: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Prepackaged Data Sources

Lexis Nexis (and other data base sources)

1) Batch search and download

2) Do not try to scrape Lexis Nexis(!!!!)

Application Programming Interface (APIs)

- Facilitate interaction with applications (like Twitter)

- Download data (often in JSON format) Twitter, Data.gov, ...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 9 / 43

Page 20: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Prepackaged Data Sources

Lexis Nexis (and other data base sources)

1) Batch search and download

2) Do not try to scrape Lexis Nexis(!!!!)

Application Programming Interface (APIs)

- Facilitate interaction with applications (like Twitter)

- Download data (often in JSON format) Twitter, Data.gov, ...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 9 / 43

Page 21: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Books, Archives, and Other Non-Digital Material

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 10 / 43

Page 22: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Books, Archives, and Other Non-Digital Material

1) Create images of texts

2) Optical Character Recognition

- Built in Adobe Pro- Abbyy FineReader (Batch processing)- Tesseract (Google, command line tool)

3) Also use, e-book formats...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 10 / 43

Page 23: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Books, Archives, and Other Non-Digital Material

1) Create images of texts

2) Optical Character Recognition

- Built in Adobe Pro- Abbyy FineReader (Batch processing)- Tesseract (Google, command line tool)

3) Also use, e-book formats...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 10 / 43

Page 24: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Books, Archives, and Other Non-Digital Material

1) Create images of texts

2) Optical Character Recognition

- Built in Adobe Pro

- Abbyy FineReader (Batch processing)- Tesseract (Google, command line tool)

3) Also use, e-book formats...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 10 / 43

Page 25: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Books, Archives, and Other Non-Digital Material

1) Create images of texts

2) Optical Character Recognition

- Built in Adobe Pro- Abbyy FineReader (Batch processing)

- Tesseract (Google, command line tool)

3) Also use, e-book formats...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 10 / 43

Page 26: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Books, Archives, and Other Non-Digital Material

1) Create images of texts

2) Optical Character Recognition

- Built in Adobe Pro- Abbyy FineReader (Batch processing)- Tesseract (Google, command line tool)

3) Also use, e-book formats...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 10 / 43

Page 27: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Books, Archives, and Other Non-Digital Material

1) Create images of texts

2) Optical Character Recognition

- Built in Adobe Pro- Abbyy FineReader (Batch processing)- Tesseract (Google, command line tool)

3) Also use, e-book formats...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 10 / 43

Page 28: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Acquiring Data from Web: Automated Web Collection

This week’s assignment: “scrape” a presidential debate:

- http://www.crummy.com/software/BeautifulSoup/

- Parse paragraphs, label speakers

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 11 / 43

Page 29: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Acquiring Data from Web: Automated Web Collection

This week’s assignment: “scrape” a presidential debate:

- http://www.crummy.com/software/BeautifulSoup/

- Parse paragraphs, label speakers

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 11 / 43

Page 30: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Acquiring Data from Web: Automated Web Collection

This week’s assignment: “scrape” a presidential debate:

- http://www.crummy.com/software/BeautifulSoup/

- Parse paragraphs, label speakers

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 11 / 43

Page 31: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Acquiring Data from Web: Automated Web Collection

This week’s assignment: “scrape” a presidential debate:

- http://www.crummy.com/software/BeautifulSoup/

- Parse paragraphs, label speakers

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 11 / 43

Page 32: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Acquiring Data from Web: Automated Web Collection

This week’s assignment: “scrape” a presidential debate:

- http://www.crummy.com/software/BeautifulSoup/

- Parse paragraphs, label speakers

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 11 / 43

Page 33: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Acquiring Data from Web: Distributed Human Computing

Amazon.com’s Mechanical Turk

- Marketplace for Human Itensive Tasks

- Requester (you): create HITs, offer $ (about $0.05 per task)

- Workers (bored + broke people): complete task

- Requester: evaluate and pay

Odesk, elance, ...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 12 / 43

Page 34: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

You have text, now what?

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 13 / 43

Page 35: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions (from Jurafsky Slides)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 14 / 43

Page 36: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Systematic Searches

A language for searching texts:

- Count mentions of a person

- Calculate amount of money discussed

- Prepare texts for analysis: Identify where to “split” a document

- ...

Provide a quick introduction here, with some examples

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 15 / 43

Page 37: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

- Disjunctions

RE Match Example Patterns Matched[mM]oney Money or money “Money”

[abc] ‘a’, ‘b’, or ‘c’ “Investing in Iran”“is dangerous business”

[1234567890] any digit “sitting on $7.5 billion dollars”“2005 and 2006, more than ”“$150 million dollars”

[\.] A period “ ‘Run!’, he screamed.”

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 16 / 43

Page 38: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

- Ranges

RE Match Example Patterns Matched[A-Z] an upper case letter “Rep. Anthony Weiner

(D-Brooklyn & Queens)”[a-z] a lower case letter “ACORN’s”[0-9] a single digit “(9th CD) ”

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 17 / 43

Page 39: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

- Negations

RE Match Example Patterns Matched[^A-Z] not an upper case letter “ACORN’s”[^Ss] neither ‘S’ nor ‘s’ “ACORN’s”[^\.] not a period “ ‘Run!’, he screamed.”

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 18 / 43

Page 40: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

- Optional Characters: ?, *, +

RE Match Example Patterns Matchedcolou?r Words with u 0 or 1 times “color” or

“colour ”oo*h! Words with o 0 or more times “oh!” or

“ooh!” or“oooh!”

o+h! Words with o 1 or more times “oh!” or“ooh!” or“oooooh!” or

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 19 / 43

Page 41: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

- Wild Cards .

RE Match Example Patterns Matchedbeg.n Any word with “beg” then “n” “begin” or

“began” or“begun” or“beggn” (Poor grammar!)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 20 / 43

Page 42: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

- Start of the line anchor ˆ, end of the line anchor $

RE Match Example Patterns Matched^[A-Z] Upper case start of line “Palo Alto”

“the town of Palo Alto”^[^A-Z] Not upper case start of line “the town of Palo Alto”

“Palo Alto”^. Start of line “Palo Alto”

“the town of Palo Alto”.$ Identify character that ends a line “Wait!”

“This is the end.”

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 21 / 43

Page 43: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

- “Or”| statements, Useful short hand

RE Match Example Patterns Matchedyours|mine Matches“yours” or “mine” “it’s either yours or mine”

\ d Any digit “1-Mississippi”\ D Any non-digit “1-Mississippi”

\ s Any whitespace character “1, 2”\ S Any non-whitespace character “1, 2”\ w Any alpha-numeric “1-Mississippi ”

\ W Any non-alpha numeric “1-Mississippi”

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 22 / 43

Page 44: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

Quick Example to Illuminate Differences:A “simple” example: identify all instances of the.

- the

Misses capitalized examples

- [tT]he

Returns words that are too long (theocrat, theme )

- [ˆa-zA-Z][tT]he[ˆa-zA-Z]

Misses the first “the” in a sentence

- (ˆ | [ˆ a-zA-Z])[tT]he[ˆ a-zA-Z]

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 23 / 43

Page 45: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

Quick Example to Illuminate Differences:A “simple” example: identify all instances of the.

- the

Misses capitalized examples

- [tT]he

Returns words that are too long (theocrat, theme )

- [ˆa-zA-Z][tT]he[ˆa-zA-Z]

Misses the first “the” in a sentence

- (ˆ | [ˆ a-zA-Z])[tT]he[ˆ a-zA-Z]

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 23 / 43

Page 46: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

Quick Example to Illuminate Differences:A “simple” example: identify all instances of the.

- the

Misses capitalized examples

- [tT]he

Returns words that are too long (theocrat, theme )

- [ˆa-zA-Z][tT]he[ˆa-zA-Z]

Misses the first “the” in a sentence

- (ˆ | [ˆ a-zA-Z])[tT]he[ˆ a-zA-Z]

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 23 / 43

Page 47: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

Quick Example to Illuminate Differences:A “simple” example: identify all instances of the.

- the

Misses capitalized examples

- [tT]he

Returns words that are too long (theocrat, theme )

- [ˆa-zA-Z][tT]he[ˆa-zA-Z]

Misses the first “the” in a sentence

- (ˆ | [ˆ a-zA-Z])[tT]he[ˆ a-zA-Z]

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 23 / 43

Page 48: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

Quick Example to Illuminate Differences:A “simple” example: identify all instances of the.

- the

Misses capitalized examples

- [tT]he

Returns words that are too long (theocrat, theme )

- [ˆa-zA-Z][tT]he[ˆa-zA-Z]

Misses the first “the” in a sentence

- (ˆ | [ˆ a-zA-Z])[tT]he[ˆ a-zA-Z]

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 23 / 43

Page 49: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

Quick Example to Illuminate Differences:A “simple” example: identify all instances of the.

- the

Misses capitalized examples

- [tT]he

Returns words that are too long (theocrat, theme )

- [ˆa-zA-Z][tT]he[ˆa-zA-Z]

Misses the first “the” in a sentence

- (ˆ | [ˆ a-zA-Z])[tT]he[ˆ a-zA-Z]

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 23 / 43

Page 50: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

Quick Example to Illuminate Differences:A “simple” example: identify all instances of the.

- the

Misses capitalized examples

- [tT]he

Returns words that are too long (theocrat, theme )

- [ˆa-zA-Z][tT]he[ˆa-zA-Z]

Misses the first “the” in a sentence

- (ˆ | [ˆ a-zA-Z])[tT]he[ˆ a-zA-Z]

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 23 / 43

Page 51: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions, Some Basics (from Jurafsky Slides)

Quick Example to Illuminate Differences:A “simple” example: identify all instances of the.

- the

Misses capitalized examples

- [tT]he

Returns words that are too long (theocrat, theme )

- [ˆa-zA-Z][tT]he[ˆa-zA-Z]

Misses the first “the” in a sentence

- (ˆ | [ˆ a-zA-Z])[tT]he[ˆ a-zA-Z]

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 23 / 43

Page 52: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

An Example: Searching for Tea Party LanguageGrimmer, Westwood, and Messing (2014): Criticism and credit

2005 2006 2007 2008 2009 2010

0.00

00.

005

0.01

00.

015

0.02

00.

025

Big Government

Year

Pro

port

ion

Pre

ss R

elea

ses

Men

tioni

ng

● ● ● ● ● ●

●●

●●

DemocratsRepublicans

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 24 / 43

Page 53: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

An Example: Searching for Tea Party LanguageGrimmer, Westwood, and Messing (2014): Criticism and credit

2005 2006 2007 2008 2009 2010

0.00

00.

005

0.01

00.

015

0.02

00.

025

Big Government

Year

Pro

port

ion

Pre

ss R

elea

ses

Men

tioni

ng

● ● ● ● ● ●

●●

●●

DemocratsRepublicans

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 24 / 43

Page 54: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

An Example: Searching for Tea Party LanguageGrimmer, Westwood, and Messing (2014): Criticism and credit

2005 2006 2007 2008 2009 2010

0.00

00.

005

0.01

00.

015

0.02

00.

025

Budget Deficit

Year

Pro

port

ion

Pre

ss R

elea

ses

Men

tioni

ng

● ●

DemocratsRepublicans

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 24 / 43

Page 55: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

An Example: Searching for Tea Party LanguageGrimmer, Westwood, and Messing (2014): Criticism and credit

Anti−spending Press Releases

Year

Pro

port

ion

of P

ress

Rel

ease

s

2005 2006 2007 2008 2009 2010

0.00

0.05

0.10

● ●● ● ●

●●●

● ●

●●

DemocratsRepublicans

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 24 / 43

Page 56: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

An Example: Searching for Tea Party LanguageGoodman, Grimmer, Parker, Zlotnik (2014): Criticism

Year

Pro

port

ion

Pre

ss R

elea

ses

2005 2006 2007 2008 2009 2010

0.00

0.05

0.10

0.15

0.20

0.25

●● ● ●

●●

● ●

● ●

Branding Rhetoric, Press Releases

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 24 / 43

Page 57: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions on Steroids: Cheating DetectionSoftware

- WCopyFind:http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/

- What constitutes plagiarism?

- Edit distance:

- Heuristically: how many letters to change from a to b

- Sets many parameters:

- Number of differences between pair of “strings”- Length of character strings to consider- Number of matching strings to constitute match

- Useful:

- Media uptake- Joint Press Releases

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 25 / 43

Page 58: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions on Steroids: Cheating DetectionSoftware

- WCopyFind:http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/

- What constitutes plagiarism?

- Edit distance:

- Heuristically: how many letters to change from a to b

- Sets many parameters:

- Number of differences between pair of “strings”- Length of character strings to consider- Number of matching strings to constitute match

- Useful:

- Media uptake- Joint Press Releases

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 25 / 43

Page 59: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions on Steroids: Cheating DetectionSoftware

- WCopyFind:http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/

- What constitutes plagiarism?

- Edit distance:

- Heuristically: how many letters to change from a to b

- Sets many parameters:

- Number of differences between pair of “strings”- Length of character strings to consider- Number of matching strings to constitute match

- Useful:

- Media uptake- Joint Press Releases

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 25 / 43

Page 60: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions on Steroids: Cheating DetectionSoftware

- WCopyFind:http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/

- What constitutes plagiarism?

- Edit distance:

- Heuristically: how many letters to change from a to b

- Sets many parameters:

- Number of differences between pair of “strings”- Length of character strings to consider- Number of matching strings to constitute match

- Useful:

- Media uptake- Joint Press Releases

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 25 / 43

Page 61: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions on Steroids: Cheating DetectionSoftware

- WCopyFind:http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/

- What constitutes plagiarism?

- Edit distance:

- Heuristically: how many letters to change from a to b

- Sets many parameters:

- Number of differences between pair of “strings”- Length of character strings to consider- Number of matching strings to constitute match

- Useful:

- Media uptake- Joint Press Releases

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 25 / 43

Page 62: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions on Steroids: Cheating DetectionSoftware

- WCopyFind:http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/

- What constitutes plagiarism?

- Edit distance:

- Heuristically: how many letters to change from a to b

- Sets many parameters:

- Number of differences between pair of “strings”

- Length of character strings to consider- Number of matching strings to constitute match

- Useful:

- Media uptake- Joint Press Releases

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 25 / 43

Page 63: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions on Steroids: Cheating DetectionSoftware

- WCopyFind:http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/

- What constitutes plagiarism?

- Edit distance:

- Heuristically: how many letters to change from a to b

- Sets many parameters:

- Number of differences between pair of “strings”- Length of character strings to consider

- Number of matching strings to constitute match

- Useful:

- Media uptake- Joint Press Releases

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 25 / 43

Page 64: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions on Steroids: Cheating DetectionSoftware

- WCopyFind:http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/

- What constitutes plagiarism?

- Edit distance:

- Heuristically: how many letters to change from a to b

- Sets many parameters:

- Number of differences between pair of “strings”- Length of character strings to consider- Number of matching strings to constitute match

- Useful:

- Media uptake- Joint Press Releases

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 25 / 43

Page 65: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions on Steroids: Cheating DetectionSoftware

- WCopyFind:http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/

- What constitutes plagiarism?

- Edit distance:

- Heuristically: how many letters to change from a to b

- Sets many parameters:

- Number of differences between pair of “strings”- Length of character strings to consider- Number of matching strings to constitute match

- Useful:

- Media uptake- Joint Press Releases

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 25 / 43

Page 66: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions on Steroids: Cheating DetectionSoftware

- WCopyFind:http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/

- What constitutes plagiarism?

- Edit distance:

- Heuristically: how many letters to change from a to b

- Sets many parameters:

- Number of differences between pair of “strings”- Length of character strings to consider- Number of matching strings to constitute match

- Useful:

- Media uptake

- Joint Press Releases

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 25 / 43

Page 67: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Regular Expressions on Steroids: Cheating DetectionSoftware

- WCopyFind:http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/

- What constitutes plagiarism?

- Edit distance:

- Heuristically: how many letters to change from a to b

- Sets many parameters:

- Number of differences between pair of “strings”- Length of character strings to consider- Number of matching strings to constitute match

- Useful:

- Media uptake- Joint Press Releases

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 25 / 43

Page 68: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Document Term Matrices

Regular expressions and search are useful

We want to use statistics/algorithms to characterize textWe’ll put it in a document-term matrix

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 26 / 43

Page 69: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Document Term Matrices

Regular expressions and search are usefulWe want to use statistics/algorithms to characterize text

We’ll put it in a document-term matrix

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 26 / 43

Page 70: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Document Term Matrices

Regular expressions and search are usefulWe want to use statistics/algorithms to characterize textWe’ll put it in a document-term matrix

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 26 / 43

Page 71: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Document Term Matrices

Preprocessing Simplify text, make it useful

Lower dimensionality

- For our purposes

Remember: characterize the Hay stack

- If you want to analyze a straw of hay, these methods are unlikely towork

- But even if you want to closely read texts, characterizing hay stackcan be useful

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 27 / 43

Page 72: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Document Term Matrices

Preprocessing Simplify text, make it usefulLower dimensionality

- For our purposes

Remember: characterize the Hay stack

- If you want to analyze a straw of hay, these methods are unlikely towork

- But even if you want to closely read texts, characterizing hay stackcan be useful

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 27 / 43

Page 73: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Document Term Matrices

Preprocessing Simplify text, make it usefulLower dimensionality

- For our purposes

Remember: characterize the Hay stack

- If you want to analyze a straw of hay, these methods are unlikely towork

- But even if you want to closely read texts, characterizing hay stackcan be useful

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 27 / 43

Page 74: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Document Term Matrices

Preprocessing Simplify text, make it usefulLower dimensionality

- For our purposes

Remember: characterize the Hay stack

- If you want to analyze a straw of hay, these methods are unlikely towork

- But even if you want to closely read texts, characterizing hay stackcan be useful

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 27 / 43

Page 75: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Document Term Matrices

Preprocessing Simplify text, make it usefulLower dimensionality

- For our purposes

Remember: characterize the Hay stack

- If you want to analyze a straw of hay, these methods are unlikely towork

- But even if you want to closely read texts, characterizing hay stackcan be useful

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 27 / 43

Page 76: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Document Term Matrices

Preprocessing Simplify text, make it usefulLower dimensionality

- For our purposes

Remember: characterize the Hay stack

- If you want to analyze a straw of hay, these methods are unlikely towork

- But even if you want to closely read texts, characterizing hay stackcan be useful

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 27 / 43

Page 77: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain usefulinformation

1) Remove capitalization, punctuation

2) Discard Word Order (Bag of Words Assumption)

3) Discard stop words

4) Create Equivalence Class: Stem, Lemmatize, or synonym

5) Discard less useful features depends on application

6) Other reduction, specialization

Output: Count vector, each element counts occurrence of stemsProvide tools to preprocess via this recipe

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 28 / 43

Page 78: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain usefulinformation

1) Remove capitalization, punctuation

2) Discard Word Order (Bag of Words Assumption)

3) Discard stop words

4) Create Equivalence Class: Stem, Lemmatize, or synonym

5) Discard less useful features depends on application

6) Other reduction, specialization

Output: Count vector, each element counts occurrence of stemsProvide tools to preprocess via this recipe

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 28 / 43

Page 79: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain usefulinformation

1) Remove capitalization, punctuation

2) Discard Word Order (Bag of Words Assumption)

3) Discard stop words

4) Create Equivalence Class: Stem, Lemmatize, or synonym

5) Discard less useful features depends on application

6) Other reduction, specialization

Output: Count vector, each element counts occurrence of stemsProvide tools to preprocess via this recipe

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 28 / 43

Page 80: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain usefulinformation

1) Remove capitalization, punctuation

2) Discard Word Order (Bag of Words Assumption)

3) Discard stop words

4) Create Equivalence Class: Stem, Lemmatize, or synonym

5) Discard less useful features depends on application

6) Other reduction, specialization

Output: Count vector, each element counts occurrence of stemsProvide tools to preprocess via this recipe

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 28 / 43

Page 81: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain usefulinformation

1) Remove capitalization, punctuation

2) Discard Word Order (Bag of Words Assumption)

3) Discard stop words

4) Create Equivalence Class: Stem, Lemmatize, or synonym

5) Discard less useful features depends on application

6) Other reduction, specialization

Output: Count vector, each element counts occurrence of stemsProvide tools to preprocess via this recipe

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 28 / 43

Page 82: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain usefulinformation

1) Remove capitalization, punctuation

2) Discard Word Order (Bag of Words Assumption)

3) Discard stop words

4) Create Equivalence Class: Stem, Lemmatize, or synonym

5) Discard less useful features depends on application

6) Other reduction, specialization

Output: Count vector, each element counts occurrence of stemsProvide tools to preprocess via this recipe

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 28 / 43

Page 83: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain usefulinformation

1) Remove capitalization, punctuation

2) Discard Word Order (Bag of Words Assumption)

3) Discard stop words

4) Create Equivalence Class: Stem, Lemmatize, or synonym

5) Discard less useful features depends on application

6) Other reduction, specialization

Output: Count vector, each element counts occurrence of stemsProvide tools to preprocess via this recipe

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 28 / 43

Page 84: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain usefulinformation

1) Remove capitalization, punctuation

2) Discard Word Order (Bag of Words Assumption)

3) Discard stop words

4) Create Equivalence Class: Stem, Lemmatize, or synonym

5) Discard less useful features depends on application

6) Other reduction, specialization

Output: Count vector, each element counts occurrence of stems

Provide tools to preprocess via this recipe

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 28 / 43

Page 85: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain usefulinformation

1) Remove capitalization, punctuation

2) Discard Word Order (Bag of Words Assumption)

3) Discard stop words

4) Create Equivalence Class: Stem, Lemmatize, or synonym

5) Discard less useful features depends on application

6) Other reduction, specialization

Output: Count vector, each element counts occurrence of stemsProvide tools to preprocess via this recipe

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 28 / 43

Page 86: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing Texts

We’re going to use the Natural Language Toolkit (nltk) to workwith texts

- Built in functionality

- Ensures we can customize our feature spaces

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 29 / 43

Page 87: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Text Loaded into Python

Gettysburg Address

from BeautifulSoup import BeautifulSoup

from urllib import urlopen

import re, os

url =

urlopen(‘http://avalon.law.yale.edu/19th century/gettyb.asp’).read()

soup = BeautifulSoup(url)

text = soup.p.contents[0]

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 30 / 43

Page 88: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing Texts

Removing capitalization:

- Python : string.lower()

- R : tolower(‘string’)

Removing punctuation

- Python: re.sub(‘\W’, ‘ ’, string)

- R : gsub(‘\\W’, ‘ ’, string)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 31 / 43

Page 89: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing Texts

text 1 = text.lower()

text 2 = re.sub(‘\W’, ‘ ’, text 1)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 32 / 43

Page 90: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

The Bag of Words Assumption

Assumption: Discard Word OrderNow we are engaged in a great civil war, testing whether

that nation, or any nation

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 33 / 43

Page 91: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

The Bag of Words Assumption

Assumption: Discard Word Ordernow we are engaged in a great civil war testing whether

that nation or any nation

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 33 / 43

Page 92: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

The Bag of Words Assumption

Assumption: Discard Word Order

Unigrams

Unigram Counta 1any 1are 1civil 1engaged 1great 1in 1nation 2now 1or 1testing 1that 1war 1we 1whether 1

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 33 / 43

Page 93: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

The Bag of Words Assumption

Assumption: Discard Word Order

Bigrams

Bigram Countnow we 1we are 1are engaged 1engaged in 1in a 1a great 1great civil 1civil war 1war testing 1testing whether 1whether that 1that nation 1nation or 1or any 1any nation 1

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 33 / 43

Page 94: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

The Bag of Words Assumption

Assumption: Discard Word Order

Trigrams

Trigram Countnow we are 1we are engaged 1are engaged in 1engaged in a 1in a great 1a great civil 1great civil war 1civil war testing 1war testing whether 1whether that nation 1that nation or 1nation or any 1or any nation 1

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 33 / 43

Page 95: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

How Could This Possibly Work?

Speech is:

- Ironic

The Raiders make very good personnel decisions

- Subtle Negation (Source: Janyce Wiebe) :

They have not succeeded, and will never succeed, in

breaking the will of this valiant people

- Order Dependent (Source: Arthur Spirling):

Peace, no more war

War, no more peace

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 34 / 43

Page 96: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

How Could This Possibly Work?

Three answers

1) It might not: Validation is critical (task specific)

2) Central Tendency in Text: Words often imply what a text is aboutwar, civil, union or tone consecrate, dead, died, lives.

Likely to be used repeatedly: create a theme for an article

3) Human supervision: Inject human judgement (coders): helps methodsidentify subtle relationships between words and outcomes of interest

Dictionaries

Training Sets

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 35 / 43

Page 97: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Discarding Word Order in Python

from nltk import word tokenize

from nltk import bigrams

from nltk import trigrams

from nltk import ngrams

text 3 = word tokenize(text 2)

text 3 bi = bigrams(text 3)

text 3 tri = trigrams(text 3)

text 3 n = ngrams(text 3, 4)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 36 / 43

Page 98: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Stop Words

- Stop Words: English Language place holding words

the, it, if, a, able, at, be, because...

- Add “noise” to documents (without conveying much information)

- Discard stop words: focus on substantive words

Note of Caution: Monroe, Colaresi, and Quinn (2008)she, he, her, his

Many English language stop lists include gender pronouns

- Exercise caution when discarding stop words

- You may need to customize your stop word list abbreviations, titles,etc

To the Python code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 37 / 43

Page 99: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Stop Words

- Stop Words: English Language place holding words

the, it, if, a, able, at, be, because...

- Add “noise” to documents (without conveying much information)

- Discard stop words: focus on substantive words

Note of Caution: Monroe, Colaresi, and Quinn (2008)she, he, her, his

Many English language stop lists include gender pronouns

- Exercise caution when discarding stop words

- You may need to customize your stop word list abbreviations, titles,etc

To the Python code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 37 / 43

Page 100: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Stop Words

- Stop Words: English Language place holding words

the, it, if, a, able, at, be, because...

- Add “noise” to documents (without conveying much information)

- Discard stop words: focus on substantive words

Note of Caution: Monroe, Colaresi, and Quinn (2008)she, he, her, his

Many English language stop lists include gender pronouns

- Exercise caution when discarding stop words

- You may need to customize your stop word list abbreviations, titles,etc

To the Python code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 37 / 43

Page 101: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Stop Words

- Stop Words: English Language place holding words

the, it, if, a, able, at, be, because...

- Add “noise” to documents (without conveying much information)

- Discard stop words: focus on substantive words

Note of Caution: Monroe, Colaresi, and Quinn (2008)she, he, her, his

Many English language stop lists include gender pronouns

- Exercise caution when discarding stop words

- You may need to customize your stop word list abbreviations, titles,etc

To the Python code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 37 / 43

Page 102: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Stop Words

- Stop Words: English Language place holding words

the, it, if, a, able, at, be, because...

- Add “noise” to documents (without conveying much information)

- Discard stop words: focus on substantive words

Note of Caution: Monroe, Colaresi, and Quinn (2008)

she, he, her, his

Many English language stop lists include gender pronouns

- Exercise caution when discarding stop words

- You may need to customize your stop word list abbreviations, titles,etc

To the Python code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 37 / 43

Page 103: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Stop Words

- Stop Words: English Language place holding words

the, it, if, a, able, at, be, because...

- Add “noise” to documents (without conveying much information)

- Discard stop words: focus on substantive words

Note of Caution: Monroe, Colaresi, and Quinn (2008)she, he, her, his

Many English language stop lists include gender pronouns

- Exercise caution when discarding stop words

- You may need to customize your stop word list abbreviations, titles,etc

To the Python code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 37 / 43

Page 104: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Stop Words

- Stop Words: English Language place holding words

the, it, if, a, able, at, be, because...

- Add “noise” to documents (without conveying much information)

- Discard stop words: focus on substantive words

Note of Caution: Monroe, Colaresi, and Quinn (2008)she, he, her, his

Many English language stop lists include gender pronouns

- Exercise caution when discarding stop words

- You may need to customize your stop word list abbreviations, titles,etc

To the Python code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 37 / 43

Page 105: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Stop Words

- Stop Words: English Language place holding words

the, it, if, a, able, at, be, because...

- Add “noise” to documents (without conveying much information)

- Discard stop words: focus on substantive words

Note of Caution: Monroe, Colaresi, and Quinn (2008)she, he, her, his

Many English language stop lists include gender pronouns

- Exercise caution when discarding stop words

- You may need to customize your stop word list abbreviations, titles,etc

To the Python code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 37 / 43

Page 106: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Stop Words

- Stop Words: English Language place holding words

the, it, if, a, able, at, be, because...

- Add “noise” to documents (without conveying much information)

- Discard stop words: focus on substantive words

Note of Caution: Monroe, Colaresi, and Quinn (2008)she, he, her, his

Many English language stop lists include gender pronouns

- Exercise caution when discarding stop words

- You may need to customize your stop word list abbreviations, titles,etc

To the Python code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 37 / 43

Page 107: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Stop Words

- Stop Words: English Language place holding words

the, it, if, a, able, at, be, because...

- Add “noise” to documents (without conveying much information)

- Discard stop words: focus on substantive words

Note of Caution: Monroe, Colaresi, and Quinn (2008)she, he, her, his

Many English language stop lists include gender pronouns

- Exercise caution when discarding stop words

- You may need to customize your stop word list abbreviations, titles,etc

To the Python code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 37 / 43

Page 108: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Creating an Equivalence Class of Words

Reduce dimensionality further

create equivalence class between words

- Words used to refer to same basic concept

family, families, familial→ famili

- Stemming/Lemmatizing algorithms: Many-to-one mapping fromwords to stem/lemma

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 38 / 43

Page 109: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Creating an Equivalence Class of Words

Reduce dimensionality further create equivalence class between words

- Words used to refer to same basic concept

family, families, familial→ famili

- Stemming/Lemmatizing algorithms: Many-to-one mapping fromwords to stem/lemma

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 38 / 43

Page 110: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Creating an Equivalence Class of Words

Reduce dimensionality further create equivalence class between words

- Words used to refer to same basic concept

family, families, familial→ famili

- Stemming/Lemmatizing algorithms: Many-to-one mapping fromwords to stem/lemma

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 38 / 43

Page 111: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Creating an Equivalence Class of Words

Reduce dimensionality further create equivalence class between words

- Words used to refer to same basic concept

family, families, familial→ famili

- Stemming/Lemmatizing algorithms: Many-to-one mapping fromwords to stem/lemma

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 38 / 43

Page 112: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Creating an Equivalence Class of Words

Reduce dimensionality further create equivalence class between words

- Words used to refer to same basic concept

family, families, familial→ famili

- Stemming/Lemmatizing algorithms: Many-to-one mapping fromwords to stem/lemma

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 38 / 43

Page 113: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Comparing Stemming and Lemmatizing

Stemming algorithm:

- Simplistic algorithms

- Chop off end of word

- Porter stemmer, Lancaster stemmer, Snowball stemmer

Lemmatizing algorithm:

- Condition on part of speech (noun, verb, etc)

- Verify result is a word

Key comparison: equivalence classesPython Code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 39 / 43

Page 114: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Comparing Stemming and Lemmatizing

Stemming algorithm:

- Simplistic algorithms

- Chop off end of word

- Porter stemmer, Lancaster stemmer, Snowball stemmer

Lemmatizing algorithm:

- Condition on part of speech (noun, verb, etc)

- Verify result is a word

Key comparison: equivalence classesPython Code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 39 / 43

Page 115: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Comparing Stemming and Lemmatizing

Stemming algorithm:

- Simplistic algorithms

- Chop off end of word

- Porter stemmer, Lancaster stemmer, Snowball stemmer

Lemmatizing algorithm:

- Condition on part of speech (noun, verb, etc)

- Verify result is a word

Key comparison: equivalence classesPython Code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 39 / 43

Page 116: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Comparing Stemming and Lemmatizing

Stemming algorithm:

- Simplistic algorithms

- Chop off end of word

- Porter stemmer, Lancaster stemmer, Snowball stemmer

Lemmatizing algorithm:

- Condition on part of speech (noun, verb, etc)

- Verify result is a word

Key comparison: equivalence classesPython Code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 39 / 43

Page 117: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Comparing Stemming and Lemmatizing

Stemming algorithm:

- Simplistic algorithms

- Chop off end of word

- Porter stemmer, Lancaster stemmer, Snowball stemmer

Lemmatizing algorithm:

- Condition on part of speech (noun, verb, etc)

- Verify result is a word

Key comparison: equivalence classesPython Code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 39 / 43

Page 118: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Comparing Stemming and Lemmatizing

Stemming algorithm:

- Simplistic algorithms

- Chop off end of word

- Porter stemmer, Lancaster stemmer, Snowball stemmer

Lemmatizing algorithm:

- Condition on part of speech (noun, verb, etc)

- Verify result is a word

Key comparison: equivalence classesPython Code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 39 / 43

Page 119: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Comparing Stemming and Lemmatizing

Stemming algorithm:

- Simplistic algorithms

- Chop off end of word

- Porter stemmer, Lancaster stemmer, Snowball stemmer

Lemmatizing algorithm:

- Condition on part of speech (noun, verb, etc)

- Verify result is a word

Key comparison: equivalence classesPython Code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 39 / 43

Page 120: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Comparing Stemming and Lemmatizing

Stemming algorithm:

- Simplistic algorithms

- Chop off end of word

- Porter stemmer, Lancaster stemmer, Snowball stemmer

Lemmatizing algorithm:

- Condition on part of speech (noun, verb, etc)

- Verify result is a word

Key comparison: equivalence classes

Python Code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 39 / 43

Page 121: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Comparing Stemming and Lemmatizing

Stemming algorithm:

- Simplistic algorithms

- Chop off end of word

- Porter stemmer, Lancaster stemmer, Snowball stemmer

Lemmatizing algorithm:

- Condition on part of speech (noun, verb, etc)

- Verify result is a word

Key comparison: equivalence classesPython Code!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 39 / 43

Page 122: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

All together now...

Four score and seven years ago our fathers brought forth on

this continent a new nation, conceived in liberty, and

dedicated to the proposition that all men are created

equal.

Step 1: Remove capitalization and punctuation

:

Step 2: Discard word order

:

Step 3: Remove stop words

:

Step 4: Applying Stemming AlgorithmStep 5: Create Count Vector (Python Code!)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 40 / 43

Page 123: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

All together now...

Four score and seven years ago our fathers brought forth on

this continent a new nation, conceived in liberty, and

dedicated to the proposition that all men are created

equal.

Step 1: Remove capitalization and punctuation:

Step 2: Discard word order

:

Step 3: Remove stop words

:

Step 4: Applying Stemming AlgorithmStep 5: Create Count Vector (Python Code!)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 40 / 43

Page 124: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

All together now...

Four score and seven years ago our fathers brought forth on

this continent a new nation, conceived in liberty, and

dedicated to the proposition that all men are created

equal.

Step 1: Remove capitalization and punctuation:four score and seven years ago our fathers brought forth on

this continent a new nation conceived in liberty and

dedicated to the proposition that all men are created equal

Step 2: Discard word order

:

Step 3: Remove stop words

:

Step 4: Applying Stemming AlgorithmStep 5: Create Count Vector (Python Code!)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 40 / 43

Page 125: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

All together now...

Step 1: Remove capitalization and punctuation:four score and seven years ago our fathers brought forth on

this continent a new nation conceived in liberty and

dedicated to the proposition that all men are created equal

Step 2: Discard word order:

Step 3: Remove stop words

:

Step 4: Applying Stemming AlgorithmStep 5: Create Count Vector (Python Code!)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 40 / 43

Page 126: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

All together now...

Step 1: Remove capitalization and punctuation:four score and seven years ago our fathers brought forth on

this continent a new nation conceived in liberty and

dedicated to the proposition that all men are created equal

Step 2: Discard word order:four, score, and, seven, years, ago, our, fathers, brought,

forth, on, this, continent, a, new, nation, conceived, in,

liberty, and, dedicated, to, the, proposition, that, all,

men, are, created, equal

Step 3: Remove stop words

:

Step 4: Applying Stemming AlgorithmStep 5: Create Count Vector (Python Code!)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 40 / 43

Page 127: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

All together now...

Step 1: Remove capitalization and punctuation:Step 2: Discard word order:four, score, and, seven, years, ago, our, fathers, brought,

forth, on, this, continent, a, new, nation, conceived, in,

liberty, and, dedicated, to, the, proposition, that, all,

men, are, created, equal

Step 3: Remove stop words :

Step 4: Applying Stemming AlgorithmStep 5: Create Count Vector (Python Code!)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 40 / 43

Page 128: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

All together now...

Step 1: Remove capitalization and punctuation:Step 2: Discard word order:four, score, and, seven, years, ago, our, fathers, brought,

forth, on, this, continent, a, new, nation, conceived, in,

liberty, and, dedicated, to, the, proposition, that, all,

men, are, created, equal

Step 3: Remove stop words :four, score, seven, years, ago, fathers, brought, forth,

continent, new, nation, conceived, liberty, dedicated,

proposition, men, created, equal

Step 4: Applying Stemming AlgorithmStep 5: Create Count Vector (Python Code!)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 40 / 43

Page 129: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

All together now...

Step 1: Remove capitalization and punctuation:Step 2: Discard word order:Step 3: Remove stop words :four, score, seven, years, ago, fathers, brought, forth,

continent, new, nation, conceived, liberty, dedicated,

proposition, men, created, equal

Step 4: Applying Stemming Algorithm

Step 5: Create Count Vector (Python Code!)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 40 / 43

Page 130: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

All together now...

Step 1: Remove capitalization and punctuation:Step 2: Discard word order:Step 3: Remove stop words :four, score, seven, years, ago, fathers, brought, forth,

continent, new, nation, conceived, liberty, dedicated,

proposition, men, created, equal

Step 4: Applying Stemming Algorithmfour, score, seven, year, ago, father, brought, forth,

contin, new, nation, conceiv, liberti, dedic, proposit,

men, creat, equal

Step 5: Create Count Vector (Python Code!)

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 40 / 43

Page 131: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

All together now...Step 1: Remove capitalization and punctuation:Step 2: Discard word order:Step 3: Remove stop words :Step 4: Applying Stemming Algorithmfour, score, seven, year, ago, father, brought, forth,

contin, new, nation, conceiv, liberti, dedic, proposit,

men, creat, equal

Step 5: Create Count Vector (Python Code!)Stem Countago 1brought 1seven 1creat 1conceiv 1men 1father 1...

...Justin Grimmer (Stanford University) Text as Data September 30th, 2014 40 / 43

Page 132: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

All together now...

Step 1: Remove capitalization and punctuation:Step 2: Discard word order:Step 3: Remove stop words :Step 4: Applying Stemming AlgorithmStep 5: Create Count Vector (Python Code!)

Stem Countago 1brought 1seven 1creat 1conceiv 1men 1father 1...

...

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 40 / 43

Page 133: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

This Can Actually Work!

Available March 2009: 304ppPb: 978-0-415-99701-0: $24.95www.routledge.com/politics

“The list of authors in The Future of Political Science is a 'who’s who' of political science. As I was reading it, I came to think of it as a platter of tasty hors d’oeuvres. It hooked me thoroughly.”

—Peter Kingstone, University of Connecticut

“In this one-of-a-kind collection, an eclectic set of contributors offer short but forceful forecasts about the future of the discipline. The resulting assortment is captivating, consistently thought-provoking, often intriguing, and sure to spur discussion and debate.”

—Wendy K. Tam Cho, University of Illinois at Urbana-Champaign

“King, Schlozman, and Nie have created a visionary and stimulating volume. The organization of the essays strikes me as nothing less than brilliant. . . It is truly a joy to read.”

—Lawrence C. Dodd, Manning J. Dauer Eminent Scholar in Political Science, University of Florida

The FuTure oF PoliTical Science100 Perspectivesedited by Gary King, harvard university, Kay Lehman Schlozman, Boston college and Norman H. Nie, Stanford university

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 41 / 43

Page 134: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Evaluators’ Rate Machine Choices Better Than Their Own(Grimmer and King)

Generate pairs of similar documents: Humans vs Machines

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24

Machine 2.24 2.08 2.40p.s. The hand-coders did the evaluation!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 42 / 43

Page 135: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Evaluators’ Rate Machine Choices Better Than Their Own(Grimmer and King)

Generate pairs of similar documents: Humans vs Machines

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60

Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24

Machine 2.24 2.08 2.40p.s. The hand-coders did the evaluation!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 42 / 43

Page 136: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Evaluators’ Rate Machine Choices Better Than Their Own(Grimmer and King)

Generate pairs of similar documents: Humans vs Machines

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68

Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 42 / 43

Page 137: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Evaluators’ Rate Machine Choices Better Than Their Own(Grimmer and King)

Generate pairs of similar documents: Humans vs Machines

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24

Machine 2.24 2.08 2.40p.s. The hand-coders did the evaluation!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 42 / 43

Page 138: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Evaluators’ Rate Machine Choices Better Than Their Own(Grimmer and King)

Generate pairs of similar documents: Humans vs Machines

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24

Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 42 / 43

Page 139: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Evaluators’ Rate Machine Choices Better Than Their Own(Grimmer and King)

Generate pairs of similar documents: Humans vs Machines

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24

Machine 2.24 2.08 2.40p.s. The hand-coders did the evaluation!

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 42 / 43

Page 140: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc3.pdf · Plain Text September 19, 2010 Sunday 10:46 AM EST REP. FOXX VISITS LOCAL SCHOOLS, TALKS WITH STUDENTS ON

Preprocessing Text

One (of many) potential recipes task specific recipes

Assess value Particular task

Thursday: Dictionary methods

Justin Grimmer (Stanford University) Text as Data September 30th, 2014 43 / 43