Text as Data - Stanford jgrimmer/Text14/tc4.pdf Justin Grimmer (Stanford University) Text as Data...

download Text as Data - Stanford jgrimmer/Text14/tc4.pdf Justin Grimmer (Stanford University) Text as Data October

of 185

  • date post

    18-Feb-2021
  • Category

    Documents

  • view

    1
  • download

    0

Embed Size (px)

Transcript of Text as Data - Stanford jgrimmer/Text14/tc4.pdf Justin Grimmer (Stanford University) Text as Data...

  • Text as Data

    Justin Grimmer

    Associate Professor Department of Political Science

    Stanford University

    October 2nd, 2014

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 1 / 23

  • Classification via Dictionary Methods

    1) Task

    a) Categorize documents into predetermined categories b) Measure documents association with predetermined categories

    2) Objective function:

    f (θ,X i ) = ∑N

    j=1 θjX ij∑N j=1 X ij

    where:

    - θ = (θ1, θ2, . . . , θN) are word weights - X i = (Xi1,Xi2, . . . ,XiN) count the occurrence of each corresponding

    word in document i

    3) Optimization predetermined word list, no task specific optimization

    4) Validation (Model checking) weight (model) checking, replication of hand coding, face validity

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 2 / 23

  • Classification via Dictionary Methods

    1) Task

    a) Categorize documents into predetermined categories

    b) Measure documents association with predetermined categories

    2) Objective function:

    f (θ,X i ) = ∑N

    j=1 θjX ij∑N j=1 X ij

    where:

    - θ = (θ1, θ2, . . . , θN) are word weights - X i = (Xi1,Xi2, . . . ,XiN) count the occurrence of each corresponding

    word in document i

    3) Optimization predetermined word list, no task specific optimization

    4) Validation (Model checking) weight (model) checking, replication of hand coding, face validity

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 2 / 23

  • Classification via Dictionary Methods

    1) Task

    a) Categorize documents into predetermined categories b) Measure documents association with predetermined categories

    2) Objective function:

    f (θ,X i ) = ∑N

    j=1 θjX ij∑N j=1 X ij

    where:

    - θ = (θ1, θ2, . . . , θN) are word weights - X i = (Xi1,Xi2, . . . ,XiN) count the occurrence of each corresponding

    word in document i

    3) Optimization predetermined word list, no task specific optimization

    4) Validation (Model checking) weight (model) checking, replication of hand coding, face validity

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 2 / 23

  • Classification via Dictionary Methods

    1) Task

    a) Categorize documents into predetermined categories b) Measure documents association with predetermined categories

    2) Objective function:

    f (θ,X i ) = ∑N

    j=1 θjX ij∑N j=1 X ij

    where:

    - θ = (θ1, θ2, . . . , θN) are word weights - X i = (Xi1,Xi2, . . . ,XiN) count the occurrence of each corresponding

    word in document i

    3) Optimization predetermined word list, no task specific optimization

    4) Validation (Model checking) weight (model) checking, replication of hand coding, face validity

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 2 / 23

  • Classification via Dictionary Methods

    1) Task

    a) Categorize documents into predetermined categories b) Measure documents association with predetermined categories

    2) Objective function:

    f (θ,X i ) = ∑N

    j=1 θjX ij∑N j=1 X ij

    where:

    - θ = (θ1, θ2, . . . , θN) are word weights - X i = (Xi1,Xi2, . . . ,XiN) count the occurrence of each corresponding

    word in document i

    3) Optimization predetermined word list, no task specific optimization

    4) Validation (Model checking) weight (model) checking, replication of hand coding, face validity

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 2 / 23

  • Classification via Dictionary Methods

    1) Task

    a) Categorize documents into predetermined categories b) Measure documents association with predetermined categories

    2) Objective function:

    f (θ,X i ) = ∑N

    j=1 θjX ij∑N j=1 X ij

    where:

    - θ = (θ1, θ2, . . . , θN) are word weights - X i = (Xi1,Xi2, . . . ,XiN) count the occurrence of each corresponding

    word in document i

    3) Optimization predetermined word list, no task specific optimization

    4) Validation (Model checking) weight (model) checking, replication of hand coding, face validity

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 2 / 23

  • Classification via Dictionary Methods

    1) Task

    a) Categorize documents into predetermined categories b) Measure documents association with predetermined categories

    2) Objective function:

    f (θ,X i ) = ∑N

    j=1 θjX ij∑N j=1 X ij

    where:

    - θ = (θ1, θ2, . . . , θN) are word weights

    - X i = (Xi1,Xi2, . . . ,XiN) count the occurrence of each corresponding word in document i

    3) Optimization predetermined word list, no task specific optimization

    4) Validation (Model checking) weight (model) checking, replication of hand coding, face validity

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 2 / 23

  • Classification via Dictionary Methods

    1) Task

    a) Categorize documents into predetermined categories b) Measure documents association with predetermined categories

    2) Objective function:

    f (θ,X i ) = ∑N

    j=1 θjX ij∑N j=1 X ij

    where:

    - θ = (θ1, θ2, . . . , θN) are word weights - X i = (Xi1,Xi2, . . . ,XiN) count the occurrence of each corresponding

    word in document i

    3) Optimization predetermined word list, no task specific optimization

    4) Validation (Model checking) weight (model) checking, replication of hand coding, face validity

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 2 / 23

  • Classification via Dictionary Methods

    1) Task

    a) Categorize documents into predetermined categories b) Measure documents association with predetermined categories

    2) Objective function:

    f (θ,X i ) = ∑N

    j=1 θjX ij∑N j=1 X ij

    where:

    - θ = (θ1, θ2, . . . , θN) are word weights - X i = (Xi1,Xi2, . . . ,XiN) count the occurrence of each corresponding

    word in document i

    3) Optimization predetermined word list, no task specific optimization

    4) Validation (Model checking) weight (model) checking, replication of hand coding, face validity

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 2 / 23

  • Classification via Dictionary Methods

    1) Task

    a) Categorize documents into predetermined categories b) Measure documents association with predetermined categories

    2) Objective function:

    f (θ,X i ) = ∑N

    j=1 θjX ij∑N j=1 X ij

    where:

    - θ = (θ1, θ2, . . . , θN) are word weights - X i = (Xi1,Xi2, . . . ,XiN) count the occurrence of each corresponding

    word in document i

    3) Optimization predetermined word list, no task specific optimization

    4) Validation (Model checking) weight (model) checking, replication of hand coding, face validity

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 2 / 23

  • Word Weights: Separating Classes

    General Classification Goal: Place documents into categories

    How To Do Classification?

    - Dictionaries:

    - Rely on Humans humans to identify words that associate with classes - Measure how well words separate (positive/negative, emotional, ...)

    - Supervised Classification Methods (Later in the Quarter):

    - Rely on statistical models - Given set of coded documents, statistical relationship between

    classes/words - Statistical measures of separation

    Key point: this is the same task

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 3 / 23

  • Word Weights: Separating Classes

    General Classification Goal: Place documents into categories How To Do Classification?

    - Dictionaries:

    - Rely on Humans humans to identify words that associate with classes - Measure how well words separate (positive/negative, emotional, ...)

    - Supervised Classification Methods (Later in the Quarter):

    - Rely on statistical models - Given set of coded documents, statistical relationship between

    classes/words - Statistical measures of separation

    Key point: this is the same task

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 3 / 23

  • Word Weights: Separating Classes

    General Classification Goal: Place documents into categories How To Do Classification?

    - Dictionaries:

    - Rely on Humans humans to identify words that associate with classes - Measure how well words separate (positive/negative, emotional, ...)

    - Supervised Classification Methods (Later in the Quarter):

    - Rely on statistical models - Given set of coded documents, statistical relationship between

    classes/words - Statistical measures of separation

    Key point: this is the same task

    Justin Grimmer (Stanford University) Text as Data October 2nd, 2014 3 / 23

  • Word Weights: Separating Classes

    General Classification Goal: Place documents into categories How To Do Classification?

    - Dictionaries:

    - Rely on Humans humans to identify words that associate with classes

    - Measure how well words separate (positive/negative, emotional, ...)

    - Supervis