supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a...

22
TOPIC MODELS Technical Appendix This appendix provides an overview of further details related to model choices and fitting topic models, as well as comparisons with other statistical approaches to text modeling. Comparison of Topic Models and Alternative Quantitative Text Models To help clarify the unique attributes of topic models (also known as Latent Dirichlet allocation [LDA]; Blei et al., 2003), we briefly consider the motivations behind their development. Given a word-document matrix (WDM), one approach to identifying underlying dimensions of the data is to use principle components analysis (PCA). PCA uses the covariance matrix from the WDM to derive a set of ordered and orthogonal dimensions characterizing the variability in the WDM. While PCA is a general and useful dimensionality reduction technique, it is not one that has traditionally been used for text analysis. However, the basis of PCA is related to a greatly influential approach to modeling text, Latent Semantic Analysis (LSA; Landauer & Dumais, 1998). LSA employs a matrix-factorization technique known as singular value decomposition (SVD). SVD is a linear algebra method based 1

Transcript of supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a...

Page 1: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

Technical Appendix

This appendix provides an overview of further details related to model choices and fitting

topic models, as well as comparisons with other statistical approaches to text modeling.

Comparison of Topic Models and Alternative Quantitative Text Models

To help clarify the unique attributes of topic models (also known as Latent Dirichlet

allocation [LDA]; Blei et al., 2003), we briefly consider the motivations behind their

development. Given a word-document matrix (WDM), one approach to identifying underlying

dimensions of the data is to use principle components analysis (PCA). PCA uses the covariance

matrix from the WDM to derive a set of ordered and orthogonal dimensions characterizing the

variability in the WDM. While PCA is a general and useful dimensionality reduction technique,

it is not one that has traditionally been used for text analysis. However, the basis of PCA is

related to a greatly influential approach to modeling text, Latent Semantic Analysis (LSA;

Landauer & Dumais, 1998). LSA employs a matrix-factorization technique known as singular

value decomposition (SVD). SVD is a linear algebra method based on performing an

eigenvector decomposition of a matrix, and can be used to reduce the dimensionality of data by

selecting subsets of the dimensions that capture most of the variance in the observed data. PCA

and SVD share many similarities, but whereas PCA is applied to a covariance matrix of the

WDM, SVD is applied directly to the WDM. In LSA, the SVD of the WDM is used to

construct a “semantic space” (i.e., a set of weights that map the observed data to a reduced set of

underlying dimensions and broadly similar to factor loadings in factor analysis). Individual terms

and documents can then be mapped into this reduced dimensionality semantic space, which in

turn can be used for making comparisons of semantic similarity.

1

Page 2: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

However, there are shortcomings to the LSA method (for an in-depth discussion, see

Griffiths, Steyvers, & Tenenbaum, 2007). One specific example is that LSA struggles to capture

what is known as polysemy—in which the same word can have multiple distinct semantic

meanings—such as the word play (i.e., play an instrument, play at the park, or going to a play at

the theater). Methods that map documents and words into a multi-dimensional space, such as

LSA, only have a single point-representation for a given word and are unable to properly capture

polysemy. Perhaps a more relevant issue for present purposes is that the dimensions of the

semantic space are not easily interpreted. Specifically, there is no inherent meaning in the

dimensions of LSA’s semantic space; much like factor analysis, original factor loadings are

typically hard to interpret, and hence the common practice is to apply some type of factor

rotation to enhance interpretability. However, factor rotation does not always yield interpretable

factors (e.g., the “bars” data from Griffiths & Steyvers, 2004).

The shortcomings of LSA were addressed by two models: a) Probabilistic Latent

Semantic Indexing (pLSI; Hoffman, 1999), and b) LDA (Blei et al., 2003) which re-formulated

the problem of unsupervised modeling of text, in terms of probability theory. pLSI modeled each

document as a mixture of probability distributions over words, and Blei et al. extended pLSI to

make it fully Bayesian by assigning Dirichlet priors on both the document mixture proportions

and the topic-word distributions, yielding LDA. Since the introduction of LDA, a large number

of variants on this model have been introduced, and these are now often more generally referred

to as “topic models”.

The key difference between LSA and topic models is that, in topic models, documents are

represented as an additive mixture of probability distributions over words (i.e., topics), in which

each topic typically captures an interpretable dimension of semantic meaning. Specifically, the

2

Page 3: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

high-probability words within each topic learned by LDA tend to be closely semantically related.

The interpretability of topic models confers them with a significant advantage over previous

approaches based in classical matrix factorization techniques for dimensionality reduction, such

as LSA. Given this brief history of text modeling, we now describe the details of LDA,

refraining from cumbersome statistical detail.

Formal Description of LDA

As previously indicated, LDA is an unsupervised machine learning method for finding a

set of topics that can be used to summarize a collection of documents. Given the input of the

WDM, the model estimates 1) a set a topics that captures underlying semantic subjects in the

corpus, and 2) a representation of each document in terms of the set of topics. Specifically, each

topic is modeled as a probability distribution over words, and each document is modeled as a

probability distribution (i.e., a mixture) of these topics, where the topics with high probability for

a given document capture the semantic content that is most prevalent within the document.

For clarity, we summarize here the notational conventions used in the Appendix. Upper-

case letters are used to indicate fixed integer values in our model: V = the size of the vocabulary,

i.e., the number of unique word-types in the dataset, after pre-processing such as removal of

stop-words; D = the number of documents in the corpus; T = the number of topics (as set by the

experimenter / data analyst). Bold-formatting is used for complete vectors of parameters,

whereas regular type is used to refer to a parameter value of a specific element within the vector.

For example, z is the vector of latent (unobserved) indicator variables for all observed words w

in the dataset, where zi indicates the topic-assignment for the ith observed word-token w i (and

therefore, can take the values corresponding to one of the topics — z i∈ {1 , 2,…,T }). Finally,

upper-case Greek letters refer to full matrices of parameters (e.g., Φ for the probabilities of

3

Page 4: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

words for each topic and Θ for the probabilities of topics for each document). The probability

distributions for a specific topic ϕ t or a specific document θd will use the lower-case form of the

Greek letter and be indexed by a subscript.

Using this notation, the model can be described as follows: Each topic t (where t∈ {1…T }

) is represented by a multinomial distribution over the set of V unique word-types in the corpus.

The distribution for topic t is given by parameter ϕt, which is a V-length vector of real-valued

numbers which sum to one. Thus, the set of T topics is represented by a V by T matrix, Φ, in

which all columns sum to one, and where ϕw , t is the probability that topic t will generate word-

type w. The number of topics that the model estimates, T, is a hyper-parameter that is specified

by the data analyst.1 Each document d (where d∈ {1 … D }) is represented by a multinomial

distribution over the set of T topics. The distribution for document d is given by parameter θd,

which is a T-length vector of real-valued numbers which sum to one. Thus, the set of T topics is

represented by a D by T matrix, Θ, in which all rows sum to one, and where θd ,t is the

probability of topic t given document d. Finally, to make the model fully Bayesian, Dirichlet

priors are placed on phi_t, the distribution over words for each topic, as well as theta_d, the

distribution over topics for each documents. Intuitively, these prior distributions serve as

smoothing parameters, and add a small pseudo-count to the matricesΦ and Θ (which is very

useful, since these matrices typically are very sparse, and would otherwise be comprised mostly

of zeroes).

Because LDA is a probabilistic generative model, it describes a step-wise process by

which the model generates the observed data (i.e., the set of documents and the specific words

1 We describe some methods for choosing the value of T later on in the paper.

4

Page 5: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

observed for each document), and reviewing these steps can provide an intuitive feel for how the

model works. The generative process for LDA is as follows:

1. For each topic t, sample a multinomial distribution ϕt over the V word-types from

a Dirichlet prior distribution with hyper-parameter β.

2. For each document d, sample a multinomial distribution θd over the T topics from

a Dirichlet prior distribution with hyper-parameter α .

3. For each document d, generate wd , i ,the word in the ith slot observed word as

follows:

a. Sample a topic z i according to the probability of the topic given the

document (these probabilities are given by the vector, θd). Note that z i is

an indicator variable that indicates the topic for word wd , i and takes a

value from 1…T .

b. Sample word wd , i from the topic according to the probability of word w in

topic zi (these probabilities are given by the vector topic ϕ zi).

The LDA model is illustrated in Figure A1 using graphical model plate notation (Jordan,

2004). Graphical plate notation provides a figurative representation for understanding

probabilistic generative models because it captures the relationships between all of the model

variables and the observed data. Each variable is represented as a node in the figure, where the

shaded nodes correspond to observed data, and the unshaded nodes correspond to unobserved

model parameters. The boxes, or “plates”, indicate that a process is repeated multiple times. For

example, the figure shows that – for all documents – a document’s probability distribution θd

over topics is sampled from a Dirichlet distribution with hyperparameter α . Then, for all words

in a document, one first samples a topic (where the z parameter is an indicator for this topic), and

5

Page 6: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

then the word (w), which is observed, is sampled according to the probability distribution ϕ z for

the corresponding topic.

Documents

WordsTopics

z

w

β

θ α

Figure A1: Graphical Model For LDA

The output of the LDA model consists of (1) the vector z, containing the topic

assignments for each word in the corpus, (2) the matrix Φ of topic-word probabilities, and (3) the

matrix Θ of document-topic probabilities. Prior to considering how the output can be used and

interpreted, we review how the model is estimated, including parameters that must be set.

Model Inference and Parameter Settings

Topic Model Inference

A common inference method for topic models is known as collapsed Gibbs sampling,

which was used in the current analyses and was first described in Griffiths & Steyvers (2004). A

brief summary of this method is provided here (for additional mathematical details and

derivations, see Griffiths & Steyvers, 2004).

6

Page 7: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

Learning the parameters of the model involves inferring the set of T distributions ϕt=1… T

of topics over words, and the D distributions θd=1 … D of documents over topics. In collapsed

Gibbs sampling, all of the topic-word and document-topic distributions are learned by

sequentially updating the vector z of latent assignments of all words in the corpus to individual

topics (the actual distributions θd and ϕt do not need to be directly updated, because they have

been “collapsed”, i.e., integrated out of the equations).

The vector of z assignments is initialized randomly (by assigning each word wi to a

random topic by setting zi to a random value between 1 and T). The z-assignments are then

updated in a random order using the following Gibbs update equation:

P ( zi=t|z−i , w i , d , α , β )∝N−i

w ,t +β

∑w=1

V

N−iw ,t+V β

N−id ,t +α

∑t=1

T

N−id , t+T α

where N−iw , t is the current count of the number of times word w has been assigned to topic t, N−i

d ,t

is the count of the number of times words in document d have been assigned to topic t. The

subscript “– i” is used to indicate that these calculations are made after the current word w i - that

we are sampling a new topic-assignment z i for - has been removed from these counts. The

hyperparameters for the Dirichlet prior distributions, α and β, as well as the number of topics, T,

are parameters that must be specified by the researcher.

In words, the above equation states that we assign word w in document d to a topic t, in

proportion to the current estimates of (1) the probability of that word given topic t, which is the

term on the left in the right-hand side of the equation, and (2) the probability of the topic t given

the document d, which is the term on the right, in the right-hand side of the equation. Each

iteration of the Gibbs sampler updates each of the zi assignments once.

7

Page 8: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

Once the Gibbs sampler has been allowed to sufficiently “burn-in” (by running the

sampler for a number of iterations until the distributions are stable), one can directly estimate the

parameters of interest, θd and ϕt as follows:

ϕ̂w , t=N−i

w , t+β

∑w=1

V

N−iw , t+W β

θ̂d ,t=N d ,t+α

∑t=1

T

N d , t+T α

which are the same estimates based on the current counts of z-assignments that are used in the

Gibbs update equation and have the same interpretation. Note that the only difference between

these terms and those in the Gibbs update equation is that the “-i” notation has not been used.

This is because the Gibbs update involves conditioning zi on all of the z assignments except the

one that is currently being updated. Here, however, we are computing our final posterior

estimates of the parameters based on all assignments.

Setting the Hyperparameters α and β

In addition to the parameters that are estimated by the model (described above), the

hyperparameters for the Dirichlet prior distributions, α and β, as well as the number of topics, T,

must be specified by the data analyst. There is a significant body of work that has looked at how

to optimally choose the number of topics, set hyperparameters, and also has considered the

influence of these settings on the fitted model (see, e.g., the Topic Modeling Bibliography at

http://www.cs.princeton.edu/~mimno/topics.html).

Choosing hyperparameters can be potentially complex as ultimately, all hyperparameters

interact in their influence on the fitted LDA model. Nonetheless, there are some heuristics for

understanding the hyperparameters influence on LDA output. For the Dirichlet priors:

8

Page 9: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

The α hyperparameter serves as a smoothing parameter on θ. As α becomes smaller,

documents’ distributions over topics θd will become more peaked (i.e., the words within each

document will tend to be assigned to a smaller subset of the topics).

The βhyperparameter is a smoothing parameter on the topic-word distributions ϕt. As β

changes, this will influence the granularity of the topics (Griffiths & Steyvers, 2004); smaller

values of β will lead to more fine-grained topics, and larger values will lead to coarser, more

general topics.

A common practice for setting these values was described in Griffiths & Steyvers (2004): Given

a specific setting for the number of topics, T, setting α=50 /T and β=.01 generally provides good

topics (see Wallach, Mimno, & McCallum, 2009 for an alternative approach in which the

hyperparameters are optimized).

Determining the Number of Topics (T)

Given the above heuristic for the settings of α and β, the only hyperparameter left to

choose is T, the number of topics. It is useful to bear in mind that in choosing the number of

topics the data analyst is not in search of the “true” number of topics but rather a useful number

of topics for a given purpose. Moreover, different research questions with the same corpus

might very well use different numbers of topics. There are two general strategies for choosing T:

1) interpretability of topics, and 2) cross-validation using either an internal or external criterion.

Some common values of T that are used are: 50, 100, 200, and 400. One option for choosing

among these settings is simply to run the model with each of these values, visually inspect the

resulting set of topics (by looking at the distribution of topics over words), and choose the set of

topics that is most interpretable. However, unlike other methods of dimensionality reduction, the

interpretability of the topics under different settings of T is fairly steady, due in part to the robust

9

Page 10: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

nature of Bayesian methods. However, two things that will happen as T increases, is that at some

point (depending on the nature and diversity of the data), there will be topic redundancy (i.e.,

multiple topics will capture similar dimensions of the data) and topic idiosyncrasy (i.e.,

individual topics are dedicated to very small portions of the overall corpus). For example, in the

couple therapy corpus, there are 134 couples. Thus, at higher numbers of topics (e.g., 200 or

400), there were clearly topics summarizing semantic content for individual couples. In most

applications to this data we would be interested in common linguistic content over couples (e.g.,

how does negative emotion language change during therapy?). Hence, practically, a smaller

number of topics were warranted, and the number of topics in the present application were

chosen by varying the total number of topics and examining the interpretability of topics.

An alternative approach to determining the number of topics is to use cross-validation. A

small proportion (e.g., 10%) of the data from each document can be withheld from the model

fitting process, and then the accuracy of the model to predict the withheld portion can be

assessed. Then, iterative models can be fit varying the number of topics (or hyper-parameters),

and the model with the best prediction can then determine the number of topics. This approach

is more challenging when there is no external criterion. For example, only using the text data

itself, it is possible to use an internal criterion such as perplexity, which is a measure of how well

the model can predict missing words in a document (e.g., see Griffiths and Steyvers, 2004).

Alternatively, if there is an external criterion (i.e., external to the text), then a similar approach

can be used to optimize the total number of topics in predicting the external criterion. Consider

the example of behavioral codes in the present data where we used cross-validation to estimate

prediction accuracy in a regression with topics. Topic models were fit on a subsample of the

complete data (e.g., a subset of the sessions or a subset of the couples). We then varied the

10

Page 11: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

number of topics, and the topic output was used in a logistic regression model to predict the

behavioral codes in the withheld sessions or couples. The model accuracy for the withheld data

could then be used to determine the number of topics. However, again, we emphasize that this

approach – while certainly objective – can lead to different numbers of topics for different

outcomes (e.g., across different behavioral codes). Hence, in the topic modeling literature thus

far, it is quite common to see applications of topic models in which the number of topics were

chosen based on interpretability of the topics.

Defining Documents

One additional model setting, which has not been a typical concern in topic modeling, is

the question of how to represent “documents”. Most corpora used in topic modeling consist of

self-contained news or journal articles, and thus have a natural document representation.

However, since the couple therapy corpus consists of transcripts with couples having dialogues

or 3-person conversations with therapists, over multiple sessions and communication

assessments, there is not a single representation of documents, and defining documents was an

explicit decision.

For the work presented in this paper, we chose to treat the text from an individual

speaker, within each individual therapy session or communication assessment transcript, as a

unique “document”. The reason for this choice was that it put documents in 1-1 alignment with

the behavioral coding systems, which were assigned to each spouse within each therapy session

or communication assessment. Although other meta-data (e.g., couple outcomes) were at

different levels of granularity, it would be straightforward to simply average over the

“documents” to get a topic-based representation at the level of couples. Note that since the topic

model assigns each individual word to a topic, one can naturally move from a more coarse-

11

Page 12: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

grained document representation to a finer-grained representation (e.g., by treating each unique

couple as a document, and then using the assignments (i.e., z) of words for individual speakers

within individual sessions, to generate document representations at different levels of

granularity).

Utilizing the LDA Output

After the data is processed, transformed into a WDM, and the model is fit, the outputs of

LDA (z , Φ, and Θ) provide various lenses to understand the underlying semantic representation

of the corpus. Throughout the paper, we have illustrated some of these applications. Here we

will briefly summarize the LDA output and how its various components were used throughout

the paper.

The distribution of topics over words, Φ, provides an interpretable set of dimensions (i.e.,

topics) that can be used to illustrate the latent semantic dimensions of a text corpus. In Figure 1

of the paper, we showed a number of the topics that were discovered by LDA (by showing in

order the 15 most likely words under each of the topics probability distributions, ϕ t). These high

probability words most strongly define a given topic, and throughout the rest of the paper, the

high-probability words (rather than simply the topic-numbers) were used to label each topic,

which allows for easy interpretation of various dimensions (e.g., to see what dimensions of data

are predictive of specific behavioral codes).

The distribution of documents over topics Θ gives an ordering of which topics are most

frequent under each document. In Figure 5, we averaged across the husband and wife documents

within each therapy session to compute a vector θd capturing the frequency of each topic within a

particular couple’s speech over time. By looking at how this θd distribution evolved over the

course of therapy, we are able to view both the time-course of the therapy intervention, as well as

12

Page 13: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

any effect this has on the couple’s speech over the course of therapy. We also utilize the ϕt

distributions to label the topics in Figure 5, so that the evolution of the couple’s speech over

therapy is readily interpretable. Furthermore, if we wish to drill-down, to look at the original

text of the data, we can use the vector z of topic-assignments to augment the transcripts, to see

which words in each transcript are being assigned to which topics.

In addition, the Θ matrix was used for the prediction of behavioral codes. This document

by topic matrix describes how prevalent each topic is in each document, and hence, it can be

used to create a topic-based matrix of covariates for predicting behavioral codes or other

outcomes associated with couples or sessions. These could be then used as in Figure 6 to predict

the behavioral codes for new documents (i.e., for data that was not used during training in the

cross-validation procedure). Additionally, as in Figure 7, the regression coefficients on each

topic that are learned by the regression model can then be used to further explore how each topic

relates to the different behavioral codes (by ordering the topics according to the regression-

model weights).

Appendix References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine

Learning Research, 3, 993–1022.

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National

Academy of Sciences, 101(Suppl. 1). doi:10.1073/pnas.0307752101

Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. T. (2007). Topics in semantic

representation. Psychological Review, 114, 211-244.

Hoffman, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual

International SIGIR Conference on Research and Development in Information Retrieval.

13

Page 14: supp.apa.orgsupp.apa.org/.../supplemental/a0029607/FAM-FAM2-Atki…  · Web viewword-types from a Dirichlet prior distribution with hyper-parameter . β .For each document . d, sample

TOPIC MODELS

Jordan, M. I. (2004). Graphical Models. Statistical Science, 19, 140-155. 

doi:10.1214/088342304000000026

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic

analysis theory of acquisition, induction, and representation of knowledge. Psychological

Review, 104, 211-240. doi: 10.1037/0033-295X.104.2.211

Wallach, H., Mimno, D., & McCallum, A. (2009). Rethinking LDA: Why Priors Matter.

In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems.

14