I have found word similarity graphs (or networks) to be useful. The three primary steps involved:
I like treating words as the unit of analysis; it makes sense to me to think of topics as being made up of words often used in conjunction with one another. This is also a naturally visual method, and plotting the network helps me understand topics better than columns of weights corresponding to latent topics. However, this approach is exploratory. I generally choose the values of the J, L, I, and T hyperparameters based on subjective decisions—not by minimizing a loss function; nonetheless, it helps me understand what is being talked about in text.
I will show how to use a few simple functions I wrote as I describe this technique in more detail. The full code for each of the functions are found at the end of this post. The data I am using are headlines from news stories and blogs in the last 6 months that mention “Star Wars” (if the sequel trilogy makes you angry—you’re wrong, but I hope you still read the rest of this post). The data and all code necessary to reproduce this post can be found at my GitHub.
Many metrics exist to quantify how similar two units are to one another; distance measures can also be inverted to measure similarity (see Cha, 2007; Choi, Cha, & Tappert, 2010; Lesot, Rifqi, & Benhadda, 2009 for reviews). I started with an assumption that words belonging to the same topic will be used in the same document (which can be a story, chapter, song, sentence, headline, and so on), so I decided the foundation of word similarities here should be co-occurrences. If the words “Darth” and “Vader” appear together in 161 headlines (see below), then their similarity score would be 161.
summarise(vader = all(c("darth", "vader") %in% word)) %>%
##  161
A problem arises, however, in considering how frequently each word is used. The words “Princess” and “Leia” occur 31 times together, but “Princess” is used far less frequently than “Darth” in general (see below). Does that mean the words “Princess” and “Leia” are less similar to one another than “Darth” and “Vader”? Not necessarily.
summarise(leia = all(c("princess", "leia") %in% word)) %>%
##  31
darth = "darth" %in% word,
vader = "vader" %in% word,
princess = "princess" %in% word,
leia = "leia" %in% word
## darth vader princess leia
## 236 232 39 116
We can overcome this difference in base rates by normalizing (standardizing) the co-occurrences. I use the cosine similarity for this, which is identical to the Ochiai coefficient in this situation (Zhou & Leydesdorff, 2016). The cosine similarity gets its name from being the cosine of the angle located between two vectors. In our case, each vector is a word, and the length of these vectors is the number of documents. If the word appears in a document, it is scored as “1”; if it does not, it is “0.” For simplicity’s sake, let’s imagine we have two documents: “Darth” appears in the first, but not the second; “Vader” appears in both. Plotted on two-dimensional space, the vectors look like:
data.frame(word = c("darth", "vader"), d1 = 1, d2 = 0:1) %>%
ggplot(aes(x = d1, y = d2)) +
coord_cartesian(xlim = 0:1, ylim = 0:1) +
geom_segment(aes(x = 0, y = 0, xend = d1, yend = d2)) +
theme(text = element_text(size = 18))
This is a 45 degree angle. We can make sure that the cosine of 45 degrees is the same as the cosine similarity between those two vectors:
cos(45 * pi / 180) # this function takes radians, not degrees
##  0.7071068
lsa::cosine(c(1, 1), c(0, 1))
## [1,] 0.7071068
The binary (1 or 0) scoring means that words are never projected into negative space—no numbers below 0 are used. This means that negative similarities cannot occur. In the two-dimensional example above, the largest angle possible is 90 degrees, which has a cosine of 0; the smallest angle possible is 0 degrees, which has a cosine of 1. Similarities are thus normalized inside of the 0 (e.g., words are never used together) to 1 (e.g., words are always used together) range.
I wrote a function that takes a tokenized data frame—where one column is named
word and another is named
id—and returns a symmetric cosine similarity matrix. There are three other arguments. First, what proportion of documents must a word appear in to be considered? This makes sure that words only used in one or two documents are not included. I generally tune this so that it takes the top 85 to 120 words. Second, what proportion of documents is too many to be considered? In the present example, the words “Star” and “Wars” appear in every headline, so they would not tell us differentiating information about topics. I usually set this to be about .80. Third, how large must the similarity be to be included in the word similarity graph? I define this as a percentile. If it is set at .50, for example, then the function will shrink the similarities that are below the median to 0. This is to cut down on spurious or inconsequential relationships in the graph. I generally set this to be somewhere between .65 and .90. There is a lot of debate in the literature about how to filter these graphs (e.g., Christensen, Kenett, Aste, Silvia, & Kwapil, 2018), and I still need to experiment with these different filtering methods to come to a more principled approach than the arbitrary one I currently use.
Using the function shown at the end of this post, I compute the cosine similarity matrix using the following code:
cos_mat <- cosine_matrix(dat, lower = .01, upper = .80, filt = .80)
Since 8,570 documents (headlines) are in this corpus, the only words used in this graph must appear in more than 85.7 documents and less than 6,856. I only graph the similarities that are in the uppermost quintile (i.e., similarity > 80th percentile). This leaves 83 words:
##  83 83
Making the Graph
A background on network theory and analysis is outside the scope of the post—but see Baggio, Scott, and Cooper (2010); Borgatti and Halgin (2011); Borgatti, Mehra, Brass, and Labianca (2009); and Talesford, Simpson, Burdette, Hayasaka, and Laurienti (2011) for introductions. We can build the network from our similarity matrix by using the
igraph function to do so and then plot it using the
ggraph package, which I like because it employs the same grammar as
ggplot2. A random seed is set so that the layout of the graph is reproducible.
g <- graph_from_adjacency_matrix(cos_mat, mode = "undirected", weighted = TRUE)
ggraph(g, layout = "nicely") +
geom_edge_link(aes(alpha = weight), show.legend = FALSE) +
geom_node_label(aes(label = name)) +