How I Put Logos on ggplot2 Figures

Creating a ggplot2 theme that matches your organization’s colors and fonts can help your plots look slick and feel seamless with the rest of the organization’s work. One extra thing that has come up with this for me has been adding a logo to plots. While I find customizing the theme by using theme() to be pretty straightforward, I feel like adding a logo is a little trickier. So in this post, I show how I add logos to ggplot2 figures. The code is cobbled from other blog posts and StackOverflow questions, but I wanted to put it all in one place and show what was most intuitive for me.

First, let’s make a plot to add a logo to. I use the starwars data set, which is included in the dplyr package—loaded below with library(tidyverse). We can look at what species are represented more than once in this data set:

library(tidyverse)
data("starwars")

(species <- starwars %>% 
  count(species) %>% 
  filter(!is.na(species) & n > 1) %>% 
  arrange(-n) %>% 
  mutate(species = factor(species, species)))
## # A tibble: 8 x 2
##   species      n
##   <fct>    <int>
## 1 Human       35
## 2 Droid        5
## 3 Gungan       3
## 4 Kaminoan     2
## 5 Mirialan     2
## 6 Twi'lek      2
## 7 Wookiee      2
## 8 Zabrak       2

Then we plot these counts:

(p1 <- ggplot(species, aes(x = species, y = n)) +
  geom_bar(stat = "identity") +
  theme_light())

Now, for a logo. I will be working with .png files in this post, and I have a file stored in my working directory called logo.png. The following code can take a file name for a .png and return an object that ggplot2 can use:

get_png <- function(filename) {
  grid::rasterGrob(png::readPNG(filename), interpolate = TRUE)
}

l <- get_png("logo.png")

Now we have our logo as the object l. I like to stick logos below the plot and to the right. We do this in three steps:

  • annotation_custom: This places the logo in a specific range of the plot. We specify four points that draw a container around where we want to place the logo. Note that these numbers follow the same scale as your data.

  • coord_cartesian: We use this to turn the clip off so that the plot isn’t cropped down to only include where data are. This can be used any time we want to do something in the margins.

  • theme: We specify the plot margins. The units follow the pattern top, right, bottom left (remember: trbl or “trouble”). In the code below, I specify a larger padding on the third position (i.e., below) so that we have some white space to work in for the logo.

I like to use grid::roundrectGrob() as a test logo when I’m trying to figure out the correct four points to supply to annotation_custom. This will just draw a rectangle for the container that your logo will be placed inside of. I assign that to t as a test logo:

t <- grid::roundrectGrob()

p1 +
  annotation_custom(t, xmin = 6.5, xmax = 8.5, ymin = -5, ymax = -8.5) +
  coord_cartesian(clip = "off") +
  theme(plot.margin = unit(c(1, 1, 3, 1), "lines"))

Now that I know this is the correct placement, I swap out t for l:

p1 +
  annotation_custom(l, xmin = 6.5, xmax = 8.5, ymin = -5, ymax = -8.5) +
  coord_cartesian(clip = "off") +
  theme(plot.margin = unit(c(1, 1, 3, 1), "lines"))

And we have a logo placed right under the plot!

One issue I run into using this approach is whenever we want to use facet_wrap() or facet_grid(). Using this approach will try to add the logo at the bottom of every panel:

p2 <- starwars %>% 
  mutate(human = !is.na(species) & species == "Human") %>% 
  ggplot(aes(x = height)) +
  geom_density() +
  facet_wrap(~ human)

p2 +
  annotation_custom(t, xmin = 200, xmax = 275, ymin = -.005, ymax = -.008) +
  coord_cartesian(clip = "off") +
  theme(plot.margin = unit(c(1, 1, 3, 1), "lines"))

So, what I do instead is create a plot that is only the logo. I make the x-axis data be the vector 0:1. This way, I can specify putting the logo on .80 to 1.0 if I want to get the right-most 20% of the figure. I make the y-axis data be the integer 1; I don’t specify ymin or ymax so that the logo will fill this entire height of the plot. I also use theme_void() to get rid of anything else but the logo.

(p3 <- ggplot(mapping = aes(x = 0:1, y = 1)) +
  theme_void() +
  annotation_custom(l, xmin = .8, xmax = 1))

Then, I use gridExtra::grid.arrange() to stack the main plot itself on top of the logo plot. The heights argument means that p2 is 93% of the height, and p3 is 7%:

gridExtra::grid.arrange(p2, p3, heights = c(.93, .07))

Confidence Interval Coverage in Weighted Surveys: A Simulation Study

My colleague Isaac Lello-Smith and I wrote a paper on how to obtain valid confidence intervals in R for weighted surveys. You can read the .pdf here and check out the code at GitHub.

There are many of schools of thought out there on methods for estimating standard errors from weighted survey data, and many books don’t tell you why a method may or may not be valid. So, we decided to simulate a situation that we often see in our work and see what worked best. A few highlights:

Screen Shot 2019-04-09 at 6.06.57 PM.png
  • Be careful with “weights” arguments in R! Read the documentation carefully. There are many types of weights in statistics, and your standard errors, p-values, and confidence intervals can be wildly wrong if you supply the wrong type of weight.

  • Use bootstrapping to calculate standard errors. We found that even estimation methods that were made for survey weights underestimated standard errors.

  • Be skeptical of confidence intervals in weighted survey contexts. We found that standard errors were underestimated when simulating real-world imperfections with the data, such as measurement error and target error. 95% confidence intervals were only truly 95% in best-case scenarios with low error.

Using Word Similarity Graphs to Explore Themes in Text: A Tutorial

One of the first questions people ask about text data is, “What is the text about?” This search for topics or themes involves reducing the complexity of the text down to a handful of meaningful categories. I have found that a lot of common approaches for this are not as useful as advertised. In this post, I’m going to explain and demonstrate how I use word similarity graphs to quickly explore and make sense of topics in text data.


Other Approaches

I have used a number of platforms that advertise “using AI” to examine themes in text data. I am generally underwhelmed with these platforms: Themes hardly make much sense; the underlying algorithm is not explained, so it is hard to figure out why things don’t make sense; and they are rigid interfaces, not letting the user tune hyperparameters or change how the text is vectorized.

I have also experimented with a number of common unsupervised learning techniques. I appreciate the idea of Latent Dirichlet allocation (Blei, Ng, & Jordan, 2003), because it is a probabilistic model that is explicit about its parameters, how it is estimated, the assumptions on which it relies, etc. However, I have rarely seen results that make much sense—other than toy examples that text mining books use. Selecting an appropriate k has been difficult, too, as the four metrics programmed into the ldatuning package (Nikita, 2016) tend to disagree with one another, some frequently recommending to choose the least k while others indicating the most.

I also looked at vectorizing the documents, generally using tf-idf (Silge & Robinson, 2017, Ch. 3), and applying my go-to clustering algorithms, like k-NN or DBSCAN (Ester, Kriegel, Sander, & Xu, 1996). I generally wasn’t satisfied with the results here, either. For example, difficulties might arise due to Zipf’s law (Silge & Robinson, 2017, Ch. 3), as the p-dimensional space we are projecting documents into is sparse, which causes problems for k-NN (e.g., Grcar, Fortuna, Mladenic, & Grobelnik, 2006).

There are undoubtedly use-cases for each of these approaches—and text mining is not my expertise—but my experiences led me to use word similarity graphs for exploring topics.


Word Similarity Graphs

I have found word similarity graphs (or networks) to be useful. The three primary steps involved:

  1. Calculate the similarities between words. This makes the word—not the document—the unit of analysis. Instead of looking for documents close to one another in p-dimensional space, we will be looking for groups of words that co-occur often. I generally focus on words that appear in at least J% of documents—but less than L% of documents—and I flatten any similarity scores between two words that fall below the Ith percentile to 0.

  2. Format these similarity scores into a symmetric matrix, where the diagonal contains 0s and the off-diagonal cells are similarities between words. Use this as an adjacency matrix to plot a network graph.

  3. Cluster nodes in this graph using a community detection algorithm. I use the Walktrap algorithm, which is based on random walks that take T steps in the network.

I like treating words as the unit of analysis; it makes sense to me to think of topics as being made up of words often used in conjunction with one another. This is also a naturally visual method, and plotting the network helps me understand topics better than columns of weights corresponding to latent topics. However, this approach is exploratory. I generally choose the values of the J, L, I, and T hyperparameters based on subjective decisions—not by minimizing a loss function; nonetheless, it helps me understand what is being talked about in text.

I will show how to use a few simple functions I wrote as I describe this technique in more detail. The full code for each of the functions are found at the end of this post. The data I am using are headlines from news stories and blogs in the last 6 months that mention “Star Wars” (if the sequel trilogy makes you angry—you’re wrong, but I hope you still read the rest of this post). The data and all code necessary to reproduce this post can be found at my GitHub.


Preparation

The functions rely on the tidyverse being loaded into the current session, and it requires the lsa and igraph packages to be installed. Before running any of the similarity or clustering functions, I run:

library(tidyverse)
library(tidytext)
library(igraph)
library(ggraph)
data(stop_words)
dat <- read_csv("starwars.csv") %>% 
  transmute(
    id = 1:nrow(.), # headline identification number for reference
    text = gsub("[-/]", " ", title),
    text = tolower(gsub("[^A-Za-z ]", "", text))
  ) %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words, by = "word") %>% 
  filter(word != "cnet") # corpus-specific stop word


Calculating Similarity

Many metrics exist to quantify how similar two units are to one another; distance measures can also be inverted to measure similarity (see Cha, 2007; Choi, Cha, & Tappert, 2010; Lesot, Rifqi, & Benhadda, 2009 for reviews). I started with an assumption that words belonging to the same topic will be used in the same document (which can be a story, chapter, song, sentence, headline, and so on), so I decided the foundation of word similarities here should be co-occurrences. If the words “Darth” and “Vader” appear together in 161 headlines (see below), then their similarity score would be 161.

dat %>% 
  group_by(id) %>% 
  summarise(vader = all(c("darth", "vader") %in% word)) %>% 
  with(sum(vader))
## [1] 161

A problem arises, however, in considering how frequently each word is used. The words “Princess” and “Leia” occur 31 times together, but “Princess” is used far less frequently than “Darth” in general (see below). Does that mean the words “Princess” and “Leia” are less similar to one another than “Darth” and “Vader”? Not necessarily.

dat %>% 
  group_by(id) %>% 
  summarise(leia = all(c("princess", "leia") %in% word)) %>% 
  with(sum(leia))
## [1] 31
dat %>% 
  group_by(id) %>% 
  summarise(
    darth = "darth" %in% word, 
    vader = "vader" %in% word, 
    princess = "princess" %in% word, 
    leia = "leia" %in% word
  ) %>% 
  select(-id) %>% 
  colSums()
##    darth    vader princess     leia 
##      236      232       39      116

We can overcome this difference in base rates by normalizing (standardizing) the co-occurrences. I use the cosine similarity for this, which is identical to the Ochiai coefficient in this situation (Zhou & Leydesdorff, 2016). The cosine similarity gets its name from being the cosine of the angle located between two vectors. In our case, each vector is a word, and the length of these vectors is the number of documents. If the word appears in a document, it is scored as “1”; if it does not, it is “0.” For simplicity’s sake, let’s imagine we have two documents: “Darth” appears in the first, but not the second; “Vader” appears in both. Plotted on two-dimensional space, the vectors look like:

data.frame(word = c("darth", "vader"), d1 = 1, d2 = 0:1) %>% 
  ggplot(aes(x = d1, y = d2)) +
  geom_point() +
  coord_cartesian(xlim = 0:1, ylim = 0:1) +
  geom_segment(aes(x = 0, y = 0, xend = d1, yend = d2)) +
  theme_minimal() +
  theme(text = element_text(size = 18))

This is a 45 degree angle. We can make sure that the cosine of 45 degrees is the same as the cosine similarity between those two vectors:

cos(45 * pi / 180) # this function takes radians, not degrees
## [1] 0.7071068
lsa::cosine(c(1, 1), c(0, 1))
##           [,1]
## [1,] 0.7071068

The binary (1 or 0) scoring means that words are never projected into negative space—no numbers below 0 are used. This means that negative similarities cannot occur. In the two-dimensional example above, the largest angle possible is 90 degrees, which has a cosine of 0; the smallest angle possible is 0 degrees, which has a cosine of 1. Similarities are thus normalized inside of the 0 (e.g., words are never used together) to 1 (e.g., words are always used together) range.

I wrote a function that takes a tokenized data frame—where one column is named word and another is named id—and returns a symmetric cosine similarity matrix. There are three other arguments. First, what proportion of documents must a word appear in to be considered? This makes sure that words only used in one or two documents are not included. I generally tune this so that it takes the top 85 to 120 words. Second, what proportion of documents is too many to be considered? In the present example, the words “Star” and “Wars” appear in every headline, so they would not tell us differentiating information about topics. I usually set this to be about .80. Third, how large must the similarity be to be included in the word similarity graph? I define this as a percentile. If it is set at .50, for example, then the function will shrink the similarities that are below the median to 0. This is to cut down on spurious or inconsequential relationships in the graph. I generally set this to be somewhere between .65 and .90. There is a lot of debate in the literature about how to filter these graphs (e.g., Christensen, Kenett, Aste, Silvia, & Kwapil, 2018), and I still need to experiment with these different filtering methods to come to a more principled approach than the arbitrary one I currently use.

Using the function shown at the end of this post, I compute the cosine similarity matrix using the following code:

cos_mat <- cosine_matrix(dat, lower = .01, upper = .80, filt = .80)

Since 8,570 documents (headlines) are in this corpus, the only words used in this graph must appear in more than 85.7 documents and less than 6,856. I only graph the similarities that are in the uppermost quintile (i.e., similarity > 80th percentile). This leaves 83 words:

dim(cos_mat)
## [1] 83 83


Making the Graph

A background on network theory and analysis is outside the scope of the post—but see Baggio, Scott, and Cooper (2010); Borgatti and Halgin (2011); Borgatti, Mehra, Brass, and Labianca (2009); and Talesford, Simpson, Burdette, Hayasaka, and Laurienti (2011) for introductions. We can build the network from our similarity matrix by using the igraph function to do so and then plot it using the ggraph package, which I like because it employs the same grammar as ggplot2. A random seed is set so that the layout of the graph is reproducible.

g <- graph_from_adjacency_matrix(cos_mat, mode = "undirected", weighted = TRUE)

set.seed(1839)
ggraph(g, layout = "nicely") +
  geom_edge_link(aes(alpha = weight), show.legend = FALSE) + 
  geom_node_label(aes(label = name)) +
  theme_void()