Quantifying "Low-Brow" and "High-Brow" Films

I went and saw Certain Women a few months ago. I was pretty excited to see it; a blurb in the trailer calls it “Triumphant… an indelible portrait of independent women,” which sounds pretty solid to me. The film had a solid point in that it exposed the mundane, everyday ways in which women have to confront sexism. It isn't always a huge dramatic thing that is obvious to everyone—instead, most of the time sexism is commonplace and woven into the routine of our society.

The only problem is that I found the movie, well, pretty boring. Showing how quotidian sexism is in a film makes for a slow-paced, quotidian plot. A few days ago, I happened upon the Rotten Tomatoes entry for the movie. It scored very well with critics (92% liked it), but rather poorly with audiences (52%). It made me think of the divisions between critics and audiences; I thought that the biggest differences between audience and critic scores could be an interesting way to quantify what is “high-brow” and what is “low-brow” film. So what I did was got critic and audience scores for movies in 2016, plotted them against one another, and looked at where they differed most.

Method

The movies I chose to examine were all listed on the 2016 in film Wikipedia page. The problem was I needed links to Rotten Tomatoes pages, not just names of movies. So, I scraped this table, took the names of the films, and I turned them into Google search URLs by taking "https://google.com/search?q=rottentomatoes+2016+" and using paste0 to add the name of the film at the end of the string. Then, I wrote a little function (using rvest and magrittr) that takes this Google search URL and fetches me the URL for the first result of a Google search:

# function for getting first hit from google page
getGoogleFirst <- function(url) {
  url %>% 
    read_html() %>% 
    html_node(".g:nth-child(1) .r a") %>% 
    html_attr("href") %>% 
    strsplit(split="=") %>% 
    getElement(1) %>% 
    strsplit(split="&") %>% 
    getElement(2) %>% 
    getElement(1)
}

After running this through a loop, I got long vector of Rotten Tomatoes links. Then, I fed them into two functions that gets critic and audience scores:

# get rotten tomatoes critic score
rtCritic <- function(url) {
  url %>% 
    read_html() %>% 
    html_node("#tomato_meter_link .superPageFontColor") %>% 
    html_text() %>% 
    strsplit(split="%") %>% 
    as.numeric()
}
# get rotten tomatoes audience score
rtAudience <- function(url) {
  url %>% 
    read_html() %>% 
    html_node(".meter-value .superPageFontColor") %>% 
    html_text() %>% 
    strsplit(split="%") %>% 
    as.numeric()
}

The film names and scores were all put into a data frame.

Results

Overall, I collected data on 224 films. The average critic score was 56.74, while the average audience score was 58.67; while audiences tended to be more positive, this difference was small, 1.93, and not statistically significant, ,t(223) = 1.34, p = .181. Audiences and critics tended to agree; scores between the two groups correlated strongly, r = .68.

But where do audiences and critics disagree most? I calculated a difference score by taking critic - audience scores, such that positive scores meant critics liked the film more than audiences. The five biggest difference scores in both the positive and negative direction are found in the table below.

“High-Brow” Films

Film Critic Audience Difference
The Monkey King 2 100 49 51
Hail, Caesar! 86 44 42
Little Sister 96 54 42
The Monster 78 39 39
The Witch 91 56 35
Into the Forest 77 42 35

“Low-Brow” Films

Film Critic Audience Difference
Hillary's America: The Secret History of the Democratic Party 4 81 -77
The River Thief 0 69 -69
I'm Not Ashamed 22 84 -62
Meet the Blacks 13 74 -61
God's Not Dead 2 9 63 -54

Interactive Plot

Below is a scatterplot of the two scores with a regression line plotted. The dots in blue are those films in the tables above. You can hover over any dot to see the film it represents as well as the audience and critic scores:



I won't do too much interpreting of the results—you can see for yourself where the movies fall by hovering over the dots. But I would be remiss if I didn't point out the largest difference score was an anti-Hillary Clinton movie: 4% of critics liked it, but somehow 81% of the audience did. Given all of the evidence that pro-Trump bots were all over the Internet in the run-up to the 2016 U.S. presidential election, I would not be surprised if many of these audience votes were bots?

Apparently I'm a low-brow plebian; I did not see any of the five most “high-brow” movies, according to the metric. Both critics and audiences seemed to love Hidden Figures (saw it, and it was awesome) and Zootopia (still haven't seen it).

Let me know what you think of this “low-brow/high-brow” metric or better ways one could quantify the construct.

Sentiment Analysis of Kanye West's Discography

Introduction

Last time, I did some basic frequency analyses of Kanye West lyrics. I've been using the tidytext package at work a lot recently, and I thought I would apply some of the package's sentiment dictionaries to the Kanye West corpus I have already made (discussed here). The corpus is not big enough to do the analyses by song or by a very specific emotion (such as joy), so I will stick to tracking positive and negative sentiment of Kanye's lyrical content over the course of his career.

Method

For each song, I removed duplicate words (for reasons like so that the song “Amazing” doesn't have an undue influence on the analyses, given that he says “amazing"—a positive word—about 50 times). I allowed duplicate words from the same album, but not from the same song.

The tidytext package includes three different sentiment analysis dictionaries. One of the dictionaries assigns each word a score from -5 (very negative) to +5 (very positive). Using this dictionary, I simply took the average score for the album:

afinn <- kanye %>% # starting with full data set
  unnest_tokens(word, lyrics) %>% # what was long string of text, now becomes one word per row
  inner_join(get_sentiments("afinn"), by="word") %>% # joining sentiments with these words
  group_by(song) %>%  # grouping dataset by song
  subset(!duplicated(word)) %>% # dropping duplicates (by song, since we are grouped by song)
  ungroup() %>% # ungrouping real quick...
  group_by(album) %>% # and now grouping by album
  mutate(sentiment=mean(score)) %>%  # getting mean by album
  slice(1) %>% # taking one row per every album
  ungroup() %>% # ungrouping
  subset(select=c(album,sentiment)) # only including the album and sentiment columns

The other two dictionaries tag each word as negative or positive. For these, I tallied up how many positive and negative words were being used per album and subtracted the negative count from the positive count:

bing <- kanye %>% # taking data set
  unnest_tokens(word, lyrics) %>% # making long list of lyrics into one word per row
  inner_join(get_sentiments("bing"), by="word") %>% # joining words with sentiment
  group_by(song) %>% # grouping by song
  subset(!duplicated(word)) %>% # getting rid of duplicated words (within each song)
  ungroup() %>% # ungrouping
  count(album, sentiment) %>% # counting up negative and positive sentiments per album
  spread(sentiment, n, fill=0) %>% # putting negative and positive into different columns
  mutate(sentiment=positive-negative) # subtracting negative from positive

# all of this is same as above, but with a different dictionary
nrc <- kanye %>% 
  unnest_tokens(word, lyrics) %>% 
  inner_join(get_sentiments("nrc"), by="word") %>%
  group_by(song) %>% 
  subset(!duplicated(word)) %>% 
  ungroup() %>% 
  count(album, sentiment) %>%
  spread(sentiment, n, fill=0) %>% 
  mutate(sentiment=positive-negative)

I then merged all these data frames together, standardized the sentiment scores, and averaged the three z-scores together to get an overall sentiment rating for each album:

colnames(afinn)[2] <- "sentiment_afinn" # renaming column will make it easier upon joining

bing <- ungroup(subset(bing, select=c(album,sentiment))) # getting rid of unnecessary columns
colnames(bing)[2] <- "sentiment_bing"

nrc <- ungroup(subset(nrc, select=c(album,sentiment)))
colnames(nrc)[2] <- "sentiment_nrc"

suppressMessages(library(plyr)) # getting plyr temporarily because I like join_all()
album_sent <- join_all(list(afinn, bing, nrc), by="album") # joining all three datasets by album
suppressMessages(detach("package:plyr")) # getting rid of plyr, because it gets in the way of dplyr

# creating composite sentiment score:
album_sent$sent <- (scale(album_sent$sentiment_afinn) + 
                    scale(album_sent$sentiment_bing) + 
                    scale(album_sent$sentiment_nrc))/3
album_sent <- album_sent[-7,c(1,5)] # subsetting data to not include watch throne and only include composite
# reordering the albums in chronological order
album_sent$album <- factor(album_sent$album, levels=levels(album_sent$album)[c(2,4,3,1,5,7,8,6)])

Results

And then I plotted the sentiment scores by album, in chronological order. Higher scores represent more positive lyrics:

ggplot(album_sent, aes(x=album, y=sent))+
  geom_point()+
  geom_line(group=1, linetype="longdash", size=.8, alpha=.5)+
  labs(x="Album", y="Sentiment")+
  theme_fivethirtyeight()+
  scale_x_discrete(labels=c("The College Dropout", "Late Registration", "Graduation",
                            "808s & Heartbreak", "MBDTF", "Yeezus", "The Life of Pablo"))+
  theme(axis.text.x=element_text(angle=45, hjust=1),
        text = element_text(size=12))

plot of chunk unnamed-chunk-32

Looks like a drop in happiness after his mother passed as well as after starting to date Kim Kardashian…

This plot jibes with what we know: 808s is the "sad” album. Graduation—with all the talk about achievement, being strong, making hit records, and having a big ego—peaks at the album with the most positive words. This plot seems to lend some validity to the method of analyzing sentiments using all three dictionaries from the tidytext package.

Some issues with this “bag-of-words” approach, however, was that it was prone to error in the finer level of analysis, like song. “Street Lights,” perhaps one of Kanye's saddest songs, came out with one of the most positive scores. Why? It was reading words like “fair” as positive, neglecting to realize that a negation word (i.e., “not”) always preceded it. One could get around this with maybe n-grams or natural language processing.

Nevertheless, there's the trajectory of Kanye's lyrical positivity over time!

Text Mining Kanye's Vocabulary

Introduction

I have been teaching myself how to do some text mining in R, and I thought a fun corpus to look at would be Kanye West lyrics. I'm still learning sentiment analysis, so this post will focus on basic frequencies for words in Kanye West songs. I'm going to look at the lyrics from Kanye's discrography (his solo albums and Watch the Throne) to see (a) the words he uses most and (b) which albums have the biggest vocabularies.

Making the corpus

I wanted to do scrape a website like Genius.com to automate the data collection, but sadly there were too many issues: How should I handle inconsistent formatting? What about when it says “Verse 1”? How do I edit out lyrics that Kanye didn't rap or sing? I just did a lot of copy-and-pasting into .txt files.

I edited down the files so that only verses by Kanye were included—I excluded features. I also took out the lyrics that were samples, choruses sung by other people, all spoken word introductions and outroductions, and skits. As is typical of text mining, I took out “stop words,” which are basically just boring words that I don't want to muck up my analyses (e.g., the, and, or). I also made it so that words like “change,” “changing,” and “changed” all counted as one word.

Like I said above, I included all of Kanye West's solo albums. My friends and I generally consider Watch the Throne to be primarily a Kanye album (save for American Gangster, Did Jay-Z ever recover from collaborating with Linkin Park in 2004?), so I included lyrics by Kanye West on this record.

Frequently used words

The first thing I did was look at the most frequently used words Kanye uses throughout his discography. He uses 3,294 unique words; he uses 1,741 of these only once, 479 twice, and 242 three times. Here is a figure showing all the words he has used 99 or more times:

plot of chunk unnamed-chunk-69

Kanye West coined the term “hashtag rap,” which, in Kanye's words, involves taking “the 'like' or 'as' out of the metaphor” (think: “Here's another hit—Barry Bonds” instead of “Here's another hit, like Barry Bonds”). Despite this, “like” is the word Kanye uses most throughout his discography. “Know,” “get,” “now,” just,“ "don't,” and “got” follow after. To my eye, this collection of words seems like a lot of words that deal with achieving or demanding things.

Which album has the biggest vocabulary?

The next thing I did was create a “uniqueness score” for each album. I started by calculating a score for each album: the number of unique words in a song divided by the number of words total in a song. For example, if a song had 25 unique words and 50 total words in the song, it would have a score of 0.5. The phrase “Hello, hello, everybody” has a score of 0.66, as there are two unique words (“hello” and “everybody”), and there are a total of three words in the phrase.

Then I simply took these songs and averaged them across album. So if an album was four songs and these songs had scores of .20, .40, .60, and .80, then the album's “uniqueness score” would be 0.50. Here is a plot of the albums' “uniqueness scores” (i.e., average vocabulary size), in chronological order:

plot of chunk unnamed-chunk-71

He's been remarkably consistent throughout his discography. The only notable difference is 808s and Heartbreak, which makes sense—he does not rap the lyrics, but sings them (mostly). What my next post on this dataset will include is the content of his vocabulary: How has the content of his lyrics changed over time?