Text Mining Kanye's Vocabulary

Introduction

I have been teaching myself how to do some text mining in R, and I thought a fun corpus to look at would be Kanye West lyrics. I'm still learning sentiment analysis, so this post will focus on basic frequencies for words in Kanye West songs. I'm going to look at the lyrics from Kanye's discrography (his solo albums and Watch the Throne) to see (a) the words he uses most and (b) which albums have the biggest vocabularies.

Making the corpus

I wanted to do scrape a website like Genius.com to automate the data collection, but sadly there were too many issues: How should I handle inconsistent formatting? What about when it says “Verse 1”? How do I edit out lyrics that Kanye didn't rap or sing? I just did a lot of copy-and-pasting into .txt files.

I edited down the files so that only verses by Kanye were included—I excluded features. I also took out the lyrics that were samples, choruses sung by other people, all spoken word introductions and outroductions, and skits. As is typical of text mining, I took out “stop words,” which are basically just boring words that I don't want to muck up my analyses (e.g., the, and, or). I also made it so that words like “change,” “changing,” and “changed” all counted as one word.

Like I said above, I included all of Kanye West's solo albums. My friends and I generally consider Watch the Throne to be primarily a Kanye album (save for American Gangster, Did Jay-Z ever recover from collaborating with Linkin Park in 2004?), so I included lyrics by Kanye West on this record.

Frequently used words

The first thing I did was look at the most frequently used words Kanye uses throughout his discography. He uses 3,294 unique words; he uses 1,741 of these only once, 479 twice, and 242 three times. Here is a figure showing all the words he has used 99 or more times:

plot of chunk unnamed-chunk-69

Kanye West coined the term “hashtag rap,” which, in Kanye's words, involves taking “the 'like' or 'as' out of the metaphor” (think: “Here's another hit—Barry Bonds” instead of “Here's another hit, like Barry Bonds”). Despite this, “like” is the word Kanye uses most throughout his discography. “Know,” “get,” “now,” just,“ "don't,” and “got” follow after. To my eye, this collection of words seems like a lot of words that deal with achieving or demanding things.

Which album has the biggest vocabulary?

The next thing I did was create a “uniqueness score” for each album. I started by calculating a score for each album: the number of unique words in a song divided by the number of words total in a song. For example, if a song had 25 unique words and 50 total words in the song, it would have a score of 0.5. The phrase “Hello, hello, everybody” has a score of 0.66, as there are two unique words (“hello” and “everybody”), and there are a total of three words in the phrase.

Then I simply took these songs and averaged them across album. So if an album was four songs and these songs had scores of .20, .40, .60, and .80, then the album's “uniqueness score” would be 0.50. Here is a plot of the albums' “uniqueness scores” (i.e., average vocabulary size), in chronological order:

plot of chunk unnamed-chunk-71

He's been remarkably consistent throughout his discography. The only notable difference is 808s and Heartbreak, which makes sense—he does not rap the lyrics, but sings them (mostly). What my next post on this dataset will include is the content of his vocabulary: How has the content of his lyrics changed over time?