Analyzing Rudy Gay Trades Using the CausalImpact Package

Introduction

I have been meaning to learn more about time-series and Bayesian methods; I'm pumped for a Bayesian class that I'll be in this coming semester. RStudio blogged about the CausalImpact package back in April—a Bayesian time-series package from folks at Google—and I've been meaning to play around with it ever since. There's a great talk posted on YouTube that is a very intuitive description of thinking about causal impact in terms of counterfactuals and the CausalImpact package itself. I decided I would use it to put some common wisdom to the test: Do NBA teams get better after getting rid of Rudy Gay? I remember a lot of chatter on podcasts and on NBA Twitter after he was traded from both the Grizzlies and the Raptors.

Method

I went back to the well and scraped Basketball-Reference using the rvest package. Looking at the teams that traded Gay mid-season, I fetched all the data from the “Schedule & Results” page and from that I calculated a point differential for every game: Positive numbers meant the team with Rudy Gay won the game by that many points, while negative numbers meant they lost by that many points. I ran the CausalImpact model with no covariates or anything: I just looked at point differential over time. I did this separately for the Grizzlies 2012-2013 season and the Raptors 2013-2014 season (both teams traded Rudy mid-season). The pre-treatment sections are before the team traded Gay; the post-treatment sections are after the team traded Gay.

Code for scraping, analyses, and plotting can be accessed over at GitHub.

Results

The package is pretty nice. The output is easy to read and interpret, and they even include little write-ups for you if you specify summary(model, "report"), where model is the name of the model you fit with the CausalImpact function. Let's take a look at the Grizzlies first.

Actual Predicted Difference 95% LB 95% UB
Average 4.4 3.6 0.82 -5 6.6
Cumulative 167.0 135.8 31.22 -190 252.5

The table shows the average and cumulative point differentials. On average, the Grizzlies scored 4.4 points more than their opponent per game after Rudy Gay was traded. Based on what the model learned from when Gay was on the team, we would have predicted this to be 3.6. Their total point differential was 167 after Rudy Gay was traded, when we would have expected about 136. The table also shows the differences: 0.82 and 31.22 points for average and cumulative, respectively. The lower bound and upper bound at a 95% confidence interval fell on far opposite sides of zero, suggesting that the difference is not likely to be different from zero. The posterior probability here of a causal effect (i.e., the probability that this increase was due to Gay leaving the team) is 61%—not a very compelling number. The report generated from the package is rather frequentist—it uses classical null hypothesis significance testing language, saying the model “would generally not be considered statistically significant” with a p-value of 0.387. Interesting.

What I really dig about this package are the plots it gives you. This package is based on the idea that it models a counterfactual: What would the team have done had Rudy Gay not been traded? It then compares this predicted counterfactual to what actually happened. Let's look at the plots:

plot of chunk unnamed-chunk-34

The top figure shows a horizontal dotted line, which is what is predicted given what we know about the team before Gay was traded. I haven't specified any seasonal trends or other predictors, so this line is flat. The black line is what is actually happened. The vertical dotted line is where Rudy Gay was traded. The middle figure shows the difference between predicted and observed. We can see that there is no reliable difference between the two after the Gay trade. Lastly, the bottom figure shows the cumulative difference (that is, adding up all of the differences between observed and predicted over time). Again, this is hovering around zero, showing us that there was really no difference in the Grizzlies point differential that actually occurred and what we predicted would have happened had Gay not been traded (i.e., the counterfactual). What about the Raptors?

The Raptors unloaded Gay to the Kings the very next season. Let's take a look at the same table and plot for the Raptors and trading Rudy:

Actual Predicted Difference 95% LB 95% UB
Average 4.4 -0.37 4.8 -0.88 10
Cumulative 279.0 -23.22 302.2 -55.59 657

plot of chunk unnamed-chunk-36

The posterior probability of a causal effect here was 95.33%—something that is much more likely than the Grizzlies example. The effect was more than five times bigger than it was for Memphis: There was a difference of 4.8 points per game (or 302 cumulatively) between what we observed and what we would have expected had the Raptors never traded Gay. Given that this effect was one (at the time, above average) player leaving a team is pretty interesting. I'm sure any team would be happy with getting almost 5 whole points better per game after getting rid of a big salary.

Conclusion

It looks like trading Rudy Gay likely had no effect on the Grizzlies, but it does seem that getting rid of him had a positive effect on the Raptors. The CausalImpact package is very user-friendly, and there are many good materials out there for understanding and interpreting the model and what's going on underneath the hood. Most of the examples I have seen are simulated data or data which are easily interpretable, so it was good practice seeing what a real, noisy dataset actually looks like.

Quantifying "Low-Brow" and "High-Brow" Films

I went and saw Certain Women a few months ago. I was pretty excited to see it; a blurb in the trailer calls it “Triumphant… an indelible portrait of independent women,” which sounds pretty solid to me. The film had a solid point in that it exposed the mundane, everyday ways in which women have to confront sexism. It isn't always a huge dramatic thing that is obvious to everyone—instead, most of the time sexism is commonplace and woven into the routine of our society.

The only problem is that I found the movie, well, pretty boring. Showing how quotidian sexism is in a film makes for a slow-paced, quotidian plot. A few days ago, I happened upon the Rotten Tomatoes entry for the movie. It scored very well with critics (92% liked it), but rather poorly with audiences (52%). It made me think of the divisions between critics and audiences; I thought that the biggest differences between audience and critic scores could be an interesting way to quantify what is “high-brow” and what is “low-brow” film. So what I did was got critic and audience scores for movies in 2016, plotted them against one another, and looked at where they differed most.

Method

The movies I chose to examine were all listed on the 2016 in film Wikipedia page. The problem was I needed links to Rotten Tomatoes pages, not just names of movies. So, I scraped this table, took the names of the films, and I turned them into Google search URLs by taking "https://google.com/search?q=rottentomatoes+2016+" and using paste0 to add the name of the film at the end of the string. Then, I wrote a little function (using rvest and magrittr) that takes this Google search URL and fetches me the URL for the first result of a Google search:

# function for getting first hit from google page
getGoogleFirst <- function(url) {
  url %>% 
    read_html() %>% 
    html_node(".g:nth-child(1) .r a") %>% 
    html_attr("href") %>% 
    strsplit(split="=") %>% 
    getElement(1) %>% 
    strsplit(split="&") %>% 
    getElement(2) %>% 
    getElement(1)
}

After running this through a loop, I got long vector of Rotten Tomatoes links. Then, I fed them into two functions that gets critic and audience scores:

# get rotten tomatoes critic score
rtCritic <- function(url) {
  url %>% 
    read_html() %>% 
    html_node("#tomato_meter_link .superPageFontColor") %>% 
    html_text() %>% 
    strsplit(split="%") %>% 
    as.numeric()
}
# get rotten tomatoes audience score
rtAudience <- function(url) {
  url %>% 
    read_html() %>% 
    html_node(".meter-value .superPageFontColor") %>% 
    html_text() %>% 
    strsplit(split="%") %>% 
    as.numeric()
}

The film names and scores were all put into a data frame.

Results

Overall, I collected data on 224 films. The average critic score was 56.74, while the average audience score was 58.67; while audiences tended to be more positive, this difference was small, 1.93, and not statistically significant, ,t(223) = 1.34, p = .181. Audiences and critics tended to agree; scores between the two groups correlated strongly, r = .68.

But where do audiences and critics disagree most? I calculated a difference score by taking critic - audience scores, such that positive scores meant critics liked the film more than audiences. The five biggest difference scores in both the positive and negative direction are found in the table below.

“High-Brow” Films

Film Critic Audience Difference
The Monkey King 2 100 49 51
Hail, Caesar! 86 44 42
Little Sister 96 54 42
The Monster 78 39 39
The Witch 91 56 35
Into the Forest 77 42 35

“Low-Brow” Films

Film Critic Audience Difference
Hillary's America: The Secret History of the Democratic Party 4 81 -77
The River Thief 0 69 -69
I'm Not Ashamed 22 84 -62
Meet the Blacks 13 74 -61
God's Not Dead 2 9 63 -54

Interactive Plot

Below is a scatterplot of the two scores with a regression line plotted. The dots in blue are those films in the tables above. You can hover over any dot to see the film it represents as well as the audience and critic scores:



I won't do too much interpreting of the results—you can see for yourself where the movies fall by hovering over the dots. But I would be remiss if I didn't point out the largest difference score was an anti-Hillary Clinton movie: 4% of critics liked it, but somehow 81% of the audience did. Given all of the evidence that pro-Trump bots were all over the Internet in the run-up to the 2016 U.S. presidential election, I would not be surprised if many of these audience votes were bots?

Apparently I'm a low-brow plebian; I did not see any of the five most “high-brow” movies, according to the metric. Both critics and audiences seemed to love Hidden Figures (saw it, and it was awesome) and Zootopia (still haven't seen it).

Let me know what you think of this “low-brow/high-brow” metric or better ways one could quantify the construct.

Sentiment Analysis of Kanye West's Discography

Introduction

Last time, I did some basic frequency analyses of Kanye West lyrics. I've been using the tidytext package at work a lot recently, and I thought I would apply some of the package's sentiment dictionaries to the Kanye West corpus I have already made (discussed here). The corpus is not big enough to do the analyses by song or by a very specific emotion (such as joy), so I will stick to tracking positive and negative sentiment of Kanye's lyrical content over the course of his career.

Method

For each song, I removed duplicate words (for reasons like so that the song “Amazing” doesn't have an undue influence on the analyses, given that he says “amazing"—a positive word—about 50 times). I allowed duplicate words from the same album, but not from the same song.

The tidytext package includes three different sentiment analysis dictionaries. One of the dictionaries assigns each word a score from -5 (very negative) to +5 (very positive). Using this dictionary, I simply took the average score for the album:

afinn <- kanye %>% # starting with full data set
  unnest_tokens(word, lyrics) %>% # what was long string of text, now becomes one word per row
  inner_join(get_sentiments("afinn"), by="word") %>% # joining sentiments with these words
  group_by(song) %>%  # grouping dataset by song
  subset(!duplicated(word)) %>% # dropping duplicates (by song, since we are grouped by song)
  ungroup() %>% # ungrouping real quick...
  group_by(album) %>% # and now grouping by album
  mutate(sentiment=mean(score)) %>%  # getting mean by album
  slice(1) %>% # taking one row per every album
  ungroup() %>% # ungrouping
  subset(select=c(album,sentiment)) # only including the album and sentiment columns

The other two dictionaries tag each word as negative or positive. For these, I tallied up how many positive and negative words were being used per album and subtracted the negative count from the positive count:

bing <- kanye %>% # taking data set
  unnest_tokens(word, lyrics) %>% # making long list of lyrics into one word per row
  inner_join(get_sentiments("bing"), by="word") %>% # joining words with sentiment
  group_by(song) %>% # grouping by song
  subset(!duplicated(word)) %>% # getting rid of duplicated words (within each song)
  ungroup() %>% # ungrouping
  count(album, sentiment) %>% # counting up negative and positive sentiments per album
  spread(sentiment, n, fill=0) %>% # putting negative and positive into different columns
  mutate(sentiment=positive-negative) # subtracting negative from positive

# all of this is same as above, but with a different dictionary
nrc <- kanye %>% 
  unnest_tokens(word, lyrics) %>% 
  inner_join(get_sentiments("nrc"), by="word") %>%
  group_by(song) %>% 
  subset(!duplicated(word)) %>% 
  ungroup() %>% 
  count(album, sentiment) %>%
  spread(sentiment, n, fill=0) %>% 
  mutate(sentiment=positive-negative)

I then merged all these data frames together, standardized the sentiment scores, and averaged the three z-scores together to get an overall sentiment rating for each album:

colnames(afinn)[2] <- "sentiment_afinn" # renaming column will make it easier upon joining

bing <- ungroup(subset(bing, select=c(album,sentiment))) # getting rid of unnecessary columns
colnames(bing)[2] <- "sentiment_bing"

nrc <- ungroup(subset(nrc, select=c(album,sentiment)))
colnames(nrc)[2] <- "sentiment_nrc"

suppressMessages(library(plyr)) # getting plyr temporarily because I like join_all()
album_sent <- join_all(list(afinn, bing, nrc), by="album") # joining all three datasets by album
suppressMessages(detach("package:plyr")) # getting rid of plyr, because it gets in the way of dplyr

# creating composite sentiment score:
album_sent$sent <- (scale(album_sent$sentiment_afinn) + 
                    scale(album_sent$sentiment_bing) + 
                    scale(album_sent$sentiment_nrc))/3
album_sent <- album_sent[-7,c(1,5)] # subsetting data to not include watch throne and only include composite
# reordering the albums in chronological order
album_sent$album <- factor(album_sent$album, levels=levels(album_sent$album)[c(2,4,3,1,5,7,8,6)])

Results

And then I plotted the sentiment scores by album, in chronological order. Higher scores represent more positive lyrics:

ggplot(album_sent, aes(x=album, y=sent))+
  geom_point()+
  geom_line(group=1, linetype="longdash", size=.8, alpha=.5)+
  labs(x="Album", y="Sentiment")+
  theme_fivethirtyeight()+
  scale_x_discrete(labels=c("The College Dropout", "Late Registration", "Graduation",
                            "808s & Heartbreak", "MBDTF", "Yeezus", "The Life of Pablo"))+
  theme(axis.text.x=element_text(angle=45, hjust=1),
        text = element_text(size=12))

plot of chunk unnamed-chunk-32

Looks like a drop in happiness after his mother passed as well as after starting to date Kim Kardashian…

This plot jibes with what we know: 808s is the "sad” album. Graduation—with all the talk about achievement, being strong, making hit records, and having a big ego—peaks at the album with the most positive words. This plot seems to lend some validity to the method of analyzing sentiments using all three dictionaries from the tidytext package.

Some issues with this “bag-of-words” approach, however, was that it was prone to error in the finer level of analysis, like song. “Street Lights,” perhaps one of Kanye's saddest songs, came out with one of the most positive scores. Why? It was reading words like “fair” as positive, neglecting to realize that a negation word (i.e., “not”) always preceded it. One could get around this with maybe n-grams or natural language processing.

Nevertheless, there's the trajectory of Kanye's lyrical positivity over time!

Text Mining Kanye's Vocabulary

Introduction

I have been teaching myself how to do some text mining in R, and I thought a fun corpus to look at would be Kanye West lyrics. I'm still learning sentiment analysis, so this post will focus on basic frequencies for words in Kanye West songs. I'm going to look at the lyrics from Kanye's discrography (his solo albums and Watch the Throne) to see (a) the words he uses most and (b) which albums have the biggest vocabularies.

Making the corpus

I wanted to do scrape a website like Genius.com to automate the data collection, but sadly there were too many issues: How should I handle inconsistent formatting? What about when it says “Verse 1”? How do I edit out lyrics that Kanye didn't rap or sing? I just did a lot of copy-and-pasting into .txt files.

I edited down the files so that only verses by Kanye were included—I excluded features. I also took out the lyrics that were samples, choruses sung by other people, all spoken word introductions and outroductions, and skits. As is typical of text mining, I took out “stop words,” which are basically just boring words that I don't want to muck up my analyses (e.g., the, and, or). I also made it so that words like “change,” “changing,” and “changed” all counted as one word.

Like I said above, I included all of Kanye West's solo albums. My friends and I generally consider Watch the Throne to be primarily a Kanye album (save for American Gangster, Did Jay-Z ever recover from collaborating with Linkin Park in 2004?), so I included lyrics by Kanye West on this record.

Frequently used words

The first thing I did was look at the most frequently used words Kanye uses throughout his discography. He uses 3,294 unique words; he uses 1,741 of these only once, 479 twice, and 242 three times. Here is a figure showing all the words he has used 99 or more times:

plot of chunk unnamed-chunk-69

Kanye West coined the term “hashtag rap,” which, in Kanye's words, involves taking “the 'like' or 'as' out of the metaphor” (think: “Here's another hit—Barry Bonds” instead of “Here's another hit, like Barry Bonds”). Despite this, “like” is the word Kanye uses most throughout his discography. “Know,” “get,” “now,” just,“ "don't,” and “got” follow after. To my eye, this collection of words seems like a lot of words that deal with achieving or demanding things.

Which album has the biggest vocabulary?

The next thing I did was create a “uniqueness score” for each album. I started by calculating a score for each album: the number of unique words in a song divided by the number of words total in a song. For example, if a song had 25 unique words and 50 total words in the song, it would have a score of 0.5. The phrase “Hello, hello, everybody” has a score of 0.66, as there are two unique words (“hello” and “everybody”), and there are a total of three words in the phrase.

Then I simply took these songs and averaged them across album. So if an album was four songs and these songs had scores of .20, .40, .60, and .80, then the album's “uniqueness score” would be 0.50. Here is a plot of the albums' “uniqueness scores” (i.e., average vocabulary size), in chronological order:

plot of chunk unnamed-chunk-71

He's been remarkably consistent throughout his discography. The only notable difference is 808s and Heartbreak, which makes sense—he does not rap the lyrics, but sings them (mostly). What my next post on this dataset will include is the content of his vocabulary: How has the content of his lyrics changed over time?

Political Targeting Using k-means Clustering

Political campaigns send certain messages to people based on what the campaigns know about these potential voters, hoping that these specialized messages are more effective than generic messages at convincing people to show up and vote for their canddiate.

Imagine that a Democratic campaign wants to target Democrats for specific types of mailers or phone call conversations, but has limited resources. How could they decide on who gets what type of message? I'll show how I've been playing around with k-means clustering to get some insight into how people might decide what messages to send to which people.

Let's say this campaign could only, at most, afford four different types of messages. We could try to cluster four different types of Democrats, based on some information that a campaign has about the voters. I will use an unrealistic example here—given that it is survey data on specific issues—but I think it is nonetheless interesting and shows what such a simple algorithm is capable of.

My colleagues Chris Crandall, Laura Van Berkel, and I have asked online samples how they feel about specific political issues (e.g., gun laws, aborton, taxes). For the present analyses, I include only people who identify as Democrats, because I'm imagining that I'm trying to target Democratic voters.

I have 179 self-identified Democrats' answers on 17 specific policy questions, as well as how much they identify as liberal to conservative (on a 0 to 100 scale). I ran k-means clustering to these 17 policy questions, specifying four groups (i.e., the most this fictitious campaign could afford).

First, let's look at how many people were in each cluster. We can also look at how much each cluster, on average, identified themselves as liberal (0) to conservative (100):

Cluster Conservatism Size
1 46.32 22
2 31.84 43
3 27.96 47
4 10.45 67

These clusters are ordered by conservatism. We could see each group as just most conservative Democrats to most progressive Democrats, but can we get a more specific picture here?

What I did was create four different plots—one for each cluster—laying out how, on average, each cluster scored on each specific policy items. These items are standardized, which means that a score of 0 in the group means that they had the same opinion as the average Democrat in the sample. This will be important for interpretation. Let's look at Cluster 1. I call these “Religious Conservative Democrats,” as I will explain shortly:

plot of chunk unnamed-chunk-4

In general, these people tend to be more conservative than the average Democrat in our sample. But what really differentiates these people most? Three of the largest deviations from zero show the story: These people are much more against abortion access, much more of the belief that religion is important in everyday life, and much more against gay marriage than the average Democrat. These are not just conservative Democrats, but Democrats that are more conservative due to traditional religious beliefs. If I were advising the campaign, I would say, whatever they choose to do, do not focus their message on progressive stances on abortion access and gay marriage.

Let's turn to the Cluster 2, “Fiscally Conservative Democrats”:

plot of chunk unnamed-chunk-5

The biggest deviations from the average Democrat that these people have are that they are more likely to be against welfare, say that there is too much government spending, and oppose raising taxes. These people also support funding social security, stopping climate change, reducing economic inequality, and government providing healthcare less than the average Democrat. I would suggest targeting these people with social issues: access to abortion, supporting gay rights, funding fewer military programs, and supporting immigration. These people are about the same as the average Democrat on these issues.

Cluster 3 are what I have named “Moderate Democrats”:

plot of chunk unnamed-chunk-6

I almost want to name this group, “Democrats Likely to Agree with All of the Questions,” because they tended to support both conservative and liberal policices more than the average Democrat (except for the death penalty item, strangely). But they can be seen as moderates, or perhaps “ambivalent.” In comparison to the average Democrat, they are both more likely to say we should control borders and immigration and say that we should reduce economic inequality. Theoretically, this group is interesting. We could ask lots of empirical questions about them.

But pragmatically? They could probably be given messages that the candidate is most passionate about, polls best, etc., as they are likely to be more favorable toward any attitude—liberal or otherwise—than the average Democrat.

You might be wondering, “These groups all seem pretty conservative.” Remember that these scores are all relative to the average Democrat. Even if they score more conservatively on an issue, they are likely to support it less than a Republican.

In any case, Cluster 4 (the biggest cluster, about 37% of the sample) are the “Progressive Democrats”:

plot of chunk unnamed-chunk-7

In comparison to the average Democrat, these people support all of the liberal causes more and conservative causes less. I would suggest to those trying to campaign to these people that they would be open to the most progressive issues that the candidate has to offer.

As I mentioned before, this is somewhat of an unrealistic example: Campaigns don't have surveys laying around for registered voters. But there's an increasing amount of information out there that campaigns could use in lieu of directly asking people how they feel about issues: Facebook data, donations to specific causes, signing up for e-mail updates from different political organizations, etc. Information for most campaigns can be sparse, but people are increasingly able to access some of these more proprietary datasets.

k-means clustering is also a remarkably simple way to look at this issue. One line of code runs the algorithm. The command I ran was:

four <- kmeans(data,4)

And then I extracted the clusters for each person with the code:

four$cluster

Even if specific decisions are not made based on these simple cluster analyses, they are easy enough to do that I believe it is a good way to explore data and how respondents can be grouped together. Running multiple analyses specifying different numbers of clusters can help us understand how the people that answer these questions may be organized in pragmatically helpful ways.

Predicting All-NBA Teams

The NBA season is winding down, which means it is award season. Bloggers, podcasters, writers, and television personalities are all releasing who they think should make the First-, Second-, and Third-Team All-NBA squads. What I will do here is try to predict who will make the All-NBA squads. I won't try to predict each particular team, but only if they made one of the three squads or not.

Technical details

I first downloaded all of the individual statistics I could from Basketball-Reference.com for all players in all years from 1989 to 2016. I chose 1989 as the starting point, because it was the first year that the NBA voted on three All-NBA squads instead of two. These included all numbers from the Totals, Per Game, Per 36 Minutes, Per 100 Possessions, and Advanced tables on Basketball-Reference.com.

I trimmed the sample down a little bit by (a) cutting out anyone who was missing data (i.e., many big men did not have three-point percentages, because they did not take a three-point shot all season) and (b) removing noisy data by dropping anyone who played less than 500 minutes in the entire year. This is a cutoff many sites use (e.g., ESPN.com) for leaderboards, and it's a cutoff that I've used in research on the NBA before. This cut out 9 people of the 420 that made the All-NBA teams in my training data (years 1989 through 2013)—not a troubling amount of players. And given that the NBA has shot more and more three-point shots in recent years, I'm not worried about cutting a few big men from the training data who didn't attempt one in an entire season.

Which leads me to the particulars of training the model. I split up the data into four parts. First, the training data (seasons 1989 to 2013). I then tested the model on the three subsequent years separately: 2014, 2015, and 2016. These three subsets made up my testing data. The model classified players as either a “Yes” or “No” for making an All-NBA team. However, the model didn't know that I needed 15 players per year, consisting of 6 guards, 6 forwards, and 3 centers. So to assess model performance, I sorted the predicted probability, based on the model, that a player made the All-NBA team. I took the 6 guards, 6 forwards, and 3 centers with the highest probabilities of making the team and counted them as my predictions for the year.

I tried a number of algorithms: k-NN, C5.0, random forest, and neural networks. I played around with each of these models, changing the number of neighbors or hidden nodes or trees, etc. The random forest and neural networks yielded the same accuracy; I find the random forest to be more interpretable, so I went with that as my model. The default 500 trees and √ p  variables, where p is the number of variables we are using to predict if the player made the All-NBA team (in our case, 10 variables per tree), were just as accurate as any other alternative I tried, so I stuck with the defaults. So let's now turn to how accurate the model was at predicting seasons 2014-2016.

Is the model accurate at predicting years 2014-2016?

Using the criteria above, I predicted 10 of the 15 All-NBA players in 2014, 13 in 2015, and 12 in 2016 correctly for an overall accuracy of about 78%. A lot of people in the NBA world have maligned the fact that the All-NBA teams must include 2 guards, 2 forwards, and 1 center per squad. What if these restrictions were lifted? Which players did the model predict, based solely on their performance, to make the All-NBA team? Let's look year-by-year.

Model-predicted All-NBA players for 2014

Player Position Made Team?
Kevin Durant SF Yes
LeBron James PF Yes
Blake Griffin PF Yes
Kevin Love PF Yes
Carmelo Anthony PF No
James Harden SG Yes
Stephen Curry PG Yes
DeMarcus Cousins C No
Chris Paul PG Yes
Anthony Davis C No
Dirk Nowitzki PF No
Paul George SF Yes
Dwight Howard C Yes
Joakim Noah C Yes
Goran Dragic SG Yes

Model-predicted All-NBA players for 2015

Player Position Made Team?
Russell Westbrook PG Yes
LeBron James SF Yes
James Harden SG Yes
Anthony Davis PF Yes
Chris Paul PG Yes
Stephen Curry PG Yes
DeMarcus Cousins C Yes
Blake Griffin PF Yes
Pau Gasol PF Yes
Damian Lillard PG No
DeAndre Jordan C Yes
LaMarcus Aldridge PF Yes
Kyrie Irving PG Yes
Marc Gasol C Yes
Jimmy Butler SG No

Model-predicted All-NBA players for 2016

Player Position Made Team?
LeBron James SF Yes
Kevin Durant SF Yes
Russell Westbrook PG Yes
Chris Paul PG Yes
James Harden SG No
Stephen Curry PG Yes
Kawhi Leonard SF Yes
DeMarcus Cousins C Yes
Kyle Lowry PG Yes
Damian Lillard PG Yes
DeAndre Jordan C Yes
Isaiah Thomas PG No
Anthony Davis C No
Paul George SF Yes
DeMar DeRozan SG No

Most notably here is James Harden had the 5th highest probability of making the All-NBA team last year—the model gave him a probability of .82. This supports the obvious point that he was one of the biggest snubs for the All-NBA team in the history of the league. The model also tended to classify some players who weren't on very good teams. I didn't include their team's win-loss record in the input variables for the model, because I was not sure how to go about calculating this win-loss percentage for players who were traded mid-season. However, the win-loss record was indirectly included in the win shares for a player.

What were the most important statistics in predicting if people made the All-NBA team or not? We can see this by looking at the Gini index. I looked at the distribution of the Gini index for each predictor variable, and two were much higher than the rest: Player Efficiency Rating (PER) and win shares. These two metrics are popular choices as indicators of overall performance, and it seems like they at least have some predictive validity when it comes to which players people will select for the All-NBA squads.

Predicting this season

Top 25 probabilities for making the All-NBA team

Player Position Probability
LeBron James SF 0.966
Kevin Durant SF 0.928
Kawhi Leonard SF 0.918
Giannis Antetokounmpo SF 0.854
Stephen Curry PG 0.800
Karl-Anthony Towns C 0.778
Jimmy Butler SF 0.762
Russell Westbrook PG 0.748
DeMarcus Cousins C 0.746
Anthony Davis C 0.742
John Wall PG 0.742
Gordon Hayward SF 0.740
DeAndre Jordan C 0.736
DeMar DeRozan SG 0.732
Marc Gasol C 0.716
Rudy Gobert C 0.688
Kyrie Irving PG 0.678
Dwight Howard C 0.662
Blake Griffin PF 0.642
Kyle Lowry PG 0.638
Isaiah Thomas PG 0.632
Andre Drummond C 0.618
James Harden PG 0.604
Kemba Walker PG 0.598
Chris Paul PG 0.574

The top 25 probabilities for making the team this season are listed above, as well as the players' positions. Using the position rules that the NBA requires for the All-NBA teams (discussed above), my predictions for the All-NBA teams, with just under ¾ of the NBA season complete, are:

Forwards

  • LeBron James
  • Kevin Durant
  • Kawhi Leonard
  • Giannis Antetokounmpo
  • Jimmy Butler
  • Gordon Hayward

Guards

  • Stephen Curry
  • Russell Westbrook
  • John Wall
  • DeMar DeRozan
  • Kyrie Irving
  • Kyle Lowry

Centers

  • Karl-Anthony Towns
  • DeMarcus Cousins
  • Anthony Davis