Political Targeting Using k-means Clustering

Political campaigns send certain messages to people based on what the campaigns know about these potential voters, hoping that these specialized messages are more effective than generic messages at convincing people to show up and vote for their canddiate.

Imagine that a Democratic campaign wants to target Democrats for specific types of mailers or phone call conversations, but has limited resources. How could they decide on who gets what type of message? I'll show how I've been playing around with k-means clustering to get some insight into how people might decide what messages to send to which people.

Let's say this campaign could only, at most, afford four different types of messages. We could try to cluster four different types of Democrats, based on some information that a campaign has about the voters. I will use an unrealistic example here—given that it is survey data on specific issues—but I think it is nonetheless interesting and shows what such a simple algorithm is capable of.

My colleagues Chris Crandall, Laura Van Berkel, and I have asked online samples how they feel about specific political issues (e.g., gun laws, aborton, taxes). For the present analyses, I include only people who identify as Democrats, because I'm imagining that I'm trying to target Democratic voters.

I have 179 self-identified Democrats' answers on 17 specific policy questions, as well as how much they identify as liberal to conservative (on a 0 to 100 scale). I ran k-means clustering to these 17 policy questions, specifying four groups (i.e., the most this fictitious campaign could afford).

First, let's look at how many people were in each cluster. We can also look at how much each cluster, on average, identified themselves as liberal (0) to conservative (100):

Cluster Conservatism Size
1 46.32 22
2 31.84 43
3 27.96 47
4 10.45 67

These clusters are ordered by conservatism. We could see each group as just most conservative Democrats to most progressive Democrats, but can we get a more specific picture here?

What I did was create four different plots—one for each cluster—laying out how, on average, each cluster scored on each specific policy items. These items are standardized, which means that a score of 0 in the group means that they had the same opinion as the average Democrat in the sample. This will be important for interpretation. Let's look at Cluster 1. I call these “Religious Conservative Democrats,” as I will explain shortly:

plot of chunk unnamed-chunk-4

In general, these people tend to be more conservative than the average Democrat in our sample. But what really differentiates these people most? Three of the largest deviations from zero show the story: These people are much more against abortion access, much more of the belief that religion is important in everyday life, and much more against gay marriage than the average Democrat. These are not just conservative Democrats, but Democrats that are more conservative due to traditional religious beliefs. If I were advising the campaign, I would say, whatever they choose to do, do not focus their message on progressive stances on abortion access and gay marriage.

Let's turn to the Cluster 2, “Fiscally Conservative Democrats”:

plot of chunk unnamed-chunk-5

The biggest deviations from the average Democrat that these people have are that they are more likely to be against welfare, say that there is too much government spending, and oppose raising taxes. These people also support funding social security, stopping climate change, reducing economic inequality, and government providing healthcare less than the average Democrat. I would suggest targeting these people with social issues: access to abortion, supporting gay rights, funding fewer military programs, and supporting immigration. These people are about the same as the average Democrat on these issues.

Cluster 3 are what I have named “Moderate Democrats”:

plot of chunk unnamed-chunk-6

I almost want to name this group, “Democrats Likely to Agree with All of the Questions,” because they tended to support both conservative and liberal policices more than the average Democrat (except for the death penalty item, strangely). But they can be seen as moderates, or perhaps “ambivalent.” In comparison to the average Democrat, they are both more likely to say we should control borders and immigration and say that we should reduce economic inequality. Theoretically, this group is interesting. We could ask lots of empirical questions about them.

But pragmatically? They could probably be given messages that the candidate is most passionate about, polls best, etc., as they are likely to be more favorable toward any attitude—liberal or otherwise—than the average Democrat.

You might be wondering, “These groups all seem pretty conservative.” Remember that these scores are all relative to the average Democrat. Even if they score more conservatively on an issue, they are likely to support it less than a Republican.

In any case, Cluster 4 (the biggest cluster, about 37% of the sample) are the “Progressive Democrats”:

plot of chunk unnamed-chunk-7

In comparison to the average Democrat, these people support all of the liberal causes more and conservative causes less. I would suggest to those trying to campaign to these people that they would be open to the most progressive issues that the candidate has to offer.

As I mentioned before, this is somewhat of an unrealistic example: Campaigns don't have surveys laying around for registered voters. But there's an increasing amount of information out there that campaigns could use in lieu of directly asking people how they feel about issues: Facebook data, donations to specific causes, signing up for e-mail updates from different political organizations, etc. Information for most campaigns can be sparse, but people are increasingly able to access some of these more proprietary datasets.

k-means clustering is also a remarkably simple way to look at this issue. One line of code runs the algorithm. The command I ran was:

four <- kmeans(data,4)

And then I extracted the clusters for each person with the code:

four$cluster

Even if specific decisions are not made based on these simple cluster analyses, they are easy enough to do that I believe it is a good way to explore data and how respondents can be grouped together. Running multiple analyses specifying different numbers of clusters can help us understand how the people that answer these questions may be organized in pragmatically helpful ways.

Predicting All-NBA Teams

The NBA season is winding down, which means it is award season. Bloggers, podcasters, writers, and television personalities are all releasing who they think should make the First-, Second-, and Third-Team All-NBA squads. What I will do here is try to predict who will make the All-NBA squads. I won't try to predict each particular team, but only if they made one of the three squads or not.

Technical details

I first downloaded all of the individual statistics I could from Basketball-Reference.com for all players in all years from 1989 to 2016. I chose 1989 as the starting point, because it was the first year that the NBA voted on three All-NBA squads instead of two. These included all numbers from the Totals, Per Game, Per 36 Minutes, Per 100 Possessions, and Advanced tables on Basketball-Reference.com.

I trimmed the sample down a little bit by (a) cutting out anyone who was missing data (i.e., many big men did not have three-point percentages, because they did not take a three-point shot all season) and (b) removing noisy data by dropping anyone who played less than 500 minutes in the entire year. This is a cutoff many sites use (e.g., ESPN.com) for leaderboards, and it's a cutoff that I've used in research on the NBA before. This cut out 9 people of the 420 that made the All-NBA teams in my training data (years 1989 through 2013)—not a troubling amount of players. And given that the NBA has shot more and more three-point shots in recent years, I'm not worried about cutting a few big men from the training data who didn't attempt one in an entire season.

Which leads me to the particulars of training the model. I split up the data into four parts. First, the training data (seasons 1989 to 2013). I then tested the model on the three subsequent years separately: 2014, 2015, and 2016. These three subsets made up my testing data. The model classified players as either a “Yes” or “No” for making an All-NBA team. However, the model didn't know that I needed 15 players per year, consisting of 6 guards, 6 forwards, and 3 centers. So to assess model performance, I sorted the predicted probability, based on the model, that a player made the All-NBA team. I took the 6 guards, 6 forwards, and 3 centers with the highest probabilities of making the team and counted them as my predictions for the year.

I tried a number of algorithms: k-NN, C5.0, random forest, and neural networks. I played around with each of these models, changing the number of neighbors or hidden nodes or trees, etc. The random forest and neural networks yielded the same accuracy; I find the random forest to be more interpretable, so I went with that as my model. The default 500 trees and √ p  variables, where p is the number of variables we are using to predict if the player made the All-NBA team (in our case, 10 variables per tree), were just as accurate as any other alternative I tried, so I stuck with the defaults. So let's now turn to how accurate the model was at predicting seasons 2014-2016.

Is the model accurate at predicting years 2014-2016?

Using the criteria above, I predicted 10 of the 15 All-NBA players in 2014, 13 in 2015, and 12 in 2016 correctly for an overall accuracy of about 78%. A lot of people in the NBA world have maligned the fact that the All-NBA teams must include 2 guards, 2 forwards, and 1 center per squad. What if these restrictions were lifted? Which players did the model predict, based solely on their performance, to make the All-NBA team? Let's look year-by-year.

Model-predicted All-NBA players for 2014

Player Position Made Team?
Kevin Durant SF Yes
LeBron James PF Yes
Blake Griffin PF Yes
Kevin Love PF Yes
Carmelo Anthony PF No
James Harden SG Yes
Stephen Curry PG Yes
DeMarcus Cousins C No
Chris Paul PG Yes
Anthony Davis C No
Dirk Nowitzki PF No
Paul George SF Yes
Dwight Howard C Yes
Joakim Noah C Yes
Goran Dragic SG Yes

Model-predicted All-NBA players for 2015

Player Position Made Team?
Russell Westbrook PG Yes
LeBron James SF Yes
James Harden SG Yes
Anthony Davis PF Yes
Chris Paul PG Yes
Stephen Curry PG Yes
DeMarcus Cousins C Yes
Blake Griffin PF Yes
Pau Gasol PF Yes
Damian Lillard PG No
DeAndre Jordan C Yes
LaMarcus Aldridge PF Yes
Kyrie Irving PG Yes
Marc Gasol C Yes
Jimmy Butler SG No

Model-predicted All-NBA players for 2016

Player Position Made Team?
LeBron James SF Yes
Kevin Durant SF Yes
Russell Westbrook PG Yes
Chris Paul PG Yes
James Harden SG No
Stephen Curry PG Yes
Kawhi Leonard SF Yes
DeMarcus Cousins C Yes
Kyle Lowry PG Yes
Damian Lillard PG Yes
DeAndre Jordan C Yes
Isaiah Thomas PG No
Anthony Davis C No
Paul George SF Yes
DeMar DeRozan SG No

Most notably here is James Harden had the 5th highest probability of making the All-NBA team last year—the model gave him a probability of .82. This supports the obvious point that he was one of the biggest snubs for the All-NBA team in the history of the league. The model also tended to classify some players who weren't on very good teams. I didn't include their team's win-loss record in the input variables for the model, because I was not sure how to go about calculating this win-loss percentage for players who were traded mid-season. However, the win-loss record was indirectly included in the win shares for a player.

What were the most important statistics in predicting if people made the All-NBA team or not? We can see this by looking at the Gini index. I looked at the distribution of the Gini index for each predictor variable, and two were much higher than the rest: Player Efficiency Rating (PER) and win shares. These two metrics are popular choices as indicators of overall performance, and it seems like they at least have some predictive validity when it comes to which players people will select for the All-NBA squads.

Predicting this season

Top 25 probabilities for making the All-NBA team

Player Position Probability
LeBron James SF 0.966
Kevin Durant SF 0.928
Kawhi Leonard SF 0.918
Giannis Antetokounmpo SF 0.854
Stephen Curry PG 0.800
Karl-Anthony Towns C 0.778
Jimmy Butler SF 0.762
Russell Westbrook PG 0.748
DeMarcus Cousins C 0.746
Anthony Davis C 0.742
John Wall PG 0.742
Gordon Hayward SF 0.740
DeAndre Jordan C 0.736
DeMar DeRozan SG 0.732
Marc Gasol C 0.716
Rudy Gobert C 0.688
Kyrie Irving PG 0.678
Dwight Howard C 0.662
Blake Griffin PF 0.642
Kyle Lowry PG 0.638
Isaiah Thomas PG 0.632
Andre Drummond C 0.618
James Harden PG 0.604
Kemba Walker PG 0.598
Chris Paul PG 0.574

The top 25 probabilities for making the team this season are listed above, as well as the players' positions. Using the position rules that the NBA requires for the All-NBA teams (discussed above), my predictions for the All-NBA teams, with just under ¾ of the NBA season complete, are:

Forwards

  • LeBron James
  • Kevin Durant
  • Kawhi Leonard
  • Giannis Antetokounmpo
  • Jimmy Butler
  • Gordon Hayward

Guards

  • Stephen Curry
  • Russell Westbrook
  • John Wall
  • DeMar DeRozan
  • Kyrie Irving
  • Kyle Lowry

Centers

  • Karl-Anthony Towns
  • DeMarcus Cousins
  • Anthony Davis