Political campaigns send certain messages to people based on what the campaigns know about these potential voters, hoping that these specialized messages are more effective than generic messages at convincing people to show up and vote for their canddiate.

Imagine that a Democratic campaign wants to target Democrats for specific types of mailers or phone call conversations, but has limited resources. How could they decide on who gets what type of message? I'll show how I've been playing around with k-means clustering to get some insight into how people might decide what messages to send to which people.

Let's say this campaign could only, at most, afford four different types of messages. We could try to cluster four different types of Democrats, based on some information that a campaign has about the voters. I will use an unrealistic example here—given that it is survey data on specific issues—but I think it is nonetheless interesting and shows what such a simple algorithm is capable of.

My colleagues Chris Crandall, Laura Van Berkel, and I have asked online samples how they feel about specific political issues (e.g., gun laws, aborton, taxes). For the present analyses, I include only people who identify as Democrats, because I'm imagining that I'm trying to target Democratic voters.

I have 179 self-identified Democrats' answers on 17 specific policy questions, as well as how much they identify as liberal to conservative (on a 0 to 100 scale). I ran k-means clustering to these 17 policy questions, specifying four groups (i.e., the most this fictitious campaign could afford).

First, let's look at how many people were in each cluster. We can also look at how much each cluster, on average, identified themselves as liberal (0) to conservative (100):

Cluster	Conservatism	Size
1	46.32	22
2	31.84	43
3	27.96	47
4	10.45	67

These clusters are ordered by conservatism. We could see each group as just most conservative Democrats to most progressive Democrats, but can we get a more specific picture here?

What I did was create four different plots—one for each cluster—laying out how, on average, each cluster scored on each specific policy items. These items are standardized, which means that a score of 0 in the group means that they had the same opinion as the average Democrat in the sample. This will be important for interpretation. Let's look at Cluster 1. I call these “Religious Conservative Democrats,” as I will explain shortly:

plot of chunk unnamed-chunk-4

In general, these people tend to be more conservative than the average Democrat in our sample. But what really differentiates these people most? Three of the largest deviations from zero show the story: These people are much more against abortion access, much more of the belief that religion is important in everyday life, and much more against gay marriage than the average Democrat. These are not just conservative Democrats, but Democrats that are more conservative due to traditional religious beliefs. If I were advising the campaign, I would say, whatever they choose to do, do not focus their message on progressive stances on abortion access and gay marriage.

Let's turn to the Cluster 2, “Fiscally Conservative Democrats”:

plot of chunk unnamed-chunk-5

The biggest deviations from the average Democrat that these people have are that they are more likely to be against welfare, say that there is too much government spending, and oppose raising taxes. These people also support funding social security, stopping climate change, reducing economic inequality, and government providing healthcare less than the average Democrat. I would suggest targeting these people with social issues: access to abortion, supporting gay rights, funding fewer military programs, and supporting immigration. These people are about the same as the average Democrat on these issues.

Cluster 3 are what I have named “Moderate Democrats”:

plot of chunk unnamed-chunk-6

I almost want to name this group, “Democrats Likely to Agree with All of the Questions,” because they tended to support both conservative and liberal policices more than the average Democrat (except for the death penalty item, strangely). But they can be seen as moderates, or perhaps “ambivalent.” In comparison to the average Democrat, they are both more likely to say we should control borders and immigration and say that we should reduce economic inequality. Theoretically, this group is interesting. We could ask lots of empirical questions about them.

But pragmatically? They could probably be given messages that the candidate is most passionate about, polls best, etc., as they are likely to be more favorable toward any attitude—liberal or otherwise—than the average Democrat.

You might be wondering, “These groups all seem pretty conservative.” Remember that these scores are all relative to the average Democrat. Even if they score more conservatively on an issue, they are likely to support it less than a Republican.

In any case, Cluster 4 (the biggest cluster, about 37% of the sample) are the “Progressive Democrats”:

plot of chunk unnamed-chunk-7

In comparison to the average Democrat, these people support all of the liberal causes more and conservative causes less. I would suggest to those trying to campaign to these people that they would be open to the most progressive issues that the candidate has to offer.

As I mentioned before, this is somewhat of an unrealistic example: Campaigns don't have surveys laying around for registered voters. But there's an increasing amount of information out there that campaigns could use in lieu of directly asking people how they feel about issues: Facebook data, donations to specific causes, signing up for e-mail updates from different political organizations, etc. Information for most campaigns can be sparse, but people are increasingly able to access some of these more proprietary datasets.

k-means clustering is also a remarkably simple way to look at this issue. One line of code runs the algorithm. The command I ran was:

four <- kmeans(data,4)

And then I extracted the clusters for each person with the code:

four$cluster

Even if specific decisions are not made based on these simple cluster analyses, they are easy enough to do that I believe it is a good way to explore data and how respondents can be grouped together. Running multiple analyses specifying different numbers of clusters can help us understand how the people that answer these questions may be organized in pragmatically helpful ways.

The NBA season is winding down, which means it is award season. Bloggers, podcasters, writers, and television personalities are all releasing who they think should make the First-, Second-, and Third-Team All-NBA squads. What I will do here is try to predict who will make the All-NBA squads. I won't try to predict each particular team, but only if they made one of the three squads or not.

Technical details

I first downloaded all of the individual statistics I could from Basketball-Reference.com for all players in all years from 1989 to 2016. I chose 1989 as the starting point, because it was the first year that the NBA voted on three All-NBA squads instead of two. These included all numbers from the Totals, Per Game, Per 36 Minutes, Per 100 Possessions, and Advanced tables on Basketball-Reference.com.

I trimmed the sample down a little bit by (a) cutting out anyone who was missing data (i.e., many big men did not have three-point percentages, because they did not take a three-point shot all season) and (b) removing noisy data by dropping anyone who played less than 500 minutes in the entire year. This is a cutoff many sites use (e.g., ESPN.com) for leaderboards, and it's a cutoff that I've used in research on the NBA before. This cut out 9 people of the 420 that made the All-NBA teams in my training data (years 1989 through 2013)—not a troubling amount of players. And given that the NBA has shot more and more three-point shots in recent years, I'm not worried about cutting a few big men from the training data who didn't attempt one in an entire season.

Which leads me to the particulars of training the model. I split up the data into four parts. First, the training data (seasons 1989 to 2013). I then tested the model on the three subsequent years separately: 2014, 2015, and 2016. These three subsets made up my testing data. The model classified players as either a “Yes” or “No” for making an All-NBA team. However, the model didn't know that I needed 15 players per year, consisting of 6 guards, 6 forwards, and 3 centers. So to assess model performance, I sorted the predicted probability, based on the model, that a player made the All-NBA team. I took the 6 guards, 6 forwards, and 3 centers with the highest probabilities of making the team and counted them as my predictions for the year.

I tried a number of algorithms: k-NN, C5.0, random forest, and neural networks. I played around with each of these models, changing the number of neighbors or hidden nodes or trees, etc. The random forest and neural networks yielded the same accuracy; I find the random forest to be more interpretable, so I went with that as my model. The default 500 trees and √ p variables, where p is the number of variables we are using to predict if the player made the All-NBA team (in our case, 10 variables per tree), were just as accurate as any other alternative I tried, so I stuck with the defaults. So let's now turn to how accurate the model was at predicting seasons 2014-2016.

Is the model accurate at predicting years 2014-2016?

Using the criteria above, I predicted 10 of the 15 All-NBA players in 2014, 13 in 2015, and 12 in 2016 correctly for an overall accuracy of about 78%. A lot of people in the NBA world have maligned the fact that the All-NBA teams must include 2 guards, 2 forwards, and 1 center per squad. What if these restrictions were lifted? Which players did the model predict, based solely on their performance, to make the All-NBA team? Let's look year-by-year.

Model-predicted All-NBA players for 2014

Player	Position	Made Team?
Kevin Durant	SF	Yes
LeBron James	PF	Yes
Blake Griffin	PF	Yes
Kevin Love	PF	Yes
Carmelo Anthony	PF	No
James Harden	SG	Yes
Stephen Curry	PG	Yes
DeMarcus Cousins	C	No
Chris Paul	PG	Yes
Anthony Davis	C	No
Dirk Nowitzki	PF	No
Paul George	SF	Yes
Dwight Howard	C	Yes
Joakim Noah	C	Yes
Goran Dragic	SG	Yes

Model-predicted All-NBA players for 2015

Player	Position	Made Team?
Russell Westbrook	PG	Yes
LeBron James	SF	Yes
James Harden	SG	Yes
Anthony Davis	PF	Yes
Chris Paul	PG	Yes
Stephen Curry	PG	Yes
DeMarcus Cousins	C	Yes
Blake Griffin	PF	Yes
Pau Gasol	PF	Yes
Damian Lillard	PG	No
DeAndre Jordan	C	Yes
LaMarcus Aldridge	PF	Yes
Kyrie Irving	PG	Yes
Marc Gasol	C	Yes
Jimmy Butler	SG	No

Model-predicted All-NBA players for 2016

Player	Position	Made Team?
LeBron James	SF	Yes
Kevin Durant	SF	Yes
Russell Westbrook	PG	Yes
Chris Paul	PG	Yes
James Harden	SG	No
Stephen Curry	PG	Yes
Kawhi Leonard	SF	Yes
DeMarcus Cousins	C	Yes
Kyle Lowry	PG	Yes
Damian Lillard	PG	Yes
DeAndre Jordan	C	Yes
Isaiah Thomas	PG	No
Anthony Davis	C	No
Paul George	SF	Yes
DeMar DeRozan	SG	No

Most notably here is James Harden had the 5th highest probability of making the All-NBA team last year—the model gave him a probability of .82. This supports the obvious point that he was one of the biggest snubs for the All-NBA team in the history of the league. The model also tended to classify some players who weren't on very good teams. I didn't include their team's win-loss record in the input variables for the model, because I was not sure how to go about calculating this win-loss percentage for players who were traded mid-season. However, the win-loss record was indirectly included in the win shares for a player.

What were the most important statistics in predicting if people made the All-NBA team or not? We can see this by looking at the Gini index. I looked at the distribution of the Gini index for each predictor variable, and two were much higher than the rest: Player Efficiency Rating (PER) and win shares. These two metrics are popular choices as indicators of overall performance, and it seems like they at least have some predictive validity when it comes to which players people will select for the All-NBA squads.

Predicting this season

Top 25 probabilities for making the All-NBA team

Player	Position	Probability
LeBron James	SF	0.966
Kevin Durant	SF	0.928
Kawhi Leonard	SF	0.918
Giannis Antetokounmpo	SF	0.854
Stephen Curry	PG	0.800
Karl-Anthony Towns	C	0.778
Jimmy Butler	SF	0.762
Russell Westbrook	PG	0.748
DeMarcus Cousins	C	0.746
Anthony Davis	C	0.742
John Wall	PG	0.742
Gordon Hayward	SF	0.740
DeAndre Jordan	C	0.736
DeMar DeRozan	SG	0.732
Marc Gasol	C	0.716
Rudy Gobert	C	0.688
Kyrie Irving	PG	0.678
Dwight Howard	C	0.662
Blake Griffin	PF	0.642
Kyle Lowry	PG	0.638
Isaiah Thomas	PG	0.632
Andre Drummond	C	0.618
James Harden	PG	0.604
Kemba Walker	PG	0.598
Chris Paul	PG	0.574

The top 25 probabilities for making the team this season are listed above, as well as the players' positions. Using the position rules that the NBA requires for the All-NBA teams (discussed above), my predictions for the All-NBA teams, with just under ¾ of the NBA season complete, are:

Forwards

LeBron James
Kevin Durant
Kawhi Leonard
Giannis Antetokounmpo
Jimmy Butler
Gordon Hayward

Guards

Stephen Curry
Russell Westbrook
John Wall
DeMar DeRozan
Kyrie Irving
Kyle Lowry

Centers

Karl-Anthony Towns
DeMarcus Cousins
Anthony Davis

Blog