How Do We Judge a Probabilistic Model? Or, How Did FiveThirtyEight Do in Forecasting the 2018 Midterms?

Nate Silver and the FiveThirtyEight folks talked quite a bit this election cycle about how their model was probabilistic. What this means is that they don’t just offer a single prediction (e.g., “Democrats will pick up 10 seats”); instead, their prediction is a distribution of possible outcomes (e.g., “We expect the result to be somewhere between Republicans picking up 5 seats to Democrats picking up 35”). One of the most common ways of thinking probabilistically is thinking in the long-term: “If this same election were to happen a large number of times, we would expect the Democrat to win 70% of the time.”

It is puzzling to judge one of these models, however, because we only see one realization of this estimated distribution of outcomes. So if something with only a 5% predicted chance of happening actually does happen, was it because (a) the model was wrong? or (b) the model was right, but we just observed a very rare event happening?

We need to judge the success of probabilistic models in a probabilistic way. This means that we need a large sample of predictions and actual results; unfortunately, this can only be done after the fact. However, it can help us think more about what works and who to trust in the future. Since FiveThirtyEight predicted 366 elections in the 2018 midterms, all using the same underlying model, we have the requisite sample size necessary to judge the model probabilistically.

How do we judge a model probabilistically? The model gives us both (a) predicted winners and (b) predicted probabilities that these predicted winners actually do win. The model can be seen as “accurate” to the extent that the predicted probabilities correspond to observed probabilities of winning. For example, we can say that the model is “probabilistically accurate” if predicted winners with a 65% chance of winning actually win 65% of the time, those with a 75% chance of winning actually win 75% of the time, etc.

I want to step away from asking if the model was black-and-white “right or wrong,” because—strictly speaking—no model is “right” or “correct” in the sense of mapping onto ground truth perfectly. The FiveThirtyEight model—just like any statistical or machine learning model—will make certain assumptions that are incorrect, not measure important variables, be subject to error, etc. We should instead ask if the model is useful. In my opinion, a model that makes accurate predictions while being honest with us about the uncertainty in its predictions is useful.

Was the FiveThirtyEight 2018 midterm model probabilistically accurate? Before we get to the model, I want to show what the results would look like in a perfect situation. We can then compare the FiveThirtyEight model to this “perfect model.”

I am going to simulate 100,000 races. Let’s imagine that we make a model, and this model gives us winners and predicted probabilities for each winner in every single race. What we want to do is model this predicted probability exactly. In my simulation below, the probabilities that actually generate the winner are the same exact probabilities that we pretend we “modeled.” This is called x below. These probabilities range from .501 to .999.

We then simulate the races with those probabilities. The results are called y below. The result is 1 if the predicted winner actually won, while the result is 0 if they lost.

n_races <- 100000
x <- runif(n_races, .501, .999)
y <- rbinom(n_races, 1, x)
sim_dat <- data.frame(x, y)
##           x y
## 1 0.9215105 1
## 2 0.8136280 1
## 3 0.6239117 1
## 4 0.8290906 1
## 5 0.8180035 0
## 6 0.7973508 1

We can now plot the predicted probabilities on the x-axis and whether or not the prediction was correct on the y-axis. Points will only appear on the y-axis at 0 or 1; we would generally use a logistic regression for this, but that imposes a slight curve to the line. We know that it should perfectly straight, so I fit both an ordinary least squares line (purple) and logistic regression line (gold) below.

If the FiveThirtyEight model were completely accurate in its win probability prediction, we would expect the relationship between predicted win probability (x) and observed win probability (y) to look like:

And this is what the results for the 2018 midterms (all 360 Senate, House, and Governor races that have been decided) actually looked like:

I am using the deluxe forecast the morning of the election. The dotted line represents what would be a perfect fit. Any line that has a steeper slope and is mostly below the dotted line would mean that the model is overestimating its certainty; any line with a smaller slope and is mostly above the dotted line indicates that the model is underestimating its certainty. Since the line here resides above the dotted line, this means that the FiveThirtyEight model was less certain in its predictions than it should have been. For example, a race with a predicted probability of 60% winning actually turned out to be correct about 70-75% of the time.

However, there are a few outstanding races. As of publishing, these are:

State Race Type Candidate Party Incumbent Win Probability
MS Senate Cindy Hyde-Smith R False 0.707
GA House Rob Woodall R True 0.848
NY House Anthony Brindisi D False 0.604
NY House Chris Collins R True 0.797
TX House Will Hurd R True 0.792
UT House Ben McAdams D False 0.642

Let’s assume that all of these predictions come out to be wrong—that is, let’s assume the worst case scenario for FiveThirtyEight. The results would still look quite good for their model:

We can look at some bucketed analyses instead of drawing a fitted line, too. The average FiveThirtyEight predicted probability of a correct prediction was 92.1%, and the average actual correct prediction rate was 95.8%. However, most of the races were easy to predict, so let’s limit these to ones that are below 75% sure, below 65% sure, and below 55% sure:

Threshold Count Mean Win Probability Correct Rate
0.75 45 0.627 0.756
0.65 30 0.591 0.667
0.55 8 0.522 0.750

The actual correct rate is higher than the predicted probability in all cases; again, we see that FiveThirtyEight underestimated the certainty they should have had in their predictions.

For a full view, here are all of the incorrect FiveThirtyEight predictions. They are sorted from highest win probability to lowest, so the table is ordered such that bigger upsets are at the top of the table.

State Race Type Candidate Party Incumbent Win Probability
OK House Steve Russell R True 0.934
SC House Katie Arrington R False 0.914
NY House Daniel Donovan R True 0.797
FL Governor Andrew Gillum D False 0.778
FL Senate Bill Nelson D True 0.732
KS House Paul Davis D False 0.641
IA Governor Fred Hubbell D False 0.636
OH Governor Richard Cordray D False 0.619
IN Senate Joe Donnelly D True 0.617
MN House Dan Feehan D False 0.599
GA House Karen C. Handel R True 0.594
VA House Scott Taylor R True 0.594
TX House John Culberson R True 0.553
NC House Dan McCready D False 0.550
TX House Pete Sessions R True 0.537

Is underestimating certainty—that is, overestimating uncertainty—a good thing? Nate Silver tongue-in-cheek bragged on his podcast that the model actually did too well. He was joking, but I think this is an important consideration: If a model is less certain than it should have been, what does that mean? I think most people would agree that it is better for a model to be 10% less certain than to be 10% more certain than it should have been, especially after the surprise of 2016. But if we are judging a model by asking if it is truthful in expressing uncertainty, then we should be critical of a model that makes correct predictions, but does so by saying it is very unsure. I don’t think there is a big enough discrepancy here to pick a fight with the FiveThirtyEight model, but I do think that inflating uncertainty can be a defense mechanism that modelers can use to protect themselves if their predictions are incorrect.

I think it is safe to say that FiveThirtyEight performed well this cycle. This also means that the polls, fundamentals, and expert ratings that underlie the model were accurate as well—at least in the aggregate. In my probabilistic opinion, we should not say that the FiveThirtyEight model was “wrong” about Steve Russell necessarily, because a predicted probability of 93% means that 7% of these will come out contrary to what was expected—and the observed probabilities mapped onto FiveThirtyEight’s predicted probabilities pretty accurately. That is, FiveThirtyEight predictions made with 93% certainty actually came to pass more than 95% of the time.

I think probabilistic forecasts are very useful for political decision makers. It keeps us from seeing a big deficit in the polls and writing off the race as unwinnable—we should never assume that the probability of an event is 100%. For example, it was said in the run-up to the midterms that a race needs to be close to even consider donating money to it: “If the race isn’t between 2% or 3%, you’re wasting your money.” There is uncertainty in all estimates—so candidates with a 90% likelihood of winning are going to lose 10% of the time. The trick for analysts and professionals are to make sense of probabilistic forecasts. Were there any warning signs ahead of time that hinted Steve Russell, Katie Arrington, or Daniel Donovan were going to lose? Was there a paucity of polling data are not a lot of money spent in those elections? Were there highly localized aspects to the races that were obscured by the model, which allowed for correlation between similar demographies and geographies? People should certainly invest more in races that are closer to being a toss-up, but thinking probabilistically requires is to invest some resources in longer shots than just a few percentage points, because unlikely events are unlikely, but they still do happen.

All data and code for this post can be found at my GitHub.

Using Beta Regression to Better Model Norms in Political Psychology

Update 2018-08-23: The link below is outdated. A full, more detailed paper can be found at my GitHub.

I recently wrote a short working paper on how to use beta regression and how it helps take into account norms in correlational studies of ideology, politics, and prejudice. It is a little long for a blog post (and this platform does not support LaTeX), so I uploaded it as a working paper. Click here to download the paper.

I hope it is instructive and informative, and that it fills in a few gaps from previous papers. Please email me if you have any questions about it. As always, the code can be found over at my GitHub.

Updated 2017-12-11

Screen Shot 2017-12-11 at 11.13.04 AM.png

Political Targeting Using k-means Clustering

Political campaigns send certain messages to people based on what the campaigns know about these potential voters, hoping that these specialized messages are more effective than generic messages at convincing people to show up and vote for their canddiate.

Imagine that a Democratic campaign wants to target Democrats for specific types of mailers or phone call conversations, but has limited resources. How could they decide on who gets what type of message? I'll show how I've been playing around with k-means clustering to get some insight into how people might decide what messages to send to which people.

Let's say this campaign could only, at most, afford four different types of messages. We could try to cluster four different types of Democrats, based on some information that a campaign has about the voters. I will use an unrealistic example here—given that it is survey data on specific issues—but I think it is nonetheless interesting and shows what such a simple algorithm is capable of.

My colleagues Chris Crandall, Laura Van Berkel, and I have asked online samples how they feel about specific political issues (e.g., gun laws, aborton, taxes). For the present analyses, I include only people who identify as Democrats, because I'm imagining that I'm trying to target Democratic voters.

I have 179 self-identified Democrats' answers on 17 specific policy questions, as well as how much they identify as liberal to conservative (on a 0 to 100 scale). I ran k-means clustering to these 17 policy questions, specifying four groups (i.e., the most this fictitious campaign could afford).

First, let's look at how many people were in each cluster. We can also look at how much each cluster, on average, identified themselves as liberal (0) to conservative (100):

Cluster Conservatism Size
1 46.32 22
2 31.84 43
3 27.96 47
4 10.45 67

These clusters are ordered by conservatism. We could see each group as just most conservative Democrats to most progressive Democrats, but can we get a more specific picture here?

What I did was create four different plots—one for each cluster—laying out how, on average, each cluster scored on each specific policy items. These items are standardized, which means that a score of 0 in the group means that they had the same opinion as the average Democrat in the sample. This will be important for interpretation. Let's look at Cluster 1. I call these “Religious Conservative Democrats,” as I will explain shortly:

plot of chunk unnamed-chunk-4

In general, these people tend to be more conservative than the average Democrat in our sample. But what really differentiates these people most? Three of the largest deviations from zero show the story: These people are much more against abortion access, much more of the belief that religion is important in everyday life, and much more against gay marriage than the average Democrat. These are not just conservative Democrats, but Democrats that are more conservative due to traditional religious beliefs. If I were advising the campaign, I would say, whatever they choose to do, do not focus their message on progressive stances on abortion access and gay marriage.

Let's turn to the Cluster 2, “Fiscally Conservative Democrats”:

plot of chunk unnamed-chunk-5

The biggest deviations from the average Democrat that these people have are that they are more likely to be against welfare, say that there is too much government spending, and oppose raising taxes. These people also support funding social security, stopping climate change, reducing economic inequality, and government providing healthcare less than the average Democrat. I would suggest targeting these people with social issues: access to abortion, supporting gay rights, funding fewer military programs, and supporting immigration. These people are about the same as the average Democrat on these issues.

Cluster 3 are what I have named “Moderate Democrats”:

plot of chunk unnamed-chunk-6

I almost want to name this group, “Democrats Likely to Agree with All of the Questions,” because they tended to support both conservative and liberal policices more than the average Democrat (except for the death penalty item, strangely). But they can be seen as moderates, or perhaps “ambivalent.” In comparison to the average Democrat, they are both more likely to say we should control borders and immigration and say that we should reduce economic inequality. Theoretically, this group is interesting. We could ask lots of empirical questions about them.

But pragmatically? They could probably be given messages that the candidate is most passionate about, polls best, etc., as they are likely to be more favorable toward any attitude—liberal or otherwise—than the average Democrat.

You might be wondering, “These groups all seem pretty conservative.” Remember that these scores are all relative to the average Democrat. Even if they score more conservatively on an issue, they are likely to support it less than a Republican.

In any case, Cluster 4 (the biggest cluster, about 37% of the sample) are the “Progressive Democrats”:

plot of chunk unnamed-chunk-7

In comparison to the average Democrat, these people support all of the liberal causes more and conservative causes less. I would suggest to those trying to campaign to these people that they would be open to the most progressive issues that the candidate has to offer.

As I mentioned before, this is somewhat of an unrealistic example: Campaigns don't have surveys laying around for registered voters. But there's an increasing amount of information out there that campaigns could use in lieu of directly asking people how they feel about issues: Facebook data, donations to specific causes, signing up for e-mail updates from different political organizations, etc. Information for most campaigns can be sparse, but people are increasingly able to access some of these more proprietary datasets.

k-means clustering is also a remarkably simple way to look at this issue. One line of code runs the algorithm. The command I ran was:

four <- kmeans(data,4)

And then I extracted the clusters for each person with the code:


Even if specific decisions are not made based on these simple cluster analyses, they are easy enough to do that I believe it is a good way to explore data and how respondents can be grouped together. Running multiple analyses specifying different numbers of clusters can help us understand how the people that answer these questions may be organized in pragmatically helpful ways.