In Support of Open Seeding in the NBA, Pt. 2

This is my second post providing support for open seeding in the NBA; this would mean that the top 16 teams in the league make the playoffs (instead of the top 8 in each conference). Last time, I showed that in only 10 of the last 34 seasons have the 16 teams with the best records been the 16 playoff teams. I wanted to look more at the player level, particularly after big names have recently migrated from the the Eastern to the Western Conference. It feels like there is a huge conference imbalance in star power in the league at a time when individual stars mean a lot to the game.

I again scraped Basketball-Reference.com for the numbers, and the code can be found at the end of this post. Basketball-Reference has league leaders pages, listing the top 20 players for a season in various statistics. I consider player efficiency rating (PER) to be the best overall performance metric, so I calculated what percent of the top 20 PER performers were in the West for every season since the 1979-1980 season. (I chose to start in this season because the league leaders pages look slightly different in previous seasons, making it more difficult to scrape). If the conferences were balanced, we would expect this to be 50%. I also did the same thing for All-NBA teams: What proportion of the All-NBA players were from the Western Conference? Again, we would expect this number to be 50% if the conferences were balanced perfectly.

results_per %>% 
  filter(complete.cases(.)) %>% 
  group_by(year) %>% 
  count(conf) %>% 
  mutate(prop = n / sum(n)) %>% 
  ungroup() %>% 
  filter(conf == "W") %>% 
  ggplot(aes(x = as.numeric(year), y = prop)) +
  geom_point(color = "#008000") +
  stat_smooth(se = FALSE, method = "loess", span = 1, color = "#800080") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(hjust = 1, angle = 45),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    text = element_text(size = 14)
  ) +
  labs(x = NULL) +
  scale_y_continuous(
    name = "Top PER Players in Western Conference",
    label = function(x) paste0(x * 100, "%")
  ) +
  geom_hline(yintercept = .5, linetype = 2)

results_allnba %>% 
  filter(complete.cases(.)) %>% 
  group_by(year) %>% 
  filter(conf == "W") %>% 
  ggplot(aes(x = year, y = prop, group = 1)) +
  geom_point(color = "#008000") +
  stat_smooth(se = FALSE, method = "loess", span = 1, color = "#800080") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(hjust = 1, angle = 45),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    text = element_text(size = 14)
  ) +
  labs(x = NULL) +
  scale_y_continuous(
    name = "All-NBA Players in Western Conference",
    label = function(x) paste0(x * 100, "%")
  ) +
  geom_hline(yintercept = .5, linetype = 2)

Western conference players have been overrepresented in the Western Conference in most years. This means the NBA—perhaps the most star-driven league in North American sports—doesn’t have the opportunity to showcase all of their great players in the playoffs, because only 8 teams from the West are allowed to compete there. It also looks as though this imbalance is at an all-time high, which could be why talk about open seeding has reached mainstream support and discussion among thinkers around the league recently. Allowing the top 16 teams to play in the playoffs would allow more stars to play on that national stage.

R Code Appendix

Scraping this posed a fun problem, because a few of the tables I was trying to access were written as comments in the HTML code for the webpage. This meant that I could not just run rvest::html_table and choose the ones I was looking for. Instead, it meant that I had to find the XPath to that comment, scrape it, convert it to a character string, read the string in again as HTML, and then parse the table out. This code also shows how one can efficiently vectorize using an apply function and then piping into a do.call command to bind the results together into a data.frame.

The entirety of the code for this post can be found at my GitHub page.

library(tidyverse)
library(rvest)
years <- 1980:2018
per <- lapply(years, function(x) {
  paste0(
    "https://www.basketball-reference.com/leagues/NBA_",
    x,
    "_leaders.html"
  ) %>% 
    read_html() %>% 
    html_table() %>% 
    getElement(30) %>% 
    transmute(
      team = substr(X2, nchar(X2) - 2, nchar(X2)),
      per = X3
    )
})
names(per) <- years
  
standings <- lapply(years, function(x) {
  tmp <- paste0(
    "https://www.basketball-reference.com/leagues/NBA_",
    x,
    "_ratings.html"
  ) %>% 
    read_html() %>% 
    html_table() %>% 
    getElement(1) %>% 
    `[`(-1, 2:3)
  colnames(tmp) <- c("team", "conf")
  tmp
})
names(standings) <- years

key <- lapply(years, function(x) {
  tmp <- paste0(
    "https://www.basketball-reference.com/leagues/NBA_", 
    x, 
    "_standings.html"
  ) %>% 
    read_html() %>% 
    html_node(xpath = '//*[@id="all_team_vs_team"]/comment()') %>% 
    html_text() %>% 
    read_html() %>% 
    html_table() %>% 
    getElement(1) %>% 
    as.data.frame()
  
  suppressWarnings(
    tmp <- data.frame(
      team = tmp$Team,
      abbr = colnames(tmp)[-1:-2]
    ) %>% 
      full_join(standings[[as.character(x)]], by = "team") %>% 
      select("abbr", "conf")
  )
  
  colnames(tmp)[[1]] <- "team"
  tmp
})
names(key) <- years

results_per <- lapply(as.character(years), function(x) {
  tmp <- suppressWarnings(left_join(per[[x]], key[[x]], by = "team"))
  tmp$year <- x
  tmp
}) %>% 
  do.call(rbind, .)

results_allnba <- lapply(years, function(x) {
  tmp <- paste0(
    "https://www.basketball-reference.com/leagues/NBA_",
    x,
    "_per_game.html"
  ) %>% 
    read_html() %>% 
    html_table() %>% 
    getElement(1) %>% 
    mutate(Player = gsub("*", "", Player, fixed = TRUE)) %>% 
    group_by(Player) %>% 
    slice(1) %>% 
    ungroup() %>% 
    transmute(player = Player, team = Tm)
  
  suppressWarnings(
    paste0(
      "https://www.basketball-reference.com/leagues/NBA_",
      x,
      ".html#all_all-nba"
    ) %>% 
      read_html() %>% 
      html_node(xpath = '//*[@id="all_all-nba"]/comment()') %>% 
      html_text() %>% 
      read_html() %>% 
      html_table() %>% 
      do.call(rbind, .) %>% 
      separate(X1, paste0("p", 1:5), sep = "\\s{2}") %>% 
      gather() %>% 
      transmute(player = value) %>% 
      left_join(tmp, by = "player") %>% 
      left_join(key[[as.character(x)]], by = "team") %>% 
      count(conf) %>% 
      mutate(prop = n / sum(n), year = x)
  )
}) %>% 
  do.call(rbind, .)

In Support of Open Seeding in the NBA

This is a post arguing why the NBA should adopt open seeding in the playoffs: Instead of taking the top 8 teams in each conference, the top 16 teams in the NBA should make the playoffs.

The first thing I wanted to do was diagnose the problem. I looked at every year from 1984 (when the NBA adopted the 16-team playoff structure) through 2017. For each year, I tallied the number of teams making the playoffs who had a smaller win percentage than a team in the other conference. These data come from Basketball-Reference.com, and the code for scraping, cleaning, and visualizing these data can be found over at GitHub.

The following figure shows this tally per year, and the years are grouped by color based on the conference who had a team with the worse record. The years between dashed vertical lines represent the years in which division winners were guaranteed a top three or four seed.

plot of chunk figure

This analysis spans 34 seasons. Of these 34 seasons:

  • There were 10 seasons where the 16 teams with the best records were the 16 in the playoffs (although not seeded as such, since the playoff bracket is split by conference).
  • Of the 24 (71%) seasons where at least one team with a worse win percentage than a lottery team in the other conference made the playoffs, the offending conference was the East 16 times, while the West 8 times.
  • The worst year was 2008, where more than half of the playoff teams in the Eastern Conference had a worse record than the 9th-place Golden State Warriors, who went .585. Actually, the 10th-place Portland Trail Blazers went .500, placing them ahead of the 7th- and 8th-seeded Eastern Conference teams, and the 11th-place Sacramento Kings had a better record than the 8th-seeded Eastern Conference Atlanta Hawks (a team that only won 37 games).

I have heard the argument that we should not worry about unbalanced conferences in any one year, because “Sometimes the East is better, sometimes the West is better—it balances out in the long-run!” While my analyses don't control for strength of schedule in each conference, it simply isn't true that the conference imbalance evens out over time. I'm looking at the past 34 seasons, and the East was worse twice as often as the West (at least in terms of worse teams making the playoffs).

That argument also doesn't make sense to me because championships are not decided over multiple-years. They are an award given out at the end of every season. So even if it balanced out between conferences over time, this would not matter, because every year some below-average team is making the playoffs. And from these data, we can see that 71% of the seasons in the last 34 years have resulted in at least one team making the playoffs that had a worse record than a lottery team in the opposite conference.

The Importance of Blocks for NBA Defenses, Over Time

Introduction

I have always been ambivalent on how I feel about blocking shots. On one hand, they are really cool. Seeing someone send a shot into the fifth row or watching Anthony Davis recover and get hand on a shot just in time is fun. Perhaps the most important play of LeBron's entire career was a block. At the same time, blocks mean more shooting fouls, and blocks don't often result in a defensive rebound.

My interest in blocks continues in this post: I will be looking at the relationship between blocks and defensive rating, both measured at the player level. Both are per 100 possessions statistics: How many blocks did they have in 100 possessions? How many points did they allow in 100 possessions (see the Basketball-Reference glossary)?

My primary interest here is looking at the relationship over time. Are blocks more central to the quality of a defense now than they used to be? The quality of defense will be measured by defensive rating—the number of points a player is estimated to have allowed per 100 possessions. For this reason, a better defensive rating is a smaller number.

There is some reason for why blocks might be more important now than they used to be. In the 2001-2002 NBA season, the NBA changed a bunch of rules, many relating to defense. There were two rules in particular that changed how people could play defense (quoting from the link above): “Illegal defense guidelines will be eliminated in their entirety,” and “A new defensive three-second rule will prohibit a defensive player from remaining in the lane for more than three consecutive seconds without closely guarding an offensive player.” This meant that people could start playing zone defense, as long as the defense didn't park guys in the paint for longer than three seconds. This also means blocking shots became a skill that took self-control: You had to be able to slide into place when necessary to block shots—you couldn't just stand there in the paint and wait for guys to swat.

Other rules made it more likely that guys were going to be able to drive past backcourt players and reach the paint: “No contact with either hands or forearms by defenders except in the frontcourt below the free throw line extended in which case the defender may use his forearm only.” So, have blocks grown more important for a defense over time?

Analysis Details

I looked at every NBA player in every season from 1989 to 2016, but players were only included if they played more than 499 minutes in a season. If you are interested in getting this data, you can see my code for scraping Basketball-Reference—along with the rest of this code for this blog post—over at GitHub.

I fit a Bayesian multilevel model to look at this, using the brms package, which interfaces with Stan from R. I predicted a player's defensive rating from how many blocks they had per 100 possessions, what season it was, and the interaction between the two (which tests the idea that the relationship has changed over time). I also allowed the intercept and relationship between blocks and defensive rating to vary by player. I made the first year in the data, 1989, the intercept (e.g., subtracted 1989 from the year variable) and specified some pretty standard, weakly-informative priors, as well:

brm(drtg ~ blkp100*year + (1+blkp100|player),
    prior=c(set_prior("normal(0,5)", class="b"),
            set_prior("student_t(3,0,10)", class="sd"),
            set_prior("lkj(2)", class="cor")),
    data=dat, warmup=1000, iter=2000, chains=4)

Results

The probability that the relationship between blocks and defensive rating depends on year, given the data (and the priors), is greater than 0.9999. So what does this look like? Let's look at the relationship at 1989, 1998, 2007, and 2016:

Year Slope
1989 -1.04
1998 -1.48
2007 -1.92
2016 -2.36

Our model says that, in 1989, for each block someone had in 100 possessions, they allowed about 1.04 fewer points per 100 possessions. In 2016? This estimate more than doubled: For every block someone had in 100 possessions, they allowed about 2.36 fewer points per 100 possessions! Here is what these slopes look like graphically:

plot of chunk plot

However, the model assumes a linear change in the relationship between blocks and defensive rating per year. What if there was an “elbow” (i.e., drop-off) in this curve in 2002, when the rules changed? That is, maybe there was a relatively stable relationship between blocks and defensive rating until 2002, when it quickly became stronger. To investigate this, I correlated blocks per 100 possessions and defensive rating in each year separately. Then, I graphed those correlations by year. (Note: Some might argue a piecewise latent growth curve model is the appropriate test here; for brevity and simplicity's sake, I chose to take a more descriptive approach here).

plot of chunk next-plot

It is hard to tell specifically where the elbow is occuring, but it does look like there is less importance of blocks before the rule change (i.e., the correlation was getting closer to zero), whereas after the rule change, the correlation was getting stronger (e.g., getting more and more negative).

Conclusion

It looks like blocks are not only cool, but especially relevant for players to perform well defensively in recent years. A consequence of the rule changes in 2001-2002 is that blocks are more important now for individual defensive performance than they were in the past.

Analyzing Rudy Gay Trades Using the CausalImpact Package

Introduction

I have been meaning to learn more about time-series and Bayesian methods; I'm pumped for a Bayesian class that I'll be in this coming semester. RStudio blogged about the CausalImpact package back in April—a Bayesian time-series package from folks at Google—and I've been meaning to play around with it ever since. There's a great talk posted on YouTube that is a very intuitive description of thinking about causal impact in terms of counterfactuals and the CausalImpact package itself. I decided I would use it to put some common wisdom to the test: Do NBA teams get better after getting rid of Rudy Gay? I remember a lot of chatter on podcasts and on NBA Twitter after he was traded from both the Grizzlies and the Raptors.

Method

I went back to the well and scraped Basketball-Reference using the rvest package. Looking at the teams that traded Gay mid-season, I fetched all the data from the “Schedule & Results” page and from that I calculated a point differential for every game: Positive numbers meant the team with Rudy Gay won the game by that many points, while negative numbers meant they lost by that many points. I ran the CausalImpact model with no covariates or anything: I just looked at point differential over time. I did this separately for the Grizzlies 2012-2013 season and the Raptors 2013-2014 season (both teams traded Rudy mid-season). The pre-treatment sections are before the team traded Gay; the post-treatment sections are after the team traded Gay.

Code for scraping, analyses, and plotting can be accessed over at GitHub.

Results

The package is pretty nice. The output is easy to read and interpret, and they even include little write-ups for you if you specify summary(model, "report"), where model is the name of the model you fit with the CausalImpact function. Let's take a look at the Grizzlies first.

Actual Predicted Difference 95% LB 95% UB
Average 4.4 3.6 0.82 -5 6.6
Cumulative 167.0 135.8 31.22 -190 252.5

The table shows the average and cumulative point differentials. On average, the Grizzlies scored 4.4 points more than their opponent per game after Rudy Gay was traded. Based on what the model learned from when Gay was on the team, we would have predicted this to be 3.6. Their total point differential was 167 after Rudy Gay was traded, when we would have expected about 136. The table also shows the differences: 0.82 and 31.22 points for average and cumulative, respectively. The lower bound and upper bound at a 95% confidence interval fell on far opposite sides of zero, suggesting that the difference is not likely to be different from zero. The posterior probability here of a causal effect (i.e., the probability that this increase was due to Gay leaving the team) is 61%—not a very compelling number. The report generated from the package is rather frequentist—it uses classical null hypothesis significance testing language, saying the model “would generally not be considered statistically significant” with a p-value of 0.387. Interesting.

What I really dig about this package are the plots it gives you. This package is based on the idea that it models a counterfactual: What would the team have done had Rudy Gay not been traded? It then compares this predicted counterfactual to what actually happened. Let's look at the plots:

plot of chunk unnamed-chunk-34

The top figure shows a horizontal dotted line, which is what is predicted given what we know about the team before Gay was traded. I haven't specified any seasonal trends or other predictors, so this line is flat. The black line is what is actually happened. The vertical dotted line is where Rudy Gay was traded. The middle figure shows the difference between predicted and observed. We can see that there is no reliable difference between the two after the Gay trade. Lastly, the bottom figure shows the cumulative difference (that is, adding up all of the differences between observed and predicted over time). Again, this is hovering around zero, showing us that there was really no difference in the Grizzlies point differential that actually occurred and what we predicted would have happened had Gay not been traded (i.e., the counterfactual). What about the Raptors?

The Raptors unloaded Gay to the Kings the very next season. Let's take a look at the same table and plot for the Raptors and trading Rudy:

Actual Predicted Difference 95% LB 95% UB
Average 4.4 -0.37 4.8 -0.88 10
Cumulative 279.0 -23.22 302.2 -55.59 657

plot of chunk unnamed-chunk-36

The posterior probability of a causal effect here was 95.33%—something that is much more likely than the Grizzlies example. The effect was more than five times bigger than it was for Memphis: There was a difference of 4.8 points per game (or 302 cumulatively) between what we observed and what we would have expected had the Raptors never traded Gay. Given that this effect was one (at the time, above average) player leaving a team is pretty interesting. I'm sure any team would be happy with getting almost 5 whole points better per game after getting rid of a big salary.

Conclusion

It looks like trading Rudy Gay likely had no effect on the Grizzlies, but it does seem that getting rid of him had a positive effect on the Raptors. The CausalImpact package is very user-friendly, and there are many good materials out there for understanding and interpreting the model and what's going on underneath the hood. Most of the examples I have seen are simulated data or data which are easily interpretable, so it was good practice seeing what a real, noisy dataset actually looks like.

Predicting All-NBA Teams

The NBA season is winding down, which means it is award season. Bloggers, podcasters, writers, and television personalities are all releasing who they think should make the First-, Second-, and Third-Team All-NBA squads. What I will do here is try to predict who will make the All-NBA squads. I won't try to predict each particular team, but only if they made one of the three squads or not.

Technical details

I first downloaded all of the individual statistics I could from Basketball-Reference.com for all players in all years from 1989 to 2016. I chose 1989 as the starting point, because it was the first year that the NBA voted on three All-NBA squads instead of two. These included all numbers from the Totals, Per Game, Per 36 Minutes, Per 100 Possessions, and Advanced tables on Basketball-Reference.com.

I trimmed the sample down a little bit by (a) cutting out anyone who was missing data (i.e., many big men did not have three-point percentages, because they did not take a three-point shot all season) and (b) removing noisy data by dropping anyone who played less than 500 minutes in the entire year. This is a cutoff many sites use (e.g., ESPN.com) for leaderboards, and it's a cutoff that I've used in research on the NBA before. This cut out 9 people of the 420 that made the All-NBA teams in my training data (years 1989 through 2013)—not a troubling amount of players. And given that the NBA has shot more and more three-point shots in recent years, I'm not worried about cutting a few big men from the training data who didn't attempt one in an entire season.

Which leads me to the particulars of training the model. I split up the data into four parts. First, the training data (seasons 1989 to 2013). I then tested the model on the three subsequent years separately: 2014, 2015, and 2016. These three subsets made up my testing data. The model classified players as either a “Yes” or “No” for making an All-NBA team. However, the model didn't know that I needed 15 players per year, consisting of 6 guards, 6 forwards, and 3 centers. So to assess model performance, I sorted the predicted probability, based on the model, that a player made the All-NBA team. I took the 6 guards, 6 forwards, and 3 centers with the highest probabilities of making the team and counted them as my predictions for the year.

I tried a number of algorithms: k-NN, C5.0, random forest, and neural networks. I played around with each of these models, changing the number of neighbors or hidden nodes or trees, etc. The random forest and neural networks yielded the same accuracy; I find the random forest to be more interpretable, so I went with that as my model. The default 500 trees and √ p  variables, where p is the number of variables we are using to predict if the player made the All-NBA team (in our case, 10 variables per tree), were just as accurate as any other alternative I tried, so I stuck with the defaults. So let's now turn to how accurate the model was at predicting seasons 2014-2016.

Is the model accurate at predicting years 2014-2016?

Using the criteria above, I predicted 10 of the 15 All-NBA players in 2014, 13 in 2015, and 12 in 2016 correctly for an overall accuracy of about 78%. A lot of people in the NBA world have maligned the fact that the All-NBA teams must include 2 guards, 2 forwards, and 1 center per squad. What if these restrictions were lifted? Which players did the model predict, based solely on their performance, to make the All-NBA team? Let's look year-by-year.

Model-predicted All-NBA players for 2014

Player Position Made Team?
Kevin Durant SF Yes
LeBron James PF Yes
Blake Griffin PF Yes
Kevin Love PF Yes
Carmelo Anthony PF No
James Harden SG Yes
Stephen Curry PG Yes
DeMarcus Cousins C No
Chris Paul PG Yes
Anthony Davis C No
Dirk Nowitzki PF No
Paul George SF Yes
Dwight Howard C Yes
Joakim Noah C Yes
Goran Dragic SG Yes

Model-predicted All-NBA players for 2015

Player Position Made Team?
Russell Westbrook PG Yes
LeBron James SF Yes
James Harden SG Yes
Anthony Davis PF Yes
Chris Paul PG Yes
Stephen Curry PG Yes
DeMarcus Cousins C Yes
Blake Griffin PF Yes
Pau Gasol PF Yes
Damian Lillard PG No
DeAndre Jordan C Yes
LaMarcus Aldridge PF Yes
Kyrie Irving PG Yes
Marc Gasol C Yes
Jimmy Butler SG No

Model-predicted All-NBA players for 2016

Player Position Made Team?
LeBron James SF Yes
Kevin Durant SF Yes
Russell Westbrook PG Yes
Chris Paul PG Yes
James Harden SG No
Stephen Curry PG Yes
Kawhi Leonard SF Yes
DeMarcus Cousins C Yes
Kyle Lowry PG Yes
Damian Lillard PG Yes
DeAndre Jordan C Yes
Isaiah Thomas PG No
Anthony Davis C No
Paul George SF Yes
DeMar DeRozan SG No

Most notably here is James Harden had the 5th highest probability of making the All-NBA team last year—the model gave him a probability of .82. This supports the obvious point that he was one of the biggest snubs for the All-NBA team in the history of the league. The model also tended to classify some players who weren't on very good teams. I didn't include their team's win-loss record in the input variables for the model, because I was not sure how to go about calculating this win-loss percentage for players who were traded mid-season. However, the win-loss record was indirectly included in the win shares for a player.

What were the most important statistics in predicting if people made the All-NBA team or not? We can see this by looking at the Gini index. I looked at the distribution of the Gini index for each predictor variable, and two were much higher than the rest: Player Efficiency Rating (PER) and win shares. These two metrics are popular choices as indicators of overall performance, and it seems like they at least have some predictive validity when it comes to which players people will select for the All-NBA squads.

Predicting this season

Top 25 probabilities for making the All-NBA team

Player Position Probability
LeBron James SF 0.966
Kevin Durant SF 0.928
Kawhi Leonard SF 0.918
Giannis Antetokounmpo SF 0.854
Stephen Curry PG 0.800
Karl-Anthony Towns C 0.778
Jimmy Butler SF 0.762
Russell Westbrook PG 0.748
DeMarcus Cousins C 0.746
Anthony Davis C 0.742
John Wall PG 0.742
Gordon Hayward SF 0.740
DeAndre Jordan C 0.736
DeMar DeRozan SG 0.732
Marc Gasol C 0.716
Rudy Gobert C 0.688
Kyrie Irving PG 0.678
Dwight Howard C 0.662
Blake Griffin PF 0.642
Kyle Lowry PG 0.638
Isaiah Thomas PG 0.632
Andre Drummond C 0.618
James Harden PG 0.604
Kemba Walker PG 0.598
Chris Paul PG 0.574

The top 25 probabilities for making the team this season are listed above, as well as the players' positions. Using the position rules that the NBA requires for the All-NBA teams (discussed above), my predictions for the All-NBA teams, with just under ¾ of the NBA season complete, are:

Forwards

  • LeBron James
  • Kevin Durant
  • Kawhi Leonard
  • Giannis Antetokounmpo
  • Jimmy Butler
  • Gordon Hayward

Guards

  • Stephen Curry
  • Russell Westbrook
  • John Wall
  • DeMar DeRozan
  • Kyrie Irving
  • Kyle Lowry

Centers

  • Karl-Anthony Towns
  • DeMarcus Cousins
  • Anthony Davis