A Monte Carlo Study on Methods for Handling Class Imbalance in Machine Learning

December 6, 2017

I recently ran a simulation study comparing methods for handling class imbalance (in this case, when the class of interest is less than about 3% of the data) for a statistical computing course. I simulated 500 data sets, varying some characteristics like sample size and minority class size, and tested a number of preprocessing techniques (e.g., SMOTE) and algorithms (e.g., XGBoost). You can view the working paper by clicking here.

If you don't want to slog through the whole paper, the plot below shows densities of how each model (combination of sampling technique and algorithm) performed. I totally left off models that used no preprocessing and oversampling, since they made so few positive predictions that metrics like F1 scores couldn't even be calculated most of the time!

Feel free to check out the GitHub repository, as well.

Archive

2024
- Dec 30, 2024 R and Python Together: Refactoring and Prompt Engineering A Previous Case Study, Using the Perplexity API Dec 30, 2024
- Dec 21, 2024 Rethinking How I Do Supervised Topic Modeling, Using ModernBERT and GPT-4o mini Dec 21, 2024
- Dec 1, 2024 Predicting Best Picture at the 2025 Academy Awards Dec 1, 2024
- Apr 14, 2024 R and Python Together: A Second Case Study Using LangChain's LLM Tools Apr 14, 2024
- Mar 24, 2024 Using R and Python Together, Seamlessly: A Case Study Using OpenAI's GPT Models Mar 24, 2024
- Mar 3, 2024 Modeling the Oscar for Best Picture (and Some Insights About XGBoost) Mar 3, 2024
2023
- Jul 20, 2023 Supervised Topic Modeling for Short Texts: My Workflow and A Worked Example Jul 20, 2023
2022
- Oct 1, 2022 Probabilistic Photograph Manipulation with ggplot2 and imager Oct 1, 2022
- May 6, 2022 Color-Swapping Film Palettes in R with imager, ggplot2, and kmeans May 6, 2022
2021
- Sep 21, 2021 Why and How to Model Conditional Variance, with an Application to My Letterboxd Data Sep 21, 2021
- Apr 29, 2021 Exploring the Star Wars "Prequel Renaissance" Using tidymodels and workflowsets Apr 29, 2021
2020
- May 17, 2020 Quickly Making an R Shiny Bingo App May 17, 2020
- Apr 16, 2020 Examining Aphex Twin's Eclectic Discography With the Spotify API and Generalized Variance Apr 16, 2020
- Apr 5, 2020 A Function for Calculating Tidy Summaries of Multiple t-tests Apr 5, 2020
- Mar 20, 2020 Introducing bwsTools: A Package for Case 1 Best-Worst Scaling (MaxDiff) Designs Mar 20, 2020
2019
- Jul 13, 2019 Simulating Data in R: Examples in Writing Modular Code Jul 13, 2019
- May 30, 2019 Star Wars Fandom Survey, Part 5: Importance of Movie Characteristics May 30, 2019
- May 30, 2019 Star Wars Fandom Survey, Part 4: Age and Nostalgia May 30, 2019
- May 30, 2019 Star Wars Fandom Survey, Part 3: Sexism and Political Attitudes May 30, 2019
- May 30, 2019 Star Wars Fandom Survey, Part 2: The Three Major Types of Star Wars Fans May 30, 2019
- May 30, 2019 Star Wars Fandom Survey, Part 1: Methods, Demographics, Validity Checks May 30, 2019
- Apr 28, 2019 How I Put Logos on ggplot2 Figures Apr 28, 2019
- Apr 9, 2019 Confidence Interval Coverage in Weighted Surveys: A Simulation Study Apr 9, 2019
- Jan 23, 2019 Using Word Similarity Graphs to Explore Themes in Text: A Tutorial Jan 23, 2019
2018
- Nov 25, 2018 How Do We Judge a Probabilistic Model? Or, How Did FiveThirtyEight Do in Forecasting the 2018 Midterms? Nov 25, 2018
- Oct 29, 2018 In Support of Open Seeding in the NBA, Pt. 2 Oct 29, 2018
- Oct 14, 2018 Explicitly Optimizing on Causal Effects via the Causal Random Forest: A Practical Introduction and Tutorial Oct 14, 2018
- Aug 20, 2018 How Big Should the Control Group Be in a Randomized Field Experiment? Aug 20, 2018
- Apr 14, 2018 On Calculating Power for Interactions in 2 x 2 Factorial Designs Apr 14, 2018
- Jan 5, 2018 The Force is Too Strong with This One? Sexism, Star Wars, and Female Heroes Jan 5, 2018
- Jan 1, 2018 "Ode to Viceroy": Mac DeMarco's Influence on Interest in Viceroy Cigarettes Jan 1, 2018
2017
- Dec 6, 2017 A Monte Carlo Study on Methods for Handling Class Imbalance in Machine Learning Dec 6, 2017
- Nov 28, 2017 Using Beta Regression to Better Model Norms in Political Psychology Nov 28, 2017
- Sep 24, 2017 In Support of Open Seeding in the NBA Sep 24, 2017
- Sep 22, 2017 The Importance of Blocks for NBA Defenses, Over Time Sep 22, 2017
- Jul 13, 2017 Analyzing Rudy Gay Trades Using the CausalImpact Package Jul 13, 2017
- Jun 6, 2017 Quantifying "Low-Brow" and "High-Brow" Films Jun 6, 2017
- May 20, 2017 Sentiment Analysis of Kanye West's Discography May 20, 2017
- Apr 18, 2017 Text Mining Kanye's Vocabulary Apr 18, 2017
- Mar 18, 2017 Political Targeting Using k-means Clustering Mar 18, 2017
- Mar 2, 2017 Predicting All-NBA Teams Mar 2, 2017

Blog

Archive