On Calculating Power for Interactions in 2 x 2 Factorial Designs


Statistical power is important when doing experimental psychology. I'm not going to try to convince you of this—I think there is enough published work over the last few decades that should do so. Instead, I'm going to assume you want your study to be adequately powered and you are trying to figure out how many participants you need to find the interaction you hypothesize. I'm going to show what I think is an intuitive way of conducting a power analysis for an interaction effect in a 2 x 2 between-subjects experiment.


There has been a lot written about this, including by Uri Simonsohn (http://datacolada.org/17), Jake Westfall (http://jakewestfall.org/blog/index.php/2015/05/26/think-about-total-n-not-n-per-cell/), and Roger Giner-Sorolla (https://approachingblog.wordpress.com/2018/01/24/powering-your-interaction-2/).

Westfall notes how people can be confused when they go to G*Power and it tells them they need the same amount of participants in a 2 x 2 design as they do in a simple randomized experiment with two conditions. I'll leave it to the three blog posts linked above to explain this dilemma.

Simonsohn and Giner-Sorolla both offer rules-of-thumb in sample size planning for a 2 x 2 interaction that sound something like: "In Situation X, you should multiply your per-cell sample size from Study 1 by Y," where Study 1 was a between-subjects experiment with two conditions. I agree with Westfall that we should not necessarily be thinking about cell size, but overall N. I also don't like using the a priori sample size from Study 1 to inform the sample size for Study 2, as this ignores the information we learned in Study 1. And using G*Power is difficult, because I don't really know how many people think of interesting, real-world phenomena and say to themselves, "Ah yes, this is probably a .15 Cohens f-squared!"

What I find intuitive is sketching out a pattern of means you expect to see and then calculating power for those results, which is what I suggest one does when calculating power for an interaction effect in a 2 x 2 between-subjects design.


I'm going to assume familiarity with Cohen's d—that is, how many standard deviations two means are from one another. If you understand that, you can understand this approach.

Imagine we have a dependent variable that has a standard deviation of 1. That means that any mean differences we find are standardized differences—that is, they are in units of Cohen's d. If one mean is 0.2 and another is 0.7, the Cohen's d between these two means is 0.5. With this in mind, the steps I suggest (and coded into a tool) are:

  1. For your 2 x 2 design, sketch out four means you expect to see, assuming that the dependent variable in all conditions has a standard deviation of 1. Use an observed Cohen's d to inform you of this.
  2. Get an overall sample size and simulate data based on these means and sample size.
  3. See if the p-value for the interaction effect is less than .05.
  4. Do steps 2 and 3 a large number of times.
  5. Get the proportion of times your simulated data had a p-value less than .05. This proportion is the power for that sample size.
  6. Do this for a wide range of possible overall sample sizes.

The logic behind simulating data is that frequentist statistics are all about the long-run: If we were to do this exact study a large number of times, what proportion of the time would we find a significant effect? What the computer can do is simulate this long-run and tell you what proportion that is significant—that is, the power.

I have written a few R functions that accomplish the 6 steps above so that you don't have to (the functions are located here: https://github.com/markhwhiteii/power_twoway/blob/master/helpers.R).

I also wrapped these R functions into a Shiny web app, which you can access here: https://markhw.shinyapps.io/power_twoway/.

Basically, what the backend code does is: first, randomly assign cases to Level 1 or Level 2 of two factors (cleverly named Factor 1 and Factor 2); second, draw values of the dependent variable for these cases from a population with a standard deviation of 1 and a mean of whatever you input for that specific cell; third, run the model and see if we got a significant interaction effect. I do this for a variety of sample sizes that you input and for how many simulations that you specify. The proportion of significant effects found is the estimated power.

I'll walk through the three interaction examples Giner-Sorolla discussed in his post: the reversal, the knockout, and the attenuation. For all of these examples, imagine we conducted a Study 1 that was a simple randomized between-subjects experiment with two conditions and found a Cohen's d of .44. Now, we are interested in throwing another manipulation in there in Study 2 (to make a 2 x 2 design) and looking for an interaction.


The Reversal

This happens when you are expecting the effect found in Study 1 to be reversed in another condition. In the present case, we predict a Cohen's d of .44 in one condition and a Cohen's d of -.44 in another. Let's assume that there are no main effects here. So what we can say is:

  • The mean for participants in Factor 1, Level 1 and Factor 2, Level 1 is .22
  • The mean for participants in Factor 1, Level 1 and Factor 2, Level 2 is -.22

This means that when Factor 1 is at Level 1, there is a Cohen's d of .44 between Levels 1 and 2 of Factor 2 (because .22 - (-.22) = .44). But now let's add in Factor 1, Level 2:

  • The mean for participants in Factor 1, Level 2 and Factor 2, Level 1 is -.22
  • The mean for participants in Factor 1, Level 2 and Factor 2, Level 2 is .22

See how the difference between levels in Factor 2 has completely flipped for Level 2 of Factor 1—the effect is d = -.44 ( because -.22 - .22 = -.44). The app will show you what this pattern of means looks like:

We can look at sample sizes from 100 to 400 by 25 (i.e., 100, 125, 150, 175, 200, etc.) and do 300 simulations per sample size. The inputs for the app look like:


Then we can check out the results it returns:

We get 80% power somewhere between 150 and 175 participants. One could now adjust the overall sample size between minimum of 150 and maximum of 175, stepping by 1 each time, to see about how many participants they need.

The Knockout

Now, imagine that we are expecting a manipulation to completely wipe out the effect we found in Study 1. The means would look like:

  • The mean for participants in Factor 1, Level 1 and Factor 2, Level 1 is .00
  • The mean for participants in Factor 1, Level 1 and Factor 2, Level 2 is .44

This captures the Cohen's d = .44 we found in Study 1. Then we think Level 2 of Factor 1 is going to bring everyone down to .00:

  • The mean for participants in Factor 1, Level 2 and Factor 2, Level 1 is .00
  • The mean for participants in Factor 1, Level 2 and Factor 2, Level 2 is .00

The pattern of means, per the app, looks like:


We can then run the power analysis leaving all of the other presets the same. Bad news this time: Even a sample size of 400 only gets us at 60% power!

And this might be too generous of a prediction, as well. The effect of .44 found in Study 1 (or Level 1 of Factor 1 in the plot above) is almost assuredly not coming from one source. If we knock out a hypothesized source of the effect, we should only expect to attenuate the effect, since other sources of the effect still exist. We can look at that next.

The Attenuation

Let's assume we can only get rid of half of the effect. We would plug in:

  • The mean for participants in Factor 1, Level 1 and Factor 2, Level 1 is .00
  • The mean for participants in Factor 1, Level 1 and Factor 2, Level 2 is .44
  • The mean for participants in Factor 1, Level 2 and Factor 2, Level 1 is .00
  • The mean for participants in Factor 1, Level 2 and Factor 2, Level 2 is .22

Which looks like:


Even worse news this time: We are only getting to about 20% power at best in the 350 to 400 range.


This is, I think, an intuitive way of going about sample size planning for interaction effects in 2 x 2 between-subjects designs. Think about the Cohen's d of the pairwise comparisons you are interested in, come up with means that represent those, and then simulate data based on those means using the app (https://markhw.shinyapps.io/power_twoway/).


A few technical notes:

  1. You might find that a larger sample size is giving you lower power. Why? It is likely because you are not using enough simulations to give you a stable power estimate. You can up the number of simulations to do—but know that it will take longer. I suggest starting with a wide range of possible sample sizes and a lower number of simulations (a few hundred), figuring out a promising range, and then going back to just that range with a higher number of simulations.
  2. I hard-coded a set.seed() statement in the app, which means that—as long as you enter the exact same inputs—you'll get the same results every time.

The Force is Too Strong with This One? Sexism, Star Wars, and Female Heroes

I am a big Star Wars fan, and I have really enjoyed the sequel films, The Force Awakens and The Last Jedi,  thus far. My colleague Matt Baldwin and I noticed a lot of sexist bashing of Rey was happening after The Force Awakens was released. We collected some data on how sexists dislike Rey two years ago, but never published it.

We just wrote about these findings at Inquisitive Mind magazine, and you can click here to head over there and check it out.


"Ode to Viceroy": Mac DeMarco's Influence on Interest in Viceroy Cigarettes

Mac DeMarco released his first album, 2, on October 16th, 2012. The fifth track is called “Ode to Viceroy,” which is a song about the Viceroy brand cigarette. He’s gained in popularity with his subsequent releaes, Salad Days and This Old Dog, and his affection for cigarettes had turned into somewhat of a meme.

What has the effect of “Ode to Viceroy” been on the popularity of the Viceroy brand itself?

I looked at this question by analyzing frequency of Google searches (accessed via the gtrendsR R package) using the CausalImpact R package.

I pulled the frequency of Google searches for a number of cigarette brands. I went looking around Wikipedia and found that Viceroy is owned by the R. J. Reynolds Tobacco Company. This sentence was on that company’s Wikipedia page: “Brands still manufactured but no longer receiving significant marketing support include Barclay, Belair, Capri, Carlton, GPC, Lucky Strike, Misty, Monarch, More, Now, Tareyton, Vantage, and Viceroy.”

I took each of these brand names and attached “cigarettes” to the end of the query (e.g., “GPC cigarettes”). I didn’t use “More,” due to the majority of queries for “More cigarettes” was probably not in reference to the More brand. I pulled monthly search numbers for each of these brands from Google Trends. I set the date range to be four years before (October 16th, 2008) and after (October 16th, 2016) 2 came out.

The CausalImpact package employs Bayesian structural time-series models in a counterfactual framework to estimate the effect of an intervention. In this case, the “intervention” is Mac DeMarco releasing 2. Basically, the model asks the question: “What would have the data looked like if no intervention had taken place?” In the present case, the model uses information I gave it from Google Trends about the handful of cigarette brands, and then it estimates search trends for Viceroy if 2 had never been released. It then compares this “synthetic” data against what we actually observed. The difference is the estimated "causal impact" of Mac DeMarco on the popularity of Viceroy cigarettes. (I highly suggest reading the paper, written by some folks at Google, introducing this method.)

When doing these analyses, we assume two crucial things: First, that none of the other brands were affected by Mac DeMarco releasing 2; and Second, that the relationships between the other brands and Viceroy remained the same after the album’s release as before the release.

Google Trends norms their data and scales it in a way that isn’t readily interpretable. For any trend that you retrieve, the highest amount of searches is set to the value of 100. Every other amount is scaled to that: If you observe a 50 for one month, that means it is 50% the value of the number of searches observed at the max in that time period. Keep this in mind when looking at the results.

You can see the trend below. The black line is what we actually observed for the amount of Google queries for “Viceroy cigarettes.” The dashed vertical line represents when Mac released 2. The dashed blue line is what we estimate would have been the trend if Mac hadn’t ever released his album, and the lighter blue area above and below this line represents our uncertainty in this dashed blue line. Specifically, there is 95% probability that the dashed blue line is somewhere in that light blue range.

We can see that what we actually observed goes outside of this blue range about a year after Mac released his album. According to the model, there is greater than a 99.97% probability that Mac DeMarco had a positive effect on people Googling “Viceroy cigarettes”. On average, the difference between what we actually observed and the estimated trend if Mac didn’t release 2 was 31, and there’s a 95% probability that this difference is between 27 and 35. This number is a little hard to interpret (given how the data are normed and scaled), but one could say that the estimated causal impact—for the average month in the four years after Mac's first album—is about 31% of whatever the highest observed number of monthly search queries were for “Viceroy cigarettes” in this same time.

But how do we know this is due to Mac? We can't ever be 100% sure, but if you look at Google Trends for "Viceroy cigarettes" in the four years after he released his album, the top "related query" is "Mac DeMarco."

In one song, Mac DeMarco was able to get people more interested in Viceroy cigarettes. I’m interested in how this affected sales—I’d bet there is at least some relationship between Google searches and sales.

Code for all data collection and analyses is available on my GitHub.

A Monte Carlo Study on Methods for Handling Class Imbalance in Machine Learning

I recently ran a simulation study comparing methods for handling class imbalance (in this case, when the class of interest is less than about 3% of the data) for a statistical computing course. I simulated 500 data sets, varying some characteristics like sample size and minority class size, and tested a number of preprocessing techniques (e.g., SMOTE) and algorithms (e.g., XGBoost). You can view the working paper by clicking here.

If you don't want to slog through the whole paper, the plot below shows densities of how each model (combination of sampling technique and algorithm) performed. I totally left off models that used no preprocessing and oversampling, since they made so few positive predictions that metrics like F1 scores couldn't even be calculated most of the time!

Feel free to check out the GitHub repository, as well.


Using Beta Regression to Better Model Norms in Political Psychology

I recently wrote a short working paper on how to use beta regression and how it helps take into account norms in correlational studies of ideology, politics, and prejudice. It is a little long for a blog post (and this platform does not support LaTeX), so I uploaded it as a working paper. Click here to download the paper.

I hope it is instructive and informative, and that it fills in a few gaps from previous papers. Please email me if you have any questions about it. As always, the code can be found over at my GitHub.

Updated 2017-12-11

Screen Shot 2017-12-11 at 11.13.04 AM.png