On Calculating Power for Interactions in 2 x 2 Factorial Designs

Link to the power calculation app.

Note: I have made a few updates to the app since originally publishing this blog post, including making the visuals prettier and including a field to adjust the alpha level. One can track updates at the GitHub repository, https://github.com/markhwhiteii/power_twoway.


Statistical power is important when doing experimental psychology. I'm not going to try to convince you of this—I think there is enough published work over the last few decades that should do so. Instead, I'm going to assume you want your study to be adequately powered and you are trying to figure out how many participants you need to find the interaction you hypothesize. I'm going to show what I think is an intuitive way of conducting a power analysis for an interaction effect in a 2 x 2 between-subjects experiment.


There has been a lot written about this, including by Uri Simonsohn (http://datacolada.org/17), Jake Westfall (http://jakewestfall.org/blog/index.php/2015/05/26/think-about-total-n-not-n-per-cell/), and Roger Giner-Sorolla (https://approachingblog.wordpress.com/2018/01/24/powering-your-interaction-2/).

Westfall notes how people can be confused when they go to G*Power and it tells them they need the same amount of participants in a 2 x 2 design as they do in a simple randomized experiment with two conditions. I'll leave it to the three blog posts linked above to explain this dilemma.

Simonsohn and Giner-Sorolla both offer rules-of-thumb in sample size planning for a 2 x 2 interaction that sound something like: "In Situation X, you should multiply your per-cell sample size from Study 1 by Y," where Study 1 was a between-subjects experiment with two conditions. I agree with Westfall that we should not necessarily be thinking about cell size, but overall N. I also don't like using the a priori sample size from Study 1 to inform the sample size for Study 2, as this ignores the information we learned in Study 1. And using G*Power is difficult, because I don't really know how many people think of interesting, real-world phenomena and say to themselves, "Ah yes, this is probably a .15 Cohens f-squared!"

What I find intuitive is sketching out a pattern of means you expect to see and then calculating power for those results, which is what I suggest one does when calculating power for an interaction effect in a 2 x 2 between-subjects design.


I'm going to assume familiarity with Cohen's d—that is, how many standard deviations two means are from one another. If you understand that, you can understand this approach.

Imagine we have a dependent variable that has a standard deviation of 1. That means that any mean differences we find are standardized differences—that is, they are in units of Cohen's d. If one mean is 0.2 and another is 0.7, the Cohen's d between these two means is 0.5. With this in mind, the steps I suggest (and coded into a tool) are:

  1. For your 2 x 2 design, sketch out four means you expect to see, assuming that the dependent variable in all conditions has a standard deviation of 1. Use an observed Cohen's d to inform you of this.

  2. Get an overall sample size and simulate data based on these means and sample size.

  3. See if the p-value for the interaction effect is less than .05.

  4. Do steps 2 and 3 a large number of times.

  5. Get the proportion of times your simulated data had a p-value less than .05. This proportion is the power for that sample size.

  6. Do this for a wide range of possible overall sample sizes.

The logic behind simulating data is that frequentist statistics are all about the long-run: If we were to do this exact study a large number of times, what proportion of the time would we find a significant effect? What the computer can do is simulate this long-run and tell you what proportion that is significant—that is, the power.

I have written a few R functions that accomplish the 6 steps above so that you don't have to (the functions are located here: https://github.com/markhwhiteii/power_twoway/blob/master/helpers.R).

I also wrapped these R functions into a Shiny web app, which you can access here: https://markhw.shinyapps.io/power_twoway/.

Basically, what the backend code does is: first, randomly assign cases to Level 1 or Level 2 of two factors (cleverly named Factor 1 and Factor 2); second, draw values of the dependent variable for these cases from a population with a standard deviation of 1 and a mean of whatever you input for that specific cell; third, run the model and see if we got a significant interaction effect. I do this for a variety of sample sizes that you input and for how many simulations that you specify. The proportion of significant effects found is the estimated power.

I'll walk through the three interaction examples Giner-Sorolla discussed in his post: the reversal, the knockout, and the attenuation. For all of these examples, imagine we conducted a Study 1 that was a simple randomized between-subjects experiment with two conditions and found a Cohen's d of .44. Now, we are interested in throwing another manipulation in there in Study 2 (to make a 2 x 2 design) and looking for an interaction.


The Reversal

This happens when you are expecting the effect found in Study 1 to be reversed in another condition. In the present case, we predict a Cohen's d of .44 in one condition and a Cohen's d of -.44 in another. Let's assume that there are no main effects here. So what we can say is:

  • The mean for participants in Factor 1, Level 1 and Factor 2, Level 1 is .22

  • The mean for participants in Factor 1, Level 1 and Factor 2, Level 2 is -.22

This means that when Factor 1 is at Level 1, there is a Cohen's d of .44 between Levels 1 and 2 of Factor 2 (because .22 - (-.22) = .44). But now let's add in Factor 1, Level 2:

  • The mean for participants in Factor 1, Level 2 and Factor 2, Level 1 is -.22

  • The mean for participants in Factor 1, Level 2 and Factor 2, Level 2 is .22

See how the difference between levels in Factor 2 has completely flipped for Level 2 of Factor 1—the effect is d = -.44 ( because -.22 - .22 = -.44). The app will show you what this pattern of means looks like:

We can look at sample sizes from 100 to 400 by 25 (i.e., 100, 125, 150, 175, 200, etc.) and do 300 simulations per sample size. The inputs for the app look like:


Then we can check out the results it returns:

We get 80% power somewhere between 150 and 175 participants. One could now adjust the overall sample size between minimum of 150 and maximum of 175, stepping by 1 each time, to see about how many participants they need.

The Knockout

Now, imagine that we are expecting a manipulation to completely wipe out the effect we found in Study 1. The means would look like:

  • The mean for participants in Factor 1, Level 1 and Factor 2, Level 1 is .00

  • The mean for participants in Factor 1, Level 1 and Factor 2, Level 2 is .44

This captures the Cohen's d = .44 we found in Study 1. Then we think Level 2 of Factor 1 is going to bring everyone down to .00:

  • The mean for participants in Factor 1, Level 2 and Factor 2, Level 1 is .00

  • The mean for participants in Factor 1, Level 2 and Factor 2, Level 2 is .00

The pattern of means, per the app, looks like:


We can then run the power analysis leaving all of the other presets the same. Bad news this time: Even a sample size of 400 only gets us at 60% power!

And this might be too generous of a prediction, as well. The effect of .44 found in Study 1 (or Level 1 of Factor 1 in the plot above) is almost assuredly not coming from one source. If we knock out a hypothesized source of the effect, we should only expect to attenuate the effect, since other sources of the effect still exist. We can look at that next.

The Attenuation

Let's assume we can only get rid of half of the effect. We would plug in:

  • The mean for participants in Factor 1, Level 1 and Factor 2, Level 1 is .00

  • The mean for participants in Factor 1, Level 1 and Factor 2, Level 2 is .44

  • The mean for participants in Factor 1, Level 2 and Factor 2, Level 1 is .00

  • The mean for participants in Factor 1, Level 2 and Factor 2, Level 2 is .22

Which looks like:


Even worse news this time: We are only getting to about 20% power at best in the 350 to 400 range.


This is, I think, an intuitive way of going about sample size planning for interaction effects in 2 x 2 between-subjects designs. Think about the Cohen's d of the pairwise comparisons you are interested in, come up with means that represent those, and then simulate data based on those means using the app (https://markhw.shinyapps.io/power_twoway/).


A few technical notes:

  1. You might find that a larger sample size is giving you lower power. Why? It is likely because you are not using enough simulations to give you a stable power estimate. You can up the number of simulations to do—but know that it will take longer. I suggest starting with a wide range of possible sample sizes and a lower number of simulations (a few hundred), figuring out a promising range, and then going back to just that range with a higher number of simulations.

  2. I hard-coded a set.seed() statement in the app, which means that—as long as you enter the exact same inputs—you'll get the same results every time.

Quantifying "Low-Brow" and "High-Brow" Films

I went and saw Certain Women a few months ago. I was pretty excited to see it; a blurb in the trailer calls it “Triumphant… an indelible portrait of independent women,” which sounds pretty solid to me. The film had a solid point in that it exposed the mundane, everyday ways in which women have to confront sexism. It isn't always a huge dramatic thing that is obvious to everyone—instead, most of the time sexism is commonplace and woven into the routine of our society.

The only problem is that I found the movie, well, pretty boring. Showing how quotidian sexism is in a film makes for a slow-paced, quotidian plot. A few days ago, I happened upon the Rotten Tomatoes entry for the movie. It scored very well with critics (92% liked it), but rather poorly with audiences (52%). It made me think of the divisions between critics and audiences; I thought that the biggest differences between audience and critic scores could be an interesting way to quantify what is “high-brow” and what is “low-brow” film. So what I did was got critic and audience scores for movies in 2016, plotted them against one another, and looked at where they differed most.


The movies I chose to examine were all listed on the 2016 in film Wikipedia page. The problem was I needed links to Rotten Tomatoes pages, not just names of movies. So, I scraped this table, took the names of the films, and I turned them into Google search URLs by taking "https://google.com/search?q=rottentomatoes+2016+" and using paste0 to add the name of the film at the end of the string. Then, I wrote a little function (using rvest and magrittr) that takes this Google search URL and fetches me the URL for the first result of a Google search:

# function for getting first hit from google page
getGoogleFirst <- function(url) {
  url %>% 
    read_html() %>% 
    html_node(".g:nth-child(1) .r a") %>% 
    html_attr("href") %>% 
    strsplit(split="=") %>% 
    getElement(1) %>% 
    strsplit(split="&") %>% 
    getElement(2) %>% 

After running this through a loop, I got long vector of Rotten Tomatoes links. Then, I fed them into two functions that gets critic and audience scores:

# get rotten tomatoes critic score
rtCritic <- function(url) {
  url %>% 
    read_html() %>% 
    html_node("#tomato_meter_link .superPageFontColor") %>% 
    html_text() %>% 
    strsplit(split="%") %>% 
# get rotten tomatoes audience score
rtAudience <- function(url) {
  url %>% 
    read_html() %>% 
    html_node(".meter-value .superPageFontColor") %>% 
    html_text() %>% 
    strsplit(split="%") %>% 

The film names and scores were all put into a data frame.


Overall, I collected data on 224 films. The average critic score was 56.74, while the average audience score was 58.67; while audiences tended to be more positive, this difference was small, 1.93, and not statistically significant, ,t(223) = 1.34, p = .181. Audiences and critics tended to agree; scores between the two groups correlated strongly, r = .68.

But where do audiences and critics disagree most? I calculated a difference score by taking critic - audience scores, such that positive scores meant critics liked the film more than audiences. The five biggest difference scores in both the positive and negative direction are found in the table below.

“High-Brow” Films

Film Critic Audience Difference
The Monkey King 2 100 49 51
Hail, Caesar! 86 44 42
Little Sister 96 54 42
The Monster 78 39 39
The Witch 91 56 35
Into the Forest 77 42 35

“Low-Brow” Films

Film Critic Audience Difference
Hillary's America: The Secret History of the Democratic Party 4 81 -77
The River Thief 0 69 -69
I'm Not Ashamed 22 84 -62
Meet the Blacks 13 74 -61
God's Not Dead 2 9 63 -54

Interactive Plot

Below is a scatterplot of the two scores with a regression line plotted. The dots in blue are those films in the tables above. You can hover over any dot to see the film it represents as well as the audience and critic scores:

I won't do too much interpreting of the results—you can see for yourself where the movies fall by hovering over the dots. But I would be remiss if I didn't point out the largest difference score was an anti-Hillary Clinton movie: 4% of critics liked it, but somehow 81% of the audience did. Given all of the evidence that pro-Trump bots were all over the Internet in the run-up to the 2016 U.S. presidential election, I would not be surprised if many of these audience votes were bots?

Apparently I'm a low-brow plebian; I did not see any of the five most “high-brow” movies, according to the metric. Both critics and audiences seemed to love Hidden Figures (saw it, and it was awesome) and Zootopia (still haven't seen it).

Let me know what you think of this “low-brow/high-brow” metric or better ways one could quantify the construct.