In search of the best, general purpose algorithm for counterbalancing conditions

Some updates based on discussion thread

We are looking for input on the best general purpose counterbalancing algorithm for experiment designs. We’ve identified some of the features that we believe such an algorithm should have as well as part of the way to do it, but we’re interested if we are missing some general purpose solution that already exists out there. Please comment if you have thought about this!

Here are the desired features:

  1. Subjects are pseudo-randomly assigned to conditions.
  2. Conditions fill in more or less evenly. For example, if you have four conditions in your experiment and run 20 people there should be 5 in each condition. If you run 200 there should be 50 in each condition. Another way of saying this is that if you take any reasonably sized sub-sequence of subjects (e.g., subject 23-33 in a design with 100 people and four conditions) the conditions will still be roughly balanced (basically the process has approximate ergodicity).
  3. When there are multiple nested conditions (e.g., N-way designs, or counterbalancing of random factors like button placement), these should also fill in evenly but with lower priority than the main experimental conditions. For example, imagine a study where people are assigned to two experimental conditions like shape (triangle or circle) but are assigned to one of 25 different colors. You care mostly about shape in your experiment but want color randomly and evenly assigned as well. If you run 20 people you want 10 to see triangles and 10 to see circles. For the color you don’t have enough subjects to visit all 50 combinations. However, you want color to be assigned evenly in your sample as well (so you don’t have 2-3 people with blue triangles).
  4. You may want to enforce conditional dependencies between factors. For example, you want the distribution over assignments to color to be separately balanced for both the circle and triangle group, or you are just choosing color independently of the subjects assignment to a shape. Choosing independently can lead to issues with correlation: it may be that every time you have a blue shape, it’s always a triangle, which could make your data hard to analyze. However, if you choose conditionally (i.e. choose the color randomly given that it’s a triangle), it could be that you end up with many more blue objects than red objects overall in your experiment. There is an obvious tradeoff here and you may want to optimize for your particular experiment.
  5. You want all this to work in a situation where people might withdraw from your study at any time (e.g., online studies). This is the kicker. In a normal psychology study, you just assign subjects equally to every possible combination of conditions and don’t worry about it. If someone drops out, you just assign the next person who comes in to the same condition (or someone else later on). But often in online experimentation, we have 30 people taking an experiment at the same time and we are unsure which of them are actually going to finish (in other words, which conditions are actually going to end up in your analysis). You need to assign people to conditions in a way that’s robust to dropouts so that in the end the distribution of conditions is pretty even. Thus you need to be able to adjust your assignment algorithm given some reasonable idea about who is unlikely to return a data point.

Is there a literature on this? Has this been worked out by someone? It seems like the above features cover quite a broad range of experiment designs. We suppose people often optimize these types of balancing by hand after deciding on a sample size for an experiment, however an algorithmic solution might be useful. We got inspired about this a bit by the recent Facebook PlanOut library, although that is designed for different types of experiments than most psychology experimenters deal with.

So far we created a simple, open source Python library called Counterbalancing which accomplishes a few of these objectives. If anyone is interested in contributing to this to make it even better, please jump in!







  1. Here is a simple procedure that seems to satisfy #1 – #4. It is not clear that #5 can be solved without additional information a priori about which cells are most likely to lead to drop-outs. #3 appears at first to require additional information as well, but I think actually it can be reduced to the more simple requirement “factors with fewer levels should be balanced first”–note that in your example, arbitrarily designating the 25-level color factor as the one we “care most about” would not obviously change the assignment procedure, since we still may as well keep the shape factor balanced as we go. As for #4, I must admit that I am totally unclear what you’re getting at here, but I at least think that this procedure avoids the two pitfalls that you point out.

    Anyway, the procedure is this. For each new observation, we sample *without replacement* from among the unique levels of *each* of the factors that we wish to counterbalance across. When one of the factors runs out of levels, we “refill” the levels of that factor and then continue.

    Here is a concrete visual example. Imagine I have 3 boxes, each full of a certain number of tickets. The 3 boxes represent 3 factors that I want to counterbalance across, and the tickets in each box represent the unique levels of those factors. Specifically, the first box has 4 tickets, the second box has 3 tickets, and the third box has 2 tickets. So I have 4*3*2 = 24 cells to counterbalance across in total. For each new observation that is added to the datasat, I draw one ticket from each of the 3 boxes to see which levels of the 3 factors that observation receives. So ultimately every observation gets some combination of 3 tickets (again, 24 unique combinations of tickets in total). When a ticket is drawn from a box, that ticket is set down beside its box (not replaced). When a box becomes empty, I take all of the tickets from that box and put them back in the box. Continue until desired number of observations is reached.

    Here is some brief R code that does this for the example above, with factors denoted A (levels a1 to a4), B (levels b1 to b3), and C (levels c1 to c2). This code only does 24 observations, but of course you could keep going.

    (A <- paste0("a", 1:4))
    (B <- paste0("b", 1:3))
    (C <- paste0("c", 1:2))
    total <- length(A)*length(B)*length(C)
    samps <- lapply(list(A, B, C), function(x){
    c(replicate(total/length(x), sample(x)))
    })
    do.call(cbind, samps)

    Jake Westfall
  2. Hey Jake,
    Thanks for thinking about this problem! Maybe the best way to illustrate the problem we are talking about in #4 is by looking at your code. Here is the output of one run:
    a b c
    1 a2 b1 c2
    2 a1 b3 c1
    3 a3 b2 c1
    4 a4 b2 c2
    5 a3 b1 c2
    6 a1 b3 c1
    7 a2 b2 c2
    8 a4 b1 c1
    9 a2 b3 c2
    10 a3 b2 c1
    11 a4 b1 c2
    12 a1 b3 c1
    13 a1 b1 c1
    14 a4 b2 c2
    15 a3 b3 c2
    16 a2 b3 c1
    17 a3 b1 c1
    18 a4 b2 c2
    19 a2 b1 c2
    20 a1 b3 c1
    21 a2 b2 c2
    22 a1 b3 c1
    23 a3 b2 c2
    24 a4 b1 c1

    Here is the same list with only the A = a1 subjects:
    a b c
    2 a1 b3 c1
    6 a1 b3 c1
    12 a1 b3 c1
    13 a1 b1 c1
    20 a1 b3 c1
    22 a1 b3 c1

    You’ll notice that all of the a1′s are paired with c1′s and almost all with b3′s. This makes analyzing your data pretty hard since you can’t separate out effects of a1 and c1. We are hoping to make a general algorithm that could balance your desire to randomly select a’s as well as randomly select a’s given a particular choice of b’s and c’s. In addition, we are trying to find ways to make sure that your conditions are evenly matched after dropouts in a way that won’t screw up the data analysis. Please keep thinking about it though and definitely check out the code we have up on github (which implements this to some extent). And feel free to submit to the project if you come up with anything else!

    David Halpern
  3. Hi David, yes, I see now the big flaw in my proposed solution. I guess if I had just done some basic checking on the solution I would have uncovered this problem (d’oh). I spent a little more time after this thinking about the issue, eventually deciding that I needed to return to “real work,” but I thought I would go ahead and at least share how I had started thinking about this in case you found it at all useful.

    My thought was I would–via brute force iterative application of the method I proposed–find a particular solution that was also completely balanced across all 24 factor combinations, and then I would try to study this solution to see if I could infer some general rules for how this kind of solution could be reliably generated without having to rely on brute force. Here is one particular solution that I found:

    [1,] a1 b3 c1
    [2,] a4 b1 c2
    [3,] a2 b2 c2
    [4,] a3 b2 c1
    [5,] a1 b1 c2
    [6,] a3 b3 c1
    [7,] a4 b3 c1
    [8,] a2 b1 c2
    [9,] a2 b2 c1
    [10,] a3 b3 c2
    [11,] a1 b2 c2
    [12,] a4 b1 c1
    [13,] a2 b3 c2
    [14,] a4 b2 c1
    [15,] a3 b1 c2
    [16,] a1 b2 c1
    [17,] a1 b1 c1
    [18,] a4 b3 c2
    [19,] a2 b1 c1
    [20,] a3 b2 c2
    [21,] a1 b3 c2
    [22,] a3 b1 c1
    [23,] a2 b3 c1
    [24,] a4 b2 c2

    And here is the same solution represented as indices arranged in a 4*3*2 array:

    , , c1

    b1 b2 b3
    a1 17 16 1
    a2 19 9 23
    a3 22 4 6
    a4 12 14 7

    , , c2

    b1 b2 b3
    a1 5 11 21
    a2 8 3 13
    a3 15 20 10
    a4 2 24 18

    I tried studying this latter representation for a while, trying to get some ideas about an algorithm that will move sequentially through the cells of an array in a way that both satisfies the method I proposed (so that early termination would at least guarantee that each of the factors were separately balanced as much as possible) and also is guaranteed to pass through every cell exactly 1 time per total number of cells (so that there is balance across all the factor combinations as much as possible). I eventually decided I was spending too much time on this and so gave up. But I post this here just in case you find this to be a useful way to think about the problem. I figured it can’t hurt.

    Jake Westfall
  4. I agree with Jake that it’s not clear you can solve #5 without prior information. Consider a case in which you simultaneously recruit 100 subjects into 2 conditions, with the intent to have 50 in each, but 10 drop out of condition A and 0 drop out of B. If you knew up front what the drop-out rate was going to be, you would have assigned more subjects to A than B. If you don’t know what the drop-out rate is going to be, and the assignment is truly simultaneous, I don’t see how you can do anything about it. You could adjust the split in the next round based on the observed completion rates, but that doesn’t help you with the batch you’ve already run.

    That said, there are two reasons I’m not sure #5 is actually an issue worth worrying about. First, in practice, if there’s a gross imbalance in cell size, the researcher is almost certain to do another round of data collection, at which point your existing algorithm could in principle balance things out just fine so long as it’s tracking completion counts over multiple batches. Second, it’s arguably quite dangerous to strive for balanced cells in a case where there’s a systematic difference in drop-out rate. In the above example, if your algorithm tries to compensate for the imbalance by recruiting 10 more subjects into A, you’ve now introduced additional confounds, in that you have 10 subjects who weren’t no longer randomly assigned to condition (and on top of that, are unevenly sampled from different epochs in time). The right way to deal with this situation, I would argue, is to give up the requirement that cells be even, and just focus on getting the minimal desired N in each cell. I.e., if you absolutely have to have 50 in each cell, then run another 25 subjects or so, with the same randomization scheme, so that you’ll now end up with something like 52 / 123 in A and B. I think if you create an algorithm that tries to “correct” for past imbalances in assignment, there’s a good chance you just end up making assignment non-random. Forcing researchers to acquire a little more data than they thought they needed seems like the much better course of action in such cases.

    Tal Yarkoni
  5. Eytan Bakshy shared these thoughts in an email exchange which I am posting here

    ——————————————–

    There are a couple thoughts:

    0. Possibility of bias:
    The concern that we have with trying to replace users who dropped out in order to achieve balance is that it might give you a biased estimate of your causal effect. In a nutshell, if we don’t know how users’ drop out rates are related to how they perform in the experiment, filling in subjects who dropped out can be very dangerous.

    Let’s say we were interested in some average treatment effect on some outcome Y for a treatment D(m) (where each D(m) is some potentially vector-valued parameterization).

    If treatment assignment is uncorrelated with each Yi’s potential outcomes (or observed / unobserved covariates), then you can compute the ATE for the contrast D(1) vs D(0) as:
    ATE = E[Y | D = D(1)] – E[Y | D=D(0)]

    In the case when subjects drop out, you have unequal assignment probabilities that depend on some observed and unobserved factors, let’s call them X. Then your estimator for the ATE would be:

    ATE =
    Sum[(E[Y | D = D(1), X_j] Pr(complete | D=(1), X_j), {1, J}] -
    Sum[(E[Y | D = D(0), X_j] Pr(complete | D=D(0), X_j), {1, J}]

    Where N is the number of strata in your partitioning of your set of covariates, X. If we could do a very good job of predicting the drop out rates as a function of X (and we knew what X was sufficiently), then we’d be in OK shape.

    There might be some cases where we don’t think drop out rates are related to how subjects perform, but it seems like if you are giving turkers tasks that might be more or less difficult, this could be problematic.

    I think Dean and I are both assuming that you don’t have enough data (covariates + N) to figure out if the data is missing in any systematic way with an MTurk experiment. How would you, e.g., test if data due to drop out is missing at random for a stroop test experiment?

    1. The role of counterbalancing software and the problems with adaptive assignment….

    An advantage of MTurk over lab studies is that you can recruit a lot of subjects. I feel like algorithms which try to fill in missing cells encourages researchers to:
    (a) use small sample sizes, which can create risks for Type-M errors
    (b) risks creating (unknown) errors in randomization.
    (c) choose a design that requires some fairly complex adjustment and checking in the analysis phase

    So as developers of tools that support good methodological practices (like PlanOut), I would advocate approaches that try to get a minimum number of subjects per cell, but don’t try to fill in subjects who drop out, since it minimizes the risks above.

    2. On field studies versus lab experiments…

    Issues of missing data in experiments don’t come up at Facebook that often because we’re generally interested in behavioral outcomes in usage activity — like how often users log in or how much content they produce — and we can measure that for every subject at any point in time, provided that they didn’t delete their account. That is, there is very little to no attrition. For experiments for which the outcomes are measured by surveys, of course, we do need to think about non-response, and Dean has done a lot of work with survey adjustment recently because of this.

    Attrition is a serious consideration in the evaluation of field experiments – most obviously in medicine, but also political science and economics. I wonder if these missing data issues haven’t been a consideration for cognitive psychology because most research is done in the lab. Mechanical turk experiments seem to be somewhere in between a lab experiment and a field experiment, in that you no longer have control over your ability to observe each outcome for a given dose (or be sure that your participant received the dose) – and so it seems natural to expect that cognitive psychology might begin to borrow from methodologies in these other disciplines.

    Todd Gureckis
  6. Just as an incremental follow-up, there has been an interesting discussion here and over email about these issues. I particularly found the ways of thinking about “field” studies (where dropout and missing data are the name of the game) and lab studies (where dropouts almost never happen) interesting. I strongly suspect that thinking more about this issue could be important for psychologists who are trained to think in terms of lab studies (where dropout never happen) and are now starting to experience more like “field” data given the widespread use of online data collection.

    A summary of what David and I have learned so far comes down to this:

    Our idea of adaptively filling in conditions is probably not “ideal” and could in fact be quite dangerous. The reasons are clear in Tal and Eytan‘s posts above (as well as an email from Dean Eckles). While in my lab the imbalance due to dropout is practically very low even on Mechanical Turk (2-3 people per condition out of like 100, see our PLOS paper about it), even a momentary change in the assignment probabilities of people to conditions can be bad. In a worst case, this adaptive algorithm used by an “inattentive” researcher could introduce to a pretty serious design flaw. Thus, the approach Tal, Eytan, and Dean advocated of pre-computing the required N (based for example on a power analysis) and then running until this minimum is met within each cell seems preferable. In other words, the lesson is explicitly report, possibly model, but definitely don’t adapt to the issue raise in point 5 in the original post. As a consequence this tends to deal with Issue #2 as well since assignment probabilities remain stationary in time. There are a couple of ways of dealing with that along the line of what Jake suggested (and Eytan shared some code using PlanOut that helped with this as well – I will try to post a link).

    What remains open though is the issue about balancing between nested factors, and dealing with (what we are calling) “priorities” over factors. I suspect the language of priorities seems weird in an experiment but I guess my idea is that there are some aspects of an experiment we often randomize (like in a category learning experiment the position of the category A/B buttons). We often code that just using a separate random number generator. However, ideally this variable should be counterbalanced as well (not just randomly sampled). However, you don’t want this to trump your main experiment design. One way to do this is to explicitly recognize and track the conditional dependencies between factors and do assignments this way. I’m interesting if my thinking on this is also flawed (perhaps the solution for this becomes more transparent when you relax the issue of dealing with dropouts).

    Re: Tal’s point of having multiple “rounds” of data collection:
    In general, we tend to do data collection online this way for technical reasons. For example on AMT, If you post too many HIT assignments at once it can slow down the page load speeds your webserver/web app so typically good to “throttle” subjects at like 5-10 at a time, then then go get more.

    Todd Gureckis