On the identifiability of parameters in reinforcement learning models

In this post we describe some work we did recently testing whether commonly studied RL learning algorithms (e.g., Q-learning) are identifiable given observed patterns of choice behavior. This is a non-exhaustive exploration of the issue, but might be interesting to people who use these models in their work.


Reinforcement learning (RL) is a computational framework for how agents (artificial or otherwise) can learn to make decisions based on experience with the environment. It has become a dominant theory of learning in computational cognitive science and neuroscience and has been successful in many applications (e.g., robots).

Figure 1. A complex and irregular surface. Here the height of the curves reflect the quality of a model fit as a function of two hypothetical models parameters. Clearly there are multiple solutions that "fit" equally well in the space making it hard to identify certain parameter combinations. For many models used in psychology, the shape of this surface is unknown and difficult to estimate.

When using a reinforcement learning model to fit human or animal data, the experimenter typically tries to estimate the latent parameters of the model from the participant’s observed choice behavior. For example, a researcher might optimize parameters of RL models fit to human data using Maximum Likelihood Estimation to obtain estimates of a participant’s learning rate. A key scientific question concerns how accurate these type of fits actually are. There many ways in which model fits of this type might be led astray.

For example, in perhaps the worst case scenario participants are using a completely different learning strategy than the one posited by the model. In this case parameter estimates from the (incorrect) model may be a poor characterization of the participant’s actual behavior. Less obviously, even if the model is “correct” (i.e., participants really are following a learning strategy identical to the one suggested by the model) it may not be possible to identify the model parameters given a sample of a participant’s behavior. For example, the likelihood space of the model may not be “well behaved” (i.e., there could be multiple local or global minima which confuse the parameter optimization routine, see Figure 1).

The Question

The topic of this post is to ask how identifiable the parameters of common RL algorithms are under a particular set of conditions of interest to our lab. In particular, we wondered if parameter estimates are stable and identifiable for RL algorithms applied to dynamic decision making tasks. These tasks involve environments where the reward given for choosing a certain option is dependent on the history of previous choices made by the agent. Because experienced outcomes are strongly dependent on the particular choices that the agent made, it seems plausible that this class of experiments might have a flat or multi-modal maximum likelihood space, leading to parameter fits that mischaracterize participant’s actual learning strategies.

One interesting way to evaluate this is to first generate simulated data from a model with a given set of parameters. This simulated data acts as a stand-in for empirical data from a human subject in an experiment. However, critically we know in this case what parameter combination and model generated the data. Next, we fit the simulated data using the standard approach in the field (e.g., maximum likelihood estimation).

In an ideal world, we would be able to faithfully recover the set of parameters we used to generate the simulated choice data. Alternatively, if our parameter estimates are way off the mark, we should be cautious about interpreting the fit of such models to human subjects. The reason is that with human subjects we are uncertain about both the model AND the parameters, whereas in our simulation study we know our model is correct and are just asking about the parameters.

If our fitting approach can’t even make sense of things in the best case, we should worry about the worse case (see Figure 2 for a summary)!

Figure 2. Top panel shows the standard empirical situation. A human with unknown parameters performs a task generating observable data. The RL model is fit to the human data providing estimates of the latent parameters. There is no direct way to tell if the parameters are accurate. The bottom panel shows the situation explored here. The RL models with known parameter settings is applied to the same experiment given to human participants. The model's choice data are re-fit with the same RL model. The hope is that the parameter estimates recovered match closely the ones used to generate the data in the first place.

RL Models of Dynamic Decision Making

The RL model that we considered is based on the one described in Gureckis & Love (2009) which itself was based on the famous Q-learning algorithm from computer science (Watkins 1989). This model is thought to be a good description of how people perform dynamic decision making tasks and thus provides an interesting domain in which to explore this issue. There are three free parameters in the model (α, γ, and τ, see Gureckis & Love, 2009 for a full description).

In our simulation experiment, we applied the model to the same task used in Gureckis & Love (2009). The task, presented as a simple video game called “Farming on Mars”, involved subjects choosing between two different Martian farming robots with the goal of generating the most oxygen possible for a future population of human inhabitants. One robot (called the Long-Term robot) would return 400 + 1000 * h/10 units of oxygen where h was the number of times that robot was chosen in the last 10 trials. The other robot (the Short-Term robot) gave 900 + 1000 * h/10 units. Thus, while the Short-Term robot would give higher rewards on each individual trial, the optimal strategy is to always select the Long-Term robot. However, subjects in the experiment were unaware that the rewards were dependent on their previous choices and would have to discover this over time.

A Simulation Experiment

As mentioned above, Gureckis & Love’s (2009) data were best-fit by a Q-Learning model which incorporated information about the latent state transitions of the task. To test if these fits were identifiable given the patterns of choice behavior recorded in the task we used the approach outlined in Figure 2.

Starting with a wide range of parameter combinations (i.e., various setting for α, γ, and τ), we generated choice data for a 500 trial “experiment” from 100 simulated subjects per combination of parameters. In the “Farming on Mars” task human subjects generally learned to maximize, i.e. choose the Long-Term option most of the time. Therefore, we were only concerned with parameter combinations that might generate human-like, maximizing behavior. We defined a simulation as consistently maximizing if it chose the Long-Term reward more than 60% of the time in the last 100 trials. For each parameter combination, we simulated 100 hypothetical “subjects” and classified the parameter combination a consistently maximizing if more 60% of the simulated subjects were maximized more the 60% of their choices. This is roughly equivalent to a less that 5% chance that the subjects were responding randomly. Across all 1000 combinations that we tested, only 12 combinations consistently displayed maximizing behavior. The table below shows all of the maximizing combinations and the number of simulated subjects (out of 100) that showed maximizing behavior for each.


Having found a set of human-like parameters, we were now ready to answer our original question. We used Maximum Likelihood Estimation to fit parameters to each of the 100 subjects generated for each of the 12 parameter combinations. Many of these resulted in fairly good estimations of the parameters. For each simulated subject, we attempted to fit parameters to that set of data starting in 50 random locations in the search space and chose the one with the highest likelihood. Finally, we looked at the median of the best fits over all 100 simulated subjects to get an idea of how close the typical parameter fit was (these medians are in the last three columns of the table below.)

Taus Alphas Gammas # Maximizing Simulations/100 Median Fit Taus Median Fit Alphas Median Fit Gammas
1 1330 1 0.4 61 1292.330961 1.002661302 0.401409669
2 1520 0.9 0.3 68 1449.070059 0.916268368 0.305883338
3 1520 1 0.3 64 1472.299003 1.0107299 0.306711376
4 1710 0.9 0.2 60 1794.842892 0.918753426 0.207316601
5 1710 0.9 0.3 63 1591.74387 0.914328119 0.305986909
6 1710 1 0.3 65 1737.748628 1.00735407 0.30399517
7 1710 1 0.4 61 1721.370366 1.01534019 0.402315334
8 1900 0.9 0.2 64 1947.288166 0.91075461 0.20415441
9 1900 0.9 0.3 74 1943.371389 0.913595281 0.300533485
10 1900 0.9 0.4 61 2012.40356 0.910731432 0.403546881
11 1900 1 0.3 72 1869.468093 1.00419566 0.303685677
12 1900 1 0.4 60 1879.272986 1.008092814 0.401227046

Of course, the recovered parameters do not match the inputted parameters exactly, but we wondered if the differences were important enough to raise concern. For example, do the recovered parameters make entirely different predictions about how participants might behave in the task? To assess this, we generated new simulated choice data with the recovered parameters and compared the overall performance of these simulations to the performance of the original 12 parameter combinations. You can think of this exercise a bit like the game “telephone”: We create a model with a known setting of parameters, generate a (noisy) sample of behavior from this model, fit this data with the same model, then generate a new (noisy) sample of behavior and see how they compare. The final fits are thus the noisy signal the remains after standard steps one would encounter in an empirical study.

Figure 3 below plots the percentage of maximizing choices on each trial averaged across the 100 simulated subjects for both the original parameters and the “recovered” parameters. We binned each trial into successive blocks of 50 trials and the plotted the mean of each bin. Qualitatively, most parameter estimates line up quite well with the originally generated data.

As you can see, in most cases the recovered parameters are very similar to the parameters used to generate the data. In a few cases there are minor discrepancies which most likely have to do with the parameter optimization procedure getting stuck in a local minima.

Figure 3. A comparison of behavior in the task for both the original simulations using the 12 maximizing parameters (in blue) and the maximum likelihood fits of those simulations (in green). The numbers for the plots correspond to the rows in the table.


Overall, the our exploration of the identifiability of the Q-learning model was reassuring. For parameter combinations that lead to long-term reward maximizing behavior, the fits to these simulations were faithful to the values we seeded our simulations with. This suggests that, were the Q-learning model the exactly correct model of human learning and decision making in these task, we would be safe in making inferences about the value of these parameters. This simulation also give a bit of further support to the approach used in Gureckis & Love (2009).

Overall, this is an important and informative exercise in estimating the value of a model fit. To facilitate replication we have shared our Julia code below which we used in this simulation experiment.


Gureckis, T.M. and Love, B.C. (2009) Short Term Gains, Long Term Pains: How Cues About State Aid Learning in Dynamic Environments. Cognition, 113, 293-313.

Watkins, C. (1989). Learning from delayed rewards. Unpublished doctoral
dissertation. Cambridge, England: Cambridge University.


Here is the Julia code for simulating and fitting the data:

  1. This is very cool and somewhat surprising. I guess the problem may arise when the true model and parameters are both unknown. A model can be yoked (i.e., trial-by-trial fit) to a participant’s past responses and provide a good fit, even when the model alone could never accomplish the task.

    Brad Love
  2. Great post! But I’m curious as to what happens when you add noise–whether random or due to some other model. In my admittedly limited experience working with this type of paradigm, it seems very unlikely that what’s going on in subjects’ heads is well-described by a single model all of the time. Even if we allow that Gureckis & Love (2009) provides the single best description of what’s going on, it seems unthinkable that most subjects wouldn’t be experimenting with other strategies/learning approaches at least some of the time–particularly early on. I’m curious if you still recapture the parameters reasonably accurately if you have a second-order parameter that, say, chooses the Gureckis & Love (2009) model 80% of the time, responds randomly 10% of the time, and uses some other simple strategy (e.g., win-stay/lose-switch) 10% of the time. The point being that while the ability to recapture the parameters will obviously break down at some point, it would be very informative to know where that point is.

    The other question is how much data you need for the simulation results to stabilize. You use 500 trials here, but that’s considerably more than many people who fit RL models have available. What happens if you only have, say, 100 trials?

    Tal Yarkoni
  3. […] or as a general “sanity-check” for any model (the model, after being fit, should generate the same pattern of behavior). Obviously, it was still up to us to determine which types of noise to consider (we could have […]

    thinking about thinking » When does a rational model “fit”?
  4. I would like to try out the code, but it seems that the > (greater than) and < (less than) signs have all be manhandled along the way. I have tried to correct it but there are some areas that still seem to be missing something.

    For example line 63:63 read:
    if qSA[sP + 1, 1] maxA = 2

    I took this to mean if (qSA[sP + 1, 1]) resolves to true maxA = 2. But qSa is a matrix of Float64 and, unlike integers, cannot be treated as a boolean. So I am not clear as to what the intent was.

    Seems that a Github gist might have be a better place to post this code.

    Any chance of getting the correct version?

  5. Hey Andre, good call, the line should be:
    if qSA[sP + 1, 1] < = qSA[sP + 1, 2] Not sure how that got screwed up but there may be other typos in there so here's the github gist: https://gist.github.com/dhalpern/c241f89a68bc799a71bb

    David Halpern