When does a rational model “fit”?

Rational models of cognition are an important theoretical tool for describing and explaining human learning, judgment, and inference. In addition to providing a computational-level explanation of how a given task should be solved by an adaptive agent, they are also frequently used to argue that people’s behavior is, in fact, rational (i.e., conforming to some optimality constraint).

However, an important question concerns how one goes from a rational model to the inference that people’s actual behavior is rational. In other words, when does one conclude that a rational model provides a good/great/excellent fit to behavior? In answer to this question, one first has to consider what it means for a rational model to fit, and therefore provide a good account of some behavioral data.

Consider the following statement:

Human behavior appears rational when the predictions derived from a rational model _________.

With these possible endings:

  1. are identical to people’s behavior
  2. are strongly correlated with people’s behavior
  3. get most aspects of people’s behavior qualitatively right using few parameters, even if some predictions are wrong 
  4. are consistent with people’s behavior assuming particular forms of measurement noise or error

Which do you think provides a sufficient basis upon which to conclude people are rational? In the psychology literature, all of those criteria have been used to some degree to endorse or reject a rational model of some behavior. This post explores the theoretical consequences of these possible answers *.

Why this isn’t so obvious…

It might seem that this question would be answered the same way for any type of modeling approach (for example via model comparison – with some penalty for model complexity), but in our discussions we are not sure it is so clear.

Rational models occupy a unique space in the taxonomy of cognitive modeling approaches. In particular, they typically do not seek to provide an explanation of how a person actually performs a task, but a description of what optimal behavior would be in any particular environment. As a consequence, they are often presented on their own without comparison to alternative models. Sometimes, this is the case because no alternative psychological or rational model has yet been proposed to explain behavior in a given domain. However, even if alternative models exist, the bar for a rational model to be deemed successful might be lower because it is not considered a description of the actual process that underlies people’s behavior. Instead, a model might be endorsed so long as it predicts the major trends in the behavioral data even if it misses a few of them.

For example, unimportant sources of noise (measurement error in behavioral data, noise in how people practically compute a rational solution, etc…) might all contribute to a mismatch between behavior and a model which might not actually impact the overall validity of its central “rationality” claims.
…the bar for a rational model to be deemed successful might be lower because it is not considered a description of the actual process that underlies people’s behavior
Alternatively, rational analysis does not exclude the possibility of a better fit via more mechanistic constraints, but such constraint might not entirely discard claims about generally adaptive/rational behavior. We feel this perspective at least gives credence to some alternative ways of judging such models (e.g., #2, #3, and #4 mentioned above).

Besides the meaning of these different types of fitting procedures, there is another important question that is raised by the different ways of ending the sentence: Does it actually matter? Will the conclusions you make about the “rationality” in your participants depend on which model evaluation measure you use?

Let’s look at an example – Intervention-based causal learning

Recently we have been interested in a rational model of how people choose to “tinker” with causal systems to discover how they work. In the cognitive literature, the decision to tinker is called a causal intervention. A rational model that explains what interventions people should choose to optimally discriminate between alternative hypotheses about how a system works is the Information Gain model (IG). According to this model, an optimal actor should choose an intervention, a, which maximally reduces the expected uncertainty, H(G), about which of a set of hypotheses or graphs, G, is correct (summing over possible intervention outcomes Y):

(1) 

 

This model has been proposed as an optimal strategy of intervention selection in the machine learning literature (Murphy, 2001). It was also suggested as a rational account of people’s intervention behavior in an article by Steyvers and colleagues (2001) who conclude that “people’s intervention choices may be explained as rational tests given their own subjective beliefs about causal structure” (p.481).

In ongoing work in the lab, we gave human participants a causal network with three variables or nodes and their task was to find out how these nodes are connected with each other. Figure 1 describes the procedure of the task. For each network, we provided two exhaustive possibilities, or hypotheses, about how the chip was wired that were shown using diagrams. To find out which hypothesis was correct, participants could click on one of the nodes to turn it on. They could then observe how that intervention would affect, or not affect, the state of the other variables. Overall, participants tested 27 different causal networks with a different pair of hypotheses for each one. For the sake of simplicity, we only analyze here the first intervention choice people made, although in our experiment they were free to make more than one.

Figure 1: Structure of a trial: Given two hypotheses (shown at the top of the screen) participants could make an intervention by clicking on one out of three components, or nodes, on the chip, thereby turning it on (green). After a short delay the other components updated according to the true underlying graph, i.e. they could also turn on or remain turned off. Participants could make as many interventions as they liked (with a penalty for each) until they were happy to choose which hypothesis is correct (forced choice).

For the present discussion the precise details of the task itself are not all that important (happy to share more information, but see upcoming papers). Critically, people made a choice to intervene on one of three nodes (labeled arbitrarily in our data figures as N1, N2, N3) on each trial of the task for each of 27 problem types. The critical question is if those choices can be reasonably interpreted as “consistent” with the predictions of the rational model (i.e., the IG model).

For each of the 27 problem types we can compute the expected information gain of every intervention as predicted by the IG model. It then is natural to compare these model scores to the frequencies by which participants chose each node in the experiment. Figure 2 shows the model predictions and choice data for every problem. Take a close look at the model and human data then continue reading for different ways of interpreting the figure.

Figure 2: Predictions of the IG model and human choice frequencies for three possible nodes (N1, N2, N3), separately by problem type.

Option 1: Are people identical to the rational model?

The first thing to note about Figure 2 is that the optimal choice is to always choose the intervention with the highest IG score ignoring all other choices (thus, visually you should imagine each plot having a single bar with non-zero height). However, people clearly deviate from that standard and often choose all the nodes with varying frequency. This is clearly a sign of sub-optimal performance given the goal of the task and one that is frequently noted (e.g., Steyvers et al., 2003). This would appear at first blush to reject option #1 from above. People are unlikely, in any situation, to behave exactly like the type of rational model commonly employed in cognitive science individually or in aggregate. This could be for important or unimportant reasons, we can’t always be sure.

Option 2: Is the rational model strongly correlated with people’s decisions?

While choosing an intervention with the highest information gain, it seems to throw away information about the relative value of different choices scored according to the model. In other words, even though the “max” is the best/optimal choice, for two actions a and b, if EIG(a) > EIG(b), then a is generally a more informative/better choice. Thus, a second natural way to compare the model to human data might be to simply compute the correlation between the data and the EIG scored for all actions given by the model, either overall, or by problem type. This is particularly appealing since EIG has units of “bits” while participants make discrete choices.

In our case, the overall correlation between expected IG and the choice frequencies for each node is 0.81 across all problems. For a zero-parameter model this is actually quite impressive. 
using correlation, the mismatch between the predictions of the model and the data are attributed to a nebulous term “unexplained variance” which could come from any number of sources
Basically with zero parameters the model explains about 65% of the variance in the group data about intervention choice. We can also break it down for each of the 27 problems (except in problems where IG makes the same predictions for all interventions in which case a correlation cannot be calculated). Figure 2 shows that in a majority of the problems IG has a high (>0.9) correlation with the data, although in two problems they correlate negatively.

Whether we accept that this analysis shows people behave rationally according to option #2 depends on our criterion for what counts as a sufficient correlation. If we believe that having a high correlation with the data on most problems is sufficient, for example, we might endorse the model as a good description of people’s behavior, especially in light of the fact that zero parameters were fit to the data.

On the other hand, using correlation, the mismatch between the predictions of the model and the data are attributed to a nebulous term “unexplained variance” which could come from any number of sources (measurement error in a limited sample, neural noise, momentary lapses in attention, subtle individual differences in memory capacity, theoretically important differences in how people approach the different problem types, etc…). However, this approach doesn’t let one distinguish well between important and unimportant sources of unexplained variance.

Option 3: Does the model get most aspects of people’s behavior qualitatively right, even if a few are wrong?

In general, the answer to this question is “yes”. If you visually scan across the shaded columns in Figure 2 you can see that in cases where the model predicts that two choices are basically equally “good” people are often split between the choices (e.g., problem 1). Similarly in problem 8, the model predicts one choice is particularly “good” and people also prefer it. In some cases things are less clear (e.g., problem 27 the model is indifferent between the three choices, but people prefer node 1 and 2). However, in a broad sense the model captures a majority of the qualitative trends (while missing just a few) with no parameters. It is up to the modeler to decide if the cases where the model mismatch are important enough to dispute the claim the people appear “rational.”

Option 4: Is the model consistent with people’s behavior assuming particular forms of measurement noise or error?

The fourth option we considered is to actually quantify particular types of noise. The goal of this analysis is to move beyond the notion of “unexplained variance” to propose, along with the rational model, a viable model for how uncertainty and noise might impact the match between a model and data.

This analysis takes a few simple steps. We first thought of some plausible assumptions about the type and magnitude of noise that could affect our data. Then we checked if a distribution of predictions from a noisy version of the model overlap with the data, while also taking into account the variability or uncertainty associated with the empirical data itself. Such an approach has been suggested for example by Roberts and Pashler (2000, see in particular Figures 1 and 2) who argue that model assessment should take into account the variability in the data, as well as the range of predictions the model makes.

We thought it seemed plausible to look at two types noise that might play a role in our task:
The goal of this analysis is to move beyond the notion of unexplained variance and to propose a viable model for how uncertainty and noise might impact the match between a model and data.
First, it is possible that participants are not choosing interventions exactly in proportion to their expected IG. Instead, they could have a stronger or weaker tendency to choose high-IG options over their alternatives. To account for this between-subject variability, we can fit a softmax choice rule to each participant’s data with a parameter τ which governs the degree to which choices are probabilistic rather than IG-maximizing (this approach was also taken by Steyvers and colleagues, 2001). This type of noise assumption is part of the human participant’s decision making process.

A second source of noise stems from the probabilistic nature of the task itself: Participant could only choose one out of three possible interventions given a relative preference over all three. Even if we knew the correct softmax parameter underlying each person’s decisions, we would still not expect a perfect agreement between the data and the model, because we only observe a limited set of choices from this decision process. This type of noise or uncertainty stems from measurement error (and is not a feature of the human subject per-se).

Figure 3: Hierarchical Bayesian model of the task. Each trial, j, corresponds to one of 27 problem types on which each participant chose one intervention, y. IGj is a three-vector with the expected information gain of each node on problem j.

Figure 4: Top: The corners of each simplex represent each of the three possible intervention choices on any trial. Points within the simplex correspond to the probabilities, P(a), of intervening on each of the three nodes in a causal graph. Any possible decision preference across the three nodes can be summarized as a unique point, e.g. the middle of the simplex is the point of indifference between the three choice options (example at the top). Bottom: The left panel shows an example of bootstrapped empirical data. The white dot in the center of the cloud represents the actual choice proportions made by participants in the experiment. The blue cloud of points are bootstrapped samples showing the uncertainty in this empirical estimate. The red cloud of points in the right panel represent a range of model predictions (explained in the text).

Given these assumptions, the model now makes a distribution of predictions about behavior in the task. To map out this distribution, we used a method based on posterior predictive simulation (see for example Gelman et al., 2000). We first fit a full Bayesian hierarchical model to the data (See Figure 3) that incorporates the probabilistic nature of the decision process as well as individual differences in τ. Note that this hierarchical model is not meant as a cognitive model and doesn’t change the fact that we are ultimately testing the suitability of the IG model. Indeed, the main input to the model is the parameter-free estimates of choice utilities provided by IG. Instead, this approach makes explicit what we mean when we say that people might be noisy by proposing a particular form for that noise, and attempting to estimate at the population level the parameters of this noise from our data. We then drew samples from the posterior distribution of this hierarchical model and repeatedly simulated the intervention choices of groups of agents the same size as our experimental population. These posterior samples show the range of empirical choice frequencies we could have realistically observed if people were indeed using a type of IG strategy to guide their choices.

Next, we wanted to also get an idea of the variability or uncertainty associated with the data. Since this is multinomial choice data, we used a bootstrapping method and repeatedly re-sampled from the choice data of each problem type, which gives us an estimate of the variability around the observed choices (similar to error bars for continuous data). In other words, it gives us the range of empirical outcomes we might have observed under different experiments.

In Figure 5, we plot those bootstrapped data samples (blue) and IG posterior predictions (red) for each of the 27 problem types using a simplex plotting method (see Figure 4 for a full explanation). For each problem, every corner of a simplex corresponds to one of the three interventions that participants could make. Points within the triangle represent specific allocations of interventions choices at the population level.

The simplex plots allow us to perform a type of visual model assessment, where the overlap of the two clouds of samples indicates a plausible fit of the model to the data. When the two clouds are separated, we can conclude that it is very unlikely to have observed the data if the IG model generated people’s choices, even given our assumptions about measurement noise. As shown in Figure 4, there are indeed some problems where the overlap between the two is minimal or non-existing (e.g. problems 4, 12, or 21), which means that it is very unlikely that the IG model could have produced the behavior we observed (again, given our explicit assumptions about the noise and variability in the sample).

Figure 5: Intervention choices and predictions of the IG model by comparison type. The corners of each triangle correspond to nodes in the causal graph that participants intervened on. White dots indicate the actual choice frequencies. Bootstrapped samples of these choices are shown in blue. Samples from the IG model's posterior are shown in red.

Thus, based on this more detailed analysis, we answer this fourth question with “no”. Even after accounting for unsystematic decision noise and the uncertainty in our measurement, the model fails to plausibly capture people’s choice patterns in certain problems.

Comparison to Correlation

Interestingly, our case shows that this posterior prediction method and the correlation approach do not always lead to the same assessment regarding the goodness of the model fit in the individual problems. Consider problem 11, for example where the data and model have strong negative correlation of r = -.94. The posterior prediction plot shows that actually the pattern of choices is not unlikely to have arisen from a population of IG users, once we account for decision noise. On the other hand, problem 14 which has a high correlation between data and model (r = 0.94) shows little overlap between model and data using the posterior prediction method in figure 5 (the clouds of samples lie close together but with little overlap).

This shows that the posterior prediction method does not necessarily imply a stricter criterion for model fit. Instead, by assuming a plausible noise model it can help expose both cases where high correlation leads to overly optimistic conclusions and those where low correlation is falsely taken to mean that a model has failed.

What have we learned?

Different criteria for accepting or rejecting a rational model (here: the IG model) gave us different answers about whether people behave consistent with its predictions or not. However, comparing across these methods, we feel that the posterior prediction method as a way of fleshing out criterion #4 was particularly useful for several reasons:

  1. It allowed us to tell when a mismatch between the model and the data can be due to unsystematic noise and provides a more tangible reason to reject or accept the model.
  2. Instead of chalking up the residual between the model fit and human data to “unexplained variance”, specific proposals were made about the types of noise effecting both participants and the empirical data.
  3. This approach takes both the data and the model seriously, by first fitting the model (including a noise model) to the data and then simulating from the model to find out the range of its predicted behaviors.
  4. It also let us look at each of the 27 problem types individually, which is useful to get a better idea about the types of behaviors that are difficult for the model to explain. In other words, we can ask if the model does poorly on particular problems more so than others.

Of course, the approach in option 4 isn’t specific to rational modeling and in fact can be applied to any type of model. In addition, it isn’t limited to the types of trinary choice data that we modeled here (e.g., in more standard data types the uncertainty in human and model data can be plotted simply with standard error bars).

Overall, we believe this approach is particularly helpful in cases where a single model is tested or evaluated on its own (as if often the case in rational analyses), but it could also be used as the basis for model comparison, or as a general “sanity-check” for any model (the model, after being fit, should generate the same pattern of behavior). Obviously, it was still up to us to determine which types of noise to consider (we could have taken into account many other sources of noise, like perceptual noise stemming from the arrangement of nodes on the screen, for example), but at least this method makes these noise assumptions explicit so they can be up for scientific discussion along with the rational model itself.

Most importantly, it is worth noting that if we had used a different criterion for assessing the model (e.g. #3), we may have been content with it fitting a majority of our data points reasonably well, and concluded that people are roughly consistent with the rational account. We think that would have caused us to miss out on some interesting behavioral patterns which suggest limitations of this particular theoretical account of our data.

Anyway, let us know what you think in the comments!

Footnotes

* Unlike some recent articles (Jones & Love, 2011, Marcus & Davis, 2013), the purpose of this post is not to debate the value of rational models as psychological theories and, in fact, takes as given that such model can be useful. Instead, we have just been discussing in lab meeting the best way to actually fit and evaluate a rational model and we thought it would be fun to open the discussion.

References

Gelman, A., Goegebeur, Y., Tuerlinckx, F., & Van Mechelen, I. (2000). Diagnostic checks for discrete data regression models using posterior predictive simulations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 49(2), 247-268.

Jones, M., & Love, B. C. (2011). Bayesian fundamentalism or enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences34(04), 169-188.

Marcus, G. F., & Davis, E. (2013). How robust are probabilistic models of higher-level cognition?. Psychological science, 0956797613495418.

Murphy, K. P. (2001). Active learning of causal bayes net structure. Technical Report. Department of Computer Science, U.C. Berkeley.

Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychological review107(2), 358.

Steyvers, M., Tenenbaum, J. B., Wagenmakers, E. J., & Blum, B. (2003). Inferring causal networks from observations and interventions. Cognitive science, 27(3), 453-489.







  1. Really interesting stuff. I agree that exact rationality/optimality in a strict sense is extremely rare – it would require zero decision noise and (in perceptual tasks) sensory noise equal to what is provided by the retinal input. Often, what is most interesting is to characterize deviations from optimality in a model-based manner. The one criterion that we typically use to declare optimality is that no plausible suboptimal model fits better (according to AIC, AICc, BIC, DIC, BMC, or ideally all these criteria). In practice, that means “no suboptimal model that we can think of” and it defines the research agenda as trying really hard to come up with alternative models, both before and after seeing the data. Optimality claims are too often made without testing many (or any) alternative models. Of course, “plausibility” is subjective so you could end up in a situation where a suboptimal model that some people find plausible bit others not fits better than the optimal model – in which case I would still be careful with declaring optimality.

    Weiji
  2. Posterior predictive checks can and should be Bayesian:
    http://www.indiana.edu/~kruschke/articles/Kruschke2013BJMSP.pdf

    John K. Kruschke