Is Mechanical Turk the future of cognitive science research?

Is Internet-based data collection the future of cognitive science research? We used Amazon’s Mechanical Turk (AMT) to replicate a classic result in cognitive psychology which has primarily been established under traditional laboratory conditions (Shepard, Hovland, and Jenkins, 1961). In this post, we describe the various lengths we went to in order to get useful data from AMT and what we learned in the process. Overall, our results highlight the potential for using AMT in experimental research, but also raise a number of concerns and challenges. We invite comments, discussion, and shared experience below!

Note: Aspects of this post are now published in
Crump M.J.C., McDonnell, J.V., and Gureckis, T.M. (2013) Evaluating Amazon’s Mechanical Turk as a Tool for Experimental Behavioral Research. PLoS ONE 8(3): e57410.

Table of Contents

Use the following links to jump around in the article:

The Challenge of Collecting Behavioral Data


One challenging aspect of experimental psychology research is the constant struggle for data. Typically, we depend on university undergraduates who participate in studies in exchange for experience, course credit, or money. Research progress depends on the ebb and flow of the semester. As a result, it can take weeks, months, or even years to conduct a large behavioral study. This issue is even more salient for researchers at smaller universities.

One appealing solution is to collect behavioral data over the Internet. In theory, online experimentation would allow researchers to collect data internationally, enable access to a very large number of people who may be interested in participating in research studies, and can be fairly automated. However, the main obstacle to conducting Internet-based research is finding people who are willing to participate and compensating them. Sure, you can post a link on your webpage, but few people are likely to find it.

Recently, our lab (like a number of others) has explored the possibility of using Amazon’s Mechanical Turk (AMT) for behavioral data collection. AMT is a crowdsourcing or “artificial-artificial-intelligence” platform in which people submit simple (and usually brief) computer-based jobs to human workers online (see Box 2).

While Internet-based methods for collecting data have been around for a while, AMT is a potentially useful system for researchers since it handles both recruitment and payment in a fairly automatic way. There are a large number of people who use AMT making it a great way to advertise and distribute studies (the NYTimes reported over 100,000 active users in 2007).

Recently, there have been a number of excellent summaries and workshops about using AMT for research. Most notably, Winter Mason of the Stevens Institute of Technology has a Behavior Research Methods paper [pdf], which summarizes how to use the system and what it does (see also this excellent blog post and the references below).

Box 1. Useful Amazon Mechanical Turk Links


  • Artificial Intelligence, With Help From the Humans – NYTimes
  • “I make $1.45 a week and I love it” – Salon.com
  • Guide to Experiments on Amazon’s Mechanical Turk -Winter Mason
  • Demographics of Mechanical Turk – from Feb 2010
  • A blog about using AMT for behavioral research – Experimental Turk

Rather than focus on how to use AMT, this post focuses on the reliability of the data for experimental research in cognitive psychology. Before using this data in our experiments we wanted some sense of the quality of the data when compared to laboratory studies. Along those lines, we had a couple of questions going into this project:

  • First, could we replicate classic findings in cognitive psychology concerning learning and memory processes with fidelity similar to that collected in the lab?
  • Second, what unexpected issues come up in conducting learning, memory, and reasoning studies online?
  • Third, how reliable is data on AMT as a function of the payment amount?

In this sense, our analysis is similar to a couple of recent papers (e.g., Paolacci, Chandler, & Ipeirotis, 2010 and Buhrmester, Kwang, and Gosling, 2011). However, these previous reports focus on survey data, the test-retest reliability of AMT, or on simple one-shot decision making tasks. In contrast, we were interested in using AMT to replicate and extend classic findings in experimental cognitive psychology which were originally established in the lab. Our emphasis was not just on qualitative replication, but in getting accurate data in a study that has been frequently replicated in the lab. In addition, we wanted to explore if it is possible to conduct publication-quality Internet experiments which unfold over many trials, require learning, sustained attention, and which may require between 20-60 minutes to complete.

Our initial experiments provide answers to many of these questions, and we felt our results might interest people thinking of using AMT for their own experiments.

Taking Experiments Online


In terms of taking cognitive experiments online, there are a number of unique challenges compared to simple ratings tasks or surveys.

First, such experiments usually take extended time (on the order of 1 hour) and uninterrupted concentration which may or may not be amenable to the AMT, which typically emphasizes short, one-shot human judgment tasks or surveys. In our initial test, we have tried to keep our tasks shorter than usual (roughly 15-30 minutes).

Second, compared to a simple HTML survey, the visual display in many cognitive experiments is typically dynamic (things may move around, there may be animations, people may respond by manipulating objects on the screen with the mouse, etc…). This raises a couple issues about how to control the screen display to ensure that people using many different types of computers (netbooks, iPads, regular desktops) view roughly the same thing. In addition, ensuring the experiment can even run without hassle in the Worker’s browser can be a technical challenge.

Box 2. Some AMT Basics


Terminology:

  • HIT – Short for “Human Intelligence Task,” is a unit of work on AMT.
  • Requester – Individuals asking for work to be done (in this case, a researcher).
  • Worker – individuals who respond to requests from Requesters and perform the offered HIT (i.e., a participant in a study)

Basics:

  • Requesters can demand certain qualifications for their Workers, such as having high previous approval ratings for their work or being located in the United States.
  • HITs are most typically performed in the web browser.
  • Once they are finished with the task, Workers submit the HIT for requester approval and payment.
  • Requesters can also pay an optional bonus based on performance.
  • Payments are made automatically through Amazon Payments system (a Paypal-like system).

After initial testing and research, we decided that there is no general solution to these issues and it is up to the individual to balance ease-of-programming with the ability to support many diverse computer systems (this will be discussed more in an upcoming post). For example, we ran into difficulty in providing support for Internet Explorer in our task. It might have taken an extra two or three weeks to figure out all the idiosyncrasies of the various rendering engines and Javascript implementations in every browser. However, it turns out that most users on AMT do not use Internet Explorer so this decision has (hopefully) modest consequences.

Ultimately, there are many options available for designing web-based experiments that run in a browser and if you are savvy enough you can probably do 80% of what Matlab’s Psychophysics toolbox does. We found it fairly simple to design a dynamic display with Javascript that did not require extensive browser reloading between trials: All stimuli load at the beginning of the experiment and run as a web-app in the user’s browser sending us updated data at the end of each block.

Experiment 1: A replication of Shepard, Hovland, Jenkins (1961)


In our first experiment with AMT we attempted to replicate Shepard, Hovland, and Jenkins’ (1961) classic study on category learning. This is a highly replicated and influential study in cognitive psychology. In addition, it has many of the key features we want in our online experiments (the participant has to pay attention for 20-30 minutes and learn something over a number of trials).

Figure 1. Two examples of the Shepard problems. Eight stimuli (here square-like objects) are divided into two groups. We used the stimuli developed and normed by Love (2002).

As a summary, Shepard et al. (1961) had participants learn various categorical groupings of a set of eight geometric objects. Each of the six groupings varies in difficulty in a way related to the complexity of the “rule” needed to correctly partition the items. Differences in difficulty persist despite the fact that, in theory, people could memorize each item. This is often taken to imply people are forming more abstract, structured conceptions of the pattern (e.g., by forming a rule).

Figure 2. The abstract structure of the Shepard, Hovland, and Jenkins classification problems (taken from Love, Medin, & Gureckis, 2004). Each stimulus can be coded as a binary vector along the three stimulus dimensions. The problems differ in how the eight items are assigned to the two categories.

Two example problems are shown in Figure 1. The “Type I” problem requires participants to form a rule along a single dimension (‘if blue then category A, otherwise category B’). This problem is usually fairly easy to learn, while other problems are more difficult. For example, the rule in the “Type VI” problem is a complicated three-way XOR and is might be best learned by memorizing the category membership of each item. A full description of the abstract structure of the Shepard learning problems is shown in Figure 2.

In general, previous research has shown that the type I problem is learned more easily than is the type II problem. In turn, types III, IV, and V are learned more slowly than type II (within III-V, learning rates appear similar). Finally, type VI is typically the most difficult pattern to learn. The relative rate of learning for these six problems has provided an important constraint on theories of human concept and category learning. Most computational models of categorization must account for the relative difficulty of these problems in order to be viewed as a serious theoretical account. In addition, the quantitative (rather than qualitative) shape of the learning curves has been used to test and differentiate models (e.g., Love, Medin, & Gureckis, 2004). Our basic goal was to see if we could replicate this finding using participants recruited over the Internet.

Methods


Participants

Two hundred and thirty-four anonymous online participants volunteered (N=41, 38, 39, 39, 39, and 38 in types I-VI respectively), and each received $1.00 via AMT’s built-in payment system. In addition, 1 in 10 participants who completed the task were randomly selected for a bonus raffle of $10. This incentive was included to encourage people to finish the task even if they found it difficult. An additional fifty-six participants initiated the experiment electronically, but withdrew before the end for unknown reasons. The data from these participants was not further analyzed. Finally, seven individuals indicated they used pen and paper to solve the task in a post-experiment questionnaire and were excluded (although these participants still received payment). Participants electronically signed consent forms and were debriefed after the experiment. The study design was approved by the NYU Institutional Review Board.


We conducted our experiment between 1:30pm EST February 24th, 2012 and 6pm EST February 28th, 2012. Data collection was generally paused each evening at around 9pm EST and started again the following morning. A restriction was put in place that participants were located with the United States and had at 95% acceptance rate for hits. The purpose of this was to increase the probability that the participants were native English speakers who could fully understand the instructions and so we could keep data collection during relatively normal “working” hours. In addition, our experiment code checked the ‘Worker ID’ (a unique number assigned to each Worker account) and made sure that each unique account could only participate in the task once. People could evade this restriction if they had multiple Amazon accounts, but doing so would be a violation of Amazon’s Terms of Use policy.

Design

Each participant was randomly assigned to complete one of the six learning problems defined by Shepard et al. The stimuli were simple square objects similar to the ones shown in Figure 1. The stimuli we used were developed by Love (2002) who normed the constituent dimensions for roughly equal psychological salience. The mapping between the stimuli and the abstract structure shown in Figure 2 was randomly counterbalanced across participants.

Procedure


Our replication, although presented in AMT, remained procedurally similar to a highly cited replication of the Shepard et al. results by Nosofsky et al. (1994). On each trial of the task, one of the eight objects was presented in the middle of the browser window (see Figure 3). Participants indicated if the item belonged to category A or B by clicking the appropriate button. Feedback was then presented for 500ms which indicated if the participant was correct or incorrect1

Figure 3. An example of the web-based interface. A single object was presented and people simply responded by indicating the correct category membership of the item. Corrective feedback was given following their response.


Trials were organized into block of 16 trials. In the rest period between blocks, participants were given information about their performance in the previous block and about how many more blocks remained. The experiment lasted until the participant responded correctly for two blocks in a row (32 trials) or until they completed 15 blocks. Participants were told that the experiment could last as long as 15 blocks, but that they could end early if they correctly learned the grouping quickly. Participants were asked not to use pen and paper.


After completing the task, participants filled out a brief questionnaire that asked if they used any external learning aids (e.g. pencil and paper), if they used any particular strategy, and how much they enjoyed the task and how difficult they thought it was.

Results


Figure 4 shows the probability of making a classification error as a function of training block for each of the six problems. If a participant reached the performance criterion (one block 100% correct) before the 15th block, we assumed they would continue to respond perfectly for all remaining blocks. Figure 4 is split in two panels. The data collected by Nosofsky and colleagues appears in the left panel and our AMT data appear in the right panel.

Figure 4. Probability of classification error as a function of training block. The left panel shows the learning curves estimated by Nosfosky et al. (1994) using 120 participants (40 per learning problem) who each performed two randomly selected problems. The right panel shows our AMT data with 242 participants, each who performed only one problem (between 38-41 per condition). We ended the experiment after 15 blocks, although Nosofsky et al. stopped after 25. Thus, the Nosofsky et al. data have been truncated to facilitate visual comparison.

There are a couple patterns of interest. First, people in the AMT experiment learn over trials and reduce the error rate. In addition, the type I problem was learned very quickly (within the first two or three blocks). In contrast, the error rate for the type II problem is somewhat higher (and more similar to problems III, IV, and V). As in previous reports, the type VI appears to be the most difficult. Thus, at a general level, our results are qualitatively in accord with previous findings.

At the same time, in all conditions besides type I, our participants performed significantly worse than Nosofsky et al.’s participants. For example, in all problems except for type VI, the probability of error in Nosofsky’s study fell below 0.1 by block 15. In contrast, our error rates asymptote near 0.2. One hypothesis is that participants on AMT generally learn more slowly, but this perspective is undermined somewhat by the fact that type I was learned at a similar rate to Nosofsky (the probability of error drops below 0.1 by the second block of trials).

This rather slower learning rate for the more complex problems is also reflected in Figure 5 (left panel) which compares the average number of blocks taken to reach criterion both participants in our data and for Nosofsky et al. In almost every problem, participants on AMT took nearly double the number of blocks compared to Nosofsky et al.’s laboratory study. Closer inspect of the data showed that this was due to a rather large proportion of participant who never mastered the problems at all (taking all 15 blocks).

Figure 5. The left panel shows the average number of blocks it took participants to reach criterion (2 blocks of 16 trials in a row with no mistakes) in each problem. The yellow bars in the right panel shows the estimated average number of blocks to criterion reported by Nosofsky et al. (1994). The right panel shows the proportion of participants in each condition reaching the learning criterion. Nosofsky et al. (1994) did not report their data in this way.

Figure 5 (right panel) shows the proportion of subjects reaching criterion before the end of the task. Roughly half the participants were able to master the problem by the end in problems II-VI. However, this view of the data does suggest that type II was at least marginally easier than problems III-V.

Figure 6. The number of drop-outs for each of the six problems. Dropouts are defined as people who started the experiment but didn't finish (e.g., closed the browser window).

Interestingly, the difficulty of the task (according to Shepard et al. and Nosofsky et al.) didn’t have a strong impact on people deciding to drop out of the task. To assess this we counted the number of participants who started the experiment but didn’t successfully finish as a function of condition (see Figure 6). As you can see, the drop-out rate does not seem to be systematically related to the problem difficulty (e.g., the smallest number of dropouts was in the type III problem was apparently somewhat difficult for participants).

It is also worth noting that we didn’t attempt any post-hoc “clean up” of the data (e.g., excluding people who took a long time or who pressed the same key for many trials in a row). While such exclusion may be warranted in certain cases, we didn’t have clear a priori hypotheses about which kinds of exclusions would be appropriate for this data. However, given the large percentage of subjects who failed to master the problems within 15 blocks, it is unlikely that there is a simple exclusion criterion that would make our data align well with the Nosofsky et al. replication (without directly excluding people who didn’t learn).

Experiment 2: Do you get what you pay for? Exploring the effect of payment on performance.


The above results were mixed. On one level it was definitely encouraging to collect so much data in such a short time frame. In addition, participants did tend to learn (e.g., in the Type I problem and as reflected in the overall error rates). However, at least when compared to Nosofsky et al. (1994), learning performance in our replication was considerable lower.

One possibility is that if we incentivized participants to perform well, we could get better data. Put another way, is the quality of AMT data basically as good as you are willing to pay? As noted by Gosling et al. (in press) some AMT workers seem to participate mainly for personal enjoyment, and payment isn’t such a important issue for these individuals. For example, in their report, they found that a large number of Workers would complete a survey for $.01 (the minimum possible payment).

However, this does not apply universally. Anecdotally, we attempted to run the Shepard et al. study reported above again but only offered $.25 as payment (and no lottery or bonus). In that case we recruited only 1 subject in 12 hours (2 others dropped out after the first block of the task). Thus, workers are influenced to participate by the magnitude of payment and their estimation of the difficulty or length of the task. However, this sensitivity to the possible payment might also influence task performance in theoretically significant ways.

In a second study, we tried to systematically explore how our replication results might depend on how much money the AMT workers are offered. This issue is rarely examined systematically in the laboratory but could have important implications in online data where participants decision to participate may be a more purely economic decision.

Specifically, we repeated the above study with two different incentive structures. We felt our initial payment scheme described above was roughly in line with what we would pay a laboratory subject for a short 15-20 minute task ($1.50 on average). Thus, we created two additional conditions, a “low incentive” group that was paid $0.75 and not offered a bonus. A second “high incentive” group was offered a guaranteed $2 and a bonus of up to $2.50 based on task performance.

Rather than test all six Shepard et al. problem set we focused this analysis on the II and IV problems. By comparing the results of this replication with our previous set of data we hoped we could obtain information about the relative effects of payment on the relationship between our online replication and related laboratory studies.

In addition, we collected demographic information about participants in this study.

Methods


Participants

Eighty-two anonymous online participants volunteered and were evenly divided between either a “low incentive” or “high incentive” condition. Within each condition, participants were randomly assigned to either the type II or type IV problems (N = 20 or 21 in each condition for both incentive conditions). In the “low incentive” condition each participant received $0.75 via AMT’s built-in payment system. There was no bonus or lottery offered for these participants. In the “high incentive” condition, participants in were paid a base amount of $2 for completing the experiment and a bonus of up to $2.50. The bonus was calculated as follows: at the end of the experiment, 10 random trials were selected from the participant’s data file and each trial where the participant provided a correct response increased the bonus by $.25. If the participant reached criterion (2 blocks with 100% correct responses) we coded all remaining trials as correct. This placed a relatively stronger financial incentive on quickly mastering the problem compared to either the “low incentive” condition or the previous experiment.


An additional twenty participants initiated the experiment electronically, but withdrew before the end for unknown reasons or self-reported using pen and paper to complete the task. As before, a restriction was put in place that participants were located with the United States and had at 95% acceptance rate for previous “hits”.


We collected data for the low-incentive condition during a 25 hour period beginning March 9th, 2012 at 5pm EST and ending March 10th at 6pm EST. Data collection was stopped at 9pm EST each evening and began again after 10am EST. We collected data for the high-incentive condition during a 2 hour period beginning March 20th, 2012 at 3:30 EST and ending 5:30 EST.

Design and Procedure

The design was mostly identical to the previous study except participants only completed either the type II or type IV problem. The procedure was mostly identical to before, the only difference was the incentive (either high or low).

Results


Figure 7 compares the learning curves for both the type II and type IV problems across three incentive conditions (the “medium” data is the same as above). As you can see, the incentive structure of the task had very little impact on overall learning rates in the task and does not fundamentally change the impression that the type II and type IV problems were learned at a roughly similar rate. This result aligns well with Mason and Watts (2009) who report that the magnitude of payment doesn’t have a strong effect on the quality of data obtained from online, crowd-sourced systems.

Figure 7. The learning curves for Shepard et al. problems II and IV based on task incentives. The incentive structure had little impact on participants

The effect of incentives on drop outs and recruitment. Overall, the incentive structure in the task had little impact on learning performance. However, it did strongly influence the rate of signups (40 subjects were collected in 2 hours in the high incentive condition while it took roughly two days to collect the same amount of data in the low incentive condition). In addition, it strongly influenced the dropout rate. In the high incentive condition, only 5 participants started the task but did not finish (2 in type II and 3 in type IV), giving a dropout rate overall of 11%. In contrast, 13 participants in the low incentive condition started but did not finish the task (six in type II and seven in type IV), for an overall dropout rate of ~25%. Again, this result is largely consonant with Mason and Watts (2009).

Experiment 3: An instructional manipulation check


Our results so far are interesting, but also suggest caution in using AMT data in cognitive science research. Despite some faint hints of the classic learning pattern in our data, there were fairly large discrepancies between our study and laboratory collected data. This mostly manifested in significantly worse learning for the conditions requiring “complex” cognition (problems II-VI, relative to the simple one-dimensional rule used in problem I). One concern is that the variable testing environment online contributes to distraction or lack of participant motivation that might negatively impacts performance in more challenging cognitive tasks. This would tend to reduce the utility of systems like AMT for research on these topics.

However, rather than give up, we doubled down in our efforts. First, we made some changes to our experiment code to be more in line with Nosofsky et al’s original replication. In particular, we replaced the stimuli developed by Love (2002) with the simple geometric figures used by Nosofsky et al. and Shepard. Pilot data suggested that the stimulus differences were not the main factor influencing performance but to ensure more comparable results we thought it would be prudent to minimize all differences.

Second, we became concerned that some participants may not have completely understood the instructions (some responses to the post-experiment questionnaire indicated that people believed the rule was changing from one block to the next). It seemed very likely that a failure to fully understand the instructions would negatively impact performance, perhaps differentially on the more difficult problems.

To address this issue, we incorporated an instructional manipulation check which has been shown to reduce noise in behavioral experiments (Oppenheimer, Meyvis, and Davidenko, 2009). This (rather straight-forward) technique requires the participant to answer non-trivial comprehension questions about the instructions of the experiment before participating. While Oppenheimer et al. introduced somewhat insidious “gotcha” questions into their instructions, we simply presented participants with a questionnaire at the end of the instruction phase which tested knowledge of the basic task and study goals. Correct answers to the questionnaire required a complete comprehension of the goals of the experiment and addressed possible misconceptions (e.g., “Will the rule change on each block?”, “Is it possible to get 100% correct?”, “Should you use pen and paper to solve the task?”, etc…). If a participant incorrectly answered any of the questions, they were asked politely to read the instruction again. This process repeated in a loop until the participant was able to answer all the comprehension question correctly.

Methods


Participants

Two-hundred anonymous online participants volunteered and were randomly assigned to either the Type I, II, IV, or VI problems (N = 50 in each). Participants were offered $1 to complete the task along with a one in ten chance of winning a $10 bonus (only available if they completed the task). This matches the “medium incentive” condition used in Experiment 1.
An additional 33 participants initiated the experiment electronically, but withdrew before the end for unknown reasons or self-reported using pen and paper to complete the task. As before, a restriction was put in place that participants were located with the United States and had at 95% acceptance rate for previous “hits”. We collected data beginning March 29th, 2012 at 11:30am EST and ending April 2nd at 5pm EST. Data collection was stopped around 9pm EST each evening and began again after 10am EST.

Design and Procedure

The design was identical to the previous study except participants only completed either the Type I, II, IV, VI problem. The only major change was to the stimuli (made to match Nosofsky et al, 1994) and the instructions (detailed above). The procedure was identical to before.

Results


Figure 8 (left panel) compares the learning curves for Nosofsky et al. (1994) and Experiment 3. The most striking pattern is the closer correspondance between our AMT data and the laboratory-collected data for problems I and IV. These data probably fall within the acceptable margin of error across independent replications of the laboratory study. As an illustration, the right panel compares the Nosofsky et al. data to a independent laboratory-based replication by Lewandowsky (2011)2. Given the intrinsic variability across replications, this suggests the AMT data do a fairly good job of replicating the laboratory-based results. In contrast, the type VI problem appears a little more difficult for participants on AMT compared to in the lab. However, at least compared to our results in Experiment 1, the relative ordering of the problems is much more pronounced (i.e., I is easier than IV which is easier than VI).

Figure 8. Left panel: A comparison of the learning curves for Experiment 3 and Nosofsky et al. (1994). The Nosofsky data have been truncated to show 10 learning blocks. Relative to Figure 1, the results look much more in-line with the laboratory-based study. For comparison, the left panel shows the alignment between Nosofsky et al. (1994) and a more recent laboratory replication by Lewandowsky (2011). Note that Lewandowsky (2011) also found a relatively small difference between the type II and type IV problems.

Despite generally increased alignment between the laboratory data and AMT data, anomalies remain. In particular, the type II problem seems systematically more difficult for participants in our online sample than in Nosofsky et al.’s laboratory study. This is clear in Figure 10 which breaks the learning curves up by problem type. The colored solid lines are Nosofsky et al. (1994), the dashed colored lines are our Experiment 3 results, and the solid black line is our Exp. 1 results. On the other hand, it is clear that the “instructional check” manipulation greatly improved performance in all condition (except perhaps type I which was already near floor).

Figure 9. A direct comparison of the learning curves from Nosofsky et al. (1994), Experiment 1, and Experiment 3. As in Figure 9, the data have been truncated to 10 blocks. The results show a relatively close correspondance between the AMT data in Experiment 3 and the performance of laboratory-tested individuals for problems I, IV, and VI (although VI is a little more difficult for Turk participants). In almost all conditions, the instructional manipulation check in Experiment 3 greatly improved learning (much more than did the incentive manipulation in Experiment 2.

The finding that type II is learned roughly at the same rate as type IV is interesting. However, other measures of learning suggested at least a marginal type II advantage. For example, 100% of participants in the type I problem reached the learning criterion within the 10 training blocks (2 blocks in a row with 100% correct responses). In comparison, 73.1% reached criterion in the type II problem. However, only 56.4% reach criterion in the type IV problem and 44.8% reach criterion in the type VI problem. Interestingly, our finding of similar learning curves for the type II and IV problems has some precedent in the laboratory literature. For example, as visible in Figure 8, Lewandowksy (2011) found that the type II problem was learned at roughly the same rate that the type IV problem. A similar result was reported by Love (2002) who found only a marginal type II advantage compared to the type IV problem in a related design. We’re in the process conducting a couple followups to this result which might shed some light on this issue.

General Discussion


Overall, our experiments with AMT seem promising, but also raise some interesting issues.

First, it was amazing how much data we could collect in a short period of time. Performing a full-sized replication of the Nosofsky et al. (1994) data set in under 96 hours is revolutionary. This alone speaks volumes about the potential of services like AMT for accelerating behavioral research.

Second, it is notable that participants did learn in all conditions (error rate dropped from the beginning to the end of the study in all conditions). This fact was not necessarily a given since people could have chosen to respond randomly. Manual inspection of our data suggests this almost never happened.

Third, many participants were willing to take part in the 15-30 minute study even when offered $.75 in the low incentive condition. Given that this is about 2-3 times longer than typical HITs on the system suggest there is a reasonable market for recruiting human participants. In our high incentive condition, we could run as many as 40 participants in two hours.


Taken together, the results are extremely encouraging. In less than 96 hours, we collected a 272 person cognitive psychology experiment over the internet. Despite a presumably heterogeneous testing environment and population pool, we seem to have gotten sensible data that qualitatively replicates previously reported results from the lab.

Finally, we replicated the key finding of Shepard et al. and Nosofsky et al. (type I was easier than types III-V which are easier than type VI). At the same time our data was a little less clear than the previously published laboratory collected studies. In general, type II seemed slightly more difficult than previously reported (at least in our learning curve analysis). At this point we are not sure what to make of this difference, except to point out that a couple recent laboratory studies report a similar pattern (e.g., Love, 2002; Lewandowsky, 2011). In addition, online participants generally learned more slowly (this was especially true in Exp 1 and 2 but also showed up in the type VI condition in Exp. 3). It may be that the slower learning relates to the more diverse participant sample than is typical in laboratory studies (e.g., we did find a slightly negative correlation between performance on the type II problem and self-reported age).

However, more than anything, we found that building in checks for understanding the instructions is critical for ensuring high quality data. After incorporating those changes, our data began looking more like a publication-quality replication study. Overall, a pretty worthwhile exploration.

A quick survey of the cognitive science literature suggests that Internet-based studies have not yet made it fully into mainstream cognitive journals. There are numerous Cognitive Science conference proceedings papers which use Turk data, and a few social psychology studies which use Turk to collect surveys (please use the comments if there are others we are unaware of!). However, considerably fewer traditional experimental psychology papers have published Internet data as a primary source of subject recruitment. Based on the above, it seems that reviewers and editors might consider accepting behavioral experiments done on AMT as a valid methodology (applying as much scrutiny as they would apply to any behavioral paradigm). Even for extended experiments requiring problem solving and learning, the data seem mostly in line with a laboratory results.

Advice and Suggestions


To conclude, we’d like to offer a bit of practical advice based on our experience.

First, we echo the point made by Mason and Suri (in press) that researchers pay AMT users something close to minimum wage in the United States or at least close to what you offer someone to perform the task in the lab. While our above analysis suggests that low pay doesn’t necessarily effect the quality of the data, we have found that we can recruit participant faster and have fewer dropouts by making the study financially appealing. In addition, it seems ethical to attempt to roughly equate payment across the lab and “Internet lab” (recognizing that a computer study requires a bit less than traveling to a lab to take part in the study). Many companies offer simple HITs on AMT for as little as $.10, but such rates are somewhat out of line with what subjects in the lab are offered.

Recommendations (and superstitions)


  1. Include checks that people understand the instructions before entering the task (comprehension questions).
  2. Pay in proportion to what would in the lab for similar work.
  3. Consider collecting data only during the daytime in the US.
  4. Ensure that your task is as fun and interesting as your science will allow. You are competing in a marketplace of online distraction (youtube, etc…)
  5. Limit the length of tasks to 10-30 minutes and pay accordingly.
  6. Whenever possible include a replication of a previous result in your design. This will allow you to assess differences to past work and ensure your conclusions are better grounded.
  7. Collect demographic information along with your study. Reviewers may seek more understanding about the population.
  8. Prevent your experiment from accessing the Internet between trials (ensures intermittent network problems don’t influence the presentation).
  9. Test on more than one browser/platform.
  10. Monitor and report drop-out rates in your study as a function of experimental condition.
  11. If you plan on excluding subjects from your design decide on the criterion BEFORE collecting your data and clearly state this in your paper. Otherwise there is a tremendous flexibility to exclude and replace participants from a virtually unlimited data source.

Second, experiments that are at least somewhat fun and engaging are likely to be better received. If you have people making 5000 discrimination judgments for simple lines or sine-wave gratings it seems less likely you will get highly useful data. One way to view it is that you are basically competing against all the other interesting things to do on the Internet (YouTube, etc…). On the other hand, in our “comment box” at the end of our experiment, many of the participants said they found the rule-discovery task to be fun and interesting (but to be fair, others hated it!).

We considered various ways to exclude “suspicious” or “odd” behavior (e.g., pressing the same key many times in a row, long response times) but ultimately didn’t report those analyses above. The problem is that our exclusion criteria were entirely arbitrary. Generally, we do not advocate excluding participants except under the most extremely obvious situations of abuse (pushing the same button the entire time). However, restrictions should be decided before data collection and clearly reported in papers. Also the time of day and date of data collection may be important as the AMT population may evolve over time.

Most importantly, we found that testing participant’s comprehension of the instructions was critical. Prior to including such checks our data were much noisier. In fact, the instruction check had a considerably more robust effect on the quality of our data than did increasing the payment. In retrospect this point is intuitive, but it was a lesson worth having sooner rather than later.

Finally, it is important to monitor and record the rate at which people begin your experiment but do not finish. This is typically not a problem in laboratory studies since the social pressure of getting up a walking out of the lab is much higher than it is online. However, dropout rates can interact in complex ways with dependent measures such as accuracy (low performing individuals may be more likely to drop out online). We recommend that, perhaps unlike a typical laboratory study, all Internet experiments report dropout rates as a function of condition.

The bottom line? AMT certainly seems promising for experimental cognitive science research. Our investigations suggest that the data quality is reasonably high and compares well to laboratory studies. Hopefully, the quality of the data will remain high as additional researchers start to utilize this resource. If we (scientists) respect the participants and contribute to a positive experience on AMT it could turn into an invaluable tool for accelerating empirical research.

A portion of this work was completed as part of Devin Domingo’s Psychology Honors Thesis at NYU. Be sure to join the discussion below!

References


Gosling, S.D., Vazire, S., Srivastava, S., and John, O.P. (2004). Should we trust web-based studies? A comparative analysis of six preconceptions about Internet questionnaires. American Psychologist, 59, 2, 93-104. [pdf]

Lewandowsky, S. (2011). Working memory capacity and categorization: Individual differences and modeling. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37, 720-738. [PDF]

Love (2002). Comparing supervised and unsupervised category learning Psychonomic Bulletin & Review, 9(4), 829-835.

Love, B.C., Medin, D.L., and Gureckis, T.M. (2004) SUSTAIN: A Network Model of Category Learning. Psychological Review, 11, 309-332.

Mason, W. and Suri, S. (in press). A Guide to Behavioral Experiments on Mechanical Turk. Behavior Research Methods [pdf]

Mason, W. and Watts, D. (2009). Financial incentives and the “performance of crowds.” HCOMP ’09: Proceedings of the ACM SIGKDD Workshop on Human Computation, 77–85. [PDF]

Nosofsky, R.M., Gluck, M.A., Palmeri, T.J., McKinley, S.C., and Glauthier, P. (1994). Comparing models of rule-based classification learning: A replication and extension of Shepard, Hovland, and Jenkins (1961) Memory & Cognition, 22(3), 352-369. [PDF]

Oppenheimer, D.M., Meyvis, T., and Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45, 867-872. [PDF]

Paolacci, G., Chandler, J., and Ipeirotis, P. G. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5, 411-419. [pdf]

Shepard, R.N., Hovland, C.I., and Jenkins H.M. (1961). Learning and memorization of classifications. Psychological Monographs: General and Applied, 75(13), 1-41.

  1. Our experiment code directly modified the current browser window and sent data to the server during rest sessions using AJAX. This meant there was no reloading or network access between individual learning trials. While our timing estimates of the display are approximate (depending on the user’s browser, speed of their computer, etc…) they are substantially more accurate than if network access was required on every trial. []
  2. These data were estimated from the Figures in Lewandowsky’s paper []






  1. Just to register a different opinion regarding data editing. I think it is probably wise to omit dicey data, especially when the experiment is run under uncontrollable conditions like AMT. People who respond too quickly or hit the same key over and over or who are at chance performance (in less difficult tasks) are very likely just giving you noise. But it’s not relevant noise derived from the variance of the underlying process, but rather random junk due to whatever they happen to do rather than seriously try to complete the task. It may seem more moral to include all data, but it just isn’t an accurate representation of what people are doing, if some of the people are doing irrelevant things.

    However, as the article said, it is necessary to set such criteria up in advance, or else you end up looking at every person’s data and trying to decide whether you like them or not, and that is a bad position to be in. Very bad. So, simple criteria such as performing better than 65% accuracy (if chance = 50%), no response times under a second (for a complex judgment), or accurate responding to catch trials are all good ideas in my experience.

    Data Deleter
  2. Very interesting report! I am wondering what is the rationale for “3. Consider collecting data only during the daytime in the US.” Are any data/study showing fluctuation of data quality from different times of the day? Or is this just to bring the condition closer to those of lab experiments (time-wise)?

    Saiwing
  3. Saiwing: This is just a superstition we’ve adopted (shared from another lab I think). We haven’t analyzed it rigorously (although we probably have enough data now that we could analyze the effect of time of day). Since we limited our participants to US citizens, running really late into the night means we are getting people at relatively odd hours (local time) which maybe associated with mental fatigue… or even intoxication. ;) In terms of improving the signal/noise ratio, the cost of stopping for 8-9 hours every evening seems low. Experience says that the number of signups is much higher during daytime so you aren’t losing out on much.

    Data Deleter: You’re moderate position is likely correct. However, deleting data is still a sin.

    Todd Gureckis
  4. I’ve been running audio-visual phonetic adaptation studies on MTurk, too, and I’ve been surprised to find that even those types of studies reliably replicate existing data. We have two kinds of built-in exclusion criteria. First, subjects have to complete a labeling pre-test, where they label all items on our /b/-to-/d/ continuum as /b/ or /d/, and if their labeling function is unusual (in a way defined ahead of time) or their performance is too noisy then they’re automatically excluded. Second, as in the study we’re replicating (Vroomen et al., 2007) during the actual exposure/adaptation phase, there are a few “catch” trials randomly interspersed. The exposure is a couple hundred trials of passively watching the same audio-visual stimulus over and over again, so we wanted to ensure that they were actually paying attention the whole time. The catch trials are pretty subtle (the only difference is that a small white dot flashes for one frame during the video), and the response is pretty simple (press the space bar). If a subject misses too many in one block or overall (again, according to a pre-defined criteria), they’re excluded, too.

    I’ll be presenting this work at CogSci12 this summer, and a preprint of the proceedings paper will be available on my website once it’s submitted.

    Dave Kleinschmidt
  5. Very interesting post, even if I believe that MTurk is not always the right instrument for collecting behavioral data (e.g. consider persuasion studies where subjects should be neither aware of being observed, nor biased by external rewards). For this reason I have been seeking for alternative solutions, that still rely on data collection over the Internet, but in “ecological” settings. If you are interested you can take a look to my paper “Ecological Evaluation of Persuasive Messages Using Google AdWords”, that you can download here:

    http://arxiv.org/abs/1204.5369

    The paper is more focused on language analysis (in the Natural Language Processing field), the approach needs to be refined, but I’m pretty sure there’s room for many uses in cognitive psychology experiments.

    Marco Guerini
  6. [...] Read about our full results here. [...]

    Can AMT be used for learning and memory experiments? « Experimental Turk
  7. Thanks for this super-informative blog post! How illuminating to see real comparative data like this.

    We do a lot of web experimentation and we have also recently begun always giving a comprehensive quiz to test online users’ comprehension of the instructions–and requiring subjects to reread the instructions if they get any of the quiz questions wrong. This should definitely be standard practice –if you don’t do this you can end up with real junk data.

    Hal

    Hal Pashler
  8. Thanks Hal! Yeah, it is a bit embarrassing we had to stumble into this rather than just guessing this would be a major issue right away. However, better to learn late than never!

    Todd Gureckis