Is Mechanical Turk the future of cognitive science research?
by John McDonnell, Devin Domingo, and Todd Gureckis
Is Internet-based data collection the future of cognitive science research? We used Amazon’s Mechanical Turk (AMT) to replicate a classic result in cognitive psychology which has primarily been established under traditional laboratory conditions (Shepard, Hovland, and Jenkins, 1961). In this post, we describe the various lengths we went to in order to get useful data from AMT and what we learned in the process. Overall, our results highlight the potential for using AMT in experimental research, but also raise a number of concerns and challenges. We invite comments, discussion, and shared experience below!
Note: Aspects of this post are now published in
Crump M.J.C., McDonnell, J.V., and Gureckis, T.M. (2013) Evaluating Amazon’s Mechanical Turk as a Tool for Experimental Behavioral Research. PLoS ONE 8(3): e57410.
Table of Contents
Use the following links to jump around in the article:
- The Challenge of Collecting Behavioral Data
- Taking Experiments Online
- Exp 1: A Replication of Shepard, Hovland, Jenkins (1961)
- Exp 2: Do you get what you pay for? Exploring the effect of payment on performance
- Exp 3: An instructional manipulation check
- General Discussion
- Advice and Suggestions
The Challenge of Collecting Behavioral Data
One challenging aspect of experimental psychology research is the constant struggle for data. Typically, we depend on university undergraduates who participate in studies in exchange for experience, course credit, or money. Research progress depends on the ebb and flow of the semester. As a result, it can take weeks, months, or even years to conduct a large behavioral study. This issue is even more salient for researchers at smaller universities.
One appealing solution is to collect behavioral data over the Internet. In theory, online experimentation would allow researchers to collect data internationally, enable access to a very large number of people who may be interested in participating in research studies, and can be fairly automated. However, the main obstacle to conducting Internet-based research is finding people who are willing to participate and compensating them. Sure, you can post a link on your webpage, but few people are likely to find it.
Recently, our lab (like a number of others) has explored the possibility of using Amazon’s Mechanical Turk (AMT) for behavioral data collection. AMT is a crowdsourcing or “artificial-artificial-intelligence” platform in which people submit simple (and usually brief) computer-based jobs to human workers online (see Box 2).
While Internet-based methods for collecting data have been around for a while, AMT is a potentially useful system for researchers since it handles both recruitment and payment in a fairly automatic way. There are a large number of people who use AMT making it a great way to advertise and distribute studies (the NYTimes reported over 100,000 active users in 2007).
Recently, there have been a number of excellent summaries and workshops about using AMT for research. Most notably, Winter Mason of the Stevens Institute of Technology has a Behavior Research Methods paper [pdf], which summarizes how to use the system and what it does (see also this excellent blog post and the references below).
Box 1. Useful Amazon Mechanical Turk Links
Rather than focus on how to use AMT, this post focuses on the reliability of the data for experimental research in cognitive psychology. Before using this data in our experiments we wanted some sense of the quality of the data when compared to laboratory studies. Along those lines, we had a couple of questions going into this project:
- First, could we replicate classic findings in cognitive psychology concerning learning and memory processes with fidelity similar to that collected in the lab?
- Second, what unexpected issues come up in conducting learning, memory, and reasoning studies online?
- Third, how reliable is data on AMT as a function of the payment amount?
In this sense, our analysis is similar to a couple of recent papers (e.g., Paolacci, Chandler, & Ipeirotis, 2010 and Buhrmester, Kwang, and Gosling, 2011). However, these previous reports focus on survey data, the test-retest reliability of AMT, or on simple one-shot decision making tasks. In contrast, we were interested in using AMT to replicate and extend classic findings in experimental cognitive psychology which were originally established in the lab. Our emphasis was not just on qualitative replication, but in getting accurate data in a study that has been frequently replicated in the lab. In addition, we wanted to explore if it is possible to conduct publication-quality Internet experiments which unfold over many trials, require learning, sustained attention, and which may require between 20-60 minutes to complete.
Our initial experiments provide answers to many of these questions, and we felt our results might interest people thinking of using AMT for their own experiments.
Taking Experiments Online
In terms of taking cognitive experiments online, there are a number of unique challenges compared to simple ratings tasks or surveys.
First, such experiments usually take extended time (on the order of 1 hour) and uninterrupted concentration which may or may not be amenable to the AMT, which typically emphasizes short, one-shot human judgment tasks or surveys. In our initial test, we have tried to keep our tasks shorter than usual (roughly 15-30 minutes).
Second, compared to a simple HTML survey, the visual display in many cognitive experiments is typically dynamic (things may move around, there may be animations, people may respond by manipulating objects on the screen with the mouse, etc…). This raises a couple issues about how to control the screen display to ensure that people using many different types of computers (netbooks, iPads, regular desktops) view roughly the same thing. In addition, ensuring the experiment can even run without hassle in the Worker’s browser can be a technical challenge.
Box 2. Some AMT Basics
- HIT – Short for “Human Intelligence Task,” is a unit of work on AMT.
- Requester – Individuals asking for work to be done (in this case, a researcher).
- Worker – individuals who respond to requests from Requesters and perform the offered HIT (i.e., a participant in a study)
- Requesters can demand certain qualifications for their Workers, such as having high previous approval ratings for their work or being located in the United States.
- HITs are most typically performed in the web browser.
- Once they are finished with the task, Workers submit the HIT for requester approval and payment.
- Requesters can also pay an optional bonus based on performance.
- Payments are made automatically through Amazon Payments system (a Paypal-like system).
Experiment 1: A replication of Shepard, Hovland, Jenkins (1961)
In our first experiment with AMT we attempted to replicate Shepard, Hovland, and Jenkins’ (1961) classic study on category learning. This is a highly replicated and influential study in cognitive psychology. In addition, it has many of the key features we want in our online experiments (the participant has to pay attention for 20-30 minutes and learn something over a number of trials).
As a summary, Shepard et al. (1961) had participants learn various categorical groupings of a set of eight geometric objects. Each of the six groupings varies in difficulty in a way related to the complexity of the “rule” needed to correctly partition the items. Differences in difficulty persist despite the fact that, in theory, people could memorize each item. This is often taken to imply people are forming more abstract, structured conceptions of the pattern (e.g., by forming a rule).
Two example problems are shown in Figure 1. The “Type I” problem requires participants to form a rule along a single dimension (‘if blue then category A, otherwise category B’). This problem is usually fairly easy to learn, while other problems are more difficult. For example, the rule in the “Type VI” problem is a complicated three-way XOR and is might be best learned by memorizing the category membership of each item. A full description of the abstract structure of the Shepard learning problems is shown in Figure 2.
In general, previous research has shown that the type I problem is learned more easily than is the type II problem. In turn, types III, IV, and V are learned more slowly than type II (within III-V, learning rates appear similar). Finally, type VI is typically the most difficult pattern to learn. The relative rate of learning for these six problems has provided an important constraint on theories of human concept and category learning. Most computational models of categorization must account for the relative difficulty of these problems in order to be viewed as a serious theoretical account. In addition, the quantitative (rather than qualitative) shape of the learning curves has been used to test and differentiate models (e.g., Love, Medin, & Gureckis, 2004). Our basic goal was to see if we could replicate this finding using participants recruited over the Internet.
Two hundred and thirty-four anonymous online participants volunteered (N=41, 38, 39, 39, 39, and 38 in types I-VI respectively), and each received $1.00 via AMT’s built-in payment system. In addition, 1 in 10 participants who completed the task were randomly selected for a bonus raffle of $10. This incentive was included to encourage people to finish the task even if they found it difficult. An additional fifty-six participants initiated the experiment electronically, but withdrew before the end for unknown reasons. The data from these participants was not further analyzed. Finally, seven individuals indicated they used pen and paper to solve the task in a post-experiment questionnaire and were excluded (although these participants still received payment). Participants electronically signed consent forms and were debriefed after the experiment. The study design was approved by the NYU Institutional Review Board.
Each participant was randomly assigned to complete one of the six learning problems defined by Shepard et al. The stimuli were simple square objects similar to the ones shown in Figure 1. The stimuli we used were developed by Love (2002) who normed the constituent dimensions for roughly equal psychological salience. The mapping between the stimuli and the abstract structure shown in Figure 2 was randomly counterbalanced across participants.
Our replication, although presented in AMT, remained procedurally similar to a highly cited replication of the Shepard et al. results by Nosofsky et al. (1994). On each trial of the task, one of the eight objects was presented in the middle of the browser window (see Figure 3). Participants indicated if the item belonged to category A or B by clicking the appropriate button. Feedback was then presented for 500ms which indicated if the participant was correct or incorrect1
Trials were organized into block of 16 trials. In the rest period between blocks, participants were given information about their performance in the previous block and about how many more blocks remained. The experiment lasted until the participant responded correctly for two blocks in a row (32 trials) or until they completed 15 blocks. Participants were told that the experiment could last as long as 15 blocks, but that they could end early if they correctly learned the grouping quickly. Participants were asked not to use pen and paper.
After completing the task, participants filled out a brief questionnaire that asked if they used any external learning aids (e.g. pencil and paper), if they used any particular strategy, and how much they enjoyed the task and how difficult they thought it was.
Figure 4 shows the probability of making a classification error as a function of training block for each of the six problems. If a participant reached the performance criterion (one block 100% correct) before the 15th block, we assumed they would continue to respond perfectly for all remaining blocks. Figure 4 is split in two panels. The data collected by Nosofsky and colleagues appears in the left panel and our AMT data appear in the right panel.
There are a couple patterns of interest. First, people in the AMT experiment learn over trials and reduce the error rate. In addition, the type I problem was learned very quickly (within the first two or three blocks). In contrast, the error rate for the type II problem is somewhat higher (and more similar to problems III, IV, and V). As in previous reports, the type VI appears to be the most difficult. Thus, at a general level, our results are qualitatively in accord with previous findings.
At the same time, in all conditions besides type I, our participants performed significantly worse than Nosofsky et al.’s participants. For example, in all problems except for type VI, the probability of error in Nosofsky’s study fell below 0.1 by block 15. In contrast, our error rates asymptote near 0.2. One hypothesis is that participants on AMT generally learn more slowly, but this perspective is undermined somewhat by the fact that type I was learned at a similar rate to Nosofsky (the probability of error drops below 0.1 by the second block of trials).
This rather slower learning rate for the more complex problems is also reflected in Figure 5 (left panel) which compares the average number of blocks taken to reach criterion both participants in our data and for Nosofsky et al. In almost every problem, participants on AMT took nearly double the number of blocks compared to Nosofsky et al.’s laboratory study. Closer inspect of the data showed that this was due to a rather large proportion of participant who never mastered the problems at all (taking all 15 blocks).
Figure 5 (right panel) shows the proportion of subjects reaching criterion before the end of the task. Roughly half the participants were able to master the problem by the end in problems II-VI. However, this view of the data does suggest that type II was at least marginally easier than problems III-V.
Interestingly, the difficulty of the task (according to Shepard et al. and Nosofsky et al.) didn’t have a strong impact on people deciding to drop out of the task. To assess this we counted the number of participants who started the experiment but didn’t successfully finish as a function of condition (see Figure 6). As you can see, the drop-out rate does not seem to be systematically related to the problem difficulty (e.g., the smallest number of dropouts was in the type III problem was apparently somewhat difficult for participants).
It is also worth noting that we didn’t attempt any post-hoc “clean up” of the data (e.g., excluding people who took a long time or who pressed the same key for many trials in a row). While such exclusion may be warranted in certain cases, we didn’t have clear a priori hypotheses about which kinds of exclusions would be appropriate for this data. However, given the large percentage of subjects who failed to master the problems within 15 blocks, it is unlikely that there is a simple exclusion criterion that would make our data align well with the Nosofsky et al. replication (without directly excluding people who didn’t learn).
Experiment 2: Do you get what you pay for? Exploring the effect of payment on performance.
The above results were mixed. On one level it was definitely encouraging to collect so much data in such a short time frame. In addition, participants did tend to learn (e.g., in the Type I problem and as reflected in the overall error rates). However, at least when compared to Nosofsky et al. (1994), learning performance in our replication was considerable lower.
One possibility is that if we incentivized participants to perform well, we could get better data. Put another way, is the quality of AMT data basically as good as you are willing to pay? As noted by Gosling et al. (in press) some AMT workers seem to participate mainly for personal enjoyment, and payment isn’t such a important issue for these individuals. For example, in their report, they found that a large number of Workers would complete a survey for $.01 (the minimum possible payment).
However, this does not apply universally. Anecdotally, we attempted to run the Shepard et al. study reported above again but only offered $.25 as payment (and no lottery or bonus). In that case we recruited only 1 subject in 12 hours (2 others dropped out after the first block of the task). Thus, workers are influenced to participate by the magnitude of payment and their estimation of the difficulty or length of the task. However, this sensitivity to the possible payment might also influence task performance in theoretically significant ways.
In a second study, we tried to systematically explore how our replication results might depend on how much money the AMT workers are offered. This issue is rarely examined systematically in the laboratory but could have important implications in online data where participants decision to participate may be a more purely economic decision.
Specifically, we repeated the above study with two different incentive structures. We felt our initial payment scheme described above was roughly in line with what we would pay a laboratory subject for a short 15-20 minute task ($1.50 on average). Thus, we created two additional conditions, a “low incentive” group that was paid $0.75 and not offered a bonus. A second “high incentive” group was offered a guaranteed $2 and a bonus of up to $2.50 based on task performance.
Rather than test all six Shepard et al. problem set we focused this analysis on the II and IV problems. By comparing the results of this replication with our previous set of data we hoped we could obtain information about the relative effects of payment on the relationship between our online replication and related laboratory studies.
In addition, we collected demographic information about participants in this study.
Eighty-two anonymous online participants volunteered and were evenly divided between either a “low incentive” or “high incentive” condition. Within each condition, participants were randomly assigned to either the type II or type IV problems (N = 20 or 21 in each condition for both incentive conditions). In the “low incentive” condition each participant received $0.75 via AMT’s built-in payment system. There was no bonus or lottery offered for these participants. In the “high incentive” condition, participants in were paid a base amount of $2 for completing the experiment and a bonus of up to $2.50. The bonus was calculated as follows: at the end of the experiment, 10 random trials were selected from the participant’s data file and each trial where the participant provided a correct response increased the bonus by $.25. If the participant reached criterion (2 blocks with 100% correct responses) we coded all remaining trials as correct. This placed a relatively stronger financial incentive on quickly mastering the problem compared to either the “low incentive” condition or the previous experiment.
An additional twenty participants initiated the experiment electronically, but withdrew before the end for unknown reasons or self-reported using pen and paper to complete the task. As before, a restriction was put in place that participants were located with the United States and had at 95% acceptance rate for previous “hits”.
We collected data for the low-incentive condition during a 25 hour period beginning March 9th, 2012 at 5pm EST and ending March 10th at 6pm EST. Data collection was stopped at 9pm EST each evening and began again after 10am EST. We collected data for the high-incentive condition during a 2 hour period beginning March 20th, 2012 at 3:30 EST and ending 5:30 EST.
Design and Procedure
The design was mostly identical to the previous study except participants only completed either the type II or type IV problem. The procedure was mostly identical to before, the only difference was the incentive (either high or low).
Figure 7 compares the learning curves for both the type II and type IV problems across three incentive conditions (the “medium” data is the same as above). As you can see, the incentive structure of the task had very little impact on overall learning rates in the task and does not fundamentally change the impression that the type II and type IV problems were learned at a roughly similar rate. This result aligns well with Mason and Watts (2009) who report that the magnitude of payment doesn’t have a strong effect on the quality of data obtained from online, crowd-sourced systems.
The effect of incentives on drop outs and recruitment. Overall, the incentive structure in the task had little impact on learning performance. However, it did strongly influence the rate of signups (40 subjects were collected in 2 hours in the high incentive condition while it took roughly two days to collect the same amount of data in the low incentive condition). In addition, it strongly influenced the dropout rate. In the high incentive condition, only 5 participants started the task but did not finish (2 in type II and 3 in type IV), giving a dropout rate overall of 11%. In contrast, 13 participants in the low incentive condition started but did not finish the task (six in type II and seven in type IV), for an overall dropout rate of ~25%. Again, this result is largely consonant with Mason and Watts (2009).
Experiment 3: An instructional manipulation check
Our results so far are interesting, but also suggest caution in using AMT data in cognitive science research. Despite some faint hints of the classic learning pattern in our data, there were fairly large discrepancies between our study and laboratory collected data. This mostly manifested in significantly worse learning for the conditions requiring “complex” cognition (problems II-VI, relative to the simple one-dimensional rule used in problem I). One concern is that the variable testing environment online contributes to distraction or lack of participant motivation that might negatively impacts performance in more challenging cognitive tasks. This would tend to reduce the utility of systems like AMT for research on these topics.
However, rather than give up, we doubled down in our efforts. First, we made some changes to our experiment code to be more in line with Nosofsky et al’s original replication. In particular, we replaced the stimuli developed by Love (2002) with the simple geometric figures used by Nosofsky et al. and Shepard. Pilot data suggested that the stimulus differences were not the main factor influencing performance but to ensure more comparable results we thought it would be prudent to minimize all differences.
Second, we became concerned that some participants may not have completely understood the instructions (some responses to the post-experiment questionnaire indicated that people believed the rule was changing from one block to the next). It seemed very likely that a failure to fully understand the instructions would negatively impact performance, perhaps differentially on the more difficult problems.
To address this issue, we incorporated an instructional manipulation check which has been shown to reduce noise in behavioral experiments (Oppenheimer, Meyvis, and Davidenko, 2009). This (rather straight-forward) technique requires the participant to answer non-trivial comprehension questions about the instructions of the experiment before participating. While Oppenheimer et al. introduced somewhat insidious “gotcha” questions into their instructions, we simply presented participants with a questionnaire at the end of the instruction phase which tested knowledge of the basic task and study goals. Correct answers to the questionnaire required a complete comprehension of the goals of the experiment and addressed possible misconceptions (e.g., “Will the rule change on each block?”, “Is it possible to get 100% correct?”, “Should you use pen and paper to solve the task?”, etc…). If a participant incorrectly answered any of the questions, they were asked politely to read the instruction again. This process repeated in a loop until the participant was able to answer all the comprehension question correctly.
Two-hundred anonymous online participants volunteered and were randomly assigned to either the Type I, II, IV, or VI problems (N = 50 in each). Participants were offered $1 to complete the task along with a one in ten chance of winning a $10 bonus (only available if they completed the task). This matches the “medium incentive” condition used in Experiment 1.
An additional 33 participants initiated the experiment electronically, but withdrew before the end for unknown reasons or self-reported using pen and paper to complete the task. As before, a restriction was put in place that participants were located with the United States and had at 95% acceptance rate for previous “hits”. We collected data beginning March 29th, 2012 at 11:30am EST and ending April 2nd at 5pm EST. Data collection was stopped around 9pm EST each evening and began again after 10am EST.
Design and Procedure
The design was identical to the previous study except participants only completed either the Type I, II, IV, VI problem. The only major change was to the stimuli (made to match Nosofsky et al, 1994) and the instructions (detailed above). The procedure was identical to before.
Figure 8 (left panel) compares the learning curves for Nosofsky et al. (1994) and Experiment 3. The most striking pattern is the closer correspondance between our AMT data and the laboratory-collected data for problems I and IV. These data probably fall within the acceptable margin of error across independent replications of the laboratory study. As an illustration, the right panel compares the Nosofsky et al. data to a independent laboratory-based replication by Lewandowsky (2011)2. Given the intrinsic variability across replications, this suggests the AMT data do a fairly good job of replicating the laboratory-based results. In contrast, the type VI problem appears a little more difficult for participants on AMT compared to in the lab. However, at least compared to our results in Experiment 1, the relative ordering of the problems is much more pronounced (i.e., I is easier than IV which is easier than VI).
Despite generally increased alignment between the laboratory data and AMT data, anomalies remain. In particular, the type II problem seems systematically more difficult for participants in our online sample than in Nosofsky et al.’s laboratory study. This is clear in Figure 10 which breaks the learning curves up by problem type. The colored solid lines are Nosofsky et al. (1994), the dashed colored lines are our Experiment 3 results, and the solid black line is our Exp. 1 results. On the other hand, it is clear that the “instructional check” manipulation greatly improved performance in all condition (except perhaps type I which was already near floor).
The finding that type II is learned roughly at the same rate as type IV is interesting. However, other measures of learning suggested at least a marginal type II advantage. For example, 100% of participants in the type I problem reached the learning criterion within the 10 training blocks (2 blocks in a row with 100% correct responses). In comparison, 73.1% reached criterion in the type II problem. However, only 56.4% reach criterion in the type IV problem and 44.8% reach criterion in the type VI problem. Interestingly, our finding of similar learning curves for the type II and IV problems has some precedent in the laboratory literature. For example, as visible in Figure 8, Lewandowksy (2011) found that the type II problem was learned at roughly the same rate that the type IV problem. A similar result was reported by Love (2002) who found only a marginal type II advantage compared to the type IV problem in a related design. We’re in the process conducting a couple followups to this result which might shed some light on this issue.
Overall, our experiments with AMT seem promising, but also raise some interesting issues.
First, it was amazing how much data we could collect in a short period of time. Performing a full-sized replication of the Nosofsky et al. (1994) data set in under 96 hours is revolutionary. This alone speaks volumes about the potential of services like AMT for accelerating behavioral research.
Second, it is notable that participants did learn in all conditions (error rate dropped from the beginning to the end of the study in all conditions). This fact was not necessarily a given since people could have chosen to respond randomly. Manual inspection of our data suggests this almost never happened.
Third, many participants were willing to take part in the 15-30 minute study even when offered $.75 in the low incentive condition. Given that this is about 2-3 times longer than typical HITs on the system suggest there is a reasonable market for recruiting human participants. In our high incentive condition, we could run as many as 40 participants in two hours.
Taken together, the results are extremely encouraging. In less than 96 hours, we collected a 272 person cognitive psychology experiment over the internet. Despite a presumably heterogeneous testing environment and population pool, we seem to have gotten sensible data that qualitatively replicates previously reported results from the lab.
Finally, we replicated the key finding of Shepard et al. and Nosofsky et al. (type I was easier than types III-V which are easier than type VI). At the same time our data was a little less clear than the previously published laboratory collected studies. In general, type II seemed slightly more difficult than previously reported (at least in our learning curve analysis). At this point we are not sure what to make of this difference, except to point out that a couple recent laboratory studies report a similar pattern (e.g., Love, 2002; Lewandowsky, 2011). In addition, online participants generally learned more slowly (this was especially true in Exp 1 and 2 but also showed up in the type VI condition in Exp. 3). It may be that the slower learning relates to the more diverse participant sample than is typical in laboratory studies (e.g., we did find a slightly negative correlation between performance on the type II problem and self-reported age).
However, more than anything, we found that building in checks for understanding the instructions is critical for ensuring high quality data. After incorporating those changes, our data began looking more like a publication-quality replication study. Overall, a pretty worthwhile exploration.
A quick survey of the cognitive science literature suggests that Internet-based studies have not yet made it fully into mainstream cognitive journals. There are numerous Cognitive Science conference proceedings papers which use Turk data, and a few social psychology studies which use Turk to collect surveys (please use the comments if there are others we are unaware of!). However, considerably fewer traditional experimental psychology papers have published Internet data as a primary source of subject recruitment. Based on the above, it seems that reviewers and editors might consider accepting behavioral experiments done on AMT as a valid methodology (applying as much scrutiny as they would apply to any behavioral paradigm). Even for extended experiments requiring problem solving and learning, the data seem mostly in line with a laboratory results.
Advice and Suggestions
To conclude, we’d like to offer a bit of practical advice based on our experience.
First, we echo the point made by Mason and Suri (in press) that researchers pay AMT users something close to minimum wage in the United States or at least close to what you offer someone to perform the task in the lab. While our above analysis suggests that low pay doesn’t necessarily effect the quality of the data, we have found that we can recruit participant faster and have fewer dropouts by making the study financially appealing. In addition, it seems ethical to attempt to roughly equate payment across the lab and “Internet lab” (recognizing that a computer study requires a bit less than traveling to a lab to take part in the study). Many companies offer simple HITs on AMT for as little as $.10, but such rates are somewhat out of line with what subjects in the lab are offered.
Recommendations (and superstitions)
- Include checks that people understand the instructions before entering the task (comprehension questions).
- Pay in proportion to what would in the lab for similar work.
- Consider collecting data only during the daytime in the US.
- Ensure that your task is as fun and interesting as your science will allow. You are competing in a marketplace of online distraction (youtube, etc…)
- Limit the length of tasks to 10-30 minutes and pay accordingly.
- Whenever possible include a replication of a previous result in your design. This will allow you to assess differences to past work and ensure your conclusions are better grounded.
- Collect demographic information along with your study. Reviewers may seek more understanding about the population.
- Prevent your experiment from accessing the Internet between trials (ensures intermittent network problems don’t influence the presentation).
- Test on more than one browser/platform.
- Monitor and report drop-out rates in your study as a function of experimental condition.
- If you plan on excluding subjects from your design decide on the criterion BEFORE collecting your data and clearly state this in your paper. Otherwise there is a tremendous flexibility to exclude and replace participants from a virtually unlimited data source.
Second, experiments that are at least somewhat fun and engaging are likely to be better received. If you have people making 5000 discrimination judgments for simple lines or sine-wave gratings it seems less likely you will get highly useful data. One way to view it is that you are basically competing against all the other interesting things to do on the Internet (YouTube, etc…). On the other hand, in our “comment box” at the end of our experiment, many of the participants said they found the rule-discovery task to be fun and interesting (but to be fair, others hated it!).
We considered various ways to exclude “suspicious” or “odd” behavior (e.g., pressing the same key many times in a row, long response times) but ultimately didn’t report those analyses above. The problem is that our exclusion criteria were entirely arbitrary. Generally, we do not advocate excluding participants except under the most extremely obvious situations of abuse (pushing the same button the entire time). However, restrictions should be decided before data collection and clearly reported in papers. Also the time of day and date of data collection may be important as the AMT population may evolve over time.
Most importantly, we found that testing participant’s comprehension of the instructions was critical. Prior to including such checks our data were much noisier. In fact, the instruction check had a considerably more robust effect on the quality of our data than did increasing the payment. In retrospect this point is intuitive, but it was a lesson worth having sooner rather than later.
Finally, it is important to monitor and record the rate at which people begin your experiment but do not finish. This is typically not a problem in laboratory studies since the social pressure of getting up a walking out of the lab is much higher than it is online. However, dropout rates can interact in complex ways with dependent measures such as accuracy (low performing individuals may be more likely to drop out online). We recommend that, perhaps unlike a typical laboratory study, all Internet experiments report dropout rates as a function of condition.
The bottom line? AMT certainly seems promising for experimental cognitive science research. Our investigations suggest that the data quality is reasonably high and compares well to laboratory studies. Hopefully, the quality of the data will remain high as additional researchers start to utilize this resource. If we (scientists) respect the participants and contribute to a positive experience on AMT it could turn into an invaluable tool for accelerating empirical research.A portion of this work was completed as part of Devin Domingo’s Psychology Honors Thesis at NYU. Be sure to join the discussion below!
Gosling, S.D., Vazire, S., Srivastava, S., and John, O.P. (2004). Should we trust web-based studies? A comparative analysis of six preconceptions about Internet questionnaires. American Psychologist, 59, 2, 93-104. [pdf]
Lewandowsky, S. (2011). Working memory capacity and categorization: Individual differences and modeling. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37, 720-738. [PDF]
Love (2002). Comparing supervised and unsupervised category learning Psychonomic Bulletin & Review, 9(4), 829-835.
Love, B.C., Medin, D.L., and Gureckis, T.M. (2004) SUSTAIN: A Network Model of Category Learning. Psychological Review, 11, 309-332.
Mason, W. and Suri, S. (in press). A Guide to Behavioral Experiments on Mechanical Turk. Behavior Research Methods [pdf]
Mason, W. and Watts, D. (2009). Financial incentives and the “performance of crowds.” HCOMP ’09: Proceedings of the ACM SIGKDD Workshop on Human Computation, 77–85. [PDF]
Nosofsky, R.M., Gluck, M.A., Palmeri, T.J., McKinley, S.C., and Glauthier, P. (1994). Comparing models of rule-based classification learning: A replication and extension of Shepard, Hovland, and Jenkins (1961) Memory & Cognition, 22(3), 352-369. [PDF]
Oppenheimer, D.M., Meyvis, T., and Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45, 867-872. [PDF]
Paolacci, G., Chandler, J., and Ipeirotis, P. G. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5, 411-419. [pdf]
Shepard, R.N., Hovland, C.I., and Jenkins H.M. (1961). Learning and memorization of classifications. Psychological Monographs: General and Applied, 75(13), 1-41.
- Our experiment code directly modified the current browser window and sent data to the server during rest sessions using AJAX. This meant there was no reloading or network access between individual learning trials. While our timing estimates of the display are approximate (depending on the user’s browser, speed of their computer, etc…) they are substantially more accurate than if network access was required on every trial. [↩]
- These data were estimated from the Figures in Lewandowsky’s paper [↩]