name: title class: center, middle
# Using .blue[Mechanical Turk] for Cognition Experiments .author[ John McDonnell ([@johnvmcdonnell](http://twitter.com/johnvmcdonnell)) & Todd Gureckis ([@todd_gureckis](http://twitter.com/todd_gureckis)) _[Computation and Cognition Lab](http://gureckislab.org)_ _Department of Psychology_ _New York University_ ]
--- name: overview # Overview 1. What is [Amazon Mechanical Turk](https://www.mturk.com) (AMT)? [Jump to this section](#whatisamt) - Who are the people in the system (demographics, etc...)? - How can it be used for research? - How good is the data? 2. What are the mechanics of using the system? [Jump to this section](#part2) - Step through setting up account - Using built-in tools to create simple form-based experiments 3. Building dynamic web experiments (psiTurk) [Jump to this section](#part3) - Introduce a [simple Python-based web-app](https://github.com/NYUCCL/psiTurk) we developed which could be a starting framework for your experiment - Using the AMT sandbox to develop and test your app - Manage hits using command-line tools (avoid too many visits to Amazon's site) --- name: credits # On the shoulders of giants...
### [Winter Mason](http://www.stevens.edu/provost/directory/faculty_profile.php?faculty_id=1687) (Asst. Prof at Stevens Institute, formerly Yahoo! Research) A social psychologist who really pioneered the documentation of using Mechanical Turk for behavioral researchers.
### [Siddharth Suri](http://www.sidsuri.com/About_Sid.html) (Microsoft Research, formerly Yahoo! Research) A computer scientist doing web-based behavioral experiments on how aspect of social network structure influence human behavior.
### They have created some helpful resources - A really helpful and informative [post](http://smallsocialsystems.com/blog/?p=95) on Winter's blog - CogSci2011 [workshop](http://cognitivesciencesociety.org/uploads/2011-t4.pdf): "How to use Mechanical Turk for Cognitive Science Research" - Mason, W. & Suri, S. (2012). [A Guide to Behavioral Experiments on Mechanical Turk](http://www.springerlink.com/content/5236n965288116v8/fulltext.pdf). _Behavior Research Methods_, 44(1), 1-23. --- name: goals class: shortlist, middle # Our three goals 1. Help you decide if it is right to use in your work 2. Share what we've learned in the process of collecting data with the system 3. Describe our system for helping you develop your own experiments --- name: whatisamt class: center, middle # What is ".blue[Mechanical Turk]"? --- .task[Part 1: What is AMT?] # The history
- Developed by [Amazon.com](http://amazon.com) by Peter Cohen - Originally for in-house use to detect duplicate product postings on Amazon's site .red[*] - Jeff Bezos calls it "artificial artificial intelligence" - Name originates from a mechanical chess illusion/automaton by Wolfgang von Kempelen in the 18th century .footnote[.red[*] Nice summary in the [New York Times](http://www.nytimes.com/2007/03/25/business/yourmoney/25Stream.html)] --- .task[Part 1: What is AMT?] # The concept
"Normally, a human makes a request of a computer, and the computer does the computation of the task, but artificial artificial intelligences like Mechanical Turk invert all that. The computer has a task that is easy for a human but extraordinarily hard for the computer. So instead of calling a computer service to perform the function, it calls a human.” ### - Jeff Bezos --- .task[Part 1: What is AMT?] # What kinds of tasks? - Difficult for computer (or for use in training machine learning systems) - Provide three key words to describe a randomly selected image - Does this photograph contain a car? - Type the characters in this image (e.g., captcha) - Traditional "work" - Go to http://mycompany.com and email me some comments on what you think of the design - Write a positive review for this product online (pay per paragraph) - Translate this text from english to spanish - .green[Help with science!] - **Participate in my human cognition/perception/learning experiment!** --- name: terminology class: shortlist, middle .task[Part 1: What is AMT?] # Key terminology - .red[**HIT**] = Human Intelligence Task (a unit of work, e.g. a trial or an entire sequence of trials in an experiment) - .orange[**Requester**] = an entity (e.g., researcher) who posts .red[**HIT**]s - .blue[**Worker**] = a person who performs the task --- .task[Part 1: What is AMT?] # Who are the workers? - 46.80% US, 34% India, 19.20% Other - United States demographic - 55-65% female - Most make <$60k/year - Median age of 30 - Hold bachelor’s and are young - Distribution mostly similar to US internet pop. - See Ipeirotis, et al. (2010) or Mason and Siddharth (2011). --- .task[Part 1: What is AMT?] # US Distribution
Taken from this [website](http://www-958.ibm.com/software/data/cognos/manyeyes/visualizations/geographic-distribution-of-turkers). --- .task[Part 1: What is AMT?] # World Distribution
Tamir (2011) --- .task[Part 1: What is AMT?] # What motivates workers? #.green.center[Money.] .smallquote[ Mitch Fernandez, 38, a disabled former United States Army linguist, said by e-mail that he became a Turk Worker for various reasons: “At first, I was just curious about the idea of crowdsourcing.” But he said he soon found that by working about two hours a day, he could often earn more than $100 a week. In the last nine months he made around $4,000, which he used to buy a high-definition television, a DVD player and a new subwoofer — all from Amazon.com.] .smallquote[ “I do this primarily for the money, but I also view it as a form of therapy to get me used to working again.” he explained. “The experience has gotten me thinking about pursuing a library science degree.” - From [New York Times](http://www.nytimes.com/2007/03/25/business/yourmoney/25Stream.html) ]
.smallquote[ "Is it worthwhile? I was genuinely surprised by the experience. If you have the ability to throw down readable writing very quickly, you can earn minimum wage with the Turk – more than I ever expected. Given the short timeframe and the wide variety of tasks available, it’s something that you can sit down and do in short little bits when it’s convenient for you.” - From [The simple Dollar](http://www.thesimpledollar.com/2009/07/05/can-you-actually-earn-reasonable-money-from-mechanical-turk/) ] --- .task[Part 1: What is AMT?] # Compensation - Median wage is $1.38/hour - Short tasks (~5 mins) award around 10 cents - Requester can reject work and revoke payment - Can also **award bonuses** - Amazon takes 10% of payments - Amazon tries to stay out of disputes --- .task[Part 1: What is AMT?] # Why mess with this? - **Convenient** - and our scripts help you even more - **Fast** - collect LOTS of data quickly - collect large N studies - pilot study materials - norm/pretest stimuli - **Affordable** - Relative to fMRI incredibly cost effective - We paid about $2.00 for 15-25 minute session - **Anonymous** - Subject never meets experimenter, in some ways testing conditions more uniform since no RAs - Subjects unlikely to know about "your lab" and the types of studies you do - **Replicable** - Very easy to share your experiment code to let people replicate your work cheaply --- .task[Part 1: What is AMT?] # Why .red[stay away]? - **Quality assurance** Are people following instructions? Can they read them? Are they "bots" just trying to rip you off? - **Attrition** the cost (social or otherwise) for closing the browser and quitting the experiment is low - **Non-lab setting** - You don’t have control over distractions. - People literally could get up and leave for 1 hour in the middle of your experiment - **Undergrads = mturk population?** Is this really a different population? --- .task[Part 1: What is AMT?] # Ethical Issues - How much should you pay? As much as the market supports or should it match what you pay in the lab (roughly)? - Can you get IRB approval for your experiment (yes, you MUST run it by the IRB first) - Some IRBs are more cautious than they need to be, IMHO - Since participants never meet experimenter and are free to close browser and walk away at any moment the IRB issues seem non-existent (should be exempt, but check with your local IRB) - Main concern would be if materials/stimuli are controversial and might reflect poorly on the field, sponsors, or university - Wasn't very difficult to add to an existing IRB at NYU --- class: center, middle # How does the data compare to that collected in the lab? .red[*] .left.footnote[.red[*]Check out our discussion of this at our [Thinking about Thinking](http://gureckislab.org/blog/) blog ] --- name: replicationstudies .task[Part 1: What is AMT?] # Lots of interest in this... - Gosling, S.D., Vazire, S., Srivastava, S., & John, O.P. (2004). [Should we trust web-based studies? A comparative analysis of six preconceptions about Internet questionnaires](http://www.yalepeplab.com/teaching/psych231/documents/231_Gosling2004.pdf). _American Psychologist_, 59, 2, 93-104. - Paolacci, G., Chandler, J., & Ipeirotis, P. G. (2010). [Running experiments on Amazon Mechanical Turk](http://repub.eur.nl/res/pub/31983/jdm10630a.pdf). _Judgment and Decision Making_, 5, 411-419. - Germine, L., Nakayama, K., Duchaine, B.C., Chabris, C.F., Chatterjee, G. & Wilmer, J.B. (2012, in press). [Is the Web as good as the lab? Comparable performance from Web and lab in cognitive/perceptual experiments.](http://www.springerlink.com/content/f0244t772070138w/) _Psychonomic Bulletin & Review_. --- .task[Part 1: What is AMT?] # Does it replicate? - We recently attempted to replicate a classic categorization finding using AMT subjects - Shepard, R.N., Hovland, C.I., & Jenkins H.M. (1961). Learning and memorization of classifications. _Psychological Monographs: General and Applied_, 75(13), 1-41. - Was a good learning experience for what works (and what doesn't) - Part of a larger study replicating a number of classic findings (stroop, task-switching, simon, visual cuing, attentional blink, subliminal priming, category learning) - **Crump, M., McDonnell, J., & Gureckis, T.M. (in prep). Evaluating Amazon’s Mechanical Turk as a tool for experimental behavioral research: A critical look.** --- .task[Part 1: What is AMT?] # SHJ (1961) - Six different ways of classifying the same 8 items (geometric figures with 3 binary dimensions) into deterministic categories - Generally Type I is learned faster than Type II which is faster that Types III-V, with Type VI the hardest .center[
] --- .task[Part 1: What is AMT?] # Try 1 .center[
] --- .task[Part 1: What is AMT?] # Try 1 .center[
] --- class: center, middle # Maybe we weren't paying people enough? --- .task[Part 1: What is AMT?] # Try 2 .center[
] --- .task[Part 1: What is AMT?] # Try 2 .center[
] --- class: center, middle # Maybe people didn't read or understand the instructions? --- .task[Part 1: What is AMT?] # Try 3 .center[
] --- .task[Part 1: What is AMT?] # Try 3 .center[
] --- .task[Part 1: What is AMT?] # Try 3 - 100% of participants in the type I problem reached the learning criterion .red[*] - 73.1% reached criterion in the type II problem - 56.4% reach criterion in the type IV problem - 44.8% reach criterion in the type VI problem. .footnote[.red[*] 2 blocks in a row with 100% correct responses within the 10 training blocks] --- .task[Part 1: What is AMT?] # Summary - Eh, it's ok. not perfect. - Hard to tell if problems are with AMT data or with previous work (perhaps selection bias with university students?) - Important lessons - Payment doesn't have much effect beyond signup rate - Checking people read and comprehend instructions is critical - Cleaning up data may help, but good to decide on criterion ahead of time - Huge studies run in 3-4 days. Does this benefit outweight possible disadvantages? Do you trust your lab data more? - Note, our other studies (e.g., stroop, etc...) replicate with much higher fidelity. Learning experiments may reflect a sort of special case where caution is particularly advised. - See also [this slide](#replicationstudies) for references to other studies reporting success --- .task[Part 1: What is AMT?] # Can you actually publish this data? - Currently, few mainstream cognitive psychology papers which report findings based purely on AMT data - A few early examples of web-based experimentation in Cognitive Science (e.g., [Mark Steyvers](http://psiexp.ss.uci.edu/research/papers/causalinference.pdf)) - Used heavily in machine learning (e.g., NIPS), in many Cognitive Science Society conference papers, and work on collective/group behavior - We have a few papers about to be submitted based entirely on AMT data, so we'll see how it goes! - Obvious solution to reviewer objections is to replicate a few conditions or experiments in the lab as well - Easy to do since the code runs just as well in the lab as online --- class: center, middle # End of Part 1 ### .gray[Time for a break!] --- name: part2 class: center, middle .task[Part 2: The mechanics]
# .blue[Part 2:] What are the mechanics of using the AMT system?
--- class: middle .task[Part 2: The mechanics] # Two perspectives: 1. What is it like to use AMT as a [**Worker**](#terminology)? 2. What is it like to use AMT as a [**Requester**](#terminology)? --- .task[Part 2: The mechanics] # Using AMT as a .blue[**Worker**] 1. Sign up on Amazon's site 2. Search for tasks (i.e., .red[**HIT**]s) to do online through the AMT interface. 3. Worker accepts a .red[**HIT**]. The task appears in a frame within the AMT website. 4. After completing the .red[**HIT**], the worker signals to Amazon that they have completed the task
Some important conceptual notes: - (as we will see) .orange[**Requester**]s send out a job with a certain number of .red[**HIT**]s - When a .blue[**Worker**] accepts the .red[**HIT**] is it removed from the .orange[**Requester**]'s account - If the .blue[**Worker**] doesn't want to do the task they have to _return_ the .red[**HIT**] to the .orange[**Requester**] so someone else can do it - As a .orange[**Requester**] you get to set how long a .red[**HIT**] can be "held" like this --- .task[Part 2: The mechanics] # What is a .red[**HIT**] ? - In our experiments, a .red[**HIT**] is typically an entire experiment (as might be done in the lab) - However, it is up to the .orange[**Requester**] to determine what constitutes a .red[**HIT**]. - Could be one trial in an experiment, and you just offer the .blue[**Worker**] as many .red[**HIT**]s as they care to do - Will see some examples in a minute... --- class: center, middle .task[Part 2: The mechanics] # Demo: Using AMT as a .blue[**Worker**]
--- .task[Part 2: The mechanics] # More about external questions... - The link to the site. Could be anything! Must link back to Amazon at the end (we'll show you how) - You are responsible for: - Hosting - **At minimum** you need a computer with a web-accessible HTTP server running - A static IP address may simplify things - You probably want to avoid trying to host it over WiFi since NAT/Firewalls can make your site inaccessible to the global Internet - Posting your link to Amazon - Communicating back to Amazon when the .red[**HIT**] has been completed - Approving completed .red[**HIT**]s for payment in a timely manner - Dealing quickly with .blue[**Worker**] issues. Welcome to tech support. - Saving your data in a secure place, etc.. --- .task[Part 2: The mechanics] # Requesting an External Question - **URL** - basic link to your experiment. - **Title**. - **Description**. - **Keywords**. - **Lifetime**. - _...how long will it be available?_ - **Duration**. - _...how long are people allowed to take?_ - **Qualifications**. - _...country? ...consistency?_ _Oddly, there is no web interface for this. You need to use scripts. Fortunately, we've got you covered in [Part 3](psiturk.html)._ --- .task[Part 2: The mechanics] # Hosting an Advertisement - When participants see your HIT in the list, they will see basic metadata:
- Once they click "View a HIT in this group," they see whatever your site is serving up, in our case this is the default:
--- # A Participant Accepts Upon accepting the HIT, they see... whatever you want them to see! In our case, it's this:
--- # A look at the URLs That last URL would look like this .red[*]: - http://puncture.psych.nyu.edu:5001/mturk?assignmentId=.red[XX]&hitId=.red[XX]&workerId=.red[XX] Basic syntax is as follows: - http://.yellow[yoururl]?.red[**assignmentId**=XX]&.blue[**hitId**=XX]&.green[**workerId**=XX] - .yellow[**yoururl**] is wherever your server is ready for requests. - .red[**assignmentId**] is the unique ID assigned by Amazon to the event of this particular person taking performing this particular HIT. ASSIGNMENT_ID_NOT_AVAILABLE means they haven't accepted yet. - .blue[**hitId**] is the Amazon ID associated with this HIT. - .green[**workerId**] is the Amazon ID associated with each unique worker. .footnote[.red[*] The .red[XX] are shorthand for usually long strings of letters and numbers.] --- # Finishing up The last thing served to the participant is a button they can use to inform Amazon that they have finished the HIT. This submits a form with their identifying information to `http://www.mturk.com/mturk/externalSubmit`.
--- # Crediting participants
--- class: center, middle # End of Part 2 ### .gray[Time for a break!] --- name: part3 class: center, middle .task[Part 3: Using psiTurk]
# .blue[Part 3]: Building dynamic web experiments
--- class: center, middle .task[Part 3: Using psiTurk] # psiTurk ### .gray[Our "framework" for doing online experiments] ### [https://github.com/NYUCCL/psiTurk](https://github.com/NYUCCL/psiTurk)
[boto](https://github.com/boto/boto) --- .task[Part 3: Using psiTurk] # Software dependencies **Python** If you've never used Python before, we recommend you use the [Enthought Python distribution](http://www.enthought.com/repo/.epd_academic_installers), because it comes with easy_install, a tool for easily installing Python packages. Once you've installed Enthought, you should be able to open a terminal window and install **Flask** (the web server library we'll be using) and **SQLAlchemy** (the database manipulation package we'll be using) with the following command: .bash easy_install Flask-SQLAlchemy If you get a permissions error, try: .bash sudo easy_install Flask-SQLAlchemy (which will prompt you for your password) --- .task[Part 3: Using psiTurk] # Software dependencies You will also find it useful to install **boto**, a Python library which enables interaction with Amazon's Mechanical Turk servers to help assign credit to participants. To install: .bash easy_install boto Finally, you can try getting things up and running by cloning our repository using the following command: .bash git clone git://github.com/NYUCCL/psiTurk.git (Further instructions are provided here: [http://github.com/NYUCCL/psiTurk](http://github.com/NYUCCL/psiTurk)) --- .task[Part 3: Using psiTurk] # Model, View, Controller (MVC) A common software development pattern. If you are _not_ a scientist and do end-user programming (e.g., Android, iOS, webapps), this is probably how you are designing things.
- **Model** is basically your data. - A common situation is for the model to contains the abstract logic of your application - What kind of data is being stored, how it is formatted, is there computation that the app is supposed to do? - **Controller** is the "glue" which binds the model to the view - **View** is the UI elements presented to a user of the software - Independent of the model - Idea is object-oriented encapsulation... the graphics people can design the interface while the engineers build the model --- .task[Part 3: Using psiTurk] # Scary? No. Basically we just are giving you a pre-structured website which you can fill in with the info for you study. In addition, our code handles some of the logic a AMT experiment might need like counterbalancing, etc...
## It's not scary, it's going to be fun!
--- .task[Part 3: Using psiTurk] # File listing - **models.py** defines the **Model** - **app.py** is the **Controller** - The various HTML files inside the templates/, static/ folders are the **View**
--- .task[Part 3: Using psiTurk] # Model (i.e., database) structure Defined in **models.py**: .python class Participant(Base): """ Object representation of a participant in the database. """ __tablename__ = TABLENAME assignmentid =Column(String(128), primary_key=True) hitid = Column(String(128)) workerid = Column(String(128)) ipaddress = Column(String(128)) cond = Column(Integer) counterbalance = Column(Integer) codeversion = Column(String(128)) beginhit = Column(DateTime, nullable=True) beginexp = Column(DateTime, nullable=True) endhit = Column(DateTime, nullable=True) status = Column(Integer, default = 1) debriefed = Column(Boolean) datastring = Column(Text, nullable=True) has fields for tracking the assignmentId, hitId, workerId, ipaddress of worker, condition, the data file, etc... --- .task[Part 3: Using psiTurk] # Status codes Possible status codes are defined at the top of the file .python # Status codes ALLOCATED = 1 STARTED = 2 COMPLETED = 3 DEBRIEFED = 4 CREDITED = 5 QUITEARLY = 6 --- .task[Part 3: Using psiTurk] # Model (i.e., database) structure The model is written/updated by the controller in response to commands given by the user (or their browser)
--- .task[Part 3: Using psiTurk] # Controller: The most important idea - **app.py** runs a webserver which is basically listening for requests (i.e., requests are attempts to access a particular URL) - These requests are handled using .red['routes'] which are simple functions in **app.py** that run in response to a URL request. - Normal web server: - You request http://gureckslab.org/coolpage.html - The server sends the content of coolpage.html in a format your browser can read and display - **app.py** server: - You request http://gureckislab.org/mturk - The server runs a python function associated with the 'mturk' .red['route'] - This function may, or may not also send to the user's browser some HTML can it can read and display - However, since it is a Python function, can do lots of other stuff like add data to a database, etc... --- .task[Part 3: Using psiTurk] # An example If you request .orange[http://myip.edu:PORT/consent], this function will run .python @app.route('/consent', methods=['GET']) def give_consent(): """ Serves up the consent in the popup window. """ if not (request.args.has_key('hitId') and \ request.args.has_key('assignmentId') and \ request.args.has_key('workerId')): raise ExperimentError('hit_assign_worker_id_not_set_in_consent') hitId=request.args ['hitId'] assignmentId = request.args ['assignmentId'] workerId = request.args ['workerId'] print hitId, assignmentId, workerId return render_template('consent.html', hitid = hitId, assignmentid=assignmentId, workerid=workerId) which checks to see if the workerid is set correctly and basically prints out the 'consent.html' template (last line) into the users browser window. **So a layer of computation/logic is inserted into any web requests!! ** --- .task[Part 3: Using psiTurk] # A more complex example If you request .orange[http://myip.edu:PORT/debrief], this function will run .python @app.route('/debrief', methods=['POST', 'GET']) def savedata(): """ User has finished the experiment and is posting their data in the form of a (long) string. They will receive a debreifing back. """ print request.form.keys() if not (request.form.has_key('assignmentid') and request.form.has_key('data')): raise ExperimentError('improper_inputs') assignmentId = request.form ['assignmentid'] datastring = request.form ['data'] print assignmentId, datastring user = Participant.query.\ filter(Participant.assignmentid == assignmentId).\ one() user.status = COMPLETED user.datastring = datastring user.endhit = datetime.datetime.now() db_session.add(user) db_session.commit() return render_template('debriefing.html', assignmentId=assignmentId) which updates the database so that we know the user finished the task, and then prints out the debriefing form to the user's browser window. --- .task[Part 3: Using psiTurk] # An example which doesn't display anything in the browser If you request .orange[http://myip.edu:PORT/debrief], this function will run .python @app.route('/inexp', methods=['POST']) def enterexp(): """ AJAX listener that listens for a signal from the user's script when they leave the instructions and enter the real experiment. After the server receives this signal, it will no longer allow them to re-access the experiment applet (meaning they can't do part of the experiment and referesh to start over). """ print "/inexp" if not request.form.has_key('assignmentid'): raise ExperimentError('improper_inputs') assignmentId = request.form ['assignmentid'] user = Participant.query.\ filter(Participant.assignmentid == assignmentId).\ one() user.status = STARTED user.beginexp = datetime.datetime.now() db_session.add(user) db_session.commit() return "Success" updates the model (i.e., database) so that the user.status is 'STARTED' and the starttime is recorded --- .task[Part 3: Using psiTurk] # Persistence The .orange[**Worker**]'s browser is simply making web requests at various times as they go through the experiment. These call up particular '.red[routes]' in the **app.py** code. What about multiple people doing the experiment at once? How do we know who they are? Remember that on the first request, Amazon sends three variables to your route: - http://.yellow[yoururl]?.red[**assignmentId**=XX]&.blue[**hitId**=XX]&.green[**workerId**=XX] We just continue to pass these variables to all new routes we add. These keys let us look up the correct user in the databased (based on the workerId and/or assignmentId) and then update the right person. --- .task[Part 3: Using psiTurk] # Templates - Rather than just plain HTML, the system uses HTML templates which help you programmatically generate HTML - **render_template()** function return render_template('debriefing.html', assignmentId=assignmentId) - Makes a variable called 'assignmentId' available inside the template.