Data Analysis in Python, the Literate Way

Our lab uses Mathematica quite a bit for data analysis and building models. There aren’t that many other people in psychology at NYU (or elsewhere) that use Mathematica. Part of the reason is the large number of libraries exist for Matlab that specifically help with fMRI analysis or experiment design. I guess Mathematica works particularly well for the specific kind of work we do. One of the key advantages of Mathematica is the interactive notebook and the high quality/flexible graphics system.


An example of a in-progress Mathematica analysis!

For those who haven’t used it, the Mathematica notebook allows you to combine text, code, and figures/plots in a single multi-media document. This is very helpful for building up and testing complex bits of code and for exploratory data analysis. Rather than cutting and pasting bits of code from a text editor into an interactive interpreter (as in Matlab, Python, or R) the notebook allows code and graphics to coexist in line with one another. It turns out it is very helpful to keep the graphics from one analysis tied to the code that generated it (along with plain text describing it). For example, if I write some code for a data analysis I can have the plot of this data appear directly below the code itself in the notebook. Later, when I’m going back through the analysis (perhaps weeks or even months later) I can more easily make sense of what goes with what. It’s basically just a more logical way to relate the _outputs_ of a computation to the code that generated it (i.e., “literate programming“).

Anyway, as great as Mathematica is, there are a couple of important draw backs.

First is that Mathematica isn’t free (as in beer or as in speech). This isn’t such a big deal for us (NYU has a university-wide academic license which keeps the cost lower). However, it is hard to involve undergraduates in the research process since it might be too costly to buy licenses for all of them. I’m also reluctant to ask students in a class I’m teaching to shell out for a license (even though Wolfram Research has made this easier lately via their per-semester licensing deals). In addition, I worry that my psych undergrad or grad students are less likely to encounter Mathematica again (whereas they are somewhat more likely to run into R, SPSS, Matlab, or Python).

A second disadvantage to Mathematica is that it isn’t always as fast as something like Matlab or Python. I’m not entirely sure why that is (some matrix computations are optimized), but it likely has to do with the very powerful symbolic computation tools that Mathematica provides. In many cases, this can make programming much easier. However, once a complex model or simulation is set up, we often find it is more effective to translate it into Python which runs runs fast enough for most of our work.

Finally, the Mathematica programming language is pretty old. It doesn’t have very clean object-oriented design patterns and the syntax can be a bit obtuse. For example, compare these two statements which do the same thing, one in Python and one in Mathematica:

Python
1
2
3
4
5
6
def myfunction():
for i in range(10):
if i > 5:
print "This number is greater than 5 ", i
else:
print "This number is less than 5 ", i

A code snippet in Python

1
2
3
4
5
6
7
8
9
myfunction[] := Module[
{i},
For[i=1, i<=10, i++,
If[i > 5,
Print["This number is greater than 5 ", i],
Print["This number is less than 5 ", i];
];
];
];

A code snippet in Mathematica
The differences may appear small, but all the square brackets in Mathematica can really make the code hard to understand (esp. in more complex programs). Python’s indentation format makes the code look nice and forces every user of the language to adopt a “organized” program listing.

The bottom line is that Python’s language is cleaner, it is more contemporary, it runs faster, the number of available libraries is immense (at least equal to, if not exceeding, the functionality in Mathematica), and it is free/open source. We use Python internally for all our experiments (check out our simple API for developing psychology experiments, PyPsyExp).

Given all this, wouldn’t it be great if Python had a notebook interface?

Well, recently, the possibility of leveraging some of the benefits of Mathematica’s “computational notebook” framework in Python has emerged (thanks to Jay Martin for telling me about this!). In particular, iPython (an “enhanced” python shell) has added a web-based notebook framework. I’ve been playing with the bleeding edge version in Github lately and I’m impressed (thus, this blog post!).

The basic idea of the system is that you launch a small webserver running on your computer (using the command ipython notebook –pylab inline). Then, you point your favorite browser (I’ve found things work very well in Chrome) at a particular local URL it prints out (e.g., http://127.0.0.1:8888). From there, the web application serves up an interactive notebook instantiated as a web page. It might not seems like a web-based interface would be really useful, but advances in AJAX have enabled fully complex, dynamic applications that run in your browser (think Facebook or Google Docs).

The current notebook format feels quite a bit like Mathematica’s notebook interface. There is the concept of a “cell” which links a bit of executable code and the resulting output. Cells can also hold text, Markup, LaTeX, or other types of text. In addition, a system is worked out for showing graphics from pylab/matplotlib, perhaps the most ubiquitous data plotting library for Python.

Two example “cells” showing a computational output and a matplotlib graph.
Overall the system is already pretty polished. For example, it does syntax highlighting (very useful). The web app can also do tab-completion on variable names (you start typing a variable and it will attempt to finish what you need to type). In addition, since it runs as a web-app, you can share your notebooks “live” with other people. I haven’t tried this yet, but in theory two people could remotely work on the same analysis file. There are keyboard short-cuts for most useful actions. Finally, you can export the notebooks in both a structure “.ipynb” format (for sharing) and as a regular .py file so you can execute the code with or without the notebook interface.

However, at the current stage, Mathematica’s notebook format is still much more refined. For example, you can’t change the color of cells, can’t collapse/hide chunks of cells/code at a time, can’t execute multiple cells at one (or groups of cells) in iPython notebooks. In addition, since Mathematica has a much more robust graphics system it is easier to export the resulting graphics files for “clean up” in Illustrator. Since iPython is a web-based app all graphics are converted into something like .png files for display which are harder to subsequently edit (although, of course, you can use matplotlib to write to a file on your local disc). Despite these limitations, development of iPython Notebook seems active (at least by the discussion on Github), and I’m sure many of these things will be addressed as time goes on.

A comparison of Mathematica Notebook and a iPython Notebook doing (roughly) the same analysis!
Anyway, I think this is a pretty promising direction. We’re probably going to be using this set up more frequently in the lab. In addition, I’m considering using the iPython notebook in the computational modeling course I’m currently teaching. My thinking is that this simple, interactive editor may be a better way to get non-programmers and Python novices into data analysis and computational modeling.

p.s. This is a great link for getting it set up on Mac OS X Lion: http://minrk.posterous.com/install-ipython-qtconsolenotebook-on-osx-lion.

UPDATE: See also this page which I will be updating for my course with install instructions for various operating systems. Also, Fernando Perez (original author of iPython) shared this link about the history of the project and this link about scientific python.







  1. There’s definitely a lot of development work on the notebook at the moment. Brian Granger is posting some updates on Google+: https://plus.google.com/110706953761515533762/posts

    If you want to help steer the direction of development, there’s a standing issue with feature requests (https://github.com/ipython/ipython/issues/977), as well as issues for more concrete proposals, and discussions on the mailing list (CCed here). Some of your suggestions (like code folding) are already planned, while others are new, at least to me. And if you or anyone in your lab are interested in working on the code, new contributors are always welcome.

    Thomas Kluyver
  2. Sage is an open source note book like platform which works with and is built on python(at least mostly).

    http://www.sagemath.org/

    Davorak
  3. Interesting post. In the case of R, you can use Sweave ( http://www.statistik.lmu.de/~leisch/Sweave/ ) to approximate something similar to a notebook; or see the Reproducible Research Task View on CRAN ( http://cran.r-project.org/web/views/ReproducibleResearch.html ).

    Here’s an example I put together with pdf, code, and explanation:

    https://github.com/jeromyanglim/Sweave_Item_Analysis/blob/master/.backup/Item_Analysis_Report.pdf

    https://github.com/jeromyanglim/Sweave_Item_Analysis

    http://jeromyanglim.blogspot.com.au/2010/11/sweave-tutorial-3-console-input-and.html

    Jeromy Anglim
  4. If you are interested in literate programming in Python, I could also recommend trying pyreport (http://pypi.python.org/pypi/pyreport). It is a low-level application that runs a .py script and combines the code with the results it produces in an pretty PDF/HTML document. Essentially, it is an analog of Sweave in the Python ecosystem.

    Anton Goloborodko
  5. quote: “the Mathematica programming language is pretty old. It doesn’t have very clean object-oriented design patterns and the syntax can be a bit obtuse.”

    Your example after these lines compares two equivalent programs based on ‘procedural programming’ while mathematica primarily provides a functional and a pattern based language. The above example can be written compactly if one knows functional construct.

    LIsp, C has been around for years ans so is mathematica.

    talegari
  6. (Sorry for the sloppy initial formatting.)

    I want to defend Mathematica syntax.

    > Python’s indentation format
    > makes the code look nice
    > and forces every user of the language
    > to adopt a “organized” program listing.

    What’s the good in _forcing_ users (to do anything)?

    Python’s indentation ideology only forces me to type spaces like a monkey. :-( Code display is not a human job, and Mathematica interface does indent code, and it’s rather good at it. It also has a cleaner syntax if you do it the proper way:

    Map[
     StringJoin[
       "This number is ",
       If[# > 5,
        "greater",
        "less"],
       " than 5: ",
       ToString[#]] &,
     Range[10]]

    which could actually be expressed as a one-liner:

    ("This number is " <> If[# > 5, "greater", "less"] <> " than 5: " <> ToString[#]) & /@ Range[10]

    that is clean and comprehensible without any identation at all.

    Akater
  7. Before ruling out R, take a look at RStudio and the R markdown functionality (similar to Mathmatica’s notebook):
    http://www.rstudio.com/ide/docs/authoring/using_markdown

    Python is great for some tasks but once the size of the program reaches a certain complexity, I prefer a language like Java or C# (this is definitely a matter of preference and circumstances). Regarding Akater’s comment about being forced to type “monkey spaces” in python I would note that the IDE you select makes all the difference. With the right auto-formatting and keyboard shortcuts the annoyances of indenting tend to disappear.

    In the end, the more tools you’re familiar with, the more efficient you become at selecting the right tool for the task. It’s definitely worth evaluating several tools instead of just sticking with what you already know when starting a new project.

    paulj