Reproducible bioinformatics pipelines with docker

I have recently come across a nice article explaining what Docker is and how it can be useful for bioinformatics. I’ll leave you to the article for more details, but basically Docker is an easy way to define a virtual machine, which makes it very straightforward for other people to reproduce the results of an analysis, with little effort from our side.

For example, let’s imagine that we are just about to submit a paper, and that our main results are based on the Tajima’s D index from the data in 1000 Genomes. The journal may ask us to show how to reproduce the analysis: which files did we used as input? Which tool did we use to calculate the Tajima’s D?

In this case, a docker file may be like the following:

The first part of this docker file will set up an ubuntu virtual machine, and install all the software needed to execute the pipeline: tabix, vcftools, snakemake. The second part will  clone the latest version of the pipeline in the virtual machine, and then use tabix to download a portion of chromosome 22 from the 1000Genomes ftp. The third part runs the pipeline, by executing a snakemake rule.

You can run this docker container by running the following:

This will take quite a while to run, and will build a docker virtual image in your system. Afterwards, you can run the following:

This command will open an interactive shell in the virtual machine. From there you will be able to inspect the output of the pipeline, and eventually, if this pipeline were more complex than a mock example, run other rules and commands.

This system makes it very easy to provide an environment in which our results can be reproduced. It is also very useful if we work from more than one workstation – e.g. if we need to have the same configuration at home and in the lab.

Just a few more links on docker and bioinformatics:


  1. One thing to note, Docker is fundamentally different from Virtualization. You can create a single docker image designed to run a single command, and take arguments, mount volumes, etc. so that the resulting docker container application can be applied to your data. See my
    docker_bwa_aligner examples for more insight.

Leave a Reply