I have recently come across a nice article explaining what Docker is and how it can be useful for bioinformatics. I’ll leave you to the article for more details, but basically Docker is an easy way to define a virtual machine, which makes it very straightforward for other people to reproduce the results of an analysis, with little effort from our side.
For example, let’s imagine that we are just about to submit a paper, and that our main results are based on the Tajima’s D index from the data in 1000 Genomes. The journal may ask us to show how to reproduce the analysis: which files did we used as input? Which tool did we use to calculate the Tajima’s D?
In this case, a docker file may be like the following:
MAINTAINER Giovanni DallOlio <email@example.com>
# Install all the software needed to run the pipeline
RUN apt-get -qq update
RUN apt-get install -y wget git tabix vcftools
RUN apt-get install -qqy python3-setuptools python3-docutils python3-flask
RUN easy_install3 snakemake
# clone the most recent version of the pipeline
RUN git clone https://github.com/dalloliogm/pipeline_play.git
# download the input files
RUN tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 22:23862589-25055718 > myregion.vcf
# Execute the pipeline
The first part of this docker file will set up an ubuntu virtual machine, and install all the software needed to execute the pipeline: tabix, vcftools, snakemake. The second part will clone the latest version of the pipeline in the virtual machine, and then use tabix to download a portion of chromosome 22 from the 1000Genomes ftp. The third part runs the pipeline, by executing a snakemake rule.
You can run this docker container by running the following:
docker build -t bioinfoblog_test https://raw.githubusercontent.com/dalloliogm/docker_play/master/1000genomes/Dockerfile
This will take quite a while to run, and will build a docker virtual image in your system. Afterwards, you can run the following:
docker run -i -t bioinfoblog_test /bin/bash
This command will open an interactive shell in the virtual machine. From there you will be able to inspect the output of the pipeline, and eventually, if this pipeline were more complex than a mock example, run other rules and commands.
This system makes it very easy to provide an environment in which our results can be reproduced. It is also very useful if we work from more than one workstation – e.g. if we need to have the same configuration at home and in the lab.
Just a few more links on docker and bioinformatics: