I have recently come across a nice article explaining what Docker is and how it can be useful for bioinformatics. I’ll leave you to the article for more details, but basically Docker is an easy way to define a virtual machine, which makes it very straightforward for other people to reproduce the results of an analysis, with little effort from our side.
For example, let’s imagine that we are just about to submit a paper, and that our main results are based on the Tajima’s D index from the data in 1000 Genomes. The journal may ask us to show how to reproduce the analysis: which files did we used as input? Which tool did we use to calculate the Tajima’s D?
In this case, a docker file may be like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
FROM ubuntu MAINTAINER Giovanni DallOlio <dalloliogm@gmail.com> # Install all the software needed to run the pipeline RUN apt-get -qq update RUN apt-get install -y wget git tabix vcftools RUN apt-get install -qqy python3-setuptools python3-docutils python3-flask RUN easy_install3 snakemake # clone the most recent version of the pipeline WORKDIR /home/user/ RUN git clone https://github.com/dalloliogm/pipeline_play.git # download the input files WORKDIR /home/user/pipeline_play RUN tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 22:23862589-25055718 > myregion.vcf # Execute the pipeline RUN snakemake |
The first part of this docker file will set up an ubuntu virtual machine, and install all the software needed to execute the pipeline: tabix, vcftools, snakemake. The second part will clone the latest version of the pipeline in the virtual machine, and then use tabix to download a portion of chromosome 22 from the 1000Genomes ftp. The third part runs the pipeline, by executing a snakemake rule.
You can run this docker container by running the following:
1 |
docker build -t bioinfoblog_test https://raw.githubusercontent.com/dalloliogm/docker_play/master/1000genomes/Dockerfile |
This will take quite a while to run, and will build a docker virtual image in your system. Afterwards, you can run the following:
1 |
docker run -i -t bioinfoblog_test /bin/bash |
This command will open an interactive shell in the virtual machine. From there you will be able to inspect the output of the pipeline, and eventually, if this pipeline were more complex than a mock example, run other rules and commands.
This system makes it very easy to provide an environment in which our results can be reproduced. It is also very useful if we work from more than one workstation – e.g. if we need to have the same configuration at home and in the lab.
Just a few more links on docker and bioinformatics:
- List of dockerized apps on biostar
- The BioDocker project
I work in data analytics and often work on bio pipelines for NGS data.
I use docker but never saw any of the resources you pointed out, thanks for sharing.
My 2cents:
how about sharing your image on the dockerhub?
https://registry.hub.docker.com/search?q=library
Also ‘rocker’ might be an interesting resource for people reading this post:
http://dirk.eddelbuettel.com/blog/2014/10/23/
One thing to note, Docker is fundamentally different from Virtualization. You can create a single docker image designed to run a single command, and take arguments, mount volumes, etc. so that the resulting docker container application can be applied to your data. See my
docker_bwa_aligner examples for more insight.