Reproducible bioinformatics pipelines with docker

I have recently come across a nice article explaining what Docker is and how it can be useful for bioinformatics. I’ll leave you to the article for more details, but basically Docker is an easy way to define a virtual machine, which makes it very straightforward for other people to reproduce the results of an analysis, with little effort from our side.

For example, let’s imagine that we are just about to submit a paper, and that our main results are based on the Tajima’s D index from the data in 1000 Genomes. The journal may ask us to show how to reproduce the analysis: which files did we used as input? Which tool did we use to calculate the Tajima’s D?

In this case, a docker file may be like the following:

FROM ubuntu
MAINTAINER Giovanni DallOlio <dalloliogm@gmail.com>

# Install all the software needed to run the pipeline
RUN apt-get -qq update
RUN apt-get install -y wget git tabix vcftools
RUN apt-get install -qqy python3-setuptools python3-docutils python3-flask
RUN easy_install3 snakemake


# clone the most recent version of the pipeline
WORKDIR /home/user/
RUN git clone https://github.com/dalloliogm/pipeline_play.git

# download the input files
WORKDIR /home/user/pipeline_play
RUN tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 22:23862589-25055718 > myregion.vcf


# Execute the pipeline
RUN snakemake

The first part of this docker file will set up an ubuntu virtual machine, and install all the software needed to execute the pipeline: tabix, vcftools, snakemake. The second part will  clone the latest version of the pipeline in the virtual machine, and then use tabix to download a portion of chromosome 22 from the 1000Genomes ftp. The third part runs the pipeline, by executing a snakemake rule.

You can run this docker container by running the following:

docker build -t bioinfoblog_test https://raw.githubusercontent.com/dalloliogm/docker_play/master/1000genomes/Dockerfile

This will take quite a while to run, and will build a docker virtual image in your system. Afterwards, you can run the following:

docker run -i -t bioinfoblog_test /bin/bash

This command will open an interactive shell in the virtual machine. From there you will be able to inspect the output of the pipeline, and eventually, if this pipeline were more complex than a mock example, run other rules and commands.

This system makes it very easy to provide an environment in which our results can be reproduced. It is also very useful if we work from more than one workstation – e.g. if we need to have the same configuration at home and in the lab.

Just a few more links on docker and bioinformatics:

Posted in methodology, news | Tagged , | 3 Comments

the most useful R function of the week: unnest from tidyr

There are many great functions in CRAN and BioConductor, and certainly saying that unnest from the tidyr package is the best is a big exaggeration. However this function solved a big problem in data formatting that made me waste a lot of time in the past, that I was surprised no one had implemented a function for it yet.

Imagine we have a dataframe like the following:

> mygenes
Entrez  symbols
7841    MOGS,CDG2B,CWH41,DER7,GCS1
4248    MGAT3,GNT-III,GNT3
5728    PTEN,BZS,CWS1,DEC,GLM2,MHAM,MMAC11,TEP1

The first column contains the Entrez of each gene. This columns is fine, as it contains only one value per row, and it is easy to query or join with other dataframes. The second column, however, contains a comma-separated list of gene names, all associated to the same Entrez IDs. This column is a mess to deal with, because we need to use grepl to query it, and we can’t join it with other dataframes as long as it is in this form.

The unnest function from tidyr allows to convert this data frame in a “tidier” form, containing one row for each combination gene symbol and alias:

> library(tidyr)
> library(dplyr)
> mygenes %>% 
    mutate(symbols=strsplit(as.character(symbols), ",")) %>% 
    unnest(symbols)

   Entrez symbols
1    7841    MOGS
2    7841   CDG2B
3    7841   CWH41
4    7841    DER7
5    7841    GCS1
6    4248   MGAT3
7    4248 GNT-III
8    4248    GNT3
9    5728    PTEN
10   5728     BZS
11   5728    CWS1
12   5728     DEC
13   5728    GLM2
14   5728    MHAM
15   5728  MMAC11
16   5728    TEP1

This code makes use of the %>% and some functions from the dplyr package, but it is still R!

Having the dataframe in this long form makes it a lot easier to deal with it. For example, let’s imagine that somebody asks us to get the Entrez IDs for the list of gene symbols DER7 and DEC. We would just have to do a simple subset on the dataframe:

> unn %>% 
    mutate(symbols=strsplit(as.character(symbols), ",")) %>% 
    unnest(symbols) %>% 
    subset(symbols %in% c("DER7", "DEC"))

   Entrez symbols
4    7841    DER7
12   5728     DEC

This is just a silly example, which may have been solved with some application of apply and grepl, but in the real world there are a lot of more complex applications for it. For example, here is some code I used to split Blat output into one line per exon (or blat alignment block):

Continue reading

Posted in news | 4 Comments

a formula interface for GeneOntology analysis

clusterProfiler is a nice R library for doing GO and KEGG enrichment analysis. It has a simple interface and it can produce some clear plots using the ggplot engine. Today I contributed a formula interface to clusterProfiler, making it easier to do enrichment of multiple groups of genes.

Let’s imagine you have a dataframe in which one column contains a list of Entrez Ids, while the other columns encode some grouping variables:

> mygenes 
Entrez    group othergroup
1         A     good
100       A     bad
1000      A     good
100101467 B     bad
100127206 B     good
100128071 B     good

The new formula interface allows to do a GO analysis on each of the groups. For example, we can group them  by the column “group”, and compare the classification of the two groups:

> install_github(c("GuangchuangYu/DOSE", "Guangchuang/clusterProfiler"))
> library(clusterProfiler)
> mygenes <- data.frame(Entrez=c('1', '100', '1000', '100101467',
                            '100127206', '100128071'),
                  group = c('A', 'A', 'A', 'B', 'B', 'B'),
                  othergroup = c('good', 'bad', 'good', 'bad', 'good', 'good'))
> GO.enrichment <- compareCluster(Entrez~group, data=mygenes, fun='groupGO')
> print(summary(GO.enrichment))
  Cluster         ID          Description Count GeneRatio     geneID
1       A GO:0016020             membrane     2       2/3   100/1000
2       A GO:0005576 extracellular region     3       3/3 1/100/1000
3       A GO:0005581      collagen trimer     0       0/3           
4       B GO:0016020             membrane     1       1/3  100127206
5       B GO:0005576 extracellular region     0       0/3           
6       B GO:0005581      collagen trimer     0       0/3

> plot(GO.enrichment, plotAll=T)  clusterProfiler example

In this case group A is enriched in membrane and extracellular region, while group B is only enriched in membrane genes. The groupGO function used here doesn’t provide p-values – we should have used enrichGO instead. I guess that the 3 Entrez ids in group A correspond to 5 genes in Gene Ontology, so that’s why the plots shows a total of 5 in group A. See clusterProfiler’s documentation for better examples.

The formula interface allows also multiple grouping. For example:

GO.enrichment <- compareCluster(Entrez~group+othergroup, data=mygenes, fun='groupGO')
  Cluster         ID          Description Count GeneRatio    geneID
1   A.bad GO:0016020             membrane     1       1/2       100
2   A.bad GO:0005576 extracellular region     1       1/2       100
3  A.good GO:0016020             membrane     1       1/3      1000
4  A.good GO:0005576 extracellular region     2       2/3    1/1000
5   B.bad GO:0016020             membrane     0       0/2          
6   B.bad GO:0005576 extracellular region     0       0/2          
7  B.good GO:0016020             membrane     1       1/2 100127206
8  B.good GO:0005576 extracellular region     0       0/2          


Of course this example is not much interesting, since it is only 6 randomly chosen genes. However with bigger datasets the formula interface can be much more powerful.

Now that I think of it, it would be better if compareCluster would return a dataframe with multiple columns, instead of merging them into a single column called Cluster. This would be make it possible to plot the results using facets or something more fancy. It would be something like this:

  group othergroup         ID          Description Count GeneRatio    geneID
1     A        bad GO:0016020             membrane     1       1/2       100
2     A        bad GO:0005576 extracellular region     1       1/2       100
3     A       good GO:0016020             membrane     1       1/3      1000
4     A       good GO:0005576 extracellular region     2       2/3    1/1000
5     B        bad GO:0016020             membrane     0       0/2          
6     B        bad GO:0005576 extracellular region     0       0/2          
7     B       good GO:0016020             membrane     1       1/2 100127206
8     B       good GO:0005576 extracellular region     0       0/2

However this would probably require introduce some retro-incompatibility in the library, and it is not a big deal as the Cluster column can be easily split using the separate function from tidyr.

Posted in news | 3 Comments

Tidy data and VCF

I think that 2014 has been the year in which my R programming style has changed the most. This is because a lot of innovative and nice libraries have been released, like dplyr, magrittr and tidyr. I started in January as a ddply enthusiast, and now instead my code is full %>% instructions and dplyr functions.

If you missed these libraries, a good starting point is the article “Principles of Tidy Dataset“, in which the author Hadley Wickham suggests some best practices for organising a dataset in a “tidy” format before doing any analysis. These practices will be already familiar to you if you have experience with the reshape/reshape2 packages, and if you used ggplot2 in the past. However, it is good to read a good summary as in the article.

Inspired by this article, I wrote a post on Biostar to discuss how a popular format in bioinformatics – the VCF – in a tidy format. Here is the link to the discussion.

The VCF in a tidy format would like more or less as below. On one hand, it would be a bit too redundant, and many columns would be replicated multiple times, making the file more sensible to typos introduced by the users and occupying more disk space. On the other hand, it would be easier to read, more flexible and able to accommodate other informations, like the population of each individual or more info about the genotype quality.

> vcf %>%
    gather(individual, value, -c(X.CHROM:FORMAT)) %>%
    separate(value, into=strsplit('GT:GQ:DP:HQ', ':')[[1]], ':', extra='drop') %>%
    separate('GT', into=c('allele1', 'allele2'), '[|/]') %>%
    gather(allele, genotype, -c(X.CHROM:individual, GQ:HQ)) %>%
    arrange(X.CHROM, POS, ID, individual) %>% 
    select(-INFO, -FORMAT, -FILTER) %>%  # let's omit this for better visualization
    subset(ID!='microsat1')              # let's omit this for better visualization

 X.CHROM  POS          ID     REF ALT QUAL individual GQ DP    HQ  allele genotype
 20       14370   rs6054257   G   A   29   NA00001    48  1 51,51 allele1        0
 20       14370   rs6054257   G   A   29   NA00001    48  1 51,51 allele2        0
 20       14370   rs6054257   G   A   29   NA00002    48  8 51,51 allele1        1
 20       14370   rs6054257   G   A   29   NA00002    48  8 51,51 allele2        0
 20       14370   rs6054257   G   A   29   NA00003    43  5   .,. allele1        1
 20       14370   rs6054257   G   A   29   NA00003    43  5   .,. allele2        1
 20       17330           .   T   A    3   NA00001    49  3 58,50 allele1        0
 20       17330           .   T   A    3   NA00001    49  3 58,50 allele2        0
 20       17330           .   T   A    3   NA00002     3  5  65,3 allele1        0
 20       17330           .   T   A    3   NA00002     3  5  65,3 allele2        1
 20       17330           .   T   A    3   NA00003    41  3  <NA> allele1        0
 20       17330           .   T   A    3   NA00003    41  3  <NA> allele2        0
 20       1110696 rs6040355   A G,T   67   NA00001    21  6 23,27 allele1        1
 20       1110696 rs6040355   A G,T   67   NA00001    21  6 23,27 allele2        2
 20       1110696 rs6040355   A G,T   67   NA00002     2  0  18,2 allele1        2
 20       1110696 rs6040355   A G,T   67   NA00002     2  0  18,2 allele2        1
 20       1110696 rs6040355   A G,T   67   NA00003    35  4  <NA> allele1        2
 20       1110696 rs6040355   A G,T   67   NA00003    35  4  <NA> allele2        2
 20       1230237         .   T   .   47   NA00001    54  7 56,60 allele1        0
 20       1230237         .   T   .   47   NA00001    54  7 56,60 allele2        0
 20       1230237         .   T   .   47   NA00002    48  4 51,51 allele1        0
 20       1230237         .   T   .   47   NA00002    48  4 51,51 allele2        0
 20       1230237         .   T   .   47   NA00003    61  2  <NA> allele1        0
 20       1230237         .   T   .   47   NA00003    61  2  <NA> allele2        0

 

 

https://www.biostars.org/p/123018/

Posted in methodology | Leave a comment

farewell to Barcelona

I am posting this a bit late (since I already moved 6 months ago); anyway, the news is that I left my lab in Barcelona, and moved to London!

prbb from avda aiguader

The institute where I did my PhD: the PRBB in Barcelona. On the other side of the building there is the beach.

I am satisfied about my time in Barcelona, where I did my master thesis and my PhD on network theory applied to human population genetics. However, it was time to move and try new experiences.

a picture taken from the 4th floor of the PRBB

a picture taken from the 4th floor of the PRBB

Apart from the change of city, I also changed my field of work, as I am now working on cancer genetics. My new group is a young group recently moved from Italy to London, famous for research on the systems-level properties of cancer genes , for a database called Network of Cancer Genes, and involved in a consortium for the sequencing of hepatocellular carcinoma. I will keep you informed of the proceedings!

Posted in news | 1 Comment

my first PyPI package: vcf2networks

My first Python package is in PyPI!! I guess that now I can officially call myself a python programmer.

VCF2Networks is a python script to calculate genotype networks from population genetics data. Genotype networks are a method used in systems biology to study the “innovability” of a given phenotype, by representing all the genotypes associated with the phenotype as a graph, and studying some properties of this graph, such as the average path length and the average degree. For more info, you can look at the slides of the “Origins of Evolutionary Innovations” book club in this blog. The script in VCF2Networks allows to take any dataset of genotypes stored in the VCF format, and calculate many of these properties.

In principle, I am planning to submit an application note about the script to a bioinformatics-oriented journal. So, if you have some little time to lent me, and you want to test it, any feedback will be very useful for me. At the moment, the major issue is to simplify the installation, because this package depends on numpy and python-igraph, and these two modules require some terrible C libraries that must be installed separately. If you are aware of any way to distribute a binary package of a python module that depends on C libraries, your suggestion will be really welcome.

Posted in news | Tagged , , , | 1 Comment

The presentation of my PhD defence

That’s it! Last week I defended my PhD thesis!! I have gone through it, and survived to tell!

I don’t feel very different from before, apart from being relieved :-). Now the future is possibly more difficult than before, because I have to look for a job position and finish a lot of things.

While I was preparing the slideshow, I realized that there are not many examples of presentations for a PhD defence online. This is bad, because you need all forms of help to prepare this presentation.The PhD defence is the last thing that you do as a PhD student, so you want to do it perfectly. It is also the moment when you describe many years of your work to the your colleagues and family. Thus, it is bad that there are few examples of slideshows for PhD defence online.

Here is the presentation that I have prepared for my defence. I hope that it will be useful to other people as an example for their defences.

I think that, for this type of presentation, the first slide to make is the “summary of the talk” slide, like the “Topics” slide I have. Usually I don’t like to have such summary slides in my presentation, but for the Thesis defence it is very important, because it gives you a feeling of security when you present. Having a well defined structure allows you to know when you can stop to drink some water or to check if everybody is following, and to know exactly what to say in each slide of the talk.

Posted in news, papers, slideshows, talks | 4 Comments