Solidarity to the workers of the Mario Negri Sud institute

I don’t usually post online petitions, but this is a bit personal to me as it regards the closure of a big research institute close to my hometown in Italy.

The Mario Negri Sud was a research institute active for the last 30 years in Abruzzo, a region in the center-south of Italy. During all these years the institute achieved excellence in many fields, from cardiovascular diseases to breast tumors, from diabetes to some rare diseases, and much more. I remember reading about an article on polycythemia vera (a rare cancer) published on NEJM just before financial problems halted most of the research, and more work by a neuroscientist who was really affected by the stress of the situation.

Unfortunately last week, after 4 years of financial struggle, the institute was officially declared bankrupt. Still, this is not the worst part.  Given the financial situation it is likely that the workers of the institute, aproximately 160 people between researchers and staff, will not receive their pay from the last 18 months, moreover they will not get their pension funds (the Italian TFR), which for some people amounts to tens of thousands of euros, accumulated in more than 25 years of work.

Negri sud picture
The Negri Sud institute. Click here to sign the petition.

These are people who dedicated almost all their life to research, and it is very unfair that they are threated this way. It is well known that the life of researchers is full of sacrifices and is never financially stable. To think that after many years they are denied a pension and abandoned to their fate is really inconceivable. Politicians were not able to solve the situation, and they are probably guilty of making it worse. Moreover given the geographical isolation of the institute, this situation hasn’t received much attention from the media outside of Abruzzo.

If you want to sign the petition, just click on this link to change.org. The petition is in Italian, and basically asks to the presidents of the Mario Negri institute, the Abruzzo region and the Chieti province (the three founding entitites of Negri Sud), to at least pay the pension and the salary of these workers. The change.org website will ask for your name, direction, postal code, and email. The website may later send you additional emails regarding other petitions, but you can opt out at any time.

Updates on docker and bioinformatics

My previous post on docker and bioinformatics received some good attention on Twitter. It’s nice because it means that this technology is getting the right attention in the bioinformatics community.

Here are a few resources and articles I’ve found thanks to the conversations in Twitter.

  • Performances of Docker on a HPC cluster – a nice article showing that running a NGS pipeline in a docker container costs about 4% of the performances. It’s up to you to decide whether this is a big or a small price to pay.
  • biodocker is a project by Hexabio aiming at providing many containers with bioinformatics application. For example, you can get a virtual machine with biopython or samtools installed in a few minutes. Update: this may have been merged with bioboxes (see discussion)
  • oswitch is a nice implementation of docker from the Queen Mary University of London, which allows to quickly switch between docker images. I like the examples in which they run a command from a virtual image and then return directly to another environment.
  • ngeasy, a Next Generation Sequencing pipeline implemented on Docker, by a group from the King’s College of London (I work in the same institute but I didn’t know them!).
  • a nice discussion on Biostar on how a reproducibility problem could be solved with Docker.
  • a Docker symposium planned for the end of 2015 here at King’s.
  • BioPython containers by Tiago Antao, including some ipython tutorials

Docker is another innovation for data analysis introduced in 2014. I am surprised by how many good things were released last year, including docker and the whole dplyr/tidyr bundle. Let’s see what 2015 will bring!

The Network of Cancer Genes database

In the last year I have been part of the team maintaining and updating the Network of Cancer Genes database, also known as NCG.

NCG logo
The Network of Cancer Genes database

The main focus of NCG is to provide a curated list of genes associated to cancer, obtained after a manual review of the literature, and classified by cancer subtypes. Moreover NCG annotates some system-level properties of genes associated to cancer, from their protein interactions to their evolutionary age, and from the presence of paralogs in the human genome to their function.

NCG is a small database and is not supported by any big consortium, but we do our best to fill our niche :-). The following list will describe you what you can get from NCG and how can it be useful to you.

Continue reading

Reproducible bioinformatics pipelines with docker

I have recently come across a nice article explaining what Docker is and how it can be useful for bioinformatics. I’ll leave you to the article for more details, but basically Docker is an easy way to define a virtual machine, which makes it very straightforward for other people to reproduce the results of an analysis, with little effort from our side.

For example, let’s imagine that we are just about to submit a paper, and that our main results are based on the Tajima’s D index from the data in 1000 Genomes. The journal may ask us to show how to reproduce the analysis: which files did we used as input? Which tool did we use to calculate the Tajima’s D?

In this case, a docker file may be like the following:

The first part of this docker file will set up an ubuntu virtual machine, and install all the software needed to execute the pipeline: tabix, vcftools, snakemake. The second part will  clone the latest version of the pipeline in the virtual machine, and then use tabix to download a portion of chromosome 22 from the 1000Genomes ftp. The third part runs the pipeline, by executing a snakemake rule.

You can run this docker container by running the following:

This will take quite a while to run, and will build a docker virtual image in your system. Afterwards, you can run the following:

This command will open an interactive shell in the virtual machine. From there you will be able to inspect the output of the pipeline, and eventually, if this pipeline were more complex than a mock example, run other rules and commands.

This system makes it very easy to provide an environment in which our results can be reproduced. It is also very useful if we work from more than one workstation – e.g. if we need to have the same configuration at home and in the lab.

Just a few more links on docker and bioinformatics:

the most useful R function of the week: unnest from tidyr

There are many great functions in CRAN and BioConductor, and certainly saying that unnest from the tidyr package is the best is a big exaggeration. However this function solved a big problem in data formatting that made me waste a lot of time in the past, that I was surprised no one had implemented a function for it yet.

Imagine we have a dataframe like the following:

The first column contains the Entrez of each gene. This columns is fine, as it contains only one value per row, and it is easy to query or join with other dataframes. The second column, however, contains a comma-separated list of gene names, all associated to the same Entrez IDs. This column is a mess to deal with, because we need to use grepl to query it, and we can’t join it with other dataframes as long as it is in this form.

The unnest function from tidyr allows to convert this data frame in a “tidier” form, containing one row for each combination gene symbol and alias:

This code makes use of the %>% and some functions from the dplyr package, but it is still R!

Having the dataframe in this long form makes it a lot easier to deal with it. For example, let’s imagine that somebody asks us to get the Entrez IDs for the list of gene symbols DER7 and DEC. We would just have to do a simple subset on the dataframe:

This is just a silly example, which may have been solved with some application of apply and grepl, but in the real world there are a lot of more complex applications for it. For example, here is some code I used to split Blat output into one line per exon (or blat alignment block):

Continue reading

a formula interface for GeneOntology analysis

clusterProfiler is a nice R library for doing GO and KEGG enrichment analysis. It has a simple interface and it can produce some clear plots using the ggplot engine. Today I contributed a formula interface to clusterProfiler, making it easier to do enrichment of multiple groups of genes.

Let’s imagine you have a dataframe in which one column contains a list of Entrez Ids, while the other columns encode some grouping variables:

The new formula interface allows to do a GO analysis on each of the groups. For example, we can group them  by the column “group”, and compare the classification of the two groups:

clusterProfiler example

In this case group A is enriched in membrane and extracellular region, while group B is only enriched in membrane genes. The groupGO function used here doesn’t provide p-values – we should have used enrichGO instead. I guess that the 3 Entrez ids in group A correspond to 5 genes in Gene Ontology, so that’s why the plots shows a total of 5 in group A. See clusterProfiler’s documentation for better examples.

The formula interface allows also multiple grouping. For example:

Of course this example is not much interesting, since it is only 6 randomly chosen genes. However with bigger datasets the formula interface can be much more powerful.

Now that I think of it, it would be better if compareCluster would return a dataframe with multiple columns, instead of merging them into a single column called Cluster. This would be make it possible to plot the results using facets or something more fancy. It would be something like this:

However this would probably require introduce some retro-incompatibility in the library, and it is not a big deal as the Cluster column can be easily split using the separate function from tidyr.

Tidy data and VCF

I think that 2014 has been the year in which my R programming style has changed the most. This is because a lot of innovative and nice libraries have been released, like dplyr, magrittr and tidyr. I started in January as a ddply enthusiast, and now instead my code is full %>% instructions and dplyr functions.

If you missed these libraries, a good starting point is the article “Principles of Tidy Dataset“, in which the author Hadley Wickham suggests some best practices for organising a dataset in a “tidy” format before doing any analysis. These practices will be already familiar to you if you have experience with the reshape/reshape2 packages, and if you used ggplot2 in the past. However, it is good to read a good summary as in the article.

Inspired by this article, I wrote a post on Biostar to discuss how a popular format in bioinformatics – the VCF – in a tidy format. Here is the link to the discussion.

The VCF in a tidy format would like more or less as below. On one hand, it would be a bit too redundant, and many columns would be replicated multiple times, making the file more sensible to typos introduced by the users and occupying more disk space. On the other hand, it would be easier to read, more flexible and able to accommodate other informations, like the population of each individual or more info about the genotype quality.

 

 

https://www.biostars.org/p/123018/