Reviewed “Bioinformatics with Python cookbook” by Tiago Antao

I’ve recently been a reviewer for the book “Bioinformatics with Python cookbook” by Tiago Antao, one of the big authors of BioPython. The book is published by Packt Publishing, and it is a collection of recipes for several bioinformatics tasks, from reading large genome files to doing population genetics and other tasks.


python book
Bioinformatics with Python Cookbook on my desktop, together with my zombie mug.


The github account of the author contains a link of all the python notebooks illustrated in the book. These notebook are freely accessible, but there is no explanation of the code, as for that you will need to buy the book. Moreover, the book provides a link to a docker image that can be used to install all the materials and software needed to execute the examples. I think this is a smart way to provide materials for exercises, and I will copy the idea in the future.

Being a reviewer, I was expected to be an expert in all the topics described in the book. However I must admit that I learned a lot from reviewing it, and that some of the recipes presented managed to surprise me. Here is a quick summary of the new things I learned:

  • How to convert many bioinformatics-related formats with pygenomics and biopython
  • How to use the rest APIs for querying ensembl
  • How to do and plot a PCA in python and eigensoft of SNP data
  • SimuPOP is a nice software for simulating population genetics events
  • DendroPy is a nice module for dealing with phyologenetic trees, like ete
  • PDB files are going to be replaced by mmCIF files, and BioPython is able to read both formats
  • pymol and cytoscape can be commanded from within a python script/ipython
  • PSIQUIC is a consistent interface to many molecular-interaction databases
  • ipython has excellent multi-core execution capacity.
  • it is easy to optimize python code with cython and numba, just by adding a few decorators

If you buy the book and find any error in the code, you can blame me as I was a reviewer and didn’t find it.

A tutorial on organizing and plotting data with R

Every year I teach in the Programming for Evolutionary Biology course held in Leipzig. It’s an intensive three weeks course, in which we take 25 evolutionary biologists with no prior knowledge of programming, we lock them in a room together with some very good teaching assistants, and we keep explaining them how to program until they learn or manage to escape.

Jokes aside, the course is a very nice experience, and people have a lot of fun, as the three weeks are full of discussion about evolutionary biology and about bioinformatics. The nice things is that this year two brilliant former students (Karl and Michiel) organized a whole reunion conference, which has been called the PEB conference and is taking place this week at the CIBIO near Porto. This reunion conference is also an opportunity to follow-up the students, and I have been in charge of organizing a couple of workshops, one about installing software on linux, and one about advanced R programming.

The tutorial for advanced R programming is available on github and below. I think it may be of interest for anybody with some knowledge of R, but wishing to learn some new tricks. In particular the tutorial is about good ways to organize a dataframe, and I tried to cover a few beginner mistakes about data analysis that I saw in Biostar. It describes the differences between a dataframe in a wide or a long format, how to convert one to the other, and what are the advantages of doing that. It also teaches how to calculate group-based summaries with dplyr, and how to plot them with ggplot2.

DIfference between a dataset in a wide or a long format
DIfference between a dataset in a wide or a long format

Continue reading

NCG enrichment implemented in DOSE

I would like to give a big thanks to Guangchuang Yu, the author of many cool R libraries like GoSemSim and ggtree, for implementing a Network of Cancer Genes enrichment function in the DOSE R library.

The new function is called enrichNCG, and can be found in the github version of DOSE. You can use it to analyze a list of genes, and determine if they are enriched in genes known to be mutated in a given cancer type. For example, a random list composed by genes having an Entrez Id between 4000 and 9000 is enriched in genes mutated in sarcoma and leukemia:

If you have multiple sets of genes, you can also use the clusterProfiler library to compare them at the same time. Read this previous post for more examples of this functionality.

If you also have gene scores (e.g. a value for the expression or conservation of each gene), you can do a Gene Set Enrichment Analysis, which will give more importance to genes with higher scores:

You can also produce many nice plots. For example this is a cnetplot, in which each gene is connected to the terms related to it:

It is worth to mention that the DOSE package allows to calculate enrichment in the Disease Ontology database, which associates genes to disease terms. In my experience, for bioinformaticians Disease Ontology is more useful than OMIM, because it provides a clear association between genes and disease terms. If you use the raw OMIM data instead, you will have to text mine the descriptions and that can lead to a lot of noisy data.

Have a good enrichment with DOSE and NCG 😉

Solidarity to the workers of the Mario Negri Sud institute

I don’t usually post online petitions, but this is a bit personal to me as it regards the closure of a big research institute close to my hometown in Italy.

The Mario Negri Sud was a research institute active for the last 30 years in Abruzzo, a region in the center-south of Italy. During all these years the institute achieved excellence in many fields, from cardiovascular diseases to breast tumors, from diabetes to some rare diseases, and much more. I remember reading about an article on polycythemia vera (a rare cancer) published on NEJM just before financial problems halted most of the research, and more work by a neuroscientist who was really affected by the stress of the situation.

Unfortunately last week, after 4 years of financial struggle, the institute was officially declared bankrupt. Still, this is not the worst part.  Given the financial situation it is likely that the workers of the institute, aproximately 160 people between researchers and staff, will not receive their pay from the last 18 months, moreover they will not get their pension funds (the Italian TFR), which for some people amounts to tens of thousands of euros, accumulated in more than 25 years of work.

Negri sud picture
The Negri Sud institute. Click here to sign the petition.

These are people who dedicated almost all their life to research, and it is very unfair that they are threated this way. It is well known that the life of researchers is full of sacrifices and is never financially stable. To think that after many years they are denied a pension and abandoned to their fate is really inconceivable. Politicians were not able to solve the situation, and they are probably guilty of making it worse. Moreover given the geographical isolation of the institute, this situation hasn’t received much attention from the media outside of Abruzzo.

If you want to sign the petition, just click on this link to The petition is in Italian, and basically asks to the presidents of the Mario Negri institute, the Abruzzo region and the Chieti province (the three founding entitites of Negri Sud), to at least pay the pension and the salary of these workers. The website will ask for your name, direction, postal code, and email. The website may later send you additional emails regarding other petitions, but you can opt out at any time.

Updates on docker and bioinformatics

My previous post on docker and bioinformatics received some good attention on Twitter. It’s nice because it means that this technology is getting the right attention in the bioinformatics community.

Here are a few resources and articles I’ve found thanks to the conversations in Twitter.

  • Performances of Docker on a HPC cluster – a nice article showing that running a NGS pipeline in a docker container costs about 4% of the performances. It’s up to you to decide whether this is a big or a small price to pay.
  • biodocker is a project by Hexabio aiming at providing many containers with bioinformatics application. For example, you can get a virtual machine with biopython or samtools installed in a few minutes. Update: this may have been merged with bioboxes (see discussion)
  • oswitch is a nice implementation of docker from the Queen Mary University of London, which allows to quickly switch between docker images. I like the examples in which they run a command from a virtual image and then return directly to another environment.
  • ngeasy, a Next Generation Sequencing pipeline implemented on Docker, by a group from the King’s College of London (I work in the same institute but I didn’t know them!).
  • a nice discussion on Biostar on how a reproducibility problem could be solved with Docker.
  • a Docker symposium planned for the end of 2015 here at King’s.
  • BioPython containers by Tiago Antao, including some ipython tutorials

Docker is another innovation for data analysis introduced in 2014. I am surprised by how many good things were released last year, including docker and the whole dplyr/tidyr bundle. Let’s see what 2015 will bring!

The Network of Cancer Genes database

In the last year I have been part of the team maintaining and updating the Network of Cancer Genes database, also known as NCG.

NCG logo
The Network of Cancer Genes database

The main focus of NCG is to provide a curated list of genes associated to cancer, obtained after a manual review of the literature, and classified by cancer subtypes. Moreover NCG annotates some system-level properties of genes associated to cancer, from their protein interactions to their evolutionary age, and from the presence of paralogs in the human genome to their function.

NCG is a small database and is not supported by any big consortium, but we do our best to fill our niche :-). The following list will describe you what you can get from NCG and how can it be useful to you.

Continue reading

Reproducible bioinformatics pipelines with docker

I have recently come across a nice article explaining what Docker is and how it can be useful for bioinformatics. I’ll leave you to the article for more details, but basically Docker is an easy way to define a virtual machine, which makes it very straightforward for other people to reproduce the results of an analysis, with little effort from our side.

For example, let’s imagine that we are just about to submit a paper, and that our main results are based on the Tajima’s D index from the data in 1000 Genomes. The journal may ask us to show how to reproduce the analysis: which files did we used as input? Which tool did we use to calculate the Tajima’s D?

In this case, a docker file may be like the following:

The first part of this docker file will set up an ubuntu virtual machine, and install all the software needed to execute the pipeline: tabix, vcftools, snakemake. The second part will  clone the latest version of the pipeline in the virtual machine, and then use tabix to download a portion of chromosome 22 from the 1000Genomes ftp. The third part runs the pipeline, by executing a snakemake rule.

You can run this docker container by running the following:

This will take quite a while to run, and will build a docker virtual image in your system. Afterwards, you can run the following:

This command will open an interactive shell in the virtual machine. From there you will be able to inspect the output of the pipeline, and eventually, if this pipeline were more complex than a mock example, run other rules and commands.

This system makes it very easy to provide an environment in which our results can be reproduced. It is also very useful if we work from more than one workstation – e.g. if we need to have the same configuration at home and in the lab.

Just a few more links on docker and bioinformatics:

the most useful R function of the week: unnest from tidyr

There are many great functions in CRAN and BioConductor, and certainly saying that unnest from the tidyr package is the best is a big exaggeration. However this function solved a big problem in data formatting that made me waste a lot of time in the past, that I was surprised no one had implemented a function for it yet.

Imagine we have a dataframe like the following:

The first column contains the Entrez of each gene. This columns is fine, as it contains only one value per row, and it is easy to query or join with other dataframes. The second column, however, contains a comma-separated list of gene names, all associated to the same Entrez IDs. This column is a mess to deal with, because we need to use grepl to query it, and we can’t join it with other dataframes as long as it is in this form.

The unnest function from tidyr allows to convert this data frame in a “tidier” form, containing one row for each combination gene symbol and alias:

This code makes use of the %>% and some functions from the dplyr package, but it is still R!

Having the dataframe in this long form makes it a lot easier to deal with it. For example, let’s imagine that somebody asks us to get the Entrez IDs for the list of gene symbols DER7 and DEC. We would just have to do a simple subset on the dataframe:

This is just a silly example, which may have been solved with some application of apply and grepl, but in the real world there are a lot of more complex applications for it. For example, here is some code I used to split Blat output into one line per exon (or blat alignment block):

Continue reading