my first DataDive event

This has been a lovely and sunny weekend in London, but I didn’t see any of it because I spent it all crunching dataframes and calculating numbers at my first Data Dive.


Data Dives are events organized by an international organization called DataKind, in which a bunch of data scientists volunteer to dedicate their time to solve data analysis for non-profit companies. For example I have been analysing data for My Help at Home, a company that helps elderly people finding local carers, trying to understand which factors influence the demand and costs of private carers.

DataKindUk has a strict no-sharing policy regarding the results of the Data Dive, in order to protect the data made available by the charities. However in the case of My Help at Home we used only publicly available data, so I guess I can show some of the results, based on the number of Homes, Agencies and Hospitals in UK:

Here are a few thoughts about the experience:

  • I’ve decided that I will start introducing myself as a data scientist rather than a bioinformatician. Most people from outside the academia do not really understand what a bioinformatician is, and it is easier to explain them that you are a data analyst or scientist working on genetic and biological data. In the end the definition is correct – bioinformaticians truthfully are a specialized type of data scientists.
  • This has been an opportunity to get in contact with the “real world” of data science outside the academia. Most of the people I met work for the private sectors, like financing, consulting, gambling, and journalism. I only met a couple of people from the academia, and they were both complaining about the lack of organization and planning at the university.
  • Thanks to dplyr and related libraries, R has become a really powerful tool for merging and assembling datasets. It helped me a lot during the phase of data cleaning and assembly, and I think that for these tasks it is much better than python or bash. I would recommend to anyone starting learning R to skip all the basic syntax and start directly with dplyr (e.g. see the tutorial I wrote for the PEB workshop).
  • The majority of people used python, in particular the ipython notebooks, for most of the tasks. Currently I am a R and dplyr person, but for machine learning tasks I am starting to think that python and scikit-learn can actually be more powerful.
  • People working in consulting, who for their work need to able to easily create nice and interactive graphs, used visual solutions such as tableau rather than munching with R or other programming tools. For example, the interactive graph above was created in a couple of minutes with noveau.
Posted in life, news | Tagged , , | Leave a comment

Reviewed “Bioinformatics with Python cookbook” by Tiago Antao

I’ve recently been a reviewer for the book “Bioinformatics with Python cookbook” by Tiago Antao, one of the big authors of BioPython. The book is published by Packt Publishing, and it is a collection of recipes for several bioinformatics tasks, from reading large genome files to doing population genetics and other tasks.


python book

Bioinformatics with Python Cookbook on my desktop, together with my zombie mug.


The github account of the author contains a link of all the python notebooks illustrated in the book. These notebook are freely accessible, but there is no explanation of the code, as for that you will need to buy the book. Moreover, the book provides a link to a docker image that can be used to install all the materials and software needed to execute the examples. I think this is a smart way to provide materials for exercises, and I will copy the idea in the future.

Being a reviewer, I was expected to be an expert in all the topics described in the book. However I must admit that I learned a lot from reviewing it, and that some of the recipes presented managed to surprise me. Here is a quick summary of the new things I learned:

  • How to convert many bioinformatics-related formats with pygenomics and biopython
  • How to use the rest APIs for querying ensembl
  • How to do and plot a PCA in python and eigensoft of SNP data
  • SimuPOP is a nice software for simulating population genetics events
  • DendroPy is a nice module for dealing with phyologenetic trees, like ete
  • PDB files are going to be replaced by mmCIF files, and BioPython is able to read both formats
  • pymol and cytoscape can be commanded from within a python script/ipython
  • PSIQUIC is a consistent interface to many molecular-interaction databases
  • ipython has excellent multi-core execution capacity.
  • it is easy to optimize python code with cython and numba, just by adding a few decorators

If you buy the book and find any error in the code, you can blame me as I was a reviewer and didn’t find it.

Posted in book club, news | Tagged , , | Leave a comment

A tutorial on organizing and plotting data with R

Every year I teach in the Programming for Evolutionary Biology course held in Leipzig. It’s an intensive three weeks course, in which we take 25 evolutionary biologists with no prior knowledge of programming, we lock them in a room together with some very good teaching assistants, and we keep explaining them how to program until they learn or manage to escape.

Jokes aside, the course is a very nice experience, and people have a lot of fun, as the three weeks are full of discussion about evolutionary biology and about bioinformatics. The nice things is that this year two brilliant former students (Karl and Michiel) organized a whole reunion conference, which has been called the PEB conference and is taking place this week at the CIBIO near Porto. This reunion conference is also an opportunity to follow-up the students, and I have been in charge of organizing a couple of workshops, one about installing software on linux, and one about advanced R programming.

The tutorial for advanced R programming is available on github and below. I think it may be of interest for anybody with some knowledge of R, but wishing to learn some new tricks. In particular the tutorial is about good ways to organize a dataframe, and I tried to cover a few beginner mistakes about data analysis that I saw in Biostar. It describes the differences between a dataframe in a wide or a long format, how to convert one to the other, and what are the advantages of doing that. It also teaches how to calculate group-based summaries with dplyr, and how to plot them with ggplot2.

DIfference between a dataset in a wide or a long format

DIfference between a dataset in a wide or a long format

Continue reading

Posted in methodology, news | Tagged , , , | Leave a comment

NCG enrichment implemented in DOSE

I would like to give a big thanks to Guangchuang Yu, the author of many cool R libraries like GoSemSim and ggtree, for implementing a Network of Cancer Genes enrichment function in the DOSE R library.

The new function is called enrichNCG, and can be found in the github version of DOSE. You can use it to analyze a list of genes, and determine if they are enriched in genes known to be mutated in a given cancer type. For example, a random list composed by genes having an Entrez Id between 4000 and 9000 is enriched in genes mutated in sarcoma and leukemia:

> library(devtools)
> install_github(c("GuangchuangYu/DOSE", "GuangchuangYu/clusterProfiler"))
> library(DOSE)
> dev_mode()
> mygenes = as.character(4000:9000)  # generate a random list of Entrez Ids, from ID 4000 to ID 9000
> summary(enrichNCG(gene=mygenes))   # calculate enrichment of the random list of Entrez Ids
           ID          Description   GeneRatio     BgRatio     pvalue        p.adjust        qvalue
 sarcoma   sarcoma     sarcoma       15/457        30/1920     0.001495187   0.02352589      0.02185067
 leukemia  leukemia    leukemia      38/457        106/1920    0.002767752   0.02352589      0.02185067

If you have multiple sets of genes, you can also use the clusterProfiler library to compare them at the same time. Read this previous post for more examples of this functionality.

> library(clusterProfiler)
> summary(compareCluster(list(L1=as.character(4000:9000), L2= as.character(3000:4000)), fun='enrichNCG'))
Cluster       ID    Description GeneRatio    BgRatio       pvalue     p.adjust       qvalue
     L1  sarcoma        sarcoma    15/457    30/1920 0.0014951873 0.0235258920 0.0218506737
     L1 leukemia       leukemia    38/457   106/1920 0.0027677520 0.0235258920 0.0218506737
     L2 leukemia       leukemia     16/96   106/1920 0.0000399849 0.0001599396 0.0001262681

If you also have gene scores (e.g. a value for the expression or conservation of each gene), you can do a Gene Set Enrichment Analysis, which will give more importance to genes with higher scores:

> data(geneList)
> y = gseAnalyzer(geneList, setType="NCG", minGSSize=1)
         ID      Description   setSize    enrichmentScore pvalue p.adjust qvalues
breast,lung      breast,lung         2         -0.9965348      0       0        0

You can also produce many nice plots. For example this is a cnetplot, in which each gene is connected to the terms related to it:

> cnetplot(enrichNCG(as.character(4000:9000), readable=T), fixed=T)

the cnet visualization of a randomly generated dataset of genes, using enrichNCG from DOSE, which derives data from the Network of Cancer Genes.
The cnet visualization of a randomly generated dataset of genes, 
using enrichNCG from DOSE, which derives data from the Network of Cancer Genes.

It is worth to mention that the DOSE package allows to calculate enrichment in the Disease Ontology database, which associates genes to disease terms. In my experience, for bioinformaticians Disease Ontology is more useful than OMIM, because it provides a clear association between genes and disease terms. If you use the raw OMIM data instead, you will have to text mine the descriptions and that can lead to a lot of noisy data.

Have a good enrichment with DOSE and NCG ;-)

Posted in methodology, news, projects | Tagged , , , , , | 1 Comment

Solidarity to the workers of the Mario Negri Sud institute

I don’t usually post online petitions, but this is a bit personal to me as it regards the closure of a big research institute close to my hometown in Italy.

The Mario Negri Sud was a research institute active for the last 30 years in Abruzzo, a region in the center-south of Italy. During all these years the institute achieved excellence in many fields, from cardiovascular diseases to breast tumors, from diabetes to some rare diseases, and much more. I remember reading about an article on polycythemia vera (a rare cancer) published on NEJM just before financial problems halted most of the research, and more work by a neuroscientist who was really affected by the stress of the situation.

Unfortunately last week, after 4 years of financial struggle, the institute was officially declared bankrupt. Still, this is not the worst part.  Given the financial situation it is likely that the workers of the institute, aproximately 160 people between researchers and staff, will not receive their pay from the last 18 months, moreover they will not get their pension funds (the Italian TFR), which for some people amounts to tens of thousands of euros, accumulated in more than 25 years of work.

Negri sud picture

The Negri Sud institute. Click here to sign the petition.

These are people who dedicated almost all their life to research, and it is very unfair that they are threated this way. It is well known that the life of researchers is full of sacrifices and is never financially stable. To think that after many years they are denied a pension and abandoned to their fate is really inconceivable. Politicians were not able to solve the situation, and they are probably guilty of making it worse. Moreover given the geographical isolation of the institute, this situation hasn’t received much attention from the media outside of Abruzzo.

If you want to sign the petition, just click on this link to The petition is in Italian, and basically asks to the presidents of the Mario Negri institute, the Abruzzo region and the Chieti province (the three founding entitites of Negri Sud), to at least pay the pension and the salary of these workers. The website will ask for your name, direction, postal code, and email. The website may later send you additional emails regarding other petitions, but you can opt out at any time.

Posted in news | Tagged , , | 2 Comments

Updates on docker and bioinformatics

My previous post on docker and bioinformatics received some good attention on Twitter. It’s nice because it means that this technology is getting the right attention in the bioinformatics community.

Here are a few resources and articles I’ve found thanks to the conversations in Twitter.

  • Performances of Docker on a HPC cluster – a nice article showing that running a NGS pipeline in a docker container costs about 4% of the performances. It’s up to you to decide whether this is a big or a small price to pay.
  • biodocker is a project by Hexabio aiming at providing many containers with bioinformatics application. For example, you can get a virtual machine with biopython or samtools installed in a few minutes. Update: this may have been merged with bioboxes (see discussion)
  • oswitch is a nice implementation of docker from the Queen Mary University of London, which allows to quickly switch between docker images. I like the examples in which they run a command from a virtual image and then return directly to another environment.
  • ngeasy, a Next Generation Sequencing pipeline implemented on Docker, by a group from the King’s College of London (I work in the same institute but I didn’t know them!).
  • a nice discussion on Biostar on how a reproducibility problem could be solved with Docker.
  • a Docker symposium planned for the end of 2015 here at King’s.
  • BioPython containers by Tiago Antao, including some ipython tutorials

Docker is another innovation for data analysis introduced in 2014. I am surprised by how many good things were released last year, including docker and the whole dplyr/tidyr bundle. Let’s see what 2015 will bring!

Posted in methodology, news | Tagged , , , | Leave a comment

The Network of Cancer Genes database

In the last year I have been part of the team maintaining and updating the Network of Cancer Genes database, also known as NCG.

NCG logo

The Network of Cancer Genes database

The main focus of NCG is to provide a curated list of genes associated to cancer, obtained after a manual review of the literature, and classified by cancer subtypes. Moreover NCG annotates some system-level properties of genes associated to cancer, from their protein interactions to their evolutionary age, and from the presence of paralogs in the human genome to their function.

NCG is a small database and is not supported by any big consortium, but we do our best to fill our niche :-). The following list will describe you what you can get from NCG and how can it be useful to you.

Continue reading

Posted in methodology, projects | Tagged , , , , , , , | 10 Comments