Interviewed by LabWorm for NCG

Our group has been interviewed by LabWorm regarding our recent publication on Network of Cancer Genes 5.0.

I absolutely love the “artist impression” they made of our team:

The NCG team sketched by LabWorm. Thanos Mourikis, me, and Omer An.

LabWorm is a collaborative platform for sharing tools and links related to bioinformatics. They have a very modern and interactive user interface, and they are very active in adding new links and involving people in the platform.

Over my too many years of experience in the bioinformatics field, I saw many attempts at creating collections of bioinformatics tools. Unfortunately many of these failed because of lack of interest or lack of maintenance. However LabWorm seems to be doing things right for the moment, as they really work hard to engage people in their community, and they even publish some blog interviews to researchers.

The bioinformatics community really need a effective way to share tools and links, and I really hope that LabWorm will be successful in their attempt.

Are fitness genes more conserved across species? my first 30-minutes attempt

A recently published paper by Hart et al presented a genome-wide CRISPR screening to identify fitness genes (a superset of essential genes) in five cell lines. The paper is quite impressive and shows the potentiality of CRISPR to generate large scale knockouts and to characterize the importance and function of genes in different conditions.

In the discussion the authors propose that fitness genes are more likely to be more conserved across species. However they do not follow-up on this hypothesis, probably for lack of space. They can’t be blamed as they already present a lot of results in the paper.

Distribution of conservation scores in the phastcons.100way.UCSC.hg19 track. Are essential genes more conserved than other genes?
Distribution of conservation scores in the human genome. Are essential genes more conserved than other genes?

This post presents a follow-up analysis on the hypothesis that fitness genes are more conserved than non-essential genes. I’ll take the original data from the paper, get the conservation scores from bioconductor data packages, and do a Wilcoxon test to compare the two distribution. The full code is available as a github repository, and please feel free to contribute if you want to do some free R/Bioconductor analysis.

Continue reading

Tutorial on working with Genomics data with bioConductor – part I

BioConductor includes many powerful packages for working with genomics data. You can do pretty much everything, from downloading gene coordinates and sequences of any model species, to converting gene ids and symbol, and to accessing ENCODE data and anything in UCSC, Ensembl, and other resources. However these packages are not always well known,  and the initial learning curve is a steep, specially for R beginners.

This series of tutorials will describe how to get gene coordinates from bioconductor, intersect these with some interesting dataset from ENCODE, and do an enrichment analysis with DOSE. It will be fun ūüôā

Libraries required for this tutorial

For this tutorial we will use only Human data. Most of the packages needed for working with human genomics can be installed with a single command:

The Homo.sapiens package is a container that includes the most important packages for working with human data. In particular it contains two data packages:

  • TxDb.Hsapiens.UCSC.hg19.knownGene is a TxDb object containing the coordinates of all the genes, transcripts, and exons in the human genome
  • is what you need to convert all gene ids – from entrez to ensembl, to GO, and so on.

The data in these packages is updated periodically (I think every 6 months), and is pretty stable, meaning that anybody using the same packages and version should be able to reproduce the same results. An alternative to using these data packages is biomaRt, but I prefer the data packages as they can be used without internet connection.

The package AnnotationHub is used to retrieve data from multiple sources and will be described later. The BSgenome package is for retrieving the human genome sequence: we will not use it in the tutorial but I included it for completeness.

Note that I also loaded the dplyr package for this tutorial. Although dplyr is not needed for working with genomics data, I consider it one of the most useful packages in R, and this tutorial will make heavy use of it. I apologize if this tutorial is not easy to follow to people not familiar with dplyr.

Retrieving gene and transcript coordinates

The TxDb object can be used to retrieve coordinates of genes, transcripts, and exons in the human genome. For example, we can access all human transcript with the transcript() function:

See help(transcripts) for other functions that can be applied to a TxDb object. For example, genes() retrieve coordinates of genes, while exons() and promoters() work similarly.

In the example above I also specified a “columns” parameter, in order to show the gene id as well. You can use this column to get the coordinates of a specific set of genes. For example, the following will retrieve the coordinates of the genes corresponding to entrez ids 1234, 231, and 421:

Converting Entrez IDs to symbols and other IDs

One of the most tricky part in bioinformatics is converting gene ids to symbols and other ids. Many errors can be made in this process, and is therefore very important to have a consistent way to convert gene ids.

Luckily, we can use the for easily converting many ids. This package should already have been loaded with library(Homo.sapiens). To see all the possible conversion tables (bimaps) available, we can either to library( or simply write “” and then hit tab on the R command line .

One of my favorite bimaps is the one to convert gene symbols to entrez. As you may know, for historical reasons the same gene can have more than one symbol. This usually complicates things a lot, and a safe procedure is to always convert symbols to entrez before starting any analysis. The ALIAS2EG bimap is there for this type of conversion:

When you convert symbols to id, it is important to remember that not only the same gene can have more than one symbol, but also the same symbol can match multiple entrez ids. For example, here is the code to get which symbols match more than one entrez id in the human species:

Note: the “%>%” symbol and the count, arrange functions come from the dplyr package.

The only safe thing to do in these cases is to identify the duplicated symbols and either discard them or manually curate them. Let’s imagine that a fellow researchers asked us to retrieve information for the genes ACT, DOLPP1 and MGAT3. We can use the bimap to identify the genes that map to multiple entrez id, and then go back to our colleague and ask him to tell us which are the correct ids.

In these examples I converted the bimap to a dataframe and then did an intersection. However the “bioConductor” way to use these bimaps is through the select function:

One big problem with these bioconductor packages is that they clash with many dplyr functions. For example, the select function gets overwritten if you load dplyr after Homo.sapiens, and the only option to avoid headaches is to explicitly refer to the function as AnnotationDbi::select. These conflicts in the namespace can cause a lot of confusion in R, because they let to weird error messages that are completely unrelated to the real problem.

In any case, the advantage of the select function is that it allows to retrieve more id types at the same type. For example here I retrieved both entrez id and ensembl ids, and if you type columns( you will be able to see many other possible output columns.

Next parts of the tutorial

I was originally planning to write one big tutorial in the same post, but now I see that it would be much more readable if I split it into multiple posts.

Please let me know if you have any comment regarding this first tutorial, and I will try to improve it and take the feedback into account for the next parts.

my first DataDive event

This has been a lovely and sunny weekend in London, but I didn’t see any of it because I spent it all crunching dataframes and calculating numbers at my first Data Dive.


Data Dives are events organized by an international organization called DataKind, in which a bunch of data scientists volunteer to dedicate their time to solve data analysis for non-profit companies. For example I have been analysing data for My Help at Home, a company that helps elderly people finding local carers, trying to understand which factors influence the demand and costs of private carers.

DataKindUk has a strict no-sharing policy regarding the results of the Data Dive, in order to protect the data made available by the charities. However in the case of My Help at Home we used only publicly available data, so I guess I can show some of the results, based on the number of Homes, Agencies and Hospitals in UK:

Here are a few thoughts about the experience:

  • I’ve decided that I will start introducing myself as a data scientist rather than a bioinformatician. Most people from outside the academia do not really understand what a bioinformatician is, and it is easier to explain them that you are a data analyst or scientist working on genetic and biological data. In the end the definition is correct – bioinformaticians truthfully are a specialized type of data scientists.
  • This has been an opportunity to get in contact with the “real world” of data science outside the academia. Most of the people I met work for the private sectors, like financing, consulting, gambling, and journalism. I only met a couple of people from the academia, and they were both complaining about the lack of organization and planning at the university.
  • Thanks to dplyr and related libraries, R has become a really powerful tool for merging and assembling datasets. It helped me a lot during the phase of data cleaning and assembly, and I think that for these tasks it is much better than python or bash. I would recommend to anyone starting learning R to skip all the basic syntax and start directly with dplyr (e.g. see the tutorial I wrote for the PEB workshop).
  • The majority of people used python, in particular the ipython notebooks, for most of the tasks. Currently I am a R and dplyr person, but for machine learning tasks I am starting to think that python and scikit-learn can actually be more powerful.
  • People working in consulting, who for their work need to able to easily create nice and interactive graphs, used visual solutions such as tableau rather than munching with R or other programming tools. For example, the interactive graph above was created in a couple of minutes with noveau.

Reviewed “Bioinformatics with Python cookbook” by Tiago Antao

I’ve recently been a reviewer for the book “Bioinformatics with Python cookbook” by Tiago Antao, one of the big authors of BioPython. The book is published by Packt Publishing, and it is a collection of recipes for several bioinformatics tasks, from reading large genome files to doing population genetics and other tasks.


python book
Bioinformatics with Python Cookbook on my desktop, together with my zombie mug.


The github account of the author contains a link of all the python notebooks illustrated in the book. These notebook are freely accessible, but there is no explanation of the code, as for that you will need to buy the book. Moreover, the book provides a link to a docker image that can be used to install all the materials and software needed to execute the examples. I think this is a smart way to provide materials for exercises, and I will copy the idea in the future.

Being a reviewer, I was expected to be an expert in all the topics described in the book. However I must admit that I learned a lot from reviewing it, and that some of the recipes presented managed to surprise me. Here is a quick summary of the new things I learned:

  • How to convert many bioinformatics-related formats with pygenomics and biopython
  • How to use the rest APIs for querying ensembl
  • How to do and plot a PCA in python and eigensoft of SNP data
  • SimuPOP is a nice software for simulating population genetics events
  • DendroPy is a nice module for dealing with phyologenetic trees, like ete
  • PDB files are going to be replaced by mmCIF files, and BioPython is able to read both formats
  • pymol and cytoscape can be commanded from within a python script/ipython
  • PSIQUIC is a consistent interface to many molecular-interaction databases
  • ipython has excellent multi-core execution capacity.
  • it is easy to optimize python code with cython and numba, just by adding a few decorators

If you buy the book and find any error in the code, you can blame me as I was a reviewer and didn’t find it.

A tutorial on organizing and plotting data with R

Every year I teach in the Programming for Evolutionary Biology course held in Leipzig. It’s an intensive three weeks course, in which we take 25 evolutionary biologists with no prior knowledge of programming, we lock them in a room together with some very good teaching assistants, and we keep explaining them how to program until they learn or manage to escape.

Jokes aside, the course is a very nice experience, and people have a lot of fun, as the three weeks are full of discussion about evolutionary biology and about bioinformatics. The nice things is that this year two brilliant former students (Karl and Michiel) organized a whole reunion conference, which has been called the PEB conference and is taking place this week at the CIBIO near Porto. This reunion conference is also an opportunity to follow-up the students, and I have been in charge of organizing a couple of workshops, one about installing software on linux, and one about advanced R programming.

The tutorial for advanced R programming is available on github and below. I think it may be of interest for anybody with some knowledge of R, but wishing to learn some new tricks. In particular the tutorial is about good ways to organize a dataframe, and I tried to cover a few beginner mistakes about data analysis that I saw in Biostar. It describes the differences between a dataframe in a wide or a long format, how to convert one to the other, and what are the advantages of doing that. It also teaches how to calculate group-based summaries with dplyr, and how to plot them with ggplot2.

DIfference between a dataset in a wide or a long format
DIfference between a dataset in a wide or a long format

Continue reading

NCG enrichment implemented in DOSE

I would like to give a big thanks to Guangchuang Yu, the author of many cool R libraries like GoSemSim and ggtree, for implementing a Network of Cancer Genes enrichment function in the DOSE R library.

The new function is called enrichNCG, and can be found in the github version of DOSE. You can use it to analyze a list of genes, and determine if they are enriched in genes known to be mutated in a given cancer type. For example, a random list composed by genes having an Entrez Id between 4000 and 9000 is enriched in genes mutated in sarcoma and leukemia:

If you have multiple sets of genes, you can also use the clusterProfiler library to compare them at the same time. Read this previous post for more examples of this functionality.

If you also have gene scores (e.g. a value for the expression or conservation of each gene), you can do a Gene Set Enrichment Analysis, which will give more importance to genes with higher scores:

You can also produce many nice plots. For example this is a cnetplot, in which each gene is connected to the terms related to it:

It is worth to mention that the DOSE package allows to calculate enrichment in the Disease Ontology database, which associates genes to disease terms. In my experience, for bioinformaticians Disease Ontology is more useful than OMIM, because it provides a clear association between genes and disease terms. If you use the raw OMIM data instead, you will have to text mine the descriptions and that can lead to a lot of noisy data.

Have a good enrichment with DOSE and NCG ūüėČ