A tutorial on organizing and plotting data with R

Every year I teach in the Programming for Evolutionary Biology course held in Leipzig. It’s an intensive three weeks course, in which we take 25 evolutionary biologists with no prior knowledge of programming, we lock them in a room together with some very good teaching assistants, and we keep explaining them how to program until they learn or manage to escape.

Jokes aside, the course is a very nice experience, and people have a lot of fun, as the three weeks are full of discussion about evolutionary biology and about bioinformatics. The nice things is that this year two brilliant former students (Karl and Michiel) organized a whole reunion conference, which has been called the PEB conference and is taking place this week at the CIBIO near Porto. This reunion conference is also an opportunity to follow-up the students, and I have been in charge of organizing a couple of workshops, one about installing software on linux, and one about advanced R programming.

The tutorial for advanced R programming is available on github and below. I think it may be of interest for anybody with some knowledge of R, but wishing to learn some new tricks. In particular the tutorial is about good ways to organize a dataframe, and I tried to cover a few beginner mistakes about data analysis that I saw in Biostar. It describes the differences between a dataframe in a wide or a long format, how to convert one to the other, and what are the advantages of doing that. It also teaches how to calculate group-based summaries with dplyr, and how to plot them with ggplot2.

DIfference between a dataset in a wide or a long format

DIfference between a dataset in a wide or a long format

Continue reading

Posted in methodology, news | Tagged , , , | Leave a comment

NCG enrichment implemented in DOSE

I would like to give a big thanks to Guangchuang Yu, the author of many cool R libraries like GoSemSim and ggtree, for implementing a Network of Cancer Genes enrichment function in the DOSE R library.

The new function is called enrichNCG, and can be found in the github version of DOSE. You can use it to analyze a list of genes, and determine if they are enriched in genes known to be mutated in a given cancer type. For example, a random list composed by genes having an Entrez Id between 4000 and 9000 is enriched in genes mutated in sarcoma and leukemia:

> library(devtools)
> install_github(c("GuangchuangYu/DOSE", "GuangchuangYu/clusterProfiler"))
> library(DOSE)
> dev_mode()
> mygenes = as.character(4000:9000)  # generate a random list of Entrez Ids, from ID 4000 to ID 9000
> summary(enrichNCG(gene=mygenes))   # calculate enrichment of the random list of Entrez Ids
           ID          Description   GeneRatio     BgRatio     pvalue        p.adjust        qvalue
 sarcoma   sarcoma     sarcoma       15/457        30/1920     0.001495187   0.02352589      0.02185067
 leukemia  leukemia    leukemia      38/457        106/1920    0.002767752   0.02352589      0.02185067

If you have multiple sets of genes, you can also use the clusterProfiler library to compare them at the same time. Read this previous post for more examples of this functionality.

> library(clusterProfiler)
> summary(compareCluster(list(L1=as.character(4000:9000), L2= as.character(3000:4000)), fun='enrichNCG'))
Cluster       ID    Description GeneRatio    BgRatio       pvalue     p.adjust       qvalue
     L1  sarcoma        sarcoma    15/457    30/1920 0.0014951873 0.0235258920 0.0218506737
     L1 leukemia       leukemia    38/457   106/1920 0.0027677520 0.0235258920 0.0218506737
     L2 leukemia       leukemia     16/96   106/1920 0.0000399849 0.0001599396 0.0001262681

If you also have gene scores (e.g. a value for the expression or conservation of each gene), you can do a Gene Set Enrichment Analysis, which will give more importance to genes with higher scores:

> data(geneList)
> y = gseAnalyzer(geneList, setType="NCG", minGSSize=1)
         ID      Description   setSize    enrichmentScore pvalue p.adjust qvalues
breast,lung      breast,lung         2         -0.9965348      0       0        0

You can also produce many nice plots. For example this is a cnetplot, in which each gene is connected to the terms related to it:

> cnetplot(enrichNCG(as.character(4000:9000), readable=T), fixed=T)

the cnet visualization of a randomly generated dataset of genes, using enrichNCG from DOSE, which derives data from the Network of Cancer Genes.
The cnet visualization of a randomly generated dataset of genes, 
using enrichNCG from DOSE, which derives data from the Network of Cancer Genes.

It is worth to mention that the DOSE package allows to calculate enrichment in the Disease Ontology database, which associates genes to disease terms. In my experience, for bioinformaticians Disease Ontology is more useful than OMIM, because it provides a clear association between genes and disease terms. If you use the raw OMIM data instead, you will have to text mine the descriptions and that can lead to a lot of noisy data.

Have a good enrichment with DOSE and NCG ;-)

Posted in methodology, news, projects | Tagged , , , , , | 1 Comment

Solidarity to the workers of the Mario Negri Sud institute

I don’t usually post online petitions, but this is a bit personal to me as it regards the closure of a big research institute close to my hometown in Italy.

The Mario Negri Sud was a research institute active for the last 30 years in Abruzzo, a region in the center-south of Italy. During all these years the institute achieved excellence in many fields, from cardiovascular diseases to breast tumors, from diabetes to some rare diseases, and much more. I remember reading about an article on polycythemia vera (a rare cancer) published on NEJM just before financial problems halted most of the research, and more work by a neuroscientist who was really affected by the stress of the situation.

Unfortunately last week, after 4 years of financial struggle, the institute was officially declared bankrupt. Still, this is not the worst part.  Given the financial situation it is likely that the workers of the institute, aproximately 160 people between researchers and staff, will not receive their pay from the last 18 months, moreover they will not get their pension funds (the Italian TFR), which for some people amounts to tens of thousands of euros, accumulated in more than 25 years of work.

Negri sud picture

The Negri Sud institute. Click here to sign the petition.

These are people who dedicated almost all their life to research, and it is very unfair that they are threated this way. It is well known that the life of researchers is full of sacrifices and is never financially stable. To think that after many years they are denied a pension and abandoned to their fate is really inconceivable. Politicians were not able to solve the situation, and they are probably guilty of making it worse. Moreover given the geographical isolation of the institute, this situation hasn’t received much attention from the media outside of Abruzzo.

If you want to sign the petition, just click on this link to change.org. The petition is in Italian, and basically asks to the presidents of the Mario Negri institute, the Abruzzo region and the Chieti province (the three founding entitites of Negri Sud), to at least pay the pension and the salary of these workers. The change.org website will ask for your name, direction, postal code, and email. The website may later send you additional emails regarding other petitions, but you can opt out at any time.

Posted in news | Tagged , , | 2 Comments

Updates on docker and bioinformatics

My previous post on docker and bioinformatics received some good attention on Twitter. It’s nice because it means that this technology is getting the right attention in the bioinformatics community.

Here are a few resources and articles I’ve found thanks to the conversations in Twitter.

  • Performances of Docker on a HPC cluster – a nice article showing that running a NGS pipeline in a docker container costs about 4% of the performances. It’s up to you to decide whether this is a big or a small price to pay.
  • biodocker is a project by Hexabio aiming at providing many containers with bioinformatics application. For example, you can get a virtual machine with biopython or samtools installed in a few minutes. Update: this may have been merged with bioboxes (see discussion)
  • oswitch is a nice implementation of docker from the Queen Mary University of London, which allows to quickly switch between docker images. I like the examples in which they run a command from a virtual image and then return directly to another environment.
  • ngeasy, a Next Generation Sequencing pipeline implemented on Docker, by a group from the King’s College of London (I work in the same institute but I didn’t know them!).
  • a nice discussion on Biostar on how a reproducibility problem could be solved with Docker.
  • a Docker symposium planned for the end of 2015 here at King’s.
  • BioPython containers by Tiago Antao, including some ipython tutorials

Docker is another innovation for data analysis introduced in 2014. I am surprised by how many good things were released last year, including docker and the whole dplyr/tidyr bundle. Let’s see what 2015 will bring!

Posted in methodology, news | Tagged , , , | Leave a comment

The Network of Cancer Genes database

In the last year I have been part of the team maintaining and updating the Network of Cancer Genes database, also known as NCG.

NCG logo

The Network of Cancer Genes database

The main focus of NCG is to provide a curated list of genes associated to cancer, obtained after a manual review of the literature, and classified by cancer subtypes. Moreover NCG annotates some system-level properties of genes associated to cancer, from their protein interactions to their evolutionary age, and from the presence of paralogs in the human genome to their function.

NCG is a small database and is not supported by any big consortium, but we do our best to fill our niche :-). The following list will describe you what you can get from NCG and how can it be useful to you.

Continue reading

Posted in methodology, projects | Tagged , , , , , , , | 10 Comments

Reproducible bioinformatics pipelines with docker

I have recently come across a nice article explaining what Docker is and how it can be useful for bioinformatics. I’ll leave you to the article for more details, but basically Docker is an easy way to define a virtual machine, which makes it very straightforward for other people to reproduce the results of an analysis, with little effort from our side.

For example, let’s imagine that we are just about to submit a paper, and that our main results are based on the Tajima’s D index from the data in 1000 Genomes. The journal may ask us to show how to reproduce the analysis: which files did we used as input? Which tool did we use to calculate the Tajima’s D?

In this case, a docker file may be like the following:

FROM ubuntu
MAINTAINER Giovanni DallOlio <dalloliogm@gmail.com>

# Install all the software needed to run the pipeline
RUN apt-get -qq update
RUN apt-get install -y wget git tabix vcftools
RUN apt-get install -qqy python3-setuptools python3-docutils python3-flask
RUN easy_install3 snakemake


# clone the most recent version of the pipeline
WORKDIR /home/user/
RUN git clone https://github.com/dalloliogm/pipeline_play.git

# download the input files
WORKDIR /home/user/pipeline_play
RUN tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 22:23862589-25055718 > myregion.vcf


# Execute the pipeline
RUN snakemake

The first part of this docker file will set up an ubuntu virtual machine, and install all the software needed to execute the pipeline: tabix, vcftools, snakemake. The second part will  clone the latest version of the pipeline in the virtual machine, and then use tabix to download a portion of chromosome 22 from the 1000Genomes ftp. The third part runs the pipeline, by executing a snakemake rule.

You can run this docker container by running the following:

docker build -t bioinfoblog_test https://raw.githubusercontent.com/dalloliogm/docker_play/master/1000genomes/Dockerfile

This will take quite a while to run, and will build a docker virtual image in your system. Afterwards, you can run the following:

docker run -i -t bioinfoblog_test /bin/bash

This command will open an interactive shell in the virtual machine. From there you will be able to inspect the output of the pipeline, and eventually, if this pipeline were more complex than a mock example, run other rules and commands.

This system makes it very easy to provide an environment in which our results can be reproduced. It is also very useful if we work from more than one workstation – e.g. if we need to have the same configuration at home and in the lab.

Just a few more links on docker and bioinformatics:

Posted in methodology, news | Tagged , | 3 Comments

the most useful R function of the week: unnest from tidyr

There are many great functions in CRAN and BioConductor, and certainly saying that unnest from the tidyr package is the best is a big exaggeration. However this function solved a big problem in data formatting that made me waste a lot of time in the past, that I was surprised no one had implemented a function for it yet.

Imagine we have a dataframe like the following:

> mygenes
Entrez  symbols
7841    MOGS,CDG2B,CWH41,DER7,GCS1
4248    MGAT3,GNT-III,GNT3
5728    PTEN,BZS,CWS1,DEC,GLM2,MHAM,MMAC11,TEP1

The first column contains the Entrez of each gene. This columns is fine, as it contains only one value per row, and it is easy to query or join with other dataframes. The second column, however, contains a comma-separated list of gene names, all associated to the same Entrez IDs. This column is a mess to deal with, because we need to use grepl to query it, and we can’t join it with other dataframes as long as it is in this form.

The unnest function from tidyr allows to convert this data frame in a “tidier” form, containing one row for each combination gene symbol and alias:

> library(tidyr)
> library(dplyr)
> mygenes %>% 
    mutate(symbols=strsplit(as.character(symbols), ",")) %>% 
    unnest(symbols)

   Entrez symbols
1    7841    MOGS
2    7841   CDG2B
3    7841   CWH41
4    7841    DER7
5    7841    GCS1
6    4248   MGAT3
7    4248 GNT-III
8    4248    GNT3
9    5728    PTEN
10   5728     BZS
11   5728    CWS1
12   5728     DEC
13   5728    GLM2
14   5728    MHAM
15   5728  MMAC11
16   5728    TEP1

This code makes use of the %>% and some functions from the dplyr package, but it is still R!

Having the dataframe in this long form makes it a lot easier to deal with it. For example, let’s imagine that somebody asks us to get the Entrez IDs for the list of gene symbols DER7 and DEC. We would just have to do a simple subset on the dataframe:

> unn %>% 
    mutate(symbols=strsplit(as.character(symbols), ",")) %>% 
    unnest(symbols) %>% 
    subset(symbols %in% c("DER7", "DEC"))

   Entrez symbols
4    7841    DER7
12   5728     DEC

This is just a silly example, which may have been solved with some application of apply and grepl, but in the real world there are a lot of more complex applications for it. For example, here is some code I used to split Blat output into one line per exon (or blat alignment block):

Continue reading

Posted in news | 4 Comments