Are fitness genes more conserved across species? my 30-minutes attempt

A recently published paper by Hart et al presented a genome-wide CRISPR screening to identify fitness genes (a superset of essential genes) in five cell lines. The paper is quite impressive and shows the potentiality of CRISPR to generate large scale knockouts and to characterize the importance and function of genes in different conditions.

In the discussion the authors propose that fitness genes are more likely to be more conserved across species. However they do not follow-up on this hypothesis, probably for lack of space. They can’t be blamed as they already present a lot of results in the paper.

Distribution of conservation scores in the phastcons.100way.UCSC.hg19 track. Are essential genes more conserved than other genes?
Distribution of conservation scores in the human genome. Are essential genes more conserved than other genes?

This post presents a follow-up analysis on the hypothesis that fitness genes are more conserved than non-essential genes. I’ll take the original data from the paper, get the conservation scores from bioconductor data packages, and do a Wilcoxon test to compare the two distribution. The full code is available as a github repository, and please feel free to contribute if you want to do some free R/Bioconductor analysis.

Continue reading

NCG enrichment implemented in DOSE

I would like to give a big thanks to Guangchuang Yu, the author of many cool R libraries like GoSemSim and ggtree, for implementing a Network of Cancer Genes enrichment function in the DOSE R library.

The new function is called enrichNCG, and can be found in the github version of DOSE. You can use it to analyze a list of genes, and determine if they are enriched in genes known to be mutated in a given cancer type. For example, a random list composed by genes having an Entrez Id between 4000 and 9000 is enriched in genes mutated in sarcoma and leukemia:

If you have multiple sets of genes, you can also use the clusterProfiler library to compare them at the same time. Read this previous post for more examples of this functionality.

If you also have gene scores (e.g. a value for the expression or conservation of each gene), you can do a Gene Set Enrichment Analysis, which will give more importance to genes with higher scores:

You can also produce many nice plots. For example this is a cnetplot, in which each gene is connected to the terms related to it:

It is worth to mention that the DOSE package allows to calculate enrichment in the Disease Ontology database, which associates genes to disease terms. In my experience, for bioinformaticians Disease Ontology is more useful than OMIM, because it provides a clear association between genes and disease terms. If you use the raw OMIM data instead, you will have to text mine the descriptions and that can lead to a lot of noisy data.

Have a good enrichment with DOSE and NCG 😉

The Network of Cancer Genes database

In the last year I have been part of the team maintaining and updating the Network of Cancer Genes database, also known as NCG.

NCG logo
The Network of Cancer Genes database

The main focus of NCG is to provide a curated list of genes associated to cancer, obtained after a manual review of the literature, and classified by cancer subtypes. Moreover NCG annotates some system-level properties of genes associated to cancer, from their protein interactions to their evolutionary age, and from the presence of paralogs in the human genome to their function.

NCG is a small database and is not supported by any big consortium, but we do our best to fill our niche :-). The following list will describe you what you can get from NCG and how can it be useful to you.

Continue reading

a script to fetch images from the UCSC browser

The UCSC browser is a nice, useful but “mammoth-ish” bioinformatics tool that despite its web 1.0 aspect, can be a very powerful ally for any bioinformaticians or biologist.

I have to admit that for many years I avoided using the UCSC browser, dismissing it because of its very old fashioned look. It was silly of me to think that way, but its interface is objectively old: for example, the user is forced to reload the whole page to update the visualization, and the fonts are not anti-aliased, and they look ugly. To me, it didn’t seem “professional” to use a pre-Ajax website for doing research.

Recently, however, I have changed my mind about this, as I discovered that this tool can be very powerful to integrate data from different sources and for doing “mash-ups”. A local UCSC browser instance can be installed in a computer and be used as a central repository for all the annotations produced in a research unit: for example, sequencing data, results from experiments and from statistical genome-wide tests, etc. If all the custom annotations produced in a lab are available in a local UCSC browser instance (either as custom tracks or as tables), it is possible to compare them, and also to compare them against annotations available publicly, such as position of genes, non-coding regions and much more. The real strength of this tool is that if you have a workflow to automatize retrieval of data from it, you are able to compare your results with virtually anything that is known about a genome.

So, let’s go to the point: I wrote a script to automatically fetch screenshots of a UCSC browser instance. It is available at this page:

The first difficulty I faced when writing this script was that there are a lot of possible different options, to define a region and how to visualize it. So, I have made the script to require three different configuration files: one for the regions to be visualized, one for the tracks to be shown, and one for the connection parameters. So here it is how you would call it:

Have a look at this pdf that created with this script. If you continue reading the post, I will also describe the different configuration files.

example of report created by this script. Click on the image to see the full pdf.

Continue reading