Are fitness genes more conserved across species? my 30-minutes attempt

A recently published paper by Hart et al presented a genome-wide CRISPR screening to identify fitness genes (a superset of essential genes) in five cell lines. The paper is quite impressive and shows the potentiality of CRISPR to generate large scale knockouts and to characterize the importance and function of genes in different conditions.

In the discussion the authors propose that fitness genes are more likely to be more conserved across species. However they do not follow-up on this hypothesis, probably for lack of space. They can’t be blamed as they already present a lot of results in the paper.

Distribution of conservation scores in the phastcons.100way.UCSC.hg19 track. Are essential genes more conserved than other genes?
Distribution of conservation scores in the human genome. Are essential genes more conserved than other genes?

This post presents a follow-up analysis on the hypothesis that fitness genes are more conserved than non-essential genes. I’ll take the original data from the paper, get the conservation scores from bioconductor data packages, and do a Wilcoxon test to compare the two distribution. The full code is available as a github repository, and please feel free to contribute if you want to do some free R/Bioconductor analysis.

Getting the data

Like most bioinformatics pipelines, half of the time is spent in preparing the data and importing other datasets. Luckily with the BioConductor data packages we can access a wide range of datasets without many efforts.

1. loading data & calculating number of rows

The starting point is table S3 from the paper, containing a list of 17,661 gene symbols, along with a bayesian score indicating whether gene is a “fitness gene” (essential) in a given cell line. Unfortunately the table doesn’t contain any gene id, so we will have to convert the gene symbols manually.

The data can be read using the xlsx library. In this analysis I will make an heavy use of dplyr and ggplot2, so let’s load these packages as well:

The last command highlights a possible problem with the data. There are 17,661 rows in the table, but only 17,649 are unique! Thus, some gene symbols are duplicated. To find them, we can combine filter() with duplicated():

This also shows another problem with the data: it seems that some of the symbols have been converted to dates in excel (01-Mar, 02-Mar). We will fix these later.

2. converting symbols to entrez

We will start by loading the Homo.sapiens package, a container for many other packages which we will use in the analysis. See my tutorial on working with Genomics Data for more information.

The table for the symbol 2 entrez conversion is org.Hs.egSYMBOL2EG. There are many ways to use this data package but I prefer to convert it to a dataframe and do a left join:

The left_join command added a new column called gene_id, containing the entrez id of each gene.

At this point we should check whether all the id conversions occurred correctly. In principle if we had a 1:1 correspondence between gene symbol and entrez, we would expect to find 17,649 unique symbols (as we saw previously), and 17,649 entrez. However this doesn’t seem to be the case:

To identify which genes have not been converted correctly, we can just subset the rows where gene_id is NA. To improve readability, we can select only the symbol column, transpose it, and paste it to get a list of symbols that are not converted:

The output shows that there are a lot of Orf genes, that some gene symbols have been converted to dates, and that other symbols are not identified. The latter must be un-official gene symbols used by the authors instead of the official Hugo symbols.

These problems can be solved by manually renaming the “date” genes, and by using the org.Hs.egALIAS table instead of org.Hs.egSYMBOL. This will however generate duplicated ids, which will have to filter out:

In the end we have 17,648 gene symbols corresponding to 17,352. For simplicity we remove all the duplicated entrez, and take only one per entrez.

2. Getting the conservation scores

The next step is to get the conservation scores for all human genes. My first thought was to use AnnotationHub to get the scores from UCSC. However, it seems it’s not there:

The reason why the conservation scores are not in Annotation Hub is that there is a separate package for them. Thanks to Robert Castelo from UPF we can get the scores from the package phastCons100way.UCSC.hg19:

The first instruction gets the coordinates of all human genes from the TxDb object, imported when we loaded Homo.sapiens. We also filter this object to retain only the genes tested in the paper.

The second instruction gets the conservation scores across 100 species, from the phastCons package.

Finally, we print the allgenes object to see if everything went well. Notice that only 17,095 of the 17,352 unique genes have coordinates – the rest are probably withdrawn genes or genes with no annotation.

3. Are core fitness genes more conserved across species?

Now that we have the correct entrez id and the scores, we can join them in a dataframe and start doing some analysis.

First, let’s create a dataframe with all the information:

The first command extracted the gene_id and conservation scores from the GenomicRanges object (mcols), transformed it to a data.frame, joined it with the other table to add the bayesian fitness scores, and then added a column to distinguish fitness and non-fitness genes.

We can plot the two distributions using ggplot2:

conservation
Distribution of conservation scores for fitness and non-fitness genes. Unfortunately, there is no much difference there!

Unfortunately, after all these efforts, it seems that there is not much difference!

To verify this further, we can compare the two distributions with a wilcoxon test:

Our p-values here is pretty clear. The two distributions are not significantly different!

Conclusions

This post provided an example on how the Bioconductor data packages allow to easily get the coordinates and ids of a set of genes, and intersect them with other annotation tables from other sources.

Unfortunately my analysis on the conservation of fitness genes proved to be inconclusive, as there is no much difference between the two sets. However there are many other things that can be tried, from using other tests to determine sequence conservation (e.g. dN/dS), to restricting the analysis to fewer species. If you have any other idea or proposal, feel free to drop a comment here or to contribute to the github repository.

Hart, T., Chandrashekhar, M., Aregger, M., Steinhart, Z., Brown, K., MacLeod, G., Mis, M., Zimmermann, M., Fradet-Turcotte, A., Sun, S., Mero, P., Dirks, P., Sidhu, S., Roth, F., Rissland, O., Durocher, D., Angers, S., & Moffat, J. (2015). High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype-Specific Cancer Liabilities Cell, 163 (6), 1515-1526 DOI: 10.1016/j.cell.2015.11.015

3 Comments

  1. Love it! This is a great use of our data that I hoped someone would follow up on…and a very nice R tutorial as well. I only wish I had seen it sooner!

    A couple of comments:

    First, I’m deeply embarrassed that we published the Excel-modified gene names. D’oh! I have HGNC IDs for all those genes, if you’re still interested.

    Second, are conservation scores calculated across the whole gene locus, or just the exons? As you know, exons are typically much much smaller than introns and are likely to be the only conserved component of the gene. It seems very likely that if introns are included, the lack of conservation there would dominate any conserved signal in the exons.

    Third, I’m sure we have more than a few false positives across all the hits in all the screens. We defined the “core essentials” as hits in >=3 of the five TKO screens (everything but A375). Maybe the signal increases as numTKOhits increases; I’d expect the numTKOhits=5 set to be very highly conserved indeed (in fact we know this to be true).

    Cheers! Nice work!

    -Traver

    1. Hi,
      thanks for replying, it’s a pleasure to see your comment 🙂
      It’s really a nice paper. I saw other publications using a similar pooled CRISPR approach after reading it.

      Don’t worry about the gene symbol problem, I’ve seen much worse :-).

      The point of calculating the conservation scores across exons only is very valid. It is quite an easy change in R, just calling the exons() functions instead of genes(), with a column for the gene id. I’ll see if I have time to repeat the analysis next week. I can also use a less stringent filter to define the core essentials genes, as you suggested.

Leave a Reply