Are fitness genes more conserved across species? my 30-minutes attempt

A recently published paper by Hart et al presented a genome-wide CRISPR screening to identify fitness genes (a superset of essential genes) in five cell lines. The paper is quite impressive and shows the potentiality of CRISPR to generate large scale knockouts and to characterize the importance and function of genes in different conditions.

In the discussion the authors propose that fitness genes are more likely to be more conserved across species. However they do not follow-up on this hypothesis, probably for lack of space. They can’t be blamed as they already present a lot of results in the paper.

Distribution of conservation scores in the phastcons.100way.UCSC.hg19 track. Are essential genes more conserved than other genes?
Distribution of conservation scores in the human genome. Are essential genes more conserved than other genes?

This post presents a follow-up analysis on the hypothesis that fitness genes are more conserved than non-essential genes. I’ll take the original data from the paper, get the conservation scores from bioconductor data packages, and do a Wilcoxon test to compare the two distribution. The full code is available as a github repository, and please feel free to contribute if you want to do some free R/Bioconductor analysis.

Continue reading

A tutorial on organizing and plotting data with R

Every year I teach in the Programming for Evolutionary Biology course held in Leipzig. It’s an intensive three weeks course, in which we take 25 evolutionary biologists with no prior knowledge of programming, we lock them in a room together with some very good teaching assistants, and we keep explaining them how to program until they learn or manage to escape.

Jokes aside, the course is a very nice experience, and people have a lot of fun, as the three weeks are full of discussion about evolutionary biology and about bioinformatics. The nice things is that this year two brilliant former students (Karl and Michiel) organized a whole reunion conference, which has been called the PEB conference and is taking place this week at the CIBIO near Porto. This reunion conference is also an opportunity to follow-up the students, and I have been in charge of organizing a couple of workshops, one about installing software on linux, and one about advanced R programming.

The tutorial for advanced R programming is available on github and below. I think it may be of interest for anybody with some knowledge of R, but wishing to learn some new tricks. In particular the tutorial is about good ways to organize a dataframe, and I tried to cover a few beginner mistakes about data analysis that I saw in Biostar. It describes the differences between a dataframe in a wide or a long format, how to convert one to the other, and what are the advantages of doing that. It also teaches how to calculate group-based summaries with dplyr, and how to plot them with ggplot2.

DIfference between a dataset in a wide or a long format
DIfference between a dataset in a wide or a long format

Continue reading