A new paper from Sabeti, on clustering the results of different tests for positive selection

Today in our journal club we have discussed the latest paper from Sabeti’s lab:

Grossman, S., Shylakhter, I., Karlsson, E., Byrne, E., Morales, S., Frieden, G., Hostetter, E., Angelino, E., Garber, M., Zuk, O., Lander, E., Schaffner, S., & Sabeti, P. (2010). A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection Science DOI: 10.1126/science.1183863

This paper described what most researcher in the field have been doing intuitively for a long time: combine the results from multiple tests for positive selection into a single result, to obtain a single result per position.

In other fields of computational biology, like structural bioinformatics or maybe sequence alignment, the approach of clustering results has been already applied for a long time, I can think for example of this one developed by my ex-profs in Bologna; however, this is the first time that this is used in population genetics.

The new CMS method described in this paper by Sabeti merges the results from three different tests: the iHS, described by Voight 2006, designed to detect recent positive selection; the XP-EHH, based on iHS but designed to detect alleles which are positively selected in one population but not in others; and the Fst, a measure of population differentation.

To be honest, our feelings on this paper during the journal club are not extremely positive, as we think that even if the idea is nice, this article is not at the level of others from the same author. The formula used to cluster together the different statistics is a single product of the bayesian probability that an allele is under selection, given the result from the test and a background calculated with simulations; it is a nice idea but it doesn’t seem to justify a paper on it.

Moreover, they cluster together tests which in fact determine different types of positive selection: for example, the XP-EHH detects alleles which undergo a sweep in a population but are neutral in the others, while iHS detects positive selection in general. It is not clear if these two statistics would overlap, and in which case: for example, if an allele is selected positively in all humans, but not in any human population in specific, it would have good iHS and low XP-EHH, so the product of them would be less interesting that the iHS itself; and opposite is true as well.

Another point is that they refer to simulations based on selective sweeps made with cosi, but they don’t provide the simulations, and it is not clear how they have obtained them. We are also not convinced that cosi is the best tool to do these simulations, as it can’t simulate certain kinds of selective sweeps.

In general, we had the feeling that the examples that they put in the article weren’t exactly representative of the whole results, as many of the regions described as selected in supplementary data are not very interesting when you look at them closely and it seems that they like ‘showed only the most interesting results’.

Anyway, the idea behind the paper is nice, and I am happy that we are finally entering in the era of clustering results from different selection tests. I wonder if the next article will describe a weighted product of different tests for selection, or if someone will try to add new tests before, like CLR or Tajima’s D.