notes on the collaborative manuscript on GWAS risk studies

Here are a few notes about contributing to the Nature Genetics manuscript that I was talking about in a previous post.

note2: I have opened a discussion on Biostar, if you are interested in contributing, look there also.

Scope and purpose of the article

The main purpose of the article is to explain how the results from a GWAS study can be functionally validated. Let’s say that a study has identified a SNP variant that is likely to be associated with a trait: the collaborative article describes the methods that can be used to demonstrate the association, by identifying eQTLs and study them through microarrays to building animal models to simulate the effect of the variant.

In my opinion the key to understand what the manuscript is about, and why it is being written collaboratively, lays in this recent Nature Genetics editorial:

the authors of the Editorial say that most of the times, after a GWAS study has found association between a SNP variant and a risk for a trait, the result is not followed by a functional characterization of the SNP.

(edit) Moreover, you should also look at the home page of the group that has written the original draft, which is the Post Genome Wide Association Study Initiative.

It seems that the same Nature Genetics authors are sponsoring this article as a way to promote discussion about the future after GWAS studies. Instead of proposing to some selected authors to write a review on the topic, they are calling for help from all the scientists interested on Internet. I think it is a nice idea to promote discussion.

Ideas for new paragraphsContinue reading

23AndMe kits for 99$ (159$) for the weekend

23AndMe is offering their kits for 99$ instead of 499$ for the weekend. Actually, if you read the conditions carefully the real prize is 159$, because you have to sign to their ‘Personal Genomics Service’ for at least one year.

If you are really interested in buying a 23andMe kit, you should know that every now and then they make such offers. The last time they did it was on 2010 April 23th, in honor of the DNA day.

As you may already have read from somewhere else about 23andMe, they don’t sequence the whole genome but only a small subset of snps, and then they give you the association between your genotypes and some congenital disease. I don’t actually believe that any of the information they give is useful to learn about your health, because no snp association study has been able to explain more the 5% of the genetic variability. I recommend you to buy the kit only if you work in this field and you are interested in the topic, as 159$ is a good offer for it.

source of the news: reddit/bioinformatics

WikiGenes: looking for authors for a Nature Genetics paper

WikiGenes is looking for contributors to a Nature Genetics paper on Genome Wide Association Studies. If you don’t mind being an author on Nature Genetics, have a look at this mail I have received:

Dear Giovanni Marco,

The editor of Nature Genetics has commissioned a collaborative standards paper on Genome Wide Association Studies. An editable draft of this paper is now online at WikiGenes,

I hope this is an interesting opportunity for you, because significant contributions to this draft might get you a co-authorship on the final paper in Nature Genetics.

I would also like to use this occasion to ask you a favor.

If you like WikiGenes, please tell your friends about it. We do not have the budget of big publishers, so we depend fully on word-of-mouth publicity.

Or you could also help us by linking to WikiGenes from your website. Thank you!

a resume on the Workshop on Evolutionary Systems Biology, Edinburgh 2010

This is a resume of the workshop on Evolutionary Systems Biology which took place a month ago in Edinburgh. At last I have the time to post it here in my blog 🙂

Have a look at my previous general post on this workshop, to see what it is about. I am sorry I could not take notes on all the talks I attended. It is not that they were not interesting, but I just failed to take nice notes on them.

Prof. John Yin, Genetic and environmental impacts on the fitness of an RNA virus: computational models and wet-lab experiments. Systems Biology Theme Leader, Wisconsin Institute for Discovery, Department of Chemical and Biological Engineering, University of Wisconsin-Madison, USA (Homepage)
John Yin and his collaborators have developed a model with 140 equations that, given the genome of a single strand mRNA(+) virus, predicts the outcome of the infection. For example, a parameter of these equations is the order of the genes encoded. They have also experimentally verified how the efficiency of a virus infection changes when you shuffle the order of genes in the virus, and saw that if a certain gene is not in the first position the replication of the virus is not efficient.

Pedro Beltrao: Evolution of phosphoregulation: from interactions to function. (Pablo’s blog on bioinformatics)
A study on the conservation of phosphorylation sites among eukaryotes. These sites tend to be loosely conserved, and the position of a phosphorylation site changes frequently. However, the total number of phosphorylation sites of a protein tends to be always the same in many organisms.

Continue reading

1000genomes data as torrent?

The day after the publication of the 1000genomes’ paper on Nature I attended a talk from one of the authors, Paul Flicek from the EBI institute, who explained about the technical challenges that have been introduced by the 1000genomes data.

He pointed out that for the first time in history, datasets in biology have reached the sizes of the big datasets in physics and astronomy. The whole GenBank, even with its exponential growth of the latter years, is small compared with the results of a particle accelerator or a big telescope. However the situation has now changed with the release of 1000genomes, and will change more with the results from other similar studies (Uk10k, 10,000 genomes).

Physicists have done a much better job than we bioinformaticians in planning how to deal with huge datasets. For example, while we bioinformaticians are still using the HTTP or FTP protocols to download datasets, competing with people watching videos on youtube for the bandwidth, physicists have developed an alternative network to Internet to share data. Or for another example, while we bioinformaticians are still debating whether to use databases or flat files, physicists have developed formats like HDF5 to handle huge collections of data.

Regarding the former issue, I am trying to convince the 1000genomes maintainers to release their data as a torrent. Yesterday we tried to download a 16 GB dataset from their website, but because of connection problems we could never finish the download. Let’s say that in the future we will have to download a 16 GB file everyday with the results of new genomes sequenced: is it feasible to do it through Internet?

A nice solution, in my opinion, would be to use the torrent protocol, adopted by the BioTorrents project and nicely described in this paper:  Langille MGI, Eisen JA (2010).

Using torrents to share big datasets like the 1000genomes sequences would have a lot of benefits. For example:

  • since each torrent is associated with a md5 sum, everybody will always download the correct data, without transfer errors. If you download data from a website, a transfer error may always occur; however, any torrent client will always check the md5 before declaring the download complete.
  • this will save a lot of bandwidth to the 1000genomes site, reducing their costs and allowing them to better use their resources.
  • a torrent is more likely to be always available, even if the 1000genomes authors decide to not support it anymore. It will be easier to trace back old datasets, therefore making research more reproducible.
  • a torrent is easier to download, because even if you have a bad internet connection, you can stop and restore the transfer at any time.

Let’s see if they will agree on this!


farewell to a friend

I just wanted to say goodbye to my cat Lanci, beloved friend and almost brother. Last week, at the age of 19, he left me and my family to go playing somewhere up there. He would have been 94 years old by human age; however, this time passed too fast for me.

He has been the featured star in a videogame I programmed back in the University (I don’t recommend you to download the code, I left it on a old webpage of mine and it must been full of viruses now) and I tried to involve him in a experiment to demonstrate empathy in animals. I won’t post many pictures here, but imagine that I was 8 years old when he came in our house.

He is probably waiting for me to go get a nap on the sun or to poor next to me on the sofa, or to play and bite my hands. I will miss him a lot.

1000 genomes era has started!

The paper from the 1000 genomes consortium has been published
yesterday on Nature. It is not yet in pubmed, but you can read the
details here:

In the early 2000, the genome of the first human individual was published. It was a major advance, because it permitted to study the structure of our genomes, identify where genes and non coding regions were, etc.

However, the draft published in 2000 was relative to a single individual, or a mix of few. But what about the differences between two individuals? If I sequence my genome and yours, what kind of differences can we expect to find? Is the genome of a person with deep African ancestry different from the one of someone with European origins? It is very important to know these differences, to predict how different persons can react to a drug or to an environment.

So, as explained in the article, 1000 genomes is a project to sequence the genome of about 2,500 individuals. The availability of these sequences will make it possible to study the differences between genomes of different persons. The data has been partly available publicly long time before the publication of the paper: however, now it will be possible to use it for publishing results in a peer-review journal. I expect a lot of publication in the following months, as a lot of laboratories have already carried out their analysis and were waiting for the publication of this paper to submit.

These should be the links to the papers:

Notice that this is one of the first times in history that Nature has published a paper under a free-access license.

The Evolutionary Systems Biology Workshop in Edinburgh 2010 (first part)

Two weeks earlier this month I attended the second ‘Evolutionary Systems Biology Workshop‘ in Edinburgh, organized by Laurence Loewe. First thing to say is that Edinburgh is a very nice city. It was exactly as I would have imagined a Scottish city: mountains, a castle, people with kilts, soups, pipes, pubs, wonderful beer, and even a greek temple on a hill. Too bad I didn’t have the time to visit it as turist.

The workshop has been very interesting and there has been a lot of nice talks; however, the biggest point of discussion was: “What is the best definition for Evolutionary Systems Biology?“.

In short, Systems Biology is the science that creates formulas and models to predict how a single individual will react in a certain situation. An example would be: I put an Yeast cell in a certain environment, and I want to find the equations and formulas that makes me predict best how this Yeast cell would react to the environment.

Evolutionary Systems Biology is similar to Systems Biology, but it tries to predict the reaction of an organism to an environment given its genome or changes to its genome. A nice example, explained by John Yin in the first talk of the workshop, is: I have a single strand mRNA virus, whose genome codifies only 4 types of proteins. What happens if I change the order of these genes? Is the virus more efficient in infecting a cell, if I put the gene that encodes for the polymerase after the one that encodes for the capside, or vice-versa?

So, Evolutionary Systems Biology can be resumed as this, the science that defines models to predict what happens to an organism when I modify its genome. EvolSysBio also studies what is the importance of the order and position of a gene within the genome, to the fitness. I admit that this can not be immediately associated with what we do in my lab (I will explain it in other posts), but it is interesting nevertheless. I wrote down some notes on the talks that I have attended, let’s see if I am able to transcribe them better and put them in this website soon.


1: Lim KI, Yin J. Computational fitness landscape for all gene-order permutations of an RNA virus. PLoS Comput Biol. 2009 Feb;5(2):e1000283. Epub 2009 Feb 6. PubMed PMID: 19197345; PubMed Central PMCID: PMC2627932. Open Access.

2: Loewe L. A framework for evolutionary systems biology. BMC Syst Biol. 2009 Feb 24;3:27. PubMed PMID: 19239699; PubMed Central PMCID: PMC2663779. Open Access.