a resume on the Workshop on Evolutionary Systems Biology, Edinburgh 2010

This is a resume of the workshop on Evolutionary Systems Biology which took place a month ago in Edinburgh. At last I have the time to post it here in my blog πŸ™‚

Have a look at my previous general post on this workshop, to see what it is about. I am sorry I could not take notes on all the talks I attended. It is not that they were not interesting, but I just failed to take nice notes on them.

Prof. John Yin, Genetic and environmental impacts on the fitness of an RNA virus: computational models and wet-lab experiments. Systems Biology Theme Leader, Wisconsin Institute for Discovery, Department of Chemical and Biological Engineering, University of Wisconsin-Madison, USA (Homepage)
John Yin and his collaborators have developed a model with 140 equations that, given the genome of a single strand mRNA(+) virus, predicts the outcome of the infection. For example, a parameter of these equations is the order of the genes encoded. They have also experimentally verified how the efficiency of a virus infection changes when you shuffle the order of genes in the virus, and saw that if a certain gene is not in the first position the replication of the virus is not efficient.

Pedro Beltrao: Evolution of phosphoregulation: from interactions to function. (Pablo’s blog on bioinformatics)
A study on the conservation of phosphorylation sites among eukaryotes. These sites tend to be loosely conserved, and the position of a phosphorylation site changes frequently. However, the total number of phosphorylation sites of a protein tends to be always the same in many organisms.

Continue reading

1000genomes data as torrent?

The day after the publication of the 1000genomes’ paper on Nature I attended a talk from one of the authors, Paul Flicek from the EBI institute, who explained about the technical challenges that have been introduced by the 1000genomes data.

He pointed out that for the first time in history, datasets in biology have reached the sizes of the big datasets in physics and astronomy. The whole GenBank, even with its exponential growth of the latter years, is small compared with the results of a particle accelerator or a big telescope. However the situation has now changed with the release of 1000genomes, and will change more with the results from other similar studies (Uk10k, 10,000 genomes).

Physicists have done a much better job than we bioinformaticians in planning how to deal with huge datasets. For example, while we bioinformaticians are still using the HTTP or FTP protocols to download datasets, competing with people watching videos on youtube for the bandwidth, physicists have developed an alternative network to Internet to share data. Or for another example, while we bioinformaticians are still debating whether to use databases or flat files, physicists have developed formats like HDF5 to handle huge collections of data.

Regarding the former issue, I am trying to convince the 1000genomes maintainers to release their data as a torrent. Yesterday we tried to download a 16 GB dataset from their website, but because of connection problems we could never finish the download. Let’s say that in the future we will have to download a 16 GB file everyday with the results of new genomes sequenced: is it feasible to do it through Internet?

A nice solution, in my opinion, would be to use the torrent protocol, adopted by the BioTorrents project and nicely described in this paper:Β  Langille MGI, Eisen JA (2010).

Using torrents to share big datasets like the 1000genomes sequences would have a lot of benefits. For example:

  • since each torrent is associated with a md5 sum, everybody will always download the correct data, without transfer errors. If you download data from a website, a transfer error may always occur; however, any torrent client will always check the md5 before declaring the download complete.
  • this will save a lot of bandwidth to the 1000genomes site, reducing their costs and allowing them to better use their resources.
  • a torrent is more likely to be always available, even if the 1000genomes authors decide to not support it anymore. It will be easier to trace back old datasets, therefore making research more reproducible.
  • a torrent is easier to download, because even if you have a bad internet connection, you can stop and restore the transfer at any time.

Let’s see if they will agree on this!

References:

farewell to a friend

I just wanted to say goodbye to my cat Lanci, beloved friend and almost brother. Last week, at the age of 19, he left me and my family to go playing somewhere up there. He would have been 94 years old by human age; however, this time passed too fast for me.

He has been the featured star in a videogame I programmed back in the University (I don’t recommend you to download the code, I left it on a old webpage of mine and it must been full of viruses now) and I tried to involve him in a experiment to demonstrate empathy in animals. I won’t post many pictures here, but imagine that I was 8 years old when he came in our house.

He is probably waiting for me to go get a nap on the sun or to poor next to me on the sofa, or to play and bite my hands. I will miss him a lot.


1000 genomes era has started!

The paper from the 1000 genomes consortium has been published
yesterday on Nature. It is not yet in pubmed, but you can read the
details here:

In the early 2000, the genome of the first human individual was published. It was a major advance, because it permitted to study the structure of our genomes, identify where genes and non coding regions were, etc.

However, the draft published in 2000 was relative to a single individual, or a mix of few. But what about the differences between two individuals? If I sequence my genome and yours, what kind of differences can we expect to find? Is the genome of a person with deep African ancestry different from the one of someone with European origins? It is very important to know these differences, to predict how different persons can react to a drug or to an environment.

So, as explained in the article, 1000 genomes is a project to sequence the genome of about 2,500 individuals. The availability of these sequences will make it possible to study the differences between genomes of different persons. The data has been partly available publicly long time before the publication of the paper: however, now it will be possible to use it for publishing results in a peer-review journal. I expect a lot of publication in the following months, as a lot of laboratories have already carried out their analysis and were waiting for the publication of this paper to submit.

These should be the links to the papers:

Notice that this is one of the first times in history that Nature has published a paper under a free-access license.

The Evolutionary Systems Biology Workshop in Edinburgh 2010 (first part)

Two weeks earlier this month I attended the second ‘Evolutionary Systems Biology Workshop‘ in Edinburgh, organized by Laurence Loewe. First thing to say is that Edinburgh is a very nice city. It was exactly as I would have imagined a Scottish city: mountains, a castle, people with kilts, soups, pipes, pubs, wonderful beer, and even a greek temple on a hill. Too bad I didn’t have the time to visit it as turist.

The workshop has been very interesting and there has been a lot of nice talks; however, the biggest point of discussion was: “What is the best definition for Evolutionary Systems Biology?“.

In short, Systems Biology is the science that creates formulas and models to predict how a single individual will react in a certain situation. An example would be: I put an Yeast cell in a certain environment, and I want to find the equations and formulas that makes me predict best how this Yeast cell would react to the environment.

Evolutionary Systems Biology is similar to Systems Biology, but it tries to predict the reaction of an organism to an environment given its genome or changes to its genome. A nice example, explained by John Yin in the first talk of the workshop, is: I have a single strand mRNA virus, whose genome codifies only 4 types of proteins. What happens if I change the order of these genes? Is the virus more efficient in infecting a cell, if I put the gene that encodes for the polymerase after the one that encodes for the capside, or vice-versa?

So, Evolutionary Systems Biology can be resumed as this, the science that defines models to predict what happens to an organism when I modify its genome. EvolSysBio also studies what is the importance of the order and position of a gene within the genome, to the fitness. I admit that this can not be immediately associated with what we do in my lab (I will explain it in other posts), but it is interesting nevertheless. I wrote down some notes on the talks that I have attended, let’s see if I am able to transcribe them better and put them in this website soon.

References:

1: Lim KI, Yin J. Computational fitness landscape for all gene-order permutations of an RNA virus. PLoS Comput Biol. 2009 Feb;5(2):e1000283. Epub 2009 Feb 6. PubMed PMID: 19197345; PubMed Central PMCID: PMC2627932. Open Access.

2: Loewe L. A framework for evolutionary systems biology. BMC Syst Biol. 2009 Feb 24;3:27. PubMed PMID: 19239699; PubMed Central PMCID: PMC2663779. Open Access.

Links of (enter period of time here) for bioinformaticians

These are links to things I have been reading lately.

Tips for writing scientific articles

The author of the blog Academic English Solutions published some chapters from his latest book, on how to write a scientific article. For example, I didn’t know that the Materials and Methods section of a paper should be written in a passive form.

Another nice article with tips on how to write scientific papers has been published on Nature/jobs lately:

Good Practices for Bioinformaticians

In case you do not know it already, the authors of Software Carpentry, one of the best resources to learn programming for bioinformaticians, are preparing a new version. Keep an eye on their blog, for example one nice article that they have published lately is Five rules for a good bioinformatician

On the same period, Nature/news has published another article on the same line: Computational Science… Error

BioStar

I have already talked about this, but Biostar is becoming a very nice resource for bioinformaticians who don’t know whom can answer a technical bioinformatics problem. Really, have a look at the questions already open and see if you can answer any! πŸ™‚

Maybe I should start blogging again

A nice article published today by The Scientist made me think that I should start blogging again.

I started blogging about bioinformatics during the first year of my Master Degree. At that time I was really happy about it, and I enjoyed a lot reading blogs and participating to web forums about bioinformatics: my formation as a bioinformatician is mostly the one of an autodidact, as at the time the courses on bioinformatics are not well organized as they are now, and I learned most of what I know by reading blogs on the topic.

After completing the Master and coming again to Barcelona to start the PhD, I started blogging less frequently. For which reasons I stopped blogging? Mostly, a change in my customs, a lot of new things to learn, problems with my hosting service, laziness, because it was too difficult to write in English, or to customize the blog appearance, and maybe also because I had a girlfriend (sounds stupid to say, but now I understand why other people have never thought of blogging, it is difficult to find the time).

Do my research life has been better after I started blogging? I don’t know, but what I know for sure is that my research skills would have been more modest without the period when I have been blogging. I think that the job of a researcher is not so different from the job of a journalist, and that is specially true for bioinformaticians: you have to find something interesting out of the sea of data available out there, and then write an interesting article to tell other people about it.

Will I start blogging againΒ  after this? I hope so, but after reading the article from The Scientist, I decided that I should write more about my work, the articles I publish and the conferences that I attend. It will be easier and help my career more. Let’s see if I can do it πŸ™‚

the Kaggle competition: Predicting HIV Progression

I have recently decided to participate to a competition offered by the Kaggle.com website, about writing a predictor for the outcome of an HIV threatement. Since I have not been blogging here for a while now, I will use this an excuse to re-start blogging and I will dedicate a series of post here about my ideas for this competition.

Kaggle.com is a new web 2.0 site that propose competitions to people able to write predictors using machine learning or other techniques. In short, they propose a study case, like this one on HIV, and they give a prize of 500$ to the team who is able to write the best predictor for the data. In theory, if you have a nice study case to propose, you can send it to them and they will also pay you if it is interesting.

The competition I want to participate to is on predicting the outcome of an HIV treatment. In short, you have a set of ~700 individuals, exposed to a cure for AIDS, and for each individual you have the sequences of two viral proteins, some other parameters, and the outcome of the cure. Then, you are given a set of 300 individuals, and you have to tell how the threatment will perform on them.

Since I am more interested in learning about machine learning methods than in the money prize, I will post here my thoughts about the competition.This will probably make for a nice tutorial to the basics of bioinformatics for dummies: how to interpretate the data, which database to interrogate, which tools, a bit of bash scripting, … I will start writing by tomorrow πŸ™‚

looking for someone to help me annotate the N-Glycosylation pathway on reactome

I am collaborating with the reactome database to create an entry for the N-glycosylation pathway in human. Most of the job is already done, however the only problem I have is that I have to find someone expert on the topic to review it before it can be published on the reactome.

reactome

So, if you know a lot about glycosylation, or know somebody interested, you can contact me… this work will probably lead to a small publication on a peer-review journal, maybe Oxford Journals/Databases or some glycosylation specific one.

I started annotating the N-glycosylation pathway by myself because I have been studying this pathway very closely and I wasn’t very happy with the annotation on KEGG/Pathways. The N-glycan biosynthesis pathway annotated there is fine, but there are a few errors* and after trying to contact Kegg’s maintainers a few times, I decided to do all the work by myself on Reactome.

Note: I am not sure if I can post it here already, but here is a link to the provisional entry that I am annotating.

* errors on KEGG pathway for N-glycan: some genes and some interactions are missing; some reactions are clustered together in a way I don’t like; I don’t understand the logic behind the last steps)

A new paper from Sabeti, on clustering the results of different tests for positive selection

Today in our journal club we have discussed the latest paper from Sabeti’s lab:

Grossman, S., Shylakhter, I., Karlsson, E., Byrne, E., Morales, S., Frieden, G., Hostetter, E., Angelino, E., Garber, M., Zuk, O., Lander, E., Schaffner, S., & Sabeti, P. (2010). A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection Science DOI: 10.1126/science.1183863

This paper described what most researcher in the field have been doing intuitively for a long time: combine the results from multiple tests for positive selection into a single result, to obtain a single result per position.

In other fields of computational biology, like structural bioinformatics or maybe sequence alignment, the approach of clustering results has been already applied for a long time, I can think for example of this one developed by my ex-profs in Bologna; however, this is the first time that this is used in population genetics.

The new CMS method described in this paper by Sabeti merges the results from three different tests: the iHS, described by Voight 2006, designed to detect recent positive selection; the XP-EHH, based on iHS but designed to detect alleles which are positively selected in one population but not in others; and the Fst, a measure of population differentation.

To be honest, our feelings on this paper during the journal club are not extremely positive, as we think that even if the idea is nice, this article is not at the level of others from the same author. The formula used to cluster together the different statistics is a single product of the bayesian probability that an allele is under selection, given the result from the test and a background calculated with simulations; it is a nice idea but it doesn’t seem to justify a paper on it.

Moreover, they cluster together tests which in fact determine different types of positive selection: for example, the XP-EHH detects alleles which undergo a sweep in a population but are neutral in the others, while iHS detects positive selection in general. It is not clear if these two statistics would overlap, and in which case: for example, if an allele is selected positively in all humans, but not in any human population in specific, it would have good iHS and low XP-EHH, so the product of them would be less interesting that the iHS itself; and opposite is true as well.

Another point is that they refer to simulations based on selective sweeps made with cosi, but they don’t provide the simulations, and it is not clear how they have obtained them. We are also not convinced that cosi is the best tool to do these simulations, as it can’t simulate certain kinds of selective sweeps.

In general, we had the feeling that the examples that they put in the article weren’t exactly representative of the whole results, as many of the regions described as selected in supplementary data are not very interesting when you look at them closely and it seems that they like ‘showed only the most interesting results’.

Anyway, the idea behind the paper is nice, and I am happy that we are finally entering in the era of clustering results from different selection tests. I wonder if the next article will describe a weighted product of different tests for selection, or if someone will try to add new tests before, like CLR or Tajima’s D.