The true story behind the annotation of a pathway

These slides are from a talk I gave earlier this week to my lab, describing two papers we published recently:

(slides are published on Nature Precedings: you can vote it here)

Bioinformaticians frequently use data and annotations from scientific databases, like KEGG or Uniprot. However, it is difficult to know how much accurate this data is, and to which extent it can be used for a large scale analysis.

So, the talk is about this. Let’s say you dedicate 6 months of my PhD thesis to accurately study and annotate a set of genes, like I did for the N-Glycosylation pathway: How many errors or unclear annotations do you expect to find in scientific databases?

Another topic discussed in the talk is the issue of how to report an error to a database. Many databases do not have a transparent system to report errors, so any incongruence is correct behind the scene, generating some issues to reproducibility. Moreover, the process of reporting errors to a database is basically not acknowledged by the scientific community, and this is unfortunate because if it were more recognized we could have better annotations in the databases and a more active scientific  community.

References:

  • Dall’Olio GM, Bertranpetit J, & Laayouni H (2010). The annotation and the usage of scientific databases could be improved with public issue tracker software. Database : the journal of biological databases and curation, 2010 PMID: 21186182
  • Dall’olio GM, Jassal B, Montanucci L, Gagneux P, Bertranpetit J, & Laayouni H (2011). The annotation of the Asparagine N-linked Glycosylation pathway in the Reactome Database. Glycobiology PMID: 21199820

Should I start putting my slides on Nature Precedings?

After the experience with the Post-GWAS article on WikiGenes I started looking at the resources on Nature Precedings, which is where the original idea of that collaborative article came from.

Nature Precedings is a Nature Network website where researchers can post drafts, ideas, presentations about work that can be published. This is exactly what I suspect has happened for the WikiGene article: the authors from the Post-GWAS consortium published a draft of a letter there, and the letter has been noted by some Nature editor, who suggested it to transform into an article and to open it to a collaborative editing.

I am looking at the presentations on Nature Precedings and thinking that maybe, some of the presentations I made or attended may be posted there. It is not very clear how the requisite for publication there are interpreted: most of the documents present pre-print papers and drafts, but some of the presentations illustrate software tools that have been published.

In any case, I think I will start putting something there… I have some ideas that I do not have the time to develop by myself, maybe if I upload them there, I can find somebody wishing to collaborate with me.

technical problems solved!

These days I have been having problems with the DNS services, which should have been solved by now.

For approximately one week, this blog has not been reachable from all the world, depending on which DNS servers you were using. Instead, an older version of the site was shown, at least here in Spain and in some parts of Italy.

Everything should be fixed now.. sorry for the inconvenience.

published a geekish paper on reporting errors to scientific databases

We have published a commentary paper about reporting annotation errors to scientific databases. In this work we discussed the fact that the work of reporting annotation errors to a database is usually not acknowledged and not considered as a scientific activity, while in our opinion it should.

Let’s say that you encounter an error in the annotation of a gene in a scientific database. What would you do? Would you report the error to the maintainers so it can be fixed, or would you just pass by and go to another database? Most of the researchers that I know are not even aware that you can report errors to a scientific database, and when they encounter too many errors in a place, they just look for a better annotated resource. This approach is very negative, because the data on scientific databases does not improve as it could and this favors the fragmentation of annotations in many databases.

Moreover, in this paper we said that reports on error on scientific databases should be public and transparent. Let’s say that you find an error in the annotation in Reactome. How can you communicate this to the other people using the same resource? All the researchers using that data for their work will be affected. And on the other side, how do you know if the data that you are using is correct, or if someone else has found some incorrectness or outdated information?

The article is already available online:

?Dall’Olio, G. M., J. Bertranpetit, and H. Laayouni. The annotation and the usage of scientific databases could be improved with public issue tracker software. Database 2010 (December): baq035-baq035. doi:10.1093/database/baq035.

If you read it, please give me some comments about it here. I will publish another blog post later next week to explain better why we did write it and some personal comments.

The best question to ask in a bioinformatics seminar is..

I have opened a new discussion on Biostar, on ‘What is the best question that you have asked, or heard asking, in a bioinformatics seminar?

For a scientist, it is very important to be able to make good questions; seminars are a good place to practice. In my case, I am lucky because in the building where I am doing my PhD there are always a lot of seminars, and at least one each week is about bioinformatics.

My favorite question is to ask about the controls that a bioinformatician used to test the software he wrote, or the analysis he did. When I was studying in Bologna, my former professor of Molecular Biology used to repeat us, in almost each lesson, that the most difficult part in designing an experiment is to choose the best controls. I believe that the choice of controls is the moment when a bioinformatician is closer to the biology he/she is studying, because you can’t do that if you don’t know the biology behind your project.

For example, yesterday I attended a seminar from one of the responsible of Ensembl Compara. My question was: which controls do you use when you update the pipeline to predict orthologs? I was wondering whether, with all the experience in predicting orthologs that the Ensembl Compara programmers have, if they know of any gene for which there is so much literature that anyone can be absolutely sure about its orthologs in other species.

So, what is your favorite question to ask in a seminar of bioinformatics? Which controls are you using in your analysis? 🙂

open position at BioDec, bioinformatics company in Bologna (Italy)

There is an open position at BioDec, a bioinformatics company based in Bologna, Italy:

people at BioDec are among the authors of Ensemble, a tool to predict transmembrane portions of protein helices. If you have ever clicked on the link ‘Third Party data’ in Uniprot (example), the predictions on transmembrane helices are provided by them. They are also involved in the implementation of other tools developed in Rita Casadio’s lab in Bologna and other bioinformatics laboratories in Italy.

Moreover, BioDec is the company that produces Plone4Bio, a library for Plone for Bioinformatics. Plone is a framework to make websites with Python, so if you are a web programmer interested in the field of bioinformatics, this may be a good experience for you.

Phylo – the multiple alignment game

Phylo is a game where players are required to manually edit a multiple alignment. The player who can make the best multiple alignment, maximizing the matches and reducing gaps, gets the best score.

This is a very funny and innovative idea. It is based on the principle that humans are better at identifying patterns than computers, and that the problem of calculating a multiple alignment is so complex that even the most advanced multiple alignment software does not find the best solution for a large set of sequences, and that a manual editing of a multiple alignment is always required.

You may have already heard about fold.it, a similar game based on protein folding: this is the answer for the guys who work in the field of multiple alignments and phylogeny. I am happy because I belong to this second group :-).

(source: http://oggiscienza.wordpress.com/2010/12/09/farmville-fatti-da-parte)

update on the collaborative article on Post-GWAS functional characterization

Here it is a resume on the new paragraphs/addition that have been made to the collaborative WikiGene paper in the past two weeks.

For those who have not been following: the manuscript is a perspective on approaches and good practices to study the function of a variant identified in a GWAS.
It happens too often that, after a GWAS is successful in identifying a relationship between a tag SNP and a disease, these results are not followed by a study on the biological mechanism behind the association, or by studies on the exact location of the causal variant.

Resume of changes (sorry if I forgot anything):

  • added references to the Uk10K project, and improved the description of 1000genomes
  • created a chapter on computational methods to predict the function of a variant. We described: the databases that annotate information on SNPs or other association studies, tools like GRAIL to analyze the literature, cited the utility of genome browser like UCSC’s, cited a study where the authors have described a pipeline to predict the effect of a non-synonymous SNP on the structure of a protein (the author of the paper have been contacted and will contribute to the paper) and we will describe how to predict pseudogenes or functional elements, and a bit about pathway approaches.
  • described how alternative splicing can add complexity to eQTL association studies
  • described the complexity of using RNA-Seq and microarrays (also in table 1), plus a few details on Zinc-Finger technologies
  • described that it is important to take into account the interactions between chromatine fibres when studying the effect of a SNP. Different genotypes can be associated with a different chromatine network, which adds a whole level of complexity when predicting the effect of a SNP on the phenotype.
  • differences between studying SNPs and CNVs
  • discussed the usage of BRCA1 cancers as models to validate GWAS
  • added some motivations on why animal models are not perfect to reproduce the effect of a SNP in human.

There are still 11 days left to make additions, so if you can make other contributions you will be welcome.

new paper from my lab: IRiS

ResearchBlogging.orgThe latest paper published by people in my lab describes a method to reconstruct past Recombination Events:

  • Melé M, Javed A, Pybus M, Calafell F, Parida L, Bertranpetit J, & The Genographic Consortium (2010). A New Method to Reconstruct Recombination Events at a Genomic Scale. PLoS computational biology, 6 (11) PMID: 21124860.

Let’s say that you have a set of genotypes obtained from a human population, like the HapMap project, the HGDP samples or a custom dataset: with this algorithm you can predict some of the recombination events that occurred in recent times.

While the most common approaches to analyze genotype panels datasets focus on identifying footprints of positive selection, association of a SNP with a disease, etc.. there have been few efforts to look at the history of recombination events. However, an event of recombination can have the same importance as any other mutation event. By knowing when a recombination has occurred we can infer useful information on the function and the history of the region involved.

What I like very much about this article is the impressive work of validation they have done to demonstrate the validity of their software. I am a silly person when it comes to the matter that bioinformatics software should be tested properly: but this time they have probably spent more time in testing the algorithm than in developing it.

The first approach they have used has been to carry out a lot of simulations, using the software CoSi. They have simulated the ‘whole history of the human species’ thousands of times, and then applied the algorithm on the simulated data to see how it performed. Afterwards, they have also used data from a sperm typing panel, which is a good dataset to study recombination events in human.

Predicting recent events of recombination from genotype data. The authors did a tremendous work to demonstrate the validity of their software.

So, if you want to know about a good paper with good examples on how to test a computational tool, you can have a look at this paper.