After the experience with the Post-GWAS article on WikiGenes I started looking at the resources on Nature Precedings, which is where the original idea of that collaborative article came from.
Nature Precedings is a Nature Network website where researchers can post drafts, ideas, presentations about work that can be published. This is exactly what I suspect has happened for the WikiGene article: the authors from the Post-GWAS consortium published a draft of a letter there, and the letter has been noted by some Nature editor, who suggested it to transform into an article and to open it to a collaborative editing.
I am looking at the presentations on Nature Precedings and thinking that maybe, some of the presentations I made or attended may be posted there. It is not very clear how the requisite for publication there are interpreted: most of the documents present pre-print papers and drafts, but some of the presentations illustrate software tools that have been published.
In any case, I think I will start putting something there… I have some ideas that I do not have the time to develop by myself, maybe if I upload them there, I can find somebody wishing to collaborate with me.
These days I have been having problems with the DNS services, which should have been solved by now.
For approximately one week, this blog has not been reachable from all the world, depending on which DNS servers you were using. Instead, an older version of the site was shown, at least here in Spain and in some parts of Italy.
Everything should be fixed now.. sorry for the inconvenience.
We have published a commentary paper about reporting annotation errors to scientific databases. In this work we discussed the fact that the work of reporting annotation errors to a database is usually not acknowledged and not considered as a scientific activity, while in our opinion it should.
Let’s say that you encounter an error in the annotation of a gene in a scientific database. What would you do? Would you report the error to the maintainers so it can be fixed, or would you just pass by and go to another database? Most of the researchers that I know are not even aware that you can report errors to a scientific database, and when they encounter too many errors in a place, they just look for a better annotated resource. This approach is very negative, because the data on scientific databases does not improve as it could and this favors the fragmentation of annotations in many databases.
Moreover, in this paper we said that reports on error on scientific databases should be public and transparent. Let’s say that you find an error in the annotation in Reactome. How can you communicate this to the other people using the same resource? All the researchers using that data for their work will be affected. And on the other side, how do you know if the data that you are using is correct, or if someone else has found some incorrectness or outdated information?
For a scientist, it is very important to be able to make good questions; seminars are a good place to practice. In my case, I am lucky because in the building where I am doing my PhD there are always a lot of seminars, and at least one each week is about bioinformatics.
My favorite question is to ask about the controls that a bioinformatician used to test the software he wrote, or the analysis he did. When I was studying in Bologna, my former professor of Molecular Biology used to repeat us, in almost each lesson, that the most difficult part in designing an experiment is to choose the best controls. I believe that the choice of controls is the moment when a bioinformatician is closer to the biology he/she is studying, because you can’t do that if you don’t know the biology behind your project.
For example, yesterday I attended a seminar from one of the responsible of Ensembl Compara. My question was: which controls do you use when you update the pipeline to predict orthologs? I was wondering whether, with all the experience in predicting orthologs that the Ensembl Compara programmers have, if they know of any gene for which there is so much literature that anyone can be absolutely sure about its orthologs in other species.
So, what is your favorite question to ask in a seminar of bioinformatics? Which controls are you using in your analysis? 🙂
people at BioDec are among the authors of Ensemble, a tool to predict transmembrane portions of protein helices. If you have ever clicked on the link ‘Third Party data’ in Uniprot (example), the predictions on transmembrane helices are provided by them. They are also involved in the implementation of other tools developed in Rita Casadio’s lab in Bologna and other bioinformatics laboratories in Italy.
Moreover, BioDec is the company that produces Plone4Bio, a library for Plone for Bioinformatics. Plone is a framework to make websites with Python, so if you are a web programmer interested in the field of bioinformatics, this may be a good experience for you.
Phylo is a game where players are required to manually edit a multiple alignment. The player who can make the best multiple alignment, maximizing the matches and reducing gaps, gets the best score.
This is a very funny and innovative idea. It is based on the principle that humans are better at identifying patterns than computers, and that the problem of calculating a multiple alignment is so complex that even the most advanced multiple alignment software does not find the best solution for a large set of sequences, and that a manual editing of a multiple alignment is always required.
You may have already heard about fold.it, a similar game based on protein folding: this is the answer for the guys who work in the field of multiple alignments and phylogeny. I am happy because I belong to this second group :-).
Here it is a resume on the new paragraphs/addition that have been made to the collaborative WikiGene paper in the past two weeks.
For those who have not been following: the manuscript is a perspective on approaches and good practices to study the function of a variant identified in a GWAS.
It happens too often that, after a GWAS is successful in identifying a relationship between a tag SNP and a disease, these results are not followed by a study on the biological mechanism behind the association, or by studies on the exact location of the causal variant.
Resume of changes (sorry if I forgot anything):
added references to the Uk10K project, and improved the description of 1000genomes
created a chapter on computational methods to predict the function of a variant. We described: the databases that annotate information on SNPs or other association studies, tools like GRAIL to analyze the literature, cited the utility of genome browser like UCSC’s, cited a study where the authors have described a pipeline to predict the effect of a non-synonymous SNP on the structure of a protein (the author of the paper have been contacted and will contribute to the paper) and we will describe how to predict pseudogenes or functional elements, and a bit about pathway approaches.
described how alternative splicing can add complexity to eQTL association studies
described the complexity of using RNA-Seq and microarrays (also in table 1), plus a few details on Zinc-Finger technologies
described that it is important to take into account the interactions between chromatine fibres when studying the effect of a SNP. Different genotypes can be associated with a different chromatine network, which adds a whole level of complexity when predicting the effect of a SNP on the phenotype.
differences between studying SNPs and CNVs
discussed the usage of BRCA1 cancers as models to validate GWAS
added some motivations on why animal models are not perfect to reproduce the effect of a SNP in human.
There are still 11 days left to make additions, so if you can make other contributions you will be welcome.
The latest paper published by people in my lab describes a method to reconstruct past Recombination Events:
Melé M, Javed A, Pybus M, Calafell F, Parida L, Bertranpetit J, & The Genographic Consortium (2010). A New Method to Reconstruct Recombination Events at a Genomic Scale. PLoS computational biology, 6 (11) PMID: 21124860.
Let’s say that you have a set of genotypes obtained from a human population, like the HapMap project, the HGDP samples or a custom dataset: with this algorithm you can predict some of the recombination events that occurred in recent times.
While the most common approaches to analyze genotype panels datasets focus on identifying footprints of positive selection, association of a SNP with a disease, etc.. there have been few efforts to look at the history of recombination events. However, an event of recombination can have the same importance as any other mutation event. By knowing when a recombination has occurred we can infer useful information on the function and the history of the region involved.
What I like very much about this article is the impressive work of validation they have done to demonstrate the validity of their software. I am a silly person when it comes to the matter that bioinformatics software should be tested properly: but this time they have probably spent more time in testing the algorithm than in developing it.
The first approach they have used has been to carry out a lot of simulations, using the software CoSi. They have simulated the ‘whole history of the human species’ thousands of times, and then applied the algorithm on the simulated data to see how it performed. Afterwards, they have also used data from a sperm typing panel, which is a good dataset to study recombination events in human.
So, if you want to know about a good paper with good examples on how to test a computational tool, you can have a look at this paper.
What is PLoS-Currents? This is the first time I read about it. Does anyone have experience with it? Can anybody explain me what is it exactly?
Albert Istvan from Biostar explained me that it is based on Google/Knol, a sort of wikipedia but with restriction on editing. It is a place where you can submit documents in the style of reports (or scientific papers), and they will be published only if they pass a revision process. It is a way to publish results quickly, which is useful for fields which need quick updates, like everything related to the Influenza virus or to the new sequencing techniques.
So far, in PLoS they have four Currents Collections, and one of them is particularly interesting: the one called PLoS/Currents Evindence on Genomic Test. It is a place where to publish document on the genetic tests and similar. For example, if you have bought a 23AndMe kit or one from the other competitors, this will be a good place where to look information about each of the tests. Maybe it is also a good resource to be included in the WikiGenes/Nature Genetics paper I was writing about earlier, but I have to look better at it.