Links of (enter period of time here) for bioinformaticians

These are links to things I have been reading lately.

Tips for writing scientific articles

The author of the blog Academic English Solutions published some chapters from his latest book, on how to write a scientific article. For example, I didn’t know that the Materials and Methods section of a paper should be written in a passive form.

Another nice article with tips on how to write scientific papers has been published on Nature/jobs lately:

Good Practices for Bioinformaticians

In case you do not know it already, the authors of Software Carpentry, one of the best resources to learn programming for bioinformaticians, are preparing a new version. Keep an eye on their blog, for example one nice article that they have published lately is Five rules for a good bioinformatician

On the same period, Nature/news has published another article on the same line: Computational Science… Error

BioStar

I have already talked about this, but Biostar is becoming a very nice resource for bioinformaticians who don’t know whom can answer a technical bioinformatics problem. Really, have a look at the questions already open and see if you can answer any! ๐Ÿ™‚

Maybe I should start blogging again

A nice article published today by The Scientist made me think that I should start blogging again.

I started blogging about bioinformatics during the first year of my Master Degree. At that time I was really happy about it, and I enjoyed a lot reading blogs and participating to web forums about bioinformatics: my formation as a bioinformatician is mostly the one of an autodidact, as at the time the courses on bioinformatics are not well organized as they are now, and I learned most of what I know by reading blogs on the topic.

After completing the Master and coming again to Barcelona to start the PhD, I started blogging less frequently. For which reasons I stopped blogging? Mostly, a change in my customs, a lot of new things to learn, problems with my hosting service, laziness, because it was too difficult to write in English, or to customize the blog appearance, and maybe also because I had a girlfriend (sounds stupid to say, but now I understand why other people have never thought of blogging, it is difficult to find the time).

Do my research life has been better after I started blogging? I don’t know, but what I know for sure is that my research skills would have been more modest without the period when I have been blogging. I think that the job of a researcher is not so different from the job of a journalist, and that is specially true for bioinformaticians: you have to find something interesting out of the sea of data available out there, and then write an interesting article to tell other people about it.

Will I start blogging againย  after this? I hope so, but after reading the article from The Scientist, I decided that I should write more about my work, the articles I publish and the conferences that I attend. It will be easier and help my career more. Let’s see if I can do it ๐Ÿ™‚

the Kaggle competition: Predicting HIV Progression

I have recently decided to participate to a competition offered by the Kaggle.com website, about writing a predictor for the outcome of an HIV threatement. Since I have not been blogging here for a while now, I will use this an excuse to re-start blogging and I will dedicate a series of post here about my ideas for this competition.

Kaggle.com is a new web 2.0 site that propose competitions to people able to write predictors using machine learning or other techniques. In short, they propose a study case, like this one on HIV, and they give a prize of 500$ to the team who is able to write the best predictor for the data. In theory, if you have a nice study case to propose, you can send it to them and they will also pay you if it is interesting.

The competition I want to participate to is on predicting the outcome of an HIV treatment. In short, you have a set of ~700 individuals, exposed to a cure for AIDS, and for each individual you have the sequences of two viral proteins, some other parameters, and the outcome of the cure. Then, you are given a set of 300 individuals, and you have to tell how the threatment will perform on them.

Since I am more interested in learning about machine learning methods than in the money prize, I will post here my thoughts about the competition.This will probably make for a nice tutorial to the basics of bioinformatics for dummies: how to interpretate the data, which database to interrogate, which tools, a bit of bash scripting, … I will start writing by tomorrow ๐Ÿ™‚

looking for someone to help me annotate the N-Glycosylation pathway on reactome

I am collaborating with the reactome database to create an entry for the N-glycosylation pathway in human. Most of the job is already done, however the only problem I have is that I have to find someone expert on the topic to review it before it can be published on the reactome.

reactome

So, if you know a lot about glycosylation, or know somebody interested, you can contact me… this work will probably lead to a small publication on a peer-review journal, maybe Oxford Journals/Databases or some glycosylation specific one.

I started annotating the N-glycosylation pathway by myself because I have been studying this pathway very closely and I wasn’t very happy with the annotation on KEGG/Pathways. The N-glycan biosynthesis pathway annotated there is fine, but there are a few errors* and after trying to contact Kegg’s maintainers a few times, I decided to do all the work by myself on Reactome.

Note: I am not sure if I can post it here already, but here is a link to the provisional entry that I am annotating.

* errors on KEGG pathway for N-glycan: some genes and some interactions are missing; some reactions are clustered together in a way I don’t like; I don’t understand the logic behind the last steps)

A new paper from Sabeti, on clustering the results of different tests for positive selection

Today in our journal club we have discussed the latest paper from Sabeti’s lab:

Grossman, S., Shylakhter, I., Karlsson, E., Byrne, E., Morales, S., Frieden, G., Hostetter, E., Angelino, E., Garber, M., Zuk, O., Lander, E., Schaffner, S., & Sabeti, P. (2010). A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection Science DOI: 10.1126/science.1183863

This paper described what most researcher in the field have been doing intuitively for a long time: combine the results from multiple tests for positive selection into a single result, to obtain a single result per position.

In other fields of computational biology, like structural bioinformatics or maybe sequence alignment, the approach of clustering results has been already applied for a long time, I can think for example of this one developed by my ex-profs in Bologna; however, this is the first time that this is used in population genetics.

The new CMS method described in this paper by Sabeti merges the results from three different tests: the iHS, described by Voight 2006, designed to detect recent positive selection; the XP-EHH, based on iHS but designed to detect alleles which are positively selected in one population but not in others; and the Fst, a measure of population differentation.

To be honest, our feelings on this paper during the journal club are not extremely positive, as we think that even if the idea is nice, this article is not at the level of others from the same author. The formula used to cluster together the different statistics is a single product of the bayesian probability that an allele is under selection, given the result from the test and a background calculated with simulations; it is a nice idea but it doesn’t seem to justify a paper on it.

Moreover, they cluster together tests which in fact determine different types of positive selection: for example, the XP-EHH detects alleles which undergo a sweep in a population but are neutral in the others, while iHS detects positive selection in general. It is not clear if these two statistics would overlap, and in which case: for example, if an allele is selected positively in all humans, but not in any human population in specific, it would have good iHS and low XP-EHH, so the product of them would be less interesting that the iHS itself; and opposite is true as well.

Another point is that they refer to simulations based on selective sweeps made with cosi, but they don’t provide the simulations, and it is not clear how they have obtained them. We are also not convinced that cosi is the best tool to do these simulations, as it can’t simulate certain kinds of selective sweeps.

In general, we had the feeling that the examples that they put in the article weren’t exactly representative of the whole results, as many of the regions described as selected in supplementary data are not very interesting when you look at them closely and it seems that they like ‘showed only the most interesting results’.

Anyway, the idea behind the paper is nice, and I am happy that we are finally entering in the era of clustering results from different selection tests. I wonder if the next article will describe a weighted product of different tests for selection, or if someone will try to add new tests before, like CLR or Tajima’s D.

BioStar, the StackOverflow for bioinformaticians

If you are a programmer you may already know StackOverflow, which is a forum-like website dedicated to questions about computer science with a innovative design and a very active community, and probably the best place on Internet where to ask when you have doubts related to programming.

Thanks to a recent post on the biopython-dev mailingl list, I discovered that it is possible to create websites with the same engine used by StackOverflow, and personalize them on a specific topic: for example, this blog and this site list all the StackExchange-like websites, ranging from mathematics to electronics, to business.

So, there is also a StackExchange website dedicated to bioinformatics, and lately I have been using it: its name is BioStar:

biostar. Click on the link to access.

If you have a question on bioinformatics, whether specific or general, you may ask it there… If we can create a community similar to the StackOverflow for bioinformatics, it will be a very good resource for everyone.

My first GeneOntology term!!

Today the maintainers of the GeneOntology database have accepted a term that I had proposed two days earlier. Therefore, since today I am the daddy of ‘integral to lumenal side of endoplasmic reticulum membrane‘ (GO:0071556)!!! ๐Ÿ™‚

logo of the GeneOntology project
logo of the GeneOntology project

GeneOntology is a database of terms used to describe the functions and the properties of proteins and objects of biological interest. It is like an ufficial and structured vocabolary, which you can use when you describe a protein to be sure that there it will be no misunderstanding on what you are saying.

Proposing a term to GeneOntology is very easy and quick, and the maintainers answer quite quickly. The only thing you have to do is to go to their bug tracker on sourceforge and provide the few informations that they require (see the instructions).ย  As I said before, it tooks only 2 days for them to accept my proposal, but my term was kind of a special case and it was clear that it was necessary to add it.

While the process to add GO term is quick and efficient, I think that they should improve their annotations for genes a bit. In fact there are a lot of GO terms which seems to be not associated with any gene, for example this one (which is some way similar to the gene that I have just proposed).

Moreover, sometimes it is difficult to navigate the GeneOntology tree itself: if you look again at GO:0071458, and look at the tree in the lower part of the page, you see that the term is repeated several times (because it belongs to several higher-level terms), and the representation is correct but a bit intimidating at first.

So, hurrรก for GeneOntology… if in the future you will ever work with something which is ‘integral to lumenal side of endoplasmic reticulum membrane’, be it either a gene or another molecule, I hope it will remind you of me… ๐Ÿ™‚

p.s. this is the full story about my term.

operator.itemgetter rocks!!

The operator module in python implements many functions of common use in C, making them faster.

Today I had to extract a very big number of positions from a long sequence of DNA, and the operator.itemgetter solved my problem quickly.

Imagine that you have a very long sequence:

you have to extract a series of specific positions, for example 4, 6, 123, 231… How would you do it in python? The most intuitive way would be to repeat the slice operation for every position, like seq1[4], seq1[6], seq1[123]…

Luckly, there with operator.itemgetter you can do it in a single operation, and it is quite fast:

What is amazing is that I have tried this operation on the real sequences, which the entire human genome, and on the real positions, which were some millions as well. I was able to extract a million of positions in sequences of millions of bases in a very few seconds!!