Maybe I should start blogging again

A nice article published today by The Scientist made me think that I should start blogging again.

I started blogging about bioinformatics during the first year of my Master Degree. At that time I was really happy about it, and I enjoyed a lot reading blogs and participating to web forums about bioinformatics: my formation as a bioinformatician is mostly the one of an autodidact, as at the time the courses on bioinformatics are not well organized as they are now, and I learned most of what I know by reading blogs on the topic.

After completing the Master and coming again to Barcelona to start the PhD, I started blogging less frequently. For which reasons I stopped blogging? Mostly, a change in my customs, a lot of new things to learn, problems with my hosting service, laziness, because it was too difficult to write in English, or to customize the blog appearance, and maybe also because I had a girlfriend (sounds stupid to say, but now I understand why other people have never thought of blogging, it is difficult to find the time).

Do my research life has been better after I started blogging? I don’t know, but what I know for sure is that my research skills would have been more modest without the period when I have been blogging. I think that the job of a researcher is not so different from the job of a journalist, and that is specially true for bioinformaticians: you have to find something interesting out of the sea of data available out there, and then write an interesting article to tell other people about it.

Will I start blogging again  after this? I hope so, but after reading the article from The Scientist, I decided that I should write more about my work, the articles I publish and the conferences that I attend. It will be easier and help my career more. Let’s see if I can do it 🙂

the Kaggle competition: Predicting HIV Progression

I have recently decided to participate to a competition offered by the Kaggle.com website, about writing a predictor for the outcome of an HIV threatement. Since I have not been blogging here for a while now, I will use this an excuse to re-start blogging and I will dedicate a series of post here about my ideas for this competition.

Kaggle.com is a new web 2.0 site that propose competitions to people able to write predictors using machine learning or other techniques. In short, they propose a study case, like this one on HIV, and they give a prize of 500$ to the team who is able to write the best predictor for the data. In theory, if you have a nice study case to propose, you can send it to them and they will also pay you if it is interesting.

The competition I want to participate to is on predicting the outcome of an HIV treatment. In short, you have a set of ~700 individuals, exposed to a cure for AIDS, and for each individual you have the sequences of two viral proteins, some other parameters, and the outcome of the cure. Then, you are given a set of 300 individuals, and you have to tell how the threatment will perform on them.

Since I am more interested in learning about machine learning methods than in the money prize, I will post here my thoughts about the competition.This will probably make for a nice tutorial to the basics of bioinformatics for dummies: how to interpretate the data, which database to interrogate, which tools, a bit of bash scripting, … I will start writing by tomorrow 🙂

looking for someone to help me annotate the N-Glycosylation pathway on reactome

I am collaborating with the reactome database to create an entry for the N-glycosylation pathway in human. Most of the job is already done, however the only problem I have is that I have to find someone expert on the topic to review it before it can be published on the reactome.

reactome

So, if you know a lot about glycosylation, or know somebody interested, you can contact me… this work will probably lead to a small publication on a peer-review journal, maybe Oxford Journals/Databases or some glycosylation specific one.

I started annotating the N-glycosylation pathway by myself because I have been studying this pathway very closely and I wasn’t very happy with the annotation on KEGG/Pathways. The N-glycan biosynthesis pathway annotated there is fine, but there are a few errors* and after trying to contact Kegg’s maintainers a few times, I decided to do all the work by myself on Reactome.

Note: I am not sure if I can post it here already, but here is a link to the provisional entry that I am annotating.

* errors on KEGG pathway for N-glycan: some genes and some interactions are missing; some reactions are clustered together in a way I don’t like; I don’t understand the logic behind the last steps)

A new paper from Sabeti, on clustering the results of different tests for positive selection

Today in our journal club we have discussed the latest paper from Sabeti’s lab:

Grossman, S., Shylakhter, I., Karlsson, E., Byrne, E., Morales, S., Frieden, G., Hostetter, E., Angelino, E., Garber, M., Zuk, O., Lander, E., Schaffner, S., & Sabeti, P. (2010). A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection Science DOI: 10.1126/science.1183863

This paper described what most researcher in the field have been doing intuitively for a long time: combine the results from multiple tests for positive selection into a single result, to obtain a single result per position.

In other fields of computational biology, like structural bioinformatics or maybe sequence alignment, the approach of clustering results has been already applied for a long time, I can think for example of this one developed by my ex-profs in Bologna; however, this is the first time that this is used in population genetics.

The new CMS method described in this paper by Sabeti merges the results from three different tests: the iHS, described by Voight 2006, designed to detect recent positive selection; the XP-EHH, based on iHS but designed to detect alleles which are positively selected in one population but not in others; and the Fst, a measure of population differentation.

To be honest, our feelings on this paper during the journal club are not extremely positive, as we think that even if the idea is nice, this article is not at the level of others from the same author. The formula used to cluster together the different statistics is a single product of the bayesian probability that an allele is under selection, given the result from the test and a background calculated with simulations; it is a nice idea but it doesn’t seem to justify a paper on it.

Moreover, they cluster together tests which in fact determine different types of positive selection: for example, the XP-EHH detects alleles which undergo a sweep in a population but are neutral in the others, while iHS detects positive selection in general. It is not clear if these two statistics would overlap, and in which case: for example, if an allele is selected positively in all humans, but not in any human population in specific, it would have good iHS and low XP-EHH, so the product of them would be less interesting that the iHS itself; and opposite is true as well.

Another point is that they refer to simulations based on selective sweeps made with cosi, but they don’t provide the simulations, and it is not clear how they have obtained them. We are also not convinced that cosi is the best tool to do these simulations, as it can’t simulate certain kinds of selective sweeps.

In general, we had the feeling that the examples that they put in the article weren’t exactly representative of the whole results, as many of the regions described as selected in supplementary data are not very interesting when you look at them closely and it seems that they like ‘showed only the most interesting results’.

Anyway, the idea behind the paper is nice, and I am happy that we are finally entering in the era of clustering results from different selection tests. I wonder if the next article will describe a weighted product of different tests for selection, or if someone will try to add new tests before, like CLR or Tajima’s D.

BioStar, the StackOverflow for bioinformaticians

If you are a programmer you may already know StackOverflow, which is a forum-like website dedicated to questions about computer science with a innovative design and a very active community, and probably the best place on Internet where to ask when you have doubts related to programming.

Thanks to a recent post on the biopython-dev mailingl list, I discovered that it is possible to create websites with the same engine used by StackOverflow, and personalize them on a specific topic: for example, this blog and this site list all the StackExchange-like websites, ranging from mathematics to electronics, to business.

So, there is also a StackExchange website dedicated to bioinformatics, and lately I have been using it: its name is BioStar:

biostar. Click on the link to access.

If you have a question on bioinformatics, whether specific or general, you may ask it there… If we can create a community similar to the StackOverflow for bioinformatics, it will be a very good resource for everyone.

My first GeneOntology term!!

Today the maintainers of the GeneOntology database have accepted a term that I had proposed two days earlier. Therefore, since today I am the daddy of ‘integral to lumenal side of endoplasmic reticulum membrane‘ (GO:0071556)!!! 🙂

logo of the GeneOntology project
logo of the GeneOntology project

GeneOntology is a database of terms used to describe the functions and the properties of proteins and objects of biological interest. It is like an ufficial and structured vocabolary, which you can use when you describe a protein to be sure that there it will be no misunderstanding on what you are saying.

Proposing a term to GeneOntology is very easy and quick, and the maintainers answer quite quickly. The only thing you have to do is to go to their bug tracker on sourceforge and provide the few informations that they require (see the instructions).  As I said before, it tooks only 2 days for them to accept my proposal, but my term was kind of a special case and it was clear that it was necessary to add it.

While the process to add GO term is quick and efficient, I think that they should improve their annotations for genes a bit. In fact there are a lot of GO terms which seems to be not associated with any gene, for example this one (which is some way similar to the gene that I have just proposed).

Moreover, sometimes it is difficult to navigate the GeneOntology tree itself: if you look again at GO:0071458, and look at the tree in the lower part of the page, you see that the term is repeated several times (because it belongs to several higher-level terms), and the representation is correct but a bit intimidating at first.

So, hurrá for GeneOntology… if in the future you will ever work with something which is ‘integral to lumenal side of endoplasmic reticulum membrane’, be it either a gene or another molecule, I hope it will remind you of me… 🙂

p.s. this is the full story about my term.

operator.itemgetter rocks!!

The operator module in python implements many functions of common use in C, making them faster.

Today I had to extract a very big number of positions from a long sequence of DNA, and the operator.itemgetter solved my problem quickly.

Imagine that you have a very long sequence:

you have to extract a series of specific positions, for example 4, 6, 123, 231… How would you do it in python? The most intuitive way would be to repeat the slice operation for every position, like seq1[4], seq1[6], seq1[123]…

Luckly, there with operator.itemgetter you can do it in a single operation, and it is quite fast:

What is amazing is that I have tried this operation on the real sequences, which the entire human genome, and on the real positions, which were some millions as well. I was able to extract a million of positions in sequences of millions of bases in a very few seconds!!

a seminar on makefile and pipelines of shell scripts

While I still haven’t had the time to restore the old posts in this blog, I would like to post again my seminar on makefiles.

This slideshow is different from all the others you can find on makefiles, because instead of showing you how to use it to compile programs, it shows you how make can be used to create pipelines of shell programs and scripts, which is very useful in bioinformatics and in other fields.

Let’s say you have a lot of scripts to analyze the results of an experiment: for example, one to launch blast, another to parse its output, to compare it with other databases, to run command line programs… or just to organize a bundle of sed/grep/gawk scripts.

A makefile can be used to store complex commands like that and organize them in a pipeline: for example, the operation ‘run_blast’ consists in running blast and parsing its results; and the ‘analyze_results’ consists in a series of sed and gawk scripts, along with an R one. I have seen many people using shell scripts to do so, but the best approach is to use a language designed to describe pipelines: this is what Make is, one of the oldest (yet used widely) languages to define pipelines, so it is good to start learning with it.

Another difference with respect to other seminars on makefiles is that I have tried to start with a ‘reduced makefile syntax’, in which you just use the name of the rule, the prerequisites, and the commands, without worrying that the name of the rule corresponding to the name of an output file. If you prefer to know about the standard approach, I suggest to start with reading the corresponding section on software carpentry for bioinformatics.

Starting from 3

Dear readers of this blog,

I have decided to start writing new posts and articles in this blog, and temporanely abandon the idea of restoring the old contents, which I will try do to when I will have more spare time.

The responsability of the blackout of this site is both mine, because I haven’t been very keen on fixing it, and of my host provider, which changed the conditions of the contract too quickly without giving me enough time to make a proper backup and organize a migration.

Now I have a raw SQL file containing all the data in the previous blog, but because of a lack of time I didn’t manage yet to use it to restore the previous articles, so for the moment I will start writing new posts and restore the old contents as soon as possible.

I am sorry to have left you few readers of this blog for so much time, without giving any explanations nor trying to restore it actively 🙂 I hope I will be able to do better in the future.

Cheers, and break a leg for your plans of conquering the world with bioinformatics 🙂