BioStar, the StackOverflow for bioinformaticians

If you are a programmer you may already know StackOverflow, which is a forum-like website dedicated to questions about computer science with a innovative design and a very active community, and probably the best place on Internet where to ask when you have doubts related to programming.

Thanks to a recent post on the biopython-dev mailingl list, I discovered that it is possible to create websites with the same engine used by StackOverflow, and personalize them on a specific topic: for example, this blog and this site list all the StackExchange-like websites, ranging from mathematics to electronics, to business.

So, there is also a StackExchange website dedicated to bioinformatics, and lately I have been using it: its name is BioStar:

biostar. Click on the link to access.

If you have a question on bioinformatics, whether specific or general, you may ask it there… If we can create a community similar to the StackOverflow for bioinformatics, it will be a very good resource for everyone.

My first GeneOntology term!!

Today the maintainers of the GeneOntology database have accepted a term that I had proposed two days earlier. Therefore, since today I am the daddy of ‘integral to lumenal side of endoplasmic reticulum membrane‘ (GO:0071556)!!! 🙂

logo of the GeneOntology project
logo of the GeneOntology project

GeneOntology is a database of terms used to describe the functions and the properties of proteins and objects of biological interest. It is like an ufficial and structured vocabolary, which you can use when you describe a protein to be sure that there it will be no misunderstanding on what you are saying.

Proposing a term to GeneOntology is very easy and quick, and the maintainers answer quite quickly. The only thing you have to do is to go to their bug tracker on sourceforge and provide the few informations that they require (see the instructions).  As I said before, it tooks only 2 days for them to accept my proposal, but my term was kind of a special case and it was clear that it was necessary to add it.

While the process to add GO term is quick and efficient, I think that they should improve their annotations for genes a bit. In fact there are a lot of GO terms which seems to be not associated with any gene, for example this one (which is some way similar to the gene that I have just proposed).

Moreover, sometimes it is difficult to navigate the GeneOntology tree itself: if you look again at GO:0071458, and look at the tree in the lower part of the page, you see that the term is repeated several times (because it belongs to several higher-level terms), and the representation is correct but a bit intimidating at first.

So, hurrá for GeneOntology… if in the future you will ever work with something which is ‘integral to lumenal side of endoplasmic reticulum membrane’, be it either a gene or another molecule, I hope it will remind you of me… 🙂

p.s. this is the full story about my term.

operator.itemgetter rocks!!

The operator module in python implements many functions of common use in C, making them faster.

Today I had to extract a very big number of positions from a long sequence of DNA, and the operator.itemgetter solved my problem quickly.

Imagine that you have a very long sequence:

you have to extract a series of specific positions, for example 4, 6, 123, 231… How would you do it in python? The most intuitive way would be to repeat the slice operation for every position, like seq1[4], seq1[6], seq1[123]…

Luckly, there with operator.itemgetter you can do it in a single operation, and it is quite fast:

What is amazing is that I have tried this operation on the real sequences, which the entire human genome, and on the real positions, which were some millions as well. I was able to extract a million of positions in sequences of millions of bases in a very few seconds!!

a seminar on makefile and pipelines of shell scripts

While I still haven’t had the time to restore the old posts in this blog, I would like to post again my seminar on makefiles.

This slideshow is different from all the others you can find on makefiles, because instead of showing you how to use it to compile programs, it shows you how make can be used to create pipelines of shell programs and scripts, which is very useful in bioinformatics and in other fields.

Let’s say you have a lot of scripts to analyze the results of an experiment: for example, one to launch blast, another to parse its output, to compare it with other databases, to run command line programs… or just to organize a bundle of sed/grep/gawk scripts.

A makefile can be used to store complex commands like that and organize them in a pipeline: for example, the operation ‘run_blast’ consists in running blast and parsing its results; and the ‘analyze_results’ consists in a series of sed and gawk scripts, along with an R one. I have seen many people using shell scripts to do so, but the best approach is to use a language designed to describe pipelines: this is what Make is, one of the oldest (yet used widely) languages to define pipelines, so it is good to start learning with it.

Another difference with respect to other seminars on makefiles is that I have tried to start with a ‘reduced makefile syntax’, in which you just use the name of the rule, the prerequisites, and the commands, without worrying that the name of the rule corresponding to the name of an output file. If you prefer to know about the standard approach, I suggest to start with reading the corresponding section on software carpentry for bioinformatics.

Starting from 3

Dear readers of this blog,

I have decided to start writing new posts and articles in this blog, and temporanely abandon the idea of restoring the old contents, which I will try do to when I will have more spare time.

The responsability of the blackout of this site is both mine, because I haven’t been very keen on fixing it, and of my host provider, which changed the conditions of the contract too quickly without giving me enough time to make a proper backup and organize a migration.

Now I have a raw SQL file containing all the data in the previous blog, but because of a lack of time I didn’t manage yet to use it to restore the previous articles, so for the moment I will start writing new posts and restore the old contents as soon as possible.

I am sorry to have left you few readers of this blog for so much time, without giving any explanations nor trying to restore it actively 🙂 I hope I will be able to do better in the future.

Cheers, and break a leg for your plans of conquering the world with bioinformatics 🙂