Should a pipeline have ‘if’ conditions and loops?

On the ruffus mailing list[1] I am participating in a discussion on whether a pipeline should contain ‘if‘ conditions and loops.

I don’t like to see conditions in a pipeline. In the real pipelines, the ones with pipes and water in them, there is no equivalent of ‘if‘ conditions. The tubes can be either be closed or not, but that is defined in the pipeline structure. It is not that the pipeline can change its structure and open/modify its path depending on whether there is water or oil running in it, or that the water can choose whether to enter in a tube or note.

With Makefiles, you usually avoid having if and while conditions, to keep the code easier to understand. Moreover with make, when you have to execute the same task for multiple elements (e.g. call the same program with different inputs) you rather launch a series of parallel jobs instead of having a loop: that is similar to splitting a pipeline into different tubes.

So, for me a pipeline is just a script that you call it and it executes a series of steps. If you start putting if and loops in it, then you won’t be able to tell which steps are called every time you launch it, and it will be more difficult to understand the code.

What do you think? Am I being too silly? W the pipelines!! 🙂

[1] ruffus is a tool to define bioinformatics pipelines with a python-like syntax, alternative to Makefiles

 

 

a discussion about node centralities with G.Scardoni

Last week we hosted the visit of G.Scardoni, the author of Centiscape, a Cytoscape plugin to calculate different measures for Node Centralities in a network.

I recommend you to read the supplementary material 1 of his paper (it’s a pdf), because it has a good description of many measures of node centralities and their possible explication in a biological context.

A node centrality is a parameter that, given a node’s position and interactions in a network, determine its importance. To understand it better, think that one of the main purposes of centralities for biology is to identify which genes are more important in a biological process: which are in a bottleneck position, which are required for having a proper function and which ones are only redundant.

The simplest measure of node centrality is the Degree, which is the number of connections of a node. It seems logic to think that genes with an higher degree (higher number of interactions) should be more important than the others, because a loss of function there will affect more interactions. However, after reading at the Centiscape plugin I realized that there are a lot of measures for node centralities, including closeness, betweenness, stress, centroid, etc. The degree is not the best parameter to identify genes in bottleneck positions, for which we should use betweenness or stress instead.

wikipedia image showing the Betweenness in a network.

Just to make this post round, I have opened a discussion on biostar about which measures of Node Centrality can be applied to biological networks. Let’s see what comes out from that discussion, and if there are other centralities I do not know yet :-).

links, resources, games, tools (January 2011)

These are links that I have collected in the past two months. I am copying them in a pseudo-random order.

PhD students life / becoming a better PhD student

1000 genomes & co

The analogy between Blast and Google

A way to explain what Blast is to young students or non-scientists is to say that ‘Blast is the equivalent of Google for searching sequences‘.

This analogy is controversial and not all the bioinformaticians would agree on it: but it is one of my favorite ways to explain what is Blast to people outside science, and it is the explication I use during Open Days or Science Meets Society events.

In this post I will discuss the Pros and the Cons of explaining Blast as the correspondent of Google for scientists. It is up to you to judge and start using the analogy for your own.

Pros

– Both are for searching.

The most common usage for Blast is to search for a sequence, to see whether it exists or if a similar sequence is already known. In this sense, a Blast query is equivalent to a Google query, where you type a word or a phrase and you want to know what is already known about it.
Let’s say you have the sequence ACGAGGGCATCGATCGACCTATCTCTTTCTAGGCAATC: what would be the first thing to do to know it’s function and role? Just blast it, and see which results come out. Similarly, let’s say that you encounter a phrase you don’t of which know the meaning, like ‘Asparagine N-Glycosylation’: how can you know what does it mean? The easiest solution is just to google the phrase and see what comes out. I think that it is important, for a student, to understand this analogy: Blast can be used to understand what it is known about a  sequence of nucleotides or aminoacid.

– Both are used because they are popular.

What makes of Blast the most used alignment engine, while there are a lot of alternatives available?
The main reason is that Blast is the most popular and well known tool of the genre, so many researchers just use it as the default alternative. Moreover, it is robust and people trust the results it returns, like for Google.
This does not mean that Blast is the best alignment tool for everything: for certain tasks it may be better to try an alternative, as there are alternatives to Google.

Continue reading

Should I start putting my slides on Nature Precedings?

After the experience with the Post-GWAS article on WikiGenes I started looking at the resources on Nature Precedings, which is where the original idea of that collaborative article came from.

Nature Precedings is a Nature Network website where researchers can post drafts, ideas, presentations about work that can be published. This is exactly what I suspect has happened for the WikiGene article: the authors from the Post-GWAS consortium published a draft of a letter there, and the letter has been noted by some Nature editor, who suggested it to transform into an article and to open it to a collaborative editing.

I am looking at the presentations on Nature Precedings and thinking that maybe, some of the presentations I made or attended may be posted there. It is not very clear how the requisite for publication there are interpreted: most of the documents present pre-print papers and drafts, but some of the presentations illustrate software tools that have been published.

In any case, I think I will start putting something there… I have some ideas that I do not have the time to develop by myself, maybe if I upload them there, I can find somebody wishing to collaborate with me.

technical problems solved!

These days I have been having problems with the DNS services, which should have been solved by now.

For approximately one week, this blog has not been reachable from all the world, depending on which DNS servers you were using. Instead, an older version of the site was shown, at least here in Spain and in some parts of Italy.

Everything should be fixed now.. sorry for the inconvenience.

published a geekish paper on reporting errors to scientific databases

We have published a commentary paper about reporting annotation errors to scientific databases. In this work we discussed the fact that the work of reporting annotation errors to a database is usually not acknowledged and not considered as a scientific activity, while in our opinion it should.

Let’s say that you encounter an error in the annotation of a gene in a scientific database. What would you do? Would you report the error to the maintainers so it can be fixed, or would you just pass by and go to another database? Most of the researchers that I know are not even aware that you can report errors to a scientific database, and when they encounter too many errors in a place, they just look for a better annotated resource. This approach is very negative, because the data on scientific databases does not improve as it could and this favors the fragmentation of annotations in many databases.

Moreover, in this paper we said that reports on error on scientific databases should be public and transparent. Let’s say that you find an error in the annotation in Reactome. How can you communicate this to the other people using the same resource? All the researchers using that data for their work will be affected. And on the other side, how do you know if the data that you are using is correct, or if someone else has found some incorrectness or outdated information?

The article is already available online:

?Dall’Olio, G. M., J. Bertranpetit, and H. Laayouni. The annotation and the usage of scientific databases could be improved with public issue tracker software. Database 2010 (December): baq035-baq035. doi:10.1093/database/baq035.

If you read it, please give me some comments about it here. I will publish another blog post later next week to explain better why we did write it and some personal comments.

The best question to ask in a bioinformatics seminar is..

I have opened a new discussion on Biostar, on ‘What is the best question that you have asked, or heard asking, in a bioinformatics seminar?

For a scientist, it is very important to be able to make good questions; seminars are a good place to practice. In my case, I am lucky because in the building where I am doing my PhD there are always a lot of seminars, and at least one each week is about bioinformatics.

My favorite question is to ask about the controls that a bioinformatician used to test the software he wrote, or the analysis he did. When I was studying in Bologna, my former professor of Molecular Biology used to repeat us, in almost each lesson, that the most difficult part in designing an experiment is to choose the best controls. I believe that the choice of controls is the moment when a bioinformatician is closer to the biology he/she is studying, because you can’t do that if you don’t know the biology behind your project.

For example, yesterday I attended a seminar from one of the responsible of Ensembl Compara. My question was: which controls do you use when you update the pipeline to predict orthologs? I was wondering whether, with all the experience in predicting orthologs that the Ensembl Compara programmers have, if they know of any gene for which there is so much literature that anyone can be absolutely sure about its orthologs in other species.

So, what is your favorite question to ask in a seminar of bioinformatics? Which controls are you using in your analysis? 🙂