Should a pipeline have ‘if’ conditions and loops?

On the ruffus mailing list[1] I am participating in a discussion on whether a pipeline should contain ‘if‘ conditions and loops.

I don’t like to see conditions in a pipeline. In the real pipelines, the ones with pipes and water in them, there is no equivalent of ‘if‘ conditions. The tubes can be either be closed or not, but that is defined in the pipeline structure. It is not that the pipeline can change its structure and open/modify its path depending on whether there is water or oil running in it, or that the water can choose whether to enter in a tube or note.

With Makefiles, you usually avoid having if and while conditions, to keep the code easier to understand. Moreover with make, when you have to execute the same task for multiple elements (e.g. call the same program with different inputs) you rather launch a series of parallel jobs instead of having a loop: that is similar to splitting a pipeline into different tubes.

So, for me a pipeline is just a script that you call it and it executes a series of steps. If you start putting if and loops in it, then you won’t be able to tell which steps are called every time you launch it, and it will be more difficult to understand the code.

What do you think? Am I being too silly? W the pipelines!! :-)

[1] ruffus is a tool to define bioinformatics pipelines with a python-like syntax, alternative to Makefiles

 

 

Posted in Uncategorized | 2 Comments

a discussion about node centralities with G.Scardoni

Last week we hosted the visit of G.Scardoni, the author of Centiscape, a Cytoscape plugin to calculate different measures for Node Centralities in a network.

I recommend you to read the supplementary material 1 of his paper (it’s a pdf), because it has a good description of many measures of node centralities and their possible explication in a biological context.

A node centrality is a parameter that, given a node’s position and interactions in a network, determine its importance. To understand it better, think that one of the main purposes of centralities for biology is to identify which genes are more important in a biological process: which are in a bottleneck position, which are required for having a proper function and which ones are only redundant.

The simplest measure of node centrality is the Degree, which is the number of connections of a node. It seems logic to think that genes with an higher degree (higher number of interactions) should be more important than the others, because a loss of function there will affect more interactions. However, after reading at the Centiscape plugin I realized that there are a lot of measures for node centralities, including closeness, betweenness, stress, centroid, etc. The degree is not the best parameter to identify genes in bottleneck positions, for which we should use betweenness or stress instead.

wikipedia image showing the Betweenness in a network.

Just to make this post round, I have opened a discussion on biostar about which measures of Node Centrality can be applied to biological networks. Let’s see what comes out from that discussion, and if there are other centralities I do not know yet :-) .

Posted in Uncategorized | 2 Comments

second part of the slideshow on HG for bioinformatics

I gave the second part of the talk on Version Control and hg for my group (check the first part). Here you have the slides:

I am working in collaboration with some of my colleagues to write a pipeline for calculating some tests for our projects.

The idea is to use hg to coordinate the writing of these scripts. We will have a reference version of the scripts on a private bitbucket.org repository; then, everybody will synchronize its local copy of the scripts from there, uploading new changes to the same place.

Some of my colleagues told me that hg is much easier than what they thought. I am very happy of this because I was worried about it being too difficult to use. It is really a long time that I want to convince my colleagues to adopt some version control tools, and it seems that it was easier than what I expected.

Posted in slideshows, talks | 2 Comments

links, resources, games, tools (January 2011)

These are links that I have collected in the past two months. I am copying them in a pseudo-random order.

PhD students life / becoming a better PhD student

1000 genomes & co

Posted in Uncategorized | Leave a comment

Apps and videogames for bioinformatics/genetics geeks (January 2011)

Apps and Games

  • FreePub is a mind-mapping software to organize scientific materials. Check also this presentation
  • After the games about protein folding and multiple alignments, a new geeky bioinformatics game has been published on Internet. Check out EteRNA!
Posted in Uncategorized | Leave a comment

For N-Glycosylation freakies

If you are a N-Glycosylation geek, check out this interview with Ajit Varki, a guru of the field:

Interview with A. Varki on the importance of studying post-translational modifications

Posted in Uncategorized | Leave a comment

short introduction to version control and hg (mercurial)

Last week I gave a short introductory talk to explain hg and the concept of version control to my colleagues.

If you are new to the concepts of version control, I recommend you to watch the excellent introductory videos at Software Carpentry.

Posted in talks | Leave a comment

The analogy between Blast and Google

A way to explain what Blast is to young students or non-scientists is to say that ‘Blast is the equivalent of Google for searching sequences‘.

This analogy is controversial and not all the bioinformaticians would agree on it: but it is one of my favorite ways to explain what is Blast to people outside science, and it is the explication I use during Open Days or Science Meets Society events.

In this post I will discuss the Pros and the Cons of explaining Blast as the correspondent of Google for scientists. It is up to you to judge and start using the analogy for your own.

Pros

- Both are for searching.

The most common usage for Blast is to search for a sequence, to see whether it exists or if a similar sequence is already known. In this sense, a Blast query is equivalent to a Google query, where you type a word or a phrase and you want to know what is already known about it.
Let’s say you have the sequence ACGAGGGCATCGATCGACCTATCTCTTTCTAGGCAATC: what would be the first thing to do to know it’s function and role? Just blast it, and see which results come out. Similarly, let’s say that you encounter a phrase you don’t of which know the meaning, like ‘Asparagine N-Glycosylation’: how can you know what does it mean? The easiest solution is just to google the phrase and see what comes out. I think that it is important, for a student, to understand this analogy: Blast can be used to understand what it is known about a  sequence of nucleotides or aminoacid.

- Both are used because they are popular.

What makes of Blast the most used alignment engine, while there are a lot of alternatives available?
The main reason is that Blast is the most popular and well known tool of the genre, so many researchers just use it as the default alternative. Moreover, it is robust and people trust the results it returns, like for Google.
This does not mean that Blast is the best alignment tool for everything: for certain tasks it may be better to try an alternative, as there are alternatives to Google.

Continue reading

Posted in Uncategorized | 4 Comments

The true story behind the annotation of a pathway

These slides are from a talk I gave earlier this week to my lab, describing two papers we published recently:

(slides are published on Nature Precedings: you can vote it here)

Bioinformaticians frequently use data and annotations from scientific databases, like KEGG or Uniprot. However, it is difficult to know how much accurate this data is, and to which extent it can be used for a large scale analysis.

So, the talk is about this. Let’s say you dedicate 6 months of my PhD thesis to accurately study and annotate a set of genes, like I did for the N-Glycosylation pathway: How many errors or unclear annotations do you expect to find in scientific databases?

Another topic discussed in the talk is the issue of how to report an error to a database. Many databases do not have a transparent system to report errors, so any incongruence is correct behind the scene, generating some issues to reproducibility. Moreover, the process of reporting errors to a database is basically not acknowledged by the scientific community, and this is unfortunate because if it were more recognized we could have better annotations in the databases and a more active scientific  community.

References:

  • Dall’Olio GM, Bertranpetit J, & Laayouni H (2010). The annotation and the usage of scientific databases could be improved with public issue tracker software. Database : the journal of biological databases and curation, 2010 PMID: 21186182
  • Dall’olio GM, Jassal B, Montanucci L, Gagneux P, Bertranpetit J, & Laayouni H (2011). The annotation of the Asparagine N-linked Glycosylation pathway in the Reactome Database. Glycobiology PMID: 21199820
Posted in slideshows | Tagged , , | 4 Comments

Should I start putting my slides on Nature Precedings?

After the experience with the Post-GWAS article on WikiGenes I started looking at the resources on Nature Precedings, which is where the original idea of that collaborative article came from.

Nature Precedings is a Nature Network website where researchers can post drafts, ideas, presentations about work that can be published. This is exactly what I suspect has happened for the WikiGene article: the authors from the Post-GWAS consortium published a draft of a letter there, and the letter has been noted by some Nature editor, who suggested it to transform into an article and to open it to a collaborative editing.

I am looking at the presentations on Nature Precedings and thinking that maybe, some of the presentations I made or attended may be posted there. It is not very clear how the requisite for publication there are interpreted: most of the documents present pre-print papers and drafts, but some of the presentations illustrate software tools that have been published.

In any case, I think I will start putting something there… I have some ideas that I do not have the time to develop by myself, maybe if I upload them there, I can find somebody wishing to collaborate with me.

Posted in Uncategorized | 3 Comments