colleague leaving the academia

Massimo Sandal is the person who introduced me to Linux. The people who know me in person will understand how much does this mean for me, since I totally am an hard core Linux Geek.

Almost 6 or 7 years ago, second year of bachelor, I joined a group of geek-lous students in the Biotec faculty, who had created a mailing list to discuss about Linux and free software and proposed to meet every now and then to install Fedora Core 1 or play with it. In practice, we ended up meeting only once or twice: but that was enough to lead me to the dark side and transform me into the Linux nerd I am now.

Since then, the open source world became very important to me. I remember clearly the day when something in my brain switched on and realized how much open source software there is available out there – and how many things I could learn by using it. The exact moment of that conversion was when I was reading the Zope book on the train back to my home town and I was playing with my laptop. I could not believe that there it was so much documentation and modules available for free: that was how the Microsoft’s blinkers fell for me. I like to say that, after that, the speed at which I improved my programming and computer skills boosted at least 5 or 10 folds.

So, I was surprised to read, in the Italian medias, about a post that Massimo has written in his blog on his decision to leave the academia. I am not sad about him leaving the research field:  it is a personal decision, I respect that. However, I am sad that a person like Massimo doesn’t find himself comfortable in the academic world.

What else can I say.. I wish him all the best, and I hope he will be able to find an even more geeky and nerdish job wherever he goes. Now it’s my turn to start converting new innocent souls to the dark Free Software side.. I already started doing it 🙂

Should a pipeline have ‘if’ conditions and loops?

On the ruffus mailing list[1] I am participating in a discussion on whether a pipeline should contain ‘if‘ conditions and loops.

I don’t like to see conditions in a pipeline. In the real pipelines, the ones with pipes and water in them, there is no equivalent of ‘if‘ conditions. The tubes can be either be closed or not, but that is defined in the pipeline structure. It is not that the pipeline can change its structure and open/modify its path depending on whether there is water or oil running in it, or that the water can choose whether to enter in a tube or note.

With Makefiles, you usually avoid having if and while conditions, to keep the code easier to understand. Moreover with make, when you have to execute the same task for multiple elements (e.g. call the same program with different inputs) you rather launch a series of parallel jobs instead of having a loop: that is similar to splitting a pipeline into different tubes.

So, for me a pipeline is just a script that you call it and it executes a series of steps. If you start putting if and loops in it, then you won’t be able to tell which steps are called every time you launch it, and it will be more difficult to understand the code.

What do you think? Am I being too silly? W the pipelines!! 🙂

[1] ruffus is a tool to define bioinformatics pipelines with a python-like syntax, alternative to Makefiles

 

 

a discussion about node centralities with G.Scardoni

Last week we hosted the visit of G.Scardoni, the author of Centiscape, a Cytoscape plugin to calculate different measures for Node Centralities in a network.

I recommend you to read the supplementary material 1 of his paper (it’s a pdf), because it has a good description of many measures of node centralities and their possible explication in a biological context.

A node centrality is a parameter that, given a node’s position and interactions in a network, determine its importance. To understand it better, think that one of the main purposes of centralities for biology is to identify which genes are more important in a biological process: which are in a bottleneck position, which are required for having a proper function and which ones are only redundant.

The simplest measure of node centrality is the Degree, which is the number of connections of a node. It seems logic to think that genes with an higher degree (higher number of interactions) should be more important than the others, because a loss of function there will affect more interactions. However, after reading at the Centiscape plugin I realized that there are a lot of measures for node centralities, including closeness, betweenness, stress, centroid, etc. The degree is not the best parameter to identify genes in bottleneck positions, for which we should use betweenness or stress instead.

wikipedia image showing the Betweenness in a network.

Just to make this post round, I have opened a discussion on biostar about which measures of Node Centrality can be applied to biological networks. Let’s see what comes out from that discussion, and if there are other centralities I do not know yet :-).

second part of the slideshow on HG for bioinformatics

I gave the second part of the talk on Version Control and hg for my group (check the first part). Here you have the slides:


I am working in collaboration with some of my colleagues to write a pipeline for calculating some tests for our projects.

The idea is to use hg to coordinate the writing of these scripts. We will have a reference version of the scripts on a private bitbucket.org repository; then, everybody will synchronize its local copy of the scripts from there, uploading new changes to the same place.

Some of my colleagues told me that hg is much easier than what they thought. I am very happy of this because I was worried about it being too difficult to use. It is really a long time that I want to convince my colleagues to adopt some version control tools, and it seems that it was easier than what I expected.

links, resources, games, tools (January 2011)

These are links that I have collected in the past two months. I am copying them in a pseudo-random order.

PhD students life / becoming a better PhD student

1000 genomes & co

The analogy between Blast and Google

A way to explain what Blast is to young students or non-scientists is to say that ‘Blast is the equivalent of Google for searching sequences‘.

This analogy is controversial and not all the bioinformaticians would agree on it: but it is one of my favorite ways to explain what is Blast to people outside science, and it is the explication I use during Open Days or Science Meets Society events.

In this post I will discuss the Pros and the Cons of explaining Blast as the correspondent of Google for scientists. It is up to you to judge and start using the analogy for your own.

Pros

– Both are for searching.

The most common usage for Blast is to search for a sequence, to see whether it exists or if a similar sequence is already known. In this sense, a Blast query is equivalent to a Google query, where you type a word or a phrase and you want to know what is already known about it.
Let’s say you have the sequence ACGAGGGCATCGATCGACCTATCTCTTTCTAGGCAATC: what would be the first thing to do to know it’s function and role? Just blast it, and see which results come out. Similarly, let’s say that you encounter a phrase you don’t of which know the meaning, like ‘Asparagine N-Glycosylation’: how can you know what does it mean? The easiest solution is just to google the phrase and see what comes out. I think that it is important, for a student, to understand this analogy: Blast can be used to understand what it is known about a  sequence of nucleotides or aminoacid.

– Both are used because they are popular.

What makes of Blast the most used alignment engine, while there are a lot of alternatives available?
The main reason is that Blast is the most popular and well known tool of the genre, so many researchers just use it as the default alternative. Moreover, it is robust and people trust the results it returns, like for Google.
This does not mean that Blast is the best alignment tool for everything: for certain tasks it may be better to try an alternative, as there are alternatives to Google.

Continue reading

The true story behind the annotation of a pathway

These slides are from a talk I gave earlier this week to my lab, describing two papers we published recently:

(slides are published on Nature Precedings: you can vote it here)

Bioinformaticians frequently use data and annotations from scientific databases, like KEGG or Uniprot. However, it is difficult to know how much accurate this data is, and to which extent it can be used for a large scale analysis.

So, the talk is about this. Let’s say you dedicate 6 months of my PhD thesis to accurately study and annotate a set of genes, like I did for the N-Glycosylation pathway: How many errors or unclear annotations do you expect to find in scientific databases?

Another topic discussed in the talk is the issue of how to report an error to a database. Many databases do not have a transparent system to report errors, so any incongruence is correct behind the scene, generating some issues to reproducibility. Moreover, the process of reporting errors to a database is basically not acknowledged by the scientific community, and this is unfortunate because if it were more recognized we could have better annotations in the databases and a more active scientific  community.

References:

  • Dall’Olio GM, Bertranpetit J, & Laayouni H (2010). The annotation and the usage of scientific databases could be improved with public issue tracker software. Database : the journal of biological databases and curation, 2010 PMID: 21186182
  • Dall’olio GM, Jassal B, Montanucci L, Gagneux P, Bertranpetit J, & Laayouni H (2011). The annotation of the Asparagine N-linked Glycosylation pathway in the Reactome Database. Glycobiology PMID: 21199820