a script to scrape Uniprot

I wrote a rudimental script to scrape the Uniprot Website and extract some information about a list of Uniprot entries. This may be useful as an example on how to query Uniprot (since I couldn’t find any public API nor library), or to get infos about a list a genes of your interest.

NOTE: the correct way to do this is by the Retrieve Tool from the Uniprot page. The script presented in this post is just an example of how to use the python library Mechanize.

The code is available at bitbucket. Usage is simple: edit the files to enter your email address and the list of IDs you are interested in, and run it as a python script.




Favorite command of the day: parallel from moreutils

Yesterday I have discovered a nice Unix tool to launch commands in parallel. It is called ‘parallel’ and it is very easy to use. I think it is the easiest way to parallelize things in a multi-core computer. You can install it from the ‘moreutils’ package in Linux, or from http://www.gnu.org/s/parallel/

the basic usage is:

$: parallel <interpreter> <command> — <list of arguments>

for example

$: parallel bash -c “echo hola” — 1 2 3

This example will launch the “echo hola” command three times in parallel, one for each argument after the ‘–‘.
You can use the command “htop” to monitor CPU usage.

Thanks to this command, it is very easy to launch a great number of jobs in parallel. For example, if I want to run 1000 simulations:

$: parallel perl launch_a_single_simulation.pl — {1..1000}

This will run 1000 simulations in parallel, making use of as many processors as available.

By using the -i option, it is also possible to pass the values of the arguments after the ‘–‘ to the script.

$: parallel i bash -c “echo hola {}” — Johannes Marc Pierre Manu
hola Marc
hola Johannes
hola Manu
hola Pierre

When using the -i option, the symbol ‘{}’ is replaced by the argument.

For example, if we want to run a job on all chromosomes, we can just say:

$: parallel -i python calculate_test_on_chromosome.py {} — {1..22} X Y

Or, if we want to execute a script for many genes, we can say:

$: parallel -i python get_plot_by_gene –gene {} — ALG12 MGAT3 DOLPP1

Have fun 🙂

Ten Simple Rules initiative entering the final phase

The initiative for the collaborative writing of a candidate ‘Ten Simple Rules’ paper, launched two weeks earlier this month from this blog, has been very successful. So successful that after only two weeks the manuscript is almost ready, in a state where further modifications may be more harmful than useful.

For this reason, we are planning to close the editing phase earlier. The initial deadline was for May 28th; but since it does not make sense to continue working on it, we will probably leave the manuscript editable for a few days more, and then close it.

So, if you want to participate, hurry up! Join the mailing list and add your contribution!


new version of the collaborative Post-GWAS article published

There are some recent news about the initiative of the collaborative article on Post-GWAS analysis launched last December[1]. It seems that a new version of the manuscript has been published on Nature Precedings (link), a few weeks earlier this month.

Well, in the end, with the exception of one figure, they did not include almost anything from what has been contributed in the wiki (I still have to check carefully). They thank the contributers in the acknowledgment section, leaving a link to the wiki page, but saying that these have not been included for reasons of space.

Continue reading

update on the status of the ‘Ten Simple Rules’ initiative after the first 2 weeks

This is an update of the status of the ‘Ten Simple Rules for getting help from Mailing Lists and Online  Scientific Communities’ initiative, after two weeks.  I am posting it here, but if you want to follow the initiative you should better subscribe to the dedicated mailing list.

First of all, I would like to thank you the people who have participated. Honestly, I didn’t expect this initiative to proceed so fast, and I am very happy to have seen so many contributions and feedback :-).

It seems that the collaborative open approach has paid, this time!

Deadlines and submission date

The manuscript is already almost complete now. The original deadline was for the end of May; however, I was thinking that we could probably finish it and submit it earlier. The manuscript is in a status where each further modification can be more harmful than useful.

Continue reading

contribute to a candidate ‘Ten Simple Rules’ article

A few months ago I had the idea of writing an article in the style of the PLoS Computational Biology ‘Ten Simple Rules’ where to explain people how to use mailing lists and web forums to solve technical problems related to bioinformatics. Something on the style of ‘How To Ask Questions The Smart Way’ by Eric Raymond [3], but adapted to bioinformaticians and more gentle.

However, it does not make sense to submit a paper on best rules on getting help from online communities without achieving some sort of community consensus first. There are so many online communities, and so many different approach and best practices, that a single person can not be representative of all the different opinions on this.

So, I am launching the initiative of a open collaborative draft for a paper in the style of the PLoS ‘Ten Simple Rules’ series, entitled ‘Ten Simple Rules for Getting Help from Mailing Lists and Online Communities’. Here it is the main page of the project:

Public Invitation to the Candidate for Ten Simple Rules for Getting Help from Mailing Lists and Online Communities’

The document will be hosted on the WikiGenes Wiki, where everybody will be able to make contributions (upon registering to the site). The WikiGenes engine will keep track of the individual contributions and acknowledge the authors of the bigger ones. After two months from now (on May 28th), I will close the document and will propose the authors of the biggest contributions to sign it as authors; the manuscript will be then be sent to PLoS CompBio, where it will be eventually be published, provided it passes the editorial review process.

about my research: gene position and selective constraints

It is time I introduce a bit the research I am doing for my PhD, here at the Pompeu Fabra-CSIC university 🙂

The main area of our research is to study whether there is correlation between the position of a gene within a biological pathway and the strength of selective constraints on it.  So, for example, if genes involved in a high number of interactions and functions tend to be more conserved (==see less changes) among species, or not. This can be better explained with this figure for a terrible poster I presented in the workshop for Evolutionary Systems Biology last year:

In this hypothetical biological pathway, genes in upstream positions or with an high number of interaction are more functionally constrained than the others, therefore their sequence should be more conserved.

The figure represents an ideal pathway of genes, as the ones annotated in the KEGG, MetaCyc or Reactome. All the nodes are genes, and the edges represent any kind of interaction between two genes: for the general discussion it is not necessary to specify whether they are metabolic, physical or other kind of interactions.

The intensity of the colors in the figure represent the strength of selective constraints we expect to find on each node. The gene on the most upstream position should be the one with the strongest selective constraints, because, if a mutation introduces a loss of function there, all the downstream interactions will be compromised. A similar reasoning can be made for genes with an high number of interactions, which should be strongly conserved.

Continue reading

my Twitter account!

I have just created a twitter account. You can now follow me at: http://twitter.com/#!/dalloliogm

I tried to resist joining Twitter for a long time, but now I need it to participate to a spare-time project of mine. I recognize that twitter can be a very useful tool for a researcher, but I am worried that it can be too intrusive and distract me too much.

Do you have any suggestions for a new twitter user? Which software (on Ubuntu) do you use to check the feeds? Which groups would you recommend to a bioinformatician?

I have just created a twitter account. You can now follow me at: http://twitter.com/#!/dalloliogm 

I tried to resist joining Twitter for a long time, but now I need it to participate to a project of mine. I recognize that twitter can be a very useful tool for a researcher, but I am worried that it can be too intrusive and distract me too much.

Do you have any suggestions for a new twitter user? Which software (on Ubuntu) do you use to check the feeds? Which groups would you recommend to a bioinformatician?


scripting Cytoscape to plot different Node Centrality measures

Finally I have made it: scripting and automatizing Cytoscape with python!! Below you can see a figure that I have automatically generated with Cytoscape, including legend and values distributions merged into a single file:

Distribution of 'Centroid' values in the pathway of N-Glycosylation. Figure generated by automatically scripting Cytoscape: Click on the figure to see the whole pdf report.

Cytoscape is a software to visualize and analyze networks, widely adopted by the bioinformatics community and with a lot of plugins to analyze biological data. Unfortunately for me, it is written in Java, making it a lot more difficult to automatize (at least for the people who don’t program in Java, like me).

One of the protocols I wanted to automatize in Cytoscape was to plot different measures applied to the nodes of the same network, and export an image (along with the legend) of it automatically. For example, I wanted to calculate different measures of node centralities to a network with Centiscape, and then plot a figure for each measure and save it to a file.

I’ve finally managed to automatize this when I discovered the XMLRPC plugin for Cytoscape. I have learned a lesson: if you want to automatize anything in Cytoscape, with any programming language other than Java, then use the XMLRPC plugin. There is also a Python Console plugin for Cytoscape, but I recommend you to use the XMLRPC directly. It is better documented (I couldn’t find any documentation for the Console plugin), you can launch it from a bash terminal, and if you use ipython, you have name completition. Moreover, the XMLRPC protocol is more standard than the Cytoscape’s inner python console, so you will also learn something useful from it.

So, if you want to see an example of how to automatize Cytoscape with python, or want to compare different measures of node centralities on a network, you can use access a repository called ‘Cytoscape compare node centralities’ I set up on bitbucket. The code can also be used as an answer to one of the most pressing problems that affect Cytoscape users: export a network view along with its legend.

colleague leaving the academia

Massimo Sandal is the person who introduced me to Linux. The people who know me in person will understand how much does this mean for me, since I totally am an hard core Linux Geek.

Almost 6 or 7 years ago, second year of bachelor, I joined a group of geek-lous students in the Biotec faculty, who had created a mailing list to discuss about Linux and free software and proposed to meet every now and then to install Fedora Core 1 or play with it. In practice, we ended up meeting only once or twice: but that was enough to lead me to the dark side and transform me into the Linux nerd I am now.

Since then, the open source world became very important to me. I remember clearly the day when something in my brain switched on and realized how much open source software there is available out there – and how many things I could learn by using it. The exact moment of that conversion was when I was reading the Zope book on the train back to my home town and I was playing with my laptop. I could not believe that there it was so much documentation and modules available for free: that was how the Microsoft’s blinkers fell for me. I like to say that, after that, the speed at which I improved my programming and computer skills boosted at least 5 or 10 folds.

So, I was surprised to read, in the Italian medias, about a post that Massimo has written in his blog on his decision to leave the academia. I am not sad about him leaving the research field:  it is a personal decision, I respect that. However, I am sad that a person like Massimo doesn’t find himself comfortable in the academic world.

What else can I say.. I wish him all the best, and I hope he will be able to find an even more geeky and nerdish job wherever he goes. Now it’s my turn to start converting new innocent souls to the dark Free Software side.. I already started doing it 🙂