I think this is a major issue for all people working in bioinformatics, because compared to the classical computer scientist, we bioinformaticians have less culture on this topic, and generally there is less perception on which are the good practices on programming.
One classical example is the position of the monitors and how to sit in front of them:
most of the bioinformaticians programs in a no-ergonomic position - you can program better just by sitting more comfortably in front of the computer.
When I was still a master student and was looking for a place to start a phd, I was literally paranoic by this picture, because of every laboratory I visited as candidate, I found that the majority of the monitors where positioned wrongly. It’s a kind of non-sense: is the ‘publish or perish’ mentality so strong that we don’t even have the time to care about how we sit in front of the computer?
Well, going back to the argument of this post, I think that the tools linked by programming4scientists can be very useful; however, the most effective way to avoid distractions is to work in a good team, and to interact with your team mates.
If you work on your project alone and never speak with your supervisor, you are more likely to waste a lot of time with distraction; but if you have clear objectives, and you have a clear focus on what you have to do every day, you will be a lot less likely to distract.
I think it is responsability of the group leader to help his students to avoid distractions, and to keep them always focused on a task. If I got distracted and go slower on my work, it is because I don’t have a clear idea of how or what to do next, and I don’t have nobody to ask it. At this point many people will say you that if you are a true scientist you must be able to carry your own project alone, but for me this sounds like an excuse or a lie.
There is a whole science who studies how to coordinate a group of programmers, and it is called ‘software engineering‘.
I have been looking for a python alternative to Makefiles for ages, and this project seems very interesting to me. With respect with scons, paver, waf, and other alternatives, this ruffus should be better because it has been designed specifically for pipelines in bioinformatics, while the others are to compile programs.
It can also plot a picture representing the workflow:
A pipeline defined with the ruffus syntax and drawn as an image with a ruffus utility. It is nice to have this feature and it is easy to use it .
I have tried a few examples and it seems to be working fine. The only doubt I have is if it can work with Grid Engine systems (qsub/qmake like), but I have filed a bug report and hope that the developers will answer me.
Make is a great tool, because it allows you to save shell statements with ease and write simple pipelines; however, its’ syntax is a bit complicated and you can’t have more than one wild type character per rule. I have also tried taverna: but it is too big and slow for me, I just need a way to run command line tools locally, while it seems to force you to use web services and so on. Finally, the BioMake syntax seemed fantastic, but I didn’t succeed in installing skam in my computer.
So, it is great to have another alternative to write pipelines, and this ruffus seems to work well. Maybe, if I’ll have time, I’ll write a short subjective comparison between all the pipeline-defining tools I know.
I am writing a parser for the kegg KGML format, the dialect of XML used to store kegg pathways.
update: A KEGG pathway file, parsed with python and plotted with networkx
I couldn’t find any existing module to do this in python. There is a similar library for R (KEGGgraph), maybe something for perl (I didn’t find it), but apparently nothing for python.
It’s is really a simple and incomplete script, but now I understood which libraries to use and how to read the format, so I will be able to improve it as soon as I have free time.
It requires the networkx library which is the standard for working with graphs and networks in python, ElementTree (already included in the standard lib) to parse the XML, and pylab/matplotlib for plotting.
Any suggestion is really appreciated. For me, this is the first time that I work with XML files.
I am back in Barcelona after the italian Pycon.it, and I can finally publish the slides of my talk on ‘Python and Bioinformatics’:
I am sorry that the slides are in Italian. I could have written them in English, but I was too tired
The talk went well, and I have been asked many questions. Maybe I should have put some more practical example, and it was too generic, but the fact is that I wanted to discuss a bit about methods. I hope I didn’t bore anyone
I hope the people who asked me won’t mind if I post some of their questions here, but maybe it will make it easier to get better answers than mine.
how can I do business on bioinformatics? Would a framework or some support will be successfull?
Most of the people who attended the talk were informaticians, and they have a better mentality than our scientists, which only think of getting grants and doing our research.
I think that making business in bioinformatics is feasable, maybe even in Italy.
Creating a company to provide support services will have more chances to be successfull than the approach of writing a new software. There are a lot of laboratories which are in their ‘earlier phases of adopting bioinformatics‘: many groups are realizing that it is cheaper to use the data produced by the big American centers instead of producing their own, so they are looking for somebody to guide them and teach them the fundamentals of programming and data management.
The problem is that it difficult to train as bioinformatician: it requires a lot of efforts, there are not so many courses, and maybe it is more worth to work in more remunerable fields like web programming and others.
I am a computer scientist interested in bioinformatics. Where can I study biology?
I couldn’t answer well to this question. I think that at least a master degree in bioinformatics is a good starting point, but I know of people working in this field that have just studied computer science.
It depends on which group you are going to work; but in general, it is only important to work in contact with a biologist to understand if what you are doing is good (what about pair programming?)
Is the speed of a language a limiting factor in bioinformatics? What about these solutions to speed up python?
Honestly I don’t think speed is much important in bioinformatics, because most of the bottleneck points are: the access to data on the disk, the network speed in some cases, and moreover the time you have to write the program.
A good idea to speed up python is to use psyco. I have tried it once, but then I forgot about it . Moreover, a good news is that they have solved the GIL problem in the most recent versions.
How can I search for overlapping regular expressions on a sequence, with python?
The python library for regular expression, re, lacks the option to look for overlapping motifs. For instance, if your regular expression is ‘C.’, and you parse the string ‘CCA”, you will only get a match (’CC’), and there is no way to look for two.
Some time ago I solved the problem by using a good library called TAMO, created for working with motifs and advanced regular expressions. You can use it to do overlapping searches, look at the tutorial: http://fraenkel.mit.edu/TAMO/TAMO_tutorial.html
How much is difficult to contribute a module to biopython?
Well, I have submitted a few patches to biopython, but the best people to ask this question are the developers theirselves.
In general, it is easier if your proposal doesn’t require to add new dependencies and if you provide some good use cases and tests already.
What about Make/taverna/other similar tools?
Somebody asked me why I don’t use taverna to create pipelines with my script.
I have tried taverna extensively, submitted some bug reports, and participated to a lot of their surveys; and I also think that is a very good software.
The problems with taverna are: no testing support, no python support, not clear if you can work with local scripts only, can’t be launched from the shell, and make is more or less standard and installed everywhere.
Some alternatives to make, written in python: scons, paver, waf, and biomake (which I didn’t succeed to compile yet and is not in python)
Will the new unicode behaviour in python 3 slow down programs which read sequence files?
This is a question I have asked to another speaker. In python 3, all the strings will be unicode by default. But if I need to read sequences which only contains A, C, T, G, N (5 characters), why do I need to store them as unicode? Won’t it slow down all my parsers?
It seems it is difficult to work with spectrometry data, because every company has its own proprietary format and there is not an universal parser.
Other questions?
mmm maybe I forgot about some of the questions, and I will put them here later.
I am not a python/bioinformatics guru, but I thought it was a good opportunity to discuss over it with other people. Later I will write a post on the other talks of the conference….
Usually, people in bioinformatics start with using flat files for doing most of the things. Later, they learn relational databases to handle bigger collections of data.
If you want to know in general about mysql, this post is full of very good links and slides:
See if you can follow the tutorial, even if you don’t know what an ORM is.
Sometimes a relational database is not the best option, especially if you have hierarchical or multidimensional data (e.g. a gene and a list of transcripts, etc..)
ZODB is an object oriented database: it allows you to store python objects via the pickle modules:
I am actually using zodb, and I think I will write a blog post on it later.
Another option for storing big data, in an hierarchial structure (a column can be a table containing other columns), is HDF5, a binary format used in physics and astronomy, which has a very good python API module called PyTables:
I would like to wish you all an Happy San Jordi, which is one of the most important National Feasts here in Catalunya.
Today (well, it was the 23th of April), boys are used to give a rose to their girlfriends, while girls use to give a book to their boyfriends.
What does this have to do with bioinformatics?
Absolutely nothing, but somebody of which I won’t tell the name, has un-explicitely let me know that this blog is becoming too technical and that I should put some distraction every now and then. Happy San Jordi!!
In the last post, I forgot to put some thoughts I had when preparing the slides.
it is true that it is better to have a lot of slides, at least one or two per minute. When you show the same slide for too much time, like two or three minutes, you start seeing people in the audience loosing the point.
Putting code examples in a slideshow is a pain. I have been told that you must keep it at least 20 pt of height and use a monospace font, and that an empiric rule to know whether the slide is readable is to watch it on the monitor from at least 2 meters of distance.
There it seems to be no simple way to highlight code in an open office presentation. There is a nice extension to do it in writer, but unfortunately it doesn’t work with impress.
I wonder whether one of the ‘most nerdish’ way to prepare slides would be easier. I have tried with LaTeX once, but onestly, I have had a lot of troubles in writing python code and maintaining the indentation with LaTeX, and I don’t know whether there is some a simpler way to do it.
Using a different color (like, for example, red) to highlight a word or a phrase is a lot better rather than using the bold effect.
Maybe I am not very good at presenting talks (hope I am also not very bad), but I am better at preparing slides that can be published on Internet and maybe used by other people as reference. Uff
I have talked about a bit of everything, insisting on three things:
python has a very good syntax and it is easier to learn than perl
it has good tools to do testing (which are very important for bioinformatics)
there are many good libraries for bioinformatics, even if maybe perl and R are better supported (unfortunately)
Well, I prepared to talk during the Easter Holydays, so I have decided to do it in a quick and easy way: and I had the genial idea of putting a lot of happy/sad faces , even if in the end, I decided to not put too many sad faces, because I thought it would have been unpopular for a talk.
If you can give me some feedback, I will really appreciate it
PyCon.it is the yearly conference on python in Italy, usually yeld in the beatiful city of Firenze (Florence). This year, the keynote talk will be given by Guido Van Rossum himself.
Guido Van Rossum will be at Pycon.it this year
Earlier I proposed a talk on python and bioinformatics for the pycon3, and I have just been told that it has been accepted.
So, now I have to prepare a seminar on ‘Python and Bioinformatics‘ before May. Do you have any suggestions?
This is the schedule of the workshop. All the talk will be live-translated in English.
I am really horryfied by a mail that I have just received from Riccardo Fallini, Molecularlab.it’s founder and former owner.
Molecularlab is a open web italian scientific community, and this is a pretty reductive description to say that it is the community where I have learned everything I know about e-science and the importance of communication between scientists throught Internet.
I have been a member of the Molecularlab community since at least 2004, when I was still a 2nd year bachelor student. Over the years, I became one of the most active users, and eventually a member of the Staff and moderator of the forum.
It has been so nice, a great school where I have learned most of what I know about bioinformatics, and also molecular biology and other fields. Imagine being in a place where you answer two random questions about bioinformatics every day (how do I do e-pcr? which database for promoters of Macaca mulatta? how does blast works? what is an HMM? and many more): this is what I have been doing for years in molecularlab.
So today, April the first 2009, Riccardo must have gone crazy or something like this.
He sold the whole web site to a most unknown company called ‘Loops Flair‘: I have never heard of them before, but it looks like they are some kind of weird evangelists doing research on Danio. rerio.
Look for example at the innovative technologies they offer: there is this DRMPEK technology to produce fishes with multiple colors at a very low cost:
To automatically get a vector with a DNA plasmide from the structure of one unsequenced protein.
But what is worst, is the collection of gadgets they offer for free to any new user:
I can’t believe Riccardo sold all the whole Molecularlab for just a couple of fish-like USB pens.
Now the whole Italian scientific community remains without a good support online, and I would like to know what Riccardo and Nicola will explain to the poor users and students.