Tidy data and VCF

I think that 2014 has been the year in which my R programming style has changed the most. This is because a lot of innovative and nice libraries have been released, like dplyr, magrittr and tidyr. I started in January as a ddply enthusiast, and now instead my code is full %>% instructions and dplyr functions.

If you missed these libraries, a good starting point is the article “Principles of Tidy Dataset“, in which the author Hadley Wickham suggests some best practices for organising a dataset in a “tidy” format before doing any analysis. These practices will be already familiar to you if you have experience with the reshape/reshape2 packages, and if you used ggplot2 in the past. However, it is good to read a good summary as in the article.

Inspired by this article, I wrote a post on Biostar to discuss how a popular format in bioinformatics – the VCF – in a tidy format. Here is the link to the discussion.

The VCF in a tidy format would like more or less as below. On one hand, it would be a bit too redundant, and many columns would be replicated multiple times, making the file more sensible to typos introduced by the users and occupying more disk space. On the other hand, it would be easier to read, more flexible and able to accommodate other informations, like the population of each individual or more info about the genotype quality.

 

 

https://www.biostars.org/p/123018/

my attempt at following every possible Best Practice in Bioinformatics

I have just uploaded my first paper to arXiv. The title is “Human Genome Variation and the concept of Genotype Networks“, and presents a first, preliminary application of the concept of Genotype Networks to human sequencing data. I know that the title may sound a bit pretentious, but we wanted to  pay a tribute to a great article by John Maynard Smith, to which the work presented is inspired.

Nevertheless, in this blog post I am not going to discuss the contents of the paper, but only on how I did this work. This was a project that I did in my last year of my PhD, and I have made an extra effort in trying to follow every best practice rules I knew.

I started my PhD in the pre-bedtools and pre-vcftools era of bioinformatics, and I saw the evolution of this field, from a spare group of people in nodalpoint to the rise of Biostar and Seqanswers. During this time, I have read and followed a lot of discussions about “what is the best way to do bioinformatics”, from whether to use source control, to testing, and much more. For the last project as a PhD student, I wanted to apply all the practices that I had learn, to determine if it was really worth to spend time learning them.

Premise: dates and times of the project

My PhD fellowship supports a three months stay in another laboratory in Europe. I decided to do it in prof. Andreas Wagner’s group in Zurich.

The decision to go to Wagner’s group was motivated by a book that he had recently published, entitled “The Origins of Evolutionary Innovations”. Previous to the start of this project I had read some articles by Andreas Wagner, and found them very interesting, so the opportunity to stay in his lab was very exciting. However, in light of what I learned during this time, I have admit that before December 2011, I didn’t understand most of the concepts present in the book. Thus, we can say that for this project, I started from zero.

I started thinking of this project in December 2011. I did the first practical implementation in the three months of the stay in Zurich, from May to August 2012. The first preliminary results came in January 2013, and the first manuscript in April 2013. We submitted to ArXiv in August 2013. During this period of time, I have also worked on three other projects, wrote my thesis, and taught at the Programming for Evolutionary Biology workshop in Leipzig.

I started working on this project in December 2011, and finished in August 2013. The log only shows the activity of code changes.
I started working on this project in December 2011, and finished in August 2013. This figure only shows the activity of code changes.

 

Note: this blog article is very long, you may want to download as PDF and read it more comfortably.

Continue reading

Two short “Agile Bioinformatics” talks

I have just come back from the Programming for Evolutionary Biology course in Leipzig, version 2013!! The course is still going on, but unfortunately this year I could not stay the whole duration three weeks, as I have stuff to do here in Barcelona.

This year, apart from the “Introduction to Linux” module, I also taught a short module on “Best Practices for programming in bioinformatics”. It was pure fun, I think I never enjoyed so much giving a talk. I explained a part about Version Control, and another about Scrum, and people were really excited about it. To make you understand how much people liked this talk, consider that three persons invited me a beer after that, which for me constitutes the maximum compliment for a talk.

I have uploaded the two slideshow on slideshare. Unfortunately, the best part of the talk was a live demonstration on how I use these practices during my daily work, but at the moment I can not make these example publicly available. However, you should be able to follow the slideshows anyway.

 

Notes from a “Write it clearly” course

I recently took a course on improving English Writing skills for researchers. These are my notes, organized as a series of “Do and Do not” lists, plus some separate list for each section of a research paper.

Feel free to have a look at them and make use of them. If you have any comments, you can add them here or to table. Have an happy paper writing day!

click to access the notes.

Gamestorming for bioinformatics

Most meetings in academic research groups are awful. I have attended meetings that lasted hours and hours, and didn’t produce any useful output. I know researchers who try to avoid meetings as much as they can, and prefer to work by themselves, because of too many bad experiences. In the end, the problem is that scientists are not very good communicators, and most PIs are not trained for being group leaders, so meetings end up being very boring and time wasting, more harmful than useful.

Fortunately  this year, thanks to a meetup group here in Barcelona, I discovered that there are many ways to improve meetings and make them more interesting. The most interesting is the concept of “Gamestorming”, which is based on transforming group meetings into “games”. If instead of inviting people to attend a meeting you ask them to participate to a short game, people are more likely to participate actively and make good contributions.

Most gamestorming techniques involve blackboards and post-its, and ask people to use them to explain their own opinion.  A simple example of a gamestorming meeting would be a planning meeting where the group leader splits a blackboard into three sections, one for listing different “Project Proposals”, and the other two for “Pros” and “Cons” of each project proposal, and asks the participant to fill the blackboard using post-its. If you want to have a good overview of techniques for brainstorming in general, I can recommend you the book “Gamestorming“, by Gray, Brown, Macanufo, from O’Reilly, which I am reading these days.

In any case, I have been thinking about which planning “games” can be adopted in bioinformatics, or by researchers in general. Here is a list of a what I introduced or planned to introduce to my group:

Continue reading

Planning a 8-hours “Introduction to Linux” course with trello

Next week I am going to give a 8 hours “Introduction to Linux” course at the “Programming for Evolutionary Biology” workshop in Leipzig. In this post, I will describe how I have used a nice planning software called “trello” to make the schedule of the course.

You must know that I am a big fan of using small card papers to organize things. I started using CRC cards from the ExtremeProgramming techniques, and now the way I organize my time is similar to the KanBan technique, although I kind of evolved it independently. In simpler words, I have the habit of cutting A4 papers into 8 smaller A6 papers, the size of a post-it, and use them to take note and to plan my projects. If you visit my office, it is full of collections of “A6” papers everywhere 🙂

One day I may prepare a blog post about how I organize my projects with A6 papers. For now, just consider that trello basically allows me to do on a web page what I usually do on paper. Also, trello allows to share workflows with other people on Internet.. For example, I can show you the schedule of the Linux course that I have made:

my trello board for the "Introduction to Linux" course. Click to see it!

So, I used trello to make 5 distinct sets of cards, one for each of the 5 parts that compose the course. In each of this list, I filled some cards to describe the most important topics that I wanted to talk about in that part of the course. I have used some a red color label to highlight which is the most important message to transmit in each of the parts of the course, the “Take-Home” message.

Continue reading