Tidy data and VCF

I think that 2014 has been the year in which my R programming style has changed the most. This is because a lot of innovative and nice libraries have been released, like dplyr, magrittr and tidyr. I started in January as a ddply enthusiast, and now instead my code is full %>% instructions and dplyr functions.

If you missed these libraries, a good starting point is the article “Principles of Tidy Dataset“, in which the author Hadley Wickham suggests some best practices for organising a dataset in a “tidy” format before doing any analysis. These practices will be already familiar to you if you have experience with the reshape/reshape2 packages, and if you used ggplot2 in the past. However, it is good to read a good summary as in the article.

Inspired by this article, I wrote a post on Biostar to discuss how a popular format in bioinformatics – the VCF – in a tidy format. Here is the link to the discussion.

The VCF in a tidy format would like more or less as below. On one hand, it would be a bit too redundant, and many columns would be replicated multiple times, making the file more sensible to typos introduced by the users and occupying more disk space. On the other hand, it would be easier to read, more flexible and able to accommodate other informations, like the population of each individual or more info about the genotype quality.




Leave a Reply