Tidy data and VCF

I think that 2014 has been the year in which my R programming style has changed the most. This is because a lot of innovative and nice libraries have been released, like dplyr, magrittr and tidyr. I started in January as a ddply enthusiast, and now instead my code is full %>% instructions and dplyr functions.

If you missed these libraries, a good starting point is the article “Principles of Tidy Dataset“, in which the author Hadley Wickham suggests some best practices for organising a dataset in a “tidy” format before doing any analysis. These practices will be already familiar to you if you have experience with the reshape/reshape2 packages, and if you used ggplot2 in the past. However, it is good to read a good summary as in the article.

Inspired by this article, I wrote a post on Biostar to discuss how a popular format in bioinformatics – the VCF – in a tidy format. Here is the link to the discussion.

The VCF in a tidy format would like more or less as below. On one hand, it would be a bit too redundant, and many columns would be replicated multiple times, making the file more sensible to typos introduced by the users and occupying more disk space. On the other hand, it would be easier to read, more flexible and able to accommodate other informations, like the population of each individual or more info about the genotype quality.

> vcf %>%
    gather(individual, value, -c(X.CHROM:FORMAT)) %>%
    separate(value, into=strsplit('GT:GQ:DP:HQ', ':')[[1]], ':', extra='drop') %>%
    separate('GT', into=c('allele1', 'allele2'), '[|/]') %>%
    gather(allele, genotype, -c(X.CHROM:individual, GQ:HQ)) %>%
    arrange(X.CHROM, POS, ID, individual) %>% 
    select(-INFO, -FORMAT, -FILTER) %>%  # let's omit this for better visualization
    subset(ID!='microsat1')              # let's omit this for better visualization

 X.CHROM  POS          ID     REF ALT QUAL individual GQ DP    HQ  allele genotype
 20       14370   rs6054257   G   A   29   NA00001    48  1 51,51 allele1        0
 20       14370   rs6054257   G   A   29   NA00001    48  1 51,51 allele2        0
 20       14370   rs6054257   G   A   29   NA00002    48  8 51,51 allele1        1
 20       14370   rs6054257   G   A   29   NA00002    48  8 51,51 allele2        0
 20       14370   rs6054257   G   A   29   NA00003    43  5   .,. allele1        1
 20       14370   rs6054257   G   A   29   NA00003    43  5   .,. allele2        1
 20       17330           .   T   A    3   NA00001    49  3 58,50 allele1        0
 20       17330           .   T   A    3   NA00001    49  3 58,50 allele2        0
 20       17330           .   T   A    3   NA00002     3  5  65,3 allele1        0
 20       17330           .   T   A    3   NA00002     3  5  65,3 allele2        1
 20       17330           .   T   A    3   NA00003    41  3  <NA> allele1        0
 20       17330           .   T   A    3   NA00003    41  3  <NA> allele2        0
 20       1110696 rs6040355   A G,T   67   NA00001    21  6 23,27 allele1        1
 20       1110696 rs6040355   A G,T   67   NA00001    21  6 23,27 allele2        2
 20       1110696 rs6040355   A G,T   67   NA00002     2  0  18,2 allele1        2
 20       1110696 rs6040355   A G,T   67   NA00002     2  0  18,2 allele2        1
 20       1110696 rs6040355   A G,T   67   NA00003    35  4  <NA> allele1        2
 20       1110696 rs6040355   A G,T   67   NA00003    35  4  <NA> allele2        2
 20       1230237         .   T   .   47   NA00001    54  7 56,60 allele1        0
 20       1230237         .   T   .   47   NA00001    54  7 56,60 allele2        0
 20       1230237         .   T   .   47   NA00002    48  4 51,51 allele1        0
 20       1230237         .   T   .   47   NA00002    48  4 51,51 allele2        0
 20       1230237         .   T   .   47   NA00003    61  2  <NA> allele1        0
 20       1230237         .   T   .   47   NA00003    61  2  <NA> allele2        0

 

 

https://www.biostars.org/p/123018/

Posted in methodology | Leave a comment

farewell to Barcelona

I am posting this a bit late (since I already moved 6 months ago); anyway, the news is that I left my lab in Barcelona, and moved to London!

prbb from avda aiguader

The institute where I did my PhD: the PRBB in Barcelona. On the other side of the building there is the beach.

I am satisfied about my time in Barcelona, where I did my master thesis and my PhD on network theory applied to human population genetics. However, it was time to move and try new experiences.

a picture taken from the 4th floor of the PRBB

a picture taken from the 4th floor of the PRBB

Apart from the change of city, I also changed my field of work, as I am now working on cancer genetics. My new group is a young group recently moved from Italy to London, famous for research on the systems-level properties of cancer genes , for a database called Network of Cancer Genes, and involved in a consortium for the sequencing of hepatocellular carcinoma. I will keep you informed of the proceedings!

Posted in news | 1 Comment

my first PyPI package: vcf2networks

My first Python package is in PyPI!! I guess that now I can officially call myself a python programmer.

VCF2Networks is a python script to calculate genotype networks from population genetics data. Genotype networks are a method used in systems biology to study the “innovability” of a given phenotype, by representing all the genotypes associated with the phenotype as a graph, and studying some properties of this graph, such as the average path length and the average degree. For more info, you can look at the slides of the “Origins of Evolutionary Innovations” book club in this blog. The script in VCF2Networks allows to take any dataset of genotypes stored in the VCF format, and calculate many of these properties.

In principle, I am planning to submit an application note about the script to a bioinformatics-oriented journal. So, if you have some little time to lent me, and you want to test it, any feedback will be very useful for me. At the moment, the major issue is to simplify the installation, because this package depends on numpy and python-igraph, and these two modules require some terrible C libraries that must be installed separately. If you are aware of any way to distribute a binary package of a python module that depends on C libraries, your suggestion will be really welcome.

Posted in news | Tagged , , , | 1 Comment

The presentation of my PhD defence

That’s it! Last week I defended my PhD thesis!! I have gone through it, and survived to tell!

I don’t feel very different from before, apart from being relieved :-). Now the future is possibly more difficult than before, because I have to look for a job position and finish a lot of things.

While I was preparing the slideshow, I realized that there are not many examples of presentations for a PhD defence online. This is bad, because you need all forms of help to prepare this presentation.The PhD defence is the last thing that you do as a PhD student, so you want to do it perfectly. It is also the moment when you describe many years of your work to the your colleagues and family. Thus, it is bad that there are few examples of slideshows for PhD defence online.

Here is the presentation that I have prepared for my defence. I hope that it will be useful to other people as an example for their defences.

I think that, for this type of presentation, the first slide to make is the “summary of the talk” slide, like the “Topics” slide I have. Usually I don’t like to have such summary slides in my presentation, but for the Thesis defence it is very important, because it gives you a feeling of security when you present. Having a well defined structure allows you to know when you can stop to drink some water or to check if everybody is following, and to know exactly what to say in each slide of the talk.

Posted in news, papers, slideshows, talks | 3 Comments

my poster featured in the “Better Posters” blog!

My ECCB2012 poster has been featured in the Better Posters blog. Check the article here: http://betterposters.blogspot.com.es/2013/10/invitating-interaction.html

I am glad because betterposters is one of my favorite blogs. It’s a blog about designing and improving posters for scientific conferences, and it contain many tips and examples of how scientific posters can be improved.

The poster featured there is the poster of the “Post-its”, which I briefly described in the article of the “best practices“.

DSC00132
Here are some other comments and tips from my experience of using post-its to get feedback during a conference:

Continue reading

Posted in news | 2 Comments

Two books on Human Evolution and on the concept of Race, that should be read together

I have decided that, from time to time, I will post some book recommendation here on this blog. This is the first of this series, dedicated to a pair of books on the evolution of the human genome and the concepts of races / human populations.

fatal invention    10000 years explosion

Continue reading

Posted in book club, news | Leave a comment

my attempt at following every possible Best Practice in Bioinformatics

I have just uploaded my first paper to arXiv. The title is “Human Genome Variation and the concept of Genotype Networks“, and presents a first, preliminary application of the concept of Genotype Networks to human sequencing data. I know that the title may sound a bit pretentious, but we wanted to  pay a tribute to a great article by John Maynard Smith, to which the work presented is inspired.

Nevertheless, in this blog post I am not going to discuss the contents of the paper, but only on how I did this work. This was a project that I did in my last year of my PhD, and I have made an extra effort in trying to follow every best practice rules I knew.

I started my PhD in the pre-bedtools and pre-vcftools era of bioinformatics, and I saw the evolution of this field, from a spare group of people in nodalpoint to the rise of Biostar and Seqanswers. During this time, I have read and followed a lot of discussions about “what is the best way to do bioinformatics”, from whether to use source control, to testing, and much more. For the last project as a PhD student, I wanted to apply all the practices that I had learn, to determine if it was really worth to spend time learning them.

Premise: dates and times of the project

My PhD fellowship supports a three months stay in another laboratory in Europe. I decided to do it in prof. Andreas Wagner’s group in Zurich.

The decision to go to Wagner’s group was motivated by a book that he had recently published, entitled “The Origins of Evolutionary Innovations”. Previous to the start of this project I had read some articles by Andreas Wagner, and found them very interesting, so the opportunity to stay in his lab was very exciting. However, in light of what I learned during this time, I have admit that before December 2011, I didn’t understand most of the concepts present in the book. Thus, we can say that for this project, I started from zero.

I started thinking of this project in December 2011. I did the first practical implementation in the three months of the stay in Zurich, from May to August 2012. The first preliminary results came in January 2013, and the first manuscript in April 2013. We submitted to ArXiv in August 2013. During this period of time, I have also worked on three other projects, wrote my thesis, and taught at the Programming for Evolutionary Biology workshop in Leipzig.

I started working on this project in December 2011, and finished in August 2013. The log only shows the activity of code changes.

I started working on this project in December 2011, and finished in August 2013. This figure only shows the activity of code changes.

 

Note: this blog article is very long, you may want to download as PDF and read it more comfortably.

Continue reading

Posted in methodology, news, papers | 9 Comments