The day after the publication of the 1000genomes’ paper on Nature I attended a talk from one of the authors, Paul Flicek from the EBI institute, who explained about the technical challenges that have been introduced by the 1000genomes data.
He pointed out that for the first time in history, datasets in biology have reached the sizes of the big datasets in physics and astronomy. The whole GenBank, even with its exponential growth of the latter years, is small compared with the results of a particle accelerator or a big telescope. However the situation has now changed with the release of 1000genomes, and will change more with the results from other similar studies (Uk10k, 10,000 genomes).
Physicists have done a much better job than we bioinformaticians in planning how to deal with huge datasets. For example, while we bioinformaticians are still using the HTTP or FTP protocols to download datasets, competing with people watching videos on youtube for the bandwidth, physicists have developed an alternative network to Internet to share data. Or for another example, while we bioinformaticians are still debating whether to use databases or flat files, physicists have developed formats like HDF5 to handle huge collections of data.
Regarding the former issue, I am trying to convince the 1000genomes maintainers to release their data as a torrent. Yesterday we tried to download a 16 GB dataset from their website, but because of connection problems we could never finish the download. Let’s say that in the future we will have to download a 16 GB file everyday with the results of new genomes sequenced: is it feasible to do it through Internet?
A nice solution, in my opinion, would be to use the torrent protocol, adopted by the BioTorrents project and nicely described in this paper: Langille MGI, Eisen JA (2010).
Using torrents to share big datasets like the 1000genomes sequences would have a lot of benefits. For example:
- since each torrent is associated with a md5 sum, everybody will always download the correct data, without transfer errors. If you download data from a website, a transfer error may always occur; however, any torrent client will always check the md5 before declaring the download complete.
- this will save a lot of bandwidth to the 1000genomes site, reducing their costs and allowing them to better use their resources.
- a torrent is more likely to be always available, even if the 1000genomes authors decide to not support it anymore. It will be easier to trace back old datasets, therefore making research more reproducible.
- a torrent is easier to download, because even if you have a bad internet connection, you can stop and restore the transfer at any time.
Let’s see if they will agree on this!