1000genomes data as torrent?

The day after the publication of the 1000genomes’ paper on Nature I attended a talk from one of the authors, Paul Flicek from the EBI institute, who explained about the technical challenges that have been introduced by the 1000genomes data.

He pointed out that for the first time in history, datasets in biology have reached the sizes of the big datasets in physics and astronomy. The whole GenBank, even with its exponential growth of the latter years, is small compared with the results of a particle accelerator or a big telescope. However the situation has now changed with the release of 1000genomes, and will change more with the results from other similar studies (Uk10k, 10,000 genomes).

Physicists have done a much better job than we bioinformaticians in planning how to deal with huge datasets. For example, while we bioinformaticians are still using the HTTP or FTP protocols to download datasets, competing with people watching videos on youtube for the bandwidth, physicists have developed an alternative network to Internet to share data. Or for another example, while we bioinformaticians are still debating whether to use databases or flat files, physicists have developed formats like HDF5 to handle huge collections of data.

Regarding the former issue, I am trying to convince the 1000genomes maintainers to release their data as a torrent. Yesterday we tried to download a 16 GB dataset from their website, but because of connection problems we could never finish the download. Let’s say that in the future we will have to download a 16 GB file everyday with the results of new genomes sequenced: is it feasible to do it through Internet?

A nice solution, in my opinion, would be to use the torrent protocol, adopted by the BioTorrents project and nicely described in this paper:  Langille MGI, Eisen JA (2010).

Using torrents to share big datasets like the 1000genomes sequences would have a lot of benefits. For example:

  • since each torrent is associated with a md5 sum, everybody will always download the correct data, without transfer errors. If you download data from a website, a transfer error may always occur; however, any torrent client will always check the md5 before declaring the download complete.
  • this will save a lot of bandwidth to the 1000genomes site, reducing their costs and allowing them to better use their resources.
  • a torrent is more likely to be always available, even if the 1000genomes authors decide to not support it anymore. It will be easier to trace back old datasets, therefore making research more reproducible.
  • a torrent is easier to download, because even if you have a bad internet connection, you can stop and restore the transfer at any time.

Let’s see if they will agree on this!

References:

5 Comments

  1. Thank you for the comment! I have been looking a HDF5 for a while (including the python libraris H5py and PyTables) but never implemented anything serious yet. I will let you know.

  2. The problem with distributing these data sets as torrents is that how many people have both the disk and the bandwidth to act as peers for the data sets and I suspect the answer to this question is very few especially given that new data is added to the ftp site weekly

    The consortium does offer a free alternative way to download these data sets using UDP rather than FTP with the aspera client ascp.

    Anyone can get the data much more quickly using ascp and it doesn’t cost the client a penny.

  3. Thanks Laura. The disk space should be not a problem, as if I am wishing to download a dataset, I should also have the space to store it. The bandwidth may be a bigger issue, but in any case most of the torrent clients can limit both download and upload rates. Anyway, I am not aware of how the transfer through Aspera works, and I will look at it.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">