a script to fetch images from the UCSC browser

The UCSC browser is a nice, useful but “mammoth-ish” bioinformatics tool that despite its web 1.0 aspect, can be a very powerful ally for any bioinformaticians or biologist.

I have to admit that for many years I avoided using the UCSC browser, dismissing it because of its very old fashioned look. It was silly of me to think that way, but its interface is objectively old: for example, the user is forced to reload the whole page to update the visualization, and the fonts are not anti-aliased, and they look ugly. To me, it didn’t seem “professional” to use a pre-Ajax website for doing research.

Recently, however, I have changed my mind about this, as I discovered that this tool can be very powerful to integrate data from different sources and for doing “mash-ups”. A local UCSC browser instance can be installed in a computer and be used as a central repository for all the annotations produced in a research unit: for example, sequencing data, results from experiments and from statistical genome-wide tests, etc. If all the custom annotations produced in a lab are available in a local UCSC browser instance (either as custom tracks or as tables), it is possible to compare them, and also to compare them against annotations available publicly, such as position of genes, non-coding regions and much more. The real strength of this tool is that if you have a workflow to automatize retrieval of data from it, you are able to compare your results with virtually anything that is known about a genome.

So, let’s go to the point: I wrote a script to automatically fetch screenshots of a UCSC browser instance. It is available at this page:

The first difficulty I faced when writing this script was that there are a lot of possible different options, to define a region and how to visualize it. So, I have made the script to require three different configuration files: one for the regions to be visualized, one for the tracks to be shown, and one for the connection parameters. So here it is how you would call it:

python fetch_ucsc.py --region <regions file> --tracks <tracks file> --config <connection configuration file>

Have a look at this pdf that created with this script. If you continue reading the post, I will also describe the different configuration files.

example of report created by this script. Click on the image to see the full pdf.

The Regions file is a CSV file containing all the regions to be shown, one per line. For each region defined in this file, the script will connect to the UCSC browser and fetch a screenshot of that regions as a pdf file. If rst2pdf is installed in your computer, the script will also generate a multi-page report.

Example of Regions file:

#label, organism, assembly, chromosome, start, end, description, upstream, downstream
IL10, human, hg18, chr1, 205007571, 205012462, "involved in immunity", 10000, 1000
PRNP, human, hg18, chr20, 4615157, 4630234, "Prion Protein", 10000, 10000
HSPB4, human, hg18, chr21, 43462210, 43465982, "Heat-shock protein", 10000, 10000

I know that CSV is not optimal for a configuration file, but in this case it seemed a good compromise. The script will fail if you miss a comma or if you change the order of a column; but if you are careful, everything should go fine.

The Tracks file contains the list of tracks to be shown. Everytime the script is called, all the regions will be fetched with the same tracks visualized. If you need to plot the same regions with different tracks visualized, just call the script more than once.

Example of Tracks file:

[visual_options]

[custom_tracks]
track1 = http://pastebin.com/raw.php?i=CKCuYGmX

[tracks]
wgRna=hide
wgEncodeReg=hide
cpgIslandExt=hide
ensGene=hide
mrna=hide
intronEst=hide
mgcGenes=hide
cons44way=hide
snp130=hide
snpArray=hide
refGene=hide
wgEncodeRegMarkPromoter=full
knownGene=full
rmsk=hide
phyloP46wayPlacental=hide

I am planning to change the format of this file to a CSV-like one, allowing to define other options for each track, such as the color and the height of the track. However, implementing these parameters would require a lot of changes in how the script connects to the browser, so I am leaving this for later.

The last configuration file is the Connection Configuration file. Here you can define your email address (mandatory), URL to a custom UCSC browser installation, HTTP proxy address and password, and so on.

Example of Connection Configuration file:

[browser]
ucsc_base_url = http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg18
username =
password =
user-agent = Mechanize client to get screenshots from the UCSC browser. Home page: https://bitbucket.org/dalloliogm/ucsc-fetch
email =
httpproxy =
httproxy_port =
httproxy_password =

Finally, let me explain the requirements. The script requires the mechanize library from python to connect, and optionally, the rst2pdf tool, to produce the multi-page reports.

This entry was posted in projects. Bookmark the permalink.

One Response to a script to fetch images from the UCSC browser

  1. brentp says:

    Thanks, This is very cool.
    It would be a nice option if the regions file could be a BED file with the 4th column being the label and description and the up/downstream set from a kwarg on the command-line.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>