The UCSC browser is a nice, useful but “mammoth-ish” bioinformatics tool that despite its web 1.0 aspect, can be a very powerful ally for any bioinformaticians or biologist.
I have to admit that for many years I avoided using the UCSC browser, dismissing it because of its very old fashioned look. It was silly of me to think that way, but its interface is objectively old: for example, the user is forced to reload the whole page to update the visualization, and the fonts are not anti-aliased, and they look ugly. To me, it didn’t seem “professional” to use a pre-Ajax website for doing research.
Recently, however, I have changed my mind about this, as I discovered that this tool can be very powerful to integrate data from different sources and for doing “mash-ups”. A local UCSC browser instance can be installed in a computer and be used as a central repository for all the annotations produced in a research unit: for example, sequencing data, results from experiments and from statistical genome-wide tests, etc. If all the custom annotations produced in a lab are available in a local UCSC browser instance (either as custom tracks or as tables), it is possible to compare them, and also to compare them against annotations available publicly, such as position of genes, non-coding regions and much more. The real strength of this tool is that if you have a workflow to automatize retrieval of data from it, you are able to compare your results with virtually anything that is known about a genome.
So, let’s go to the point: I wrote a script to automatically fetch screenshots of a UCSC browser instance. It is available at this page:
The first difficulty I faced when writing this script was that there are a lot of possible different options, to define a region and how to visualize it. So, I have made the script to require three different configuration files: one for the regions to be visualized, one for the tracks to be shown, and one for the connection parameters. So here it is how you would call it:
python fetch_ucsc.py --region <regions file> --tracks <tracks file> --config <connection configuration file>
Have a look at this pdf that created with this script. If you continue reading the post, I will also describe the different configuration files.
The Regions file is a CSV file containing all the regions to be shown, one per line. For each region defined in this file, the script will connect to the UCSC browser and fetch a screenshot of that regions as a pdf file. If rst2pdf is installed in your computer, the script will also generate a multi-page report.
Example of Regions file: #label, organism, assembly, chromosome, start, end, description, upstream, downstream IL10, human, hg18, chr1, 205007571, 205012462, "involved in immunity", 10000, 1000 PRNP, human, hg18, chr20, 4615157, 4630234, "Prion Protein", 10000, 10000 HSPB4, human, hg18, chr21, 43462210, 43465982, "Heat-shock protein", 10000, 10000
I know that CSV is not optimal for a configuration file, but in this case it seemed a good compromise. The script will fail if you miss a comma or if you change the order of a column; but if you are careful, everything should go fine.
The Tracks file contains the list of tracks to be shown. Everytime the script is called, all the regions will be fetched with the same tracks visualized. If you need to plot the same regions with different tracks visualized, just call the script more than once.
Example of Tracks file: [visual_options] [custom_tracks] track1 = http://pastebin.com/raw.php?i=CKCuYGmX [tracks] wgRna=hide wgEncodeReg=hide cpgIslandExt=hide ensGene=hide mrna=hide intronEst=hide mgcGenes=hide cons44way=hide snp130=hide snpArray=hide refGene=hide wgEncodeRegMarkPromoter=full knownGene=full rmsk=hide phyloP46wayPlacental=hide
I am planning to change the format of this file to a CSV-like one, allowing to define other options for each track, such as the color and the height of the track. However, implementing these parameters would require a lot of changes in how the script connects to the browser, so I am leaving this for later.
The last configuration file is the Connection Configuration file. Here you can define your email address (mandatory), URL to a custom UCSC browser installation, HTTP proxy address and password, and so on.
Example of Connection Configuration file: [browser] ucsc_base_url = http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg18 username = password = user-agent = Mechanize client to get screenshots from the UCSC browser. Home page: https://bitbucket.org/dalloliogm/ucsc-fetch email = httpproxy = httproxy_port = httproxy_password =