a parser for KEGG pathways in python

I am writing a parser for the kegg KGML format, the dialect of XML used to store kegg pathways.

update: A KEGG pathway file, parsed with python and plotted with networkx

update: A KEGG pathway file, parsed with python and plotted with networkx

I couldn’t find any existing module to do this in python. There is a similar library for R (KEGGgraph), maybe something for perl (I didn’t find it), but apparently nothing for python.

Here it is the code, hosted on github:

It’s is really a simple and incomplete script, but now I understood which libraries to use and how to read the format, so I will be able to improve it as soon as I have free time.

It requires the networkx library which is the standard for working with graphs and networks in python, ElementTree (already included in the standard lib) to parse the XML, and pylab/matplotlib for plotting.

Any suggestion is really appreciated. For me, this is the first time that I work with XML files.

  • Share/Save/Bookmark

Comments 6

  1. Neil wrote:

    You’re right that Perl lacks libraries for KEGG Pathways. BioPerl has SeqIO::kegg, but that is only for sequences in KEGG flatfile format.

    BioRuby is better, as you might expect since both KEGG and Ruby are out of Japan. See http://bioruby.org/rdoc/classes/Bio/KEGG.html – ruby code should be quite clear to a python coder.

    Posted 03 Jun 2009 at 12:43 am
  2. admin wrote:

    Thanks!
    Basically, it also uses a library for parsing xml and then creates a network representation.

    They use the entry.graphics.label element to get the gene names, which is exactly what I was doing but was unsure wether it was correct.

    Posted 03 Jun 2009 at 1:39 pm
  3. Abhishek Tiwari wrote:

    Good, but did u mentioned some time back about python way of accessing the KEGG data through web services, does this parser extend that

    Posted 03 Jun 2009 at 2:55 pm
  4. admin wrote:

    In the end I still didn’t manage to make a SOAP call to KEGG with python: I have tried the suds library, which works but returns some errors, and SOAPpy, which I still can’t make work under an http proxy (maybe now I understood how to solve it).

    Basically, the soap libraries are used to download data from kegg, while with this you download manually a kgml file (from the ftp: ftp://ftp.genome.jp/pub/kegg/xml/ko/) and then work on it locally.

    This is the same approach used by the authors of the R library, KEGGgraph. Another advantage is that with this, you get a python networkx object directly, so you can work with the structure of the pathway, plot it (see the networkx site), and export it to other formats, like the ones supported by cytoscape.

    By the way, I think there is still not an easy way to convert KGML to cytoscape files… you can do it with this library (provided I still didn’t write any test and have to do a lot of refactoring).

    Posted 03 Jun 2009 at 3:09 pm
  5. João Rodrigues wrote:

    Regarding SOAPpy and KEGG, I have used it in one of my projects behind an http proxy. Email me if you want to have a look at the code :) anaryin [at] gmail [dot] com

    Nice idea with this parser, keep up the good work!

    Posted 09 Jun 2009 at 12:16 am
  6. Simon Cockell wrote:

    I’ve suggested this in reply to your comment on my blog, but thought I’d leave it here too. Have you tried cElementTree, rather than ElementTree in your code? It is in the standard libraries (since 2.5), but is MUCH faster than ElementTree. It is also completely synonymous, so the only thing you’d have to change in your code is the import statement.
    i.e.
    import xml.etree.ElementTree as ET
    becomes:
    import xml.etree.cElementTree as ET

    Posted 12 Jun 2009 at 9:39 pm

Trackbacks & Pingbacks 1

  1. From Simon Cockell | This week in links #1 | Fuzzier Logic on 12 Jun 2009 at 5:39 pm

    [...] Parsing KEGG in Python [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *