a formula interface for GeneOntology analysis

clusterProfiler is a nice R library for doing GO and KEGG enrichment analysis. It has a simple interface and it can produce some clear plots using the ggplot engine. Today I contributed a formula interface to clusterProfiler, making it easier to do enrichment of multiple groups of genes.

Let’s imagine you have a dataframe in which one column contains a list of Entrez Ids, while the other columns encode some grouping variables:

The new formula interface allows to do a GO analysis on each of the groups. For example, we can group them  by the column “group”, and compare the classification of the two groups:

clusterProfiler example

In this case group A is enriched in membrane and extracellular region, while group B is only enriched in membrane genes. The groupGO function used here doesn’t provide p-values – we should have used enrichGO instead. I guess that the 3 Entrez ids in group A correspond to 5 genes in Gene Ontology, so that’s why the plots shows a total of 5 in group A. See clusterProfiler’s documentation for better examples.

The formula interface allows also multiple grouping. For example:

Of course this example is not much interesting, since it is only 6 randomly chosen genes. However with bigger datasets the formula interface can be much more powerful.

Now that I think of it, it would be better if compareCluster would return a dataframe with multiple columns, instead of merging them into a single column called Cluster. This would be make it possible to plot the results using facets or something more fancy. It would be something like this:

However this would probably require introduce some retro-incompatibility in the library, and it is not a big deal as the Cluster column can be easily split using the separate function from tidyr.


Leave a Reply