clusterProfiler is a nice R library for doing GO and KEGG enrichment analysis. It has a simple interface and it can produce some clear plots using the ggplot engine. Today I contributed a formula interface to clusterProfiler, making it easier to do enrichment of multiple groups of genes.
Let’s imagine you have a dataframe in which one column contains a list of Entrez Ids, while the other columns encode some grouping variables:
> mygenes Entrez group othergroup 1 A good 100 A bad 1000 A good 100101467 B bad 100127206 B good 100128071 B good
The new formula interface allows to do a GO analysis on each of the groups. For example, we can group them by the column “group”, and compare the classification of the two groups:
> install_github(c("GuangchuangYu/DOSE", "Guangchuang/clusterProfiler")) > library(clusterProfiler) > mygenes <- data.frame(Entrez=c('1', '100', '1000', '100101467', '100127206', '100128071'), group = c('A', 'A', 'A', 'B', 'B', 'B'), othergroup = c('good', 'bad', 'good', 'bad', 'good', 'good')) > GO.enrichment <- compareCluster(Entrez~group, data=mygenes, fun='groupGO') > print(summary(GO.enrichment)) Cluster ID Description Count GeneRatio geneID 1 A GO:0016020 membrane 2 2/5 100/1000 2 A GO:0005576 extracellular region 3 3/5 1/100/1000 3 A GO:0005581 collagen trimer 0 0/5 4 B GO:0016020 membrane 1 1/3 100127206 5 B GO:0005576 extracellular region 0 0/3 6 B GO:0005581 collagen trimer 0 0/3 > plot(GO.enrichment, plotAll=T)
In this case group A is enriched in membrane and extracellular region, while group B is only enriched in membrane genes. The groupGO function used here doesn’t provide p-values – we should have used enrichGO instead. I guess that the 3 Entrez ids in group A correspond to 5 genes in Gene Ontology, so that’s why the plots shows a total of 5 in group A. See clusterProfiler’s documentation for better examples.
The formula interface allows also multiple grouping. For example:
GO.enrichment <- compareCluster(Entrez~group+othergroup, data=mygenes, fun='groupGO') Cluster ID Description Count GeneRatio geneID 1 A.bad GO:0016020 membrane 1 1/2 100 2 A.bad GO:0005576 extracellular region 1 1/2 100 3 A.good GO:0016020 membrane 1 1/3 1000 4 A.good GO:0005576 extracellular region 2 2/3 1/1000 5 B.bad GO:0016020 membrane 0 0/2 6 B.bad GO:0005576 extracellular region 0 0/2 7 B.good GO:0016020 membrane 1 1/2 100127206 8 B.good GO:0005576 extracellular region 0 0/2
Of course this example is not much interesting, since it is only 6 randomly chosen genes. However with bigger datasets the formula interface can be much more powerful.
Now that I think of it, it would be better if compareCluster would return a dataframe with multiple columns, instead of merging them into a single column called Cluster. This would be make it possible to plot the results using facets or something more fancy. It would be something like this:
group othergroup ID Description Count GeneRatio geneID 1 A bad GO:0016020 membrane 1 1/2 100 2 A bad GO:0005576 extracellular region 1 1/2 100 3 A good GO:0016020 membrane 1 1/3 1000 4 A good GO:0005576 extracellular region 2 2/3 1/1000 5 B bad GO:0016020 membrane 0 0/2 6 B bad GO:0005576 extracellular region 0 0/2 7 B good GO:0016020 membrane 1 1/2 100127206 8 B good GO:0005576 extracellular region 0 0/2
However this would probably require introduce some retro-incompatibility in the library, and it is not a big deal as the Cluster column can be easily split using the separate function from tidyr.