I would like to give a big thanks to Guangchuang Yu, the author of many cool R libraries like GoSemSim and ggtree, for implementing a Network of Cancer Genes enrichment function in the DOSE R library.
The new function is called enrichNCG, and can be found in the github version of DOSE. You can use it to analyze a list of genes, and determine if they are enriched in genes known to be mutated in a given cancer type. For example, a random list composed by genes having an Entrez Id between 4000 and 9000 is enriched in genes mutated in sarcoma and leukemia:
</code>> install_github(c("GuangchuangYu/DOSE", "GuangchuangYu/clusterProfiler"))
> mygenes = as.character(4000:9000) # generate a random list of Entrez Ids, from ID 4000 to ID 9000
> summary(enrichNCG(gene=mygenes)) # calculate enrichment of the random list of Entrez Ids
ID Description GeneRatio BgRatio pvalue p.adjust qvalue
sarcoma sarcoma sarcoma 15/457 30/1920 0.001495187 0.02352589 0.02185067
leukemia leukemia leukemia 38/457 106/1920 0.002767752 0.02352589 0.02185067
If you have multiple sets of genes, you can also use the clusterProfiler library to compare them at the same time. Read this previous post for more examples of this functionality.
> summary(compareCluster(list(L1=as.character(4000:9000), L2= as.character(3000:4000)), fun='enrichNCG'))
Cluster ID Description GeneRatio BgRatio pvalue p.adjust qvalue
L1 sarcoma sarcoma 15/457 30/1920 0.0014951873 0.0235258920 0.0218506737
L1 leukemia leukemia 38/457 106/1920 0.0027677520 0.0235258920 0.0218506737
L2 leukemia leukemia 16/96 106/1920 0.0000399849 0.0001599396 0.0001262681
If you also have gene scores (e.g. a value for the expression or conservation of each gene), you can do a Gene Set Enrichment Analysis, which will give more importance to genes with higher scores:
> y = gseAnalyzer(geneList, setType="NCG", minGSSize=1)
ID Description setSize enrichmentScore pvalue p.adjust qvalues
breast,lung breast,lung 2 -0.9965348 0 0 0
You can also produce many nice plots. For example this is a cnetplot, in which each gene is connected to the terms related to it:
> cnetplot(enrichNCG(as.character(4000:9000), readable=T), fixed=T)<a href="http://bioinfoblog.it/wp-content/uploads/2015/04/random1_cnet.png">
<img class="alignnone wp-image-1579 size-large" src="http://bioinfoblog.it/wp-content/uploads/2015/04/random1_cnet-1024x729.png" alt="the cnet visualization of a randomly generated dataset of genes, using enrichNCG from DOSE, which derives data from the Network of Cancer Genes." width="640" height="456" /></a>
The cnet visualization of a randomly generated dataset of genes,
using enrichNCG from <a title="DOSE" href="https://github.com/GuangchuangYu/DOSE" target="_blank">DOSE</a>, which derives data from the <a title="The Network of Cancer Genes database" href="http://bioinfoblog.it/2015/03/the-network-of-cancer-genes-database/" target="_blank">Network of Cancer Genes</a>.
It is worth to mention that the DOSE package allows to calculate enrichment in the Disease Ontology database, which associates genes to disease terms. In my experience, for bioinformaticians Disease Ontology is more useful than OMIM, because it provides a clear association between genes and disease terms. If you use the raw OMIM data instead, you will have to text mine the descriptions and that can lead to a lot of noisy data.
Have a good enrichment with DOSE and NCG 😉