The Network of Cancer Genes database

In the last year I have been part of the team maintaining and updating the Network of Cancer Genes database, also known as NCG.

NCG logo
The Network of Cancer Genes database

The main focus of NCG is to provide a curated list of genes associated to cancer, obtained after a manual review of the literature, and classified by cancer subtypes. Moreover NCG annotates some system-level properties of genes associated to cancer, from their protein interactions to their evolutionary age, and from the presence of paralogs in the human genome to their function.

NCG is a small database and is not supported by any big consortium, but we do our best to fill our niche :-). The following list will describe you what you can get from NCG and how can it be useful to you.

A manually annotated list of genes associated to cancer

It is difficult to keep pace with all the literature on cancer. New screenings on cancer samples are published every one or two months, usually describing novel mutations and new cancer driver genes. While these screenings add important knowledge on the mechanisms behind cancer, it is difficult to keep track of all of them, and have a clear picture of which mutations are driver in a given cancer type. The ICGC and the TCGA consortia provide some nice web interface to retrieve the genes recurrently mutated in a cancer type, but these are limited only to the data published by these two consortia. What about all the other studies published outside of ICGC and TGCA?

drivers by subtypes
Number of known cancer genes (COSMIC list) or candidates annotated in NCG

In NCG we manually review all the studies published recently, and annotate a list of genes reported as “drivers” in each study. So far we have about 70 papers annotated, and we are close to uploading a batch of about 70 more publications. The annotation process is currently done between three people, and each paper is checked more than once to make sure that the annotation is correct. It’s hard work, but then the output is a nice list of driver genes in each cancer type.

Annotation of paralogs of cancer genes

Recent estimates reported that about 80% of the human genes have at least one paralog (e.g. Dickerson and Robertson 2012). These percentages may be a bit too high, and they may be based on a excessively broad definition of paralogy, but overall we can expect that a good portion of the human genes have at least a domain or a portion of their sequence in common with other genes.

PTEN, a known tumor suppressor, has a duplication on chromosome 9. This actually correspond to the PTENP1 pseudogene.
PTEN, a known tumor suppressor, has a duplication on chromosome 9. This actually correspond to the PTENP1 pseudogene.

The presence of a paralog of a cancer gene is a factor to take into account, because it can complicate the development of drug strategies. In particular, it has been hypothesized that two paralogs can often exibit functional compensation, meaning that if we inhibit the activity of a gene, the other paralog can compensate the function, reducing any impact of the inhibition. This may render a drug less efficient in inhibiting an oncogene, or lead to unpredictable effects in other cases.

Annotation of gene age

It has been shown that cancer genes of different age can have different properties. For example most tumor suppressors tend to be old genes originated in the Universal Common Ancestor of all eukaryotes, while most oncogenes are originated in metazoans. The indication of gene age can therefore be useful to have an idea of whether a candidate gene may be an oncogene or a tumor suppressor.

The TP53 tumor suppressor originated in eukaryotes, and we can find orthologs of it in all organisms except in procaryotes.
The TP53 tumor suppressor originated in eukaryotes, and we can find orthologs of it in all organisms except in procaryotes.

The indication of age can also be useful to understand which model organisms can be used to study the gene – e.g. whether the gene is present in yeast, or only in closer species.

Role in the interactome network

Another important feature provided in NCG is the protein-protein interactions of the cancer gene. It has been reported that both oncogenes and tumor suppressor genes have on average an high number of interactions, so understanding which genes interact with a given candidate can be useful to understand the function and the involvement in cancer. The interaction network in NCG comes from the integration of 5 databases for protein-protein interactions (see An et al 2014 for more info), after some cleaning steps.

Protein interactions of the BRAF oncogenes
Protein interactions of the BRAF oncogene

Summary

NCG is database annotating cancer genes and their systems-level properties. It is now at the 4th release, but its development is still active and we are looking for new properties to annotate. If you have any idea or suggestion, just contact us!

References (cross linked at researchblogging.org)

An, O., Pendino, V., D’Antonio, M., Ratti, E., Gentilini, M., & Ciccarelli, F. (2014). NCG 4.0: the network of cancer genes in the era of massive mutational screenings of cancer genomes Database, 2014 DOI: 10.1093/database/bau015
Dickerson, J., & Robertson, D. (2011). On the Origins of Mendelian Disease Genes in Man: The Impact of Gene Duplication Molecular Biology and Evolution, 29 (1), 61-69 DOI: 10.1093/molbev/msr111

10 Comments

    1. Hi Guangchuang, I think it will be definitely very cool!!
      The best way to get the list of genes is from the Download section (http://ncg.kcl.ac.uk/download.php). There are two types of cancer genes: “cgcs”, derived “candidates”, which are the Cancer Gene Consensus (http://cancer.sanger.ac.uk/cancergenome/projects/census/ ), and “candidates” , derived from our manual curation of terms. In addition, cgcs are divided into either “dominant” and “recessive” genes, according to whether they are oncogenes or tumor suppressors.

      The list of genes is a bit old, as it is frozen since 2014. We are going to update the list at the end of the year, but it is not ready yet at the moment 😉

          1. > require(DOSE)
            Loading required package: DOSE

            > data(geneList)
            > gene = names(geneList)[abs(geneList)>2]
            > head(gene)
            [1] “4312” “8318” “10874” “55143” “55388” “991”
            >
            > y = gseAnalyzer(geneList, setType=”NCG”)
            [1] “calculating observed enrichment scores…”
            [1] “calculating permutation scores…”
            |================================================================== | 95%
            [1] “calculating p values…”
            [1] “done…”
            >
            > head(summary(y))
            [1] ID Description setSize enrichmentScore
            [5] pvalue p.adjust qvalues
            (or 0-length row.names)
            > y = gseAnalyzer(geneList, setType=”NCG”, minGSSize=1)
            [1] “calculating observed enrichment scores…”
            [1] “calculating permutation scores…”
            |===================================================================== | 98%
            [1] “calculating p values…”
            [1] “done…”
            > head(summary(y))
            ID
            breast,lung,ovarian,pancreas,prostate,sarcoma breast,lung,ovarian,pancreas,prostate,sarcoma
            liver liver
            Description
            breast,lung,ovarian,pancreas,prostate,sarcoma breast,lung,ovarian,pancreas,prostate,sarcoma
            liver liver
            setSize enrichmentScore pvalue
            breast,lung,ovarian,pancreas,prostate,sarcoma 2 -0.9965348 0.000
            liver 59 -0.4762949 0.001
            p.adjust qvalues
            breast,lung,ovarian,pancreas,prostate,sarcoma 0.000 0.00000000
            liver 0.031 0.02631579
            >
            > x = enrichNCG(gene, pvalueCutoff=1, qvalueCutoff=1, minGSSize=1, readable=TRUE)
            > head(summary(x))
            ID Description GeneRatio BgRatio pvalue p.adjust qvalue
            prostate prostate prostate 2/17 80/1920 0.1559174 0.1559174 NA
            geneID Count
            prostate FOXA1/OGN 2

            Now DOSE supports hypergeometric test and GSEA for NCG data.

    1. These are wonderful news! Thank you for implementing NCG in DOSE! I’ll write a blog post as soon as I can.

      How does the custom annotation data works? Does it allow to create custom databases of genes/terms annotations?

      1. User only needs to provides TERM2GENE annotation, which is a data.frame with 2 columns. The first column is term and the second one is gene.

        Another input TERM2NAME is optional.

        Other parameter is similar with enrichGO/enrichKEGG.

        For example:

        > require(DOSE)
        > data(geneList)
        >
        > require(clusterProfiler)
        >
        > gsea.res =GSEA(geneList, minGSSize=1, pvalueCutoff=1, TERM2GENE=cancer2gene)
        preparing geneSet collections…
        [1] “calculating observed enrichment scores…”
        [1] “calculating permutation scores…”
        |===================================================================== | 98%
        [1] “calculating p values…”
        [1] “done…”
        >
        > head(summary(gsea.res))
        ID
        breast,lung,ovarian,pancreas,prostate,sarcoma breast,lung,ovarian,pancreas,prostate,sarcoma
        liver liver
        leukemia,breast leukemia,breast
        ovarian ovarian
        lymphoma,non-hodgkin lymphoma lymphoma,non-hodgkin lymphoma
        lymphoma lymphoma
        Description
        breast,lung,ovarian,pancreas,prostate,sarcoma breast,lung,ovarian,pancreas,prostate,sarcoma
        liver liver
        leukemia,breast leukemia,breast
        ovarian ovarian
        lymphoma,non-hodgkin lymphoma lymphoma,non-hodgkin lymphoma
        lymphoma lymphoma
        setSize enrichmentScore pvalue
        breast,lung,ovarian,pancreas,prostate,sarcoma 2 -0.9965348 0.000
        liver 59 -0.4762949 0.001
        leukemia,breast 2 0.9732172 0.004
        ovarian 2 0.9473503 0.010
        lymphoma,non-hodgkin lymphoma 12 0.6290415 0.014
        lymphoma 119 0.3043574 0.017
        p.adjust qvalues
        breast,lung,ovarian,pancreas,prostate,sarcoma 0.00000000 0.00000000
        liver 0.03100000 0.02684211
        leukemia,breast 0.08266667 0.07157895
        ovarian 0.15500000 0.13421053
        lymphoma,non-hodgkin lymphoma 0.17222222 0.14912281
        lymphoma 0.17222222 0.14912281
        >
        > gene = names(geneList)[abs(geneList) > 1]
        > enrich.res = enricher(gene, minGSSize=1, pvalueCutoff=1, qvalueCutoff = 1, TERM2GENE=cancer2gene)
        > head(summary(enrich.res))
        ID Description GeneRatio
        liver liver liver 12/160
        lipoma lipoma lipoma 2/160
        gastric gastric gastric 5/160
        bone cyst bone cyst bone cyst 2/160
        cholangiocarcinoma cholangiocarcinoma cholangiocarcinoma 2/160
        non-hodgkin lymphoma non-hodgkin lymphoma non-hodgkin lymphoma 3/160
        BgRatio pvalue p.adjust qvalue
        liver 79/1920 0.02738473 0.2624927 0.2302567
        lipoma 4/1920 0.03701728 0.2624927 0.2302567
        gastric 24/1920 0.04374878 0.2624927 0.2302567
        bone cyst 5/1920 0.05835422 0.2625940 0.2303456
        cholangiocarcinoma 6/1920 0.08282484 0.2981694 0.2615521
        non-hodgkin lymphoma 18/1920 0.18465912 0.4928293 0.4323064
        geneID
        liver 727897/273/1657/4337/29994/4053/2550/4915/213/3572/80162/4857
        lipoma 57007/10186
        gastric 7272/1272/79633/5764/9723
        bone cyst 1009/4958
        cholangiocarcinoma 10403/3908
        non-hodgkin lymphoma 7037/7832/4582
        Count
        liver 12
        lipoma 2
        gastric 5
        bone cyst 2
        cholangiocarcinoma 2
        non-hodgkin lymphoma 3

        These examples using GSEA and enricher with user input (NCG data here).

        enrichNCG() and gseaAnalyzer(setType=”NCG”, …) will generate the same output.

        You may notice that Description and ID columns are the same, since we did not provide TERM2NAME data.frame, and the Description is necessary for plotting (ID is supposed for computer and Description for human).

        So, if there is no TERM2NAME annotation available, I just put the ID in Description column.

  1. GSEA and enricher is only available in clusterProfiler.

    I will keep all the functions in DOSE are all disease related.

Leave a Reply to ygc Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">