BioConductor includes many powerful packages for working with genomics data. You can do pretty much everything, from downloading gene coordinates and sequences of any model species, to converting gene ids and symbol, and to accessing ENCODE data and anything in UCSC, Ensembl, and other resources. However these packages are not always well known, and the initial learning curve is a steep, specially for R beginners.
This series of tutorials will describe how to get gene coordinates from bioconductor, intersect these with some interesting dataset from ENCODE, and do an enrichment analysis with DOSE. It will be fun 🙂
Libraries required for this tutorial
For this tutorial we will use only Human data. Most of the packages needed for working with human genomics can be installed with a single command:
|
> source("https://bioconductor.org/biocLite.R") > biocLite("Homo.sapiens", "AnnotationHub", "BSgenome.Hsapiens.UCSC.hg19") > library(Homo.sapiens) > library(AnnotationHub) > library(dplyr) |
The Homo.sapiens package is a container that includes the most important packages for working with human data. In particular it contains two data packages:
- TxDb.Hsapiens.UCSC.hg19.knownGene is a TxDb object containing the coordinates of all the genes, transcripts, and exons in the human genome
- org.Hs.eg.db is what you need to convert all gene ids – from entrez to ensembl, to GO, and so on.
The data in these packages is updated periodically (I think every 6 months), and is pretty stable, meaning that anybody using the same packages and version should be able to reproduce the same results. An alternative to using these data packages is biomaRt, but I prefer the data packages as they can be used without internet connection.
The package AnnotationHub is used to retrieve data from multiple sources and will be described later. The BSgenome package is for retrieving the human genome sequence: we will not use it in the tutorial but I included it for completeness.
Note that I also loaded the dplyr package for this tutorial. Although dplyr is not needed for working with genomics data, I consider it one of the most useful packages in R, and this tutorial will make heavy use of it. I apologize if this tutorial is not easy to follow to people not familiar with dplyr.
Retrieving gene and transcript coordinates
The TxDb object can be used to retrieve coordinates of genes, transcripts, and exons in the human genome. For example, we can access all human transcript with the transcript() function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
|
> tx = TxDb.Hsapiens.UCSC.hg19.knownGene > human.transcripts = transcripts(tx, columns=c("tx_id", "tx_name","gene_id")) > human.transcripts GRanges object with 82960 ranges and 3 metadata columns: seqnames ranges strand | tx_id tx_name gene_id <Rle> <IRanges> <Rle> | <integer> <character> <CharacterList> [1] chr1 [ 11874, 14409] + | 1 uc001aaa.3 100287102 [2] chr1 [ 11874, 14409] + | 2 uc010nxq.1 100287102 [3] chr1 [ 11874, 14409] + | 3 uc010nxr.1 100287102 [4] chr1 [ 69091, 70008] + | 4 uc001aal.1 79501 [5] chr1 [321084, 321115] + | 5 uc001aaq.2 ... ... ... ... ... ... ... ... [82956] chrUn_gl000237 [ 1, 2686] - | 82956 uc011mgu.1 [82957] chrUn_gl000241 [20433, 36875] - | 82957 uc011mgv.2 [82958] chrUn_gl000243 [11501, 11530] + | 82958 uc011mgw.1 [82959] chrUn_gl000243 [13608, 13637] + | 82959 uc022brq.1 [82960] chrUn_gl000247 [ 5787, 5816] - | 82960 uc022brr.1 |
See help(transcripts) for other functions that can be applied to a TxDb object. For example, genes() retrieve coordinates of genes, while exons() and promoters() work similarly.
In the example above I also specified a “columns” parameter, in order to show the gene id as well. You can use this column to get the coordinates of a specific set of genes. For example, the following will retrieve the coordinates of the genes corresponding to entrez ids 1234, 231, and 421:
|
> subset(human.transcripts, gene_id %in% c(1234, 231, 421)) GRanges object with 5 ranges and 3 metadata columns: seqnames ranges strand | tx_id tx_name gene_id <Rle> <IRanges> <Rle> | <integer> <character> <CharacterList> [1] chr3 [ 46411633, 46417697] + | 13703 uc003cpo.4 1234 [2] chr3 [ 46411633, 46417697] + | 13704 uc010hjd.3 1234 [3] chr7 [134127107, 134143888] - | 31418 uc003vrp.1 231 [4] chr22 [ 19957402, 19966734] - | 74543 uc002zqy.3 421 [5] chr22 [ 19957402, 20004309] - | 74544 uc002zqz.3 421 |
Converting Entrez IDs to symbols and other IDs
One of the most tricky part in bioinformatics is converting gene ids to symbols and other ids. Many errors can be made in this process, and is therefore very important to have a consistent way to convert gene ids.
Luckily, we can use the org.Hs.eg.db for easily converting many ids. This package should already have been loaded with library(Homo.sapiens). To see all the possible conversion tables (bimaps) available, we can either to library(help=org.Hs.eg.db) or simply write “org.Hs.eg” and then hit tab on the R command line .
|
> org.Hs.eg<TAB> org.Hs.eg org.Hs.egCHRLENGTHS org.Hs.egENSEMBLTRANS org.Hs.egGO2EG org.Hs.egPATH org.Hs.egREFSEQ2EG org.Hs.eg_dbInfo org.Hs.eg.db org.Hs.egCHRLOC org.Hs.egENSEMBLTRANS2EG org.Hs.egMAP org.Hs.egPATH2EG org.Hs.egSYMBOL org.Hs.eg_dbconn org.Hs.eg.db:: org.Hs.egCHRLOCEND org.Hs.egENZYME org.Hs.egMAP2EG org.Hs.egPFAM org.Hs.egSYMBOL2EG org.Hs.eg_dbfile org.Hs.egACCNUM org.Hs.egENSEMBL org.Hs.egENZYME2EG org.Hs.egMAPCOUNTS org.Hs.egPMID org.Hs.egUCSCKG org.Hs.eg_dbschema org.Hs.egACCNUM2EG org.Hs.egENSEMBL2EG org.Hs.egGENENAME org.Hs.egOMIM org.Hs.egPMID2EG org.Hs.egUNIGENE org.Hs.egALIAS2EG org.Hs.egENSEMBLPROT org.Hs.egGO org.Hs.egOMIM2EG org.Hs.egPROSITE org.Hs.egUNIGENE2EG org.Hs.egCHR org.Hs.egENSEMBLPROT2EG org.Hs.egGO2ALLEGS org.Hs.egORGANISM org.Hs.egREFSEQ org.Hs.egUNIPROT library(help=org.Hs.eg.db) |
One of my favorite bimaps is the one to convert gene symbols to entrez. As you may know, for historical reasons the same gene can have more than one symbol. This usually complicates things a lot, and a safe procedure is to always convert symbols to entrez before starting any analysis. The ALIAS2EG bimap is there for this type of conversion:
|
> head(as.data.frame(org.Hs.egALIAS2EG)) gene_id alias_symbol 1 1 A1B 2 1 ABG 3 1 GAB 4 1 HYST2477 5 1 A1BG 6 2 A2MD |
When you convert symbols to id, it is important to remember that not only the same gene can have more than one symbol, but also the same symbol can match multiple entrez ids. For example, here is the code to get which symbols match more than one entrez id in the human species:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
|
> as.data.frame(org.Hs.egALIAS2EG) %>% count(alias_symbol) %>% arrange(-n) Source: local data frame [118,097 x 2] alias_symbol n (chr) (int) 1 VH 36 2 MT1 12 3 GPCR 11 4 HOX1 10 5 ACT 9 6 HOX2 9 7 PPIase 9 8 UDPGT 9 9 ALP 8 10 NAP1 8 |
Note: the “%>%” symbol and the count, arrange functions come from the dplyr package.
The only safe thing to do in these cases is to identify the duplicated symbols and either discard them or manually curate them. Let’s imagine that a fellow researchers asked us to retrieve information for the genes ACT, DOLPP1 and MGAT3. We can use the bimap to identify the genes that map to multiple entrez id, and then go back to our colleague and ask him to tell us which are the correct ids.
|
> as.data.frame(org.Hs.egALIAS2EG) %>% count(alias_symbol) %>% filter(alias_symbol %in% mygenes) Source: local data frame [3 x 2] alias_symbol n (chr) (int) 1 ACT 9 2 DOLPP1 1 3 MGAT3 2 |
In these examples I converted the bimap to a dataframe and then did an intersection. However the “bioConductor” way to use these bimaps is through the select function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
|
> AnnotationDbi::select( org.Hs.eg.db, keys=mygenes, keytype='ALIAS', columns=c('ENTREZID', 'ENSEMBL')) 'select()' returned 1:many mapping between keys and columns ALIAS ENTREZID ENSEMBL 1 ACT 12 ENSG00000196136 2 ACT 12 ENSG00000273259 3 ACT 71 ENSG00000184009 4 ACT 72 ENSG00000163017 5 ACT 9457 ENSG00000112214 6 ACT 11332 ENSG00000097021 7 ACT 345651 ENSG00000169067 8 ACT 389036 <NA> 9 ACT 440915 <NA> 10 ACT 641455 ENSG00000187537 11 ACT 641455 ENSG00000222036 12 MGAT3 4248 ENSG00000128268 13 MGAT3 346606 ENSG00000106384 14 DOLPP1 57171 ENSG00000167130 |
One big problem with these bioconductor packages is that they clash with many dplyr functions. For example, the select function gets overwritten if you load dplyr after Homo.sapiens, and the only option to avoid headaches is to explicitly refer to the function as AnnotationDbi::select. These conflicts in the namespace can cause a lot of confusion in R, because they let to weird error messages that are completely unrelated to the real problem.
In any case, the advantage of the select function is that it allows to retrieve more id types at the same type. For example here I retrieved both entrez id and ensembl ids, and if you type columns(org.Hs.eg.db) you will be able to see many other possible output columns.
Next parts of the tutorial
I was originally planning to write one big tutorial in the same post, but now I see that it would be much more readable if I split it into multiple posts.
Please let me know if you have any comment regarding this first tutorial, and I will try to improve it and take the feedback into account for the next parts.