R/genejam-freshen.R
freshenGenes.Rd
Freshen gene annotations using Bioconductor annotation data
freshenGenes( x, ann_lib = c("", "org.Hs.eg.db"), try_list = c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"), final = c("SYMBOL"), split = "[ ]*[,/;]+[ ]*", sep = ",", handle_multiple = c("first_try", "first_hit", "all", "best_each"), empty_rule = c("empty", "original", "na"), include_source = FALSE, protect_inline_sep = TRUE, intermediate = "intermediate", ignore.case = FALSE, verbose = FALSE, ... )
x | character vector or |
---|---|
ann_lib | character vector indicating the name or names of the Bioconductor annotation library to use when looking up gene nomenclature. |
try_list | character vector indicating one or more names of
annotations to use for the input gene symbols in |
final | character vector to use for the final conversion
step. When |
split | character value used to separate delimited values in |
sep | character value used to concatenate multiple entries in
the same field. The default |
handle_multiple | character value indicating how to handle multiple
values: |
empty_rule | character value indicating how to handle entries which
did not have a match, and are therefore empty: |
include_source | logical indicating whether to include a column
that shows the colname and source matched. For example, if column
|
protect_inline_sep | logical indicating whether to
protect inline characters in |
intermediate |
|
ignore.case |
|
verbose | logical indicating whether to print verbose output. |
data.frame
with one or more columns indicating the input
data, then a column "intermediate"
containing the Entrez gene ID
that was matched, then one column for each item in final
,
by default "SYMBOL"
.
This function takes a vector or data.frame
of gene symbols,
and uses Bioconductor annotation methods to find the most current
official gene symbol.
The annotation process runs in two basic steps:
Convert the input gene to Entrez gene ID.
Convert Entrez gene ID to official gene symbol.
The first step uses an ordered list of annotations, with the assumption that the first match is usually the best, and most specific. By default, the order is:
"org.Hs.egSYMBOL2EG"
-- almost always 1-to-1 match
"org.Hs.egACCNUM2EG"
-- mostly a 1-to-1 match
"org.Hs.egALIAS2EG"
-- sometimes a 1-to-1 match, sometimes 1-to-many
When multiple Entrez gene ID values are matched, they are all
retained. See argument handle_multiple
for custom options.
The second step converts the Entrez gene ID (or multiple IDs)
to the official gene symbol, by default using "org.Hs.egSYMBOL"
.
The second step may optionally include multiple annotation types, each of which will be returned. Some common examples:
"org.Hs.egSYMBOL"
-- official Entrez gene symbol
"org.Hs.egALIAS"
-- set of recognized aliases for an Entrez gene.
"org.Hs.egGENENAME"
-- official Entrez long gene name
For each step, the annotation matched can be returned, as an audit trail to see which annotation was available for each input entry.
Note that if the input data already contains Entrez gene ID
values, you can define that colname with argument intermediate
.
For case-insensitive search, which is particularly useful in non-human
organisms because they often use mixed-case, use the argument
ignore.case=TRUE
. In our benchmark tests it appears to add roughly
0.1 seconds per annotation, regardless of the number of input entries.
This appears to be the time it takes to spool the list of annotation
keys stored in the SQLite database, and may therefore be dependent upon
the size of the annotation file.
Other genejam:
freshenGenes2()
,
freshenGenes3()
,
get_anno_db()
,
is_empty()
if (suppressPackageStartupMessages(require(org.Hs.eg.db))) { cat("\nBasic usage\n"); print(freshenGenes(c("APOE", "CCN2", "CTGF"))); }#> Warning: package ‘IRanges’ was built under R version 3.6.2#> Warning: package ‘S4Vectors’ was built under R version 3.6.3#> #> Basic usage #> input intermediate SYMBOL #> 1 APOE 348 APOE #> 2 CCN2 1490 CCN2 #> 3 CTGF 1490 CCN2if (suppressPackageStartupMessages(require(org.Hs.eg.db))) { ## Optionally show the annotation source matched cat("\nOptionally show the annotation source matched\n"); print(freshenGenes(c("APOE", "CCN2", "CTGF"), include_source=TRUE)); }#> #> Optionally show the annotation source matched #> input intermediate intermediate_source SYMBOL #> 1 APOE 348 org.Hs.egSYMBOL2EG APOE #> 2 CCN2 1490 org.Hs.egSYMBOL2EG CCN2 #> 3 CTGF 1490 org.Hs.egALIAS2EG CCN2if (suppressPackageStartupMessages(require(org.Hs.eg.db))) { ## Show comma-delimited genes cat("\nInput genes are comma-delimited\n"); print(freshenGenes(c("APOE", "CCN2", "CTGF", "CCN2,CTGF"))); }#> #> Input genes are comma-delimited #> input_v1 input_v2 intermediate SYMBOL #> 1 APOE 348 APOE #> 2 CCN2 1490 CCN2 #> 3 CTGF 1490 CCN2 #> 4 CCN2 CTGF 1490 CCN2if (suppressPackageStartupMessages(require(org.Hs.eg.db))) { ## Optionally include more than SYMBOL in the output cat("\nCustom output to include SYMBOL, ALIAS, GENENAME\n"); print(freshenGenes(c("APOE", "HIST1H1C"), final=c("SYMBOL", "ALIAS", "GENENAME"))); }#> #> Custom output to include SYMBOL, ALIAS, GENENAME #> input intermediate SYMBOL ALIAS #> 1 APOE 348 APOE AD2,APO-E,ApoE4,APOE,LDLCQ5,LPG #> 2 HIST1H1C 3006 H1-2 H1-2,H1.2,H1C,H1F2,H1s-1,HIST1H1C #> GENENAME #> 1 apolipoprotein E #> 2 H1.2 linker histone, cluster memberif (suppressPackageStartupMessages(require(org.Hs.eg.db))) { ## More advanced, match affymetrix probesets if (suppressPackageStartupMessages(require(hgu133plus2.db))) { cat("\nAdvanced example including Affymetrix probesets.\n"); print(freshenGenes(c("227047_x_at","APOE","HIST1H1D","NM_003166,U08032"), include_source=TRUE, try_list=c("hgu133plus2ENTREZID","REFSEQ2EG","SYMBOL2EG","ACCNUM2EG","ALIAS2EG"), final=c("SYMBOL","GENENAME"))) } }#> #> Advanced example including Affymetrix probesets. #> input_v1 input_v2 intermediate intermediate_source SYMBOL #> 1 227047_x_at 57659 hgu133plus2ENTREZID ZBTB4 #> 2 APOE 348 org.Hs.egSYMBOL2EG APOE #> 3 HIST1H1D 3007 org.Hs.egALIAS2EG H1-3 #> 4 NM_003166 U08032 6818 org.Hs.egREFSEQ2EG SULT1A3 #> GENENAME #> 1 zinc finger and BTB domain containing 4 #> 2 apolipoprotein E #> 3 H1.3 linker histone, cluster member #> 4 sulfotransferase family 1A member 3