Freshen gene annotations using Bioconductor annotation data
Source:R/genejam-freshen.R
freshenGenes.RdFreshen gene annotations using Bioconductor annotation data
Usage
freshenGenes(
x,
ann_lib = c("", "org.Hs.eg.db"),
try_list = c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
final = c("SYMBOL"),
split = "[ ]*[,/;]+[ ]*",
revert_split = TRUE,
sep = ",",
handle_multiple = c("first_try", "first_hit", "all", "best_each"),
empty_rule = c("empty", "original", "na"),
include_source = FALSE,
protect_inline_sep = TRUE,
intermediate = "ENTREZID",
ignore.case = FALSE,
verbose = FALSE,
...
)Arguments
- x
charactervector ordata.framewith one or most columns containing gene symbols.- ann_lib
charactervector indicating the name or names of the Bioconductor annotation library to use when looking up gene nomenclature.- try_list
charactervector indicating one or more names of annotations to use for the input gene symbols inx. The annotation should typically return the Entrez gene ID, usually given by'2EG'at the end of the name. For exampleSYMBOL2EGwill be used with ann_lib"org.Hs.eg.db"to produce annotation name"org.Hs.egSYMBOL2EG". Note that when the'2EG'form of annotation does not exist (or another suitable suffix defined in argument"revmap_suffix"inget_anno_db()), it will be derived usingAnnotationDbi::revmap(). For example if"org.Hs.egALIAS"is requested, but only"org.Hs.egALIAS2EG"is available, thenAnnotationDbi::revmap(org.Hs.egALIAS2EG)is used to create the equivalent of"org.Hs.egALIAS".- final
charactervector to use for the final conversion step. WhenfinalisNULLno conversion is performed. Whenfinalcontains multiple values, each value is returned in the output. For example,final=c("SYMBOL","GENENAME")will return a column"SYMBOL"and a column"GENENAME".- split
charactervalue used to separate delimited values inxby the functionbase::strsplit(). The default will split values separated by comma,semicolon;or forward slash/, and will trim whitespace before and after these delimiters.- revert_split
logicaldefault TRUE, whether to revert the internal split with argumentsplit, which causes columns with delimited values to be split into single value per column. Whenrevert_split=TRUEthe columns are re-combined to match the original data. Note that values are combined usingsepas delimiter, which may not be identical to the input data.- sep
charactervalue used to concatenate multiple entries in the same field. The defaultsep=","will comma-delimit multiple entries in the same field.- handle_multiple
charactervalue indicating how to handle multiple values:"first_hit"will query each column ofxuntil it finds the first possible returning match, and will ignore all subsequent possible matches for that row inx. For example, if one row inxcontains multiple values, only the first match will be used."first_try"will return the first match fromtry_listfor all columns inxthat contain a match. For example, if one row inxcontains two values, the first match fromtry_listusing one or both columns inxwill be maintained. Subsequent entries intry_listwill not be attempted for rows that already have a match."all"will return all possible matches for all entries inxusing all items intry_list.
- empty_rule
charactervalue indicating how to handle entries which did not have a match, and are therefore empty:"original"will use the original entry as the output field;"empty"will leave the entry blank.- include_source
logicalindicating whether to include a column that shows the colname and source matched. For example, if column"original_gene"matched"SYMBOL2EG"in"org.Hs.eg.db"there will be a column"found_source"with value"original_gene.org.Hs.egSYMBOL2EG".- protect_inline_sep
logicalindicating whether to protect inline characters insep, to prevent them from being used to split single values into multiple values. For example,"GENENAME"returns the full gene name, which often contains comma","characters. These commas do not separate multiple separate values, so they should not be used to split a string like"H4 clustered histone 10, pseudogene"into two strings"H4 clustered histone 10"and"pseudogene".- intermediate
characterstring with colname inxthat contains intermediate values. These values are expected from output of the first step in the workflow, for example"SYMBOL2EG"returns Entrez gene values, so if the inputxalready contains some of these values in a column, assign that colname tointermediate.- ignore.case
logicalindicating whether to use case-insensitive matching whenignore.case=TRUE, otherwise the defaultignore.case=FALSEwill perform defaultmget()which requires the upper and lowercase characters are an identical match. Whenignore.case=TRUEthis function callsgenejam::imget().- verbose
logicalindicating whether to print verbose output.
Value
data.frame with one or more columns indicating the input
data, then a column "intermediate" containing the Entrez gene ID
that was matched, then one column for each item in final,
by default "SYMBOL".
Details
This function takes a vector or data.frame of gene symbols,
and uses Bioconductor annotation methods to find the most current
official gene symbol.
The annotation process runs in two basic steps:
Convert the input gene to Entrez gene ID (
intermediate).Convert Entrez gene ID to official gene symbol (
final).
The intermediate depends upon the annotation used, but in the
default condition, most Bioconductor gene-centric annotation sources
are focused on ENTREZID as the primary key for integration.
During Step 1 above, values are successively queried to obtain
a successful conversion to intermediate, the specific logic for which
is controlled by handle_multiple. However, once the criteria
are met which produces valid intermediate, the value or values
in the intermediate are then used for Step 2 above.
Step 1. Convert to Entrez gene ID
The first step uses an ordered list of annotations, with the assumption that the first match is usually the best, and most specific. By default, the order is:
"org.Hs.egSYMBOL2EG"– almost always 1-to-1 match"org.Hs.egACCNUM2EG"– mostly a 1-to-1 match"org.Hs.egALIAS2EG"– sometimes a 1-to-1 match, sometimes 1-to-many
When multiple Entrez gene ID values are matched, they are all
retained. See argument handle_multiple for custom options.
Step 2. Use Entrez gene ID to return official annotation
The second step converts the Entrez gene ID (or multiple IDs)
to the official gene symbol, by default using "org.Hs.egSYMBOL".
The second step may optionally include multiple annotation types, each of which will be returned. Some common examples:
"org.Hs.egSYMBOL"– official Entrez gene symbol"org.Hs.egALIAS"– set of recognized aliases for an Entrez gene."org.Hs.egGENENAME"– official Entrez long gene name
For each step, the annotation matched can be returned, as an audit trail to see which annotation was available for each input entry.
Note that if the input data already contains Entrez gene ID
values, you can define that colname with argument intermediate.
Pre-existing intermediate values
When input data contains a column that matches argument intermediate,
the intention is for pre-existing values to be honored and maintained.
As a result, pre-existing non-empty values in intermediate are kept
without being updated. For example, in a data.frame that contains
colnames 'gene' and 'intermediate', and values are present in both
columns for a given row, the value in 'intermediate' is used without
also querying the value in 'gene' for that row.
Case-insensitive search
Case-insensitive search is particularly useful in non-human
organisms because they often use mixed-case gene symbols.
This support is enabled with the argument ignore.case=TRUE.
In our benchmark tests it appears to add roughly
0.1 seconds per annotation in try_list, regardless of the
number of input entries in x. It seems reasonable that the
cost may increase with annotation size, although in practice
the time dependency does not seem to be notably affected by
annotation size.
This time cost is related to the process of converting all keys()
in the annotation to lowercase, plus relatively minimal time to
handle scenarios where multiple keys in mixed case (capitalization)
are converted to the same lowercase key, for example 'CAL', 'Cal', 'cal'
are all recognized as 'cal', and all associated connections are combined.
This secondary step was added in version 0.0.18.900. Prior to that
version, only the first such matching lowercase result was used.
For the vast majority of gene-based queries involving human, this step is not necessary and produces identical results. The reason is that human gene symbols are entirely UPPERCASE.
See also
Other genejam:
freshenGenes2(),
freshenGenes3(),
get_anno_db(),
is_empty()
Examples
if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
cat("\nBasic usage\n");
print(freshenGenes(c("APOE", "CCN2", "CTGF")));
}
#>
#> Basic usage
#> input ENTREZID SYMBOL
#> 1 APOE 348 APOE
#> 2 CCN2 1490 CCN2
#> 3 CTGF 1490 CCN2
if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
## Optionally show the annotation source matched
cat("\nOptionally show the annotation source matched\n");
print(freshenGenes(c("APOE", "CCN2", "CTGF"), include_source=TRUE));
}
#>
#> Optionally show the annotation source matched
#> input ENTREZID ENTREZID_source SYMBOL
#> 1 APOE 348 org.Hs.egSYMBOL2EG APOE
#> 2 CCN2 1490 org.Hs.egSYMBOL2EG CCN2
#> 3 CTGF 1490 org.Hs.egALIAS2EG CCN2
if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
## Show comma-delimited genes
cat("\nInput genes are comma-delimited\n");
print(freshenGenes(c("APOE", "CCN2", "CTGF", "CCN2,CTGF")));
}
#>
#> Input genes are comma-delimited
#> input ENTREZID SYMBOL
#> 1 APOE 348 APOE
#> 2 CCN2 1490 CCN2
#> 3 CTGF 1490 CCN2
#> 4 CCN2,CTGF 1490 CCN2
if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
## Optionally include more than SYMBOL in the output
cat("\nCustom output to include SYMBOL, ALIAS, GENENAME\n");
print(freshenGenes(c("APOE", "HIST1H1C"),
final=c("SYMBOL", "ALIAS", "GENENAME")));
}
#>
#> Custom output to include SYMBOL, ALIAS, GENENAME
#> input ENTREZID SYMBOL ALIAS
#> 1 APOE 348 APOE AD2,APO-E,ApoE4,APOE,LDLCQ5,LPG
#> 2 HIST1H1C 3006 H1-2 H1-2,H1.2,H1C,H1F2,H1s-1,HIST1H1C
#> GENENAME
#> 1 apolipoprotein E
#> 2 H1.2 linker histone, cluster member
if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
## More advanced, match affymetrix probesets
if (requireNamespace("hgu133plus2.db", quietly=TRUE) &&
suppressPackageStartupMessages(require("hgu133plus2.db"))) {
cat("\nAdvanced example including Affymetrix probesets.\n");
affy_df <- freshenGenes(
c("227047_x_at", "APOE", "HIST1H1D", "NM_003166,U08032"),
include_source=TRUE,
try_list=c("hgu133plus2ENTREZID", "REFSEQ2EG",
"SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
final=c("SYMBOL", "GENENAME"))
print(affy_df)
}
}
#>
#>
#> Advanced example including Affymetrix probesets.
#> input ENTREZID ENTREZID_source SYMBOL
#> 1 227047_x_at 57659 hgu133plus2ENTREZID ZBTB4
#> 2 APOE 348 org.Hs.egSYMBOL2EG APOE
#> 3 HIST1H1D 3007 org.Hs.egALIAS2EG H1-3
#> 4 NM_003166,U08032 6818 org.Hs.egREFSEQ2EG SULT1A3
#> GENENAME
#> 1 zinc finger and BTB domain containing 4
#> 2 apolipoprotein E
#> 3 H1.3 linker histone, cluster member
#> 4 sulfotransferase family 1A member 3