Freshen gene annotations using Bioconductor annotation data

freshenGenes(
  x,
  ann_lib = c("", "org.Hs.eg.db"),
  try_list = c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
  final = c("SYMBOL"),
  split = "[ ]*[,/;]+[ ]*",
  sep = ",",
  handle_multiple = c("first_try", "first_hit", "all", "best_each"),
  empty_rule = c("empty", "original", "na"),
  include_source = FALSE,
  protect_inline_sep = TRUE,
  intermediate = "intermediate",
  ignore.case = FALSE,
  verbose = FALSE,
  ...
)

Arguments

x	character vector or `data.frame` with one or most columns containing gene symbols.
ann_lib	character vector indicating the name or names of the Bioconductor annotation library to use when looking up gene nomenclature.
try_list	character vector indicating one or more names of annotations to use for the input gene symbols in `x`. The annotation should typically return the Entrez gene ID, usually given by `'2EG'` at the end of the name. For example `SYMBOL2EG` will be used with ann_lib `"org.Hs.eg.db"` to produce annotation name `"org.Hs.egSYMBOL2EG"`. Note that when the `'2EG'` form of annotation does not exist (or another suitable suffix defined in argument `"revmap_suffix"` in `get_anno_db()`), it will be derived using `AnnotationDbi::revmap()`. For example if `"org.Hs.egALIAS"` is requested, but only `"org.Hs.egALIAS2EG"` is available, then `AnnotationDbi::revmap(org.Hs.egALIAS2EG)` is used to create the equivalent of `"org.Hs.egALIAS"`.
final	character vector to use for the final conversion step. When `final` is `NULL` no conversion is performed. When `final` contains multiple values, each value is returned in the output. For example, `final=c("SYMBOL","GENENAME")` will return a column `"SYMBOL"` and a column `"GENENAME"`.
split	character value used to separate delimited values in `x` by the function `base::strsplit()`. The default will split values separated by comma `,` semicolon `;` or forward slash `/`, and will trim whitespace before and after these delimiters.
sep	character value used to concatenate multiple entries in the same field. The default `sep=","` will comma-delimit multiple entries in the same field.
handle_multiple	character value indicating how to handle multiple values: `"first_hit"` will query each column of `x` until it finds the first possible returning match, and will ignore all subsequent possible matches for that row in `x`. For example, if one row in `x` contains multiple values, only the first match will be used. `"first_try"` will return the first match from `try_list` for all columns in `x` that contain a match. For example, if one row in `x` contains two values, the first match from `try_list` using one or both columns in `x` will be maintained. Subsequent entries in `try_list` will not be attempted for rows that already have a match. `"all"` will return all possible matches for all entries in `x` using all items in `try_list`.
empty_rule	character value indicating how to handle entries which did not have a match, and are therefore empty: `"original"` will use the original entry as the output field; `"empty"` will leave the entry blank.
include_source	logical indicating whether to include a column that shows the colname and source matched. For example, if column `"original_gene"` matched `"SYMBOL2EG"` in `"org.Hs.eg.db"` there will be a column `"found_source"` with value `"original_gene.org.Hs.egSYMBOL2EG"`.
protect_inline_sep	logical indicating whether to protect inline characters in `sep`, to prevent them from being used to split single values into multiple values. For example, `"GENENAME"` returns the full gene name, which often contains comma `","` characters. These commas do not separate multiple separate values, so they should not be used to split a string like `"H4 clustered histone 10, pseudogene"` into two strings `"H4 clustered histone 10"` and `"pseudogene"`.
intermediate	`character` string with colname in `x` that contains intermediate values. These values are expected from output of the first step in the workflow, for example `"SYMBOL2EG"` returns Entrez gene values, so if the input `x` already contains some of these values in a column, assign that colname to `intermediate`.
ignore.case	`logical` indicating whether to use case-insensitive matching when `ignore.case=TRUE`, otherwise the default `ignore.case=FALSE` will perform default `mget()` which requires the upper and lowercase characters are an identical match. When `ignore.case=TRUE` this function calls `genejam::imget()`.
verbose	logical indicating whether to print verbose output.

Value

data.frame with one or more columns indicating the input data, then a column "intermediate" containing the Entrez gene ID that was matched, then one column for each item in final, by default "SYMBOL".

Details

This function takes a vector or data.frame of gene symbols, and uses Bioconductor annotation methods to find the most current official gene symbol.

The annotation process runs in two basic steps:

Convert the input gene to Entrez gene ID.
Convert Entrez gene ID to official gene symbol.

Step 1. Convert to Entrez gene ID

The first step uses an ordered list of annotations, with the assumption that the first match is usually the best, and most specific. By default, the order is:

"org.Hs.egSYMBOL2EG" -- almost always 1-to-1 match
"org.Hs.egACCNUM2EG" -- mostly a 1-to-1 match
"org.Hs.egALIAS2EG" -- sometimes a 1-to-1 match, sometimes 1-to-many

When multiple Entrez gene ID values are matched, they are all retained. See argument handle_multiple for custom options.

Step 2. Use Entrez gene ID to return official annotation

The second step converts the Entrez gene ID (or multiple IDs) to the official gene symbol, by default using "org.Hs.egSYMBOL".

The second step may optionally include multiple annotation types, each of which will be returned. Some common examples:

"org.Hs.egSYMBOL" -- official Entrez gene symbol
"org.Hs.egALIAS" -- set of recognized aliases for an Entrez gene.
"org.Hs.egGENENAME" -- official Entrez long gene name

For each step, the annotation matched can be returned, as an audit trail to see which annotation was available for each input entry.

Note that if the input data already contains Entrez gene ID values, you can define that colname with argument intermediate.

Case-insensitive search

For case-insensitive search, which is particularly useful in non-human organisms because they often use mixed-case, use the argument ignore.case=TRUE. In our benchmark tests it appears to add roughly 0.1 seconds per annotation, regardless of the number of input entries. This appears to be the time it takes to spool the list of annotation keys stored in the SQLite database, and may therefore be dependent upon the size of the annotation file.

Examples

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   cat("\nBasic usage\n");
   print(freshenGenes(c("APOE", "CCN2", "CTGF")));
}
#> Warning: package ‘IRanges’ was built under R version 3.6.2
#> Warning: package ‘S4Vectors’ was built under R version 3.6.3
#> 
#> Basic usage
#>   input intermediate SYMBOL
#> 1  APOE          348   APOE
#> 2  CCN2         1490   CCN2
#> 3  CTGF         1490   CCN2

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## Optionally show the annotation source matched
   cat("\nOptionally show the annotation source matched\n");
   print(freshenGenes(c("APOE", "CCN2", "CTGF"), include_source=TRUE));
}
#> 
#> Optionally show the annotation source matched
#>   input intermediate intermediate_source SYMBOL
#> 1  APOE          348  org.Hs.egSYMBOL2EG   APOE
#> 2  CCN2         1490  org.Hs.egSYMBOL2EG   CCN2
#> 3  CTGF         1490   org.Hs.egALIAS2EG   CCN2

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## Show comma-delimited genes
   cat("\nInput genes are comma-delimited\n");
   print(freshenGenes(c("APOE", "CCN2", "CTGF", "CCN2,CTGF")));
}
#> 
#> Input genes are comma-delimited
#>   input_v1 input_v2 intermediate SYMBOL
#> 1     APOE                   348   APOE
#> 2     CCN2                  1490   CCN2
#> 3     CTGF                  1490   CCN2
#> 4     CCN2     CTGF         1490   CCN2

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## Optionally include more than SYMBOL in the output
   cat("\nCustom output to include SYMBOL, ALIAS, GENENAME\n");
   print(freshenGenes(c("APOE", "HIST1H1C"),
      final=c("SYMBOL", "ALIAS", "GENENAME")));
}
#> 
#> Custom output to include SYMBOL, ALIAS, GENENAME
#>      input intermediate SYMBOL                             ALIAS
#> 1     APOE          348   APOE   AD2,APO-E,ApoE4,APOE,LDLCQ5,LPG
#> 2 HIST1H1C         3006   H1-2 H1-2,H1.2,H1C,H1F2,H1s-1,HIST1H1C
#>                              GENENAME
#> 1                    apolipoprotein E
#> 2 H1.2 linker histone, cluster member

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## More advanced, match affymetrix probesets
   if (suppressPackageStartupMessages(require(hgu133plus2.db))) {
      cat("\nAdvanced example including Affymetrix probesets.\n");
      print(freshenGenes(c("227047_x_at","APOE","HIST1H1D","NM_003166,U08032"),
         include_source=TRUE,
         try_list=c("hgu133plus2ENTREZID","REFSEQ2EG","SYMBOL2EG","ACCNUM2EG","ALIAS2EG"),
         final=c("SYMBOL","GENENAME")))
   }
}
#> 
#> Advanced example including Affymetrix probesets.
#>      input_v1 input_v2 intermediate intermediate_source  SYMBOL
#> 1 227047_x_at                 57659 hgu133plus2ENTREZID   ZBTB4
#> 2        APOE                   348  org.Hs.egSYMBOL2EG    APOE
#> 3    HIST1H1D                  3007   org.Hs.egALIAS2EG    H1-3
#> 4   NM_003166   U08032         6818  org.Hs.egREFSEQ2EG SULT1A3
#>                                  GENENAME
#> 1 zinc finger and BTB domain containing 4
#> 2                        apolipoprotein E
#> 3     H1.3 linker histone, cluster member
#> 4     sulfotransferase family 1A member 3