Freshen gene annotations using Bioconductor annotation data

freshenGenes2(
  x,
  ann_lib = c("", "org.Hs.eg.db"),
  try_list = c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
  final = c("SYMBOL", "GENENAME"),
  split = "[ ]*[,/;]+[ ]*",
  sep = ",",
  handle_multiple = c("first_try", "first_hit", "all", "best_each"),
  empty_rule = c("empty", "original", "na"),
  include_source = FALSE,
  protect_inline_sep = TRUE,
  intermediate = "intermediate",
  verbose = FALSE,
  ...
)

Arguments

x	character vector or `data.frame` with one or most columns containing gene symbols.
ann_lib	character vector indicating the name or names of the Bioconductor annotation library to use when looking up gene nomenclature.
try_list	character vector indicating one or more names of annotations to use for the input gene symbols in `x`. The annotation should typically return the Entrez gene ID, usually given by `'2EG'` at the end of the name. For example `SYMBOL2EG` will be used with ann_lib `"org.Hs.eg.db"` to produce annotation name `"org.Hs.egSYMBOL2EG"`. Note that when the `'2EG'` form of annotation does not exist (or another suitable suffix defined in argument `"revmap_suffix"` in `get_anno_db()`), it will be derived using `AnnotationDbi::revmap()`. For example if `"org.Hs.egALIAS"` is requested, but only `"org.Hs.egALIAS2EG"` is available, then `AnnotationDbi::revmap(org.Hs.egALIAS2EG)` is used to create the equivalent of `"org.Hs.egALIAS"`.
final	character vector to use for the final conversion step. When `final` is `NULL` no conversion is performed. When `final` contains multiple values, each value is returned in the output. For example, `final=c("SYMBOL","GENENAME")` will return a column `"SYMBOL"` and a column `"GENENAME"`.
split	character value used to separate delimited values in `x` by the function `base::strsplit()`. The default will split values separated by comma `,` semicolon `;` or forward slash `/`, and will trim whitespace before and after these delimiters.
sep	character value used to concatenate multiple entries in the same field. The default `sep=","` will comma-delimit multiple entries in the same field.
handle_multiple	character value indicating how to handle multiple values: `"first_hit"` will query each column of `x` until it finds the first possible returning match, and will ignore all subsequent possible matches for that row in `x`. For example, if one row in `x` contains multiple values, only the first match will be used. `"first_try"` will return the first match from `try_list` for all columns in `x` that contain a match. For example, if one row in `x` contains two values, the first match from `try_list` using one or both columns in `x` will be maintained. Subsequent entries in `try_list` will not be attempted for rows that already have a match. `"all"` will return all possible matches for all entries in `x` using all items in `try_list`.
empty_rule	character value indicating how to handle entries which did not have a match, and are therefore empty: `"original"` will use the original entry as the output field; `"empty"` will leave the entry blank.
include_source	logical indicating whether to include a column that shows the colname and source matched. For example, if column `"original_gene"` matched `"SYMBOL2EG"` in `"org.Hs.eg.db"` there will be a column `"found_source"` with value `"original_gene.org.Hs.egSYMBOL2EG"`.
protect_inline_sep	logical indicating whether to protect inline characters in `sep`, to prevent them from being used to split single values into multiple values. For example, `"GENENAME"` returns the full gene name, which often contains comma `","` characters. These commas do not separate multiple separate values, so they should not be used to split a string like `"H4 clustered histone 10, pseudogene"` into two strings `"H4 clustered histone 10"` and `"pseudogene"`.
intermediate	`character` string with colname in `x` that contains intermediate values. These values are expected from output of the first step in the workflow, for example `"SYMBOL2EG"` returns Entrez gene values, so if the input `x` already contains some of these values in a column, assign that colname to `intermediate`.
verbose	logical indicating whether to print verbose output.

Details

This function is a convenient extension of freshenGenes() that adds GENENAME to the default value for final=c("SYMBOL", "GENENAME"). It therefore returns two (2) annotation columns by default, the gene symbol, and the long gene name.

Freshen gene annotations using Bioconductor annotation data

Arguments

Details

See also