Freshen gene annotations using Bioconductor annotation data

freshenGenes3(
  x,
  ann_lib = c("", "org.Hs.eg.db"),
  try_list = c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
  final = c("SYMBOL", "GENENAME", "ALIAS"),
  split = "[ ]*[,/;]+[ ]*",
  sep = ",",
  handle_multiple = c("first_try", "first_hit", "all", "best_each"),
  empty_rule = c("empty", "original", "na"),
  include_source = FALSE,
  protect_inline_sep = TRUE,
  intermediate = "intermediate",
  verbose = FALSE,
  ...
)

Arguments

x	character vector or `data.frame` with one or most columns containing gene symbols.
ann_lib	character vector indicating the name or names of the Bioconductor annotation library to use when looking up gene nomenclature.
try_list	character vector indicating one or more names of annotations to use for the input gene symbols in `x`. The annotation should typically return the Entrez gene ID, usually given by `'2EG'` at the end of the name. For example `SYMBOL2EG` will be used with ann_lib `"org.Hs.eg.db"` to produce annotation name `"org.Hs.egSYMBOL2EG"`. Note that when the `'2EG'` form of annotation does not exist (or another suitable suffix defined in argument `"revmap_suffix"` in `get_anno_db()`), it will be derived using `AnnotationDbi::revmap()`. For example if `"org.Hs.egALIAS"` is requested, but only `"org.Hs.egALIAS2EG"` is available, then `AnnotationDbi::revmap(org.Hs.egALIAS2EG)` is used to create the equivalent of `"org.Hs.egALIAS"`.
final	character vector to use for the final conversion step. When `final` is `NULL` no conversion is performed. When `final` contains multiple values, each value is returned in the output. For example, `final=c("SYMBOL","GENENAME")` will return a column `"SYMBOL"` and a column `"GENENAME"`.
split	character value used to separate delimited values in `x` by the function `base::strsplit()`. The default will split values separated by comma `,` semicolon `;` or forward slash `/`, and will trim whitespace before and after these delimiters.
sep	character value used to concatenate multiple entries in the same field. The default `sep=","` will comma-delimit multiple entries in the same field.
handle_multiple	character value indicating how to handle multiple values: `"first_hit"` will query each column of `x` until it finds the first possible returning match, and will ignore all subsequent possible matches for that row in `x`. For example, if one row in `x` contains multiple values, only the first match will be used. `"first_try"` will return the first match from `try_list` for all columns in `x` that contain a match. For example, if one row in `x` contains two values, the first match from `try_list` using one or both columns in `x` will be maintained. Subsequent entries in `try_list` will not be attempted for rows that already have a match. `"all"` will return all possible matches for all entries in `x` using all items in `try_list`.
empty_rule	character value indicating how to handle entries which did not have a match, and are therefore empty: `"original"` will use the original entry as the output field; `"empty"` will leave the entry blank.
include_source	logical indicating whether to include a column that shows the colname and source matched. For example, if column `"original_gene"` matched `"SYMBOL2EG"` in `"org.Hs.eg.db"` there will be a column `"found_source"` with value `"original_gene.org.Hs.egSYMBOL2EG"`.
protect_inline_sep	logical indicating whether to protect inline characters in `sep`, to prevent them from being used to split single values into multiple values. For example, `"GENENAME"` returns the full gene name, which often contains comma `","` characters. These commas do not separate multiple separate values, so they should not be used to split a string like `"H4 clustered histone 10, pseudogene"` into two strings `"H4 clustered histone 10"` and `"pseudogene"`.
intermediate	`character` string with colname in `x` that contains intermediate values. These values are expected from output of the first step in the workflow, for example `"SYMBOL2EG"` returns Entrez gene values, so if the input `x` already contains some of these values in a column, assign that colname to `intermediate`.
verbose	logical indicating whether to print verbose output.

Details

This function is a convenient extension of freshenGenes() that adds GENENAME and ALIAS to the default value for final=c("SYMBOL", "GENENAME", "ALIAS"). It therefore returns three (3) annotation columns by default, the gene symbol, the long gene name, and the common gene aliases. The gene aliases often includes numerous previous gene symbols attributed to the gene.

Freshen gene annotations using Bioconductor annotation data

Arguments

Details

See also