Freshen gene annotations using Bioconductor annotation data

freshenGenes3(
  x,
  ann_lib = c("", "org.Hs.eg.db"),
  try_list = c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
  final = c("SYMBOL", "GENENAME", "ALIAS"),
  split = "[ ]*[,/;]+[ ]*",
  sep = ",",
  handle_multiple = c("first_try", "first_hit", "all", "best_each"),
  empty_rule = c("empty", "original", "na"),
  include_source = FALSE,
  protect_inline_sep = TRUE,
  intermediate = "intermediate",
  verbose = FALSE,
  ...
)

Arguments

x

character vector or data.frame with one or most columns containing gene symbols.

ann_lib

character vector indicating the name or names of the Bioconductor annotation library to use when looking up gene nomenclature.

try_list

character vector indicating one or more names of annotations to use for the input gene symbols in x. The annotation should typically return the Entrez gene ID, usually given by '2EG' at the end of the name. For example SYMBOL2EG will be used with ann_lib "org.Hs.eg.db" to produce annotation name "org.Hs.egSYMBOL2EG". Note that when the '2EG' form of annotation does not exist (or another suitable suffix defined in argument "revmap_suffix" in get_anno_db()), it will be derived using AnnotationDbi::revmap(). For example if "org.Hs.egALIAS" is requested, but only "org.Hs.egALIAS2EG" is available, then AnnotationDbi::revmap(org.Hs.egALIAS2EG) is used to create the equivalent of "org.Hs.egALIAS".

final

character vector to use for the final conversion step. When final is NULL no conversion is performed. When final contains multiple values, each value is returned in the output. For example, final=c("SYMBOL","GENENAME") will return a column "SYMBOL" and a column "GENENAME".

split

character value used to separate delimited values in x by the function base::strsplit(). The default will split values separated by comma , semicolon ; or forward slash /, and will trim whitespace before and after these delimiters.

sep

character value used to concatenate multiple entries in the same field. The default sep="," will comma-delimit multiple entries in the same field.

handle_multiple

character value indicating how to handle multiple values: "first_hit" will query each column of x until it finds the first possible returning match, and will ignore all subsequent possible matches for that row in x. For example, if one row in x contains multiple values, only the first match will be used. "first_try" will return the first match from try_list for all columns in x that contain a match. For example, if one row in x contains two values, the first match from try_list using one or both columns in x will be maintained. Subsequent entries in try_list will not be attempted for rows that already have a match. "all" will return all possible matches for all entries in x using all items in try_list.

empty_rule

character value indicating how to handle entries which did not have a match, and are therefore empty: "original" will use the original entry as the output field; "empty" will leave the entry blank.

include_source

logical indicating whether to include a column that shows the colname and source matched. For example, if column "original_gene" matched "SYMBOL2EG" in "org.Hs.eg.db" there will be a column "found_source" with value "original_gene.org.Hs.egSYMBOL2EG".

protect_inline_sep

logical indicating whether to protect inline characters in sep, to prevent them from being used to split single values into multiple values. For example, "GENENAME" returns the full gene name, which often contains comma "," characters. These commas do not separate multiple separate values, so they should not be used to split a string like "H4 clustered histone 10, pseudogene" into two strings "H4 clustered histone 10" and "pseudogene".

intermediate

character string with colname in x that contains intermediate values. These values are expected from output of the first step in the workflow, for example "SYMBOL2EG" returns Entrez gene values, so if the input x already contains some of these values in a column, assign that colname to intermediate.

verbose

logical indicating whether to print verbose output.

Details

This function is a convenient extension of freshenGenes() that adds GENENAME and ALIAS to the default value for final=c("SYMBOL", "GENENAME", "ALIAS"). It therefore returns three (3) annotation columns by default, the gene symbol, the long gene name, and the common gene aliases. The gene aliases often includes numerous previous gene symbols attributed to the gene.

See also