Freshen gene annotations using Bioconductor annotation data
freshenGenes2(
x,
ann_lib = c("", "org.Hs.eg.db"),
try_list = c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
final = c("SYMBOL", "GENENAME"),
split = "[ ]*[,/;]+[ ]*",
sep = ",",
handle_multiple = c("first_try", "first_hit", "all", "best_each"),
empty_rule = c("empty", "original", "na"),
include_source = FALSE,
protect_inline_sep = TRUE,
intermediate = "intermediate",
verbose = FALSE,
...
)
Arguments
x |
character vector or data.frame with one or most columns
containing gene symbols. |
ann_lib |
character vector indicating the name or names of the
Bioconductor annotation library to use when looking up
gene nomenclature. |
try_list |
character vector indicating one or more names of
annotations to use for the input gene symbols in x . The
annotation should typically return the Entrez gene ID, usually
given by '2EG' at the end of the name. For example SYMBOL2EG
will be used with ann_lib "org.Hs.eg.db" to produce annotation
name "org.Hs.egSYMBOL2EG" . Note that when the '2EG' form of
annotation does not exist (or another suitable suffix defined in
argument "revmap_suffix" in get_anno_db() ), it will be derived
using AnnotationDbi::revmap() . For example if "org.Hs.egALIAS"
is requested, but only "org.Hs.egALIAS2EG" is available, then
AnnotationDbi::revmap(org.Hs.egALIAS2EG) is used to create the
equivalent of "org.Hs.egALIAS" . |
final |
character vector to use for the final conversion
step. When final is NULL no conversion is performed.
When final contains multiple values, each value is returned
in the output. For example, final=c("SYMBOL","GENENAME") will
return a column "SYMBOL" and a column "GENENAME" . |
split |
character value used to separate delimited values in x
by the function base::strsplit() . The default will split values
separated by comma , semicolon ; or forward slash / , and will
trim whitespace before and after these delimiters. |
sep |
character value used to concatenate multiple entries in
the same field. The default sep="," will comma-delimit multiple
entries in the same field. |
handle_multiple |
character value indicating how to handle multiple
values: "first_hit" will query each column of x until it finds the
first possible returning match, and will ignore all subsequent possible
matches for that row in x . For example, if one row in x contains
multiple values, only the first match will be used. "first_try"
will return the first match from try_list for all columns in x
that contain a match. For example, if one row in x contains two
values, the first match from try_list using one or both columns in
x will be maintained. Subsequent entries in try_list will not be
attempted for rows that already have a match. "all" will return all
possible matches for all entries in x using all items in try_list . |
empty_rule |
character value indicating how to handle entries which
did not have a match, and are therefore empty: "original" will use
the original entry as the output field; "empty" will leave the
entry blank. |
include_source |
logical indicating whether to include a column
that shows the colname and source matched. For example, if column
"original_gene" matched "SYMBOL2EG" in "org.Hs.eg.db" there
will be a column "found_source" with value
"original_gene.org.Hs.egSYMBOL2EG" . |
protect_inline_sep |
logical indicating whether to
protect inline characters in sep , to prevent them from
being used to split single values into multiple values.
For example, "GENENAME" returns the full gene name, which
often contains comma "," characters. These commas do
not separate multiple separate values, so they should not be
used to split a string like "H4 clustered histone 10, pseudogene"
into two strings "H4 clustered histone 10" and "pseudogene" . |
intermediate |
character string with colname in x that
contains intermediate values. These values are expected from output
of the first step in the workflow, for example "SYMBOL2EG"
returns Entrez gene values, so if the input x already contains
some of these values in a column, assign that colname to
intermediate .
|
verbose |
logical indicating whether to print verbose output. |
Details
This function is a convenient extension of freshenGenes()
that adds GENENAME
to the default value for
final=c("SYMBOL", "GENENAME")
.
It therefore returns two (2
) annotation columns by default,
the gene symbol, and the long gene name.
See also