Freshen gene annotations using Bioconductor annotation data
Source:R/genejam-convenience.R
freshenGenes2.RdFreshen gene annotations using Bioconductor annotation data
Usage
freshenGenes2(
x,
ann_lib = c("", "org.Hs.eg.db"),
try_list = c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
final = c("SYMBOL", "GENENAME"),
split = "[ ]*[,/;]+[ ]*",
sep = ",",
handle_multiple = c("first_try", "first_hit", "all", "best_each"),
empty_rule = c("empty", "original", "na"),
include_source = FALSE,
protect_inline_sep = TRUE,
intermediate = "ENTREZID",
verbose = FALSE,
...
)Arguments
- x
charactervector ordata.framewith one or most columns containing gene symbols.- ann_lib
charactervector indicating the name or names of the Bioconductor annotation library to use when looking up gene nomenclature.- try_list
charactervector indicating one or more names of annotations to use for the input gene symbols inx. The annotation should typically return the Entrez gene ID, usually given by'2EG'at the end of the name. For exampleSYMBOL2EGwill be used with ann_lib"org.Hs.eg.db"to produce annotation name"org.Hs.egSYMBOL2EG". Note that when the'2EG'form of annotation does not exist (or another suitable suffix defined in argument"revmap_suffix"inget_anno_db()), it will be derived usingAnnotationDbi::revmap(). For example if"org.Hs.egALIAS"is requested, but only"org.Hs.egALIAS2EG"is available, thenAnnotationDbi::revmap(org.Hs.egALIAS2EG)is used to create the equivalent of"org.Hs.egALIAS".- final
charactervector to use for the final conversion step. WhenfinalisNULLno conversion is performed. Whenfinalcontains multiple values, each value is returned in the output. For example,final=c("SYMBOL","GENENAME")will return a column"SYMBOL"and a column"GENENAME".- split
charactervalue used to separate delimited values inxby the functionbase::strsplit(). The default will split values separated by comma,semicolon;or forward slash/, and will trim whitespace before and after these delimiters.- sep
charactervalue used to concatenate multiple entries in the same field. The defaultsep=","will comma-delimit multiple entries in the same field.- handle_multiple
charactervalue indicating how to handle multiple values:"first_hit"will query each column ofxuntil it finds the first possible returning match, and will ignore all subsequent possible matches for that row inx. For example, if one row inxcontains multiple values, only the first match will be used."first_try"will return the first match fromtry_listfor all columns inxthat contain a match. For example, if one row inxcontains two values, the first match fromtry_listusing one or both columns inxwill be maintained. Subsequent entries intry_listwill not be attempted for rows that already have a match."all"will return all possible matches for all entries inxusing all items intry_list.
- empty_rule
charactervalue indicating how to handle entries which did not have a match, and are therefore empty:"original"will use the original entry as the output field;"empty"will leave the entry blank.- include_source
logicalindicating whether to include a column that shows the colname and source matched. For example, if column"original_gene"matched"SYMBOL2EG"in"org.Hs.eg.db"there will be a column"found_source"with value"original_gene.org.Hs.egSYMBOL2EG".- protect_inline_sep
logicalindicating whether to protect inline characters insep, to prevent them from being used to split single values into multiple values. For example,"GENENAME"returns the full gene name, which often contains comma","characters. These commas do not separate multiple separate values, so they should not be used to split a string like"H4 clustered histone 10, pseudogene"into two strings"H4 clustered histone 10"and"pseudogene".- intermediate
characterstring with colname inxthat contains intermediate values. These values are expected from output of the first step in the workflow, for example"SYMBOL2EG"returns Entrez gene values, so if the inputxalready contains some of these values in a column, assign that colname tointermediate.- verbose
logicalindicating whether to print verbose output.
Details
This function is a convenient extension of freshenGenes()
that adds GENENAME to the default value for
final=c("SYMBOL", "GENENAME").
It therefore returns two (2) annotation columns by default,
the gene symbol, and the long gene name.
See also
Other genejam:
freshenGenes(),
freshenGenes3(),
get_anno_db(),
is_empty()