Skip to contents

Freshen gene annotations using Bioconductor annotation data

Usage

freshenGenes(
  x,
  ann_lib = c("", "org.Hs.eg.db"),
  try_list = c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
  final = c("SYMBOL"),
  split = "[ ]*[,/;]+[ ]*",
  revert_split = TRUE,
  sep = ",",
  handle_multiple = c("first_try", "first_hit", "all", "best_each"),
  empty_rule = c("empty", "original", "na"),
  include_source = FALSE,
  protect_inline_sep = TRUE,
  intermediate = "ENTREZID",
  ignore.case = FALSE,
  verbose = FALSE,
  ...
)

Arguments

x

character vector or data.frame with one or most columns containing gene symbols.

ann_lib

character vector indicating the name or names of the Bioconductor annotation library to use when looking up gene nomenclature.

try_list

character vector indicating one or more names of annotations to use for the input gene symbols in x. The annotation should typically return the Entrez gene ID, usually given by '2EG' at the end of the name. For example SYMBOL2EG will be used with ann_lib "org.Hs.eg.db" to produce annotation name "org.Hs.egSYMBOL2EG". Note that when the '2EG' form of annotation does not exist (or another suitable suffix defined in argument "revmap_suffix" in get_anno_db()), it will be derived using AnnotationDbi::revmap(). For example if "org.Hs.egALIAS" is requested, but only "org.Hs.egALIAS2EG" is available, then AnnotationDbi::revmap(org.Hs.egALIAS2EG) is used to create the equivalent of "org.Hs.egALIAS".

final

character vector to use for the final conversion step. When final is NULL no conversion is performed. When final contains multiple values, each value is returned in the output. For example, final=c("SYMBOL","GENENAME") will return a column "SYMBOL" and a column "GENENAME".

split

character value used to separate delimited values in x by the function base::strsplit(). The default will split values separated by comma , semicolon ; or forward slash /, and will trim whitespace before and after these delimiters.

revert_split

logical default TRUE, whether to revert the internal split with argument split, which causes columns with delimited values to be split into single value per column. When revert_split=TRUE the columns are re-combined to match the original data. Note that values are combined using sep as delimiter, which may not be identical to the input data.

sep

character value used to concatenate multiple entries in the same field. The default sep="," will comma-delimit multiple entries in the same field.

handle_multiple

character value indicating how to handle multiple values:

  • "first_hit" will query each column of x until it finds the first possible returning match, and will ignore all subsequent possible matches for that row in x. For example, if one row in x contains multiple values, only the first match will be used.

  • "first_try" will return the first match from try_list for all columns in x that contain a match. For example, if one row in x contains two values, the first match from try_list using one or both columns in x will be maintained. Subsequent entries in try_list will not be attempted for rows that already have a match.

  • "all" will return all possible matches for all entries in x using all items in try_list.

empty_rule

character value indicating how to handle entries which did not have a match, and are therefore empty: "original" will use the original entry as the output field; "empty" will leave the entry blank.

include_source

logical indicating whether to include a column that shows the colname and source matched. For example, if column "original_gene" matched "SYMBOL2EG" in "org.Hs.eg.db" there will be a column "found_source" with value "original_gene.org.Hs.egSYMBOL2EG".

protect_inline_sep

logical indicating whether to protect inline characters in sep, to prevent them from being used to split single values into multiple values. For example, "GENENAME" returns the full gene name, which often contains comma "," characters. These commas do not separate multiple separate values, so they should not be used to split a string like "H4 clustered histone 10, pseudogene" into two strings "H4 clustered histone 10" and "pseudogene".

intermediate

character string with colname in x that contains intermediate values. These values are expected from output of the first step in the workflow, for example "SYMBOL2EG" returns Entrez gene values, so if the input x already contains some of these values in a column, assign that colname to intermediate.

ignore.case

logical indicating whether to use case-insensitive matching when ignore.case=TRUE, otherwise the default ignore.case=FALSE will perform default mget() which requires the upper and lowercase characters are an identical match. When ignore.case=TRUE this function calls genejam::imget().

verbose

logical indicating whether to print verbose output.

Value

data.frame with one or more columns indicating the input data, then a column "intermediate" containing the Entrez gene ID that was matched, then one column for each item in final, by default "SYMBOL".

Details

This function takes a vector or data.frame of gene symbols, and uses Bioconductor annotation methods to find the most current official gene symbol.

The annotation process runs in two basic steps:

  1. Convert the input gene to Entrez gene ID (intermediate).

  2. Convert Entrez gene ID to official gene symbol (final).

The intermediate depends upon the annotation used, but in the default condition, most Bioconductor gene-centric annotation sources are focused on ENTREZID as the primary key for integration.

During Step 1 above, values are successively queried to obtain a successful conversion to intermediate, the specific logic for which is controlled by handle_multiple. However, once the criteria are met which produces valid intermediate, the value or values in the intermediate are then used for Step 2 above.

Step 1. Convert to Entrez gene ID

The first step uses an ordered list of annotations, with the assumption that the first match is usually the best, and most specific. By default, the order is:

  • "org.Hs.egSYMBOL2EG" – almost always 1-to-1 match

  • "org.Hs.egACCNUM2EG" – mostly a 1-to-1 match

  • "org.Hs.egALIAS2EG" – sometimes a 1-to-1 match, sometimes 1-to-many

When multiple Entrez gene ID values are matched, they are all retained. See argument handle_multiple for custom options.

Step 2. Use Entrez gene ID to return official annotation

The second step converts the Entrez gene ID (or multiple IDs) to the official gene symbol, by default using "org.Hs.egSYMBOL".

The second step may optionally include multiple annotation types, each of which will be returned. Some common examples:

  • "org.Hs.egSYMBOL" – official Entrez gene symbol

  • "org.Hs.egALIAS" – set of recognized aliases for an Entrez gene.

  • "org.Hs.egGENENAME" – official Entrez long gene name

For each step, the annotation matched can be returned, as an audit trail to see which annotation was available for each input entry.

Note that if the input data already contains Entrez gene ID values, you can define that colname with argument intermediate.

Pre-existing intermediate values

When input data contains a column that matches argument intermediate, the intention is for pre-existing values to be honored and maintained. As a result, pre-existing non-empty values in intermediate are kept without being updated. For example, in a data.frame that contains colnames 'gene' and 'intermediate', and values are present in both columns for a given row, the value in 'intermediate' is used without also querying the value in 'gene' for that row.

Case-insensitive search is particularly useful in non-human organisms because they often use mixed-case gene symbols. This support is enabled with the argument ignore.case=TRUE. In our benchmark tests it appears to add roughly 0.1 seconds per annotation in try_list, regardless of the number of input entries in x. It seems reasonable that the cost may increase with annotation size, although in practice the time dependency does not seem to be notably affected by annotation size.

This time cost is related to the process of converting all keys() in the annotation to lowercase, plus relatively minimal time to handle scenarios where multiple keys in mixed case (capitalization) are converted to the same lowercase key, for example 'CAL', 'Cal', 'cal' are all recognized as 'cal', and all associated connections are combined. This secondary step was added in version 0.0.18.900. Prior to that version, only the first such matching lowercase result was used.

For the vast majority of gene-based queries involving human, this step is not necessary and produces identical results. The reason is that human gene symbols are entirely UPPERCASE.

See also

Examples

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   cat("\nBasic usage\n");
   print(freshenGenes(c("APOE", "CCN2", "CTGF")));
}
#> 
#> Basic usage
#>   input ENTREZID SYMBOL
#> 1  APOE      348   APOE
#> 2  CCN2     1490   CCN2
#> 3  CTGF     1490   CCN2

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## Optionally show the annotation source matched
   cat("\nOptionally show the annotation source matched\n");
   print(freshenGenes(c("APOE", "CCN2", "CTGF"), include_source=TRUE));
}
#> 
#> Optionally show the annotation source matched
#>   input ENTREZID    ENTREZID_source SYMBOL
#> 1  APOE      348 org.Hs.egSYMBOL2EG   APOE
#> 2  CCN2     1490 org.Hs.egSYMBOL2EG   CCN2
#> 3  CTGF     1490  org.Hs.egALIAS2EG   CCN2

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## Show comma-delimited genes
   cat("\nInput genes are comma-delimited\n");
   print(freshenGenes(c("APOE", "CCN2", "CTGF", "CCN2,CTGF")));
}
#> 
#> Input genes are comma-delimited
#>       input ENTREZID SYMBOL
#> 1      APOE      348   APOE
#> 2      CCN2     1490   CCN2
#> 3      CTGF     1490   CCN2
#> 4 CCN2,CTGF     1490   CCN2

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## Optionally include more than SYMBOL in the output
   cat("\nCustom output to include SYMBOL, ALIAS, GENENAME\n");
   print(freshenGenes(c("APOE", "HIST1H1C"),
      final=c("SYMBOL", "ALIAS", "GENENAME")));
}
#> 
#> Custom output to include SYMBOL, ALIAS, GENENAME
#>      input ENTREZID SYMBOL                             ALIAS
#> 1     APOE      348   APOE   AD2,APO-E,ApoE4,APOE,LDLCQ5,LPG
#> 2 HIST1H1C     3006   H1-2 H1-2,H1.2,H1C,H1F2,H1s-1,HIST1H1C
#>                              GENENAME
#> 1                    apolipoprotein E
#> 2 H1.2 linker histone, cluster member

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## More advanced, match affymetrix probesets
   if (requireNamespace("hgu133plus2.db", quietly=TRUE) &&
      suppressPackageStartupMessages(require("hgu133plus2.db"))) {
      cat("\nAdvanced example including Affymetrix probesets.\n");
      affy_df <- freshenGenes(
         c("227047_x_at", "APOE", "HIST1H1D", "NM_003166,U08032"),
         include_source=TRUE,
         try_list=c("hgu133plus2ENTREZID", "REFSEQ2EG",
            "SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
         final=c("SYMBOL", "GENENAME"))
      print(affy_df)
   }
}
#> 
#> 
#> Advanced example including Affymetrix probesets.
#>              input ENTREZID     ENTREZID_source  SYMBOL
#> 1      227047_x_at    57659 hgu133plus2ENTREZID   ZBTB4
#> 2             APOE      348  org.Hs.egSYMBOL2EG    APOE
#> 3         HIST1H1D     3007   org.Hs.egALIAS2EG    H1-3
#> 4 NM_003166,U08032     6818  org.Hs.egREFSEQ2EG SULT1A3
#>                                  GENENAME
#> 1 zinc finger and BTB domain containing 4
#> 2                        apolipoprotein E
#> 3     H1.3 linker histone, cluster member
#> 4     sulfotransferase family 1A member 3