Annotate GRanges using another GRanges object

annotateGRfromGR(
  GR1,
  GR2,
  grOL = NULL,
  numAsStrings = FALSE,
  stringShrinkFunc = function(...) {     jamba::cPasteUnique(..., doSort = TRUE) },
  numShrinkFunc = sum,
  addStringCols = NULL,
  type = c("any", "start", "end", "within", "equal"),
  ignore.strand = FALSE,
  select = "all",
  sep = ",",
  verbose = FALSE,
  DEBUG = FALSE,
  ...
)

Arguments

GR1

GRanges object, the reference object to which annotations are added.

GR2

GRanges object used for annotations

grOL

overlaps object, optionally used when the GenomicRanges::findOverlaps() function has already been run, mostly intended to save re-processing the overlaps for large objects.

numAsStrings

logical indicating whether numerical values should be treated as strings for the purpose of added annotations. When TRUE, numerical values will be converted to character strings and retained as if they were labels. This argument treats all numeric columns as string, to treat only a subset use the addStringCols argument.

stringShrinkFunc

function used to shrink string annotations by overlapping feature. This function should take a list as input, and sep as an argument indicating the delimiter if applicable, and return a character vector of the same length as the list. By default, jamba::cPasteUnique() is used. Control over the sort and uniqueness should be applied with this function. Note: stringShrinkFunc can be a single function, or can be a named list of function calls, whose names match the colnames of values. If a named list is provided, the first entry is used for any any names in values which are not in names(stringShrinkFunc).

numShrinkFunc

function used to shrink numerical values by overlapping feature. For numeric values, the data.table package is used to apply the function, which enables custom functions not otherwise available via the GenomicRanges methods. Therefore, the function does not take a list as input, instead takes a numeric vector as input. Note: numShrinkFunc can be a single function, or can be a named list of function calls, whose names match the colnames of values. If a named list is provided, the first entry is used for any any names in values which are not in names(numShrinkFunc).

addStringCols

character vector, optional, of numeric colnames that should be treated as string values. This option is a subset of the numAsStrings.

type, ignore.strand, select

arguments sent to GenomicRanges::findOverlaps(). Note these options are only used when grOL is not supplied.

sep

character value indicating the delimiter to use between string values.

verbose

logical indicating whether to print verbose output.

DEBUG

logical indicating whether to print debugging output.

...

additional arguments are ignored.

Value

GRanges object with colnames added to values, with length and order equal to the input GR1 GRanges object.

Details

This function adds annotations to features in the given GRanges object, from overlapping features in a second GRanges object. It is especially useful after performing a manipulation that results in a GRanges object without any values annotation, for example GenomicRanges::reduce() or GenomicRanges::intersect().

In theory this function is relatively simple, it applies annotations to overlapping entries. In practice, it gets complicated when multiple annotations overlap the same GRange entry. In that case, numerical values by default return the base::mean() while string values call jamba::cPasteUnique() by default, and return the unique, sorted, comma-delimited set of string values. For example, overlapping several exons from the same gene, the resulting annotation might include just one gene symbol.

The numeric values can be aggregated using another function, controlled by numShrinkFunction named using the colname. For example: numShrinkFunction="min" would return the base::min() value for all numeric columns. But numShrinkFunction=list(default=mean, score=max) would return the base::mean() for all numeric columns except the "score" column which would report the base::max().

Numeric values currently use the data.table package workflow, which provides extremely efficient calculations for numeric values. However, vectorized functions have the potential to be notably faster, as is true with jamba::cPasteUnique() especially when applying jamba::mixedSort() to sort alphanumeric values. In future, the implementation may change to accomodate vectorized numeric functions, instead of using data.table.

TODO: This function is a specific case where another function shrinkDataFrame() may be a more general purpose use. That function is yet to be ported to the Jam package suite.

See also

Examples

gr12 <- GenomicRanges::GRanges( seqnames=rep(c("chr1", "chr2", "chr1"), c(3,3,3)), ranges=IRanges::IRanges( start=c(100, 200, 400, 500, 300, 100, 200, 400, 600), width=c(100,150,50, 50,50,100, 50,200,50) ), strand=rep(c("+", "-", "+"), c(3,3,3)), gene_name=rep(c("GeneA", "GeneB", "GeneC"), each=3) ) gr1 <- gr12[,0]; # Say for example you have a GRanges object with no annotation gr1;
#> GRanges object with 9 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr1 100-199 + #> [2] chr1 200-349 + #> [3] chr1 400-449 + #> [4] chr2 500-549 - #> [5] chr2 300-349 - #> [6] chr2 100-199 - #> [7] chr1 200-249 + #> [8] chr1 400-599 + #> [9] chr1 600-649 + #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths
# And had another GRanges object with annotations gr12;
#> GRanges object with 9 ranges and 1 metadata column: #> seqnames ranges strand | gene_name #> <Rle> <IRanges> <Rle> | <character> #> [1] chr1 100-199 + | GeneA #> [2] chr1 200-349 + | GeneA #> [3] chr1 400-449 + | GeneA #> [4] chr2 500-549 - | GeneB #> [5] chr2 300-349 - | GeneB #> [6] chr2 100-199 - | GeneB #> [7] chr1 200-249 + | GeneC #> [8] chr1 400-599 + | GeneC #> [9] chr1 600-649 + | GeneC #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths
# To add annotations annotateGRfromGR(gr1, gr12);
#> GRanges object with 9 ranges and 1 metadata column: #> seqnames ranges strand | gene_name #> <Rle> <IRanges> <Rle> | <character> #> GR1_1 chr1 100-199 + | GeneA #> GR1_2 chr1 200-349 + | GeneA,GeneC #> GR1_3 chr1 400-449 + | GeneA,GeneC #> GR1_4 chr2 500-549 - | GeneB #> GR1_5 chr2 300-349 - | GeneB #> GR1_6 chr2 100-199 - | GeneB #> GR1_7 chr1 200-249 + | GeneA,GeneC #> GR1_8 chr1 400-599 + | GeneA,GeneC #> GR1_9 chr1 600-649 + | GeneC #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths
# Notice certain features overlap multiple annotations, # which may be acceptable. # If you want to keep annotations distinct, # use annotateGRLfromGRL() grl1 <- GenomicRanges::split(gr12[,0], GenomicRanges::values(gr12)$gene_name); grl2 <- GenomicRanges::split(gr12, GenomicRanges::values(gr12)$gene_name); # The first object is a GRangesList with no annotations grl1;
#> GRangesList object of length 3: #> $GeneA #> GRanges object with 3 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr1 100-199 + #> [2] chr1 200-349 + #> [3] chr1 400-449 + #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #> #> $GeneB #> GRanges object with 3 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr2 500-549 - #> [2] chr2 300-349 - #> [3] chr2 100-199 - #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #> #> $GeneC #> GRanges object with 3 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr1 200-249 + #> [2] chr1 400-599 + #> [3] chr1 600-649 + #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #>
# The second object is a GRangesList with annotation, # assumed to be in the same order grl2;
#> GRangesList object of length 3: #> $GeneA #> GRanges object with 3 ranges and 1 metadata column: #> seqnames ranges strand | gene_name #> <Rle> <IRanges> <Rle> | <character> #> [1] chr1 100-199 + | GeneA #> [2] chr1 200-349 + | GeneA #> [3] chr1 400-449 + | GeneA #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #> #> $GeneB #> GRanges object with 3 ranges and 1 metadata column: #> seqnames ranges strand | gene_name #> <Rle> <IRanges> <Rle> | <character> #> [1] chr2 500-549 - | GeneB #> [2] chr2 300-349 - | GeneB #> [3] chr2 100-199 - | GeneB #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #> #> $GeneC #> GRanges object with 3 ranges and 1 metadata column: #> seqnames ranges strand | gene_name #> <Rle> <IRanges> <Rle> | <character> #> [1] chr1 200-249 + | GeneC #> [2] chr1 400-599 + | GeneC #> [3] chr1 600-649 + | GeneC #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #>
annotateGRLfromGRL(grl1, grl2);
#> GRangesList object of length 3: #> $GeneA #> GRanges object with 3 ranges and 1 metadata column: #> seqnames ranges strand | X #> <Rle> <IRanges> <Rle> | <character> #> grl1_v1 chr1 100-199 + | GeneA #> grl1_v2 chr1 200-349 + | GeneA #> grl1_v3 chr1 400-449 + | GeneA #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #> #> $GeneB #> GRanges object with 3 ranges and 1 metadata column: #> seqnames ranges strand | X #> <Rle> <IRanges> <Rle> | <character> #> grl1_v4 chr2 500-549 - | GeneB #> grl1_v5 chr2 300-349 - | GeneB #> grl1_v6 chr2 100-199 - | GeneB #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #> #> $GeneC #> GRanges object with 3 ranges and 1 metadata column: #> seqnames ranges strand | X #> <Rle> <IRanges> <Rle> | <character> #> grl1_v7 chr1 200-249 + | GeneC #> grl1_v8 chr1 400-599 + | GeneC #> grl1_v9 chr1 600-649 + | GeneC #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #>