Annotate GRanges using another GRanges object

annotateGRfromGR(
  GR1,
  GR2,
  grOL = NULL,
  numAsStrings = FALSE,
  stringShrinkFunc = function(...) {     jamba::cPasteUnique(..., doSort = TRUE) },
  numShrinkFunc = sum,
  addStringCols = NULL,
  type = c("any", "start", "end", "within", "equal"),
  ignore.strand = FALSE,
  select = "all",
  sep = ",",
  verbose = FALSE,
  DEBUG = FALSE,
  ...
)

Arguments

GR1	GRanges object, the reference object to which annotations are added.
GR2	GRanges object used for annotations
grOL	overlaps object, optionally used when the `GenomicRanges::findOverlaps()` function has already been run, mostly intended to save re-processing the overlaps for large objects.
numAsStrings	logical indicating whether numerical values should be treated as strings for the purpose of added annotations. When `TRUE`, numerical values will be converted to character strings and retained as if they were labels. This argument treats all numeric columns as string, to treat only a subset use the `addStringCols` argument.
stringShrinkFunc	function used to shrink string annotations by overlapping feature. This function should take a list as input, and `sep` as an argument indicating the delimiter if applicable, and return a character vector of the same length as the list. By default, `jamba::cPasteUnique()` is used. Control over the sort and uniqueness should be applied with this function. Note: `stringShrinkFunc` can be a single function, or can be a named list of function calls, whose names match the colnames of `values`. If a named list is provided, the first entry is used for any any names in `values` which are not in `names(stringShrinkFunc)`.
numShrinkFunc	function used to shrink numerical values by overlapping feature. For numeric values, the `data.table` package is used to apply the function, which enables custom functions not otherwise available via the `GenomicRanges` methods. Therefore, the function does not take a list as input, instead takes a numeric vector as input. Note: `numShrinkFunc` can be a single function, or can be a named list of function calls, whose names match the colnames of `values`. If a named list is provided, the first entry is used for any any names in `values` which are not in `names(numShrinkFunc)`.
addStringCols	character vector, optional, of numeric colnames that should be treated as string values. This option is a subset of the `numAsStrings`.
type, ignore.strand, select	arguments sent to `GenomicRanges::findOverlaps()`. Note these options are only used when `grOL` is not supplied.
sep	character value indicating the delimiter to use between string values.
verbose	logical indicating whether to print verbose output.
DEBUG	logical indicating whether to print debugging output.
...	additional arguments are ignored.

Value

GRanges object with colnames added to values, with length and order equal to the input GR1 GRanges object.

Details

This function adds annotations to features in the given GRanges object, from overlapping features in a second GRanges object. It is especially useful after performing a manipulation that results in a GRanges object without any values annotation, for example GenomicRanges::reduce() or GenomicRanges::intersect().

In theory this function is relatively simple, it applies annotations to overlapping entries. In practice, it gets complicated when multiple annotations overlap the same GRange entry. In that case, numerical values by default return the base::mean() while string values call jamba::cPasteUnique() by default, and return the unique, sorted, comma-delimited set of string values. For example, overlapping several exons from the same gene, the resulting annotation might include just one gene symbol.

The numeric values can be aggregated using another function, controlled by numShrinkFunction named using the colname. For example: numShrinkFunction="min" would return the base::min() value for all numeric columns. But numShrinkFunction=list(default=mean, score=max) would return the base::mean() for all numeric columns except the "score" column which would report the base::max().

Numeric values currently use the data.table package workflow, which provides extremely efficient calculations for numeric values. However, vectorized functions have the potential to be notably faster, as is true with jamba::cPasteUnique() especially when applying jamba::mixedSort() to sort alphanumeric values. In future, the implementation may change to accomodate vectorized numeric functions, instead of using data.table.

TODO: This function is a specific case where another function shrinkDataFrame() may be a more general purpose use. That function is yet to be ported to the Jam package suite.

Examples

gr12 <- GenomicRanges::GRanges(
   seqnames=rep(c("chr1", "chr2", "chr1"), c(3,3,3)),
   ranges=IRanges::IRanges(
      start=c(100, 200, 400, 500, 300, 100, 200, 400, 600),
      width=c(100,150,50, 50,50,100, 50,200,50)
   ),
   strand=rep(c("+", "-", "+"), c(3,3,3)),
   gene_name=rep(c("GeneA", "GeneB", "GeneC"), each=3)
)
gr1 <- gr12[,0];

# Say for example you have a GRanges object with no annotation
gr1;
#> GRanges object with 9 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1   100-199      +
#>   [2]     chr1   200-349      +
#>   [3]     chr1   400-449      +
#>   [4]     chr2   500-549      -
#>   [5]     chr2   300-349      -
#>   [6]     chr2   100-199      -
#>   [7]     chr1   200-249      +
#>   [8]     chr1   400-599      +
#>   [9]     chr1   600-649      +
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

# And had another GRanges object with annotations
gr12;
#> GRanges object with 9 ranges and 1 metadata column:
#>       seqnames    ranges strand |   gene_name
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]     chr1   100-199      + |       GeneA
#>   [2]     chr1   200-349      + |       GeneA
#>   [3]     chr1   400-449      + |       GeneA
#>   [4]     chr2   500-549      - |       GeneB
#>   [5]     chr2   300-349      - |       GeneB
#>   [6]     chr2   100-199      - |       GeneB
#>   [7]     chr1   200-249      + |       GeneC
#>   [8]     chr1   400-599      + |       GeneC
#>   [9]     chr1   600-649      + |       GeneC
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

# To add annotations
annotateGRfromGR(gr1, gr12);
#> GRanges object with 9 ranges and 1 metadata column:
#>         seqnames    ranges strand |   gene_name
#>            <Rle> <IRanges>  <Rle> | <character>
#>   GR1_1     chr1   100-199      + |       GeneA
#>   GR1_2     chr1   200-349      + | GeneA,GeneC
#>   GR1_3     chr1   400-449      + | GeneA,GeneC
#>   GR1_4     chr2   500-549      - |       GeneB
#>   GR1_5     chr2   300-349      - |       GeneB
#>   GR1_6     chr2   100-199      - |       GeneB
#>   GR1_7     chr1   200-249      + | GeneA,GeneC
#>   GR1_8     chr1   400-599      + | GeneA,GeneC
#>   GR1_9     chr1   600-649      + |       GeneC
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

# Notice certain features overlap multiple annotations,
# which may be acceptable.

# If you want to keep annotations distinct,
# use annotateGRLfromGRL()
grl1 <- GenomicRanges::split(gr12[,0],
   GenomicRanges::values(gr12)$gene_name);
grl2 <- GenomicRanges::split(gr12,
   GenomicRanges::values(gr12)$gene_name);

# The first object is a GRangesList with no annotations
grl1;
#> GRangesList object of length 3:
#> $GeneA
#> GRanges object with 3 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1   100-199      +
#>   [2]     chr1   200-349      +
#>   [3]     chr1   400-449      +
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $GeneB
#> GRanges object with 3 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr2   500-549      -
#>   [2]     chr2   300-349      -
#>   [3]     chr2   100-199      -
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $GeneC
#> GRanges object with 3 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1   200-249      +
#>   [2]     chr1   400-599      +
#>   [3]     chr1   600-649      +
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 

# The second object is a GRangesList with annotation,
# assumed to be in the same order
grl2;
#> GRangesList object of length 3:
#> $GeneA
#> GRanges object with 3 ranges and 1 metadata column:
#>       seqnames    ranges strand |   gene_name
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]     chr1   100-199      + |       GeneA
#>   [2]     chr1   200-349      + |       GeneA
#>   [3]     chr1   400-449      + |       GeneA
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $GeneB
#> GRanges object with 3 ranges and 1 metadata column:
#>       seqnames    ranges strand |   gene_name
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]     chr2   500-549      - |       GeneB
#>   [2]     chr2   300-349      - |       GeneB
#>   [3]     chr2   100-199      - |       GeneB
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $GeneC
#> GRanges object with 3 ranges and 1 metadata column:
#>       seqnames    ranges strand |   gene_name
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]     chr1   200-249      + |       GeneC
#>   [2]     chr1   400-599      + |       GeneC
#>   [3]     chr1   600-649      + |       GeneC
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 

annotateGRLfromGRL(grl1, grl2);
#> GRangesList object of length 3:
#> $GeneA
#> GRanges object with 3 ranges and 1 metadata column:
#>           seqnames    ranges strand |           X
#>              <Rle> <IRanges>  <Rle> | <character>
#>   grl1_v1     chr1   100-199      + |       GeneA
#>   grl1_v2     chr1   200-349      + |       GeneA
#>   grl1_v3     chr1   400-449      + |       GeneA
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $GeneB
#> GRanges object with 3 ranges and 1 metadata column:
#>           seqnames    ranges strand |           X
#>              <Rle> <IRanges>  <Rle> | <character>
#>   grl1_v4     chr2   500-549      - |       GeneB
#>   grl1_v5     chr2   300-349      - |       GeneB
#>   grl1_v6     chr2   100-199      - |       GeneB
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $GeneC
#> GRanges object with 3 ranges and 1 metadata column:
#>           seqnames    ranges strand |           X
#>              <Rle> <IRanges>  <Rle> | <character>
#>   grl1_v7     chr1   200-249      + |       GeneC
#>   grl1_v8     chr1   400-599      + |       GeneC
#>   grl1_v9     chr1   600-649      + |       GeneC
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#>

Annotate GRanges using another GRanges object

Arguments

Value

Details

See also

Examples