Annotate GRanges using another GRanges object
annotateGRfromGR( GR1, GR2, grOL = NULL, numAsStrings = FALSE, stringShrinkFunc = function(...) { jamba::cPasteUnique(..., doSort = TRUE) }, numShrinkFunc = sum, addStringCols = NULL, type = c("any", "start", "end", "within", "equal"), ignore.strand = FALSE, select = "all", sep = ",", verbose = FALSE, DEBUG = FALSE, ... )
GR1 | GRanges object, the reference object to which annotations are added. |
---|---|
GR2 | GRanges object used for annotations |
grOL | overlaps object, optionally used when the
|
numAsStrings | logical indicating whether numerical values should
be treated as strings for the purpose of added annotations. When
|
stringShrinkFunc | function used to shrink string annotations
by overlapping feature. This function should take a list as input,
and |
numShrinkFunc | function used to shrink numerical values
by overlapping feature. For numeric values, the |
addStringCols | character vector, optional, of numeric colnames that
should be treated as string values. This option is a subset of the
|
type, ignore.strand, select | arguments sent to
|
sep | character value indicating the delimiter to use between string values. |
verbose | logical indicating whether to print verbose output. |
DEBUG | logical indicating whether to print debugging output. |
... | additional arguments are ignored. |
GRanges object with colnames added to values
, with length
and order equal to the input GR1
GRanges object.
This function adds annotations to features in the given GRanges
object, from overlapping features in a second GRanges object.
It is especially useful after performing a manipulation that
results in a GRanges object without any values
annotation,
for example GenomicRanges::reduce()
or GenomicRanges::intersect()
.
In theory this function is relatively simple, it applies annotations
to overlapping entries. In practice, it gets complicated when multiple
annotations overlap the same GRange entry. In that case, numerical
values by default return the base::mean()
while string values call
jamba::cPasteUnique()
by default, and return the unique, sorted,
comma-delimited set of string values. For example, overlapping several
exons from the same gene, the resulting annotation might include
just one gene symbol.
The numeric values can be aggregated using another function, controlled
by numShrinkFunction
named using the colname. For example:
numShrinkFunction="min"
would return the base::min()
value for all
numeric columns. But numShrinkFunction=list(default=mean, score=max)
would return the base::mean()
for all numeric columns except the
"score"
column which would report the base::max()
.
Numeric values currently use the data.table
package workflow, which
provides extremely efficient calculations for numeric values. However,
vectorized functions have the potential to be notably faster, as is
true with jamba::cPasteUnique()
especially when applying
jamba::mixedSort()
to sort alphanumeric values. In future, the
implementation may change to accomodate vectorized numeric functions,
instead of using data.table
.
TODO: This function is a specific case where another function
shrinkDataFrame()
may be a more general purpose use. That function
is yet to be ported to the Jam package suite.
Other jam GRanges functions:
addGRLgaps()
,
addGRgaps()
,
annotateGRLfromGRL()
,
assignGRLexonNames()
,
closestExonToJunctions()
,
combineGRcoverage()
,
exoncov2polygon()
,
findOverlapsGRL()
,
flattenExonsBy()
,
getFirstStrandedFromGRL()
,
getGRLgaps()
,
getGRcoverageFromBw()
,
getGRgaps()
,
grl2df()
,
jam_isDisjoint()
,
make_ref2compressed()
,
sortGRL()
,
spliceGR2junctionDF()
,
stackJunctions()
gr12 <- GenomicRanges::GRanges( seqnames=rep(c("chr1", "chr2", "chr1"), c(3,3,3)), ranges=IRanges::IRanges( start=c(100, 200, 400, 500, 300, 100, 200, 400, 600), width=c(100,150,50, 50,50,100, 50,200,50) ), strand=rep(c("+", "-", "+"), c(3,3,3)), gene_name=rep(c("GeneA", "GeneB", "GeneC"), each=3) ) gr1 <- gr12[,0]; # Say for example you have a GRanges object with no annotation gr1;#> GRanges object with 9 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr1 100-199 + #> [2] chr1 200-349 + #> [3] chr1 400-449 + #> [4] chr2 500-549 - #> [5] chr2 300-349 - #> [6] chr2 100-199 - #> [7] chr1 200-249 + #> [8] chr1 400-599 + #> [9] chr1 600-649 + #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths# And had another GRanges object with annotations gr12;#> GRanges object with 9 ranges and 1 metadata column: #> seqnames ranges strand | gene_name #> <Rle> <IRanges> <Rle> | <character> #> [1] chr1 100-199 + | GeneA #> [2] chr1 200-349 + | GeneA #> [3] chr1 400-449 + | GeneA #> [4] chr2 500-549 - | GeneB #> [5] chr2 300-349 - | GeneB #> [6] chr2 100-199 - | GeneB #> [7] chr1 200-249 + | GeneC #> [8] chr1 400-599 + | GeneC #> [9] chr1 600-649 + | GeneC #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths# To add annotations annotateGRfromGR(gr1, gr12);#> GRanges object with 9 ranges and 1 metadata column: #> seqnames ranges strand | gene_name #> <Rle> <IRanges> <Rle> | <character> #> GR1_1 chr1 100-199 + | GeneA #> GR1_2 chr1 200-349 + | GeneA,GeneC #> GR1_3 chr1 400-449 + | GeneA,GeneC #> GR1_4 chr2 500-549 - | GeneB #> GR1_5 chr2 300-349 - | GeneB #> GR1_6 chr2 100-199 - | GeneB #> GR1_7 chr1 200-249 + | GeneA,GeneC #> GR1_8 chr1 400-599 + | GeneA,GeneC #> GR1_9 chr1 600-649 + | GeneC #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths# Notice certain features overlap multiple annotations, # which may be acceptable. # If you want to keep annotations distinct, # use annotateGRLfromGRL() grl1 <- GenomicRanges::split(gr12[,0], GenomicRanges::values(gr12)$gene_name); grl2 <- GenomicRanges::split(gr12, GenomicRanges::values(gr12)$gene_name); # The first object is a GRangesList with no annotations grl1;#> GRangesList object of length 3: #> $GeneA #> GRanges object with 3 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr1 100-199 + #> [2] chr1 200-349 + #> [3] chr1 400-449 + #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #> #> $GeneB #> GRanges object with 3 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr2 500-549 - #> [2] chr2 300-349 - #> [3] chr2 100-199 - #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #> #> $GeneC #> GRanges object with 3 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr1 200-249 + #> [2] chr1 400-599 + #> [3] chr1 600-649 + #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #># The second object is a GRangesList with annotation, # assumed to be in the same order grl2;#> GRangesList object of length 3: #> $GeneA #> GRanges object with 3 ranges and 1 metadata column: #> seqnames ranges strand | gene_name #> <Rle> <IRanges> <Rle> | <character> #> [1] chr1 100-199 + | GeneA #> [2] chr1 200-349 + | GeneA #> [3] chr1 400-449 + | GeneA #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #> #> $GeneB #> GRanges object with 3 ranges and 1 metadata column: #> seqnames ranges strand | gene_name #> <Rle> <IRanges> <Rle> | <character> #> [1] chr2 500-549 - | GeneB #> [2] chr2 300-349 - | GeneB #> [3] chr2 100-199 - | GeneB #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #> #> $GeneC #> GRanges object with 3 ranges and 1 metadata column: #> seqnames ranges strand | gene_name #> <Rle> <IRanges> <Rle> | <character> #> [1] chr1 200-249 + | GeneC #> [2] chr1 400-599 + | GeneC #> [3] chr1 600-649 + | GeneC #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #>#> GRangesList object of length 3: #> $GeneA #> GRanges object with 3 ranges and 1 metadata column: #> seqnames ranges strand | X #> <Rle> <IRanges> <Rle> | <character> #> grl1_v1 chr1 100-199 + | GeneA #> grl1_v2 chr1 200-349 + | GeneA #> grl1_v3 chr1 400-449 + | GeneA #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #> #> $GeneB #> GRanges object with 3 ranges and 1 metadata column: #> seqnames ranges strand | X #> <Rle> <IRanges> <Rle> | <character> #> grl1_v4 chr2 500-549 - | GeneB #> grl1_v5 chr2 300-349 - | GeneB #> grl1_v6 chr2 100-199 - | GeneB #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #> #> $GeneC #> GRanges object with 3 ranges and 1 metadata column: #> seqnames ranges strand | X #> <Rle> <IRanges> <Rle> | <character> #> grl1_v7 chr1 200-249 + | GeneC #> grl1_v8 chr1 400-599 + | GeneC #> grl1_v9 chr1 600-649 + | GeneC #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths #>