Assign exon names to GRangesList

assignGRLexonNames(
  GRL,
  geneSymbolColname = "geneSymbol",
  exonNameColname = paste0(geneSymbolColname, "Exon"),
  suffix = "_exon",
  renameOnes = FALSE,
  filterTwoStrand = TRUE,
  checkDisjoin = c("warn", "none", "stop"),
  assignGRLnames = TRUE,
  verbose = FALSE,
  ...
)

Arguments

GRL

GRangesList input object. Ideally, the input GRanges are disjoint, meaning no two exons overlap, but instead are represented by non-overlapping regions that are sometimes adjacent.

geneSymbolColname

character string indicating with colname of values(GRL) contains the unique gene symbol. If this value does not already exist, it is created and populated using names(GRL). If names(GRL) does not exist, it tries to use values(GRL)[[geneSymbolColname]] then if that does not exist it tries to use values(GRL@unlistData)[[geneSymbolColname]] with the first entry in each GRanges from GRL.

exonNameColname

character string to be used for the resulting exon name. By default "Exon" is appended to the end of the geneSymbolColname.

suffix

character value indicating the suffix to add to the geneSymbolColname to indicate individual exon numbers.

renameOnes

logical indicating whether to name exon sub-sections when the exon is not subdivided. For example, when renameOnes=FALSE, an exon with one section would be named, "exon1", otherwise when renameOnes=TRUE an exon with one section would be named, "exon1a". This distinction can be helpful to recognize exons that contain no subsections.

filterTwoStrand

logical indicating whether to filter out genes occurring on two different strands.

checkDisjoin

character value indicating how to handle non-disjoint input GRL ranges. When checkDisjoin="stop" then any non-disjoint GRanges in an element of GRL will cause the function to fail.

assignGRLnames

logical indicating whether names for the resulting GRangesList should use the exon names.

verbose

logical indicating whether to print verbose output.

...

additional arguments are ignored.

Value

GRangesList object with additional columns indicating the exon name.

Details

This function takes a GRangesList object with an annotated gene symbol column, and defines exon numbers for each distinct (non-adjacent) range per gene. When multiple ranges overlap, one exon number is applied to them all, and disjoint ranges are denoted using a letter suffix.

For example the exon labels below:

|======|......|======|=======|======|......|======|=======| .exon1.........exon2a.exon2b..exon2c........exon3a..exon3b.

The full name for each feature will become:

  • Gene_exon1

  • Gene_exon2a

  • Gene_exon2b

  • Gene_exon2c

  • Gene_exon3a

  • Gene_exon3b

The reasoning, is to preserve the gene symbol for readability, but to number exons to indicate the numbered contiguous exon, with suffix to indicate the sub-section of each exon.

See also

Examples

gene1 <- GenomicRanges::GRanges(seqnames="chr1", ranges=IRanges::IRanges( start=c(100, 200, 400, 500), width=c(100, 50, 50, 150)), strand="+", geneSymbol="Gene1"); gene2 <- GenomicRanges::GRanges(seqnames="chr1", ranges=IRanges::IRanges( start=c(600, 900, 1150, 1200), width=c(200, 100, 100, 50)), strand="-", geneSymbol="Gene2"); gene3 <- GenomicRanges::GRanges(seqnames="chr1", ranges=IRanges::IRanges( start=c(1500), width=c(75)), strand="+", geneSymbol="Gene3"); GRL <- GenomicRanges::GRangesList(list(gene1, gene2, gene3)) assignGRLexonNames(GRL)
#> GRangesList object of length 3: #> $Gene1 #> GRanges object with 4 ranges and 2 metadata columns: #> seqnames ranges strand | geneSymbol geneSymbolExon #> <Rle> <IRanges> <Rle> | <character> <character> #> Gene1_exon1a chr1 100-199 + | Gene1 Gene1_exon1a #> Gene1_exon1b chr1 200-249 + | Gene1 Gene1_exon1b #> Gene1_exon2 chr1 400-449 + | Gene1 Gene1_exon2 #> Gene1_exon3 chr1 500-649 + | Gene1 Gene1_exon3 #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths #> #> $Gene2 #> GRanges object with 4 ranges and 2 metadata columns: #> seqnames ranges strand | geneSymbol geneSymbolExon #> <Rle> <IRanges> <Rle> | <character> <character> #> Gene2_exon3 chr1 600-799 - | Gene2 Gene2_exon3 #> Gene2_exon2 chr1 900-999 - | Gene2 Gene2_exon2 #> Gene2_exon1b chr1 1150-1249 - | Gene2 Gene2_exon1b #> Gene2_exon1a chr1 1200-1249 - | Gene2 Gene2_exon1a #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths #> #> $Gene3 #> GRanges object with 1 range and 2 metadata columns: #> seqnames ranges strand | geneSymbol geneSymbolExon #> <Rle> <IRanges> <Rle> | <character> <character> #> Gene3_exon1 chr1 1500-1574 + | Gene3 Gene3_exon1 #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths #>