Flatten exons by gene or transcript

flattenExonsBy(
  exonsByTx,
  tx2geneDF,
  by = c("gene", "tx"),
  detectedTx = NULL,
  genes = NULL,
  txColname = "transcript_id",
  geneColname = "gene_name",
  cdsByTx = NULL,
  cdsByGene = NULL,
  filterTwoStrand = FALSE,
  exon_method = c("disjoin", "reduce"),
  verbose = FALSE
)

Arguments

exonsByTx

GRangesList named by transcript, containing one or more GRanges representing exons. This data is often produced from TxDb data using GenomicFeatures::exonsBy(...,by="tx").

tx2geneDF

data.frame containing at least two columns with transcript and gene annotation, whose colnames are defined by arguments txColname and geneColname respectively. When using a GTF file, makeTx2geneFromGtf() can be used to create a tx2geneDF in data.frame format.

by

character string to group exons, "gene" groups multiple transcripts per gene, and "tx" groups exons per transcript. Note that in both cases, it combines exonsByTx and cdsByTx when cdsByTx is also supplied.

detectedTx

character vector of detected transcripts, used to subset the overall transcripts prior to producing a flattened gene exon model.

genes

optional character vector, representing a subset of genes for which flattened exons will be prepared. This argument is useful when focusing on only one or a subset of genes.

txColname

character string indicating a column from colnames(tx2geneDF) used to identify transcripts.

geneColname

character string indicating a column from colnames(tx2geneDF) used to identify gene name, or gene symbol.

cdsByTx

GRangesList named by transcript, containing GRanges exons that only include CDS regions. This data is often produced from TxDb data using GenomicFeatures::cdsBy(...,by="tx").

cdsByGene

GRangesList named by gene, containing GRanges exons that only include CDS regions. This data is often produced from TxDb data using GenomicFeatures::cdsBy(...,by="gene"). Note this input is only used when by="gene".

filterTwoStrand

logical indicating whether genes on multiple strands are removed during assignGRLexonNames() which assigns ordered exon numbers for each unique gene. Setting this to FALSE may cause exon numbers to be incorrect, but it will retain all genes. When this argument is TRUE any genes present on multiple strands are removed.

exon_method

character string indicating the method to use when combining transcript exons by gene: "disjoin" maintains the internal boundaries for overlapping exons, so overlapping exons of different width will be sub-divided; "reduce" combines overlapping exons into one larger exon that is not sub-divided. The "reduce" method is substantially faster, but loses the ability to match a specific exon region to its source transcript isoform(s). Note that when cdsByExon is also supplied, exons will be sub-divided at the point where an exon goes from CDS-overlapping, to non-coding.

verbose

logical indicating whether to print verbose output.

Value

GRangesList with names dependent upon argument by: when by="gene" names are derived from values in geneColname; when by="tx" names are derived from values in txColname. Each entry in the GRangesList will contain a series of non-overlapping GRanges each representing an exon. The exon names are described above.

Details

This function takes as input:

  • exonsByTx as a GRangesList object of transcript exons named by the transcript_id

  • tx2geneDF a data.frame with transcript-gene cross-reference

  • detectedTx an optional character vector of transcript_id values, used to subset the overall transcripts

  • cdsByTx an optional GRangesList object, similar to exonsByTx except that it only contains the CDS portion of exons

When by="gene" this function groups exons from one or more transcript isoforms together by gene, to produce a single non-overlapping set of exons that describe each gene. When cdsByTx is provided, the output is useful in showing which regions of an exon is coding (CDS), and which regions are non-coding. When exon_method="disjoin" the output also maintains any internal exon boundaries wherever multiple exons overlap.

When by="tx" this function is primarily used to combine exonsByTx with optional cdsByTx in order to sub-divide exons into regions which are coding (CDS) and non-coding.

The use of detectedTx has appeared to be very helpful in reducing the overall complexity of the flattened gene-exon models, specifically reducing the number of low-quality predicted transcripts that are represented.

Finally, this function calls assignGRLexonNames() to label exons using a defined naming scheme:

  • Contiguous exons are numbered in order, starting at 1 and increasing in the coding direction (strand-specific.) For example exons will be numbered: exon1, exon2, exon3.

  • Exons which are sub-divided, are indicated with an lowercase character letter, for example: exon1a, exon1b, exon1c.

A text schematic is shown below:

|=======|======|......|=======|.....|=======|=======|=======|

|_exon1a|exon1b|......|_exon2_|.....|_exon3a|_exon3b|_exon3c|

Where

  • |====| represents an exon,

  • |====|====| represents one contiguous exon with two sub-divided parts, and

  • |.....| represents an intron.

It is recommended but not required to supply detectedTx, since it can greatly reduce the total number of transcripts. This step has two benefits:

  1. Supplying detectedTx can greatly simplify the resulting gene-exon models.

  2. Supplying detectedTx has the by-product of removing potentially erroneous transcripts from the source annotation, while also producing a finished result that is driven by observed data.

Potential problems with supplying detectedTx, and suggested work-around:

  1. If the detectedTx is incorrect, it may not include all genes defined in tx2geneDF. In principle, this effect is beneficial, by not producing flat gene-exon models for genes with no observed data.

    • Workaround: Note that launchSashimiApp() has the option to query "All genes".

    • An alternative workaround is to run flattenExonsBy() without supplying detectedTx, but providing gene so this method only produces flat gene-exons for the genes of interest.

  2. The sashimi plot may represent exon coverage as if it were an intron, thus compressing the width of that coverage inside an intron context. However, the coverage will be displayed, giving a visual indicator that it may need to be reviewed in more detail.

See also