Flatten exons by gene or transcript

flattenExonsBy(
  exonsByTx,
  tx2geneDF,
  by = c("gene", "tx"),
  detectedTx = NULL,
  genes = NULL,
  txColname = "transcript_id",
  geneColname = "gene_name",
  cdsByTx = NULL,
  cdsByGene = NULL,
  filterTwoStrand = FALSE,
  exon_method = c("disjoin", "reduce"),
  verbose = FALSE
)

Arguments

exonsByTx	GRangesList named by transcript, containing one or more GRanges representing exons. This data is often produced from `TxDb` data using `GenomicFeatures::exonsBy(...,by="tx")`.
tx2geneDF	data.frame containing at least two columns with transcript and gene annotation, whose colnames are defined by arguments `txColname` and `geneColname` respectively. When using a GTF file, `makeTx2geneFromGtf()` can be used to create a `tx2geneDF` in `data.frame` format.
by	character string to group exons, `"gene"` groups multiple transcripts per gene, and `"tx"` groups exons per transcript. Note that in both cases, it combines `exonsByTx` and `cdsByTx` when `cdsByTx` is also supplied.
detectedTx	character vector of detected transcripts, used to subset the overall transcripts prior to producing a flattened gene exon model.
genes	optional character vector, representing a subset of genes for which flattened exons will be prepared. This argument is useful when focusing on only one or a subset of genes.
txColname	character string indicating a column from `colnames(tx2geneDF)` used to identify transcripts.
geneColname	character string indicating a column from `colnames(tx2geneDF)` used to identify gene name, or gene symbol.
cdsByTx	`GRangesList` named by transcript, containing `GRanges` exons that only include CDS regions. This data is often produced from `TxDb` data using `GenomicFeatures::cdsBy(...,by="tx")`.
cdsByGene	`GRangesList` named by gene, containing `GRanges` exons that only include CDS regions. This data is often produced from `TxDb` data using `GenomicFeatures::cdsBy(...,by="gene")`. Note this input is only used when `by="gene"`.
filterTwoStrand	`logical` indicating whether genes on multiple strands are removed during `assignGRLexonNames()` which assigns ordered exon numbers for each unique gene. Setting this to `FALSE` may cause exon numbers to be incorrect, but it will retain all genes. When this argument is `TRUE` any genes present on multiple strands are removed.
exon_method	`character` string indicating the method to use when combining transcript exons by gene: `"disjoin"` maintains the internal boundaries for overlapping exons, so overlapping exons of different width will be sub-divided; `"reduce"` combines overlapping exons into one larger exon that is not sub-divided. The `"reduce"` method is substantially faster, but loses the ability to match a specific exon region to its source transcript isoform(s). Note that when `cdsByExon` is also supplied, exons will be sub-divided at the point where an exon goes from CDS-overlapping, to non-coding.
verbose	logical indicating whether to print verbose output.

Value

GRangesList with names dependent upon argument by: when by="gene" names are derived from values in geneColname; when by="tx" names are derived from values in txColname. Each entry in the GRangesList will contain a series of non-overlapping GRanges each representing an exon. The exon names are described above.

Details

This function takes as input:

exonsByTx as a GRangesList object of transcript exons named by the transcript_id
tx2geneDF a data.frame with transcript-gene cross-reference
detectedTx an optional character vector of transcript_id values, used to subset the overall transcripts
cdsByTx an optional GRangesList object, similar to exonsByTx except that it only contains the CDS portion of exons

When by="gene" this function groups exons from one or more transcript isoforms together by gene, to produce a single non-overlapping set of exons that describe each gene. When cdsByTx is provided, the output is useful in showing which regions of an exon is coding (CDS), and which regions are non-coding. When exon_method="disjoin" the output also maintains any internal exon boundaries wherever multiple exons overlap.

When by="tx" this function is primarily used to combine exonsByTx with optional cdsByTx in order to sub-divide exons into regions which are coding (CDS) and non-coding.

The use of detectedTx has appeared to be very helpful in reducing the overall complexity of the flattened gene-exon models, specifically reducing the number of low-quality predicted transcripts that are represented.

Finally, this function calls assignGRLexonNames() to label exons using a defined naming scheme:

Contiguous exons are numbered in order, starting at 1 and increasing in the coding direction (strand-specific.) For example exons will be numbered: exon1, exon2, exon3.
Exons which are sub-divided, are indicated with an lowercase character letter, for example: exon1a, exon1b, exon1c.

A text schematic is shown below:

|=======|======|......|=======|.....|=======|=======|=======|

|_exon1a|exon1b|......|_exon2_|.....|_exon3a|_exon3b|_exon3c|

Where

|====| represents an exon,
|====|====| represents one contiguous exon with two sub-divided parts, and
|.....| represents an intron.

It is recommended but not required to supply detectedTx, since it can greatly reduce the total number of transcripts. This step has two benefits:

Supplying detectedTx can greatly simplify the resulting gene-exon models.
Supplying detectedTx has the by-product of removing potentially erroneous transcripts from the source annotation, while also producing a finished result that is driven by observed data.

Potential problems with supplying detectedTx, and suggested work-around:

If the detectedTx is incorrect, it may not include all genes defined in tx2geneDF. In principle, this effect is beneficial, by not producing flat gene-exon models for genes with no observed data.
- Workaround: Note that launchSashimiApp() has the option to query "All genes".
- An alternative workaround is to run flattenExonsBy() without supplying detectedTx, but providing gene so this method only produces flat gene-exons for the genes of interest.
The sashimi plot may represent exon coverage as if it were an intron, thus compressing the width of that coverage inside an intron context. However, the coverage will be displayed, giving a visual indicator that it may need to be reviewed in more detail.

Flatten exons by gene or transcript

Arguments

Value

Details

See also