Flatten exons by gene or transcript
flattenExonsBy( exonsByTx, tx2geneDF, by = c("gene", "tx"), detectedTx = NULL, genes = NULL, txColname = "transcript_id", geneColname = "gene_name", cdsByTx = NULL, cdsByGene = NULL, filterTwoStrand = FALSE, exon_method = c("disjoin", "reduce"), verbose = FALSE )
exonsByTx | GRangesList named by transcript, containing one or
more GRanges representing exons. This data is often produced
from |
---|---|
tx2geneDF | data.frame containing at least two columns with
transcript and gene annotation, whose colnames are defined by
arguments |
by | character string to group exons, |
detectedTx | character vector of detected transcripts, used to subset the overall transcripts prior to producing a flattened gene exon model. |
genes | optional character vector, representing a subset of genes for which flattened exons will be prepared. This argument is useful when focusing on only one or a subset of genes. |
txColname | character string indicating a column from
|
geneColname | character string indicating a column from
|
cdsByTx |
|
cdsByGene |
|
filterTwoStrand |
|
exon_method |
|
verbose | logical indicating whether to print verbose output. |
GRangesList
with names dependent upon argument by
:
when by="gene"
names are derived from values in geneColname
;
when by="tx"
names are derived from values in txColname
.
Each entry in the GRangesList
will contain a series of
non-overlapping GRanges
each representing an exon. The exon
names are described above.
This function takes as input:
exonsByTx
as a GRangesList
object
of transcript exons named by the transcript_id
tx2geneDF
a data.frame
with transcript-gene cross-reference
detectedTx
an optional character vector of transcript_id
values, used to subset the overall transcripts
cdsByTx
an optional GRangesList
object, similar to exonsByTx
except that it only contains the CDS portion of exons
When by="gene"
this function groups exons from one or more
transcript isoforms together by gene, to produce a single
non-overlapping set of exons that describe each gene. When cdsByTx
is provided, the output is useful in showing which regions of
an exon is coding (CDS), and which regions are non-coding.
When exon_method="disjoin"
the output also maintains any
internal exon boundaries wherever multiple exons overlap.
When by="tx"
this function is primarily used to combine
exonsByTx
with optional cdsByTx
in order to sub-divide
exons into regions which are coding (CDS) and non-coding.
The use of detectedTx
has appeared to be very helpful in
reducing the overall complexity of the flattened gene-exon
models, specifically reducing the number of low-quality
predicted transcripts that are represented.
Finally, this function calls assignGRLexonNames()
to label
exons using a defined naming scheme:
Contiguous exons are numbered in order, starting at 1
and
increasing in the coding direction (strand-specific.) For
example exons will be numbered: exon1
, exon2
, exon3
.
Exons which are sub-divided, are indicated with an lowercase
character letter, for example: exon1a
, exon1b
, exon1c
.
A text schematic is shown below:
|=======|======|......|=======|.....|=======|=======|=======| |_exon1a|exon1b|......|_exon2_|.....|_exon3a|_exon3b|_exon3c|
Where
|====|
represents an exon,
|====|====|
represents one contiguous exon with two
sub-divided parts, and
|.....|
represents an intron.
It is recommended but not required to supply detectedTx
,
since it can greatly reduce the total number of transcripts.
This step has two benefits:
Supplying detectedTx
can greatly simplify the
resulting gene-exon models.
Supplying detectedTx
has the by-product of
removing potentially erroneous transcripts from the
source annotation, while also producing a finished result
that is driven by observed data.
Potential problems with supplying detectedTx
, and
suggested work-around:
If the detectedTx
is incorrect, it may not include all
genes defined in tx2geneDF
. In principle, this effect is
beneficial, by not producing flat gene-exon models for
genes with no observed data.
Workaround: Note that launchSashimiApp()
has
the option to query "All genes"
.
An alternative workaround is to run flattenExonsBy()
without supplying detectedTx
, but providing gene
so
this method only produces flat gene-exons for the genes
of interest.
The sashimi plot may represent exon coverage as if it were an intron, thus compressing the width of that coverage inside an intron context. However, the coverage will be displayed, giving a visual indicator that it may need to be reviewed in more detail.
Other jam RNA-seq functions:
assignGRLexonNames()
,
closestExonToJunctions()
,
combineGRcoverage()
,
defineDetectedTx()
,
detectedTxInfo()
,
exoncov2polygon()
,
getGRcoverageFromBw()
,
groups2contrasts()
,
internal_junc_score()
,
makeTx2geneFromGtf()
,
make_ref2compressed()
,
prepareSashimi()
,
runDiffSplice()
,
sortSamples()
,
spliceGR2junctionDF()
Other jam GRanges functions:
addGRLgaps()
,
addGRgaps()
,
annotateGRLfromGRL()
,
annotateGRfromGR()
,
assignGRLexonNames()
,
closestExonToJunctions()
,
combineGRcoverage()
,
exoncov2polygon()
,
findOverlapsGRL()
,
getFirstStrandedFromGRL()
,
getGRLgaps()
,
getGRcoverageFromBw()
,
getGRgaps()
,
grl2df()
,
jam_isDisjoint()
,
make_ref2compressed()
,
sortGRL()
,
spliceGR2junctionDF()
,
stackJunctions()