vignettes/create-a-sashimi-plot.Rmd
      create-a-sashimi-plot.RmdThere are several sources of gene-exon structures:
GTF or GFF file, for example Gencode GTF (https://www.gencodegenes.com).
GenomicFeatures::TxDb object with GenomicFeatures::makeTxDbFromGFF().splicejam::makeTx2geneFromGtf() creates a data.frame with cross-reference data for "transcript_id" and "gene_name". This step makes GTF a more reliable workflow than the TxDb workflow, since the TxDb workflow relies upon Bioconductor annotations to obtain "gene_name", and is sometimes lacking.Txdb transcript database, provided by GenomicFeatures::TxDb
BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene").TxDb specification does not include "gene_name" which therefore needs to be added using Bioconductor annotation packages, such as org.Hs.eg.db (Human, Homo sapiens) or org.Mm.eg.db (Mouse, Mus musculus). This step is almost always a little lossy, seen as TxDb entries whose "transcript_id" do not have corresponding"gene_id" nor "gene_name" entries. Ability to fetch annotation directly from UCSC is currently not straightforward.GRanges or GRangesList objects, already assembled by other methods.
rtracklayer::import() has several methods that can import BED files, or GTF files. Typically, BED format Ultimately the requirement is a disjoint (non-overlapping) set of exon GRanges, pulled into a GRangesList whose names are "gene_name". This "gene_name" is mainly used as a figure caption, though it can be integrated with transcript exons.The requirement for Sashimi plots is to flatten exons to disjoint (non-overlapping) exons across the gene region of interest. The exons are used to determine the genomic regions to use for sequence coverage, and splice junction reads.
The exons are also used to define the genome coordinate transformation function used to compress introns to more manageable (smaller) size.
The recommended workflow is to use detected transcripts derived from splicejam::defineDetectedTx() or some other means. This step typically reduces the total transcripts by 60%, which greatly simplifies the resulting gene structure, and focuses the gene structure on the experimental data.
Gene transcriptome data resources are increasingly comprehensive, as they are derived from numerous tissue samples and cell types. In ideal cases, the data should represent the superset of possible gene transcripts, of which only a fraction are experimentally observed.
In addition, we observed questionable transcript annotations, with no apparent supporting data – possible artifacts of an automated annotation pipeline. While these apparent errors are found in low percentage, their presence can completely disrupt a gene structure, sometimes combining several genes (70+) into one extremely long gene locus longer than 5 megabases.
As a practical matter, we found it beneficial that many of these transcripts get removed during the splicejam::defineDetectedTx() process, conveniently simplifying the resulting gene-exon models. Note this process currently uses data from an analysis tool such as Kallisto (Bray et al 2016, https://pachterlab.github.io/kallisto/about) or Salmon (Patro et al 2017, https://combine-lab.github.io/salmon/) since these tools estimate transcript isoform abundance.
This example uses Bioconductor data package TxDb.Mmusculus.UCSC.mm10.knownGene which represents UCSC known genes for the mouse mm10 genome assembly.
First we assemble all exons by transcript, then all CDS exons by transcript. We will use these exons to assemble annotated gene models.
options("warn"=-1); suppressPackageStartupMessages(library(splicejam)); suppressPackageStartupMessages(library(jamba)); suppressPackageStartupMessages(library(kableExtra)); if (suppressPackageStartupMessages(require(TxDb.Mmusculus.UCSC.mm10.knownGene))) { # First obtain exons by transcript exonsByTxMm10 <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="tx", use.names=TRUE); values(exonsByTxMm10@unlistData)$feature_type <- "exon"; values(exonsByTxMm10@unlistData)$subclass <- "exon"; # For added insight, obtain CDS exons by transcript (optional) cdsByTxMm10 <- cdsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="tx", use.names=TRUE); values(cdsByTxMm10@unlistData)$feature_type <- "cds"; values(cdsByTxMm10@unlistData)$subclass <- "cds"; }
The TxDb objects do not permit storing gene symbols, so the example uses the Bioconductor annotation package org.Mm.eg.db to assign gene_name values. For values not found in org.Mm.eg.db, we use the format LOC# to indicate the Entrez gene ID, for example Entrez gene ID 1234 would be represented LOC1234.
if (require(TxDb.Mmusculus.UCSC.mm10.knownGene)) { # Now prepare tx_name, gene_id, gene_name data.frame, # surprisingly difficult tx2geneMm10 <- suppressMessages( AnnotationDbi::select(TxDb.Mmusculus.UCSC.mm10.knownGene, keys(TxDb.Mmusculus.UCSC.mm10.knownGene, "GENEID"), columns=c("GENEID","TXNAME"), keytype="GENEID" ) ); tx2geneMm10 <- renameColumn(tx2geneMm10, from=c("GENEID", "TXNAME", "TXTYPE"), to=c("gene_id", "transcript_id", "transcript_type")); # add gene_name using org.Mm.eg.db if (suppressPackageStartupMessages(require(org.Mm.eg.db))) { gene_ids <- values(genes(TxDb.Mmusculus.UCSC.mm10.knownGene))$gene_id; gene_namesL <- mget(gene_ids, org.Mm.egSYMBOL, ifnotfound=NA); ## Convert list to vector taking the first gene_name each ## (All genes should only have one SYMBOL but there is no ## hard constraint so we should make absolutely sure to ## use only one value per gene.) gene_names <- unlist(heads(S4Vectors::List(gene_namesL), 1)); ## Replace NA with LOC# format ## Note we use gsub() to ensure the data fits the expected format if (any(is.na(gene_names))) { gene_na <- which(is.na(gene_names)); gene_names[gene_na] <- gsub("^([0-9]+)$", "LOC\\1", names(gene_names[gene_na])); } tx2geneMm10$gene_name <- gene_names[as.character(tx2geneMm10$gene_id)]; } else { ## If we have no gene annotations, use the gene_id values tx2geneMm10$gene_name <- as.character(tx2geneMm10$gene_id); } # print the first 20 rows to show the content print(head(tx2geneMm10, 20)); } #> Loading required package: TxDb.Mmusculus.UCSC.mm10.knownGene
Next we flatten transcript exons to the gene level using the function splicejam::flattenExonsBy().
We also flatten exons to the transcript level, which has the effect of combining CDS exons alongside non-CDS exons. The result is used when plotting transcript exon structures.
In both cases, the argument genes=c("Gria1", "Ntrk2") is used as a convenience to produce data for only these two genes.
We recommend using detected transcripts here, with the argument
detectedTx.
if (require(TxDb.Mmusculus.UCSC.mm10.knownGene)) { # flatten exons to the gene level # for speed, we will only process "Gria1", and "Ntrk2" flatExonsByGeneMm10 <- flattenExonsBy(exonsByTx=exonsByTxMm10, cdsByTx=cdsByTxMm10, by="gene", genes=c("Gria1", "Ntrk2"), tx2geneDF=tx2geneMm10, verbose=FALSE); # to be fancy, also flatten transcripts, to include CDS ranges flatExonsByTxMm10 <- flattenExonsBy(exonsByTx=exonsByTxMm10, cdsByTx=cdsByTxMm10, tx2geneDF=tx2geneMm10, by="tx", genes=c("Gria1", "Ntrk2")); } #> Loading required package: TxDb.Mmusculus.UCSC.mm10.knownGene
Once the gene-transcript-exon data is prepared, the exon structure can be plotted using the function splicejam::gene2gg():
if (require(TxDb.Mmusculus.UCSC.mm10.knownGene)) { # Pull out Gria1 grlGria1 <- flatExonsByGeneMm10[["Gria1"]]; ## Plot a basic gene-exon structure ggGria1exons <- gene2gg(gene="Gria1", flatExonsByGene=flatExonsByGeneMm10, exonLabelSize=6); print(ggGria1exons + ggtitle("Gria1 exons")); ## Compare to the gene structure without compressing introns gg1full <- gene2gg(gene="Gria1", flatExonsByGene=flatExonsByGeneMm10, compressGaps=FALSE) print(gg1full); ## Plot a slightly more detailed gene-transcript-exon structure ggGria1exonsTx <- gene2gg(gene="Gria1", flatExonsByGene=flatExonsByGeneMm10, flatExonsByTx=flatExonsByTxMm10, tx2geneDF=tx2geneMm10); print(ggGria1exonsTx + ggtitle("Gria1 (compressed introns)")); ## Notice how difficult it is to see exon15 and exon16 are ## mutually exclusive exons gg2full <- gene2gg(gene="Gria1", flatExonsByGene=flatExonsByGeneMm10, flatExonsByTx=flatExonsByTxMm10, tx2geneDF=tx2geneMm10, compressGaps=FALSE); print(gg2full + ggtitle("Gria1 (uncompressed introns)")); } #> Loading required package: TxDb.Mmusculus.UCSC.mm10.knownGene
There are currently two main sources of coverage data:
bigWig format, accessible either as a file or web URL.NumericList. This data is generated using rtracklayer::import(...,type="bigWig") but it is recommended to use the function splicejam::getGRcoverageFromBw().The next step is to define biological samples for each set of coverage data. For this step, we use a data.frame called "filesDF", with these colnames:
"url" - one of the following:
colnames(values(covGR)) when covGR is supplied as a GRanges object containing coverage data."type" - the type of file
"bw" for bigWig data"junction" for a BED12 file containing splice junction reads"coverage_gr" for coverage data provided in covGR
"sample_id" - biological sample identifier, used to aggregate files into the same plot panel."scale_factor" - (optional) column with numeric values which are multiplied by the score of the file. These values can help normalize scores across samples.A bigWig coverage file can only store coverage for one strand, when using strand-specific RNA-seq. To provide negative strand data, either provide bigWig coverage scores which are negative values (never above zero, and containing at least one negative value), or supply a negative value for the "scale_factor" field in filesDF.
The example below defines a data.frame named filesDF which contains bigWig coverage, and splice junction data.
## assemble a data.frame baseurl <- "https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/"; # BED files with junction reads bedext <- ".STAR_mm10.combinedJunctions.bed"; bwext <- c("492_1.sickle.merged.cutadapt.STAR_mm10.pos.bw", "492_1.sickle.merged.cutadapt.STAR_mm10.neg.bw"); c1 <- c("CA1", "CA2"); r1 <- c("CB", "DE"); bedsamples <- paste0(rep(c1, each=2), "_", r1); bedurls <- paste0(baseurl, bedsamples, bedext); # bigWig files with strand-specific read coverage bwsamples <- paste0(rep(c1, each=4), rep(r1, each=2)); bwsamples1 <- paste0(rep(c1, each=4), "_", rep(r1, each=2)); bwurls <- paste0(baseurl, "NS50211", bwsamples, bwext); # Assemble into a data.frame filesDF <- data.frame(stringsAsFactors=FALSE, check.names=FALSE, url=c(bedurls, bwurls), type=rep(c("junction", "bw"), c(4,8)), sample_id=c(bedsamples, bwsamples1), scale_factor=rep(c(1,3), c(length(bedurls), length(bwurls)))); filesDF; #> url #> 1 https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/CA1_CB.STAR_mm10.combinedJunctions.bed #> 2 https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/CA1_DE.STAR_mm10.combinedJunctions.bed #> 3 https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/CA2_CB.STAR_mm10.combinedJunctions.bed #> 4 https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/CA2_DE.STAR_mm10.combinedJunctions.bed #> 5 https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/NS50211CA1CB492_1.sickle.merged.cutadapt.STAR_mm10.pos.bw #> 6 https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/NS50211CA1CB492_1.sickle.merged.cutadapt.STAR_mm10.neg.bw #> 7 https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/NS50211CA1DE492_1.sickle.merged.cutadapt.STAR_mm10.pos.bw #> 8 https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/NS50211CA1DE492_1.sickle.merged.cutadapt.STAR_mm10.neg.bw #> 9 https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/NS50211CA2CB492_1.sickle.merged.cutadapt.STAR_mm10.pos.bw #> 10 https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/NS50211CA2CB492_1.sickle.merged.cutadapt.STAR_mm10.neg.bw #> 11 https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/NS50211CA2DE492_1.sickle.merged.cutadapt.STAR_mm10.pos.bw #> 12 https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/NS50211CA2DE492_1.sickle.merged.cutadapt.STAR_mm10.neg.bw #> type sample_id scale_factor #> 1 junction CA1_CB 1 #> 2 junction CA1_DE 1 #> 3 junction CA2_CB 1 #> 4 junction CA2_DE 1 #> 5 bw CA1_CB 3 #> 6 bw CA1_CB 3 #> 7 bw CA1_DE 3 #> 8 bw CA1_DE 3 #> 9 bw CA2_CB 3 #> 10 bw CA2_CB 3 #> 11 bw CA2_DE 3 #> 12 bw CA2_DE 3
Example coverage data is provided as data, to demonstrate the format expected. Note that in this case, there is no coverage defined for introns. The bigWig workflow by default adds gaps between exons, and retrieves coverage data for introns as well.
The GRanges object below uses an example set of exons, test_exon_wide_gr, and corresponding coverage GRanges is test_exon_wide_gr. Note that both share the same names and genomic coordinates.
Ultimately, coverage data should be split by the same GRanges used to define exons (and introns) for the Sashimi plot.
data(test_exon_wide_gr); test_exon_wide_gr; #> GRanges object with 4 ranges and 1 metadata column: #> seqnames ranges strand | gene_name #> <Rle> <IRanges> <Rle> | <character> #> wide1 chr1 100-200 + | TestGene1 #> wide2 chr1 10300-10400 + | TestGene1 #> wide3 chr1 20500-20750 + | TestGene1 #> wide4 chr1 39900-40000 + | TestGene1 #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths data(test_cov_wide_gr); test_cov_wide_gr; #> GRanges object with 4 ranges and 1 metadata column: #> seqnames ranges strand | sample_A #> <Rle> <IRanges> <Rle> | <NumericList> #> wide1 chr1 100-200 + | 246,248,261,... #> wide2 chr1 10300-10400 + | 195,202,198,... #> wide3 chr1 20500-20750 + | 195,189,178,... #> wide4 chr1 39900-40000 + | 260,257,253,... #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Some basic plotting functions are provided, in fact they are rolled into the Sashimi plot itself. They can help visualize the coverage by itself.
The function splicejam::exoncov2polygon() converts coverage to polygons, by default it returns a format convenient for geom_polygon() or ggforce::geom_shape(), but if coord_style="base" then it returns a matrix with blank rows separating each polygon, suitable for R base graphics using graphics::polygon().
The argument covNames must contain one of more columns in colnames(values(test_cov_wide_gr)).
# To plot a simple GRanges object widecovdf <- exoncov2polygon(test_cov_wide_gr, covNames="sample_A"); suppressPackageStartupMessages(require(ggplot2)); ggWide3 <- ggplot(widecovdf, aes(x=x, y=y, group=gr, fill=gr, color=gr)) + ggforce::geom_shape(alpha=0.7) + colorjam::theme_jam() + colorjam::scale_fill_jam() + colorjam::scale_color_jam(); print(ggWide3);

The plot above demonstrates default genomic coordinates, and shows how large intron regions are compared to typical exons (in mammals anyway.) For better visualization of the transcript exons, we compress introns using the function splicejam::make_ref2compresssed(). This function returns a list with several useful components, one of which is trans_grc which is a ggplot2 transformation function.
The argument nBreaks=10 defines 10 fixed x-axis label positions, which by default use exon boundaries. To label every exon boundary, use nBreaks=Inf.
The key step when using ggplot2 is to use the function ggplot2::scale_x_continuous(), using argument trans=ref2c$trans_grc. This argument defines the x-axis compression, x-axis breaks, x-axis minor breaks, and x-axis labels.
Alternatively, you can use ggplot2::coord_trans(x=ref2c$trans_grc) however, that function only compresses the x-axis coordinates and does not define breaks or labels.
# Now compress the introns keeping axis labels ref2c <- make_ref2compressed(test_cov_wide_gr, nBreaks=10); # Repeat ggplot using ref2c$trans_grc ggWide3c <- ggWide3 + scale_x_continuous(trans=ref2c$trans_grc) + xlab("chr1 (compressed introns)") + ggtitle("exons (compressed introns)"); print(ggWide3c);

Splice junction data can be provided in two forms:
BED6 or BED12 format, accessible either as a file or web URL. The BED6 format uses the score column to count reads, and uses the genomic coordinates to indicate the splice junction span. The BED12 format is inverted, so there is one base on each side, and the junction region is in the middle. In this case, the one base is trimmed off each side, and the internal region is used to indicate the junction spanning region."score" column indicates the number of splice junction reads, and the "sample_id" column indicates the biological sample. Therefore all splice junctions for all samples should be condensed into one GRanges object.Currently, the rtracklayer::import() function does not support bigBed format, and there is no apparent plan to provide that capability. It is recommended to use the UCSC tool bigBedToBed to convert bigBed files to BED format.
A set of sample splice junction data is provided to show the expected format of junction data. Note it has two columns "score" and "sample_id".
data(test_junc_wide_gr); test_junc_wide_gr; #> GRanges object with 5 ranges and 2 metadata columns: #> seqnames ranges strand | score sample_id #> <Rle> <IRanges> <Rle> | <numeric> <character> #> juncwide1 chr1 200-10299 + | 200 sample_A #> juncwide2 chr1 200-20499 + | 50 sample_A #> juncwide3 chr1 10400-20499 + | 120 sample_A #> juncwide4 chr1 10400-39899 + | 80 sample_A #> juncwide5 chr1 20750-39899 + | 170 sample_A #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Some basic plotting functions are provided, which are rolled into the Sashimi plot. They can help visualize splice junction data directly, as needed.
The key function here is splicejam::grl2df(...,shape="junction") which configures wide junction arcs suitable for splicejam::geom_diagonal_wide_arc(). (Note that ggplot2::geom_polygon() would also work.)
# To plot junctions, use grl2df(..., shape="junction") junc_wide_df <-grl2df(test_junc_wide_gr, shape="junction"); ggWide1 <- ggplot(junc_wide_df, aes(x=x, y=y, group=gr_name, fill=gr_name, color=gr_name)) + splicejam::geom_diagonal_wide_arc() + colorjam::theme_jam() + colorjam::scale_fill_jam(alpha=0.7) + colorjam::scale_color_jam() + xlab("chr1") + ggtitle("junctions (full intron width)") print(ggWide1);

For more insight into creating a Sashimi plot manually, see the help for data splicejam::test_junc_wide_gr.
The main function prepareSashimi() is intended to assemble the sources of data together into one list of data.frame results which can be used to produce ggplot2 visualizations.
For this example, we use the Gria1 gene in mouse mm10, pulling from a web-accessible track hub. Note that the bigBed files were already converted to BED12 format using the UCSC tool bigBedToBed.
if (exists("flatExonsByGeneMm10")) { shGria1 <- prepareSashimi(gene="Gria1", flatExonsByGene=flatExonsByGeneMm10, minJunctionScore=100, sample_id=c("CA1_CB", "CA2_CB"), filesDF=filesDF); }
Alternatively, you can provide data in the form of GRanges objects, and not point to specific files.
Note that filesDF is still required for coverage in GRanges format, in order to define the "sample_id" for each coverage.
data(test_exon_wide_gr); data(test_junc_wide_gr); data(test_cov_wide_gr); testfilesDF <- data.frame(url="sample_A", type="coverage_gr", sample_id="sample_A", scale_factor=1); testfilesDF; #> url type sample_id scale_factor #> 1 sample_A coverage_gr sample_A 1 # Now prepare sashimi data sh1 <- prepareSashimi(GRangesList(TestGene1=test_exon_gr), filesDF=testfilesDF, gene="TestGene1", sample_id="sample_A", covGR=test_cov_gr, juncGR=test_junc_gr);
At this point, we can just plot the result.
# Plot sample Sashimi data plotSashimi(sh1);

Once the Sashimi data is prepared, the splicejam::plotSashimi() function is used to create a visualization that can be customized in several ways.
The output of splicejam::plotSashimi() is a ggplot object, which can be printed to create a plot, or can be modified as needed.
if (exists("shGria1")) { ggGria1 <- plotSashimi(shGria1, junc_color=alpha2col("goldenrod1", 0.4), junc_fill=alpha2col("goldenrod1", 0.4), show=c("coverage", "junction", "junctionLabels"), fill_scheme="sample_id"); print(ggGria1); }
To add a gene-exon model, and even transcript-exon models, we will use the cowplot package to arrange multiple ggplot objects together.
Some guidelines:
cowplot package.+ facet_wrap(~sample_id, ncol=1).The example below demonstrates combining a gene-exon plot with a Sashimi plot, meeting the required guidelines.
if (suppressPackageStartupMessages(require(cowplot)) && exists("ggGria1")) { cpGria1 <- cowplot::plot_grid( ggGria1 + theme(axis.text.x=element_blank()) + xlab(NULL), ggGria1exonsTx + ggtitle(NULL) + expand_limits(y=-2), ncol=1, align="v", axis="lr", rel_heights=c(4,4)); print(cpGria1); }
if (require(cowplot) && exists("ggGria1")) { cpWide <- cowplot::plot_grid(ncol=1, ggGria1 + scale_x_continuous() + theme(axis.text.x=element_blank()) + xlab(NULL), gg2full, axis="lr", align="v"); print(cpWide); }
There are two ways to zoom into a region of interest:
ggforce::facet_zoom()
The example below shows the ggforce::facet_zoom() capability.
Note the most important step is to transform the x-axis values when using xlim to subset the range. The splicejam::prepareSashimi() function returns x$ref2c$transform() which transforms genomic coordinates into compressed coordinates.
plotSashimi(sh1) + ggforce::facet_zoom(xlim=sh1$ref2c$transform(c(370, 410)));

In order to zoom multiple panels, a few steps must be followed:
ggforce::facet_zoom() method above is applied to each plotcowplot::plot_grid().if (exists("flatExonsByGeneMm10")) { zoomStart <- start(flatExonsByGeneMm10[["Gria1"]]["Gria1_exon14"]); zoomEnd <- end(flatExonsByGeneMm10[["Gria1"]]["Gria1_exon18a"]); # Now prepare sashimi data shGria1L <- lapply(c("CA1_CB", "CA2_CB"), function(iSample){ # devtools::load_all("/Users/wardjm/Projects/Ward/splicejam") shX <- prepareSashimi(gene="Gria1", flatExonsByGene=flatExonsByGeneMm10, minJunctionScore=100, sample_id=iSample, doStackJunctions=FALSE, filesDF=filesDF); }); ggGriaL <- lapply(shGria1L, function(shX){ ggX <- plotSashimi(shX) + ggforce::facet_zoom(xlim=shX$ref2c$transform(c(zoomStart, zoomEnd))); }); # now use cowplot if (require(cowplot)) { cpMulti <- do.call(cowplot::plot_grid, ggGriaL); print(cpMulti); } }