Farris et al RNA-seq methods

This document describes and demonstrates analysis methods used to analyze mouse hippocampal subregion- and compartment-specific transcriptome RNA-seq data (Farris et al manuscript in preparation.)

Raw sequence read data is provided through GEO GSE116343, with RNA-seq data available at GEO GSE116342. Data used for analysis in R is provided through an R package farrisdata, installed in R with devtools::install_github("jmw86069/farrisdata").

R environment setup

This document can be reproduced using the R code displayed in this document. The initial steps involve loading R packages into memory.

suppressPackageStartupMessages(library(knitr));
suppressPackageStartupMessages(library(kableExtra));
suppressPackageStartupMessages(library(dplyr));
suppressPackageStartupMessages(library(tidyselect));
suppressPackageStartupMessages(library(plotly));
suppressPackageStartupMessages(library(made4));
suppressPackageStartupMessages(library(matrixStats));
suppressPackageStartupMessages(library(ComplexHeatmap));
suppressPackageStartupMessages(library(SummarizedExperiment));

suppressPackageStartupMessages(library(ggplot2));
suppressPackageStartupMessages(library(ggrepel));

suppressPackageStartupMessages(library(jamba));
suppressPackageStartupMessages(library(jamma));
suppressPackageStartupMessages(library(colorjam));
suppressPackageStartupMessages(library(splicejam));
suppressPackageStartupMessages(library(farrisdata));

suppressPackageStartupMessages(library(png));
suppressPackageStartupMessages(library(Cairo));

Alternatively, one can download the Rmarkdown file “farrisSeq.Rmd” from the Github repository. Open R in the folder where the file was saved, then run this command in R:

rmarkdown::render(“farrisSeq.Rmd”, “html_document”, knit_root_dir=getwd(), output_dir=getwd())

Description of Input Data

The input data for this analysis workflow is derived from Salmon quantitation files, which were already imported and assembled into relevant data matrices.

Reference color substitutions

For consistency, a common set of colors are defined to represent the different sample groups.

colorSub <- c(
   CA1="orange",
   CA2="orangered3",
   CA3="dodgerblue",
   DG="purple",
   CB="grey75",
   DE="grey30",
   CA1_CB="orange1",
   CA1_DE="darkorange2",
   CA2_CB="orangered1",
   CA2_DE="orangered3",
   CA3_CB="dodgerblue1",
   CA3_DE="dodgerblue3",
   DG_CB="darkorchid1",
   DG_DE="darkorchid3",
   `TRUE`="skyblue",
   `FALSE`="indianred3");
colorSub <- rgb2col(col2rgb(colorSub, alpha=TRUE));
showColors(split(colorSub,
   rep(c("Factors","Groups","Factors"), c(6,8,2))));

Gene expression data

Salmon quantitation files were imported using the R Bioconductor package tximport, summarized to the gene level, and stored in an R object farrisGeneSE, which is available in the R data package farrisdata, installed using devtools::install_github("jmw86069/farrisdata").

## Load the gene data
farrisGeneSE;

## class: SummarizedExperiment 
## dim: 49341 24 
## metadata(4): design contrasts genes samples
## assays(2): counts raw_counts
## rownames(49341): -343C11.2 00R_AC107638.2 ... n-TSaga9 n-TStga1
## rowData names(4): probeID ProbeName GeneName geneSymbol
## colnames(24): CA2CB492 CA2CB496 ... DGDE496 DGDE502
## colData names(4): CellType Compartment AnimalID groupName

farrisGeneSE SummarizedExperiment format

farrisGeneSE is a SummarizedExperiment object with this content:

assays list containing the numeric matrix with gene rows, sample columns, with log2 median-normalized counts. Each row is named using a unique gene symbol derived from the Gencode GTF file. To access the normalized Salmon pseudocounts:

assays(farrisGeneSE)[["counts"]]

assays also contains the raw Salmon pseudocounts accessible, using this command: assays(farrisGeneSE)[["raw_counts"]]
rowData data.frame with gene rows, using the gene symbol derived from the Gencode GTF file.

as.data.frame(head(rowData(farrisGeneSE), 10)) %>%
   dplyr::mutate() %>%
   dplyr::select(probeID, geneSymbol,
      tidyselect::ends_with("_type"),
      tidyselect::contains("has")) %>%
   kable(escape=FALSE) %>%
   kable_styling()

probeID	geneSymbol
-343C11.2	-343C11.2
00R_AC107638.2	00R_AC107638.2
00R_Pgap2	00R_Pgap2
0610005C13Rik	0610005C13Rik
0610006L08Rik	0610006L08Rik
0610007P14Rik	0610007P14Rik
0610009B22Rik	0610009B22Rik
0610009E02Rik	0610009E02Rik
0610009L18Rik	0610009L18Rik
0610009O20Rik	0610009O20Rik

colData annotated data.frame with sample rows, describing the mouse hippocampal Region, CellType, Sample, and groupName.

	CellType	Compartment	AnimalID	groupName
CA2CB492	CA2	CB	492	CA2_CB
CA2CB496	CA2	CB	496
CA2CB502	CA2	CB	502
CA2DE492	CA2	DE	492	CA2_DE
CA2DE496	CA2	DE	496
CA2DE502	CA2	DE	502
CA1CB492	CA1	CB	492	CA1_CB
CA1CB496	CA1	CB	496
CA1CB502	CA1	CB	502
CA1DE492	CA1	DE	492	CA1_DE
CA1DE496	CA1	DE	496
CA1DE502	CA1	DE	502
CA3CB492	CA3	CB	492	CA3_CB
CA3CB496	CA3	CB	496
CA3CB502	CA3	CB	502
CA3DE492	CA3	DE	492	CA3_DE
CA3DE496	CA3	DE	496
CA3DE502	CA3	DE	502
DGCB492	DG	CB	492	DG_CB
DGCB496	DG	CB	496
DGCB502	DG	CB	502
DGDE492	DG	DE	492	DG_DE
DGDE496	DG	DE	496
DGDE502	DG	DE	502

metadata list describing the experimental design and contrasts used for statistical comparisons.
- design numeric matrix describing the sample design, suitable for use directly by methods from the R Bioconductor limma package.
- contrasts numeric matrix describing contrasts, suitable for use directly by limma.
- samples vector of samples used.
- genes vector of genes used.

Transcript expression data

Salmon quantitation files were imported using tximport as above, maintaining individual isoform abundances, then stored in an R object farrisTxSE, which is available in the R data package farrisdata, installed using devtools::install_github("jmw86069/farrisdata").

## Load the gene data
data(farrisTxSE);

farrisTxSE format

The farrisTxSE object is a named list with content similar to that described above for gene data:

assays list containing the numeric matrix with gene rows, sample columns, with log2 median-normalized counts. Each row is named using a unique gene symbol derived from the Gencode GTF file.
rowData data.frame with gene rows, using the gene symbol derived from the Gencode GTF file.

tx_df <- subset(as.data.frame(rowData(farrisTxSE)),
   geneSymbol %in% c("Gria1","Shank2","Ntrk2")) %>%
   dplyr::select(geneSymbol,
      transcript_id,
      tidyselect::ends_with("_type"),
      tidyselect::contains("has"),
      tidyselect::contains("tpm"));
tx_df <- mixedSortDF(tx_df,
   byCols=match(c("geneSymbol","TxDetectedByTPM"),
     colnames(tx_df))*c(1,-1));
colorSubGene <- c(colorSub,
   group2colors(unique(tx_df$geneSymbol),
      cRange=c(20,30),
      lRange=c(80,90)));
tx_df2 <- kable_coloring(tx_df,
      colorSub=colorSubGene,
      row_color_by="geneSymbol",
      verbose=FALSE,
      returnType="kable") %>%
   collapse_rows(columns=2, valign="middle") %>%
   row_spec(0, background="#DDDDDD")

tx_df2;

	geneSymbol	transcript_id	gene_type	transcript_type	has3UTR	TxHas3UTR	TxHasExt3UTR	GeneHasExt3UTR	TxHasCDS	TxDetectedByTPM
ENSMUST00000036315.15	Gria1	ENSMUST00000036315.15	protein_coding	protein_coding	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE
ENSMUST00000094179.10		ENSMUST00000094179.10	protein_coding	protein_coding	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE
ENSMUST00000125292.1		ENSMUST00000125292.1	protein_coding	protein_coding	TRUE	FALSE	TRUE	TRUE	TRUE	FALSE
ENSMUST00000151045.2		ENSMUST00000151045.2	protein_coding	protein_coding	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE
ENSMUST00000173531.1		ENSMUST00000173531.1	protein_coding	processed_transcript	TRUE	FALSE	FALSE	TRUE	FALSE	FALSE
ENSMUST00000079828.5	Ntrk2	ENSMUST00000079828.5	protein_coding	protein_coding	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE
ENSMUST00000109838.8	Ntrk2	ENSMUST00000109838.8	protein_coding	protein_coding	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE
ENSMUST00000105900.8	Shank2	ENSMUST00000105900.8	protein_coding	protein_coding	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE
ENSMUST00000146006.2		ENSMUST00000146006.2	protein_coding	protein_coding	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE
ENSMUST00000097929.3		ENSMUST00000097929.3	protein_coding	protein_coding	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE
ENSMUST00000105902.7		ENSMUST00000105902.7	protein_coding	protein_coding	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE
ENSMUST00000136979.1		ENSMUST00000136979.1	protein_coding	processed_transcript	TRUE	FALSE	FALSE	TRUE	FALSE	FALSE
ENSMUST00000213146.1		ENSMUST00000213146.1	protein_coding	protein_coding	TRUE	FALSE	TRUE	TRUE	TRUE	FALSE

colData annotated data.frame with sample rows, describing the mouse hippocampal Region, CellType, Sample, and groupName. This data is identical to the sample table shown for farrisGeneSE above.
metadata list describing the experimental design and contrasts used for statistical comparisons.
- design numeric matrix describing the sample design, suitable for use directly by methods from the R Bioconductor limma package.
- contrasts numeric matrix describing contrasts, suitable for use directly by limma.
- samples vector of samples used.
- genes vector of genes used.

Related analysis parameters

Sickle read trimming

We used sickle version 1.33 to trim paired-end reads, with parameters -q 20 -l 20 to filter reads with quality threshold 20, and requiring reads at least 20 base length.

Cutadapt adapter trimming

We used cutadapt version 1.8.1 with parameter -m 20 and with standard Illumina Truseq adapter sequence AGATCGGAAGAGC.

STAR sequence alignment

We used STAR version 2.5.1b to align versus Gencode vM12 comprehensive GTF, with the following parameters, based upon ENCODE recommended RNA-seq parameters:

--outFilterType Normal
--outFilterMultimapNmax 20
--alignSJoverhangMin 8
--alignSJDBoverhangMin 1
--outFilterMismatchNmax 999
--alignIntronMin 20
--alignIntronMax 1000000
--alignMatesGapMax 1000000
--outSAMtype BAM SortedByCoordinate Unsorted
--outWigType bedGraph
--outWigStrand Stranded
--outWigNorm None
--outReadsUnmapped Fastx
--outMultimapperOrder Random

Salmon quantitation

We used salmon version 0.9.1 for transcript quantitation, using Gencode vM12 comprehensive GTF file, using a index with kmer size 31, and the cutadapt adapter-trimmed reads, with seqBias, VBO, and gcBias options enabled.

featureCounts read counting

We used featureCounts version 1.5.1 to produce per-gene counts, using Gencode vM12 transcripts flattened per gene.

Gene expression analysis summary

BGA plot

bgaExprs <- assays(farrisGeneSE)$counts;
bgaExprs[bgaExprs < 7] <- 7;
bgaExprsUse <- bgaExprs[rowMins(bgaExprs) < rowMaxs(bgaExprs),,drop=FALSE];
nrow(bgaExprsUse);

## [1] 18005

salmonBga1 <- bga(dataset=bgaExprsUse,
   classvec=nameVector(colData(farrisGeneSE)$groupName,
      colnames(farrisGeneSE)),
   type="coa");

salmonBga1ly <- bgaPlotly3d(salmonBga1,
   axes=c(1,2,3),
   colorSub=colorSub,
   useScaledCoords=FALSE,
   drawVectors="none",
   drawSampleLabels=FALSE,
   superGroups=gsub("_.+", "", salmonBga1$fac),
   ellipseType="none",
   sceneX=0, sceneY=1, sceneZ=1,
   verbose=FALSE);
salmonBga1ly;

Comparison with CA1 published data by Nakayama, Ainsley, Cajigas

First we calculate group medians using gene expression data, to determine the set of genes detected above an expression threshold of 128 pseudocounts (log2 = 7).

farrisGeneGroupMedians <- rowGroupMeans(assays(farrisGeneSE)[["counts"]],
   groups=colData(farrisGeneSE)$groupName,
   useMedian=TRUE);

# Plot histogram of expression in CA1_CB and CA1_DE
plotPolygonDensity(farrisGeneGroupMedians[,c("CA1_CB","CA1_DE")],
   xlim=c(0,20),
   ylim=c(0,700),
   ablineV=7);

# Detected genes in CA1_CB and CA1_DE
CA1detected <- (farrisGeneGroupMedians[,"CA1_CB"] > 7 &
   farrisGeneGroupMedians[,"CA1_DE"] > 7);
CA1detCBandDE <- rownames(farrisGeneGroupMedians)[CA1detected];
length(CA1detCBandDE);

## [1] 10877

Next we load previously gene lists representing detected CA1 dendrite genes from published studies from Nakayama et al, Ainsley et al, and Cajigas et al. These genes are available in the farrisdata package, as described above.

Produce a 4-way Venn diagram showing the overlapping genes from these studies.

GeneListL <- list(
  Farris=CA1detCBandDE,
  Cajigas=CajigasGenes,
  Ainsley=AinsleyGenes,
  Nakayama=NakayamaGenes);
GeneListIM <- list2im(GeneListL);
vps <- limma::vennDiagram(GeneListIM,
   circle.col=rainbowJam(4));

vps2 <- venn(GeneListL,
   show.plot=FALSE);
## Retrieve specific overlaps like this:
vps2vennL <- attr(vps2, "intersections");
## vps2vennL[["Farris:Cajigas:Ainsley:Nakayama"]]

Correlation Centered by Compartment, refers to Figure 2

Cell Body

Per-sample correlation, centered across all CB samples. First, we use Salmon normalized pseudocounts, restricted to genes where at least one group mean value is above log2(7), which is >= 128 normalized pseudocounts.

Then data is centered per Compartment, so that CellBody samples are centered by subtracting the mean CellBody expression per gene, and Dendrite samples are centered by subtracting the mean Dendrite expression per gene. This step uses centerGeneData() with the argument centerGroups.

Next we prepare a data.frame with color coding to show the CellType and Compartment values.

## farrisGeneGroupMedians
## Pull out only cell body samples
iSamples <- colnames(farrisGeneSE);
iSamplesCB <- vigrep("CB", iSamples);
iSamplesDE <- vigrep("DE", iSamples);
#iSamplesGrp <- colnames(iMatrix7grp);
iSamplesGrp <- colnames(farrisGeneGroupMedians);
iSamplesGrpCB <- vigrep("CB", iSamplesGrp);
iSamplesGrpDE <- vigrep("DE", iSamplesGrp);

corrCutoff <- 7;
genesAboveCutoff <- (rowMaxs(farrisGeneGroupMedians) >= corrCutoff);
genesAboveCutoffCB <- (rowMaxs(farrisGeneGroupMedians[,iSamplesGrpCB]) >= corrCutoff);
genesAboveCutoffDE <- (rowMaxs(farrisGeneGroupMedians[,iSamplesGrpDE]) >= corrCutoff);
genesAboveCutoffBoth <- (genesAboveCutoffCB & genesAboveCutoffDE);

iMatrix7 <- assays(farrisGeneSE[genesAboveCutoff,])[["counts"]];
centerGroups <- gsub("^.+(DE|CB).+$",
   "\\1",
   iSamples);
iMatrix7ctr <- centerGeneData(iMatrix7,
   centerGroups=centerGroups);
iMatrix7ctrCor <- cor(iMatrix7ctr);


## Generate some color bars to annotate the heatmap
colDataColorsRepL <- as.data.frame(colData(farrisGeneSE)[,c("CellType","Compartment")]);
colDataColorsRep <- df2colorSub(colDataColorsRepL, colorSub=colorSub);

Heatmap using ComplexHeatmap:

pheat_colors <- list(Compartment=colorSub[c("CB","DE")],
   CellType=colorSub[c("CA1","CA2","CA3","DG")]);
pheat_breaks <- warpAroundZero(seq(from=-1, to=1, length.out=51), lens=1);

cBR <- circlize::colorRamp2(breaks=pheat_breaks,
   col=getColorRamp("RdBu_r", n=51));

colHA <- HeatmapAnnotation(df=colDataColorsRepL[iSamplesCB,2:1],
   show_annotation_name=TRUE,
   border=TRUE,
   annotation_legend_param=list(
      title_gp=gpar(fontsize=12, fontface="bold")),
   col=pheat_colors);
rowA <- rowAnnotation(df=colDataColorsRepL[iSamplesCB,2:1],
   col=pheat_colors,
   show_annotation_name=TRUE,
   border=TRUE,
   annotation_legend_param=list(
      title_gp=gpar(fontsize=12, fontface="bold")),
   show_legend=FALSE);

corHM <- Heatmap(iMatrix7ctrCor[iSamplesCB,iSamplesCB],
   name="Correlation",
   clustering_method_columns="ward.D",
   clustering_method_rows="ward.D",
   column_dend_height=unit(20, "mm"),
   row_dend_width=unit(20, "mm"),
   top_annotation=colHA,
   left_annotation=rowA,
   border=TRUE,
   heatmap_legend_param=list(
      grid_width=unit(8, "mm"),
      border="black",
      title_gp=gpar(fontsize=12, fontface="bold"),
      legend_height=unit(30, "mm")),
   row_dend_side="left",
   col=cBR);
ComplexHeatmap::draw(corHM, merge_legend=TRUE);

Cell Body and Dendrite

Correlation showing CA2_CB correlates highest with CA2_DE, etc. for each CellType.

## farrisGeneGroupMedians[genesAboveCutoff,]
iMatrix7grp <- farrisGeneGroupMedians[genesAboveCutoffCB,];
centerGroupsGrp <- gsub("^.*(DE|CB).*$",
   "\\1",
   iSamplesGrp);
## 31oct2018 using mean instead of median
iMatrix7grpCtr <- centerGeneData(iMatrix7grp,
   centerGroups=centerGroupsGrp,
   mean=TRUE);
iMatrix7grpCtrCor <- cor(iMatrix7grpCtr);

## Generate some color bars to annotate the heatmap
colDataColorsGrpL <- data.frame(rbindList(
   strsplit(nameVector(iSamplesGrp), "_")));
colnames(colDataColorsGrpL) <- c("CellType","Compartment");
colDataColorsGrp <- df2colorSub(colDataColorsGrpL,
  colorSub=colorSub);

Heatmap using ComplexHeatmap:

colHA2 <- HeatmapAnnotation(df=colDataColorsGrpL[iSamplesGrp,2:1],
   show_annotation_name=TRUE,
   border=TRUE,
   annotation_legend_param=list(
      title_gp=gpar(fontsize=12, fontface="bold")),
   col=pheat_colors);
rowA2 <- rowAnnotation(df=colDataColorsGrpL[iSamplesGrp,2:1],
   col=pheat_colors,
   show_annotation_name=TRUE,
   border=TRUE,
   annotation_legend_param=list(
      title_gp=gpar(fontsize=12, fontface="bold")),
   show_legend=FALSE);

corHM2 <- Heatmap(iMatrix7grpCtrCor[iSamplesGrp,iSamplesGrp],
   name="Correlation",
   clustering_method_columns="ward.D",
   clustering_method_rows="ward.D",
   column_dend_height=unit(20, "mm"),
   row_dend_width=unit(20, "mm"),
   top_annotation=colHA2,
   left_annotation=rowA2,
   border=TRUE,
   heatmap_legend_param=list(
      grid_width=unit(8, "mm"),
      border="black",
      title_gp=gpar(fontsize=12, fontface="bold"),
      legend_height=unit(30, "mm")),
   row_dend_side="left",
   col=cBR);
ComplexHeatmap::draw(corHM2, merge_legend=TRUE);

Splom plot, refers to Figure 2, and Supplemental Figure 3

Scatterplot showing the +1,+1 selection of genes, which helps define the gene lists in the next section.

corCols <- c("CA2_DE","CA2_CB");
cutCB <- round(digits=3, log2(1.5));
cutDE <- round(digits=3, log2(1.5));
splomL <- lapply(nameVector(c("CA1","CA2","CA3","DG")), function(k) {
   corCols <- vigrep(k, colnames(iMatrix7grpCtr));
   df1 <- as.data.frame(iMatrix7grpCtr[,corCols]);
   CBhit <- (abs(df1[,1]) > cutCB) * sign(df1[,1]);
   DEhit <- (abs(df1[,2]) > cutCB) * sign(df1[,2]);
   CBhitN <- paste0(k, "_", "CBhit");
   DEhitN <- paste0(k, "_", "DEhit");
   df1[,CBhitN] <- CBhit;
   df1[,DEhitN] <- DEhit;
   df1[,"GeneName"] <- rownames(df1);
   df1;
});
#table(splomL[[1]][,3:4])
splomLsub <- lapply(splomL, function(iDF){
   subset(iDF, !iDF[,3] %in% 0 | !iDF[,4] %in% 0)
});

splomDF2 <- rbindList(lapply(splomLsub, function(iDF){
   iDF[,"CellType"] <- gsub("_.+", "", colnames(iDF)[1]);
   colnames(iDF) <- c("CB","DE","CBhit","DEhit","GeneName","CellType");
   iDF;
}));
#splomDF2[,"row"] <- ifelse(splomDF2$CellType %in% c("CA1","CA2"), "CA1", "CA3");
splomDF2[,"CBDEhit"] <- paste0(splomDF2$CBhit,":",splomDF2$DEhit);

## Color-code the splom points
ccColors <- nameVector(rep("grey", 8), unique(splomDF2$CBDEhit));
ccColors[c("1:1","-1:-1")] <- "orangered3";
ccColors[c("-1:1","1:-1")] <- "grey";

## Gene labels
GeneHighlight <- c("Plch2","Rgs14","Necab2","1700024P16Rik");
splomDF2$label <- ifelse(splomDF2$GeneName %in% GeneHighlight,
  splomDF2$GeneName, "");


splomDF2[,"CBmaxGroupMean"] <- rowMaxs(farrisGeneGroupMedians[splomDF2$GeneName,iSamplesGrpCB]);
splomDF2[,"DEmaxGroupMean"] <- rowMaxs(farrisGeneGroupMedians[splomDF2$GeneName,iSamplesGrpDE]);
splomDF2[,"CBdet7"] <- (splomDF2[,"CBmaxGroupMean"] > 7)+0;
splomDF2[,"DEdet7"] <- (splomDF2[,"DEmaxGroupMean"] > 7)+0;

gg2 <- ggplot(splomDF2,
      aes(x=CB, y=DE, color=CBDEhit, fill=CBDEhit, GeneName=GeneName)) +
   geom_point(shape=21, size=2) +
   geom_vline(xintercept=c(-1,1)*0.585,
      linetype="dashed", color="grey20") +
   geom_hline(yintercept=c(-1,1)*0.585, linetype="dashed", color="grey20") +
   facet_wrap(facets=~CellType) +
   theme_jam() +
   scale_fill_manual(values=ccColors) +
   scale_color_manual(values=makeColorDarker(ccColors)) +
   ggrepel::geom_label_repel(aes(label=label),
      force=8,
      fill="white") +
   theme(legend.position="none");
gg2;

Microarray correlation heatmap.

Neuronal Gene Lists

(Description of the four gene/transcript lists)

## anyGenes1 is the full set of all genes with abundance > 1
anyGenes1 <- rownames(farrisGeneGroupMedians)[rowMaxs(farrisGeneGroupMedians) > 1];

## Filtering rules for CB and DE
## corrCutoff is 7
corFilterRuleCB <- (rowMaxs(farrisGeneGroupMedians[,iSamplesGrpCB]) >= corrCutoff);
corFilterRuleDE <- (rowMaxs(farrisGeneGroupMedians[,iSamplesGrpDE]) >= corrCutoff);
corFilterRule <- (corFilterRuleCB);

DEgenes7 <- rownames(farrisGeneGroupMedians)[corFilterRuleDE];
CBgenes7 <- rownames(farrisGeneGroupMedians)[corFilterRuleCB];
DEgenes7nonCB7 <- setdiff(DEgenes7, CBgenes7); # non-pyramidal
CBgenes7nonDE7 <- setdiff(CBgenes7, DEgenes7); # "non-DE genes"
DEgenes7andCB7 <- intersect(DEgenes7, CBgenes7);   # "DE genes"

## Define neuro gene lists
geneListsAll <- list(anyGenes1=anyGenes1,
   DEandCB=DEgenes7andCB7, # detected
   CBonly=CBgenes7nonDE7,  # detected only CB, not detected DE
   `non-pyramidal`=DEgenes7nonCB7,
   `CB1andDE1`=unique(subset(splomDF2,
      CBDEhit %in% "1:1" &
      CBdet7 %in% c(1) &
      DEdet7 %in% c(1))$GeneName)
   );
#lengths(geneListsAll);
geneListsDF <- data.frame(GeneList=names(geneListsAll),
   GeneCount=format(lengths(geneListsAll), big.mark=",", trim=TRUE));
rownames(geneListsDF) <- NULL;
colorSubGeneLists <- nameVector(rainbowJam(length(geneListsAll)),
   names(geneListsAll));
geneListsDF2 <- kable_coloring(geneListsDF,
      colorSub=colorSubGeneLists,
      row_color_by="GeneList",
      returnType="kable") %>%
   row_spec(0, background="#DDDDDD")
geneListsDF2;

GeneList	GeneCount
anyGenes1	28,075
DEandCB	12,265
CBonly	548
non-pyramidal	1,871
CB1andDE1	1,055

Heatmap per-gene.

Differential gene expression using limma

In preparation.

Transcript analysis summary

Definition of detected transcripts

We refer to the function defineDetectedTx() in the jampack R package. The function takes normalized counts, normalized TPM values, sample group information, and several cutoff values:

cutoffTxPctMax=10 requires a transcript isoform to be at least 10% of the highest isoform present in the same sample.
cutoffTxExpr=32 requires a transcript isoform to have at least 32 normalized counts.
cutoffTxTPMExpr=2 requires a transcript isoform to have at least a TPM value of 2.
All of the above criteria must be met for an isoform to be considered “detected.”

#refreshFunctions("farrisSalmonWWS");
detectedTxTPML <- defineDetectedTx(
   iMatrixTx=assays(farrisTxSE)[["counts"]],
   iMatrixTxTPM=assays(farrisTxSE)[["tpm"]],
   groups=colData(farrisTxSE)$groupName,
   cutoffTxPctMax=10,
   cutoffTxExpr=32,
   cutoffTxTPMExpr=2,
   zeroAsNA=FALSE,
   tx2geneDF=renameColumn(rowData(farrisTxSE),
     from=c("geneSymbol","probeID"),
     to=c("gene_name","transcript_id")),
   useMedian=FALSE,
   verbose=FALSE);
detectedTx <- detectedTxTPML$detectedTx;
numDetectedTx <- length(detectedTx);
detectedGenes <- mixedSort(unique(rowData(farrisTxSE[detectedTx,])$geneSymbol));
numDetectedGenes <- length(detectedGenes);

The code above defined 39,552 detected transcripts by the given criteria, covering 17,358 unique gene symbols.

Transcript type per gene list

e.g. protein_coding, etc.

First, as a point of organizing the transcript-gene associations, we read the Gencode GTF file and produce a data.frame with the transcript-gene data.

Note: This step downloads the Gencode GTF file for version vM12, then extracts data from that file. Once the data.frame is extracted, it is stored in a text file for re-use.

vM12gtf <- "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M12/gencode.vM12.annotation.gtf.gz";
vM12gtfBase <- basename(vM12gtf);
if (!file.exists(vM12gtfBase)) {
   curl::curl_download(url=vM12gtf,
      destfile=vM12gtfBase);
}
tx2geneFile <- file.path(".", "vM12gtf.tx2geneDF.txt");
if (!file.exists(tx2geneFile)) {
   tx2geneDF <- makeTx2geneFromGtf(GTF=vM12gtfBase,
      verbose=FALSE);
   write.table(file=tx2geneFile,
      x=tx2geneDF,
      sep="\t",
      quote=FALSE,
      na="",
      col.names=TRUE,
      row.names=FALSE);
} else {
   tx2geneDF <- read.table(tx2geneFile,
      sep="\t",
      check.names=FALSE,
      as.is=TRUE,
      fill=TRUE,
      quote="\"",
      allowEscapes=FALSE,
      comment.char="",
      header=TRUE,
      stringsAsFactor=FALSE);
}

Note the logic to define various transcript types is applied below, after the definition of Neuronal Transcript Lists.

Preparation of 3’UTR data from Gencode

We imported Gencode vM12 comprehensive GTF into R using R Bioconductor GenomicFeatures package, with which we derived 3’UTR regions, using GenomicFeatures::makeTxDbFromGFF() and GenomicFeatures::threeUTRsByTranscript(), respectively.

This step re-uses the Gencode GTF file downloaded above, then converts it to a "Txdb" object, which is a SQLite relational database format. This database is saved into a file so it can be recalled without re-creating the file again.

localDb <- file.path(".", "vM12gtf.txdb");
if (!file.exists(localDb)) {
   vM12txdb <- GenomicFeatures::makeTxDbFromGFF(vM12gtfBase);
   AnnotationDbi::saveDb(x=vM12txdb, file=localDb);
} else {
   vM12txdb <- AnnotationDbi::loadDb(file=localDb);
}

## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK

## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: gencode.vM12.annotation.gtf.gz
## # Organism: NA
## # Taxonomy ID: NA
## # miRBase build ID: NA
## # Genome: NA
## # transcript_nrow: 122968
## # exon_nrow: 474052
## # cds_nrow: 236484
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2019-10-06 13:59:19 -0400 (Sun, 06 Oct 2019)
## # GenomicFeatures version at creation time: 1.36.4
## # RSQLite version at creation time: 2.1.2
## # DBSCHEMAVERSION: 1.2

## This snippet simply confirms the vM12txdb object is valid
## otherwise it loads from the source file. This step is
## necessary to resume a previously cached knitr session.
if (!DBI::dbIsValid(AnnotationDbi::dbconn(vM12txdb))) {
   vM12txdb <- AnnotationDbi::loadDb(file=localDb);
   printDebug("Refreshed vM12txdb.");
   printDebug('print(find("vM12txdb"))');
   print(find("vM12txdb"));
} else {
   printDebug("Did not refresh vM12txdb.");
}

## Loading required package: GenomicFeatures

## Loading required package: AnnotationDbi

## 
## Attaching package: 'AnnotationDbi'

## The following object is masked from 'package:plotly':
## 
##     select

## The following object is masked from 'package:dplyr':
## 
##     select

## ##  (21:14:57) 07Oct2019:  Refreshed vM12txdb. 
## ##  (21:14:57) 07Oct2019:  print(find("vM12txdb")) 
## [1] ".GlobalEnv"

## Check for valid vM12txdb since it is not cached properly by knitr
if (!DBI::dbIsValid(dbconn(vM12txdb))) {
   vM12txdb <- AnnotationDbi::loadDb(file=localDb);
}
## Grab three prime UTR into a GRangesList
gencode3utr <- GenomicFeatures::threeUTRsByTranscript(vM12txdb,
   use.names=TRUE);
values(gencode3utr@unlistData)[,"transcript_id"] <- rep(names(gencode3utr), lengths(gencode3utr));
txMatch <- match(values(gencode3utr@unlistData)[,"transcript_id"], tx2geneDF$transcript_id);
values(gencode3utr@unlistData)[,"gene_id"] <- tx2geneDF[txMatch,"gene_id"];
values(gencode3utr@unlistData)[,"gene_name"] <- tx2geneDF[txMatch,"gene_name"];
values(gencode3utr@unlistData)[,"gene_type"] <- tx2geneDF[txMatch,"gene_type"];
values(gencode3utr@unlistData)[,"transcript_type"] <- tx2geneDF[txMatch,"transcript_type"];
names(gencode3utr@unlistData) <- pasteByRow(values(gencode3utr@unlistData)[,c("transcript_id","exon_rank")]);

## Add flags to tx2geneDF, whether a gene or tx has 3'UTR
tx2geneDF[,"GeneHas3UTR"] <- tx2geneDF$gene_id %in% 
   values(gencode3utr@unlistData)[,"gene_id"];
tx2geneDF[,"TxHas3UTR"] <- tx2geneDF$transcript_id %in%
   values(gencode3utr@unlistData)[,"transcript_id"];

## Summarize widths of three-prime-utr exons by transcript
threePrimeTx <- sum(width(gencode3utr));
GencodeVM12mm10threeUtrLength <- sum(width(gencode3utr));

## non-mito detectedTx
detectedTxMito <- subset(tx2geneDF[match(detectedTx, tx2geneDF$transcript_id),],
   grepl("^mt-", gene_name))$transcript_id;
detectedTxNonMito <- setdiff(detectedTx, detectedTxMito);

Neuronal Transcript Lists

The code below creates transcript lists equivalent to the Neuronal Gene Lists above. The main distinction is that the filtering rules are applied to transcript-level data, as opposed to gene-level summary data above. While a gene may be present in one category, perhaps only a subset of its transcript isoforms may be present in that category. Similarly, individual isoforms from the same gene may be present in multiple distinct neuronal categories.

The transcript-level subsets are used to produce the 3-prime UTR and CAI plots in subsequent figures.

## First calculate per-transcript group expression values
farrisTxGroupMedians <- rowGroupMeans(assays(farrisTxSE)[["counts"]],
   groups=colData(farrisTxSE)$groupName,
   useMedian=TRUE);
farrisTxTPMGroupMedians <- rowGroupMeans(assays(farrisTxSE)[["tpm"]],
   groups=colData(farrisTxSE)$groupName,
   useMedian=TRUE);

## Define TPM data matrix
## Note: we add +1 to normalized TPM values, which keeps the minimum
## values above zero. Consequently, we reset values which were already
## zero to zero, so they are not adjusted.
iMatrixTxTPM <- assays(farrisTxSE)[["tpm"]] + 1;
iMatrixTxTPM[assays(farrisTxSE)[["raw_tpm"]] == 0] <- 0

## Define cutoffs using CB and DE samples
## "Any" uses permissive cutoffs:
## pseudocounts >= 5, TPM >= 1
## no requirement for isoforms to be a certain percent the max per gene
detectedTxTPManyL <- defineDetectedTx(
#   iMatrixTx=assays(farrisTxSE[,iSamplesCB])[["counts"]],
#   iMatrixTxTPM=assays(farrisTxSE[,iSamplesCB])[["tpm"]],
#   groups=colData(farrisTxSE[,iSamplesCB])$groupName,
   iMatrixTx=assays(farrisTxSE)[["counts"]],
   iMatrixTxTPM=assays(farrisTxSE)[["tpm"]],
   groups=colData(farrisTxSE)$groupName,
   cutoffTxPctMax=0,
   cutoffTxExpr=5,
   cutoffTxTPMExpr=1,
   tx2geneDF=renameColumn(rowData(farrisTxSE),
     from=c("geneSymbol","probeID"),
     to=c("gene_name","transcript_id")),
   useMedian=FALSE,
   verbose=FALSE);
## Define cutoffs using cell body (CB) samples
## Requires pseudocounts >= 32, TPM >= 2
## and isoforms must be >= 10 percent the max per gene
detectedTxCBTPML <- defineDetectedTx(
   iMatrixTx=assays(farrisTxSE[,iSamplesCB])[["counts"]],
   iMatrixTxTPM=assays(farrisTxSE[,iSamplesCB])[["tpm"]],
   groups=colData(farrisTxSE[,iSamplesCB])$groupName,
   cutoffTxPctMax=10,
   cutoffTxExpr=32,
   cutoffTxTPMExpr=2,
   tx2geneDF=renameColumn(rowData(farrisTxSE),
     from=c("geneSymbol","probeID"),
     to=c("gene_name","transcript_id")),
   useMedian=FALSE,
   verbose=FALSE);
detectedTxTPMCB <- detectedTxCBTPML$detectedTx;
## Define cutoffs using cell body (DE) samples
## Requires pseudocounts >= 32, TPM >= 2
## and isoforms must be >= 10 percent the max per gene
detectedTxDETPML <- defineDetectedTx(
   iMatrixTx=assays(farrisTxSE[,iSamplesDE])[["counts"]],
   iMatrixTxTPM=assays(farrisTxSE[,iSamplesDE])[["tpm"]],
   groups=colData(farrisTxSE[,iSamplesDE])$groupName,
   cutoffTxPctMax=10,
   cutoffTxExpr=32,
   cutoffTxTPMExpr=2,
   tx2geneDF=renameColumn(rowData(farrisTxSE),
     from=c("geneSymbol","probeID"),
     to=c("gene_name","transcript_id")),
   useMedian=FALSE,
   verbose=FALSE);
detectedTxTPMDE <- detectedTxDETPML$detectedTx;

## Center data within Compartment
farrisTxTPMGroupMediansCtr2 <- centerGeneData(farrisTxTPMGroupMedians,
   centerGroups=gsub("^.+_", "", colnames(farrisTxTPMGroupMedians)),
   mean=TRUE, showGroups=FALSE);


## anyTx1 is the full set of all transcripts with abundance > 1
anyTx1 <- rownames(farrisTxGroupMedians)[rowMaxs(farrisTxGroupMedians) > 1];

## Filtering rules for CB and DE
## corrCutoff is 7
corFilterRuleTxCB <- (rowMaxs(farrisTxGroupMedians[,iSamplesGrpCB]) >= corrCutoff);
corFilterRuleTxDE <- (rowMaxs(farrisTxGroupMedians[,iSamplesGrpDE]) >= corrCutoff);
corFilterRuleTx <- (corFilterRuleTxCB);

## Subsets of detected transcripts
DEtx7 <- rownames(farrisTxGroupMedians)[corFilterRuleTxDE];
CBtx7 <- rownames(farrisTxGroupMedians)[corFilterRuleTxCB];
DEtx7nonCB7 <- setdiff(DEtx7, CBtx7); # non-pyramidal
CBtx7nonDE7 <- setdiff(CBtx7, DEtx7); # "non-DE genes"
DEtx7andCB7 <- intersect(DEtx7, CBtx7);   # "DE genes"

## Plot based upon TPM or counts
splomTxL <- lapply(nameVector(c("CA1","CA2","CA3","DG")), function(k) {
   corCols <- vigrep(k, colnames(farrisTxTPMGroupMediansCtr2));
   #df1 <- as.data.frame(farrisTxTPMGroupMediansCtr2[corFilterRuleTx,corCols]);
   df1 <- as.data.frame(farrisTxTPMGroupMediansCtr2[detectedTxTPMCB,corCols]);
   CBhit <- (abs(df1[,1]) > cutCB) * sign(df1[,1]);
   DEhit <- (abs(df1[,2]) > cutCB) * sign(df1[,2]);
   CBhitN <- paste0(k, "_", "CBhit");
   DEhitN <- paste0(k, "_", "DEhit");
   df1[,CBhitN] <- CBhit;
   df1[,DEhitN] <- DEhit;
   df1[,"GeneName"] <- rownames(df1);
   df1;
});
## Remove entries with (0,0) no change in any condition
splomTxLsub <- lapply(splomTxL, function(iDF){
   subset(iDF, !iDF[,3] %in% 0 | !iDF[,4] %in% 0)
});
## Create a data.frame for ggplot
splomTxDF2 <- rbindList(lapply(splomTxLsub, function(iDF){
   iDF[,"CellType"] <- gsub("_.+", "", colnames(iDF)[1]);
   colnames(iDF) <- c("CB","DE","CBhit","DEhit","transcript_id","CellType");
   iDF;
}));
splomTxDF2[,"Gene"] <- tx2geneDF[match(splomTxDF2$transcript_id, tx2geneDF$transcript_id),"gene_name"];
splomTxDF2[,"CBDEhit"] <- paste0(splomTxDF2$CBhit,
   ":",
   splomTxDF2$DEhit);
## Add rowMaxs for CB and DE to farrisSplomDF2
splomTxDF2[,"CBmaxGroupMedian"] <- rowMaxs(farrisTxGroupMedians[splomTxDF2$transcript_id, iSamplesGrpCB]);
splomTxDF2[,"DEmaxGroupMedian"] <- rowMaxs(farrisTxGroupMedians[splomTxDF2$transcript_id, iSamplesGrpDE]);
splomTxDF2[,"CBdet7"] <- (splomTxDF2[,"CBmaxGroupMedian"] > corrCutoff)+0;
splomTxDF2[,"DEdet7"] <- (splomTxDF2[,"DEmaxGroupMedian"] > corrCutoff)+0;

## Add TPM detected columns
splomTxDF2[,"CBmaxGroupMedianTPM"] <- rowMaxs(farrisTxTPMGroupMedians[splomTxDF2$transcript_id, iSamplesGrpCB]);
splomTxDF2[,"DEmaxGroupMedianTPM"] <- rowMaxs(farrisTxTPMGroupMedians[splomTxDF2$transcript_id, iSamplesGrpDE]);
splomTxDF2[,"CBdetTPM"] <- (splomTxDF2$transcript_id %in% detectedTxTPMCB)+0;
splomTxDF2[,"DEdetTPM"] <- (splomTxDF2$transcript_id %in% detectedTxTPMDE)+0;


## Put it all together into tx lists
txListsAll <- list(anyGenes1=anyTx1,
   DEandCB=DEtx7andCB7, # detected
   CBonly=CBtx7nonDE7,  # detected only CB, not detected DE
   `non-pyramidal`=DEtx7nonCB7,
   `CB1andDE1`=unique(subset(splomTxDF2,
      CBDEhit %in% "1:1" &
      CBdet7 %in% c(1) & #DEdetTPM %in% c(0) &
      DEdet7 %in% c(1))$transcript_id)
   );

#########################################################
## Same logic as above, using TPM instead of pseudocounts
detectedTxTPMany <- detectedTxTPManyL$detectedTx;
detectedTxTPMCB <- detectedTxCBTPML$detectedTx;
detectedTxTPMDE <- detectedTxDETPML$detectedTx;
txListsAllTPM <- list(anyGenes1=detectedTxTPMany,
   DEandCB=intersect(detectedTxTPMCB, detectedTxTPMDE),
   CBonly=setdiff(detectedTxTPMCB, detectedTxTPMDE),
   `non-pyramidal`=setdiff(detectedTxTPMDE, detectedTxTPMCB),
   `CB1andDE1`=unique(subset(splomTxDF2,
      CBDEhit %in% "1:1" &
      CBdetTPM %in% c(1) &
      DEdetTPM %in% c(1))$transcript_id)
   );

Transcript types by transcript list

ncTx2geneDFdetTPM <- subset(tx2geneDF,
   !gene_type %in% c("protein_coding","TEC","Mt_tRNA","Mt_rRNA") &
   transcript_id %in% detectedTx);
## nc detectedTx has 1548 rows, originally it had 2657 rows
## Convert to list
txListsAllTPMncDet <- lapply(txListsAllTPM, function(i){
   intersect(i, ncTx2geneDFdetTPM$transcript_id);
});
## lengths(txListsAllTPMncDet)
## 1436, 747, 381, 420, 94

## % Tx in txListsAll that are protein_coding and have 3'UTRs
cdsTx2geneDFdetTPM <- subset(tx2geneDF,
   gene_type %in% c("protein_coding") &
   transcript_id %in% detectedTx);
txListsAllTPMcdsDet <- lapply(txListsAllTPM, function(i){
   iDF <- subset(cdsTx2geneDFdetTPM, transcript_id %in% i);
   iDF$transcript_id;
});
## lengths(txListsAllTPMcdsDet)
## 24120, 17104, 3424, 5259, 1541

txListsAllTPMcdsDetHas3utr <- lapply(txListsAllTPM, function(i){
   iDF <- subset(cdsTx2geneDFdetTPM, transcript_id %in% i);
   iV <- table(iDF$TxHas3UTR);
   c(format(iV, scientific=FALSE, big.mark=","),
      pct=round(1000*nrow(subset(iDF, TxHas3UTR)) / nrow(iDF))/10);
});
#txListsAllTPMcdsDetHas3utr;
txListsAllTPMcdsDetHas3utrV <- unlist(lapply(txListsAllTPMcdsDetHas3utr, function(i){
   as.numeric(gsub(",", "", i[["TRUE"]]));
}));

## Subset txListsAllTPM for detected transcripts
txListsAllTPMdet <- lapply(txListsAllTPM, function(i){
   intersect(i, detectedTx);
});


## Make a small table with txListsAll counts
txListsAllDF <- data.frame(geneListsAll=lengths(geneListsAll),
   txListsAllTPM=lengths(txListsAllTPM),
   txListsAllTPMdet=lengths(txListsAllTPMdet),
   non_coding=lengths(txListsAllTPMncDet),
   pct_non_coding=lengths(txListsAllTPMncDet)/lengths(txListsAllTPMdet),
   protein_coding=lengths(txListsAllTPMcdsDet),
   pct_protein_coding=lengths(txListsAllTPMcdsDet)/lengths(txListsAllTPMdet),
   protein_coding_with3utr=txListsAllTPMcdsDetHas3utrV,
   pct_protein_coding_with3utr=txListsAllTPMcdsDetHas3utrV/lengths(txListsAllTPMcdsDet),
   protein_coding_without3utr=lengths(txListsAllTPMcdsDet)-txListsAllTPMcdsDetHas3utrV
);
txListsAllDF$otherNc <- txListsAllDF[,"txListsAllTPMdet"] - (txListsAllDF[,"non_coding"] + txListsAllDF[,"protein_coding"]);

txListsAllDFuse <- t(as.matrix(txListsAllDF[,c(
   "protein_coding_with3utr","protein_coding_without3utr",
   "non_coding","otherNc"
   )]));

## Scale to 100%
txListsAllDFuseScaled <- t(t(txListsAllDFuse) / colSums(txListsAllDFuse))*100;

txListsAllDFuseScaled2 <- melt(txListsAllDFuseScaled);
txListsAllDFuseScaled2$Var1 <- factor(txListsAllDFuseScaled2$Var1,
   levels=unique(provigrep(c("other","noncoding|non.coding","without","with","."),
      as.character(txListsAllDFuseScaled2$Var1))));

txListsAllDFuseScaled2$group <- pasteByRowOrdered(txListsAllDFuseScaled2[,c("Var2","Var1")]);

Properties of hippocampus dendritic RNAs, supplemental figure 7

supp7A Heatmap of high-confidence dendritic, cell body retained, and non-pyramidal RNA lists

A heatmap is used to show the pattern of gene expression among the Neuronal Gene Lists defined above.

## Display only genes from the 
glWhich <- c("CBonly", "non-pyramidal", "CB1andDE1");
gl4 <- unname(unlist(geneListsAll[glWhich]));
## Make a vector of gene list names, named by gene
gl4names <- nameVector(rep(names(geneListsAll[glWhich]),
   lengths(geneListsAll[glWhich])),
   gl4);
iSamplesGrpO <- mixedSort(iSamplesGrp);
## Use all gene group median values
iM4 <- farrisGeneGroupMedians[gl4,iSamplesGrpO];

## Column annotations using each experimental factor (not used)
colHA3 <- HeatmapAnnotation(df=colDataColorsGrpL[iSamplesGrpO,2:1],
   show_annotation_name=TRUE,
   col=pheat_colors);

## Column annotations using group name and color
colHA3grp <- HeatmapAnnotation(Group=rownames(colDataColorsGrpL[iSamplesGrpO,]),
   annotation_height=unit(7, "mm"), height=unit(7, "mm"),
   border=TRUE,
   annotation_legend_param=list(
      title_gp=gpar(fontsize=12, fontface="bold")),
   col=list(Group=colorSub[iSamplesGrpO]));

## Row annotations as colors with no labels
rowHA3 <- HeatmapAnnotation(which="row",
   df=data.frame(List=gl4names[rownames(iM4)]),
   annotation_width=unit(7, "mm"), width=unit(7, "mm"),
   col=list(List=colorSubGeneLists[names(geneListsAll)[3:5]]));
## Row annotations with block labels
rowHA3B <- rowAnnotation(
   foo=anno_block(gp=gpar(fill=colorSubGeneLists[sort(names(geneListsAll)[3:5])]),
      labels=sort(names(geneListsAll)[3:5]),
      labels_gp=gpar(col="white"))
   );

## Define color ramp
pheat_expr_breaks <- warpAroundZero(seq(from=-2, to=18, length.out=51),
   baseline=5,
   lens=4);
range(pheat_expr_breaks);

## [1]  0.512 18.000

cBRexpr <- circlize::colorRamp2(breaks=pheat_expr_breaks,
   colors=colorRampPalette(tail(warpRamp(getColorRamp("RdBu_r", n=15), lens=0), 12))(51));

## Create the heatmap
splitF <- factor(gl4names[rownames(iM4)],
   levels=(c("CBonly","non-pyramidal","CB1andDE1")));
glHM <- Heatmap(iM4[,iSamplesGrpO],
   name="Log2 Counts",
   show_parent_dend_line=FALSE,
   cluster_columns=FALSE,
   show_row_names=FALSE,
   clustering_method_rows="ward.D2",
   clustering_distance_rows="euclidean",
   #clustering_method_rows="complete",
   #clustering_distance_rows="maximum",
   column_dend_height=unit(20, "mm"),
   row_dend_width=unit(20, "mm"),
   top_annotation=colHA3grp,
   left_annotation=rowHA3B,
   split=factor(gl4names[rownames(iM4)], levels=names(geneListsAll)[3:5]),
   gap=unit(3, "mm"),
   use_raster=TRUE,
   border=TRUE,
   heatmap_legend_param=list(
      grid_width=unit(8, "mm"),
      border="black",
      title_gp=gpar(fontsize=12, fontface="bold"),
      legend_height=unit(30, "mm")),
   row_dend_side="left",
   raster_device="CairoPNG",
   col=cBRexpr);
ComplexHeatmap::draw(glHM + rowHA3B, merge_legend=TRUE);

## Print the number of rows and columns in the heatmap
## (broken in non-interactive sessions)
if (1 == 2) {
   legend(x=grconvertX(0.00, "ndc"),
      y=grconvertY(0.05, "ndc"),
      legend=paste(
         dim(iM4[,iSamplesGrpO]),
         c("rows", "columns")),
      cex=0.8,
      xpd=NA,
      bg="transparent",
      box.col="transparent");
}

supp7B Percent breakdown of transcript type by transcript list

A stacked bar chart is used to show the relative abundances of transcript types, for each Neuronal Transcript List.

## Define the colors by type
nType <- nrow(txListsAllDFuse);
nList <- ncol(txListsAllDFuse);
txColorSubL <- split(unname(
   color2gradient(n=nType,
      reverseGradient=FALSE,
      gradientWtFactor=1,
      rainbowJam(nList))),
   factor(rep(colnames(txListsAllDFuse), each=nType),
      levels=colnames(txListsAllDFuse)));
txColorSubM <- do.call(cbind, txColorSubL);
rownames(txColorSubM) <- rownames(txListsAllDFuse);
colnames(txColorSubM) <- NULL;
par("mar"=c(4,14,4,2));
imageByColors(txColorSubM,
   main="Transcript Type Color Key");

txColorSubV <- nameVector(unlist(txColorSubL),
   txListsAllDFuseScaled2$group);

ggTxType <- ggplot(txListsAllDFuseScaled2, aes(x=Var2, fill=group, color=group)) +
   geom_bar(aes(weight=value)) +
   scale_fill_manual(values=txColorSubV) + 
   scale_color_manual(values=makeColorDarker(darkFactor=20, txColorSubV)) +
   theme_jam(base_size=20) +
   theme(
      panel.grid.major.x=element_blank(),
      panel.grid.minor.x=element_blank()) +
   ylab("Percent") +
   xlab("Transcript List") +
   ggtitle("txListsAllTPM by gene_type and 3'UTR") +
   theme(legend.position="none");
print(ggTxType);

supp7C Violin plot of annotated 3’UTR lengths in each RNA list

We plotted 3’UTR lengths as violin plots using only detected transcripts based upon TPM criteria described elsewhere.

## Subset the txListsAll by those with three prime utr data
txLists3utr <- lapply(txListsAll, function(i){
   intersect(i, names(GencodeVM12mm10threeUtrLength));
});

## Pull pre-computed data from the farrisdata package
data(GencodeVM12mm10cai);
data(GencodeVM12mm10cdsLength);
data(GencodeVM12mm10caiCtLow);
data(GencodeVM12mm10caiCtBad);

## Create data in format for violin plots using ggplot2
violin3utrDF <- data.frame(check.names=FALSE,
   stringsAsFactors=FALSE,
   Tx=unlist(txLists3utr),
   gene_name=tx2geneDF[match(unlist(txLists3utr), tx2geneDF$transcript_id),"gene_name"],
   Subset=factor(rep(names(txLists3utr), lengths(txLists3utr)),
      levels=names(txLists3utr)),
   detectedTx=ifelse(unlist(txLists3utr) %in% detectedTxNonMito,
      "Detected Tx", "Undetected Tx"),
   threeUtrLength=GencodeVM12mm10threeUtrLength[unlist(txLists3utr)],
   txCai=GencodeVM12mm10cai[unlist(txLists3utr)],
   cdsLength=rmNA(naValue=0, GencodeVM12mm10cdsLength[unlist(txLists3utr)])
   );
## Now make a version that includes Tx counts in the labels
violin3utrDF$SubsetCount <- splicejam::factor2label(
   pasteByRowOrdered(violin3utrDF[,c("Subset","detectedTx")]),
   itemName="txs");
levels(violin3utrDF$SubsetCount) <- gsub("_([^(]+) Tx (.+) (txs)", " \\2 \\1 \\3",
   levels(violin3utrDF$SubsetCount))
colorSubGeneLists2 <- unlist(lapply(names(colorSubGeneLists), function(i){
   j <- vigrep(i, levels(violin3utrDF$SubsetCount));
   nameVector(rep(colorSubGeneLists[i], length(j)), j);
}));
## Note some tricky math, we take the median log length, then exponentiate
violin3utrDF$SubsetLabel <- splicejam::factor2label(
   pasteByRowOrdered(violin3utrDF[,c("Subset","detectedTx")]),
   types="count",
   aggFun=function(x,...){2^median(log2(x[x>10]),na.rm=TRUE)},
   valuesL=violin3utrDF[,"threeUtrLength",drop=FALSE],
   itemName="txs");
violin3utrDF$log3utr <- log10(1+violin3utrDF$threeUtrLength);

## Subset for specific neuronal groups
violin3utrDFsub <- subset(violin3utrDF, Subset %in% names(txListsAll)[c(3,4,5)]);

## 3 prime UTR length violin plots using ggplot2
gg3utr <- ggplot(data=subset(violin3utrDF, detectedTx %in% "Detected Tx"),
   aes(x=Subset, y=threeUtrLength)) +
   scale_y_continuous(trans="log10",
      breaks=c(10,20,50,100,200,300,500,750,1000,1500,2000,5000,10000,20000,50000),
      limits=10^c(1,4.35)) +
   #annotation_logticks(sides="lr") +
   facet_wrap(~detectedTx) +
   geom_violin(aes(fill=SubsetCount),
      scale="area",
      size=1,
      alpha=0.8,
      trim=TRUE,
      draw_quantiles=c(0.5)) +
   theme_jam(grid.minor.size=0) +
   scale_color_manual(values=colorSubGeneLists2) +
   scale_fill_manual(values=colorSubGeneLists2);
gg3utr;

Statistical tests comparing three prime UTR lengths

## Wilcoxon t-test comparing the percent CAI
threeUtrCBonly <- subset(violin3utrDF, detectedTx %in% "Detected Tx" &
   Subset %in% "CBonly")$threeUtrLength;
threeUtrNonPyr <- subset(violin3utrDF, detectedTx %in% "Detected Tx" &
   Subset %in% "non-pyramidal")$threeUtrLength;
threeUtrCB1andDE1 <- subset(violin3utrDF, detectedTx %in% "Detected Tx" &
   Subset %in% "CB1andDE1")$threeUtrLength;

## parametric t-tests, using log10-transformed UTR lengths
t35utr <- t.test(log10(threeUtrCBonly), log10(threeUtrCB1andDE1));
t34utr <- t.test(log10(threeUtrCBonly), log10(threeUtrNonPyr));
t45utr <- t.test(log10(threeUtrNonPyr), log10(threeUtrCB1andDE1));

## Wilcoxon non-parametric t-tests
wt35utr <- wilcox.test(threeUtrCBonly, threeUtrCB1andDE1);
wt34utr <- wilcox.test(threeUtrCBonly, threeUtrNonPyr);
wt45utr <- wilcox.test(threeUtrNonPyr, threeUtrCB1andDE1);

TtestStatsUtr <- list(t35utr=t35utr,
   t34utr=t34utr,
   t45utr=t45utr,
   wt35utr=wt35utr,
   wt34utr=wt34utr,
   wt45utr=wt45utr);
TtestStatsUtrDF <- rbindList(lapply(names(TtestStatsUtr), function(t1){
   t <- TtestStatsUtr[[t1]];
   data.frame(data.name=t$data.name,
      p.value=t$p.value,
      method=t$method);
}));

## Print stats table
colorSubMethod <- group2colors(levels(TtestStatsUtrDF$method));
TtestStatsUtrDF$p.value <- format(TtestStatsUtrDF$p.value,
   digits=3,
   trim=TRUE);
TtestStatsUtrDF2 <- kable_coloring(TtestStatsUtrDF,
      colorSub=colorSubMethod,
      row_color_by="method",
      returnType="kable") %>%
   row_spec(0, background="#DDDDDD")
TtestStatsUtrDF2;

data.name	p.value	method
log10(threeUtrCBonly) and log10(threeUtrCB1andDE1)	7.46e-26	Welch Two Sample t-test
log10(threeUtrCBonly) and log10(threeUtrNonPyr)	1.29e-16	Welch Two Sample t-test
log10(threeUtrNonPyr) and log10(threeUtrCB1andDE1)	2.33e-06	Welch Two Sample t-test
threeUtrCBonly and threeUtrCB1andDE1	2.41e-29	Wilcoxon rank sum test with continuity correction
threeUtrCBonly and threeUtrNonPyr	1.25e-17	Wilcoxon rank sum test with continuity correction
threeUtrNonPyr and threeUtrCB1andDE1	2.59e-08	Wilcoxon rank sum test with continuity correction

supp7D Proximal versus distal 3’UTR detected in each RNA list per compartment

ALE elements are defined by unique 5-prime ends of each 3-prime UTR, which allows for overlapping regions downstream, but protects against a prevalent annotation artifact seen with automated gene transcript models. For example, certain genes have a premature alternative STOP codon annotated early in the CDS, causing some isoforms to annotate nearly the entire transcript as 3-prime UTR. We found that taking the 5-prime end of 3-prime UTR regions was able to condense shared 3-prime UTR regions for transcript isoforms per gene, sufficient for this analysis.

The resulting data matrix contains rows of ALE elements per gene. The subsequent violin plot compares the ALE2 expression to ALE1 expression, where ALE1 has been subtracted to yield a log2 fold change.

## Check for valid vM12txdb since it is not cached properly by knitr
if (!DBI::dbIsValid(dbconn(vM12txdb))) {
   vM12txdb <- AnnotationDbi::loadDb(file=localDb);
}
## Define transcripts with ALE regions
formatInt <- function(x,...){format(x, scientific=FALSE, trim=TRUE, big.mark=",", ...);}

aleL <- tx2ale(
   threeUtrGRL=gencode3utr,
   txdb=vM12txdb,
   detectedTx=detectedTx,
   iMatrix=assays(farrisTxSE)[["tpm"]],
   method="flank",
   verbose=TRUE,
   tx2geneDF=tx2geneDF);

## ##  (14:31:01) 06Oct2019:  gencode2ale(): Using threeUtrGRL as-is. 
## ##  (14:31:01) 06Oct2019:  gencode2ale(): Keeping 25,983 transcripts found in detectedTx. 
## ##  (14:31:02) 06Oct2019:  gencode2ale(): Using first 3'UTR stranded exon to determine ALEs. 
## ##  (14:31:02) 06Oct2019:  getFirstStrandedFromGRL(): sorting grl 
## ##  (14:31:02) 06Oct2019:  sortGRL(): splitting GRanges into list 
## ##  (14:31:02) 06Oct2019:  getFirstStrandedFromGRL(): performing direct logic 
## ##  (14:31:02) 06Oct2019:  gencode2ale(): Splitting ranges by gene. 
## ##  (14:31:02) 06Oct2019:  gencode2ale(): Reducing ranges. 
## ##  (14:31:02) 06Oct2019:  gencode2ale(): Annotating reduced ranges with transcript_id per gene. 
## ##  (14:31:05) 06Oct2019:  gencode2ale(): Assigning stranded numbers to the ranges. 
## ##  (14:31:05) 06Oct2019:  head(threeUtrGRLdetGeneGRLred2): 
## GRangesList object of length 6:
## $0610007P14Rik 
## GRanges object with 1 range and 8 metadata columns:
##           seqnames            ranges strand |            exon_name
##              <Rle>         <IRanges>  <Rle> |          <character>
##   grl1_v1    chr12 85815448-85816113      - | ENSMUSE00000116007.5
##                   transcript_id               gene_id     gene_name
##                     <character>           <character>   <character>
##   grl1_v1 ENSMUST00000021676.11 ENSMUSG00000021252.11 0610007P14Rik
##                gene_type transcript_type   exon_id exon_rank
##              <character>     <character> <numeric> <numeric>
##   grl1_v1 protein_coding  protein_coding    352426         5
## 
## $0610009B22Rik 
## GRanges object with 1 range and 8 metadata columns:
##           seqnames            ranges strand |            exon_name
##   grl1_v2    chr11 51685386-51685646      - | ENSMUSE00000679023.1
##                  transcript_id              gene_id     gene_name
##   grl1_v2 ENSMUST00000007921.8 ENSMUSG00000007777.9 0610009B22Rik
##                gene_type transcript_type exon_id exon_rank
##   grl1_v2 protein_coding  protein_coding  327973         2
## 
## $0610009L18Rik 
## GRanges object with 1 range and 8 metadata columns:
##           seqnames              ranges strand |            exon_name
##   grl1_v3    chr11 120351015-120351190      + | ENSMUSE00000372489.1
##                  transcript_id              gene_id     gene_name
##   grl1_v3 ENSMUST00000143813.1 ENSMUSG00000043644.4 0610009L18Rik
##                gene_type transcript_type exon_id exon_rank
##   grl1_v3 protein_coding  protein_coding  323716         2
## 
## ...
## <3 more elements>
## -------
## seqinfo: 22 sequences (1 circular) from an unspecified genome; no seqlengths
## ##  (14:31:05) 06Oct2019:  assignGRLexonNames(): class(GRL):CompressedGRangesList 
## ##  (14:31:05) 06Oct2019:  assignGRLexonNames(): No multi-stranded exon entries. 
## ##  (14:31:05) 06Oct2019:  assignGRLexonNames(): Checking disjoint ranges. 
## ##  (14:31:05) 06Oct2019:  assignGRLexonNames(): Reducing ranges. 
## ##  (14:31:05) 06Oct2019:  assignGRLexonNames(): head(GRLred): 
## GRangesList object of length 6:
## $0610007P14Rik 
## GRanges object with 1 range and 1 metadata column:
##       seqnames            ranges strand |     gene_name
##          <Rle>         <IRanges>  <Rle> |   <character>
##   [1]    chr12 85815448-85816113      - | 0610007P14Rik
## 
## $0610009B22Rik 
## GRanges object with 1 range and 1 metadata column:
##       seqnames            ranges strand |     gene_name
##   [1]    chr11 51685386-51685646      - | 0610009B22Rik
## 
## $0610009L18Rik 
## GRanges object with 1 range and 1 metadata column:
##       seqnames              ranges strand |     gene_name
##   [1]    chr11 120351015-120351190      + | 0610009L18Rik
## 
## ...
## <3 more elements>
## -------
## seqinfo: 22 sequences (1 circular) from an unspecified genome; no seqlengths
## ##  (14:31:06) 06Oct2019:  assignGRLexonNames(): geneSymbolColname:gene_name 
## ##  (14:31:06) 06Oct2019:  assignGRLexonNames(): geneSymbolColname values:0610007P14Rik,0610009B22Rik,0610009L18Rik,0610009O20Rik,0610010F05Rik,0610010F05Rik,0610010F05Rik,0610010K14Rik,0610010K14Rik,0610012G03Rik 
## ##  (14:31:06) 06Oct2019:  assignGRLexonNames(): head(GRLred): 
## GRangesList object of length 6:
## $0610007P14Rik 
## GRanges object with 1 range and 2 metadata columns:
##       seqnames            ranges strand |     gene_name    ALE_name
##          <Rle>         <IRanges>  <Rle> |   <character> <character>
##   [1]    chr12 85815448-85816113      - | 0610007P14Rik            
## 
## $0610009B22Rik 
## GRanges object with 1 range and 2 metadata columns:
##       seqnames            ranges strand |     gene_name ALE_name
##   [1]    chr11 51685386-51685646      - | 0610009B22Rik         
## 
## $0610009L18Rik 
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |     gene_name ALE_name
##   [1]    chr11 120351015-120351190      + | 0610009L18Rik         
## 
## ...
## <3 more elements>
## -------
## seqinfo: 22 sequences (1 circular) from an unspecified genome; no seqlengths
## ##  (14:31:06) 06Oct2019:  assignGRLexonNames(): head(GRLredStrandP): 
## [1]  3  4  8  9 10 11
## ##  (14:31:06) 06Oct2019:  assignGRLexonNames(): head(GRLredStrandN): 
## [1]  1  2  5  6  7 12
## ##  (14:31:06) 06Oct2019:  assignGRLexonNames(): exonNameColname values:0610007P14Rik_ale1,0610009B22Rik_ale1,0610009L18Rik_ale1,0610009O20Rik_ale1,0610010F05Rik_ale3,0610010F05Rik_ale2,0610010F05Rik_ale1,0610010K14Rik_ale2,0610010K14Rik_ale1,0610012G03Rik_ale1 
## ##  (14:31:06) 06Oct2019:  assignGRLexonNames(): head(GRL[,GRLcolnames]): 
## GRangesList object of length 6:
## $0610007P14Rik 
## GRanges object with 1 range and 8 metadata columns:
##           seqnames            ranges strand |            exon_name
##              <Rle>         <IRanges>  <Rle> |          <character>
##   grl1_v1    chr12 85815448-85816113      - | ENSMUSE00000116007.5
##                   transcript_id               gene_id     gene_name
##                     <character>           <character>   <character>
##   grl1_v1 ENSMUST00000021676.11 ENSMUSG00000021252.11 0610007P14Rik
##                gene_type transcript_type   exon_id exon_rank
##              <character>     <character> <numeric> <numeric>
##   grl1_v1 protein_coding  protein_coding    352426         5
## 
## $0610009B22Rik 
## GRanges object with 1 range and 8 metadata columns:
##           seqnames            ranges strand |            exon_name
##   grl1_v2    chr11 51685386-51685646      - | ENSMUSE00000679023.1
##                  transcript_id              gene_id     gene_name
##   grl1_v2 ENSMUST00000007921.8 ENSMUSG00000007777.9 0610009B22Rik
##                gene_type transcript_type exon_id exon_rank
##   grl1_v2 protein_coding  protein_coding  327973         2
## 
## $0610009L18Rik 
## GRanges object with 1 range and 8 metadata columns:
##           seqnames              ranges strand |            exon_name
##   grl1_v3    chr11 120351015-120351190      + | ENSMUSE00000372489.1
##                  transcript_id              gene_id     gene_name
##   grl1_v3 ENSMUST00000143813.1 ENSMUSG00000043644.4 0610009L18Rik
##                gene_type transcript_type exon_id exon_rank
##   grl1_v3 protein_coding  protein_coding  323716         2
## 
## ...
## <3 more elements>
## -------
## seqinfo: 22 sequences (1 circular) from an unspecified genome; no seqlengths
## ##  (14:31:06) 06Oct2019:  assignGRLexonNames(): head(GRLred[,exonNameColname]): 
## GRangesList object of length 6:
## $0610007P14Rik 
## GRanges object with 1 range and 1 metadata column:
##       seqnames            ranges strand |           ALE_name
##          <Rle>         <IRanges>  <Rle> |        <character>
##   [1]    chr12 85815448-85816113      - | 0610007P14Rik_ale1
## 
## $0610009B22Rik 
## GRanges object with 1 range and 1 metadata column:
##       seqnames            ranges strand |           ALE_name
##   [1]    chr11 51685386-51685646      - | 0610009B22Rik_ale1
## 
## $0610009L18Rik 
## GRanges object with 1 range and 1 metadata column:
##       seqnames              ranges strand |           ALE_name
##   [1]    chr11 120351015-120351190      + | 0610009L18Rik_ale1
## 
## ...
## <3 more elements>
## -------
## seqinfo: 22 sequences (1 circular) from an unspecified genome; no seqlengths
## ##  (14:31:06) 06Oct2019:  annotateGRLfromGRL(): annoNames2:ALE_name 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR(): Running findOverlaps(GR1, GR2) 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR(): Completed findOverlaps(GR1, GR2) 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR(): length(grOL):16962 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR():    grOLq1:1,2,3,4,5,6,7,8,9,10 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR():    grOLs1:1,2,3,4,5,6,7,8,9,10 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR():    grOLq: 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR():    grOLs: 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR(): Running shrinkMatrix on 0 entries out of 16,962 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR(): numCols:    
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR(): stringCols:ALE_name 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR(): Shrinking string columns:ALE_name 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR(): nrow(grOLm1):16962 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR(): nrow(grOLmUse):0 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR():    ALE_name 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR(): Appending multi- and single-entry string data.frames 
## ##  (14:31:06) 06Oct2019:  annotateGRfromGR(): Appending column to GRanges:ALE_name 
## ##  (14:31:06) 06Oct2019:  assignGRLexonNames(): Completed annotateGRLfromGRL(). 
## ##  (14:31:07) 06Oct2019:  gencode2ale(): Assigned stranded numbers to the ranges. 
## ##  (14:31:07) 06Oct2019:  gencode2ale(): Filtered 13,770 genes to 2,540 genes having multiple ranges. 
## ##  (14:31:07) 06Oct2019:  gencode2ale(): Creating transcript-to-ALE xref for multi-range genes. 
## ##  (14:31:08) 06Oct2019:  gencode2ale(): Aggregating transcript abundances by ranges.

## use tx2ale to calculate ALE using count data
iMatrixAleCt <- log2(1+
   shrinkMatrix(
      2^(assays(farrisTxSE[names(aleL$tx2ale),])[["counts"]])-1,
      groupBy=aleL$tx2ale,
      shrinkFunc=function(x){sum(x, na.rm=TRUE)},
      returnClass="matrix"));

We also applied logical filtering to the resulting genes, so the Neuronal Gene Lists defined above would be represented by genes in each region according to these rules:

anyGenes1 is allowed to contain expression data from all genes in the anyGenes list from any region.
DEandCB only includes genes where the ALE data is meets the thresholds for CB and DE samples.
CBonly only includes genes present in CB samples, and excludes DE samples.
non-pyramidal only includes genes present in DE samples, and excludes CB samples.
CB1andDE1 only includes genes where the ALE data is meets the thresholds for CB and DE samples.

Data were filtered so that the aggregate ALE TPM value for either ALE1 or ALE2 for each gene was at least 2.

groups <- nameVector(colData(farrisTxSE[,iSamples])$groupName, iSamples);
facet_groups <- nameVector(gsub("^.+_", "", groups), names(groups))

## Filter genes for the appropriate geneLists entry
subsetCBDE <- function(iDiffAleTall2ctr) {
   iDFviolinGeneL <- split(iDiffAleTall2ctr$gene_name,
      pasteByRowOrdered(iDiffAleTall2ctr[,c("geneList","Region")]));
   DEandCBsubset1 <- intersect(iDFviolinGeneL[["DEandCB_CB"]],
      iDFviolinGeneL[["DEandCB_DE"]]);
   CB1andDE1subset1 <- intersect(iDFviolinGeneL[["CB1andDE1_CB"]],
      iDFviolinGeneL[["CB1andDE1_DE"]]);
   iDiffAleTall2ctrUse <- subset(iDiffAleTall2ctr,
      (geneList %in% "DEandCB" & gene_name %in% DEandCBsubset1) |
      (geneList %in% "CB1andDE1" & gene_name %in% CB1andDE1subset1) |
      (geneList %in% "CBonly" & Region %in% "CB") |
      (geneList %in% "non-pyramidal" & Region %in% "DE") |
      (geneList %in% c("anyGenes1"))
      );
   #print(table(iDiffAleTall2ctrUse[,c("Region","geneList")]));
   iDiffAleTall2ctrUse;
}

## Create violin plot data
iMatrixAleGrp <- rowGroupMeans(aleL$iMatrixAle[,iSamples],
   useMedian=FALSE,
   groups=groups);
aleViolinL <- ale2violin(aleL$iMatrixAle[,iSamples],
   geneLists=geneListsAll,
   maxGroupMeanALE=2,
   groups=groups,
   lineAlpha=0.1,
   subsetFunc=subsetCBDE,
   facet_name="Region",
   verbose=FALSE,
   facet_groups=facet_groups);
printDebug("Number of genes represented in each panel (TPM calculations):");

## ##  (14:31:11) 06Oct2019:  Number of genes represented in each panel (TPM calculations):

table(aleViolinL$iDiffAleTall2ctrFacet[,c("Region","geneList")])/2;

##       geneList
## Region anyGenes1 DEandCB CBonly non-pyramidal CB1andDE1
##     CB      1883    1831     14             0       106
##     DE      1949    1831      0            66       106

print(aleViolinL$ggAle2);

supp7E Violin plot of the average percentage of codons with CAI <= 0.5 per RNA list

Codon CAI values are loaded via the farrisdata R package, and can be calculated using methods described in the jampack R package suite.

## codon adaptability index (cai) values are pre-computed and
## available from the farrisdata package
#data(GencodeVM12mm10cai);

## Subset transcript lists for entries having CAI data,
## which mostly filters for CDS-encoding Txs
txListsCai <- lapply(txListsAll, function(i){
   intersect(i, names(GencodeVM12mm10cai))
});
txListsCaiTPM <- lapply(txListsAllTPM, function(i){
   intersect(i, names(GencodeVM12mm10cai))
});
#txListsCai <- txListsCaiTPM;


## Create data in format for violin plots using ggplot2
violinCaiDF <- data.frame(check.names=FALSE,
   stringsAsFactors=FALSE,
   Tx=unlist(txListsCai),
   gene_name=tx2geneDF[match(unlist(txListsCai), tx2geneDF$transcript_id),"gene_name"],
   Subset=factor(rep(names(txListsCai), lengths(txListsCai)),
      levels=names(txListsCai)),
   detectedTx=ifelse(unlist(txListsCai) %in% detectedTxNonMito,
      "Detected Tx", "Undetected Tx"),
   txCai=GencodeVM12mm10cai[unlist(txListsCai)],
   cdsLength=GencodeVM12mm10cdsLength[unlist(txListsCai)],
   ctLowCai=GencodeVM12mm10caiCtLow[unlist(txListsCai)],
   pctLowCai=(GencodeVM12mm10caiCtLow[unlist(txListsCai)] /
         GencodeVM12mm10cdsLength[unlist(txListsCai)] *100),
   ctBadCai=GencodeVM12mm10caiCtBad[unlist(txListsCai)],
   pctBadCai=(GencodeVM12mm10caiCtBad[unlist(txListsCai)] /
         GencodeVM12mm10cdsLength[unlist(txListsCai)] *100),
   txCaiQ1=GencodeVM12mm10caiQ1mean[unlist(txListsCai)]
   );

## Now make a version that includes Tx counts in the labels
violinCaiDF$SubsetCount <- splicejam::factor2label(violinCaiDF$Subset,
   itemName="txs");
violinCaiDF$SubsetLabel <- splicejam::factor2label(violinCaiDF$Subset,
   types="count",
   aggFun=median,
   valuesL=violinCaiDF[,"pctLowCai",drop=FALSE],
   itemName="txs");
colorSubGeneLists3 <- unlist(lapply(names(colorSubGeneLists), function(i){
   j <- vigrep(i, c(
      levels(violinCaiDF$SubsetLabel),
      levels(violinCaiDF$SubsetCount)));
   nameVector(rep(colorSubGeneLists[i], length(j)), j);
}));

## Subset for specific neuronal groups
violinCaiDFsub <- subset(violinCaiDF, Subset %in% names(txListsAll)[c(3,4,5)]);

## CAI violin plots using ggplot2
ggPctLowCai <- ggplot(data=subset(violinCaiDFsub, detectedTx %in% "Detected Tx"),
   aes(x=Subset, y=pctLowCai)) +
   #aes(x=Subset, y=txCai)) +
   facet_wrap(~detectedTx) +
   geom_violin(aes(fill=SubsetLabel), scale="area",
   #geom_violin(aes(fill=SubsetCount), scale="area",
      alpha=0.8, trim=TRUE, draw_quantiles=c(0.5)) +
   ylim(c(0, 20)) +
   theme_jam(grid.minor.size=0) +
   scale_color_manual(values=colorSubGeneLists3) +
   scale_fill_manual(values=colorSubGeneLists3);
ggPctLowCai;

Statistical tests comparing gene lists

## Wilcoxon t-test comparing the percent CAI
caiCBonly <- subset(violinCaiDFsub, detectedTx %in% "Detected Tx" &
   Subset %in% "CBonly")$pctLowCai;
caiNonPyr <- subset(violinCaiDFsub, detectedTx %in% "Detected Tx" &
   Subset %in% "non-pyramidal")$pctLowCai;
caiCB1andDE1 <- subset(violinCaiDFsub, detectedTx %in% "Detected Tx" &
   Subset %in% "CB1andDE1")$pctLowCai;

## parametric t-tests
t35cai <- t.test(caiCBonly, caiCB1andDE1);
t34cai <- t.test(caiCBonly, caiNonPyr);
t45cai <- t.test(caiNonPyr, caiCB1andDE1);

## Wilcoxon non-parametric t-tests
wt35cai <- wilcox.test(caiCBonly, caiCB1andDE1);
wt34cai <- wilcox.test(caiCBonly, caiNonPyr);
wt45cai <- wilcox.test(caiNonPyr, caiCB1andDE1);

TtestStatsCai <- list(t35cai=t35cai,
   t34cai=t34cai,
   t45cai=t45cai,
   wt35cai=wt35cai,
   wt34cai=wt34cai,
   wt45cai=wt45cai);
TtestStatsCaiDF <- rbindList(lapply(names(TtestStatsCai), function(t1){
   t <- TtestStatsCai[[t1]];
   data.frame(data.name=t$data.name,
      p.value=t$p.value,
      method=t$method);
}));

## Print stats table
colorSubMethod <- group2colors(levels(TtestStatsCaiDF$method));
#TtestStatsCaiDF$p.value <- sfsmisc::pretty10exp(TtestStatsCaiDF$p.value);
TtestStatsCaiDF$p.value <- format(TtestStatsCaiDF$p.value,
   digits=3,
   trim=TRUE);
TtestStatsCaiDF2 <- kable_coloring(TtestStatsCaiDF,
      colorSub=colorSubMethod,
      row_color_by="method",
      returnType="kable") %>%
   row_spec(0, background="#DDDDDD")
TtestStatsCaiDF2;

data.name	p.value	method
caiCBonly and caiCB1andDE1	9.03e-24	Welch Two Sample t-test
caiCBonly and caiNonPyr	2.27e-10	Welch Two Sample t-test
caiNonPyr and caiCB1andDE1	3.24e-08	Welch Two Sample t-test
caiCBonly and caiCB1andDE1	2.23e-25	Wilcoxon rank sum test with continuity correction
caiCBonly and caiNonPyr	1.16e-11	Wilcoxon rank sum test with continuity correction
caiNonPyr and caiCB1andDE1	1.46e-08	Wilcoxon rank sum test with continuity correction

supp7F Heatmap of hypergeometric enrichment P-values of cell type by RNA list

In preparation.

Differential transcript isoform analysis limma

Differential isoform tests were performed using the R limma package, and the diffSplice() function, using normalized TPM abundances.

Definition of experiment design and contrast matrices

The experimental design and contrast matrices are defined based upon the Limma Users Guide, using zero intercept. All pairwise and two-way contrasts are included, provided each contrast changes only one factor at a time.

## Design matrix is defined by the sample groups
iGroups <- nameVector(
   colData(farrisTxSE[,iSamples])$groupName,
      iSamples);
## Re-order factor levels so CA2 is first
iGroups <- factor(iGroups,
   levels=provigrep(c("CA2", "."), levels(iGroups)));
iDC <- groups2contrasts(iGroups, returnDesign=TRUE);
iDesign <- iDC$iDesign;
## Alternative "manual" method
#iDesign <- stats::model.matrix(~0+iGroups);
#colnames(iDesign) <- levels(iGroups);
#rownames(iDesign) <- names(iGroups);

## Contrasts are CB-DE
iContrasts <- iDC$iContrasts;
## Alternative format, useful for custom contrasts
#iContrasts <- limma::makeContrasts(
#   contrasts=c(
#      "CA2_DE-CA2_CB",
#      "CA1_DE-CA1_CB"),
#   levels=iDesign);

Differential isoform statistical tests

Differential isoforms are determined using the limma::diffSplice() function, made conveniently available in the splicejam::runDiffSplice() function.

iMatrixTx <- assays(farrisTxSE[,iSamples])[["tpm"]];
iMatrixTx[iMatrixTx == 0] <- NA;
diffSpliceL <- runDiffSplice(
   iMatrixTx=iMatrixTx,
   detectedTx=detectedTx,
   tx2geneDF=tx2geneDF,
   iDesign=iDesign,
   iContrasts=iContrasts,
   cutoffFDR=0.05,
   cutoffFold=1.5,
   collapseByGene=TRUE,
   spliceTest="t",
   verbose=FALSE,
   useVoom=FALSE);

## Total number of exons:  30783 
## Total number of genes:  8589 
## Number of genes with 1 exon:  0 
## Mean number of exons in a gene:  4 
## Max number of exons in a gene:  29

diffSpliceHitsL <- lapply(diffSpliceL$statsDFs, function(iDF){
   hitColname <- head(vigrep("^hit ", colnames(iDF)), 1);
   as.character(subset(iDF, iDF[[hitColname]] != 0)$gene_name);
});

## diffSplice transcript level
diffSpliceTxL <- runDiffSplice(
   iMatrixTx=iMatrixTx,
   detectedTx=detectedTx,
   tx2geneDF=tx2geneDF,
   iDesign=iDesign,
   iContrasts=iContrasts,
   cutoffFDR=0.05,
   cutoffFold=1.5,
   collapseByGene=FALSE,
   spliceTest="t",
   verbose=FALSE,
   useVoom=FALSE);

## Total number of exons:  30783 
## Total number of genes:  8589 
## Number of genes with 1 exon:  0 
## Mean number of exons in a gene:  4 
## Max number of exons in a gene:  29

diffSpliceHitsTxL <- lapply(diffSpliceTxL$statsDFs, function(iDF){
   hitColname <- head(vigrep("^hit ", colnames(iDF)), 1);
   as.character(subset(iDF, iDF[[hitColname]] != 0)$transcript_id);
});
#sdim(diffSpliceHitsTxL);

## Subsets of named contrasts
DEcon <- unvigrep("_CB", names(diffSpliceHitsTxL));
CBcon <- unvigrep("_DE", names(diffSpliceHitsTxL));
DEconHits <- length(unique(unlist(
   diffSpliceHitsL[DEcon])));
CBconHits <- length(unique(unlist(
   diffSpliceHitsL[CBcon])));
DEconHitsTx <- length(unique(unlist(
   diffSpliceHitsTxL[DEcon])));
CBconHitsTx <- length(unique(unlist(
   diffSpliceHitsTxL[CBcon])));

CA1CA2CBcon <- vigrep("CA1.*CA2",
   unvigrep("DE", names(diffSpliceHitsTxL)));
CA1CA2DEcon <- vigrep("CA1.*CA2",
   unvigrep("CB", names(diffSpliceHitsTxL)));
CA1CA2CBconHits <- length(diffSpliceHitsL[[CA1CA2CBcon]]);
CA1CA2DEconHits <- length(diffSpliceHitsL[[CA1CA2DEcon]]);
CA1CA2CBconHitsTx <- length(diffSpliceHitsTxL[[CA1CA2CBcon]]);
CA1CA2DEconHitsTx <- length(diffSpliceHitsTxL[[CA1CA2DEcon]]);

CA1CA2Twocon <- vigrep("CA1.*CA2_CB",
   vigrep("DE", names(diffSpliceHitsTxL)));
CA1CA2TwoconHitsTx <- length(diffSpliceHitsTxL[[CA1CA2Twocon]]);
CA1CA2TwoconHits <- length(diffSpliceHitsL[[CA1CA2Twocon]]);

Statistical hits were defined as having a linear fold change >= 1.5, and a Benjamini Hochberg adjusted P-value <= 0.05.

8,437 genes with multiple detected isoforms were tested, which include
30,783 detected isoforms in these multi-isoform genes.
8,305 total unique differential transcript isoforms from
4,352 unique genes across all comparisons.

Interestingly, only a minor percentage of isoform hits came from cell body comparisons (e.g. CA1 cell body vs CA2 cell body, 254 differentially spliced transcript isoforms from 169 unique genes), suggesting that mature hippocampal neurons overwhelmingly express the same isoforms in similar proportions for co-expressed genes, albeit with nearly 200 exceptions.

In contrast, the vast majority of differentially expressed isoform hits came from dendrite comparisons (2,606 differentially spliced transcript isoforms from 1,792 unique genes). A subset of these hits overlapped with the two-way comparison (e.g. CA2 cell body to dendrite vs CA1 cell body to dendrite, 223 differentially spliced transcript isoforms from 223 unique genes), suggesting either cell-type and/or compartment specific splicing.

Export stats results to Excel

The stats results will be exported to Excel, using the openxlsx R package, storing each stats table as its own worksheet.

GroupColnames <- c("Compartment","Region");
contrastsDF <- iDC$iContrastNames;
colnames(contrastsDF)[1:2] <- GroupColnames;
contrastsDFsplit <- gsub("^_", "",
   jamba::pasteByRow(
      do.call(cbind,
         lapply(nameVector(GroupColnames), function(i){
            ifelse(lengths(strsplit(as.character(contrastsDF[[i]]), ",")) > 1,
               i,
               NA)
         })
      )
   )
);
contrastsL <- split(as.character(contrastsDF$contrastName),
   contrastsDFsplit);
con2label <- function(i) {
   gsub("_$", "", cPaste(sep="_", doSort=FALSE,
      lapply(i, function(j){
         sortSamples(preControlTerms=c("CB","DE"),
            unique(unlist(
               strsplit(j, "[-_()]+"))));
      })
   ))
}
## Loop through each contrast list and save an Excel file
for (i in nameVectorN(contrastsL)) {
   iXlsx <- paste0("DiffIsoforms_by_", i, "_", jamba::getDate(), ".xlsx");
   #printDebug(i, "", ": ", iXlsx,
   #   ", ", length(contrastsL[[i]]), " worksheets");
   iConLabels <- con2label(contrastsL[[i]]);
   iMatch <- match(unique(iConLabels), iConLabels);
   for (iCon in contrastsL[[i]][iMatch]) {
      iLab <- con2label(iCon);
      #printDebug("   ", iLab);
      #iDFtx <- diffSpliceTxL$statsDFs[[iCon]];
      ## Gene-centric stats table
      iDFgene <- diffSpliceL$statsDFs[[iCon]];
      ## Remove t-statistic column, reorder hit column
      iDFgene <- dplyr::select(iDFgene,
            -tidyselect::matches("^t ")) %>%
         dplyr::select(tidyselect::matches("gene|transcript"),
            tidyselect::matches("^hit "),
            tidyselect::everything());
      jamba::writeOpenxlsx(file=iXlsx,
         x=iDFgene,
         sheetName=iLab,
         append=(!iCon == contrastsL[[i]][[1]]),
         highlightColumns=igrep("gene|transcript", colnames(iDFgene)),
         pvalueColumns=igrep("p.val|fdr", colnames(iDFgene)),
         lfcColumns=igrep("logfc", colnames(iDFgene)),
         numColumns=igrep("groupmean", colnames(iDFgene)),
         intColumns=igrep("^numTx", colnames(iDFgene)),
         intRule=c(0,7,15),
         hitColumns=igrep("^hit ", colnames(iDFgene)),
         freezePaneColumn=max(igrep("gene|transcript", colnames(iDFgene))),
         freezePaneRow=2
      );
   }
}

## Note: zip::zip() is deprecated, please use zip::zipr() instead

Splicing, Sashimi plots

Two genes with interesting alternative transcript isoforms are Ntrk2 and Gria1. Sashimi plots are shown below, which represent the RNA-seq sequence read coverage across the exons, and the splice junction reads that span from exon to exon.

The plots also include the transcript-exon model, and the flattened gene-exon model.

First, some Sashimi plot data is prepared:

Exons per transcript, and CDS exons per transcript, which are derived from the Gencode GTF loaded previously.
filesDF which is a data.frame describing each RNA-seq file used to load sequence read coverage, and splice junction reads.
sample_id which is a subset of sample_id values described in filesDF.

## Check for valid TxDb
if (!DBI::dbIsValid(AnnotationDbi::dbconn(vM12txdb))) {
   vM12txdb <- AnnotationDbi::loadDb(file=localDb);
}

## Obtain the exons per transcript
exonsByTx <- exonsBy(vM12txdb, by="tx", use.names=TRUE);
## Obtain CDS exons per transcript, used to show UTR as non-CDS
cdsByTx <- cdsBy(vM12txdb, by="tx", use.names=TRUE);

## flatten exons by gene for these two genes
flatExonsByGene2 <- flattenExonsBy(exonsByTx=exonsByTx,
   tx2geneDF=tx2geneDF,
   detectedTx=detectedTx,
   cdsByTx=cdsByTx,
   by="gene",
   genes=c("Ntrk2", "Gria1"));

## flatten CDS exons by gene for these two genes
flatExonsByTx2 <- flattenExonsBy(exonsByTx=exonsByTx,
   tx2geneDF=tx2geneDF,
   detectedTx=detectedTx,
   cdsByTx=cdsByTx,
   by="tx",
   genes=c("Ntrk2", "Gria1"));

## Define the files used for the sashimi plots
filesDF <- farrisdata::farris_sashimi_files_df;
## Define a subset of sample_id to display
sample_id <- c("CA1_CB", "CA1_DE", "CA2_CB", "CA2_DE");

Sashimi plot for Ntrk2

## prepare sashimi plot data for Ntrk2
sashimi_ntrk2 <- prepareSashimi(gene="Ntrk2",
   tx2geneDF=tx2geneDF,
   flatExonsByTx=flatExonsByTx2,
   flatExonsByGene=flatExonsByGene2,
   colorSub=farrisdata::colorSub,
   filesDF=filesDF,
   sample_id=sample_id);

## prepare gene-exon model for Ntrk2
gg_ntrk2 <- gene2gg(gene="Ntrk2",
   gene_order="last",
   labelExons=FALSE,
   exonLabelSize=6,
   direction="y",
   flatExonsByGene=flatExonsByGene2,
   flatExonsByTx=flatExonsByTx2)

## use cowplot to assemble sashimi plot and gene-exon model
cp_ntrk2 <- cowplot::plot_grid(
   gg_ntrk2 +
      theme(axis.text.x=element_blank()) +
      #ggtitle(NULL) +
      xlab(NULL),
   plotSashimi(sashimi_ntrk2,
         color_sub=farrisdata::colorSub),
   ncol=1,
   align="v",
   axis="lr",
   rel_heights=c(1,2));
cp_ntrk2;

Sashimi plot for Gria1

## prepare sashimi plot data for Ntrk2
sashimi_gria1 <- prepareSashimi(gene="Gria1",
   tx2geneDF=tx2geneDF,
   flatExonsByTx=flatExonsByTx2,
   flatExonsByGene=flatExonsByGene2,
   filesDF=filesDF,
   minJunctionScore=200,
   sample_id=sample_id);

## prepare gene-exon model for Ntrk2
gg_gria1 <- gene2gg(gene="Gria1",
   gene_order="last",
   labelExons=FALSE,
   exonLabelSize=6,
   direction="y",
   flatExonsByGene=flatExonsByGene2,
   flatExonsByTx=flatExonsByTx2)

## use cowplot to assemble sashimi plot and gene-exon model
cp_gria1 <- cowplot::plot_grid(
   gg_gria1 +
      theme(axis.text.x=element_blank()) +
      #ggtitle(NULL) +
      xlab(NULL),
   plotSashimi(sashimi_gria1,
         color_sub=farrisdata::colorSub),
   ncol=1,
   align="v",
   axis="lr",
   rel_heights=c(1,2));
cp_gria1;

Zoom into the differential slice event, that switched from the “flop” to the “flip” form of Gria1:

## use cowplot to assemble sashimi plot and gene-exon model
gria1_r <- ranges(range(subset(flatExonsByGene2[["Gria1"]],
   gene_nameExon %in% c("Gria1_exon13", "Gria1_exon17a"))))
gria1_zoom <- c(start(gria1_r), end(gria1_r));

cp_gria1_zoom <- cowplot::plot_grid(
   gg_gria1 +
      theme(axis.text.x=element_blank()) +
      coord_cartesian(xlim=gria1_zoom, expand=FALSE) +
      xlab(NULL),
   plotSashimi(sashimi_gria1,
      label_coords=gria1_zoom,
      color_sub=farrisdata::colorSub) +
      coord_cartesian(xlim=gria1_zoom, expand=FALSE),
   ncol=1,
   align="v",
   axis="lr",
   rel_heights=c(1,3));
cp_gria1_zoom;

Gene Set Enrichment

Selection of gene hits

Used DESeq-normalized expression values, tested versus MSigDB v6.0 using hypergeometric enrichment via the R function stats::phyper(). A data.frame summary of statistical results was prepared, which included:

pathway name
pathway “Source” and “Category” as defined by MSigDB
number of genes per pathway
number of gene hits found per pathway
hypergeometric enrichment P-value
FDR-adjusted P-value
comma-delimited gene hits per pathway

The gene symbols obtained from MSigDB were converted to most recent mouse Entrez gene symbols using the Bioconductor package "org.Mm.eg.db". This same process was used with the gene symbols from the RNA-seq analysis steps above, in order to ensure the gene symbols in MSigDB and RNA-seq analysis were using the same comparable version of Entrez gene symbols.

One data.frame result was prepared for each gene hit list, and was further analyzed using methods in the "multienrichjam" package.

MultiEnrichMap

MultiEnrichMap is an analysis workflow that extends previous innovative analysis workflows:

“EnrichmentMap” by the Bader lab, ref: Merico,et al, PLoS One, 2010)
“clusterProfiler” and “enrichplot” by Dr. Guangchuang Yu.

The previous EnrichmentMap and EnrichMap workflows are intended to help understand and visualize gene set enrichment results in the context of one gene hit list.

MultiEnrichMap is an extension of those workflow that compares multiple enrichment results. This workflow is implemented in an R package “multienrichjam”, which can be installed with

devtools::install_github("jmw86069/multienrichjam")

Subset pathways for top 15 per source-category

Heatmap of enrichment P-values.

Cnet plot