This package is under active development, these functions make up a core set of methods currently used across active Omics analysis projects. A summary of goals and relevant features are described below.
Goals of jamses
The core goal is to make data analysis and visualization of SummarizedExperiment
objects straightforward for common scenarios. It also accepts SingleCellExperiment
and Seurat
objects.
Define a design matrix using
~ 0 + group
syntax, see below.Define statistical contrasts to compare one or more factors, see
groups_to_sedesign()
.Plot contrasts as defined, see
plot_sedesign()
.Confirm design,contrast matrices are always in sync, see
SEDesign-class
.Automate analysis of multiple contrasts, using limma/limmavoom/DEqMS, see
se_contrast_stats()
.-
Convenient “short hand” for contrasts, see
contrasts2comp()
. -
Integrate with other tools.
-
venndir::venndir()
- see Github"jmw86069/venndir"
to create directional Venn diagrams.
-
Approach for Statistical Contrasts
The Limma User’s Guide (LUG) is an amazing resource which describes alternate approaches for one-way and two-way contrasts, which are mathematically equivalent. For a more thorough discussion please review these approaches to confirm that ~0 + x
is mathematically identical to ~ x
, and only differs in how estimates are reported.
The approach used in
jamses
uses the~0 + x
strategy.
Each experiment group is defined using independent replicates.
This approach does not imply that there is “no intercept” during the model fit, see LUG for details.-
One-way contrasts must compare only one factor change per contrast.
For example, a valid one-way contrast is shown below:It compares
(treated-control)
with factor levelA
unchanged.
However, an invalid one-way contrast is shown below:It is invalid because allows two factor changes:
treated-control
andA-B
. -
A two-way contrast is defined as the fold change of two compatible one-way fold changes, and in the proper order.
Consider the following: Caveat: Specific contrasts can be added manually when necessary.
Design and contrasts - SEDesign
-
groups_to_sedesign()
takes by default adata.frame
where each column represents an experiment factor, and creates the following:-
design
matrix -
contrast
matrix using only appropriate changes in single experiment factor levels for one-way contrasts, and equivalent contrasts across secondary factor levels where appropriate for two-way contrasts, up tomax_depth
factor depth. -
samples
vector representing individual sample columns used from the input data.
-
-
Output is
SEDesign
as an S4 object with slot names:-
sedesign@design
- the design matrix. -
sedesign@contrasts
- the contrasts matrix. -
sedesign@samples
- vector of samples.
-
SEDesign S4 Object Details
SEDesign
is an S4 object that contains the following slots:
-
samples
: acharacter
vector of sample names, derived fromcolnames()
of the SummarizedExperiment` object.- When
samples
are subset, it has a cascade effect ondesign
andcontrasts
. When groups are no longer represented, they are removed fromdesign
andcontrasts
as relevant.
- When
-
design
: amatrix
representing the design matrix, with additional constraints:-
rownames()
are always maintained, so they can be aligned withcolnames()
of theSummarizedExperiment
object. This requirement helps ensure that the design and assay data are always in proper sync. -
colnames()
are defined as sample groups, equivalent to usingmodel.matrix( ~ 0 + groups )
for example from the The Limma User’s Guide from thelimma
Bioconductor package. - All values are always
1
or0
with no fractions. More complicated designs require manual adjustment, beyond the scope ofjamses
.
-
-
contrasts
: amatrix
representing the contrast matrix used for statistical comparisons, with additional constraints:-
rownames(contrasts)
are always defined, and always equalcolnames(contrasts)
to ensure contrasts and design are always properly synchronized.
-
Example SEDesign object
The example below uses a character
vector of group names per sample, with two factors separated by underscore "_"
. The same data can be provided as a data.frame
with two columns.
library(jamses)
library(kableExtra)
igroups <- jamba::nameVector(paste(rep(c("WT", "KO"), each=6),
rep(c("Control", "Treated"), each=3),
sep="_"),
suffix="_rep");
igroups <- factor(igroups, levels=unique(igroups));
# jamba::kable_coloring(color_cells=FALSE,
# format="markdown",
# caption="Sample to group association",
# data.frame(groups=igroups))
knitr::kable(data.frame(groups=igroups))
groups | |
---|---|
WT_Control_rep1 | WT_Control |
WT_Control_rep2 | WT_Control |
WT_Control_rep3 | WT_Control |
WT_Treated_rep1 | WT_Treated |
WT_Treated_rep2 | WT_Treated |
WT_Treated_rep3 | WT_Treated |
KO_Control_rep1 | KO_Control |
KO_Control_rep2 | KO_Control |
KO_Control_rep3 | KO_Control |
KO_Treated_rep1 | KO_Treated |
KO_Treated_rep2 | KO_Treated |
KO_Treated_rep3 | KO_Treated |
The resulting design and contrasts matrices are shown below:
sedesign <- groups_to_sedesign(igroups);
jamba::kable_coloring(
# knitr::kable(
colorSub=c(`-1`="dodgerblue", `1`="firebrick"),
caption="Design matrix output from design(sedesign).",
data.frame(check.names=FALSE, design(sedesign)));
WT_Control | WT_Treated | KO_Control | KO_Treated | |
---|---|---|---|---|
WT_Control_rep1 | 1 | 0 | 0 | 0 |
WT_Control_rep2 | 1 | 0 | 0 | 0 |
WT_Control_rep3 | 1 | 0 | 0 | 0 |
WT_Treated_rep1 | 0 | 1 | 0 | 0 |
WT_Treated_rep2 | 0 | 1 | 0 | 0 |
WT_Treated_rep3 | 0 | 1 | 0 | 0 |
KO_Control_rep1 | 0 | 0 | 1 | 0 |
KO_Control_rep2 | 0 | 0 | 1 | 0 |
KO_Control_rep3 | 0 | 0 | 1 | 0 |
KO_Treated_rep1 | 0 | 0 | 0 | 1 |
KO_Treated_rep2 | 0 | 0 | 0 | 1 |
KO_Treated_rep3 | 0 | 0 | 0 | 1 |
# knitr::kable(
jamba::kable_coloring(
colorSub=c(`-1`="dodgerblue", `1`="firebrick"),
caption="Contrast matrix output from contrasts(sedesign).",
data.frame(check.names=FALSE, contrasts(sedesign)));
KO_Control-WT_Control | KO_Treated-WT_Treated | WT_Treated-WT_Control | KO_Treated-KO_Control | (KO_Treated-WT_Treated)-(KO_Control-WT_Control) | |
---|---|---|---|---|---|
WT_Control | -1 | 0 | -1 | 0 | 1 |
WT_Treated | 0 | -1 | 1 | 0 | -1 |
KO_Control | 1 | 0 | 0 | -1 | -1 |
KO_Treated | 0 | 1 | 0 | 1 | 1 |
For convenience, SEDesign can be visualized using plot_sedesign()
:
# plot the design and contrasts
plot_sedesign(sedesign);
title(main="plot_sedesign(sedesign)\noutput:")
- Two-way contrasts are indicated by the “squiggly curved line” which connects the end of one contrast to the beginning of the next contrast. This connection describes the first contrast, subtracted by the second contrast.
Data normalization
SummarizedExperiment
objects are normalized using:
-
se_normalize()
- lightweight wrapper to several normalization functions:-
method="jammanorm"
: callsjamma::jammanorm()
which normalizes the median log fold change to zero. This method is also used inDESeq2::estimateSizeFactors()
for example. When usingjamma::jammaplot()
for MA-plots, thejammanorm()
method is conceptually equivalent to shifting the y-axis values to zero, so the median log fold change is zero. (Assumptions apply.) -
method="quantile
: callslimma::normalizeQuantiles()
for quantile normalization. -
method="limma_batch_adjust"
ormethod="lba"
: callslimma::removeBatchEffect()
to adjust of batch effects. This normalization is only recommended for visualization and clustering, not recommended prior to statistical comparisons. Instead, the recommendation is to define a blocking factor passed tose_contrast_stats()
with argumentblock
.
-
matrix_normalize()
- is the core function forse_normalize()
and operates on individual numeric data matrices.
Normalization can be reviewed with MA-plots, recommended by using
jamma::jammaplot()
. See Github repository"jmw86069/jamma"
.
Statistical comparisons
se_contrast_stats()
is the central function
Applies
SEDesign
(design and contrasts) toSummarizedExperiment
data.-
Calls the appropriate
limma
functions one or moreassays
in theSummarizedExperiment
object.-
use_voom=TRUE
will enablelimmavoom
methodology for count data. -
posthoc_test="DEqMS"
will enablelimma-DEqMS
methodology for proteomics mass spec data, which defines an error model based upon PSM counts per row. -
block
will define blocking factor forlimma
. When also applyinglimmavoom
, it performs the appropriate two-step process as described by Dr. Smyth.
-
-
Applies statistical thresholds to mark statistical hits:
-
adjp_cutoff
: adjusted P-value threshold -
fold_cutoff
: normal space fold change threshold -
mgm_cutoff
: “mgm” is an abbreviation for “max group mean”. This threshold requires at least one experiment group involved in the contrast to have group mean at or above this threshold in order to mark a statistical hit. It does not subset data prior to testing.
-
-
Additional threshold options:
-
p_cutoff
: unadjusted P-value threshold -
ave_cutoff
: average expression threshold, using the equivalent oflimma
columnAveExpr
with the mean group expression. Note thisAveExpr
value is calculated mean per row across all groups, and may be subject to skewing. -
int_adjp_cutoff
,int_fold_cutoff
,int_mgm_cutoff
: optional thresholds only used for interaction contrasts, intended for data mining purposes. For example, data mining may filter for two-way contrast results with no fold change threshold, or slightly relaxedint_adjp_cutoff=0.1
.
-
-
Options:
-
use_voom=TRUE
: enable limmavoom workflow for count data -
posthoc_test="DEqMS"
: enable DEqMS post-hoc adjustment to the limma empirical Bayes model fit. -
floor_min
,floor_value
: logic to handle numeric values below a defined noise threshold, values below this threshold becomefloor_value
. This filtering can also convert values from0
zero toNA
, where appropriate. For data with substantial missing measurements, this option may be beneficial. -
handle_na
,na_value
: logic to handleNA
values, particularly when an entire group isNA
. For proteomics, or platforms with a defined “noise floor”, it may be useful to usehandle_na="full1"
andna_value=noise_floor
. This specific scenario assignsnoise_floor
to one value in any completelyNA
group tonoise_floor
so that its variability is not used, however the fold change can still be calculated relative to the platform minimum signal -
block
: optional blocking factor, which enables thelimma::duplicateCorrelation()
calculation, which is applied to the model fit per the The Limma User’s Guide.
-
-
Returns a
list
referred to assestats
:-
"hit_array"
:array
of named numeric vectors indicating direction of statistical hits after applying the relevant threshold cutoffs:-
1
up-regulated -
-1
down-regulated
-
"stats_dfs"
:list
ofdata.frame
for each contrast, where column headers also include the statistical contrast to help confirm the identity of each analysis result."stats_df"
: one super-widedata.frame
with results across all contrasts, assay data matrices, and hit thresholds. This matrix is intended to help filter for hits across the various different contrasts and thresholds.
-
-
Save to excel with
save_sestats()
:- This function helps automate saving every statistical contrast in table format, into individual Excel worksheets. It may include additional
rowData()
annotations. - Each worksheet is styled consistently, making it easy to flip through each worksheet.
- This function helps automate saving every statistical contrast in table format, into individual Excel worksheets. It may include additional
Heatmaps
heatmap_se()
is a wrapper for the amazing ComplexHeatmap::Heatmap()
, intended to automate frequently-used options that represent repetitive work.
Common rules:
- All data is centered by default. It can be customized.
- All data is assumed to be
log2
or otherwise “appropriately transformed”. (The typical transform islog2(1 + x)
.) - The heatmap can also be displayed as sample correlations (
correlation=TRUE
) which re-uses the same data centering options (critical aspect of these plots). - The heatmap can be subset using
rows
orsestats
to show stat hit annotations on the left side. - Column and row annotations can be added using
colData(se)
androwData(se)
- Columns and rows can be split using
colData(se)
androwData(se)
colnames. - The heatmap title includes a summary of data.
Common arguments:
se
: theSummarizedExperiment
data for the heatmapassay_name
: define a specific data matrix to displayrows
: choose a specific subset of rows (optional)-
sestats
: optional stat hits in one of these formats:- output from
se_contrast_stats()
- incidence
matrix
with valuesc(-1, 0, 1)
-
list
of numeric values usingc(-1, 0, 1)
, with names representingrownames(se)
-
list
ofrownames(se)
- output from
top_colnames
: column annotations incolData(se)
to display across the top.rowData_colnames
: optional row annotations to display on the left.sample_color_list
:list
named by colnames fromcolData(se)
orrowData(se)
to define colors.-
centerby_colnames
: optional sub-groups for data centering- Useful to center multiple tissue types, or sample classes, batches, etc.
- Set
centerby_colnames=FALSE
to turn off data centering altogether.
controlSamples
: custom subset of control samplesfor data centering.row_split
: split rows using colnames inrowData(se)
, or by number of dendrogram sub-tree clusters.column_split
: split columns using colnames incolData(se)
, or by number of dendrogram sub-tree clusters.mark_rows
: optional subset of rows for callout labels.