Compute contrast statistics on SummarizedExperiment data
Source:R/jam_secontrasts.R
se_contrast_stats.Rd
Compute contrast statistics on SummarizedExperiment data
Usage
se_contrast_stats(
se,
assay_names,
adjp_cutoff = 0.05,
p_cutoff = NULL,
fold_cutoff = 1.5,
int_adjp_cutoff = adjp_cutoff,
int_p_cutoff = p_cutoff,
int_fold_cutoff = fold_cutoff,
mgm_cutoff = NULL,
ave_cutoff = NULL,
confint = FALSE,
floor_min = NULL,
floor_value = NULL,
sedesign = NULL,
icontrasts = NULL,
idesign = NULL,
igenes = NULL,
isamples = NULL,
enforce_design = TRUE,
use_voom = FALSE,
voom_block_twostep = TRUE,
posthoc_test = c("none", "DEqMS"),
posthoc_args = list(DEqMS = list(PSM_counts = NULL, fit.method = "loess")),
weights = NULL,
robust = FALSE,
handle_na = c("full1", "full", "partial", "all", "none"),
na_value = 0,
rowData_colnames = c("SYMBOL"),
collapse_by_gene = FALSE,
block = NULL,
correlation = NULL,
max_correlation_rows = 10000,
normgroup = NULL,
do_normgroups = TRUE,
seed = 123,
verbose = FALSE,
...
)
Arguments
- se
SummarizedExperiment
object.Note that
colnames(se)
should match the samples insedesign
, or therownames(idesign)
whensedesign
is not supplied.Data is subset by
colnames(se)
usingisamples
when supplied, andrownames(se)
whenigenes
is supplied.Note argument
rowData_colnames
can be used to retain somerowData(se)
columns in the statdata.frame
summaries for convenience, particularly helpful when analyzing microarray data where therownames(se)
represent probe ID or assay ID.
- assay_names
character
vector with one or more assay names fromnames(assays(se))
.- adjp_cutoff
numeric
value threshold with the adjusted P-value at or below which a contrast result can be considered a statistical hit. This threshold is applied in addition to othercutoff
values.- p_cutoff
numeric
value threshold with the unadjusted P-value at or below which a contrast result can be considered a statistical hit. This argument is not recommended, in favor of usingadjp_cutoff
. This threshold is applied in addition to othercutoff
values.- fold_cutoff
numeric
value threshold indicating the normal absolute fold change at or above which a contrast result can be considered a statistical hit. For clarity, this threshold is normal space fold change, for example 2-fold would befold_cutoff=2
.- int_adjp_cutoff, int_p_cutoff, int_fold_cutoff
optional thresholds used only when a two-way interaction style contrast is detected. These optional thresholds may be useful to apply more lenient criteria to interaction contrasts, but in that event are cautioned to be used for data mining exercises. Ideally, the thresholds are identical between pairwise and interaction contrasts, and ideally there are enough replicates in the data to support the interaction contrasts with sufficient confidence to make those comparisons.
- mgm_cutoff
numeric
value threshold of the maximum group mean value required in each contrast for the contrast to be considered a statistical hit. The "max group mean" logic requires only one group in a contrast to be above this threshold, while all other groups can have values below the threshold. This threshold is applied in addition to othercutoff
values.- ave_cutoff
numeric
value threshold of the average expression as reported bylimma::lmFit()
, within each normgroup if relevant, for the contrast to be considered a statistical hit. This threshold is applied in addition to othercutoff
values. Typically the column "AvgExpr" is calculated as a row mean.- confint
logical
passed tolimma::topTable()
indicating whether to calculate 95% confidence intervals for log2 fold changelogFC
values. Alternatively it can be anumeric
value between zero and one specifying a specific confidence interval.- floor_min, floor_value
numeric
minimum value (floor_min) at or below whichnumeric
values in the assay data matrix are reverted tofloor_value
as a replacement. This option is valuable to set allnumeric
values at or below zero to zero, or to set all values at or below zero toNA
in circumstances where zero indicates "no measurement" and would be more accurately represented as a missing measurement than a measurement of0
.Note:
floor_min
is applied beforehandle_na
, so that this step is able to createNA
values which will be handled appropriately based upon the optionhandle_na
.
- sedesign
SEDesign
object as defined bygroups_to_sedesign()
, with slotNames"design"
,"contrasts"
, and"samples"
. The argumentsidesign
andicontrasts
are ignored when this argument is defined.- icontrasts, idesign
numeric
matrices representing statistical contrasts, and sample-group design matrices. These values are ignored whensedesign
is defined.- igenes
character
vector with optional subset ofrownames(se)
, by default it uses allrownames(se)
.- isamples
character
vector with optional subset ofcolnames(se)
, by default it usescolnames(se)
that are also defined in the design matrix.- enforce_design
logical
(this option is not implemented). By default the design matrix is used to subset the inputcolnames(se)
as needed, andisamples
is used to subset the design matrix and corresponding contrasts as relevant.- use_voom
logical
indicating whether to applylimma-voom
analysis steps. When applied, data is not scaled usinglimma:voom()
, instead uses data as supplied.- voom_block_twostep
logical
indicating whether to perform the "two-step" voom methodology whenblock
is also supplied. This workflow is recommended by voom authors:call
voom()
first withoutblock
to determine overallweights
call
duplicateCorrelation()
if necessary, using the voomweights
, and theblock
argument, to calculatecorrelation
.call
voom()
again using thecorrelation
,weights
, andblock
arguments. This produces improvedweights
.call
duplicateCorrelation()
again with the updatedweights
, andblock
in order to calculate improvedcorrelation
.Then proceed with
lmFit()
using appropriateweights
usingblock
; and appropriatecorrelation
also using the properweights
andblock
.
- posthoc_test
character
string indicating an optional post-hoc test to apply."none"
: applieslimma::eBayes()
by default, the moderated t-test."DEqMS"
: applies adjustment for proteomics measurements provided by the package"DEqMS"
. Seeposthoc_args
.
- posthoc_args
list
named by theposthoc_test
above."DEqMS"
recognizes two arguments, which are passed toDEqMS::spectraCounteBayes()
:"PSM_counts"
: anumeric
vector of peptide spectra matched, one perigenes
orrownames(se)
. These values are used by DEqMS to model variability based upon the number of spectra as the key measure of confidence."fit.method"
:character
name of the model to use, default is"loess"
.
- weights
numeric
non-negative precision weights passed tolimma::lmFit()
, either as a matrix with nrow and ncol that matchigenes
andisamples
, respectively, or matching one oflength(igenes)
orlength(isamples)
. Whenigenes
orisamples
are not supplied, it usesnrow(se)
orncol(se)
, respectively.- robust
logical
passed tolimma::eBayes()
, whether estimation ofdf.prior
andvar.prior
should be robust against outlier sample variances.- handle_na
character
string indicating how to handleNA
values in the data matrix, passed tohandle_na_values()
."partial"
: ReplaceNA
values withna_value
, except when an entire group isNA
the entire group is kept atNA
."full"
: RetainNA
values, except when an entire group isNA
the replicate values are replaced withna_value
. The option"full1"
may be preferred, since it only replaces one value in the group, therefore does not misrepresent the variability of the group as zero."full1"
: RetainNA
values, except when an entire group isNA
, one replicate value in the group is replaced withna_value
. By replacing only one replicate withna_value
the group does not have a variance/dispersion, forcing the variance to be determined by the other group in the contrast, while still allowing an approximate fold change to be calculated. This option is suitable when there is a noise floor above zero, as it retains an approximate fold change while estimating a P-value using only the variance derived from the non-NA
group."all"
: replace allNA
values withna_value
."none"
: perform no replacement ofNA
values.
- na_value
numeric
passed tohandle_na_values()
- rowData_colnames
character
vector of colnames inrowData(se)
that should be retained in each statdata.frame
produced as output. The values inrowData_colnames
are intersected withcolnames(rowData(se))
to determine which columns to keep.- collapse_by_gene
logical
(not currently implemented).- block
character
,factor
, ornumeric
used as a blocking factor, using argumentblock
inlimma::lmFit()
for example.The argument can be supplied as one of the following:
character
with one or morecolnames(colData(se))
character
vector named using values inisamples
and/orcolnames(se)
character
vector without names, whose length must equallength(isamples)
or whenisamples
is not provided,ncol(se)
.
- correlation
optional inter-duplicate or inter-technical correlation matrix passed to
limma::lmFit()
.- max_correlation_rows
numeric
maximum number of rows inimatrix
to use when callinglimma::duplicateCorrelation()
. This process only occurs whenblock
is defined,correlation=NULL
andnrow(imatrix) > max_correlation_rows
. In this scenario, a random subset of rows are used to calculatecorrelation
, then thatcorrelation
is used forlimma::lmFit()
. This process is intended to help very large data volumes, where the speed oflimma::duplicateCorrelation()
is impacted in quadratic manner by increasing number of rows, while also not improving the summary correlation value.- normgroup
character
orfactor
vector with lengthncol(se)
orlength(isamples)
whenisamples
is defined. Values define independent normalization groups, which performs independent analyses within each unique normalization group. This option is intended for convenience, enabling separate variance models for each normalization group, which is appropriate when analyzing very different sample types. During limma model fit, all samples in all groups are used by default, which may incorrectly estimate variance when the variability by row is not uniform across different sample types. Whennormgroup=NULL
the default is to assume all samples are in the samenormgroup="bulk"
. Each subset of samples begins with the samesedesign
,idesign
,icontrast
, however they are fed intovalidate_sestats()
to subset the appropriate contrasts to use only samples within the current normgroup. As a result, any contrasts that span two normgroups will be removed, and will not appear in the output.The argument can be supplied as one of the following:
character
with one or morecolnames(colData(se))
character
vector named using values inisamples
and/orcolnames(se)
character
vector without names, whose length must equallength(isamples)
or whenisamples
is not provided,ncol(se)
.
- do_normgroups
logical
whether to enable normgroup processing, or to use the previous technique that kept all samples together. This argument may be removed in future, with recommendation to usenormgroup=NULL
to disable normalization group processing. Note that whennormgroup=NULL
the output should be identical withdo_normgroups=TRUE
anddo_normgroups=FALSE
.- seed
numeric
used to set a random seed withset.seed()
for reproducibility. Useseed=NULL
to avoid setting a random seed. Note this action will only affect downstream functions that employ some form of randomization.- verbose
logical
indicating whether to print verbose output.- ...
additional arguments are ignored.
Details
This function is essentially a wrapper around statistical methods
in the limma
package, with additional steps to apply statistical
thresholds to define "statistical hits" by three main criteria:
P-value or adjusted P-value
fold change
max group mean
This function is unique in that it applies the statistical methods
to one or more "signals" in the input SummarizedExperiment
assays,
specifically intended to compare things like normalization methods.
If multiple statistical thresholds are defined, each one is applied in order, which is specifically designed to compare the effect of applying different statistical thresholds. For example one may want to pre-compute "statistical hits" using adjusted P-value 0.05, and 0.01; or using fold change >= 1.5, or fold change >= 2.0. The underlying statistics are the same, but a column indicating hits is created for each set of thresholds.
Hits are annotated:
-1
for down-regulation0
for un-changed (by the given criteria)1
for up-regulation
The results are therefore intended to feed directional Venn diagrams, which display the overlapping gene hits, and whether the directions are shared or opposed.
This function can optionally apply the limma-voom workflow,
which involves calculating matrix weights using limma::voom()
,
then applying those weights during the model fit.
The output is intended to include several convenient formats:
stats_dfs
- list ofdata.frame
stats one per contraststats_df
- onedata.frame
with all stats togetherhit_array
-array
with three dimensions: signal; contrast; threshold whose cells contain hit flags (-1, 0, 1) named byrownames(se)
.
Design and contrast matrices can be defined using the function
jamses::groups_to_sedesign()
. That function assigns each sample
to a sample group, then assembles all relevant group contrasts
which involve only one-factor contrast at a time. It optionally
defines two-factor contrasts (contrast of contrasts) where
applicable.
A subset of genes (rownames(se)
) or samples (colnames(se)
) can
be defined, to restrict calculations to use only the subset data.
See also
Other jamses stats:
ebayes2dfs()
,
format_hits()
,
handle_na_values()
,
hit_array_to_list()
,
process_sestats_to_hitim()
,
run_limma_replicate()
,
save_sestats()
,
sestats_to_dfs()
,
sestats_to_df()
,
voom_jam()
Examples
set.seed(123)
expr <- rnorm(20) + 7;
noise <- rnorm(120) / 5;
fold <- rnorm(20) / 2.5;
m <- matrix(expr + noise, ncol=6);
for (i in 4:6) {
m[,i] <- m[,i] + fold;
}
# m <- matrix(rnorm(120)/2 + 5, ncol=6);
colnames(m) <- paste0(rep(c("Vehicle", "Treated"), each=3), 1:3)
rownames(m) <- paste0("row", seq_len(20))
# simulate some "hits"
m[1, 4:6] <- m[1, 4:6] + 2
m[2, 4:6] <- m[2, 4:6] + 1.5
m[3, 4:6] <- m[3, 4:6] - 1.3
# create SummarizedExperiment
se <- SummarizedExperiment::SummarizedExperiment(
assays=list(counts=m),
colData=data.frame(sample=colnames(m),
group=factor(rep(c("Vehicle", "Treated"), each=3),
c("Vehicle", "Treated"))),
rowData=data.frame(measurement=rownames(m)))
# assign colors
sample_color_list <- platjam::design2colors(se)
# heatmap
heatmap_se(se, sample_color_list=sample_color_list)
# create SEDesign
sedesign <- groups_to_sedesign(se, group_colnames="group")
# plot the design and contrasts
plot_sedesign(sedesign)
#> Warning: no non-missing arguments to max; returning -Inf
#> Warning: no non-missing arguments to max; returning -Inf
# limma contrasts
sestats <- se_contrast_stats(se=se,
sedesign=sedesign,
assay_names="counts")
# print data.frame summary of hits
sestats_to_df(sestats)
#> Signal Contrasts hit adjp0.05 fc1.5
#> Treated-Vehicle counts Treated-Vehicle 8 hits (4 up, 4 down)
# plot the design with number of hits labeled
plot_sedesign(sedesign, sestats=sestats)
#> Warning: no non-missing arguments to max; returning -Inf
#> Warning: no non-missing arguments to max; returning -Inf
# heatmap with hits
heatmap_se(se,
sample_color_list=sample_color_list,
sestats=sestats)
# review stats table
stats_df <- sestats$stats_dfs$counts[["Treated-Vehicle"]]
head(stats_df)
#> probes hit adjp0.05 fc1.5 Treated-Vehicle logFC Treated-Vehicle
#> row1 row1 1 2.3337734
#> row2 row2 1 1.4366480
#> row3 row3 -1 -1.8277444
#> row4 row4 -1 -0.6312502
#> row5 row5 0 -0.5632394
#> row6 row6 0 -0.0696241
#> fold Treated-Vehicle P.Value Treated-Vehicle adj.P.Val Treated-Vehicle
#> row1 5.041222 6.410602e-17 1.282120e-15
#> row2 2.706912 2.127696e-11 1.418464e-10
#> row3 -3.549816 3.099395e-14 3.099395e-13
#> row4 -1.548907 3.752311e-04 9.143257e-04
#> row5 -1.477583 1.285519e-03 2.571038e-03
#> row6 -1.049443 6.392157e-01 7.102397e-01
#> mgm Treated-Vehicle Vehicle mean Treated mean
#> row1 8.681105 6.347332 8.681105
#> row2 8.144589 6.707941 8.144589
#> row3 8.383734 8.383734 6.555990
#> row4 7.098608 7.098608 6.467357
#> row5 7.096697 7.096697 6.533457
#> row6 8.547980 8.547980 8.478356
# volcano plot for one contrast
jamma::volcano_plot(stats_df)
#> ## (14:56:43) 28Oct2024: volcano_plot(): sig_colname: adj.P.Val Treated-Vehicle
#> ## (14:56:43) 28Oct2024: volcano_plot(): lfc_colname: logFC Treated-Vehicle
###############################
# simulate a two-way model
adjust <- rnorm(20);
new_fold <- rnorm(20);
se2 <- cbind(se, se)
groups2 <- paste0(rep(c("WT", "KO"), each=6),
"_",
SummarizedExperiment::colData(se)$group)
groups2 <- factor(groups2, levels=unique(groups2))
colnames(se2) <- paste0(groups2, 1:3)
SummarizedExperiment::colData(se2)$sample <- colnames(se2);
SummarizedExperiment::colData(se2)$group <- groups2;
SummarizedExperiment::colData(se2)$genotype <- jamba::gsubOrdered("_.+", "",
SummarizedExperiment::colData(se2)$group)
SummarizedExperiment::colData(se2)$treatment <- jamba::gsubOrdered("^.+_", "",
SummarizedExperiment::colData(se2)$group)
SummarizedExperiment::assays(se2)$counts[,7:12] <- (
SummarizedExperiment::assays(se2)$counts[,7:12] + adjust)
SummarizedExperiment::assays(se2)$counts[,10:12] <- (
SummarizedExperiment::assays(se2)$counts[,10:12] + new_fold)
# assign colors
sample_color_list2 <- platjam::design2colors(se2,
class_colnames="genotype",
color_sub=c(
Vehicle="palegoldenrod",
Treated="firebrick4"),
group_colnames="group")
# heatmap
hm2a <- heatmap_se(se2,
sample_color_list=sample_color_list2,
controlSamples=1:3,
center_label="versus WT_Vehicle",
column_title_rot=60,
row_cex=0.4,
column_cex=0.5,
column_title_gp=grid::gpar(fontsize=8),
column_split="group")
ComplexHeatmap::draw(hm2a, column_title=attr(hm2a, "hm_title"), merge_legends=TRUE)
# heatmap centered by genotype
hm2b <- heatmap_se(se2,
sample_color_list=sample_color_list2,
controlSamples=c(1:3, 7:9),
control_label="versus vehicle",
column_gap=grid::unit(c(1, 3, 1), "mm"),
centerby_colnames="genotype",
column_title_rot=60,
row_cex=0.4,
column_cex=0.5,
column_title_gp=grid::gpar(fontsize=8),
column_split="group")
ComplexHeatmap::draw(hm2b, column_title=attr(hm2b, "hm_title"), merge_legends=TRUE)
# create SEDesign
sedesign2 <- groups_to_sedesign(se2, group_colnames="group")
# plot the design and contrasts
plot_sedesign(sedesign2, label_cex=0.7)
# note the two-way contrast can be flipped
plot_sedesign(sedesign2, flip_twoway=TRUE, label_cex=0.7)
# limma contrasts
sestats2 <- se_contrast_stats(se=se2,
sedesign=sedesign2,
assay_names="counts")
# print data.frame summary of hits
sestats_to_df(sestats2)
#> Signal
#> KO_Vehicle-WT_Vehicle counts
#> KO_Treated-WT_Treated counts
#> WT_Treated-WT_Vehicle counts
#> KO_Treated-KO_Vehicle counts
#> (KO_Treated-WT_Treated)-(KO_Vehicle-WT_Vehicle) counts
#> Contrasts
#> KO_Vehicle-WT_Vehicle KO_Vehicle-WT_Vehicle
#> KO_Treated-WT_Treated KO_Treated-WT_Treated
#> WT_Treated-WT_Vehicle WT_Treated-WT_Vehicle
#> KO_Treated-KO_Vehicle KO_Treated-KO_Vehicle
#> (KO_Treated-WT_Treated)-(KO_Vehicle-WT_Vehicle) (KO_Treated-WT_Treated)-(KO_Vehicle-WT_Vehicle)
#> hit adjp0.05 fc1.5
#> KO_Vehicle-WT_Vehicle 8 hits (4 up, 4 down)
#> KO_Treated-WT_Treated 11 hits (6 up, 5 down)
#> WT_Treated-WT_Vehicle 8 hits (4 up, 4 down)
#> KO_Treated-KO_Vehicle 14 hits (6 up, 8 down)
#> (KO_Treated-WT_Treated)-(KO_Vehicle-WT_Vehicle) 11 hits (4 up, 7 down)
# plot the design with number of hits labeled
plot_sedesign(sedesign2, sestats=sestats2, label_cex=0.7)
# heatmap with hits
hm2hits <- heatmap_se(se2,
sample_color_list=sample_color_list2,
sestats=sestats2,
controlSamples=c(1:3, 7:9),
column_gap=grid::unit(c(1, 3, 1), "mm"),
control_label="versus vehicle",
centerby_colnames="genotype",
row_split=3,
row_cex=0.4,
column_title_rot=60,
column_cex=0.5,
column_title_gp=grid::gpar(fontsize=8),
column_split="group")
ComplexHeatmap::draw(hm2hits,
column_title=attr(hm2hits, "hm_title"),
merge_legends=TRUE)
# venn diagram
hit_list <- hit_array_to_list(sestats2,
contrast_names=c(1:2))
jamba::sdim(hit_list);
#> rows class
#> KO_Vehicle-WT_Vehicle 8 numeric
#> KO_Treated-WT_Treated 11 numeric
# names(hit_list) <- gsub(":", ",<br>\n", contrast2comp(names(hit_list)))
# vo <- venndir::venndir(hit_list, expand_fraction=0.1)
# venndir::venndir_legender(venndir_out=vo, setlist=hit_list)
# vo <- venndir::venndir(hit_list, expand_fraction=0.1, proportional=TRUE)
# venndir::venndir_legender(venndir_out=vo, setlist=hit_list)