Normalize data using MA-plot logic
jammanorm(
x,
controlGenes = NULL,
minimum_mean = 0,
controlSamples = NULL,
centerGroups = NULL,
useMedian = TRUE,
useMean = NULL,
noise_floor = -Inf,
noise_floor_value = noise_floor,
verbose = FALSE,
...
)
numeric
matrix with expression data suitable for use
by jammaplot()
. Gene expression data is typically transformed
using log2(1+x)
to represent reasonably normal distribution.
numeric
value used to filter controlGenes
,
a control gene must have at least this level of expression to
be included in the normalization, where the expression is
determined by the mean or median value analogous to the x-axis
value in MA-plots.
character
vector of colnames(x)
passed
to centerGeneData()
which defines the control samples during
the data centering step.
By default, and the most common practice, MA-plots are
calculated across all samples, which effectively uses all
colnames(x)
as controlSamples
.
However, it is quite useful sometimes to provide a subset of
samples especially if there are known quality samples, to which
new samples of unknown quality are being compared.
character
vector of groups passed to
jamma::centerGeneData()
which determines how data is centered.
Each group is centered independently, to enable visual
comparisons within each relevant centering group.
It is useful to center within batches or within
subsets of samples that are not intended to be compared to
one another.
Another useful alternative is to center by each sample
group in order to view the variability among group
replicates, which should be much lower than variability across
sample groups. See centerGeneData()
for more specific examples.
logical
indicates whether to center data
using the median
value, where useMedian=FALSE
by default.
The median is preferred in cases where outliers should not influence
centering.
The mean is preferred in cases where the data should visualize
data in a manner consistent with downstream parametric
statistical analysis.
When a particular sample represents a technical outlier,
one option to visualize data without being skewed by the outlier
is to define controlSamples
to exclude the outlier sample(s).
In this way, data centering will be applied using the non-outlier
samples as reference.
(deprecated) logical
, use useMedian
. This argument
indicates whether to center data using the mean
value.
When useMean=NULL
the argument useMedian
is preferred.
For backward compatibility, when useMean
is not NULL
,
then useMedian
is defined by useMedian <- !useMean
.
numeric
values passed
to jammacalc()
. The noise_floor
is the value below which
a floor is applied. The floor sets all values below this floor
to noise_floor_value
. For example, one could apply noise_floor=0
and noise_floor_value=NA
which would change any value below 0
to NA
.
logical indicating whether to print verbose output.
additional parameters sent to downstream functions,
jamba::plotSmoothScatter()
, jammacalc()
, centerGeneData()
.
numeric
matrix
whose columns have been normalized,
with the following named attributes
:
"nf"
numeric vector of normalization factors;
"hk"
a character
vector of controlGenes for each sample;
"hk_count"
the integer
number of controlGenes for each sample.
This function normalizes data using an approach analogous to data viewed in MA-plots. Normalization is applied by shifting data along the y-axis so the mean (or median) expression among control genes is zero, indicated on an MA-plot as y=0.
Note: This method should be performed only after reviewing the MA-plots, to ensure the assumptions are met. Similarly, data can also be viewed in MA-plots after normalization to confirm and review the effect of normalization.
It is useful to run jammaplot()
data after jammanorm()
to visualize the effect of this normalization. If data is not
centered at y=0, the parameters should be adjusted.
This method effectively reinforces the assumption that the mean log fold
change for control genes is expected to be zero. When useMedian=TRUE
it reinforces the assumption that the log fold change for the
majority of control genes should be zero.
Therefore, the assumptions may be summarized as follows:
The principle assumption is that the set of controlGenes
,
whose mean expression is at or above the minimum_mean
value,
are unchanged within the respective centerGroups
. For typical
whole genome transcript microarray and RNA-seq experiments,
this assumption is typically valid when using useMedian=TRUE
.
For experiments with specific reference genes, or housekeeper
genes, this assumption may only be true for those specific genes.
The data signal is assumed to be a roughly linear representation
of the relative abundance of each measured entity, which is usually
true for log-transformed microarray intensity, or log-transformed
RNA-seq read counts. For QPCR or TaqMan, this assumption is valid
for the direct CT cycle threshold values, or after log-transformation
of the exponentiated CT values, for example 2^(40-CT).
All that said, a straightforward way to visualize this assumption
is with MA-plots, to confirm that signal is horizontal across
the full signal range - either for the majority of all genes,
or for the specific controlGenes
used for normalization.
The variability among control genes should not be more than twice
the median absolute deviation across other samples within the
relevant centerGroups
. Effectively this assumption means the
control genes on the MA-plot should not show wide spread along the
y-axis.
In cases where some samples show non-horizontal signal across the
MA-plot, the data is not conforming to a consistent and proportional
signal across the response range of the experiment. In effect, it
means signal is being compressed, or expanded along the response
as compared to other samples in the same centerGroups
. In this
scenario, the best normalization method may be
limma::normalizeQuantiles()
, limma::normalizeCyclicLoess()
,
or vsn::vsn()
. These methods adjust the distribution of signal
to enforce consistency across samples.
In general, the signal distribution itself should not be adjusted
unless necessary, in order to retain as much information from the
underlying technology as possible. This method jammanorm()
is
intended to apply linear normalization, which effectively shifts
the entire signal for a sample up or down relative to other samples.
This scenario is effective for technologies such as QPCR, TaqMan, Nanostring counts, and RNA-seq counts or RNA-seq pseudocounts.
When the MA-plot demonstrates non-horizontal signal, it is most often the result one or both of these influences:
batch effect, imposed either by different upstream sample processing steps among the samples being tested, or
platform technology that tends to produce relative signal strength but not absolute quantitative signal, commonly seen with microarray hybridization technologies such as Affymetrix, Illumina, Agilent, SomaLogic, or Myriad RBM.
Note that any upstream sample amplification technique may also impose non-linear effects on the molecules being measured.
One method to test for a batch effect is to define centerGroups
to include batch, so the data will be centered for each batch
independently. If this centering resolves the non-horizontal
signal, then batch is very likely to be a component to be modeled
in the experiment. See limma::removeBatchEffect()
. The batch effect
adjustment by limma::removeBatchEffect()
and sva::ComBat()
almost exactly subtract the batch component from the signal.
That said, it may or may not be ideal to apply batch adjustment
prior to running downstream statistical tests, as opposed to
including batch as a covariate term in the statistical
model used for testing, example when using DESeq2
.
The main benefit of applying batch adjustment at this step is to
visualize data downstream consistent with the method used by
those statistical tests, or when running a clustering technique
that does not have the capability of applying appropriate
batch effect modeling.
This normalization is a "linear normalization" in that it uniformly shifts data up or down relative to other samples, without modifying the relative distribution of signal. It is very similar to housekeeper normalization, geometric mean normalization, and global signal scaling, which are all also "linear normalization" methods. An example of non-linear normalization is quantile or VSN normalization.
The control genes can be defined upfront with controlGenes
,
which can be housekeeper genes, or a custom subset of genes.
The controlGenes
are filtered to require the mean or median
expression at or above minimum_mean
in order to be used
during normalization.
The minimum_mean
threshold is useful and important to match
the variability seen in the MA-plots. For example data below
a certain x-axis value may have very high variability, and
should usually not be used for normalization.
When the MA-plot after normalization does not show signal centered
at y=0, the most common and effective adjustment is to apply
minimum_mean
to require controGenes
to have expression
at or above this threshold. The next most effective option is
useMedian=TRUE
which will center the majority of genes at
y=0 instead of the overall mean at y=0.
The end result is very similar to other housekeeper normalization methods which typically define normalization factors by calculating the geometric mean of log-transformed housekeeper gene abundances. Such approaches usually work in part because higher expressed housekeeper genes usually have lower variability, which keeps the influence on geometric mean reasonably consistent across a broad range of expression. That said, genes with higher expression have more influence on the geometric mean than genes with much lower expression.
However, the strategy with jammanorm()
is to assert that
the mean difference from average expression for the controlGenes
should be equal to zero. The effect is applied evenly across
control genes by evaluating the mean or median difference from y=0
for each sample.
Note that some platform technologies generate a noise threshold, below which data may be skewed up or down relative to other samples. It is recommended to ignore this type of skew below the threshold when determining whether data is horizontal on MA-plots.
For example Nanostring data includes a series of positive and negative control probes, and a suitable noise threshold is either the midpoint between the lowest positive and highest negative probe, or the lowest positive probe. When this noise threshold is applied, data above the noise threshold is typically horizontal, although data below the threshold may be skewed up or down depending upon the effective input RNA concentration.
Other jam matrix functions:
centerGeneData()
,
jammacalc()
,
matrix_to_column_rank()