Normalize data using MA-plot logic

jammanorm(
  x,
  controlGenes = NULL,
  minimum_mean = 0,
  controlSamples = NULL,
  centerGroups = NULL,
  useMedian = TRUE,
  useMean = NULL,
  noise_floor = -Inf,
  noise_floor_value = noise_floor,
  verbose = FALSE,
  ...
)

Arguments

x: numeric matrix with expression data suitable for use by jammaplot(). Gene expression data is typically transformed using log2(1+x) to represent reasonably normal distribution.
minimum_mean: numeric value used to filter controlGenes, a control gene must have at least this level of expression to be included in the normalization, where the expression is determined by the mean or median value analogous to the x-axis value in MA-plots.
controlSamples: character vector of colnames(x) passed to centerGeneData() which defines the control samples during the data centering step. By default, and the most common practice, MA-plots are calculated across all samples, which effectively uses all colnames(x) as controlSamples. However, it is quite useful sometimes to provide a subset of samples especially if there are known quality samples, to which new samples of unknown quality are being compared.
centerGroups: character vector of groups passed to jamma::centerGeneData() which determines how data is centered. Each group is centered independently, to enable visual comparisons within each relevant centering group. It is useful to center within batches or within subsets of samples that are not intended to be compared to one another. Another useful alternative is to center by each sample group in order to view the variability among group replicates, which should be much lower than variability across sample groups. See centerGeneData() for more specific examples.
useMedian: logical indicates whether to center data using the median value, where useMedian=FALSE by default. The median is preferred in cases where outliers should not influence centering. The mean is preferred in cases where the data should visualize data in a manner consistent with downstream parametric statistical analysis. When a particular sample represents a technical outlier, one option to visualize data without being skewed by the outlier is to define controlSamples to exclude the outlier sample(s). In this way, data centering will be applied using the non-outlier samples as reference.
useMean: (deprecated) logical, use useMedian. This argument indicates whether to center data using the mean value. When useMean=NULL the argument useMedian is preferred. For backward compatibility, when useMean is not NULL, then useMedian is defined by useMedian <- !useMean.
noise_floor, noise_floor_value: numeric values passed to jammacalc(). The noise_floor is the value below which a floor is applied. The floor sets all values below this floor to noise_floor_value. For example, one could apply noise_floor=0 and noise_floor_value=NA which would change any value below 0 to NA.
verbose: logical indicating whether to print verbose output.
...: additional parameters sent to downstream functions, jamba::plotSmoothScatter(), jammacalc(), centerGeneData().

Value

numeric

matrix whose columns have been normalized, with the following named attributes:

"nf" numeric vector of normalization factors;
"hk" a character vector of controlGenes for each sample;
"hk_count" the integer number of controlGenes for each sample.

Details

This function normalizes data using an approach analogous to data viewed in MA-plots. Normalization is applied by shifting data along the y-axis so the mean (or median) expression among control genes is zero, indicated on an MA-plot as y=0.

Note: This method should be performed only after reviewing the MA-plots, to ensure the assumptions are met. Similarly, data can also be viewed in MA-plots after normalization to confirm and review the effect of normalization.

It is useful to run jammaplot() data after jammanorm() to visualize the effect of this normalization. If data is not centered at y=0, the parameters should be adjusted.

Assumptions

This method effectively reinforces the assumption that the mean log fold change for control genes is expected to be zero. When useMedian=TRUE it reinforces the assumption that the log fold change for the majority of control genes should be zero.

Therefore, the assumptions may be summarized as follows:

The principle assumption is that the set of controlGenes, whose mean expression is at or above the minimum_mean value, are unchanged within the respective centerGroups. For typical whole genome transcript microarray and RNA-seq experiments, this assumption is typically valid when using useMedian=TRUE. For experiments with specific reference genes, or housekeeper genes, this assumption may only be true for those specific genes.
The data signal is assumed to be a roughly linear representation of the relative abundance of each measured entity, which is usually true for log-transformed microarray intensity, or log-transformed RNA-seq read counts. For QPCR or TaqMan, this assumption is valid for the direct CT cycle threshold values, or after log-transformation of the exponentiated CT values, for example 2^(40-CT). All that said, a straightforward way to visualize this assumption is with MA-plots, to confirm that signal is horizontal across the full signal range - either for the majority of all genes, or for the specific controlGenes used for normalization.
The variability among control genes should not be more than twice the median absolute deviation across other samples within the relevant centerGroups. Effectively this assumption means the control genes on the MA-plot should not show wide spread along the y-axis.

In cases where some samples show non-horizontal signal across the MA-plot, the data is not conforming to a consistent and proportional signal across the response range of the experiment. In effect, it means signal is being compressed, or expanded along the response as compared to other samples in the same centerGroups. In this scenario, the best normalization method may be limma::normalizeQuantiles(), limma::normalizeCyclicLoess(), or vsn::vsn(). These methods adjust the distribution of signal to enforce consistency across samples.

In general, the signal distribution itself should not be adjusted unless necessary, in order to retain as much information from the underlying technology as possible. This method jammanorm() is intended to apply linear normalization, which effectively shifts the entire signal for a sample up or down relative to other samples.

This scenario is effective for technologies such as QPCR, TaqMan, Nanostring counts, and RNA-seq counts or RNA-seq pseudocounts.

When the MA-plot demonstrates non-horizontal signal, it is most often the result one or both of these influences:

batch effect, imposed either by different upstream sample processing steps among the samples being tested, or
platform technology that tends to produce relative signal strength but not absolute quantitative signal, commonly seen with microarray hybridization technologies such as Affymetrix, Illumina, Agilent, SomaLogic, or Myriad RBM.

Note that any upstream sample amplification technique may also impose non-linear effects on the molecules being measured.

One method to test for a batch effect is to define centerGroups to include batch, so the data will be centered for each batch independently. If this centering resolves the non-horizontal signal, then batch is very likely to be a component to be modeled in the experiment. See limma::removeBatchEffect(). The batch effect adjustment by limma::removeBatchEffect() and sva::ComBat() almost exactly subtract the batch component from the signal.

That said, it may or may not be ideal to apply batch adjustment prior to running downstream statistical tests, as opposed to including batch as a covariate term in the statistical model used for testing, example when using DESeq2. The main benefit of applying batch adjustment at this step is to visualize data downstream consistent with the method used by those statistical tests, or when running a clustering technique that does not have the capability of applying appropriate batch effect modeling.

About the normalization

This normalization is a "linear normalization" in that it uniformly shifts data up or down relative to other samples, without modifying the relative distribution of signal. It is very similar to housekeeper normalization, geometric mean normalization, and global signal scaling, which are all also "linear normalization" methods. An example of non-linear normalization is quantile or VSN normalization.

The control genes can be defined upfront with controlGenes, which can be housekeeper genes, or a custom subset of genes. The controlGenes are filtered to require the mean or median expression at or above minimum_mean in order to be used during normalization.

The minimum_mean threshold is useful and important to match the variability seen in the MA-plots. For example data below a certain x-axis value may have very high variability, and should usually not be used for normalization.

When the MA-plot after normalization does not show signal centered at y=0, the most common and effective adjustment is to apply minimum_mean to require controGenes to have expression at or above this threshold. The next most effective option is useMedian=TRUE which will center the majority of genes at y=0 instead of the overall mean at y=0.

Comparison to geometric mean normalization

The end result is very similar to other housekeeper normalization methods which typically define normalization factors by calculating the geometric mean of log-transformed housekeeper gene abundances. Such approaches usually work in part because higher expressed housekeeper genes usually have lower variability, which keeps the influence on geometric mean reasonably consistent across a broad range of expression. That said, genes with higher expression have more influence on the geometric mean than genes with much lower expression.

However, the strategy with jammanorm() is to assert that the mean difference from average expression for the controlGenes should be equal to zero. The effect is applied evenly across control genes by evaluating the mean or median difference from y=0 for each sample.

Noise threshold

Note that some platform technologies generate a noise threshold, below which data may be skewed up or down relative to other samples. It is recommended to ignore this type of skew below the threshold when determining whether data is horizontal on MA-plots.

For example Nanostring data includes a series of positive and negative control probes, and a suitable noise threshold is either the midpoint between the lowest positive and highest negative probe, or the lowest positive probe. When this noise threshold is applied, data above the noise threshold is typically horizontal, although data below the threshold may be skewed up or down depending upon the effective input RNA concentration.