Performs per-row centering on a numeric matrix
centerGeneData(
x,
centerGroups = NULL,
na.rm = TRUE,
controlSamples = NULL,
useMedian = TRUE,
rmOutliers = FALSE,
madFactor = 5,
controlFloor = NA,
naControlAction = c("na", "row", "floor", "min"),
naControlFloor = 0,
rowStatsFunc = NULL,
returnGroupedValues = FALSE,
returnGroups = FALSE,
mean = NULL,
verbose = FALSE,
...
)numeric matrix of input data. See assumptions,
that data is assumed to be log2-transformed, or otherwise
appropriately transformed.
character vector of group names, or
NULL if there are no groups.
logical indicating whether NA values should be
ignored for summary statistics. This argument is passed
to the corresponding row stats function. Frankly, this
value should be na.rm=TRUE for all stat functions by default,
for example mean(..., na.rm=TRUE) should be default.
character vector of values in colnames(x)
which defines the columns to use when calculating group
summary values.
logical indicating whether to use group median
values when calculating summary statistics TRUE, or
group means FALSE. In either case, when rowStatsFunc
is provided, it is used instead.
logical indicating whether to perform outlier
detection and removal prior to row group stats. This
argument is passed to jamba::rowGroupMeans(). Note that
outliers are only removed during the row group summary step,
and not in the centered data.
numeric value passed to jamba::rowGroupMeans(),
indicating the MAD factor threshold to use when rmOutliers=TRUE.
The MAD of each row group is computed, the overall group median
MAD is used to define 1x MAD factor, and any MAD more than
madFactor times the group median MAD is considered an outlier
and is removed. The remaining data is used to compute row
group values.
numeric value used as a minimum for any control
summary value during centering.
Use NA to skip this behavior.
When defined, all control group summary values are calculated,
then any values below controlFloor are set to the controlFloor
for the purpose of data centering.
By default controlFloor=NA which imposes no such floor value.
However, controlFloor=0 would be appropriate when zero is defined
as effective noise floor after something like background subtraction
during the upstream processing or upstream normalization.
Using a value above zero would be appropriate when the effective
noise floor of a platform is above zero, so that values are not
centered relative to noise. For example, if the effective noise
floor is 5, then centering should not "amplify" differences from
any value less than 5, since in this scenario a value of 5 or less
is effectively the same as a value of 5. It has the effect of returning
fold changes relative to the effective platform minimum detectable
signal.
character string indicating how to handle the specific
scenario when the control group summary value is NA for a particular
centering operation.
"na": default is to return NA since 15 - NA = NA.
"row": use the summary value across all relevant samples,
so the centering is against all non-NA values within the center group.
"floor": use the numeric value defined by naControlFloor,
to indicate a practical noise floor for the centering operation.
When naControlFloor=0 (default) this option effectively keeps
non-NA values without centering these values.
"min": use the minimum control value as the floor, which effectively
defines the floor by the lowest observed summary value across all
rows. It assumes rows are generally on the same range of detection,
even if not all rows have the same observed range. For example,
microarray probes have reasonably similar theoretical range of
detection, even if some probes to highly-expressed genes are
commonly observed with higher signal. The lowest observed signal
effectively sets the minimum detected value.
optional function used to calculate row group
summary values. This function should take a numeric matrix as
input, and return a one-column numeric matrix as output, or
a numeric vector with length nrow(x). The function should
also accept na.rm as an argument.
logical indicating whether to include
the numeric matrix of row group values used during centering,
returned in the attributes with name "x_group".
logical indicating whether to return the
centering summary data.frame in attributes with name "center_df".
logical indicating whether to print verbose output.
additional arguments are passed to jamba::rowGroupMeans().
This function centers data by subtracting the median or mean for each row.
Columns can be grouped using argument centerGroups.
Each group group of columns defined by centerGroups
is centered independently.
Data can be centered relative to specific control columns
using argument controlSamples.
When controlSamples is not supplied, the default behavior
is to use all columns. This process is consistent with
typical MA-plots.
It may be preferred to define controlSamples in cases where
there are known reference samples, against which other samples
should be compared.
The controlSamples logic is applied independently to each
group defined in centerGroups.
You can confirm the centerGroups and controlSamples are
correct in the result data, by accessing the attribute
"center_df", see examples below.
Note: This function assumes input data is suitable for centering by subtraction. This data requirement is true for:
most log-transformed gene expression data
quantitative PCR (QPCR) cycle threshold (CT) values
other numeric data that has been suitably transformed to meet reasonable parametric assumption of normality,
rank-transformed data which results in difference in rank
generally speaking, any data where the difference between 5 and 7 (2) is reasonably similar to the difference between 15 and 17 (2).
it may be feasible to perform background subtraction on straight count data, for example sequence coverage at a particular location in a genome.
The data requirement is not true for:
most gene expression data in normal space (hint: if any value is above 100, it is generally not log-transformed)
numeric data that is strongly skewed
generally speaking, any data where the difference between 5 and 7 is not reasonably similar to the difference between 15 and 17. If the percent difference is more likely to be the interesting measure, data may be log-transformed for analysis.
For special cases, rowStatsFunc can be supplied to perform
specific group summary calculations per row.
When controlSamples is supplied, and contains all NA values
for a given row of data, within relevant centerGroups subsets,
the default behavior is defined by naControlAction="NA" below:
naControlAction="na": values are centered versus NA which
results in all values NA (current behavior, default).
naControlAction="row": values are centered versus the row,
using all samples in the same center group. This action effectively
"centers to what we have".
naControlAction="floor": values are centered versus a numeric
floor defined by argument naControlFloor. When naControlFloor=0
then values are effectively not centered. However, naControlFloor=10
could for example be used to center values versus a practical noise
floor, if the range of detection for a particular experiment starts
at 10 as a low value.
naControlAction="min": values are centered versus the minimum
observed summary value in the data, which effectively uses the data
to define a value for naControlFloor.
The motivation to center versus something other than controlSamples
when all measurements for controlSamples are NA is to have
a numeric value to indicate that a measurement was detected in
non-control columns. This situation occurs in technologies when
control samples have very low signal, and in some cases report
NA when no measurement is detected within the instrument range
of detection.
Other jam matrix functions:
jammacalc(),
jammanorm(),
matrix_to_column_rank()
x <- matrix(1:100, ncol=10);
colnames(x) <- letters[1:10];
# basic centering
centerGeneData(x);
#> a b c d e f g h i j
#> [1,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [2,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [3,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [4,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [5,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [6,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [7,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [8,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [9,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [10,] -45 -35 -25 -15 -5 5 15 25 35 45
# grouped centering
centerGeneData(x,
centerGroups=rep(c("A","B"), c(5,5)));
#> a b c d e f g h i j
#> [1,] -20 -10 0 10 20 -20 -10 0 10 20
#> [2,] -20 -10 0 10 20 -20 -10 0 10 20
#> [3,] -20 -10 0 10 20 -20 -10 0 10 20
#> [4,] -20 -10 0 10 20 -20 -10 0 10 20
#> [5,] -20 -10 0 10 20 -20 -10 0 10 20
#> [6,] -20 -10 0 10 20 -20 -10 0 10 20
#> [7,] -20 -10 0 10 20 -20 -10 0 10 20
#> [8,] -20 -10 0 10 20 -20 -10 0 10 20
#> [9,] -20 -10 0 10 20 -20 -10 0 10 20
#> [10,] -20 -10 0 10 20 -20 -10 0 10 20
# centering versus specific control columns
centerGeneData(x,
controlSamples=letters[c(1:3)]);
#> a b c d e f g h i j
#> [1,] -10 0 10 20 30 40 50 60 70 80
#> [2,] -10 0 10 20 30 40 50 60 70 80
#> [3,] -10 0 10 20 30 40 50 60 70 80
#> [4,] -10 0 10 20 30 40 50 60 70 80
#> [5,] -10 0 10 20 30 40 50 60 70 80
#> [6,] -10 0 10 20 30 40 50 60 70 80
#> [7,] -10 0 10 20 30 40 50 60 70 80
#> [8,] -10 0 10 20 30 40 50 60 70 80
#> [9,] -10 0 10 20 30 40 50 60 70 80
#> [10,] -10 0 10 20 30 40 50 60 70 80
# grouped centering versus specific control columns
centerGeneData(x,
centerGroups=rep(c("A","B"), c(5,5)),
controlSamples=letters[c(1:3, 6:8)]);
#> a b c d e f g h i j
#> [1,] -10 0 10 20 30 -10 0 10 20 30
#> [2,] -10 0 10 20 30 -10 0 10 20 30
#> [3,] -10 0 10 20 30 -10 0 10 20 30
#> [4,] -10 0 10 20 30 -10 0 10 20 30
#> [5,] -10 0 10 20 30 -10 0 10 20 30
#> [6,] -10 0 10 20 30 -10 0 10 20 30
#> [7,] -10 0 10 20 30 -10 0 10 20 30
#> [8,] -10 0 10 20 30 -10 0 10 20 30
#> [9,] -10 0 10 20 30 -10 0 10 20 30
#> [10,] -10 0 10 20 30 -10 0 10 20 30
# confirm the centerGroups and controlSamples
x_ctr <- centerGeneData(x,
centerGroups=rep(c("A","B"), c(5,5)),
controlSamples=letters[c(1:3, 6:8)],
returnGroups=TRUE);
attr(x_ctr, "center_df");
#> sample centerGroups controlSamples
#> a a A TRUE
#> b b A TRUE
#> c c A TRUE
#> d d A FALSE
#> e e A FALSE
#> f f B TRUE
#> g g B TRUE
#> h h B TRUE
#> i i B FALSE
#> j j B FALSE