Performs per-row centering on a numeric matrix
centerGeneData(
x,
centerGroups = NULL,
na.rm = TRUE,
controlSamples = NULL,
useMedian = TRUE,
rmOutliers = FALSE,
madFactor = 5,
controlFloor = NA,
naControlAction = c("na", "row", "floor", "min"),
naControlFloor = 0,
rowStatsFunc = NULL,
returnGroupedValues = FALSE,
returnGroups = FALSE,
mean = NULL,
verbose = FALSE,
...
)
numeric
matrix of input data. See assumptions,
that data is assumed to be log2-transformed, or otherwise
appropriately transformed.
character
vector of group names, or
NULL
if there are no groups.
logical
indicating whether NA values should be
ignored for summary statistics. This argument is passed
to the corresponding row stats function. Frankly, this
value should be na.rm=TRUE
for all stat functions by default,
for example mean(..., na.rm=TRUE)
should be default.
character
vector of values in colnames(x)
which defines the columns to use when calculating group
summary values.
logical
indicating whether to use group median
values when calculating summary statistics TRUE
, or
group means FALSE
. In either case, when rowStatsFunc
is provided, it is used instead.
logical
indicating whether to perform outlier
detection and removal prior to row group stats. This
argument is passed to jamba::rowGroupMeans()
. Note that
outliers are only removed during the row group summary step,
and not in the centered data.
numeric
value passed to jamba::rowGroupMeans()
,
indicating the MAD factor threshold to use when rmOutliers=TRUE
.
The MAD of each row group is computed, the overall group median
MAD is used to define 1x MAD factor, and any MAD more than
madFactor
times the group median MAD is considered an outlier
and is removed. The remaining data is used to compute row
group values.
numeric
value used as a minimum for any control
summary value during centering.
Use NA
to skip this behavior.
When defined, all control group summary values are calculated,
then any values below controlFloor
are set to the controlFloor
for the purpose of data centering.
By default controlFloor=NA
which imposes no such floor value.
However, controlFloor=0
would be appropriate when zero is defined
as effective noise floor after something like background subtraction
during the upstream processing or upstream normalization.
Using a value above zero would be appropriate when the effective
noise floor of a platform is above zero, so that values are not
centered relative to noise. For example, if the effective noise
floor is 5, then centering should not "amplify" differences from
any value less than 5, since in this scenario a value of 5 or less
is effectively the same as a value of 5. It has the effect of returning
fold changes relative to the effective platform minimum detectable
signal.
character
string indicating how to handle the specific
scenario when the control group summary value is NA
for a particular
centering operation.
"na"
: default is to return NA
since 15 - NA = NA.
"row"
: use the summary value across all relevant samples,
so the centering is against all non-NA values within the center group.
"floor"
: use the numeric value defined by naControlFloor
,
to indicate a practical noise floor for the centering operation.
When naControlFloor=0
(default) this option effectively keeps
non-NA values without centering these values.
"min"
: use the minimum control value as the floor, which effectively
defines the floor by the lowest observed summary value across all
rows. It assumes rows are generally on the same range of detection,
even if not all rows have the same observed range. For example,
microarray probes have reasonably similar theoretical range of
detection, even if some probes to highly-expressed genes are
commonly observed with higher signal. The lowest observed signal
effectively sets the minimum detected value.
optional
function used to calculate row group
summary values. This function should take a numeric matrix as
input, and return a one-column numeric matrix as output, or
a numeric vector with length nrow(x)
. The function should
also accept na.rm
as an argument.
logical
indicating whether to include
the numeric matrix of row group values used during centering,
returned in the attributes with name "x_group"
.
logical
indicating whether to return the
centering summary data.frame in attributes with name "center_df".
logical
indicating whether to print verbose output.
additional arguments are passed to jamba::rowGroupMeans()
.
This function centers data by subtracting the median or mean for each row.
Columns can be grouped using argument centerGroups
.
Each group group of columns defined by centerGroups
is centered independently.
Data can be centered relative to specific control columns
using argument controlSamples
.
When controlSamples
is not supplied, the default behavior
is to use all columns. This process is consistent with
typical MA-plots.
It may be preferred to define controlSamples
in cases where
there are known reference samples, against which other samples
should be compared.
The controlSamples
logic is applied independently to each
group defined in centerGroups
.
You can confirm the centerGroups
and controlSamples
are
correct in the result data, by accessing the attribute
"center_df"
, see examples below.
Note: This function assumes input data is suitable for centering by subtraction. This data requirement is true for:
most log-transformed gene expression data
quantitative PCR (QPCR) cycle threshold (CT) values
other numeric data that has been suitably transformed to meet reasonable parametric assumption of normality,
rank-transformed data which results in difference in rank
generally speaking, any data where the difference between 5 and 7 (2) is reasonably similar to the difference between 15 and 17 (2).
it may be feasible to perform background subtraction on straight count data, for example sequence coverage at a particular location in a genome.
The data requirement is not true for:
most gene expression data in normal space (hint: if any value is above 100, it is generally not log-transformed)
numeric data that is strongly skewed
generally speaking, any data where the difference between 5 and 7 is not reasonably similar to the difference between 15 and 17. If the percent difference is more likely to be the interesting measure, data may be log-transformed for analysis.
For special cases, rowStatsFunc
can be supplied to perform
specific group summary calculations per row.
When controlSamples
is supplied, and contains all NA
values
for a given row of data, within relevant centerGroups
subsets,
the default behavior is defined by naControlAction="NA"
below:
naControlAction="na"
: values are centered versus NA
which
results in all values NA
(current behavior, default).
naControlAction="row"
: values are centered versus the row,
using all samples in the same center group. This action effectively
"centers to what we have".
naControlAction="floor"
: values are centered versus a numeric
floor defined by argument naControlFloor
. When naControlFloor=0
then values are effectively not centered. However, naControlFloor=10
could for example be used to center values versus a practical noise
floor, if the range of detection for a particular experiment starts
at 10 as a low value.
naControlAction="min"
: values are centered versus the minimum
observed summary value in the data, which effectively uses the data
to define a value for naControlFloor
.
The motivation to center versus something other than controlSamples
when all measurements for controlSamples
are NA
is to have
a numeric
value to indicate that a measurement was detected in
non-control columns. This situation occurs in technologies when
control samples have very low signal, and in some cases report
NA
when no measurement is detected within the instrument range
of detection.
Other jam matrix functions:
jammacalc()
,
jammanorm()
,
matrix_to_column_rank()
x <- matrix(1:100, ncol=10);
colnames(x) <- letters[1:10];
# basic centering
centerGeneData(x);
#> a b c d e f g h i j
#> [1,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [2,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [3,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [4,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [5,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [6,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [7,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [8,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [9,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [10,] -45 -35 -25 -15 -5 5 15 25 35 45
# grouped centering
centerGeneData(x,
centerGroups=rep(c("A","B"), c(5,5)));
#> a b c d e f g h i j
#> [1,] -20 -10 0 10 20 -20 -10 0 10 20
#> [2,] -20 -10 0 10 20 -20 -10 0 10 20
#> [3,] -20 -10 0 10 20 -20 -10 0 10 20
#> [4,] -20 -10 0 10 20 -20 -10 0 10 20
#> [5,] -20 -10 0 10 20 -20 -10 0 10 20
#> [6,] -20 -10 0 10 20 -20 -10 0 10 20
#> [7,] -20 -10 0 10 20 -20 -10 0 10 20
#> [8,] -20 -10 0 10 20 -20 -10 0 10 20
#> [9,] -20 -10 0 10 20 -20 -10 0 10 20
#> [10,] -20 -10 0 10 20 -20 -10 0 10 20
# centering versus specific control columns
centerGeneData(x,
controlSamples=letters[c(1:3)]);
#> a b c d e f g h i j
#> [1,] -10 0 10 20 30 40 50 60 70 80
#> [2,] -10 0 10 20 30 40 50 60 70 80
#> [3,] -10 0 10 20 30 40 50 60 70 80
#> [4,] -10 0 10 20 30 40 50 60 70 80
#> [5,] -10 0 10 20 30 40 50 60 70 80
#> [6,] -10 0 10 20 30 40 50 60 70 80
#> [7,] -10 0 10 20 30 40 50 60 70 80
#> [8,] -10 0 10 20 30 40 50 60 70 80
#> [9,] -10 0 10 20 30 40 50 60 70 80
#> [10,] -10 0 10 20 30 40 50 60 70 80
# grouped centering versus specific control columns
centerGeneData(x,
centerGroups=rep(c("A","B"), c(5,5)),
controlSamples=letters[c(1:3, 6:8)]);
#> a b c d e f g h i j
#> [1,] -10 0 10 20 30 -10 0 10 20 30
#> [2,] -10 0 10 20 30 -10 0 10 20 30
#> [3,] -10 0 10 20 30 -10 0 10 20 30
#> [4,] -10 0 10 20 30 -10 0 10 20 30
#> [5,] -10 0 10 20 30 -10 0 10 20 30
#> [6,] -10 0 10 20 30 -10 0 10 20 30
#> [7,] -10 0 10 20 30 -10 0 10 20 30
#> [8,] -10 0 10 20 30 -10 0 10 20 30
#> [9,] -10 0 10 20 30 -10 0 10 20 30
#> [10,] -10 0 10 20 30 -10 0 10 20 30
# confirm the centerGroups and controlSamples
x_ctr <- centerGeneData(x,
centerGroups=rep(c("A","B"), c(5,5)),
controlSamples=letters[c(1:3, 6:8)],
returnGroups=TRUE);
attr(x_ctr, "center_df");
#> sample centerGroups controlSamples
#> a a A TRUE
#> b b A TRUE
#> c c A TRUE
#> d d A FALSE
#> e e A FALSE
#> f f B TRUE
#> g g B TRUE
#> h h B TRUE
#> i i B FALSE
#> j j B FALSE