Center gene data

Performs per-row centering on a numeric matrix

centerGeneData(
  x,
  centerGroups = NULL,
  na.rm = TRUE,
  controlSamples = NULL,
  useMedian = TRUE,
  rmOutliers = FALSE,
  madFactor = 5,
  controlFloor = NA,
  naControlAction = c("na", "row", "floor", "min"),
  naControlFloor = 0,
  rowStatsFunc = NULL,
  returnGroupedValues = FALSE,
  returnGroups = FALSE,
  mean = NULL,
  verbose = FALSE,
  ...
)

Arguments

x

numeric matrix of input data. See assumptions, that data is assumed to be log2-transformed, or otherwise appropriately transformed.

centerGroups

character vector of group names, or NULL if there are no groups.

na.rm

logical indicating whether NA values should be ignored for summary statistics. This argument is passed to the corresponding row stats function. Frankly, this value should be na.rm=TRUE for all stat functions by default, for example mean(..., na.rm=TRUE) should be default.

controlSamples

character vector of values in colnames(x) which defines the columns to use when calculating group summary values.

useMedian

logical indicating whether to use group median values when calculating summary statistics TRUE, or group means FALSE. In either case, when rowStatsFunc is provided, it is used instead.

rmOutliers

logical indicating whether to perform outlier detection and removal prior to row group stats. This argument is passed to jamba::rowGroupMeans(). Note that outliers are only removed during the row group summary step, and not in the centered data.

madFactor

numeric value passed to jamba::rowGroupMeans(), indicating the MAD factor threshold to use when rmOutliers=TRUE. The MAD of each row group is computed, the overall group median MAD is used to define 1x MAD factor, and any MAD more than madFactor times the group median MAD is considered an outlier and is removed. The remaining data is used to compute row group values.

controlFloor

numeric value used as a minimum for any control summary value during centering. Use NA to skip this behavior. When defined, all control group summary values are calculated, then any values below controlFloor are set to the controlFloor for the purpose of data centering. By default controlFloor=NA which imposes no such floor value. However, controlFloor=0 would be appropriate when zero is defined as effective noise floor after something like background subtraction during the upstream processing or upstream normalization. Using a value above zero would be appropriate when the effective noise floor of a platform is above zero, so that values are not centered relative to noise. For example, if the effective noise floor is 5, then centering should not "amplify" differences from any value less than 5, since in this scenario a value of 5 or less is effectively the same as a value of 5. It has the effect of returning fold changes relative to the effective platform minimum detectable signal.

naControlAction

character string indicating how to handle the specific scenario when the control group summary value is NA for a particular centering operation.

"na": default is to return NA since 15 - NA = NA.
"row": use the summary value across all relevant samples, so the centering is against all non-NA values within the center group.
"floor": use the numeric value defined by naControlFloor, to indicate a practical noise floor for the centering operation. When naControlFloor=0 (default) this option effectively keeps non-NA values without centering these values.
"min": use the minimum control value as the floor, which effectively defines the floor by the lowest observed summary value across all rows. It assumes rows are generally on the same range of detection, even if not all rows have the same observed range. For example, microarray probes have reasonably similar theoretical range of detection, even if some probes to highly-expressed genes are commonly observed with higher signal. The lowest observed signal effectively sets the minimum detected value.

rowStatsFunc

optional function used to calculate row group summary values. This function should take a numeric matrix as input, and return a one-column numeric matrix as output, or a numeric vector with length nrow(x). The function should also accept na.rm as an argument.

returnGroupedValues

logical indicating whether to include the numeric matrix of row group values used during centering, returned in the attributes with name "x_group".

returnGroups

logical indicating whether to return the centering summary data.frame in attributes with name "center_df".

verbose

logical indicating whether to print verbose output.

...

additional arguments are passed to jamba::rowGroupMeans().

Details

This function centers data by subtracting the median or mean for each row.

Columns can be grouped using argument centerGroups. Each group group of columns defined by centerGroups is centered independently.

Data can be centered relative to specific control columns using argument controlSamples. When controlSamples is not supplied, the default behavior is to use all columns. This process is consistent with typical MA-plots.

It may be preferred to define controlSamples in cases where there are known reference samples, against which other samples should be compared.

The controlSamples logic is applied independently to each group defined in centerGroups.

You can confirm the centerGroups and controlSamples are correct in the result data, by accessing the attribute "center_df", see examples below.

Note: This function assumes input data is suitable for centering by subtraction. This data requirement is true for:

most log-transformed gene expression data
quantitative PCR (QPCR) cycle threshold (CT) values
other numeric data that has been suitably transformed to meet reasonable parametric assumption of normality,
rank-transformed data which results in difference in rank
generally speaking, any data where the difference between 5 and 7 (2) is reasonably similar to the difference between 15 and 17 (2).
it may be feasible to perform background subtraction on straight count data, for example sequence coverage at a particular location in a genome.

The data requirement is not true for:

most gene expression data in normal space (hint: if any value is above 100, it is generally not log-transformed)
numeric data that is strongly skewed
generally speaking, any data where the difference between 5 and 7 is not reasonably similar to the difference between 15 and 17. If the percent difference is more likely to be the interesting measure, data may be log-transformed for analysis.

For special cases, rowStatsFunc can be supplied to perform specific group summary calculations per row.

Control groups with NA values (since version 0.0.28.900)

When controlSamples is supplied, and contains all NA values for a given row of data, within relevant centerGroups subsets, the default behavior is defined by naControlAction="NA" below:

naControlAction="na": values are centered versus NA which results in all values NA (current behavior, default).
naControlAction="row": values are centered versus the row, using all samples in the same center group. This action effectively "centers to what we have".
naControlAction="floor": values are centered versus a numeric floor defined by argument naControlFloor. When naControlFloor=0 then values are effectively not centered. However, naControlFloor=10 could for example be used to center values versus a practical noise floor, if the range of detection for a particular experiment starts at 10 as a low value.
naControlAction="min": values are centered versus the minimum observed summary value in the data, which effectively uses the data to define a value for naControlFloor.

The motivation to center versus something other than controlSamples when all measurements for controlSamples are NA is to have a numeric value to indicate that a measurement was detected in non-control columns. This situation occurs in technologies when control samples have very low signal, and in some cases report NA when no measurement is detected within the instrument range of detection.

Examples

x <- matrix(1:100, ncol=10);
colnames(x) <- letters[1:10];
# basic centering
centerGeneData(x);
#>         a   b   c   d  e f  g  h  i  j
#>  [1,] -45 -35 -25 -15 -5 5 15 25 35 45
#>  [2,] -45 -35 -25 -15 -5 5 15 25 35 45
#>  [3,] -45 -35 -25 -15 -5 5 15 25 35 45
#>  [4,] -45 -35 -25 -15 -5 5 15 25 35 45
#>  [5,] -45 -35 -25 -15 -5 5 15 25 35 45
#>  [6,] -45 -35 -25 -15 -5 5 15 25 35 45
#>  [7,] -45 -35 -25 -15 -5 5 15 25 35 45
#>  [8,] -45 -35 -25 -15 -5 5 15 25 35 45
#>  [9,] -45 -35 -25 -15 -5 5 15 25 35 45
#> [10,] -45 -35 -25 -15 -5 5 15 25 35 45

# grouped centering
centerGeneData(x,
   centerGroups=rep(c("A","B"), c(5,5)));
#>         a   b c  d  e   f   g h  i  j
#>  [1,] -20 -10 0 10 20 -20 -10 0 10 20
#>  [2,] -20 -10 0 10 20 -20 -10 0 10 20
#>  [3,] -20 -10 0 10 20 -20 -10 0 10 20
#>  [4,] -20 -10 0 10 20 -20 -10 0 10 20
#>  [5,] -20 -10 0 10 20 -20 -10 0 10 20
#>  [6,] -20 -10 0 10 20 -20 -10 0 10 20
#>  [7,] -20 -10 0 10 20 -20 -10 0 10 20
#>  [8,] -20 -10 0 10 20 -20 -10 0 10 20
#>  [9,] -20 -10 0 10 20 -20 -10 0 10 20
#> [10,] -20 -10 0 10 20 -20 -10 0 10 20

# centering versus specific control columns
centerGeneData(x,
   controlSamples=letters[c(1:3)]);
#>         a b  c  d  e  f  g  h  i  j
#>  [1,] -10 0 10 20 30 40 50 60 70 80
#>  [2,] -10 0 10 20 30 40 50 60 70 80
#>  [3,] -10 0 10 20 30 40 50 60 70 80
#>  [4,] -10 0 10 20 30 40 50 60 70 80
#>  [5,] -10 0 10 20 30 40 50 60 70 80
#>  [6,] -10 0 10 20 30 40 50 60 70 80
#>  [7,] -10 0 10 20 30 40 50 60 70 80
#>  [8,] -10 0 10 20 30 40 50 60 70 80
#>  [9,] -10 0 10 20 30 40 50 60 70 80
#> [10,] -10 0 10 20 30 40 50 60 70 80

# grouped centering versus specific control columns
centerGeneData(x,
   centerGroups=rep(c("A","B"), c(5,5)),
   controlSamples=letters[c(1:3, 6:8)]);
#>         a b  c  d  e   f g  h  i  j
#>  [1,] -10 0 10 20 30 -10 0 10 20 30
#>  [2,] -10 0 10 20 30 -10 0 10 20 30
#>  [3,] -10 0 10 20 30 -10 0 10 20 30
#>  [4,] -10 0 10 20 30 -10 0 10 20 30
#>  [5,] -10 0 10 20 30 -10 0 10 20 30
#>  [6,] -10 0 10 20 30 -10 0 10 20 30
#>  [7,] -10 0 10 20 30 -10 0 10 20 30
#>  [8,] -10 0 10 20 30 -10 0 10 20 30
#>  [9,] -10 0 10 20 30 -10 0 10 20 30
#> [10,] -10 0 10 20 30 -10 0 10 20 30

# confirm the centerGroups and controlSamples
x_ctr <- centerGeneData(x,
   centerGroups=rep(c("A","B"), c(5,5)),
   controlSamples=letters[c(1:3, 6:8)],
   returnGroups=TRUE);
attr(x_ctr, "center_df");
#>   sample centerGroups controlSamples
#> a      a            A           TRUE
#> b      b            A           TRUE
#> c      c            A           TRUE
#> d      d            A          FALSE
#> e      e            A          FALSE
#> f      f            B           TRUE
#> g      g            B           TRUE
#> h      h            B           TRUE
#> i      i            B          FALSE
#> j      j            B          FALSE

Arguments

Details

Control groups with NA values (since version 0.0.28.900)

See also

Examples