Collapse SummarizedExperiment data by row

Usage

se_collapse_by_row(
  se,
  rows = rownames(se),
  row_groups,
  assay_names = NULL,
  group_func_name = c("sum", "mean", "weighted.mean", "geomean", "none"),
  rowStatsFunc = NULL,
  rowDataColnames = NULL,
  keepNULLlevels = FALSE,
  delim = "[ ]*[;,]+[ ]*",
  data_transform = c("none", "log2p+sqrt", "log2+sqrt", "log2p", "log2"),
  verbose = TRUE,
  ...
)

Arguments

se

SummarizedExperiment

rows

character vector of rows(se) to use for analysis. When rows=NULL the default is to use all rows(se).

row_groups

character vector representing groups of rows to be combined.

assay_names

character vector of names(assays(se)) to use for the collapse operation. When assay_names=NULL the default is to use all assays(se).

group_func_name

character name of function used to aggregate measurement data within row_groups.

sum - takes the sum() of each value in the group. This option should be used together with data_transform when there has been any data transformation, so that the data is inverse-transformed prior to calculating the sum(), after which data is re-transformed to its original state. This method is appropriate for log2p log2(1 + x) transformed abundance measurements for example.
mean - calculates the mean value per group. Note that in this case is it usually recommended not to define data_transform so that values are averaged in the appropriately transformed numeric space.
weighted.mean - calculates weighted.mean() where weights w are defined by the values used. This method may be appropriate and effective with normal space abundance values derived from proteomics mass spec quantitation.
geomean - calculates geometric mean of values in each group.
none -

rowStatsFunc

function optional function used instead of group_func_name.

rowDataColnames

character subset of colnames in rowData(se) to be retained in the output data. Multiple values are combined usually by comma-delimited concatenation within row_groups, therefore it may be beneficial to include only relevant columns in that output.

keepNULLlevels

logical indicating whether to drop unused factor levels in row_groups, this argument is passed to jamba::rowGroupMeans().

delim

character string indicating a delimiter.

data_transform

character string indicating which transformation was used when preparing the assay data. The assumption is that all assays were transformed by this method. During processing, data is inverse-transformed prior to applying the group_func_name or rowStatsFunc if supplied. After that function is applied, data is transformed using this function. The purpose is to enable taking the sum() in proper measured absolute units (in normal space for example) where relevant, after which is original numeric transformation is re-applied.

verbose

logical indicating whether to print verbose output.

...

additional arguments are passed to jamba::rowGroupMeans().

Value

SummarizedExperiment object with these changes:

rows will be collapsed by row_groups, for each assays(se) numeric matrix defined by assay_names. The collapse may optionally apply a data transformation defined in data_transform in order to apply an appropriate numeric summary calculation.
rowData(se) will also be collapsed by shrinkDataFrame() to combine unique values from each row annotation.

Details

Purpose is to collapse rows of a SummarizedExperiment object, where measurements for a given entity, usually a gene, are split across multiple rows in the source data. The output of this function should be measurements appropriately summarized to the gene level.

The key arguments are group_func_name, and data_transform. Note that data is inverse-transformed based upon data_transform, prior to calculating group summary values defined by group_func_name. The reason is to enable using group_func_name="sum" on normal space abundance values, when input data has already been transformed with log2(1 + x) for example. In this case it is most appropriate to take the sum of normal space abundance values, then to re-apply the transformation afterwards.

However, when using group_func_name="mean" it is usually recommended to use data_transform="none" so that data is maintained in appropriately transformed state.

The driving use case is proteomics mass spectrometry data, where measurements are described in terms of peptide sequences, with or without optional post-translational modification (PTM), and the peptide sequences are annotated to a source protein or gene. This function can be used to:

collapse peptide-PTM data to the peptide level
collapse peptide data to the protein level

In future it may be used to collapse multiple microarray probe measurements to the gene level, although that process is more likely to be useful and recommended after performing probe-level statistical analysis.

Proteomics mass spectrometry analysis

For proteomics mass spectrometry data, proteins are inconsistently fragmented into smaller peptides of varying sizes. The peptides are usually separated on a chromatography column, from which aliquot fractions are taken and measured by mass spectrometry. The total signal derived from the original protein is therefore some combination of the measured peptide parts.

In some upstream data processing tools, such as Proteomics Discoverer, and PEAKS, the peptide data may be annotated with observed modification events (PTM). In this scenario, peptide measurements are split across multiple rows of data, where each row represents an observed combination of peptide and PTMs.

Collapse methods

It is fairly straightforward to observe peptide-PTM measurement data is correlated with overall protein quantification, and that the specific combination of peptide fragments may be inconsistent across samples. That is, one may observe five peptides of protein A in one sample, and may observe seven peptides of protein A in another sample. The quantities of each peptide may be inconsistent, due to variability in protein fragmentation across samples. However, the general sum of peptide measurements is typically fairly stable across samples, especially for proteins of moderate to high abundance which are known to have stable abundance per cell.

Choice of method to collapse measurements is not trivial, and is therefore configurable. In general, proteomics abundances are analyzed after log2( 1 + x ) transformation. However, measurements cannot be summed in log2 form, which would be equivalent to multiplying measurements in normal form. Measurements can be summed but only after exponentiating the data, for example the reciprocal ( 2 ^ x ) - 1 is sufficient.