Collapse SummarizedExperiment data by row
Usage
se_collapse_by_row(
se,
rows = rownames(se),
row_groups,
assay_names = NULL,
group_func_name = c("sum", "mean", "weighted.mean", "geomean", "none"),
rowStatsFunc = NULL,
rowDataColnames = NULL,
keepNULLlevels = FALSE,
delim = "[ ]*[;,]+[ ]*",
data_transform = c("none", "log2p+sqrt", "log2+sqrt", "log2p", "log2"),
verbose = TRUE,
...
)Arguments
- se
SummarizedExperiment- rows
charactervector ofrows(se)to use for analysis. Whenrows=NULLthe default is to use allrows(se).- row_groups
charactervector representing groups of rows to be combined.- assay_names
charactervector ofnames(assays(se))to use for the collapse operation. Whenassay_names=NULLthe default is to use allassays(se).- group_func_name
charactername of function used to aggregate measurement data withinrow_groups.sum- takes thesum()of each value in the group. This option should be used together withdata_transformwhen there has been any data transformation, so that the data is inverse-transformed prior to calculating thesum(), after which data is re-transformed to its original state. This method is appropriate for log2plog2(1 + x)transformed abundance measurements for example.mean- calculates the mean value per group. Note that in this case is it usually recommended not to definedata_transformso that values are averaged in the appropriately transformed numeric space.weighted.mean- calculatesweighted.mean()where weightsware defined by the values used. This method may be appropriate and effective with normal space abundance values derived from proteomics mass spec quantitation.geomean- calculates geometric mean of values in each group.none-
- rowStatsFunc
functionoptional function used instead ofgroup_func_name.- rowDataColnames
charactersubset of colnames inrowData(se)to be retained in the output data. Multiple values are combined usually by comma-delimited concatenation withinrow_groups, therefore it may be beneficial to include only relevant columns in that output.- keepNULLlevels
logicalindicating whether to drop unused factor levels inrow_groups, this argument is passed tojamba::rowGroupMeans().- delim
characterstring indicating a delimiter.- data_transform
characterstring indicating which transformation was used when preparing the assay data. The assumption is that all assays were transformed by this method. During processing, data is inverse-transformed prior to applying thegroup_func_nameorrowStatsFuncif supplied. After that function is applied, data is transformed using this function. The purpose is to enable taking thesum()in proper measured absolute units (in normal space for example) where relevant, after which is original numeric transformation is re-applied.- verbose
logicalindicating whether to print verbose output.- ...
additional arguments are passed to
jamba::rowGroupMeans().
Value
SummarizedExperiment object with these changes:
rows will be collapsed by
row_groups, for eachassays(se)numericmatrix defined byassay_names. The collapse may optionally apply a data transformation defined indata_transformin order to apply an appropriatenumericsummary calculation.rowData(se)will also be collapsed byshrinkDataFrame()to combine unique values from each row annotation.
Details
Purpose is to collapse rows of a SummarizedExperiment object,
where measurements for a given entity, usually a gene, are split
across multiple rows in the source data. The output of this function
should be measurements appropriately summarized to the gene level.
The key arguments are group_func_name, and data_transform.
Note that data is inverse-transformed based upon data_transform,
prior to calculating group summary values defined by group_func_name.
The reason is to enable using group_func_name="sum" on normal
space abundance values, when input data has already been
transformed with log2(1 + x) for example. In this case it is most
appropriate to take the sum of normal space abundance values,
then to re-apply the transformation afterwards.
However, when using group_func_name="mean" it is usually
recommended to use data_transform="none" so that data is maintained
in appropriately transformed state.
The driving use case is proteomics mass spectrometry data, where measurements are described in terms of peptide sequences, with or without optional post-translational modification (PTM), and the peptide sequences are annotated to a source protein or gene. This function can be used to:
collapse peptide-PTM data to the peptide level
collapse peptide data to the protein level
In future it may be used to collapse multiple microarray probe measurements to the gene level, although that process is more likely to be useful and recommended after performing probe-level statistical analysis.
Proteomics mass spectrometry analysis
For proteomics mass spectrometry data, proteins are inconsistently fragmented into smaller peptides of varying sizes. The peptides are usually separated on a chromatography column, from which aliquot fractions are taken and measured by mass spectrometry. The total signal derived from the original protein is therefore some combination of the measured peptide parts.
In some upstream data processing tools, such as Proteomics Discoverer, and PEAKS, the peptide data may be annotated with observed modification events (PTM). In this scenario, peptide measurements are split across multiple rows of data, where each row represents an observed combination of peptide and PTMs.
Collapse methods
It is fairly straightforward to observe peptide-PTM measurement data is correlated with overall protein quantification, and that the specific combination of peptide fragments may be inconsistent across samples. That is, one may observe five peptides of protein A in one sample, and may observe seven peptides of protein A in another sample. The quantities of each peptide may be inconsistent, due to variability in protein fragmentation across samples. However, the general sum of peptide measurements is typically fairly stable across samples, especially for proteins of moderate to high abundance which are known to have stable abundance per cell.
Choice of method to collapse measurements is not trivial, and is
therefore configurable. In general, proteomics abundances are
analyzed after log2( 1 + x ) transformation. However, measurements
cannot be summed in log2 form, which would be equivalent to
multiplying measurements in normal form. Measurements can be summed
but only after exponentiating the data, for example the reciprocal
( 2 ^ x ) - 1 is sufficient.
See also
Other jamses SE utilities:
make_se_test(),
se_collapse_by_column(),
se_detected_rows(),
se_normalize(),
se_rbind(),
se_to_rowcoldata()