Shrink data.frame by row groups
Usage
shrinkDataFrame(
x,
groupBy,
na.rm = TRUE,
string_func = function(x) jamba::cPasteSU(x, na.rm = TRUE),
num_func = function(x) {
mean(x, na.rm = TRUE)
},
add_string_cols = NULL,
num_to_string_func = as.character,
keep_na_groups = TRUE,
include_num_reps = FALSE,
collapse_method = 2,
verbose = FALSE,
...
)Arguments
- x
data.frame(or equivalent)- groupBy
charactervector with one of the following:one or more columns in
colnames(x). The values in these columns will define the row groups used.characterorfactorwith length equal tonrow(x). These values will define the row groups used.
- string_func
function, default usesjamba::cPasteSU(), used forcharacterorfactorcolumns. Note that string columns are handled differently thannumericcolumns by applying vectorized operations across the complete set of rows in one step, rather than callingdata.tableon each subgroup.- num_func
function, defaultfunction(x)mean(x, na.rm=TRUE), used fornumericcolumns. Note that this function is applied to each row group bydata.table, and is typically very efficient fornumericvalues.- add_string_cols
characterwith optionalnumericcolumns that should be handled as if they werecharactercolumns. DefaultNULL.- num_to_string_func
functionused foradd_string_colswhen convertingnumericcolumns tocharacter. Defaultas.character()retains the fullnumericvalue, however it may be useful to use something likefunction(x)signif(x, digits=3)to limit the output to only three significant digits, orfunction(x)format(x, digits=3).- keep_na_groups
logical, default TRUE, whether to convertNAvalues in row groups to""so they are retained in the output.You may want to use
keep_na_groups=FALSEwhen there are a large number of un-annotated rows that should not be aggregated together. This situation may occur if converting a probe to a gene symbol, where a subset of probes cannot be converted to a gene symbol and instead receiveNA.
- include_num_reps
logicalindicating whether to add a column"num_reps"to the output, with theintegernumber of rows in each row group.- collapse_method
integerdefault 2, indicating the internal collapse method used. Experimental.1collapses eachnumericcolumn independently.2collapses each set ofnumericcolumns that use the same numeric shrink function. When allnumericcolumns use the same shrink function, they are all calculated in a single step, which is typically much faster.
- verbose
logicalindicating whether to print verbose output.- ...
additional arguments are ignored.
Details
Purpose is to shrink a data.frame to have one row per row grouping.
The row grouping can use a single column of identifiers, or multiple
columns. The challenge is to apply a relevant function to each column,
expecting there will be columns with numeric, character, or factor
types.
The default behavior:
numericcolumns are summarized withmean(x, na.rm=TRUE), so that NA values are ignored when there are non-NA values present.charactercolumns are combined using unique, sortedcharacterstrings.This step uses
jamba::cPasteSU()where theSactivates sorting usingjamba::mixedSort(), andUcallsunique().To retain all values, remove the
Uand calljamba::cPasteS()To skip the sort, remove the
Sand calljamba::cPasteU()To keep all values, and skip sorting, call
jamba::cPaste()
See also
Other jamses utilities:
choose_annotation_colnames(),
contrast2comp_dev(),
fold_to_log2fold(),
intercalate(),
list2im_opt(),
log2fold_to_fold(),
make_block_arrow_polygon(),
mark_stat_hits(),
matrix_normalize(),
point_handedness(),
point_slope_intercept(),
shortest_unique_abbreviation(),
shrink_df(),
shrink_matrix(),
sort_samples(),
strsplitOrdered(),
sub_split_vector(),
update_function_params(),
update_list_elements()
Examples
testdf <- data.frame(check.names=FALSE,
SYMBOL=rep(c("ACTB", "GAPDH", "PPIA"), c(2, 3, 1)),
`logFC B-A`=c(1.4, 1.4, 2.3, NA, 2.5, 5.1),
probe=paste0("probe", 1:6))
shrink_df(testdf, by="SYMBOL")
#> SYMBOL logFC B-A probe
#> ACTB ACTB 1.4 probe1,probe2
#> GAPDH GAPDH 2.4 probe3,probe4,probe5
#> PPIA PPIA 5.1 probe6
shrink_df(testdf, by="SYMBOL", num_func=mean)
#> SYMBOL logFC B-A probe
#> ACTB ACTB 1.4 probe1,probe2
#> GAPDH GAPDH NA probe3,probe4,probe5
#> PPIA PPIA 5.1 probe6
shrink_df(testdf, by="SYMBOL", add_string_cols="logFC B-A")
#> SYMBOL logFC B-A probe
#> ACTB ACTB 1.4 probe1,probe2
#> GAPDH GAPDH 2.3,2.5 probe3,probe4,probe5
#> PPIA PPIA 5.1 probe6
testdftall <- do.call(rbind, lapply(1:10000, function(i){
idf <- testdf;
idf$SYMBOL <- paste0(idf$SYMBOL, "_", i);
idf;
}))
shrunk_tall <- shrink_df(testdftall,
by="SYMBOL")
head(shrunk_tall, 6)
#> SYMBOL logFC B-A probe
#> ACTB_1 ACTB_1 1.4 probe1,probe2
#> GAPDH_1 GAPDH_1 2.4 probe3,probe4,probe5
#> PPIA_1 PPIA_1 5.1 probe6
#> ACTB_2 ACTB_2 1.4 probe1,probe2
#> GAPDH_2 GAPDH_2 2.4 probe3,probe4,probe5
#> PPIA_2 PPIA_2 5.1 probe6
shrunk_tall2 <- jamses::shrinkDataFrame(testdftall,
groupBy="SYMBOL")
head(shrunk_tall2, 6)
#> SYMBOL logFC B-A probe
#> ACTB_1 ACTB_1 1.4 probe1,probe2
#> GAPDH_1 GAPDH_1 2.4 probe3,probe4,probe5
#> PPIA_1 PPIA_1 5.1 probe6
#> ACTB_2 ACTB_2 1.4 probe1,probe2
#> GAPDH_2 GAPDH_2 2.4 probe3,probe4,probe5
#> PPIA_2 PPIA_2 5.1 probe6