Shrink data.frame by row groups
Usage
shrinkDataFrame(
x,
groupBy,
na.rm = TRUE,
string_func = function(x) jamba::cPasteSU(x, na.rm = TRUE),
num_func = function(x) {
mean(x, na.rm = TRUE)
},
add_string_cols = NULL,
num_to_string_func = as.character,
keep_na_groups = TRUE,
include_num_reps = FALSE,
collapse_method = 2,
verbose = FALSE,
...
)
Arguments
- x
data.frame
(or equivalent)- groupBy
character
vector with one of the following:one or more columns in
colnames(x)
. The values in these columns will define the row groups used.character
orfactor
with length equal tonrow(x)
. These values will define the row groups used.
- string_func
function
, default usesjamba::cPasteSU()
, used forcharacter
orfactor
columns. Note that string columns are handled differently thannumeric
columns by applying vectorized operations across the complete set of rows in one step, rather than callingdata.table
on each subgroup.- num_func
function
, defaultfunction(x)mean(x, na.rm=TRUE)
, used fornumeric
columns. Note that this function is applied to each row group bydata.table
, and is typically very efficient fornumeric
values.- add_string_cols
character
with optionalnumeric
columns that should be handled as if they werecharacter
columns. DefaultNULL
.- num_to_string_func
function
used foradd_string_cols
when convertingnumeric
columns tocharacter
. Defaultas.character()
retains the fullnumeric
value, however it may be useful to use something likefunction(x)signif(x, digits=3)
to limit the output to only three significant digits, orfunction(x)format(x, digits=3)
.- keep_na_groups
logical
, default TRUE, whether to convertNA
values in row groups to""
so they are retained in the output.You may want to use
keep_na_groups=FALSE
when there are a large number of un-annotated rows that should not be aggregated together. This situation may occur if converting a probe to a gene symbol, where a subset of probes cannot be converted to a gene symbol and instead receiveNA
.
- include_num_reps
logical
indicating whether to add a column"num_reps"
to the output, with theinteger
number of rows in each row group.- collapse_method
integer
default 2, indicating the internal collapse method used. Experimental.1
collapses eachnumeric
column independently.2
collapses each set ofnumeric
columns that use the same numeric shrink function. When allnumeric
columns use the same shrink function, they are all calculated in a single step, which is typically much faster.
- verbose
logical
indicating whether to print verbose output.- ...
additional arguments are ignored.
Details
Purpose is to shrink a data.frame
to have one row per row grouping.
The row grouping can use a single column of identifiers, or multiple
columns. The challenge is to apply a relevant function to each column,
expecting there will be columns with numeric
, character
, or factor
types.
The default behavior:
numeric
columns are summarized withmean(x, na.rm=TRUE)
, so that NA values are ignored when there are non-NA values present.character
columns are combined using unique, sortedcharacter
strings.This step uses
jamba::cPasteSU()
where theS
activates sorting usingjamba::mixedSort()
, andU
callsunique()
.To retain all values, remove the
U
and calljamba::cPasteS()
To skip the sort, remove the
S
and calljamba::cPasteU()
To keep all values, and skip sorting, call
jamba::cPaste()
See also
Other jamses utilities:
choose_annotation_colnames()
,
contrast2comp_dev()
,
fold_to_log2fold()
,
intercalate()
,
list2im_opt()
,
log2fold_to_fold()
,
make_block_arrow_polygon()
,
mark_stat_hits()
,
matrix_normalize()
,
point_handedness()
,
point_slope_intercept()
,
shortest_unique_abbreviation()
,
shrink_df()
,
shrink_matrix()
,
sort_samples()
,
strsplitOrdered()
,
sub_split_vector()
,
update_function_params()
,
update_list_elements()
Examples
testdf <- data.frame(check.names=FALSE,
SYMBOL=rep(c("ACTB", "GAPDH", "PPIA"), c(2, 3, 1)),
`logFC B-A`=c(1.4, 1.4, 2.3, NA, 2.5, 5.1),
probe=paste0("probe", 1:6))
shrink_df(testdf, by="SYMBOL")
#> SYMBOL logFC B-A probe
#> ACTB ACTB 1.4 probe1,probe2
#> GAPDH GAPDH 2.4 probe3,probe4,probe5
#> PPIA PPIA 5.1 probe6
shrink_df(testdf, by="SYMBOL", num_func=mean)
#> SYMBOL logFC B-A probe
#> ACTB ACTB 1.4 probe1,probe2
#> GAPDH GAPDH NA probe3,probe4,probe5
#> PPIA PPIA 5.1 probe6
shrink_df(testdf, by="SYMBOL", add_string_cols="logFC B-A")
#> SYMBOL logFC B-A probe
#> ACTB ACTB 1.4 probe1,probe2
#> GAPDH GAPDH 2.3,2.5 probe3,probe4,probe5
#> PPIA PPIA 5.1 probe6
testdftall <- do.call(rbind, lapply(1:10000, function(i){
idf <- testdf;
idf$SYMBOL <- paste0(idf$SYMBOL, "_", i);
idf;
}))
shrunk_tall <- shrink_df(testdftall,
by="SYMBOL")
head(shrunk_tall, 6)
#> SYMBOL logFC B-A probe
#> ACTB_1 ACTB_1 1.4 probe1,probe2
#> GAPDH_1 GAPDH_1 2.4 probe3,probe4,probe5
#> PPIA_1 PPIA_1 5.1 probe6
#> ACTB_2 ACTB_2 1.4 probe1,probe2
#> GAPDH_2 GAPDH_2 2.4 probe3,probe4,probe5
#> PPIA_2 PPIA_2 5.1 probe6
shrunk_tall2 <- jamses::shrinkDataFrame(testdftall,
groupBy="SYMBOL")
head(shrunk_tall2, 6)
#> SYMBOL logFC B-A probe
#> ACTB_1 ACTB_1 1.4 probe1,probe2
#> GAPDH_1 GAPDH_1 2.4 probe3,probe4,probe5
#> PPIA_1 PPIA_1 5.1 probe6
#> ACTB_2 ACTB_2 1.4 probe1,probe2
#> GAPDH_2 GAPDH_2 2.4 probe3,probe4,probe5
#> PPIA_2 PPIA_2 5.1 probe6