Curate data.frame into a data.frame
curateDFtoDF( x, curationL2 = NULL, matchWholeString = TRUE, trimWhitespace = TRUE, whitespace = "_ ", expandWhitespace = TRUE, keepAllColnames = TRUE, verbose = TRUE, ... )
x | data.frame |
---|---|
curationL2 | list with curation rules as described above, or
a character vector of yaml files, which will be imported into
a list format using |
matchWholeString, trimWhitespace, whitespace, expandWhitespace | arguments passed to |
keepAllColnames | logical indicating whether to keep all colnames
from |
verbose | logical indicating whether to print verbose output |
... | additional arguments are passed to |
This function takes a data.frame as input, where one or more columns are expected to be used in data curation to create another data.frame. This situation is useful when the final desired data.frame depends upon values in more than one column of the input data.frame.
Specifically, this function is a wrapper around curateVtoDF()
.
Typically, curationL2
is derived from YAML formatted files, and
loaded into a list with this type of setup:
curationL2 <- yaml::yaml.load_file("curation.yaml")
.
The structure of curationL2:
curationL2
is a list object, whose names(curationL2)
are values
in colnames(x)
and represent column of data used as input.
each list element in curationL2
is also a list, whose
names
represent colnames to create or update in the output
data.frame
.
these lists contain character vectors length=2
containing
a regular expression substitution pattern (see base::gsub
),
and a replacement pattern.
The list is processed in order, and names can be repeated as necessary to apply the proper substitution patterns in the order required. New columns created during the curation may also be used in later curation steps.
Example curation.yaml YAML format. Take note that there is required leading space in the format.
From_ColnameA: To_ColnameC: - - patternA - replacementA - - patternB - replacementB To_ColnameD: - - patternC - replacementC - - patternD - replacementD From_ColnameB: To_ColnameE: - - patternE - replacementE - - patternF - replacementF
When the rule creates a colname already present in colnames(x), then only values specifically matched by the substitution patterns are modified. For example, this technique can be used to modify the group assignment of a Sample_ID:
Sample_ID: Group: - - Sample1234 - WildType
The rules above will match "Sample1234"
in the "Sample_ID"
column
of x, and assign "WildType"
to the "Group"
column only for
matching entries.
In addition to values in colnames(x)
, the "from" value may
also be "rownames"
which will cause the curation rules to
act upon values in rownames(x)
instead of values in a specific
column of x
.
Note that if a "to" column does not already exist, then all values in the "from" column which do not match any substitution pattern will be used to fill the remainder of the "to" column. Once the "to" column exists, then only entries with a matching substitution pattern are replaced using the replacement pattern.
For example, for NanoString data, the column "CartridgeWell"
can be
derived from rownames(x)
, after which the new column "CartridgeWell"
can be used in subsequent curation steps.
Additional notes:
The substitution pattern is automatically expanded to include the
whole input string, if not already present. For example supplying "WT"
will match "^.*(WT).*$"
. However if the substitution pattern is
"^.*(WT).*$"
then it will not be expanded.
When the substitution pattern is expanded, the string is also enclosed
in parentheses "()"
which means the replacement can use "\\1"
to
use the successfully matched pattern as the output string. For example
if "WT"
and "Mutant"
are always valid genotypes, then it would
be sufficient to define substitution pattern "WT|Mutant"
and
replacement pattern "\\1"
.
When the substitution pattern is expanded, and the string is enclosed
in parentheses, any parentheses in the substitution pattern are therefore
one level deeper, for example "file([A-Z]+)"
will be expanded to
"^.*(file([A-Z]+)).*$"
. See the example below, where the replacement
pattern uses "\\2"
to use only the internal parentheses.
Other jam design functions:
curateVtoDF()
,
groups2contrasts()
set.seed(123); df <- data.frame(filename=paste( paste0("file", sapply(1:5, function(i) { paste(sample(LETTERS, 5), collapse="") })), rep(c("WT", "Mut"), each=3), rep(c("Veh","EtOH"), 3), sep="_")); df;#> filename #> 1 fileOSNCJ_WT_Veh #> 2 fileRVKET_WT_EtOH #> 3 fileNVESI_WT_Veh #> 4 fileCHGJI_Mut_EtOH #> 5 fileSDNQK_Mut_Veh #> 6 fileOSNCJ_Mut_EtOH# Note a couple ways of accomplishing similar results: # Genotype matches "WT|wildtype" and replaces with "WT", # then matches "Mut|mutant" and replaces with "Mut" # # Treatment matches "Veh|EtOH" and simply replaces with # whatever was matched curationYaml <- c( "filename: Genotype: - - WT|wildtype - WT - - Mut|mutant - Mut Treatment: - - Veh|EtOH - \\1 File: - - file([A-Z]+) - \\1 FileStem: - - file([A-Z]+) - \\2"); # print the curation.yaml to show its structure cat(curationYaml)#> filename: #> Genotype: #> - - WT|wildtype #> - WT #> - - Mut|mutant #> - Mut #> Treatment: #> - - Veh|EtOH #> - \1 #> File: #> - - file([A-Z]+) #> - \1 #> FileStem: #> - - file([A-Z]+) #> - \2#> ## (19:08:21) 09Mar2021: curateDFtoDF(): Applying curation to column:filename #> ## (19:08:21) 09Mar2021: curateVtoDF(): Creating column:Genotype #> ## (19:08:21) 09Mar2021: curateVtoDF(): Creating column:Treatment #> ## (19:08:21) 09Mar2021: curateVtoDF(): Creating column:File #> ## (19:08:21) 09Mar2021: curateVtoDF(): Creating column:FileStem#> Genotype Treatment File FileStem filename #> 1 WT Veh fileOSNCJ OSNCJ fileOSNCJ_WT_Veh #> 2 WT EtOH fileRVKET RVKET fileRVKET_WT_EtOH #> 3 WT Veh fileNVESI NVESI fileNVESI_WT_Veh #> 4 Mut EtOH fileCHGJI CHGJI fileCHGJI_Mut_EtOH #> 5 Mut Veh fileSDNQK SDNQK fileSDNQK_Mut_Veh #> 6 Mut EtOH fileOSNCJ OSNCJ fileOSNCJ_Mut_EtOH