R/platjam-import-metabolomics-niehs.R
import_metabolomics_niehs.Rd
Import metabolomics data from NIEHS file formats
import_metabolomics_niehs(
data_path,
shared_samples_only = TRUE,
filter_sample_type = NULL,
curation_txt = NULL,
drop_na_columns = TRUE,
drop_na_values = c(NA, "NA", "not applicable", "not specified", "none"),
simplify_singlet_columns = TRUE,
verbose = FALSE,
...
)
character
path which contains files as named in
Details.
logical
indicating whether to retain
only those samples which are shared across positive and negative
ionization data files. This step also typically removed extraneous
technical QC samples which may not have been included in both
the positive and negative ionization data.
character
with optional subset of
sample types to retain. The most useful is
filter_sample_types="sample"
which will retain only biological
samples, and will drop all other samples. The commonly observed
values in the column "CORE_SampleType"
:
"sample"
: biological sample
"QC_curve"
: calibration curve sample
"blank_system"
: negative control blank sample
"<NA>"
: empty values where data provided a sample which was
not described in the associated project metadata file. This
typically occurs only for system quality control checks,
such as "SystemSuitability", "wash", "AqX_blank", and "AqX_sample".
data.frame
passed to curate_se_colData()
in order to include sample annotations. The default uses
identifiers from colnames(se)
for each SummarizedExperiment
object.
column headers found in the annotation metadata file.
If curation_txt
is not supplied, then values will be split into
columns by _
underscore or " "
whitespace characters.
logivcal
indicating whether to drop columns
in colData
or rowData
when all values are NA
, "NA"
,
"not applicable"
, or "none"
, defined in drop_na_values
.
Note that columns with only one non-na value in all fields will
be removed from the colData(se)
and stored as metadata as
a single character
vector accessible via:
se@metadata$colData_values
character
vector of values considered to
be "na" when drop_na_columns=TRUE
.
logical
indicating whether to remove
colData(se)
columns with only one value, instead storing the
name and value in metadata as a character vector.
This option is enabled by default, and simplifies the resulting
colData(se)
so that it only includes columns with two or
more unique values.
logical
indicating whether to print verbose output.
additional arguments are ignored.
list
of SummarizedExperiment
objects, where the list
is defined by the type of ionization ("df_pos", "df_neg"), and the type of data ("cleaned") in the data filenames. Typically the result includes:
"df_pos_cleaned"
"df_neg_cleaned"
If only one ionization is provided, only one entry will be returned.
For each SummarizedExperiment
object:
rowData
represents metabolite annotations
colData
represents sample annotations, optionally including
annotations via a data.frame
supplied as curation_txt
.
the slot "metadata"
is a list
with the following:
isample_use
: the subset of colnames(se)
for which there
was sample metadata found in the metadata file. Some control
samples may not match the full metadata, and will be ignored
when using isamples_Use
.
irows_use
: all rownames(se)
for all measured metabolites.
irows_clean
: the rownames(se)
for measurement with no
annotation in the column "flag_guidance"
.
irows_flagged
: the rownames(se)
for measurements with
some non-empty annotation in the column "flag_guidance"
.
This import function is specific to NIEHS file formats produced
from their defined analysis workflow. The files typically include
"df_pos"
for positive ionization, and "df_neg"
for negative
ionization.
Optionally, when the full data processing file is present,
it will be imported alongside the cleaned data described above.
The full data processing imports detailed compound measurement
data, and is expected in one of two formats in the data_path
folder:
Files "compounds_pos.txt"
and/or "compounds_neg.txt"
, or
"1_DataProcessed.zip"
which is expected to contain
files "compounds_pos.txt"
and/or "compounds_neg.txt"
in the archive.
The "compounds" data includes important annotations for each measurement, specifically the type of numeric measurement that is supplied by the upstream software. These annotations include whether numeric values were imputed, or measured directly in each sample.
"[project_code]_NIEHS_MCF_metadata.txt"
: tab-delimited text file
which contains sample annotations.
"df_pos.datamatrix.cleaned.txt"
: Tab-delimited text file
containing peak areas.
Features are processed and cleaned by MCF for quality.
"df_pos.datamatrix.cleaned.log10.txt"
: As above, but log10 transformed.
"df_pos.datamatrix.cleaned.rowsum.txt"
: As above but using
row sum peak areas.
"df_pos.annotation.cleaned.txt"
: annotation of each measured metabolite.
"df_neg.datamatrix.cleaned.txt"
: Tab-delimited text file
containing peak areas.
Features are processed and cleaned by MCF for quality.
"df_neg.datamatrix.cleaned.log10.txt"
: As above, but log10 transformed.
"df_neg.datamatrix.cleaned.rowsum.txt"
: As above but using
row sum peak areas.
"df_neg.annotation.cleaned.txt"
: annotation of each measured metabolite.
Other jam import functions:
coverage_matrix2nmat()
,
deepTools_matrix2nmat()
,
frequency_matrix2nmat()
,
import_lipotype_csv()
,
import_nanostring_csv()
,
import_nanostring_rcc()
,
import_nanostring_rlf()
,
import_proteomics_PD()
,
import_proteomics_mascot()
,
import_salmon_quant()
,
process_metab_compounds_file()
data_path <- path.expand("~/Projects/Rider/metabolomics_jul2023/data");
se_list <- import_metabolomics_niehs(data_path);