Import metabolomics data from NIEHS file formats

  shared_samples_only = TRUE,
  filter_sample_type = NULL,
  curation_txt = NULL,
  drop_na_columns = TRUE,
  drop_na_values = c(NA, "NA", "not applicable", "not specified", "none"),
  simplify_singlet_columns = TRUE,
  verbose = FALSE,



character path which contains files as named in Details.


logical indicating whether to retain only those samples which are shared across positive and negative ionization data files. This step also typically removed extraneous technical QC samples which may not have been included in both the positive and negative ionization data.


character with optional subset of sample types to retain. The most useful is filter_sample_types="sample" which will retain only biological samples, and will drop all other samples. The commonly observed values in the column "CORE_SampleType":

  • "sample": biological sample

  • "QC_curve": calibration curve sample

  • "blank_system": negative control blank sample

  • "<NA>": empty values where data provided a sample which was not described in the associated project metadata file. This typically occurs only for system quality control checks, such as "SystemSuitability", "wash", "AqX_blank", and "AqX_sample".


data.frame passed to curate_se_colData() in order to include sample annotations. The default uses identifiers from colnames(se) for each SummarizedExperiment object. column headers found in the annotation metadata file. If curation_txt is not supplied, then values will be split into columns by _ underscore or " " whitespace characters.


logivcal indicating whether to drop columns in colData or rowData when all values are NA, "NA", "not applicable", or "none", defined in drop_na_values.

Note that columns with only one non-na value in all fields will be removed from the colData(se) and stored as metadata as a single character vector accessible via: se@metadata$colData_values


character vector of values considered to be "na" when drop_na_columns=TRUE.


logical indicating whether to remove colData(se) columns with only one value, instead storing the name and value in metadata as a character vector. This option is enabled by default, and simplifies the resulting colData(se) so that it only includes columns with two or more unique values.


logical indicating whether to print verbose output.


additional arguments are ignored.


list of SummarizedExperiment objects, where the list

is defined by the type of ionization ("df_pos", "df_neg"), and the type of data ("cleaned") in the data filenames. Typically the result includes:

  • "df_pos_cleaned"

  • "df_neg_cleaned"

If only one ionization is provided, only one entry will be returned.

For each SummarizedExperiment object:

  • rowData represents metabolite annotations

  • colData represents sample annotations, optionally including annotations via a data.frame supplied as curation_txt.

  • the slot "metadata" is a list with the following:

    • isample_use: the subset of colnames(se) for which there was sample metadata found in the metadata file. Some control samples may not match the full metadata, and will be ignored when using isamples_Use.

    • irows_use: all rownames(se) for all measured metabolites.

    • irows_clean: the rownames(se) for measurement with no annotation in the column "flag_guidance".

    • irows_flagged: the rownames(se) for measurements with some non-empty annotation in the column "flag_guidance".


This import function is specific to NIEHS file formats produced from their defined analysis workflow. The files typically include "df_pos" for positive ionization, and "df_neg" for negative ionization.

Optionally, when the full data processing file is present, it will be imported alongside the cleaned data described above. The full data processing imports detailed compound measurement data, and is expected in one of two formats in the data_path folder:

  1. Files "compounds_pos.txt" and/or "compounds_neg.txt", or

  2. "" which is expected to contain files "compounds_pos.txt" and/or "compounds_neg.txt" in the archive.

The "compounds" data includes important annotations for each measurement, specifically the type of numeric measurement that is supplied by the upstream software. These annotations include whether numeric values were imputed, or measured directly in each sample.

Sample Metadata

  • "[project_code]_NIEHS_MCF_metadata.txt": tab-delimited text file which contains sample annotations.

Positive Ionization Files

  • "df_pos.datamatrix.cleaned.txt": Tab-delimited text file containing peak areas. Features are processed and cleaned by MCF for quality.

  • "df_pos.datamatrix.cleaned.log10.txt": As above, but log10 transformed.

  • "df_pos.datamatrix.cleaned.rowsum.txt": As above but using row sum peak areas.

  • "df_pos.annotation.cleaned.txt": annotation of each measured metabolite.

Negative Ionization Files

  • "df_neg.datamatrix.cleaned.txt": Tab-delimited text file containing peak areas. Features are processed and cleaned by MCF for quality.

  • "df_neg.datamatrix.cleaned.log10.txt": As above, but log10 transformed.

  • "df_neg.datamatrix.cleaned.rowsum.txt": As above but using row sum peak areas.

  • "df_neg.annotation.cleaned.txt": annotation of each measured metabolite.


data_path <- path.expand("~/Projects/Rider/metabolomics_jul2023/data");
se_list <- import_metabolomics_niehs(data_path);