Import Salmon quant.sf files to SummarizedExperiment

import_salmon_quant(
  salmonOut_paths,
  import_types = c("tx", "gene", "gene_body", "gene_tx"),
  gtf = NULL,
  tx2gene = NULL,
  curation_txt = NULL,
  tx_colname = "transcript_id",
  gene_colname = "gene_name",
  gene_body_colname = "transcript_type",
  geneFeatureType = "exon",
  txFeatureType = "exon",
  countsFromAbundance = "lengthScaledTPM",
  gene_body_ids = NULL,
  trim_tx_from = c("[(][-+][)]"),
  trim_tx_to = c(""),
  verbose = FALSE,
  ...
)

Arguments

salmonOut_paths

character vectors to each individual folder that contains the "quant.sf" output file for Salmon.

import_types

character indicating which type or types of data to return. Note that the distinction between gene and gene_body is only relevant when there are transcript entries defined with transcript_type="gene_body". These entries specifically represent unspliced transcribed regions for a gene locus, and only for multi-exon genes.

  • tx: transcript quantitation, direct import of quant.sf files.

  • gene: gene quantitation after calling tximport::summarizeToGene(), excluding transcript_type="gene_body".

  • gene_body: gene quantitation after calling tximport::summarizeToGene(), including transcript_type="gene_body".

gtf

character path to a GTF file, used only when tx2gene is not supplied. When used, splicejam::makeTx2geneFromGtf() is called to create a data.frame object tx2gene.

tx2gene

character path to file, or data.frame with at least two columns matching tx_colname and gene_colname below. When supplied, the gtf argument is ignored, unless the file path is not accessible, or the data is not data.frame.

curation_txt

data.frame whose first column should match the sample column headers found in the PD abundance columns, and subsequent columns contain associated sample annotations. If curation_txt is not supplied, then values will be split into columns by _ underscore or " " whitespace characters.

tx_colname, gene_colname

character strings indicating colnames in tx2gene that should be used.

  • tx_colname represents unique identifier for each transcript, usually "transcript_id".

  • gene_colname represents a gene label associated with gene summarized expression values, typically "gene_name".

geneFeatureType, txFeatureType

character arguments passed to splicejam::makeTx2geneFromGtf() only when supplying argument gtf with a path to a GTF file.

countsFromAbundance

character string passed to tximport::summarizeToGene() to define the method for calculating abundance.

gene_body_ids

character optional vector with specific row identifiers that should be considered transcript_type="gene_body" entries, relevant to argument import_types above. When gene_body_ids is defined, these entries are used directly without using tx2gene. When gene_body_ids is not defined, tx2gene$transcript_type is used if present. If that column is not present, or does not contain any entries with "gene_body", then all transcripts are used for import_types="gene", and import_types="gene_body" is not valid and therefore is not returned.

verbose

logical indicating whether to print verbose output.

...

additional arguments are passed to supporting functions.

curate_tx_from, curate_tx_to

character vector of regular expression patterns to be used optionally to curate the values in tx_colname prior to joining those values to tx2gene[[tx_colname]]. The default is to remove "(-)" and "(+)" from the transcript_id (tx_colname) column.

Value

list with SummarizedExperiment objects, each of which contain assay names c("counts", "abundance", "length), where c("counts", "abundance") are transformed with log2(1 + x). The transform can be reversed with 10^x - 1. The SummarizedExperiment objects by name:

  • "TxSE": transcript-level values imported from quant.sf.

  • "GeneSE": gene-level summary values, excluding "gene_body" entries.

  • "GeneBodySE": gene-level summary values, including "gene_body" entries.

  • "GeneTxSE": gene-level summary values, where transcripts are combined to gene level, and "gene_body" entries are represented separately, with suffix "_gene_body" added to the gene name.

Details

This function is intended to automate the process of importing a series of quant.sf files, then generating SummarizedExperiment objects at the transcript and gene level. It optionally includes sample annotation provided as a data.frame in argument curation_txt. It also includes transcript and gene annotations through either data.frame from argument tx2gene, or it derives tx2gene from a GTF file from argument gtf. The GTF file option then calls splicejam::makeTx2geneFromGtf().

This function can optionally process data that includes full length gene body regions, annotated with "gene_body". This option is specific for Salmon quantitation where the transcripts include full length gene body for multi-exon genes, for example to measure unspliced transcript abundance.

  • import_types="gene" summarizes only the proper transcripts, excluding "gene_body" entries.

  • import_types="gene_body" summarizes all transcript and full gene entries into one summary transcript abundance.

  • import_types="gene_tx" summarizes proper transcript to gene level, and separately represents "gene_body" entries for comparison.