Define detected transcripts

defineDetectedTx(
  iMatrixTx = NULL,
  iMatrixTxGrp = NULL,
  iMatrixTxTPM = NULL,
  iMatrixTxTPMGrp = NULL,
  groups = NULL,
  tx2geneDF = NULL,
  cutoffTxPctMax = 10,
  cutoffTxExpr = 5,
  cutoffTxTPMExpr = 0.1,
  txColname = "transcript_id",
  geneColname = "gene_name",
  zeroAsNA = TRUE,
  applyTxPctTo = c("TPM", "counts", "both", "either"),
  useMedian = FALSE,
  floorTPM = 0.001,
  floorCounts = 0.001,
  verbose = FALSE,
  ...
)

Arguments

iMatrixTx

numeric matrix of read counts (or pseudocounts) with transcript rows and sample columns. This data is assumed to be log2-transformed, and if any value is higher than 50, it will be log2-transformed with log2(x+1).

iMatrixTxGrp

numeric matrix of read counts averaged by sample group. If this matrix is not provided, it will be calculated from iMatrixTx using jamba::rowGroupMeans() and the groups parameter. This data is assumed to be log2-transformed, and if any value is higher than 50, it will be log2-transformed with log2(x+1).

iMatrixTxTPM

numeric matrix of TPM values, with sample columns and transcript rows. Note that if this parameter is not supplied, the counts in iMatrixTx will be used to determine the percent max isoform expression.

iMatrixTxTPMGrp

numeric matrix of TPM values averaged by sample group. If this matrix is not provided, it will be calculated from iMatrixTxTPM using jamba::rowGroupMeans() and the groups parameter.

groups

vector of group labels, either as character vector or factor. It should be named by colnames(iMatrixTx).

tx2geneDF

data.frame with colnames including c("transcript_id","gene_name"), where the values in the "transcript_id" column must match the rownames(iMatrixTx).

cutoffTxPctMax

numeric value scaled from 0 to 100 indicating the percentage of the maximum isoform expression per gene, for an alternate isoform to be considered for detection.

cutoffTxExpr

numeric value indicating the minimum group mean counts in iMatrixTxGrp for a transcript to be considered for detection.

cutoffTxTPMExpr

numeric value indicating the minimum group mean TPM in iMatrixTxTPMGrp for a transcript to be considered for detection.

txColname, geneColname

the colnames(tx2geneDF) representing the rownames(iMatrixTx) matched by tx2geneDF[,txColname], and the associated genes given by tx2geneDF[,geneColname]. Note that detectedTx must also contain values in rownames(iMatrixTx) and tx2geneDF[,txColname].

zeroAsNA

logical indicating whether values of zero (or less than zero) should be treated as NA values, thus removing them from mean calculations. This argument is only relevant when iMatrixTxGrp and iMatrixTxTPMGrp are not supplied. Argument zeroAsNA=TRUE is recommended when using kmer quantitation tools such as Salmon or Kallisto, which can sometimes allocate all expression to one or another transcript isoform when two isoforms are nearly identical. Also use TRUE when a value of zero represents the absense of data.

applyTxPctTo

character string indicating how to apply the cutoffTcPctMax threshold. This argument is only used when both iMatrixTx and iMatrixTxTPM are supplied, as follows: "TPM": uses only iMatrixTxTPM data; "counts" uses only iMatrixTx data; "both" requires both iMatrixTx and iMatrixTxTPM data meet the threshold; "either" requires that one or both of iMatrixTx and iMatrixTxTPM meet the threshold.

useMedian

logical indicating whether to use group median values instead of group mean values.

floorTPM, floorCounts

numeric value indicating the floor value to use when isoform TPM or counts are below 1, used to prevent divide-by-zero when all isoforms are 0.

verbose

logical indicating whether to print verbose output.

Value

List with the following elements:

txExprGrpTx

Numeric matrix representing the expression counts per transcript, grouped by "gene_name".

txPctMaxTxGrpAll

Numeric matrix representing the percent expression of each transcript isoform per gene, as compared to the highest expression of isoforms for that gene, using iMatrixTxGrp data. (New to verion 0.0.61.900.)

txPctMaxTxTPMGrpAll

Numeric matrix representing the percent expression of each transcript isoform per gene, as compared to the highest expression of isoforms for that gene, using iMatrixTxTPMGrp data. This data is returned only if iMatrixTxTPM or iMatrixTxTPMGrp were supplied. (New to verion 0.0.61.900.)

txPctMaxGrpAll

Numeric matrix representing the percent max expression used for filtering, after applying applyTxPctTo: "counts" uses txPctMaxTxGrpAll; "TPM" uses txPctMaxTxTPMGrpAll; "both" uses the higher of txPctMaxTxGrpAll and txPctMaxTxTPMGrpAll; "either" uses the lower of txPctMaxTxGrpAll and txPctMaxTxTPMGrpAll.

txExprGrpAll

Numeric matrix of sample group counts, exponentiated and rounded to integer values.

txTPMExprGrpAll

Numeric matrix of sample group TPM values, exponentiated and rounded to integer values.

txFilterM

Numeric matrix indicating whether each isoform met the criteria to be considered detected. The criteria must be met in the same group for an isoform to be considered detected.

detectedTx

Character vector of transcripts, as defined by the rownames(iMatrixTx).

Details

This function aims to combine evidence from RNA-seq sequence read counts (or pseudocounts from a kmer tool such as Salmon or Kallisto), along with alternative TPM quantitation, to determine the observed "detected" transcript space for a given experiment.

Each input data matrix is assumed to be appropriately log-transformed, typically using log2(1+x). If any value is >= 50 then the data matrix will be log2-transformed using log2(1+x).

The criteria must be met in at least one sample group, but all criteria must be met in the same sample group for an isoform to be considered "detected".

In our experience the use of TPM values appears more robust and is conceptually the best approach for comparing the relative quantity of one transcript isoform to another. Our reasoning is that TPM is intended to be roughly a molar quantity of transcript molecules, independent of the transcript length, and the potential for overlapping regions between isoforms. We also recommend the use of a kmer quantitation method, such as Salmon or Kallisto, which estimates isoform abundances not by specific read counts, but by quantifying kmers unique to particular isoforms for a given gene.

In all cases, the thresholds for detection can be modified, however from our experiences thus far the default values perform reasonably well at identifying expressed isoforms, while filtering out isoforms that we considered to be spuriously expressed.

There are three default requirements for a transcript to be considered "detected".

  1. An isoform must be expressed at least 10% of the max isoform for a given gene, using TPM values.

  2. An isoform must have at least log2(32) pseudocounts to be considered detected, based upon our view of Salmon pseudocount data using MA-plots.

  3. An isoform must have at least log2(2) TPM units to be considered detected, based upon our view of Salmon TPM values using MA-plots.

Each experiment is likely to be different in terms of total sequenced reads, quality of read alignment or quantitation to the transcriptome, etc. We suggest observing MA-plots for the counts and TPM values, for the point at which the signal substantially increases from baseline zero. We also plotted the TPM versus count per sample, noted the point at which the two signals began to correlate. These observations along with careful review of numerous gene model transcript isoforms supported our selection of these criteria.

Lastly, the requirement for 10 percent of max isoform expression was motivated by observing highly expressed genes, which sometimes had alternative isoforms with extremely low abundance compared to the most abundant isoform, but which was notably higher than the minimum for detection. For example Gapdh expression above 100,000 pseudocounts, may have an isoform with 120 pseudocounts. When we reviewed the sequence coverage, we could find no compelling evidence to support the minor isoform, and theorized that the pseudocounts arose from the stochastic nature of rebalancing relative expression among isoforms.

Note the argument zeroAsNA=TRUE, which by default treats any expression value of zero (or less than zero) as NA, thus removing them from group mean calculations. When iMatrixTxGrp and iMatrixTxTPMGrp are not supplied, this option is helpful in calculating a more appropriate group mean expression value, notably when a value of zero represents absence of data. Any group mean that is NA as a result is converted to zero for the purpose of applying filters.

See also