Define detected transcripts

defineDetectedTx(
  iMatrixTx = NULL,
  iMatrixTxGrp = NULL,
  iMatrixTxTPM = NULL,
  iMatrixTxTPMGrp = NULL,
  groups = NULL,
  tx2geneDF = NULL,
  cutoffTxPctMax = 10,
  cutoffTxExpr = 5,
  cutoffTxTPMExpr = 0.1,
  txColname = "transcript_id",
  geneColname = "gene_name",
  zeroAsNA = TRUE,
  applyTxPctTo = c("TPM", "counts", "both", "either"),
  useMedian = FALSE,
  floorTPM = 0.001,
  floorCounts = 0.001,
  verbose = FALSE,
  ...
)

Arguments

iMatrixTx	numeric matrix of read counts (or pseudocounts) with transcript rows and sample columns. This data is assumed to be log2-transformed, and if any value is higher than 50, it will be log2-transformed with `log2(x+1)`.
iMatrixTxGrp	numeric matrix of read counts averaged by sample group. If this matrix is not provided, it will be calculated from `iMatrixTx` using `jamba::rowGroupMeans()` and the `groups` parameter. This data is assumed to be log2-transformed, and if any value is higher than 50, it will be log2-transformed with `log2(x+1)`.
iMatrixTxTPM	numeric matrix of TPM values, with sample columns and transcript rows. Note that if this parameter is not supplied, the counts in `iMatrixTx` will be used to determine the percent max isoform expression.
iMatrixTxTPMGrp	numeric matrix of TPM values averaged by sample group. If this matrix is not provided, it will be calculated from `iMatrixTxTPM` using `jamba::rowGroupMeans()` and the `groups` parameter.
groups	vector of group labels, either as character vector or factor. It should be named by `colnames(iMatrixTx)`.
tx2geneDF	data.frame with colnames including `c("transcript_id","gene_name")`, where the values in the `"transcript_id"` column must match the `rownames(iMatrixTx)`.
cutoffTxPctMax	numeric value scaled from 0 to 100 indicating the percentage of the maximum isoform expression per gene, for an alternate isoform to be considered for detection.
cutoffTxExpr	numeric value indicating the minimum group mean counts in `iMatrixTxGrp` for a transcript to be considered for detection.
cutoffTxTPMExpr	numeric value indicating the minimum group mean TPM in `iMatrixTxTPMGrp` for a transcript to be considered for detection.
txColname, geneColname	the `colnames(tx2geneDF)` representing the `rownames(iMatrixTx)` matched by `tx2geneDF[,txColname]`, and the associated genes given by `tx2geneDF[,geneColname]`. Note that `detectedTx` must also contain values in `rownames(iMatrixTx)` and `tx2geneDF[,txColname]`.
zeroAsNA	logical indicating whether values of zero (or less than zero) should be treated as `NA` values, thus removing them from mean calculations. This argument is only relevant when `iMatrixTxGrp` and `iMatrixTxTPMGrp` are not supplied. Argument `zeroAsNA=TRUE` is recommended when using kmer quantitation tools such as Salmon or Kallisto, which can sometimes allocate all expression to one or another transcript isoform when two isoforms are nearly identical. Also use `TRUE` when a value of zero represents the absense of data.
applyTxPctTo	`character` string indicating how to apply the `cutoffTcPctMax` threshold. This argument is only used when both `iMatrixTx` and `iMatrixTxTPM` are supplied, as follows: `"TPM"`: uses only `iMatrixTxTPM` data; `"counts"` uses only `iMatrixTx` data; `"both"` requires both `iMatrixTx` and `iMatrixTxTPM` data meet the threshold; `"either"` requires that one or both of `iMatrixTx` and `iMatrixTxTPM` meet the threshold.
useMedian	logical indicating whether to use group median values instead of group mean values.
floorTPM, floorCounts	`numeric` value indicating the floor value to use when isoform TPM or counts are below `1`, used to prevent divide-by-zero when all isoforms are `0`.
verbose	logical indicating whether to print verbose output.

Value

List with the following elements:

txExprGrpTx: Numeric matrix representing the expression counts per transcript, grouped by "gene_name".
txPctMaxTxGrpAll: Numeric matrix representing the percent expression of each transcript isoform per gene, as compared to the highest expression of isoforms for that gene, using iMatrixTxGrp data. (New to verion 0.0.61.900.)
txPctMaxTxTPMGrpAll: Numeric matrix representing the percent expression of each transcript isoform per gene, as compared to the highest expression of isoforms for that gene, using iMatrixTxTPMGrp data. This data is returned only if iMatrixTxTPM or iMatrixTxTPMGrp were supplied. (New to verion 0.0.61.900.)
txPctMaxGrpAll: Numeric matrix representing the percent max expression used for filtering, after applying applyTxPctTo: "counts" uses txPctMaxTxGrpAll; "TPM" uses txPctMaxTxTPMGrpAll; "both" uses the higher of txPctMaxTxGrpAll and txPctMaxTxTPMGrpAll; "either" uses the lower of txPctMaxTxGrpAll and txPctMaxTxTPMGrpAll.
txExprGrpAll: Numeric matrix of sample group counts, exponentiated and rounded to integer values.
txTPMExprGrpAll: Numeric matrix of sample group TPM values, exponentiated and rounded to integer values.
txFilterM: Numeric matrix indicating whether each isoform met the criteria to be considered detected. The criteria must be met in the same group for an isoform to be considered detected.
detectedTx: Character vector of transcripts, as defined by the rownames(iMatrixTx).

Details

This function aims to combine evidence from RNA-seq sequence read counts (or pseudocounts from a kmer tool such as Salmon or Kallisto), along with alternative TPM quantitation, to determine the observed "detected" transcript space for a given experiment.

Each input data matrix is assumed to be appropriately log-transformed, typically using log2(1+x). If any value is >= 50 then the data matrix will be log2-transformed using log2(1+x).

The criteria must be met in at least one sample group, but all criteria must be met in the same sample group for an isoform to be considered "detected".

In our experience the use of TPM values appears more robust and is conceptually the best approach for comparing the relative quantity of one transcript isoform to another. Our reasoning is that TPM is intended to be roughly a molar quantity of transcript molecules, independent of the transcript length, and the potential for overlapping regions between isoforms. We also recommend the use of a kmer quantitation method, such as Salmon or Kallisto, which estimates isoform abundances not by specific read counts, but by quantifying kmers unique to particular isoforms for a given gene.

In all cases, the thresholds for detection can be modified, however from our experiences thus far the default values perform reasonably well at identifying expressed isoforms, while filtering out isoforms that we considered to be spuriously expressed.

There are three default requirements for a transcript to be considered "detected".

An isoform must be expressed at least 10% of the max isoform for a given gene, using TPM values.
An isoform must have at least log2(32) pseudocounts to be considered detected, based upon our view of Salmon pseudocount data using MA-plots.
An isoform must have at least log2(2) TPM units to be considered detected, based upon our view of Salmon TPM values using MA-plots.

Each experiment is likely to be different in terms of total sequenced reads, quality of read alignment or quantitation to the transcriptome, etc. We suggest observing MA-plots for the counts and TPM values, for the point at which the signal substantially increases from baseline zero. We also plotted the TPM versus count per sample, noted the point at which the two signals began to correlate. These observations along with careful review of numerous gene model transcript isoforms supported our selection of these criteria.

Lastly, the requirement for 10 percent of max isoform expression was motivated by observing highly expressed genes, which sometimes had alternative isoforms with extremely low abundance compared to the most abundant isoform, but which was notably higher than the minimum for detection. For example Gapdh expression above 100,000 pseudocounts, may have an isoform with 120 pseudocounts. When we reviewed the sequence coverage, we could find no compelling evidence to support the minor isoform, and theorized that the pseudocounts arose from the stochastic nature of rebalancing relative expression among isoforms.

Note the argument zeroAsNA=TRUE, which by default treats any expression value of zero (or less than zero) as NA, thus removing them from group mean calculations. When iMatrixTxGrp and iMatrixTxTPMGrp are not supplied, this option is helpful in calculating a more appropriate group mean expression value, notably when a value of zero represents absence of data. Any group mean that is NA as a result is converted to zero for the purpose of applying filters.

Define detected transcripts

Arguments

Value

Details

See also