Make tx2gene data.frame from a GTF file

makeTx2geneFromGtf(
  GTF,
  geneAttrNames = c("gene_id", "gene_name", "gene_type"),
  txAttrNames = c("transcript_id", "transcript_type"),
  geneFeatureType = "gene",
  txFeatureType = c("transcript", "mRNA"),
  nrows = -1L,
  verbose = FALSE,
  ...
)

Arguments

GTF

character file name sent to data.table::fread(). When the file ends with ".gz", the R.utils package is recommended, otherwise the fallback option is to make a system call to gzcat to gunzip the file during the import step. Note this process fails when gzcat is not available in the path of the user environment. In general, the R.utils package is the best solution.

geneAttrNames

character vector of recognized attribute names as they appear in column 9 of the GTF file, for gene rows.

txAttrNames

character vector of recognized attribute names as they appear in column 9 of the GTF file, for transcript rows.

geneFeatureType

character value to match column 3 of the GTF file, used to define gene rows, by default "gene".

txFeatureType

character value to match column 3 of the GTF file, used to define gene rows, by default "transcript". In some GTF files, "mRNA" is used, so either is accepted by default.

nrows

integer number of rows to read from the GTF file, by default -1 means all rows are imported. This parameter is useful to check the results of a large GTF file using only a subset portion of the file.

verbose

logical whether to print verbose output during processing.

Value

data.frame with colnames indicated by the values in geneAttrNames and txAttrNames.

Details

Create a transcript-to-gene data.frame from a GTF file, which is required by a number of transcriptome analysis methods such as those in the DEXseq package, and the limma package functions such as diffSplice().

This function also only uses data.table::fread() and does not import the full GTF file using something like Bioconductor GenomicFeatures, simply because the data.table method is markedly faster when importing only the transcript-to-gene relationship. Also, this method allows the import of more annotations than are supported by the typical Bioconductor rtracklayer::import() for GTF data.

This function is intended to help keep all transcript data consistent by using the same GTF file that is also used by other analysis tools, whether those tools be based in R or more likely, outside R in a terminal environment.

For example, the GTF file could be used:

  • to run STAR sequence alignment then Rsubread::featureCounts() to generate a matrix of read counts per gene, transcript, or exon; or

  • to generate a transcript FASTA sequence file then run a kmer quantitation tool such as Salmon or Kallisto, then using tximport::tximport() to import results into R for downstream processing.

See also