Multienrichment folio of summary plots

mem_plot_folio(
  mem,
  do_which = NULL,
  p_cutoff = NULL,
  p_floor = 1e-10,
  main = "",
  use_raster = TRUE,
  min_gene_ct = 1,
  min_set_ct = 1,
  min_set_ct_each = 4,
  column_method = "euclidean",
  row_method = "euclidean",
  exemplar_range = c(1, 2, 3),
  pathway_column_split = NULL,
  pathway_column_title = LETTERS,
  gene_row_split = NULL,
  gene_row_title = letters,
  edge_color = NULL,
  cex.main = 2,
  cex.sub = 1.5,
  row_cex = 1,
  column_cex = 1,
  max_labels = 4,
  max_nchar_labels = 25,
  include_cluster_title = TRUE,
  repulse = 4,
  use_shadowText = FALSE,
  color_by_column = FALSE,
  style = "dotplot",
  enrich_im_weight = 0.3,
  gene_im_weight = 0.5,
  colorize_by_gene = TRUE,
  cluster_color_min_fraction = 0.4,
  byCols = c("composite_rank", "minp_rank", "gene_count_rank"),
  edge_bundling = "connections",
  apply_direction = NULL,
  do_plot = TRUE,
  verbose = TRUE,
  ...
)

Arguments

mem: list object created by multiEnrichMap(). Specifically the object is expected to contain colorV, enrichIM, memIM, geneIM.
do_which: integer vector of plots to produce. When do_which is NULL, then all plots are produced. This argument is intended to help produce one plot from a folio, therefore each plot is referred by the number of the plot, in order.
p_cutoff: numeric value indicating the enrichment P-value threshold used for multiEnrichMap(), but when NULL this value is taken from the mem input, or 0.05 is used by default.
p_floor: numeric value indicating the lowest enrichment P-value used in the color gradient on the Enrichment Heatmap.
main: character string used as a title on Cnet plots.
use_raster: logical indicating whether to use raster heatmaps, passed to ComplexHeatmap::Heatmap().
min_gene_ct, min_set_ct: integer values passed to mem_gene_path_heatmap(). The min_gene_ct requires each set to contain min_gene_ct genes, and min_set_ct requires each gene to be present in at least min_set_ct sets.
min_set_ct_each: minimum number of genes required for each set, required for at least one enrichment test.
column_method, row_method: arguments passed to ComplexHeatmap::Heatmap() which indicate the distance method used to cluster columns and rows, respectively.
exemplar_range: integer vector (or NULL) used to create Cnet exemplar plots, using this many exemplars per cluster.
pathway_column_split, gene_row_split: integer value passed as column_split and row_split, respectively, to mem_gene_path_heatmap(), indicating the number of pathway clusters, and gene clusters, to create in the gene-pathway heatmap. When either value is NULL then auto-split logic is used.
pathway_column_title, gene_row_title: character vectors passed to mem_gene_path_heatmap() as column_title and row_title, respectively. When one value is supplied, it is displayed and centered across all the respective splits. When multiple values are supplied, values are used to the number of splits, and recycled as needed. In that case, repeated values are made unique by jamba::makeNames().
cex.main, cex.sub: numeric values passed to title() which size the default title and sub-title in Cnet plots.
row_cex, column_cex: numeric character expansion factor, used to adjust the relative size of row and column labels, respectively. A value of 1.1 will make row font size 10% larger.
color_by_column: logical indicating whether to colorize the enrichment heatmap columns using colorV in the input mem. This argument is only relevant when do_which include 1.
enrich_im_weight, gene_im_weight: numeric value between 0 and 1, passed to mem_gene_path_heatmap(), used to apply relative weight to clustering columns and rows, respectively, when combining the gene-pathway incidence matrix with either column enrichment P-values, or row gene incidence matrix data.
colorize_by_gene: logical passed to mem_gene_path_heatmap() indicating whether the heatmap body for the gene-pathway heatmap will be colorized using the enrichment colors for each gene.
cluster_color_min_fraction: numeric value passed to collapse_mem_clusters() used to determine which enrichment colors to associate with each Cnet cluster.
byCols: character vector describing how to sort the pathways within Cnet clusters. This argument is passed to rank_mem_clusters().
edge_bundling: character string passed to jam_igraph() to control edge bundling. The default edge_bundling="connections" will bundle Cnet plot edges for genes that share the same pathway connections.
apply_direction: logical or NULL indicating whether to indicate directionality in the mem_enrichment_heatmap() which is the first plot in the series. The default apply_direction=NULL will auto-detect whether there is directionality present in the data, and will set apply_direction=TRUE only when there are non-NA values that differ from zero.
do_plot: logical indicating whether to render each plot. When do_plot=FALSE the plot objects will be created and returned, but the plot itself will not be rendered. This option may be useful to generate the full set of figures in one set, then review each figure one by one in an interactive session.
verbose: logical indicating whether to print verbose output.
...: additional arguments are passed to downstream functions. Notably, sets is passed to mem_gene_path_heatmap() which allows one to define a specific subset of sets to use in the gene-pathway heatmap.

Value

list is returned via invisible(), which contains each plot object enabled by the argument do_which:

enrichment_hm is a Heatmap object from ComplexHeatmap that contains the enrichment P-value heatmap. Note that this data is not used directly in subsequent plots, the pathway clusters shown here are based upon -log10(Pvalue) and not the underlying gene content of each pathway. This plot is a useful overview that answers the question "How many pathways are significantly enriched across the different enrichment tests?"
gp_hm is a Heatmap object from ComplexHeatmap with the gene-pathway incidence matrix heatmap. This heatmap and the column/pathway clusters are the subject of subsequent Cnet plots.
gp_hm_caption is a text caption that describes the gene and set filter criteria, and the row and column distance methods used for clustering. Because the filtering and clustering options have substantial impact on clustering, and the pathway clusters are the key for all subsequent plots, these values are important to keep associated with the output of this function.
clusters_mem is a list with the pathways contained in each pathway cluster shown by the gene-pathway heatmap, obtained by heatmap_column_order(gp_hm). The pathway names should also be present in colnames(mem$memIM) and rownames(mem$enrichIM), for follow-up inspection.
cnet_collapsed is an igraph object with Cnet plot data, where the pathways have been collapsed by cluster, using the gene-pathway heatmap clusters defined in clusters_mem. Each pathway cluster is labeled by cluster name, and the first few pathway names. This data can be plotted using jam_igraph(cnet_collapsed).
cnet_collapsed_set is the same as cnet_collapsed except the pathways are labeled by the cluster name only, for example c("A", "B", "C", "D"). This data can be plotted using jam_igraph(cnet_collapsed_set).
cnet_collapsed_set2 is the same as cnet_collapsed_set except the gene labels are hidden, useful when there are too many genes to label clearly. The gene symbols are still stored in V(g)$name but the labels in V(g)$label are updated to hide the genes. This data can be plotted using jam_igraph(cnet_collapsed_set2).
cnet_exemplars is a list of igraph Cnet objects, each one contains only the number of exemplar pathways from each cluster defined by argument exemplar_range. By default it uses 1 exemplar per cluster, then 2 exemplars per cluster, then 3 exemplars per cluster. A number of published figures use 1 exemplar per pathway cluster. This data can be plotted using jam_igraph(cnet_exemplars[[1]]), which will plot only the first igraph object from the list.
cnet_clusters is a list of igraph Cnet objects, each one contains all the pathways in one pathway cluster. This data can be plotted using jam_igraph(cnet_clusters[[1]]), or by calling a specific cluster jam_igraph(cnet_clusters[["A"]]).

Details

This function is intended to create multiple summary plots using the output data from multiEnrichMap(). By default it creates all plots one by one, sufficient for including in a multi-page PDF document with cairo_pdf(..., onefile=TRUE) or pdf(..., onefile=TRUE).

The data for each plot object can be created and visualized later with argument do_plot=FALSE.

Note: Since version 0.0.76.900 the first step in the workflow is to cluster the underlying gene-pathway incidence matrix. This step defines a consistent dendrogram driven by underlying gene content in each pathway. The dendrogram is used by each subsequent plot including the enrichment heatmap.

There are two recommended strategies for visualizing multienrichment results:

Pathway clusters viewed as a concept network (Cnet) plot.
- Given numerous statistically enriched pathways, this process defines pathway clusters using the underlying gene-pathway incidence matrix.
- Within each pathway cluster, the pathways typically share a high proportion of the same genes, and therefore are expected to represent very similar functions. Ideally, each cluster represents some distinct biological function, or a functional theme.
- Benefit: Reducing a large number of pathways to a small number of clusters greatly improves the options for visualization, while retaining a comprehensive view of all genes and pathways involved.
- Benefit: This option is recommended when there are numerous pathways, and when including more pathways is beneficial to understanding the overall functional effects of the experimental study.
- Limitation: The downside with this approach is that sometimes this comprehensive content can be too much detail to interpret in one figure, overshadowing individual pathways in each cluster.
- Limitation: It may be difficult to recognize a functional theme for each pathway cluster, unfortunately that process is not (yet) automated and requires some domain expertise of the pathways and functions involved.
- Limitation: It may not be possible for one Cnet plot to represent all functional effects of an experimental study.
Exemplar pathways are viewed as a Cnet plot.
- As described above, given numerous statistically enriched pathways, pathways are clustered using the gene-pathway incidence matrix. One "exemplar" pathway is selected from each cluster to represent the typical pathway content in each cluster, usually the most significant pathway in the cluster, but optionally the pathway containing the most total genes.
- Benefit: This process can produce a cleaner figure than Option 1 PathwayClusters, because fewer pathways and their associated genes are included in the figure.
- Limitation: This cleaner figure is understandably somewhat less comprehensive, and may be subject to bias when selecting exemplar pathways. However the selection of relevant pathways may be very effective within the context of the experimental study.
- Benefit: The resulting Cnet plot can often improve focus on specific genes and pathways, which can be advantageous when including numerous "synonyms" for the same or similar pathways is not beneficial.
- Benefit: This strategy also works particularly well when there are relatively few enriched pathways, or when argument topEnrichN used with multiEnrichMap() was relatively small.

The folio of plots includes:

Enrichment Heatmap, using enrichment P-values via mem_enrichment_heatmap(). Plot #1.
Gene-Pathway Incidence Matrix Heatmap using mem_gene_path_heatmap(). This step visualizes the pathway clustering to be used by all other plots in the folio. Plot #2.
Cnet Cluster Plot representing Gene-Pathway clusters as a network, created using collapse_mem_clusters(), then plotted with jam_igraph(). Plots #3, #4, and #5.
Cnet Exemplar Plots using exemplar pathways from each gene-pathway cluster, with increase number of exemplars included from each cluster (n per cluster). Cnet igraph objects are created using subsetCnetIgraph(), then plotted with jam_graph(). Plots #6, #7, and #8.
Cnet Individual Cluster Plots with one plot for each gene-pathway cluster defined above, including all pathways within the cluster. These plots are mostly useful when a particular cluster may have multiple sub-clusters included together. The plots can be useful to understand the relationship between pathways in each cluster. Plots #9, #10, and so on, length equal to pathway_column_split.

The specific plots to be created are controlled with do_which:

do_which=1 will create the enrichment heatmap.
do_which=2 will create the gene-pathway heatmap.
do_which=3 will create the Cnet Cluster Plot using pathway cluster labels for each pathway node, by default it uses LETTERS: "A", "B", "C", "D", etc.
do_which=4 will create the Cnet Cluster Plot using abbreviated pathway labels for each pathway cluster node.
do_which=5 will create the Cnet Cluster Plot with no node labels.
do_which=6 begins the series of Cnet Exemplar Plots for each value in argument exemplar_range, whose default is c(1, 2, 3).
do_which=9 (by default) begins the series of Cnet individual cluster plots, which includes all pathways from each cluster.

The most frequently used plots are do_which=2 for the gene-pathway heatmap, and do_which=4 for the collapsed Cnet plot, where Cnet clusters are based upon the gene-pathway heatmap.

Arguments p_cutoff and min_set_ct_each can be used to apply more stringent thresholds than the original mem data. For example, applying p_cutoff=0.05 during multiEnrichMap() will colorize pathways in mem$enrichIMcolors, however when calling mem_plot_folio() with p_cutoff=0.001 will use blank color in the color gradient for pathways that do not have mem$enrichIM value at or below 0.001.

Our experience is that the pathway clustering does not need to be perfect to be useful and valid. The pathway clusters are valid based upon the parameters used for clustering, and provide insight into the genes that help define each cluster distinct from other clusters. Sometimes the clustering results are more or less effective based upon the type of pattern observed in the data, so it can be helpful to adjust parameters to drill down to the most effective patterns.

Gene-Pathway clustering

The clustering is performed by combining the gene-pathway incidence matrix mem$memIM with the -log10(mem$enrichIM) enrichment P-values. The relative weight of each matrix is controlled by enrich_im_weight, where enrich_im_weight=0 assigns weight=0 to the enrichment P-values, and thus clusters only using the gene-pathway matrix. Similarly, enrich_im_weight=1 will assign full weight to the enrichment P-value matrix, and will ignore the gene-pathway matrix data.

The corresponding weight for gene (rows) is controlled by gene_im_weight, which balances row clustering with the mem$geneIM matrix, and the gene-pathway matrix mem$memIM.

The argument column_method defines the distance method, for example "euclidean" and "binary" are two immediate choices. The method also adds "correlation" from amap::hcluster() which can be very useful especially with large datasets.

The number of pathway clusters is controlled by pathway_column_split, by default when pathway_column_split=NULL and auto_cluster=TRUE the number of clusters is defined based upon the total number of pathways. In practice, pathway_column_split=4 or pathway_column_split=3 is recommended, as this number of clusters is most convenient to visualize as a Cnet plot.

To define your own pathway cluster labels, define pathway_column_title as a vector with length equal to pathway_column_split. These labels become network node labels in subsequent plots, and in the resulting igraph object.

The pathway clusters are dependent upon the genes and pathways used during clustering, which are also controlled by min_set_ct and min_gene_ct.

min_set_ct filters the matrix by the number of times a Set is represented in the matrix, which can be helpful when there are pathways with large number of genes, with some pathways with very low number of genes.
min_gene_ct filters the matrix by the number of times a gene is represented in the matrix. It can be helpful for requiring a gene be represented in more than one enriched pathway.
min_set_ct_each filters the matrix to require each Set to contain at least this many entries from one enrichment result, rather than using the combined incidence matrix. It is mostly helpful to increase the value used in multiEnrichMap() argument min_count, which already filters pathways for minimum number of genes involved.
Note: These filters are only recommended when the gene-pathway matrix is very large, perhaps 100 pathways, or 500 genes.

Cnet pathway clusters

The resulting Cnet pathway clusters are single nodes in the network, and these nodes are colorized based upon the enrichment tests involved. The threshold for including the color for each enrichment test is defined by cluster_color_min_fraction, which requires at least this fraction of pathways in a pathway cluster meets the significance criteria for that enrichment test.

To adjust the coloration filter to include any enrichment test with at least one significant result, use cluster_color_min_fraction=0.01. In the gene-pathway heatmap, these colors are shown across the top of the heatmap. The default cluster_color_min_fraction=0.4 requires 40% of pathways in a cluster for each enrichment test.

Note: Prior to version 0.0.76.900 the enrichment heatmap was clustered only using enrichment P-values, transformed with log10(Pvalue). The clustering was inconsistent with other plots in the folio, and was not effective at clustering pathways based upon similar content, which is the primary goal of the multienrichjam R package.