How to import and use Ingenuity IPA enrichment data • multienrichjam

library(multienrichjam);
library(jamba);
library(colorjam);
suppressPackageStartupMessages(library(ComplexHeatmap));
options("stringsAsFactors"=FALSE, "warn"=-1);
knitr::opts_chunk$set(
   fig.height=10,
   fig.width=10,
   fig.align="center"
)
ragg_png = function(..., res = 192) {
  ragg::agg_png(..., res = res, units = "in")
}
knitr::opts_chunk$set(dev = "ragg_png", fig.ext = "png")

Import and use Ingenuity IPA enrichment data

This document describes steps recommended for using Ingenuity IPA enrichment data.

Ingenuity IPA enrichment data can be exported using a function "Export All" which by default creates one text file, concatenating each enrichment table into one large file.

This workflow demonstrates the import process using two IPA enrichment files used by Reese et al 2019 https://doi.org/10.1016/j.jaci.2018.11.043 to compare enrichment results in newborns to older children.

It therefore requires IPA enrichment results have already been exported in text format from IPA.

Import IPA data from text files

To import an IPA text file, use importIPAenrichment():

newborn_txt <- system.file("extdata",
   "Newborns-IPA.txt",
   package="multienrichjam");
newborn_dfl <- importIPAenrichment(newborn_txt);
#> ##  (11:47:56) 19Aug2024:   IPA xref data was not found, expecting section: 'Analysis Ready Molecules', handling as: revert_ipa_xref=FALSE

The result is a list of data.frame objects, where each data.frame represents one enrichment test. A convenient way to see the dimensions of each data.frame is with the function jamba::sdim():

sdim(newborn_dfl);
#>                            rows cols      class
#> Canonical Pathways          113    7 data.frame
#> Upstream Regulators         117    6 data.frame
#> Diseases and Bio Functions  444    7 data.frame
#> Tox Functions                15    7 data.frame
#> Networks                      8    7 data.frame
#> Tox Lists                    19    6 data.frame

For MultiEnrichMap, we typically want to analyze multiple IPA enrichment files, so we can wrap the call in an lapply() function:

newborn_txt <- system.file("extdata",
   "Newborns-IPA.txt",
   package="multienrichjam");
olderchildren_txt <- system.file("extdata",
   "OlderChildren-IPA.txt",
   package="multienrichjam");
ipa_files <- c(Newborns=newborn_txt,
   OlderChildren=olderchildren_txt)

ipa_l <- lapply(ipa_files, importIPAenrichment);
#> ##  (11:47:56) 19Aug2024:   IPA xref data was not found, expecting section: 'Analysis Ready Molecules', handling as: revert_ipa_xref=FALSE 
#> ##  (11:47:56) 19Aug2024:   IPA xref data was not found, expecting section: 'Analysis Ready Molecules', handling as: revert_ipa_xref=FALSE

Now we can check the dimensions within each list using jamba::ssdim():

ssdim(ipa_l);
#> $Newborns
#>                            rows cols      class
#> Canonical Pathways          113    7 data.frame
#> Upstream Regulators         117    6 data.frame
#> Diseases and Bio Functions  444    7 data.frame
#> Tox Functions                15    7 data.frame
#> Networks                      8    7 data.frame
#> Tox Lists                    19    6 data.frame
#> 
#> $OlderChildren
#>                            rows cols      class
#> Canonical Pathways          237    7 data.frame
#> Upstream Regulators         338    7 data.frame
#> Diseases and Bio Functions  500    7 data.frame
#> Tox Functions               118    7 data.frame
#> Networks                     10    7 data.frame
#> Tox Lists                    36    6 data.frame

In most cases, each IPA file should contain the same enrichment tests, for example "Canonical Pathways", "Upstream Regulators", "Diseases and Bio Functions", etc. However, it is not always the case, so it is recommended to check and verify each IPA file contains at least the enrichment tests needed for downstream analysis.

Analyze IPA enrichments from one enrichment test

IPA performs multiple enrichment tests, which are done independently and with unique assumptions and caveats. Therefore, I recommend using one enrichment test at a time in MultiEnrichMap.

Extract one data.frame from each result:

library(igraph)
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union
## Take only the Ingenuity Canonical Pathways
enrichList_canonical <- lapply(ipa_l, function(i){
   i[["Canonical Pathways"]];
});
sdim(enrichList_canonical);
#>               rows cols      class
#> Newborns       113    7 data.frame
#> OlderChildren  237    7 data.frame

## Convert data.frame to enrichResult
## multienrichjam::enrichDF2enrichResult
er_canonical <- lapply(enrichList_canonical, function(i){
   enrichDF2enrichResult(i,
      keyColname="Name",
      pvalueColname="P-value",
      geneColname="geneNames",
      geneRatioColname="Ratio",
      pvalueCutoff=1)
});
sdim(er_canonical);
#>               rows cols        class
#> Newborns       113   11 enrichResult
#> OlderChildren  237   11 enrichResult
kable_coloring(
   head(as.data.frame(er_canonical[[1]])),
   caption="Top 10 rows of enrichment data",
   row.names=FALSE) %>%
   kableExtra::column_spec(column=seq_len(ncol(er_canonical[[1]])),
      border_left="1px solid #DDDDDD",
      extra_css="white-space: nowrap;")

Top 10 rows of enrichment data
ID	Ingenuity Canonical Pathways	-log(p-value)	zScore	GeneRatio	geneID	pvalue	Description	p.adjust	Count	setSize
Role of Macrophages, Fibroblasts and Endothelial Cells in Rheumatoid Arthritis	Role of Macrophages, Fibroblasts and Endothelial Cells in Rheumatoid Arthritis	0.405	NaN	0.00321	TNFSF13B	0.3935501	Role of Macrophages, Fibroblasts and Endothelial Cells in Rheumatoid Arthritis	0.3935501	1	312
Neuroinflammation Signaling Pathway	Neuroinflammation Signaling Pathway	0.406	NaN	0.00322	CASP8	0.3926449	Neuroinflammation Signaling Pathway	0.3926449	1	311
Sirtuin Signaling Pathway	Sirtuin Signaling Pathway	0.428	NaN	0.00344	HIST1H1D	0.3732502	Sirtuin Signaling Pathway	0.3732502	1	291
G-Protein Coupled Receptor Signaling	G-Protein Coupled Receptor Signaling	0.447	NaN	0.00362	PRKAR2B	0.3572728	G-Protein Coupled Receptor Signaling	0.3572728	1	276
Protein Ubiquitination Pathway	Protein Ubiquitination Pathway	0.461	NaN	0.00377	TAP2	0.3459394	Protein Ubiquitination Pathway	0.3459394	1	265
Signaling by Rho Family GTPases	Signaling by Rho Family GTPases	0.478	NaN	0.00397	RDX	0.3326596	Signaling by Rho Family GTPases	0.3326596	1	252

run multiEnrichMap()

Now given a list of data.frame results, we can run multiEnrichMap():

mem_canonical <- multiEnrichMap(er_canonical,
   enrichBaseline=1,
   cutoffRowMinP=0.05,
   colorV=c("purple", "orange"),
   topEnrichN=20)

The output mem_canonical is a list containing various results.

kable_coloring(
   sdim(mem_canonical),
   caption="sdim(mem_canonical)") %>%
   kableExtra::column_spec(column=seq_len(4),
      border_left="1px solid #DDDDDD",
      extra_css="white-space: nowrap;")

sdim(mem_canonical)
	rows	cols	class	class_v2
enrichLabels	2		character	NA
colorV	2		character	NA
geneHitList	2		list	NA
geneHitIM	68	2	matrix	array
geneIM	34	2	matrix	array
geneIMcolors	34	2	matrix	array
enrichIMgeneCount	30	2	matrix	array
enrichIMdirection	30	2	matrix	array
enrichList	2		list	NA
enrichIM	30	2	matrix	array
enrichIMcolors	30	2	matrix	array
multiEnrichDF	30	12	data.frame	NA
multiEnrichResult	30	14	enrichResult	NA
memIM	34	30	matrix	array
multiEnrichMap	30	220	igraph	NA
multiEnrichMap2	30	220	igraph	NA
multiCnetPlot	64	110	igraph	NA
multiCnetPlot1	64	110	igraph	NA
multiCnetPlot1b	64	110	igraph	NA
multiCnetPlot2	64	110	igraph	NA
colnames	5		list	NA
p_cutoff	1		numeric	NA

Create a “Mem Plot Folio” (new Apr2020)

The other analysis steps describe in this vignette are valid options, however in recent work with multienrichjam we have developed a new function to create a small folio of relates plots that we call a “mem plot folio”, and the function is mem_plot_folio().

The mem_plot_folio() function creates the following plots by default:

Enrichment P-value heatmap (optionally colored by group)
Gene-pathway incidence matrix, clustered by column and by row
Cnet plot including all collapsed pathway clusters
- The first plot labels clusters by LETTERS, “A”, “B”, “C”, “D”, etc.
- The second plot labels clusters by the top n pathway names
- The third plot labels clusters by the top n pathway names, but hides gene labels. This plot is best when there are too many genes to label.
Cnet plot using 1,2,3 exemplars per pathway cluster (configurable)
- The first plot uses 1 exemplar per cluster.
- The second plot uses 2 exemplars per cluster.
- The third plot uses 3 exemplars per cluster.
Cnet plot of each pathway cluster individually
- The number of plots is the number of heatmap pathway_clusters.

mem_canonical_plots <- multienrichjam::mem_plot_folio(mem_canonical,
   pathway_column_split=4,
   node_factor=1,
   use_shadowText=TRUE,
   label_factor=1.2,
   do_which=c(1:5),
   verbose=TRUE,
   main="Canonical Pathways");
#> ##  (11:48:00) 19Aug2024:   mem_plot_folio(): Gene-pathway heatmap (pre-emptive) 
#> ##  (11:48:00) 19Aug2024:   mem_plot_folio(): plot_num 1: Enrichment P-value Heatmap

#> ##  (11:48:00) 19Aug2024:   mem_plot_folio(): Gene-pathway heatmap 
#> ##  (11:48:00) 19Aug2024:   mem_plot_folio(): plot_num 2: Gene-Pathway Heatmap

#> ##  (11:48:02) 19Aug2024:   mem_plot_folio(): Defined 4 pathway clusters. 
#> ##  (11:48:02) 19Aug2024:   mem_plot_folio(): Preparing Cnet collapsed 
#> ##  (11:48:02) 19Aug2024:   mem_plot_folio(): subsetCnetIgraph() 
#> ##  (11:48:02) 19Aug2024:   mem_plot_folio(): plot_num 3: Cnet collapsed with gene and cluster labels

#> ##  (11:48:02) 19Aug2024:   mem_plot_folio(): plot_num 4: Cnet collapsed with gene and set labels

#> ##  (11:48:03) 19Aug2024:   mem_plot_folio(): plot_num 5: Cnet collapsed with set labels, without gene labels

The object returned mem_canonical_plots will contain a list of the graphical objects for each plot requested. For example, each Cnet plot returns an igraph object which can be customized.

Commentary on the “Plot Folio”

In practice, we rarely found much benefit from the multi-enrichment map, which shows pathways connected to pathways based upon Jaccard overlap. This network view was often disorganized, not clearly clustered, and lacked ability to see which genes drive the overlaps between pathways. Perhaps it works best for the highly structured Gene Ontology (GO), but in our hands GO was just not insightful for the broad range of experiments we were analyzing.

Instead, we almost always gravitated toward the Cnet plot, a clever idea by Dr. Guangchang Yu that connects pathways to genes, where genes also connect naturally to other pathways. This plot for many of our collaborators has seemed more intuitive, and answers the next question people often have: “What are the shared genes?” Our subtle addition to this plot is to color-code genes by the enrichment experiment, so you can see which genes are shared or unique by experiment.

Since most enrichment results have far more pathways than can be shown in one Cnet plot, we tried to prioritize which pathways to display. The logic in Dr. Yu’s enrichplot is to take the top n pathways by P-value, however we noticed for our data this logic often shows redundant pathways and misses orthologous biological functions, which might be enriched to a lesser degree. So we started clustering pathways by P-value, then by using the pathway-gene incidence matrix.

The pathway-gene incidence matrix worked best in our hands for this clustering, partly because it clusters pathways by the genes involved, and not by the pattern of enrichment across the enrichment tests (which may not be related to the underlying genes.) So we cluster pathways, then choose an arbitrary number of clusters, and took one “exemplar” from each cluster to represent each cluster. This practice is mostly effective, especially when the clusters are well-defined, and when members of each cluster have relatively similar function. We can increase the exemplars per cluster as needed. This strategy is helpful because each pathway is very clearly labeled.

Most recently, we experimented with collapsing each pathway_cluster, so each pathway_cluster includes all genes implicated by pathways contained in the pathway_cluster. The benefit here is that no genes are lost by selecting a subset of “exemplar” pathways – all genes are retained per cluster. Then instead of plotting pathway-gene network, we plot a pathway_cluster-gene network. Each pathway_cluster is labeled using the top n pathway names – which is configurable, and is only a visible label. Each pathway_cluster contains all the pathways in the cluster.

Directly create each individual plot

The plots in mem_plot_folio() are also available as individual functions.

In general, it is most convenient to call mem_plot_folio() and use the output list to create custom plots, however the functions are also available to be called independently.

Plot the MultiEnrichMap network

We can view the “Multi Enrichment Map” itself with mem_multienrichplot().

This network connects pathways when they meet a Jaccard overlap coefficient threshold based upon the shared genes between the pathways. The default overlap is stored by multiEnrichMap() in the output object mem_canonical.

g <- mem_multienrichplot(mem_canonical,
   do_plot=FALSE,
   overlap=0.3,
   node_factor=2,
   repulse=3.5)
jam_igraph(g,
   node_factor=2,
   use_shadowText=TRUE)

The overlap threshold can be customized manually, or through the helper function mem_find_overlap(). This function attempts to guess a reasonable overlap threshold based upon the network connectivity, by optimizing the cluster size and total nodes involved.

g <- mem_multienrichplot(mem_canonical,
   overlap=mem_find_overlap(mem_canonical),
   do_plot=FALSE,
   node_factor=3,
   repulse=3.5)
jam_igraph(g,
   node_factor=3,
   use_shadowText=TRUE)

Notice there are distinct subnetworks, called “components”, which are not connected to each other. A useful follow-up technique is to graph one component at a time, which we do using the helper function subset_igraph_components(). Components are ordered by size, largest to smallest, so you can keep the largest using argument keep=1, or the second largest with keep=2, and so on.

We also call two other helper functions:

removeIgraphBlanks() removes blank colors from multi-color nodes, such as pie nodes, or colored rectangle nodes. It helps show only the remaining colors without the whitespace.
relayout_with_qfr() applies Fruchterman-Reingold layout to an igraph object. It’s convenient to adjust repulse here, which is the recommended method for adjusting the spacing between nodes. This function also updates other useful attributes such as the node label angle, which makes labels appear offset away from the majority of edges. (It helps more with Cnet plots below.)

## You can alternatively pull out any other component
g_sub <- subset_igraph_components(g, keep=1);

## Re-apply network layout, and remove blank colors
g_sub <- relayout_with_qfr(repulse=3.5,
   removeIgraphBlanks(g_sub))

## Plot
jam_igraph(g_sub,
   node_factor=3,
   use_shadowText=TRUE)

Plot the enrichment P-value matrix

A useful overview of the enrichment results, including the overlaps across datasets, is shown in a heatmap of -log10(Pvalue). The helper function mem_enrichment_heatmap() is used here.

The argument p_cutoff is used to set the Pvalue below which cells are colorized – every P-value above this threshold is not colored, and displayed as white, even when the P-value is less than 1.

mem_enrichment_heatmap(mem_canonical,
   p_cutoff=0.05);

The same data can be plotted as a heatmap.

mem_enrichment_heatmap(mem_canonical,
   style="heatmap",
   p_cutoff=0.05);

The argument color_by_column=TRUE will optionally apply a color gradient to each column, using the same P-value threshold defined by p_cutoff. In this case, cells are labeled to indicate the actual enrichment P-value.

mem_enrichment_heatmap(mem_canonical,
   style="heatmap",
   color_by_column=TRUE);

Plot the pathway-gene incidence matrix

We can view the pathway-gene matrix using the function mem_gene_path_heatmap(). This function uses data in the mem object: "memIM" which contains the gene-pathway incidence matrix, "geneIM" which contains the gene-dataset incidence matrix, and "enrichIM" which contains the pathway-dataset matrix.

This function attempts to “guess” a reasonable number of column and row splits – which breaks the column and row dendrograms into separate sub-clusters. These breaks are visual only, but can be useful in follow-up analyses shown later. You can override these numbers with arguments column_split and row_split.

Also, the colors used across the top of the heatmap indicate the enrichment P-value, and are intentionally scaled so all P-values above the P-value threshold are not colored, and are therefore white.

This function uses the amazing ComplexHeatmap::Heatmap() function, which enables many awesome customizable features. The ... argument is passed to ComplexHeatmap::Heatmap(), so you can use those features as needed.

hm <- mem_gene_path_heatmap(mem_canonical,
   column_cex=0.5,
   row_cex=0.6);
hm;

As a follow-up analysis, you can pull out each pathway cluster from the heatmap itself, using heatmap_column_order():

hm_sets <- heatmap_column_order(hm);
hm_sets;
#> $A
#>                                eNOS Signaling 
#>                              "eNOS Signaling" 
#>                                RAR Activation 
#>                              "RAR Activation" 
#>   Dopamine-DARPP32 Feedback in cAMP Signaling 
#> "Dopamine-DARPP32 Feedback in cAMP Signaling" 
#>                           Hepatic Cholestasis 
#>                         "Hepatic Cholestasis" 
#> 
#> $B
#>   Synaptic Long Term Depression                Glioma Signaling 
#> "Synaptic Long Term Depression"              "Glioma Signaling" 
#>        Growth Hormone Signaling 
#>      "Growth Hormone Signaling" 
#> 
#> $C
#>            mTOR Signaling           HIPPO signaling   Fc Epsilon RI Signaling 
#>          "mTOR Signaling"         "HIPPO signaling" "Fc Epsilon RI Signaling" 
#>          p70S6K Signaling 
#>        "p70S6K Signaling"

Plot the Pathway-Gene Concept network (Cnet plot)

We can view the Cnet plot (concept network plot) which shows the pathway-gene relationship.

The helper function memIM2cnet() creates a Cnet plot from the mem_canonical output. Here, we also pipe the result through other helper functions:

fixSetLabels() applies pathway label word wrap
relayout_with_qfr() applies network layout and adjusts node labels
removeIgraphBlanks() removes blank colors from the igraph nodes

#cnet <- mem_canonical$multiCnetPlot1b;
cnet <- mem_canonical %>% 
   memIM2cnet() %>%
   fixSetLabels() %>%
   removeIgraphBlanks() %>%
   relayout_with_qfr(repulse=4);
#plot(cnet, vertex.label.cex=0.6);
par("mar"=c(5,4,4,2)+0.1)
jam_igraph(cnet,
   use_shadowText=TRUE,
   node_factor=0.5,
   vertex.label.cex=0.6);
mem_legend(mem_canonical);

You can extract the largest connected subnetwork to plot, as before.

g2 <- cnet;
g2_sub <- subset_igraph_components(cnet, keep=1)

#plot(g2_sub);
jam_igraph(g2_sub,
   use_shadowText=TRUE,
   label_factor=0.5,
   node_factor=0.5);

Customizing the Cnet plot

The most useful customization is to subset the pathway nodes.

One useful method to choose a subset of pathways is to view one heatmap cluster at a time. Here we will use hm_sets made previously.

cnet_sub <- subsetCnetIgraph(cnet,
   repulse=3.5,
   includeSets=unlist(hm_sets[c("A")]));
jam_igraph(cnet_sub,
   node_factor=1,
   use_shadowText=TRUE,
   label_dist_factor=3,
   label_factor=1.3);
mem_legend(mem_canonical);

Subset the Pathway-Gene Cnet plot

You can also subset the Pathway-Gene Cnet plot using specific names, or by some network criteria, using the function subsetCnetIgraph().

Pathways are subsetted to require at least 6 genes, using the argument minSetDegree, which can be useful to help simplify the graph structure.

Similarly, the minGeneDegree=2 argument can be used to display only genes present in 2 or more pathways, but for now we will not use that argument.

During the subset operation, some convenient default operations are also performed, controlled by function arguments:

singlet nodes (with no connection to any other) are removed, remove_singlets=TRUE
the layout is re-calculated, force_relayout=TRUE
nodes are re-ordered by color, do_reorder=TRUE
labels are re-positioned to have proper angle offset away from the edges entering each node, spread_labels=TRUE
blank colors are not removed from pie and colored rectangle nodes, remove_blanks=FALSE, but can be during this step by remove_blanks=TRUE.

cnet3 <- multienrichjam::subsetCnetIgraph(cnet,
   repulse=5,
   minSetDegree=6,
   minGeneDegree=1);
jam_igraph(cnet3,
   node_factor=0.7,
   use_shadowText=TRUE);
mem_legend(mem_canonical);

Custom igraph plot function jam_igraph()

The jam_igraph() function is a custom version of igraph::plot() intended to enhance the base function in a few ways:

edge_bundling="connections" improves the rendering of edges by bundling edges from node clusters, so they are drawn with a bezier curve
use_shadowText=TRUE will draw labels with a contrasting border to improve legibility of text labels
rescale=FALSE keeps the network layout aspect ratio instead of scaling the coordinates to fit the size of the plot window. It also properly scales the node and edge sizes.
new arguments to help resize existing values:
- label_factor re-scales the label.cex values by a multiplier
- node_factor, edge_factor re-scales the node.size and edge.width by a multiplier
- label_dist_factor re-scales the label.dist values by a multiplier

The function below replicates the previous Cnet plot, with these changes:

node_factor=2 makes the nodes 2x larger
label_factor=1.2 makes the label.cex values 1.2x larger
label_dist_factor=3 makes the label distance from the node center 3x larger, which emphasizes the label placement at an angle away from the edges of each node.

jam_igraph(cnet3,
   edge_factor=2,
   node_factor=1.2,
   label_factor=1.2, 
   use_shadowText=TRUE,
   label_dist_factor=3);

One more fancy alternative, you can colorize the edges based upon the node colors, a great idea inspired by the Gephi netowrk visualization tool.

jam_igraph(color_edges_by_nodes(cnet3, alpha=0.7),
   edge_bundling="connections",
   edge_factor=2,
   node_factor=1,
   label_factor=1.2, 
   use_shadowText=TRUE,
   label_dist_factor=3)