library(multienrichjam);
library(jamba);
library(colorjam);
suppressPackageStartupMessages(library(ComplexHeatmap));
options("stringsAsFactors"=FALSE, "warn"=-1);
knitr::opts_chunk$set(
fig.height=10,
fig.width=10,
fig.align="center"
)
ragg_png = function(..., res = 192) {
ragg::agg_png(..., res = res, units = "in")
}
knitr::opts_chunk$set(dev = "ragg_png", fig.ext = "png")
This document describes steps recommended for using Ingenuity IPA enrichment data.
Ingenuity IPA enrichment data can be exported using a function
"Export All"
which by default creates one text file,
concatenating each enrichment table into one large file.
This workflow demonstrates the import process using two IPA enrichment files used by Reese et al 2019 https://doi.org/10.1016/j.jaci.2018.11.043 to compare enrichment results in newborns to older children.
It therefore requires IPA enrichment results have already been exported in text format from IPA.
To import an IPA text file, use
importIPAenrichment()
:
newborn_txt <- system.file("extdata",
"Newborns-IPA.txt",
package="multienrichjam");
newborn_dfl <- importIPAenrichment(newborn_txt);
#> ## (11:47:56) 19Aug2024: IPA xref data was not found, expecting section: 'Analysis Ready Molecules', handling as: revert_ipa_xref=FALSE
The result is a list of data.frame
objects, where each
data.frame
represents one enrichment test. A convenient way
to see the dimensions of each data.frame
is with the
function jamba::sdim()
:
sdim(newborn_dfl);
#> rows cols class
#> Canonical Pathways 113 7 data.frame
#> Upstream Regulators 117 6 data.frame
#> Diseases and Bio Functions 444 7 data.frame
#> Tox Functions 15 7 data.frame
#> Networks 8 7 data.frame
#> Tox Lists 19 6 data.frame
For MultiEnrichMap, we typically want to analyze multiple IPA
enrichment files, so we can wrap the call in an lapply()
function:
newborn_txt <- system.file("extdata",
"Newborns-IPA.txt",
package="multienrichjam");
olderchildren_txt <- system.file("extdata",
"OlderChildren-IPA.txt",
package="multienrichjam");
ipa_files <- c(Newborns=newborn_txt,
OlderChildren=olderchildren_txt)
ipa_l <- lapply(ipa_files, importIPAenrichment);
#> ## (11:47:56) 19Aug2024: IPA xref data was not found, expecting section: 'Analysis Ready Molecules', handling as: revert_ipa_xref=FALSE
#> ## (11:47:56) 19Aug2024: IPA xref data was not found, expecting section: 'Analysis Ready Molecules', handling as: revert_ipa_xref=FALSE
Now we can check the dimensions within each list using
jamba::ssdim()
:
ssdim(ipa_l);
#> $Newborns
#> rows cols class
#> Canonical Pathways 113 7 data.frame
#> Upstream Regulators 117 6 data.frame
#> Diseases and Bio Functions 444 7 data.frame
#> Tox Functions 15 7 data.frame
#> Networks 8 7 data.frame
#> Tox Lists 19 6 data.frame
#>
#> $OlderChildren
#> rows cols class
#> Canonical Pathways 237 7 data.frame
#> Upstream Regulators 338 7 data.frame
#> Diseases and Bio Functions 500 7 data.frame
#> Tox Functions 118 7 data.frame
#> Networks 10 7 data.frame
#> Tox Lists 36 6 data.frame
In most cases, each IPA file should contain the same enrichment
tests, for example "Canonical Pathways"
,
"Upstream Regulators"
,
"Diseases and Bio Functions"
, etc. However, it is not
always the case, so it is recommended to check and verify each IPA file
contains at least the enrichment tests needed for downstream
analysis.
IPA performs multiple enrichment tests, which are done independently and with unique assumptions and caveats. Therefore, I recommend using one enrichment test at a time in MultiEnrichMap.
Extract one data.frame
from each result:
library(igraph)
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
## Take only the Ingenuity Canonical Pathways
enrichList_canonical <- lapply(ipa_l, function(i){
i[["Canonical Pathways"]];
});
sdim(enrichList_canonical);
#> rows cols class
#> Newborns 113 7 data.frame
#> OlderChildren 237 7 data.frame
## Convert data.frame to enrichResult
## multienrichjam::enrichDF2enrichResult
er_canonical <- lapply(enrichList_canonical, function(i){
enrichDF2enrichResult(i,
keyColname="Name",
pvalueColname="P-value",
geneColname="geneNames",
geneRatioColname="Ratio",
pvalueCutoff=1)
});
sdim(er_canonical);
#> rows cols class
#> Newborns 113 11 enrichResult
#> OlderChildren 237 11 enrichResult
kable_coloring(
head(as.data.frame(er_canonical[[1]])),
caption="Top 10 rows of enrichment data",
row.names=FALSE) %>%
kableExtra::column_spec(column=seq_len(ncol(er_canonical[[1]])),
border_left="1px solid #DDDDDD",
extra_css="white-space: nowrap;")
ID | Ingenuity Canonical Pathways | -log(p-value) | zScore | GeneRatio | geneID | pvalue | Description | p.adjust | Count | setSize |
---|---|---|---|---|---|---|---|---|---|---|
Role of Macrophages, Fibroblasts and Endothelial Cells in Rheumatoid Arthritis | Role of Macrophages, Fibroblasts and Endothelial Cells in Rheumatoid Arthritis | 0.405 | NaN | 0.00321 | TNFSF13B | 0.3935501 | Role of Macrophages, Fibroblasts and Endothelial Cells in Rheumatoid Arthritis | 0.3935501 | 1 | 312 |
Neuroinflammation Signaling Pathway | Neuroinflammation Signaling Pathway | 0.406 | NaN | 0.00322 | CASP8 | 0.3926449 | Neuroinflammation Signaling Pathway | 0.3926449 | 1 | 311 |
Sirtuin Signaling Pathway | Sirtuin Signaling Pathway | 0.428 | NaN | 0.00344 | HIST1H1D | 0.3732502 | Sirtuin Signaling Pathway | 0.3732502 | 1 | 291 |
G-Protein Coupled Receptor Signaling | G-Protein Coupled Receptor Signaling | 0.447 | NaN | 0.00362 | PRKAR2B | 0.3572728 | G-Protein Coupled Receptor Signaling | 0.3572728 | 1 | 276 |
Protein Ubiquitination Pathway | Protein Ubiquitination Pathway | 0.461 | NaN | 0.00377 | TAP2 | 0.3459394 | Protein Ubiquitination Pathway | 0.3459394 | 1 | 265 |
Signaling by Rho Family GTPases | Signaling by Rho Family GTPases | 0.478 | NaN | 0.00397 | RDX | 0.3326596 | Signaling by Rho Family GTPases | 0.3326596 | 1 | 252 |
Now given a list of data.frame
results, we can run
multiEnrichMap()
:
mem_canonical <- multiEnrichMap(er_canonical,
enrichBaseline=1,
cutoffRowMinP=0.05,
colorV=c("purple", "orange"),
topEnrichN=20)
The output mem_canonical
is a list containing various
results.
kable_coloring(
sdim(mem_canonical),
caption="sdim(mem_canonical)") %>%
kableExtra::column_spec(column=seq_len(4),
border_left="1px solid #DDDDDD",
extra_css="white-space: nowrap;")
rows | cols | class | class_v2 | |
---|---|---|---|---|
enrichLabels | 2 | character | NA | |
colorV | 2 | character | NA | |
geneHitList | 2 | list | NA | |
geneHitIM | 68 | 2 | matrix | array |
geneIM | 34 | 2 | matrix | array |
geneIMcolors | 34 | 2 | matrix | array |
enrichIMgeneCount | 30 | 2 | matrix | array |
enrichIMdirection | 30 | 2 | matrix | array |
enrichList | 2 | list | NA | |
enrichIM | 30 | 2 | matrix | array |
enrichIMcolors | 30 | 2 | matrix | array |
multiEnrichDF | 30 | 12 | data.frame | NA |
multiEnrichResult | 30 | 14 | enrichResult | NA |
memIM | 34 | 30 | matrix | array |
multiEnrichMap | 30 | 220 | igraph | NA |
multiEnrichMap2 | 30 | 220 | igraph | NA |
multiCnetPlot | 64 | 110 | igraph | NA |
multiCnetPlot1 | 64 | 110 | igraph | NA |
multiCnetPlot1b | 64 | 110 | igraph | NA |
multiCnetPlot2 | 64 | 110 | igraph | NA |
colnames | 5 | list | NA | |
p_cutoff | 1 | numeric | NA |
The other analysis steps describe in this vignette are valid options,
however in recent work with multienrichjam
we have
developed a new function to create a small folio of relates plots that
we call a “mem plot folio”, and the function is
mem_plot_folio()
.
The mem_plot_folio()
function creates the following
plots by default:
Enrichment P-value heatmap (optionally colored by group)
Gene-pathway incidence matrix, clustered by column and by row
Cnet plot including all collapsed pathway clusters
LETTERS
, “A”, “B”,
“C”, “D”, etc.n
pathway
namesn
pathway
names, but hides gene labels. This plot is best when there are too many
genes to label.Cnet plot using 1,2,3 exemplars per pathway cluster (configurable)
Cnet plot of each pathway cluster individually
mem_canonical_plots <- multienrichjam::mem_plot_folio(mem_canonical,
pathway_column_split=4,
node_factor=1,
use_shadowText=TRUE,
label_factor=1.2,
do_which=c(1:5),
verbose=TRUE,
main="Canonical Pathways");
#> ## (11:48:00) 19Aug2024: mem_plot_folio(): Gene-pathway heatmap (pre-emptive)
#> ## (11:48:00) 19Aug2024: mem_plot_folio(): plot_num 1: Enrichment P-value Heatmap
#> ## (11:48:00) 19Aug2024: mem_plot_folio(): Gene-pathway heatmap
#> ## (11:48:00) 19Aug2024: mem_plot_folio(): plot_num 2: Gene-Pathway Heatmap
#> ## (11:48:02) 19Aug2024: mem_plot_folio(): Defined 4 pathway clusters.
#> ## (11:48:02) 19Aug2024: mem_plot_folio(): Preparing Cnet collapsed
#> ## (11:48:02) 19Aug2024: mem_plot_folio(): subsetCnetIgraph()
#> ## (11:48:02) 19Aug2024: mem_plot_folio(): plot_num 3: Cnet collapsed with gene and cluster labels
#> ## (11:48:02) 19Aug2024: mem_plot_folio(): plot_num 4: Cnet collapsed with gene and set labels
#> ## (11:48:03) 19Aug2024: mem_plot_folio(): plot_num 5: Cnet collapsed with set labels, without gene labels
The object returned mem_canonical_plots
will contain a
list of the graphical objects for each plot requested. For example, each
Cnet plot returns an igraph
object which can be
customized.
In practice, we rarely found much benefit from the multi-enrichment map, which shows pathways connected to pathways based upon Jaccard overlap. This network view was often disorganized, not clearly clustered, and lacked ability to see which genes drive the overlaps between pathways. Perhaps it works best for the highly structured Gene Ontology (GO), but in our hands GO was just not insightful for the broad range of experiments we were analyzing.
Instead, we almost always gravitated toward the Cnet plot, a clever idea by Dr. Guangchang Yu that connects pathways to genes, where genes also connect naturally to other pathways. This plot for many of our collaborators has seemed more intuitive, and answers the next question people often have: “What are the shared genes?” Our subtle addition to this plot is to color-code genes by the enrichment experiment, so you can see which genes are shared or unique by experiment.
Since most enrichment results have far more pathways than can be
shown in one Cnet plot, we tried to prioritize which pathways to
display. The logic in Dr. Yu’s enrichplot
is to take the
top n
pathways by P-value, however we noticed for our data
this logic often shows redundant pathways and misses orthologous
biological functions, which might be enriched to a lesser degree. So we
started clustering pathways by P-value, then by using the pathway-gene
incidence matrix.
The pathway-gene incidence matrix worked best in our hands for this clustering, partly because it clusters pathways by the genes involved, and not by the pattern of enrichment across the enrichment tests (which may not be related to the underlying genes.) So we cluster pathways, then choose an arbitrary number of clusters, and took one “exemplar” from each cluster to represent each cluster. This practice is mostly effective, especially when the clusters are well-defined, and when members of each cluster have relatively similar function. We can increase the exemplars per cluster as needed. This strategy is helpful because each pathway is very clearly labeled.
Most recently, we experimented with collapsing each pathway_cluster,
so each pathway_cluster includes all genes implicated by pathways
contained in the pathway_cluster. The benefit here is that no genes are
lost by selecting a subset of “exemplar” pathways – all genes are
retained per cluster. Then instead of plotting pathway-gene network, we
plot a pathway_cluster-gene network. Each pathway_cluster is labeled
using the top n
pathway names – which is configurable, and
is only a visible label. Each pathway_cluster contains all the pathways
in the cluster.
The plots in mem_plot_folio()
are also available as
individual functions.
In general, it is most convenient to call
mem_plot_folio()
and use the output list
to
create custom plots, however the functions are also available to be
called independently.
We can view the “Multi Enrichment Map” itself with
mem_multienrichplot()
.
This network connects pathways when they meet a Jaccard overlap
coefficient threshold based upon the shared genes between the pathways.
The default overlap is stored by multiEnrichMap()
in the
output object mem_canonical
.
g <- mem_multienrichplot(mem_canonical,
do_plot=FALSE,
overlap=0.3,
node_factor=2,
repulse=3.5)
jam_igraph(g,
node_factor=2,
use_shadowText=TRUE)
The overlap threshold can be customized manually, or through the
helper function mem_find_overlap()
. This function attempts
to guess a reasonable overlap threshold based upon the network
connectivity, by optimizing the cluster size and total nodes
involved.
g <- mem_multienrichplot(mem_canonical,
overlap=mem_find_overlap(mem_canonical),
do_plot=FALSE,
node_factor=3,
repulse=3.5)
jam_igraph(g,
node_factor=3,
use_shadowText=TRUE)
Notice there are distinct subnetworks, called “components”, which are
not connected to each other. A useful follow-up technique is to graph
one component at a time, which we do using the helper function
subset_igraph_components()
. Components are ordered by size,
largest to smallest, so you can keep the largest using argument
keep=1
, or the second largest with keep=2
, and
so on.
We also call two other helper functions:
removeIgraphBlanks()
removes blank colors from
multi-color nodes, such as pie nodes, or colored rectangle nodes. It
helps show only the remaining colors without the whitespace.relayout_with_qfr()
applies Fruchterman-Reingold layout
to an igraph
object. It’s convenient to adjust
repulse
here, which is the recommended method for adjusting
the spacing between nodes. This function also updates other useful
attributes such as the node label angle, which makes labels appear
offset away from the majority of edges. (It helps more with Cnet plots
below.)
## You can alternatively pull out any other component
g_sub <- subset_igraph_components(g, keep=1);
## Re-apply network layout, and remove blank colors
g_sub <- relayout_with_qfr(repulse=3.5,
removeIgraphBlanks(g_sub))
## Plot
jam_igraph(g_sub,
node_factor=3,
use_shadowText=TRUE)
A useful overview of the enrichment results, including the overlaps
across datasets, is shown in a heatmap of -log10(Pvalue)
.
The helper function mem_enrichment_heatmap()
is used
here.
The argument p_cutoff
is used to set the Pvalue below
which cells are colorized – every P-value above this threshold is not
colored, and displayed as white, even when the P-value is less than
1.
mem_enrichment_heatmap(mem_canonical,
p_cutoff=0.05);
The same data can be plotted as a heatmap.
mem_enrichment_heatmap(mem_canonical,
style="heatmap",
p_cutoff=0.05);
The argument color_by_column=TRUE
will optionally apply
a color gradient to each column, using the same P-value threshold
defined by p_cutoff
. In this case, cells are labeled to
indicate the actual enrichment P-value.
mem_enrichment_heatmap(mem_canonical,
style="heatmap",
color_by_column=TRUE);
We can view the pathway-gene matrix using the function
mem_gene_path_heatmap()
. This function uses data in the
mem
object: "memIM"
which contains the
gene-pathway incidence matrix, "geneIM"
which contains the
gene-dataset incidence matrix, and "enrichIM"
which
contains the pathway-dataset matrix.
This function attempts to “guess” a reasonable number of column and
row splits – which breaks the column and row dendrograms into separate
sub-clusters. These breaks are visual only, but can be useful in
follow-up analyses shown later. You can override these numbers with
arguments column_split
and row_split
.
Also, the colors used across the top of the heatmap indicate the enrichment P-value, and are intentionally scaled so all P-values above the P-value threshold are not colored, and are therefore white.
This function uses the amazing ComplexHeatmap::Heatmap()
function, which enables many awesome customizable features. The
...
argument is passed to
ComplexHeatmap::Heatmap()
, so you can use those features as
needed.
hm <- mem_gene_path_heatmap(mem_canonical,
column_cex=0.5,
row_cex=0.6);
hm;
As a follow-up analysis, you can pull out each pathway cluster from
the heatmap itself, using heatmap_column_order()
:
hm_sets <- heatmap_column_order(hm);
hm_sets;
#> $A
#> eNOS Signaling
#> "eNOS Signaling"
#> RAR Activation
#> "RAR Activation"
#> Dopamine-DARPP32 Feedback in cAMP Signaling
#> "Dopamine-DARPP32 Feedback in cAMP Signaling"
#> Hepatic Cholestasis
#> "Hepatic Cholestasis"
#>
#> $B
#> Synaptic Long Term Depression Glioma Signaling
#> "Synaptic Long Term Depression" "Glioma Signaling"
#> Growth Hormone Signaling
#> "Growth Hormone Signaling"
#>
#> $C
#> mTOR Signaling HIPPO signaling Fc Epsilon RI Signaling
#> "mTOR Signaling" "HIPPO signaling" "Fc Epsilon RI Signaling"
#> p70S6K Signaling
#> "p70S6K Signaling"
We can view the Cnet plot (concept network plot) which shows the pathway-gene relationship.
The helper function memIM2cnet()
creates a Cnet plot
from the mem_canonical
output. Here, we also pipe the
result through other helper functions:
fixSetLabels()
applies pathway label word wraprelayout_with_qfr()
applies network layout and adjusts
node labelsremoveIgraphBlanks()
removes blank colors from the
igraph
nodes
#cnet <- mem_canonical$multiCnetPlot1b;
cnet <- mem_canonical %>%
memIM2cnet() %>%
fixSetLabels() %>%
removeIgraphBlanks() %>%
relayout_with_qfr(repulse=4);
#plot(cnet, vertex.label.cex=0.6);
par("mar"=c(5,4,4,2)+0.1)
jam_igraph(cnet,
use_shadowText=TRUE,
node_factor=0.5,
vertex.label.cex=0.6);
mem_legend(mem_canonical);
You can extract the largest connected subnetwork to plot, as before.
g2 <- cnet;
g2_sub <- subset_igraph_components(cnet, keep=1)
#plot(g2_sub);
jam_igraph(g2_sub,
use_shadowText=TRUE,
label_factor=0.5,
node_factor=0.5);
The most useful customization is to subset the pathway nodes.
One useful method to choose a subset of pathways is to view one
heatmap cluster at a time. Here we will use hm_sets
made
previously.
cnet_sub <- subsetCnetIgraph(cnet,
repulse=3.5,
includeSets=unlist(hm_sets[c("A")]));
jam_igraph(cnet_sub,
node_factor=1,
use_shadowText=TRUE,
label_dist_factor=3,
label_factor=1.3);
mem_legend(mem_canonical);
You can also subset the Pathway-Gene Cnet plot using specific names,
or by some network criteria, using the function
subsetCnetIgraph()
.
Pathways are subsetted to require at least 6 genes, using the
argument minSetDegree
, which can be useful to help simplify
the graph structure.
Similarly, the minGeneDegree=2
argument can be used to
display only genes present in 2 or more pathways, but for now we will
not use that argument.
During the subset operation, some convenient default operations are also performed, controlled by function arguments:
remove_singlets=TRUE
force_relayout=TRUE
do_reorder=TRUE
spread_labels=TRUE
remove_blanks=FALSE
, but can be during this step by
remove_blanks=TRUE
.
cnet3 <- multienrichjam::subsetCnetIgraph(cnet,
repulse=5,
minSetDegree=6,
minGeneDegree=1);
jam_igraph(cnet3,
node_factor=0.7,
use_shadowText=TRUE);
mem_legend(mem_canonical);
The jam_igraph()
function is a custom version of
igraph::plot()
intended to enhance the base function in a
few ways:
edge_bundling="connections"
improves the rendering
of edges by bundling edges from node clusters, so they are drawn with a
bezier curve
use_shadowText=TRUE
will draw labels with a
contrasting border to improve legibility of text labels
rescale=FALSE
keeps the network layout aspect ratio
instead of scaling the coordinates to fit the size of the plot window.
It also properly scales the node and edge sizes.
new arguments to help resize existing values:
label_factor
re-scales the label.cex
values by a multipliernode_factor
, edge_factor
re-scales the
node.size
and edge.width
by a multiplierlabel_dist_factor
re-scales the label.dist
values by a multiplierThe function below replicates the previous Cnet plot, with these changes:
node_factor=2
makes the nodes 2x largerlabel_factor=1.2
makes the label.cex
values 1.2x largerlabel_dist_factor=3
makes the label distance from the
node center 3x larger, which emphasizes the label placement at an angle
away from the edges of each node.
jam_igraph(cnet3,
edge_factor=2,
node_factor=1.2,
label_factor=1.2,
use_shadowText=TRUE,
label_dist_factor=3);
One more fancy alternative, you can colorize the edges based upon the node colors, a great idea inspired by the Gephi netowrk visualization tool.
jam_igraph(color_edges_by_nodes(cnet3, alpha=0.7),
edge_bundling="connections",
edge_factor=2,
node_factor=1,
label_factor=1.2,
use_shadowText=TRUE,
label_dist_factor=3)