MultiEnrichment Heatmap of Genes and Pathways
Usage
mem_gene_path_heatmap(
mem,
genes = NULL,
sets = NULL,
min_gene_ct = 1,
min_set_ct = 1,
min_set_ct_each = 4,
column_fontsize = NULL,
column_cex = 1,
row_fontsize = NULL,
row_cex = 1,
row_method = "binary",
column_method = "binary",
enrich_im_weight = 0.3,
gene_im_weight = 0.5,
gene_annotations = c("im", "direction", "default"),
annotation_suffix = c(im = "hit", direction = "dir"),
simple_anno_size = grid::unit(6, "mm"),
cluster_columns = NULL,
cluster_rows = NULL,
cluster_row_slices = FALSE,
cluster_column_slices = FALSE,
name = NULL,
p_cutoff = NULL,
p_floor = 1e-10,
row_split = NULL,
column_split = NULL,
auto_split = TRUE,
column_title = LETTERS,
row_title = letters,
row_title_rot = 0,
colorize_by_gene = TRUE,
na_col = "white",
rotate_heatmap = FALSE,
colramp = "Reds",
column_names_max_height = grid::unit(8, "cm"),
column_names_rot = 90,
show_gene_legend = FALSE,
show_pathway_legend = TRUE,
show_heatmap_legend = 8,
use_raster = FALSE,
seed = 123,
verbose = FALSE,
...
)Arguments
- mem
Memorlistobject created bymultiEnrichMap().- genes
charactervector of genes to include in the heatmap, all other genes will be excluded.- sets
charactervector of sets (pathways) to include in the heatmap, all other sets will be excluded.- min_gene_ct
integerminimum number of occurrences of each gene across the pathways, all other genes are excluded.- min_set_ct
integerminimum number of genes required for each set, all other sets are excluded.- min_set_ct_each
integerminimum number of genes required for each set, required for at least one enrichment test.- column_fontsize, row_fontsize
numericpassed asfontsizetoComplexHeatmap::Heatmap()to define a specific fontsize for column and row labels. WhenNULLthe nrow/ncol of the heatmap are used to infer a reasonable starting point fontsize, which can be adjusted withcolumn_cexandrow_cex.- row_method, column_method
characterstring with distance method, default is 'binary' which considers all non-zero values to be 1. It is used for row and column hierarchical clustering byamap::hcluster(). It offers more methods thanhclust()hence its use here.- enrich_im_weight
numericvalue between 0 and 1 (default 0.3), the relative weight of enrichment-log10 P-valueand overall gene-pathway incidence matrix when clustering pathways.When
enrich_im_weight=0then only the gene-pathway incidence matrix is used for pathway clustering.When
enrich_im_weight=1then only the pathway significance (-log10 P-value) is used for pathway clustering.The default
enrich_im_weight=0.3balances the combination of the enrichment P-value matrix, with the gene-pathway incidence matrix.
- gene_im_weight
numericvalue between 0 and 1 (default 0.5), the relative weight of themem$geneIMgene incidence matrix, and overall gene-pathway incidence matrix when clustering genes.When
gene_im_weight=0then only the gene-pathway incidence matrix is used for gene clustering.When
gene_im_weight=1then only the gene incidence matrix (mem$geneIM) is used for gene clustering.The default
_im_weight=0.5balances the gene incidence matrix with the gene-pathway incidence matrix, giving each matrix equal weight (since values are typically all(0, 1).
- gene_annotations
characterstring indicating which annotation(s) to display alongside the gene axis of the heatmap.Default is
"im", "direction", and"default"which will hide the"direction"if all non-zero values are positive.When
"im"is present, the colored incidence matrix is displayed.When
"direction"is present, the directional matrix is displayed using colors defined bycolorjam::col_div_xf(1.2).When both
"im"and"direction"are present, they are displayed in the order defined.When no values are given, the gene annotation is not displayed.
- annotation_suffix
charactervector named by values permitted bygene_annotations, with optional suffix to add to the annotation labels. For example, as by default, it may be helpful to add "hit" or "dir" to distinguish the enrichment labels.- name
charactervalue passed toComplexHeatmap::Heatmap(), used as a label above the heatmap color legend. DefaultNULLuses "Gene Hits by Enrichment".- p_cutoff
numericvalue of the enrichment P-value cutoff, above which P-values are not colored, and are therefore white. The enrichment P-values are displayed as an annotated heatmap at the top of the main heatmap. Any cell that has a color meets at least the minimum P-value threshold. The default is taken from inputmemfor consistency with the input multienrichment analysis.- column_split, row_split
row and column split, default NULL, detects an appropriate number of clusters. Passed to
ComplexHeatmap::Heatmap().To turn off split use
1.To specify a fixed number of clusters, use an
integervalue. The value may be changed if the underlying data does not support that number of clusters.To specify fixed clusters by name, use an atomic vector named by the rownames
genes(Mem)or colnamessets(Mem)with values which will become the cluster names. When supplying afactorthe factor level order will be maintained.Alternatively, supply a
data.framewhose rownames match thegenes(Mem)rows orsets(Mem)columns, respectively. In this case, column values are combined usingjamba::pasteByRowOrdered()which also maintains factor level order.When supplying either an atomic vector or
data.framethe actual names will be used to subset the resultingMemdata if there are fewer names provided than exist inMem. Note that theMemdata are not subset again by other filter criteria, for examplemin_set_ct,min_gene_ct, etc.The
column_titleargument is used when
- auto_split
logicalwhether to determine clusters when column_split or row_split is NULL, default TRUE.- column_title
optional
characterstring or vector to display above the column splits. Default usesLETTERSwith length to match the number of clusters. When there is only one column cluster, it is not named unlesscolumn_titlealso has length 1.- row_title
optional
characterstring or vector to display to the left side of the heatmap row splits. The default useslettersto match the number of row clusters. When there is only one row cluster, it is not named unlessrow_titlealso has length 1.- row_title_rot
numericvalue, default 0, the rotation ofrow_titletext, where0is not rotated, and90is rotated 90 degrees.- colorize_by_gene
logicaldefault TRUE, whether to color the main heatmap body using the colors fromgeneIMto indicate each enrichment in which a given gene is involved. Colors are blended usingcolorjam::blend_colors(), using colors fromcolorVin thegeneIMcolors(Mem).- na_col
characterstring indicating the color to use for NA or missing values. Typically this argument is only used whencolorize_by_gene=TRUE, where entries with no color are recognized asNAbyComplexHeatmap::Heatmap().- rotate_heatmap
logicalindicating whether the entire heatmap should be rotated so that pathway names are displayed as rows, and genes as columns. When enabled, arguments referring to columns and rows are flipped, so "column" arguments will continue to affect pathways/sets, and "row" arguments will continue to affect genes. This includescolumn_methodandrow_methodas of 0.0.90.900.Exceptions:
row_title_rotis only applied to rows, due to its purpose.column_names_rotis only applied to columns, also due to its purpose.
- colramp
character, default "Reds", with name of color, color gradient, or a vector of colors, anything that can be converted to a color gradient byjamba::getColorRamp().- column_names_max_height
grid::unitpassed toComplexHeatmap::Heatmap(). When supplied asnumericit is converted to units in "mm". Default 180 mm.- column_names_rot
numericpassed toComplexHeatmap::Heatmap().- show_gene_legend, show_pathway_legend
logical, whether to show the gene IM and pathway IM legends, respectively.The gene IM legend is FALSE by default, since it only describes the color used for each column, and is somewhat redundant with the pathway IM legend.
The pathway IM default is TRUE, it displays the color scale including the range of enrichment P-values colorized.
- show_heatmap_legend
numericorlogical, (default 8), the maximum number of labels to use for the heatmap color legend. When 'colorize_by_gene' is TRUE, the heatmap legend would include all possible blended colors using gene IM data, which frankly can become too much.When
logical,TRUEis converted to8by default.When there are more legend items than than
show_heatmap_legendthe color legend will only display singlet colors, which means only one color per individual set defined in colorV.This legend can be created and extracted from the output
Heatmapobject to be displayed independently for publishable figures if necessary.
- use_raster
logicalpassed toComplexHeatmap::Heatmap(), default FALSE, whether to rasterize the heatmap body. This option is recommended FALSE when 'colorize_by_gene' is TRUE, due to the way the rasterization is handled at matrix level. For very large heatmaps you may try 'colorize_by_gene=FALSE' and 'use_raster=TRUE' to reduce the figure size with PDF and SVG output, and will improve visual fidelity in some cases.- seed
numericvalue passed toset.seed()to define a reproducible random seed during row and column clustering.- verbose
logicalindicating whether to print verbose output.- ...
additional arguments are passed to
ComplexHeatmap::Heatmap()for customization. However, if...causes an error, the sameComplexHeatmap::Heatmap()function is called without..., which is intended to allow overloading...for different functions.
Value
Heatmap object defined in ComplexHeatmap::Heatmap() with
custom attributes with the method caption:
"caption":characterstring with method details."caption_legendlist":ComplexHeatmap::Legendsobject suitable to be included with Heatmap legends bydraw(hm, annotation_legend_list=caption_legendlist), or or drawn directlygrid::grid.draw(caption_legendlist)."draw_caption"-functionthat will draw the caption in the bottom-right corner of the device by default, to be called withattr(hm, "draw_caption")()ordraw_caption().
The Heatmap row and column order can be retrieved:
jamba::heatmap_row_order()- returns alistof vectors of rownames in the order they appear in the heatmap, with list names defined by row split.jamba::heatmap_column_order()- returns alistof vectors of colnames in the order they appear in the heatmap, with list names defined by row split.
Details
This function takes the Mem output from
multiEnrichMap() and creates a gene-by-pathway incidence
matrix heatmap, using ComplexHeatmap::Heatmap().
The major output of this function is to define pathway
clusters which influences other figures produced by
mem_plot_folio(): the enrichment heatmap; Cnet cluster plots;
Cnet exemplars.
It uses three basic sources of data to annotate the heatmap:
memIMthe gene-set incidence matrixgeneIMthe gene incidence matrix by datasetenrichIMthe pathway enrichment P-value matrix by dataset
Pathway and gene clusters
When column_split (pathways) or row_split (genes) are
not defined, a reasonable number is assigned to split columns
and rows, respectively. The specific number is controlled by
defining an integer number. The splits are named using
column_title and row_title, which defaults LETTERS and
letters, respectively.
Custom pathway clusters (columns) can be defined by defining
column_split as a vector named by values in sets(Mem),
with values used to define the name for each cluster.
When provided as a factor it will honor the factor level order.
When a subset of sets(Mem) are provided, it will also subset
the Mem object to show only the matching sets, however it does
not re-apply row and column filtering described above.
It is recommended to use argument 'sets' to subset the pathway
gene sets upfront, then use column_split with names that
match 'sets'.
Alternatively, column_split can be supplied as a data.frame
with rownames that match sets(Mem), in which case column values
are combined using jamba::pasteByRowOrdered() which also keeps
factor level order.
Custom gene clusters (rows) can be defined with row_split using
the same mechanism as with column_split described above,
with names that match genes(Mem) as appropriate.
Column pathway clustering
Columns (pathways) are clustered using a combination of the gene-pathway
incidence matrix, and the -log10 P-values shown along the
pathway axis (usually columns). The purpose is to allow the
enrichment P-value to influence the clustering together with
the gene content. The balance is adjusted using
enrich_im_weight, default 0.3, and where 0.0 will ignore the
enrichment P-values during clustering. Higher values will tend
to create clusters that represent shared/unique significant
pathways, and less determined based solely on the gene content.
Note that the enrichment P-value matrix used during clustering
is adjusted by using p_cutoff and p_floor.
Values above the p_cutoff are converted to 1 so they do not
influence clustering. Values below p_floor are converted to
p_floor so they influence clustering only at the level of
other values at p_floor.
The default cluster_columns=TRUE will employ amap::hcluster()
as a convenient and efficient one-step distance-hierarchical clustering
approach, using cluster_method as the distance method. A custom
function can be supplied with cluster_columns as long as the
output is 'hclust' or can be coerced to 'hclust'.
The clustering distance method column_method uses default 'binary',
which treats any non-zero value as 1.
In this case, the relative weight is not important since all non-zero
values are equivalent.
However, column_method='euclidean', current default in mem_plot_folio()
may improve the output when adjusting the relative weight of
enrichment P-values with incidence matrix.
Gene clustering
Rows (genes) are clustered using a combination of the gene-pathway
incidence matrix, and the 'geneIM' or 'geneIMdirection' matrix data
shown, as defined with gene_annotations. The relative weight of
these matrices is controlled with gene_im_weight with default 0.5,
which gives equal weight to each matrix. By default, when directional
gene values are shown the directional matrix is used with clustering.
The default cluster_rows=TRUE will employ amap::hcluster() as
described for Pathway clustering, or cluster_rows can be a custom
function that produces 'hclust' or can be coerced to 'hclust'.
When row_method='binary' the relative weight of gene incidence matrix
and the pathway-gene matrix is not important, since all non-zero
values are treated as 1 during clustering. The mem_plot_folio()
default uses 'euclidean' to improve the effect of the relative
weight of these two matrices.
Gene clusters are not often used for downstream analysis, for example they do not form clusters in the Cnet plots, and are not (yet) used for other analysis in multienrichjam. However, gene clusters are quite useful when interpreting pathway-gene data.
It is helpful in practice to refer to a gene cluster by name: "The genes in cluster 'd' all appear to be cytokines." (Also maybe there should be better names, but that's for another time.)
Gene clusters may form what we call "hot spots", where most of a gene cluster is colorized and associated with one or more pathway clusters. A hot spot indicates a set of genes shared across multiple pathways, a "core" set of genes which may have serve an important functional basis in several pathways.
An example might be mitogen-activated protein kinase (MAPK) genes, which typically involve a multi-step kinase call signaling cascade. Pathways which involve one MAPK would very often involve each MAPK at subsequent steps in the cascade. In fact, MAPK genes are often the core signaling mechanism of numerous apparently unrelated pathways - we refer to it as the "internal wiring" of the signaling in a cell, as an analogy to a building which may use wiring to pass along any number of messages. Different cell types may employ the MAPK cascade to send a message, and this message may have different meaning across cell types, and indeed may have meaning based upon the cell state.
Filtering
A subset of pathways can be defined with argument sets,
which may be useful to prepare this plot for a set of
"exemplar pathways" which represent key pathways of interest
for a study.
Similar for genes with argument genes, however this option
is less commonly used.
Pathways and genes can be subset by number of occurrences of each, for example:
min_gene_ct: minimum occurrences of a gene across pathways, which also means the number of pathways in which a gene occurs.min_set_ct: minimum occurrences of a set (pathway) across genes, which means the number of genes in the set across all enrichments.min_set_ct_each: minimum occurrences of a set (pathway) across genes. It requires at leastmin_set_ct_eachgenes in at least one enrichment, which is consistent withmin_countinmultiEnrichMap().
When pathways are filtered by min_gene_ct, min_set_ct,
and min_set_ct_each, the order of operations is as follows:
min_set_ct_each,min_set_ct- these filters are applied before filtering genes, in order to ensure all genes are present from the start.min_gene_ct- genes are filtered after pathway filtering, in order to remove pathways which were not deemed "significant" based upon the required number of genes. Only after those pathways are removed can the number of occurrences of each gene be judged appropriately.
See also
Other custom plot functions:
mem_enrichment_heatmap(),
mem_legend()