Skip to contents

Find recommended overlap threshold for EnrichMap, experimental

Usage

mem_find_overlap(
  mem,
  overlap_range = c(0.1, 0.99),
  max_cutoff = 0.4,
  adjust = -0.01,
  debug = FALSE,
  ...
)

Arguments

mem

list output from multiEnrichMap()

overlap_range

numeric range of Jaccard overlap values, default 0.1, 0.99 using step 0.01.

max_cutoff

numeric value between 0 and 1, to define the maximum fraction of nodes in the largest connected component, compared to the total number of non-singlet nodes.

adjust

numeric used to adjust the final overlap, default -0.01 will use the overlap one step before the max O score.

debug

logical indicating whether to return full debug data, which is used internally to determine the best overlap cutoff to use.

...

additional arguments are passed to mem2emap().

Value

numeric value with recommended Jaccard overlap coefficient.

Details

It implements a straightforward approach to determine a reasonable Jaccard overlap threshold for Enrichment Map data, and is still very much open to improvement after more experience using it on varied datasets.

The premise is that two pathways that have Jaccard overlap above a threshold are connected by a network "edge".

  • With extremely low threshold, most pathways would be connected, even if they have only one gene in common.

  • With an extremely high threshold, pathways would only be connected if nearly all genes were in common.

  • A moderate threshold is intended to balance the two extremes.

  • The aesthetic and biological interesting threshold appears to be dependent upon the type and number of pathways returned from enrichment analysis. For example, immunology pathways may favor a different threshold than metabolic pathways. (Purely hypothetical.)

  • As a result, this function is intended to find a middle ground based upon the pathway data used for analysis at the time, where some but not all pathways are connected.

The method finds the overlap threshold at which the first connected component is no more than max_cutoff fraction of the whole network. This fraction is defined by the number of nodes in the largest connected component, divided by the total number of non-singlet nodes.

We found that max_cutoff=0.4, the point at which the largest connected component contains no more than 40% of all nodes, seems to be a reasonably good threshold.