Venndir with Gene Expression Data • venndir

library(venndir)

Venndir was driven by gene expression analysis, which contains up/down directionality. Here are some examples showing how gene expression data can be used in venndir.

The examples include limma and DESeq2 as two common example workflows.

Generic example

The venndir::make_venn_test() function is useful to generate test data, just to create a Venn diagram.

setlist <- make_venn_test(2500, 3,
  sizes=c(400, 500, 200),
  do_signed=TRUE)
venndir(setlist, overlap_type="agreement")

Venndir default example showing signed counts, usingoverlap type 'agreement'.

Venndir with limma results

Illumina Case Study

The limma package provides an extensive set of gene expression data analysis tools. The steps below reproduce the Limma Users Guide example "Illumina Use Case" in Section 17.3. This example requires data available at https://bioinf.wehi.edu.au/marray/IlluminaCaseStudy copied to the current working directory.

For the sake of this example, the steps assume the data is located in a folder [HOME]/IlluminaCaseStudy, where [HOME] is the user home directory.

illuminadir <- "~/IlluminaCaseStudy";
do_limma <- FALSE;
do_deseq2 <- FALSE;
if (suppressPackageStartupMessages(require(limma)) && dir.exists(illuminadir)) {
   do_limma <- TRUE;
}
if (do_limma && suppressPackageStartupMessages(require(DESeq2))) {
   do_deseq2 <- TRUE;
}
if (do_limma) {
   targets <- readTargets(path=illuminadir)
   # import Illumina data
   x <- read.ilmn(files="probe profile.txt.gz",
      ctrlfiles="control probe profile.txt.gz",
      other.columns="Detection",
      path=illuminadir)
   # define expressed probes
   y <- neqc(x)
   expressed <- rowSums(y$other$Detection < 0.05) >= 3;
   # subset for expressed probes
   y <- y[expressed,]
   # calculate within-patient correlations
   ct <- factor(targets$CellType)
   design <- model.matrix(~0 + ct);
   colnames(design) <- levels(ct);
   dupcor <- duplicateCorrelation(y,
      design,
      block=targets$Donor);
   dupcor$consensus.correlation;
   
   # fit linear model, paired by patient, with correlations
   fit <- lmFit(y,
      design,
      block=targets$Donor,
      correlation=dupcor$consensus.correlation);
   # define contrasts
   contrasts <- makeContrasts(ML-MS, 
      LP-MS, 
      ML-LP, 
      levels=design)
   
   # fit contrasts
   fit2 <- contrasts.fit(fit, contrasts);
   
   # eBayes
   fit2 <- eBayes(fit2, trend=TRUE)
}
#> Reading file ~/IlluminaCaseStudy/probe profile.txt.gz ... ...
#> Reading file ~/IlluminaCaseStudy/control probe profile.txt.gz ... ...

lmFit() or eBayes() to incidence matrix

The function limma::decideTests() will convert the output MArrayLM object (from limma::lmFit() or limma::eBayes()), first applying thresholds for adjusted P-value and log2 fold change.

The output has class TestResults and is a signed incidence matrix! The values are already: c(-1, 0, 1), which is perfect!

Take a look at the output:

if (do_limma) {
   # decideTests() creates a signed incidence matrix
   fit2decide <- decideTests(fit2,
      p.value=0.01, lfc=log2(1),
      method="global");
   print(fit2decide)
}
#> TestResults matrix
#>               Contrasts
#>                ML - MS LP - MS ML - LP
#>   ILMN_2055271       0       0       0
#>   ILMN_1653355       0       0       0
#>   ILMN_1787689       0       0       0
#>   ILMN_1745607      -1      -1       0
#>   ILMN_1735045       0       0       0
#> 24686 more rows ...

Examples by Input Type

Incidence matrix

A signed incidence matrix can be used directly, it is automatically converted to a signed list.

For simplicity, we will show only the first two contrasts, using argument: sets=c(1, 2)

if (do_limma) {
   venndir(fit2decide, sets=c(1, 2));
}

Venndir using input in the form of a signed incidence matrix, values are -1, 0, or +1.

Signed set list

The signed incidence matrix can be converted to a signed set list using im_value2list().

if (do_limma) {
   limmalist <- im_value2list(fit2decide);
   venndir(limmalist, sets=c(1, 2));
}

Venndir using a signed setlist as input, after converting the signed incidence matrix to a signed setlist.

A few features are noticeable:

The limma contrasts "ML - MS" and "LP - MS" are shown, both involve a comparison “versus MS”. It compares the ML differences from MS, to the LP differences from MS.
The numbers are statistically significant changes given the statistical thresholds used by limma::decideTests().
There is fairly high overlap between these gene lists, 1,582 are shared.
All of the shared changes have the same direction of change in ML-MS compared with LP-MS.
- 740 are up in both comparisons.
- 842 are down in both comparisons.

Exercise for the reader: Consider using more lenient statistical thresholds, and test the overlap and directional concordance. Would you expect more lenient results to have the same concordance, or less concordance?

Customizing the Diagram

Proportional Venn (Euler)

The data can be drawn with proportional circles, which is a Euler diagram.

For this diagram, we plot sets=c(1, 3) mostly because these are more visually interesting.

if (do_limma) {
  venndir(limmalist, sets=c(1, 3), proportional=TRUE);
}

Venndir figure with 'proportional=TRUE' which produces a proportional Euler diagram. Signed counts are displayed beside the main overlap counts by default.

Customizing the labels

Sometimes proportional circles have tall-skinny overlap regions, so you can arrange the signed labels using argument: template="tall".

For kicks, we will also enlarge the main count labels using the second value in font_cex=c(1.2, 1.5, 0.8). The first value controls the set label, the second controls the main count label, the third controls the signed count labels.

if (do_limma) {
   venndir(limmalist, sets=c(1, 3), proportional=TRUE,
      font_cex=c(1, 1.5, 0.8),
      template="tall");
}

overlap_type=“agreement”

The argument overlap_type="agreement" will summarize counts by agreement "=", or disagreement "X".

Notice that sets 2 and 3 are predominantly discordant. It may be confusing, but for this type of analysis, and these comparisons, this result is expected. The group "LP" is the test group in the first comparison, and "LP" is the control group in the second comparison.

if (do_limma) {
   venndir(limmalist, sets=c(2, 3), proportional=TRUE,
      font_cex=c(1, 1.5, 0.8),
      template="tall",
      overlap_type="agreement");
}

overlap_type=“each”

The argument overlap_type="each" will display counts for each combination of up and down. This option gives the most detail, but can sometimes be challenging to label clearly.

if (do_limma) {
   venndir(limmalist, sets=c(2, 3),
      font_cex=c(1, 1.5, 0.8),
      template="tall",
      overlap_type="each");
}

Item labels

It is sometimes useful to show item labels inside the Venn diagram. However please don’t try it with 3,000+ labels!

For this example, we apply a stringent statistical filter to reduce the number of hits.

Sign Item label

if (do_limma) {
   # increase stringency of statistical filtering
   fit2decide2 <- decideTests(fit2,
      p.value=1e-4,
      lfc=5);
   # increase stringency of statistical filtering
   limmalist2 <- im_value2list(fit2decide2);
   # display item labels
   venndir(fit2decide2,
      sets=c(1, 2),
      show_labels="Ni",
      item_degrees=2,
      fontfamily="ArialNarrow",
      show_items="sign item")
}

Venndir figure showing item labels filled inside each corresponding overlap region. Labels include the arrow indication direction, beside each item label. Labels are also colored based upon direction. The labels indicate Illumina probe names, since the statistical analysis involved probes, and not specific genes measured by the probes.

Gene labels

This step “gets into the weeds” so to speak, but is more interesting than printing Illumina probe names. There is a little helper function collapse_im() intended to collapse an incidence matrix by row groups, which is a fancy way of saying “Make it per-gene”.

Convert Illumina probe names to Illumina gene symbols.
Summarize the direction per gene with collapse_im().

The method prioritizes non-zero values, so that any evidence of a hit will be retained, using the majority direction.

if (do_limma) {
   # filter for non-zero values
   fit2decide2g <- subset(fit2decide2, rowSums(abs(fit2decide2)) > 0)
   match2g <- match(rownames(fit2decide2g), rownames(y))
   y_symbol <- ifelse(y$genes$SYMBOL %in% "",
      rownames(y),
      y$genes$SYMBOL)
   row_genes <- y_symbol[match2g]
   fit2gene <- collapse_im(fit2decide2g, row_genes)
   
   # display item labels
   venndir(fit2gene,
      xyratio=2.2,
      # fontfamily="ArialNarrow",
      fontfamily="Avenir Next Condensed",
      sets=c(1, 2),
      show_labels="Ni",
      show_items="sign item")
}

Venndir figure showing item labels which now use the gene symbol, after converting Illumina probe names to the measured genes. Because multiple probes may measure the same gene, the conversion utilizes a helper function: 'im_collapse()'.

Arrow labels

When there are many items, sometimes it’s nice to see just the arrows. It saves space, while also filling the polygon with the color associated with each direction.

Use argument: show_items="sign"

if (do_limma) {
   vo6s <- venndir(fit2decide2,
      sets=c(1, 2),
      label_style="shaded box",
      proportional=TRUE,
      show_labels="NcSi",
      show_items="sign",
      item_cex=2,
      item_degrees=7,
      max_items=1000);
}

Venndir figure showing a proportional Euler diagram, including item labels in each overlap region. The item labels now only show the directional arrow, not the label itself, in order to indicate the proportion of directions observed. The count labels are placed outside the overlap regions, connected by a line segment.

Three-way directional Venn

The examples above purposefully used only two contrasts, because those two contrasts show very high concordance.

Three-group analysis

One common type of analysis involves three experimental groups. For example:

Control
Treatment A
Treatment B

The typical contrasts are:

A - Control
B - Control
A - B

When creating a three-way Venn diagram, the results are also fairly straightforward. When the directionality is included, it can be more confusing, but also potentially much more informative. Note that the contrast (3) may be written "A - B" or "B - A", and the order affects the directionality.

if (do_limma) {
   fit2decide1 <- decideTests(fit2,
      p.value=1e-3,
      lfc=log2(1.5));
   colnames(fit2decide1) <- c("A - C",
      "B - C",
      "A - B")
   v <- venndir(fit2decide1,
      sets=c(1, 2, 3),
      overlap_type="agreement");
}

There are a few potential scenarios:

Treatment A does not affect the same genes as B. Few overlaps.
Treatment A affects similar genes as B, but not with high concordance.
Treatment A affects similar genes as B, with high concordance.

An interesting question is whether A and B are different from each other. This is contrast (3) above: "A - B"

We’ll use the Venndir highlight feature to walk through the logic.

if (do_limma) {
   v2 <- venndir::highlight_venndir_overlap(v,
      overlap_set="A - B")
   plot(v2)
}

Venndir figure shown as in the previous figure, except with one overlap region highlighted using yellow fill color, and a wide orange inner border. The highlighted region represents contrast 'A - B', and the hits which are unique to this comparison. The highlighting helps emphasize this region for discussion.

These 46 hits are different between A and B, but are not independently different when comparing A to C, nor B to C. What does that mean?

It suggests that these are moderate treatment effects, because they aren’t statistically significant when comparing to Control.
It also suggests that the moderate effects are in opposite direction in A and B compared to Control. (It is possible that Control has higher variability for these genes, but for now let’s assume variability is reasonably consistent for all groups.)

Taken together, one might expect the number here (46) to be lower than those for "A - C" or "B - C" simply because they are only hits when comparing A and B directly. If they were changes also seen in "A - C" they would appear in the section 214.

if (do_limma) {
   v3 <- venndir::highlight_venndir_overlap(v,
      overlap_set="A - C&A - B")
   plot(v3)
}

Venndir figure as in the previous figure, this time highlighting one overlap region between 'A - C' and 'A - B', with 214 shared hits which are all concordant in direction. The other set represents the contrast 'B - C', and therefore hits in the highlighted region can only be concordant in direction.

These 214 hits are “distinctively A”, in that they are present "A - C" and "A - B". There are two explanations for concordance or discordance here:
- Note these hits are not present in "B - C" therefore, they could be considered “uniquely A”.
- High concordance is expected, and suggests that these “A” changes are unique to “A”, and have significantly stronger effect in “A” than in “B”.
- Discordance really should be extremely unlikely, since it would require changes between "A - B" which are also not "B - C" changes.

if (do_limma) {
   v4 <- venndir::highlight_venndir_overlap(v,
      overlap_set="B - C&A - B")
   plot(v4)
}

Venndir figure as in the previous figure, highlighting another overlap region between 'B - C' and 'A - B', with 126 hits which are all discordant in direction. Again, due to the organization of this three-way Venn diagram, hits in this overlap region can only be discordant in direction.

These 126 hits are “distinctively B”, completely analogous to the “distinctively A” example above.

However, note that all hits are discordant. It may be confusing at first, however, this effect is due to the order of the comparisons. “B” is the test group in "B - C", but “B” is the control group in "A - B".

if (do_limma) {
   v5 <- venndir::highlight_venndir_overlap(v,
      overlap_set="A - C&B - C")
   plot(v5)
}

Venndir figure as in the previous figure, showing the overlap between 'A - C' and 'B - C', with 726 genes which are all concordant in direction. In this overlap region, hits may be concordant or discordant, and the observation of entirely concordant hits is a finding.

These 726 hits are shared by A and B compared with Control. The key question is whether the hits are concordant. This example represents the driving reason Venndir was created.

There are two conceptual results:

Overlap
- High overlap suggests “the same components are involved”.
- Low overlap suggests “not many of the same components are involved.”
Directional concordance
1. High concordance suggests the components are affected in the same way. This outcome suggests ‘A’ and ‘B’ have similar biological effect.
2. Mixed concordance suggests that components are not affected similarly. In fact, this outcome argues against ‘A’ and ‘B’ having similar biological effects.
3. Low concordance suggests the components are affected in the opposite way. (This outcome may be equally compelling.) This outcome suggests ‘A’ and ‘B’ have opposite biological effects, for example disease and treatment.

if (do_limma) {
   v6 <- venndir::highlight_venndir_overlap(v,
      overlap_set="A - C&B - C&A - B")
   plot(v6)
}

Venndir figure as in the previous figure, highlighting the three-way overlap between 'A - C', 'B - C', and 'A - B', and showing 13 shared hits in all three contrasts. This overlap region is not expected to have any particular concordance, and is difficult to interpret without reviewing the underlying data.

In my opinion, the concordance here is not useful to interpret. It may be interesting to display overlap_type="each".

if (do_limma) {
   veach <- venndir(fit2decide1,
      do_plot=FALSE,
      sets=c(1, 2, 3),
      overlap_type="each");
   v6each <- venndir::highlight_venndir_overlap(veach,
      overlap_set="A - C&B - C&A - B")
   plot(v6each)
}

This plot is mainly useful when A and B have concordant effects compared with Control.
- When A and B are concordant, results should typically begin ^^ or vv.
- The third arrow indicates the contrast with higher magnitude.
- For example ^^^ is up in A-C, up in B-C, and higher in A-B. Therefore A has a greater effect (up) than B.
- For example vvv is down in A-C, down in B-C, and even lower in A-B. Therefore A has a greater effect (down) than B.
- However ^^v is up in A-C, up in B-C, and lower in A-B. Therefore B has a greater effect (up) than A.
- Similarly vv^ is down in A-C, down in B-C, and higher in A-B. Therefore B has a greater effect (down) than A.
When A and B are discordant, the third arrow should match the first arrow, and is therefore not informative.
- ^v^ is up in A-C, down in B-C. Therefore when comparing ‘A-B’ the direction must also be up.
- v^v is down in A-C, up in B-C. Therefore when comparing ‘A-B’ the direction must also be down.
To state it clearly, there should never be a scenario with ^vv or v^^ for the contrasts A-C, B-C, A-B. If this pattern occurs, it suggests two possibilities:
1. The contrasts are reversed, for example 'B-A' instead of 'A-B'. This is fine, but adjust the expectations accordingly.
2. There may be high variability in the data, causing the estimated directions to be unstable. It may occur when using imputed data in place of missing or sparse data, or when there are outlier measurements which are order of magnitude (10-fold or more) different than typical expected results.

Venndir with DESeq2 results

The same data is processed using DESeq2, with the caveat that this is Illumina microarray data - not RNA-seq data as intended for DESeq2. This example is purely to demonstrate the mechanics of the process.

if (do_limma) {
   # DESeq2 DataSet
   targets$DonorNum <- factor(paste0("E", targets$Donor))
   targets$CellType <- factor(targets$CellType)
   countData <- round(2^(y$E) - 1);
   countData[] <- as.integer(countData)
   dds <- suppressMessages(
      DESeq2::DESeqDataSetFromMatrix(
         countData=countData,
         rowData=y$genes,
         colData=targets,
         design=~0 + CellType + DonorNum)
      )
   # run DESeq
   dds <- DESeq2::DESeq(dds)
   # calculate contrasts
   use_contrasts <- contrasts[c(1, 2, 3, 4, 4, 4), ];
   rownames(use_contrasts) <- c(paste0("CellType", rownames(contrasts)),
      tail(resultsNames(dds), 2))
   
   # iterate each contrast to produce DESeqResults
   res_list <- lapply(jamba::nameVector(colnames(use_contrasts)),
      function(i){
      DESeq2::results(dds,
         saveCols="SYMBOL",
         contrast=use_contrasts[, i])
   })
   # filter for statistical hits
   res_hits <- lapply(res_list, function(ires){
      isub <- subset(ires, !is.na(padj) &
            padj <= 0.01 &
            abs(log2FoldChange) >= log2(2));
      jamba::nameVector(sign(isub$log2FoldChange),
         rownames(isub))
   })
   names(res_hits) <- paste0(names(res_hits), "\nDESeq2");
}
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> -- note: fitType='parametric', but the dispersion trend was not well captured by the
#>    function: y = a/x + b, and a local regression fit was automatically substituted.
#>    specify fitType='local' or 'mean' to avoid this message next time.
#> final dispersion estimates
#> fitting model and testing

DESeq2 hits

if (do_deseq2) {
   venndir(res_hits,
      proportional=TRUE,
      template="tall",
      sets=c(1, 2));
}

Venndir proportional Euler diagram using statistical hits from DESeq2 analysis. The vast majority of hits shared in these two contrasts are concordant, 1358 hits are up in both, 1360 hits are down in both, and only 25 hits have discordant direction.

Compare limma to DESeq2

It’s only natural to want to compare limma and DESeq2… Remember that DESeq2 is used here with microarray data only for educational purposes, and is not generally recommended for array data otherwise. That said, it does perform fairly well.

if (do_deseq2) {
   use_list <- c(
      limmalist[1:2],
      res_hits[1:2])[c(1, 3, 4, 2)];
   venndir(use_list,
      font_cex=c(1, 1, 0.5),
      overlap_type="agreement",
      # alpha_by_counts=TRUE,
      template="tall");
}

Four-way Venndir diagram comparing limma results to DESeq2 results. Hits from the same contrast are shared at a high level across the two methods, as anticipated, in fact the region with the most probes is the four-way overlap, which suggests great agreement in the two contrasts, and the two methodologies.

Quick notes:

Notice there are more similarities within the same contrast (1,223 and 827). These are likely to be robust hits, detected by both limma and DESeq2.
There is a much lower overlap within each platform (519, 117) across the two contrasts. These are likely to be specific to each method, owing to limma’s specialization in microarrays.
There are no shared hits across methods for different contrasts, which is also encouraging.
Almost all shared hits are concordant.