MA-plots using ggplot2

ggjammaplot(
  x,
  nbin_factor = 1,
  bw_factor = 1,
  assay_name = 1,
  useMedian = FALSE,
  controlSamples = NULL,
  centerGroups = NULL,
  colramp = c("transparent", "lightblue", "blue", "navy", "orange", "orangered2"),
  groupedX = TRUE,
  grouped_mad = TRUE,
  outlierMAD = 5,
  mad_row_min = 4,
  displayMAD = FALSE,
  noise_floor = 0,
  noise_floor_value = NA,
  naValue = NA,
  centerFunc = centerGeneData,
  whichSamples = NULL,
  useRank = FALSE,
  titleBoxColor = "lightgoldenrod1",
  outlierColor = "lemonchiffon",
  fillBackground = TRUE,
  maintitle = NULL,
  subtitle = NULL,
  summary = "mean",
  difference = "difference",
  transFactor = 0.2,
  doPlot = TRUE,
  highlightPoints = NULL,
  highlightPch = 21,
  highlightCex = 1.5,
  highlightColor = NULL,
  doHighlightLegend = TRUE,
  ablineH = c(-2, 0, 2),
  base_size = 12,
  panel.grid.major.colour = "grey90",
  panel.grid.minor.colour = "grey95",
  return_type = c("ggplot", "data"),
  xlim = NULL,
  ylim = c(-6, 6),
  ncol = NULL,
  nrow = NULL,
  blankPlotPos = NULL,
  verbose = FALSE,
  ...
)

Arguments

x

one of the following inputs:

  • numeric matrix

  • SummarizedExperiment object, where the assay data is defined using assays(x)[[assay_name]]. Accordingly, assay_name can be either an integer index or character string matching the names(assays(x)).

  • list output from jammacalc() or jammaplot(), where each element in the list is a two-column matrix with colnames c("x", "y").

nbin_factor

numeric value used to adjust the number of bins used to display the MA-plots, where values higher than 1 increase the resolution and level of detail, and values below 1 decrease the resolution. Note the number of bins are already adjusted based upon the square root of the number of plot panels, and nbin_factor applied to that value.

bw_factor

numeric used to adjust the resolution of the 2-dimensional bandwidth calculation, where higher values create more detailed density, and lower values create a smoother density across the range of data.

assay_name

relevant only when x is SummarizedExperiment, one of these input types:

  • character string that matches names(assays(x))

  • integer index for assays(x), where any value higher than length(assays(x)) is adjusted to length(assays(x)), which makes it convenient to select the last element in the list of assays(x) by using assay_name = Inf.

useMedian

logical indicating whether calculations should use median, or when useMedian=FALSE the mean is used. The median has the benefit of reducing effect of outliers, however the mean has the advantage that it represents data consistent with most parametric statistical analyses.

controlSamples

character vector of colnames(x) to use as the control when calculating centered data. By default, all samples are used, so the classic MA-plot is the value of each sample, subtracting the median or mean value calculated across all samples. It is sometimes useful to define a subset of known samples for this calculation, which can be beneficial in avoiding outliers, or for consistency by selecting high quality control samples.

centerGroups

character vector with length equal to ncol(x), which defines subgroups of colnames(x) to be treated independently during the MA-plot calculation.

colramp

one of several inputs recognized by jamba::getColorRamp(). It typically recognizes either the name of a color ramp from RColorBrewer, the name of functions from the viridis package such as viridis::viridis(), or single R colors, or a vector of R colors. When a single color is supplied, a gradient is created from white to that color, where the default base color can be customized with defaultBaseColor="black" for example.

groupedX

logical indicating whether the x-axis value, which represents the median or mean value, should be calculated independently for each group when centerGroups is used with multiple groups. Typically groupedX=TRUE is recommended, however it can be beneficial to share an overall x-axis value in specific circumstances.

grouped_mad

logical indicating whether the MAD factor calculation of variability among samples should be performed independently for each group when centerGroups is used with multiple groups. Typically grouped_max=TRUE is recommended, however it can be beneficial to share an overall MAD factor threshold across all samples in specific circumstances.

outlierMAD

numeric indicating the MAD factor threshold above which a particular sample is considered an outlier.

mad_row_min

numeric value indicating the minimum x-axis value, calculated using either median or mean as defined by argument useMedian, at or above which a measurement is used in the MAD factor calculation. This threshold is useful to restrict the MAD variability calculation to measurements (rows in x) with signal that meets a minimum noise threshold.

displayMAD

logical indicating whether to display the MAD factor in the bottom right corner of each MA-plot panel.

noise_floor, noise_floor_value

numeric to define a numeric floor, or NULL for no numeric floor. Values at or below noise_floor are set to noise_floor_value, intended for two potential uses:

  1. Filter out value below a threshold, so they do not affect centering.

    • This option is valuable to remove zeros when a zero 0 is considered "no measurement observed", typically for count data such as RNA-seq, NanoString, and especially single-cell protocols or other protocols that produce a large number of missing values.

    • One can typically tell whether input data includes zero 0 values by the presence of characteristic 45-degree angle lines originating from x=0 angled toward the right. The points along this line are rows with more measurements of zero than non-zero, there this sample has a non-zero value.

  2. Set values at a noise floor to the noise floor, to retain the measurement but minimize the effect during centering to the lowest realiable measurement for the platform technology.

    • This value may be set to a platform noise floor for something like microarray data where the intensity may be unreliable below a threshold; or

    • for quantitative PCR measurements where cycle threshold (Ct) values may become unreliable, for example above CT=40 or CT=35. Data is often transformed to abundance with 2 ^ (40 - CT) then log2-transformed for analysis. In this case, to apply a noise_floor effective for CT=35, one would use noise_floor=5.

naValue

character string used to convert values of NA to something else. This argument is useful when a numeric matrix may contain NA values but would prefer them to be, for example, 0.

centerFunc

function used to supply a custom data centering function. In practice this argument should rarely be changed.

whichSamples

integer index of samples in colnames(x) to be plotted, however all samples in colnames(x) will be used for the MA-plot calculations and data centering. This argument is intended to help zoom in to inspect a specific subset of samples, without having to plot all samples in x.

useRank

logical indicating whether to plot rank on the x-axis, rank-difference on the y-axis for each sample. This transformation is rather useful, especially when downstream analysis tools may also refer to the rank value of particular measurements.

titleBoxColor

character vector of R colors, where titleBoxColor is equal to ncol(x), or where names(titleBoxColor) matches colnames(x). When supplied, each plot panel strip background will be colored accordingly.

outlierColor

character string representing one R color, used when colrampOutlier is NULL and when outlierMAD is defined. This color is used for MA-plot outlier panels by substituting the first color from the colramp color ramp, to act as a visual cue that the panel represents an outlier.

fillBackground

logical currently used for base R graphics output, and passed to jamba::plotSmoothScatter(), indicating whether to fill the plot panel using the first color in the color ramp for each MA-plot panel, or when a plot panel is an outlier, it uses outlierColor. This argument is mainly useful to highlight outlier panels, although it is also useful when the color ramp has non-white base color, for example viridis::viridis().

maintitle

character string with the title displayed above all individual MA-plot panels. It will appear in the top outer margin.

subtitle

NULL or character vector to be drawn at the bottom left corner of each plot panel, the location is defined by subtitlePreset.

transFactor

numeric adjustment to the visual density of smooth scatter points. For base R graphics, this argument is passed to jamba::plotSmoothScatter(). The argument value is based upon graphics::smoothScatter() argument transformation, which uses default function(x)x^0.25. The transFactor is equivalent to the exponential in the form: function(x)x^transFactor. Lower values make the point density more visually intense, higher values make the point density less visually intense.

doPlot

logical indicating whether to create plots. When doPlot=FALSE only the MA-plot panel data is returned.

highlightPoints

NULL, or character vector, or a list of character vectors indicating rownames(x) to highlight in each MA-plot panel. When NULL, no points are highlighted; when character vector, points are highlighted in all MA-plot panels; when list of character vectors, each character vector in the list is highlighted using a unique color in highlightColor. Points are drawn using graphics::points() and colored using highlightColor, which can be time-consuming for a large number of highlight points.

highlightCex

numeric value recycled to length(highlightPoints) indicating the highlight point size.

highlightColor

character vector used when highlightPoints is defined. It is recycled to length(highlightPoints) and is applied either to

doHighlightLegend

logical indicating whether to print a color legend when highlightPoints is defined. The legend is displayed in the bottom outer margin of the page using outer_legend(), and the page is adjusted to add bottom outer margin.

ablineH

numeric vector indicating position of horizontal and vertical lines in each MA-plot panel.

xlim

NULL or numeric vector length=2 indicating the y-axis and x-axis ranges, respectively. The values are useful to define consistent dimensions across all panels. The default ylim=c(-4,4) represents 16-fold up and down range in normal space, and is typically a reasonable starting point for most purposes. Even if numeric values are all between -1.5 and 1.5, it is still recommended to keep a range in context of c(-4, 4), to indicate that the observed values are lower than typically observed. The c(-4, 4) may be adjusted relative to the typical ranges expected for the data. It is sometimes helpful to define xlim slightly above zero for datasets that have an extremely large proportion of zeros, in order to reduce the visual effect of having that much point density at zero, for example with xlim=c(0.001, 20) and applyRangeCeiling=FALSE.

ylim

NULL or numeric vector length=2 indicating the y-axis and x-axis ranges, respectively. The values are useful to define consistent dimensions across all panels. The default ylim=c(-4,4) represents 16-fold up and down range in normal space, and is typically a reasonable starting point for most purposes. Even if numeric values are all between -1.5 and 1.5, it is still recommended to keep a range in context of c(-4, 4), to indicate that the observed values are lower than typically observed. The c(-4, 4) may be adjusted relative to the typical ranges expected for the data. It is sometimes helpful to define xlim slightly above zero for datasets that have an extremely large proportion of zeros, in order to reduce the visual effect of having that much point density at zero, for example with xlim=c(0.001, 20) and applyRangeCeiling=FALSE.

ncol

integer number of MA-plot panel columns and rows passed to graphics::par("mfrow") when doPar=TRUE. When only one value is supplied, nrow or ncol, the other value is defined by ncol(x) and blankPlotPos so all panels can be contained on one page. When nrow and ncol are defined such that multiple pages are produced, each page will be annotated with maintitle and doHighlightLegend as relevant.

nrow

integer number of MA-plot panel columns and rows passed to graphics::par("mfrow") when doPar=TRUE. When only one value is supplied, nrow or ncol, the other value is defined by ncol(x) and blankPlotPos so all panels can be contained on one page. When nrow and ncol are defined such that multiple pages are produced, each page will be annotated with maintitle and doHighlightLegend as relevant.

blankPlotPos

NULL or integer vector indicating plot panel positions to be drawn blank, and therefore skipped. Plot panels are drawn in the exact order of colnames(x) received. Blank panel positions are intended to help customize the visual alignment of MA-plot panels. The mechanism is similar to ggplot2::facet_wrap() except that blank positions can be manually defined by what makes sense for the experiment design.

verbose

logical indicating whether to print verbose output.

...

additional parameters sent to downstream functions, jamba::plotSmoothScatter, centerGeneData.

Details

This method is under active development and may change as features are implemented.

It is currently fully functional and is being documented.

See also

Other jam plot functions: jammaplot()

Examples

if (jamba::check_pkg_installed("SummarizedExperiment") &&
   jamba::check_pkg_installed("farrisdata")) {
   suppressPackageStartupMessages(require(SummarizedExperiment));

   GeneSE <- farrisdata::farrisGeneSE;

   titleBoxColor <- jamba::nameVector(
      farrisdata::colorSub[as.character(colData(GeneSE)$groupName)],
      colnames(GeneSE));
   options("warn"=FALSE);

   gg <- ggjammaplot(GeneSE,
      ncol=6,
      base_size=12,
      assay_name="raw_counts")

   gg <- ggjammaplot(GeneSE,
      ncol=6,
      assay_name="counts",
      useRank=TRUE,
      ylim=c(-11000, 11000),
      maintitle="MA-plots by rank and rank difference",
      titleBoxColor=titleBoxColor)

   gg <- ggjammaplot(GeneSE,
      ncol=6,
      assay_name="counts",
      titleBoxColor=titleBoxColor,
      base_size=10,
      maintitle="MA-plots showing MAD factor",
      displayMAD=TRUE)

   gg <- ggjammaplot(GeneSE,
      ncol=6,
      assay_name="counts",
      titleBoxColor=titleBoxColor,
      maintitle="MA-plot omitting one panel, then using blankPlotPos",
      whichSamples=colnames(GeneSE)[c(1:21, 23:24)],
      blankPlotPos=22,
      displayMAD=TRUE)

   if (FALSE) {
   ggdf <- ggjammaplot(GeneSE,
      assay_name="counts",
      whichSamples=c(1:3, 7:9),
      return_type="data",
      titleBoxColor=titleBoxColor)
   highlightPoints1 <- names(jamba::tcount(subset(ggdf, mean > 15 & difference < -1)$item, 2))
   highlightPoints2 <- subset(ggdf, name %in% "CA1CB492" &
      difference < -4.5)$item;
   highlightPoints <- list(
      divergent=highlightPoints1,
      low_CA1CB492=highlightPoints2);

   ggdf_h <- ggjammaplot(GeneSE,
      assay_name="counts",
      highlightPoints=highlightPoints,
      whichSamples=c(1:3, 7:9),
      return_type="data",
      titleBoxColor=titleBoxColor)

   # you can use output from `jammaplot()` as input to `ggjammaplot()`:
   jp2 <- jammaplot(GeneSE,
      outlierMAD=2,
      doPlot=FALSE,
      assay_name="raw_counts",
      filterFloor=1e-10,
      filterFloorReplacement=NA,
      centerGroups=colData(GeneSE)$Compartment,
      subtitleBoxColor=farrisdata::colorSub[as.character(colData(GeneSE)$Compartment)],
      useRank=FALSE);

   gg1 <- ggjammaplot(jp2,
      ncol=6,
      titleBoxColor=titleBoxColor);
   print(gg1);
   }
}
#> Warning: package ‘SummarizedExperiment’ was built under R version 3.6.2
#> Warning: package ‘S4Vectors’ was built under R version 3.6.3
#> Warning: package ‘IRanges’ was built under R version 3.6.2
#> Warning: package ‘GenomeInfoDb’ was built under R version 3.6.3
#> Warning: package ‘DelayedArray’ was built under R version 3.6.3
#> Warning: package ‘matrixStats’ was built under R version 3.6.2
#> Warning: package ‘BiocParallel’ was built under R version 3.6.2
#> Warning: Ignoring unknown parameters: stat
#> Warning: Removed 534 rows containing non-finite values (stat_density2d).
#> Warning: Removed 7776 rows containing missing values (geom_raster).
#> Warning: Ignoring unknown parameters: stat

#> Warning: Removed 140 rows containing non-finite values (stat_density2d).
#> Warning: Removed 7776 rows containing missing values (geom_raster).
#> Warning: Ignoring unknown parameters: stat

#> Warning: Removed 322 rows containing non-finite values (stat_density2d).
#> Warning: Removed 7776 rows containing missing values (geom_raster).
#> Warning: Ignoring unknown parameters: stat

#> Warning: Removed 317 rows containing non-finite values (stat_density2d).
#> Warning: Removed 7636 rows containing missing values (geom_raster).