MA-plots using ggplot2
ggjammaplot(
x,
nbin_factor = 1,
bw_factor = 1,
assay_name = 1,
useMedian = FALSE,
controlSamples = NULL,
centerGroups = NULL,
colramp = c("transparent", "lightblue", "blue", "navy", "orange", "orangered2"),
groupedX = TRUE,
grouped_mad = TRUE,
outlierMAD = 5,
mad_row_min = 4,
displayMAD = FALSE,
noise_floor = 0,
noise_floor_value = NA,
naValue = NA,
centerFunc = centerGeneData,
whichSamples = NULL,
useRank = FALSE,
titleBoxColor = "lightgoldenrod1",
outlierColor = "lemonchiffon",
fillBackground = TRUE,
maintitle = NULL,
subtitle = NULL,
summary = "mean",
difference = "difference",
transFactor = 0.2,
doPlot = TRUE,
highlightPoints = NULL,
highlightPch = 21,
highlightCex = 1.5,
highlightColor = NULL,
doHighlightLegend = TRUE,
ablineH = c(-2, 0, 2),
base_size = 12,
panel.grid.major.colour = "grey90",
panel.grid.minor.colour = "grey95",
return_type = c("ggplot", "data"),
xlim = NULL,
ylim = c(-6, 6),
ncol = NULL,
nrow = NULL,
blankPlotPos = NULL,
verbose = FALSE,
...
)one of the following inputs:
numeric matrix
SummarizedExperiment object, where the
assay data is defined using assays(x)[[assay_name]]. Accordingly,
assay_name can be either an integer index or character string
matching the names(assays(x)).
list output from jammacalc() or jammaplot(), where each
element in the list is a two-column matrix with colnames c("x", "y").
numeric value used to adjust the number of bins
used to display the MA-plots, where values higher than 1 increase
the resolution and level of detail, and values below 1 decrease
the resolution. Note the number of bins are already adjusted based
upon the square root of the number of plot panels, and nbin_factor
applied to that value.
numeric used to adjust the resolution of the
2-dimensional bandwidth calculation, where higher values create
more detailed density, and lower values create a smoother density
across the range of data.
relevant only when x is SummarizedExperiment,
one of these input types:
character string that matches names(assays(x))
integer index for assays(x), where any value higher than
length(assays(x)) is adjusted to length(assays(x)), which makes
it convenient to select the last element in the list of assays(x)
by using assay_name = Inf.
logical indicating whether calculations should use
median, or when useMedian=FALSE the mean is used. The median
has the benefit of reducing effect of outliers, however the mean
has the advantage that it represents data consistent with most
parametric statistical analyses.
character vector of colnames(x) to use as
the control when calculating centered data. By default, all samples
are used, so the classic MA-plot is the value of each sample,
subtracting the median or mean value calculated across all samples.
It is sometimes useful to define a subset of known samples for this
calculation, which can be beneficial in avoiding outliers, or for
consistency by selecting high quality control samples.
character vector with length equal to ncol(x),
which defines subgroups of colnames(x) to be treated independently
during the MA-plot calculation.
one of several inputs recognized by
jamba::getColorRamp(). It typically recognizes either the name of
a color ramp from RColorBrewer, the name of functions from the
viridis package such as viridis::viridis(), or single R colors, or
a vector of R colors. When a single color is supplied, a gradient
is created from white to that color, where the default base color
can be customized with defaultBaseColor="black" for example.
logical indicating whether the x-axis value, which
represents the median or mean value, should be calculated independently
for each group when centerGroups is used with multiple groups.
Typically groupedX=TRUE is recommended, however it can be beneficial
to share an overall x-axis value in specific circumstances.
logical indicating whether the MAD factor calculation
of variability among samples should be performed independently
for each group when centerGroups is used with multiple groups.
Typically grouped_max=TRUE is recommended, however it can be beneficial
to share an overall MAD factor threshold across all samples
in specific circumstances.
numeric indicating the MAD factor threshold above
which a particular sample is considered an outlier.
numeric value indicating the minimum x-axis
value, calculated using either median or mean as defined by
argument useMedian, at or above which a measurement is used in the
MAD factor calculation. This threshold is useful to restrict the
MAD variability calculation to measurements (rows in x) with
signal that meets a minimum noise threshold.
logical indicating whether to display the MAD factor
in the bottom right corner of each MA-plot panel.
numeric to define a numeric
floor, or NULL for no numeric floor. Values at or below
noise_floor are set to noise_floor_value, intended for two
potential uses:
Filter out value below a threshold, so they do not affect centering.
This option is valuable to remove zeros when a zero 0 is considered
"no measurement observed", typically for count data such as RNA-seq,
NanoString, and especially single-cell protocols or other protocols
that produce a large number of missing values.
One can typically tell whether input data includes zero 0
values by the presence of characteristic 45-degree angle lines
originating from x=0 angled toward the right. The points along
this line are rows with more measurements of zero than non-zero,
there this sample has a non-zero value.
Set values at a noise floor to the noise floor, to retain the measurement but minimize the effect during centering to the lowest realiable measurement for the platform technology.
This value may be set to a platform noise floor for something like microarray data where the intensity may be unreliable below a threshold; or
for quantitative PCR measurements where cycle threshold (Ct)
values may become unreliable, for example above CT=40 or CT=35.
Data is often transformed to abundance with 2 ^ (40 - CT) then
log2-transformed for analysis. In this case, to apply a noise_floor
effective for CT=35, one would use noise_floor=5.
character string used to convert values of NA to
something else. This argument is useful when a numeric matrix may
contain NA values but would prefer them to be, for example, 0.
function used to supply a custom data centering
function. In practice this argument should rarely be changed.
integer index of samples in colnames(x) to be
plotted, however all samples in colnames(x) will be used for the
MA-plot calculations and data centering. This argument is intended
to help zoom in to inspect a specific subset of samples, without
having to plot all samples in x.
logical indicating whether to plot rank on the x-axis,
rank-difference on the y-axis for each sample. This transformation
is rather useful, especially when downstream analysis tools may
also refer to the rank value of particular measurements.
character vector of R colors, where
titleBoxColor is equal to ncol(x), or where
names(titleBoxColor) matches colnames(x). When supplied, each
plot panel strip background will be colored accordingly.
character string representing one R color,
used when colrampOutlier is NULL and when outlierMAD is
defined. This color is used for MA-plot outlier panels by
substituting the first color from the colramp color ramp,
to act as a visual cue that the panel represents an outlier.
logical currently used for base R graphics
output, and passed to jamba::plotSmoothScatter(),
indicating whether to fill the plot panel using the
first color in the color ramp for each MA-plot panel, or when
a plot panel is an outlier, it uses outlierColor.
This argument is mainly useful to highlight outlier panels,
although it is also useful when the color ramp has non-white
base color, for example viridis::viridis().
character string with the title displayed above
all individual MA-plot panels. It will appear in the top outer
margin.
NULL or character vector to be drawn at
the bottom left corner of each plot panel, the location
is defined by subtitlePreset.
numeric adjustment to the visual density of
smooth scatter points. For base R graphics, this argument is
passed to jamba::plotSmoothScatter(). The argument value is based upon
graphics::smoothScatter() argument transformation, which uses
default function(x)x^0.25. The transFactor is equivalent to the
exponential in the form: function(x)x^transFactor. Lower values
make the point density more visually intense, higher values make the
point density less visually intense.
logical indicating whether to create plots. When
doPlot=FALSE only the MA-plot panel data is returned.
NULL, or character vector, or a
list of character vectors indicating rownames(x) to
highlight in each MA-plot panel. When NULL, no points are
highlighted; when character vector, points are highlighted in
all MA-plot panels; when list of character vectors, each
character vector in the list is highlighted using a unique
color in highlightColor. Points are drawn using
graphics::points() and colored using highlightColor,
which can be time-consuming for a large number of highlight
points.
numeric value recycled to length(highlightPoints)
indicating the highlight point size.
character vector used when highlightPoints
is defined. It is recycled to length(highlightPoints) and
is applied either to
logical indicating whether to print a
color legend when highlightPoints is defined. The legend is
displayed in the bottom outer margin of the page using
outer_legend(), and the page is adjusted to add bottom
outer margin.
numeric vector indicating position of
horizontal and vertical lines in each MA-plot panel.
NULL or numeric vector length=2 indicating
the y-axis and x-axis ranges, respectively. The values are useful
to define consistent dimensions across all panels. The
default ylim=c(-4,4) represents 16-fold up and down range in
normal space, and is typically a reasonable starting point for
most purposes. Even if numeric values are all between
-1.5 and 1.5, it is still recommended to keep a range in
context of c(-4, 4), to indicate that the observed values
are lower than typically observed. The c(-4, 4) may be adjusted
relative to the typical ranges expected for the data.
It is sometimes helpful to define xlim slightly above zero for
datasets that have an extremely large proportion of zeros, in order
to reduce the visual effect of having that much point density at
zero, for example with xlim=c(0.001, 20) and
applyRangeCeiling=FALSE.
NULL or numeric vector length=2 indicating
the y-axis and x-axis ranges, respectively. The values are useful
to define consistent dimensions across all panels. The
default ylim=c(-4,4) represents 16-fold up and down range in
normal space, and is typically a reasonable starting point for
most purposes. Even if numeric values are all between
-1.5 and 1.5, it is still recommended to keep a range in
context of c(-4, 4), to indicate that the observed values
are lower than typically observed. The c(-4, 4) may be adjusted
relative to the typical ranges expected for the data.
It is sometimes helpful to define xlim slightly above zero for
datasets that have an extremely large proportion of zeros, in order
to reduce the visual effect of having that much point density at
zero, for example with xlim=c(0.001, 20) and
applyRangeCeiling=FALSE.
integer number of MA-plot panel columns and rows
passed to graphics::par("mfrow") when doPar=TRUE. When only one
value is supplied, nrow or ncol, the other value is defined
by ncol(x) and blankPlotPos so all panels can be contained on
one page. When nrow and ncol are defined such that multiple
pages are produced, each page will be annotated with maintitle
and doHighlightLegend as relevant.
integer number of MA-plot panel columns and rows
passed to graphics::par("mfrow") when doPar=TRUE. When only one
value is supplied, nrow or ncol, the other value is defined
by ncol(x) and blankPlotPos so all panels can be contained on
one page. When nrow and ncol are defined such that multiple
pages are produced, each page will be annotated with maintitle
and doHighlightLegend as relevant.
NULL or integer vector indicating
plot panel positions to be drawn blank, and therefore skipped.
Plot panels are drawn in the exact order of colnames(x) received.
Blank panel positions are intended to help customize the visual
alignment of MA-plot panels. The mechanism is similar to
ggplot2::facet_wrap() except that blank positions can be manually
defined by what makes sense for the experiment design.
logical indicating whether to print verbose output.
additional parameters sent to downstream functions,
jamba::plotSmoothScatter, centerGeneData.
This method is under active development and may change as features are implemented.
It is currently fully functional and is being documented.
Other jam plot functions:
jammaplot()
if (jamba::check_pkg_installed("SummarizedExperiment") &&
jamba::check_pkg_installed("farrisdata")) {
suppressPackageStartupMessages(require(SummarizedExperiment));
GeneSE <- farrisdata::farrisGeneSE;
titleBoxColor <- jamba::nameVector(
farrisdata::colorSub[as.character(colData(GeneSE)$groupName)],
colnames(GeneSE));
options("warn"=FALSE);
gg <- ggjammaplot(GeneSE,
ncol=6,
base_size=12,
assay_name="raw_counts")
gg <- ggjammaplot(GeneSE,
ncol=6,
assay_name="counts",
useRank=TRUE,
ylim=c(-11000, 11000),
maintitle="MA-plots by rank and rank difference",
titleBoxColor=titleBoxColor)
gg <- ggjammaplot(GeneSE,
ncol=6,
assay_name="counts",
titleBoxColor=titleBoxColor,
base_size=10,
maintitle="MA-plots showing MAD factor",
displayMAD=TRUE)
gg <- ggjammaplot(GeneSE,
ncol=6,
assay_name="counts",
titleBoxColor=titleBoxColor,
maintitle="MA-plot omitting one panel, then using blankPlotPos",
whichSamples=colnames(GeneSE)[c(1:21, 23:24)],
blankPlotPos=22,
displayMAD=TRUE)
if (FALSE) {
ggdf <- ggjammaplot(GeneSE,
assay_name="counts",
whichSamples=c(1:3, 7:9),
return_type="data",
titleBoxColor=titleBoxColor)
highlightPoints1 <- names(jamba::tcount(subset(ggdf, mean > 15 & difference < -1)$item, 2))
highlightPoints2 <- subset(ggdf, name %in% "CA1CB492" &
difference < -4.5)$item;
highlightPoints <- list(
divergent=highlightPoints1,
low_CA1CB492=highlightPoints2);
ggdf_h <- ggjammaplot(GeneSE,
assay_name="counts",
highlightPoints=highlightPoints,
whichSamples=c(1:3, 7:9),
return_type="data",
titleBoxColor=titleBoxColor)
# you can use output from `jammaplot()` as input to `ggjammaplot()`:
jp2 <- jammaplot(GeneSE,
outlierMAD=2,
doPlot=FALSE,
assay_name="raw_counts",
filterFloor=1e-10,
filterFloorReplacement=NA,
centerGroups=colData(GeneSE)$Compartment,
subtitleBoxColor=farrisdata::colorSub[as.character(colData(GeneSE)$Compartment)],
useRank=FALSE);
gg1 <- ggjammaplot(jp2,
ncol=6,
titleBoxColor=titleBoxColor);
print(gg1);
}
}
#> Warning: package ‘SummarizedExperiment’ was built under R version 3.6.2
#> Warning: package ‘S4Vectors’ was built under R version 3.6.3
#> Warning: package ‘IRanges’ was built under R version 3.6.2
#> Warning: package ‘GenomeInfoDb’ was built under R version 3.6.3
#> Warning: package ‘DelayedArray’ was built under R version 3.6.3
#> Warning: package ‘matrixStats’ was built under R version 3.6.2
#> Warning: package ‘BiocParallel’ was built under R version 3.6.2
#> Warning: Ignoring unknown parameters: stat
#> Warning: Removed 534 rows containing non-finite values (stat_density2d).
#> Warning: Removed 7776 rows containing missing values (geom_raster).
#> Warning: Ignoring unknown parameters: stat
#> Warning: Removed 140 rows containing non-finite values (stat_density2d).
#> Warning: Removed 7776 rows containing missing values (geom_raster).
#> Warning: Ignoring unknown parameters: stat
#> Warning: Removed 322 rows containing non-finite values (stat_density2d).
#> Warning: Removed 7776 rows containing missing values (geom_raster).
#> Warning: Ignoring unknown parameters: stat
#> Warning: Removed 317 rows containing non-finite values (stat_density2d).
#> Warning: Removed 7636 rows containing missing values (geom_raster).