3.4 Box-and-Whisker Plots (ggplot2)

As much as we are lattice enthusiasts, we always end up drawing boxplots with ggplot2 because they look so much nicer, meaning that there’s no need to modify so many graphical parameter settings in order to get an acceptable result. You will see what we mean when we plot a ggplot2 version using the default settings.

bw_ggplot <- ggplot(diamonds, aes(x = color, y = price))

g_bw <- bw_ggplot + geom_boxplot()

print(g_bw)
A basic **ggplot2** boxplot.

Figure 3.15: A basic ggplot2 boxplot.

This is much better straight away! And, as we’ve already seen, the faceting requires also just one more line…

bw_ggplot <- ggplot(diamonds, aes(x = color, y = price))

g_bw <- bw_ggplot + 
  geom_boxplot(fill = "grey90") +
  facet_wrap(~ cut)

print(g_bw)
A faceted **ggplot2** boxplot.

Figure 3.16: A faceted ggplot2 boxplot.

So far, you may have gotten the impression that pretty much everything is a little bit easier the ggplot2 way. Well, a lot of things are, but some are not. If we wanted to highlight the relative sample sizes of the different color levels like we did earlier in lattice (using varwidth = TRUE) we have to put a little more effort into ggplot2. Meaning, we have to calculate this ourselves. There is no built-in functionality for this feature (yet), at least not to our knowledge.

But anyway, it is not too complicated. The equation for this adjustment is rather straightforward, we simply take the square root of the counts for each color and divide it by the overall number of observations. Then we standardize this relative to the maximum of this calculation. As a final step, we need to break this down to each of the panels of the plot. This is the toughest part of it. We won’t go into any detail here, but the llply() part of the following code chunk is basically the equivalent of what is going on behind the scenes of lattice (though the latter most likely does not use llply()).

Anyway, it does not require too many lines of code to achieve the box width adjustment in ggplot2.

w <- sqrt(table(diamonds$color)/nrow(diamonds))
### standardize w to maximum value
w <- w / max(w)

g_bw <- bw_ggplot + 
  facet_wrap(~ cut) +
  llply(unique(diamonds$color), 
        function(i) geom_boxplot(fill = clrs_hcl(7)[i],
                                 width = w[i], outlier.shape = "*",
                                 outlier.size = 3,
                                 data = subset(diamonds, color == i)))

print(g_bw)
A faceted **ggplot2** boxplot with colored boxes and box widths relative to number of observations.

Figure 3.17: A faceted ggplot2 boxplot with colored boxes and box widths relative to number of observations.

The result is very similar to what we have achieved earlier. In summary, lattice needs a little more care to adjust the standard graphical parameters, whereas ggplot2 requires us to manually calculate the width of the boxes. We leave it up to you which way suits you better… the two of us have already made our choice a few years ago ;-)

Boxplots are, as mentioned above, a brilliant way to visualize data distribution(s). Their strength lies in the comparability of different classes as they are plotted next to each other using a common scale. Another, more classical - as parametric - way are histograms and density plots.