Creating Publication Quality Graphics in R

3.2 Scatter plots (ggplot2)

Now let’s try to recreate our lattice-based achievements using ggplot2.

ggplot2 is radically different from the way that lattice works. lattice is much closer to the traditional way of plotting in R. There are different functions for different types of plots. In ggplot2 this is different. Every plot we want to draw is, at a fundamental level, created in exactly the same way. What differs are the subsequent calls on how to represent the individual plot components (basically x and y). This means a much more consistent way of building visualizations, but it also means that things are rather different from what you might have learned about syntax and structure of (plotting) objects in R. But don’t worry, even Tim managed to understand how things are done in ggplot2 (and prior to writing this he had almost never used it before).

Before we get carried away too much, let’s jump right into our first plot using ggplot2.

scatter_ggplot <- ggplot(aes(x = carat, y = price), data = diamonds)

g_sc <- scatter_ggplot + geom_point()

print(g_sc)

Figure 3.7: A basic scatter plot created with ggplot2.

Similar to lattice, plots are (usually) stored in objects. But that is about all the similarity there is.

Let’s look at the above code in a little more detail. The first line is the fundamental definition of what we want to plot. We provide the ‘aesthetics’ for the plot via aes(). We state that we want the values on the x-axis to represent carat and the y-values are price. Furthermore, we want to take these variables from the diamonds data set. That’s basically it, and this will not change a hell of a lot in the subsequent plotting routines.

What will change in the plotting code chunks that follow is how we want the relationship between these variables to be represented in our plot. This is done by defining so-called ‘geometries’ (geom_...()). In this first case, we stated that we want the relationship between x and y to be represented as points, hence we used geom_point().

If we wanted to provide a plot showing the relationship between price and carat in panels representing the quality of the diamonds, we need what in ggplot2 is called ‘faceting’ (i.e. panels in lattice). To achieve this, we simply repeat our plotting call from earlier and add another layer to the call which does the faceting.

g_sc <- scatter_ggplot + 
  geom_point() +
  facet_wrap(~ cut)

print(g_sc)

Figure 3.8: The ggplot2 version of a faceted plot.

By default, all plots created by ggplot() have a grey background. This comes in particularly handy as soon as colors are involved because it increases the contrast of the colors. However, quite some people tend to dislike the grey default theme when aiming to create a simple black-and-white scatter plot, just like in our case. Here, a white facet background seems more suitable, and luckily, ggplot2 lets us easily change the background color using a pre-defined theme called theme_bw() (make sure to check out ?theme_bw for a full list of available themes).

g_sc <- scatter_ggplot + 
  geom_point() +
  facet_wrap(~ cut) + 
  theme_bw()

print(g_sc)

Figure 3.9: The ggplot2 version of a faceted plot with grey facet background.

In case you wanted to remove the horizontal and vertical grid lines as well, and hence, end up with a plain white background, an additional line + theme(panel.grid = element_blank()) would be required. For our purposes, however, the above plot version drawing a white instead of a grey background is totally sufficient.

In order to provide the regression line for each panel like we did in lattice, we need a function called stat_smooth(). This is fundamentally the same function that we used earlier, as the panel.smoother() in lattice is based on stat_smooth().

Putting this together we could do something like this (note that we also change the number of rows and columns into which the facets should be arranged):

g_sc <- scatter_ggplot + 
  geom_point(color = "grey60") +
  facet_wrap(~ cut, nrow = 2, ncol = 3) +
  stat_smooth(method = "lm", se = TRUE, 
              fill = "black", color = "black") + 
  theme_bw()

print(g_sc)

Figure 3.10: A faceted ggplot2 plot with regression lines and confidence bands in each facet.

Simple and straightforward, and the result looks rather similar to the lattice version we created earlier.

Creating a point density scatter plot in ggplot2 is actually a fair bit easier than in lattice, as ggplot2 provides several predefined stat_*() functions. One of these is designed to create 2-dimensional kernel density estimations, just what we want. However, this is where the syntax of ggplot2 really becomes a bit abstract. The definition of the fill argument of this call is ..density.. which, at least at first glance, does not seem very intuitive.

Furthermore, it is not quite sufficient to supply the stat_*() function, we also need to state how to map the colors to that definition. Therefore, we need yet another layer which defines what color palette to use. As we want a continuous variable (density) to be filled with a gradient of n colors, we need to use scale_fill_gradientn() in which we can define the colors we want to be used.

g_sc <- scatter_ggplot + 
  geom_tile() +
  facet_wrap(~ cut, nrow = 3, ncol = 2) +
  stat_density2d(aes(fill = ..density..), n = 100,
                 geom = "tile", contour = FALSE) +
  scale_fill_gradientn(colors = c("white",
                                   rev(clrs_hcl(100)))) +
  stat_smooth(method = "lm", se = FALSE, color = "black") +
  coord_fixed(ratio = 5/30000) + 
  theme_bw()

print(g_sc)

Figure 3.11: The ggplot2 version of a panel plot showing point densities in each panel.