2.1 Summarizing data
The dplyr package is a marvelous and very intuitive set of tools to perform fast and easy data manipulation and summarization in preparation e.g. for visualization. In the following, we’ll guide you through a selection of operations you may find helpful within the course of future R workflows on the basis of the diamonds dataset. Based on a very comprehensive RPubs tutorial, these include
- filter(),
- select(),
- arrange(),
- mutate(), and
- summarise()(incl.- group_by()),
each of which will be accompanied by an equivalent base-R approach.
Subsetting data via filter()
Just like base-R subset() or simply indexing via conditional expressions included in square brackets, filter() creates a subset of your data based on some user-defined criteria. Before we dive any deeper into this, let’s check your subsetting abilities.
Task: base-R subsetting
Suppose we want to create a subset of diamonds keeping all D-, E- or F-colored specimen with a ‘Premium’ or ‘Ideal’ cut quality and a weight of more than 3 carat. Try to figure out two different solutions, one with and one without using subset.
Using filter() is straightforward and very similar to subset(). The first argument represents the dataset under investigation, whereas any subsequent argument represents a logical expression to filter particular rows. Note that in contrast to the direct subsetting via square brackets, you are not required to repeat the name of the dataset with every single condition.
library(dplyr)
filter(diamonds, carat > 3 &
         cut %in% c("Premium", "Ideal") &
         color %in% c("D", "E", "F"))## # A tibble: 2 x 10
##   carat     cut color clarity depth table price     x     y     z
##   <dbl>   <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  3.01 Premium     F      I1  62.2    56  9925  9.24  9.13  5.73
## 2  3.05 Premium     E      I1  60.9    58 10453  9.26  9.25  5.66
Selecting columns via select()
Selecting specific columns from a dataset follows a similar syntax to filter() and works analogous to SQL standards. Again, the first argument represents the dataset under consideration followed by the desired column names (without double quotes).
dplyr::select(diamonds, carat, cut, color, clarity)## # A tibble: 53,940 x 4
##    carat       cut color clarity
##    <dbl>     <ord> <ord>   <ord>
##  1  0.23     Ideal     E     SI2
##  2  0.21   Premium     E     SI1
##  3  0.23      Good     E     VS1
##  4  0.29   Premium     I     VS2
##  5  0.31      Good     J     SI2
##  6  0.24 Very Good     J    VVS2
##  7  0.24 Very Good     I    VVS1
##  8  0.26 Very Good     H     SI1
##  9  0.22      Fair     E     VS2
## 10  0.23 Very Good     H     VS1
## # ... with 53,930 more rowsNote that both the use of c() to combine the single column names into a vector as well as the need for double quotes became obsolete. In addition to this, column names may be treated analogous to numeric indices, eliminating the need to count columns when desiring to extract several consecutive columns from a rather wide datasets.
dplyr::select(diamonds, carat:clarity, price)## # A tibble: 53,940 x 5
##    carat       cut color clarity price
##    <dbl>     <ord> <ord>   <ord> <int>
##  1  0.23     Ideal     E     SI2   326
##  2  0.21   Premium     E     SI1   326
##  3  0.23      Good     E     VS1   327
##  4  0.29   Premium     I     VS2   334
##  5  0.31      Good     J     SI2   335
##  6  0.24 Very Good     J    VVS2   336
##  7  0.24 Very Good     I    VVS1   336
##  8  0.26 Very Good     H     SI1   337
##  9  0.22      Fair     E     VS2   337
## 10  0.23 Very Good     H     VS1   338
## # ... with 53,930 more rowsThere’s also a set of additional helper functions, including starts_with(), ends_with(), matches() and contains(), which definitely go beyond the scope of this short introduction. Still, it’s good to know that such things existed in case you needed any of them.
A brief note on chaining (or pipelining)
Now suppose we wanted to select the same columns from the previously created subset of data. Traditionally, this would either require a 2-step approach, generating otherwise unnecessary intermediate results, via
## first, create subset
diamonds_sub <- filter(diamonds, carat > 3 &
                         cut %in% c("Premium", "Ideal") &
                         color %in% c("D", "E", "F"))
## second, select columns
select(diamonds_sub, carat:clarity, price)or nested function calls which are usually hard to read.
## all-in-one nested solution
select(
  filter(diamonds, carat > 3 &
           cut %in% c("Premium", "Ideal") &
           color %in% c("D", "E", "F")), 
  carat:clarity, price
)dplyr introduces the %>% operator which is meant to bridge a set of connected processing steps, thus eliminating the need for intermediary variables (e.g. diamonds_sub in the above example) or nested function calls. Just think of %>% as a “then” connecting two parts of a sentence.
diamonds %>%
  filter(carat > 3 &
           cut %in% c("Premium", "Ideal") &
           color %in% c("D", "E", "F")) %>%
  dplyr::select(carat:clarity, price)## # A tibble: 2 x 5
##   carat     cut color clarity price
##   <dbl>   <ord> <ord>   <ord> <int>
## 1  3.01 Premium     F      I1  9925
## 2  3.05 Premium     E      I1 10453Note that chaining comes in handy when performing multiple operations at once, rendering your code much more elegant and reducing the accumulated overhead significantly. The single worksteps can be read from left to right and from top to bottom, just like you would read the pages of a book.
Reordering rows via arrange()
arrange() ensures a fast and easy rearrangement of rows based on certain variables. Although this can be done using base-R, its necessity soon becomes clear when dealing with rearrangements based on multiple variables as the dplyr approach requires far less typing.
Task: base-R rearrangement of rows
Here’s a tricky one. Let’s assume we wanted to rearrange the rows of a subset of diamonds (color, price, carat) according to color, with the best color (D) on top. Since we’re also interested in a preferably low price, the diamonds of uniform color should be sorted according to their price, with cheapest ones on top. Finally, for each set of diamonds with a specified color and price, we want their weights arranged in descending order, with the heaviest specimen on top.
Again, the dplyr approach is much simpler and can be done in one go.
diamonds %>%
  dplyr::select(color, price, carat) %>%
  arrange(color, price, desc(carat))## # A tibble: 53,940 x 3
##    color price carat
##    <ord> <int> <dbl>
##  1     D   357  0.23
##  2     D   357  0.23
##  3     D   361  0.32
##  4     D   362  0.23
##  5     D   367  0.24
##  6     D   367  0.20
##  7     D   367  0.20
##  8     D   367  0.20
##  9     D   373  0.24
## 10     D   373  0.23
## # ... with 53,930 more rows
Adding new columns via mutate()
mutate() lets you add new variables to an existing data frame. It is basically an extended version of base-R transform() in the sense that it allows you to directly work with columns you’ve just created.
diamonds %>%
  dplyr::select(color, carat, price) %>%
  mutate(ppc = price / carat, 
         ppc_rnd = round(ppc, 2))## # A tibble: 53,940 x 5
##    color carat price      ppc ppc_rnd
##    <ord> <dbl> <int>    <dbl>   <dbl>
##  1     E  0.23   326 1417.391 1417.39
##  2     E  0.21   326 1552.381 1552.38
##  3     E  0.23   327 1421.739 1421.74
##  4     I  0.29   334 1151.724 1151.72
##  5     J  0.31   335 1080.645 1080.65
##  6     J  0.24   336 1400.000 1400.00
##  7     I  0.24   336 1400.000 1400.00
##  8     H  0.26   337 1296.154 1296.15
##  9     E  0.22   337 1531.818 1531.82
## 10     H  0.23   338 1469.565 1469.57
## # ... with 53,930 more rows
Summarise values via summarise()
Similar to plyr::summarise(), the dplyr version of summarise() lets you melt a dataset into a single row depending on the supplied function. The function works quite similar to base-R aggregate(). However, the ease of use is definitely on the side of summarise().
Task: base-R data aggregation
In order to demonstrate dplyr’s ease of use in terms of data summarization, try to aggregate() the diamonds dataset in such a fashion that you end up with a data frame showing the minimum, mean and maximum price per diamond color.
As for the dplyr solution, the group_by function comes in handy when trying to calculate metrics from sub-groups of data, e.g. depending on the diamonds’ color, rather than from the entire data frame.
diamonds %>% 
  group_by(color) %>%
  summarise(MIN = min(price), MEAN = mean(price), MAX = max(price))## # A tibble: 7 x 4
##   color   MIN     MEAN   MAX
##   <ord> <dbl>    <dbl> <dbl>
## 1     D   357 3169.954 18693
## 2     E   326 3076.752 18731
## 3     F   342 3724.886 18791
## 4     G   354 3999.136 18818
## 5     H   337 4486.669 18803
## 6     I   334 5091.875 18823
## 7     J   335 5323.818 18710In order to summarize an entire variable into one single value, just drop the group_by function.
summarise(diamonds, min = min(price), mean = mean(price), max = max(price))## # A tibble: 1 x 3
##     min   mean   max
##   <dbl>  <dbl> <dbl>
## 1   326 3932.8 18823Now that you’re familiar with the basics of data manipulation via dplyr and before moving on to actual data visualization via ggplot2, it’s time to have a brief look at data format conversion from wide to long format which is essential for displaying numerous groups of data in one single plot.