2 Data handling and visualization

As noted earlier in this guide, an R workflow basically consists of performing calculations and/or visualizing your data which has previously been imported into R. In the majority of cases, this workflow usually includes an intermediary step referred to as data manipulation meant to get your data into shape for visualization. Note that some packages even require you to provide the data to be displayed in a particular format (of which more later on).

Within the scope of this short course, we’ll simply assume that you already heard a lot about importing your data into R. Even if this were not the case, there’s plenty of tutorials on the internet demonstrating the use of read.table() and the like. Help on this topic may be found e.g. on Quick-R, CRAN, R-Tutor and so forth. Just remember, Google is your friend ☺


Data manipulation

An overview of typical data manipulation steps, including illustrative examples, is provided by the R Cookbook and includes, among others,

  • General operations
    • Sorting
    • Converting between vector types
    • Dealing with duplicate records and missing values
  • Factor operations
    • Renaming and re-computing factor levels
    • Reordering factor levels (which comes in quite handy when plotting data)
  • Data frame operations
    • Adding and removing columns
    • Reordering columns
    • Merging data frames
  • Restructuring data
    • Data format conversions
    • Summarizing data

Since I assume that you are already familiar with the basics of data manipulation, we will primarily focus on the latter point which represents an essential step towards conveying your results clearly and vividly to a broader audience.


The diamonds dataset

In the following subsections, I’ll introduce the essentials of data restructuring in R on the basis of the diamonds dataset, which is part of the ggplot2 package. In case you are not already familiar with the dataset, just take a minute and have a look at ?diamonds in order to get a more detailed description of the single variables.

Now, it is important to realize that diamonds is not a standard R data.frame (in fact, it used to be one), but an object of class tbl_df.

library(ggplot2)
class(diamonds)
## [1] "tbl_df"     "tbl"        "data.frame"

One good thing about such objects in R is that their print() method largely renders ritual-like head(), tail(), or str() calls unnecessary.

diamonds
## # A tibble: 53,940 x 10
##    carat       cut color clarity depth table price     x     y     z
##    <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
##  2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
##  3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
##  4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
##  5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
##  9  0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
## # ... with 53,930 more rows

Notice that not only the class and the dimensions of the dataset are displayed, but also the variable types of the single columns. Furthermore, the console output is truncated after the first 10 rows (which would also apply to the displayed columns if we were dealing with a somewhat wider dataset). Practically, this means that the appropriate use of tbl_df lets you kill two birds with one stone compared with the 2-step approach via head and str that is usually required for investigating standard data frames.