Advanced Programming in R

2.2 Data format conversions

Usually, your data is arranged in matrix format with rows and columns representing observations and variables, respectively. This is the common case not only for R, but also for many other statistical software packages, including Excel and SPSS. However, certain R packages, including visualization via ggplot2, are easier to handle when using a long data format rather than a wide one. Therefore, a “rearrangement of the form, but not the content, of the data” (Wickham 2007) according to so-called identifier (or ID) and measured variables becomes necessary. Here’s a simple example taken from the Cookbook for R to illustrate this point. Let’s start off with the wide format:

olddata_wide <- read.table(header = TRUE, text = '
 subject sex control cond1 cond2
       1   M     7.9  12.3  10.7
       2   F     6.3  10.6  11.1
       3   F     9.5  13.1  13.8
       4   M    11.5  13.4  12.9
')

Notice the strict structure of the data.frame with observations of subjects 1-4 (two males and two females) arranged in rows and measured variables (‘control’, ‘cond1’, ‘cond2’) in columns? Now, let’s see what the long version of the dataset would look like:

library(reshape2)
melt(olddata_wide, id.vars = c("subject", "sex"))

##    subject sex variable value
## 1        1   M  control   7.9
## 2        2   F  control   6.3
## 3        3   F  control   9.5
## 4        4   M  control  11.5
## 5        1   M    cond1  12.3
## 6        2   F    cond1  10.6
## 7        3   F    cond1  13.1
## 8        4   M    cond1  13.4
## 9        1   M    cond2  10.7
## 10       2   F    cond2  11.1
## 11       3   F    cond2  13.8
## 12       4   M    cond2  12.9

What just happened? melt() (or, more precisely, melt.data.frame()) from the reshape2 package was cast upon the dataset, forcing an increase in the number of rows to the expense of columns. Columns ‘subject’ and ‘sex’ were thereby specified as ID variables, which prevented them from being split apart. The remaining variables, on the other hand, now got arranged among each other rather than next to each other, with each row representing a unique ID-variable combination. You will notice later on that such a format is much easier to handle when it comes to visualizing grouped variables.

Task: Melting diamonds

The diamonds dataset included in ggplot2 features quite a variety of variables per specimen. For example, let’s assume we want to convert the dataset from wide into long format, thus resulting in a significant reduction of columns. Since no ‘true’ ID column is included so far, create a new column ‘ID’ from rownames(diamonds) and subsequently melt() the dataset using all factor columns and the newly established ID column as id.vars.

References

Wickham, Hadley. 2007. “Reshaping Data with the Reshape Package.” Journal of Statistical Software 21 (12): 1–20. doi:10.18637/jss.v021.i12.