1.4 Merging Data

Often enough we end up with multiple data sets on our hard drive that contain useful data for the same analysis. In this case we might want to amalgamate our data sets so that we have all the data in one set.
R provides a function called merge() that does just that:

ave_n_cut_color_price <- merge(ave_n_cut_color, ave_price_cut, 
                               by.x = "cut", by.y = "Group.1")
ave_n_cut_color_price
##          cut color    n        x
## 1       Fair     J  119 4358.758
## 2       Fair     D  163 4358.758
## 3       Fair     I  175 4358.758
## 4       Fair     E  224 4358.758
## 5       Fair     H  303 4358.758
## 6       Fair     F  312 4358.758
## 7       Fair     G  314 4358.758
## 8       Good     J  307 3928.864
## 9       Good     I  522 3928.864
## 10      Good     D  662 3928.864
## 11      Good     H  702 3928.864
## 12      Good     G  871 3928.864
## 13      Good     F  909 3928.864
## 14      Good     E  933 3928.864
## 15     Ideal     J  896 3457.542
## 16     Ideal     I 2093 3457.542
## 17     Ideal     D 2834 3457.542
## 18     Ideal     H 3115 3457.542
## 19     Ideal     F 3826 3457.542
## 20     Ideal     E 3903 3457.542
## 21     Ideal     G 4884 3457.542
## 22   Premium     J  808 4584.258
## 23   Premium     I 1428 4584.258
## 24   Premium     D 1603 4584.258
## 25   Premium     F 2331 4584.258
## 26   Premium     E 2337 4584.258
## 27   Premium     H 2360 4584.258
## 28   Premium     G 2924 4584.258
## 29 Very Good     J  678 3981.760
## 30 Very Good     I 1204 3981.760
## 31 Very Good     D 1513 3981.760
## 32 Very Good     H 1824 3981.760
## 33 Very Good     F 2164 3981.760
## 34 Very Good     G 2299 3981.760
## 35 Very Good     E 2400 3981.760

As the variable names of our two data sets differ, we need to specifically provide the names for each by which the merging should be done (by.x and by.y). The default of merge() tries to find variable names which are identical.

Note, in order to merge more than two data frames at a time, we need to call a powerful higher-order function called Reduce(). This is one mighty function for doing all sorts of things iteratively.

names(ave_price_cut) <- c("cut", "price")

set.seed(12)

df3 <- data.frame(cut = ave_price_cut$cut,
                  var1 = rnorm(nrow(ave_price_cut), 10, 2),
                  var2 = rnorm(nrow(ave_price_cut), 100, 20))

ave_n_cut_color_price <- Reduce(function(...) merge(..., all=T), 
                                list(ave_n_cut_color, 
                                     ave_price_cut,
                                     df3))
ave_n_cut_color_price
##          cut color    n    price      var1      var2
## 1       Fair     J  119 4358.758  7.038865  94.55408
## 2       Fair     D  163 4358.758  7.038865  94.55408
## 3       Fair     I  175 4358.758  7.038865  94.55408
## 4       Fair     E  224 4358.758  7.038865  94.55408
## 5       Fair     H  303 4358.758  7.038865  94.55408
## 6       Fair     F  312 4358.758  7.038865  94.55408
## 7       Fair     G  314 4358.758  7.038865  94.55408
## 8       Good     J  307 3928.864 13.154339  93.69303
## 9       Good     I  522 3928.864 13.154339  93.69303
## 10      Good     D  662 3928.864 13.154339  93.69303
## 11      Good     H  702 3928.864 13.154339  93.69303
## 12      Good     G  871 3928.864 13.154339  93.69303
## 13      Good     F  909 3928.864 13.154339  93.69303
## 14      Good     E  933 3928.864 13.154339  93.69303
## 15     Ideal     J  896 3457.542  6.004716 108.56030
## 16     Ideal     I 2093 3457.542  6.004716 108.56030
## 17     Ideal     D 2834 3457.542  6.004716 108.56030
## 18     Ideal     H 3115 3457.542  6.004716 108.56030
## 19     Ideal     F 3826 3457.542  6.004716 108.56030
## 20     Ideal     E 3903 3457.542  6.004716 108.56030
## 21     Ideal     G 4884 3457.542  6.004716 108.56030
## 22   Premium     J  808 4584.258  8.159990  97.87072
## 23   Premium     I 1428 4584.258  8.159990  97.87072
## 24   Premium     D 1603 4584.258  8.159990  97.87072
## 25   Premium     F 2331 4584.258  8.159990  97.87072
## 26   Premium     E 2337 4584.258  8.159990  97.87072
## 27   Premium     H 2360 4584.258  8.159990  97.87072
## 28   Premium     G 2924 4584.258  8.159990  97.87072
## 29 Very Good     J  678 3981.760  8.086511  87.43490
## 30 Very Good     I 1204 3981.760  8.086511  87.43490
## 31 Very Good     D 1513 3981.760  8.086511  87.43490
## 32 Very Good     H 1824 3981.760  8.086511  87.43490
## 33 Very Good     F 2164 3981.760  8.086511  87.43490
## 34 Very Good     G 2299 3981.760  8.086511  87.43490
## 35 Very Good     E 2400 3981.760  8.086511  87.43490

Obviously, setting proper names would be the next step now…

Okay, so now we have a few tools at hand to manipulate our data in a way that we should be able to produce some meaningful graphs which tell the story that we want to be heard, or better, seen…

So, let’s start plotting stuff!