1.1 Subsetting Data
The diamonds
data set from ggplot2 is structured as follows:
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
The str()
command is probably the most useful command in all of R. It shows the complete structure of our data set and provides a ‘road map’ of how to access certain parts of the data.
For example, the console output tells us that diamonds$carat
is a numerical vector of length 53940, whereas diamonds$cut
is a factor with the ordered levels
levels(diamonds$cut)
## [1] "Fair" "Good" "Very Good" "Premium" "Ideal"
Suppose we’re a stingy person and don’t want to spend too much money on the wedding ring for our loved one, we could create a data set only including diamonds that cost less than 1000 USD (though 1000 USD does still seem very generous to me).
diamonds_cheap <- subset(diamonds, price < 1000)
Then our new data set would look like this
str(diamonds_cheap)
## Classes 'tbl_df', 'tbl' and 'data.frame': 14499 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Now the new diamonds_cheap
subset is a reduced version of the original diamonds
data set only having 14499 entries instead of the original 53940 entries.
In case we were interested in a subset only including all diamonds of ‘Premium’ quality (column cut
), the command would be
diamonds_premium <- subset(diamonds, cut == "Premium")
str(diamonds_premium)
## Classes 'tbl_df', 'tbl' and 'data.frame': 13791 obs. of 10 variables:
## $ carat : num 0.21 0.29 0.22 0.2 0.32 0.24 0.29 0.22 0.22 0.3 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 4 4 4 4 4 4 4 4 4 4 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 6 3 2 2 6 3 2 1 7 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 3 4 3 2 1 5 3 4 4 2 ...
## $ depth : num 59.8 62.4 60.4 60.2 60.9 62.5 62.4 61.6 59.3 59.3 ...
## $ table : num 61 58 61 62 58 57 58 58 62 61 ...
## $ price : int 326 334 342 345 345 355 403 404 404 405 ...
## $ x : num 3.89 4.2 3.88 3.79 4.38 3.97 4.24 3.93 3.91 4.43 ...
## $ y : num 3.84 4.23 3.84 3.75 4.42 3.94 4.26 3.89 3.88 4.38 ...
## $ z : num 2.31 2.63 2.33 2.27 2.68 2.47 2.65 2.41 2.31 2.61 ...
Note the two equal signs in order to specify our selection. This stems from an effort to be consistent with selection criteria such as “smaller than” (<=
) or “not equal” (!=
) and basically translates to “is equal”!
Any combinations of these conditional statements are valid within one and the same subset
call, e.g.
diamonds_premium_and_cheap <- subset(diamonds, cut == "Premium" &
price <= 1000)
produces a rather strict subset only allowing diamonds of premium quality that cost less than 1000 USD.
In case we want ANY of these, meaning all diamonds of premium quality OR cheaper than 1000 USD, we would use the |
operator to combine the two specifications:
diamonds_premium_or_cheap <- subset(diamonds, cut == "Premium" |
price <= 1000)
The OR specification is much less rigid than the AND specification which will result in a larger data set:
diamonds_premium_and_cheap
has 3204 rows, whilediamonds_premium_or_cheap
has 25111 rows.
str(diamonds_premium_and_cheap)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3204 obs. of 10 variables:
## $ carat : num 0.21 0.29 0.22 0.2 0.32 0.24 0.29 0.22 0.22 0.3 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 4 4 4 4 4 4 4 4 4 4 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 6 3 2 2 6 3 2 1 7 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 3 4 3 2 1 5 3 4 4 2 ...
## $ depth : num 59.8 62.4 60.4 60.2 60.9 62.5 62.4 61.6 59.3 59.3 ...
## $ table : num 61 58 61 62 58 57 58 58 62 61 ...
## $ price : int 326 334 342 345 345 355 403 404 404 405 ...
## $ x : num 3.89 4.2 3.88 3.79 4.38 3.97 4.24 3.93 3.91 4.43 ...
## $ y : num 3.84 4.23 3.84 3.75 4.42 3.94 4.26 3.89 3.88 4.38 ...
## $ z : num 2.31 2.63 2.33 2.27 2.68 2.47 2.65 2.41 2.31 2.61 ...
str(diamonds_premium_or_cheap)
## Classes 'tbl_df', 'tbl' and 'data.frame': 25111 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
There is, in principle, no limitation to the combination of these so-called Boolean (or logical) operators
- and:
&
- or:
|
- equal to:
==
- not equal to:
!=
- greater than (or equal):
>
(>=
) - less than (or equal):
<
(<=
)
You probably get the idea…