4.2 C++ interconnectivity via Rcpp
Prerequisites
In order for Rcpp to work, we have to make sure your local system is capable of building packages. On Linux-based systems, this shouldn’t be much of a problem since things just generally tend to work whereas on Windows (or OS X), you will possible be required to install Rtools (or Xcode) to be able to compile C++ functions and make them available in R. Further information on package development prerequisites can be found here.
You may easily check if everything works by running the following code chunk.
## load 'Rcpp' package
library(Rcpp)
## try to evaluate c++ expression
evalCpp("1 + 1")
## [1] 2
If this expression does not evaluate to ‘2’, there’s something wrong with your local setup and you should possibly contact one of the lecturers for troubleshooting.
How to make your C++ code available in R
Rcpp offers two ways to import C++ functions into R, namely
cppFunction()
andsourceCpp()
.
While the former takes an entire C++ source code as input argument, the latter behaves very similar to base-R source()
in the sense that it sources a code file (.cpp
) and makes the functions included therein available in your
global R environment. During this short overview, however, we’ll primarily focus on the first approach while the latter is introduced only briefly.
- No input, scalar output
No matter which approach you will use in the end, Rcpp will take the C++ code, compile it and transform it into a proper R function. Imagine, for instance, the following code (which is heavily based on Hadley Wickham’s tutorial on High performance functions with Rcpp (Wickham 2014)).
## function that returns '1'
cppFunction('int one() {
return 1;
}')
one()
## [1] 1
Note that C++ requires you to specify the output type of one()
which, in this particular case, is obviously an integer (int
). Accordingly, R variables of type ‘numeric’, ‘character’ and ‘logical’ are referred to as double
, String
and bool
in C++ language. Let’s move on to our next example.
- Scalar input, scalar output
When working with C++ code, you are required to not only specify the output type of your function, but also the type(s) of the input argument(s). The next code chunk defines a function signC()
that takes an integer input x
and, depending on the arithmetic sign of x
, returns one of ‘1’, ‘0’, and ‘-1’.
## function that returns 1 if 'x' is positive, -1 if 'x' is negative, 0 otherwise
cppFunction('int signC(int x) {
if (x > 0) {
return 1;
} else if (x == 0) {
return 0;
} else {
return -1;
}
}')
signC(-10)
## [1] -1
Note also that if
statements in C++ look very similar to their R equivalent. The only obvious differences are
- the need to explicitly include a
return
statement and - the semicolons (
;
) terminating each line of code.
- Vector input, scalar output
Although this seems hardly necessary, let’s assume for now that we wanted to rewrite the base-R sum
function in C++. Rather than supplying a scalar input argument, we need the function to work with a vector of numbers for obvious reasons. Similar to the different scalar inputs depicted above, base-R ‘integer’, ‘numeric’, ‘character’ and ‘logical’ vectors are represented as IntegerVector
, NumericVector
, CharacterVector
and LogicalVector
in C++. This time, it also makes sense to define the input type as NumericVector
and, accordingly, the output type as double
since we’d possibly like to supply numbers with decimal places instead of raw integers.
cppFunction('double sumC(NumericVector x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i];
}
return total;
}')
sumC(seq(0, 1, 0.1))
## [1] 5.5
- Matrix input, vector output
Rcpp also comes with a number of so-called ‘sugars’ that help newcomers to find their way by providing C++-equivalents of a number of built-in R functions. A short overview of featured functions is e.g. given by Eddelbuettel and Francois (2011). Among others, these include
- math functions, e.g.
abs
,ceiling
,floor
,exp
,log
; - scalar summaries, e.g.
mean
,min
,max
,sum
,sd
, `var; - vector summaries, e.g.
cumsum
,diff
; - finding utilities, e.g.
which_max
,which_min
,match
; - finding duplicates, e.g.
duplicated
,unique
.
In order to demonstrate the proper use of such ‘sugars’ and, at the same time, introduce NumericMatrix
(NumericVector
) as further input (output) variable types, let’s replicate the previous example on the use of apply
to calculate mean values from each single variable column of the diamonds
dataset using Rcpp functionality. For the sake of simplicity of this demonstration, let’s again focus on the numeric columns only (note also the use of sapply()
to create an index vector of (non-)numeric columns).
## subset with numeric columns only
num_cols <- sapply(1:ncol(diamonds), function(i) {
is.numeric(data.frame(diamonds)[, i])
})
diamonds_sub <- as.matrix(diamonds[, num_cols])
## c++-version of 'colMeans'
cppFunction("NumericVector colMeansC(NumericMatrix x) {
// number of rows and columns
int nCol = x.ncol();
int nRow = x.nrow();
// temporary variable of size nrow(x) to store column values in
NumericVector nVal(nRow);
// initialize output vector
NumericVector out(nCol);
// loop over each column
for (int i = 0; i < nCol; i++) {
// values in current column
nVal = x(_, i);
// store mean of current 'nVal' in 'out[i]'
out[i] = mean(nVal);
}
return out;
}")
means <- colMeansC(diamonds_sub)
names(means) <- colnames(diamonds_sub)
means
## carat depth table price x
## 0.7979397 61.7494049 57.4571839 3932.7997219 5.7311572
## y z
## 5.7345260 3.5387338
## speed check
microbenchmark(
val_apply <- apply(diamonds_sub, 2, mean),
val_cpp <- colMeansC(diamonds_sub)
, times = 20L)
## Unit: milliseconds
## expr min lq mean
## val_apply <- apply(diamonds_sub, 2, mean) 5.142534 5.187900 5.385949
## val_cpp <- colMeansC(diamonds_sub) 1.158980 1.170475 1.242834
## median uq max neval cld
## 5.339393 5.476106 6.158233 20 b
## 1.248274 1.308215 1.320326 20 a
## similarity check
identical(val_apply, means)
## [1] TRUE
It’s milliseconds we are talking about here, but still - colMeansC()
runs more than 5 times faster as compared to the apply()
approach!
What’s the point of that?
You might guess that we did not decide to include this chapter on C++ interconnectivity just for fun. The actual reason is that C++ code performs much faster as compared to R when it comes to for()
loops. Without going too much into detail, one of the underlying reason is the very efficient memory management of the C++ language as compared to the massive overhead that R produces during each intermediary step. But find out for yourselves…
Task: sumR()
vs. sumC()
Write a function sumR()
(do not use the built-in sum
function) as an equivalent to the above sumC()
function and have a look at the time it takes to run sumR(1:1e4)
using system.time()
(or microbenchmark()
).
## speed check
microbenchmark(
sum(1:1e4), # built-in `sum` function
sumR(1:1e4), # base-R version
sumC(1:1e4) # rcpp version
, times = 100L)
## Unit: microseconds
## expr min lq mean median uq max neval
## sum(1:10000) 13.138 14.3700 15.60138 14.781 15.602 30.792 100
## sumR(1:10000) 557.936 560.4000 675.62845 562.247 580.106 6391.835 100
## sumC(1:10000) 29.560 32.8445 34.59753 34.076 35.308 51.319 100
## cld
## a
## b
## a
As you can see, sumC()
runs more than 40 times faster than sumR()
and, at the same time, takes only slightly longer as compared to the highly optimized (because vectorized) built-in sum()
function. You cannot imagine what’s possible with Rcpp when it comes to more complex operations!
A short note on the use of sourceCpp
For such short operations, the use of cppFunction()
seems reasonable. However, we recommend to use sourceCpp()
when it comes to more complex C++ functions. Take, for example, the following peace of C++ code that reproduces the built-in cor()
function.
cppFunction('double corC(NumericVector x, NumericVector y) {
int nx = x.size(), ny = y.size();
if (nx != ny) stop("Input vectors must have equal length!");
double sum_x = sum(x), sum_y = sum(y);
NumericVector xy = x * y;
NumericVector x_squ = x * x, y_squ = y * y;
double sum_xy = sum(xy);
double sum_x_squ = sum(x_squ), sum_y_squ = sum(y_squ);
double out = ((nx * sum_xy) - (sum_x * sum_y)) / sqrt((nx * sum_x_squ - pow(sum_x, 2.0)) * (nx * sum_y_squ - pow(sum_y, 2.0)));
return out;
}')
Quite confusing, isn’t it? Not the least because the inline C++ is not formatted properly. Luckily, RStudio comes with a C++ editor that allows you to write stand-alone .cpp
functions – including code formatting! For that purpose, select ‘C++ file’ from the top-left drop-down menu and paste the code that we initially passed as character
input to cppFunction()
.
In order to ensure compatibility with Rcpp and make the C++ function available in R, we need to add a header to our .cpp
file (see below). In the end, this should look as follows.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double corC(NumericVector x, NumericVector y) {
int nx = x.size(), ny = y.size();
if (nx != ny) stop("Input vectors must have equal length!");
double sum_x = sum(x), sum_y = sum(y);
NumericVector xy = x * y;
NumericVector x_squ = x * x, y_squ = y * y;
double sum_xy = sum(xy);
double sum_x_squ = sum(x_squ), sum_y_squ = sum(y_squ);
double out = ((nx * sum_xy) - (sum_x * sum_y)) / sqrt((nx * sum_x_squ - pow(sum_x, 2.0)) * (nx * sum_y_squ - pow(sum_y, 2.0)));
return out;
}
Save the file in src/corC.cpp
, for example, and go back to R. Then run
## source 'corC' function (remember to adjust the path)
sourceCpp("src/corC.cpp")
## correlation of 'carat' and 'price'
microbenchmark(
cor(diamonds$carat, diamonds$price),
corC(diamonds$carat, diamonds$price)
, times = 20L)
## Unit: microseconds
## expr min lq mean median
## cor(diamonds$carat, diamonds$price) 776.349 795.2335 889.1878 852.5055
## corC(diamonds$carat, diamonds$price) 720.924 787.0230 837.4791 809.8080
## uq max neval cld
## 918.8085 1271.881 20 a
## 851.8895 1225.078 20 a
Wow, corC()
performs even faster than the built-in and highly optimized cor()
function – at least on my machine. Just imagine the speed gain as compared to a self-written corR()
function!
References
Wickham, Hadley. 2014. Advanced R. Chapman & Hall/CRC the R Series. CRC Press.
Eddelbuettel, Dirk, and Romain Francois. 2011. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software 40 (8): 1–18. doi:10.18637/jss.v040.i08.