Advanced Programming in R

4.2 C++ interconnectivity via Rcpp

Prerequisites

In order for Rcpp to work, we have to make sure your local system is capable of building packages. On Linux-based systems, this shouldn’t be much of a problem since things just generally tend to work whereas on Windows (or OS X), you will possible be required to install Rtools (or Xcode) to be able to compile C++ functions and make them available in R. Further information on package development prerequisites can be found here.

You may easily check if everything works by running the following code chunk.

## load 'Rcpp' package
library(Rcpp)

## try to evaluate c++ expression
evalCpp("1 + 1")

## [1] 2

If this expression does not evaluate to ‘2’, there’s something wrong with your local setup and you should possibly contact one of the lecturers for troubleshooting.

How to make your C++ code available in R

Rcpp offers two ways to import C++ functions into R, namely

cppFunction() and
sourceCpp().

While the former takes an entire C++ source code as input argument, the latter behaves very similar to base-R source() in the sense that it sources a code file (.cpp) and makes the functions included therein available in your
global R environment. During this short overview, however, we’ll primarily focus on the first approach while the latter is introduced only briefly.

No input, scalar output

No matter which approach you will use in the end, Rcpp will take the C++ code, compile it and transform it into a proper R function. Imagine, for instance, the following code (which is heavily based on Hadley Wickham’s tutorial on High performance functions with Rcpp (Wickham 2014)).

## function that returns '1'
cppFunction('int one() {
  return 1;
}')

one()

## [1] 1

Note that C++ requires you to specify the output type of one() which, in this particular case, is obviously an integer (int). Accordingly, R variables of type ‘numeric’, ‘character’ and ‘logical’ are referred to as double, String and bool in C++ language. Let’s move on to our next example.

Scalar input, scalar output

When working with C++ code, you are required to not only specify the output type of your function, but also the type(s) of the input argument(s). The next code chunk defines a function signC() that takes an integer input x and, depending on the arithmetic sign of x, returns one of ‘1’, ‘0’, and ‘-1’.

## function that returns 1 if 'x' is positive, -1 if 'x' is negative, 0 otherwise
cppFunction('int signC(int x) {
  if (x > 0) {
    return 1;
  } else if (x == 0) {
    return 0;
  } else {
    return -1;
  }
}')

signC(-10)

## [1] -1

Note also that if statements in C++ look very similar to their R equivalent. The only obvious differences are

the need to explicitly include a return statement and
the semicolons (;) terminating each line of code.

Vector input, scalar output

Although this seems hardly necessary, let’s assume for now that we wanted to rewrite the base-R sum function in C++. Rather than supplying a scalar input argument, we need the function to work with a vector of numbers for obvious reasons. Similar to the different scalar inputs depicted above, base-R ‘integer’, ‘numeric’, ‘character’ and ‘logical’ vectors are represented as IntegerVector, NumericVector, CharacterVector and LogicalVector in C++. This time, it also makes sense to define the input type as NumericVector and, accordingly, the output type as double since we’d possibly like to supply numbers with decimal places instead of raw integers.

cppFunction('double sumC(NumericVector x) {
  int n = x.size();
  double total = 0;
  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total;
}')

sumC(seq(0, 1, 0.1))

## [1] 5.5

Matrix input, vector output

Rcpp also comes with a number of so-called ‘sugars’ that help newcomers to find their way by providing C++-equivalents of a number of built-in R functions. A short overview of featured functions is e.g. given by Eddelbuettel and Francois (2011). Among others, these include

math functions, e.g. abs, ceiling, floor, exp, log;
scalar summaries, e.g. mean, min, max, sum, sd, `var;
vector summaries, e.g. cumsum, diff;
finding utilities, e.g. which_max, which_min, match;
finding duplicates, e.g. duplicated, unique.

In order to demonstrate the proper use of such ‘sugars’ and, at the same time, introduce NumericMatrix (NumericVector) as further input (output) variable types, let’s replicate the previous example on the use of apply to calculate mean values from each single variable column of the diamonds dataset using Rcpp functionality. For the sake of simplicity of this demonstration, let’s again focus on the numeric columns only (note also the use of sapply() to create an index vector of (non-)numeric columns).

## subset with numeric columns only
num_cols <- sapply(1:ncol(diamonds), function(i) {
  is.numeric(data.frame(diamonds)[, i])
})
diamonds_sub <- as.matrix(diamonds[, num_cols])

## c++-version of 'colMeans'
cppFunction("NumericVector colMeansC(NumericMatrix x) {
  
  // number of rows and columns
  int nCol = x.ncol();
  int nRow = x.nrow();
  
  // temporary variable of size nrow(x) to store column values in
  NumericVector nVal(nRow);
  
  // initialize output vector
  NumericVector out(nCol);
  
  // loop over each column
  for (int i = 0; i < nCol; i++) {
    
    // values in current column
    nVal = x(_, i);
    
    // store mean of current 'nVal' in 'out[i]'
    out[i] = mean(nVal);
  }
  
  return out;
}")

means <- colMeansC(diamonds_sub)
names(means) <- colnames(diamonds_sub)
means

##        carat        depth        table        price            x 
##    0.7979397   61.7494049   57.4571839 3932.7997219    5.7311572 
##            y            z 
##    5.7345260    3.5387338

## speed check
microbenchmark(
  val_apply <- apply(diamonds_sub, 2, mean), 
  val_cpp <- colMeansC(diamonds_sub)
, times = 20L)

## Unit: milliseconds
##                                       expr      min       lq     mean
##  val_apply <- apply(diamonds_sub, 2, mean) 5.142534 5.187900 5.385949
##         val_cpp <- colMeansC(diamonds_sub) 1.158980 1.170475 1.242834
##    median       uq      max neval cld
##  5.339393 5.476106 6.158233    20   b
##  1.248274 1.308215 1.320326    20  a

## similarity check
identical(val_apply, means)

## [1] TRUE

It’s milliseconds we are talking about here, but still - colMeansC() runs more than 5 times faster as compared to the apply() approach!

What’s the point of that?

You might guess that we did not decide to include this chapter on C++ interconnectivity just for fun. The actual reason is that C++ code performs much faster as compared to R when it comes to for() loops. Without going too much into detail, one of the underlying reason is the very efficient memory management of the C++ language as compared to the massive overhead that R produces during each intermediary step. But find out for yourselves…

Task: sumR() vs. sumC()

Write a function sumR() (do not use the built-in sum function) as an equivalent to the above sumC() function and have a look at the time it takes to run sumR(1:1e4) using system.time() (or microbenchmark()).

## speed check
microbenchmark(
  sum(1:1e4), # built-in `sum` function
  sumR(1:1e4), # base-R version
  sumC(1:1e4)  # rcpp version
, times = 100L)

## Unit: microseconds
##           expr     min       lq      mean  median      uq      max neval
##   sum(1:10000)  13.138  14.3700  15.60138  14.781  15.602   30.792   100
##  sumR(1:10000) 557.936 560.4000 675.62845 562.247 580.106 6391.835   100
##  sumC(1:10000)  29.560  32.8445  34.59753  34.076  35.308   51.319   100
##  cld
##   a 
##    b
##   a

As you can see, sumC() runs more than 40 times faster than sumR() and, at the same time, takes only slightly longer as compared to the highly optimized (because vectorized) built-in sum() function. You cannot imagine what’s possible with Rcpp when it comes to more complex operations!

A short note on the use of sourceCpp

For such short operations, the use of cppFunction() seems reasonable. However, we recommend to use sourceCpp() when it comes to more complex C++ functions. Take, for example, the following peace of C++ code that reproduces the built-in cor() function.

cppFunction('double corC(NumericVector x, NumericVector y) {
  int nx = x.size(), ny = y.size();
  
  if (nx != ny) stop("Input vectors must have equal length!");
  
  double sum_x = sum(x), sum_y = sum(y);
  
  NumericVector xy = x * y;
  NumericVector x_squ = x * x, y_squ = y * y;
  
  double sum_xy = sum(xy);
  double sum_x_squ = sum(x_squ), sum_y_squ = sum(y_squ);
  
  double out = ((nx * sum_xy) - (sum_x * sum_y)) / sqrt((nx * sum_x_squ - pow(sum_x, 2.0)) * (nx * sum_y_squ - pow(sum_y, 2.0)));
  
  return out;
}')

Quite confusing, isn’t it? Not the least because the inline C++ is not formatted properly. Luckily, RStudio comes with a C++ editor that allows you to write stand-alone .cpp functions – including code formatting! For that purpose, select ‘C++ file’ from the top-left drop-down menu and paste the code that we initially passed as character input to cppFunction().

In order to ensure compatibility with Rcpp and make the C++ function available in R, we need to add a header to our .cpp file (see below). In the end, this should look as follows.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double corC(NumericVector x, NumericVector y) {
  int nx = x.size(), ny = y.size();
  
  if (nx != ny) stop("Input vectors must have equal length!");
  
  double sum_x = sum(x), sum_y = sum(y);
  
  NumericVector xy = x * y;
  NumericVector x_squ = x * x, y_squ = y * y;
  
  double sum_xy = sum(xy);
  double sum_x_squ = sum(x_squ), sum_y_squ = sum(y_squ);
  
  double out = ((nx * sum_xy) - (sum_x * sum_y)) / sqrt((nx * sum_x_squ - pow(sum_x, 2.0)) * (nx * sum_y_squ - pow(sum_y, 2.0)));
  
  return out;
}

Save the file in src/corC.cpp, for example, and go back to R. Then run

## source 'corC' function (remember to adjust the path)
sourceCpp("src/corC.cpp")

## correlation of 'carat' and 'price'  
microbenchmark(
  cor(diamonds$carat, diamonds$price), 
  corC(diamonds$carat, diamonds$price) 
, times = 20L)

## Unit: microseconds
##                                  expr     min       lq     mean   median
##   cor(diamonds$carat, diamonds$price) 776.349 795.2335 889.1878 852.5055
##  corC(diamonds$carat, diamonds$price) 720.924 787.0230 837.4791 809.8080
##        uq      max neval cld
##  918.8085 1271.881    20   a
##  851.8895 1225.078    20   a

Wow, corC() performs even faster than the built-in and highly optimized cor() function – at least on my machine. Just imagine the speed gain as compared to a self-written corR() function!

References

Wickham, Hadley. 2014. Advanced R. Chapman & Hall/CRC the R Series. CRC Press.

Eddelbuettel, Dirk, and Romain Francois. 2011. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software 40 (8): 1–18. doi:10.18637/jss.v040.i08.