This post demonstrates how the Rcpp library can be leveraged to significantly improve the performance of CPU-bound computations in R programs. Instead of concocting a sample dataset that has no real-world significance, I instead describe a problem I encountered in a project not too long ago, and how rewriting a poor-performing computational routine in C++ and calling it from R via Rcpp resulted in a massive reduction in the application’s overall runtime. For those not interested in the problem domain and would rather skip ahead to the mechanics of building, compiling and calling C++ functions from R, navigate to the Rcpp section.
In my work as an Actuarial Consultant at CNA Financial, I’ve been fortunate enough to contribute to many projects in both the Pricing and Reserving departments. Most recently, I worked as part of a team that created a framework capable of producing expectations and ranges for insurance losses exceeding a pre-specified threshold (>250K, >1M, >5M, etc…). The original implementation was written in R, and although results were produced as expected, the amount of time required to generate results turned out to be much longer than initially anticipated. Profiling quickly identified the bottleneck, and we found that a single subroutine was responsible for over 90% of the application’s runtime.
In this post, I’ll present a simplified representation of the bottleneck, and we’ll demonstrate how replacing the poorly-performing subroutine with an equivalent C++ implemention via Rcpp reduces processing time by a very significant amount.
The examples that follow assume a relatively recent version of R (>=3.2.*) running on Linux with the data.table
,Rcpp
and foreach
packages installed.
Problem Domain
Without delving into too much actuarial-specific detail, the final output is a table of values with throusands of simulated paths for a given accident year from the current evaluation date to ultimate, which is some number of years in the future. The simulated output consists of a random draw from a parameterized count distribution at annual increments. For example, if we’re interested in projecting large loss count expectations and ranges for accident year 2017 for the next 5 years assuming evaluation begins at 2018-01-01, the simulation step would generate 10000 random variates at 24, 36, 48, 60 and 72 months hence (since on 2018-01-01, accident year 2017 is already 12 months old, and the number of large losses is known and therefore doesn’t require estimation). Not accounting for any key identifying fields and instead focusing only on development, the output would result in a data.table containing 10000 rows by 6 columns: The first column (“12”) being development at 12 months, which is a known quantity and as such will be a column containing 10000 identical values. The 5 remaining development columns contain random samples at 24, 36, 48, 60 and 72 months.
To begin, we create a dataset that resembles the project’s simulated output which will be used throughout the demonstration:
library("data.table")
library("foreach")
set.seed(20160516)
# Create dataset to resemble simulated output.
DF = data.table(
"12"=rep(43,10000),
"24"=runif(n=10000,min=30, max=50),
"36"=runif(n=10000,min=25, max=55),
"48"=runif(n=10000,min=20, max=60),
"60"=runif(n=10000,min=17, max=63),
"72"=runif(n=10000,min=15, max=65)
)
A quick glance at the dataset reveals:
> head(DF)
12 24 36 48 60 72
1: 43 40.84760 46.63734 28.69515 51.08826 46.37697
2: 43 45.65797 51.32795 47.70165 17.11326 36.15512
3: 43 44.87763 40.21962 41.18246 55.49747 22.66363
4: 43 48.73865 28.67403 26.18241 38.03940 52.66633
5: 43 47.95401 49.39142 21.79366 32.23425 41.80831
6: 43 39.66412 54.25155 49.20667 58.87800 19.76825
As just described, column 12
represents the number of large losses from claims occurring in the first twelve months of accident year 2017 assuming an evaluation date of 20180101. Since this is a known quantity, column 12
contains the same value for all 10000 records in the dataset. Visually, the simulated paths will be seen to emanate from a single point of origin, the common origin beiong the realized number of large losses for accident year 2017 at 12 months development.
Each of the 10000 paths represent a potential development outcome to ultimate. The intention is to calculate percentiles from the distribution of simulated outcomes at each development period in order to determine a range of possible outcomes for the number of large losses at any point in time after 12 months and up to ultimate (72 months in our example).
For the purposes of reporting, the range estimates need to be presented at monthly increments, meaning instead of distributing range estimates from 12-72 months hence at 12-month increments (a 10000x6 data.table), they’ll be distributed from 12-72 months ahead at 1-month increments (a 10000x60 data.table). Instead of simulating outcomes at each intermediate development period, we instead simulate outcomes at annual increments, then linearly interpolate at monthly intervals to satisfy the reporting requirement.
To simplify the demonstration a bit further, assume we need only to interpolate between 12 and 24 months at 1-month increments, which, for accident year 2017 evaluated at 2018-01-01 represents expected large losses as of 1 month, 2 months, three months, … , all the way up to 12 months.
The first implementation relies on R’s approx
function within a foreach
iteration construct. Input will be a 10000x2 data.table, where column 12
represents 2017 large losses as of 12 months (all observations will be identical in this column), and column 24
represents possible loss development outcomes 12 months hence. The task at hand is to transform the 10000x2 data.table to one with dimensions 10000x12, with intermediate columns representing the linear-interpolated outcomes between 12 and 24 months. Note that the 12
and 24
columns will be identical in both tables:
# ======================================================================
# |
# Tabular interpolation in R => 1st Approach |
# (Pure R implementation) |
# |
# Linearly interpolate intermediate development periods |
# |
# ======================================================================
library("data.table")
library("foreach")
# Simplified data representation: two columns only.
DF = data.table(
"12"=rep(43,10000),"24"=runif(n=10000,min=30, max=50)
)
xvals = as.integer(names(DF)) # same for all 10000 records
evalp = seq(xvals[1], xvals[2]) # evaluation points
yvals = mapply( # a list containing 10000 2-element vectors
function(y0, y1) c(y0, y1),
y0=DF[[1]], y1=DF[[2]], SIMPLIFY=FALSE
)
# Iterate over yvals list, interpolating at each evalp.
interpd0 = foreach(
i=1:length(yvals),.errorhandling="stop",.combine="rbind",
.final=function(m) {
setNames(as.data.table(m,keep.rownames=FALSE),as.character(evalp))
}
) %do% {
approx(x=xvals, y=yvals[[i]], xout=evalp)$y
}
xvals
contains the column names of DF
as a vector, here 12
and 24
. yvals
is a list of length 10000 with each row containing a 2-element vector, namely the values of the 12
and 24
columns. For example, in DF
, the first row’s value for 12
is 43 and 40.84760 for 24
. Therefore, the first element of yvals
will be c(43, 40.84760)
. evalp
represents a sequwnced used to determine the evaluation points, which in this case is 12-24 inclusive. evalp
gets passed to approx
‘s xout argument.
After initial setup, our first implementation iterates over each vector pair in yvals
and linearly interpolates between the pair at the specified evaluation points. The results are vertically concatenated (which is controlled by foreach’s .combine option), then as a final step, we coerce the resulting dataset into a data.table object, then set the fieldnames in accordance with the points of evaluation.
We can time the first implementation using R’s proc.time
as follows:
# ======================================================================
# |
# Profiling runtime of first tabular interpolation implementation |
# |
# ======================================================================
t_init_0 = proc.time()
interpd0 = foreach(
i=1:length(yvals),.errorhandling="stop",.combine="rbind",
.final=function(m) {
setNames(as.data.table(m,keep.rownames=FALSE),as.character(evalp))
}
) %do% {
approx(x=xvals, y=yvals[[i]], xout=evalp)$y
}
t_total_0 = (proc.time()-t_init_0)[["elapsed"]]
message("Runtime using `approx`: ", t_total_0, " seconds".)
# Prints =>
# Runtime using `approx`: 7.687 seconds.
The first implementation requires almost 8 seconds to transform the 10000x2 data.table into a 10000x12 interpolated representation. If the problem was limited to a single dataset with 10000 records, it wouldn’t have been necessary to look for a more performant solution. However, within the context of the project outlined in the introduction, this procedure was to to be carried out at 180 evaluation points for hundreds of simulated datasets, but also for multiple large loss thresholds. ~8 seconds per dataset wasn’t going to cut it. Enter Rcpp.
Rcpp
Rcpp is a third-party R library which greatly simplifies the process of extending R with compiled code extensions, namely of the C++ variety. A typical usage scenario is to write a function or other callable in C++. The C++ source file is then compiled into a shared library, which is then called from R just as any other function. One of the features that makes Rcpp so powerful is that all R’s data structures are available and can be leveraged in any user defined extension module without being required to manually allocate and deallocate memory.
Although Rcpp has several methods by which compiled C++ functions can be loaded into R, we focus on one method, which is to write the C++ function in a separate file with extension .cpp, then load it into R using sourceCpp
(provided by Rcpp).
My first attempt at optimizing the interpolation routine consisted of replacing R’s builtin approx
with an equivalent version written in C++. The thinking was that the blame for the bottleneck could be placed squarely on the shoulders of approx
,
since the repeated call to approx
was the only computationally intensive call at each iteration.
Recall that the mathematical expression for linear interpolation is given by:
The C++ interpolation routine was straightforward to implement, and was identified as approxc1
to distinguish it from the builtin. I assumed it would serve as a drop-in replacement for approx
, but with improved performance. What follows is the declaration for approxc1
found in approxc1.cpp:
#include <Rcpp.h>
using namespace Rcpp;
// =============================================================================
// approxc1.cpp |
// =============================================================================
// This is `approxc1`, C++-implementation of R's native approx function. |
// Note that x0, x1, y0 and y1 are scalars. |
// evalPts is a numeric vector which represents the points of evaluation. |
// Returns a vector the same length as evalPts representing the interpolated |
// values. |
// =============================================================================
// [[Rcpp::export]]
NumericVector approxc1(int x0, int x1, double y0, double y1, NumericVector evalPts) {
int n = evalPts.size();
double quotient = ((y1 - y0)/(x1 - x0));
NumericVector vout(n);
for(int i=0; i<n; i++) {
vout[i] = (quotient * (evalPts[i] - x0) + y0);
}
return(vout);
}
approxc1
interpolates a single row at a time, returning a vector of interpolated values at the points given by evalPts
. approxc1
takes the x*
and y*
input arguments as scalars instead of vectors, but we could have very easily written approxc1
to have the same call signature as approx
.
The next step was to test approxc1
on the same dataset to get an idea of the performance improvement. We load approxc1.cpp
into to the global environment using the sourceCpp
function:
library("data.table")
library("foreach")
library("Rcpp")
# Build shared library containing approxc1 and load into global environment.
Rcpp::sourceCpp("approxc1.cpp", rebuild=TRUE)
t_init_1 = proc.time()
# Iterate over yvals list, interpolating at each evalp.
interpd1 = foreach(
i=1:length(yvals),.errorhandling="stop",.combine="rbind",
.final=function(m) {
setNames(as.data.table(m,keep.rownames=FALSE),as.character(evalp))
}
) %do% {
approxc1(
x0=xvals[1], x1=xvals[2], y0=yvals[[i]][1], y1=yvals[[i]][2],
evalPts=evalp
)
}
t_total_1 = (proc.time()-t_init_1)[["elapsed"]]
message("Runtime using `approxc1`: ", t_total_1, " seconds".)
# Prints =>
# Runtime using `approxc1`: 6.195. seconds.
This was definitely not on par with expectations! A reduction in runtime of only ~1.5 seconds, from 7.687 down to 6.195,
seemed paltry, but more significantly would do little to reduce the overall runtime of the process in total. But this first failed attempt was not a total loss, since we learned approx
wasn’t the bottleneck after all, and in fact, taken in isolation, performs just as well as the C++ implementation because it is implemented in C/C++. If you type approx
at R’s interactive console without arguments or parens then press enter, the body of the function will be printed. All the way near the bottom, we find the following line starting with yout
:
yout <- .Call(C_Approx, x, y, xout, method, yleft, yright,
Whenever approx
is called from R, approx
calls C_Approx
, which we have reason to believe is the C-implemented interpolation routine that approx
functions as a wrapper for.
If approx
wasn’t the bottleneck, the only other construct that could adversely effect processing time would be the iteration scheme itself, which, in this case, is the foreach constructor. The next implementation removed the call to foreach, and instead relied on the C++ function to handle both the column and row-wise iteration. Unlike approxc1
, which returned a vector representing a single row of interpolated values at each evaluation point, approxc2
returns a matrix, which is coerced to a data.table upon returning to the caller. Another difference between approxc1
and approxc2
is that in approxc2
, y0
and y1
are vectors, not scalars. The x
parameter is a 2-element vector, which still represents the end-points for interpolation, but it is passed as a single entity instead of separate scalar values as in approxc1
. Here is the declaration for approxc2
, found in approxc2.cpp:
#include <Rcpp.h>
using namespace Rcpp;
// =============================================================================
// approxc2.cpp |
// =============================================================================
// This is `approxc2`, a C++ routine that performs tabular interpolation. |
// Takes as input 2 vectors to interpolate between (y0, y1), a 2-element vector|
// containing the x-values (`x`), and the points at which the interpolation |
// should take place (`evalPts`). |
// Returns a matrix of dimension length(y0/y1) rows by length(evalPts) columns.|
// =============================================================================
// [[Rcpp::export]]
NumericMatrix approxc2(NumericVector x, NumericVector y0, NumericVector y1, NumericVector evalPts) {
// interpc returns a matrix x.size() rows by evalPts columns
// containing interpolated data at each of evalPts between y0
// and y1. `x` is simply an integer vector containing the
// bounds for interpolation.
if(y0.size()!=y1.size()) stop("Vectors must have the same dimensionality.");
int x0 = x[0];
int x1 = x[1];
int nrows = y0.size();
int ncols = evalPts.size();
double quotient = 0.0;
NumericMatrix mat(nrows, ncols);
for(int i=0; i<nrows; i++) {
quotient = ((y1[i] - y0[i])/(x1 - x0));
for(int j=0; j<ncols; j++) {
mat(i, j) = (quotient * (evalPts[j] - x0) + y0[i]);
}
}
return(mat);
}
The implementation of approxc2
is straightforward. The outer-loop iterates by row, the inner-loop by column. We specify a return type of NumericMatrix
, which we can specify without having to concern ourselves with any of the details of memory management. Also note that there are more efficient approaches that can be used to populate a matrix via Rcpp (see, for example, the RcppArmadillo library), but for our purposes, approxc2
will suffice.
We tested approxc2
on the same dataset used to benchmark the approx
and approxc1
implementations:
library("data.table")
library("Rcpp")
# Build shared library containing approxc2 and load into global environment.
Rcpp::sourceCpp("approxc2.cpp", rebuild=TRUE)
t_init_2 = proc.time()
interpd2 = as.data.table(
approxc2(x=xvals, y0=DF[[1]], y1=DF[[2]], evalPts=evalp),
keep.rownames=FALSE
)
setnames(interpd2, as.character(evalp))
t_total_2 = (proc.time()-t_init_2)[["elapsed"]]
message("Runtime using `approxc2`: ", t_total_2, " seconds".)
# Prints =>
# Runtime using `approxc2`: 0.0150000000000006 seconds.
The performance improvement in going from approxc1
to approxc2
was quite astonishing: It went from 6.195 down to 0.015 seconds, representing a 400x speedup!
To verify that interpd0
, interpd1
and interpd2
all contain the same interpolated values, from the R interactive console we run:
> all(c(all.equal(interpd0, interpd1), all.equal(interpd1, interpd2)))
Then via the transitive property, if interpd0==interpd1
and interpd1==interpd2
, then interpd0==interpd2
, and all implementations return an identically interpolated data.table.
We now combine the separate test block into a single consolidated view:
library("data.table")
library("foreach")
library("Rcpp")
# Build shared libraries and load into global environment.
Rcpp::sourceCpp("approxc1.cpp", rebuild=TRUE)
Rcpp::sourceCpp("approxc2.cpp", rebuild=TRUE)
# `approx` implementation =======================================>
t_init_0 = proc.time()
interpd0 = foreach(
i=1:length(yvals),.errorhandling="stop",.combine="rbind",
.final=function(m) {
setNames(as.data.table(m,keep.rownames=FALSE),as.character(evalp))
}
) %do% {
approx(x=xvals, y=yvals[[i]], xout=evalp)$y
}
t_total_0 = (proc.time()-t_init_0)[["elapsed"]]
# `approxc1` implementation =====================================>
t_init_1 = proc.time()
# Iterate over yvals list, interpolating at each evalp.
interpd1 = foreach(
i=1:length(yvals),.errorhandling="stop",.combine="rbind",
.final=function(m) {
setNames(as.data.table(m,keep.rownames=FALSE),as.character(evalp))
}
) %do% {
approxc1(
x0=xvals[1], x1=xvals[2], y0=yvals[[i]][1], y1=yvals[[i]][2],
evalPts=evalp
)
}
t_total_1 = (proc.time()-t_init_1)[["elapsed"]]
# `approxc2` implementation =====================================>
t_init_2 = proc.time()
interpd2 = as.data.table(
approxc2(x=xvals, y0=DF[[1]], y1=DF[[2]], evalPts=evalp),
keep.rownames=FALSE
); setnames(interpd2, as.character(evalp))
t_total_2 = (proc.time()-t_init_2)[["elapsed"]]
# Display results:
message("Runtime using `approx`: ", t_total_0, " seconds".)
message("Runtime using `approxc1`: ", t_total_1, " seconds".)
message("Runtime using `approxc2`: ", t_total_2, " seconds".)
# Prints =>
# Runtime using `approx`: 7.687 seconds.
# Runtime using `approxc1`: 6.195. seconds.
# Runtime using `approxc2`: 0.0150000000000006 seconds.
Conclusion
In this post, we learned that at least in some cases, computational bottlenecks in R applications aren’t necessarily due to R builtins themselves, but may instead be attributed to the iteration scheme implemented in R. By moving the iteration into a compiled code extension, we were able to drastically reduce the runtime required to interpolate a table of values. This is what makes Rcpp so powerful: You get all the benefits of C++ without having to deal with the more challenging aspects of working with compiled, statically-typed programming languages. Until next time, happy coding!