Lets say you have a data.table with rating factors as columns associated with different coverages. For a simple example, let’s assume a 2-coverage, 2-factor rating plan with a single BI factor bi_occupancy_factor
and a single property factor prop_tiv_factor
. Also assume factors have been attached to a sample cohort of policies based on the risk characteristics of the exposure. We construct the table below:
library("data.table")
= data.table(
DF policyno=paste0("0000", 1:5), locationno=rep(1, 5),
prop_tiv_factor=c(1.112, 1.255, 1.3125, 1.075, 1.3125),
bi_occupancy_factor=c(1.015, 1.0675, 1.0925, 1.0675, 1.0475),
stringsAsFactors=FALSE
)
The resulting table looks like:
policyno locationno prop_tiv_factor bi_occupancy_factor
1: 00001 1 1.1120 1.0150
2: 00002 1 1.2550 1.0675
3: 00003 1 1.3125 1.0925
4: 00004 1 1.0750 1.0675
5: 00005 1 1.3125 1.0475
Our goal is to create a new column which represents a combined property factor. With only a single property factor, this is trivial, since we can simply set the combined property factor to prop_tiv_factor
:
:=prop_tiv_factor] DF[,prop_factor
Resulting in:
policyno locationno prop_tiv_factor bi_occupancy_factor prop_factor
1: 00001 1 1.1120 1.0150 1.1120
2: 00002 1 1.2550 1.0675 1.2550
3: 00003 1 1.3125 1.0925 1.3125
4: 00004 1 1.0750 1.0675 1.0750
5: 00005 1 1.3125 1.0475 1.3125
Assume the decision has been made to incorporate a new property rating factor into the plan, prop_age_bld_factor
. Our revised initial data.table becomes:
DF = data.table(
policyno=paste0("0000", 1:5), locationno=rep(1, 5),
prop_tiv_factor=c(1.112, 1.255, 1.3125, 1.075, 1.3125),
bi_occupancy_factor=c(1.015, 1.0675, 1.0925, 1.0675, 1.0475),
prop_age_bld_factor=c(1.173, 1.21235, 1.0935, 1.2815, 1.1115),
stringsAsFactors=FALSE
)
Creating our combined property rating factor is still straightforward: This time, we simply set prop_factor
to the product of prop_tiv_factor
and prop_age_bld_factor
for every observation in DF:
:=prop_tiv_factor * prop_age_bld_factor] DF[,prop_factor
Which yields:
policyno locationno prop_tiv_factor bi_occupancy_factor prop_age_bld_factor prop_factor
1: 00001 1 1.1120 1.0150 1.17300 1.304376
2: 00002 1 1.2550 1.0675 1.21235 1.521499
3: 00003 1 1.3125 1.0925 1.09350 1.435219
4: 00004 1 1.0750 1.0675 1.28150 1.377613
5: 00005 1 1.3125 1.0475 1.11150 1.458844
Next the request is made to incorporate another property rating factor, this time prop_deductible_factor
. Our dataset now becomes:
= data.table(
DF policyno=paste0("0000", 1:5), locationno=rep(1, 5),
prop_tiv_factor=c(1.112, 1.255, 1.3125, 1.075, 1.3125),
bi_occupancy_factor=c(1.015, 1.0675, 1.0925, 1.0675, 1.0475),
prop_age_bld_factor=c(1.173, 1.21235, 1.0935, 1.2815, 1.1115),
prop_deductible_factor=c(1.025, 1.025, 1.755, 1.025, 1.1665),
stringsAsFactors=FALSE
)
We could continue as before, updating prop_factor
to include prop_deductible_factor
in the product. However, this approach is not scalable, and each time you go into the code to make changes, you increase the likelihood of introducing errors. We need a more general solution to the problem, specifically a method which allows us to perform an operation on some logical grouping of columns, where the exact number (of columns) may not be known until the moment of execution.
Using Reduce
and %like%
The Reduce
function successively applies a function to the elements of an object from left to right or right to left, respectively. The simplest example is using Reduce
in place of sum
:
> Reduce(`+`, c(1, 2, 3, 4, 5))
15] [
We can also replicate prod
> Reduce(`*`, c(1, 2, 3, 4, 5))
1] 120 [
Reduce
is capable of working with more complex data structures as well.
The %like%
function comes from data.table, and works similar to SQL’s LIKE
operator. It is essentially shorthand for a regular expression pattern matching subroutine, which returns a TRUE/FALSE based on whether or not the target matches the pattern. To demonstrate, we identify which month names end in y
:
> month.name %like% "y$"
1] TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE [
To return the actual month names ending in y
, use the previous expression as a mask:
> month.name[month.name %like% "y$"]
1] "January" "February" "May" "July" [
Note that in regular expression parlance, $
means to match y
only when it occurs at the end of the string. Similarly, ^
indicates the match must occur at the beginning of the string. In addition, .+
matches one or more characters, which can be letters, numbers, punctuation or whitespace. We can leverage Reduce
and %like%
within the context of data.table to take the product of a potentially variable number of factor columns. Using our latest version of DF with 3 prop_*_factor
columns, we have:
:=
DF[,prop_factorReduce(`*`, .SD), .SDcols=names(DF)[names(DF) %like% "^prop.+factor$"]
]
To break this down a bit, note that the value assigned to .SDcols
is doing nothing more than filtering to only those columns starting with prop
and ending with factor
. We can see this by running the expression in isolation:
> names(DF)[names(DF) %like% "^prop.+factor$"
1] "prop_tiv_factor" "prop_age_bld_factor" "prop_deductible_factor" [
.SD
is one of data.table’s special symbols, which serve as shortcuts for frequently used operations. In this particular case, .SD
is a stand-in for the columns we’re interested in taking the product of, and .SDcols
represents the names of the columns over which the operation is to be applied.
In data.table, :=
represents the “assignment-by-reference” operator. Notice that when a column is defined via :=
, the update is made by reference, so it isn’t necessary to re-assign the column name to the table as would typically be necessary when working with standard data.frame objects.
Does our solution generalize to any number of columns? Let’s add a few more factors and test it out:
= data.table(
DF policyno=paste0("0000", 1:5), locationno=rep(1, 5),
prop_tiv_factor=c(1.112, 1.255, 1.3125, 1.075, 1.3125),
bi_occupancy_factor=c(1.015, 1.0675, 1.0925, 1.0675, 1.0475),
prop_age_bld_factor=c(1.173, 1.21235, 1.0935, 1.2815, 1.1115),
prop_deductible_factor=c(1.025, 1.025, 1.755, 1.025, 1.1665),
prop_nbr_stories_factor=c(1.015, 1.015, 1.015, 1.015, 1.1175),
prop_age_roof_factor=c(1.033, 1.0373, 1.3573, 1.033, 1.0373),
bi_protect_class_factor=c(1., 1., 1., 1.25, 1.25),
prop_construction_factor=c(1.0235, 1.0744, 1.1985, 1.0235, 1.0744),
stringsAsFactors=FALSE
)
Which gives us
policyno locationno prop_tiv_factor bi_occupancy_factor prop_age_bld_factor prop_deductible_factor prop_nbr_stories_factor
1: 00001 1 1.1120 1.0150 1.17300 1.0250 1.0150
2: 00002 1 1.2550 1.0675 1.21235 1.0250 1.0150
3: 00003 1 1.3125 1.0925 1.09350 1.7550 1.0150
4: 00004 1 1.0750 1.0675 1.28150 1.0250 1.0150
5: 00005 1 1.3125 1.0475 1.11150 1.1665 1.1175
prop_age_roof_factor bi_protect_class_factor prop_construction_factor
1: 1.0330 1.00 1.0235
2: 1.0373 1.00 1.0744
3: 1.3573 1.00 1.1985
4: 1.0330 1.25 1.0235
5: 1.0373 1.25 1.0744
Our logic should return the product of prop_tiv_factor, prop_age_bld_factor, prop_deductible_factor, prop_nbr_stories_factor, prop_age_roof_factor and prop_construction_factor without modifying our earlier implementation. Let’s compare the results of our dynamic, general expression vs. explicitly specifying the column names to multiply:
# Compare results from explicit and implicit column multiplication.
:=
DF[,prop_factorReduce(`*`, .SD), .SDcols=names(DF)[names(DF) %like% "^prop.+factor$"]
]= DF[,prop_factor]
prop_factor1
# Compute product specifying prop-prefixed columns explicitly.
= DF$prop_tiv_factor DF$prop_age_bld_factor * DF$prop_deductible_factor *
prop_factor2 $prop_nbr_stories_factor * DF$prop_age_roof_factor * DF$prop_construction_factor DF
Comparing results gives
> prop_factor1
1] 1.434765 1.764136 4.158868 1.515323 2.119393
[> prop_factor2
1] 1.434765 1.764136 4.158868 1.515323 2.119393 [