For the uninitiated, data.table is a third-party package for the R programming language which provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed 1. I was first introduced to data.table when I began my career at CNA, and as a consequence of working with it on a daily basis for a few of years have become well acquainted with the package and many of its offerings.

This post highlights aggregation and aggregate operations within the context of data.table. Even though the syntax is straightforward for the most common aggregate operations, when working with slightly more complex expressions, subtle syntactical variations can result in unexpected result sets. But after working through the examples in this post, you’ll hopefully be in a position to leverage data.table to tackle all of your data aggregation needs. To follow along with the examples, you should have a relatively recent installation of R, in addition to the following third-party libraries:
data.table
CASdatasets (sample insurance datasets)

Aggregation

Our sample dataset, nzcathist, can be found in the CASdatasets package and represents 145 catastrophic events that occurred in New Zealand between 1968-2014, along with the associated cost in millions originally (LOSS_AMT_ORIG) and adjusted for 2014 dollars (LOSS_AMT_2014). In the pre-processing step, we reduce the dataset down to only the requisite columns (an already pre-processed version of this dataset in .csv format can be downloaded here):

library("data.table")
library("CASdatasets")

data(nzcathist) # available in `CASdatasets` package

DFInit         = setDT(nzcathist)
keepFields    = c("Year", "Quarter", "Type", "Location", "NormCost2014")
newFieldNames = c("YEAR", "QUARTER", "TYPE", "LOCATION", "LOSS_AMT")
DF            = DFInit[,..keepFields]
setnames(DF, newFieldNames)


Inspecting the first few records reveals:

> head(DF)

   YEAR QUARTER         TYPE                LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
1: 2014       4 Power outage                Auckland           2.0          2.00
2: 2014       2        Flood          Nelson, Tasman           2.7          2.72
3: 2014       2      Weather North and South Islands          37.6         37.85
4: 2014       2 Flood, Storm                      NA          55.3         55.67
5: 2014       1      Cyclone                      NA           3.6          3.63
6: 2014       1        Storm              Canterbury           4.8          4.84


Note that in the pre-processing step, we coerced nzcathist, originally a data.frame, into a data.table by calling the setDT function. setDT coerces data.frames to data.tables by reference, as opposed to using as.data.table, which creates an independent copy of the original data.frame in memory. For smaller datasets, the difference in the two approaches will be negligible. But as the size of the data begins to scale, calling as.data.table may result in serious performance degradation.



Before jumping in, I thought it helpful to reproduce a diagram found in the official data.table documentation. It has been included as a way to introduce/clarify data.table’s general syntax:

            DT[ i, j, by ] # + extra arguments
                |  |  |
                |  |   -------> grouped by what?
                |   -------> what to do?
                 ---> on which rows?


In the official documentation, beneath the diagram reads Take DT, subset rows by i, then compute j grouped by by. Throughout the rest of the post, any reference to i, j or by implicitly assumes this context. Also, it should be assumed that all datasets have been coerced to data.table objects, even if not explicitly converted as such within the sample code block.

We start with the common task of calculating the sum of values held within a single column. Note that the following returns a scalar:

> DF[,sum(LOSS_AMT_2014)]
[1] 19977.66


To compute the sum of LOSS_AMT_ORIG and LOSS_AMT_2014 over all records, we run the following, which returns a single-record data.table with two columns:

> DF[,.(LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
        LOSS_AMT_2014=sum(LOSS_AMT_2014))]

   LOSS_AMT_ORIG LOSS_AMT_2014
1:      18065.72      19977.66


.() is shorthand for list. We can replace the expression above with a call to list and an identical result will be obtained:

> DF[,list(LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
           LOSS_AMT_2014=sum(LOSS_AMT_2014))]

   LOSS_AMT_ORIG LOSS_AMT_2014
1:      18065.72      19977.66        


It may be of interest to calculate the total amounts for LOSS_AMT_ORIG and LOSS_AMT_2014 by YEAR. This is accomplished with inclusion of by:

> DF[,.(LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
        LOSS_AMT_2014=sum(LOSS_AMT_2014)), by="YEAR"]                   

    YEAR LOSS_AMT_ORIG LOSS_AMT_2014
 1: 2014        128.50        129.42
 2: 2013        206.40        210.01
 3: 2012          8.70          8.95
 4: 2011      16557.50      17380.70
 5: 2010         68.75         74.66
 6: 2009          6.01          6.64
 7: 2008         86.26         96.42
 8: 2007        125.80        147.40
 9: 2006         50.40         60.60
10: 2005         60.40         75.88
11: 2004        145.15        187.59
12: 2003          3.30          4.31
13: 2002         25.65         34.34
14: 2001          2.60          3.57
15: 2000         23.10         32.18
16: 1999         55.50         79.86
17: 1998         35.20         50.68
18: 1997         14.40         21.03
19: 1996         11.80         17.34
20: 1995         12.60         19.04
21: 1994         18.30         28.77
22: 1993          7.60         12.06
23: 1992          8.10         13.07
24: 1991          3.10          5.07
25: 1990          1.80          2.99
26: 1988         56.10        106.21
27: 1987        192.00        398.17
28: 1986         19.30         47.34
29: 1985         17.20         47.28
30: 1984         51.10        160.59
31: 1983          5.80         18.86
32: 1981          9.00         37.04
33: 1980         11.30         53.33
34: 1978         10.30         60.96
35: 1976          6.20         51.61
36: 1975          7.00         63.35
37: 1968         13.50        230.34
    YEAR LOSS_AMT_ORIG LOSS_AMT_2014


In order to limit the result to only YEAR>2010, include the condition in i:

> DF[YEAR>2010,.(LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
                 LOSS_AMT_2014=sum(LOSS_AMT_2014)), by="YEAR"]           

   YEAR LOSS_AMT_ORIG LOSS_AMT_2014
1: 2014         128.5        129.42
2: 2013         206.4        210.01
3: 2012           8.7          8.95
4: 2011       16557.5      17380.70


by can be substituted with keyby. keyby runs setkey() on the specified column(s). Since calling setkey on a data.table results in the data.table being sorted by the specified column(s), keyby can be used if the dataset requires ordering upon return. Notice that our previous result was displayed with YEAR in descending order. Using keyby instead arranges the output in ascending order w.r.t. YEAR:

> DF[YEAR>2010,.(LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
                 LOSS_AMT_2014=sum(LOSS_AMT_2014)), keyby="YEAR"]     

YEAR LOSS_AMT_ORIG LOSS_AMT_2014
1: 2011       16557.5      17380.70
2: 2012           8.7          8.95
3: 2013         206.4        210.01
4: 2014         128.5        129.42


Note that if the fieldname rebinding is omitted and include only the aggregate operator over the target fields, the same result will be obtained, but the aggregated fields will be assigned arbitrary names, in this case V1 and V2:

> DF[YEAR>2010,.(sum(LOSS_AMT_ORIG),sum(LOSS_AMT_2014)),by="YEAR"]

   YEAR      V1       V2
1: 2014   128.5   129.42
2: 2013   206.4   210.01
3: 2012     8.7     8.95
4: 2011 16557.5 17380.70


Special Symbols in Aggregate Operations

In an earlier post post, I introduced data.table’s special symbols, which are read-only symbols that can be used in j2. The .SD special symbol is particularly useful for aggregate operations, along with .SDcols, which is used in conjunction with .SD. .SD, .SDcols and by will be used in the sample expressions that follow, and it is therefore instructive to provide a qualitative description of the role each plays within the context of data.table:


.SD Stands for subset of data.table. By default gets assigned all columns except those mentioned in by (orkeyby). Can only be used in j.


.SDcols Represents the columns that are to be aggregated, which must be numeric. Fields may be given as a character vector of fieldnames or column-positional indices.


by Similar to SQL’s GROUP BY. Specifies how the fields in .SDcols should be aggregated, i.e., over which fields should the aggregate operation be performed.

Using the special symbol syntax, we can reproduce the output returned in our original examples. First, taking the sum of LOSS_AMT_ORIG and LOSS_AMT_2014 over all records without partitioning:

> DF[,lapply(.SD, sum, na.rm=TRUE),
      .SDcols=c("LOSS_AMT_ORIG", "LOSS_AMT_2014")]

   LOSS_AMT_ORIG LOSS_AMT_2014
1:      18065.72      19977.66


Partitioning LOSS_AMT_ORIG and LOSS_AMT_2014 by YEAR:

> DF[,lapply(.SD, sum, na.rm=TRUE),
      .SDcols=c("LOSS_AMT_ORIG", "LOSS_AMT_2014"),
          by="YEAR"]

    YEAR LOSS_AMT_ORIG LOSS_AMT_2014
 1: 2014        128.50        129.42
 2: 2013        206.40        210.01
 3: 2012          8.70          8.95
 4: 2011      16557.50      17380.70
                .
                .
                .


With only 6 columns in our sample dataset, the convenience of .SD may not be readily apparent. But as the number of columns in our dataset grows, the ability to use .SD as a stand-in for all fields not targeted in the aggregate operation becomes an incredibly useful abstraction.
We can perform other aggregate operations over LOSS_AMT_ORIG and LOSS_AMT_2014. Calculating the average loss amounts by TYPE is as simple as replacing sum and YEAR in the last expression:

> DF[,lapply(.SD, mean, na.rm=TRUE),
      .SDcols=c("LOSS_AMT_ORIG", "LOSS_AMT_2014"),
          by="TYPE"]

            TYPE LOSS_AMT_ORIG LOSS_AMT_2014
 1: Power outage      6.100000      8.350000
 2:        Flood      5.443103     12.133448
 3:      Weather     13.000000     14.680000
 4: Flood, Storm     18.511111     20.683333
 5:      Cyclone     11.350000     20.267500
 6:        Storm     17.932813     24.539062
 7:   Earthquake   2393.618571   2541.094286
 8:      Tornado      3.540000      4.425455
 9:    Hailstorm      5.437500      7.130000
10:        Other      6.220000     25.265000


.SD can also be used to return the first N-records of a particular subgroup. To demonstrate, let’s say we’re interested in returning the 3 costliest catastrophic events by TYPE from our 145 record dataset. With 10 distinct event types, we expect 30 records to be returned. In addition, we specify the result must be returned in descending order w.r.t. LOSS_AMT_2014:

> DF[order(-LOSS_AMT_2014),.SD[1:3], by="TYPE"]

            TYPE YEAR QUARTER                          LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
 1:   Earthquake 2011       1                        Canterbury       16500.0      17320.68
 2:   Earthquake 1987       1                     Bay of Plenty         192.0        398.17
 3:   Earthquake 2007       4                          Gisborne          30.5         35.38
 4:        Other 1968       1                            Wahine          10.0        170.62
 5:        Other 2005       2             BOP Tauranga , Matata          28.5         35.65
 6:        Other 1985       1    Thames , Coromandel , Te Aroha           5.9         16.35
 7:        Storm 2004       1                Lower North Island         112.0        145.25
 8:        Storm 2013       3                        Nationwide          74.5         75.50
 9:        Storm 2007       3 Far North , Auckland , Coromandel          60.5         71.01
10:        Flood 1984       1          Invercargill , Southland          45.8        143.93
11:        Flood 1999       4         Queenstown Lakes District          46.1         66.26
12:        Flood 1978       4                             Otago          10.3         60.96
13:      Cyclone 1988       1                                NA          37.0         70.41
14:      Cyclone 1997       1                                NA           3.2          4.69
15:      Cyclone 2014       1                                NA           3.6          3.63
16: Flood, Storm 2014       2                                NA          55.3         55.67
17: Flood, Storm 2013       2              Nelson,Bay of Plenty          46.2         47.26
18: Flood, Storm 2002       2                      North Island          21.5         28.73
19:      Weather 2014       2           North and South Islands          37.6         37.85
20:      Weather 1996       3                                NA           8.1         11.93
21:      Weather 2000       3                      North Island           4.2          5.87
22:    Hailstorm 1994       1                          Hastings          10.8         17.13
23:    Hailstorm 2005       1                      Christchurch          13.0         16.40
24:    Hailstorm 2008       4                        Canterbury          11.2         12.57
25: Power outage 1998       2                           Mercury          10.2         14.70
26: Power outage 2014       4                          Auckland           2.0          2.00
27: Power outage   NA      NA                                NA            NA            NA
28:      Tornado 2005       1                         Greymouth           9.2         11.61
29:      Tornado 2007       3                    Tornado events           8.3          9.74
30:      Tornado 2012       4                          Auckland           8.7          8.95



.SD[1:3] specifies the number of rows that should be returned within each subgroup (in this case 3). The - preceding LOSS_AMT_2014 results in a descending order sort within each subgroup.

To return the least costly and most costly events by TYPE, we alter the previous expression slightly:

> DF97[,.SD[c(1,.N)],by="TYPE"]

            TYPE YEAR QUARTER                              LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
 1: Power outage 2014       4                              Auckland           2.0          2.00
 2: Power outage 1998       2                               Mercury          10.2         14.70
 3:        Flood 2014       2                        Nelson, Tasman           2.7          2.72
 4:        Flood 1976       1              Wellington , Hutt Valley           6.2         51.61
 5:      Weather 2014       2               North and South Islands          37.6         37.85
 6:      Weather 1996       4                                    NA           2.1          3.07
 7: Flood, Storm 2014       2                                    NA          55.3         55.67
 8: Flood, Storm 1994       4                         North , South           6.0          9.26
 9:      Cyclone 2014       1                                    NA           3.6          3.63
10:      Cyclone 1988       1                                    NA          37.0         70.41
11:        Storm 2014       1                            Canterbury           4.8          4.84
12:        Storm 1968       1                                Wahine           3.5         59.72
13:   Earthquake 2013       3                           Cook Strait          16.2         16.42
14:   Earthquake 1987       1                         Bay of Plenty         192.0        398.17
15:      Tornado 2012       4                              Auckland           8.7          8.95
16:      Tornado 1980       3                              Onehunga           1.0          4.53
17:    Hailstorm 2009       2                         Bay of Plenty           2.3          2.56
18:    Hailstorm 1994       1                              Hastings          10.8         17.13
19:        Other 2007       4 Central North and Lower South Islands           4.8          5.57
20:        Other 1968       1                                Wahine          10.0        170.62


We can further leverage the .N special symbol to obtain a count of catastrophic events by TYPE:

> DF[,.N, by="TYPE"]

            TYPE  N
 1: Power outage  2
 2:        Flood 58
 3:      Weather  4
 4: Flood, Storm  9
 5:      Cyclone  4
 6:        Storm 32
 7:   Earthquake  7
 8:      Tornado 11
 9:    Hailstorm  8
10:        Other 10


Or by TYPE and YEAR since 2010:

> DF[YEAR>=2010,.N,by=c("TYPE","YEAR")]

            TYPE YEAR N
 1: Power outage 2014 1
 2:        Flood 2014 1
 3:      Weather 2014 1
 4: Flood, Storm 2014 1
 5:      Cyclone 2014 1
 6:        Storm 2014 2
 7:        Storm 2013 3
 8:   Earthquake 2013 2
 9:        Flood 2013 1
10: Flood, Storm 2013 1
11:      Tornado 2012 1
12:   Earthquake 2011 1
13: Flood, Storm 2011 1
14:        Flood 2011 3
15:      Tornado 2011 2
16:        Flood 2010 3
17:        Storm 2010 2


Assignment by Reference

Fields can be appended to an existing data.table with data representing a specific categorical aggregation. For example, if we focus only on records from the original catastrophe dataset but only those occurring in 1997:

> DF97 = DF[YEAR==1997,]
> DF97

    YEAR QUARTER      TYPE             LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
 1: 1997       2     Flood            Northland           1.2          1.76
 2: 1997       3     Flood             Auckland           0.7          1.02
 3: 1997       3     Flood           Coromandel           0.5          0.73
 4: 1997       2     Flood             Auckland           3.7          5.42
 5: 1997       1     Storm         South Island           1.1          1.61
 6: 1997       4     Storm         South Island           0.2          0.29
 7: 1997       4 Hailstorm    Southland , Otago           0.4          0.58
 8: 1997       1   Cyclone                   NA           3.2          4.69
 9: 1997       2     Flood               Wairoa           0.5          0.73
10: 1997       4     Storm North , South Island           2.9          4.20


A new column can be added representing the proportion each event contributed to the total loss amount for YEAR=1997 (note that in this example, LOSS_AMT_2014 is targeted as the proportionality basis):

> DF97[,PROPORTION:=LOSS_AMT_2014/sum(LOSS_AMT_2014)][]

    YEAR QUARTER      TYPE             LOCATION LOSS_AMT_ORIG LOSS_AMT_2014 PROPORTION
 1: 1997       2     Flood            Northland           1.2          1.76 0.08368997
 2: 1997       3     Flood             Auckland           0.7          1.02 0.04850214
 3: 1997       3     Flood           Coromandel           0.5          0.73 0.03471232
 4: 1997       2     Flood             Auckland           3.7          5.42 0.25772706
 5: 1997       1     Storm         South Island           1.1          1.61 0.07655730
 6: 1997       4     Storm         South Island           0.2          0.29 0.01378982
 7: 1997       4 Hailstorm    Southland , Otago           0.4          0.58 0.02757965
 8: 1997       1   Cyclone                   NA           3.2          4.69 0.22301474
 9: 1997       2     Flood               Wairoa           0.5          0.73 0.03471232
10: 1997       4     Storm North , South Island           2.9          4.20 0.19971469


This expression contains the := operator, which, within the context of data.table, facilitates assignment-by-reference. We include the empty set of brackets since without them, no result is printed when columns are assigned by reference.
This example introduces a point worth remembering: When updating data.tables with assignment-by-reference or with the functional form equivalent, in general the dataset will not be aggregated. Instead, the result of the aggregate operation will be appended to each record (and by each sub-group if applicable) in the original data.table. Stated more succinctly, the assignment-by-reference operator cannot perform grouping operations.

One final scenario I’d like to cover is updating data.table columns while retaining the original table shape (at least in one dimension). Although this isn’t considered an aggregate operation or function as such, it is somewhat related and will be included here nonetheless.

The following example is contrived, but nicely illustrates the point. Assume we’d like to add 5 to every cell in LOSS_AMT_ORIG and LOSS_AMT_2014 in the DF97 subset. The first approach will be to leverage .SD with lapply while reassigning LOSS_AMT_ORIG and LOSS_AMT_2014:

# Specify numeric fields =>
numFields = c("LOSS_AMT_ORIG", "LOSS_AMT_2014")

> DF[,(numFields):=
      lapply(.SD, function(v) v+5),
          .SDcols=numFields][]

    YEAR QUARTER      TYPE             LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
 1: 1997       2     Flood            Northland           6.2          6.76
 2: 1997       3     Flood             Auckland           5.7          6.02
 3: 1997       3     Flood           Coromandel           5.5          5.73
 4: 1997       2     Flood             Auckland           8.7         10.42
 5: 1997       1     Storm         South Island           6.1          6.61
 6: 1997       4     Storm         South Island           5.2          5.29
 7: 1997       4 Hailstorm    Southland , Otago           5.4          5.58
 8: 1997       1   Cyclone                   NA           8.2          9.69
 9: 1997       2     Flood               Wairoa           5.5          5.73
10: 1997       4     Storm North , South Island           7.9          9.20


It is necessary to surround the vector numFields with (), otherwise the vector will be mistaken as a column name for a new field assignment by reference. Notice that this has the desired effect of adding 5 to each cell in the target columns.
The second approach highlights what is referred to as the Functional form in data.table parlance. Essentially, it serves as a multi-column analog of the assignment-by-reference operator, :=. When used with Functional form expressions, := must be surrounded by either backticks := or quotes ':='. We demonstrate:

> DF97[, `:=` 
      (LOSS_AMT_ORIG=LOSS_AMT_ORIG+5,
       LOSS_AMT_2014=LOSS_AMT_2014+5)][]

    YEAR QUARTER      TYPE             LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
 1: 1997       2     Flood            Northland           6.2          6.76
 2: 1997       3     Flood             Auckland           5.7          6.02
 3: 1997       3     Flood           Coromandel           5.5          5.73
 4: 1997       2     Flood             Auckland           8.7         10.42
 5: 1997       1     Storm         South Island           6.1          6.61
 6: 1997       4     Storm         South Island           5.2          5.29
 7: 1997       4 Hailstorm    Southland , Otago           5.4          5.58
 8: 1997       1   Cyclone                   NA           8.2          9.69
 9: 1997       2     Flood               Wairoa           5.5          5.73
10: 1997       4     Storm North , South Island           7.9          9.20


Finally, as a sort-of anti-example, we demonstrate the behavior of applying an aggregate operation using the Functional form of the assignment-by-reference operator and column rebinding. If we entered the following:

> DF97[, `:=` 
    (LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
     LOSS_AMT_2014=sum(LOSS_AMT_2014))][]


One might assume this expression returns the summation of LOSS_AMT_ORIG and LOSS_AMT_2014 over all records without partitioning, resulting in a 1-row by 2-column data.table. However, the expression contains the assignment-by-reference operator in functional form, and as we mentioned before, the assignment-by-reference operator cannot perform grouping operations. Therefore, the above expression instead overwrites each cell of LOSS_AMT_ORIG and LOSS_AMT_2014 with the non-partitioned aggregate sum of each column respectively. Here’s a look at the first 10 records:

> DF[, `:=` 
    (LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
     LOSS_AMT_2014=sum(LOSS_AMT_2014))][]

    YEAR QUARTER         TYPE                          LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
 1: 2014       4 Power outage                          Auckland      18065.72      19977.66
 2: 2014       2        Flood                    Nelson, Tasman      18065.72      19977.66
 3: 2014       2      Weather           North and South Islands      18065.72      19977.66
 4: 2014       2 Flood, Storm                                NA      18065.72      19977.66
 5: 2014       1      Cyclone                                NA      18065.72      19977.66
 6: 2014       1        Storm                        Canterbury      18065.72      19977.66
 7: 2014       1        Storm Canterbury and Lower North Island      18065.72      19977.66
 8: 2013       4        Storm                        Nationwide      18065.72      19977.66
 9: 2013       3        Storm                        Nationwide      18065.72      19977.66
10: 2013       3   Earthquake                       Cook Strait      18065.72      19977.66


Depending on the context, this may or may not be the desired result of an aggregate operation. But more importantly, being able to identify when the assignment-by-reference operator should be used in a data.table expression and when it shouldn’t is an important skill to develop, and comes in especially handy when debugging data.table code that you didn’t author.

Conclusion

In this post, we introduced data aggregation within the context of the powerful data.table library. We touched on aggregation via data.table’s special symbols, and presented the two ways in which the assignment-by-reference operator can be used. But this post serves only as an introduction. For more information many more advanced use cases and examples, I suggest becoming familiar with the data.table reference manual. In my opinion, the developers, contributors and maintainers of data.table produce one of the best, most useful open source library references I’ve seen. Like every other reference manual, the library’s API is covered in detail, but what really sets it apart is the examples: In addition to the minimally useful annotations that demonstrate a given callable’s most basic functionality, many sections include additional detail with a wealth of useful examples and alternate approaches the user might not have even known were possible. Sections that come immediately to mind are the assignment-by-reference operator (identified as := in the table of contents), melt.data.table, dcast.data.table and special-symbols. Despite using data.table on a regular basis, I still peruse the manual after each release, and each time end up taking something away that I hadn’t known before. I encourage you to get in the habit of doing the same.
Until next time, happy coding!

Footnotes:

  1. https://github.com/Rdatatable/data.table/wiki
  2. https://cran.r-project.org/web/packages/data.table/data.table.pdf, pg.93