For the uninitiated, data.table is a third-party package for the R programming language which provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed 1. I was first introduced to data.table when I began my career at CNA, and as a consequence of working with it on a daily basis for a few of years have become well acquainted with the package and many of its offerings.
This post highlights aggregation and aggregate operations within the context
of data.table. Even though the syntax is straightforward for the most common
aggregate operations, when working with slightly more complex expressions,
subtle syntactical variations can result in unexpected result sets. But after
working through the examples in this post, you’ll hopefully be in a position
to leverage data.table to tackle all of your data aggregation needs.
To follow along with the examples, you should have a relatively recent
installation of R, in addition to the following third-party libraries:
data.table
CASdatasets (sample insurance datasets)
Aggregation
Our sample dataset, nzcathist
, can be found in the CASdatasets package and
represents 145 catastrophic events that occurred in New Zealand between
1968-2014, along with the associated cost in millions originally
(LOSS_AMT_ORIG
) and adjusted for 2014 dollars (LOSS_AMT_2014
). In the
pre-processing step, we reduce the dataset down to only the requisite columns
(an already pre-processed version of this dataset in .csv format can be
downloaded
here):
library("data.table")
library("CASdatasets")
data(nzcathist) # available in `CASdatasets` package
DFInit = setDT(nzcathist)
keepFields = c("Year", "Quarter", "Type", "Location", "NormCost2014")
newFieldNames = c("YEAR", "QUARTER", "TYPE", "LOCATION", "LOSS_AMT")
DF = DFInit[,..keepFields]
setnames(DF, newFieldNames)
Inspecting the first few records reveals:
> head(DF)
YEAR QUARTER TYPE LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
1: 2014 4 Power outage Auckland 2.0 2.00
2: 2014 2 Flood Nelson, Tasman 2.7 2.72
3: 2014 2 Weather North and South Islands 37.6 37.85
4: 2014 2 Flood, Storm NA 55.3 55.67
5: 2014 1 Cyclone NA 3.6 3.63
6: 2014 1 Storm Canterbury 4.8 4.84
Note that in the pre-processing step, we coerced nzcathist
, originally a
data.frame, into a data.table by calling the setDT
function. setDT
coerces data.frames to data.tables by reference, as opposed to using
as.data.table
, which creates an independent copy of the original
data.frame in memory. For smaller datasets, the difference in the two
approaches will be negligible. But as the size of the data begins to scale,
calling as.data.table
may result in serious performance degradation.
Before jumping in, I thought it helpful to reproduce a diagram found in the
official data.table documentation. It has been included as a way to
introduce/clarify data.table’s general syntax:
DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
In the official documentation, beneath the diagram reads Take DT, subset rows
by i, then compute j grouped by by. Throughout the rest of the post, any
reference to i, j or by implicitly assumes this context. Also, it should
be assumed that all datasets have been coerced to data.table objects, even if
not explicitly converted as such within the sample code block.
We start with the common task of calculating the sum of values held within a single column. Note that the following returns a scalar:
> DF[,sum(LOSS_AMT_2014)]
[1] 19977.66
To compute the sum of LOSS_AMT_ORIG and LOSS_AMT_2014 over all records, we
run the following, which returns a single-record data.table with two columns:
> DF[,.(LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
LOSS_AMT_2014=sum(LOSS_AMT_2014))]
LOSS_AMT_ORIG LOSS_AMT_2014
1: 18065.72 19977.66
.()
is shorthand for list
. We can replace the expression above with a
call to list
and an identical result will be obtained:
> DF[,list(LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
LOSS_AMT_2014=sum(LOSS_AMT_2014))]
LOSS_AMT_ORIG LOSS_AMT_2014
1: 18065.72 19977.66
It may be of interest to calculate the total amounts for LOSS_AMT_ORIG and
LOSS_AMT_2014 by YEAR. This is accomplished with inclusion of by
:
> DF[,.(LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
LOSS_AMT_2014=sum(LOSS_AMT_2014)), by="YEAR"]
YEAR LOSS_AMT_ORIG LOSS_AMT_2014
1: 2014 128.50 129.42
2: 2013 206.40 210.01
3: 2012 8.70 8.95
4: 2011 16557.50 17380.70
5: 2010 68.75 74.66
6: 2009 6.01 6.64
7: 2008 86.26 96.42
8: 2007 125.80 147.40
9: 2006 50.40 60.60
10: 2005 60.40 75.88
11: 2004 145.15 187.59
12: 2003 3.30 4.31
13: 2002 25.65 34.34
14: 2001 2.60 3.57
15: 2000 23.10 32.18
16: 1999 55.50 79.86
17: 1998 35.20 50.68
18: 1997 14.40 21.03
19: 1996 11.80 17.34
20: 1995 12.60 19.04
21: 1994 18.30 28.77
22: 1993 7.60 12.06
23: 1992 8.10 13.07
24: 1991 3.10 5.07
25: 1990 1.80 2.99
26: 1988 56.10 106.21
27: 1987 192.00 398.17
28: 1986 19.30 47.34
29: 1985 17.20 47.28
30: 1984 51.10 160.59
31: 1983 5.80 18.86
32: 1981 9.00 37.04
33: 1980 11.30 53.33
34: 1978 10.30 60.96
35: 1976 6.20 51.61
36: 1975 7.00 63.35
37: 1968 13.50 230.34
YEAR LOSS_AMT_ORIG LOSS_AMT_2014
In order to limit the result to only YEAR>2010, include the condition in
i
:
> DF[YEAR>2010,.(LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
LOSS_AMT_2014=sum(LOSS_AMT_2014)), by="YEAR"]
YEAR LOSS_AMT_ORIG LOSS_AMT_2014
1: 2014 128.5 129.42
2: 2013 206.4 210.01
3: 2012 8.7 8.95
4: 2011 16557.5 17380.70
by
can be substituted with keyby
. keyby
runs setkey()
on the
specified column(s). Since calling setkey
on a data.table results in the
data.table being sorted by the specified column(s), keyby
can be used if
the dataset requires ordering upon return. Notice that our previous result was
displayed with YEAR in descending order. Using keyby
instead arranges
the output in ascending order w.r.t. YEAR:
> DF[YEAR>2010,.(LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
LOSS_AMT_2014=sum(LOSS_AMT_2014)), keyby="YEAR"]
YEAR LOSS_AMT_ORIG LOSS_AMT_2014
1: 2011 16557.5 17380.70
2: 2012 8.7 8.95
3: 2013 206.4 210.01
4: 2014 128.5 129.42
Note that if the fieldname rebinding is omitted and include only the aggregate
operator over the target fields, the same result will be obtained, but the
aggregated fields will be assigned arbitrary names, in this case V1 and
V2:
> DF[YEAR>2010,.(sum(LOSS_AMT_ORIG),sum(LOSS_AMT_2014)),by="YEAR"]
YEAR V1 V2
1: 2014 128.5 129.42
2: 2013 206.4 210.01
3: 2012 8.7 8.95
4: 2011 16557.5 17380.70
Special Symbols in Aggregate Operations
In an earlier post post, I
introduced data.table’s special symbols, which are read-only symbols that
can be used in j
2. The .SD
special symbol is
particularly useful for aggregate operations, along with .SDcols
, which is
used in conjunction with .SD
. .SD
, .SDcols
and by
will be used
in the sample expressions that follow, and it is therefore instructive to
provide a qualitative description of the role each plays within the context of data.table:
.SD
Stands for subset of data.table. By default gets assigned all columns
except those mentioned in by
(orkeyby
). Can only be used in j
.
.SDcols
Represents the columns that are to be aggregated, which must be numeric.
Fields may be given as a character vector of fieldnames or
column-positional indices.
by
Similar to SQL’s GROUP BY
. Specifies how the fields in .SDcols
should be aggregated, i.e., over which fields should the aggregate
operation be performed.
Using the special symbol syntax, we can reproduce the output returned in our original examples. First, taking the sum of LOSS_AMT_ORIG and LOSS_AMT_2014 over all records without partitioning:
> DF[,lapply(.SD, sum, na.rm=TRUE),
.SDcols=c("LOSS_AMT_ORIG", "LOSS_AMT_2014")]
LOSS_AMT_ORIG LOSS_AMT_2014
1: 18065.72 19977.66
Partitioning LOSS_AMT_ORIG and LOSS_AMT_2014 by YEAR:
> DF[,lapply(.SD, sum, na.rm=TRUE),
.SDcols=c("LOSS_AMT_ORIG", "LOSS_AMT_2014"),
by="YEAR"]
YEAR LOSS_AMT_ORIG LOSS_AMT_2014
1: 2014 128.50 129.42
2: 2013 206.40 210.01
3: 2012 8.70 8.95
4: 2011 16557.50 17380.70
.
.
.
With only 6 columns in our sample dataset, the convenience of .SD
may not
be readily apparent. But as the number of columns in our dataset grows, the
ability to use .SD
as a stand-in for all fields not targeted in the
aggregate operation becomes an incredibly useful abstraction.
We can perform other aggregate operations over LOSS_AMT_ORIG and
LOSS_AMT_2014. Calculating the average loss amounts by TYPE is as simple
as replacing sum and YEAR in the last expression:
> DF[,lapply(.SD, mean, na.rm=TRUE),
.SDcols=c("LOSS_AMT_ORIG", "LOSS_AMT_2014"),
by="TYPE"]
TYPE LOSS_AMT_ORIG LOSS_AMT_2014
1: Power outage 6.100000 8.350000
2: Flood 5.443103 12.133448
3: Weather 13.000000 14.680000
4: Flood, Storm 18.511111 20.683333
5: Cyclone 11.350000 20.267500
6: Storm 17.932813 24.539062
7: Earthquake 2393.618571 2541.094286
8: Tornado 3.540000 4.425455
9: Hailstorm 5.437500 7.130000
10: Other 6.220000 25.265000
.SD
can also be used to return the first N-records of a particular
subgroup. To demonstrate, let’s say we’re interested in returning the 3
costliest catastrophic events by TYPE from our 145 record dataset. With 10
distinct event types, we expect 30 records to be returned. In addition, we
specify the result must be returned in descending order w.r.t.
LOSS_AMT_2014:
> DF[order(-LOSS_AMT_2014),.SD[1:3], by="TYPE"]
TYPE YEAR QUARTER LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
1: Earthquake 2011 1 Canterbury 16500.0 17320.68
2: Earthquake 1987 1 Bay of Plenty 192.0 398.17
3: Earthquake 2007 4 Gisborne 30.5 35.38
4: Other 1968 1 Wahine 10.0 170.62
5: Other 2005 2 BOP Tauranga , Matata 28.5 35.65
6: Other 1985 1 Thames , Coromandel , Te Aroha 5.9 16.35
7: Storm 2004 1 Lower North Island 112.0 145.25
8: Storm 2013 3 Nationwide 74.5 75.50
9: Storm 2007 3 Far North , Auckland , Coromandel 60.5 71.01
10: Flood 1984 1 Invercargill , Southland 45.8 143.93
11: Flood 1999 4 Queenstown Lakes District 46.1 66.26
12: Flood 1978 4 Otago 10.3 60.96
13: Cyclone 1988 1 NA 37.0 70.41
14: Cyclone 1997 1 NA 3.2 4.69
15: Cyclone 2014 1 NA 3.6 3.63
16: Flood, Storm 2014 2 NA 55.3 55.67
17: Flood, Storm 2013 2 Nelson,Bay of Plenty 46.2 47.26
18: Flood, Storm 2002 2 North Island 21.5 28.73
19: Weather 2014 2 North and South Islands 37.6 37.85
20: Weather 1996 3 NA 8.1 11.93
21: Weather 2000 3 North Island 4.2 5.87
22: Hailstorm 1994 1 Hastings 10.8 17.13
23: Hailstorm 2005 1 Christchurch 13.0 16.40
24: Hailstorm 2008 4 Canterbury 11.2 12.57
25: Power outage 1998 2 Mercury 10.2 14.70
26: Power outage 2014 4 Auckland 2.0 2.00
27: Power outage NA NA NA NA NA
28: Tornado 2005 1 Greymouth 9.2 11.61
29: Tornado 2007 3 Tornado events 8.3 9.74
30: Tornado 2012 4 Auckland 8.7 8.95
.SD[1:3]
specifies the number of rows that should be returned within each
subgroup (in this case 3). The - preceding LOSS_AMT_2014 results in a
descending order sort within each subgroup.
To return the least costly and most costly events by TYPE, we alter the previous expression slightly:
> DF97[,.SD[c(1,.N)],by="TYPE"]
TYPE YEAR QUARTER LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
1: Power outage 2014 4 Auckland 2.0 2.00
2: Power outage 1998 2 Mercury 10.2 14.70
3: Flood 2014 2 Nelson, Tasman 2.7 2.72
4: Flood 1976 1 Wellington , Hutt Valley 6.2 51.61
5: Weather 2014 2 North and South Islands 37.6 37.85
6: Weather 1996 4 NA 2.1 3.07
7: Flood, Storm 2014 2 NA 55.3 55.67
8: Flood, Storm 1994 4 North , South 6.0 9.26
9: Cyclone 2014 1 NA 3.6 3.63
10: Cyclone 1988 1 NA 37.0 70.41
11: Storm 2014 1 Canterbury 4.8 4.84
12: Storm 1968 1 Wahine 3.5 59.72
13: Earthquake 2013 3 Cook Strait 16.2 16.42
14: Earthquake 1987 1 Bay of Plenty 192.0 398.17
15: Tornado 2012 4 Auckland 8.7 8.95
16: Tornado 1980 3 Onehunga 1.0 4.53
17: Hailstorm 2009 2 Bay of Plenty 2.3 2.56
18: Hailstorm 1994 1 Hastings 10.8 17.13
19: Other 2007 4 Central North and Lower South Islands 4.8 5.57
20: Other 1968 1 Wahine 10.0 170.62
We can further leverage the .N
special symbol to obtain a count of
catastrophic events by TYPE:
> DF[,.N, by="TYPE"]
TYPE N
1: Power outage 2
2: Flood 58
3: Weather 4
4: Flood, Storm 9
5: Cyclone 4
6: Storm 32
7: Earthquake 7
8: Tornado 11
9: Hailstorm 8
10: Other 10
Or by TYPE and YEAR since 2010:
> DF[YEAR>=2010,.N,by=c("TYPE","YEAR")]
TYPE YEAR N
1: Power outage 2014 1
2: Flood 2014 1
3: Weather 2014 1
4: Flood, Storm 2014 1
5: Cyclone 2014 1
6: Storm 2014 2
7: Storm 2013 3
8: Earthquake 2013 2
9: Flood 2013 1
10: Flood, Storm 2013 1
11: Tornado 2012 1
12: Earthquake 2011 1
13: Flood, Storm 2011 1
14: Flood 2011 3
15: Tornado 2011 2
16: Flood 2010 3
17: Storm 2010 2
Assignment by Reference
Fields can be appended to an existing data.table with data representing a specific categorical aggregation. For example, if we focus only on records from the original catastrophe dataset but only those occurring in 1997:
> DF97 = DF[YEAR==1997,]
> DF97
YEAR QUARTER TYPE LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
1: 1997 2 Flood Northland 1.2 1.76
2: 1997 3 Flood Auckland 0.7 1.02
3: 1997 3 Flood Coromandel 0.5 0.73
4: 1997 2 Flood Auckland 3.7 5.42
5: 1997 1 Storm South Island 1.1 1.61
6: 1997 4 Storm South Island 0.2 0.29
7: 1997 4 Hailstorm Southland , Otago 0.4 0.58
8: 1997 1 Cyclone NA 3.2 4.69
9: 1997 2 Flood Wairoa 0.5 0.73
10: 1997 4 Storm North , South Island 2.9 4.20
A new column can be added representing the proportion each event contributed
to the total loss amount for YEAR=1997 (note that in this example,
LOSS_AMT_2014 is targeted as the proportionality basis):
> DF97[,PROPORTION:=LOSS_AMT_2014/sum(LOSS_AMT_2014)][]
YEAR QUARTER TYPE LOCATION LOSS_AMT_ORIG LOSS_AMT_2014 PROPORTION
1: 1997 2 Flood Northland 1.2 1.76 0.08368997
2: 1997 3 Flood Auckland 0.7 1.02 0.04850214
3: 1997 3 Flood Coromandel 0.5 0.73 0.03471232
4: 1997 2 Flood Auckland 3.7 5.42 0.25772706
5: 1997 1 Storm South Island 1.1 1.61 0.07655730
6: 1997 4 Storm South Island 0.2 0.29 0.01378982
7: 1997 4 Hailstorm Southland , Otago 0.4 0.58 0.02757965
8: 1997 1 Cyclone NA 3.2 4.69 0.22301474
9: 1997 2 Flood Wairoa 0.5 0.73 0.03471232
10: 1997 4 Storm North , South Island 2.9 4.20 0.19971469
This expression contains the :=
operator, which, within the context of
data.table, facilitates assignment-by-reference. We include the empty set of
brackets since without them, no result is printed when columns are assigned by
reference.
This example introduces a point worth remembering: When updating data.tables
with assignment-by-reference or with the functional form equivalent, in
general the dataset will not be aggregated. Instead, the result of the
aggregate operation will be appended to each record (and by each sub-group if
applicable) in the original data.table. Stated more succinctly, the
assignment-by-reference operator cannot perform grouping operations.
One final scenario I’d like to cover is updating data.table columns while retaining the original table shape (at least in one dimension). Although this isn’t considered an aggregate operation or function as such, it is somewhat related and will be included here nonetheless.
The following example is contrived, but nicely illustrates the point. Assume
we’d like to add 5 to every cell in LOSS_AMT_ORIG and LOSS_AMT_2014 in the
DF97 subset. The first approach will be to leverage .SD
with lapply
while reassigning LOSS_AMT_ORIG and LOSS_AMT_2014:
# Specify numeric fields =>
numFields = c("LOSS_AMT_ORIG", "LOSS_AMT_2014")
> DF[,(numFields):=
lapply(.SD, function(v) v+5),
.SDcols=numFields][]
YEAR QUARTER TYPE LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
1: 1997 2 Flood Northland 6.2 6.76
2: 1997 3 Flood Auckland 5.7 6.02
3: 1997 3 Flood Coromandel 5.5 5.73
4: 1997 2 Flood Auckland 8.7 10.42
5: 1997 1 Storm South Island 6.1 6.61
6: 1997 4 Storm South Island 5.2 5.29
7: 1997 4 Hailstorm Southland , Otago 5.4 5.58
8: 1997 1 Cyclone NA 8.2 9.69
9: 1997 2 Flood Wairoa 5.5 5.73
10: 1997 4 Storm North , South Island 7.9 9.20
It is necessary to surround the vector numFields with ()
, otherwise the
vector will be mistaken as a column name for a new field assignment by
reference. Notice that this has the desired effect of adding 5 to each cell in
the target columns.
The second approach highlights what is referred to as the Functional form in
data.table parlance. Essentially, it serves as a multi-column analog of the
assignment-by-reference operator, :=
. When used with Functional form
expressions, :=
must be surrounded by either backticks :=
or
quotes ':='
. We demonstrate:
> DF97[, `:=`
(LOSS_AMT_ORIG=LOSS_AMT_ORIG+5,
LOSS_AMT_2014=LOSS_AMT_2014+5)][]
YEAR QUARTER TYPE LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
1: 1997 2 Flood Northland 6.2 6.76
2: 1997 3 Flood Auckland 5.7 6.02
3: 1997 3 Flood Coromandel 5.5 5.73
4: 1997 2 Flood Auckland 8.7 10.42
5: 1997 1 Storm South Island 6.1 6.61
6: 1997 4 Storm South Island 5.2 5.29
7: 1997 4 Hailstorm Southland , Otago 5.4 5.58
8: 1997 1 Cyclone NA 8.2 9.69
9: 1997 2 Flood Wairoa 5.5 5.73
10: 1997 4 Storm North , South Island 7.9 9.20
Finally, as a sort-of anti-example, we demonstrate the behavior of applying an
aggregate operation using the Functional form of the assignment-by-reference
operator and column rebinding. If we entered the following:
> DF97[, `:=`
(LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
LOSS_AMT_2014=sum(LOSS_AMT_2014))][]
One might assume this expression returns the summation of LOSS_AMT_ORIG and
LOSS_AMT_2014 over all records without partitioning, resulting in a 1-row by
2-column data.table. However, the expression contains the
assignment-by-reference operator in functional form, and as we mentioned
before, the assignment-by-reference operator cannot perform grouping
operations. Therefore, the above expression instead overwrites each cell of
LOSS_AMT_ORIG and LOSS_AMT_2014 with the non-partitioned aggregate sum of
each column respectively. Here’s a look at the first 10 records:
> DF[, `:=`
(LOSS_AMT_ORIG=sum(LOSS_AMT_ORIG),
LOSS_AMT_2014=sum(LOSS_AMT_2014))][]
YEAR QUARTER TYPE LOCATION LOSS_AMT_ORIG LOSS_AMT_2014
1: 2014 4 Power outage Auckland 18065.72 19977.66
2: 2014 2 Flood Nelson, Tasman 18065.72 19977.66
3: 2014 2 Weather North and South Islands 18065.72 19977.66
4: 2014 2 Flood, Storm NA 18065.72 19977.66
5: 2014 1 Cyclone NA 18065.72 19977.66
6: 2014 1 Storm Canterbury 18065.72 19977.66
7: 2014 1 Storm Canterbury and Lower North Island 18065.72 19977.66
8: 2013 4 Storm Nationwide 18065.72 19977.66
9: 2013 3 Storm Nationwide 18065.72 19977.66
10: 2013 3 Earthquake Cook Strait 18065.72 19977.66
Depending on the context, this may or may not be the desired result of an
aggregate operation. But more importantly, being able to identify when the
assignment-by-reference operator should be used in a data.table expression and
when it shouldn’t is an important skill to develop, and comes in especially
handy when debugging data.table code that you didn’t author.
Conclusion
In this post, we introduced data aggregation within the context of the
powerful data.table library. We touched on aggregation via data.table’s
special symbols, and presented the two ways in which the
assignment-by-reference operator can be used. But this post serves only as an
introduction. For more information many more advanced use cases and examples,
I suggest becoming familiar with the
data.table reference manual.
In my opinion, the developers, contributors and maintainers of data.table
produce one of the best, most useful open source library references I’ve seen.
Like every other reference manual, the library’s API is covered in detail, but
what really sets it apart is the examples: In addition to the minimally useful
annotations that demonstrate a given callable’s most basic functionality, many
sections include additional detail with a wealth of useful examples and
alternate approaches the user might not have even known were possible.
Sections that come immediately to mind are the assignment-by-reference
operator (identified as :=
in the table of contents), melt.data.table,
dcast.data.table and special-symbols. Despite using data.table on a
regular basis, I still peruse the manual after each release, and each time end
up taking something away that I hadn’t known before. I encourage you to get in
the habit of doing the same.
Until next time, happy coding!
Footnotes:
- https://github.com/Rdatatable/data.table/wiki
- https://cran.r-project.org/web/packages/data.table/data.table.pdf, pg.93