The data.table library is an R package that
provides an enhanced version of R’s in-built data.frame. There are many available resources that introduce data.table’s
syntax and provide examples of typical usage scenarios. In this post, we’ll cover an aspect of data.table that doesn’t
seem to get as much coverage: data.table’s special symbols.
Before proceeding, we should highlight data.table’s general syntactical form, that when understood, can be used to translate
data.table expressions of arbitrary complexity into 3 steps:
DT[i, j, by]
Which translates to:
“Take DT
, subset rows using i
, then calculate j
grouped by by
“
The following annotated diagram is included as part of data.table’s documentation:
DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
For those familiar with SQL, the three expressions that describe the general form can be thought of as components
of a SQL query: “SELECT from j
, WHERE i
, GROUP BY by
“. Although the SQL analog of the original
interpretation may not hold for every possible data.table operation, it remains a valid proxy for many scenarios
encountered in practice.
Although this applies to data.table in general, a thorough understanding of the general form will be especially
useful when exploring ways to leverage special symbols in your own code. The examples that follow serve
only as an introduction: The usefulness and applicable range of data.table’s special symbols are in no way
limited to what’s included here - we’re just getting started!
The data.table special symbols are defined as follows:
* .N is an integer, length 1, containing the number of rows in the group. This may be useful when the column
names are not known in advance and for convenience generally. When grouping by i
, .N is the number of rows in x
matched to, for each row of i
, regardless of whether nomatch is NA
or 0. It is renamed to N (no dot) in the
result (otherwise a column called “.N” could conflict with the .N variable, see FAQ 4.6 for more details and example),
unless it is explicity named; e.g., DT[,list(total=.N),by=a]
.
* .SD is a data.table containing the Subset of x’s Data for each group, excluding any columns used in by
(or keyby
).
* .BY is a list containing a length 1 vector for each item in by
. This can be useful when by
is not known in
advance. The by
variables are also available to j
directly by name; useful for example for titles of graphs if j
is a
plot command, or to branch with if()
depending on the value of a group variable.
* .I is an integer vector equal to seq_len(nrow(x))
. While grouping, it holds for each item in the group, it’s row
location in x. This is useful to subset in j
; e.g. DT[, .I[which.max(somecol)], by=grp]
.
* .GRP is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc.
* .EACHI when i is a data.table, DT[i,j,by=.EACHI]
evaluates j
for the groups of DT that each row in i
joins to.
We call this grouping by each i.
.N
The special symbol .N
can be used to return the last row or number of rows of a data.table. Referring to the R sample
dataset CO2
, what follows are a few examples using .N
:
library("data.table")
DT = CO2; setDT(DT)
> head(DT)
Plant Type Treatment conc uptake
1: Qn1 Quebec nonchilled 95 16.0
2: Qn1 Quebec nonchilled 175 30.4
3: Qn1 Quebec nonchilled 250 34.8
4: Qn1 Quebec nonchilled 350 37.2
5: Qn1 Quebec nonchilled 500 35.3
6: Qn1 Quebec nonchilled 675 39.2
# Return last record of CO2 dataset =>
> DT[.N]
Plant Type Treatment conc uptake
1: Mc3 Mississippi chilled 1000 19.9
# Return the second-to-last record of CO2 dataset =>
> DT[.N-1]
Plant Type Treatment conc uptake
1: Mc3 Mississippi chilled 675 18.9
# Return the number of records in CO2 dataset =>
> DT[,.N]
[1] 84
# Return the number of records for each unique value of `Treatment` =>
> DT[, .N, by=Treatment]
Treatment N
1: nonchilled 42
2: chilled 42
# Return the number of records for each unique combination of
# `Treatment` & `Plant` =>
> DT[, .N, keyby="Treatment,Plant"]
Treatment Plant N
1: nonchilled Qn1 7
2: nonchilled Qn2 7
3: nonchilled Qn3 7
4: nonchilled Mn3 7
5: nonchilled Mn2 7
6: nonchilled Mn1 7
7: chilled Qc1 7
8: chilled Qc3 7
9: chilled Qc2 7
10: chilled Mc2 7
11: chilled Mc3 7
12: chilled Mc1 7
.N
can be used to enumerate records, creating a row counter field:
> (DT[,CNTR:=1:.N])
Plant Type Treatment conc uptake CNTR
1: Qn1 Quebec nonchilled 95 16.0 1
2: Qn1 Quebec nonchilled 175 30.4 2
3: Qn1 Quebec nonchilled 250 34.8 3
4: Qn1 Quebec nonchilled 350 37.2 4
5: Qn1 Quebec nonchilled 500 35.3 5
.
.
.
79: Mc3 Mississippi chilled 175 18.0 79
80: Mc3 Mississippi chilled 250 17.9 80
81: Mc3 Mississippi chilled 350 17.9 81
82: Mc3 Mississippi chilled 500 17.9 82
83: Mc3 Mississippi chilled 675 18.9 83
84: Mc3 Mississippi chilled 1000 19.9 84
Similiarly, enumeration can be applied by group:
> (DT[,CNTR:=1:.N, by=Plant])
Plant Type Treatment conc uptake CNTR
1: Qn1 Quebec nonchilled 95 16.0 1
2: Qn1 Quebec nonchilled 175 30.4 2
3: Qn1 Quebec nonchilled 250 34.8 3
4: Qn1 Quebec nonchilled 350 37.2 4
5: Qn1 Quebec nonchilled 500 35.3 5
6: Qn1 Quebec nonchilled 675 39.2 6
7: Qn1 Quebec nonchilled 1000 39.7 7
8: Qn2 Quebec nonchilled 95 13.6 1
9: Qn2 Quebec nonchilled 175 27.3 2
10: Qn2 Quebec nonchilled 250 37.1 3
11: Qn2 Quebec nonchilled 350 41.8 4
12: Qn2 Quebec nonchilled 500 40.6 5
13: Qn2 Quebec nonchilled 675 41.4 6
14: Qn2 Quebec nonchilled 1000 44.3 7
.
.
.
.SD
.SD
(Subset of x’s Data) contains all columns of the data.table except the grouping columns
by default. Referring again to the CO2
dataset, we can use .SD
to print the contents of a
data.table:
library("data.table")
DT = CO2; setDT(DT)
> DT[,.SD]
Plant Type Treatment conc uptake
1: Qn1 Quebec nonchilled 95 16.0
2: Qn1 Quebec nonchilled 175 30.4
3: Qn1 Quebec nonchilled 250 34.8
4: Qn1 Quebec nonchilled 350 37.2
.
.
.
83: Mc3 Mississippi chilled 675 18.9
84: Mc3 Mississippi chilled 1000 19.9
Although uncommon, .SD
can be used for subsetting columns. Next, we retrieve Treatment
and conc
from
the CO2
dataset using three approaches, then demonstrating each method returns an identical subset:
# Using .SD to retrieve "Treatment" & "conc" =>
> DT1 = DT[, .SD, .SDcols=c("Treatment", "conc")]
> head(DT1)
Treatment conc
1: nonchilled 95
2: nonchilled 175
3: nonchilled 250
4: nonchilled 350
5: nonchilled 500
6: nonchilled 675
# Can also be accomplished using the following. Note that
# `.(...)` is shorthand for `list(...)` =>
> DT2 = DT[, .(Treatment, conc)]
# If column names are available only as a character vector,
# i.e., c("Treatment", "conc") include `with=FALSE` =>
> DT3 = DT[, c("Treatment", "conc"), with=FALSE]
# Test for equality amongst DT1, DT2 & DT3 =>
> identical(DT1, DT3); identical(DT2, DT3); identical(DT1, DT2)
[1] TRUE
[1] TRUE
[1] TRUE
We can retrieve the first 3 rows for all fields partitioned by Plant
:
> (DT1 = DT[,.SD[1:3], by=Plant])
Plant Type Treatment conc uptake
1: Qn1 Quebec nonchilled 95 16.0
2: Qn1 Quebec nonchilled 175 30.4
3: Qn1 Quebec nonchilled 250 34.8
4: Qn2 Quebec nonchilled 95 13.6
5: Qn2 Quebec nonchilled 175 27.3
6: Qn2 Quebec nonchilled 250 37.1
7: Qn3 Quebec nonchilled 95 16.2
8: Qn3 Quebec nonchilled 175 32.4
9: Qn3 Quebec nonchilled 250 40.3
.
.
.
28: Mc1 Mississippi chilled 95 10.5
29: Mc1 Mississippi chilled 175 14.9
30: Mc1 Mississippi chilled 250 18.1
31: Mc2 Mississippi chilled 95 7.7
32: Mc2 Mississippi chilled 175 11.4
33: Mc2 Mississippi chilled 250 12.3
34: Mc3 Mississippi chilled 95 10.6
35: Mc3 Mississippi chilled 175 18.0
36: Mc3 Mississippi chilled 250 17.9
It is possible to retrieve the 1st, 2nd, second-to-last and last records for each unique value
present in Type
:
> DT[, .SD[c(1, 2, .N-1, .N)], by=Type]
Type Plant Treatment conc uptake
1: Quebec Qn1 nonchilled 95 16.0
2: Quebec Qn1 nonchilled 175 30.4
3: Quebec Qc3 chilled 675 39.6
4: Quebec Qc3 chilled 1000 41.4
5: Mississippi Mn1 nonchilled 95 10.6
6: Mississippi Mn1 nonchilled 175 19.2
7: Mississippi Mc3 chilled 675 18.9
8: Mississippi Mc3 chilled 1000 19.9
.SD
can be used in conjunction with .SDcols
to convert fields with a factor datatype to character
type. .SDcols
is frequently used with .SD
to specify a subset of the columns of .SD
to be used in
j
. If .SDcols
is present, it’s generally bound to a vector of fieldnames, and the operation in j
will be applied only to the fields associated with the names specified in .SDcols
:
# First, list original field datatypes =>
> lapply(DT, class)
$Plant
[1] "ordered" "factor"
$Type
[1] "factor"
$Treatment
[1] "factor"
$conc
[1] "numeric"
$uptake
[1] "numeric"
# Convert fields with factor datatype (`Plant`, `Type` & `Treatment`)
# to character datatype. Isolate factor fieldnames =>
> factorFieldnames = names(Filter(is.factor, DT))
# Convert factor datatypes to character datatypes =>
> DT[,(factorFieldnames):=lapply(.SD, as.character), .SDcols=factorFieldnames]
# Verify datatypes have been successfully coerced =>
> lapply(DT, class)
$Plant
[1] "character"
$Type
[1] "character"
$Treatment
[1] "character"
$conc
[1] "numeric"
$uptake
[1] "numeric"
Two things to note: First, because we used the assignment by reference operator, :=
, to recast
datatypes, it is not necessary to re-assign the result of the type-coercion back to DT. This is because
when using :=
, operations are performed in-place and by reference, therefore the result of a
particular action are visible immediately in DT
. Second, since factorFieldnames
is enclosed by
parentheses preceeding :=
, the result is assigned to the columns specified in factorFieldnames
.
If parentheses we left out, a new field named factorFieldnames
would be added to DT
, which is not
the desired behavior.
.SD
can be used to calculate the average of any fields of interest overall or by group:
# Calculate the overall average of `conc` & `uptake` =>
> (avgDT1 = DT[, lapply(.SD, mean), .SDcols=c("conc", "uptake")])
conc uptake
1: 435 27.2131
# Calculate the average of `conc` & `uptake` by `Plant` =>
> (avgDT2 = DT[, lapply(.SD, mean), .SDcols=c("conc", "uptake"), by=Plant])
Plant conc uptake
1: Qn1 435 33.22857
2: Qn2 435 35.15714
3: Qn3 435 37.61429
4: Qc1 435 29.97143
5: Qc2 435 32.70000
6: Qc3 435 32.58571
7: Mn1 435 26.40000
8: Mn2 435 27.34286
9: Mn3 435 24.11429
10: Mc1 435 18.00000
11: Mc2 435 12.14286
12: Mc3 435 17.30000
One final example using .SD
: As mentioned earlier in the post, .SDcols
is used to specify
the columns of the data.table that .SD
will operate on. Alternatively, we can use .SDcols
to exclude fields from the operation performed by .SD
. Referring to the example in which we
calculated the average of conc
& uptake
in the CO2
dataset, we can alternatively specify
the fields .SD
should not average over. Here’s an alternative approach that results in excluding
Plant
, Type
and Treatment
from the average cacluation:
DT = CO2; setDT(DT)
# Calculate the average of `conc` & `uptake` by excluding
# `Plant`, `Type` & `Treatment` =>
> (avgDT = DT[, lapply(.SD, mean), .SDcols=!c("Plant","Type","Treatment")]))
conc uptake
1: 435 27.2131
Clearly in this example it is not advantageous to specify the fields not included in the average
(the need to specify 3 excluded fieldnames as opposed to 2 included fieldnames), but situations do arise
in which this functionality turns out to be very convenient.
.BY
.BY
is a list containing a length 1 vector for each item in by
. There are few examples in which
the .BY
special symbol is used, so we’ll reproduce the example included in the data.table vignette:
library("data.table")
> DT = data.table(
x=rep(c("b","a","c"),each=3),
v=c(1,1,1,2,2,1,1,2,2),
y=c(1,3,6),
a=1:9,
b=9:1
)
> head(DT)
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
> X = data.table(
x=c("c","b"),
v=8:7,
foo=c(4,2)
)
> head(X)
x v foo
1: c 8 4
2: b 7 2
# Join within each group =>
> X[, DT[.BY, y, on="x"], by=x]
x V1
1: c 1
2: c 3
3: c 6
4: b 1
5: b 3
6: b 6
.I
.I
is an integer vector equal to seq_len(nrow(x)). But we can instead use .I
to return a vector
containing the row indicies of the records in a data.table that meet the condition specified in i
.
In the next example, we demonstrate how to return the row indicies from CO2
having conc>500
using .I
:
library("data.table")
DT = CO2; setDT(DT)
# Return row indicies in which conc > 500 =>
> DT[,.I[conc>500]]
[1] 6 7 13 14 20 21 27 28 34 35 41 42 48 49 55 56 62 63 69 70 76 77 83 84
Using .I
after the condition specified in i
returns a vector of length equal to the number of
records meeting the i
‘s criteria:
> DT[conc>500, .I]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Note that use of .I
may have been supplanted in favor of data.table’s which
parameter. Instead
of using .I
, we can use which=TRUE
to return the row indicies of records meeting the condition
specified in i
:
# Using `which=TRUE` instead of `.I` =>
> DT[conc>500, which=TRUE]
[1] 6 7 13 14 20 21 27 28 34 35 41 42 48 49 55 56 62 63 69 70 76 77 83 84
We obtain the same result as using DT[,.I[conc>500]]
, but DT[conc>500, which=TRUE]
is, in my opinion,
preferable based on the improved readability.
.GRP
.GRP
is an integer, length 1, containing a simple group counter. If used without assignment to a new
field, will return an enumerated data.table containing the unique combination of fields specified in
by
. If we specify the key of the CO2
dataset as "Plant", "Type", "Treatment"
, we can return the
unique combinations of these fields by referencing the data.table’s key in the by
clause:
library("data.table")
DT = CO2; setDT(DT)
# Set key on DT =>
> setkeyv(DT, c("Plant", "Type", "Treatment"))
# Return enumerated data.table listing unique
# combinations of fields specified in `by` =>
> DT[, .GRP, by=key(DT)])
Plant Type Treatment GRP
1: Qn1 Quebec nonchilled 1
2: Qn2 Quebec nonchilled 2
3: Qn3 Quebec nonchilled 3
4: Qc1 Quebec chilled 4
5: Qc3 Quebec chilled 5
6: Qc2 Quebec chilled 6
7: Mn3 Mississippi nonchilled 7
8: Mn2 Mississippi nonchilled 8
9: Mn1 Mississippi nonchilled 9
10: Mc2 Mississippi chilled 10
11: Mc3 Mississippi chilled 11
12: Mc1 Mississippi chilled 12
Notice that fields not specified in by
are omitted from the output.
We can also use .GRP
to indicate which unique group a record belongs to for each record in the
parent data.table. We use the :=
operator to define a new column (by reference) indicating
each record’s associated group:
> (DT[,GROUP_ID:=.GRP, by=key(DT)])
Plant Type Treatment conc uptake GROUP_ID
1: Qn1 Quebec nonchilled 95 16.0 1
2: Qn1 Quebec nonchilled 175 30.4 1
3: Qn1 Quebec nonchilled 250 34.8 1
4: Qn1 Quebec nonchilled 350 37.2 1
5: Qn1 Quebec nonchilled 500 35.3 1
6: Qn1 Quebec nonchilled 675 39.2 1
7: Qn1 Quebec nonchilled 1000 39.7 1
8: Qn2 Quebec nonchilled 95 13.6 2
9: Qn2 Quebec nonchilled 175 27.3 2
10: Qn2 Quebec nonchilled 250 37.1 2
.
.
.
79: Mc1 Mississippi chilled 175 14.9 12
80: Mc1 Mississippi chilled 250 18.1 12
81: Mc1 Mississippi chilled 350 18.9 12
82: Mc1 Mississippi chilled 500 19.5 12
83: Mc1 Mississippi chilled 675 22.2 12
84: Mc1 Mississippi chilled 1000 21.9 12
.EACHI
.EACHI
is not documented along with the other special symbols, but it is mentioned in the Arguments
section
of the data.table vignette. What follows is the
description for .EACHI
found there:
When i
is a list (or data.frame or data.table), DT[i, j, by=.EACHI]
evaluates j
for the groups in DT that each row in i
joins to. That is, you can
join (in i
) and aggregate (in j
) simultaneously. We call this grouping by each i
.
A question on stackoverflow requested additional
information on how to properly use .EACHI
, and one of data.table’s authors provided an excellent response with clear
examples and straightforward use cases. As with .BY
, we proceed by referring to examples from data.table’s documentation:
library("data.table")
> (DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9))
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 4
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> (X = data.table(x=c("c","b"), v=8:7, foo=c(4,2)))
x v foo
1: c 8 4
2: b 7 2
# join and eval j for each row in i
> (DT1 = DT[X, sum(v), by=.EACHI, on="x"])
x V1
1: c 24
2: b 6
First, an inner join on the field x
with DT
and X
is performed. Then calculate the sum of v
from
DT
for each x
in X
. X
has two unique values in x
: c
& b
. Referencing DT
, when x=="c"
,
sum(v) = 24
. When x=="b"
, sum(v) = 6
, which matches the result using .EACHI
above.
We can remove a value or range of values from inclusion of the j
operation by specifying those values to
exclude:
> DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
> setkey(DT, x)
> (DT1 = DT[!"a", sum(v), by=.EACHI])
x V1
1: b 6
2: c 24
For each unique key value (records in x
) that isn’t "a"
, calculate the sum of column v
.
Conclusion
This post has been a quick introduction to data.table’s special symbols. Once sufficiently grokked, using data.table’s special symbols will simplify your data.table expressions, allowing complex operations to be expressed in a very natural way. Until next time, happy coding!