The data.table library is an R package that provides an enhanced version of R’s in-built data.frame. There are many available resources that introduce data.table’s syntax and provide examples of typical usage scenarios. In this post, we’ll cover an aspect of data.table that doesn’t seem to get as much coverage: data.table’s special symbols.
Before proceeding, we should highlight data.table’s general syntactical form, that when understood, can be used to translate data.table expressions of arbitrary complexity into 3 steps:

DT[i, j, by]


Which translates to:

“Take DT, subset rows using i, then calculate j grouped by by

The following annotated diagram is included as part of data.table’s documentation:

DT[ i, j, by ] # + extra arguments
    |  |  |
    |  |   -------> grouped by what?
    |   -------> what to do?
     ---> on which rows?


For those familiar with SQL, the three expressions that describe the general form can be thought of as components of a SQL query: “SELECT from j, WHERE i, GROUP BY by“. Although the SQL analog of the original interpretation may not hold for every possible data.table operation, it remains a valid proxy for many scenarios encountered in practice.
Although this applies to data.table in general, a thorough understanding of the general form will be especially useful when exploring ways to leverage special symbols in your own code. The examples that follow serve only as an introduction: The usefulness and applicable range of data.table’s special symbols are in no way limited to what’s included here - we’re just getting started!

The data.table special symbols are defined as follows:



* .N is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in advance and for convenience generally. When grouping by i, .N is the number of rows in x matched to, for each row of i, regardless of whether nomatch is NA or 0. It is renamed to N (no dot) in the result (otherwise a column called “.N” could conflict with the .N variable, see FAQ 4.6 for more details and example), unless it is explicity named; e.g., DT[,list(total=.N),by=a].

* .SD is a data.table containing the Subset of x’s Data for each group, excluding any columns used in by (or keyby).

* .BY is a list containing a length 1 vector for each item in by. This can be useful when by is not known in advance. The by variables are also available to j directly by name; useful for example for titles of graphs if j is a plot command, or to branch with if() depending on the value of a group variable.

* .I is an integer vector equal to seq_len(nrow(x)). While grouping, it holds for each item in the group, it’s row location in x. This is useful to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp].

* .GRP is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc.

* .EACHI when i is a data.table, DT[i,j,by=.EACHI] evaluates j for the groups of DT that each row in i joins to. We call this grouping by each i.

.N

The special symbol .N can be used to return the last row or number of rows of a data.table. Referring to the R sample dataset CO2, what follows are a few examples using .N:

library("data.table")

DT = CO2; setDT(DT)

> head(DT)
   Plant   Type  Treatment conc uptake
1:   Qn1 Quebec nonchilled   95   16.0
2:   Qn1 Quebec nonchilled  175   30.4
3:   Qn1 Quebec nonchilled  250   34.8
4:   Qn1 Quebec nonchilled  350   37.2
5:   Qn1 Quebec nonchilled  500   35.3
6:   Qn1 Quebec nonchilled  675   39.2


# Return last record of CO2 dataset =>
> DT[.N]
   Plant        Type Treatment conc uptake
1:   Mc3 Mississippi   chilled 1000   19.9


# Return the second-to-last record of CO2 dataset =>
> DT[.N-1]
   Plant        Type Treatment conc uptake
1:   Mc3 Mississippi   chilled  675   18.9


# Return the number of records in CO2 dataset =>
> DT[,.N]
[1] 84


# Return the number of records for each unique value of `Treatment` =>
> DT[, .N, by=Treatment]
    Treatment  N
1: nonchilled 42
2:    chilled 42


# Return the number of records for each unique combination of
# `Treatment` & `Plant` =>
> DT[, .N, keyby="Treatment,Plant"]

     Treatment Plant N
 1: nonchilled   Qn1 7
 2: nonchilled   Qn2 7
 3: nonchilled   Qn3 7
 4: nonchilled   Mn3 7
 5: nonchilled   Mn2 7
 6: nonchilled   Mn1 7
 7:    chilled   Qc1 7
 8:    chilled   Qc3 7
 9:    chilled   Qc2 7
10:    chilled   Mc2 7
11:    chilled   Mc3 7
12:    chilled   Mc1 7


.N can be used to enumerate records, creating a row counter field:

> (DT[,CNTR:=1:.N])

    Plant        Type  Treatment conc uptake CNTR
 1:   Qn1      Quebec nonchilled   95   16.0    1
 2:   Qn1      Quebec nonchilled  175   30.4    2
 3:   Qn1      Quebec nonchilled  250   34.8    3
 4:   Qn1      Quebec nonchilled  350   37.2    4
 5:   Qn1      Quebec nonchilled  500   35.3    5
                          .
                          .
                          .
79:   Mc3 Mississippi    chilled  175   18.0   79
80:   Mc3 Mississippi    chilled  250   17.9   80
81:   Mc3 Mississippi    chilled  350   17.9   81
82:   Mc3 Mississippi    chilled  500   17.9   82
83:   Mc3 Mississippi    chilled  675   18.9   83
84:   Mc3 Mississippi    chilled 1000   19.9   84


Similiarly, enumeration can be applied by group:

> (DT[,CNTR:=1:.N, by=Plant])

    Plant        Type  Treatment conc uptake CNTR
 1:   Qn1      Quebec nonchilled   95   16.0    1
 2:   Qn1      Quebec nonchilled  175   30.4    2
 3:   Qn1      Quebec nonchilled  250   34.8    3
 4:   Qn1      Quebec nonchilled  350   37.2    4
 5:   Qn1      Quebec nonchilled  500   35.3    5
 6:   Qn1      Quebec nonchilled  675   39.2    6
 7:   Qn1      Quebec nonchilled 1000   39.7    7
 8:   Qn2      Quebec nonchilled   95   13.6    1
 9:   Qn2      Quebec nonchilled  175   27.3    2
10:   Qn2      Quebec nonchilled  250   37.1    3
11:   Qn2      Quebec nonchilled  350   41.8    4
12:   Qn2      Quebec nonchilled  500   40.6    5
13:   Qn2      Quebec nonchilled  675   41.4    6
14:   Qn2      Quebec nonchilled 1000   44.3    7
                      .
                      .
                      .


.SD

.SD (Subset of x’s Data) contains all columns of the data.table except the grouping columns by default. Referring again to the CO2 dataset, we can use .SD to print the contents of a data.table:

library("data.table")

DT = CO2; setDT(DT)

> DT[,.SD]

    Plant        Type  Treatment conc uptake
 1:   Qn1      Quebec nonchilled   95   16.0
 2:   Qn1      Quebec nonchilled  175   30.4
 3:   Qn1      Quebec nonchilled  250   34.8
 4:   Qn1      Quebec nonchilled  350   37.2
                    .
                    .
                    .
83:   Mc3 Mississippi    chilled  675   18.9
84:   Mc3 Mississippi    chilled 1000   19.9


Although uncommon, .SD can be used for subsetting columns. Next, we retrieve Treatment and conc from the CO2 dataset using three approaches, then demonstrating each method returns an identical subset:

# Using .SD to retrieve "Treatment" & "conc" =>
> DT1 = DT[, .SD, .SDcols=c("Treatment", "conc")]

> head(DT1)
Treatment conc
1: nonchilled   95
2: nonchilled  175
3: nonchilled  250
4: nonchilled  350
5: nonchilled  500
6: nonchilled  675


# Can also be accomplished using the following. Note that
# `.(...)` is shorthand for `list(...)` => 
> DT2 = DT[, .(Treatment, conc)]

# If column names are available only as a character vector,
# i.e., c("Treatment", "conc") include `with=FALSE` =>
> DT3 = DT[, c("Treatment", "conc"), with=FALSE]

# Test for equality amongst DT1, DT2 & DT3 =>
> identical(DT1, DT3); identical(DT2, DT3); identical(DT1, DT2)

[1] TRUE
[1] TRUE
[1] TRUE


We can retrieve the first 3 rows for all fields partitioned by Plant:

> (DT1 = DT[,.SD[1:3], by=Plant])

    Plant        Type  Treatment conc uptake
 1:   Qn1      Quebec nonchilled   95   16.0
 2:   Qn1      Quebec nonchilled  175   30.4
 3:   Qn1      Quebec nonchilled  250   34.8
 4:   Qn2      Quebec nonchilled   95   13.6
 5:   Qn2      Quebec nonchilled  175   27.3
 6:   Qn2      Quebec nonchilled  250   37.1
 7:   Qn3      Quebec nonchilled   95   16.2
 8:   Qn3      Quebec nonchilled  175   32.4
 9:   Qn3      Quebec nonchilled  250   40.3
                      .
                      .
                      .
28:   Mc1 Mississippi    chilled   95   10.5
29:   Mc1 Mississippi    chilled  175   14.9
30:   Mc1 Mississippi    chilled  250   18.1
31:   Mc2 Mississippi    chilled   95    7.7
32:   Mc2 Mississippi    chilled  175   11.4
33:   Mc2 Mississippi    chilled  250   12.3
34:   Mc3 Mississippi    chilled   95   10.6
35:   Mc3 Mississippi    chilled  175   18.0
36:   Mc3 Mississippi    chilled  250   17.9    


It is possible to retrieve the 1st, 2nd, second-to-last and last records for each unique value present in Type:

> DT[, .SD[c(1, 2, .N-1, .N)], by=Type]

          Type Plant  Treatment conc uptake
1:      Quebec   Qn1 nonchilled   95   16.0
2:      Quebec   Qn1 nonchilled  175   30.4
3:      Quebec   Qc3    chilled  675   39.6
4:      Quebec   Qc3    chilled 1000   41.4
5: Mississippi   Mn1 nonchilled   95   10.6
6: Mississippi   Mn1 nonchilled  175   19.2
7: Mississippi   Mc3    chilled  675   18.9
8: Mississippi   Mc3    chilled 1000   19.9


.SD can be used in conjunction with .SDcols to convert fields with a factor datatype to character type. .SDcols is frequently used with .SD to specify a subset of the columns of .SD to be used in j. If .SDcols is present, it’s generally bound to a vector of fieldnames, and the operation in j will be applied only to the fields associated with the names specified in .SDcols:

# First, list original field datatypes =>
> lapply(DT, class)

$Plant
[1] "ordered" "factor" 

$Type
[1] "factor"

$Treatment
[1] "factor"

$conc
[1] "numeric"

$uptake
[1] "numeric"


# Convert fields with factor datatype (`Plant`, `Type` & `Treatment`) 
# to character datatype. Isolate factor fieldnames =>
> factorFieldnames = names(Filter(is.factor, DT))


# Convert factor datatypes to character datatypes =>
> DT[,(factorFieldnames):=lapply(.SD, as.character), .SDcols=factorFieldnames]


# Verify datatypes have been successfully coerced =>
> lapply(DT, class)

$Plant
[1] "character"

$Type
[1] "character"

$Treatment
[1] "character"

$conc
[1] "numeric"

$uptake
[1] "numeric"


Two things to note: First, because we used the assignment by reference operator, :=, to recast datatypes, it is not necessary to re-assign the result of the type-coercion back to DT. This is because when using :=, operations are performed in-place and by reference, therefore the result of a particular action are visible immediately in DT. Second, since factorFieldnames is enclosed by parentheses preceeding :=, the result is assigned to the columns specified in factorFieldnames. If parentheses we left out, a new field named factorFieldnames would be added to DT, which is not the desired behavior.


.SD can be used to calculate the average of any fields of interest overall or by group:

# Calculate the overall average of `conc` & `uptake` =>
> (avgDT1 = DT[, lapply(.SD, mean), .SDcols=c("conc", "uptake")])

   conc  uptake
1:  435 27.2131


# Calculate the average of `conc` & `uptake` by `Plant` =>
> (avgDT2 = DT[, lapply(.SD, mean), .SDcols=c("conc", "uptake"), by=Plant])

    Plant conc   uptake
 1:   Qn1  435 33.22857
 2:   Qn2  435 35.15714
 3:   Qn3  435 37.61429
 4:   Qc1  435 29.97143
 5:   Qc2  435 32.70000
 6:   Qc3  435 32.58571
 7:   Mn1  435 26.40000
 8:   Mn2  435 27.34286
 9:   Mn3  435 24.11429
10:   Mc1  435 18.00000
11:   Mc2  435 12.14286
12:   Mc3  435 17.30000


One final example using .SD: As mentioned earlier in the post, .SDcols is used to specify the columns of the data.table that .SD will operate on. Alternatively, we can use .SDcols to exclude fields from the operation performed by .SD. Referring to the example in which we calculated the average of conc & uptake in the CO2 dataset, we can alternatively specify the fields .SD should not average over. Here’s an alternative approach that results in excluding Plant, Type and Treatment from the average cacluation:

DT = CO2; setDT(DT)

# Calculate the average of `conc` & `uptake` by excluding
# `Plant`, `Type` & `Treatment` =>
> (avgDT = DT[, lapply(.SD, mean), .SDcols=!c("Plant","Type","Treatment")]))

   conc  uptake
1:  435 27.2131


Clearly in this example it is not advantageous to specify the fields not included in the average (the need to specify 3 excluded fieldnames as opposed to 2 included fieldnames), but situations do arise in which this functionality turns out to be very convenient.

.BY

.BY is a list containing a length 1 vector for each item in by. There are few examples in which the .BY special symbol is used, so we’ll reproduce the example included in the data.table vignette:

library("data.table")

> DT = data.table(
           x=rep(c("b","a","c"),each=3), 
           v=c(1,1,1,2,2,1,1,2,2), 
           y=c(1,3,6), 
           a=1:9, 
           b=9:1
           )

> head(DT)

   x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4


> X = data.table(
          x=c("c","b"), 
          v=8:7, 
          foo=c(4,2)
          )

> head(X)

   x v foo
1: c 8   4
2: b 7   2


# Join within each group =>
> X[, DT[.BY, y, on="x"], by=x]

   x V1
1: c  1
2: c  3
3: c  6
4: b  1
5: b  3
6: b  6


.I

.I is an integer vector equal to seq_len(nrow(x)). But we can instead use .I to return a vector containing the row indicies of the records in a data.table that meet the condition specified in i. In the next example, we demonstrate how to return the row indicies from CO2 having conc>500 using .I:

library("data.table")

DT = CO2; setDT(DT)

# Return row indicies in which conc > 500 =>
> DT[,.I[conc>500]]

[1]  6  7 13 14 20 21 27 28 34 35 41 42 48 49 55 56 62 63 69 70 76 77 83 84


Using .I after the condition specified in i returns a vector of length equal to the number of records meeting the is criteria:

> DT[conc>500, .I]

[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24


Note that use of .I may have been supplanted in favor of data.table’s which parameter. Instead of using .I, we can use which=TRUE to return the row indicies of records meeting the condition specified in i:

# Using `which=TRUE` instead of `.I` =>
> DT[conc>500, which=TRUE]

[1]  6  7 13 14 20 21 27 28 34 35 41 42 48 49 55 56 62 63 69 70 76 77 83 84


We obtain the same result as using DT[,.I[conc>500]], but DT[conc>500, which=TRUE] is, in my opinion, preferable based on the improved readability.

.GRP

.GRP is an integer, length 1, containing a simple group counter. If used without assignment to a new field, will return an enumerated data.table containing the unique combination of fields specified in by. If we specify the key of the CO2 dataset as "Plant", "Type", "Treatment", we can return the unique combinations of these fields by referencing the data.table’s key in the by clause:

library("data.table")

DT = CO2; setDT(DT)

# Set key on DT =>
> setkeyv(DT, c("Plant", "Type", "Treatment"))

# Return enumerated data.table listing unique 
# combinations of fields specified in `by` =>
> DT[, .GRP, by=key(DT)])

    Plant        Type  Treatment GRP
 1:   Qn1      Quebec nonchilled   1
 2:   Qn2      Quebec nonchilled   2
 3:   Qn3      Quebec nonchilled   3
 4:   Qc1      Quebec    chilled   4
 5:   Qc3      Quebec    chilled   5
 6:   Qc2      Quebec    chilled   6
 7:   Mn3 Mississippi nonchilled   7
 8:   Mn2 Mississippi nonchilled   8
 9:   Mn1 Mississippi nonchilled   9
10:   Mc2 Mississippi    chilled  10
11:   Mc3 Mississippi    chilled  11
12:   Mc1 Mississippi    chilled  12


Notice that fields not specified in by are omitted from the output.
We can also use .GRP to indicate which unique group a record belongs to for each record in the parent data.table. We use the := operator to define a new column (by reference) indicating each record’s associated group:

> (DT[,GROUP_ID:=.GRP, by=key(DT)])

    Plant        Type  Treatment conc uptake GROUP_ID
 1:   Qn1      Quebec nonchilled   95   16.0        1
 2:   Qn1      Quebec nonchilled  175   30.4        1
 3:   Qn1      Quebec nonchilled  250   34.8        1
 4:   Qn1      Quebec nonchilled  350   37.2        1
 5:   Qn1      Quebec nonchilled  500   35.3        1
 6:   Qn1      Quebec nonchilled  675   39.2        1
 7:   Qn1      Quebec nonchilled 1000   39.7        1
 8:   Qn2      Quebec nonchilled   95   13.6        2
 9:   Qn2      Quebec nonchilled  175   27.3        2
10:   Qn2      Quebec nonchilled  250   37.1        2
                          .
                          .
                          .
79:   Mc1 Mississippi    chilled  175   14.9       12
80:   Mc1 Mississippi    chilled  250   18.1       12
81:   Mc1 Mississippi    chilled  350   18.9       12
82:   Mc1 Mississippi    chilled  500   19.5       12
83:   Mc1 Mississippi    chilled  675   22.2       12
84:   Mc1 Mississippi    chilled 1000   21.9       12


.EACHI

.EACHI is not documented along with the other special symbols, but it is mentioned in the Arguments section of the data.table vignette. What follows is the description for .EACHI found there:

When i is a list (or data.frame or data.table), DT[i, j, by=.EACHI] evaluates j for the groups in DT that each row in i joins to. That is, you can join (in i) and aggregate (in j) simultaneously. We call this grouping by each i.

A question on stackoverflow requested additional information on how to properly use .EACHI, and one of data.table’s authors provided an excellent response with clear examples and straightforward use cases. As with .BY, we proceed by referring to examples from data.table’s documentation:

library("data.table")

> (DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9))

   x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 4
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9

> (X = data.table(x=c("c","b"), v=8:7, foo=c(4,2)))

   x v foo
1: c 8   4
2: b 7   2


# join and eval j for each row in i
> (DT1 = DT[X, sum(v), by=.EACHI, on="x"])

   x V1
1: c 24
2: b  6


First, an inner join on the field x with DT and X is performed. Then calculate the sum of v from DT for each x in X. X has two unique values in x: c & b. Referencing DT, when x=="c", sum(v) = 24. When x=="b", sum(v) = 6, which matches the result using .EACHI above.
We can remove a value or range of values from inclusion of the j operation by specifying those values to exclude:

> DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)

> setkey(DT, x)
> (DT1 = DT[!"a", sum(v), by=.EACHI])
   x V1
1: b  6
2: c 24


For each unique key value (records in x) that isn’t "a", calculate the sum of column v.

Conclusion

This post has been a quick introduction to data.table’s special symbols. Once sufficiently grokked, using data.table’s special symbols will simplify your data.table expressions, allowing complex operations to be expressed in a very natural way. Until next time, happy coding!