The data.table library is an R package that
provides an enhanced version of R’s in-built data.frame. There are many available resources that introduce data.table’s
syntax and provide examples of typical usage scenarios. In this post, we’ll cover an aspect of data.table that doesn’t
seem to get as much coverage: data.table’s *special symbols*.

Before proceeding, we should highlight data.table’s general syntactical form, that when understood, can be used to translate
data.table expressions of arbitrary complexity into 3 steps:

`DT[i, j, by]`

Which translates to:

*“Take *

`DT`

, subset rows using `i`

, then calculate `j`

grouped by `by`

“The following annotated diagram is included as part of data.table’s documentation:

```
DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
```

For those familiar with SQL, the three expressions that describe the general form can be thought of as components
of a SQL query: “SELECT from ** j**, WHERE

**, GROUP BY**

`i`

**“. Although the SQL analog of the original interpretation may not hold for every possible data.table operation, it remains a valid proxy for many scenarios encountered in practice.**

`by`

Although this applies to data.table in general, a thorough understanding of the general form will be especially useful when exploring ways to leverage special symbols in your own code. The examples that follow serve only as an introduction: The usefulness and applicable range of data.table’s special symbols are in no way limited to what’s included here - we’re just getting started!

The data.table special symbols are defined as follows:

* **.N** is an integer, length 1, containing the number of rows in the group. This may be useful when the column
names are not known in advance and for convenience generally. When grouping by `i`

, .N is the number of rows in x
matched to, for each row of `i`

, regardless of whether nomatch is `NA`

or 0. It is renamed to N (no dot) in the
result (otherwise a column called “.N” could conflict with the .N variable, see FAQ 4.6 for more details and example),
unless it is explicity named; e.g., `DT[,list(total=.N),by=a]`

.

* **.SD** is a data.table containing the Subset of x’s Data for each group, excluding any columns used in `by`

(or `keyby`

).

* **.BY** is a list containing a length 1 vector for each item in `by`

. This can be useful when `by`

is not known in
advance. The `by`

variables are also available to `j`

directly by name; useful for example for titles of graphs if `j`

is a
plot command, or to branch with `if()`

depending on the value of a group variable.

* **.I** is an integer vector equal to `seq_len(nrow(x))`

. While grouping, it holds for each item in the group, it’s row
location in x. This is useful to subset in `j`

; e.g. `DT[, .I[which.max(somecol)], by=grp]`

.

* **.GRP** is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc.

* **.EACHI** when i is a data.table, `DT[i,j,by=.EACHI]`

evaluates `j`

for the groups of DT that each row in `i`

joins to.
We call this *grouping by each i*.

`.N`

`.N`

The special symbol `.N`

can be used to return the last row or number of rows of a data.table. Referring to the R sample
dataset `CO2`

, what follows are a few examples using `.N`

:

```
library("data.table")
DT = CO2; setDT(DT)
> head(DT)
Plant Type Treatment conc uptake
1: Qn1 Quebec nonchilled 95 16.0
2: Qn1 Quebec nonchilled 175 30.4
3: Qn1 Quebec nonchilled 250 34.8
4: Qn1 Quebec nonchilled 350 37.2
5: Qn1 Quebec nonchilled 500 35.3
6: Qn1 Quebec nonchilled 675 39.2
# Return last record of CO2 dataset =>
> DT[.N]
Plant Type Treatment conc uptake
1: Mc3 Mississippi chilled 1000 19.9
# Return the second-to-last record of CO2 dataset =>
> DT[.N-1]
Plant Type Treatment conc uptake
1: Mc3 Mississippi chilled 675 18.9
# Return the number of records in CO2 dataset =>
> DT[,.N]
[1] 84
# Return the number of records for each unique value of `Treatment` =>
> DT[, .N, by=Treatment]
Treatment N
1: nonchilled 42
2: chilled 42
# Return the number of records for each unique combination of
# `Treatment` & `Plant` =>
> DT[, .N, keyby="Treatment,Plant"]
Treatment Plant N
1: nonchilled Qn1 7
2: nonchilled Qn2 7
3: nonchilled Qn3 7
4: nonchilled Mn3 7
5: nonchilled Mn2 7
6: nonchilled Mn1 7
7: chilled Qc1 7
8: chilled Qc3 7
9: chilled Qc2 7
10: chilled Mc2 7
11: chilled Mc3 7
12: chilled Mc1 7
```

`.N`

can be used to enumerate records, creating a row counter field:

```
> (DT[,CNTR:=1:.N])
Plant Type Treatment conc uptake CNTR
1: Qn1 Quebec nonchilled 95 16.0 1
2: Qn1 Quebec nonchilled 175 30.4 2
3: Qn1 Quebec nonchilled 250 34.8 3
4: Qn1 Quebec nonchilled 350 37.2 4
5: Qn1 Quebec nonchilled 500 35.3 5
.
.
.
79: Mc3 Mississippi chilled 175 18.0 79
80: Mc3 Mississippi chilled 250 17.9 80
81: Mc3 Mississippi chilled 350 17.9 81
82: Mc3 Mississippi chilled 500 17.9 82
83: Mc3 Mississippi chilled 675 18.9 83
84: Mc3 Mississippi chilled 1000 19.9 84
```

Similiarly, enumeration can be applied by group:

```
> (DT[,CNTR:=1:.N, by=Plant])
Plant Type Treatment conc uptake CNTR
1: Qn1 Quebec nonchilled 95 16.0 1
2: Qn1 Quebec nonchilled 175 30.4 2
3: Qn1 Quebec nonchilled 250 34.8 3
4: Qn1 Quebec nonchilled 350 37.2 4
5: Qn1 Quebec nonchilled 500 35.3 5
6: Qn1 Quebec nonchilled 675 39.2 6
7: Qn1 Quebec nonchilled 1000 39.7 7
8: Qn2 Quebec nonchilled 95 13.6 1
9: Qn2 Quebec nonchilled 175 27.3 2
10: Qn2 Quebec nonchilled 250 37.1 3
11: Qn2 Quebec nonchilled 350 41.8 4
12: Qn2 Quebec nonchilled 500 40.6 5
13: Qn2 Quebec nonchilled 675 41.4 6
14: Qn2 Quebec nonchilled 1000 44.3 7
.
.
.
```

`.SD`

`.SD`

`.SD`

(*Subset of x’s Data*) contains all columns of the data.table except the grouping columns
by default. Referring again to the `CO2`

dataset, we can use `.SD`

to print the contents of a
data.table:

```
library("data.table")
DT = CO2; setDT(DT)
> DT[,.SD]
Plant Type Treatment conc uptake
1: Qn1 Quebec nonchilled 95 16.0
2: Qn1 Quebec nonchilled 175 30.4
3: Qn1 Quebec nonchilled 250 34.8
4: Qn1 Quebec nonchilled 350 37.2
.
.
.
83: Mc3 Mississippi chilled 675 18.9
84: Mc3 Mississippi chilled 1000 19.9
```

Although uncommon, `.SD`

can be used for subsetting columns. Next, we retrieve `Treatment`

and `conc`

from
the `CO2`

dataset using three approaches, then demonstrating each method returns an identical subset:

```
# Using .SD to retrieve "Treatment" & "conc" =>
> DT1 = DT[, .SD, .SDcols=c("Treatment", "conc")]
> head(DT1)
Treatment conc
1: nonchilled 95
2: nonchilled 175
3: nonchilled 250
4: nonchilled 350
5: nonchilled 500
6: nonchilled 675
# Can also be accomplished using the following. Note that
# `.(...)` is shorthand for `list(...)` =>
> DT2 = DT[, .(Treatment, conc)]
# If column names are available only as a character vector,
# i.e., c("Treatment", "conc") include `with=FALSE` =>
> DT3 = DT[, c("Treatment", "conc"), with=FALSE]
# Test for equality amongst DT1, DT2 & DT3 =>
> identical(DT1, DT3); identical(DT2, DT3); identical(DT1, DT2)
[1] TRUE
[1] TRUE
[1] TRUE
```

We can retrieve the first 3 rows for all fields partitioned by `Plant`

:

```
> (DT1 = DT[,.SD[1:3], by=Plant])
Plant Type Treatment conc uptake
1: Qn1 Quebec nonchilled 95 16.0
2: Qn1 Quebec nonchilled 175 30.4
3: Qn1 Quebec nonchilled 250 34.8
4: Qn2 Quebec nonchilled 95 13.6
5: Qn2 Quebec nonchilled 175 27.3
6: Qn2 Quebec nonchilled 250 37.1
7: Qn3 Quebec nonchilled 95 16.2
8: Qn3 Quebec nonchilled 175 32.4
9: Qn3 Quebec nonchilled 250 40.3
.
.
.
28: Mc1 Mississippi chilled 95 10.5
29: Mc1 Mississippi chilled 175 14.9
30: Mc1 Mississippi chilled 250 18.1
31: Mc2 Mississippi chilled 95 7.7
32: Mc2 Mississippi chilled 175 11.4
33: Mc2 Mississippi chilled 250 12.3
34: Mc3 Mississippi chilled 95 10.6
35: Mc3 Mississippi chilled 175 18.0
36: Mc3 Mississippi chilled 250 17.9
```

It is possible to retrieve the 1st, 2nd, second-to-last and last records for each unique value
present in `Type`

:

```
> DT[, .SD[c(1, 2, .N-1, .N)], by=Type]
Type Plant Treatment conc uptake
1: Quebec Qn1 nonchilled 95 16.0
2: Quebec Qn1 nonchilled 175 30.4
3: Quebec Qc3 chilled 675 39.6
4: Quebec Qc3 chilled 1000 41.4
5: Mississippi Mn1 nonchilled 95 10.6
6: Mississippi Mn1 nonchilled 175 19.2
7: Mississippi Mc3 chilled 675 18.9
8: Mississippi Mc3 chilled 1000 19.9
```

`.SD`

can be used in conjunction with `.SDcols`

to convert fields with a factor datatype to character
type. `.SDcols`

is frequently used with `.SD`

to specify a subset of the columns of `.SD`

to be used in
`j`

. If `.SDcols`

is present, it’s generally bound to a vector of fieldnames, and the operation in `j`

will be applied only to the fields associated with the names specified in `.SDcols`

:

```
# First, list original field datatypes =>
> lapply(DT, class)
$Plant
[1] "ordered" "factor"
$Type
[1] "factor"
$Treatment
[1] "factor"
$conc
[1] "numeric"
$uptake
[1] "numeric"
# Convert fields with factor datatype (`Plant`, `Type` & `Treatment`)
# to character datatype. Isolate factor fieldnames =>
> factorFieldnames = names(Filter(is.factor, DT))
# Convert factor datatypes to character datatypes =>
> DT[,(factorFieldnames):=lapply(.SD, as.character), .SDcols=factorFieldnames]
# Verify datatypes have been successfully coerced =>
> lapply(DT, class)
$Plant
[1] "character"
$Type
[1] "character"
$Treatment
[1] "character"
$conc
[1] "numeric"
$uptake
[1] "numeric"
```

Two things to note: First, because we used the assignment by reference operator, `:=`

, to recast
datatypes, it is not necessary to re-assign the result of the type-coercion back to DT. This is because
when using `:=`

, operations are performed *in-place* and by reference, therefore the result of a
particular action are visible immediately in `DT`

. Second, since `factorFieldnames`

is enclosed by
parentheses preceeding `:=`

, the result is assigned to the columns specified in `factorFieldnames`

.
If parentheses we left out, a new field named `factorFieldnames`

would be added to `DT`

, which is not
the desired behavior.

`.SD`

can be used to calculate the average of any fields of interest overall or by group:

```
# Calculate the overall average of `conc` & `uptake` =>
> (avgDT1 = DT[, lapply(.SD, mean), .SDcols=c("conc", "uptake")])
conc uptake
1: 435 27.2131
# Calculate the average of `conc` & `uptake` by `Plant` =>
> (avgDT2 = DT[, lapply(.SD, mean), .SDcols=c("conc", "uptake"), by=Plant])
Plant conc uptake
1: Qn1 435 33.22857
2: Qn2 435 35.15714
3: Qn3 435 37.61429
4: Qc1 435 29.97143
5: Qc2 435 32.70000
6: Qc3 435 32.58571
7: Mn1 435 26.40000
8: Mn2 435 27.34286
9: Mn3 435 24.11429
10: Mc1 435 18.00000
11: Mc2 435 12.14286
12: Mc3 435 17.30000
```

One final example using `.SD`

: As mentioned earlier in the post, `.SDcols`

is used to specify
the columns of the data.table that `.SD`

will operate on. Alternatively, we can use `.SDcols`

to *exclude* fields from the operation performed by `.SD`

. Referring to the example in which we
calculated the average of `conc`

& `uptake`

in the `CO2`

dataset, we can alternatively specify
the fields `.SD`

*should not* average over. Here’s an alternative approach that results in excluding
`Plant`

, `Type`

and `Treatment`

from the average cacluation:

```
DT = CO2; setDT(DT)
# Calculate the average of `conc` & `uptake` by excluding
# `Plant`, `Type` & `Treatment` =>
> (avgDT = DT[, lapply(.SD, mean), .SDcols=!c("Plant","Type","Treatment")]))
conc uptake
1: 435 27.2131
```

Clearly in this example it is not advantageous to specify the fields *not* included in the average
(the need to specify 3 excluded fieldnames as opposed to 2 included fieldnames), but situations do arise
in which this functionality turns out to be very convenient.

`.BY`

`.BY`

`.BY`

is a list containing a length 1 vector for each item in `by`

. There are few examples in which
the `.BY`

special symbol is used, so we’ll reproduce the example included in the data.table vignette:

```
library("data.table")
> DT = data.table(
x=rep(c("b","a","c"),each=3),
v=c(1,1,1,2,2,1,1,2,2),
y=c(1,3,6),
a=1:9,
b=9:1
)
> head(DT)
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
> X = data.table(
x=c("c","b"),
v=8:7,
foo=c(4,2)
)
> head(X)
x v foo
1: c 8 4
2: b 7 2
# Join within each group =>
> X[, DT[.BY, y, on="x"], by=x]
x V1
1: c 1
2: c 3
3: c 6
4: b 1
5: b 3
6: b 6
```

`.I`

`.I`

`.I`

is an integer vector equal to seq_len(nrow(x)). But we can instead use `.I`

to return a vector
containing the row indicies of the records in a data.table that meet the condition specified in `i`

.
In the next example, we demonstrate how to return the row indicies from `CO2`

having `conc>500`

using `.I`

:

```
library("data.table")
DT = CO2; setDT(DT)
# Return row indicies in which conc > 500 =>
> DT[,.I[conc>500]]
[1] 6 7 13 14 20 21 27 28 34 35 41 42 48 49 55 56 62 63 69 70 76 77 83 84
```

Using `.I`

after the condition specified in `i`

returns a vector of length equal to the number of
records meeting the `i`

‘s criteria:

```
> DT[conc>500, .I]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
```

Note that use of `.I`

may have been supplanted in favor of data.table’s `which`

parameter. Instead
of using `.I`

, we can use `which=TRUE`

to return the row indicies of records meeting the condition
specified in `i`

:

```
# Using `which=TRUE` instead of `.I` =>
> DT[conc>500, which=TRUE]
[1] 6 7 13 14 20 21 27 28 34 35 41 42 48 49 55 56 62 63 69 70 76 77 83 84
```

We obtain the same result as using `DT[,.I[conc>500]]`

, but `DT[conc>500, which=TRUE]`

is, in my opinion,
preferable based on the improved readability.

`.GRP`

`.GRP`

`.GRP`

is an integer, length 1, containing a simple group counter. If used without assignment to a new
field, will return an enumerated data.table containing the unique combination of fields specified in
`by`

. If we specify the key of the `CO2`

dataset as `"Plant", "Type", "Treatment"`

, we can return the
unique combinations of these fields by referencing the data.table’s key in the `by`

clause:

```
library("data.table")
DT = CO2; setDT(DT)
# Set key on DT =>
> setkeyv(DT, c("Plant", "Type", "Treatment"))
# Return enumerated data.table listing unique
# combinations of fields specified in `by` =>
> DT[, .GRP, by=key(DT)])
Plant Type Treatment GRP
1: Qn1 Quebec nonchilled 1
2: Qn2 Quebec nonchilled 2
3: Qn3 Quebec nonchilled 3
4: Qc1 Quebec chilled 4
5: Qc3 Quebec chilled 5
6: Qc2 Quebec chilled 6
7: Mn3 Mississippi nonchilled 7
8: Mn2 Mississippi nonchilled 8
9: Mn1 Mississippi nonchilled 9
10: Mc2 Mississippi chilled 10
11: Mc3 Mississippi chilled 11
12: Mc1 Mississippi chilled 12
```

Notice that fields not specified in `by`

are omitted from the output.

We can also use `.GRP`

to indicate which unique group a record belongs to for each record in the
parent data.table. We use the `:=`

operator to define a new column (by reference) indicating
each record’s associated group:

```
> (DT[,GROUP_ID:=.GRP, by=key(DT)])
Plant Type Treatment conc uptake GROUP_ID
1: Qn1 Quebec nonchilled 95 16.0 1
2: Qn1 Quebec nonchilled 175 30.4 1
3: Qn1 Quebec nonchilled 250 34.8 1
4: Qn1 Quebec nonchilled 350 37.2 1
5: Qn1 Quebec nonchilled 500 35.3 1
6: Qn1 Quebec nonchilled 675 39.2 1
7: Qn1 Quebec nonchilled 1000 39.7 1
8: Qn2 Quebec nonchilled 95 13.6 2
9: Qn2 Quebec nonchilled 175 27.3 2
10: Qn2 Quebec nonchilled 250 37.1 2
.
.
.
79: Mc1 Mississippi chilled 175 14.9 12
80: Mc1 Mississippi chilled 250 18.1 12
81: Mc1 Mississippi chilled 350 18.9 12
82: Mc1 Mississippi chilled 500 19.5 12
83: Mc1 Mississippi chilled 675 22.2 12
84: Mc1 Mississippi chilled 1000 21.9 12
```

`.EACHI`

`.EACHI`

`.EACHI`

is not documented along with the other special symbols, but it is mentioned in the `Arguments`

section
of the data.table vignette. What follows is the
description for `.EACHI`

found there:

*When i is a list (or data.frame or data.table), DT[i, j, by=.EACHI]
evaluates j for the groups in DT that each row in i joins to. That is, you can
join (in i) and aggregate (in j) simultaneously. We call this grouping by each i.*

A question on stackoverflow requested additional information on how to properly use

`.EACHI`

, and one of data.table’s authors provided an excellent response with clear
examples and straightforward use cases. As with `.BY`

, we proceed by referring to examples from data.table’s documentation:
```
library("data.table")
> (DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9))
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 4
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> (X = data.table(x=c("c","b"), v=8:7, foo=c(4,2)))
x v foo
1: c 8 4
2: b 7 2
# join and eval j for each row in i
> (DT1 = DT[X, sum(v), by=.EACHI, on="x"])
x V1
1: c 24
2: b 6
```

First, an inner join on the field `x`

with `DT`

and `X`

is performed. Then calculate the sum of `v`

from
`DT`

*for each* `x`

in `X`

. `X`

has two unique values in `x`

: `c`

& `b`

. Referencing `DT`

, when `x=="c"`

,
`sum(v) = 24`

. When `x=="b"`

, `sum(v) = 6`

, which matches the result using `.EACHI`

above.

We can remove a value or range of values from inclusion of the `j`

operation by specifying those values to
exclude:

```
> DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
> setkey(DT, x)
> (DT1 = DT[!"a", sum(v), by=.EACHI])
x V1
1: b 6
2: c 24
```

For each unique key value (records in `x`

) that isn’t `"a"`

, calculate the sum of column `v`

.

## Conclusion

This post has been a quick introduction to data.table’s special symbols. Once sufficiently grokked, using data.table’s special symbols will simplify your data.table expressions, allowing complex operations to be expressed in a very natural way. Until next time, happy coding!