Date Tags R

The data.table library is an R package that provides an enhanced version of R’s in-built data.frame. There are many available resources that introduce data.table’s syntax and provide examples of typical usage scenarios. In this post, we’ll cover one aspect of the data.table API: Special symbols.

Before proceeding, we should highlight data.table’s general syntactical form, that when understood, can be used to translate data.table expressions of arbitrary complexity into 3 steps:

DT[i, j, by]

Which translates to:

Take DT, subset rows using i, then calculate j grouped by by

The following annotated diagram is included as part of data.table’s documentation:

DT[ i, j, by ] # + extra arguments
    |  |  |
    |  |   -------> grouped by what?
    |   -------> what to do?
     ---> on which rows?

For those familiar with SQL, the three expressions that describe the general form can be thought of as components of a SQL query: “SELECT from j, WHERE i, GROUP BY by”. Although the SQL analog of the original interpretation may not hold for every possible data.table operation, it remains a valid proxy for many scenarios encountered in practice. Although this applies to data.table in general, a thorough understanding of the general form will be especially useful when exploring ways to leverage special symbols in your own code.

The data.table special symbols are defined as follows:

  • .N is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in advance and for convenience generally. When grouping by i, .N is the number of rows in x matched to, for each row of i, regardless of whether nomatch is NA or 0. It is renamed to N (no dot) in the result (otherwise a column called ‘.N’ could conflict with the .N variable, see FAQ 4.6 for more details and example), unless it is explicity named; e.g., DT[,list(total=.N),by=a].

  • .SD is a data.table containing the Subset of x’s data for each group, excluding any columns used in by (or keyby).

  • .BY is a list containing a length 1 vector for each item in by. This can be useful when by is not known in advance. The by variables are also available to j directly by name; useful for example for titles of graphs if j is a plot command, or to branch with if() depending on the value of a group variable.

  • .I is an integer vector equal to seq_len(nrow(x)). While grouping, it holds for each item in the group, it’s row location in x. This is useful to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp].

  • .GRP is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc.

.N

The special symbol .N can be used to return the last row or number of rows of a data.table. Referring to the R sample dataset CO2, what follows are a few examples using .N:

> library("data.table")

> DT = CO2; setDT(DT)

> head(DT)
   Plant   Type  Treatment conc uptake
1:   Qn1 Quebec nonchilled   95   16.0
2:   Qn1 Quebec nonchilled  175   30.4
3:   Qn1 Quebec nonchilled  250   34.8
4:   Qn1 Quebec nonchilled  350   37.2
5:   Qn1 Quebec nonchilled  500   35.3
6:   Qn1 Quebec nonchilled  675   39.2

### Return last record of CO2 dataset.
> DT[.N]
   Plant        Type Treatment conc uptake
1:   Mc3 Mississippi   chilled 1000   19.9


### Return the second-to-last record of CO2 dataset.
> DT[.N-1]
   Plant        Type Treatment conc uptake
1:   Mc3 Mississippi   chilled  675   18.9


### Return the number of records in CO2 dataset.
> DT[,.N]
[1] 84


### Return the number of records for each unique value of Treatment.
> DT[, .N, by=Treatment]
    Treatment  N
1: nonchilled 42
2:    chilled 42


### Return the number of records for each unique combination of Treatment` & Plant.
> DT[, .N, keyby="Treatment,Plant"]

     Treatment Plant N
 1: nonchilled   Qn1 7
 2: nonchilled   Qn2 7
 3: nonchilled   Qn3 7
 4: nonchilled   Mn3 7
 5: nonchilled   Mn2 7
 6: nonchilled   Mn1 7
 7:    chilled   Qc1 7
 8:    chilled   Qc3 7
 9:    chilled   Qc2 7
10:    chilled   Mc2 7
11:    chilled   Mc3 7
12:    chilled   Mc1 7

.N can be used to enumerate records, creating a row counter field:

> (DT[,CNTR:=1:.N])

    Plant        Type  Treatment conc uptake CNTR
 1:   Qn1      Quebec nonchilled   95   16.0    1
 2:   Qn1      Quebec nonchilled  175   30.4    2
 3:   Qn1      Quebec nonchilled  250   34.8    3
 4:   Qn1      Quebec nonchilled  350   37.2    4
 5:   Qn1      Quebec nonchilled  500   35.3    5
                          .
                          .
                          .
79:   Mc3 Mississippi    chilled  175   18.0   79
80:   Mc3 Mississippi    chilled  250   17.9   80
81:   Mc3 Mississippi    chilled  350   17.9   81
82:   Mc3 Mississippi    chilled  500   17.9   82
83:   Mc3 Mississippi    chilled  675   18.9   83
84:   Mc3 Mississippi    chilled 1000   19.9   84

Similiarly, enumeration can be applied by group:

(DT[,CNTR:=1:.N, by=Plant])

    Plant        Type  Treatment conc uptake CNTR
 1:   Qn1      Quebec nonchilled   95   16.0    1
 2:   Qn1      Quebec nonchilled  175   30.4    2
 3:   Qn1      Quebec nonchilled  250   34.8    3
 4:   Qn1      Quebec nonchilled  350   37.2    4
 5:   Qn1      Quebec nonchilled  500   35.3    5
 6:   Qn1      Quebec nonchilled  675   39.2    6
 7:   Qn1      Quebec nonchilled 1000   39.7    7
 8:   Qn2      Quebec nonchilled   95   13.6    1
 9:   Qn2      Quebec nonchilled  175   27.3    2
10:   Qn2      Quebec nonchilled  250   37.1    3
11:   Qn2      Quebec nonchilled  350   41.8    4
12:   Qn2      Quebec nonchilled  500   40.6    5
13:   Qn2      Quebec nonchilled  675   41.4    6
14:   Qn2      Quebec nonchilled 1000   44.3    7

.SD

.SD (subset of x’s data) contains all columns of the data.table except the grouping columns by default. Referring again to the CO2 dataset, we can use .SD to print the contents of a data.table:

> library("data.table")

> DT = CO2; setDT(DT)

> DT[,.SD]

    Plant        Type  Treatment conc uptake
 1:   Qn1      Quebec nonchilled   95   16.0
 2:   Qn1      Quebec nonchilled  175   30.4
 3:   Qn1      Quebec nonchilled  250   34.8
 4:   Qn1      Quebec nonchilled  350   37.2
                    .
                    .
                    .
83:   Mc3 Mississippi    chilled  675   18.9
84:   Mc3 Mississippi    chilled 1000   19.9

Although uncommon, .SD can be used for subsetting columns. Next, we retrieve Treatment and conc from the CO2 dataset using three approaches, demonstrating each method returns an identical subset:

# Using .SD to retrieve Treatment & conc.
> DT1 = DT[, .SD, .SDcols=c("Treatment", "conc")]

> head(DT1)
    Treatment conc
1: nonchilled   95
2: nonchilled  175
3: nonchilled  250
4: nonchilled  350
5: nonchilled  500
6: nonchilled  675

> DT2 = DT[, .(Treatment, conc)]

# If column names are available only as a character vector,
# i.e., c("Treatment", "conc") include `with=FALSE`.
> DT3 = DT[, c("Treatment", "conc"), with=FALSE]

# Test for equality amongst DT1, DT2 & DT3.
> identical(DT1, DT3); identical(DT2, DT3)
[1] TRUE
[1] TRUE

.SD can be used to retrieve the first 3 rows for all fields partitioned by Plant:

> (DT1 = DT[,.SD[1:3], by=Plant])

    Plant        Type  Treatment conc uptake
 1:   Qn1      Quebec nonchilled   95   16.0
 2:   Qn1      Quebec nonchilled  175   30.4
 3:   Qn1      Quebec nonchilled  250   34.8
 4:   Qn2      Quebec nonchilled   95   13.6
 5:   Qn2      Quebec nonchilled  175   27.3
 6:   Qn2      Quebec nonchilled  250   37.1
 7:   Qn3      Quebec nonchilled   95   16.2
 8:   Qn3      Quebec nonchilled  175   32.4
 9:   Qn3      Quebec nonchilled  250   40.3
                      .
                      .
                      .
28:   Mc1 Mississippi    chilled   95   10.5
29:   Mc1 Mississippi    chilled  175   14.9
30:   Mc1 Mississippi    chilled  250   18.1
31:   Mc2 Mississippi    chilled   95    7.7
32:   Mc2 Mississippi    chilled  175   11.4
33:   Mc2 Mississippi    chilled  250   12.3
34:   Mc3 Mississippi    chilled   95   10.6
35:   Mc3 Mississippi    chilled  175   18.0
36:   Mc3 Mississippi    chilled  250   17.9    

It is possible to retrieve the 1st, 2nd, second-to-last and last records for each unique value present in Type:

> DT[, .SD[c(1, 2, .N-1, .N)], by=Type]

          Type Plant  Treatment conc uptake
1:      Quebec   Qn1 nonchilled   95   16.0
2:      Quebec   Qn1 nonchilled  175   30.4
3:      Quebec   Qc3    chilled  675   39.6
4:      Quebec   Qc3    chilled 1000   41.4
5: Mississippi   Mn1 nonchilled   95   10.6
6: Mississippi   Mn1 nonchilled  175   19.2
7: Mississippi   Mc3    chilled  675   18.9
8: Mississippi   Mc3    chilled 1000   19.9

.SD can be used along with .SDcols to convert fields with a factor datatype to character type. .SDcols is frequently used with .SD to specify a subset of the columns of .SD to be used in j. If .SDcols is present, it is generally bound to a vector of fieldnames, and the operation in j will be applied only to the fields associated with the names specified in .SDcols:

### First, list original field datatypes.
> lapply(DT, class)

$Plant
[1] "ordered" "factor" 

$Type
[1] "factor"

$Treatment
[1] "factor"

$conc
[1] "numeric"

$uptake
[1] "numeric"


### Convert fields with factor datatype (Plant, Type & Treatment) 
### to character type. Isolate factor fieldnames.
> factorFieldnames = names(Filter(is.factor, DT))

### Convert factor datatypes to character types.
> DT[,(factorFieldnames):=lapply(.SD, as.character), .SDcols=factorFieldnames]

### Verify datatypes have been successfully coerced.
> lapply(DT, class)

$Plant
[1] "character"

$Type
[1] "character"

$Treatment
[1] "character"

$conc
[1] "numeric"

$uptake
[1] "numeric"

Two things to note: First, because we used the assignment by reference operator, :=, to recast datatypes, it is not necessary to re-assign the result of the type-coercion back to DT. This is because when using :=, operations are performed in-place and by reference, therefore the result of a particular action are visible immediately in DT. Second, since factorFieldnames is enclosed by parentheses preceeding :=, the result is assigned to the columns specified in factorFieldnames. If parentheses we left out, a new field named factorFieldnames would be added to DT, which is not the desired behavior.

.SD can be used to calculate the average of any fields of interest overall or by group:

### Calculate the overall average of conc & uptake.
> (avgDT1 = DT[, lapply(.SD, mean), .SDcols=c("conc", "uptake")])

   conc  uptake
1:  435 27.2131


### Calculate the average of conc & uptake by Plant.
> (avgDT2 = DT[, lapply(.SD, mean), .SDcols=c("conc", "uptake"), by=Plant])

    Plant conc   uptake
 1:   Qn1  435 33.22857
 2:   Qn2  435 35.15714
 3:   Qn3  435 37.61429
 4:   Qc1  435 29.97143
 5:   Qc2  435 32.70000
 6:   Qc3  435 32.58571
 7:   Mn1  435 26.40000
 8:   Mn2  435 27.34286
 9:   Mn3  435 24.11429
10:   Mc1  435 18.00000
11:   Mc2  435 12.14286
12:   Mc3  435 17.30000

One final example using .SD: As mentioned earlier in the post, .SDcols is used to specify the columns of the data.table that .SD will operate on. Alternatively, we can use .SDcols to exclude fields from the operation performed by .SD. Referring to the example in which we calculated the average of conc & uptake in the CO2 dataset, we can instead specify the fields .SD should not average over. Here’s an alternative approach that results in excluding Plant, Type and Treatment from the average cacluation:

DT = CO2; setDT(DT)

# Calculate the average of conc & uptake by excluding Plant, Type & Treatment.
> (avgDT = DT[, lapply(.SD, mean), .SDcols=!c("Plant", "Type", "Treatment")]))

   conc  uptake
1:  435 27.2131

Clearly in this example it is not advantageous to specify the fields not included in the average, but situations do arise in which this functionality turns out to be very convenient.

.BY

.BY is a list containing a length 1 vector for each item in by. There are few examples in which the .BY special symbol is used, so we’ll reproduce the example included in the data.table vignette:

library("data.table")

> DT = data.table(
           x=rep(c("b", "a", "c"), each=3), 
           v=c(1, 1, 1, 2, 2, 1, 1, 2, 2), 
           y=c(1, 3, 6), 
           a=1:9, 
           b=9:1
           )

> head(DT)

   x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4


> X = data.table(
      x=c("c","b"), 
      v=8:7, 
      foo=c(4,2)
      )

> head(X)

   x v foo
1: c 8   4
2: b 7   2


# Join within each group.
> X[, DT[.BY, y, on="x"], by=x]

   x V1
1: c  1
2: c  3
3: c  6
4: b  1
5: b  3
6: b  6

.I

.I is an integer vector equal to seq_len(nrow(x)). But we can instead use .I to return a vector containing the row indicies of the records in a data.table that meet the condition specified in i. In the next example, we demonstrate how to return the row indicies from CO2 having conc>500 using .I:

### Return row indicies in which conc > 500.
> DT[,.I[conc > 500]]
[1]  6  7 13 14 20 21 27 28 34 35 41 42 48 49 55 56 62 63 69 70 76 77 83 84

Using .I after the condition specified in i returns a vector of length equal to the number of records meeting the i’s criteria:

> DT[conc > 500, .I]

[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Note that use of .I may have been supplanted in favor of data.table’s which parameter. Instead of using .I, we can use which=TRUE to return the row indicies of records meeting the condition specified in i:

> DT[conc > 500, which=TRUE]
[1]  6  7 13 14 20 21 27 28 34 35 41 42 48 49 55 56 62 63 69 70 76 77 83 84

.GRP

.GRP is an integer, length 1, containing a simple group counter. If used without assignment to a new field, .GRP will return an enumerated data.table containing the unique combination of fields specified in by. If we specify the key of the CO2 dataset as “Plant”, “Type”, “Treatment”, we can return the unique combinations of these fields by referencing the data.table’s key in the by clause:

### Set key on DT.
> setkeyv(DT, c("Plant", "Type", "Treatment"))

### Return enumerated data.table listing unique combinations of fields specified in by.
> DT[, .GRP, by=key(DT)])

    Plant        Type  Treatment GRP
 1:   Qn1      Quebec nonchilled   1
 2:   Qn2      Quebec nonchilled   2
 3:   Qn3      Quebec nonchilled   3
 4:   Qc1      Quebec    chilled   4
 5:   Qc3      Quebec    chilled   5
 6:   Qc2      Quebec    chilled   6
 7:   Mn3 Mississippi nonchilled   7
 8:   Mn2 Mississippi nonchilled   8
 9:   Mn1 Mississippi nonchilled   9
10:   Mc2 Mississippi    chilled  10
11:   Mc3 Mississippi    chilled  11
12:   Mc1 Mississippi    chilled  12

Notice that fields not specified in by are omitted from the output.

We can also use .GRP to indicate which unique group a record belongs to for each record in the parent data.table. We use the := operator to define a new column (by reference) indicating each record’s associated group:

> (DT[,GROUP_ID:=.GRP, by=key(DT)])

    Plant        Type  Treatment conc uptake GROUP_ID
 1:   Qn1      Quebec nonchilled   95   16.0        1
 2:   Qn1      Quebec nonchilled  175   30.4        1
 3:   Qn1      Quebec nonchilled  250   34.8        1
 4:   Qn1      Quebec nonchilled  350   37.2        1
 5:   Qn1      Quebec nonchilled  500   35.3        1
 6:   Qn1      Quebec nonchilled  675   39.2        1
 7:   Qn1      Quebec nonchilled 1000   39.7        1
 8:   Qn2      Quebec nonchilled   95   13.6        2
 9:   Qn2      Quebec nonchilled  175   27.3        2
10:   Qn2      Quebec nonchilled  250   37.1        2
                          .
                          .
                          .
79:   Mc1 Mississippi    chilled  175   14.9       12
80:   Mc1 Mississippi    chilled  250   18.1       12
81:   Mc1 Mississippi    chilled  350   18.9       12
82:   Mc1 Mississippi    chilled  500   19.5       12
83:   Mc1 Mississippi    chilled  675   22.2       12
84:   Mc1 Mississippi    chilled 1000   21.9       12