In a previous post, we highlighted common methods for interacting with SQLite databases from within the Python programming language. Python exposes the sqlite3 module as part of its standard library, allowing for tight integration between the two projects.

Although RSQLite is not included with the standard R distribution, the interface is familiar and straightforward to use, especially if you have experience interfacing with other relational database management systems from R: RSQLite is DBI-compatible, and leverages all of the familiar database interaction functionality also found in 3rd-party packages such as RMySQL, ROracle, etc…

In this post, we’ll demonstrate how to get up and running with RSQLite. We’ll walk through creating a new SQLite database, creating and populating tables in the database, and finally how to query data from the tables we created.

SQLite Database Creation

Creating a new database is straightforward: Specify the type of database to create (the SQLite variety for this example), along with a filepath to which the database will be saved. One of the major benefits SQLite is that it is a disk-based database which doesn’t require a separate server process:

library(DBI)
library(RSQLite)

conn = dbConnect(RSQLite::SQLite(), "/path/to/database/file.db")



The database file extension can be either .db or .sqlite.


Note that it’s possible to explicity prefix the source library for objects in R with libname::. This removes any ambiguity regarding the package of origin for an object of interest. In the source above, it is clear that SQLite() originates from RSQLite.


If the database is required only temporarily and it is preferable not to save the database to file, it’s possible to create an in-memory database. The initialization is the same as for the persisted database, except the filepath is replaced with “:memory:”:

library(DBI)
library(RSQLite)

conn = dbConnect(RSQLite::SQLite(), ":memory:")


Then, to disconnect from the database, call dbDisconnect(<databasename>):


dbDisconnect(conn)



With respect to the in-memory database, when dbDisconnect is called, the database will be purged from memory.

Table Creation


One of the advantages of interfacing with relational database management systems in R is that it is not necessary to explicitly create and execute the DDL associated with the table to be written. The structure of an R data.frame is such that all datatypes and additional table specifications can be inferred from the data.frame, and the DDL in turn is then compiled and executed on the fly. This is hugely convenient, especially for workflows that rely on a large number table creation and population routines.


In the next example, we load the trees dataset (which is included as one of R’s sample datasets) to an SQLite database sample.db into a table named trees.


A look at the first few records of trees:


  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7


In addition, we’ll include a TIMESTAMP field with Girth, Height and Volume. Note that TIMESTAMP needs to be coerced to a character datatype since SQLite doesn’t support date datatypes:

library(DBI)
library(RSQLite)

conn = dbConnect(RSQLite::SQLite(), "sample.db")

df = trees
df = cbind(df, Timestamp=c(toString(Sys.time())))

# change fieldnames to all caps =>
names(df) = toupper(names(df))

dbWriteTable(conn, "trees", df)



To verify that trees was created in sample.db, call dbListTables:

# List tables contained in active database =>
> dbListTables(conn)
[1] "trees"


Querying Tables

There are two common approaches with respect to data retrevial: First, pass a valid SQL statement to dbGetQuery. The SQL statement gets passed along to the SQLite parser and the corresponding dataset is returned as an R data.frame. Second, provide the tablename to dbReadTable, and the table will be returned as contained in the database in its entirety. Users of the latter approach may experience severe performance degradation upon querying very large datasets. We’ll demonstrate a work-around to this problem in the next section.



To demonstrate dbGetQuery, we return all records from the “trees” table with \(HEIGHT>80\):

library(DBI)
library(RSQLite)

conn = dbConnect(RSQLite::SQLite(), "sample.db")

SQLstr = "SELECT * FROM trees WHERE HEIGHT>80"
treesDF = dbGetQuery(conn, SQLstr)

dbDisconnect(conn)

# print `treesDF` results in =>
  GIRTH HEIGHT VOLUME           TIMESTAMP
1  10.7     81   18.8 2017-09-02 21:59:26
2  10.8     83   19.7 2017-09-02 21:59:26
3  12.9     85   33.8 2017-09-02 21:59:26
4  13.3     86   27.4 2017-09-02 21:59:26
5  17.3     81   55.4 2017-09-02 21:59:26
6  17.5     82   55.7 2017-09-02 21:59:26
7  20.6     87   77.0 2017-09-02 21:59:26



If at any point it is necessary to drop a table from a database, simply call dbRemoveTable. To remove “trees”, we’d call:

dbRemoveTable(conn, "trees")

print(dbListTables(conn))
[1] character(0)



dbReadTable requires only the database connection and the tablename of interest. Assuming we haven’t removed “trees”, to return the table in it’s entirety, run:

library(DBI)
library(RSQLite)

conn = dbConnect(RSQLite::SQLite(), "sample.db")
treesDF = dbReadTable(conn, "trees")

print(dim(treesDF))
[1] 31  4



This matches with the original dimensions of “trees” after appending TIMESTAMP.

RSQLite has support parameterized queries, where a value is provided which fully specifies the SQL statement at runtime. To demonstrate, consider the SQL statement which retrieved the records from “trees” with HEIGHT in excess of \(80\). Instead of hardcoding \(80\), we specify the threshold at runtime. This requires a slight modification to the SQL statement, as well as the inclusion of an additional argument to the dbGetQuery call. Here we return records with \(HEIGHT>82\):

library(DBI)
library(RSQLite)

conn = DBI::dbConnect(RSQLite::SQLite(), "sample.db")

# change threshold from query to query =>
thrsh = 82
SQLstr = "SELECT * FROM trees WHERE HEIGHT>:thrsh"
treesDF = DBI::dbGetQuery(conn, SQLstr, params=list(thrsh=thrsh))

print(treesDF)

# returns =>
  GIRTH HEIGHT VOLUME           TIMESTAMP
1  10.8     83   19.7 2017-09-02 21:59:26
2  12.9     85   33.8 2017-09-02 21:59:26
3  13.3     86   27.4 2017-09-02 21:59:26
4  20.6     87   77.0 2017-09-02 21:59:26


Iterative Retrievial for Large Datasets

As mentioned in the last section, users of dbGetTable may experience severe performance degradation when retrieving very large datasets. We can instead retrieve the data iteratively, using a combination of dbSendQuery and dbFetch.
The call to dbSendQuery is identical to dbGetQuery, except dbSendQuery initializes a cursor associated with the dataset of interest as opposed to retrieving the dataset of interest outright (as dbGetQuery does). Think of the variable bound to the result of dbSendQuery as a pointer to the row currently being processed, and as each record is retrieved, the pointer moves to the next row, on and on until the entire result set has been traversed.

dbFetch takes for arguments a cursor, as well as a number which determines how many records to retrieve at each iteration. If n is not specified, it defaults to 500. If n is set to -1, the entire datset will be retrieved at once, exhibiting behavior similiar to dbGetQuery.

We next demonstrate iterative retrieval using dbSendQuery and dbFetch to query “trees” in groups of 5 records. Each data.frame is written to a list, then combined upon completion. Once iteration has ceased, calling dbClearResult(<cursor>) closes the result set:


library(DBI)
library(RSQLite)

conn = dbConnect(RSQLite::SQLite(), "sample.db")

# `resList` contains data.frame from each iteration =>
resList = list()
cursor = dbSendQuery(conn, "SELECT * FROM trees")

# commence iteration =>
while (!dbHasCompleted(cursor)) {
    iterDF = dbFetch(cursor, n=5)
    resList[[length(resList)+1]] <- iterDF
}

dbClearResult(cursor)
treesDF = do.call("rbind", resList)
print(dim(treesDF))
[1] 31  4



Conclusion

One of the benefits of leveraging SQLite for data management has to do with interoperability across programming languages. While there exist plenty of 3rd-party libraries that facilitate communication between, say, R and Python, SQLite databases can be used as a common middle ground or pipeline between Python and R that doesn’t require the maintenance or management of external dependencies. This is preferable to simply writing an R data.frame to .csv then reading it back into Python, since with SQLite, the datatypes are necessarily preserved.
SQLite usage is widespread, and after sufficient familiarity with it has been gained, you’ll find yourself using it all the time is all kinds of projects. Go forth and experiment! Until next time, happy coding!