Databases are ubiquitous. Any non-trivial application intended for use within the enterprise will almost certainly read from or write to a database at some point. In fact, part of the reason analysts have turned to programming languages like Python (and R) for data manipulation tasks is the higher level of abstraction with respect to database interfaces. In R, the general database interface is DBI. In Python it is referred to as the Database API-2.0 (see PEP249 for more information). The purpose of these interfaces is to expose a set of methods for database interaction that are consistent across vendors (for example, a method used to retrieve a table from an Oracle database is the same method used to retrieve a table from a Postgres or Microsoft SQLServer database). The remainder of this post focuses specifically on interacting with data in an Oracle database from Pandas.

In the examples that follow, we’ll demonstrate how to create a database connection, retrieve tables, manipulate the dataset and finally export the data back to the database under a different tablename. In order to follow along with the examples, the following 3rd-party packages need to be installed:

  • cx_Oracle
  • Pandas
  • sqlalchemy

Initializing a Database Connection

Once cx_Oracle has been installed, we need to create a database connection. Note that in a previous post, we covered how to retrieve Oracle table data using cx_Oracle directly. Although this approach is possible, accessing Oracle table data via Pandas is preferred as it simplifies the most commonly performed actions.

cx_Oracle’s native database connection allows for one-way transactions only: Table data can be retrieved, but cannot be written. However, if we instead initialize our connection using SQLAlchemy, we allow for bi-directional i/o from the start, which can be very useful. This requires the SQLAlchemy library. The syntax to create a database connection with SQLAlchemy is:

"oracle://<USER>:<PASSWORD>@<DATABASE>"

For example, to connect to the DEV schema in database DB with password password, the connection would be specified as:

import cx_Oracle
import sqlalchemy
import pandas as pd

DATABASE = "DB"
SCHEMA = "DEV"
PASSWORD = "password"
connstr = f"oracle://{SCHEMA}:{PASSWORD}@{DATABASE}"
conn = sqlalchemy.create_engine(connstr)

Once the database connection has been established, we can retrieve datasets using the read_sql_query method. It takes for arguments any valid SQL statement along with a connection object referencing the target database. For example, assume we have a table named SEVERITY_CDFS in the DB schema. We can retrieve the table as follows (we repeat the earlier code for convenience):

import cx_Oracle
import sqlalchemy
import pandas as pd

pd.options.display.float_format = '{:.5f}'.format

DATABASE = "DB"
SCHEMA = "DEV"
PASSWORD = "password"
connstr = f"oracle://{SCHEMA}:{PASSWORD}@{DATABASE}"
conn = sqlalchemy.create_engine(connstr)

TABLENAME = "SEVERITY_CDFS"
SQL  = "SELECT * FROM {}".format(TABLENAME)
cdfs = pd.read_sql_query(SQL, con=conn)

The object returned by read_sql_query is a Pandas DataFrame. If you need to retrieve an entire table without filtering conditions specified in SQL, Pandas offers the read_sql_table method, which takes for its first argument a tablename that resides in the target schema as opposed to a SQL statement. Note that read_sql_table is only valid for SQLAlchemy connection objects, and would not work with a standard cx_Oracle connection. In the next example, we retrieve the same table as before, but replace read_sql_query with read_sql_table:

import cx_Oracle
import sqlalchemy
import pandas as pd

pd.options.display.float_format = '{:.5f}'.format

# Specify database connection details and establish connection.
DATABASE = "DB"
SCHEMA = "DEV"
PASSWORD = "password"
connstr = f"oracle://{SCHEMA}:{PASSWORD}@{DATABASE}"
conn = sqlalchemy.create_engine(connstr)

TABLENAME = "SEVERITY_CDFS"
cdfs = pd.read_sql_table(TABLENAME, con=conn)

Iterative Data Retrieval

For very large datasets, it may be inefficient to return all records in a single round trip. Pandas allows datasets to be queried incrementally by providing a chunksize parameter within read_sql_query (also an option for read_sql_table). chunksize specifies how many records should be retrieved at each iteration. The call is identical to the original, except when chunksize is specified, an iterator referencing the dataset is returned instead of the dataset itself.

To demonstrate, we again retrieve the SEVERITY_CDFS table but this time in 1000 record blocks:

In [2]: TABLENAME = "SEVERITY_CDFS"
In [3]: SQL = "SELECT * FROM {}".format(TABLENAME)
In [4]: ref = pd.read_sql_query(SQL, con=conn, chunksize=1000)
In [5]:type(ref)
Out[1]: 
generator

Here’s an approach that can be used to retrieve and compile the dataset into a single DataFrame after exhausting the generator from the call to read_sql_query with chunksize=1000:

TABLENAME = "SEVERITY_CDFS"
SQL = "SELECT * FROM {}".format(TABLENAME)
ref = pd.read_sql_query(SQL, con=conn, chunksize=1000)
cdfs = pd.concat([rr for rr in ref]).reset_index(drop=True)

We next describe a few routines that can be applied to DataFrames. The next few examples refer to the cdfs dataset referenced above.
To obtain the fieldnames of the queried dataset, use the columns attribute:

cdfs.columns
Index(['ID', 'CDF_ENUM', 'CDF', 'LOSS'], dtype='object')

To obtain the dimensions of the retrieved dataset, use the shape attribute. The result is a tuple consisting of (nrows, ncols):

>>> cdfs.shape
(16650, 8)

To determine the datatypes associated with each field in cdfs, call the dtypes attribute:

>>> cdfs.dtypes
ID        object
CDF_ENUM  int64
CDF       float64
LOSS      float64
dtype: object

We can sort cdfs by ID and CDF_ENUM, then review the first few records using head:

In [5]:  cdfs = cdfs.sort_values(["ID", "CDF_ENUM"]).reset_index(drop=True)
In [6]: cdfs.head()
Out[2]:
    ID  CDF_ENUM      CDF        LOSS
000001         1 0.006667 69077.55279
000001         2 0.013333 76009.02460
000001         3     0.02 80063.67568
000001         4 0.026667 82940.49640
000001         5 0.033333 85171.93191

Subsetting DataFrames

This section contains a few examples of subsetting DataFrames by index.

Indexing on the DataFrame directly:

In [7]:  cdfs[:5]
Out[3]:    
    ID  CDF_ENUM      CDF        LOSS
000001         1 0.006667 69077.55279
000001         2 0.013333 76009.02460
000001         3     0.02 80063.67568
000001         4 0.026667 82940.49640
000001         5 0.033333 85171.93191

Pandas DataFrames can be indexed using .iloc, which facilitates integer-location based indexing for selection by position. Keep in mind that Python uses 0-based indexing. Passing : to .iloc serves as a stand-in for “return all”. If passed as the first argument, returns all rows independent of any column (axis 1) subsetting; If passed as the second argument, all columns independent of any row (axis 0) subsetting are returned:

In [8]: cdfs.iloc[:5, :]
Out[4]:    
    ID  CDF_ENUM      CDF        LOSS
000001         1 0.006667 69077.55279
000001         2 0.013333 76009.02460
000001         3     0.02 80063.67568
000001         4 0.026667 82940.49640
000001         5 0.033333 85171.93191

Subsetting Columns

A subset of DataFrame columns can be obtained by specifying the column names as comma-delimited strings surrounded by double brackets. A subset of cdfs consisting of ID, CDF_ENUM and CDF which we refer to as df is obtained as follows:

In [9]: df = cdfs[["ID", "CDF_ENUM", "CDF"]]
In[10]: df.head(5)
Out[5]: 
    ID  CDF_ENUM      CDF
000001         1 0.006667
000001         2 0.013333
000001         3     0.02
000001         4 0.026667
000001         5 0.033333

Individual fields can be retrieved using two approaches. Keep in mind that Pandas DataFrames are comprised of one or more homogeneously typed Series objects, which is similar to a named vector in R. We can use bracket notation, specifying the field of interest as a string, or dot notation, where the field is treated as an attribute of the parent DataFrame:

In[11]: id1 = cdfs["ID"]
In[12]: id2 = cdfs.ID
In[13]: type(id1)
Out[6]:
pandas.core.series.Series

In[14]: id1.equals(id2)
Out[7]:
True

To determine the unique elements comprising id1 (or id2), call the unique method:

In[14]: id1.unique()
Out[8]:
array(['000001', '000002', '000003'], dtype=object)

If a count of the number of unique elements is needed, call nunique:

In[15]: id1.nunique()
Out[9]:
3

Note that if desired, ID can be transformed into a single-column DataFrame by specifying double instead of single brackets:

In[16]: df = cdfs[["ID"]]
In[17]: type(df)
Out[10]: 
pandas.core.frame.DataFrame

A Pandas DataFrame or Series object can be converted into a numpy ndarray by calling the object’s values attribute:

In[18]: mgarr = id1.values
In[19]: type(mgarr)
Out[11]:
numpy.ndarray

Leveraging the query Method

Pandas DataFrames expose a query method, which takes a SQL-like expression which is then used for subsetting the target DataFrame. To demonstrate, we retrieve records from cdfs where ID=='000001' and CDF<0.1:

In [2]: cdfs.query("ID=='000001' and CDF<.1")
Out[1]:
    ID  CDF_ENUM       CDF         LOSS
000001  1        0.006667   69077.55279
000001  2        0.013333   76009.0246
000001  3        0.02       80063.67568
000001  4        0.026667   82940.4964
000001  5        0.033333   85171.93191
000001  6        0.04       86995.14748
000001  7        0.046667   88536.65428
000001  8        0.053333   89871.96821
000001  9        0.06       91049.79856
000001  10       0.066667   92103.40372
000001  11       0.073333   93056.50552
000001  12       0.08       93926.61929
000001  13       0.086667   94727.04636
000001  14       0.093333   95468.12609

For expressions parsed by query, == is used to obtain records equal to the RHS of the expression, whereas SQL variants use a single = to represent the same idea.

You may not always know the values with which to subset in advance. It may be that the subset of interest isn’t known until runtime. In those instances, we can specify a variable name by prefixing it with @: This prevents the specifier from being interpreted as a string literal. Next we return the same subset before, but using variables as opposed to string literals:

In [2]: GROUP = "000001"
In [3]: CDF_THRESH = 0.1
In [4]: cdfs.query("ID==@GROUP and CDF<@CDF_THRESH")
Out[1]:
ID      CDF_ENUM       CDF         LOSS
000001  1        0.006667   69077.55279
000001  2        0.013333   76009.0246
000001  3        0.02       80063.67568
000001  4        0.026667   82940.4964
000001  5        0.033333   85171.93191
000001  6        0.04       86995.14748
000001  7        0.046667   88536.65428
000001  8        0.053333   89871.96821
000001  9        0.06       91049.79856
000001  10       0.066667   92103.40372
000001  11       0.073333   93056.50552
000001  12       0.08       93926.61929
000001  13       0.086667   94727.04636
000001  14       0.093333   95468.12609

Appending a Timestamp to a DataFrame

We can use the standard library datetime module to append a timestamp to cdfs:

In [2]: import datetime
In [3]: tstmp = datetime.datetime.now().strftime("%c")
In [4]: cdfs["TIMESTAMP"] = tstmp
In [5]: cdfs.head()
Out[1]:
    ID  CDF_ENUM      CDF        LOSS                TIMESTAMP
000001         1 0.006667 69077.55279 Fri Dec 21 09:52:11 2018
000001         2 0.013333 76009.02460 Fri Dec 21 09:52:11 2018
000001         3     0.02 80063.67568 Fri Dec 21 09:52:11 2018
000001         4 0.026667 82940.49640 Fri Dec 21 09:52:11 2018
000001         5 0.033333 85171.93191 Fri Dec 21 09:52:11 2018

Uploading DataFrames to Oracle

Finally, we demonstrate how to export the updated cdfs table back to Oracle. To accomplish this, we use the to_sql method. It takes a table name as a string, a connection object and a parameter (if_exists) that specifies how to proceed if the tablename exists in the target schema. Options are “fail”, “replace” or “append” (“fail” is the default).

To replace the existing SEVERITY_CDFS table with our updated version with TIMESTAMP added, we’d call to_sql (assuming our connection still exists):

In [2]: cdfs.to_sql("SEVERITY_CDFS", con=conn, if_exists="replace")