Databases are ubiquitous. Any non-trivial application intended for use within the enterprise will almost certainly read from or write to a database at some point. In fact, part of the reason analysts have turned to programming languages like Python (and R) for data manipulation tasks is the higher level of abstraction with respect to database interfaces. In R, the general database interface is DBI. In Python it is referred to as the Database API-2.0 (see PEP249 for more information). The purpose of these interfaces is to expose a set of methods for database interaction that are consistent across vendors (for example, a method used to retrieve a table from an Oracle database is the same method used to retrieve a table from a Postgres or Microsoft SQLServer database). The remainder of this post focuses specifically on interacting with data in an Oracle database from Pandas.

The canonical Python-Oracle interface is cx_Oracle. cx_Oracle isn’t distributed as part of the Python Standard Library, nor is it included as part of the Anaconda distribution, and therefore needs to be installed separately via pip or conda. To verify that the package has been installed properly, open a command prompt (Press Windows key + R, then type “cmd”), and enter the following:

C:\> python -c "import cx_Oracle; print(cx_Oracle.__version__)"
7.0.0

If instead of a version number, a message indicating that ‘python’ is not recognized as an internal or external command, operable program or batch file, verify that the directory containing the Python executable is included as part of your system or user PATH environment variable.

In the examples that follow, we’ll demonstrate how to create a database connection, retrieve tables, manipulate the dataset and finally export the data back to the database under a different tablename. In order to follow along with the examples, the following 3rd-party packages need to be installed:

  • cx_Oracle
  • Pandas
  • sqlalchemy

Initializing a Database Connection

Once cx_Oracle has been installed, we need to create a database connection. Note that in a previous post, we covered how to retrieve Oracle table data using cx_Oracle directly. Although this approach is possible, accessing Oracle table data via Pandas is much preferred as it simplifies the most commonly performed actions.

cx_Oracle’s native database connection allows for one-way transactions only: Table data can be retrieved, but cannot be written. However, if we instead initialize our connection using SQLAlchemy, we allow for bi-directional i/o from the start, which can be very useful. This requires the SQLAlchemy library (which is included as part of the Anaconda distribution). The syntax to create a database connection with SQLAlchemy is:

"oracle://<USER>:<PASSWORD>@<DATABASE>"

For example, to connect to the “DEV” schema/user in database “DB” with password “password”, the connection would be specified as:

import cx_Oracle
import sqlalchemy
import pandas as pd

DATABASE = "DB"
SCHEMA   = "DEV"
PASSWORD = "password"
connstr  = "oracle://{}:{}@{}".format(SCHEMA, PASSWORD, DATABASE)
conn     = sqlalchemy.create_engine(connstr)

If you’re using Python 3.6 or later, you can use f-strings:

connstr = f"oracle://{SCHEMA}:{PASSWORD}@{DATABASE}"

Once the database connection has been established, we can retrieve datasets using the Pandas read_sql_query function. It takes for arguments any valid SQL statement along with a connection object referencing the target database. For example, assume we have a table named “SEVERITY_CDFS” in the “DB” schema containing 150-point discretized severity distributions for various lines of business. A sample of what the data looks can can be obtained here. We retrieve the table as follows (we repeat the earlier code for convenience):

"""
Demonstrate how to query Oracle datasets using Python
and Pandas.
Required 3rd-party Packages: cx_Oracle, sqlalchemy, pandas
"""
import cx_Oracle
import sqlalchemy
import pandas as pd

# Suppress scientific notation when printing DataFrame objects.
# We choose to display floats with 5 decimals. Modify as needed.
pd.options.display.float_format = '{:.5f}'.format

# Specifiy database connection details and establish connection.
DATABASE = "DB"
SCHEMA   = "DEV"
PASSWORD = "password"
connstr  = "oracle://{}:{}@{}".format(SCHEMA, PASSWORD, DATABASE)
conn     = sqlalchemy.create_engine(connstr)

# Specify the table to query. Note that SQL should be a 
# valid SQL statement.
TABLENAME = "SEVERITY_CDFS"
SQL  = "SELECT * FROM {}".format(TABLENAME)
cdfs = pd.read_sql_query(SQL, con=conn)

The object returned by read_sql_query is a Pandas DataFrame. This is a 2-dimensional tabular data structure with a large number of useful attributes and methods. We can verify cdfs is a DataFrame by calling Python’s type function on cdfs:

>>> type(cdfs)
pandas.core.frame.DataFrame

If you need to retrieve an entire table without filtering conditions specified in SQL, Pandas offers the read_sql_table function, which takes for its first argument a tablename that resides in the target schema as opposed to a SQL statement. Note that read_sql_table is only valid for SQLAlchemy connection objects, and wouldn’t work with a standard cx_Oracle connection. In the next example, we retrieve the same table as before, but replace read_sql_query with read_sql_table:

"""
Demonstrate how to query Oracle datasets using Python
and Pandas.
Required 3rd-party Packages: cx_Oracle, sqlalchemy, pandas
"""
import cx_Oracle
import sqlalchemy
import pandas as pd

# Suppress scientific notation when printing DataFrame objects.
# We choose to display floats with 5 decimals. Modify as needed.
pd.options.display.float_format = '{:.5f}'.format

# Specify database connection details and establish connection.
DATABASE = "DB"
SCHEMA   = "DEV"
PASSWORD = "password"
connstr  = "oracle://{}:{}@{}".format(SCHEMA, PASSWORD, DATABASE)
conn     = sqlalchemy.create_engine(connstr)

# Specify the table to query. Note that SQL should be a 
# valid SQL statement.
TABLENAME = "SEVERITY_CDFS"
cdfs = pd.read_sql_table(TABLENAME, con=conn)

The only difference is the elimination of the SQL statement. In this version, the target tablename is passed to the function directly.

Iterative Data Retrieval

For very large datasets, it may be inefficient to return all records in a single round trip. Pandas allows datasets to be queried incrementally by providing a chunksize parameter within read_sql_query (also applies to read_sql_table). chunksize specifies how many records should be retrieved at each iteration. The call is identical to the original, except when chunksize is specified, an iterator referencing the dataset is returned instead of the dataset itself.

To demonstrate, we read “SEVERITY_CDFS” as before, but this time in 1000 record blocks:

>>> TABLENAME = "SEVERITY_CDFS"
>>> SQL = "SELECT * FROM {}".format(TABLENAME)
>>> ref = pd.read_sql_query(SQL, con=conn, chunksize=1000)
>>> type(ref)
generator

Notice that ref is generator.

Here’s an approach that can be used to retrieve and compile the dataset into a single DataFrame after exhausting the generator bound from the call to read_sql_query with chunksize=1000:

TABLENAME = "SEVERITY_CDFS"
SQL = "SELECT * FROM {}".format(TABLENAME)
ref = pd.read_sql_query(SQL, con=conn, chunksize=1000)

# Process ref generator using a list comprehension.
cdfs_list = [i for i in ref]

# Verically concatenate DataFrames in cdfs_list into a single
# DataFrame via pd.concat.
cdfs = pd.concat(cdfs_list).reset_index(drop=True)

Another motivation for interacting with Oracle through Pandas is the fact that the result is returned as a DataFrame object, which is a powerful, versatile data structure with a wide range of potential uses. We next describe a few useful routines that can be applied to DataFrames. Assume that the cdfs table has been retrieved.

To obtain the fieldnames of the queried dataset, use the columns attribute:

>>> cdfs.columns
Index(['ID', 'CDF_ENUM', 'CDF', 'LOSS'], dtype='object')

To obtain the dimensions of the retrieved dataset, use the shape attribute. The result is a tuple consisting of (nrows, ncols):

>>> cdfs.shape
(16650, 8)

To determine the datatypes associated with each field in cdfs, call the dtypes attribute:

>>> cdfs.dtypes
ID        object
CDF_ENUM  int64
CDF       float64
LOSS      float64
dtype: object

We can sort cdfs by “ID” and “CDF_ENUM”, then review the first few records using head:

>>> cdfs = cdfs.sort_values(["ID", "CDF_ENUM"]).reset_index(drop=True)
>>> cdfs.head()
    ID  CDF_ENUM      CDF        LOSS
000001         1 0.006667 69077.55279
000001         2 0.013333 76009.02460
000001         3     0.02 80063.67568
000001         4 0.026667 82940.49640
000001         5 0.033333 85171.93191

Two other approaches can be used to retrieve the first 5 records of cdfs.

Subsetting DataFrames

This section contains a few examples of subsetting DataFrames by index.

Indexing on the DataFrame directly:

>>> cdfs[:5]
    ID  CDF_ENUM      CDF        LOSS
000001         1 0.006667 69077.55279
000001         2 0.013333 76009.02460
000001         3     0.02 80063.67568
000001         4 0.026667 82940.49640
000001         5 0.033333 85171.93191

Using integer-location indexer .iloc Pandas DataFrames can be indexed using .iloc, which facilitates purely integer-location based indexing for selection by position. Keep in mind that Python uses 0-based indexing. Passing : to .iloc serves as a stand-in for “return all”. If passed as the first argument, returns all rows independent of any column-level (axis 1) subsetting; If passed as the second argument, returns all columns independent of any row-level (axis 0) subsetting:

>>> cdfs.iloc[:5, :]
    ID  CDF_ENUM      CDF        LOSS
000001         1 0.006667 69077.55279
000001         2 0.013333 76009.02460
000001         3     0.02 80063.67568
000001         4 0.026667 82940.49640
000001         5 0.033333 85171.93191

Subsetting Columns

A subset of DataFrame columns can be obtained by specifying the desired column names as comma-delimited strings surrounded by double brackets. A subset of cdfs consisting of “ID”, “CDF_ENUM” and “CDF” which we refer to as df is obtained as follows:

>>> df = cdfs[["ID", "CDF_ENUM", "CDF"]]
>>> df.head(5)
    ID  CDF_ENUM      CDF
000001         1 0.006667
000001         2 0.013333
000001         3     0.02
000001         4 0.026667
000001         5 0.033333

Individual fields can be retrieved using two approaches. Keep in mind that Pandas DataFrames are comprised of one or more homogeneously typed Series objects, which is similar to a named vector in R. We can use bracket notation, specifying the field of interest as a string, or dot notation, where the field is treated as an attribute of the parent DataFrame:

>>> id1 = cdfs["ID"]
>>> id2 = cdfs.ID
>>> type(id1)
pandas.core.series.Series
>>> id1.equals(id2)
True

To determine the unique elements comprising id1 (or id2), call the unique method:

>>> id1.unique()
array(['000001', '000002', '000003'], dtype=object)

If a count of the number of unique elements is needed, call nunique:

>>> id1.nunique()
3

Note that if desired, “ID” can be transformed into a single-column DataFrame by specifying double instead of single brackets:

>>> df = cdfs[["ID"]]
>>> type(df)
pandas.core.frame.DataFrame

At any time, a pandas DataFrame or Series object can be converted into a numpy ndarray by calling the object’s values attribute:

>>> mgarr = id1.values
>>> type(mgarr)
numpy.ndarray
>>> mgarr.ndim
1
>>> mgarr = cdfs[["CDF_ENUM", "CDF"]].values
>>> type(mgarr)
np.ndarray
>>> mgarr.ndim
2

Leveraging the query Method

Pandas DataFrames expose a query method, which takes a SQL-like expression which is then used for subsetting the target DataFrame. To demonstrate, we retrieve records from cdfs where “ID”==‘000001’ and CDF<0.1:

>>> cdfs.query("ID=='000001' and CDF<.1")
    ID  CDF_ENUM       CDF         LOSS
000001  1        0.006667   69077.55279
000001  2        0.013333   76009.0246
000001  3        0.02       80063.67568
000001  4        0.026667   82940.4964
000001  5        0.033333   85171.93191
000001  6        0.04       86995.14748
000001  7        0.046667   88536.65428
000001  8        0.053333   89871.96821
000001  9        0.06       91049.79856
000001  10       0.066667   92103.40372
000001  11       0.073333   93056.50552
000001  12       0.08       93926.61929
000001  13       0.086667   94727.04636
000001  14       0.093333   95468.12609

For expressions parsed by query, == is used to obtain records equal to the RHS of the expression, whereas SQL variants use a single = to represent the same idea.

You may not always know the values with which to subset in advance. It may be that the subset of interest isn’t known until runtime. In those instances, we can specify a variable name within query by prefixing it with @: This prevents the specifier from being interpreted as a string literal. Here’s how to return the same subset as above, but using variables as opposed to hard-coded string literals:

>>> GROUP = "000001"
>>> CDF_THRESH  = 0.1
>>> cdfs.query("ID==@GROUP and CDF<@CDF_THRESH")
ID      CDF_ENUM       CDF         LOSS
000001  1        0.006667   69077.55279
000001  2        0.013333   76009.0246
000001  3        0.02       80063.67568
000001  4        0.026667   82940.4964
000001  5        0.033333   85171.93191
000001  6        0.04       86995.14748
000001  7        0.046667   88536.65428
000001  8        0.053333   89871.96821
000001  9        0.06       91049.79856
000001  10       0.066667   92103.40372
000001  11       0.073333   93056.50552
000001  12       0.08       93926.61929
000001  13       0.086667   94727.04636
000001  14       0.093333   95468.12609

Appending a Timestamp to a DataFrame

We can use the standard library’s datetime module to append a timestamp to cdfs:

>>> import datetime
>>> tstmp = datetime.datetime.now().strftime("%c")
>>> cdfs["TIMESTAMP"] = tstmp
>>> cdfs.head()
    ID  CDF_ENUM      CDF        LOSS                TIMESTAMP
000001         1 0.006667 69077.55279 Fri Dec 21 09:52:11 2018
000001         2 0.013333 76009.02460 Fri Dec 21 09:52:11 2018
000001         3     0.02 80063.67568 Fri Dec 21 09:52:11 2018
000001         4 0.026667 82940.49640 Fri Dec 21 09:52:11 2018
000001         5 0.033333 85171.93191 Fri Dec 21 09:52:11 2018

Exporting DataFrames to Oracle

Finally, we demonstrate how to export the updated cdfs table back to Oracle. To accomplish this, we use the DataFrame’s to_sql method. It takes a table name as a string, a connection object and a parameter (if_exists) that specifies how to proceed if the given tablename exists in the target schema. Options are “fail”, “replace” or “append” (set to “fail” by default).

To replace the existing “SEVERITY_CDFS” table with our updated version with “TIMESTAMP” added, we’d call to_sql as follows (assume out connection, conn, still exists and references the same database and schema):

>>> cdfs.to_sql("SEVERITY_CDFS", con=conn, if_exists="replace")

Conclusion

In this post, we demonstrated a simplified approach to interacting with Oracle database data using standard Pandas DataFrame methods. By leveraging Pandas, there’s no need to worry about iterating over cursor objects, but not at the expense of lazily loading large datasets (as deomstrated by using the chunksize parameter in one of the first examples). Pandas DataFrames expose a ton of functionality, so there’s always new things to discover, so start exploring. Until next time, happy coding!