Databases are ubiquitous. Any non-trivial application intended for use within the enterprise will almost certainly read from or write to a database at some point. In fact, part of the reason analysts have turned to programming languages like Python (and R) for data manipulation tasks is the higher level of abstraction with respect to database interfaces. In R, the general database interface is DBI. In Python it is referred to as the Database API-2.0 (see PEP249 for more information). The purpose of these interfaces is to expose a set of methods for database interaction that are consistent across vendors (for example, a method used to retrieve a table from an Oracle database is the same method used to retrieve a table from a Postgres or Microsoft SQLServer database). The remainder of this post focuses specifically on interacting with data in an Oracle database from Pandas.
In the examples that follow, we'll demonstrate how to create a database connection, retrieve tables, manipulate the dataset and finally export the data back to the database under a different tablename. In order to follow along with the examples, the following 3rd-party packages need to be installed:
- cx_Oracle
- Pandas
- sqlalchemy
Initializing a Database Connection
Once cx_Oracle has been installed, we need to create a database connection. Note that in a previous post, we covered how to retrieve Oracle table data using cx_Oracle directly. Although this approach is possible, accessing Oracle table data via Pandas is preferred as it simplifies the most commonly performed actions.
cx_Oracle's native database connection allows for one-way transactions only: Table data can be retrieved, but cannot be written. However, if we instead initialize our connection using SQLAlchemy, we allow for bi-directional i/o from the start, which can be very useful. This requires the SQLAlchemy library. The syntax to create a database connection with SQLAlchemy is:
"oracle://<USER>:<PASSWORD>@<DATABASE>"
For example, to connect to the DEV schema in database DB with
password password
, the connection would be specified as:
import cx_Oracle
import sqlalchemy
import pandas as pd
DATABASE = "DB"
SCHEMA = "DEV"
PASSWORD = "password"
connstr = f"oracle://{SCHEMA}:{PASSWORD}@{DATABASE}"
conn = sqlalchemy.create_engine(connstr)
Once the database connection has been established, we can retrieve datasets
using the read_sql_query
method. It takes for arguments any valid SQL
statement along with a connection object referencing the target database. For
example, assume we have a table named SEVERITY_CDFS in the DB schema. We can retrieve
the table as follows (we repeat the earlier code for convenience):
import cx_Oracle
import sqlalchemy
import pandas as pd
pd.options.display.float_format = '{:.5f}'.format
DATABASE = "DB"
SCHEMA = "DEV"
PASSWORD = "password"
connstr = f"oracle://{SCHEMA}:{PASSWORD}@{DATABASE}"
conn = sqlalchemy.create_engine(connstr)
TABLENAME = "SEVERITY_CDFS"
SQL = "SELECT * FROM {}".format(TABLENAME)
cdfs = pd.read_sql_query(SQL, con=conn)
The object returned by read_sql_query
is a Pandas DataFrame.
If you need to retrieve an entire table without filtering conditions specified
in SQL, Pandas offers the read_sql_table
method, which takes for its first
argument a tablename that resides in the target schema as opposed to a SQL
statement. Note that read_sql_table
is only valid for SQLAlchemy connection
objects, and would not work with a standard cx_Oracle connection. In the next
example, we retrieve the same table as before, but replace read_sql_query
with read_sql_table
:
import cx_Oracle
import sqlalchemy
import pandas as pd
pd.options.display.float_format = '{:.5f}'.format
# Specify database connection details and establish connection.
DATABASE = "DB"
SCHEMA = "DEV"
PASSWORD = "password"
connstr = f"oracle://{SCHEMA}:{PASSWORD}@{DATABASE}"
conn = sqlalchemy.create_engine(connstr)
TABLENAME = "SEVERITY_CDFS"
cdfs = pd.read_sql_table(TABLENAME, con=conn)
Iterative Data Retrieval
For very large datasets, it may be inefficient to return all records in a
single round trip. Pandas allows datasets to be queried incrementally by
providing a chunksize
parameter within read_sql_query
(also an option for
read_sql_table
). chunksize
specifies how many records should be retrieved
at each iteration. The call is identical to the original, except when
chunksize
is specified, an iterator referencing the dataset is returned
instead of the dataset itself.
To demonstrate, we again retrieve the SEVERITY_CDFS table but this time in 1000 record blocks:
In [2]: TABLENAME = "SEVERITY_CDFS"
In [3]: SQL = "SELECT * FROM {}".format(TABLENAME)
In [4]: ref = pd.read_sql_query(SQL, con=conn, chunksize=1000)
In [5]:type(ref)
Out[1]:
generator
Here's an approach that can be used to retrieve and compile the dataset into a
single DataFrame after exhausting the generator from the call to
read_sql_query
with chunksize=1000
:
TABLENAME = "SEVERITY_CDFS"
SQL = "SELECT * FROM {}".format(TABLENAME)
ref = pd.read_sql_query(SQL, con=conn, chunksize=1000)
cdfs = pd.concat([rr for rr in ref]).reset_index(drop=True)
We next describe a few routines that can be applied to DataFrames. The next few
examples refer to the cdfs
dataset referenced above.
To obtain the fieldnames of the queried dataset, use the columns attribute:
cdfs.columns
Index(['ID', 'CDF_ENUM', 'CDF', 'LOSS'], dtype='object')
To obtain the dimensions of the retrieved dataset, use the shape attribute. The result is a tuple consisting of (nrows, ncols):
>>> cdfs.shape
(16650, 8)
To determine the datatypes associated with each field in cdfs, call the
dtypes
attribute:
>>> cdfs.dtypes
ID object
CDF_ENUM int64
CDF float64
LOSS float64
dtype: object
We can sort cdfs
by ID and CDF_ENUM, then review the first few records using
head
:
In [5]: cdfs = cdfs.sort_values(["ID", "CDF_ENUM"]).reset_index(drop=True)
In [6]: cdfs.head()
Out[2]:
ID CDF_ENUM CDF LOSS
000001 1 0.006667 69077.55279
000001 2 0.013333 76009.02460
000001 3 0.02 80063.67568
000001 4 0.026667 82940.49640
000001 5 0.033333 85171.93191
Subsetting DataFrames
This section contains a few examples of subsetting DataFrames by index.
Indexing on the DataFrame directly:
In [7]: cdfs[:5]
Out[3]:
ID CDF_ENUM CDF LOSS
000001 1 0.006667 69077.55279
000001 2 0.013333 76009.02460
000001 3 0.02 80063.67568
000001 4 0.026667 82940.49640
000001 5 0.033333 85171.93191
Pandas DataFrames can be indexed using .iloc
, which facilitates
integer-location based indexing for selection by position. Keep in mind that
Python uses 0-based indexing. Passing :
to .iloc serves as a stand-in for
"return all". If passed as the first argument, returns all rows independent of
any column (axis 1) subsetting; If passed as the second argument, all columns
independent of any row (axis 0) subsetting are returned:
In [8]: cdfs.iloc[:5, :]
Out[4]:
ID CDF_ENUM CDF LOSS
000001 1 0.006667 69077.55279
000001 2 0.013333 76009.02460
000001 3 0.02 80063.67568
000001 4 0.026667 82940.49640
000001 5 0.033333 85171.93191
Subsetting Columns
A subset of DataFrame columns can be obtained by specifying the column
names as comma-delimited strings surrounded by double brackets. A subset of
cdfs consisting of ID, CDF_ENUM and CDF which we refer to as df
is obtained
as follows:
In [9]: df = cdfs[["ID", "CDF_ENUM", "CDF"]]
In[10]: df.head(5)
Out[5]:
ID CDF_ENUM CDF
000001 1 0.006667
000001 2 0.013333
000001 3 0.02
000001 4 0.026667
000001 5 0.033333
Individual fields can be retrieved using two approaches. Keep in mind that Pandas DataFrames are comprised of one or more homogeneously typed Series objects, which is similar to a named vector in R. We can use bracket notation, specifying the field of interest as a string, or dot notation, where the field is treated as an attribute of the parent DataFrame:
In[11]: id1 = cdfs["ID"]
In[12]: id2 = cdfs.ID
In[13]: type(id1)
Out[6]:
pandas.core.series.Series
In[14]: id1.equals(id2)
Out[7]:
True
To determine the unique elements comprising id1 (or id2), call the unique
method:
In[14]: id1.unique()
Out[8]:
array(['000001', '000002', '000003'], dtype=object)
If a count of the number of unique elements is needed, call nunique:
In[15]: id1.nunique()
Out[9]:
3
Note that if desired, ID can be transformed into a single-column DataFrame by specifying double instead of single brackets:
In[16]: df = cdfs[["ID"]]
In[17]: type(df)
Out[10]:
pandas.core.frame.DataFrame
A Pandas DataFrame or Series object can be converted into a numpy ndarray by
calling the object's values
attribute:
In[18]: mgarr = id1.values
In[19]: type(mgarr)
Out[11]:
numpy.ndarray
Leveraging the query Method
Pandas DataFrames expose a query
method, which takes a SQL-like expression
which is then used for subsetting the target DataFrame. To demonstrate, we
retrieve records from cdfs where ID=='000001'
and CDF<0.1
:
In [2]: cdfs.query("ID=='000001' and CDF<.1")
Out[1]:
ID CDF_ENUM CDF LOSS
000001 1 0.006667 69077.55279
000001 2 0.013333 76009.0246
000001 3 0.02 80063.67568
000001 4 0.026667 82940.4964
000001 5 0.033333 85171.93191
000001 6 0.04 86995.14748
000001 7 0.046667 88536.65428
000001 8 0.053333 89871.96821
000001 9 0.06 91049.79856
000001 10 0.066667 92103.40372
000001 11 0.073333 93056.50552
000001 12 0.08 93926.61929
000001 13 0.086667 94727.04636
000001 14 0.093333 95468.12609
For expressions parsed by query, ==
is used to obtain records equal to the
RHS of the expression, whereas SQL variants use a single =
to represent the
same idea.
You may not always know the values with which to subset in advance. It may be
that the subset of interest isn't known until runtime. In those instances, we
can specify a variable name by prefixing it with @
: This
prevents the specifier from being interpreted as a string literal. Next we
return the same subset before, but using variables as opposed to string
literals:
In [2]: GROUP = "000001"
In [3]: CDF_THRESH = 0.1
In [4]: cdfs.query("ID==@GROUP and CDF<@CDF_THRESH")
Out[1]:
ID CDF_ENUM CDF LOSS
000001 1 0.006667 69077.55279
000001 2 0.013333 76009.0246
000001 3 0.02 80063.67568
000001 4 0.026667 82940.4964
000001 5 0.033333 85171.93191
000001 6 0.04 86995.14748
000001 7 0.046667 88536.65428
000001 8 0.053333 89871.96821
000001 9 0.06 91049.79856
000001 10 0.066667 92103.40372
000001 11 0.073333 93056.50552
000001 12 0.08 93926.61929
000001 13 0.086667 94727.04636
000001 14 0.093333 95468.12609
Appending a Timestamp to a DataFrame
We can use the standard library datetime module to append a timestamp to cdfs:
In [2]: import datetime
In [3]: tstmp = datetime.datetime.now().strftime("%c")
In [4]: cdfs["TIMESTAMP"] = tstmp
In [5]: cdfs.head()
Out[1]:
ID CDF_ENUM CDF LOSS TIMESTAMP
000001 1 0.006667 69077.55279 Fri Dec 21 09:52:11 2018
000001 2 0.013333 76009.02460 Fri Dec 21 09:52:11 2018
000001 3 0.02 80063.67568 Fri Dec 21 09:52:11 2018
000001 4 0.026667 82940.49640 Fri Dec 21 09:52:11 2018
000001 5 0.033333 85171.93191 Fri Dec 21 09:52:11 2018
Uploading DataFrames to Oracle
Finally, we demonstrate how to export the updated cdfs table back to Oracle. To
accomplish this, we use the to_sql
method. It takes a table name
as a string, a connection object and a parameter (if_exists
) that specifies
how to proceed if the tablename exists in the target schema. Options are
"fail", "replace" or "append" ("fail" is the default).
To replace the existing SEVERITY_CDFS table with our updated version with
TIMESTAMP added, we'd call to_sql
(assuming our connection still exists):
In [2]: cdfs.to_sql("SEVERITY_CDFS", con=conn, if_exists="replace")