```
import folium
# Latitude and longitude for Durkin Park, 84th & Kolin Ave, Chicago IL.
lat = 41.739
lon = -87.729
zoom = 18
m = folium.Map(location=[lat, lon], zoom_start=zoom)
folium.Marker(location=[lat, lon]).add_to(m)
m
```

Once quarto has been installed, create a new project using the Quarto CLI. For the purposes of demonstration, we will refer to this as *dat303-blog*, but you can name it anything (just don’t include whitespace). A folder with that name will be created in the Git client’s current working directory. I like to keep all my repositories in a *Repos* folder, so I’ll first navigate to *Repos* using the `cd`

command (note that in the examples that follow, lines starting with `#`

are comments and should not be run. Lines starting with `$`

represent the command line prompt and should be run from the Git client):

```
# Switch into desired directory to create dat303-blog folder.
$ cd T:/Repos
```

Next, we run the quarto `create-project`

command, which will create a folder named *dat303-blog* located at *T:/Repos/at303-blog*:

`$ quarto create-project dat303-blog --type website:blog`

This folder will contain a *posts* folder which will eventually contain our blog content, and a number of additional files:

`_quarto.yaml`

: Contains the title of our blog, links to our GitHub/social media accounts and styling options. By default,`_quarto.yaml`

looks like:`# contents of _quarto.yaml. project: type: website website: title: "dat303-blog" navbar: right: - about.qmd - icon: github href: https://github.com/ - icon: twitter href: https://twitter.com format: html: theme: cosmo css: styles.css`

The theme is initially set to “cosmo”. It can be changed to any valid bootswatch theme. The full list of available themes can be found here.

`about.qmd`

: File to provide information about yourself.`profile.jpg`

: Replace this with a personal photo with the same name (`profile.jpg`

), or update the name of the photo in`about.qmd`

for the image key-value pair.

A `.gitignore`

file is used in Git to specify which files or directories should be ignored by Git when you make a commit. This means that any files or directories listed in the `.gitignore`

file won’t be tracked by Git, which is useful for excluding files that are not necessary for version control. From VSCode, create a file named *.gitignore* and save it to the *dat303-blog* folder. Add the following lines to the *.gitignore*:

```
/.quarto/
/_site/
.*
```

Save your changes.

Ensure that the current working directory of the Git client is *dat303-blog*, then run `git init`

:

```
$ cd T:/Repos/dat303-blog
$ git init
```

In the *posts* directory, we will create a new subdirectory for each post. We use all lowercase with words separated by dashes to make it easy to navigate between pages.

For example, I might create a *solving-normal-equations* directory under *posts*. Within the directory, I would create a new jupyter notebook named *solving-normal-equations.ipynb*.

In the very first cell of *solving-normal-equations.ipynb*, change the cell type to raw (click on the lower right of the cell and change “Python” to “raw”), and add the following header detail:

```
---
title: Solving the Normal Equations
date: 2024-09-02
description: An investigation into solving the normal equations with Python
categories: [Python]
---
```

Be sure to include the three dashes at the top and bottom of the cell as in the example above.

Populate the remaining cells of your notebook with your inline commentary, code and plots, etc. Be sure to save your changes.

Saving our changes with Git is a two-step process: We first stage any changes via `git add`

, then commit them using `git commit`

. Whenever running `git commit`

, you are required to include a commit message, which comes after the `-m`

flag. Assuming the notebook has been saved locally, run:

```
$ git add --all
$ git commit -m "Added solving-normal-equations article."
```

From GitHub, click on the `+`

and select *New Repository*. In the Repository name field, enter *[username].github.io*. In this example, it would be *jtrive.github.io*:

Add a description and be sure to keep the repository Public. Do not add a README or a *.gitignore* (we already created this). Click on *Create Repository*.

In the next window, be sure to click on SSH at the top. You’ll see something similar to:

Since we already created our repository locally and committed our first change, we are going to focus on the second box, *…or push an existing repository from the command line*. Copy the first line starting with `git remote add origin ...`

and paste it into the Git client and hit enter. In my case, it looked like:

`$ git remote add origin git@github.com:jtrive/jtrive.github.io.git`

Don’t worry about the other two commands: We have to do things a little different since we’re using Quarto.

`gh-pages`

branchVerify that no changes are pending in your blog directory by running `git status`

:

```
$ git status
On branch master
nothing to commit, working tree clean
```

We need to create a separate `gh-pages`

branch to host our blog. Note that this is a one-time action. From the *dat303-blog* directory, run the following commands (**make sure all changes are committed before running this!**):

```
$ git checkout --orphan gh-pages
$ git reset --hard
$ git commit --allow-empty -m "Initializing gh-pages branch."
$ git push origin gh-pages
```

From the Git client, checkout the master branch, then run `quarto publish gh-pages`

:

```
$ git checkout master
$ quarto publish gh-pages
```

Type `Y`

when prompted:

```
$ quarto publish gh-pages
? Update site at git@github.com-jtrive:jtrive/jtrive.github.io.git? (Y/n) » Y
```

Upon completion, navigate to `jtrive.github.io`

. You’ll see something like:

You can remove the *Post With Code* and *Welcome to My Blog* subdirectories under *posts* to drop those entries. Navigating to *Solving the Normal Equations*, we see:

Looks pretty good!

Many of the initial configuration steps are one-time actions. Once you’ve setup your blog as described, the typical workflow will be the following:

Create a new folder in the

*posts*directory, using lowercase letters/numbers with words separated by dashes.Create a Jupyter notebook in this directory with the same name and .ipynb extension.

Change the first cell of the notebook to raw, and add title information as shown below. Be sure the first and last lines are three dashes,

`---`

:`--- title: Solving the Normal Equations date: 2024-02-09 description: An investigation into solving the normal equations with Python categories: [Python] ---`

Create your blog post (narrative text, code, plots, equations, etc.). Save your changes.

From the Git client, navigate to the blog directory, then add and commit your changes:

`$ cd /path/to/blog $ git add --all $ git commit -m "Added second blog post."`

Ensure master branch is checkout (it should already be), then run the following two commands:

`$ git checkout master $ quarto publish gh-pages`

View your published content at [username].github.io. If my username is jtrive, my content will be available at

*jtrive.github.io*.

As the availability and complexity of geospatial data continue to grow with advancements in technology and data collection methods, the demand for skilled geospatial data scientists is expected to rise. Therefore, investing in learning geospatial data science equips individuals with valuable skills that are not only relevant today but also increasingly essential for future career success.

Folium is a Python library used for visualizing geospatial data interactively on web maps. Leveraging the capabilities of Leaflet.js, Folium allows users to create maps directly within Python code, making it an accessible and powerful tool for geospatial visualization and analysis.

With Folium, users can create various types of interactive maps, including point maps, choropleth maps, heatmaps, and vector overlays, by simply specifying geographic coordinates and map styling options. The library provides intuitive APIs for customizing map features such as markers, popups, tooltips, legends, and map layers, enabling users to create visually appealing and informative maps with ease.

Folium integrates with other popular Python libraries such as Pandas and Matplotlib, allowing users to visualize geospatial data stored in DataFrame objects or plot data directly onto Folium maps. It also supports various tile providers and basemaps, enabling users to choose from a wide range of map styles and sources.

Creating maps with folium is straightforward. We simply pass the latitude and longitude of the point of interest (POI) and specify a zoom level. We can then drop a marker on the point of interest, and interact with the map however we’d like.

We can get the latitude and longitude for a given POI by performing a google search. Latitude ranges from -90 to 90 degrees, longitude from -180 to 180 degrees. The latitude and longitude for the DMACC Ankeny campus is **(41.5996, -93.6276)**, which is **(latitude, longitude)**. Note that for US coordinates, the longitude will always be negative. An illustration is provided below:

To illustrate, let’s render a map over the park I used to play at as a child (Durkin Park on the southwest side of Chicago). Note that zoom level provides more detail as the number gets larger. A zoom level of 4 would show the entire US; a zoom level of 17 would render roughly a city block:

```
import folium
# Latitude and longitude for Durkin Park, 84th & Kolin Ave, Chicago IL.
lat = 41.739
lon = -87.729
zoom = 18
m = folium.Map(location=[lat, lon], zoom_start=zoom)
folium.Marker(location=[lat, lon]).add_to(m)
m
```

Make this Notebook Trusted to load map: File -> Trust Notebook

A few things to note about the code used to render the map:

- We start by importing the folium library.
- The lat/lon for Durkin Park was obtained by a simple google search.
- I used a level 18 zoom but this is not necessary since the map is dynamic and can be resized.
- To add the marker to the map, we call
`.add_to(m)`

. - We included
`m`

by itself in the last line of the cell in order for the map to render. Without doing this, the map would not display.

We can change the color of the marker by passing an additional argument into `folium.Marker`

. I’ll place a second marker in another park I used to visit when I was younger, Scottsdale Park. I’ll make this second marker red.

```
# Durkin Park coordinates.
lat0 = 41.739
lon0 = -87.729
# Scottsdale Park coordinates.
lat1 = 41.7416
lon1 = -87.7356
# Center map at midway point between parks.
mid_lat = (lat0 + lat1) / 2
mid_lon = (lon0 + lon1) / 2
# Specify zoom level.
zoom = 16
# Initialize map.
m = folium.Map(location=[mid_lat, mid_lon], zoom_start=zoom)
# Add Durkin Park marker.
folium.Marker(
location=[lat0, lon0],
popup="Durkin Park",
).add_to(m)
# Add Scottsdale Park marker.
folium.Marker(
location=[lat1, lon1],
popup="Scottsdale Park",
icon=folium.Icon(color="red")
).add_to(m)
m
```

Make this Notebook Trusted to load map: File -> Trust Notebook

Notice that the `popup`

argument was supplied to `folium.Marker`

. Now when we click on the markers, whatever text we supply to `popup`

will be shown on the map.

We can connect the markers in the map by using `folium.PolyLine`

. We pass it a list of lat/lon pairs, and it draws a line connecting the points. Let’s connect the two parks with a green line:

```
# Durkin Park coordinates.
lat0 = 41.739
lon0 = -87.729
# Scottsdale Park coordinates.
lat1 = 41.7416
lon1 = -87.7356
# Center map at midway point between parks.
mid_lat = (lat0 + lat1) / 2
mid_lon = (lon0 + lon1) / 2
# Specify zoom level.
zoom = 16
# Initialize map.
m = folium.Map(location=[mid_lat, mid_lon], zoom_start=zoom)
# Add Durkin Park marker.
folium.Marker(
location=[lat0, lon0],
popup="Durkin Park",
).add_to(m)
# Add Scottsdale Park marker.
folium.Marker(
location=[lat1, lon1],
popup="Scottsdale Park",
icon=folium.Icon(color="red")
).add_to(m)
# Connect parks with green line.
points = [(lat0, lon0), (lat1, lon1)]
folium.PolyLine(points, color="green").add_to(m)
m
```

Make this Notebook Trusted to load map: File -> Trust Notebook

One final point: We can replace the standard markers with circle markers by using `folium.CircleMarker`

. `radius`

controls the size of the markers and `color/fill_color`

set the color of the marker:

```
m = folium.Map(location=[mid_lat, mid_lon], zoom_start=zoom)
# Add Durkin Park circle marker.
folium.CircleMarker(
location=[lat0, lon0],
radius=7,
popup="Durkin Park",
color="red",
fill_color="red",
fill=True,
fill_opacity=1
).add_to(m)
# Add Scottsdale Park marker.
folium.CircleMarker(
location=[lat1, lon1],
radius=7,
popup="Scottsdale Park",
color="red",
fill_color="red",
fill=True,
fill_opacity=1
).add_to(m)
# Connect parks with green line.
points = [(lat0, lon0), (lat1, lon1)]
folium.PolyLine(points, color="green").add_to(m)
m
```

Make this Notebook Trusted to load map: File -> Trust Notebook

The International Space Station (ISS) is a collaborative effort among multiple nations, serving as a hub for scientific research and international cooperation in space exploration. The ISS orbits the Earth at an astonishing speed of approximately 17,500 miles per hour, completing an orbit around the planet approximately every 90 minutes.

The `coords`

list in the next cell represents the position as latitude-longitude pairs of the ISS sampled every minute for 20 minutes. We can render each of the 20 points as red circle markers connected by a red dashed line. Note that it is not necessary to call `folium.CircleMarker`

20 times: Use a for loop to iterate over the `coords`

list.

```
coords = [
(50.4183, -35.337),
(49.3934, -29.7562),
(48.0881, -24.4462),
(46.5282, -19.4374),
(44.7411, -14.743),
(42.7364, -10.3267),
(40.5727, -6.2481),
(38.2576, -2.4505),
(35.8123, 1.0896),
(33.2554, 4.3975),
(30.6031, 7.4986),
(27.8697, 10.4178),
(25.0674, 13.1786),
(22.197, 15.8122),
(19.2887, 18.3195),
(16.3407, 20.7295),
(13.3611, 23.059),
(10.3562, 25.325),
(7.3323, 27.5427),
(4.2953, 29.7267)
]
lats, lons = zip(*coords)
mid_lat = sum(lats) / len(lats)
mid_lon = sum(lons) / len(lons)
m = folium.Map(location=[mid_lat, mid_lon], zoom_start=4)
for lat, lon in coords:
folium.CircleMarker(
location=[lat, lon],
radius=5,
color="red",
fill_color="red",
fill=True,
fill_opacity=1
).add_to(m)
# Connect coords with red dashed line.
folium.PolyLine(coords, color="red", dash_array="5").add_to(m)
m
```

Make this Notebook Trusted to load map: File -> Trust Notebook

**Points**: Represent discrete locations such as cities, landmarks, or individual trees. Each point has a specific location defined by coordinates (e.g., latitude and longitude).**Lines**(or polylines): Represent linear features such as roads, rivers, or boundaries. Lines are composed of a series of connected points.**Polygons**(or multipolygons): Represent areas or shapes such as lakes, parks, or country borders. Polygons are defined by a series of points that create a closed shape.

Shapefiles are one of the most common formats for vector data. They store points, lines, and polygons along with attribute information. The US Census Bureau makes a number of shapefiles available here. In this notebook, we’ll walkthrough how to load shapefiles into GeoPandas, plotting the boundaries and create a choropleth map based on a second dataset (choropleth maps are those where the color of each shape is based on the value of an associated variable).

To start, download US state shapefile *cb_2018_us_state_500k.zip* from the United States Census Bureau boundary files page. Under the *State* subheader, you will see three files:

*cb_2018_us_state_500k.zip**cb_2018_us_state_5m.zip**cb_2018_us_state_20m.zip*

The 500k files are the most detailed, but also the largest. The 20m files are the smallest, but at the cost of some dramatic simplification. The 5m files fall somewhere between the other two. We will work with the 500k files.

Once downloaded, the shapefile can be loaded into a GeoPandas DataFrame as follows:

```
import numpy as np
import pandas as pd
import geopandas as gpd
shp_path = "cb_2018_us_state_500k.zip"
dfshp = gpd.read_file(shp_path)
dfshp.head(5)
```

STATEFP | STATENS | AFFGEOID | GEOID | STUSPS | NAME | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 28 | 01779790 | 0400000US28 | 28 | MS | Mississippi | 00 | 121533519481 | 3926919758 | MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ... |

1 | 37 | 01027616 | 0400000US37 | 37 | NC | North Carolina | 00 | 125923656064 | 13466071395 | MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ... |

2 | 40 | 01102857 | 0400000US40 | 40 | OK | Oklahoma | 00 | 177662925723 | 3374587997 | POLYGON ((-103.00257 36.52659, -103.00219 36.6... |

3 | 51 | 01779803 | 0400000US51 | 51 | VA | Virginia | 00 | 102257717110 | 8528531774 | MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ... |

4 | 54 | 01779805 | 0400000US54 | 54 | WV | West Virginia | 00 | 62266474513 | 489028543 | POLYGON ((-82.64320 38.16909, -82.64300 38.169... |

The geometry column is a special column in a GeoDataFrame that stores the geometric shapes associated with each row (in this case, the shapes in latitude-longitude pairs that define the boundary of each state). This column contains the vector data that defines the spatial features in the dataset. Some states have boundaries defined by a MULTIPOLYGON, such as Hawaii, whose boundary consists of multiple closed POLYGONS. If it isn’t already present, the geometry column needs to be defined.

We can plot the data present in the present in the shapefile by calling the GeoDataFrame’s `plot`

method:

```
dfshp.plot()
```

Let’s zoom in and focus on a map of the lower 48 states only:

```
exclude = ["American Samoa", "Alaska", "Hawaii", "Guam", "United States Virgin Islands",
"Commonwealth of the Northern Mariana Islands", "Puerto Rico"]
dfshp48 = dfshp[~dfshp.NAME.isin(exclude)].reset_index(drop=True)
dfshp48.plot()
```

We can get a better view of the boundaries of each state by calling `boundary.plot`

:

```
dfshp48.boundary.plot()
```

By default, the plots rendered via GeoPandas are smaller than we might like. We can increase the size of the rendered map, suppress ticklabels, change the boundary color and add a title as follows:

```
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True)
ax.set_title("U.S. Boundaries - Lower 48 States")
dfshp48.boundary.plot(ax=ax, edgecolor="red", linewidth=.50)
ax.axis("off")
plt.show()
```

To overlay the state name at the center of each state, use:

```
fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True)
ax.set_title("U.S. Boundaries - Lower 48 States")
dfshp48.boundary.plot(ax=ax, edgecolor="red", linewidth=.50)
dfshp48.apply(lambda x: ax.annotate(x.NAME, xy=x.geometry.centroid.coords[0], ha='center', fontsize=6), axis=1)
ax.axis("off")
plt.show()
```

In the shapefile, ALAND and AWATER represent the land and water area of each state in square meters. To create a choropleth map based on the natural log of AWATER, include the `column`

argument to the `plot`

method:

```
# Compute natural log of AWATER to get better separation by state.
dfshp48["log_AWATER"] = np.log(dfshp48["AWATER"])
dfshp48.plot(column="log_AWATER", cmap="plasma")
```

We can reformat the map as before, while also adding a legend to give context the difference in colors by state. Options for colormaps are available here:

```
fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True)
ax.set_title("Ln(AWATER) - Lower 48 States")
dfshp48.plot(
ax=ax, column="log_AWATER", edgecolor="gray", linewidth=.50,
cmap="gist_rainbow", alpha=.750, legend=True,
legend_kwds={"label": "Ln(AWATER)", "orientation": "vertical", "shrink": .35}
)
ax.axis("off")
plt.show()
```

For variety, let’s download the Congressional District shapefile and plot the boundaries. It is available at the same link as above, and is identified as *cb_2018_us_cd116_500k.zip*. Reading the file into GeoPandas and displaying the first 5 rows yields:

```
dfc = gpd.read_file("cb_2018_us_cd116_500k.zip")
print(f"dfc.shape: {dfc.shape}")
dfc.head(5)
```

`dfc.shape: (441, 9)`

STATEFP | CD116FP | AFFGEOID | GEOID | LSAD | CDSESSN | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|

0 | 17 | 10 | 5001600US1710 | 1710 | C2 | 116 | 777404163 | 31605644 | POLYGON ((-88.19882 42.41557, -88.19860 42.415... |

1 | 47 | 06 | 5001600US4706 | 4706 | C2 | 116 | 16770155959 | 324676580 | POLYGON ((-87.15023 36.56770, -87.14962 36.568... |

2 | 48 | 06 | 5001600US4806 | 4806 | C2 | 116 | 5564805243 | 255530191 | POLYGON ((-97.38860 32.61731, -97.38856 32.618... |

3 | 48 | 07 | 5001600US4807 | 4807 | C2 | 116 | 419784487 | 3069802 | POLYGON ((-95.77383 29.87515, -95.76962 29.875... |

4 | 48 | 26 | 5001600US4826 | 4826 | C2 | 116 | 2349987793 | 191353567 | POLYGON ((-97.39826 32.99996, -97.39792 33.013... |

We again display the boundaries:

```
dfc.boundary.plot()
```

We’d like to focus on the lower 48 states again, but this time the shapefile doesn’t have a NAME column. How should we proceed?

One approach is to define a bounding box that encloses the lower 48 states, then filter the shapefile to retain only those congressional districts whose geometry intersects the bounding box. GeoPandas provides coordinate based indexing with the `cx`

indexer, which slices using a bounding box. Geometries in the GeoSeries or GeoDataFrame that intersect the bounding box will be returned.

For the lower 48 states bounding box, we’ll use **(-125, 24.6), (-65, 50)**, southwest to northeast. We also include a circle marker at the center of each congressional district:

```
xmin, ymin, xmax, ymax = -125, 24.6, -65, 50
dfc48 = dfc.cx[xmin:xmax, ymin:ymax]
fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True)
ax.set_title("US Congressional Districts, 116th Congress - Lower 48 States", fontsize=11)
dfc48.boundary.plot(ax=ax, edgecolor="black", linewidth=.50)
dfc48.geometry.centroid.plot(ax=ax, markersize=6, color="red")
ax.axis("off")
plt.show()
```

```
C:\Users\jtriv\AppData\Local\Temp\ipykernel_8996\3296541533.py:9: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
dfc48.geometry.centroid.plot(ax=ax, markersize=6, color="red")
```

Working with GeoJSON is much the same as working with shapefiles, one difference being that with GeoJSON, vector data is contained within a single file as opposed to an archive of multiple file types. See here for an example.

But once read into GeoPandas, we work with it the same way. We can load US state boundary files as GeoJSON from GitHub via:

```
dfstate = gpd.read_file("https://raw.githubusercontent.com/PublicaMundi/MappingAPI/master/data/geojson/us-states.json")
dfstate.head()
```

id | name | density | geometry | |
---|---|---|---|---|

0 | 01 | Alabama | 94.650 | POLYGON ((-87.35930 35.00118, -85.60667 34.984... |

1 | 02 | Alaska | 1.264 | MULTIPOLYGON (((-131.60202 55.11798, -131.5691... |

2 | 04 | Arizona | 57.050 | POLYGON ((-109.04250 37.00026, -109.04798 31.3... |

3 | 05 | Arkansas | 56.430 | POLYGON ((-94.47384 36.50186, -90.15254 36.496... |

4 | 06 | California | 241.700 | POLYGON ((-123.23326 42.00619, -122.37885 42.0... |

```
dfstate.plot()
```

```
"""
Creating a database connection with sqlalchemy.
"""
import pandas as pd
import sqlalchemy
DRIVER = "SQL Server"
SERVER = "SERVER"
DATABASE = "DATABASE"
# Create connection uri.
conn_uri = f"mssql+pyodbc://{SERVER}/{DATABASE}?driver={DRIVER}".replace(" ", "+")
# Initialize connection.
conn = sqlalchemy.create_engine(conn_uri)
```

A few points to highlight:

`conn_uri`

is a string that contains information needed to connect to our database. The prefix`mssql+pyodbc://`

indicates that we’re targeting a SQL Server database via the pyodbc connector. Also, if we weren’t using Windows authentication, or were working with a different RDBMS, it would be necessary to change`conn_uri`

. For example, an Oracle connection uri would be specified as`oracle://[USERNAME]:[PASSWORD]@[DATABASE]`

.Also in

`conn_uri`

, within the format substitution, whitespace in`DRIVER`

is replaced with`+`

. This is consistent with how whitespace is encoded for web addresses.

Next, to query the French Motor Third-Party Liability Claims sample dataset in the table *SAMPLE_FREMTPL*, use the `read_sql`

function. I’ve included the connection initialization logic for convenience:

```
"""
Reading database data into Pandas DataFrame.
"""
import pandas as pd
import sqlalchemy
DRIVER = "SQL Server"
SERVER = "SERVER"
DATABASE = "DATABASE"
# Create connection uri.
conn_uri = f"mssql+pyodbc://{SERVER}/{DATABASE}?driver={DRIVER}".replace(" ", "+")
# Initialize connection.
conn = sqlalchemy.create_engine(conn_uri)
# Create query.
SQL = "SELECT * FROM SAMPLE_TABLE"
df = pd.read_sql(SQL, con=conn)
```

Instead of passing a query to `pd.read_sql`

, the tablename could have been provided. `pd.read_sql`

is convenience wrapper around `read_sql_table`

and `read_sql_query`

which will delegate to the specific function depending on the input (dispatches `read_sql_table`

if input is a tablename, `read_sql_query`

if input is a query). Refer to the documentation for more information.

Let’s assume SAMPLE_TABLE represents the French Motor Third-Party Liability Claims dataset available here. Inspecting the first 10 records of the dataset yields:

```
IDPOL CLAIMNB EXPOSURE AREA VEHPOWER VEHAGE DRIVAGE BONUSMALUS VEHBRAND VEHGAS DENSITY REGION
0 1290 1 0.66000 'B' 7 0 28 60 'B12' 'Regular' 52 'R72'
1 1292 1 0.12000 'B' 7 0 28 60 'B12' 'Regular' 52 'R72'
2 1295 1 0.08000 'E' 5 0 36 50 'B12' 'Regular' 3274 'R11'
3 1296 1 0.50000 'E' 5 0 36 50 'B12' 'Regular' 3274 'R11'
4 1297 1 0.20000 'E' 5 0 36 50 'B12' 'Regular' 3274 'R11'
5 1299 1 0.74000 'D' 6 0 76 50 'B12' 'Regular' 543 'R91'
6 1301 1 0.05000 'D' 6 0 76 50 'B12' 'Regular' 543 'R91'
7 1303 1 0.03000 'B' 11 0 39 50 'B12' 'Diesel' 55 'R52'
8 1304 1 0.76000 'B' 11 0 39 50 'B12' 'Diesel' 55 'R52'
9 1306 1 0.49000 'E' 10 0 38 50 'B12' 'Regular' 2715 'R93'
```

When working with large datasets, it may be inefficient to retrieve the entire dataset in a single pass. Pandas provides functionality to retrieve data in `chunksize`

-record blocks, which can result in significant speedups. In the following example, the same French Motor Third-Party Liability Claims sample dataset is retrieved in 20,000-record blocks. The only change in the call to `read_sql`

is the inclusion of `chunksize`

, which specifies the maximum number of records to retrieve for a given iteration. We assume `conn`

has already been initialized:

```
"""
Using `read_sql`'s *chunksize* parameter for iterative retrieval.
"""
CHUNKSIZE = 20000
SQL = "SELECT * FROM SAMPLE_TABLE"
dfiter = pd.read_sql(SQL, con=conn, chunksize=CHUNKSIZE)
df = pd.concat([dd for dd in dfiter])
```

`CHUNKSIZE`

specifies the maximum number of records to retrieve at each iteration.`dfiter`

is a reference to the data targeted in our query.`dfiter`

is not a DataFrame, rather it is a generator, a Python object which makes it easy to create iterators. Generators yield values lazily, so they are particularly memory efficient.`df = pd.concat([dd for dd in dfiter])`

can be decomposed into two parts: First,`[dd for dd in dfiter]`

is a*list comprehension*, a very powerful tool that works similar to a flattened for loop. If we bound`[dd for dd in dfiter]`

to a variable directly, the result would be a list of 34 DataFrames, each having no more than 20,000 records. Second,`pd.concat`

takes the list of DataFrames, and performs a row-wise concatenation of each DataFrame, resulting in a single DataFrame with 678,013 records.`pd.concat`

is akin to the SQL`UNION`

operator. The final result,`df`

, is a DataFrame having 678,013 rows and 12 columns.

Instead of reading the data into memory, it may be necessary to retrieve the dataset, then write the results to file for later analysis. This can be accomplished in an iterative fashion so that no more than `CHUNKSIZE`

records are in-memory at any point in time. Results will be saved to .csv in a file named `"FREMTPL.csv"`

in 100,000 record blocks:

```
"""
Writing queried results to file.
"""
import time
CHUNKSIZE = 100000
CSV_PATH = "FREMTPL.csv"
SQL = "SELECT * FROM SAMPLE_TABLE"
dfiter = pd.read_sql(SQL, conn, chunksize=CHUNKSIZE)
t_i = time.time()
trkr, nbrrecs = 0, 0
with open(CSV_PATH, "w", encoding="utf-8", newline="") as fcsv:
for df in dfiter:
fcsv.write(df.to_csv(header=nbrrecs==0, index=False, mode="a"))
nbrrecs+=df.shape[0]
print("Retrieved records {}-{}".format((trkr * CHUNKSIZE) + 1, nbrrecs))
trkr+=1
t_tot = time.time() - t_i
retrieval_rate = nbrrecs / t_tot
print(
f"Retrieved {nbrrecs} records in {t_tot:.0f} seconds ({retrieval_rate:.0f} recs/sec.)."
)
```

Executing the code above produces the following output:

```
Retrieved records 1-100000
Retrieved records 100001-200000
Retrieved records 200001-300000
Retrieved records 300001-400000
Retrieved records 400001-500000
Retrieved records 500001-600000
Retrieved records 600001-678013
Retrieved 678013 records in 20 seconds (33370 recs/sec.).
```

In order to export a DataFrame into a database, we leverage the DataFrame’s `to_sql`

method. We provide the name of the table we wish to upload data into, along with a connection object, and what action to take if the table already exists. `if_exists`

can be one of:

“fail”: Raise a

`ValueError`

.“replace”: Drop the table before inserting new values.

“append”: Insert new values to the existing table.

As a simple transformation, we determine aggregate EXPOSURE by AREA, append a timestamp, then export the result as “SAMPLE_AREA_SUMM”. If the table exists, we want the query to fail:

```
"""
Summary of aggregate EXPOSURE by AREA based on the French Motor Third-Party
Liability Claims sample dataset.
"""
import datetime
# Compute aggregate EXPOSURE by AREA.
dfsumm = df.groupby("AREA", as_index=False)["EXPOSURE"].sum()
# Append timestamp.
timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
dfsumm["TIMESTAMP"] = timestamp
# Export results.
dfsumm.to_sql("SAMPLE_AREA_SUMM", con=conn, if_exists="fail")
```

If the table already exists, an error like the following will be generated:

`ValueError: Table 'SAMPLE_AREA_SUMM' already exists.`

Otherwise, no output will be generated.

Next we demonstrate how data can be queried iteratively and written directly to a compressed file format. This is especially useful when working with very large datasets, or when the data exceeds available system resources. Another reason to save datasets in compressed format is that Pandas can read compressed files just as easily as csvs. Once read into memory, the dataset will expand to the full uncompressed size, but by writing data to compressed format we reduce our overall storage footprint. Here’s the code to do it:

```
import gzip
import time
import pandas as pd
import sqlalchemy
DRIVER = "SQL Server"
SERVER = "SERVER"
DATABASE = "DATABASE"
CHUNKSIZE = 100000
DATA_PATH = "COMPRESSED-SAMPLE-TABLE.csv.gz"
# Create connection uri.
conn_uri = f"mssql+pyodbc://{SERVER}/{DATABASE}?driver={DRIVER}".replace(" ", "+")
# Initialize connection.
conn = sqlalchemy.create_engine(conn_uri)
SQL = "SELECT * FROM SAMPLE_TABLE"
dfiter = pd.read_sql(SQL, con=conn, chunksize=CHUNKSIZE)
t_i = time.time()
trkr, nbrrecs = 0, 0
with gzip.open(DATA_PATH, "wb") as fgz:
for df in dfiter:
fgz.write(df.to_csv(header=nbrrecs==0, index=False, mode="a").encode("utf-8"))
nbrrecs+=df.shape[0]
print("Retrieved records {}-{}".format((trkr * CHUNKSIZE) + 1, nbrrecs))
trkr+=1
t_tot = time.time() - t_i
retrieval_rate = nbrrecs / t_tot
print(
"Retrieved {} records in {:.0f} seconds ({:.0f} recs/sec.).".format(
nbrrecs, t_tot, retrieval_rate
)
)
```

The only expression requiring explanation is within `df.to_csv`

, where `header=nbrrecs==0`

is specified. This ensures that headers are written for the first batch of records only, and ignored for subsequent batches (100,000 record chunks are read in at each iteration).

To read the compressed file back into Pandas, use the `pd.read_csv`

function specifying the compression type (in this example we used “gzip” - other options are “zip”, “bz2” or “xz”):

```
In [1]: df = pd.read_csv(DATA_PATH, compression="gzip")
In [2]: df.shape
Out[2]: (678013, 12)
```

`GridSearchCV`

to identify optimal hyperparameters for a given model and metric, and alternatives for selecting a classifier threshold in scikit-learn.
First we load the breast cancer dataset. We will forgo any pre-processing, but create separate train and validation sets:

```
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
np.set_printoptions(suppress=True, precision=8, linewidth=1000)
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
data = load_breast_cancer()
X = data["data"]
y = data["target"]
# Create train, validation and test splits.
Xtrain, Xvalid, ytrain, yvalid = train_test_split(X, y, test_size=.20, random_state=516)
print(f"Xtrain.shape: {Xtrain.shape}")
print(f"Xvalid.shape: {Xvalid.shape}")
```

```
Xtrain.shape: (455, 30)
Xvalid.shape: (114, 30)
```

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting (see documentation here).

The `RandomForestClassifier`

takes a number of hyperparameters. It can be difficult to determine which values to set these to manually, so instead we can perform a cross-validated grid search over a number of candidate values to determine which hyperparmeter combination is best for our data and specified metric. `GridSearchCV`

is part of scikit-learn, and is a method used to find the best possible configuration of hyperparameters for optimal performance. It works as follows:

**Define a parameter grid**: The grid is a dictionary that maps parameter names to the values that should be tested. These parameters are specific to the model you are working to optimize.**Specify a model**: Choose a model that you want to optimize using`GridSearchCV`

. This model is not trained yet; it’s just passed in with it’s default parameters.**Cross-validation setup**:`GridSearchCV`

uses cross-validation to evaluate each combination of parameter values provided in the grid. You need to specify the number of folds (splits) for the cross-validation process (this is the`cv`

parameter). Common choices are 5 or 10 folds, depending on the size of your dataset and how thorough you want the search to be.**Search Execution**: With the parameter grid, model, and cross-validation setup,`GridSearchCV`

systematically works through multiple combinations of parameter sets, cross-validating as it goes to determine which configuration gives the best performance based on a score function. The performance is often measured using metrics like accuracy, precision or recall for classification problems or mean squared error for regression problems.**Results:**Finally,`GridSearchCV`

provides the best parameters, allowing you to understand which parameters work best for your model. Additionally, it can provide other results like the score for each parameter combination, allowing for deeper analysis of how different parameter values impact model performance.

The documentation for `GridSearchCV`

is available here.

In the next cell, we assess the following `RandomForestClassifier`

hyperparameters:

`n_estimators`

: [100, 150, 250]`min_samples_leaf`

: [2, 3, 4]`ccp_alpha`

: [0, .1, .2, .3]

For the metric, recall is used since the cost of a false negative is high (not detecting breast cancer). This means the hyperparameter combination with the maximum average recall over the k-folds will be selected as the best parameter set.

```
"""
Example using GridSearchCV to identify optimal hyperparameters w.r.t. recall.
Note that within GridSearchCV, cv represents the number of folds for
k-Fold cross validation.
"""
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# Create parameter grid as dictionary.
param_grid = {
"n_estimators": [100, 150, 250],
"min_samples_leaf": [2, 3, 4],
"ccp_alpha": [0, .1, .2, .3]
}
# Pass model and param_grid into GridSearchCV.
mdl = GridSearchCV(
RandomForestClassifier(random_state=516),
param_grid,
scoring="recall",
cv=5
)
# Fit model on training set. This can take a while depending on the number of
# hyperparameter combinations in param_grid.
mdl.fit(Xtrain, ytrain)
# Print optimal parameters.
print(f"best parameters: {mdl.best_params_}")
```

`best parameters: {'ccp_alpha': 0, 'min_samples_leaf': 4, 'n_estimators': 100}`

For random forests, boosting models and other tree-based ensemble methods, we can obtain a summary of the relative importance of each of the input features. This is available in the `mdl.best_estimator_.feature_importances_`

attribute. We can plot feature importances in decreasing order as follows:

```
imp = mdl.best_estimator_.feature_importances_
rf_imp = pd.Series(imp, index=data["feature_names"]).sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(9, 5), tight_layout=True)
rf_imp.plot.bar(ax=ax)
ax.set_title("RandomForestClassifier feature importances")
ax.set_ylabel("mean decrease in impurity")
plt.show()
```

In terms of mean decrease in impurity, the top 7 features are assigned the highest importance, with the remaining features deemed not as relevant. For more information on how feature importance is calculated, see here.

The resulting `mdl`

object can be used to make predictions on the validation set (`mdl`

exposes the `RandomForestClassifier`

with optimal hyperparameters set). We use `mdl.predict_proba`

to get probabilities on [0, 1], with values closer to 1 representing positive predicted instances of breast cancer on the validation set:

```
ypred = mdl.predict_proba(Xvalid)[:,1]
ypred
```

```
array([0.005 , 0.82743637, 0.97088095, 0. , 0. , 1. , 0.98020202, 0.67380556, 0. , 0.99333333, 0.9975 , 0.30048576, 0.9528113 , 0.99666667, 0.04102381, 0.99444444, 1. , 0.828226 , 0. , 0. , 0.97916667, 1. , 0.99607143, 0.90425163, 0. , 0.02844156, 0.99333333, 0.98183333, 0.9975 , 0.08869769, 0.97369841, 0. , 1. , 0.71100866, 0.96022727, 0. , 0.71200885, 0.06103175, 0.005 , 0.99490476, 0.1644127 , 0. , 0.23646934, 1. , 0.57680164, 0.64901715, 0.9975 , 0.61790818, 0.95509668, 0.99383333, 0.04570455, 0.97575758, 1. , 0.47115815, 0.92422619, 0.77371415, 0. , 1. , 0.26198657, 0. , 0.28206638, 0.95216162, 0.98761905, 0.99464286, 0.98704762, 0.85579351, 0.10036905, 0.00222222, 0.98011905, 0.99857143, 0.92285967, 0.95180556, 0.97546947, 0.84433189, 0.005 , 0.99833333, 0.83616339, 1. , 0.9955 , 1. , 0.99833333, 1. ,
0.86399315, 0.9807381 , 0. , 0.99833333, 0.9975 , 0. , 0.98733333, 0.96822727, 0.23980827, 0.7914127 , 0. , 0.98133333, 1. , 1. , 0.89251019, 0.9498226 , 0.18943254, 0.83494391, 0.9975 , 1. , 0.77079113, 0.99722222, 0.30208297, 1. , 0.92111977, 0.99428571, 0.91936508, 0.47118074, 0.98467172, 0.006 , 0.05750305, 0.96954978])
```

Note that scikit-learn `predict_proba`

outputs an nx2 dimensional array, where the first column represents the probability of class 0 and the second column the probability of class 1 (has breast cancer). Each row will sum to 1. We will work with the probabilities of the class we’re interested in analyzing, so we extract only the values from the positive class (the second column), that’s why we call `mdl.predict_proba(Xvalid)[:,1]`

.

In order to master machine learning, it is necessary to learn a variety of minor concepts that underpin these systems. One such concept is setting the optimal classification threshold.

By default, for probabilistic classifiers scikit-learn uses a threshold of .50 to distinguish between positive and negative class instances. The predicted classes are obtained by calling `mdl.predict`

. Here’s a side by side comparison of the model predicted probabilities and predicted classes:

```
# Predicted probabilities.
ypred = mdl.predict_proba(Xvalid)[:,1].reshape(-1, 1)
# Predicted classes.
yhat = mdl.predict(Xvalid).reshape(-1, 1)
# Combine probabilities and predicted class labels.
preds = np.concatenate([ypred, yhat], axis=1)
preds
```

```
array([[0.005 , 0. ],
[0.82743637, 1. ],
[0.97088095, 1. ],
[0. , 0. ],
[0. , 0. ],
[1. , 1. ],
[0.98020202, 1. ],
[0.67380556, 1. ],
[0. , 0. ],
[0.99333333, 1. ],
[0.9975 , 1. ],
[0.30048576, 0. ],
[0.9528113 , 1. ],
[0.99666667, 1. ],
[0.04102381, 0. ],
[0.99444444, 1. ],
[1. , 1. ],
[0.828226 , 1. ],
[0. , 0. ],
[0. , 0. ],
[0.97916667, 1. ],
[1. , 1. ],
[0.99607143, 1. ],
[0.90425163, 1. ],
[0. , 0. ],
[0.02844156, 0. ],
[0.99333333, 1. ],
[0.98183333, 1. ],
[0.9975 , 1. ],
[0.08869769, 0. ],
[0.97369841, 1. ],
[0. , 0. ],
[1. , 1. ],
[0.71100866, 1. ],
[0.96022727, 1. ],
[0. , 0. ],
[0.71200885, 1. ],
[0.06103175, 0. ],
[0.005 , 0. ],
[0.99490476, 1. ],
[0.1644127 , 0. ],
[0. , 0. ],
[0.23646934, 0. ],
[1. , 1. ],
[0.57680164, 1. ],
[0.64901715, 1. ],
[0.9975 , 1. ],
[0.61790818, 1. ],
[0.95509668, 1. ],
[0.99383333, 1. ],
[0.04570455, 0. ],
[0.97575758, 1. ],
[1. , 1. ],
[0.47115815, 0. ],
[0.92422619, 1. ],
[0.77371415, 1. ],
[0. , 0. ],
[1. , 1. ],
[0.26198657, 0. ],
[0. , 0. ],
[0.28206638, 0. ],
[0.95216162, 1. ],
[0.98761905, 1. ],
[0.99464286, 1. ],
[0.98704762, 1. ],
[0.85579351, 1. ],
[0.10036905, 0. ],
[0.00222222, 0. ],
[0.98011905, 1. ],
[0.99857143, 1. ],
[0.92285967, 1. ],
[0.95180556, 1. ],
[0.97546947, 1. ],
[0.84433189, 1. ],
[0.005 , 0. ],
[0.99833333, 1. ],
[0.83616339, 1. ],
[1. , 1. ],
[0.9955 , 1. ],
[1. , 1. ],
[0.99833333, 1. ],
[1. , 1. ],
[0.86399315, 1. ],
[0.9807381 , 1. ],
[0. , 0. ],
[0.99833333, 1. ],
[0.9975 , 1. ],
[0. , 0. ],
[0.98733333, 1. ],
[0.96822727, 1. ],
[0.23980827, 0. ],
[0.7914127 , 1. ],
[0. , 0. ],
[0.98133333, 1. ],
[1. , 1. ],
[1. , 1. ],
[0.89251019, 1. ],
[0.9498226 , 1. ],
[0.18943254, 0. ],
[0.83494391, 1. ],
[0.9975 , 1. ],
[1. , 1. ],
[0.77079113, 1. ],
[0.99722222, 1. ],
[0.30208297, 0. ],
[1. , 1. ],
[0.92111977, 1. ],
[0.99428571, 1. ],
[0.91936508, 1. ],
[0.47118074, 0. ],
[0.98467172, 1. ],
[0.006 , 0. ],
[0.05750305, 0. ],
[0.96954978, 1. ]])
```

Notice that when the probability is less that 0.50, the predicted class is 0. When the predicted probability is greater than 0.50, the predicted class is 1. For certain applications, the 0.50 threshold might make sense, for example when your target is balanced or close to balanced (when the number of 0s and 1s in the training set is approximately equal). But for unbalanced datasets, using the default threshold can give misleading results. In what follows, we walkthrough a few approaches that can be used to assess the optimal discrimination theshold for a classifier.

The first approach is the most straightforward: Just use the default scikit-learn threshold of .50. This makes sense when your classes are balanced, but will give misleading results when classes are imbalanced.

If we look at the number of positives (1s) vs. total samples in our training set, we have:

```
print(f"Proportion of positives in training set: {ytrain.sum() / ytrain.shape[0]:.2f}")
```

`Proportion of positives in training set: 0.62`

We see that 62% of the samples belong to class 1. This is usually not the case. In many classification scenarios, we’re dealing with 10%, 5% or even less than 1% of samples belonging to the positive class.

To illustrate the approach, since 62% percent of the observations belong to the positive class, we would use a threshold of **1 - .62 = .38**. The predicted class labels are then created using the following code:

```
# Creating predicted classes based on adjusted classifier threshold.
thresh = .38
yhat = np.where(ypred <= thresh, 0, 1)
```

Now any sample with a predicted probability less than or equal to .38 will be assigned to class 0, and samples with predicted probability greater than .38 are assigned to the positive class.

If we’re dealing with a highly imbalanced dataset with only 1% positive instances, we would use **1 - .01 = .99** as the threshold using this method.

The f1-score is the geometric average of precision and recall. We can compute precision and recall for a number of different thresholds then select the threshold that maximizes the f1-score. This is a suitable approach if your classification task weighs precision and recall equally. Although this isn’t the case for our breast cancer classifier (we want to maximize recall since the cost of a false negative is high), the approach is demonstrated in the next cell:

```
from sklearn.metrics import precision_recall_curve
# Get precision and recall for various thresholds.
p, r, thresh = precision_recall_curve(yvalid, ypred)
# Compute f1-score for each threshold.
f1 = 2 * (p * r) / (p + r)
# Identify threshold that maximizes f1-score.
best_thresh = thresh[np.argmax(f1)]
print(f"Threshold using optimal f1-score: {best_thresh:,.3f}.")
```

`Threshold using optimal f1-score: 0.471.`

Using this method, we would set the discrimination threshold to **.471**, and would obtain the predicted class labels the same way as before:

```
thresh = .471
yhat = np.where(ypred <= thresh, 0, 1)
```

The Receiver Operating Characteristic (ROC) curve plots the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds. Typically we’re interested in using a threshold that maximizes TPR while minimizing FPR, which is the point (0, 1). The curve starts with a threshold of 1 at the far left and decreases towards 0 as the x-axis increases.

We can plot the ROC curve in scikit-learn using the code below. Note that `ypred`

are predicted probabilities and `yvalid`

are class labels (1s or 0s).

```
from sklearn.metrics import RocCurveDisplay
roc_disp = RocCurveDisplay.from_predictions(
yvalid, ypred, name="RandomForestClassifier", color="#191964"
)
roc_disp.ax_.set_title("ROC curve", fontsize=9)
roc_disp.ax_.grid(True)
plt.show()
```

Using approach 4, the optimal threshold would be somewhere between .70-.80, which is much higher than what is indicated using the other methods so far. Ultimately it is up to you to determine which threshold makes the most sense, but intuitively, a threshold of .70-.80 seems too high when the prevalence of the positive class in the training data is 62%.

The precision-recall curve is a graphical representation used in binary classification to evaluate the performance of a classification model at different probability thresholds. This curve shows the trade-off between precision and recall for a number of different thresholds. The curve plots recall on the x-axis and precision on the y-axis.

The curve starts from the rightmost part of the graph. As the threshold for classifying positive instances decreases, recall increases, and precision can either increase or decrease, but typically it decreases because the model starts to classify more instances as positive, including both true positives and false positives.

The top-right corner of the graph (high precision, high recall) represents the ideal point, where the classifier perfectly identifies all positive cases with no false positives. Generally, we’d like to select a threshold that corresponds to a point closest to top-right corner of the graph.

We can plot the precision-recall curve in scikit-learn using the code below. Note that `ypred`

are predicted probabilities and `yvalid`

are class labels (1s or 0s).

```
from sklearn.metrics import PrecisionRecallDisplay
pr_disp = PrecisionRecallDisplay.from_predictions(
yvalid, ypred, name="RandomForestClassifier", color="#CD0066"
)
pr_disp.ax_.set_title("Precision-Recall curve", fontsize=9)
pr_disp.ax_.grid(True)
plt.show()
```

Based on the plot, would want to select the threshold that corresponds to a recall of about .95, since this is close to the point (1, 1). This can be determined using the following code:

```
from sklearn.metrics import precision_recall_curve
p, r, thresh = precision_recall_curve(yvalid, ypred)
best_thresh = thresh[np.where(r >= .95)[-1][-1]]
print(f"Selected threshold using precision-recall curve: {best_thresh:,.3f}.")
```

`Selected threshold using precision-recall curve: 0.674.`

It is also possible to plot precision and recall as two separate series against threshold on the x-axis. The goal is to identify a point where precision and recall intersect. Using this approach may be suitable in some scenarios.

```
from sklearn.metrics import precision_recall_curve
p, r, thresh = precision_recall_curve(yvalid, ypred)
p, r = p[:-1], r[:-1]
fig, ax = plt.subplots(1, 1, figsize=(6.5, 4), tight_layout=True)
ax.set_title("precision & recall vs. threshold", fontsize=10)
ax.plot(thresh, p, color="red", linewidth=1.25, label="precision")
ax.plot(thresh, r, color="blue", linewidth=1.25, label="recall")
ax.set_xlabel("threshold", fontsize=8)
# ax.set_xticks(np.arange(tmax+1))
ax.tick_params(axis="x", which="major", direction="in", labelsize=8)
ax.tick_params(axis="y", which="major", direction="in", labelsize=8)
ax.xaxis.set_ticks_position("none")
ax.yaxis.set_ticks_position("none")
ax.grid(True)
ax.legend(loc="upper right", fancybox=True, framealpha=1, fontsize="medium")
plt.show()
```

The precision and recall series intersect right after .60, therefore method #5 would set the threshold to roughly .60.

Once a threshold has been selected, the predictive power of the classifier can be assessed. To do this, we will look at the confusion matrix as well as the `sklearn.metrics.classification_report`

. Note that both diagnostics require actual and predicted labels. Once we’ve settled on a threshold, model assessment is performed comparing actual vs. predicted labels. In what follows, the 0.471 threshold obtained from method #3 will be used as the classification threshold.

Technically, once we’ve decided on a threshold, we should then assess the performance of the model on a separate test set. However, for the purposes of demonstration, we are going to re-use the validation set.

We start by creating the confusion matrix:

```
from sklearn.metrics import ConfusionMatrixDisplay
# Determine predicted classes using the .471 threshold.
thresh = .471
yhat = np.where(ypred <= thresh, 0, 1)
cm_disp = ConfusionMatrixDisplay.from_predictions(yvalid, yhat, colorbar=False)
cm_disp.ax_.set_title(f"mm confusion matrix (thresh={thresh:.3})", fontsize=9)
plt.show()
```

The output indicates:

- There are 76 True Positives (TP).
- There are 34 True Negatives (TN).
- There are 4 False Positives (FP).
- There are 0 False Negatives (FN).

Next we inspect the classification report. This also takes actual and predicted labels, and returns a summary of common classifier metrics:

```
from sklearn.metrics import classification_report
print(classification_report(yvalid, yhat))
```

```
precision recall f1-score support
0 1.00 0.89 0.94 38
1 0.95 1.00 0.97 76
accuracy 0.96 114
macro avg 0.97 0.95 0.96 114
weighted avg 0.97 0.96 0.96 114
```

Overall this is very good performance.

where:

- is .
- is (unitary with orthonormal columns; columns =
*left signular vectors*). - is (unitary with orthonormal columns; columns =
*right signular vectors*). - is with real non-negative entries along the diagonal (
*singular values*). The singular values are the square roots of the eigenvalues of or . - When , has at most non-zero elements on the diagonal.
- Rank of = number of non-zero singular values.

**In numpy:**

- The rows of represent the eigenvectors of .
- The columns of represent the eigenvectors of .
- The eigenvalues are .

The SVD provides a systematic way to determine a low-dimensional approximation to high-dimensional data in terms of dominant patterns. This technique is data-driven in that patterns are discovered purely from data, without the addition of expert knowledge or intuition.

If is self-adjoint, (), then the singular values of are equal to the absolute values of the eigenvalues of . In Numpy, we compute the SVD as follows:

```
import numpy as np
X = np.random.rand(5, 3)
U, S, Vt = np.linalg.svd(X, full_matrices=True)
Uhat, Shat, Vhatt = np.linalg.svd(X, full_matrices=False)
print("\nfull_matrices=True:")
print(f"U.shape: {U.shape}.")
print(f"S.shape: {S.shape}.")
print(f"Vt.shape: {Vt.shape}.")
print("\nfull_matrices=False:")
print(f"Uhat.shape: {Uhat.shape}.")
print(f"Shat.shape: {Shat.shape}.")
print(f"Vhatt.shape: {Vhatt.shape}.")
print(f"\nS:\n{S}.\n")
print(f"Shat:\n{Shat}.\n")
```

```
full_matrices=True:
U.shape: (5, 5).
S.shape: (3,).
Vt.shape: (3, 3).
full_matrices=False:
Uhat.shape: (5, 3).
Shat.shape: (3,).
Vhatt.shape: (3, 3).
S:
[2.13628638 0.91901978 0.39330927].
Shat:
[2.13628638 0.91901978 0.39330927].
```

Perhaps the most useful and defining property of the SVD is that it provides an optimal low-rank approximation to a matrix . The Eckhart-Young theorem states that the optimal rank- approximation to in a least-squares sense is given by the rank- SVD truncation :

where:

- represent the first leading columns of .
- represents the leading sub-block of .
- represents the Frobenius norm.

Because is diagonal, the rank- SVD approximation is given by the sum of distinct rank-1 matrices:

The truncated SVD basis provides a coordinate transformation from the high-dimensional original matrix into a lower dimensional representation.

For truncation values that are smaller than the number of non-zero singular values (i.e., the rank of ), the truncated SVD only approximates :

If we choose the truncation value to keep all non-zero singular values, then is exact.

For the next example, we use an alternate cover photo from the Allman Brothers 1971 release *At the Fillmore East*, shown in color and grayscale side-by-side. We’ll work with the grayscale image going forward since it limits us to two dimensions:

```
from skimage import io
from skimage.color import rgb2gray
import matplotlib.pyplot as plt
import numpy as np
# 3-D RGB image.
imgrgb = io.imread("fillmore.jpg")
# 2-D grayscale image.
img = rgb2gray(imgrgb)
# Make grayscale image symmetric.
img = img[:800, :800]
print(f"img.shape: {img.shape}")
fig, ax = plt.subplots(1, 2, figsize=(10, 5), tight_layout=True)# figsize=(8, 4))
ax[0].imshow(imgrgb)
ax[0].set_title("original", fontsize=9)
ax[0].set_axis_off()
ax[1].imshow(img, cmap=plt.cm.gray)
ax[1].set_title("grayscale", fontsize=9)
ax[1].set_axis_off()
plt.show()
```

`img.shape: (800, 800)`

Next we generate successive rank- approximations of the original image, showing the storage requirement of each rank- approximation.

```
# Grayscale image.
X = img
# Run SVD on grayscale image X.
U, S, Vt = np.linalg.svd(X, full_matrices=False)
# Convert signular values array to full matrix.
S = np.diag(S)
# Rank-r approximations to evaluate.
ranks = [1, 20, 100, 200,]# len(S)]
# Matplotlib indices.
indices = [(0, 0), (0, 1), (1, 0), (1, 1)]
# Number of values associated with original image.
total_nbr_vals = np.prod(X.shape)
fig, ax = plt.subplots(2, 2, tight_layout=True, figsize=(8, 8))
for r, (ii, jj) in zip(ranks, indices):
# Compute rank-r approximation of X.
Xr = U[:, :r] @ S[:r, :r] @ Vt[:r, :]
# Compute storage or rank-r approximation vs. full image.
rank_r_nbr_vals = np.prod(U[:, :r].shape) + r + np.prod(Vt[:r, :].shape)
rank_r_storage = rank_r_nbr_vals / total_nbr_vals
# Display rank-r approximation.
ax[ii, jj].imshow(Xr, cmap=plt.cm.gray)
ax[ii, jj].set_title(f"r={r:,.0f} (storage={rank_r_storage:.2%})", fontsize=9)
ax[ii, jj].set_axis_off()
plt.show()
```

A rank-100 approximation provides a decent representation of the original. At rank-200, there is virtually no difference between the original and the approximation. In practice, we could store `U[:, :200], S[:200, :200]`

and `Vt[:200, :]`

separately, then compute the matrix product prior to rendering the image. Doing so reduces the storage requirements by a factor of 2.

We can plot the magnitude of the singular values along with the cumulative proportion to assess how much variation in the original image is captured for a given rank- approximation:

```
s = np.diag(S)
ranks = ranks + [400, 800]
fig, ax = plt.subplots(1, 2, tight_layout=True, figsize=(8, 4))
ax[0].semilogy(s, color="#000000", linewidth=1)
ax[0].set_ylabel(r"Singular value, $\sigma_{r}$")
ax[0].set_xlabel(r"$r$")
ax[0].tick_params(axis="x", which="major", direction='in', labelsize=8)
ax[0].tick_params(axis="x", which="minor", direction='in', labelsize=8)
ax[0].tick_params(axis="y", which="major", direction='in', labelsize=8)
ax[0].tick_params(axis="y", which="minor", direction='in', labelsize=8)
ax[0].xaxis.set_ticks_position("none")
ax[0].yaxis.set_ticks_position("none")
ax[0].grid(True)
ax[1].plot(np.cumsum(s) / np.sum(s), color="#000000", linewidth=1)
ax[1].set_ylabel(r"cumulative sum")
ax[1].set_xlabel(r"$r$")
ax[1].tick_params(axis="x", which="major", direction='in', labelsize=8)
ax[1].tick_params(axis="x", which="minor", direction='in', labelsize=8)
ax[1].tick_params(axis="y", which="major", direction='in', labelsize=8)
ax[1].tick_params(axis="y", which="minor", direction='in', labelsize=8)
ax[1].xaxis.set_ticks_position("none")
ax[1].yaxis.set_ticks_position("none")
ax[1].grid(True)
for r in ranks:
y = np.sum(s[:r]) / np.sum(s)
ax[1].scatter(r, y, s=50, color="red")
ax[1].annotate(
r"$r=$" + "{:,.0f}".format(r), xycoords="data", xy=(r, y),
xytext=(10, 0), textcoords="offset points", ha="left", va="center",
fontsize=8, rotation=0, weight="normal", color="#000000",
)
plt.show()
```

The rank-100 approximation accounts for ~60% of the cumulative sum of singular values. By rank-200, the approximation is closer to 80%. For completeness, we also show that a rank-800 approximation is able to recover the original image fully, since it is using all singular values and vectors (the original grayscale image was 800 x 800). The benefit of using SVD for image compression lies in its ability to prioritize and retain the most significant features of the image data, while excluding less significant features.

Note that much of this analysis is based on Chapter 1 of Steve Brunton’s *Data-Driven Science and Engineering*, which is an excellent resource for practicing Data Scientists. Be sure to pickup your own copy, as the second edition was recently released.

Assuming that humans are a random sample from all humans that will ever exist, the probability that any individual is born at a particular time in human history is proportional to the population size at that time.

Based on historical population data, one can estimate the total number of humans who have ever lived up to the present.

Under the assumption that humans will continue to reproduce at a roughly constant rate until extinction, the argument suggests that since you are observing humanity at a random point in its history, it is statistically more likely that you are living closer to the midpoint of human existence rather than at the beginning or end.

In the podcast, Bostrom suggests that we have systematically underestimated the probability that humanity will go extinct soon, and uses sampling from two urns and observing the results as an analogy:

Imagine we have two urns: The first (urn A) has 10 balls numbered 1 thru 10, and the second (urn B) has 1,000,000 balls numbered 1-1,000,000. Someone puts one urn in front of you, and asks you what is the probability that it is the 10 ball urn? With no other information, as a rational participant you might say 50%.

But you are then allowed to reach in and pick a ball at random from the urn. Suppose you draw a ball at random, and find that you’ve drawn a ball with 7 on it. Drawing a 7 is strong evidence for the 10 ball urn, since the probability of drawing a 7 from the 10 ball urn is 10%, while the probability of drawing a 7 from the 1,000,000 ball urn is .00001%.

You then perform a Bayesian update: If your prior was 50/50 for the 10 ball urn, you become virtually certain after finding a randomly sampled 7 that it only has 10 balls in it.

Let:

- = Prior probability that the unknown urn is urn A = .50.
- = Prior probability that the unknown urn is urn B = .50.
- = Given a randomly drawn ball , the probability that it came from urn .
- = Given that a random draw originates from urn , the probability of observing the drawn .

The expression for the Bayesian update is then:

We know and , since there are 10 balls in urn and 1,000,000 balls in urn , and each has an equally likely chance of being drawn.

After observing a 7, For urn A we have:

After observing a 7, For urn B we have:

Therefore, there is a greater than 99.99% probability that the unknown urn is urn given the observed 7.

The Doomsday argument asks how many humans will have lived by the time the species goes extinct.

Suppose we consider 2 hypotheses:

- There will be
**200,000,000,000**humans (200 billion) in total. - There will be
**200,000,000,000,000**humans (200 trillion) in total.

Take your own birth rank as a random sample (your position in the sequence of all humans who have ever lived), and reason that you are a random sample from the set of all humans who have ever existed. It turns out you are roughly human 100 billion. If there are only going to be a total of 200 billion humans that ever live, 100 billion is a perfectly unremarkable number, and your actual birth rank lies somewhere in the middle.

But if there are going to be 200 trillion humans, then a birth rank of 100 billion is remarkably early (100,000,000,000 / 200,000,000,000,000 = 0.0005).

When considering these two hypotheses, you should update in favor of the human species having a lower total number of members (what Bostrom calls “doom soon”).

He goes on to summarize that there has to be something “fishy” with this argument, because from very weak premises, it gets a very striking implication, that we have almost no chance of reaching 200T humans in the future. How can we get there by simply reflecting on when we were born?

]]>In a generalized linear model (GLM), the response may have any distribution from the exponential family. Rather than assuming the mean is a linear function of the explanatory variables, we assume that a function of the mean, or the link function, is a linear function of the explanatory variables.

Logistic regression is used for modeling data with a categorical response. Although it’s possible to model multinomial data using logistic regression, in this post our analysis will be limited to models targeting a dichotomous response, where the outcome can be classified as ‘Yes/No’ or ‘1/0’.

The logistic regression model is a GLM whose canonical link is the logit, or log-odds:

for .

Solving the logit for , which is a stand-in for the predicted probability associated with observation , yields

where and .

Maximum Likelihood Estimation can be used to determine the parameters of a Logistic Regression model, which entails finding the set of parameters for which the probability of the observed data is greatest. The objective is to estimate the unknown .

Let represent independent, dichotomous response values for each of observations, where denotes a success and denotes a failure. The density function of a single observation is given by

and the corresponding likelihood function is

Taking the natural log of the maximum likelihood estimate results in the log-likelihood function:

The first-order partial derivatives of the log-likelihood are calculated and set to zero for each

which can be represented in matrix notation as

where is a (p + 1)-by-n matrix and an n-by-1 vector.

The vector of first-order partial derivatives of the log-likelihood function is referred to as the score function and is typically represented as .

These (p+1) equations are solved simultaneously to obtain the parameter estimates .

Each solution specifies a critical-point which will be either a maximum or a minimum. The critical point will be a maximum if the matrix of second partial derivatives is negative definite (which means every element on the diagonal of the matrix is less than zero).

The matrix of second partial derivatives is given by

represented in matrix form as

where is an n-by-n diagonal matrix of weights with each element equal to for logistic regression models (in general, the weights matrix will have entries inversely proportional to the variance of the response).

Since no closed-form solution exists for determining logistic regression model coefficients, iterative techniques must be employed.

Two distinct but related iterative methods can be utilized in determining model coefficients: the Newton-Raphson method and Fisher Scoring. The Newton-Raphson method relies on the matrix of second partial derivatives, also known as the Hessian. The Newton-Raphson update expression is:

where:

- = the vector of updated coefficient estimates.

- = the vector of coefficient estimates from the previous iteration.

- = the inverse of the Hessian, .

- = the vector of first-order partial derivatives of the log-likelihood function, .

The Newton-Raphson method starts with an initial guess for the solution, and obtains a second guess by approximating the function to be maximized in a neighborhood of the initial guess by a second-degree polynomial, and then finding the location of that polynomial’s maximum value. This process continues until it converges to the actual solution. The convergence of to is usually fast, with adequate convergence frequently realized after fewer than 50 iterations.

An alternative method, *Fisher Scoring*, utilizes the expected information . Let serve as a stand-in for the expected value of the information:

The Fisher scoring update step replaces from Newton-Raphson with :

where:

- = the vector of updated coefficient estimates.

- = the vector of coefficient estimates from the previous iteration.

- = the inverse of the expected information matrix, .

- = the vector of first-order partial derivatives of the log-likelihood function, .

For GLMs with a canonical link (of which employing the logit for logistic regression is an example), the observed and expected information are the same. When the response follows an exponential family distribution, and the canonical link function is employed, observed and expected information coincide so that Fisher scoring and Newton-Raphson are identical.

When the canonical link is used, the second partial derivatives of the log-likelihood do not depend on the observation , and therefore

Fisher scoring has the advantage that it produces the asymptotic covariance matrix as a by-product.

To summarize:

- The
*Hessian*is the matrix of second partial derivatives of the log-likelihood with respect to the parameters, . - The
*observed information*is .

- The
*expected information*is . - The
*asymptotic covariance matrix*is .

For models employing a canonical link function:

- The observed and expected information are the same, .
- , or .
- The Newton-Raphson and Fisher Scoring algorithms yield identical results.

The data used for our sample calculation can be obtained here. The data represents O-Ring failures in the 23 pre-Challenger space shuttle missions. TEMPERATURE will serve as the single explanatory variable which will be used to predict O_RING_FAILURE, which is 1 if a failure occurred, 0 otherwise.

Once the parameters have been determined, the model estimate of the probability of success for a given observation can be calculated with:

In the following code, we define a single function, `get_params`

, which returns the estimated model coefficients as a (p+1)-by-1 array. In addition, the function returns the number of scoring iterations, fitted values and the variance-covariance matrix for the estimated parameters.

```
import numpy as np
def estimate_lr_params(X, y, epsilon=.001):
"""
Estimate logistic regression coefficients using Fisher Scoring.Iteration
ceases once changes between elements in coefficient matrix across
consecutive iterations is less than epsilon.
- design_matrix `X` : n-by-(p+1)
- response_vector `y` : n-by-1
- probability_vector `p` : n-by-1
- weights_matrix `W` : n-by-n
- epsilon : threshold above which iteration continues
- n : Number of observations
- (p + 1) : Number of parameters (+1 for intercept term)
- U: First derivative of log-likelihood with respect to
each beta_i, i.e. "Score Function" = X^T * (y - p)
- I: Second derivative of log-likelihood with respect to
each beta_i, i.e. "Information Matrix" = (X^T * W * X)
- X^T*W*X results in a (p + 1)-by-(p + 1) matrix.
- X^T(y - p) results in a (p+1)-by-1 matrix.
- (X^T*W*X)^-1 * X^T(y - p) results in a (p + 1)-by-1 matrix.
Returns
-------
dict of model results
"""
def sigmoid(v):
return (1 / (1 + np.exp(-v)))
betas0 = np.zeros(X.shape[1]).reshape(-1, 1)
p = sigmoid(X @ betas0)
W = np.diag((p * (1 - p)).ravel())
I = X.T @ W @ X
U = X.T @ (y - p)
n_iter = 0
while True:
n_iter+=1
betas = betas0 + np.linalg.inv(I) @ U
betas = betas.reshape(-1, 1)
if np.all(np.abs(betas - betas0) < epsilon):
break
else:
p = sigmoid(X @ betas)
W = np.diag((p * (1 - p)).ravel())
I = X.T @ W @ X
U = X.T @ (y - p)
betas0 = betas
dresults = {
"params": betas.ravel(),
"ypred": sigmoid(X @ betas),
"V": np.linalg.inv(I),
"n_iter": n_iter
}
return(dresults)
```

We read in the Challenger dataset, partition the data into the design matrix and response vector, which are then passed to `estimate_lr_params`

:

```
import pandas as pd
df = pd.read_csv("https://gist.githubusercontent.com/jtrive84/835514a76f7afd552c999e4d9134baa8/raw/6dac51b80f892ef051174a46766eb53c7b609ebd/Challenger.csv")
X0 = df[["TEMPERATURE"]].values
X = np.concatenate([np.ones(X0.shape[0]).reshape(-1, 1), X0], axis=1)
y = df[["O_RING_FAILURE"]].values
dresults = estimate_lr_params(X, y)
dresults
```

```
{'params': array([15.04290163, -0.23216274]),
'ypred': array([[0.43049313],
[0.22996826],
[0.27362106],
[0.32209405],
[0.37472428],
[0.1580491 ],
[0.12954602],
[0.22996826],
[0.85931657],
[0.60268105],
[0.22996826],
[0.04454055],
[0.37472428],
[0.93924781],
[0.37472428],
[0.08554356],
[0.22996826],
[0.02270329],
[0.06904407],
[0.03564141],
[0.08554356],
[0.06904407],
[0.82884484]]),
'V': array([[ 5.44406534e+01, -7.96333573e-01],
[-7.96333573e-01, 1.17143602e-02]]),
'n_iter': 5}
```

`estimate_lr_params`

returns a dictionary consisting of the following keys:

`"params"`

: Estimated parameters.`"ypred"`

: Fitted values.

`"V"`

: Variance-covariance matrix of the parameter estimates.

`"n_iter"`

: Number of Fisher scoring iterations.

For the Challenger dataset, our implementation of Fisher scoring results in a model with and . In order to predict new probabilities of O-Ring Failure based on temperature, we use:

Negative coefficients correspond to features that are negatively associated with the probability of a positive outcome, with the reverse being true for positive coefficients.

Lets compare the results of our implementation against the estimates produced by statsmodels:

```
import statsmodels.formula.api as smf
mdl = smf.logit("O_RING_FAILURE ~ TEMPERATURE", data=df).fit()
print(f"\nmdl.params:\n{mdl.params}\n")
print(f"mdl.cov_params():\n{mdl.cov_params()}\n")
print(f"mdl.predict(df):\n{mdl.predict(df)}\n")
```

```
Optimization terminated successfully.
Current function value: 0.441635
Iterations 7
mdl.params:
Intercept 15.042902
TEMPERATURE -0.232163
dtype: float64
mdl.cov_params():
Intercept TEMPERATURE
Intercept 54.444275 -0.796387
TEMPERATURE -0.796387 0.011715
mdl.predict(df):
0 0.430493
1 0.229968
2 0.273621
3 0.322094
4 0.374724
5 0.158049
6 0.129546
7 0.229968
8 0.859317
9 0.602681
10 0.229968
11 0.044541
12 0.374724
13 0.939248
14 0.374724
15 0.085544
16 0.229968
17 0.022703
18 0.069044
19 0.035641
20 0.085544
21 0.069044
22 0.828845
dtype: float64
```

The values produced using the statsmodels align closely with the results from `estimate_lr_params`

.

A feature of logistic regression models is that the predictions preserve the data’s marginal probabilities. If you aggregate the fitted values from the model, the total will equal the number of positive outcomes in the original target vector:

```
# estimate_lr_params.
dresults["ypred"].sum()
```

`7.000000000274647`

```
# statsmodels.
mdl.predict(df).sum()
```

`7.0000000000000036`

We have 7 positive instances in our dataset, and the total probability aggregates to 7 in both sets of predictions.

- Q-Q Plot: Compares two probability distributions by plotting their quantiles against each other.

- P-P Plot: Compares two cumulative distribution functions against each other.

- Histogram: Plot density histogram with parametric distribution overlay.

In addition, the following tests will be introduced:

- Kolmogorov-Smirnov: Test the equality of continuous, one-dimensional probability distributions.

- Anderson-Darling: Test whether a given sample is drawn from a given probability distribution.

- Shapiro-Wilk: Test the null hypothesis that the data is drawn from a normal distribution.

The same dataset will be used throughout the post, provided below:

```
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
np.set_printoptions(suppress=True, precision=8)
dat = np.asarray([
62.55976, -14.71019, -20.67025, -35.43758, -10.65457, 21.55292,
41.26359, 0.33537, -14.43599, -40.66612, 6.45701, -40.39694,
55.1221, 24.50901, 6.61822, -29.10305, 6.21494, 15.25862,
13.54446, 2.48212, -2.34573, -21.47846, -5.0777, 26.48881,
-8.68764, -5.49631, 42.58039, -6.59111, -23.08169, 19.09755,
-21.35046, 0.24064, -3.16365, -37.43091, 24.48556, 2.6263,
31.14471, 5.75287, -46.8529, -14.26814, 8.41045, 18.11071,
-30.46438, 12.22195, -31.83203, -8.09629, 52.06456, -24.30986,
-25.62359, 2.86882, 15.77073, 31.17838, -22.04998
])
```

The task is to assess how well our data fits a normal distribution parameterized with mean and variance computed using:

Keep in mind that although we’re testing how well othe data can be approximated by a normal distribution, many of the tests we highlight (with the exception of Shapiro-Wilk) can assess the quality of fit for many different parametric models.

We begin with visual assessments of goodness-of-fit.

The Q-Q plot compares two probability distributions by plotting their quantiles against each other. We compare standard normal quantiles (x-axis) against the empirical quantiles from the dataset of interest (y-axis). If the two distributions are similar, the points in the Q-Q plot will approximately lie on a straight line. There isn’t a hard and fast rule to determine how much deviation from the straight line is too much, but if the distributions are very different, it will be readily apparent in the Q-Q plot. We can construct a Q-Q plot from scratch using matplotlib as follows:

```
"""
Generate qq-plot comparing data against standard normal distribution.
"""
line_color = "#E02C70"
dat = np.sort(dat)
cdf = np.arange(1, dat.size + 1) / dat.size
ndist = stats.norm(loc=0, scale=1)
theo = ndist.ppf(cdf)
# Remove observations containing Inf.
x, y = zip(*[tt for tt in zip(theo, dat) if np.Inf not in tt and -np.Inf not in tt])
# Obtain coefficients for best fit regression line.
b1, b0, _, _, _ = stats.linregress(x, y)
yhat = [b1 * ii + b0 for ii in x]
# Determine upper and lower axis bounds.
xmin, xmax = min(x), max(x)
ymin, ymax = int(min(y) - 1), int(max(y) + 1)
fig, ax = plt.subplots(1, 1, figsize=(8, 5), tight_layout=True)
ax.set_title(
"Q-Q Plot: Data vs. Standard Normal Distribution",
color="#000000", loc="center", fontsize=10, weight="bold"
)
ax.scatter(x, y, color="#FFFFFF", edgecolor="#000000", linewidth=.75, s=45)
ax.plot(x, yhat, color=line_color, linewidth=1.25)
ax.set_xlim(left=xmin, right=xmax)
ax.set_ylim(bottom=ymin, top=ymax)
ax.set_ylabel("Empirical Quantiles", fontsize=9, color="#000000")
ax.set_xlabel("Standard Normal Quantiles", fontsize=9, color="#000000")
ax.tick_params(axis="x", which="major", labelsize=8)
ax.tick_params(axis="y", which="major", labelsize=8)
ax.xaxis.set_ticks_position("none")
ax.yaxis.set_ticks_position("none")
ax.grid(True)
ax.set_axisbelow(True)
plt.show()
```

The points seem to mostly follow a straight line, but there are observations that deviate from strict linearity. However, there’s nothing here that disqualifies our dataset from being modeled with normal distribution.

The P-P plot compares two cumulative distribution functions against each other. To produce a P-P plot, we plot the theoretical percentiles (x-axis) against empirical percentiles (y-axis), so that each axis ranges from 0-1. The line of comparison is the 45 degree line running from (0,0) to (1,1). The distributions are equal if and only if the plot falls on this line: any deviation indicates a difference between the distributions. The code to generate a P-P plot is provided below:

```
"""
Create P-P plot, which compares theoretical normal percentiles (x-axis) against
empirical percentiles (y-axis).
"""
dat = np.sort(dat)
dat_mean = dat.mean()
dat_std = dat.std(ddof=1)
# Standardize dat.
sdat = (dat - dat_mean) / dat_std
cdf = np.arange(1, dat.size + 1) / dat.size
ndist = stats.norm(loc=0, scale=1)
theo = ndist.cdf(sdat)
x, y = theo, cdf
fig, ax = plt.subplots(1, 1, figsize=(8, 5), tight_layout=True)
ax.set_title(
"P-P Plot: Empricial CDF vs. Standard Normal CDF",
color="#000000", loc="center", fontsize=10, weight="bold"
)
ax.scatter(x, y, color="#FFFFFF", edgecolor="#000000", linewidth=.75, s=45)
ax.plot([0, 1], [0, 1], color=line_color, linewidth=1.25)
ax.set_xlim(left=0, right=1)
ax.set_ylim(bottom=0, top=1)
ax.set_ylabel("Empirical Cumulative Distribution", fontsize=9,color="#000000")
ax.set_xlabel("Standard Normal Distribution", fontsize=9, color="#000000")
ax.tick_params(axis="x", which="major", labelsize=8)
ax.tick_params(axis="y", which="major", labelsize=8)
ax.xaxis.set_ticks_position("none")
ax.yaxis.set_ticks_position("none")
ax.grid(True)
ax.set_axisbelow(True)
plt.show()
```

Although the observations follow the linear trend in general, the data overall appear somewhat above the reference line . This may be attributable to the mean of `dat`

being greater than 0. However, this doesn’t eliminate the possibility of our data representing a sample from a normal population. We expect some deviation from the expected normal percentiles, which we see in the P-P plot.

For the next diagnostic we create a histogram which represents the density of the empirical data overlaid with a parameterized normal distribution.

```
"""
Plot histogram with best-fit normal distribution overlay.
"""
dist = stats.norm(loc=dat_mean, scale=dat_std)
xdist = np.arange(dat.min(), dat.max(), .01)
ydist = dist.pdf(xdist)
fig, ax = plt.subplots(1, 1, figsize=(8, 5), tight_layout=True)
ax.set_title(
"Empirical Data w/ Parametric Overlay", color="#000000",
loc="center", fontsize=10, weight="bold"
)
ax.hist(
dat, 13, density=True, alpha=1, color="#E02C70",
edgecolor="#FFFFFF", linewidth=1.0
)
ax.plot(xdist, ydist, color="#000000", linewidth=1.75, linestyle="--")
ax.set_xlabel("")
ax.set_ylabel("")
ax.tick_params(axis="x", which="major", labelsize=7)
ax.tick_params(axis="y", which="major", labelsize=7)
ax.xaxis.set_ticks_position("none")
ax.yaxis.set_ticks_position("none")
ax.grid(True)
ax.set_axisbelow(True)
plt.show()
```

The data appear to follow a pattern roughly the shape outlined by a best-fit normal density.

The Kolmogorov-Smirnov Test is different than the previous set of visualizations in that it produces a metric used to assess the level of agreement between target and reference distributions, but a visual diagnostic can be obtained as well.

Suppose that we have a set of empirical data that we assume originates from some distribution . The Kolmogorov-Smirnov statistic is used to test:

: the samples come from

against:

: The samples do not come from

The test compares the empirical distribution function of the data, , with the cumulative distribution function associated with the null hypothesis, (the expected CDF).

The Kolmogorov-Smirnov statistic is given by

Assuming the data are ordered such that represents the the minimum value in the dataset and the maximum value, the empirical CDF can be represented as

where is the number of observations in the dataset.

For each observation, compute the absolute differences between and . The Kolmogorov-Smirnov statistic is the maximum value from the vector of absolute differences. This value represents the maximum absolute distance between the expected and observed distribution functions. is then compared to a table of critical values to assess whether to reject or fail to reject .

Before computing the statistic, we first demonstrate how to generate the one-sample Kolmogorov-Smirnov comparison plot:

```
"""
Kolmogorov-Smirnov test visualization.
"""
dat = np.sort(dat)
dat_mean = dat.mean()
dat_std = dat.std(ddof=1)
sdat = (dat - dat_mean) / dat_std
cdf = np.arange(1, dat.size + 1) / dat.size
dist = stats.norm(loc=0, scale=1)
# Generate Kolmogorov-Smirnov comparison plot.
# y0 : Values from reference distribution.
# y1 : Values from empirical distribution.
ecdfpairs = zip(sdat, cdf)
ecdfpairs = [ii for ii in ecdfpairs if np.Inf not in ii and -np.Inf not in ii]
x, y1 = zip(*ecdfpairs)
x = np.asarray(x, dtype=float)
y0 = dist.cdf(x)
y1 = np.asarray(y1, dtype=float)
absdiffs = np.abs(y1 - y0)
indx = np.argwhere(absdiffs == absdiffs.max()).ravel()[0]
xann, y0ann, y1ann = x[indx], y0[indx], y1[indx]
ypoint = (y1ann + y0ann) / 2
xy, xyp = (xann, ypoint), (xann + .3, ypoint - .1)
fig, ax = plt.subplots(1, 1, figsize=(8, 5), tight_layout=True)
xmin, xmax, ymin, ymax = min(x), max(x), 0, 1
ax.set_title(
"Kolmogorov-Smirnov Illustration", fontsize=10,
loc="center", color="#000000", weight="bold"
)
ax.set_xlim(left=xmin, right=xmax)
ax.set_ylim(bottom=ymin, top=ymax)
ax.plot(x, y0, color="#000000", linewidth=1.5, linestyle="--", label="Reference CDF")
ax.step(x, y1, color="#f33455", linewidth=1.5, where="pre", label="Empirical CDF")
ax.tick_params(axis="x", which="major", labelsize=8)
ax.tick_params(axis="y", which="major", labelsize=8)
ax.set_ylabel("CDF", fontsize=9)
ax.set_xlabel("z", fontsize=9)
ax.xaxis.set_ticks_position("none")
ax.yaxis.set_ticks_position("none")
ax.grid(True)
ax.set_axisbelow(True)
plt.annotate(
"Maximum Absolute Distance", xy=xy, xytext=xyp,
arrowprops=dict(facecolor="black", width=1.5, shrink=0.025, headwidth=6.5)
)
ax.legend(
frameon=1, loc="upper left", fontsize="medium", fancybox=True, framealpha=1
)
plt.show()
```

The Kolmogorov-Smirnov statistic is computed as the greatest absolute distance between the empirical and expected CDFs. Computing the statistic is straightforward:

```
dat = np.sort(dat)
cdf = np.arange(1, dat.size + 1) / dat.size
dat_mean = dat.mean()
dat_std = dat.std(ddof=1)
# Parameterized expected normal distribution.
expnorm = stats.norm(loc=dat_mean, scale=dat_std)
expcdf = expnorm.cdf(dat)
# Compute difference between datcdf and expcdf.
absdiffs = np.abs(cdf - expcdf)
D0 = absdiffs.max()
D0
```

`0.07194182492411011`

We can compare our value of with the value obtained from `scipy.stats.kstest`

, which takes as arguments the empirical dataset and a callable representing the CDF of the expected distribution, and returns the D-statistic as well as the p-value associated with the computed D-statistic (note that critical values depend on the number of observations). The manually computed result is given by `D0`

, the result returned from `scipy.stats.kstest`

by `D1`

:

```
dat = np.sort(dat)
cdf = np.arange(1, dat.size + 1) / dat.size
dat_mean = dat.mean()
dat_std = dat.std(ddof=1)
# Parameterized expected normal distribution.
expnorm = stats.norm(loc=dat_mean, scale=dat_std)
expcdf = expnorm.cdf(dat)
absdiffs = np.abs(cdf - expcdf)
D0 = absdiffs.max()
D1, p1 = stats.kstest(dat, expnorm.cdf)
print(f"Our D : {D0:.8}")
print(f"Scipy kstest D: {D1:.8}")
print(f"kstest p-value: {p1:.8}")
```

```
Our D : 0.071941825
Scipy kstest D: 0.071941825
kstest p-value: 0.92828027
```

The p-value (the second element of the 2-tuple returned by `scipy.stats.kstest`

) is 0.9283. How should this result be interpreted?

For the one-sample Kolmogorov-Smirnov test, the null hypothesis is that the distributions are the same. Thus, the lower the p-value the greater the statistical evidence you have to reject the null hypothesis and conclude the distributions are different. *The test only lets you speak of your confidence that the distributions are different, not the same, since the test is designed to find the probability of Type I error*. Therefore, if is less than the critical value, we do not reject the null hypothesis (corresponds to a large p-value). If is greater than the critical value, we reject the null hypothesis (corresponds to a small p-value).

Given our p-value of 0.9326, we do not have sufficient evidence to reject the null hypothesis that the distributions are the same.

The Anderson-Darling test tests the null hypothesis that a sample is drawn from a population that follows a particular distribution. It makes use of the fact that when given a hypothesized underlying distribution and assuming the data is a sample from this distribution, the CDF of the data can be assumed to follow a uniform distribution. The statistic itself can be expressed as:

The function `scipy.stats.anderson`

takes as arguments the empirical dataset and a distribution to test against (one of “norm”, “expon”, “logistic”, “gumbel”, “gumbel_l” or “gumbel_rexponential”), and returns the Anderson-Darling test statistic, the critical values for the specified distribution and the significance levels associated with the critical values. For example, to test whether our dataset follows a normal distribution, we run the following:

```
# Perform Anderson-Darling test.
A, crit, sig = stats.anderson(dat, dist="norm")
print(f"A : {A}")
print(f"crit: {crit}")
print(f"sig : {sig}")
```

```
A : 0.22442637404651578
crit: [0.54 0.615 0.738 0.861 1.024]
sig : [15. 10. 5. 2.5 1. ]
```

According to the `scipy.stats.anderson`

documentation, if the returned statistic is larger than the critical values, then for the corresponding significance level the null hypothesis (that the data come from the chosen distribution) can be rejected. Since our statistic 0.224426 is smaller than all critical values, we do not have sufficient evidence to reject the null hypothesis that the data come from a normal distribution.

A table of Anderson-Darling critical values can be found here.

The Shapiro-Wilk test null hypothesis is that the sample is drawn from a normally distributed population. The function `scipy.stats.shapiro`

takes the empirical dataset as it’s sole argument, and similar to `scipy.stats.kstest`

returns a 2-tuple containing the test statistic and p-value.

```
# Perform Shapiro-Wilk test.
W, p = stats.shapiro(dat)
print(f"W: {W}")
print(f"p: {p}")
```

```
W: 0.9804516434669495
p: 0.5324737429618835
```

If the p-value is less than the chosen alpha level, then the null hypothesis is rejected, and there is evidence that the data tested are not normally distributed. If the p-value is greater than the chosen alpha level, then the null hypothesis that the data came from a normally distributed population can not be rejected (e.g., for an alpha level of 0.05, a data set with a p-value of 0.05 rejects the null hypothesis that the data are from a normally distributed population). Our p-value is 0.532, so we cannot reject the null hypothesis.

A table of Shapiro-Wilk critical values can be downloaded here.

R’s glm function is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution. This function conceals a good deal of the complexity behind a simple interface, making it easy to overlook the calculations that estimate a model’s coefficients. The goal of this post is to shed some light on the mechanics of those calculations.

In a generalized linear model the response may follow any distribution from the exponential family, and rather than assuming the mean is a linear function of the explanatory variables, we assume that a function of the mean (the link function) is a linear function of the explanatory variables.

Logistic regression is used for modeling data with a categorical response. Although it’s possible to model multinomial data using logistic regression, this article focuses only on fitting data having a dichotomous response (‘Yes/No’, ‘True/False’, ‘1/0’, ‘Good/Bad’).

The logistic regression model is a generalized linear model whose canonical link is the logit, or log-odds:

Solving the logit for , which represents the predicted probability for a set of features , yields

Where and .

In other words, the expression for maps any real-valued to a positive probability between 0 and 1.

Maximum Likelihood Estimation can be used to determine the parameters of a Logistic Regression model, which entails finding the set of parameters for which the probability of the observed data is greatest. The objective is to estimate the unknown .

Let represent independent, dichotomous response values for each of observations, where denotes a success and denotes a failure. The density function of a single observation can be expressed as

From which we obtain the likelihood function:

Taking the natural log of the maximum likelihood estimate results in the log-likelihood function:

The first-order partial derivatives of the log-likelihood are calculated and set to zero for each :

which can be represented in matrix form as

Where is a (p+1)xn matrix and a nx1 vector.

The vector of first-order partial derivatives of the log-likelihood function is referred to as the score function, and is typically represented as .

These equations are solved simultaneously to obtain the parameter estimates . Each solution specifies a critical-point which will be either a maximum or a minimum. The critical point will be a maximum if the matrix of second partial derivatives is negative definite (which means every element on the diagonal of the matrix is less than zero).

The matrix of second partial derivatives can be expressed as

which can be represented as

where is an nxn diagonal matrix of weights with each element equal to for logistic regression models. In general, the weight matrix will have entries inversely proportional to the variance of the response.

Since no closed-form solution exists for determining logistic regression coefficients, iterative techniques must be employed.

Two distinct but related iterative methods can be utilized in determining model coefficients: the Newton-Raphson method and Fisher Scoring. The Newton-Raphson method relies on the matrix of second partial derivatives, also known as the Hessian. The Newton-Raphson update expression is given by:

where:

- = the vector of updated coefficient estimates.

- = the vector of coefficient estimates from the previous iteration.
- = the inverse Hessian, .

- = the vector of first-order partial derivatives of the log-likelihood function, .

The Newton-Raphson method starts with an initial guess for the solution, and obtains a second guess by approximating the function to be maximized in a neighborhood of the initial guess by a second-degree polynomial, and then finding the location of that polynomial’s maximum value. This process continues until it converges to the actual solution. The convergence of to is usually fast, with adequate convergence usually realized in fewer than 20 iterations.

Fisher Scoring utilizes the expected information, . Let serve as a stand-in for the expected value of the information:

The Fisher Scoring update step replaces from Newton-Raphson with :

where:

- = the vector of updated coefficient estimates.

- = the vector of coefficient estimates from the previous iteration.

- = the inverse of the expected information matrix, .

- = the vector of first-order partial derivatives of the log-likelihood function, .

For GLM’s with a canonical link, the observed and expected information are the same. When the response follows an exponential family distribution and the canonical link function is employed, observed and expected Information coincide so that Fisher Scoring produces the same estimates as Newton-Raphson.

When the canonical link is used, the second partial derivatives of the log-likelihood do not depend on the observations , and therefore

Fisher scoring has the advantage that it produces the asymptotic covariance matrix as a by-product. To summarize:

- The Hessian is the matrix of second partial derivatives of the log-likelihood with respect to the parameters: .

- The observed information is .
- The expected information is .

- The asymptotic covariance matrix is .

For models employing a canonical link function:

- The observed and expected information are the same: .

- , or .

- The Newton-Raphson and Fisher Scoring algorithms yield identical results.

The data used for our sample calculation can be obtained here. This data represents O-ring failures in the 23 pre-Challenger space shuttle missions. In this dataset, TEMPERATURE serves as the single explanatory variable which will be used to predict “O_RING_FAILURE”, which is 1 if a failure occurred, 0 otherwise.

Once the parameters have been determined, the model estimate of the probability of success for a given observation can be calculated via:

`getCoefficients`

returns the estimated model coefficients as a (p+1)x1 matrix. In addition, the function returns the number of scoring iterations, fitted values and resulting variance-covariance matrix.

```
getCoefficients = function(design_matrix, response_vector, epsilon=.0001) {
# =========================================================================
# design_matrix `X` => n-by-(p+1) |
# response_vector `y` => n-by-1 |
# probability_vector `p` => n-by-1 |
# weights_matrix `W` => n-by-n |
# epsilon => threshold above which iteration continues |
# =========================================================================
# n => # of observations |
# (p + 1) => # of parameters, +1 for intercept term |
# =========================================================================
# U => First derivative of Log-Likelihood with respect to |
# each beta_i, i.e. `Score Function`: X_transpose * (y - p) |
# |
# I => Second derivative of Log-Likelihood with respect to |
# each beta_i. The `Information Matrix`: (X_transpose * W * X) |
# |
# X^T*W*X results in a (p+1)-by-(p+1) matrix |
# X^T(y - p) results in a (p+1)-by-1 matrix |
# (X^T*W*X)^-1 * X^T(y - p) results in a (p+1)-by-1 matrix |
# ========================================================================|
X = as.matrix(design_matrix)
y = as.matrix(response_vector)
# Initialize logistic function used for Scoring calculations.
pi_i = function(v) return(exp(v) / (1 + exp(v)))
# Initialize beta_0, p_0, W_0, I_0 & U_0.
beta_0 = matrix(rep(0, ncol(X)), nrow=ncol(X), ncol=1, byrow=FALSE, dimnames=NULL)
p_0 = pi_i(X %*% beta_0)
W_0 = diag(as.vector(p_0*(1 - p_0)))
I_0 = t(X) %*% W_0 %*% X
U_0 = t(X) %*% (y - p_0)
# Initialize variables for iteration.
beta_old = beta_0
iter_I = I_0
iter_U = U_0
iter_p = p_0
iter_W = W_0
fisher_scoring_iterations = 0
# Iterate until difference between abs(beta_new - beta_old) < epsilon.
while(TRUE) {
fisher_scoring_iterations = fisher_scoring_iterations + 1
beta_new = beta_old + solve(iter_I) %*% iter_U
if (all(abs(beta_new - beta_old) < epsilon)) {
model_parameters = beta_new
fitted_values = pi_i(X %*% model_parameters)
covariance_matrix = solve(iter_I)
break
} else {
iter_p = pi_i(X %*% beta_new)
iter_W = diag(as.vector(iter_p*(1-iter_p)))
iter_I = t(X) %*% iter_W %*% X
iter_U = t(X) %*% (y - iter_p)
beta_old = beta_new
}
}
results = list(
'model_parameters'=model_parameters,
'covariance_matrix'=covariance_matrix,
'fitted_values'=fitted_values,
'number_iterations'=fisher_scoring_iterations
)
return(results)
}
```

A quick summary of R’s matrix operators:

`%*%`

is a stand-in for matrix multiplication.

`diag`

returns a matrix with the provided vector as the diagonal and zero off-diagonal entries.`t`

returns the transpose of the provided matrix.`solve`

returns the inverse of the provided matrix (if it exists).

Note that in our implementation, we solve the normal equations directly. You wouldn’t see this in practice or when using optimized numerical software packages. This is because since when confronted with solving ill-conditioned systems of equations, computing effectively squares the condition number, which results in an answer with diminished accuracy. Optimized statistical computing packages instead leverage more stable methods such as the QR decomposition or SVD. But this suffices for our purposes.

We load the Challenger dataset and partition it into the design matrix and response, which will then be passed into `getCoefficients`

:

```
df = read.table(
file="Challenger.csv", header=TRUE, sep=",",
stringsAsFactors=FALSE
)
X = as.matrix(cbind(1, df['TEMPERATURE'])) # design matrix
y = as.matrix(df['O_RING_FAILURE']) # response vector
colnames(X) = NULL
colnames(y) = NULL
# Call `getCoefficients`, keeping epsilon at .0001.
results = getCoefficients(X, y, epsilon=.0001)
```

Printing `results`

displays the model’s estimated coefficients (*model_parameters*), the variance-covariance matrix of the coefficient estimates (*covariance_matrix*), fitted values (*fitted_values*) and the number of Fisher Scoring iterations (*number_iterations*):

```
> print(results)
$model_parameters
[,1]
[1,] 15.0429016
[2,] -0.2321627
$covariance_matrix
[,1] [,2]
[1,] 54.4442748 -0.79638682
[2,] -0.7963868 0.01171514
$fitted_values
[,1]
[1,] 0.43049313
[2,] 0.22996826
[3,] 0.27362105
[4,] 0.32209405
[5,] 0.37472428
[6,] 0.15804910
[7,] 0.12954602
[8,] 0.22996826
[9,] 0.85931657
[10,] 0.60268105
[11,] 0.22996826
[12,] 0.04454055
[13,] 0.37472428
[14,] 0.93924781
[15,] 0.37472428
[16,] 0.08554356
[17,] 0.22996826
[18,] 0.02270329
[19,] 0.06904407
[20,] 0.03564141
[21,] 0.08554356
[22,] 0.06904407
[23,] 0.82884484
$number_iterations
[1] 6
```

For the Challenger dataset, our implementation of Fisher Scoring yields a and . In order to predict new probabilities of O-ring failure based on temperature, our model relies on the following formula:

Negative coefficients correspond to variables that are negatively correlated with the probability of a positive outcome, the reverse being true for positive coefficients.

Lets compare the results of our implementation with the output of `glm`

using the same dataset, and specifying family=“binomial” and link=“logit”:

```
df = read.table(
file="Challenger.csv", header=TRUE, sep=",",
stringsAsFactors=FALSE
)
logistic.fit = glm(
formula=O_RING_FAILURE ~ TEMPERATURE,
family=binomial(link=logit), data=df
)
```

From `logistic.fit`

, we’ll extract `coefficients`

, `fitted.values`

and `iter`

, and call `vcov(logistic.fit)`

to obtain the variance-covariance matrix of the estimated coefficients:

```
> logistic.fit$coefficients
(Intercept) TEMPERATURE
15.0429016 -0.2321627
> matrix(logistic.fit$fitted.values)
[,1]
[1,] 0.43049313
[2,] 0.22996826
[3,] 0.27362105
[4,] 0.32209405
[5,] 0.37472428
[6,] 0.15804910
[7,] 0.12954602
[8,] 0.22996826
[9,] 0.85931657
[10,] 0.60268105
[11,] 0.22996826
[12,] 0.04454055
[13,] 0.37472428
[14,] 0.93924781
[15,] 0.37472428
[16,] 0.08554356
[17,] 0.22996826
[18,] 0.02270329
[19,] 0.06904407
[20,] 0.03564141
[21,] 0.08554356
[22,] 0.06904407
[23,] 0.82884484
> logistic.fit$fitted.iter
5
> vcov(logistic.fit)
(Intercept) TEMPERATURE
(Intercept) 54.4441826 -0.79638547
TEMPERATURE -0.7963855 0.01171512
```

Our coefficients match exactly with those generated by glm, and as would be expected, the fitted values are also identical.

Notice there’s some discrepancy in the estimate of the variance-covariance matrix beginning with the 4th decimal (54.4442748 in our algorithm vrs. 54.4441826 for the variance of the Intercept term from glm). This may be due to rounding, or the loss of precision in floating point values when inverting matrices. Notice our implementation required one more Fisher Scoring iteration than glm (6 vrs. 5). Perhaps increasing the size of our epsilon will reduce the number of Fisher Scoring iterations, which in turn may lead to better agreement between the variance-covariance matrices.

Calling `summary(logistic.fit)`

prints, among other things, the standard error of the coefficient estimates:

```
> summary(logistic.fit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.0429 7.3786 2.039 0.0415 *
TEMPERATURE -0.2322 0.1082 -2.145 0.0320 *
```

The *Std. Error* values are the square root of the diagonal elements of the variance-covariance matrix, and .

*z value* is the estimated coefficient divided by *Std. Error*. In our example, and . *Pr(>|z|)* is the p-value, which tells us whether we should trust the estimated coefficient value. The standard rule of thumb is that coefficients with p-values less than 0.05 are reliable, although some tests require stricter thresholds.

A feature of Logistic Regression is that the training data’s marginal probabilities are preserved. If you aggregate fitted values from the training set, that quantity will equal the number of positive outcomes in the response vector (this is true for all exponential family GLMs employing a canonical link function):

```
> sum(y)
7
# Checking sum for our algorithm.
> sum(mySummary$fitted_values)
7
#checking sum for glm.
> sum(logistic.fit$fitted.values)
7
```

To apply the model generated by glm to a new set of explanatory variables, use the `predict`

function. Pass a list or data.frame of explanatory variables to `predict`

, and for logistic regression models, be sure to set `type="response"`

to ensure probabilities are returned. For example:

```
# New inputs for Logistic Regression model.
> tempsDF <- data.frame(TEMPERATURE=c(24, 41, 46, 47, 61))
> predict(logistic.fit, tempsDF, type="response")
1 2 3 4 5
0.9999230 0.9960269 0.9874253 0.9841912 0.7070241
```

The process starts by comparing the network’s output to the desired output, calculating the error. Then, starting from the output layer and moving backward, the algorithm computes the gradients of the error with respect to each weight in the network using the chain rule of calculus. These gradients indicate how much each weight contributes to the error.

Next, the weights are updated using gradient descent, where they are adjusted in the direction that minimizes the error. This adjustment is proportional to the gradient and a predefined learning rate, ensuring the network converges towards a solution. Backpropagation continues iteratively over the training data until the network’s performance reaches a satisfactory level or a predetermined number of iterations is reached.

Overall, backpropagation efficiently adjusts the weights of a fully connected network, enabling it to learn complex relationships between input and output data through iterative optimization of the network’s parameters.

In what follows, we walkthrough the mathematics and pseudocode required to train a 2-layer fully connected network for a classification task.

In the following, superscripts represent the layer associated with each variable:

: Input data having dimension n-by-f, where n is the number of samples and f the number of features. For a batch of 32 MNIST samples, would have dimension (32, 784).

: Target variable. classifying a single digit from MINST, a vector populated with 0s and 1s indicating the ground truth label for the sample (8 or not 8). Has the same length as the first dimension of .

: Trainable weights. Projects previous layer activations to lower dimensional representation. Again referring to the first set of weights for a batch of 32 MNIST samples, ’s first dimension will match the second dimension of the activations from the previous layer (784), and ’s second dimension will be some lower dimension, say 256. will therefore have dimension (784, 256).

: Bias term, a one-dimensional vector associated with each hidden layer having length equal to the second dimension of the hidden layer. will have dimension (256,).

: Output of layer , which is the matrix product of the previous layer activations and current layer weights (plus bias term).

: Activations associated with layer . Passes through a non-linearity such as sigmoid or ReLU.

More concretely, assume a 2-layer fully-connected neural network with one hidden layer of size 256, through which a dataset of dimension 32-by-784 is passed to predict whether each of the 32 images is an 8 or not. The forward pass looks like:

- Randomly initialize (784x256), (256x1), (256x1) and (1x1)
- (32x784)
- (32x256)
- (32x256)
- (32x1)
- (32x1)

The final output, , represents the probability that each sample is the number 8 or not.

With the actual labels and our predicted probabilities , we can define our loss function, the cross-entropy loss for binary classification:

The goal of backpropagation is to compute the partial derivatives of the loss function with respect to any weight or in the network. In order to update our weights, we need to take derivatives of w.r.t. and , then update and using the derivatives. Backpropagation starts by taking the derivative of the loss function. We first compute the derivatives of the loss function w.r.t. and . Here we make use of the chain rule:

Once we have and , and are updated as follows:

for some learning rate . This holds for all layers. For given layer , the update rule for and is:

Let’s start with unpacking . The first entry on the r.h.s., , represents the derivative of the loss function w.r.t. , which is

The second term on the r.h.s., , is the derivative of the sigmoid activation (). The derivative of the sigmoid function is given by

therefore is given by

For the third term on the r.h.s., , recall that . Therefore

Finally, we have

As a notational convenience, we define :

This way, can be expressed as

We proceed in a similar fashion for :

since .

For the first layer we re-use many of these calculations, but for new terms on the r.h.s., we employ the chain rule in the same way. For reference, restate the terms from the forward pass:

We next consider :

Considering each term on the r.h.s:

Resulting in:

As before, we define as

which allows us to write as

Similarly for :

Considering each term on the r.h.s:

Therefore

To complete the backpropagation algorithm, it is necessary to define :

Assume is a 32x784 batch of MNIST images, and our network has one hidden layer of size 256. Our task is to identify which digit 0-9 a sample most closely resembles. We first declare a number of functions, then implement the forward and backward passes along with weights update.

```
import numpy as np
def sigmoid(X):
"""
Compute the sigmoid activation for the input.
"""
return 1 / (1 + np.exp(-X))
def sigmoid_dev(X):
"""
The analytical derivative of sigmoid function at X.
"""
return sigmoid(X) * (1 - sigmoid(X))
def softmax(scores):
"""
Compute softmax scores given the raw output from the model.
Returns softmax probabilities (N, num_classes).
"""
numer = np.exp(scores - scores.max(axis=1, keepdims=True))
denom = numer.sum(axis=1, keepdims=True)
return np.divide(numer, denom)
def cross_entropy_loss(ypred, yactual):
"""
Compute Cross-Entropy Loss based on prediction of the network and labels
"""
yactual = np.asarray(yactual)
ypred = ypred[np.arange(len(yactual)), yactual]
return -np.mean(np.log(ypred))
def compute_accuracy(ypred, yactual):
"""
Compute the accuracy of current batch.
"""
yactual = np.asarray(yactual)
yhat = np.argmax(ypred, axis=1)
return (y == yhat).sum() / y.shape[0]
```

```
# Stand in for batch of 32 MNIST images.
X = np.random.randint(0, 256, size=(32, 784))
y = np.random.randint(0, 10, size=32)
# Reshape labels to 32 x 10.
Y = np.zeros((32, 10))
Y[np.arange(X.shape[0]), y] = 1 # (32, 10)
# Learning rate.
alpha = .05
# Initialize weights.
b1 = np.zeros(256)
b2 = np.zeros(10)
W1 = 0.001 * np.random.randn(784, 256)
W2 = 0.001 * np.random.randn(256, 10)
# Forward pass.
Z1 = X @ W1 + b1 # (32, 256)
A1 = sigmoid(Z1) # (32, 256)
Z2 = A1 @ W2 + b2 # (32, 10)
A2 = softmax(Z2) # (32, 10)
# Compute loss and accuracy.
loss = cross_entropy_loss(A2, y)
accuracy = compute_accuracy(A2, y)
# Backward pass.
dZ2 = A2 - Y # (32, 10)
dW2 = (A1.T @ dZ2) / 32 # (256, 10)
db2 = np.sum(dZ2, axis=0) / 32 # (10,)
dA1 = dZ2 @ W2.T # (32, 256)
dZ1 = np.multiply(dA1, sigmoid_dev(Z1)) # (32, 256)
dW1 = (X.T @ dZ1) / 32 # (784, 256)
db1 = np.sum(dZ1, axis=0) / 32 # (256,)
# Update weights.
W2 = W2 - alpha * dW2
b2 = b2 - alpha * db2
W1 = W1 - alpha * dW1
b1 = b1 - alpha * db1
```

The code starting with the forward pass would be iterated over a set of batches for a pre-determined number of epochs. The final weights would then be used for inference.

```
from itertools import zip_longest
import os
import sys
import time
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import numpy as np
import pandas as pd
import sklearn
from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, RobustScaler
from sklearn.metrics import (
accuracy_score, f1_score, precision_score, recall_score, roc_auc_score,
confusion_matrix, precision_recall_curve, roc_curve
)
from sklearn.model_selection import RandomizedSearchCV, train_test_split
np.set_printoptions(suppress=True, precision=8)
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 10000)
train_path = "https://gist.githubusercontent.com/jtrive84/13d05ace37948cac9583a9ab1f2def31/raw/3dc5bc9e0b573c1039abc20f816321e570aae69c/adult.csv"
dftrain = pd.read_csv(train_path)
print(dftrain.head())
```

```
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States <=50K
1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States <=50K
2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States >50K
3 44 Private 160323 Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States >50K
4 18 ? 103497 Some-college 10 Never-married ? Own-child White Female 0 0 30 United-States <=50K
```

After loading the dataset, the first task is to get an idea of the frequency of different groups within categorical features. In the next cell, a dictionary is created for each categorical feature which remaps groups to ensure a reasonable number of observations in each:

```
dworkclass = {
"Federal-gov": "gov",
"Local-gov": "gov",
"Never-worked": "other",
"Private": "private",
"Self-emp-inc": "other",
"Self-emp-not-inc": "other",
"State-gov": "gov",
"Without-pay": "other",
"missing": "missing"
}
deducation = {
"Preschool": "no-hs",
"1st-4th": "no-hs",
"5th-6th": "no-hs",
"7th-8th": "no-hs",
"9th": "hs",
"10th": "hs",
"11th": "hs",
"12th": "hs",
"HS-grad": "hs-grad",
"Prof-school": "some-college",
"Some-college": "some-college",
"Assoc-acdm": "some-college",
"Assoc-voc": "some-college",
"Bachelors": "bachelors",
"Masters": "masters",
"Doctorate": "phd",
"missing": "missing"
}
dmarital = {
"Divorced": "divorced",
"Married-AF-spouse": "married",
"Married-civ-spouse": "married",
"Married-spouse-absent": "married",
"Never-married": "not-married",
"Separated": "divorced",
"Widowed": "widowed",
"missing": "missing"
}
doccupation = {
"Adm-clerical": "clerical",
"Armed-Forces": "other",
"Craft-repair": "repair",
"Exec-managerial": "managerial",
"Farming-fishing": "farming",
"Handlers-cleaners": "cleaners",
"Machine-op-inspct": "repair",
"Other-service": "service",
"Priv-house-serv": "other",
"Prof-specialty": "specialty",
"Protective-serv": "other",
"Sales": "sales",
"Tech-support": "tech",
"Transport-moving": "moving",
"missing": "missing"
}
doccupation2 = {
"Adm-clerical": "white",
"Armed-Forces": "other",
"Craft-repair": "blue",
"Exec-managerial": "white",
"Farming-fishing": "blue",
"Handlers-cleaners": "blue",
"Machine-op-inspct": "blue",
"Other-service": "blue",
"Priv-house-serv": "other",
"Prof-specialty": "other",
"Protective-serv": "blue",
"Sales": "white",
"Tech-support": "white",
"Transport-moving": "blue",
"missing": "missing"
}
drelationship = {
"Husband": "husband",
"Not-in-family": "no-family",
"Other-relative": "other",
"Own-child": "child",
"Unmarried": "unmarried",
"Wife": "wife",
"missing": "missing"
}
drace = {
"Amer-Indian-Eskimo": "eskimo",
"Asian-Pac-Islander": "asian",
"Black": "black",
"Other": "other",
"White": "white",
"missing": "missing"
}
dgender = {
"Female": "F",
"Male": "M",
"missing": "missing"
}
```

Next we distinguish between categorical and continuous features. Categorical features are re-mapped to align with the groups defined above. For categorical features, we assign null values to a “missing” category instead of relying on an imputation rule. This allows us to check for possible patterns in the missing data later on. `capital-gain`

and `capital-loss`

are converted into binary indicators and `native-country`

into US vs. non-US. Finally, we split the data into training and validation sets ensuring the same proportion of positive instances in each cut:

```
categorical = [
"workclass", "marital-status", "occupation", "relationship",
"race", "gender", "capital-gain", "capital-loss", "native-country"
]
continuous = [
"fnlwgt", "hours-per-week", "age", "educational-num"
]
# workclass.
dftrain["workclass"] = dftrain["workclass"].fillna("missing")
dftrain["workclass"] = dftrain["workclass"].map(dworkclass)
# marital-status.
dftrain["marital-status"] = dftrain["marital-status"].fillna("missing")
dftrain["marital-status"] = dftrain["marital-status"].map(dmarital)
# occupation.
dftrain["occupation"] = dftrain["occupation"].fillna("missing")
dftrain["occupation"] = dftrain["occupation"].map(doccupation)
# relationship.
dftrain["relationship"] = dftrain["relationship"].fillna("missing")
dftrain["relationship"] = dftrain["relationship"].map(drelationship)
# race.
dftrain["race"] = dftrain["race"].fillna("missing")
dftrain["race"] = dftrain["race"].map(drace)
# sex.
dftrain["gender"] = dftrain["gender"].fillna("missing")
dftrain["gender"] = dftrain["gender"].map(dgender)
# capital-gain: Convert to binary indicator.
dftrain["capital-gain"] = dftrain["capital-gain"].map(lambda v: 1 if v > 0 else 0)
# capital-loss: Convert to binary indicator.
dftrain["capital-loss"] = dftrain["capital-loss"].map(lambda v: 1 if v > 0 else 0)
# Encode native-country.
dftrain["native-country"] = dftrain["native-country"].map(lambda v: "US" if v == "United-States" else "other")
# Encode response.
dftrain["income"] = dftrain["income"].map(lambda v: 1 if v == ">50K" else 0)
# Create train and validation sets.
y = dftrain["income"]
dft, dfv, yt, yv = train_test_split(dftrain, y, test_size=.125, stratify=y)
print(f"dft.shape: {dft.shape}")
print(f"dfv.shape: {dfv.shape}")
print(f"prop. yt : {yt.sum() / dft.shape[0]:.4f}")
print(f"prop. yv : {yv.sum() / dfv.shape[0]:.4f}")
```

```
dft.shape: (42736, 15)
dfv.shape: (6106, 15)
prop. yt : 0.2393
prop. yv : 0.2393
```

With categorical features re-mapped, it is useful to look at the proportion of positive instances in each group per feature:

```
indices = [
(0, 0), (0, 1), (0, 2),
(1, 0), (1, 1), (1, 2),
(2, 0), (2, 1), (2, 2),
]
fig, ax = plt.subplots(3, 3, figsize=(9, 7), tight_layout=True)
for (ii, jj), col in zip_longest(indices, categorical):
if col is None:
ax[ii, jj].remove()
else:
gg = dftrain.groupby(col, as_index=False).agg(
leq50k=("income", lambda v: v[v==0].size),
gt50k=("income", "sum")
).sort_values(col, ascending=True)
if col in ["education-num", "capital-gain", "capital-loss"]:
gg[col] = gg[col].astype(str)
if col == "occupation":
rot = 25
else:
rot = 0
gg.plot.bar(ax=ax[ii, jj])
ax[ii, jj].set_title(col, fontsize=8, weight="bold")
ax[ii, jj].set_xticklabels(gg[col].values, rotation=rot)
ax[ii, jj].yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter("{x:,.0f}"))
ax[ii, jj].tick_params(axis="x", which="major", direction='in', labelsize=6)
ax[ii, jj].tick_params(axis="x", which="minor", direction='in', labelsize=6)
ax[ii, jj].tick_params(axis="y", which="major", direction='in', labelsize=6)
ax[ii, jj].tick_params(axis="y", which="minor", direction='in', labelsize=6)
ax[ii, jj].xaxis.set_ticks_position("none")
ax[ii, jj].yaxis.set_ticks_position("none")
ax[ii, jj].legend(loc="best", fancybox=True, framealpha=1, fontsize="x-small")
ax[ii, jj].grid(True)
ax[ii, jj].set_axisbelow(True)
plt.show()
```

From the generated plot, we take-away the following:

`education-num`

: Higher percentage of “>50k” for levels >= 13.`maritial-status`

: Higher proportion of “>50k” for married vs. all other groups.`sex`

: Higher proportion of “>50k” for Males vs. Females.`occupation`

: Higher proportion of “>50k” for managerial and specialty.

A similar exhibit for continuous features gives us an idea of the distribution of values in each:

```
indices = [0, 1, 2, 3]
fig, ax = plt.subplots(1, 4, figsize=(10, 3), tight_layout=True)
for ii, col in zip_longest(indices, continuous):
ax[ii].set_title(col, fontsize=8, weight="bold")
ax[ii].hist(
dft[col], 16, density=True, alpha=1, color="#E02C70",
edgecolor="#FFFFFF", linewidth=1.0
)
#ax[ii].yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter("{x:,.0f}"))
ax[ii].tick_params(axis="x", which="major", direction='in', labelsize=6)
ax[ii].tick_params(axis="x", which="minor", direction='in', labelsize=6)
ax[ii].tick_params(axis="y", which="major", direction='in', labelsize=6)
ax[ii].tick_params(axis="y", which="minor", direction='in', labelsize=6)
ax[ii].xaxis.set_ticks_position("none")
ax[ii].yaxis.set_ticks_position("none")
# ax[ii].legend(loc="best", fancybox=True, framealpha=1, fontsize="x-small")
ax[ii].grid(True)
ax[ii].set_axisbelow(True)
plt.show()
```

We are now in a position to create our pipelines. The first pipeline is created to support a logistic regression classifier. We initialize a `ColumnTransformer`

instance, which gives us the ability to define separate preprocessing steps for different groups of columns (in our case, categorical vs. continuous). As the logistic regression classifier doesn’t support categorical features, we one-hot encode them. In addition, since the logistic regression classifier relies on gradient descent to estimate coefficients, continuous features are scaled using `RobustScaler`

to help with convergence and missing values imputed using `IterativeImputer`

. For the classifier, we use the elasticnet penalty, which is a blend of lasso and ridge penalties. We’ll determine the optimal weighting using grid search.

```
from sklearn.linear_model import LogisticRegression
# Data pre-processing for LogisticRegression model.
lr = LogisticRegression(
penalty="elasticnet", solver="saga", max_iter=5000
)
continuous_transformer1 = Pipeline(steps=[
("imputer", IterativeImputer()),
("scaler" , RobustScaler())
])
categorical_transformer1 = Pipeline(steps=[
("onehot", OneHotEncoder(drop="first", sparse_output=False, handle_unknown="error"))
])
preprocessor1 = ColumnTransformer(transformers=[
("continuous" , continuous_transformer1, continuous),
("categorical", categorical_transformer1, categorical)
], remainder="drop"
)
pipeline1 = Pipeline(steps=[
("preprocessor", preprocessor1),
("classifier", lr)
]).set_output(transform="pandas")
```

Notice that `set_output`

is affixed to `pipeline1`

by specifying `transform="pandas"`

. This was added in scikit-learn version 1.2, and allows intermediate and final datasets to be represented as Pandas DataFrames instead of Numpy arrays. I’ve found this to be particularly convenient, especially when inspecting the results of a transformation.

A different set of preprocessing steps is carried out for the HistGradientBoostingClassifier instance, which is functionally equivalent to lightgbm. Since HistGradientBoostingClassifier supports categorical features, it isn’t necessary to one-hot encode: We pass a list of columns that should be treated as nominal categorical features to the `categorical_features`

parameter. Coming out of `ColumnTransformer`

, categorical features are renamed with a leading `categorical__`

, so it is easy to identify which columns to pass. As before, `IterativeImputer`

is used to impute missing continuous values. Within `categorical_transformer2`

, we pass `OrdinalEncoder`

to convert non-numeric categories to integers, which can then be processed by HistGradientBoostingClassifier. Since HistGradientBoostingClassifier doesn’t rely on gradient descent, it isn’t necessary to include `RobustScaler`

in `continuous_transformer2`

.

```
from sklearn.ensemble import HistGradientBoostingClassifier
# Data pre-processing for HistGradientBoostingClassifier model. Uses OrdinalEncoder
# instead of OneHotEncoder since categorical features are supported.
gb = HistGradientBoostingClassifier(
categorical_features=[f"categorical__{ii}" for ii in categorical]
)
continuous_transformer2 = Pipeline(steps=[
("imputer", IterativeImputer())
])
categorical_transformer2 = Pipeline(steps=[
("encoder", OrdinalEncoder())
])
preprocessor2 = ColumnTransformer(transformers=[
("continuous" , continuous_transformer2, continuous),
("categorical", categorical_transformer2, categorical),
], remainder="drop"
)
pipeline2 = Pipeline(steps=[
("preprocessor", preprocessor2),
("classifier", gb)
]).set_output(transform="pandas")
```

Instead og using `GridSearchCV`

, we leverage `RandomizedSearchCV`

. `GridSearchCV`

evaluates a multi-dimensional array of hyperparameters, whereas `RandomizedSearchCV`

samples from a pre-specified distribution a defined number of samples. For our logistic regression classifier, we sample uniformly from [0, 1] for `l1_ratio`

and [0, 10] for the regularization parameter `C`

.

```
from scipy.stats import uniform
RANDOM_STATE = 516
verbosity = 3
n_iter = 3
scoring = "accuracy"
cv = 5
param_grid1 = {
"classifier__l1_ratio": uniform(loc=0, scale=1),
"classifier__C": uniform(loc=0, scale=10)
}
mdl1 = RandomizedSearchCV(
pipeline1, param_grid1, scoring=scoring, cv=cv, verbose=verbosity,
random_state=RANDOM_STATE, n_iter=n_iter
)
mdl1.fit(dft.drop("income", axis=1), yt)
print(f"\nbest parameters: {mdl1.best_params_}")
# Get holdout scores for each fold to compare against other model.
best_rank1 = np.argmin(mdl1.cv_results_["rank_test_score"])
best_mdl_cv_scores1 = [
mdl1.cv_results_[f"split{ii}_test_score"][best_rank1] for ii in range(cv)
]
dfv["ypred1"] = mdl1.predict_proba(dfv.drop("income", axis=1))[:, 1]
dfv["yhat1"] = dfv["ypred1"].map(lambda v: 1 if v >= .50 else 0)
mdl1_acc = accuracy_score(dfv["income"], dfv["yhat1"])
mdl1_precision = precision_score(dfv["income"], dfv["yhat1"])
mdl1_recall = recall_score(dfv["income"], dfv["yhat1"])
print(f"\nmdl1_acc : {mdl1_acc}")
print(f"mdl1_precision: {mdl1_precision}")
print(f"mdl1_recall : {mdl1_recall}")
```

```
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 1/5] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915;, score=0.841 total time= 2.3s
[CV 2/5] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915;, score=0.840 total time= 1.7s
[CV 3/5] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915;, score=0.843 total time= 2.6s
[CV 4/5] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915;, score=0.843 total time= 1.8s
[CV 5/5] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915;, score=0.850 total time= 1.7s
[CV 1/5] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359;, score=0.841 total time= 1.6s
[CV 2/5] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359;, score=0.840 total time= 1.5s
[CV 3/5] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359;, score=0.843 total time= 3.4s
[CV 4/5] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359;, score=0.843 total time= 1.6s
[CV 5/5] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359;, score=0.850 total time= 1.3s
[CV 1/5] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002;, score=0.841 total time= 1.7s
[CV 2/5] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002;, score=0.840 total time= 1.7s
[CV 3/5] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002;, score=0.843 total time= 2.3s
[CV 4/5] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002;, score=0.843 total time= 1.9s
[CV 5/5] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002;, score=0.850 total time= 1.7s
best parameters: {'classifier__C': 1.115284252761577, 'classifier__l1_ratio': 0.5667878644753359}
mdl1_acc : 0.8435964624959057
mdl1_precision: 0.7184801381692574
mdl1_recall : 0.5694729637234771
```

We proceed analogously for HistGradientBoostingClassifier, but sample from different hyperparameters.

```
RANDOM_STATE = 516
scoring = "accuracy"
verbosity = 3
n_iter = 3
cv = 5
param_grid2 = {
"classifier__max_iter": [100, 250, 500],
"classifier__min_samples_leaf": [10, 20, 50, 100],
"classifier__l2_regularization": uniform(loc=0, scale=1000),
"classifier__learning_rate": [.01, .05, .1, .25, .5],
"classifier__max_leaf_nodes": [None, 20, 31, 40, 50]
}
mdl2 = RandomizedSearchCV(
pipeline2, param_grid2, scoring=scoring, cv=cv, verbose=verbosity,
random_state=RANDOM_STATE, n_iter=n_iter
)
mdl2.fit(dft.drop("income", axis=1), yt)
print(f"\nbest parameters: {mdl2.best_params_}")
# Get holdout scores for each fold to compare against other model.
best_rank2 = np.argmin(mdl2.cv_results_["rank_test_score"])
best_mdl_cv_scores2 = [
mdl2.cv_results_[f"split{ii}_test_score"][best_rank2] for ii in range(cv)
]
dfv["ypred2"] = mdl2.predict_proba(dfv.drop("income", axis=1))[:, 1]
dfv["yhat2"] = dfv["ypred2"].map(lambda v: 1 if v >= .50 else 0)
mdl2_acc = accuracy_score(dfv["income"], dfv["yhat2"])
mdl2_precision = precision_score(dfv["income"], dfv["yhat2"])
mdl2_recall = recall_score(dfv["income"], dfv["yhat2"])
print(f"\nmdl2_acc : {mdl2_acc}")
print(f"mdl2_precision: {mdl2_precision}")
print(f"mdl2_recall : {mdl2_recall}")
```

```
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 1/5] END classifier__l2_regularization=811.5660497752214, classifier__learning_rate=0.25, classifier__max_iter=500, classifier__max_leaf_nodes=None, classifier__min_samples_leaf=50;, score=0.846 total time= 1.1s
[CV 2/5] END classifier__l2_regularization=811.5660497752214, classifier__learning_rate=0.25, classifier__max_iter=500, classifier__max_leaf_nodes=None, classifier__min_samples_leaf=50;, score=0.848 total time= 1.5s
[CV 3/5] END classifier__l2_regularization=811.5660497752214, classifier__learning_rate=0.25, classifier__max_iter=500, classifier__max_leaf_nodes=None, classifier__min_samples_leaf=50;, score=0.849 total time= 1.1s
[CV 4/5] END classifier__l2_regularization=811.5660497752214, classifier__learning_rate=0.25, classifier__max_iter=500, classifier__max_leaf_nodes=None, classifier__min_samples_leaf=50;, score=0.849 total time= 0.9s
[CV 5/5] END classifier__l2_regularization=811.5660497752214, classifier__learning_rate=0.25, classifier__max_iter=500, classifier__max_leaf_nodes=None, classifier__min_samples_leaf=50;, score=0.852 total time= 1.0s
[CV 1/5] END classifier__l2_regularization=138.5495352566758, classifier__learning_rate=0.1, classifier__max_iter=100, classifier__max_leaf_nodes=50, classifier__min_samples_leaf=10;, score=0.845 total time= 0.6s
[CV 2/5] END classifier__l2_regularization=138.5495352566758, classifier__learning_rate=0.1, classifier__max_iter=100, classifier__max_leaf_nodes=50, classifier__min_samples_leaf=10;, score=0.846 total time= 0.6s
[CV 3/5] END classifier__l2_regularization=138.5495352566758, classifier__learning_rate=0.1, classifier__max_iter=100, classifier__max_leaf_nodes=50, classifier__min_samples_leaf=10;, score=0.849 total time= 0.6s
[CV 4/5] END classifier__l2_regularization=138.5495352566758, classifier__learning_rate=0.1, classifier__max_iter=100, classifier__max_leaf_nodes=50, classifier__min_samples_leaf=10;, score=0.849 total time= 0.6s
[CV 5/5] END classifier__l2_regularization=138.5495352566758, classifier__learning_rate=0.1, classifier__max_iter=100, classifier__max_leaf_nodes=50, classifier__min_samples_leaf=10;, score=0.854 total time= 0.6s
[CV 1/5] END classifier__l2_regularization=189.1538419557398, classifier__learning_rate=0.1, classifier__max_iter=100, classifier__max_leaf_nodes=20, classifier__min_samples_leaf=20;, score=0.846 total time= 0.4s
[CV 2/5] END classifier__l2_regularization=189.1538419557398, classifier__learning_rate=0.1, classifier__max_iter=100, classifier__max_leaf_nodes=20, classifier__min_samples_leaf=20;, score=0.848 total time= 0.5s
[CV 3/5] END classifier__l2_regularization=189.1538419557398, classifier__learning_rate=0.1, classifier__max_iter=100, classifier__max_leaf_nodes=20, classifier__min_samples_leaf=20;, score=0.852 total time= 0.4s
[CV 4/5] END classifier__l2_regularization=189.1538419557398, classifier__learning_rate=0.1, classifier__max_iter=100, classifier__max_leaf_nodes=20, classifier__min_samples_leaf=20;, score=0.850 total time= 0.4s
[CV 5/5] END classifier__l2_regularization=189.1538419557398, classifier__learning_rate=0.1, classifier__max_iter=100, classifier__max_leaf_nodes=20, classifier__min_samples_leaf=20;, score=0.855 total time= 0.6s
best parameters: {'classifier__l2_regularization': 189.1538419557398, 'classifier__learning_rate': 0.1, 'classifier__max_iter': 100, 'classifier__max_leaf_nodes': 20, 'classifier__min_samples_leaf': 20}
mdl2_acc : 0.8524402227317392
mdl2_precision: 0.7348993288590604
mdl2_recall : 0.5995893223819302
```

Notice that `mdl1`

and `mdl2`

expose `predict/predict_proba`

methods, so we can generate predictions using the resulting `RandomizedSearchCV`

object directly, and it will dispatch a call to the estimator associated with the hyperparameters that maximize accuracy.

Precision, recall and accuracy are close for each model. We can check if the difference between models is significant using the approach outlined here:

```
from scipy.stats import t
def corrected_std(differences, n_train, n_test):
"""
Corrects standard deviation using Nadeau and Bengio's approach.
"""
kr = len(differences)
corrected_var = np.var(differences, ddof=1) * (1 / kr + n_test / n_train)
corrected_std = np.sqrt(corrected_var)
return(corrected_std)
def compute_corrected_ttest(differences, df, n_train, n_test):
"""
Computes right-tailed paired t-test with corrected variance.
"""
mean = np.mean(differences)
std = corrected_std(differences, n_train, n_test)
t_stat = mean / std
p_val = t.sf(np.abs(t_stat), df) # right-tailed t-test
return(t_stat, p_val)
differences = np.asarray(best_mdl_cv_scores2) - np.asarray(best_mdl_cv_scores1)
n = len(differences)
df = n - 1
n_train = 4 * (dft.shape[0] // 5)
n_test = dft.shape[0] // 5
t_stat, p_val = compute_corrected_ttest(differences, df, n_train, n_test)
print(f"t-value: {t_stat:.3f}")
print(f"p-value: {p_val:.3f}")
```

```
t-value: 5.231
p-value: 0.003
```

At a significance alpha level at p=0.05, the test concludes that HistGradientBoostingClassifier is significantly better than the LogisticRegression model.

Finally, we can overlay the histograms of model predictions by true class:

```
color0 = "#E02C70"
color1 = "#6EA1D5"
alpha = .65
n_bins = 12
fig, ax = plt.subplots(1, 2, figsize=(9, 4), tight_layout=True)
# LogisticRegression.
yy0 = dfv[dfv.income==0]["ypred1"].values
yy1 = dfv[dfv.income==1]["ypred1"].values
ax[0].set_title(
f"LogisticRegression (acc={mdl1_acc:.3f})",
fontsize=9, weight="normal"
)
ax[0].hist(
yy0, n_bins, density=True, alpha=alpha, color=color0,
edgecolor="#000000", linewidth=1.0, label="<=50K"
)
ax[0].hist(
yy1, n_bins, density=True, alpha=alpha, color=color1,
edgecolor="#000000", linewidth=1.0, label=">50K"
)
ax[0].tick_params(axis="x", which="major", direction='in', labelsize=6)
ax[0].tick_params(axis="x", which="minor", direction='in', labelsize=6)
ax[0].tick_params(axis="y", which="major", direction='in', labelsize=6)
ax[0].tick_params(axis="y", which="minor", direction='in', labelsize=6)
ax[0].xaxis.set_ticks_position("none")
ax[0].yaxis.set_ticks_position("none")
ax[0].set_yticklabels([])
ax[0].legend(loc="best", fancybox=True, framealpha=1, fontsize="small")
ax[0].grid(True)
ax[0].set_axisbelow(True)
# HistGradientBoostingClassifier.
yy0 = dfv[dfv.income==0]["ypred2"].values
yy1 = dfv[dfv.income==1]["ypred2"].values
ax[1].set_title(
f"HistGradientBoostingClassifier (acc={mdl2_acc:.3f})",
fontsize=9, weight="normal"
)
ax[1].hist(
yy0, n_bins, density=True, alpha=alpha, color=color0,
edgecolor="#000000", linewidth=1.0, label="<=50K"
)
ax[1].hist(
yy1, n_bins, density=True, alpha=alpha, color=color1,
edgecolor="#000000", linewidth=1.0, label=">50K"
)
ax[1].tick_params(axis="x", which="major", direction='in', labelsize=6)
ax[1].tick_params(axis="x", which="minor", direction='in', labelsize=6)
ax[1].tick_params(axis="y", which="major", direction='in', labelsize=6)
ax[1].tick_params(axis="y", which="minor", direction='in', labelsize=6)
ax[1].xaxis.set_ticks_position("none")
ax[1].yaxis.set_ticks_position("none")
ax[1].set_yticklabels([])
ax[1].legend(loc="best", fancybox=True, framealpha=1, fontsize="small")
ax[1].grid(True)
ax[1].set_axisbelow(True)
plt.show()
```

`multiprocessing.Pool`

class provides access to a pool of worker processes to which jobs can be submitted. It supports asynchronous results with timeouts and callbacks and has a parallel map implementation. Leveraging `multiprocessing.Pool`

is straightforward. To demonstrate, we will solve Project Euler Problem #14 in a distributed fashion. The problem states:
```
The following iterative sequence is defined for the set of positive integers:
n -> n/2 (n is even)
n -> 3n + 1 (n is odd)
Using the rule above and starting with 13, we generate the following sequence:
13 -> 40 -> 20 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1
It can be seen that this sequence (starting at 13 and finishing at 1) contains
10 terms. Although it has not been proved yet (Collatz Problem), it is thought
that all starting numbers finish at 1.
Which starting number, under one million, produces the longest chain?
NOTE: Once the chain starts the terms are allowed to go above one million.
```

To start, we define two functions: `collatz_test`

and `chain_length`

. `collatz_test`

contains the logic that either divides the input by 2 (if even) or multiplies it by 3 and adds 1 (if odd). `chain_length`

returns a tuple consisting of the initial integer along with the length of the collatz chain:

```
def collatz_test(n):
"""
If n is even, return (n/2), else return (3n+1).
"""
return((n / 2) if n%2==0 else (3 * n + 1))
def chain_length(n):
"""
Return the length of the collatz chain along
with the input value n.
"""
if n <= 0:
return(None)
cntr, tstint = 0, n
while tstint != 1:
cntr+=1
tstint = collatz_test(tstint)
return(n, cntr)
```

One thing to keep in mind when using the multiprocessing library is that instances of the Pool and Process classes can only be initialized after the `if __name__ == "__main__"`

statement, and as a consequence Pool cannot be called from within an interactive Python session.

Next we present our declarations from earlier along with the distributed logic, which sets up `chain_length`

parallel dispatch:

```
"""
Parallel solution to Project Euler Problem # 14.
"""
import multiprocessing
def collatz_test(n):
"""
If n is even, return (n/2), else return (3n+1).
"""
return((n / 2) if n % 2 == 0 else (3 * n + 1))
def chain_length(n):
"""
Return the length of the collatz chain along
with the input value `n`.
"""
if n <= 0:
return(None)
cntr, tstint = 0, n
while tstint!=1:
cntr+=1
tstint = collatz_test(tstint)
return(n, cntr)
if __name__ == "__main__":
# Initialize array of values to test.
arr = multiprocessing.Array('L', range(1, 1000000))
pool = multiprocessing.Pool()
all_lengths = pool.map(chain_length, arr, chunksize=1000)
pool.close()
pool.join()
# Search for longest chain.
longest_chain = max((i for i in all_lengths), key=lambda x: x[1])
```

We first declare our sequence of test values as `multiprocessing.Array`

, which prevents the same 1,000,000 element sequence from being replicated in each process (only an issue on Windows, where there is no fork system call). Instead, the array will be created once, and all processes will have access to it. The “L” typecode is from the array module in the Python Standard Library, which indicates the datatype of the elements contained in the sequence. We initialize the Pool instance, then call its map method, which works similarly to the builtin map function, only in parallel. Within `pool.map`

, We set `chunksize=1000`

due to the following commentary in multiprocessing’s documentation:

For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

Upon execution, we find that 837,799 produces the longest sequence, and it is of length 524. By distributing the tasks to four cores, the script completes in 25 seconds, whereas the sequential implementation requires approx. 55 seconds. This disparity would only grow as the range of evaluation increases from 1M to 5M or 10M.

For more information on the multiprocessing module, be sure to check out the documentation. In addition, the Python Standard Library includes the `concurrent.futures`

module, which exposes an even higher-level interface that facilitates both thread and process-based parallelism via Executor objects.

Assume a set of observations representing ground-up property losses in dollars:

`19999 19974 5051 7179 34416 56840 4420 6558`

Our task is to fit a Weibull distribution to the loss data in order to produce a severity curve. The Weibull density is given by:

where:

- is the shape parameter,
- is the scale parameter, .
- .

The expected value of the Weibull distribution is

and the median is given by

The variance is given by

In , represents the gamma function, a generalization of the factorial expressed as

The fitdistrplus library calculates parameter estimates given data and a hypothesized distribution. The `fitdist`

function takes an optional `start`

parameter, which represents initial parameter values associated with the hypothesized distribution. The Weibull distribution has two parameters that require estimation: , the shape parameter and , the scale parameter. How can we come up with reasonable initial estimates of and ?

First, notice that if the mean is divided by the median, cancels, leaving a function of only. By setting what remains to the ratio of the empirical mean to median, the result will be an expression we can use to obtain an initial estimate of :

As a consequence of the Gamma function in the right-hand-side numerator, we cannot solve for using direct methods. In R, we use `uniroot`

to estimate roots of univariate functions numerically. In the code that follows, we implement a closure which returns a function which then can be evaluated and `k`

, it’s sole argument, which `uniroot`

will use to zero-in on a solution:

```
# Example solving for Weibull shape parameter using uniroot.
lossData = c(19999, 19974, 5051, 7179, 34416, 56840, 4420, 6558)
# Calling shapeFunc returns a function, which can then be used by uniroot to find a solution.
shapeFunc = function(v) {
# Compute ratio of empirical mean to median.
ratio = mean(v) / median(v)
function(k) {
return((gamma(1 + (1 / k)) / (log(2)^(1 / k))) - ratio)
}
}
# Evaluate shapeFunc. ff is a function which takes a single argument `k`.
ff = shapeFunc(lossData)
```

The body of `shapeFunc`

is a straightforward implementation of our ratio expression above. The only difference is the expression is set to 0 by subtracting the ratio (1.421915) from both sides. We have our function `ff`

and the interval over which to search for a solution . The call to `uniroot`

is made below:

`shape = uniroot(ff, interval=c(.Machine$double.eps, max(lossData)))$root`

Since is strictly greater than 0, we set the search interval lower bound to `.Machine$double.eps`

, which represents the smallest positive floating-point value such that . Our initial estimate for the shape parameter given our data is . To determine an initial estimate for the scale parameter, we can use the fact that

resulting in .

With our hypothesized distribution and initial parameters, obtaining maximum likelihood estimates is straightforward. The initial parameter estimation code is included again for convenience:

```
# Computing maximum likelihood estimates using fitdistrplus.
library("fitdistplus")
lossData = c(19999, 19974, 5051, 7179, 34416, 56840, 4420, 6558)
shapeFunc = function(v) {
# Compute ratio of empirical mean to median.
ratio = mean(v) / median(v)
function(k) {
return((gamma(1 + (1 / k)) / (log(2)^(1 / k))) - ratio)
}
}
# Evaluate shapeFunc. ff is a function which takes a single argument `k`.
ff = shapeFunc(lossData)
# Initial shape parameter estimate.
shape0 = uniroot(ff, interval=c(.Machine$double.eps, max(lossData)))$root
# Initial scale parameter estimate.
scale0 = mean(lossData) / gamma(1 + (1 / shape0))
# Obtain mle parameter estimates.
mleFit = fitdistrplus::fitdist(
lossData, distr="weibull", method="mle", start=list(shape=shape0, scale=scale0)
)
```

Accessing `mleFit`

’s `estimate`

attribute, parameter estimates are:

```
> mleFit$estimate
shape scale
1.177033 20525.761478
```

Which is close to our initial starting parameter estimates.

`optim`

subroutine. In the example that follows, I’ll demonstrate how to find the shape and scale parameters for a Gamma distribution using synthetically generated data via maximum likelihood.
The available parameters for `optim`

are given below:

```
optim(par, fn, gr=NULL, ...,
method=c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", "Brent"),
lower=-Inf, upper=Inf,
control=list(), hessian=FALSE)
par
Initial values for the parameters to be optimized over.
fn
A function to be minimized (or maximized), with first argument the vector of
parameters over which minimization is to take place. It should return a scalar
result.
gr
A function to return the gradient for the "BFGS", "CG" and "L-BFGS-B" methods.
If it is NULL, a finite-difference approximation will be used.
For the "SANN" method it specifies a function to generate a new candidate
point. If it is NULL a default Gaussian Markov kernel is used.
...
Further arguments to be passed to fn and gr.
method
The method to be used. See 'Details'. Can be abbreviated.
lower, upper
Bounds on the variables for the "L-BFGS-B" method, or bounds in which to search
for method "Brent".
control
a list of control parameters. See 'Details'.
hessian
Logical. Should a numerically differentiated Hessian matrix be returned?
```

`optim`

is passed a function which takes a single vector argument representing the parameters we hope to find (`fn`

) along with a starting point from which the selected optimization routine begins searching (`par`

), with one starting value per parameter.

The gamma distribution can be parameterized in a number of ways, but for this demonstration, we use the shape-scale parameterization, with density given by:

where represents the shape parameter and the scale parameter. The Gamma distribution has variance in proportion to the mean, which differs from the normal distribution which has constant variance across observations. Specifically, for gamma distributed random variable , the mean and variance are:

A useful feature of the gamma distribution is that it has a constant coefficient of variation:

Thus, for any values of fit to a gamma distribution with parameters and , the ratio of the standard deviation to the mean will always equate to .

Maximum likelihood estimation (MLE) is a technique used to estimate parameters for a candidate model or distributional form. Essentially, MLE aims to identify which parameter(s) make the observed data most likely, given the specified model. In practice, we do not know the values of the proposed model parameters but we do know the data. We use the likelihood function to observe how the function changes for different parameter values while holding the data fixed. This can be used to judge which parameter values lead to greater likelihood of the sample occurring.

The joint density of independently distributed observations is given by:

When this expression is interpreted as a function of unknown given known data , we obtain the likelihood function:

Solving the likelihood equation can be difficult. This can be partially alleviated by logging the likelihood expression, which results in an expression for the loglikelihood. When solving the likelihood, it is often necessary to take the derivative of the expression to find the optimum (although not all optimizers are gradient based). It is much more computationally stable and conceptually straightforward to take the derivate of an additive function of independent observations (loglikelihood) as opposed to a multiplicative function of independent observations (likelihood). The loglikelihood is given by:

Referring back to the shape-scale parameterized gamma distribution, the joint density is represented as

Taking the natural log of the joint density results in the loglikelihood for our proposed distributional form:

Expanding the loglikelihood and focusing on a single observation yields:

Now considering all observations, after a bit of rearranging and simplification we obtain

Next, the the partial derivatives w.r.t. and would be obtained and set equal to zero in order to find the solutions directly or in an iterative fashion. But when using `optim`

, we only need to go as far as producing an expression for the joint loglikelihood over the set of observations.

The function eventually passed along to `optim`

will be implemented as a closure. A closure is a function which returns another function. A trivial example would be one in which an outer function wraps and inner function that computes the product of two numbers, which is then raised to a power specified as an argument accepted by the outer function. You’d be correct to think this problem could just as easily be solved as a 3-parameter function. However, you’d be required to pass the 3rd argument representing the degree to which the product should be raised every time the function is called. An advantage of closures is that any parameters associated with the outer function are global variables from the perspective of the inner function, which is useful in many scenarios.

Next we implement a closure as described in the previous paragraph: The inner function takes two numeric values, `a`

and `b`

, and returns the product `a * b`

. The outer function takes a single numeric argument, `pow`

, which determines the degree to which the product should be raised. The final value returned by the function will be `(a * b)^pow`

:

```
# Closure in which a power is specified upon initialization. Then, for each subsequent
# invocation, the product a * b is raised to pow.
prodPow = function(pow) {
function(a, b) {
return((a * b)^pow)
}
}
```

To initialize the closure, we call `prodPow`

, but only pass the argument for `pow`

. We will inspect the result and show that it is a function:

```
> func = proPow(pow=3)
> class(func)
[1] "function"
> names(formals(func))
[1] "a" "b"
```

The object bound to `func`

is a function with two arguments, `a`

and `b`

. If we invoke `func`

by specifying arguments for `a`

and `b`

, we expect a numeric value to be returned representing the product raised to the 3rd power:

```
> func(2, 4)
[1] 512
> func(7, 3)
[1] 9261
```

We next implement the gamma loglikelihood as a closure. The reason for doing this has to do with the way in which `optim`

works. The arguments associated with the function passed to `optim`

should consist of a vector of the parameters of interest, which in our case is a vector representing . By implementing the function as a closure, we can reference the set of observations (`vals`

) from the scope of the inner function without having to pass the data as an argument for the function being optimized. This is very useful, since, if you recall from our final expression for the loglikelihood, it is necessary to compute and at each evaluation. It is far more efficient to compute these quantities once then reference them from the inner function as necessary without requiring recomputation.

The Gamma loglikelihood closure is provided below:

```
# Loglikelihood for the Gamma distribution. The outer function accepts the set of observations.
# The inner function takes a vector of parameters (alpha, theta), and returns the loglikelihood.
gammaLLOuter = function(vals) {
# -------------------------------------------------------------
# Return a function representing the the Gamma loglikelihood. |
# `n` represents the number of observations in vals. |
# `s` represents the sum of vals with NAs removed. |
# `l` represents the sum of log(vals) with NAs removed. |
# -------------------------------------------------------------
n = length(vals)
s = sum(vals, na.rm=TRUE)
l = sum(log(vals), na.rm=TRUE)
function(v) {
# ---------------------------------------------------------
# v represents a length-2 vector of parameters alpha |
# (shape) and theta (scale). |
# Returns the loglikelihood. |
# ---------------------------------------------------------
a = v[1]
theta = v[2]
return((a - 1) * l - (theta^-1) * s - n * a * log(theta) -n * log(gamma(a)))
}
}
```

50 random observations from a gamma distribution with Gaussian noise are generated using and . We know beforehand the data originate from a gamma distribution with noise. We want to verify that the optimization routine can recover these parameters given only the data:

```
a = 5 # shape
theta = 10000 # scale
n = 50
vals = rgamma(n=n, shape=a, scale=theta) + rnorm(n=n, 0, theta / 100)
```

Referring back to the call signature for `optim`

, we need to provide initial parameter values for the optimization routine (the `par`

argument). Using 0 can sometimes suffice, but since , different values must be provided. We can leverage the gamma distribution’s constant coefficient of variation to get an initial estimate of the shape parameter , then use along with the empirical mean to back out an initial scale parameter estimate, . In R, this is accomplished as follows:

```
# Get initial estimates for a, theta.
valsMean = mean(vals, na.rm=TRUE) # empirical mean of vals.
valsStd = sd(vals, na.rm=TRUE) # empirical standard deviation of vals.
a0 = (valsMean / valsStd)^2 # initial shape parameter estimate.
theta0 = valsMean / a0 # initial scale parameter estimate.
```

We have everything we need to compute maximum likelihood estimates. Two final points: By default, `optim`

minimizes the function passed into it. Since we’re looking to maximize the loglikelihood, we need to include an additional argument to the `control`

parameter as `list(fnscale=-1)`

to ensure `optim`

returns the maximum. Second, there are a number of different optimizers from which to choose. A generally good choice is “BFGS”, which I use here.

We initialize `gammaLLOuter`

with `vals`

, then pass the initial parameter values along with `method="BFGS"`

and `control=list(fnscale=-1)`

into `optim`

:

```
# Generating maximum likelihood estimates from data assumed to follow a gamma distribution.
options(scipen=-9999)
set.seed(516)
gammaLLOuter = function(vals) {
# -------------------------------------------------------------
# Return a function representing the the Gamma loglikelihood. |
# `n` represents the number of observations in vals. |
# `s` represents the sum of vals with NAs removed. |
# `l` represents the sum of log(vals) with NAs removed. |
# -------------------------------------------------------------
n = length(vals)
s = sum(vals, na.rm=TRUE)
l = sum(log(vals), na.rm=TRUE)
function(v) {
# ---------------------------------------------------------
# `v` represents a length-2 vector of parameters alpha |
# (shape) and theta (scale). |
# Returns the loglikelihood. |
# ---------------------------------------------------------
a = v[1]
theta = v[2]
return((a - 1) * l - (theta^-1) * s - n * a * log(theta) -n * log(gamma(a)))
}
}
# Generate 50 random Gamma observations with Gaussian noise.
a = 5 # shape
theta = 10000 # scale
n = 50
vals = rgamma(n=n, shape=a, scale=theta) + rnorm(n=n, 0, theta / 100)
# Initialize gammaLL.
gammaLL = gammaLLOuter(vals)
# Determine initial estimates for shape and scale.
valsMean = mean(vals, na.rm=TRUE)
valsStd = sd(vals, na.rm=TRUE)
a0 = (valsMean / valsStd)^2
theta0 = valsMean / a0
paramsInit = c(a0, theta0)
# Dispatch arguments to optim.
paramsMLE = optim(
par=paramsInit, fn=gammaLL, method="BFGS", control=list(fnscale=-1)
)
```

`optim`

returns a list with a number of elements. We are concerned primarily with three elements:

`convergence`

: Specifies whether the optimization converged to a solution. 0 means yes,

any other number means it did not converge.`par`

: A vector of parameter estimates.`value`

: The maximized loglikelihood.

If you plug in the values for and from `paramsMLE$par`

, you would get a value equal to `paramsMLE$value. Checking the values returned by the call to`

optim` above, we have:

```
> paramsMLE$convergence
[1] 0
> paramsMLE$par
[1] 5.234362 9515.717485
> paramsMLE$value
[1] -566.4171
```

Let’s compare our estimates against estimates produced by `fitdistrplus`

. We need to scale our data by 100, otherwise `fitdist`

throws an error:

```
> fitdistrplus::fitdist(vals/100, "gamma")
Fitting of the distribution ' gamma ' by maximum likelihood
Parameters:
estimate Std. Error
shape 5.35617640 0.983440540
rate 0.01077757 0.002056568
```

If we reciprocate the estimate for “rate” and multiply by 100 (the amount we divided `vals`

by in the call to `fitdist`

), we get a scale estimate of 9278.53. Note that dividing by 100 works for the Gamma density, since it is an exponential scale family distribution, but this will not work for distributions generally. In summary:

```
shape scale
Our estimate : 5.234362 9515.71748
fitdistrplus estimate: 5.356176 9278.52939
% difference : 2.22743% 2.55631%
```

We find the estimates for each parameter are within 3% of one another.

- Retrieve HTML from a webpage.

- Parse the HTML and extract all references to embedded PDF links.

- For each PDF link, download the document and save it locally.

Plenty of 3rd-party libraries can query and retrieve a webpage’s links. However, the purpose of this post is to highlight the fact that by combining elements of the Python Standard Library with the Requests package, we can roll our own, and learn something while we’re at it.

This is straightforward using requests. Let’s query the Singular Value Decomposition page on Wikipedia:

```
import requests
url = "https://en.wikipedia.org/wiki/Singular_value_decomposition"
# instruct requests object to return HTML as plain text.
html = requests.get(url).text
html[:50]
```

`'<!DOCTYPE html>\n<html class="client-nojs vector-fe'`

The HTML has been obtained. Next we’ll identify and extract references to all embedded PDF links.

A cursory review of the HTML from webpages with embedded PDF links revealed the following:

- Valid PDF URLs will in almost always be embedded within an
`href`

tag.

- Valid PDF URLs will in all cases be preceded by
`http`

or`https`

.

- Valid PDF URLs will in all cases be enclosed by a trailing
`>`

.

- Valid PDF URLs cannot contain whitespace.

After some trial and error, the following regular expression was found to have acceptable performance for our test cases:

`"(?=href=).*(https?://\S+.pdf).*?>"`

An excellent site to practice building and testing regular expressions is Pythex . The app allows you to construct regular expressions and determine how they match against the target text. I find myself using it on a regular basis.

Here is the logic associated with steps I and II combined:

```
import re
import requests
url = "https://en.wikipedia.org/wiki/Singular_value_decomposition"
# instruct requests object to return HTML as plain text.
html = requests.get(url).text
# Search html and compile PDF URLs in a list.
pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)
for link in pdf_links:
print(link)
```

```
http://www.wou.edu/~beavers/Talks/Willamette1106.pdf
http://www.alterlab.org/research/highlights/pone.0078913_Highlight.pdf
http://math.mit.edu/~edelman/publications/distribution_of_a_scaled.pdf
http://files.grouplens.org/papers/webKDD00.pdf
https://stanford.edu/~rezab/papers/dimsum.pdf
http://faculty.missouri.edu/uhlmannj/UC-SIMAX-Final.pdf
```

Note that the regular expression is prepended with an `r`

when passed to `re.findall`

. This instructs Python to interpret what follows as a raw string and to ignore escape sequences.

`re.findall`

returns a list of matches extracted from the source text. In our case, it returns a list of URLs referencing the PDF documents found on the page.

For the last step we need to retrieve the documents associated with our collection of links and write them to file locally. We introduce another module from the Python Standard Library, `os.path`

, which facilitates the partitioning of absolute filepaths into components in order to retain filenames when saving documents to file.

For example, consider the following url:

`https://stanford.edu/~rezab/papers/dimsum.pdf`

To capture *dimsum.pdf*, we pass the absolute URL to `os.path.split`

, which returns a tuple of everything preceding the filename as the first element, along with the filename and extension as the second element:

```
import os
url = "https://stanford.edu/~rezab/papers/dimsum.pdf"
os.path.split(url)
```

`('https://stanford.edu/~rezab/papers', 'dimsum.pdf')`

This will be used to preserve the filename of the documents we save locally.

This step differs from the initial HTML retrieval in that we need to request the content as bytes, not text. By calling `requests.get(url).content`

, we’re accessing the raw bytes that comprise the PDF, then writing those bytes to file. Here’s the logic for the third and final step:

```
import os
import re
import requests
url = "https://en.wikipedia.org/wiki/Singular_value_decomposition"
html = requests.get(url).text
pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)
# Request PDF content and write to file for all entries.
for pdf in pdf_links:
# Get filename from url for naming file locally.
pdf_name = os.path.split(pdf)[1].strip()
try:
r = requests.get(pdf).content
with open(pdf_name, "wb") as f:
f.write(r)
except:
print(f"Unable to download {pdf_name}.")
else:
print(f"Saved {pdf_name}.")
```

```
Saved Willamette1106.pdf.
Saved pone.0078913_Highlight.pdf.
Saved distribution_of_a_scaled.pdf.
Saved webKDD00.pdf.
Saved dimsum.pdf.
Unable to download UC-SIMAX-Final.pdf.
```

Notice that we surround `with open(pdfname, "wb")...`

in a try-except block: This handles situations that would prevent our code from downloading a document, such as broken redirects or invalid links.

All-in we end up with 16 lines of code excluding comments. We next present the full implementation of the PDF Harvester after a little reorganization:

```
import os.path
import re
import requests
def pdf_harvester(url):
"""
Retrieve URLs html and extract references to PDFs. Download PDFs,
writing to current working directory.
Parameters
----------
url: str
Web address to serach for PDF links.
"""
html = requests.get(url).text
pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)
for pdf in pdf_links:
# Get filename from url for naming file locally.
pdf_name = os.path.split(pdf)[1].strip()
try:
r = requests.get(pdf).content
with open(pdf_name, "wb") as f:
f.write(r)
except:
print(f"Unable to download {pdf_name}.")
else:
print(f"Saved {pdf_name}.")
```

Fortunately it isn’t necessary to perform a full recalculation of mean and variance when accounting for new observations. Recall that for a sequence of observations the sample mean and variance are given by:

A new observation becomes available. To calculate the updated mean and variance in light of this new observation without requiring full recalculation, we can use the following:

Consider the following values:

The mean and variance for the observations:

A new value, becomes available. Full recalculation of mean and variance yields:

The mean and variance calculated using online update results in:

confirming agreement between the two approaches.

Note that the variance returned using the online update formula is the population variance. In order to return the updated unbiased sample variance, we need to multiply the variance returned by the online update formula by , where represents the length of the original array excluding the new observation. Thus, the updated sample variance after accounting for the new value is:

A straightforward implementation in Python to handle online mean and variance updates, incorporating Bessel’s correction to return the unbiased sample variance is provided below:

```
import numpy as np
def online_mean(mean_init, n, new_obs):
"""
Return updated mean in light of new observation without
full recalculation.
"""
return((n * mean_init + new_obs) / (n + 1))
def online_variance(var_init, mean_new, n, new_obs):
"""
Return updated variance in light of new observation without
full recalculation. Includes Bessel's correction to return
unbiased sample variance.
"""
return ((n + 1) / n) * (((n * var_init) / (n + 1)) + (((new_obs - mean_new)**2) / n))
a0 = np.array([1154, 717, 958, 1476, 889, 1414, 1364, 1047])
a1 = np.array([1154, 717, 958, 1476, 889, 1414, 1364, 1047, 1251])
# Original mean and variance.
mean0 = a0.mean() # 1127.38
variance0 = a0.var() # 65096.48
# Full recalculation mean and variance with new observation.
mean1 = a1.mean() # 1141.11
variance1 = a1.var(ddof=1) # 59372.99
# Online update of mean and variance with bias correction.
mean2 = online_mean(mean0, a0.size, 1251) # 1141.11
variance2 = online_variance(variance0, mean2, a0.size, 1251) # 66794.61
print(f"Full recalculation mean : {mean1:,.5f}")
print(f"Full recalculation variance: {variance1:,.5f}")
print(f"Online calculation mean : {mean2:,.5f}")
print(f"Online calculation variance: {variance2:,.5f}")
```

```
Full recalculation mean : 1,141.11111
Full recalculation variance: 66,794.61111
Online calculation mean : 1,141.11111
Online calculation variance: 66,794.61111
```

In this article, a technique to estimate total reserves accounting for correlation between lines of business is introduced. We focus on workers compensation, commercial auto, product liability and other liability data sourced from the CAS Loss Reserves Database. We’ll demonstrate how to account for correlation between lines, and show how changes to the correlation assumption effects the total reserve estimate.

The data can be downloaded from the CAS website directly using data.table’s `fread`

. We perform some preprocessing to normalize column names and assign like columns the same name in each table.

```
library("data.table")
library("foreach")
library("ChainLadder")
library("ggplot2")
DF1 = fread("https://www.casact.org/sites/default/files/2021-04/wkcomp_pos.csv") # workers compensation
DF2 = fread("https://www.casact.org/sites/default/files/2021-04/comauto_pos.csv") # commercial auto
DF3 = fread("https://www.casact.org/sites/default/files/2021-04/prodliab_pos.csv") # product liability
DF4 = fread("https://www.casact.org/sites/default/files/2021-04/othliab_pos.csv") # other liability
names(DF1) = tolower(names(DF1))
names(DF2) = tolower(names(DF2))
names(DF3) = tolower(names(DF3))
names(DF4) = tolower(names(DF4))
setnames(
DF1, c("incurloss_d", "bulkloss_d", "accidentyear", "developmentlag"),
c("incurloss", "bulkloss", "origin", "dev")
)
setnames(
DF2, c("incurloss_c", "bulkloss_c", "accidentyear", "developmentlag"),
c("incurloss", "bulkloss", "origin", "dev")
)
setnames(
DF3, c("incurloss_r1", "bulkloss_r1", "accidentyear", "developmentlag"),
c("incurloss", "bulkloss", "origin", "dev")
)
setnames(
DF4, c("incurloss_h1", "bulkloss_h1", "accidentyear", "developmentlag"),
c("incurloss", "bulkloss", "origin", "dev")
)
dfList = list(wkcomp=DF1, comauto=DF2, prodliab=DF3, othliab=DF4)
```

Each dataset contains loss data indexed by grcode, which is a company id. We need to find a company with losses in DF1, DF2, DF3 and DF4. This can be accomplished with the following:

```
grcodes = Reduce(
function(v1, v2) intersect(v1, v2),
lapply(dfList, function(DF) unique(DF[,grcode]))
)
grnamesDF = unique(
DF1[grcode %in% grcodes, .(grcode, grname)]
)
setorderv(grnamesDF, c("grcode"), c(1))
```

Which yields:

```
grcode grname
1: 337 California Cas Grp
2: 715 West Bend Mut Ins Grp
3: 1066 Island Ins Cos Grp
4: 1538 Farmers Automobile Grp
5: 1767 State Farm Mut Grp
6: 2143 Farmers Alliance Mut & Affiliates
7: 5185 Grinnell Mut Grp
8: 7080 New Jersey Manufacturers Grp
9: 9466 Lumber Ins Cos
10: 10048 Hyundai Marine & Fire Ins Co Ltd
11: 11126 Yasuda Fire & Marine Ins Co Of Amer
12: 13439 Partners Mut Ins Co
13: 13528 Brotherhood Mut Ins Co
14: 13587 Chicago Mut Ins Co
15: 14044 Goodville Mut Cas Co
16: 14257 IMT Ins Co Mut
17: 14370 Lebanon Mut Ins Co
18: 14508 Michigan Millers Mut Ins Co
19: 15024 Preferred Mut Ins Co
20: 18791 Virginia Mut Ins Co
21: 23663 National American Ins Co
22: 26433 Harco Natl Ins Co
23: 28258 Continental Natl Ind Co
24: 35408 Sirius Amer Ins Co
25: 38300 Samsung Fire & Marine Ins Co Ltd
26: 38733 Alaska Nat Ins Co
27: 44091 Dowa Fire & Marine Ins Co Ltd Us Br
grcode grname
```

Let’s go with **1767**, which represents State Farm. In the next code block, we subset each data.table to only those records with `grcode=="1767"`

, then create runoff triangles for each line of business:

```
GRCODE = 1767
grList = lapply(dfList, function(DF) DF[grcode==GRCODE,])
triData = foreach(
ii=1:length(grList), .inorder=TRUE, .errorhandling="stop",
.final=function(ll) setNames(ll, names(grList))
) %do% {
currLOB = names(grList)[[ii]]
DFInit = grList[[ii]]
DF = DFInit[dev<=max(origin) - origin + 1,]
DF[,value:=incurloss - bulkloss]
as.triangle(DF[,.(origin, dev, value)])
}
```

Triangles for each lob are presented below:

```
> triData
$wkcomp
dev
origin 1 2 3 4 5 6 7 8 9 10
1988 50758 94150 106804 113733 120148 123986 127650 128622 129791 130625
1989 65423 110204 131509 140383 147011 150266 152264 155017 155979 NA
1990 68719 141501 165694 181789 189149 194315 196897 201780 NA NA
1991 82409 165813 199016 213698 222994 229774 232413 NA NA NA
1992 97138 183451 208163 220275 227404 234320 NA NA NA NA
1993 106508 167688 195533 212777 220063 NA NA NA NA NA
1994 93736 141067 160848 173457 NA NA NA NA NA NA
1995 81309 116739 135447 NA NA NA NA NA NA NA
1996 66073 92365 NA NA NA NA NA NA NA NA
1997 56003 NA NA NA NA NA NA NA NA NA
$prodliab
dev
origin 1 2 3 4 5 6 7 8 9 10
1988 696 737 881 1002 1379 1451 1741 1814 1818 1850
1989 428 351 617 718 761 788 797 802 804 NA
1990 57 77 92 135 197 235 250 263 NA NA
1991 23 121 140 141 172 189 190 NA NA NA
1992 48 109 101 107 131 130 NA NA NA NA
1993 119 133 150 211 278 NA NA NA NA NA
1994 21 60 59 100 NA NA NA NA NA NA
1995 57 53 54 NA NA NA NA NA NA NA
1996 10 11 NA NA NA NA NA NA NA NA
1997 20 NA NA NA NA NA NA NA NA NA
$comauto
dev
origin 1 2 3 4 5 6 7 8 9 10
1988 110231 152848 168137 180062 186150 188142 189352 191307 191867 194000
1989 121678 158218 176744 188127 192966 196104 199178 199655 200949 NA
1990 123376 175239 201955 214113 219988 223308 225841 226373 NA NA
1991 117457 162601 183338 198607 203398 205870 206957 NA NA NA
1992 124611 166788 189771 201033 206826 212361 NA NA NA NA
1993 137902 185952 209357 220428 226541 NA NA NA NA NA
1994 150582 194528 216205 231077 NA NA NA NA NA NA
1995 150511 194730 215037 NA NA NA NA NA NA NA
1996 142301 184283 NA NA NA NA NA NA NA NA
1997 143970 NA NA NA NA NA NA NA NA NA
$othliab
dev
origin 1 2 3 4 5 6 7 8 9 10
1988 22417 58806 77536 103003 112976 120070 124641 126954 127444 128036
1989 24740 55381 76543 97608 113777 124341 126171 128952 132618 NA
1990 19432 63891 94243 119678 124938 129990 133964 133949 NA NA
1991 25821 84453 136275 159204 169820 172446 181744 NA NA NA
1992 38377 98045 138205 154554 171701 177467 NA NA NA NA
1993 53001 150478 196273 224523 232681 NA NA NA NA NA
1994 50848 127767 187297 233255 NA NA NA NA NA NA
1995 59140 149648 215701 NA NA NA NA NA NA NA
1996 71637 159561 NA NA NA NA NA NA NA NA
1997 82937 NA NA NA NA NA NA NA NA NA
```

Next, for each triangle, call the `BootChainLadder`

function (available in the ChainLadder library), running 5000 iterations and retaining only the total IBNR samples from each invocation (discarding IBNR simulations by accident year). We replace simulated values less than or equal to 1 with 1:

```
ibnrSimsDF = foreach(
ii=1:length(triData), .inorder=TRUE, .errorhandling="stop",
.combine="cbind.data.frame", .final=setDT
) %do% {
tri = triData[[ii]]
bcl = BootChainLadder(tri, R=5000, process.distr="gamma")
lobSims = bcl$IBNR.Totals
lobSims[lobSims<1] = 1
lobSims
}
# Set names of each column in simsDataDF to associated LOB.
names(ibnrSimsDF) = names(triData)
```

Inspecting the first 6 records of `ibnrSimsDF`

yields:

```
> head(ibnrSimsDF)
wkcomp prodliab comauto othliab
1: 213282.5 309.9531 207524.1 836339.0
2: 185281.3 453.1356 228032.9 876116.3
3: 178462.7 263.7076 246759.9 633045.5
4: 204928.1 169.7184 246953.0 641145.2
5: 168382.3 408.6908 213764.4 717701.9
6: 158486.8 194.0509 227606.5 711641.2
```

`ibnrSimsDF`

contains 5000 rows, with the value in each row representing the total simulated reserve need across all accident years for the lob in question. It is possible to produce histograms of the simulated total IBNR using ggplot2. The code that follows generates a faceted quad-plot of the sampling distribution of total IBNR for each lob, with a vertical dashed red line marking the location of the distribution mean. We first transform `ibnrSimsDF`

into a ggplot2-compatible format (which is ggDF):

```
# Create faceted quad-plot representing sampling distribution of total IBNR.
ggDF = data.table::melt(
ibnrSimsDF, measure.vars=names(ibnrSimsDF), value.name="ibnr",
variable.name="lob", variable.factor=FALSE
)
# Add mean.ibnr for huistogram overlay.
ggDF[,mean.ibnr:=mean(ibnr, na.rm=TRUE), by="lob"]
ggplot(ggDF, aes(x=ibnr)) +
geom_histogram(bins=35, color="black", fill="white") +
geom_vline(
aes(xintercept=mean.ibnr), color="red", linetype="dashed", size=1
) +
theme(
axis.title.y=element_blank(), axis.text.y=element_blank(),
axis.ticks.y=element_blank(), axis.title.x=element_blank()
) +
scale_x_continuous(
labels=function(x) format(x, big.mark=",", scientific=FALSE)
) +
facet_wrap(~lob, scales="free")
```

Running the code above produces the following exhibit:

If all we are trying to do is determine the expected value of the reserve run-off, we can calculate the expected value for each lob separately and add all the expectations together. However, if we are trying to quantify a value other than the mean (such as the 75th percentile), we cannot simply sum across lines of business. If we do so, we will overstate the aggregate reserve need. The only time the sum of each lob’s 75th percentile would be appropriate for the aggregate reserve indication is when all lines are fully correlated with each other, which is highly unlikely.

To account for correlation between lobs, we rely on the rank correlation methodology described in *Two Approaches to Calculating Correlated Reserve Indications Across Multiple Lines of Business*. The methodology is carried out through a two-step process:

In the first step, a stochastic reserving technique is used to generate N possible reserve runoffs from each data triangle being analyzed (this is what we have in ibnrSimsDF). In the second step, a correlation matrix is specified, where individual elements of the correlation matrix describe the association between different pairs of lobs. With the correlation matrix , carry out the following steps:

Compute the Cholesky decomposition of , that is, find the unique lower triangular matrix such that .

Compute , a vector whose components are independent standard normal variates (for our example, .)

Let . Since represents independent draws from the standard normal distribution, the value of the mean vector is 0. Therefore correlated random draws are obtained by matrix multiplying with .

For the correlation matrix, we’ll initially assume no correlation between lobs (all off-diagonal elements=0). Later

we’ll compare estimated reserve need as a function of changing correlation.

The correlation matrix can be initialized as follows:

```
sigma = matrix(
c(c(1, 0, 0, 0),
c(0, 1, 0, 0),
c(0, 0, 1, 0),
c(0, 0, 0, 1)),
nrow=4,
dimnames=list(names(ibnrSimsDF), names(ibnrSimsDF))
)
```

Which looks like the following:

```
wkcomp prodliab comauto othliab
wkcomp 1 0 0 0
prodliab 0 1 0 0
comauto 0 0 1 0
othliab 0 0 0 1
```

The next code block implements steps 1-3:

```
A = t(chol(sigma))
Z = matrix(rnorm(ncol(A) * 5000), nrow=5000, ncol=ncol(A))
X = Z %*% A
```

Checking out the first few records of `X`

yields:

```
> head(X)
wkcomp prodliab comauto othliab
[1,] 0.2256225 0.66492692 0.8239846 -1.5497317
[2,] 0.1101583 0.60652201 -0.9572046 -0.5200923
[3,] -0.5961369 0.13732270 -1.5355783 1.0622470
[4,] 0.6863108 -1.02719480 0.1086142 -0.4941367
[5,] 1.3918400 0.09805293 0.3412182 -0.1409186
[6,] 0.5547157 1.57012447 0.1263973 0.7135559
```

For each column in X, we need to obtain the rank of each correlated random draw. This can be accomplished by running:

```
rankX = foreach(ii=1:ncol(X), .combine="cbind") %do% { rank(X[,ii]) }
colnames(rankX) = colnames(sigma)
```

Inspecting the first few records from `rankX`

yields:

```
> head(rankX)
wkcomp prodliab comauto othliab
[1,] 2971 3758 3975 293
[2,] 2751 3658 856 1493
[3,] 1393 2759 288 4335
[4,] 3785 782 2746 1544
[5,] 4619 2684 3178 2221
[6,] 3569 4687 2784 3866
```

To prepare for the rank correlation step, we need to order our total IBNR simulations from smallest to largest within each lob column:

```
# Order total bootstrapped ibnr samples from smallest to largest.
orderedSimsDF = foreach(
ii=1:length(names(ibnrSimsDF)), .combine="cbind.data.frame",
.final=setDT
) %do% {
currLOB = names(ibnrSimsDF)[[ii]]
sort(ibnrSimsDF[[currLOB]])
}
names(orderedSimsDF) = names(ibnrSimsDF)
```

Then for each rank in `rankX`

, we lookup the corresponding position-wise element from orderedSimsDF. This ensures that the rank order correlations between lobs are the same as the correlations imposed on the random normal samples. For example, the first row of rankX is:

```
wkcomp prodliab comauto othliab
2971 3758 3975 293
```

Then using orderedSimsDF, we lookup the 2971st element under wkcomp, the 3758th element under prodliab, the 3975th element under comauto and the 293rd element under othliab. This can be accomplished as follows:

```
# Get correlated IBNR samples.
corrIBNR = foreach(
ii=1:length(names(orderedSimsDF)), .combine="cbind"
) %do% {
currLOB = names(orderedSimsDF)[[ii]]
lobIndx = rankR[,currLOB]
orderedSimsDF[lobIndx, get(currLOB)]
}
colnames(corrIBNR) = names(orderedSimsDF)
```

Finally, we sum the correlated samples across lobs, resulting in a vector of values representing the aggregate reserve distribution:

`totalIBNR = apply(corrIBNR, MARGIN=1, sum)`

Percentiles of the aggregate IBNR distribution can be obtained by calling:

```
> quantile(totalIBNR, c(.01, .25, .50, .75, .99))
1% 25% 50% 75% 99%
962340.6 1107900.3 1171348.8 1241553.0 1428743.0
```

We’ve re-run the procedure described in the previous section for 5 different correlation matrices, assuming 0, .25, .50, .75 and .99 off-diagonal correlation, and combined the results into a single data.table `qqDF`

. I then estimated the 1st, 25th, 50th, 75th and 99th percentile of each aggregate reserve distribution and created an exhibit comparing the distribution of each as a function of percentile. The code used to create the exhibit is given below:

```
# ------------------------------------------------------------------------
# Assume qqDF contains 1st, 25th, 50th, 75th and 99th percentile of the
# aggregate IBNR distribution for off-diagonal correlation values of
# 0, .25, .50, .75 and .99. The first few records of qqDF look like:
#
# rho x y
# 1: 0% 0.00 871243.8
# 2: 0% 0.25 1107900.3
# 3: 0% 0.50 1171348.8
# 4: 0% 0.75 1241553.0
# 5: 0% 0.99 1428743.0
#
# ------------------------------------------------------------------------
ggplot(qqDF, aes(x=x, y=y)) + geom_line(aes(color=rho), size=.5) +
scale_y_continuous(
labels=function(x) format(x, big.mark=",", scientific=FALSE)
) +
theme(
axis.title.y=element_blank(), axis.title.x=element_blank()
) + xlim(0, 1) +
ggtitle("Aggregate reserve distribution by correlation")
```

Which produces the following:

By changing `xlim(.50, 1)`

in the code above, we can zoom in on the right-hand side of the distribution:

We see that around .50 mark on the x-axis, there is essntially no difference between 0% and 25% off-diagonal correlation assumption. However, as we move right along the x-axis, there’s a greater and greater discrepancy. when x=.99, the difference in the estimated total needed reserve is ~50,000, which represents approximately a 5% difference.

A few take-aways:

If all we are trying to do is determine the expected value of the reserve run-off, we can calculate the expected value for each lob separately then add the expectations together.

If the aim is to quantify a value other than the mean, such as the 75th percentile, we cannot simply sum across the lines of business, as this is akin to assuming full correlation between lines of business, which is unlikely and will overstate the aggregate reserve need.

Off-diagonal correlation values do not need to be the same, but the matrix does need to be symmetric (identical values at and ).

```
library("data.table")
DF1 = data.table(
group=c("A", "B", "C", "D", "E"),
loss=c(0, 10000, 20000, 30000, 40000),
stringsAsFactors=FALSE
)
```

Reviewing the contents of DF1:

```
group loss
1: A 0
2: B 10000
3: C 20000
4: D 30000
5: E 40000
```

Let’s also assume we have a table of claims, DF2:

```
DF2 = data.table(
claimno=paste0("000", 10:20),
loss=c(8101, 15700, 64140, 20000, 11655, 31850, 23680, 41440, 16161, 77000, 4564),
stringsAsFactors=FALSE
)
```

Reviewing the contents of DF2:

```
claimno loss
1: 00010 8101
2: 00011 15700
3: 00012 64140
4: 00013 20000
5: 00014 11655
6: 00015 31850
7: 00016 23680
8: 00017 41440
9: 00018 16161
10: 00019 77009
11: 00020 4564
```

The goal is for each claimno in DF2, assign the corresponding group from `DF1`

such that loss threshold (from `DF1`

) is the maximum value less than or equal to loss from DF2. For example, a loss amount of 15700 should be assigned to group B, since 10000 is the maximum loss threshold less than or equal to 15700.

The way many people first attack this problem is to use a deeply nested sequence of `ifelse`

statements. Something akin to:

```
DF2[,
group:=
ifelse(loss>=0 & loss<10000, "A",
ifelse(loss>=10000 & loss<20000, "B",
ifelse(loss>=20000 & loss<30000, "C",
ifelse(loss>=30000 & loss<40000, "D",
"E"
)
)
)
)
]
```

Which results in:

```
claimno loss group
1: 00010 8101 A
2: 00011 15700 B
3: 00012 64140 E
4: 00013 20000 C
5: 00014 11655 B
6: 00015 31850 D
7: 00016 23680 C
8: 00017 41440 E
9: 00018 16161 B
10: 00019 77000 E
11: 00020 4564 A
```

This solution works, but is not optimal for a number of reasons. First, it’s overly verbose and brittle. If the number of groups changes from 5 to 10 or 15, it becomes necessary to extend the nesting of ifelses by the number of new groups. One should always try to avoid writing code that requires updates in proportion to the size of the input. Perhaps more importantly, this approach has poor runtime performance, which we demonstrate later on.

Performing a rolling join in data.table is straightforward. Simply add the `roll`

modifier within the join expression, specifying either `+Inf`

(or `TRUE`

) or `-Inf`

to specify the direction in which to roll. Sticking with the same DF1 and DF2 from before, we create a new table DF, which represents DF2 along with the target group associated with each claimno:

`DF = DF1[DF2, on="loss", roll=+Inf]`

Resulting in:

```
group loss claimno
1: A 8101 00010
2: B 15700 00011
3: E 64140 00012
4: C 20000 00013
5: B 11655 00014
6: D 31850 00015
7: C 23680 00016
8: E 41440 00017
9: B 16161 00018
10: E 77000 00019
11: A 4564 00020
```

Note that in this example, the key column loss is the same in both tables. If this was not the case, say, threshold in DF1 vs. loss in DF2, one would specify `on=c("threshold"="loss")`

.

For completeness, let’s see what happens if switch to `roll=-Inf`

(assume we changed loss to threshold in DF1):

`DF = DF1[DF2, on=c("threshold"="loss"), roll=-Inf]`

Resulting in:

```
group threshold claimno
1: B 8101 00010
2: C 15700 00011
3: <NA> 64140 00012
4: C 20000 00013
5: C 11655 00014
6: E 31850 00015
7: D 23680 00016
8: <NA> 41440 00017
9: C 16161 00018
10: <NA> 77000 00019
11: B 4564 00020
```

Any value in excess of the largest threshold gets set to `NA`

, and all other claims get set to the minimum threshold from DF1 greater than or equal to loss in DF1.

To demonstrate to difference in performance, we generate a new DF2 with one million random claim amounts. We’ll then compare the performance between the naive initial implementation and the rolling join implementation. To make it easier to use with the microbenchmark profiling tool, each implementation is encapsulated within separate functions:

```
library("data.table")
library("microbenchmark")
DF1 = data.table(
group=c("A", "B", "C", "D", "E"),
loss=c(0, 10000, 20000, 30000, 40000),
stringsAsFactors=FALSE
)
DF2 = data.table(
claimno=formatC(1:1000000, format="d", width=7, flag=0),
loss=rgamma(n=1000000, shape=1, scale=25000),
stringsAsFactors=FALSE
)
# Create copies to operate on for each implementation.
method1DF = data.table::copy(DF2)
method2DF = data.table::copy(DF2)
fmethod1 = function() {
# First method.
method1DF[,
group:=
ifelse(loss>=0 & loss<10000, "A",
ifelse(loss>=10000 & loss<20000, "B",
ifelse(loss>=20000 & loss<30000, "C",
ifelse(loss>=30000 & loss<40000, "D",
"E"
)
)
)
)
]
}
fmethod2 = function() {
# Second method.
DF = DF1[method2DF, on=c("loss"), roll=+Inf]
}
# Run comparison 10 times, compare max result from each.
microbenchmark(
fmethod1(),
fmethod2(),
times=10
)
```

The results from microbenchmark are provided below:

```
Unit: milliseconds
expr min lq mean median uq max neval
fmethod1() 2116.3529 2212.5053 2518.0355 2558.7205 2779.6253 3061.8588 10
fmethod2() 494.8094 536.9963 622.7095 586.1551 677.1586 825.6939 10
```

In the worst case, the rolling join approach is almost 4 times faster, and as the number of records increases, so does the relative performance improvement between the two methods.

- standard polynomial regression
- cubic B-spline regression
- smoothing splines

The data will a set of loss development factors (LDFs) associated with an unidentified line of business. Instead of smoothing LDFs patterns directly, we first compute the cumulative loss development factors (CLDFs), then take the reciprocal to obtain Percent of Ultimate factors. Doing so will generally (not always) result in a monotonically increasing factor as a function of time. The code that follows prepares our data:

```
library("data.table")
library("ggplot2")
options(scipen=9999)
ldfs = c(
2.85637, 1.58402, 1.37531, 1.3001, 1.21469, 1.28128, 1.15415, 1.09783, 1.09302,
1.06395, 1.04992, 1.04659, 1.05164, 1.03117, 1.0236, 1.06338, 1.03234, 1.0172,
1.01795, 1.01813, 1.01413, 1.00863, 1.01346, 1.00372, 1.00423, 1.00683, 1.04633,
1.01796, 1.02279, 1.00629, 1.00205, 1.00316, 1.007, 1.02828, 1.00117, 1.00303,
1.00055, 1.02272, 1.00678, 1.00152, 1.00013, 1.01347, 1, 1.00071, 1.00136
)
# Compute cumulative development factors.
cldfs = rev(cumprod(rev(ldfs)))
# Compute percent-of-ultimate factors.
pous = 1 / cldfs
DF = data.table(
xinit=1:length(pous), ldf0=ldfs, cldf0=cldfs, y=pous,
stringsAsFactors=FALSE
)
# Rescale `dev` to fall between 0-1.
DF[,x:=seq(0, 1, length.out=nrow(DF))]
setcolorder(
DF, c("xinit", "x", "y", "ldf0", "cldf0")
)
```

Fields in `DF`

are defined as follows:

`xinit`

: Original development period. 1 <=`xinit`

<= 45.`x`

:`xinit`

rescaled to [0,1].`y`

: Percent-of-ultimate factors.`ldf0`

: Original unsmoothed loss development factors.`cldf0`

: Original unsmoothed cumulative loss development factors.

Inspecting our data yields:

```
xinit x y ldf0 cldf0
1: 1 0.00000000 0.03022475 2.85637 33.085464
2: 2 0.02272727 0.08633308 1.58402 11.583045
3: 3 0.04545455 0.13675333 1.37531 7.312436
4: 4 0.06818182 0.18807822 1.30010 5.316937
5: 5 0.09090909 0.24452049 1.21469 4.089637
6: 6 0.11363636 0.29701659 1.28128 3.366815
7: 7 0.13636364 0.38056142 1.15415 2.627697
8: 8 0.15909091 0.43922497 1.09783 2.276738
9: 9 0.18181818 0.48219434 1.09302 2.073853
10: 10 0.20454545 0.52704806 1.06395 1.897360
11: 11 0.22727273 0.56075279 1.04992 1.783317
12: 12 0.25000000 0.58874556 1.04659 1.698527
13: 13 0.27272727 0.61617522 1.05164 1.622915
14: 14 0.29545455 0.64799451 1.03117 1.543223
15: 15 0.31818182 0.66819250 1.02360 1.496575
16: 16 0.34090909 0.68396184 1.06338 1.462070
17: 17 0.36363636 0.72731134 1.03234 1.374927
18: 18 0.38636364 0.75083259 1.01720 1.331855
19: 19 0.40909091 0.76374691 1.01795 1.309334
20: 20 0.43181818 0.77745617 1.01813 1.286246
21: 21 0.45454545 0.79155145 1.01413 1.263342
22: 22 0.47727273 0.80273607 1.00863 1.245739
23: 23 0.50000000 0.80966368 1.01346 1.235081
24: 24 0.52272727 0.82056176 1.00372 1.218677
25: 25 0.54545455 0.82361425 1.00423 1.214161
26: 26 0.56818182 0.82709813 1.00683 1.209046
27: 27 0.59090909 0.83274721 1.04633 1.200845
28: 28 0.61363636 0.87132839 1.01796 1.147673
29: 29 0.63636364 0.88697745 1.02279 1.127424
30: 30 0.65909091 0.90719167 1.00629 1.102303
31: 31 0.68181818 0.91289790 1.00205 1.095413
32: 32 0.70454545 0.91476934 1.00316 1.093172
33: 33 0.72727273 0.91766001 1.00700 1.089728
34: 34 0.75000000 0.92408363 1.02828 1.082153
35: 35 0.77272727 0.95021672 1.00117 1.052392
36: 36 0.79545455 0.95132847 1.00303 1.051162
37: 37 0.81818182 0.95421100 1.00055 1.047986
38: 38 0.84090909 0.95473581 1.02272 1.047410
39: 39 0.86363636 0.97642741 1.00678 1.024142
40: 40 0.88636364 0.98304759 1.00152 1.017245
41: 41 0.90909091 0.98454182 1.00013 1.015701
42: 42 0.93181818 0.98466981 1.01347 1.015569
43: 43 0.95454545 0.99793331 1.00000 1.002071
44: 44 0.97727273 0.99793331 1.00071 1.002071
45: 45 1.00000000 0.99864185 1.00136 1.001360
xinit x y ldf0 cldf0
```

Polynomial regression is similar to standard linear regression, except the design matrix contains `x`

raised to the desired power in each column. For example, assuming we have independent value `x`

given by:

`x = c(2, 4, 7, 5, 2)`

Instead of regressing a response `y`

on `x`

alone, polynomial regression fits `y`

using the matrix `X`

:

```
1 2 3
[1,] 2 4 8
[2,] 4 16 64
[3,] 7 49 343
[4,] 5 25 125
[5,] 2 4 8
```

Notice each column represents `x`

raised to the power in the column header. The first column is , the second and the third . Creating the design matrix in R can be accomplished using the `poly`

function. Next we create the design matrix `X`

and fit a polynomial regression model of degree 3 to our data:

```
X = poly(DF$x, degree=3, raw=TRUE)
y = DF$y
# Combine design matrix with target response y (pous).
DF1 = setDT(cbind.data.frame(X, y))
# Call lm function. On RHS of formula, `.` specifies all columns in DF1 are to be used.
mdl = lm(y ~ ., data=DF1)
# Bind reference to fitted values as yhat1.
DF[,yhat1:=unname(predict(mdl))]
```

A visualization overlaying polynomial regression estimates with original percent-of-ultimate factors is presented below (this code will be reused for all exhibits that follow, with inputs updated as necessary):

```
exhibitTitle = paste0("Polynomial Regression: Percent-of-Ultimate Modeled vs. Actual")
ggplot(DF) +
geom_point(aes(x=xinit, y=y, color="Actual"), size=2) +
geom_line(aes(x=xinit, y=yhat1, color="Predicted"), size=1.0) +
guides(color=guide_legend(override.aes=list(shape=c(16, NA), linetype=c(0, 1)))) +
scale_color_manual("", values=c("Actual"="#758585", "Predicted"="#E02C70")) +
scale_x_continuous(breaks=seq(min(DF$xinit), max(DF$xinit), 2)) +
scale_y_continuous(breaks=seq(0, 1, .1)) + ggtitle(exhibitTitle) +
theme(
plot.title=element_text(size=10, color="#E02C70"),
axis.title.x=element_blank(), axis.title.y=element_blank(),
axis.text.x=element_text(angle=0, vjust=0.5, size=8),
axis.text.y=element_text(size=8)
)
```

There is a generally good fit to actuals, but notice that for later development periods, estimates are increasing upward rather than leveling off asymptotically toward 1.0. This is one of the drawbacks of polynomial regression: The bases are non-local, meaning that the fitted value of at a given value depends strongly on data values for far from . In modern statistical modeling applications, polynomial basis-functions are used along with new basis functions such as splines, introduced next.

B-spline regression remedies the shortcomings of polynomial regression, namely the issue of non-locality. I’m going to demonstrate the usage of B-splines within the context of R rather than delve into the mathematical details. For an in-depth overview of B-splines, refer to *Elements of Statistical Learning*, specifically chapter 5.

To perform B-spline regression in R, the `bs`

function is used to generate the B-spline basis matrix for a polynomial spline. The number of spline knots to use is specified, along with the degree of polynomial to use (defaults to 3). We then generate the knot locations using the range of our independent value (`x`

) and the number of knots using the `seq`

function. Techniques can be used to minimize a cost function, such as LOOCV to minimize average MSE in order to determine the optimal number of knots, but we omit this step and instead arbitrarily choose 5 knots for the purposes of demonstration. As a rule-of-thumb, more knots leads to higher variance and lower bias.

```
library("splines")
y = DF$y
nbrSplineKnots = 5
# Knot locations using data min/max and nbrSplineKnots.
knotsSeq = seq(min(DF$x), max(DF$x), length.out=nbrSplineKnots)
# Create basis matrix using splines::bs. degree=3 represents cubic spline.
Bbasis = bs(DF$x, knots=knotsSeq, degree=3)
# Drop columns containing only a single value.
Bbasis = as.matrix(Filter(function(v) uniqueN(v)>1, as.data.table(Bbasis)))
# Combine design matrix with target y (pous).
DF2 = setDT(cbind.data.frame(Bbasis, y))
# Fit B-spline regression model.
mdl2 = lm(y ~ ., data=DF2)
# Bind reference to fitted values as yhat2.
DF[,yhat2:=unname(predict(mdl2))]
```

Running the same ggplot2 code from before, replacing `yhat1`

with `yhat2`

, we obtain:

Polynomial regression and B-spline estimates are similar, but B-spline estimates exhibit better behavior in the later periods, with estimates far less influenced by erratic observations, while also approaching 1.0 on the right. The B-spline fit exhibits a good trade-off between bias and variance.

`smooth.spline`

, which we model by , where are zero mean random errors. The cubic smoothing spline estimate of the function is defined to be the minimizer of:

where is a smoothing parameter wehich controls the bias variance trade-off. With respect to the `smooth.spline`

function, is identified as `spar`

, where `0 <= spar <=1`

. As `spar`

approaches 1, the fit resembles linear regression (low variance / high bias). As `spar`

approaches 0, the fit resembles interpolation (high variance / low bias). For the purposes of demonstration, we set `spar=.70`

and `df`

(degrees of freedom) to the number of records in the data:

```
smoothParam = .70
yhat3 = smooth.spline(DF$x, DF$y, df=nrow(DF), spar=smoothParam)$y
DF[,yhat3:=yhat3]
```

`yhat3`

vs. percent-of-ultimates:

It comes as no surprise that `smooth.spline`

predictions are similar to B-spline estimates. To demonstrate how changing `spar`

modifies the nature of the curve, we present the next code block, which fits the original percent-of-ultimate data using `smooth.spline`

for 6 values of `spar`

:

```
library("foreach")
targetSpar = c(.01, .15, .40 , .60, .85, 1.0)
DF3 = foreach(
i=1:length(targetSpar), .inorder=TRUE, .errorhandling="stop",
.final=function(ll) rbindlist(ll, fill=TRUE)
) %do% {
currSpar = targetSpar[[i]]
currDF = DF[,.(xinit, x, y)]
sparID = paste0("spar=", round(currSpar, 2))
mdl = smooth.spline(currDF[,x], currDF[,y], df=nrow(currDF), spar=currSpar)
currDF[,`:=`(yhat=mdl$y, id=sparID, spar=currSpar)]
}
ggplot(DF3) +
geom_point(aes(x=xinit, y=y, color="Actual"), size=1.5) +
geom_line(aes(x=xinit, y=yhat, color="Predicted"), size=.75) +
scale_color_manual("", values=c("Actual"="#758585", "Predicted"="#E02C70")) +
scale_x_continuous(breaks=seq(min(DF3$xinit), max(DF3$xinit), 2)) +
scale_y_continuous(breaks=seq(0, 1, .1)) +
facet_wrap(facets=vars(id), nrow=2, scales="free", shrink=FALSE) +
ggtitle("smooth.spline Regression: Percent-of-Ultimate Modeled vs. Actual") +
theme(
plot.title=element_text(size=10, color="#E02C70"),
axis.title.x=element_blank(), axis.title.y=element_blank(),
axis.text.x=element_blank(), axis.text.y=element_blank(),
legend.position="none", panel.grid.major=element_blank(),
axis.ticks=element_blank()
)
```

`spar=1`

is the lower-right facet (lowest variance/highest bias), and `spar=.01`

the upper-left facet (highest variance/lowest bias):