Hyperparameter Search and Classifier Threshold Selection

Machine Learning

April 28, 2024

The following notebook demonstrates how to use GridSearchCV to identify optimal hyperparameters for a given model and metric, and alternatives for selecting a classifier threshold in scikit-learn.

First we load the breast cancer dataset. We will forgo any pre-processing, but create separate train and validation sets:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

np.set_printoptions(suppress=True, precision=8, linewidth=1000)
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

data = load_breast_cancer()
X = data["data"]
y = data["target"]

# Create train, validation and test splits. 
Xtrain, Xvalid, ytrain, yvalid = train_test_split(X, y, test_size=.20, random_state=516)

print(f"Xtrain.shape: {Xtrain.shape}")
print(f"Xvalid.shape: {Xvalid.shape}")
Xtrain.shape: (455, 30)
Xvalid.shape: (114, 30)