When implementing a machine learning classifier, best practices dictate training the model on 60-70% of the dataset of interest, and preserving 30-40% of the data for adequacy testing after the model has been fit. While there are many variations of this theme, such as splitting the dataset into train, evaluation and test sets, or using k-fold cross-validation, the core tenet is the same: Don’t evaluate your model on the same data used to fit it. I believe the scikit-learn documentation sums it up best in stating:

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally.

scikit-learn makes splitting datasets into training and test sets painless by exposing the train_test_split method from the model_evaluation sub-module. We’ll demonstrate how it’s used shortly.

For the remainder of the post, we’ll be referring to mw.csv, which can be found here. Our goal will be to correctly catagorize each individual as male or female (1=male, 2=female) based on a combination of the individual’s height, hand length and foot length.

Next, we read in the dataset using pandas, and clean it up using scikit-learn’s pre-processing utilities. Then, using train_test_split, we split the dataset into train and test sets in the proportion desired (here, we selected .33 for test_size).

# Dataset Preprocessing                      |
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# read in dataset, separating explanatory variables from response =>
fpath = "E:\\Datasets\\Binary_Response\\Men_Women\\mw.csv"
df  = pd.read_table(fpath, sep=",")
X = df.drop(['GENDER'], axis=1)
y = df['GENDER'].map(lambda x: 0 if x==2 else x).values

# split data into training and test sets =>
X_train, X_test, y_train, y_test = train_test_split(
                                X, y, test_size=.33, random_state=16)

# scale explanatory variables =>
sclr = StandardScaler()
X_train = sclr.fit_transform(X_train)
X_test  = sclr.transform(X_test)

In the pre-processing step, we scaled HEIGHT, FOOT_LENGTH and HAND_LENGTH to be on the same basis using StandardScaler. Note that we called fit_transform on X_train, but called transform on X_test. This is because we’re transforming the test dataset to be on the same basis as the training dataset, but not letting it influence the scale of the transformation.

We fit the model to the training dataset. We fit a collection of binary classifiers simultaneously, then demonstrate the tools and techniques we can leverage to evaluate model performance. In the code that follows, we fit a Gaussian Naive Bayes, k-Nearest Neighbors, Logistic Regression and Random Forest classifiers:

from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Gaussian Naive Bayes Classifier             |
nb_clf = GaussianNB().fit(X_train, y_train)

# get estimated class predictions and probabilities =>
nb_y_hat = nb_clf.predict(X_test)
nb_p_hat = nb_clf.predict_proba(X_test)[:,1]

# kNN Classifier                             |
knn_clf = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)

# get estimated class predictions and probabilities =>
knn_y_hat = knn_clf.predict(X_test)
knn_p_hat = knn_clf.predict_proba(X_test)[:,1]

# Logistic Regression Classifier             |
lr_clf = LogisticRegression(C=1e10).fit(X_train, y_train)

# pass test set to classifier to evaluate model fit =>
lr_y_hat = lr_clf.predict(X_test)
lr_y_hat = lr_clf.predict_proba(X_test)[:,1]

# =============================================
# Random Forest Classifier                    |
# =============================================
rf_clf = RandomForestClassifier().fit(X_train, y_train)

# pass test set to classifier to evaluate model fit =>
rf_y_hat = rf_clf.predict(X_test)
rf_p_hat  = rf_clf.predict_proba(X_test)[:,1]

For each classifier, the first step is fitting the model to the training data. After the model is fit, we pass the test dataset to each classifier. For each model, we have two resulting evaluations: vectors ending with ‘_y_hat’ represent the class predicition for each individual for that particular model: For our sample dataset, this will either be 1 if the prediction is ‘Male’ or 0 if the prediction is ‘Female’ (we changed ‘Female’ from 2 to 0 in the pre-processing step above). The vectors ending with ‘_p_hat’ give the probability of the classification for each individual by that particular model: The probability associated with a predicition lends insight into the model’s confidence of a given classification (an individual classified as ‘Male’ with 98% probability by a given model has a much higher degree of confidence in the classification than if it was 51%). The probabilities associated with a classification are used for model assessment via ROC Curves, which will be demonstrated shortly. But first, lets learn of some of the metrics used to assess the quality of our machine learning classifiers.

Precision and Recall

Precision is a measure of result relevancy, while Recall is a measure of how many relevant results are returned. Precision measures the fraction of positive predictions that are correct, whereas Recall measures the fraction of all positive instances the classifier correctly identifies as positive.

Before presenting formulaic representations of precision and recall, some clarification of the terminology will help with interpreting the quantitive representation:

  • True/False refers to whether the model’s predicition is correct or incorrect.
  • Positive/Negative refers to whether the model predicted the positive or negative class.
  • P is the number of positive instances in the actual dataset.
  • N is the number of negative instances in the actual dataset.

It immediately follows that:

  • True Positive (TP) - The model of interest correctly predicts the positive class
  • True Negative (TN) - The model of interest correctly predicts the negative class
  • False Positive (FP) - The model of interest incorrectly predicts the positive class (Type I error)
  • False Negative (FN) - The model of interest incorrectly predicts the negative class (Type II error)

Now for some definitions:

Accuracy is defined as:

$$ \begin{aligned} ACC = \frac {TP+TN}{P+N} = \frac {TP+TN}{TP+TN+FP+FN} \end{aligned} $$

The accuracy measure in and of itself is not sufficent in assessing the quality of a classifier. Consider a dataset in which the occurance of the positive class is a rare event: Assume only 3% of observations are classified as positive. Thus, if a model naively predicts the negative class for each observation, such a model would achieve 97% accuracy! This is why it is important to consider additional metrics, such as the following:

Precision is the fraction of positive predictions that are correct:

$$ \begin{aligned} Precision = \frac{TP}{TP+FP} \end{aligned} $$

Recall is the fraction of all positive instances a classifier correctly predicts as positive:

$$ \begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned} $$

False Positive Rate is the fraction of all negative instances the classifier incorrectly identifies as positive:

$$ \begin{aligned} FPR = \frac{FP}{TN+FP} \end{aligned} $$

The \(F_{1}\) score combines precision and recall, which returns their harmonic mean:

$$ \begin{aligned} F_{1} = 2 \frac{Precision * Recall}{Precision + Recall} \end{aligned} $$

A given classifier’s true positive, true negative, false positive and false negative predictions can be combined and represented together concisely in a confusion matrix, which we discuss next.

Confusion Matrix

The confusion matrix can be used to evaluate the accuracy of a classifier. From the scikit-learn docs:

By definition a confusion matrix \(C\) is such that \(C_{i, j}\) is equal to the number of observations known to be in group i but predicted to be in group j. Thus in binary classification, the count of true negatives is \(C_{0,0}\), false negatives is \(C_{1,0}\), true positives is \(C_{1,1}\) and false positives is \(C_{0,1}\).

In the following code segment, we demonstrate how to generate the confusion matrix for our k-Nearest Neighbors classifier above:

# ==========================================
#  Plot Confusion Matrix                   |
# ==========================================
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

actual_response = y_test
predicted_response = knn_y_pred
cm = confusion_matrix(actual_response, predicted_response)

sns.heatmap(cm, square=True, annot=True, cbar=False)
plt.xlabel('Predicted Value')
plt.ylabel('Actual Value')

Running the code above generates the following:


Therefore, the k-Nearest Neighbors classifier predicted 22 true negatives, 4 false negatives, 2 false positives and 24 true positives.

ROC Curve

The ROC curve (Receiver Operating Characteristic) is a plot often used used in assessing the quality of a binary classifier as the discrimination threshold is varied. Let’s define some additional metrics, some of which are used directly in generating the ROC curve:

The True Positive Rate (same as recall, a.k.a. sensitivity, hit rate):

$$ \begin{aligned} TPR = \frac{TP}{TP+FN} = \frac {TP}{P} \end{aligned} $$

The False Positive Rate (a.k.a. Type-I error, fall-out):

$$ \begin{aligned} FPR = \frac {FP}{FP+TN} = 1 = TNR \end{aligned} $$

The True Negative Rate (a.k.a. specificity):

$$ \begin{aligned} TNR = \frac {TN}{TN+FP} = \frac {TN}{N} \end{aligned} $$

The False Negative Rate (a.k.a. miss rate, Type-II error):

$$ \begin{aligned} FNR = \frac {FN}{FN+TP} = 1 - TPR \end{aligned} $$

The ROC curve leverages two of these metrics: \(TPR\) and \(FPR\). Convention dictates plotting \(TPR\) as a function of \(FPR\), with \(TPR\) on the y-axis and \(FPR\) along the x-axis.

The area under the ROC curve (AUC) is used to assess model quality. The higher the AUC, the better the model. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

Next we demonstrate how to plot ROC Curves for each of our 4 classifiers and calculate their respective AUC scores. AUC scores are generally between .5-1.0, with .5 representing random guessing, and 1.0 representing perfect classification for all observations. As with everything else, scikit-learn makes obtaining these metrics as straighforward as possible:

# ==========================================
#  Plot ROC curve for each classifier      |
# ==========================================
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

fpr1, tpr1, thresholds1 = roc_curve(y_test, rf_p_hat,  pos_label=1)
fpr2, tpr2, thresholds2 = roc_curve(y_test, lr_p_hat,  pos_label=1)
fpr3, tpr3, thresholds3 = roc_curve(y_test, nb_p_hat,  pos_label=1)
fpr4, tpr4, thresholds4 = roc_curve(y_test, knn_p_hat, pos_label=1)

plt.plot(fpr1, tpr1, linewidth=2, label='Random Forest')
plt.plot(fpr2, tpr2, linewidth=2, label='Logistic Regression')
plt.plot(fpr3, tpr3, linewidth=2, label='Naive Bayes')
plt.plot(fpr4, tpr4, linewidth=2, label='kNN')
plt.plot([0,1], [0,1], '--')
plt.axis([-.05, 1.05, -.05, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.legend(loc='lower right',prop={'size':20}, frameon=True)

Running this generates the following plot:


Then, to obtain each classifiers AUC score, we run:

# ==========================================
#  Plot AUC score for each classifier      |
# ==========================================
from sklearn.metrics import roc_auc_score

>>> print("Random Forest AUC Score      : {}".format(roc_auc_score(y_test, rf_p_hat)))
    print("Logistic Regression AUC Score: {}".format(roc_auc_score(y_test, lr_p_hat)))
    print("Naive Bayes AUC Score        : {}".format(roc_auc_score(y_test, nb_p_hat)))
    print("kNN AUC Score                : {}".format(roc_auc_score(y_test, knn_p_hat)))

Random Forest AUC Score      : 0.9174107142857144
Logistic Regression AUC Score: 0.9241071428571428
Naive Bayes AUC Score        : 0.9508928571428571
kNN AUC Score                : 0.9419642857142858

In terms of AUC score, the Naive Bayes classifier performs best.

A convenient method of obtaining important classifier metrics is provided in metrics.classification_report. It takes for arguments the test set labels and the classifier’s class predictions for those observations, along with the labels for the class in question. classification_report then generates a table with precision, recall, f1-score and support:

# ==========================================
#  Print classification_report             |
# ==========================================
from sklearn.metrics import classification_report

>>> print(classification_report(y_test, nb_y_hat, target_names=['Male', 'Female']))

             precision    recall  f1-score   support

       Male       0.88      0.88      0.88        24
     Female       0.89      0.89      0.89        28

avg / total       0.88      0.88      0.88        52

Once a model has been fit…the work has only just begun! In order to implement and productionalize well-tuned, fully-functional machine learning models, a good deal of time needs to be spent not only in determining optimal hyperparameter settings, but also in considering the proper precision/recall tradeoff that makes the most sense for problem domain in question.

We’ve only scratched the surface of what scikit-learn offers with respect to model assessment utilities. Make an effort to explore the documentation (which is first rate), and make it a goal to internalize a new aspect of the API you weren’t previously familiar with everytime you visit. In no time at all, you’ll have gained familiarity with the most important utilities, which you’ll be able to leverage to your advantage when faced with the prospect of having to solve a problem with a non-obvious solution.
Until next time, happy coding!