When implementing a machine learning classifier, best practices dictate training the model on 60-70% of the dataset of interest, and preserving 30-40% of the data for adequacy testing after the model has been fit. While there are many variations of this theme, such as splitting the dataset into train, evaluation and test sets, or using k-fold cross-validation, the core tenet is the same: Don't evaluate your model on the same data used to fit it.
The scikit-learn documentation sums it up best:

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word "experiment" is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally.

scikit-learn makes splitting datasets into training and test sets painless by exposing the train_test_split method from the model_evaluation sub-module. We'll demonstrate how it's used shortly.

For the remainder of the post, we'll be referring to mw.csv, which is available for download here. The goal is to correctly catagorize each individual as male or female (1=male, 0=female) based on a combination of the individual's height, hand length and foot length.

We start by reading in the dataset using Pandas and leverage scikit-learn's preprocessing utilities to handle scaling, transformation and imputation. The dataset is then split into train and test cohorts in the desired proportion.

# Dataset Preprocessing                      |
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# read in dataset, separating explanatory variables from response =>
fpath = "E:\\Datasets\\Binary_Response\\Men_Women\\mw.csv"
df  = pd.read_table(fpath, sep=",")
X = df.drop(['GENDER'], axis=1)
y = df['GENDER'].map(lambda x: 0 if x==2 else x).values

# Partition data into training and test cohorts.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.33, random_state=16

# Scale continuous explanatory variables.
sclr = StandardScaler()
X_train = sclr.fit_transform(X_train)
X_test  = sclr.transform(X_test)

We scaled HEIGHT, FOOT_LENGTH and HAND_LENGTH to be on the same relative scale using StandardScaler. We called fit_transform on X_train, but only called transform on X_test. This is because we're transforming the test data to be on the same basis as the training data, but not letting it influence the scale of the transformation.

The model is fit to the training dataset. In the next code block, we fit a number of binary classifiers simultaneously (Gaussian Naive Bayes, k-Nearest Neighbors, Logistic Regression and Random Forest) then demonstrate the methods commonly used to assess model performance:

Fitting a number of binary classifiers using scikit-learn.
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Gaussian Naive Bayes classifier.
nb_clf = GaussianNB().fit(X_train, y_train)
nb_y_hat = nb_clf.predict(X_test)
nb_p_hat = nb_clf.predict_proba(X_test)[:,1]

# kNN classifier.
knn_clf = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
knn_y_hat = knn_clf.predict(X_test)
knn_p_hat = knn_clf.predict_proba(X_test)[:,1]

# Logistic Regression classifier.
lr_clf = LogisticRegression(C=1e10).fit(X_train, y_train)
lr_y_hat = lr_clf.predict(X_test)
lr_y_hat = lr_clf.predict_proba(X_test)[:,1]

# Random Forest classifier.
rf_clf = RandomForestClassifier().fit(X_train, y_train)
rf_y_hat = rf_clf.predict(X_test)
rf_p_hat = rf_clf.predict_proba(X_test)[:,1]

For each classifier, the first step is fitting the model to the training data. After the model is fit, we pass the test dataset to each classifier. For each model, we have two resulting evaluations: vectors ending with _y_hat represent the class predicition for each individual for that particular model: For our sample dataset, this will either be 1 if the prediction is Male or 0 if the prediction is Female (we changed Female from 2 to 0 in the pre-processing step above). The vectors ending with _p_hat give the probability of the classification for each individual by that particular model: The probability associated with a predicition lends insight into the model's confidence in a given classification (an individual classified as Male with 98% probability by a given model has a much higher degree of confidence in the classification than a 51% assigned probability). The probabilities associated with a classification are used for model assessment via ROC Curves, which will be demonstrated later in the post. But first we summarize important metrics used to assess the quality of machine learning classifiers.

Precision and Recall

Precision is a measure of result relevancy, whereas recall is a measure of how many relevant results are returned. Precision measures the fraction of positive predictions that are correct, whereas Recall measures the fraction of all positive instances the classifier correctly identifies as positive. Before presenting formulaic representations of precision and recall, some clarification of the terminology will help with interpreting the quantitive representation:

  • True/False refers to whether the model's predicition is correct or incorrect.
  • Positive/Negative refers to whether the model predicted the positive or negative class.
  • P is the number of positive instances in the dataset of interest.
  • N is the number of negative instances in the dataset of interest.
  • True Positive (TP) = The model of interest correctly predicts the positive class.
  • True Negative (TN) = The model of interest correctly predicts the negative class.
  • False Positive (FP) = The model of interest incorrectly predicts the positive class (Type I error).
  • False Negative (FN) = The model of interest incorrectly predicts the negative class (Type II error).

Accuracy is defined as:

$$ \begin{aligned} ACC = \frac {TP+TN}{P+N} = \frac {TP+TN}{TP+TN+FP+FN} \end{aligned} $$

The accuracy measure in and of itself is not sufficent in assessing the quality of a classifier. Consider a dataset in which the occurance of the positive class is a rare event: Assume only 3% of observations are classified as positive. Thus, if a model naively predicts the negative class for each observation, such a model would achieve 97% accuracy. Since most real world classifiers focus on data with highly unbalanced classes (insurance or credit card fraud, for example), it is important to look at other metrics when evaluating model quality.

Precision is the fraction of positive predictions that are correct:

$$ \begin{aligned} Precision = \frac{TP}{TP+FP} \end{aligned} $$

Precision measures the degree of accuaracy of the classifier's positive identification mechanism. It answers the question, In aggregate, when the classifier identifies an observation as being positive, how accurate is it?

Recall is the fraction of all positive instances a classifier correctly predicts as positive:

$$ \begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned} $$

False Positive Rate is the fraction of all negative instances the classifier incorrectly identifies as positive:

$$ \begin{aligned} FPR = \frac{FP}{TN+FP} \end{aligned} $$

The $F_{1}$ score calculates the harmonic mean of precision and recall:

$$ \begin{aligned} F_{1} = 2 \frac{Precision * Recall}{Precision + Recall} \end{aligned} $$

A classifier's true positive, true negative, false positive and false negative predictions can be combined and represented together concisely in a confusion matrix, which we discuss next.

Confusion Matrix

The confusion matrix can be used to evaluate the accuracy of a classifier. From the scikit-learn documentation:

By definition a confusion matrix $C$ is such that $C_{i,j}$ is equal to the number of observations known to be in group $i$ but predicted to be in group $j$. Thus in binary classification, the count of true negatives is $C_{0,0}$, false negatives is $C_{1,0}$, true positives is $C_{1,1}$ and false positives is $C_{0,1}$.

We next demonstrate how to generate a confusion matrix for the k-Nearest Neighbors classifier from earlier:

Plot Confusion Matrix.
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

actual_response = y_test
predicted_response = knn_y_pred
cm = confusion_matrix(actual_response, predicted_response)
sns.heatmap(cm, square=True, annot=True, cbar=False)
plt.xlabel("Predicted Value")
plt.ylabel("Actual Value")

This code produces the following:


To summarize, the k-Nearest Neighbors classifier predicted 22 true negatives, 4 false negatives, 2 false positives and 24 true positives.

ROC Curve

The ROC curve (Receiver Operating Characteristic) is a plot often used used in assessing the quality of a binary classifier as the discrimination threshold is varied. Let's define some additional metrics, some of which are used directly in generating the ROC curve:

The True Positive Rate (same as recall, a.k.a. sensitivity, hit rate):

$$ \begin{aligned} TPR = \frac{TP}{TP+FN} = \frac {TP}{P} \end{aligned} $$

The False Positive Rate (a.k.a. Type-I error, fall-out):

$$ \begin{aligned} FPR = \frac {FP}{FP+TN} = 1 - TNR \end{aligned} $$

The True Negative Rate (a.k.a. specificity):

$$ \begin{aligned} TNR = \frac {TN}{TN+FP} = \frac {TN}{N} \end{aligned} $$

The False Negative Rate (a.k.a. miss rate, Type-II error):

$$ \begin{aligned} FNR = \frac {FN}{FN+TP} = 1 - TPR \end{aligned} $$

The ROC curve incorporates two of these metrics: TPR and FPR. Convention dictates plotting TPR as a function of FPR, with TPR on the y-axis and FPR along the x-axis.

The area under the ROC curve (AUC) is used to assess model quality. The higher the AUC, the better the model. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

Next we demonstrate how to plot ROC Curves for each of our 4 classifiers and calculate their respective AUC scores. AUC scores fall between .5-1.0, with .5 representing random guessing, and 1.0 representing perfect classification for all observations:

Plot ROC curve for each classifier.
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

fpr1, tpr1, thresholds1 = roc_curve(y_test, rf_p_hat,  pos_label=1)
fpr2, tpr2, thresholds2 = roc_curve(y_test, lr_p_hat,  pos_label=1)
fpr3, tpr3, thresholds3 = roc_curve(y_test, nb_p_hat,  pos_label=1)
fpr4, tpr4, thresholds4 = roc_curve(y_test, knn_p_hat, pos_label=1)

plt.plot(fpr1, tpr1, linewidth=2, label='Random Forest')
plt.plot(fpr2, tpr2, linewidth=2, label='Logistic Regression')
plt.plot(fpr3, tpr3, linewidth=2, label='Naive Bayes')
plt.plot(fpr4, tpr4, linewidth=2, label='kNN')
plt.plot([0,1], [0,1], '--')
plt.axis([-.05, 1.05, -.05, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.legend(loc='lower right', prop={'size':20}, frameon=True)

Running the code above produces the following:


We can compute the AUC score for each classifier as well:

Compute AUC score for each classifier.
from sklearn.metrics import roc_auc_score

print("Random Forest AUC Score      : {}".format(roc_auc_score(y_test, rf_p_hat)))
print("Logistic Regression AUC Score: {}".format(roc_auc_score(y_test, lr_p_hat)))
print("Naive Bayes AUC Score        : {}".format(roc_auc_score(y_test, nb_p_hat)))
print("kNN AUC Score                : {}".format(roc_auc_score(y_test, knn_p_hat)))
Random Forest AUC Score      : 0.9174107142857144
Logistic Regression AUC Score: 0.9241071428571428
Naive Bayes AUC Score        : 0.9508928571428571
kNN AUC Score                : 0.9419642857142858

A convenient method of obtaining relevant classifier metrics is provided in metrics.classification_report. It takes for arguments the test set labels and the classifier's class predictions for those observations, along with the labels for the class in question. classification_report then generates a table with precision, recall, f1-score and support:

Print classification_report.
from sklearn.metrics import classification_report

>>> print(classification_report(y_test, nb_y_hat, target_names=['Male', 'Female']))

             precision    recall  f1-score   support

       Male       0.88      0.88      0.88        24
     Female       0.89      0.89      0.89        28
avg / total       0.88      0.88      0.88        52