When implementing a machine learning classifier, best practices dictate training the model on 60-70% of the
dataset of interest, and preserving 30-40% of the data for adequacy testing after the model has been fit.
While there are many variations of this theme, such as splitting the dataset into train, evaluation and test
sets, or using k-fold cross-validation, the core tenet is the same: *Don’t evaluate your model on the same
data used to fit it*. I believe the scikit-learn documentation
sums it up best in stating:

*Learning the parameters of a prediction function and testing it on the same data is a methodological
mistake: a model that would just repeat the labels of the samples that it has just seen would have a
perfect score but would fail to predict anything useful on yet-unseen data. This situation is called
overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment
to hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is
not intended to denote academic use only, because even in commercial settings machine learning usually starts
out experimentally.*

scikit-learn makes splitting datasets into training and test sets painless by exposing the `train_test_split`

method
from the `model_evaluation`

sub-module. We’ll demonstrate how it’s used shortly.

For the remainder of the post, we’ll be referring to `mw.csv`

, which can be
found here. Our goal will be to correctly
catagorize each individual as male or female (1=male, 2=female) based on a combination of the individual’s height,
hand length and foot length.

Next, we read in the dataset using pandas, and clean it up using scikit-learn’s pre-processing utilities. Then, using
`train_test_split`

, we split the dataset into train and test sets in the proportion desired (here, we selected .33 for `test_size`

).

```
#=============================================
# Dataset Preprocessing |
#=============================================
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# read in dataset, separating explanatory variables from response =>
fpath = "E:\\Datasets\\Binary_Response\\Men_Women\\mw.csv"
df = pd.read_table(fpath, sep=",")
X = df.drop(['GENDER'], axis=1)
y = df['GENDER'].map(lambda x: 0 if x==2 else x).values
# split data into training and test sets =>
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=.33, random_state=16)
# scale explanatory variables =>
sclr = StandardScaler()
X_train = sclr.fit_transform(X_train)
X_test = sclr.transform(X_test)
```

In the pre-processing step, we scaled HEIGHT, FOOT_LENGTH and HAND_LENGTH to be on the same basis
using `StandardScaler`

. Note that we called `fit_transform`

on *X_train*, but called `transform`

on *X_test*.
This is because we’re transforming the test dataset to be on the same basis as the training dataset, but not letting
it influence the scale of the transformation.

We fit the model to the training dataset. We fit a collection of binary classifiers simultaneously, then demonstrate
the tools and techniques we can leverage to evaluate model performance. In the code that follows, we fit a Gaussian
Naive Bayes, k-Nearest Neighbors, Logistic Regression and Random Forest classifiers:

```
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
#==============================================
# Gaussian Naive Bayes Classifier |
#==============================================
nb_clf = GaussianNB().fit(X_train, y_train)
# get estimated class predictions and probabilities =>
nb_y_hat = nb_clf.predict(X_test)
nb_p_hat = nb_clf.predict_proba(X_test)[:,1]
#=============================================
# kNN Classifier |
#=============================================
knn_clf = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
# get estimated class predictions and probabilities =>
knn_y_hat = knn_clf.predict(X_test)
knn_p_hat = knn_clf.predict_proba(X_test)[:,1]
#=============================================
# Logistic Regression Classifier |
#=============================================
lr_clf = LogisticRegression(C=1e10).fit(X_train, y_train)
# pass test set to classifier to evaluate model fit =>
lr_y_hat = lr_clf.predict(X_test)
lr_y_hat = lr_clf.predict_proba(X_test)[:,1]
# =============================================
# Random Forest Classifier |
# =============================================
rf_clf = RandomForestClassifier().fit(X_train, y_train)
# pass test set to classifier to evaluate model fit =>
rf_y_hat = rf_clf.predict(X_test)
rf_p_hat = rf_clf.predict_proba(X_test)[:,1]
```

For each classifier, the first step is fitting the model
to the training data. After the model is fit, we pass the test dataset to each classifier. For each
model, we have two resulting evaluations: vectors ending with ‘_y_hat’ represent the class predicition
for each individual for that particular model: For our sample dataset, this will either be `1`

if the
prediction is ‘Male’ or `0`

if the prediction is ‘Female’ (we changed ‘Female’ from `2`

to `0`

in the
pre-processing step above). The vectors ending with ‘_p_hat’ give the probability of the classification
for each individual by that particular model: The probability associated with a predicition lends insight
into the model’s confidence of a given classification (an individual classified as ‘Male’ with 98% probability
by a given model has a much higher degree of confidence in the classification than if it was 51%).
The probabilities associated with a classification are used for model assessment via *ROC Curves*, which will
be demonstrated shortly. But first, lets learn of some of the metrics used to assess the quality of our machine
learning classifiers.

## Precision and Recall

*Precision* is a measure of result relevancy, while *Recall* is a measure of how many relevant results are returned.
Precision measures the fraction of positive predictions that are correct, whereas *Recall* measures the fraction
of all positive instances the classifier correctly identifies as positive.

Before presenting formulaic representations of precision and recall, some clarification of the terminology will help with interpreting the quantitive representation:

*True/False*refers to whether the model’s predicition is correct or incorrect.*Positive/Negative*refers to whether the model predicted the positive or negative class.*P*is the number of positive instances in the actual dataset.*N*is the number of negative instances in the actual dataset.

It immediately follows that:

**True Positive**(TP) - The model of interest correctly predicts the positive class**True Negative**(TN) - The model of interest correctly predicts the negative class**False Positive**(FP) - The model of interest incorrectly predicts the positive class (Type I error)**False Negative**(FN) - The model of interest incorrectly predicts the negative class (Type II error)

Now for some definitions:

*Accuracy* is defined as:

The accuracy measure in and of itself is not sufficent in assessing the quality of a classifier.
Consider a dataset in which the occurance of the positive class is a rare event: Assume only 3% of
observations are classified as positive. Thus, if a model naively predicts the negative class for
each observation, such a model would achieve 97% accuracy! This is why it is important to consider
additional metrics, such as the following:

*Precision* is the fraction of positive predictions that are correct:

*Recall* is the fraction of all positive instances a classifier correctly predicts as positive:

*False Positive Rate* is the fraction of all negative instances the classifier incorrectly identifies as positive:

The \(F_{1}\) score combines precision and recall, which returns their harmonic mean:

A given classifier’s true positive, true negative, false positive and false negative predictions can be
combined and represented together concisely in a *confusion matrix*, which we discuss next.

## Confusion Matrix

The *confusion matrix* can be used to evaluate the accuracy of a classifier. From the
scikit-learn docs:

*By definition a confusion matrix \(C\) is such that \(C_{i, j}\) is equal to the number of observations known to be in
group i but predicted to be in group j. Thus in binary classification, the count of true negatives is \(C_{0,0}\),
false negatives is \(C_{1,0}\), true positives is \(C_{1,1}\) and false positives is \(C_{0,1}\).*

In the following code segment, we demonstrate how to generate the confusion matrix for our k-Nearest Neighbors classifier above:

```
# ==========================================
# Plot Confusion Matrix |
# ==========================================
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook')
from sklearn.metrics import confusion_matrix
actual_response = y_test
predicted_response = knn_y_pred
cm = confusion_matrix(actual_response, predicted_response)
sns.heatmap(cm, square=True, annot=True, cbar=False)
plt.xlabel('Predicted Value')
plt.ylabel('Actual Value')
plt.show()
```

Running the code above generates the following:

Therefore, the k-Nearest Neighbors classifier predicted 22 true negatives,
4 false negatives, 2 false positives and 24 true positives.

## ROC Curve

The ROC curve (Receiver Operating Characteristic) is a
plot often used used in assessing the quality of a binary classifier as the discrimination threshold is varied. Let’s define
some additional metrics, some of which are used directly in generating the ROC curve:

The True Positive Rate (same as recall, a.k.a. sensitivity, hit rate):

The False Positive Rate (a.k.a. Type-I error, fall-out):

The True Negative Rate (a.k.a. specificity):

The False Negative Rate (a.k.a. miss rate, Type-II error):

The ROC curve leverages two of these metrics: \(TPR\) and \(FPR\). Convention dictates plotting \(TPR\) as a function of \(FPR\), with \(TPR\) on the y-axis and \(FPR\) along the x-axis.

The area under the ROC curve (AUC) is used to assess model quality. The higher the AUC, the better the model.
A high area under the curve represents both high recall and high precision, where high precision relates to a low
false positive rate, and high recall relates to a low false negative rate. High scores for both show that the
classifier is returning accurate results (high precision), as well as returning a majority of all positive results
(high recall).

Next we demonstrate how to plot ROC Curves for each of our 4 classifiers and calculate their respective AUC scores.
AUC scores are generally between .5-1.0, with .5 representing random guessing, and 1.0 representing perfect classification
for all observations. As with everything else, scikit-learn makes obtaining these metrics as straighforward as possible:

```
# ==========================================
# Plot ROC curve for each classifier |
# ==========================================
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook')
fpr1, tpr1, thresholds1 = roc_curve(y_test, rf_p_hat, pos_label=1)
fpr2, tpr2, thresholds2 = roc_curve(y_test, lr_p_hat, pos_label=1)
fpr3, tpr3, thresholds3 = roc_curve(y_test, nb_p_hat, pos_label=1)
fpr4, tpr4, thresholds4 = roc_curve(y_test, knn_p_hat, pos_label=1)
plt.plot(fpr1, tpr1, linewidth=2, label='Random Forest')
plt.plot(fpr2, tpr2, linewidth=2, label='Logistic Regression')
plt.plot(fpr3, tpr3, linewidth=2, label='Naive Bayes')
plt.plot(fpr4, tpr4, linewidth=2, label='kNN')
plt.plot([0,1], [0,1], '--')
plt.axis([-.05, 1.05, -.05, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.grid(True)
plt.legend(loc='lower right',prop={'size':20}, frameon=True)
plt.show()
```

Running this generates the following plot:

Then, to obtain each classifiers AUC score, we run:

```
# ==========================================
# Plot AUC score for each classifier |
# ==========================================
from sklearn.metrics import roc_auc_score
>>> print("Random Forest AUC Score : {}".format(roc_auc_score(y_test, rf_p_hat)))
print("Logistic Regression AUC Score: {}".format(roc_auc_score(y_test, lr_p_hat)))
print("Naive Bayes AUC Score : {}".format(roc_auc_score(y_test, nb_p_hat)))
print("kNN AUC Score : {}".format(roc_auc_score(y_test, knn_p_hat)))
Random Forest AUC Score : 0.9174107142857144
Logistic Regression AUC Score: 0.9241071428571428
Naive Bayes AUC Score : 0.9508928571428571
kNN AUC Score : 0.9419642857142858
```

In terms of AUC score, the Naive Bayes classifier performs best.

A convenient method of obtaining important classifier metrics is provided in `metrics.classification_report`

.
It takes for arguments the test set labels and the classifier’s class predictions for those observations,
along with the labels for the class in question. `classification_report`

then generates a table with precision,
recall, f1-score and support:

```
# ==========================================
# Print classification_report |
# ==========================================
from sklearn.metrics import classification_report
>>> print(classification_report(y_test, nb_y_hat, target_names=['Male', 'Female']))
precision recall f1-score support
Male 0.88 0.88 0.88 24
Female 0.89 0.89 0.89 28
avg / total 0.88 0.88 0.88 52
```

Once a model has been fit…the work has only just begun! In order to implement and productionalize well-tuned, fully-functional machine learning models, a good deal of time needs to be spent not only in determining optimal hyperparameter settings, but also in considering the proper precision/recall tradeoff that makes the most sense for problem domain in question.

We’ve only scratched the surface of what scikit-learn offers with respect to model assessment utilities.
Make an effort to explore the documentation (which is first rate), and make it a goal to internalize a new
aspect of the API you weren’t previously familiar with everytime you visit. In no time at all, you’ll have
gained familiarity with the most important utilities, which you’ll be able to leverage to your advantage when
faced with the prospect of having to solve a problem with a non-obvious solution.

Until next time, happy coding!