Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes Theorem with the naive assumption of independence between every pair of features. Naive Bayes classifiers make two strong assumptions:
1) The value of a particular feature is independent of the value of any other feature, given the class variable.
2) The set of features associated with an unclassified instance are assumed to follow a normal distribution.
What follows are the steps required to implement Gaussian Naive Bayes from scratch:
1) Ensure all explanatory variables are continuous: If the dataset contains categorical features, look into the Bernoulli or Multinomial form of Naive Bayes.
2) For each explanatory variable, calculate the maximum likelihood estimate of the mean and variance for each class.
3) To classify a new instance, calculate the posterior probability for each class. There will be as many posterior probabilities per unclassified instance as there are distinct classes.
4) The new instance will be classified based on the class having the greatest posterior probability.
Example Classification
Consider the following dataset representing business school admission decisions for a collection of applicants:
ID GPA GMAT ADMITTED_IND
000000001 3.14 473 1
000000002 3.22 482 1
000000003 2.96 596 1
000000004 3.28 523 1
000000005 2.72 399 0
000000006 2.85 381 0
000000007 2.51 458 0
000000008 2.36 399 0
In addition, we have two additional records that will be used to test the classifier:
ID GPA GMAT ADMITTED_IND
000000009 2.90 384 0
000000010 3.40 431 1
For each feature, we calculate the mean and variance for each level of the response (admitted/not admitted):
ADMITTED_IND GPA-mean GPA-variance GMAT-mean GMAT-variance
1 3.150 0.0193 518.50 3143.00
0 2.610 0.0474 409.25 1128.25
In the sample dataset, we choose equiprobable priors (since \(P(admit)=P(!admit)=.5\)). However, the prior probabilities need not be derived from the dataset of interest. They can be based on external data sources, such as admissions from prior years or institutions.
We now present the Naive Bayes classifier. Recall the general form of Bayes Theorem:
The posterior probability for admitted is given by:
The posterior probability for non-admitted is given by:
Where:
- \(P(admit)/P(!admit)\) =represents the prior probability, .50 in this example.
- \(P(GPA|admit)P(GMAT|admit)\) represents the likelihood. We assume zero
correlation between GPA and GMAT via the first assumption of Naive Bayes.
- data is a stand-in for GMAT and GPA for a given instance.
The second assumption of Naive Bayes is that all explanatory variables follow a normal distribution. Thus, \(P(GMAT|admitted)\) is calculated by passing the observation’s GMAT score and GPA into the associated normal density function, parameterized by the estimates of mean and variance by response level determined above.
For the admitted class:
For the non-admitted class:
Calculation
We used the original eight observations (out training data) to compute \(\mu_{GMAT}\), \(\sigma^{2}_{GMAT}\), \(\mu_{GPA}\) and \(\sigma^{2}_{GPA}\) for each level of the response. Using these values, computing the probability of each response level for each variable is straightforward. First, consider the probability of admitted w.r.t. GMAT for ID=000000009
The probability of not admitted w.r.t. GMAT for ID=000000009:
Similarly for GPA:
Not admitted:
Then, plugging values into the posterior expression, class probabilities for ID=000000009 are given by:
Thus, an individual with GPA=2.90 and GMAT=384 would almost certainly not be admitted according to the Gaussian Naive Bayes classifier.
Gaussian Naive Bayes in scikit-learn
Implementing Gaussian Naive Bayes in scikit-learn is straightforward. Generally, after deciding on a particular model to use for a classification task, the next steps would be:
- Pre-process explanatory data (scale, impute, encode)
- Instantiate model
- Fit model to training data
- Predict classes on test/holdout data
Using the sample admissions data, we demonstrate how to carry out next steps with scikit-learn:
# ===========================================================
# scikit-learn implementation of Naive Bayes classifier |
# ===========================================================
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
# Read dataset into pandas DataFrame.
df = pd.DataFrame({
'ID':['000000001','000000002','000000003','000000004',
'000000005','000000006','000000007','000000008'],
'GPA':[3.14,3.22,2.96,3.28,2.72,2.85,2.51,2.36],
'GMAT':[473,482,596,523,399,381,458,399],
'ADMITTED_IND':[1,1,1,1,0,0,0,0]
})
# Split data into desiign matrix (X) and response (y).
Xinit = df[['GPA', 'GMAT']].values
y = df['ADMITTED_IND'].values
# [1] Preprocess explanatory data - we use StandardScaler, which returns the
# features scaled with 0 mean and unit variance.
scl = StandardScaler()
X = scl.fit_transform(Xinit)
# [2] Fit model to training data - Instantiate model and call `fit` method.
clf = GaussianNB()
clf.fit(X, y)
# [3] Predict classes on holdout data.
# scale test data by calling scaler's `transform` method (not fit!).
pre_000000009 = scl.transform([[2.90, 384]])
pre_000000010 = scl.transform([[3.40, 431]])
obs_000000009 = clf.predict(pre_000000009)
obs_000000010 = clf.predict(pre_000000010)
print("000000009 actual admission status: 0; predicited status: {}".format(obs_000000009))
print("000000009 actual admission status: 1; predicited status: {}".format(obs_000000010))
# returns:
# 000000009 actual admission status: 0; predicited status: [0]
# 000000009 actual admission status: 1; predicited status: [1]
We see that the class labels predicted by our model agree with the actual labels in both cases.
From the Gaussian Naive Bayes classifier, we can access both the label predictions and posterior probabilities for each class. For demonstartion purposes, we generate the class predictions and probabiltities associated with the training set of eight instances, but in practice you’ll be interested in determining these metrics for the holdout dataset:
"""
Continuing with clf object from above.
"""
# Get model predicted classes.
y_hat = clf.predict(X) # array([1, 1, 1, 1, 0, 0, 0, 0], dtype=int64)
# Get model predicted probabilities.
p_hat = clf.predict_proba(X)[:,[1]]
Printing p_hat
yields:
>>> print(p_hat)
array([[ 0.997114],
[ 0.999608],
[ 1. ],
[ 0.999998],
[ 0.000098],
[ 0.002745],
[ 0.000001],
[ 0. ]])