The Pleasure of Finding Things Out: A blog by James Triveri - Gradient Descent for Logistic Regression

Within the GLM framework, model coefficients are estimated using iterative reweighted least squares (IRLS), sometimes referred to as Fisher Scoring. This works well, but becomes inefficient as the size of the dataset increases: IRLS relies on the matrix of second partial derivatives which is expensive to compute. Instead, we can estimate logistic regression coefficients using gradient descent, which only relies on the first derivative of the cost function. This is much more efficient to compute, and generally provides good estimates once features have been standardized.

Expression for linear predictions:

\[ z = \boldsymbol{X} \boldsymbol{\theta} \]

Expression for predicted probabilities:

\[ \hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}} \]

Gradient descent:

\[ \theta_{t + 1} := \theta_{t} - \eta \nabla L(\hat{y}, y) \]

\(\eta\) is the learning rate.
\(\nabla L\) represents the gradients of the loss function.

Optimal parameters:

\[ \boldsymbol{\hat{\theta}} = \operatorname*{argmin}_\theta \frac{1}{m} \sum_{i=1}^{m} L(f(x^{(i)}; \theta), y^{(i)}) \]

The loss function is used to determine how much predictions differs from labels. Here we use binary cross entropy loss:

\[ L(\hat{y}, y) = -\frac{1}{m} \sum_{i=1}^{m} y \cdot \mathrm{log}(\hat{y}) + (1 - y) \cdot \mathrm{log}(1 - \hat{y}), \]

which we obtain from the log-likelihood of a single observation assumed to follow a binomial distribution. We flip the sign (the leading -) in order to minimize the loss. Note that \(\hat{y} = \sigma(z)\).

Expression for gradient of loss function:

\[ \nabla L(\hat{y}, y) = \frac{1}{m}(\hat{y} - y)x^{T} \]

We choose \(\boldsymbol{\theta}\) that maximizes the log-probability of the true y labels in the training data given the observations \(X\). The goal is to find the set of weights which minimize the loss function, averaged over all examples. For each variable \(\theta_{j}\) in \(\boldsymbol{\theta}\), the gradient will have a component that tells us the slope w.r.t. that variable.

The breast cancer dataset is loaded from scikit-learn, the features are standardized and model coefficients estimated using gradient descent. Results are then compared with statsmodels coefficient estimates using the same data to ensure correctness.

"""
Load breast cancer dataset from scikit-learn. Standardize features.
"""
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

data_loader = load_breast_cancer()
X_data, y_data = data_loader.data, data_loader.target
column_names = data_loader.feature_names
df = pd.DataFrame(X_data, columns=column_names)

keep_features = [
    'mean radius', 'mean texture', 'mean perimeter', 'mean area',
    'mean smoothness'
    ]

X = df[keep_features].values
y = y_data[:,None]

# Standardize features. 
sclr = StandardScaler()
X = sclr.fit_transform(X)

# Add bias term.
bias = np.ones(X.shape[0])[:, None]
X = np.concatenate([bias, X], axis=1)

eta represents the learning rate and is a hyperparameter. num_epochs can be set as desired. For this example, we haven’t incorporated early stopping logic, but this would be straightforward to implement. We initialize the weight vector w to all 0s. w has the same length as the number of features + 1 for the intercept term.


eta = .50
num_epochs = 50000
w = np.zeros((X.shape[1], 1))
loss = []

for _ in range(num_epochs):
    z = X @ w
    ypred = 1. / (1. + np.exp(-z))
    L = -np.mean(y * np.log(ypred) + (1 - y) * np.log(1 - ypred))
    dL = np.mean(((ypred - y) * X), axis=0)[:, None]
    loss.append(L)        
    w-=eta * dL

print(f"w: {w.ravel()}")

w: [ -0.4592425   22.09479813  -1.56463209 -14.7402888  -14.68918055
  -1.66460167]

Estimating the coefficients using statsmodels yields:

import statsmodels.formula.api as smf

df2 = pd.DataFrame(X[:,1:], columns=keep_features)
df2["target"] = y
df2.columns = [ii.replace(" ", "_") for ii in df2.columns]
features = " + ".join([ii for ii in df2.columns if ii!="target"])
mdl_expr = "target ~ " + features
mdl = smf.logit(mdl_expr, data=df2).fit()
mdl.params

Optimization terminated successfully.
         Current function value: 0.148702
         Iterations 10

Intercept          -0.459246
mean_radius        22.094852
mean_texture       -1.564632
mean_perimeter    -14.740317
mean_area         -14.689213
mean_smoothness    -1.664601
dtype: float64

Which matches closely with the results produced via gradient descent.