# Gradient Descent for Logistic Regression

Implementing gradient descent to estimate logistic regression coefficients
Machine Learning
Python
Published

February 1, 2024

Within the GLM framework, model coefficients are estimated using iterative reweighted least squares (IRLS), sometimes referred to as Fisher Scoring. This works well, but becomes inefficient as the size of the dataset increases: IRLS relies on the matrix of second partial derivatives which is expensive to compute. Instead, we can estimate logistic regression coefficients using gradient descent, which only relies on the first derivative of the cost function. This is much more efficient to compute, and generally provides good estimates once features have been standardized.

Expression for linear predictions:

$z = \boldsymbol{X} \boldsymbol{\theta}$

Expression for predicted probabilities:

$\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$

$\theta_{t + 1} := \theta_{t} - \eta \nabla L(\hat{y}, y)$

• $$\eta$$ is the learning rate.
• $$\nabla L$$ represents the gradients of the loss function.

Optimal parameters:

$\boldsymbol{\hat{\theta}} = \operatorname*{argmin}_\theta \frac{1}{m} \sum_{i=1}^{m} L(f(x^{(i)}; \theta), y^{(i)})$

The loss function is used to determine how much predictions differs from labels. Here we use binary cross entropy loss:

$L(\hat{y}, y) = -\frac{1}{m} \sum_{i=1}^{m} y \cdot \mathrm{log}(\hat{y}) + (1 - y) \cdot \mathrm{log}(1 - \hat{y}),$

which we obtain from the log-likelihood of a single observation assumed to follow a binomial distribution. We flip the sign (the leading -) in order to minimize the loss. Note that $$\hat{y} = \sigma(z)$$.

Expression for gradient of loss function:

$\nabla L(\hat{y}, y) = \frac{1}{m}(\hat{y} - y)x^{T}$

We choose $$\boldsymbol{\theta}$$ that maximizes the log-probability of the true y labels in the training data given the observations $$X$$. The goal is to find the set of weights which minimize the loss function, averaged over all examples. For each variable $$\theta_{j}$$ in $$\boldsymbol{\theta}$$, the gradient will have a component that tells us the slope w.r.t. that variable.

The breast cancer dataset is loaded from scikit-learn, the features are standardized and model coefficients estimated using gradient descent. Results are then compared with statsmodels coefficient estimates using the same data to ensure correctness.

"""
Load breast cancer dataset from scikit-learn. Standardize features.
"""
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame(X_data, columns=column_names)

keep_features = [
'mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness'
]

X = df[keep_features].values
y = y_data[:,None]

# Standardize features.
sclr = StandardScaler()
X = sclr.fit_transform(X)

bias = np.ones(X.shape[0])[:, None]
X = np.concatenate([bias, X], axis=1)

eta represents the learning rate and is a hyperparameter. num_epochs can be set as desired. For this example, we haven’t incorporated early stopping logic, but this would be straightforward to implement. We initialize the weight vector w to all 0s. w has the same length as the number of features + 1 for the intercept term.


eta = .50
num_epochs = 50000
w = np.zeros((X.shape[1], 1))
loss = []

for _ in range(num_epochs):
z = X @ w
ypred = 1. / (1. + np.exp(-z))
L = -np.mean(y * np.log(ypred) + (1 - y) * np.log(1 - ypred))
dL = np.mean(((ypred - y) * X), axis=0)[:, None]
loss.append(L)
w-=eta * dL

print(f"w: {w.ravel()}")
w: [ -0.4592425   22.09479813  -1.56463209 -14.7402888  -14.68918055
-1.66460167]

Estimating the coefficients using statsmodels yields:

import statsmodels.formula.api as smf

df2 = pd.DataFrame(X[:,1:], columns=keep_features)
df2["target"] = y
df2.columns = [ii.replace(" ", "_") for ii in df2.columns]
features = " + ".join([ii for ii in df2.columns if ii!="target"])
mdl_expr = "target ~ " + features
mdl = smf.logit(mdl_expr, data=df2).fit()
mdl.params
Optimization terminated successfully.
Current function value: 0.148702
Iterations 10
Intercept          -0.459246
dtype: float64