In an earlier post, we covered methods used for determining parameter estimates for a Logistic Regression model. We explored an implementation of the Fisher Scoring algorithm, which, when used in concert with an exponential family distribution and its associated canonical link, yields identical results as Newton-Raphson. We then compared coefficient estimates produced by scikit-learn’s linear_model.LogisticRegression classifier, and after accounting regularization, were shown to be equal.
However, the dataset from which the coefficient estimates were derived had few explanatory variables. As the number of features in a dataset increases, the amount of time required to carry out a single step of the iteration will grow in proportion to a 3rd-degree polynomial. Recall that the Normal Equations associated with Generalized Linear Models are the same as the Normal Equations with the addition of a weighting term, which varies at each iteration. For Generalized Linear Models, model coefficients are obtained using

$$\hat{\beta} = (X^{T}WX)^{-1}X^{T}y$$

where $$X$$ is the design matrix, $$y$$ the response vector, and $$(X^{T}WX)^{-1}$$ the Information Matrix (which is used for statistical inference). Notice that the coefficient estimate requires inverting an n-by-n matrix at each iteration (where $$n$$ is the number of observations), which has $$O(n^{3})$$ complexity (see here for information regarding the complexity of common matrix operations). For datasets with many thousands of features, the Fisher/Newton-Raphson methods become prohibitively slow.
The method of gradient descent is also an iterative solver, but is far more scalable than Newton-Raphson. When using gradient descent, the learning rate, generally represented by $$\alpha$$, must be set not so large as to overshoot the optimum, but large enough so that the algorithm doesn’t take prohibitively long to locate it. The learning rate can be set to an optimal value using grid search, but for this post, we’ll assume that the selected learning rate is sufficient for the examples demonstrated.

The objective of gradient descent is to minimize a cost function, $$W(\theta)$$, by taking steps in proportion to the negative of the direction of the greatest rate of increase, since moving in the same direction as the gradient would result in identifying a local maximum. Let

$$W(\theta) = \frac{1}{n}\sum_{i=1}^{n}(f(x_{i})-y_{i})^{2},$$

where $$n$$ is the number of observations, $$y$$ is the $$nx1$$ response vector and when the model of interest is linear, $$f(x) = \theta_{0} + x_{1}\theta_{1} + \cdots + x_{p-1}\theta_{p-1}$$. In matrix form, $$f(x) = X\theta$$, where $$X$$ is an $$nxp$$ matrix (with $$p$$ representing the number of model parameters including the intercept) and $$theta$$ a $$px1$$ vector, the product of which yields an $$n$$-length column vector. The expression for $$W(\theta)$$ is the Mean-Squared Error.

The update expression in matrix form can be written as

$$\theta^{(i+1)} = \theta^{(i)} - \alpha \nabla{W(\theta)},$$

where \nabla{W(\theta)} represents the gradient of $$W(\theta)$$, which is the multivariate generalization of the derivative. $$\nabla{W(\theta)}$$ is given by

$$\nabla{W(\theta)} = \frac{2}{n}\sum_{i=1}^{n}(f(x_{i})-y_{i})x_{i},$$

which is nothing more than the chain rule applied to $$W(\theta)$$. The final update expression becomes

$$\theta^{(i+1)} = \theta^{(i)} - \alpha \frac{2}{n}\sum_{i=1}^{n}(f(x_{i})-y_{i})x_{i},$$

where $$\alpha$$ represents the learning rate which controls the step size at each reevaluation. As mentioned previously, great care must go into selecting an appropriate learning rate for the model of interest. It is not uncommon for the difference between a learning rate that leads to convergence and a learning rate resulting in divergence to be as little as .01, where $$\alpha=.02$$ results in locating the optimum, but $$\alpha=.03$$ does not.

Also note that at each iteration, calculation occurs for all $$n$$ explanatory variables. The gradient descent algorithm described here is known as batch gradient descent, since the entire batch of data is reevaluated at each step. This can be inefficient for very large datasets, so if the dataset’s record count is exceedingly large, alternative implementations such as stochastic gradient descent will exhibit superior runtime performance over batch gardient descent (for this post we assume all datasets are reasonably sized, such that evaluation at each update is not computationally prohibitive).
Unless there’s a compelling reason to commence iteration elsewhere, $$\theta_{0}$$, the $$p$$ length column vector of model parameter coefficients the algorithm is attempting to estimate, will be initialized to 0, $$\theta^{T} = [0, 0, \cdots, 0]$$. This is akin to using a non-informative prior.

Evaluation and Convergence Diagnostics

One method used to determine if the algorithm is on the ‘right track’ is to evaluate $$W(\theta)$$ at each iteration.
$$W(\theta)$$ should decrease at each successive update, so if $$W(\theta)$$ is increasing, something has gone awry and the learning rate, $$\alpha$$ should be reevaluated.

In our implementation of batch gradient descent, we incorporate two mechanisms that serve as stopping criteria (at least one of which will always be triggered). The first terminates iteration in the event of a learning rate overspecification resulting in divergence. The second evaluates the absolute difference between $$\theta_{k+1}$$ and $$\theta_{k}$$: If $$|\theta_{k+1}-\theta_{k}|$$ is less than $$\epsilon$$ for $$[\theta_{0}, \theta_{1}, \cdots, \theta_{p-1}]$$, then cease iteration; otherwise proceed to the next evaluation.

In what follows, we demonstrate how to go about obtaining linear regression coefficient estimates using the gradient descent algorithm. We can compare the coefficient estimates returned by the gradient descent algorithm to those obtained from the Normal Equations, $$\hat{\beta} = (X^{T}X)^{-1}X^{T}y$$.

Implementation

To demonstrate, we estimate model parameters for the Boston Housing Datset using number of rooms (rm) to predict median home price (medv). The dataset can be obtained here.

"""
"""
import numpy as np

def gradient_descent(X, y, max_iters, learning_rate, epsilon):
"""
Determine regression coefficient estimates via gradient descent.

Function Parameters:
X             => nxp design matrix
y             => nx1 response vector
max_iters     => maximum number of gradient evaluations
learning_rate => rate of convergence
epsilon       => threshold below which iteration ceases

Returns:
cntr          => number of iterations carried out
'coeffs        => regression coefficient estimates
learning_rate => pass-thru of input parameter
epsilon       => pass-thru of input parameter
max_iters     => pass-thru of input parameter
"""
# Coerce X & y to matrix types (make y n-by-1) =>
Xm, ym = np.matrix(X), np.matrix(y).T

# Initialize initial coefficient vector to 0 =>
b_init = np.reshape(np.zeros(Xm.shape), (2,1))

# Initialize counter and coefficient =>
cntr  = 0
coeff = learning_rate * (2 / Xm.shape)

while cntr < max_iters:
# Evaluate b using b_init =>
b = b_init - coeff * (Xm.T * (Xm * b_init) - ym)
if all(np.abs(np.ravel(b) - np.ravel(b_init)) <= epsilon): break
b_init = b
cntr += 1

return({'cntr'         :cntr,
'coeffs'       :np.ravel(b),
'learning_rate':learning_rate,
'epsilon'      :epsilon,
'max_iters'    :max_iters})

As described in the docstring, the formal parameters to gradient_descent are:
X is the n-by-p design matrix.

y n-by-1 response vector.

max_iters is the maximum number of gradient evaluations.

learning_rate sets the rate of convergence.

epsilon is the threshold below which iteration ceases.

The function returns a dict, which contains max_iters, learning_rate and epsilon as pass-thru parameters, in addition to:
cntr - The number of iterations/evaluations performed/carried out

* coeffs - A numpy array which contains the estimated model coefficients

To begin, read the Boston Housing dataset into a pandas DataFrame, and pass the requisite arguments to gradient_descent. X and y will be obtained from the dataset. Set max_iters=50000, learning_rate=.02 and epsilon=.00001:

import numpy as np
import pandas as pd

# Read-in Housing dataset and extract rm for X and medv as y =>
X = dataset[['rm']]
y = dataset['medv'].values

# Insert intercept term =>
X.insert(0, 'INTERCEPT', 1)

# Call gradient_descent to estimate regression parameters =>
theta_hat = gradient_descent(X, y, max_iters=50000,learning_rate=.02, epsilon=.00001)

Printing the contents of theta_hat yields:

>>> print(theta_hat)
{'cntr': 15456,
'coeffs': array([-34.64984206,   9.09884246]),
'epsilon': 1e-05,
'learning_rate': 0.02,
'max_iters': 50000}

To determine the accuracy of the estimated coefficients returned by gradient_descent, we turn to the Normal Equations. Recall that a Linear Regression model can be represented using matrix notation as

$$\hat{\beta} = (X^{T}X)^{-1}X^{T}y,$$

where $$X$$ is the design matrix and $$y$$ the response vector. Calculating $$\hat{\beta}$$ is straightforward using numpy:

"""
Determining regression coefficients via the Normal Equations
"""
import numpy as np
import pandas as pd

# Read-in Housing dataset and extract rm for X and medv` as y =>
X = dataset[['rm']]
y = dataset['medv'].values
X.insert(0, 'INTERCEPT', 1)
Xm = np.matrix(X)
ym = np.matrix(y)

# Determine coefficients =>
beta_hat = np.ravel((Xm.T * Xm).I * X.T * y)

# Printing beta_hat from Python interpreter =>
>>> print(beta_hat)
array([-34.67062078,   9.10210898])

We see that out initial estimates ([-34.64984206, 9.09884246]) are pretty close to the parameter estimates obtained directly. However, by reducing epsilon, we can get closer to the directly-calculated solution:

"""
Demonstration of gradient descent coefficient estimates
converging to the solution obtained from the direct method
by epsilon reduction =>
"""

benchmark = np.ravel((Xm.T * Xm).I * X.T * y)   # array([-34.67062078,   9.10210898])

# epsilon=.00001
max_iters=50000,
learning_rate=.02,
epsilon=.00001)    # array([-34.64984206,   9.09884246])

# epsilon=.000001
max_iters=50000,
learning_rate=.02,
epsilon=.000001)   # array([-34.66854342,   9.10178241])

# epsilon=.0000001
max_iters=50000,
learning_rate=.02,
epsilon=.0000001)  # array([-34.67041299,   9.10207632])

# epsilon=.00000001
model_func=f1,
max_iters=50000,
learning_rate=.02,
epsilon=.00000001)  # array([-34.6706    ,   9.10210572])

# epsilon=.000000001
max_iters=50000,
learning_rate=.02,
epsilon=.000000001)  # array([-34.6706187 ,   9.10210865])

# Summary =>
# benchmark           array([-34.67062078,   9.10210898])
# epsilon=.00001      array([-34.64984206,   9.09884246])
# epsilon=.000001     array([-34.66854342,   9.10178241])
# epsilon=.0000001    array([-34.67041299,   9.10207632])
# epsilon=.00000001   array([-34.6706    ,   9.10210572])
# epsilon=.000000001  array([-34.6706187 ,   9.10210865])

For this example, the coefficient estimates were especially sensitive to the learning rate. With all inputs specified as before, increasing the learning_rate from .02 to .025 results in divergence:

>>> theta_hat = gradient_descent(X, y, model_func=f1, max_iters=50000,
learning_rate=.025, epsilon=.00001)
>>> print(theta_hat)
{'cntr': 50000,
'coeffs': array([ nan,  nan]),
'epsilon': 1e-05,
'learning_rate': 0.025,
'max_iters': 50000}

learning_rate and epsilon aren’t your typical ‘set-and-forget’ function parameters. Determine which values of learning_rate results in divergence, and scale back from there.