In an earlier post, we covered methods used for determining parameter estimates for a Logistic Regression model. We explored an implementation of the Fisher Scoring algorithm, which, when used in concert with an exponential family distribution and its associated canonical link, yields identical results as Newton-Raphson. We then compared coefficient estimates produced by scikit-learn’s *linear_model.LogisticRegression* classifier, and after accounting regularization, were shown to be equal.

However, the dataset from which the coefficient estimates were derived had few explanatory variables. As the number of features in a dataset increases, the amount of time required to carry out a single step of the iteration will grow in proportion to a 3rd-degree polynomial. Recall that the Normal Equations associated with Generalized Linear Models are the same as the Normal Equations with the addition of a weighting term, which varies at each iteration. For Generalized Linear Models, model coefficients are obtained using

where \(X\) is the design matrix, \(y\) the response vector, and \((X^{T}WX)^{-1}\) the *Information Matrix* (which is used for statistical inference). Notice that the coefficient estimate requires inverting an n-by-n matrix at each iteration (where \(n\) is the number of observations), which has \(O(n^{3})\) complexity (see here for information regarding the complexity of common matrix operations). For datasets with many thousands of features, the Fisher/Newton-Raphson methods become prohibitively slow.

The method of gradient descent is also an iterative solver, but is far more scalable than Newton-Raphson. When using gradient descent, the learning rate, generally represented by \(\alpha\), must be set not so large as to overshoot the optimum, but large enough so that the algorithm doesn’t take prohibitively long to locate it. The learning rate can be set to an optimal value using grid search, but for this post, we’ll assume that the selected learning rate is sufficient for the examples demonstrated.

## Gradient Descent Algorithm

The objective of gradient descent is to minimize a cost function, \(W(\theta)\), by taking steps in proportion to the *negative* of the direction of the greatest rate of increase, since moving in the same direction as the gradient would result in identifying a local maximum. Let

where \(n\) is the number of observations, \(y\) is the \(nx1\) response vector and when the model of interest is linear, \(f(x) = \theta_{0} + x_{1}\theta_{1} + \cdots + x_{p-1}\theta_{p-1}\). In matrix form, \(f(x) = X\theta\), where \(X\) is an \(nxp\) matrix (with \(p\) representing the number of model parameters including the intercept) and \(theta\) a \(px1\) vector, the product of which yields an \(n\)-length column vector. The expression for \(W(\theta)\) is the *Mean-Squared Error*.

The update expression in matrix form can be written as

where \nabla{W(\theta)} represents the *gradient* of \(W(\theta)\), which is the multivariate generalization of the derivative. \(\nabla{W(\theta)}\) is given by

which is nothing more than the chain rule applied to \(W(\theta)\). The final update expression becomes

where \(\alpha\) represents the learning rate which controls the step size at each reevaluation. As mentioned previously, great care must go into selecting an appropriate learning rate for the model of interest. It is not uncommon for the difference between a learning rate that leads to convergence and a learning rate resulting in divergence to be as little as .01, where \(\alpha=.02\) results in locating the optimum, but \(\alpha=.03\) does not.

Also note that at each iteration, calculation occurs for all \(n\) explanatory variables. The gradient descent algorithm described here is known as *batch gradient descent*, since the entire batch of data is reevaluated at each step. This can be inefficient for very large datasets, so if the dataset’s record count is exceedingly large, alternative implementations such as stochastic gradient descent will exhibit superior runtime performance over batch gardient descent (for this post we assume all datasets are reasonably sized, such that evaluation at each update is not computationally prohibitive).

Unless there’s a compelling reason to commence iteration elsewhere, \(\theta_{0}\), the \(p\) length column vector of model parameter coefficients the algorithm is attempting to estimate, will be initialized to 0, \(\theta^{T} = [0, 0, \cdots, 0]\). This is akin to using a non-informative prior.

### Evaluation and Convergence Diagnostics

One method used to determine if the algorithm is on the ‘right track’ is to evaluate \(W(\theta)\) at each iteration.

\(W(\theta)\) should decrease at each successive update, so if \(W(\theta)\) is increasing, something has gone awry and the learning rate, \(\alpha\) should be reevaluated.

In our implementation of batch gradient descent, we incorporate two mechanisms that serve as stopping criteria (at least one of which will always be triggered). The first terminates iteration in the event of a learning rate overspecification resulting in divergence. The second evaluates the absolute difference between \(\theta_{k+1}\) and \(\theta_{k}\): If \(|\theta_{k+1}-\theta_{k}|\) is less than \(\epsilon\) for \([\theta_{0}, \theta_{1}, \cdots, \theta_{p-1}]\), then cease iteration; otherwise proceed to the next evaluation.

In what follows, we demonstrate how to go about obtaining linear regression coefficient estimates using the gradient descent algorithm. We can compare the coefficient estimates returned by the gradient descent algorithm to those obtained from the Normal Equations, \(\hat{\beta} = (X^{T}X)^{-1}X^{T}y\).

## Implementation

To demonstrate, we estimate model parameters for the Boston Housing Datset using number of rooms (`rm`

) to predict median home price (`medv`

). The dataset can be obtained here.

```
"""
Gradient Descent Algorithm
"""
import numpy as np
def gradient_descent(X, y, max_iters, learning_rate, epsilon):
"""
Determine regression coefficient estimates via gradient descent.
The coefficient vector will start with all values initialized to 0.
Function Parameters:
`X` => nxp design matrix
`y` => nx1 response vector
`max_iters` => maximum number of gradient evaluations
`learning_rate` => rate of convergence
`epsilon` => threshold below which iteration ceases
Returns:
`cntr` => number of iterations carried out
'coeffs` => regression coefficient estimates
`learning_rate` => pass-thru of input parameter
`epsilon` => pass-thru of input parameter
`max_iters` => pass-thru of input parameter
"""
# Coerce X & y to matrix types (make y n-by-1) =>
Xm, ym = np.matrix(X), np.matrix(y).T
# Initialize initial coefficient vector to 0 =>
b_init = np.reshape(np.zeros(Xm.shape[1]), (2,1))
# Initialize counter and coefficient =>
cntr = 0
coeff = learning_rate * (2 / Xm.shape[0])
while cntr < max_iters:
# Evaluate b using b_init =>
b = b_init - coeff * (Xm.T * (Xm * b_init) - ym)
if all(np.abs(np.ravel(b) - np.ravel(b_init)) <= epsilon): break
b_init = b
cntr += 1
return({'cntr' :cntr,
'coeffs' :np.ravel(b),
'learning_rate':learning_rate,
'epsilon' :epsilon,
'max_iters' :max_iters})
```

As described in the docstring, the formal parameters to `gradient_descent`

are:

* X is the n-by-p design matrix.
*

`y`

n-by-1 response vector.

`max_iters`

is the maximum number of gradient evaluations. `learning_rate`

sets the rate of convergence.

The function returns a dict, which contains

`epsilon`

is the threshold below which iteration ceases. The function returns a dict, which contains

`max_iters`

, `learning_rate`

and `epsilon`

as pass-thru parameters, in addition to:
`cntr`

- The number of iterations/evaluations performed/carried out *

`coeffs`

- A numpy array which contains the estimated model coefficients To begin, read the Boston Housing dataset into a pandas DataFrame, and pass the requisite arguments to `gradient_descent`

.
`X`

and `y`

will be obtained from the dataset. Set `max_iters=50000`

, `learning_rate=.02`

and `epsilon=.00001`

:

```
import numpy as np
import pandas as pd
# Read-in Housing dataset and extract `rm` for X and `medv` as y =>
dataset = pd.read_csv("Boston.csv")
X = dataset[['rm']]
y = dataset['medv'].values
# Insert intercept term =>
X.insert(0, 'INTERCEPT', 1)
# Call `gradient_descent` to estimate regression parameters =>
theta_hat = gradient_descent(X, y, max_iters=50000,learning_rate=.02, epsilon=.00001)
```

Printing the contents of `theta_hat`

yields:

```
>>> print(theta_hat)
{'cntr': 15456,
'coeffs': array([-34.64984206, 9.09884246]),
'epsilon': 1e-05,
'learning_rate': 0.02,
'max_iters': 50000}
```

To determine the accuracy of the estimated coefficients returned by `gradient_descent`

, we turn to the Normal Equations. Recall that a Linear Regression model can be represented using matrix notation as

where \(X\) is the design matrix and \(y\) the response vector. Calculating \(\hat{\beta}\) is straightforward using numpy:

```
"""
Determining regression coefficients via the Normal Equations
"""
import numpy as np
import pandas as pd
# Read-in Housing dataset and extract `rm` for X and `medv` as y =>
dataset = pd.read_csv("Boston.csv")
X = dataset[['rm']]
y = dataset['medv'].values
X.insert(0, 'INTERCEPT', 1)
Xm = np.matrix(X)
ym = np.matrix(y)
# Determine coefficients =>
beta_hat = np.ravel((Xm.T * Xm).I * X.T * y)
# Printing beta_hat from Python interpreter =>
>>> print(beta_hat)
array([-34.67062078, 9.10210898])
```

We see that out initial estimates ([-34.64984206, 9.09884246]) are pretty close to the parameter estimates obtained directly. However, by reducing `epsilon`

, we can get closer to the directly-calculated solution:

```
"""
Demonstration of gradient descent coefficient estimates
converging to the solution obtained from the direct method
by epsilon reduction =>
"""
benchmark = np.ravel((Xm.T * Xm).I * X.T * y) # array([-34.67062078, 9.10210898])
# epsilon=.00001
beta_init = gradient_descent(X, y,
max_iters=50000,
learning_rate=.02,
epsilon=.00001) # array([-34.64984206, 9.09884246])
# epsilon=.000001
beta_init = gradient_descent(X, y,
max_iters=50000,
learning_rate=.02,
epsilon=.000001) # array([-34.66854342, 9.10178241])
# epsilon=.0000001
beta_init = gradient_descent(X, y,
max_iters=50000,
learning_rate=.02,
epsilon=.0000001) # array([-34.67041299, 9.10207632])
# epsilon=.00000001
beta_init = gradient_descent(X, y,
model_func=f1,
max_iters=50000,
learning_rate=.02,
epsilon=.00000001) # array([-34.6706 , 9.10210572])
# epsilon=.000000001
beta_init = gradient_descent(X, y,
max_iters=50000,
learning_rate=.02,
epsilon=.000000001) # array([-34.6706187 , 9.10210865])
# Summary =>
# benchmark array([-34.67062078, 9.10210898])
# epsilon=.00001 array([-34.64984206, 9.09884246])
# epsilon=.000001 array([-34.66854342, 9.10178241])
# epsilon=.0000001 array([-34.67041299, 9.10207632])
# epsilon=.00000001 array([-34.6706 , 9.10210572])
# epsilon=.000000001 array([-34.6706187 , 9.10210865])
```

For this example, the coefficient estimates were especially sensitive to the learning rate. With all inputs specified as before, increasing the `learning_rate`

from .02 to .025 results in divergence:

```
>>> theta_hat = gradient_descent(X, y, model_func=f1, max_iters=50000,
learning_rate=.025, epsilon=.00001)
>>> print(theta_hat)
{'cntr': 50000,
'coeffs': array([ nan, nan]),
'epsilon': 1e-05,
'learning_rate': 0.025,
'max_iters': 50000}
```

`learning_rate`

and `epsilon`

aren’t your typical ‘set-and-forget’ function parameters. Determine which values of `learning_rate`

results in divergence, and scale back from there.