The Pleasure of Finding Things Out: A blog by James Triveri

The Normal Equations, represented in matrix form as

\[ (X^{T}X)\hat{\beta} = X^{T}y \]

are utilized in determining coefficient estimates associated with regression models. The matrix form is a compact representation of the model specification commonly represented as

\[ y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} + \varepsilon \]

where \(\epsilon\) represents the error term, and

\[ \sum_{i=1}^{n} \varepsilon_{i} = 0. \]

For a dataset with \(n\) records by \(k\) explanatory variables per record, the components of the Normal Equations are:

\(\hat{\beta} = (\hat{\beta}_{0},\hat{\beta}_{1},\cdots,\hat{\beta}_{k})^{T}\), a vector of \((k+1)\) coefficents (one for each of the k explanatory variables plus one for the intercept term)
\(X\) , an \(n\) by \((k+1)\)-dimensional matrix of explanatory variables, with the first column consisting entirely of 1’s
\({y} = (y_{1}, y_{2},...,y_{n})\), the response

The task is to solve for the \((k+1)\) \(\beta_{j}\)’s such that \(\hat{\beta}_{0}, \hat{\beta}_{1},...,\hat{\beta}_{k}\) minimize

\[ \sum_{i=1}^{n} \hat{\varepsilon}^{2}_{i} = \sum_{i=1}^{n} (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i1} - \hat{\beta}_{2}x_{i2} - \cdots - \hat{\beta}_{k}x_{ik})^2. \]

The Normal Equations can be derived using Least-Squares and Maximum likelihood Estimation.

Least-Squares Derivation

Unlike Maximum Likelihood derivation, the Least-Squares approach requires no distributional assumption. For \(\hat{\beta}_{0}, \hat{\beta}_{1}, \cdots ,\hat{\beta}_{k}\), we seek estimators that minimize the sum of squared deviations between the \(n\) response values and the predicted values, \(\hat{y}\). The objective is to minimize

\[ \sum_{i=1}^{n} \hat{\varepsilon}^{2}_{i} = \sum_{i=1}^{n} (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i1} - \hat{\beta}_{2}x_{i2} - \cdots - \hat{\beta}_{k}x_{ik})^2. \]

Using matrix notation, our model can be represented as \(y = X^{T}\beta + \varepsilon\). Isolating and squaring the error term yields

\[ \hat \varepsilon^T \hat \varepsilon = \sum_{i=1}^{n} (y - X\hat{\beta})^{T}(y - X\hat{\beta}). \]

Expanding the right-hand side and combining terms results in

\[ \hat \varepsilon^T \hat \varepsilon = y^{T}y - 2y^{T}X\hat{\beta} + \hat{\beta}X^{T}X\hat{\beta} \]

To find the value of \(\hat{\beta}\) that minimizes \(\hat \varepsilon^T \hat \varepsilon\), we differentiate \(\hat \varepsilon^T \hat \varepsilon\) with respect to \(\hat{\beta}\), and set the result to zero:

\[ \frac{\partial \hat{\varepsilon}^{T}\hat{\varepsilon}}{\partial \hat{\beta}} = -2X^{T}y + 2X^{T}X\hat{\beta} = 0 \]

Which can then be solved for \(\hat{\beta}\):

\[ \hat{\beta} = {(X^{T}X)}^{-1}{X}^{T}y \]

Since \(\hat{\beta}\) minimizes the sum of squares, \(\hat{\beta}\) is called the Least-Squares Estimator.

Maximum Likelihood Derivation

For the Maximum Likelihood derivation, \(X\), \(y\) and \(\hat{\beta}\) are the same as described in the Least-Squares derivation, and the model still follows the form

\[ y = X^{T}\beta + \varepsilon \]

but now we assume the \(\varepsilon_{i}\) are \(iid\) and follow a zero-mean normal distribution:

\[ N(\varepsilon_{i}; 0, \sigma^{2}) = \frac{1}{\sqrt{2\pi\sigma^{2}}} e^{- \frac{(y_{i}-X^{T}\hat{\beta})^{2}}{2\sigma^{2}}}. \]

In addition, the responses, \(y_{i}\), are each assumed to follow a normal distribution. For \(n\) observations, the likelihood function is

\[ L(\beta) = \Big(\frac{1}{\sqrt{2\pi\sigma^{2}}}\Big)^{n} e^{-(y-X\beta)^{T}(y-X\beta)/2\sigma^{2}}. \]

The Log-Likelihood is then

\[ \mathrm{Ln}(L(\beta)) = -\frac{n}{2}\mathrm{Ln}(2\pi) -\frac{n}{2}\mathrm{Ln}(\sigma^{2})-\frac{1}{2\sigma^{2}}(y-X\beta)^{T}(y-X\beta). \]

Taking derivatives with respect to \(\beta\) and setting the result equal to zero yields

\[ \frac{\partial \mathrm{Ln}(L(\beta))}{\partial \beta} = -2X^{T}y -2X^{T}X\beta = 0. \]

Rearranging and solving for \(\beta\) we obtain

\[ \hat{\beta} = {(X^{T}X)}^{-1}{X}^{T}y, \]

which is the same result obtained via Least Squares.