The Normal Equations, represented in matrix form as

$$ (X^{T}X)\hat{\beta} = X^{T}y $$

are utilized in determining coefficent values associated with regression models. The matrix representation is a compact representation of of the model specification, which is commonly represented as

$$ y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} + \varepsilon $$

where ε represents the error term, and

$$ \sum_{i=1}^{n} \varepsilon_{i} = 0. $$

For a dataset with \(n\) records by \(k\) explanatory variables per record, the components of the Normal Equations are:

  • \(\hat{\beta} = (\hat{\beta}_{0},\hat{\beta}_{1},\cdots,\hat{\beta}_{k})^{T}\), a vector of \((k+1)\) coefficents (one for each of the k explanatory variables plus one for the intercept term)
  • \(X\) , an \(n\) by \((k+1)\)-dimensional matrix of explanatory variables, with the first column consisting entirely of 1’s
  • \({y} = (y_{1}, y_{2},...,y_{n})\), the response

The task is to solve for the \((k+1)\) \(\beta_{j}\)s such that \(\hat{\beta}_{0}, \hat{\beta}_{1},...,\hat{\beta}_{k}\) minimize

$$ \sum_{i=1}^{n} \hat{\varepsilon}^{2}_{i} = \sum_{i=1}^{n} (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i1} - \hat{\beta}_{2}x_{i2} - \cdots - \hat{\beta}_{k}x_{ik})^2. $$

The Normal Equations can be derived using Least-Squares and Maximum likelihood Estimation. We’ll demonstrate both approaches.

Least-Squares Derivation

Unlike Maximum Likelihood derivation, the Least-Squares approach requires no distributional assumption. For \(\hat{\beta}_{0}, \hat{\beta}_{1}, \cdots ,\hat{\beta}_{k}\), we seek estimators that minimize the sum of squared deviations between the \(n\) response variables and the predicted values, \(\hat{y}\). The objective is to minimize

$$ \sum_{i=1}^{n} \hat{\varepsilon}^{2}_{i} = \sum_{i=1}^{n} (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i1} - \hat{\beta}_{2}x_{i2} - \cdots - \hat{\beta}_{k}x_{ik})^2. $$

Using the more-compact matrix notation, our model can be represented as \(y = X^{T}\beta + \varepsilon\). Isolating and squaring the error term yields

$$ \hat \varepsilon^T \hat \varepsilon = \sum_{i=1}^{n} (y - X\hat{\beta})^{T}(y - X\hat{\beta}). $$

Expanding the right-hand side and combining terms results in

$$ \hat \varepsilon^T \hat \varepsilon = y^{T}y - 2y^{T}X\hat{\beta} + \hat{\beta}X^{T}X\hat{\beta} $$

To find the value of \(\hat{\beta}\) that minimizes \(\hat \varepsilon^T \hat \varepsilon\), we differentiate \(\hat \varepsilon^T \hat \varepsilon\) with respect to \(\hat{\beta}\), and set the result to zero:

$$ \frac{\partial \hat{\varepsilon}^{T}\hat{\varepsilon}}{\partial \hat{\beta}} = -2X^{T}y + 2X^{T}X\hat{\beta} = 0 $$

Which can then be solved for \(\hat{\beta}\):

$$ \hat{\beta} = {(X^{T}X)}^{-1}{X}^{T}y $$

Since \(\hat{\beta}\) minimizes the sum of squares, \(\hat{\beta}\) is called the Least-Squares Estimator.

Maximum Likelihood Derivation

For the Maximum Likelihood derivation, \(X\), \(y\) and \(\hat{\beta}\) are the same as described in the Least-Squares derivation, and the model still follows the form

$$ y = X^{T}\beta + \varepsilon $$

but now we assume the \(\varepsilon_{i}\) are \(iid\) and follow a zero-mean normal distribution:

$$ N(\varepsilon_{i}; 0, \sigma^{2}) = \frac{1}{\sqrt{2\pi\sigma^{2}}} e^{- \frac{(y_{i}-X^{T}\hat{\beta})^{2}}{2\sigma^{2}}}. $$

In addition, the responses, \(y_{i}\), are each assumed to follow a normal distribution. For \(n\) observations, the likelihood function is

$$ L(\beta) = \Big(\frac{1}{\sqrt{2\pi\sigma^{2}}}\Big)^{n} e^{-(y-X\beta)^{T}(y-X\beta)/2\sigma^{2}}. $$

The Log-Likelihood is then

$$ Ln(L(\beta)) = -\frac{n}{2}Ln(2\pi) -\frac{n}{2}Ln(\sigma^{2})-\frac{1}{2\sigma^{2}}(y-X\beta)^{T}(y-X\beta). $$

Taking derivatives with respect to \(\beta\) and setting the result equal to zero yields

$$ \frac{\partial Ln(L(\beta))}{\partial \beta} = -2X^{T}y -2X^{T}X\beta = 0. $$

Rearranging and solving for \(\beta\) we obtain

$$ \hat{\beta} = {(X^{T}X)}^{-1}{X}^{T}y, $$

which is identical to the result obtained using the Least-Squares approach.