The Normal Equations, represented in matrix form as

are utilized in determining coefficent values associated with regression models. The matrix representation is a compact representation of of the model specification, which is commonly represented as

where ε represents the error term, and

For a dataset with \(n\) records by \(k\) explanatory variables per record, the components of the Normal Equations are:

- \(\hat{\beta} = (\hat{\beta}_{0},\hat{\beta}_{1},\cdots,\hat{\beta}_{k})^{T}\), a vector of \((k+1)\) coefficents (one for each of the k explanatory variables plus one for the intercept term)
- \(X\) , an \(n\) by \((k+1)\)-dimensional matrix of explanatory variables, with the first column consisting entirely of 1’s
- \({y} = (y_{1}, y_{2},...,y_{n})\), the response

The task is to solve for the \((k+1)\) \(\beta_{j}\)‘s such that \(\hat{\beta}_{0}, \hat{\beta}_{1},...,\hat{\beta}_{k}\) minimize

The Normal Equations can be derived using Least-Squares and Maximum likelihood Estimation. We’ll demonstrate both approaches.

## Least-Squares Derivation

Unlike Maximum Likelihood derivation, the Least-Squares approach requires no distributional assumption. For \(\hat{\beta}_{0}, \hat{\beta}_{1}, \cdots ,\hat{\beta}_{k}\), we seek estimators that minimize the sum of squared deviations between the \(n\) response variables and the predicted values, \(\hat{y}\). The objective is to minimize

Using the more-compact matrix notation, our model can be represented as \(y = X^{T}\beta + \varepsilon\). Isolating and squaring the error term yields

Expanding the right-hand side and combining terms results in

To find the value of \(\hat{\beta}\) that minimizes \(\hat \varepsilon^T \hat \varepsilon\), we differentiate \(\hat \varepsilon^T \hat \varepsilon\) with respect to \(\hat{\beta}\), and set the result to zero:

Which can then be solved for \(\hat{\beta}\):

Since \(\hat{\beta}\) minimizes the sum of squares, \(\hat{\beta}\) is called the
*Least-Squares Estimator*.

## Maximum Likelihood Derivation

For the Maximum Likelihood derivation, \(X\), \(y\) and \(\hat{\beta}\) are the same as described in the Least-Squares derivation, and the model still follows the form

but now we assume the \(\varepsilon_{i}\) are \(iid\) and follow a zero-mean normal distribution:

In addition, the responses, \(y_{i}\), are each assumed to follow a normal distribution. For \(n\) observations, the likelihood function is

The Log-Likelihood is then

Taking derivatives with respect to \(\beta\) and setting the result equal to zero yields

Rearranging and solving for \(\beta\) we obtain

which is identical to the result obtained using the Least-Squares approach.