The Normal Equations, represented in matrix form as
are utilized in determining coefficent values associated with regression models. The matrix representation is a compact representation of of the model specification, which is commonly represented as
where ε represents the error term, and
For a dataset with \(n\) records by \(k\) explanatory variables per record, the components of the Normal Equations are:
- \(\hat{\beta} = (\hat{\beta}_{0},\hat{\beta}_{1},\cdots,\hat{\beta}_{k})^{T}\), a vector of \((k+1)\) coefficents (one for each of the k explanatory variables plus one for the intercept term)
- \(X\) , an \(n\) by \((k+1)\)-dimensional matrix of explanatory variables, with the first column consisting entirely of 1’s
- \({y} = (y_{1}, y_{2},...,y_{n})\), the response
The task is to solve for the \((k+1)\) \(\beta_{j}\)‘s such that \(\hat{\beta}_{0}, \hat{\beta}_{1},...,\hat{\beta}_{k}\) minimize
The Normal Equations can be derived using Least-Squares and Maximum likelihood Estimation. We’ll demonstrate both approaches.
Least-Squares Derivation
Unlike Maximum Likelihood derivation, the Least-Squares approach requires no distributional assumption. For \(\hat{\beta}_{0}, \hat{\beta}_{1}, \cdots ,\hat{\beta}_{k}\), we seek estimators that minimize the sum of squared deviations between the \(n\) response variables and the predicted values, \(\hat{y}\). The objective is to minimize
Using the more-compact matrix notation, our model can be represented as \(y = X^{T}\beta + \varepsilon\). Isolating and squaring the error term yields
Expanding the right-hand side and combining terms results in
To find the value of \(\hat{\beta}\) that minimizes \(\hat \varepsilon^T \hat \varepsilon\), we differentiate \(\hat \varepsilon^T \hat \varepsilon\) with respect to \(\hat{\beta}\), and set the result to zero:
Which can then be solved for \(\hat{\beta}\):
Since \(\hat{\beta}\) minimizes the sum of squares, \(\hat{\beta}\) is called the Least-Squares Estimator.
Maximum Likelihood Derivation
For the Maximum Likelihood derivation, \(X\), \(y\) and \(\hat{\beta}\) are the same as described in the Least-Squares derivation, and the model still follows the form
but now we assume the \(\varepsilon_{i}\) are \(iid\) and follow a zero-mean normal distribution:
In addition, the responses, \(y_{i}\), are each assumed to follow a normal distribution. For \(n\) observations, the likelihood function is
The Log-Likelihood is then
Taking derivatives with respect to \(\beta\) and setting the result equal to zero yields
Rearranging and solving for \(\beta\) we obtain
which is identical to the result obtained using the Least-Squares approach.