The Normal Equations, represented in matrix form as
are utilized in determining coefficent values associated with multiple linear regression models. The matrix representation is a compact form of of the full model specification, which is commonly represented as
where \(\varepsilon\) represents the error term, and
For a dataset with \(n\) records by \(k\) explanatory variables per record, the components of the Normal Equations are:

\(\hat{\beta} = (\hat{\beta}_{0}, \hat{\beta}_{1},...,\hat{\beta}_{k})^{T}\), a vector of \((k+1)\) coefficents (one for each of the k explanatory variables plus one for the intercept term)

\({X}\), an \(n\) by \((k+1)\)dimensional matrix of explanatory variables, with the first column consisting entirely of 1’s

\({y} = (y_{1}, y_{2},...,y_{n})\), the response variable
The task is to solve for the \((k+1)\) \(\beta_{j}\)‘s such that \(\hat{\beta}_{0}, \hat{\beta}_{1},...,\hat{\beta}_{k}\) minimize
The Normal Equations can be derived using both LeastSquares and Maximum likelihood Estimation. We’ll demonstrate both approaches.
LeastSquares Derivation
An advantage of the LeastSquares approach is that no distributional assumption is necessary (unlike Maximum Likelihood Estimation). For \(\hat{\beta}_{0}, \hat{\beta}_{1},...,\hat{\beta}_{k}\), we seek estimators that minimize the sum of squared deviations between the \(n\) response variables and the predicted values, \(\hat{y}\). The objective is to minimize
Using the morecompact matrix notation, our model can be represented as \(y = X^{T}\beta + \varepsilon\). Isolating and squaring the error term yields
Expanding the righthand side and combining terms results in
To find the value of \(\hat{\beta}\) that minimizes \(\hat \varepsilon^T \hat \varepsilon\), we differentiate \(\hat \varepsilon^T \hat \varepsilon\) with respect to \(\hat{\beta}\), and set the result to zero:
Which can then be solved for \(\hat{\beta}\):
Since \(\hat{\beta}\) minimizes the sum of squares, \(\hat{\beta}\) is called the LeastSquares Estimator.
Maximum Likelihood Derivation
For the Maximum Likelihood derivation, \(X\), \(y\) and \(\hat{\beta}\) are the same as described in the LeastSquares derivation, and the model still follows the form
but here we assume the \(\varepsilon_{i}\) are \(iid\) and follow a zeromean normal distribution:
In addition, the responses, \(y_{i}\), are each assumed to follow a normal distribution. For \(n\) observations, the likelihood function is
\(Ln(L(\beta))\), the LogLikelihood, is therefore
Taking derivatives with respect to \(\beta\) and setting the result equal to zero yields
Upon rearranging and solving for \(\beta\), we obtain
which is identical to the result obtained from the LeastSquares approach.