# Derivation of the Normal Equations

Derivation of the Normal Equations via least squares and maximum likelihood
Statistical Modeling
Published

February 1, 2024

The Normal Equations, represented in matrix form as

$(X^{T}X)\hat{\beta} = X^{T}y$

are utilized in determining coefficient estimates associated with regression models. The matrix form is a compact representation of the model specification commonly represented as

$y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} + \varepsilon$

where $$\epsilon$$ represents the error term, and

$\sum_{i=1}^{n} \varepsilon_{i} = 0.$

For a dataset with $$n$$ records by $$k$$ explanatory variables per record, the components of the Normal Equations are:

• $$\hat{\beta} = (\hat{\beta}_{0},\hat{\beta}_{1},\cdots,\hat{\beta}_{k})^{T}$$, a vector of $$(k+1)$$ coefficents (one for each of the k explanatory variables plus one for the intercept term)
• $$X$$ , an $$n$$ by $$(k+1)$$-dimensional matrix of explanatory variables, with the first column consisting entirely of 1’s
• $${y} = (y_{1}, y_{2},...,y_{n})$$, the response

The task is to solve for the $$(k+1)$$ $$\beta_{j}$$’s such that $$\hat{\beta}_{0}, \hat{\beta}_{1},...,\hat{\beta}_{k}$$ minimize

$\sum_{i=1}^{n} \hat{\varepsilon}^{2}_{i} = \sum_{i=1}^{n} (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i1} - \hat{\beta}_{2}x_{i2} - \cdots - \hat{\beta}_{k}x_{ik})^2.$

The Normal Equations can be derived using Least-Squares and Maximum likelihood Estimation.

### Least-Squares Derivation

Unlike Maximum Likelihood derivation, the Least-Squares approach requires no distributional assumption. For $$\hat{\beta}_{0}, \hat{\beta}_{1}, \cdots ,\hat{\beta}_{k}$$, we seek estimators that minimize the sum of squared deviations between the $$n$$ response values and the predicted values, $$\hat{y}$$. The objective is to minimize

$\sum_{i=1}^{n} \hat{\varepsilon}^{2}_{i} = \sum_{i=1}^{n} (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i1} - \hat{\beta}_{2}x_{i2} - \cdots - \hat{\beta}_{k}x_{ik})^2.$

Using matrix notation, our model can be represented as $$y = X^{T}\beta + \varepsilon$$. Isolating and squaring the error term yields

$\hat \varepsilon^T \hat \varepsilon = \sum_{i=1}^{n} (y - X\hat{\beta})^{T}(y - X\hat{\beta}).$

Expanding the right-hand side and combining terms results in

$\hat \varepsilon^T \hat \varepsilon = y^{T}y - 2y^{T}X\hat{\beta} + \hat{\beta}X^{T}X\hat{\beta}$

To find the value of $$\hat{\beta}$$ that minimizes $$\hat \varepsilon^T \hat \varepsilon$$, we differentiate $$\hat \varepsilon^T \hat \varepsilon$$ with respect to $$\hat{\beta}$$, and set the result to zero:

$\frac{\partial \hat{\varepsilon}^{T}\hat{\varepsilon}}{\partial \hat{\beta}} = -2X^{T}y + 2X^{T}X\hat{\beta} = 0$

Which can then be solved for $$\hat{\beta}$$:

$\hat{\beta} = {(X^{T}X)}^{-1}{X}^{T}y$

Since $$\hat{\beta}$$ minimizes the sum of squares, $$\hat{\beta}$$ is called the Least-Squares Estimator.

### Maximum Likelihood Derivation

For the Maximum Likelihood derivation, $$X$$, $$y$$ and $$\hat{\beta}$$ are the same as described in the Least-Squares derivation, and the model still follows the form

$y = X^{T}\beta + \varepsilon$

but now we assume the $$\varepsilon_{i}$$ are $$iid$$ and follow a zero-mean normal distribution:

$N(\varepsilon_{i}; 0, \sigma^{2}) = \frac{1}{\sqrt{2\pi\sigma^{2}}} e^{- \frac{(y_{i}-X^{T}\hat{\beta})^{2}}{2\sigma^{2}}}.$

In addition, the responses, $$y_{i}$$, are each assumed to follow a normal distribution. For $$n$$ observations, the likelihood function is

$L(\beta) = \Big(\frac{1}{\sqrt{2\pi\sigma^{2}}}\Big)^{n} e^{-(y-X\beta)^{T}(y-X\beta)/2\sigma^{2}}.$

The Log-Likelihood is then

$\mathrm{Ln}(L(\beta)) = -\frac{n}{2}\mathrm{Ln}(2\pi) -\frac{n}{2}\mathrm{Ln}(\sigma^{2})-\frac{1}{2\sigma^{2}}(y-X\beta)^{T}(y-X\beta).$

Taking derivatives with respect to $$\beta$$ and setting the result equal to zero yields

$\frac{\partial \mathrm{Ln}(L(\beta))}{\partial \beta} = -2X^{T}y -2X^{T}X\beta = 0.$

Rearranging and solving for $$\beta$$ we obtain

$\hat{\beta} = {(X^{T}X)}^{-1}{X}^{T}y,$

which is the same result obtained via Least Squares.