The Normal Equations, represented in matrix form as

$$(X^{T}X)\hat{\beta} = X^{T}y$$

are utilized in determining coefficent values associated with multiple linear regression models. The matrix representation is a compact form of of the full model specification, which is commonly represented as

$$y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} + \varepsilon$$

where $$\varepsilon$$ represents the error term, and

$$\sum_{i=1}^{n} \varepsilon_{i} = 0.$$

For a dataset with $$n$$ records by $$k$$ explanatory variables per record, the components of the Normal Equations are:

• $$\hat{\beta} = (\hat{\beta}_{0}, \hat{\beta}_{1},...,\hat{\beta}_{k})^{T}$$, a vector of $$(k+1)$$ coefficents (one for each of the k explanatory variables plus one for the intercept term)

• $${X}$$, an $$n$$ by $$(k+1)$$-dimensional matrix of explanatory variables, with the first column consisting entirely of 1’s

• $${y} = (y_{1}, y_{2},...,y_{n})$$, the response variable

The task is to solve for the $$(k+1)$$ $$\beta_{j}$$s such that $$\hat{\beta}_{0}, \hat{\beta}_{1},...,\hat{\beta}_{k}$$ minimize

$$\sum_{i=1}^{n} \hat{\varepsilon}^{2}_{i} = \sum_{i=1}^{n} (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i1} - \hat{\beta}_{2}x_{i2} - \cdots - \hat{\beta}_{k}x_{ik})^2.$$

The Normal Equations can be derived using both Least-Squares and Maximum likelihood Estimation. We’ll demonstrate both approaches.

## Least-Squares Derivation

An advantage of the Least-Squares approach is that no distributional assumption is necessary (unlike Maximum Likelihood Estimation). For $$\hat{\beta}_{0}, \hat{\beta}_{1},...,\hat{\beta}_{k}$$, we seek estimators that minimize the sum of squared deviations between the $$n$$ response variables and the predicted values, $$\hat{y}$$. The objective is to minimize

$$\sum_{i=1}^{n} \hat{\varepsilon}^{2}_{i} = \sum_{i=1}^{n} (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i1} - \hat{\beta}_{2}x_{i2} - \cdots - \hat{\beta}_{k}x_{ik})^2.$$

Using the more-compact matrix notation, our model can be represented as $$y = X^{T}\beta + \varepsilon$$. Isolating and squaring the error term yields

$$\hat \varepsilon^T \hat \varepsilon = \sum_{i=1}^{n} (y - X\hat{\beta})^{T}(y - X\hat{\beta}).$$

Expanding the right-hand side and combining terms results in

$$\hat \varepsilon^T \hat \varepsilon = y^{T}y - 2y^{T}X\hat{\beta} + \hat{\beta}X^{T}X\hat{\beta}$$

To find the value of $$\hat{\beta}$$ that minimizes $$\hat \varepsilon^T \hat \varepsilon$$, we differentiate $$\hat \varepsilon^T \hat \varepsilon$$ with respect to $$\hat{\beta}$$, and set the result to zero:

$$\frac{\partial \hat{\varepsilon}^{T}\hat{\varepsilon}}{\partial \hat{\beta}} = -2X^{T}y + 2X^{T}X\hat{\beta} = 0$$

Which can then be solved for $$\hat{\beta}$$:

$$\hat{\beta} = {(X^{T}X)}^{-1}{X}^{T}y$$

Since $$\hat{\beta}$$ minimizes the sum of squares, $$\hat{\beta}$$ is called the Least-Squares Estimator.

## Maximum Likelihood Derivation

For the Maximum Likelihood derivation, $$X$$, $$y$$ and $$\hat{\beta}$$ are the same as described in the Least-Squares derivation, and the model still follows the form

$$y = X^{T}\beta + \varepsilon$$

but here we assume the $$\varepsilon_{i}$$ are $$iid$$ and follow a zero-mean normal distribution:

$$N(\varepsilon_{i}; 0, \sigma^{2}) = \frac{1}{\sqrt{2\pi\sigma^{2}}} e^{- \frac{(y_{i}-X^{T}\hat{\beta})^{2}}{2\sigma^{2}}}.$$

In addition, the responses, $$y_{i}$$, are each assumed to follow a normal distribution. For $$n$$ observations, the likelihood function is

$$L(\beta) = \Big(\frac{1}{\sqrt{2\pi\sigma^{2}}}\Big)^{n} e^{-(y-X\beta)^{T}(y-X\beta)/2\sigma^{2}}.$$

$$Ln(L(\beta))$$, the Log-Likelihood, is therefore

$$Ln(L(\beta)) = -\frac{n}{2}Ln(2\pi) -\frac{n}{2}Ln(\sigma^{2})-\frac{1}{2\sigma^{2}}(y-X\beta)^{T}(y-X\beta).$$

Taking derivatives with respect to $$\beta$$ and setting the result equal to zero yields

$$\frac{\partial Ln(L(\beta))}{\partial \beta} = -2X^{T}y -2X^{T}X\beta = 0.$$

Upon rearranging and solving for $$\beta$$, we obtain

$$\hat{\beta} = {(X^{T}X)}^{-1}{X}^{T}y,$$

which is identical to the result obtained from the Least-Squares approach.