Deviance is used as goodness of fit measure for generalized linear models and
in cases in which parameters are estimated using maximum likelihood. Deviance
is a generalization of the residual sum of squares from ordinary least squares
regression. The deviance of a fitted model compares the log-likelihood of the
model of interest to the log-likelihood of the *saturated model*, representing
the model with n parameters that fits the n observations perfectly, but
with no ability to generalize.

In large samples, deviance follows a chi-square distribution with \(n - p\) degrees of freedom, where \(n\) is the number of observations and \(p\) the number of parameters in the model. The null hypothesis, \(H_{0}\), is that the model fits. The alternative hypothesis, \(H_{1}\), is that the model does not fit. A deviance much higher than \(n-p\) indicates the model is a poor fit to the data. In terms of magnitude, smaller is always better: The smaller the deviance, the better the fit of the model. Let:

- \(\hat{L}\) = the likelihood of the model of interest.
- \(\hat{L}_{max}\) = the likelihood of the saturated model, the model having \(n\) parameters which fit the \(n\) observations perfectly.
- \(\hat{l}\) = the log-likelihood for the model of interest.
- \(\hat{l}_{max}\) = the likelihood of the saturated model, the model having \(n\) parameters which fit the \(n\) observations perfectly.

The general expression for deviance is given by

The binomial log-likelihood function is given by

For the saturated model, \(\pi_{i} = y_{i}\), and for the model of interest, \(\pi_{i} = \hat y_{i}\). Thus, the the deviance for the binomial log-likelihood becomes:

### Calcuation

The computations in the remainder of the post will be based on
PlacekickerData.csv,
a data set having dichotomous response which indicates the success or failure
of all placekicks from the 1995 NFL season. What follows is a brief description of the fields contained in
*PlacekickerData.csv*:

**WEEK**: Week of the NFL season**DISTANCE**: Distance in yards of the placekick**CHANGE**: Whether success of the placekick would have resulted in a lead-change (1) or not (0)**ELAP30**: Minutes remaining before trhe end of the half**PAT**: Point After Attempt (1) or field goal (0)**TYPE**: Outdoor (1) or indoor (0)**FIELD**: Grass (1) or artificial turf (0)**WIND**:Whether placekick was attepted in windy conditions (1) or not (0)**GOOD**: The response variable, (1) if placekick is successful, otherwise (0)

We will model GOOD as a function of DISTANCE and WIND. First read
the dataset into R then pass the model specification to `glm`

in order to
generate the model summary statistics:

```
library("data.table")
df = fread(
"PlacekickerData.csv", header=TRUE, sep=",", stringsAsFactors=FALSE
)
pkfit = glm(
formula="GOOD ~ DISTANCE + WIND", family=binomial(link=logit),
data=df
)
```

Calling `summary(pkfit)`

returns:

```
> summary(pkfit)
glm(formula = GOOD ~ DISTANCE + WIND, family = binomial(link = logit),
data = df2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7662 0.2353 0.2353 0.3706 1.5914
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.884528 0.331916 17.729 <2e-16 ***
DISTANCE -0.115588 0.008396 -13.767 <2e-16 ***
WIND -0.582124 0.314051 -1.854 0.0638 .
---
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1013.43 on 1424 degrees of freedom
Residual deviance: 772.53 on 1422 degrees of freedom
AIC: 778.53
```

The remaining sections detail the calculations used in generating a selection of the model summary statistics.

### Residual Deviance

The *residual deviance* (or simply the deviance) for the model in question
has a value of 772.53. To reproduce this value, we make use of the fact that
for logistic regression models in particular, the likelihood of the saturated
model is 1, implying the saturated model’s log-likelihood is 0. As a result,
the original deviance expression, \(-2 \times (\hat l - \hat l_{\mathrm{max}})\) reduces to
\(D = -2 \hat l\). The deviance is computed by multiplying the model of
interest’s log-likelihood by -2:

```
# Recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification `GOOD ~ DISTANCE + WIND`.
fitted = unname(pkfit$fitted.values)
actual = df$GOOD
# Function to compute the log-likelihood.
# y => actual values (1/0)
# pi => fitted probabilities
ll = function(y, pi) {
return(y * log(pi) + (1 - y) * log(1 - pi))
}
D = -2 * sum(ll(y=actual, pi=fitted))
print(D)
# yields 772.5335
```

Comparing the residual deviance output from the original model summary,
we find the results to be identical (*772.53* from the summary vrs. *772.5335* here).

To test the goodness of fit of the model, recall that the null hypothesis is
that the model is correctly specified. Pass the residual deviance, 772.5335
along with the model degrees of freedom to the `pchisq`

function to determine
whether there is strong evidence to reject the null hypothesis:

```
dof = 1422 # Degrees of freedom n - p
deviance = 772.5335
p_val = pchisq(deviance, df=dof, lower.tail=FALSE)
print(p_val)
# returns 1
```

There is no evidence to reject the null hypothesis that the model fits. Also note that the model passes the “rule-of_thumb” goodness of fit test since the deviance is less than \(n-p\), 772.5335 < 1425 - 3 = 1422.

### Null Deviance

The *null deviance* shows how well the model with nothing but an intercept
predicts the response. If the null deviance is small, it means that the
intercept-only model sufficiently explains the data. To calculate null
deviance, the same approach used to calculate residual deviance can be
leveraged by simply replacing the fitted values with the mean over all values
of the response:

```
# Recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification `GOOD ~ DISTANCE + WIND`.
actual = df$GOOD
ravg = mean(df$GOOD) # 0.885614
# Function to compute the log-likelihood
# y : actual values (1/0)
# pi: fitted probabilities
ll = function(y, pi) {
return(y * log(pi) + (1 - y) * log(1 - pi))
}
nullDeviance = -2 * sum(ll(y=actual, pi=ravg))
print(nullDeviance)
# yields 1013.426
```

The output matches the null deviance from the model summary (1013.43).

### Comparison of Null and Residual Deviance

The difference between log-likelihoods for two models (the model with more parameters identified as the full model, the other the reduced model), one being a proper subset of the other, has an approximate chi-square distribution in large samples with degrees of freedom equal to the difference in the number of parameters between the full and reduced models. It can also be shown that the difference in deviances between the full and reduced model also follows a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between the full and reduced models:

where \(q\) represents the degrees of freedom. The statistic on the left in the
expression is the *likelihood ratio statistic*. By comparing the residual and
null deviances, we can assess the significance of the parameters in the full
model that are not present in the null model. Large values of the chi-square
statistic are taken as evidence that the the null hypothesis is implausible.

For the reduced model (same as the null model in this example), the degrees of freedom are the number of observations minus one for the intercept term, \(df_{\mathrm{reduced}} = n - 1\). For the full model, the degrees of freedom are the number of observations minus the number of parameters including the intercept, or \(df_{\mathrm{full}} = n - 3\). Thus \(df = df_{\mathrm{full}} - df_{\mathrm{reduced}} = n - 1 - n + 3 = 2\).

The null hypothesis, \(H_{0}\) states \(\beta_{\mathrm{DISTANCE}} = \beta_{\mathrm{WIND}} = 0\). The alternative hypothesis, \(H_{1}\), states \(\beta_{\mathrm{DISTANCE}}\) or \(\beta_{\mathrm{WIND}}\) (or both) are non-zero.

Note that you can test each explanatory variable separately against the
intercept-only model in order to isolate their respective contribution to the
residual deviance, but in this example we’re testing their combined
significance. If the difference between null and residual deviances is close to
zero, it’s an indication that **DISTANCE** and **WIND** are not significant.
Alternatively, a large difference in deviance indicates that either
**DISTANCE** or **WIND** (or both) are significant, and it’s unlikely that we
would have seen as much reduction in deviance by chance alone.

Next we calculate the significance of the fit parameters:

```
# recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification GOOD ~ DISTANCE + WIND =>
fitted = unname(pkfit$fitted.values)
actual = df$GOOD
ravg = mean(df$GOOD) # 0.885614
# Function to compute the log-likelihood
# y => actual values (1/0)
# pi => fitted probabilities
ll = function(y, pi) {
return(y * log(pi) + (1 - y) * log(1 - pi))
}
# renaming `D` to `residualDeviance` =>
residualDeviance = -2 * sum(ll(y=actual, pi=fitted))
nullDeviance = -2 * sum(ll(y=actual, pi=ravg))
devianceDiff = nullDeviance - residualDeviance
# Determine degrees of freedom.
rd_dof = length(actual) - length(pkfit$coefficients)
dd_dof = length(actual) - 1
dof = rd_dof - nd_dof # 2
# Pass devianceDiff and df to pchisq function.
p_val = pchisq(devianceDiff, df=dof, lower.tail=FALSE)
print(p_val)
# returns 4.889108e-53
```

Such a small p-value indicates either **DISTANCE** or **WIND** (or both) are
highly significant: Further analysis can be carried out to determine the source
of the deviance reduction.

### Deviance Residuals

The *deviance residuals* are analogous to the residuals of a linear regression
model. In linear regression, the residuals are the differences between the true
outcome and the predicted output, whereas the deviance residuals are related to
the log-likelihoods of having observed the true outcome, given the predicted
probability of that outcome. The deviance residuals are computed by taking the square
root of the \(i^{th}\) term of the deviance, with the sign selected to be the sign of
(actual - fitted):

We can take advantage of a simplification to the deviance residuals expression
the same way we did with residual deviance: Multiply the sign of the difference
between actual and fitted values by the square root of 2 times the
log-likelihood of all observations, then compare with quantiles from `pkfit`

:

```
# recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification GOOD ~ DISTANCE + WIND =>
fitted = unname(pkfit$fitted.values)
actual = df$GOOD
# function to compute the log-likelihood
# y => actual values (1/0)
# pi => fitted probabilities
ll = function(y, pi) {
return(y * log(pi) + (1 - y) * log(1 - pi))
}
# Deviance residuals definition.
devianceResiduals = function(y, pi) {
return(sign(y - pi) * sqrt(-2 * ll(y, pi)))
}
devRes = devianceResiduals(y=actual, pi=fitted)
# Call `summary` on devRes.
summary(devRes)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.7660 0.2353 0.2353 0.1490 0.3706 1.5910
```

For comparison, the deviance residuals for `pkfit`

were:

```
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7662 0.2353 0.2353 0.3706 1.5914
```

### Akaike information criterion (AIC)

A collection of candidate models can be compared, and the selection criteria
may be to choose the model with the highest log-likelihood. However, the
log-likelihood of a model will always increase with the addition of more
variables, even if those variables are insignificant and do little to increase
the model’s predictive power. The *Akaike information criterion*, or AIC, is a
penalized log-likelihood test that charges a penalty for additional variables.
It can be thought of as a measure of the relative quality of a model. When
considering one or more models fit to the same dataset, the preferred model is
the one with the minimum AIC. AIC is computed as:

where \(l\) is the log-likelihood of the model of interest and \(q\) the number of parameters including the intercept term. The lower the AIC, the better the model:

```
# recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification GOOD ~ DISTANCE + WIND =>
fitted = unname(pkfit$fitted.values)
actual = df$GOOD
# function to compute the log-likelihood
# y => actual values (1/0)
# pi => fitted probabilities
ll = function(y, pi) {
return(y * log(pi) + (1 - y) * log(1 - pi))
}
# capture number of model coefficients =>
nbr_coeffs = length(pkfit$coefficients) # 3
AIC = -2*ll(y=actual, pi=fitted) + 2 * nbr_coeffs
print(AIC)
# returns 778.5335
```

AIC for `pkfit`

was 778.53.

### Pseudo R-squared

The *pseudo R-squared*, or *generalized R-squared* measures how much deviance
is explained by the model. It ranges between 0 and 1, with models having a
value closer to 1 explaining more of the deviance than a model having a value
closer to 0. The expression for pseudo R-squared is:

where \(l_{min}\) is the log-likelihood for the intercept-only model, and \(l\) the log-likelihood of the model of interest:

```
nullDeviance = pkfit$null.deviance # 1013.426
residualDeviance = pkfit$deviance # 772.5335
pseudoRSquared = 1 - (residualDeviance/nullDeviance)
print(pseudoRSquared)
# returns 0.2377013
```

Our `GOOD ~ DISTANCE + WIND`

model only explains ~24% of the deviance, which
means that we haven’t yet identified all the factors that predict whether or
not a placekick is made. We should now go back and test the quality of the
other explanatory variables, and check if another combination of explanatory
variables has preferable goodness of fit characteristics but with improved
pseudo R-squared.