Deviance is used as goodness of fit measure for generalized linear models and in cases in which parameters are estimated using maximum likelihood. Deviance is a generalization of the residual sum of squares from ordinary least squares regression. The deviance of a fitted model compares the log-likelihood of the model of interest to the log-likelihood of the saturated model, representing the model with n parameters that fits the n observations perfectly, but with no ability to generalize.

In large samples, deviance follows a chi-square distribution with \(n - p\) degrees of freedom, where \(n\) is the number of observations and \(p\) the number of parameters in the model. The null hypothesis, \(H_{0}\), is that the model fits. The alternative hypothesis, \(H_{1}\), is that the model does not fit. A deviance much higher than \(n-p\) indicates the model is a poor fit to the data. In terms of magnitude, smaller is always better: The smaller the deviance, the better the fit of the model. Let:

  • \(\hat{L}\) = the likelihood of the model of interest.
  • \(\hat{L}_{max}\) = the likelihood of the saturated model, the model having \(n\) parameters which fit the \(n\) observations perfectly.
  • \(\hat{l}\) = the log-likelihood for the model of interest.
  • \(\hat{l}_{max}\) = the likelihood of the saturated model, the model having \(n\) parameters which fit the \(n\) observations perfectly.

The general expression for deviance is given by

$$ D = -2 \times \mathrm{Ln} \frac{\hat L}{\hat L_{\mathrm{max}}} = -2 \times (\hat l - \hat l_{\mathrm{max}}) $$

The binomial log-likelihood function is given by

$$ l(\hat{\beta}) = \sum_{i=1}^n \big(y_{i}Ln(\pi_{i}) + (1-y_{i})Ln(1 - \pi_{i})\big) $$

For the saturated model, \(\pi_{i} = y_{i}\), and for the model of interest, \(\pi_{i} = \hat y_{i}\). Thus, the the deviance for the binomial log-likelihood becomes:

$$ D = 2 \times \sum_{i=1}^n y_{i} \cdot \mathrm{Ln} \frac{y_{i}}{\hat y_{i}} + (1-y_{i}) \cdot \mathrm{Ln} \Big(\frac{1 - y_{i}}{1 - \hat{y}_{i}}\Big) $$

Calcuation

The computations in the remainder of the post will be based on PlacekickerData.csv, a data set having dichotomous response which indicates the success or failure of all placekicks from the 1995 NFL season. What follows is a brief description of the fields contained in PlacekickerData.csv:

  • WEEK: Week of the NFL season
  • DISTANCE: Distance in yards of the placekick
  • CHANGE: Whether success of the placekick would have resulted in a lead-change (1) or not (0)
  • ELAP30: Minutes remaining before trhe end of the half
  • PAT: Point After Attempt (1) or field goal (0)
  • TYPE: Outdoor (1) or indoor (0)
  • FIELD: Grass (1) or artificial turf (0)
  • WIND:Whether placekick was attepted in windy conditions (1) or not (0)
  • GOOD: The response variable, (1) if placekick is successful, otherwise (0)

We will model GOOD as a function of DISTANCE and WIND. First read the dataset into R then pass the model specification to glm in order to generate the model summary statistics:

library("data.table")

df = fread(
    "PlacekickerData.csv", header=TRUE, sep=",", stringsAsFactors=FALSE
    )

pkfit = glm(
    formula="GOOD ~ DISTANCE + WIND", family=binomial(link=logit),
    data=df
    )

Calling summary(pkfit) returns:

> summary(pkfit)
glm(formula = GOOD ~ DISTANCE + WIND, family = binomial(link = logit), 
    data = df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7662   0.2353   0.2353   0.3706   1.5914  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.884528   0.331916  17.729   <2e-16 ***
DISTANCE    -0.115588   0.008396 -13.767   <2e-16 ***
WIND        -0.582124   0.314051  -1.854   0.0638 .  
---

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1013.43  on 1424  degrees of freedom
Residual deviance:  772.53  on 1422  degrees of freedom
AIC: 778.53

The remaining sections detail the calculations used in generating a selection of the model summary statistics.

Residual Deviance

The residual deviance (or simply the deviance) for the model in question has a value of 772.53. To reproduce this value, we make use of the fact that for logistic regression models in particular, the likelihood of the saturated model is 1, implying the saturated model’s log-likelihood is 0. As a result, the original deviance expression, \(-2 \times (\hat l - \hat l_{\mathrm{max}})\) reduces to \(D = -2 \hat l\). The deviance is computed by multiplying the model of interest’s log-likelihood by -2:

# Recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification `GOOD ~ DISTANCE + WIND`.
fitted = unname(pkfit$fitted.values)
actual = df$GOOD

# Function to compute the log-likelihood.
# y  => actual values (1/0)
# pi => fitted probabilities
ll = function(y, pi) {
    return(y * log(pi) + (1 - y) * log(1 - pi))
}

D = -2 * sum(ll(y=actual, pi=fitted))

print(D)
# yields 772.5335

Comparing the residual deviance output from the original model summary, we find the results to be identical (772.53 from the summary vrs. 772.5335 here).

To test the goodness of fit of the model, recall that the null hypothesis is that the model is correctly specified. Pass the residual deviance, 772.5335 along with the model degrees of freedom to the pchisq function to determine whether there is strong evidence to reject the null hypothesis:

dof = 1422 # Degrees of freedom n - p
deviance = 772.5335
p_val = pchisq(deviance, df=dof, lower.tail=FALSE)
print(p_val)
# returns 1

There is no evidence to reject the null hypothesis that the model fits. Also note that the model passes the “rule-of_thumb” goodness of fit test since the deviance is less than \(n-p\), 772.5335 < 1425 - 3 = 1422.

Null Deviance

The null deviance shows how well the model with nothing but an intercept predicts the response. If the null deviance is small, it means that the intercept-only model sufficiently explains the data. To calculate null deviance, the same approach used to calculate residual deviance can be leveraged by simply replacing the fitted values with the mean over all values of the response:

# Recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification `GOOD ~ DISTANCE + WIND`.
actual = df$GOOD
ravg = mean(df$GOOD) # 0.885614

# Function to compute the log-likelihood
# y : actual values (1/0)
# pi: fitted probabilities
ll = function(y, pi) {
    return(y * log(pi) + (1 - y) * log(1 - pi))
}

nullDeviance = -2 * sum(ll(y=actual, pi=ravg))

print(nullDeviance)
# yields 1013.426

The output matches the null deviance from the model summary (1013.43).

Comparison of Null and Residual Deviance

The difference between log-likelihoods for two models (the model with more parameters identified as the full model, the other the reduced model), one being a proper subset of the other, has an approximate chi-square distribution in large samples with degrees of freedom equal to the difference in the number of parameters between the full and reduced models. It can also be shown that the difference in deviances between the full and reduced model also follows a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between the full and reduced models:

$$ -2 \cdot \mathrm{Ln}(L_{\mathrm{reduced}})--2 \cdot \mathrm{Ln}(L_{\mathrm{full}}) = -2 \cdot \mathrm{Ln}\Big(\frac{L_{\mathrm{reduced}}}{L_{\mathrm{full}}}\Big) = D_{\mathrm{reduced}} - D_{\mathrm{full}} \sim \chi^{2}(q), $$

where \(q\) represents the degrees of freedom. The statistic on the left in the expression is the likelihood ratio statistic. By comparing the residual and null deviances, we can assess the significance of the parameters in the full model that are not present in the null model. Large values of the chi-square statistic are taken as evidence that the the null hypothesis is implausible.

For the reduced model (same as the null model in this example), the degrees of freedom are the number of observations minus one for the intercept term, \(df_{\mathrm{reduced}} = n - 1\). For the full model, the degrees of freedom are the number of observations minus the number of parameters including the intercept, or \(df_{\mathrm{full}} = n - 3\). Thus \(df = df_{\mathrm{full}} - df_{\mathrm{reduced}} = n - 1 - n + 3 = 2\).

The null hypothesis, \(H_{0}\) states \(\beta_{\mathrm{DISTANCE}} = \beta_{\mathrm{WIND}} = 0\). The alternative hypothesis, \(H_{1}\), states \(\beta_{\mathrm{DISTANCE}}\) or \(\beta_{\mathrm{WIND}}\) (or both) are non-zero.

Note that you can test each explanatory variable separately against the intercept-only model in order to isolate their respective contribution to the residual deviance, but in this example we’re testing their combined significance. If the difference between null and residual deviances is close to zero, it’s an indication that DISTANCE and WIND are not significant. Alternatively, a large difference in deviance indicates that either DISTANCE or WIND (or both) are significant, and it’s unlikely that we would have seen as much reduction in deviance by chance alone.

Next we calculate the significance of the fit parameters:

# recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification GOOD ~ DISTANCE + WIND =>
fitted = unname(pkfit$fitted.values)
actual = df$GOOD
ravg = mean(df$GOOD)  # 0.885614


# Function to compute the log-likelihood
# y  => actual values (1/0)
# pi => fitted probabilities
ll = function(y, pi) {
    return(y * log(pi) + (1 - y) * log(1 - pi))
}

# renaming `D` to  `residualDeviance` =>
residualDeviance = -2 * sum(ll(y=actual, pi=fitted))
nullDeviance = -2 * sum(ll(y=actual, pi=ravg))
devianceDiff = nullDeviance - residualDeviance

# Determine degrees of freedom.
rd_dof = length(actual) - length(pkfit$coefficients)
dd_dof = length(actual) - 1
dof = rd_dof - nd_dof # 2

# Pass devianceDiff and df to  pchisq function.
p_val = pchisq(devianceDiff, df=dof, lower.tail=FALSE)

print(p_val)
# returns 4.889108e-53

Such a small p-value indicates either DISTANCE or WIND (or both) are highly significant: Further analysis can be carried out to determine the source of the deviance reduction.

Deviance Residuals

The deviance residuals are analogous to the residuals of a linear regression model. In linear regression, the residuals are the differences between the true outcome and the predicted output, whereas the deviance residuals are related to the log-likelihoods of having observed the true outcome, given the predicted probability of that outcome. The deviance residuals are computed by taking the square root of the \(i^{th}\) term of the deviance, with the sign selected to be the sign of (actual - fitted):

$$ d_{i} = \mathrm{sign}(y_{i} - \hat y_{i}) \sqrt{2 \cdot \Big(y_{i} \cdot \mathrm{Ln} \Big(\frac{y_{i}}{\hat y_{i}}\Big) + (1 - y_{i})\cdot \mathrm{Ln}\Big(\frac{1-y_{i}}{1-\hat y_{i}}\Big)\Big)} $$

We can take advantage of a simplification to the deviance residuals expression the same way we did with residual deviance: Multiply the sign of the difference between actual and fitted values by the square root of 2 times the log-likelihood of all observations, then compare with quantiles from pkfit:

# recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification GOOD ~ DISTANCE + WIND =>
fitted = unname(pkfit$fitted.values)
actual = df$GOOD

# function to compute the log-likelihood
# y  => actual values (1/0)
# pi => fitted probabilities
ll = function(y, pi) {
    return(y * log(pi) + (1 - y) * log(1 - pi))
}

# Deviance residuals definition.
devianceResiduals = function(y, pi) {
    return(sign(y - pi) * sqrt(-2 * ll(y, pi)))
}

devRes = devianceResiduals(y=actual, pi=fitted)

# Call `summary` on devRes.
summary(devRes)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2.7660  0.2353  0.2353  0.1490  0.3706  1.5910 

For comparison, the deviance residuals for pkfit were:

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7662   0.2353   0.2353   0.3706   1.5914  

Akaike information criterion (AIC)

A collection of candidate models can be compared, and the selection criteria may be to choose the model with the highest log-likelihood. However, the log-likelihood of a model will always increase with the addition of more variables, even if those variables are insignificant and do little to increase the model’s predictive power. The Akaike information criterion, or AIC, is a penalized log-likelihood test that charges a penalty for additional variables. It can be thought of as a measure of the relative quality of a model. When considering one or more models fit to the same dataset, the preferred model is the one with the minimum AIC. AIC is computed as:

$$ AIC = -2l + 2q, $$

where \(l\) is the log-likelihood of the model of interest and \(q\) the number of parameters including the intercept term. The lower the AIC, the better the model:

# recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification GOOD ~ DISTANCE + WIND =>
fitted = unname(pkfit$fitted.values)
actual = df$GOOD

# function to compute the log-likelihood
# y  => actual values (1/0)
# pi => fitted probabilities
ll = function(y, pi) {
    return(y * log(pi) + (1 - y) * log(1 - pi))
}

# capture number of model coefficients =>
nbr_coeffs = length(pkfit$coefficients) # 3

AIC = -2*ll(y=actual, pi=fitted) + 2 * nbr_coeffs

print(AIC)
# returns 778.5335

AIC for pkfit was 778.53.

Pseudo R-squared

The pseudo R-squared, or generalized R-squared measures how much deviance is explained by the model. It ranges between 0 and 1, with models having a value closer to 1 explaining more of the deviance than a model having a value closer to 0. The expression for pseudo R-squared is:

$$ R^2_{\mathrm{pseudo}} = \frac {l_{min} - l}{l_{min}} = 1 - \frac {deviance_{\mathrm{residual}}}{deviance_{\mathrm{null}}} $$

where \(l_{min}\) is the log-likelihood for the intercept-only model, and \(l\) the log-likelihood of the model of interest:

nullDeviance = pkfit$null.deviance     # 1013.426
residualDeviance = pkfit$deviance      # 772.5335
pseudoRSquared = 1 - (residualDeviance/nullDeviance)

print(pseudoRSquared)
# returns 0.2377013

Our GOOD ~ DISTANCE + WIND model only explains ~24% of the deviance, which means that we haven’t yet identified all the factors that predict whether or not a placekick is made. We should now go back and test the quality of the other explanatory variables, and check if another combination of explanatory variables has preferable goodness of fit characteristics but with improved pseudo R-squared.