In an earlier post, we implemented the Fisher Scoring algorithm, which we then used to estimate the coefficients for a Logistic Regression model. We demonstrated that our algorithm’s coefficient estimates were identical to the coefficients produced by R’s glm function, but we didn’t formally assess the quality of the model in any way. In this post, we’ll determine the goodness of fit of our Logistic Regression model and test the the significance of our coefficient estimates.

Deviance

Deviance is used as goodness of fit measure for Generalized Linear Models, and in cases when parameters are estimated using maximum likelihood, is a generalization of the residual sum of squares in Ordinary Least Squares Regression. The deviance of a fitted model compares the log-likelihood of the model of interest to the log-likelihood of the saturated model, the model with n parameters that fits the n observations perfectly, but is overparameterized to the point that it is basically just interpolating the data.

In large samples, deviance follows a chi-square distribution with \(n-p\) degrees of freedom, where \(n\) is the number of observations and \(p\) is the number of parameters in the model. The null hypothesis, \(H_{0}\), is that the model fits. The alternative hypothesis, \(H_{1}\), is that the model does not fit. A deviance much higher than \(n-p\) indicates the model is a poor fit to the data. Quantifiably, smaller is always better: The smaller the deviance, the better the fit of the model.

Let:

  • \(\hat{L}\) = the likelihood for the current model of interest
  • \(\hat{L}_{max}\) = the likelihood of the saturated model, the model having n parameters which fit the n observations perfectly
  • \(\hat{l}\) = the log-likelihood for the current model of interest
  • \(\hat{l}_{max}\) = the log-likelihood of the saturated model, the model having n parameters which fit the n observations perfectly

Then the deviance can be represented as:

$$ D = -2Ln \Big(\frac{\hat{L}}{\hat{L}_{max}}\Big) = -2(\hat{l} - \hat{l}_{max}) $$

As demonstrated [here](({filename}/articles/Statistical_Modeling/Logistic_Estimation.md), the binomial log-likelihood function is:

$$ l(\hat{\beta}) = \sum_{i=1}^n \big(y_{i}Ln(\pi_{i}) + (1-y_{i})Ln(1 - \pi_{i})\big) $$

For the saturated model, \(\pi_{i}=y_{i}\), and for the model of interest, \(\pi_{i} = \hat{y}_{i}\). Thus, the the deviance is represented as:

$$ D = 2\sum_{i=1}^n \Big(y_{i} Ln\frac{y_{i}}{\hat{y}_{i}}+(1-y_{i})Ln \frac{1 - y_{i}}{1 - \hat{y}_{i}}\Big) $$

Calculation

For the example in this post, we’ll be using PlacekickerData.csv, a data set with a dichotomous response indicating the success or failure of all placekicks in the 1995 NFL season 1. What follows is a brief description of the fields in PlacekickerData.csv:

  • WEEK: Week of the NFL season
  • DISTANCE: Distance in yards of the placekick
  • CHANGE: Whether success of the placekick would have resulted in a lead-change (1) or not (0)
  • ELAP30: Minutes remaining before trhe end of the half
  • PAT: Point After Attempt (1) or field goal (0)
  • TYPE: Outdoor (1) or indoor (0)
  • FIELD: Grass (1) or artificial turf (0)
  • WIND:Whether placekick was attepted in windy conditions (1) or not (0)
  • GOOD: The response variable, (1) if placekick is successful, otherwise (0)

We will model GOOD as a function of DISTANCE and WIND.First read the dataset into R then pass the model specification to glm in order to generate the model summary statistics:

df <- read.table(
             file="PlacekickerData.csv",
             header=TRUE, 
             sep=",", 
             stringsAsFactors=FALSE
             )


pkfit <- glm(
          formula=GOOD ~ DISTANCE + WIND,
          family=binomial(link=logit),
          data=df
          )

Calling summary(pkfit) returns:

> summary(pkfit)

glm(formula = GOOD ~ DISTANCE + WIND, family = binomial(link = logit), 
    data = df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7662   0.2353   0.2353   0.3706   1.5914  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.884528   0.331916  17.729   <2e-16 ***
DISTANCE    -0.115588   0.008396 -13.767   <2e-16 ***
WIND        -0.582124   0.314051  -1.854   0.0638 .  
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1013.43  on 1424  degrees of freedom
Residual deviance:  772.53  on 1422  degrees of freedom
AIC: 778.53

The following sections highlight the calculations used in generating the model summary statistics.

Residual Deviance

We’ll begin by reproducing the Residual Deviance (or simply the deviance) output, which, for the model in question, has a value of \(772.53\). To do so, we make use of the fact that specific to Logistic Regression models, the likelihood of the saturated model is 1, which means the saturated model’s log-likelihood is 0. Thus the original Deviance expression, \(D = 2(l_{max} - l)\) reduces to \(D = -2l\). The deviance is computed by multiplying the model of interest’s log-likelihood by \(-2\):

# recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification GOOD ~ DISTANCE + WIND =>
fitted <- unname(pkfit$fitted.values)
actual <- df$GOOD


# function to compute the log-likelihood
# y  => actual values (1/0)
# pi => fitted probabilities
ll <- function(y, pi) return(y*log(pi)+(1-y)*log(1-pi))

D <- -2 * sum(ll(y=actual, pi=fitted))

print(D)
# yields 772.5335

Looking back at Residual deviance in the original model summary, we note that the results are identical (\(772.53\) from the summary vrs. \(772.5335\) here).

To test the goodness of fit of the model, recall that the null hypothesis is that the model is correctly specified. Pass the residual deviance, \(772.5335\) along with the model degrees of freedom to pchisq to determine whether there is strong evidence to reject the null hypothesis:

df       <- 1422
deviance <- 772.5335
p_val    <- pchisq(deviance, df=df, lower.tail=FALSE)

print(p_val)
# returns 1

There is no evidence to reject the null hypothesis that the model fits. Also note that the model passes the ‘off-the-cuff’ goodness of fit test since the deviance is less than \(n-p\), \(772.5335 < 1425-3 = 1422\).

Null Deviance

The null deviance shows how well the model with nothing but an intercept predicts the response. If the null deviance is really small, it means that the intercept-only model sufficiently explains the data.
To calculate null deviance, the same approach used to calculate residual deviance can be leveraged by simply replacing the fitted values with the mean over all values of the response:

# recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification GOOD ~ DISTANCE + WIND =>
actual <- df$GOOD
ravg   <- mean(df$GOOD) # 0.885614

# function to compute the log-likelihood
# y  => actual values (1/0)
# pi => fitted probabilities
ll <- function(y, pi) return(y*log(pi)+(1-y)*log(1-pi))

nullDeviance <- -2 * sum(ll(y=actual, pi=ravg))

print(nullDeviance)
# yields 1013.426

The output matches the Null deviance from the model summary, which was \(1013.43\).

Comparison of Null and Residual Deviance

The difference between log-likelihoods for two models (the model with more parameters identified as the full model, the other the reduced model), one being a proper subset of the other, has an approximate chi-square distribution in large samples with degrees of freedom equal to the difference in the number of parameters between the full and reduced models. It can also be shown that the difference in deviances between the full and reduced model also follows a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between the full and reduced models:

$$ -2Ln(L_{reduced})--2Ln(L_{full}) = -2Ln\Big(\frac{L_{reduced}}{L_{full}}\Big) = D_{reduced} - D_{full} \sim \chi^{2}(q), $$

where \(q\) represents the degrees of freedom. The statistic on the left in the expression is the likelihood ratio statistic. By comparing the residual and null deviances, we can assess the significance of the parameters in the full model that are not present in the null model. Large values of the chi-square statistic are taken as evidence that the the null hypothesis is implausible.

For the reduced model (same as the null model in this example), the degrees of freedom are the number of observations minus one for the intercept term, \(df_{reduced} = n - 1\). For the full model, the degrees of freedom are the number of observations minus the number of parameters including the intercept, or \(df_{full} = n - 3\). Thus, for significance test, \(df = df_{full} - df_{reduced} = n - 1 - n + 3 = 2\).

The null hypothesis, \(H_{0}\), states that \(\beta_{DISTANCE} = \beta_{WIND} = 0\). The alternative hypothesis, \(H_{1}\), states that either \(\beta_{DISTANCE}\) or \(\beta_{WIND}\) (or both) are non-zero.

(Note that you can test each explanatory variable (DISTANCE and WIND here) separately against the intercept-only model in order to isolate their respective contribution to the residual deviance, but in this example, we’re testing their combined significance)

If the difference between null and residual deviances is close to zero, it’s an indication that DISTANCE and WIND are not significant. Alternatively, a large difference in deviance indicates that either DISTANCE or WIND (or both) are significant, and it’s unlikely that we would have seen as much reduction in deviance by chance alone. Now we calculate the significance of the fit parameters:

# recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification GOOD ~ DISTANCE + WIND =>
fitted <- unname(pkfit$fitted.values)
actual <- df$GOOD
ravg   <- mean(df$GOOD)  # 0.885614


# function to compute the log-likelihood
# y  => actual values (1/0)
# pi => fitted probabilities
ll <- function(y, pi) return(y*log(pi)+(1-y)*log(1-pi))

# renaming `D` to  `residualDeviance` =>
residualDeviance <- -2 * sum(ll(y=actual, pi=fitted))
nullDeviance     <- -2 * sum(ll(y=actual, pi=ravg))
devianceDiff     <- nullDeviance - residualDeviance

# get degrees of freedom =>
rd_df <- length(actual) - length(pkfit$coefficients)
dd_df <- length(actual) - 1
df    <- rd_df - nd_df # 2

# pass devianceDiff and df to  pchisq function =>
p_val <- pchisq(devianceDiff, df=df, lower.tail=FALSE)

print(p_val)
# returns 4.889108e-53

Such a small p-value indicates either DISTANCE or WIND (or both) are highly significant: Further analysis can be carried out to determine the source of deviance reduction.

Deviance Residuals

The deviance residuals are analogous to the residuals of a linear regression model. In linear regression, the residuals are the differences between the true outcome and the predicted output, whereas the deviance residuals are related to the log-likelihoods of having observed the true outcome, given the predicted probability of that outcome 2. The deviance residuals are obtained by taking the square root of the \(i^{th}\) term of the deviance, with the sign selected to be the sign of \((actual - fitted)\):

$$ d_{i} = sign(y_{i} - \hat{y}_{i}) \sqrt{2\Big(y_{i}Ln\Big(\frac{y_{i}}{\hat{y}_{i}}\Big) + (1 - y_{i})Ln\Big(\frac{1-y_{i}}{1-\hat{y}_{i}}\Big)\Big)} $$

We can take advantage of a simplification to the deviance residuals expression the same way we did with residual deviance: Multiply the sign of the difference between actual and fitted values by the square root of \(-2\) times the log-likelihood of all data points, then return the quantile summary to compare with the pkfit‘s quantiles:

# recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification GOOD ~ DISTANCE + WIND =>
fitted <- unname(pkfit$fitted.values)
actual <- df$GOOD

# function to compute the log-likelihood
# y  => actual values (1/0)
# pi => fitted probabilities
ll <- function(y, pi) return(y*log(pi)+(1-y)*log(1-pi))

# deviance residuals definition =>
devianceResiduals <- function(y, pi) {
    return(sign(y-pi)*sqrt(-2*ll(y, pi)))
}


devRes <- devianceResiduals(y=actual, pi=fitted)

# call `summary` on devRes =>
summary(devRes)

# returns:
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2.7660  0.2353  0.2353  0.1490  0.3706  1.5910 

For comparison, the deviance residuals for pkfit were:

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7662   0.2353   0.2353   0.3706   1.5914  

Akaike information criterion (AIC)

A collection of candidate models can be compared, and the selection criteria may be to choose the model with the highest log-likelihood. However, the log-likelihood of a model will almost always increase with the addition of more variables, even if those variables are insignificant and do little to increase the model’s predictive power. The Akaike information criterion, or AIC, is a penalized log-likelihood formula that charges a penalty for additional variables. It can be thought of as a measure of the relative quality of a model. When considering one or more models fit to the same dataset, the preferred model is the one with the minimum AIC value. The AIC formula is:

$$ AIC = -2l + 2q, $$

where \(l\) is the log-likelihood of the model of interest and \(q\) is the number of parameters including the intercept term. Thew lower the AIC, the better the model.

# recall that pkfit resulted from calling glm on `PlacekickerData.csv`
# with model specification GOOD ~ DISTANCE + WIND =>
fitted <- unname(pkfit$fitted.values)
actual <- df$GOOD

# function to compute the log-likelihood
# y  => actual values (1/0)
# pi => fitted probabilities
ll <- function(y, pi) return(y*log(pi)+(1-y)*log(1-pi))

# capture number of model coefficients =>
nbr_coeffs <- length(pkfit$coefficients) # 3

AIC <- -2*ll(y=actual, pi=fitted) + 2*nbr_coeffs

print(AIC)

# returns 778.5335

AIC for pkfit was \(778.53\).

Pseudo R-squared

The pseudo R-squared, or Generalized R-squared measures how much deviance is explained by the model. It’s range is between 0 and 1, with models having a value closer to 1 explaining more of the deviance than a model having a value closer to 0. The formula for pseudo R-squared is:

$$ R^{2}_{pseudo} = \frac {l_{min} - l}{l_{min}} = 1 - \frac {deviance_{residual}}{deviance_{null}} $$

where \(l_{min}\) is the log-likelihood for the intercept-only model, and \(l\) is the log-likelihood associated with the model in question. Using our null and residual deviances from earlier:

nullDeviance     <- pkfit$null.deviance # 1013.426
residualDeviance <- pkfit$deviance      # 772.5335
pseudoRSquared   <- 1 - (residualDeviance/nullDeviance)

print(pseudoRSquared)

# returns 0.2377013

Our GOOD ~ DISTANCE + WIND model only explains ~24% of the deviance, which means that we haven’t yet identified all the factors that predict whether or not a placekick is made. We should now go back and test the quality of the other explanatory variables, and check if another combination of explanatory variables has similiar goodness of fit characteristics but with a better pseudo R-squared.

Conclusion

In this post, we walked through some of the methods used to assess goodness of fit and significance of explanatory variables for a Logistic Regression model, and demonstrated how to calculate many of the important summary statistics output by R’s glm function. We also discusses some of the simplifications that can be made in calculating deviance and deviance residuals when modeling a dataset with a dichotomous response. In future posts, I plan to cover model assessment techniques for other members of the exponential family. Until then, happy coding!

Footnotes:

  1. Bilder, Christopher, Thomas M. Loughin (2015). Analysis of Categorical Data in R. Taylor & Francis Group, New York
  2. Mount, John, Nina Zumel (2014). Practical Data Science in R. Manning Publications, New York