(Note: The R implementation of Estimating Logistic Regression Coefficents From Scratch can be found here.)
In this post, I’ll demonstrate how to estimate the coefficents of a Logistic Regression model using the Fisher Scoring algorithm. We will then compare our estimates to those generated by scikitlearn’s linear_model.LogisticRegression class when exposed to the same dataset. Specifically, we’ll focus on how parameters of a Logistic Regression model are calculated when fit to data with a dicotomous response.
Background
In a Generalized Linear Model, the response may have any distribution from the exponential family, and rather than assuming the mean is a linear function of the explnatory variables, we assume that a function of the mean, or the link function, is a linear function of the explnatory variables.
Logistic Regression is used for modeling data with a categorical response. Although it’s possible to model multinomial data using Logistic Regression, in this post we’ll limit our analysis to models having a dichotomous response, where the outcome can be classified as ‘Yes/No’, ‘True/False’, ‘1/0’, ‘Good/Bad’, etc…
The Logistic Regression model is a Generalized Linear Model whose canonical link is the logit, or logodds:
for \(i = (1, \cdots , n)\).
Solving the logit for \(\pi_{i}\), which is a standin for the predicted probability associated with \(x_{i}\), yields
where \(\infty<x_{i}<+\infty\) and \(0<\pi_{i}<1\).
In other words, the expression for \(\pi_{i}\) maps any realvalued \(x_{i}\) to a positive probability between 0 and 1.
Parameter Estimation
Maximum Likelihood Estimation can be used to determine the parameters of a Logistic Regression model, which entails finding the set of parameters for which the probability of the observed data is greatest. The objective is to estimate the \((p+1)\) unknown \(\beta_{0}, \cdots ,\beta_{p}\).
Let \(Y_{i}\) represent independent, dicotomous response values for each of \(n\) observations, where \(Y_{i}=1\) denotes a success and \(Y_{i}=0\) denotes a failure. The density function of a single observation \(Y_{i}\) is given by
and the corresponding Likelihood function is
Taking the natural log of the Maximum Likelihood Estimate results in the LogLikelihood function:
The firstorder partial derivatives of the LogLikelihood are calculated and set to zero for each \(\beta_{k}\), \(k = 0, 1, \cdots, p\)
which can be represented in matrix notation as
where \(X^{T}\) is a \((p+1)\)by\(n\) matrix, and \((y  \pi)\) is a \(n\)by\(1\) vector.
The vector of firstorder partial derivatives of the LogLikelihood function is referred to as the score function in statistical literature, and is typically represented as \(U\).
These \((p+1)\) equations are solved simultaneously to obtain the parameter estimates \(\hat\beta_{0}, \cdots ,\hat\beta_{p}\). Each solution specifies a criticalpoint which will be either a maximum or a minimum. The critical point will be a maximum if the matrix of second partial derivatives is negative definite (which means every element on the diagonal of the matrix is less than zero).
The matrix of second partial derivatives is given by
represented in matrix form as:
where \(W\) is an \(n\)by\(n\) diagonal matrix of weights with each element equal to \(\pi_{i}(1  \pi_{i})\) for Logistic Regression models (in general, the weights matrix \(W\) will have entries inversely proportional to the variance of the response).
Since no closedform solution exists for determining Logistic Regression model coefficents (as exists for Linear Regression models), iterative techniques must be employed.
Fitting the Model
Two distinct but related iterative methods can be utilized in determining model coefficents: the NewtonRaphson method and Fisher Scoring. The NewtonRaphson method relies on the matrix of second partial derivatives, also known as the Hessian. The NewtonRaphson update formula is:
where:
 \(\beta^{(t+1)}\) = the vector of updated coefficent estimates
 \(\beta^{(t)}\) = the vector of coefficent estimates from the previous iteration
 \((H^{(t)})^{1}\) = the inverse of the Hessian, \(\Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)^{1}\)
 \(U^{(t)}\) = the vector of firstorder partial derivatives of the LogLikelihood function, \(\frac {\partial l(\beta)}{\partial \beta}\) = \(X^{T}(y  \pi)\)
The NewtonRaphson method starts with an initial guess for the solution, and obtains a second guess by approximating the function to be maximized in a neighborhood of the initial guess by a seconddegree polynomial, and then finding the location of that polynomial’s maximum value. This process continues until it converges to the actual solution. The convergence of \(\beta^{t}\) to \(\hat{\beta}\) is usually fast, with adequate convergence realized after 45 iterations 1.
Fisher Scoring utilizes the expected information, \(E\Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)\). Let \(\mathcal{I}\) serve as a standin for the expected value of the information:
Then, the Fisher Scoring update step replaces \(H^{(t)}\) from NewtonRaphson with \(\mathcal{I}^{(t)}\):
where:
 \(\beta^{(t+1)}\) = the vector of updated coefficent estimates
 \(\beta^{(t)}\) = the vector of coefficent estimates from the previous iteration
 \((\mathcal{I}^{(t)})^{1}\) = the inverse of the expected information matrix, \(E \Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)^{1}\)
 \(U^{(t)}\) = the vector of firstorder partial derivatives of the LogLikelihood function, \(\frac {\partial l(\beta)}{\partial \beta}\) = \(X^{T}(y  \pi)\)
Iteration continues until \(\beta^{(t)}\) stabilizes.
For GLM’s with a canonical link (of which employing the logit for Logistic Regression is an example), the observed and expected information are the same. When the response follows an exponential family distribution, and the canonical link function is employed, observed and expected Information coincide so that Fisher Scoring is the same as NewtonRaphson.
When the canonical link is used, the second partial derivatives of the LogLikelihood do not depend on the observation \(y_{i}\), and therefore
Fisher scoring has the advantage that it produces the asymptotic covariance matrix as a byproduct.
To clarify:

The Hessian is the matrix of second partial derivatives of the LogLikelihood with respect to the parameters, or \(H = \frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\).

The observed information is \(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\).

The expected information is \(\mathcal{I} = E\Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)\).

The asymptotic covariance matrix is \(Var(\hat{\beta}) = E\Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)^{1} = (X^{T}WX)^{1}\).
For models employing a canonical link function:

The observed and expected information are the same, \(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}} = E\Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)\).

\(H = \mathcal{I}\), or \(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}} = E\Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)\).

The NewtonRaphson and Fisher Scoring algorithms yield identical results.
Pure Python Fisher Scoring Implementation
The data used for our sample calculation can be obtained here. This data represents ORing failures in the 23 preChallenger space shuttle missions. In this dataset, TEMPERATURE
will serve as the single explnatory variable which will be used to predict O_RING_FAILURE
, which is 1
if a failure occurred, 0
otherwise.
Once the parameters have been determined, the model estimate of the probability of success for a given observation can be calculated with:
In the following code segment, we define a single function, get_coefficients
, which returns the estimated model coefficents as a \((p+1)\)by\(1\) array. In addition, the function returns the number of scoring iterations, fitted values and the variancecovariance matrix for the estimated coefficients:
def get_coefficients(design_matrix, response_vector, epsilon=.001):
"""
Determine Logistic Regression coefficents using Fisher Scoring algorithm.
Iteration ceases once changes between elements in coefficent matrix across
consecutive iterations is less than epsilon.
# =========================================================================
# design_matrix `X` => nby(p+1) 
# response_vector `y` => nby1 
# probability_vector `p` => nby1 
# weights_matrix `W` => nbyn 
# epsilon => threshold above which iteration continues 
# =========================================================================
# n => # of observations 
# (p + 1) => # of parameterss, +1 for intercept term 
# =========================================================================
# U => First derivative of LogLikelihood with respect to 
# each beta_i, i.e. `Score Function`: X_transpose * (y  p) 
# 
# I => Second derivative of LogLikelihood with respect to 
# each beta_i. The `Information Matrix`: (X_transpose * W * X) 
# 
# X^T*W*X results in a (p+1)by(p+1) matrix 
# X^T(y  p) results in a (p+1)by1 matrix 
# (X^T*W*X)^1 * X^T(y  p) results in a (p+1)by1 matrix 
# ========================================================================
"""
X = np.matrix(design_matrix)
y = np.matrix(response_vector)
# initialize logistic function used for Scoring calculations =>
def pi_i(v): return (np.exp(v) / (1 + np.exp(v)))
# initialize beta_0, p_0, W_0, I_0 & U_0 =>
beta_0 = np.matrix(np.zeros(np.shape(X)[1])).T
p_0 = pi_i(X * beta_0)
W_pre = (np.array(p_0) * np.array(1  p_0))
W_0 = np.matrix(np.diag(W_pre[:, 0]))
I_0 = X.T * W_0 * X
U_0 = X.T * (y  p_0)
# initialize variables for iteration =>
beta_old = beta_0
iter_I = I_0
iter_U = U_0
iter_p = p_0
iter_W = W_0
fisher_scoring_iterations = 0
# iterate until abs(beta_new  beta_old) < epsilon =>
while True:
# Fisher Scoring Update Step =>
coeffs.append(np.array(beta_old))
fisher_scoring_iterations += 1
beta_new = beta_old + iter_I.I * iter_U
if all(np.abs(np.array(beta_new)np.array(beta_old)) < epsilon):
model_parameters = beta_new
fitted_values = pi_i(X * model_parameters)
covariance_matrix = iter_I.I
break
else:
iter_p = pi_i(X * beta_new)
iter_W_pre = (np.array(iter_p) * np.array(1  iter_p))
iter_W = np.matrix(np.diag(iter_W_pre[:, 0]))
iter_I = X.T * iter_W * X
iter_U = X.T * (y  iter_p)
beta_old = beta_new
summary = {
'model_parameters' : np.array(model_parameters),
'fitted_values' : np.array(fitted_values),
'covariance_matrix': np.array(covariance_matrix),
'number_iterations': fisher_scoring_iterations
}
return (summary)
We read in the Challenger dataset and partition it into the design matrix and response vector, which are then passed to get_coefficients
:
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True)
# read dataset with pandas =>
df = pd.read_csv("Challenger.csv")
# add intercept field =>
df['INTERCEPT'] = 1
X = df[['INTERCEPT', 'TEMPERATURE']].values
y = df[['O_RING_FAILURE']].values
# call `get_coefficients`, keeping epsilon at .001 =>
my_summary < get_coefficients(X, y, epsilon=.001)
get_coefficients
returns a dictionary with the following keyvalue pairs:
 ‘model_parameters’: The model’s estimated coefficents
 ‘covariance matrix’: The variancecovariance matrix of the coefficent estimates
 ‘fitted_values’: The fitted explanatory variables
 ‘number_iterations’: The number of Fisher Scoring iterations
>>> my_summary['model_parameters']
array([[ 15.04290165],
[ 0.23216274]])
>>> my_summary['covariance_matrix']
array([[ 54.4442749 , 0.79638683],
[ 0.79638683, 0.01171514]])
>>> my_summary['fitted_values']
array([[ 0.43049313],
[ 0.22996826],
[ 0.27362105],
[ 0.32209405],
[ 0.37472428],
[ 0.1580491 ],
[ 0.12954602],
[ 0.22996826],
[ 0.85931657],
[ 0.60268105],
[ 0.22996826],
[ 0.04454055],
[ 0.37472428],
[ 0.93924781],
[ 0.37472428],
[ 0.08554356],
[ 0.22996826],
[ 0.02270329],
[ 0.06904407],
[ 0.03564141],
[ 0.08554356],
[ 0.06904407],
[ 0.82884484]])
>>> my_summary['number_iterations']
5
So for the Challenger dataset, our implementation of the Fisher Scoring algorithm yields a \(\hat{\beta}_{0} = 15.0429016\) and \(\hat{\beta}_{1} = 0.2321627\). In order to predict new probabilities of ORing Failure based on temperature, our model implies the following formula:
Negative coefficents correspond to variables that are negatively correlated to the probability of a positive outcome, with the reverse being true for positive coefficents.
Lets compare the results of our Fisher Scoring algorithm with the output of scikitlearn’s linear_model.LogisticRegression class using the same dataset:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
np.set_printoptions(suppress=True)
# read dataset with pandas =>
df = pd.read_csv("Challenger.csv")
# design matrix must be 2dimensional, so we call reshape =>
X = df[['TEMPERATURE']].values.reshape(len(df[['TEMPERATURE']]), 1)
# `np.ravel` flattens a nested sequence =>
y = np.ravel(df[['O_RING_FAILURE']].values)
lr = LogisticRegression(C=1e10, intercept_scaling=200)
lr.fit(X, y)
params = np.array([lr.intercept_, lr.coef_])
print(params)
# returns:
array([[ 15.04289075], [ 0.23216258]])
The coefficent estimates generated by scikitlearn’s linear_model.LogisticRegression are very close to those returned by our Fisher Scoring implementation (\(\hat{\beta}_0 = 15.0429016\) and \(\hat{\beta}_1 =0.2321627\)).
A quick note about the arguments passed to the LogisticRegression class:
scikitlearn’s linear models include regularization by default. The regularization parameter allows for controlling the tradeoff between the fit to the data and generalization to future
unknown data, and in the LogisticRegression class, C
is this parameter. Since we want the standard error function to serve as our cost function, setting C
to a very large value essentially negates the influence of the regularization parameter on our coefficient estimates. intercept_scaling
serves a similiar purpose. To learn more about the LogisticRegression class initialization parameters, check out the linear_model.LogisticRegression documentation page.
To create a list of explanatory variablefitted probability pairs, run the following:
# predicted probabilities for `X` =>
fitted_probs = np.ravel(lr.predict_proba(X)[:,[1]]).tolist()
# flattened Xvalues =>
xvals = np.ravel(X).tolist()
fitted_pairs = list(zip(xvals, fitted_probs))
print(fitted_pairs)
[(66, 0.4304930563884877),
(70, 0.22996831652689573),
(69, 0.27362108897825554),
(68, 0.32209405642318906),
(67, 0.3747242419645074),
(72, 0.15804918927261258),
(73, 0.12954611461881435),
(70, 0.22996831652689573),
(57, 0.8593163615383961),
(63, 0.60268086115142),
(70, 0.22996831652689573),
(78, 0.04454061498466786),
(67, 0.3747242419645074),
(53, 0.9392476723456893),
(67, 0.3747242419645074),
(75, 0.08554364438644661),
(70, 0.22996831652689573),
(81, 0.02270333247612705),
(76, 0.06904415511953753),
(79, 0.03564146745997936),
(75, 0.08554364438644661),
(76, 0.06904415511953753),
(58, 0.8288446174645678)]
A feature of Logistic Regression is that the training data’s marginal probabilities are preserved. If you aggregate the fitted values from the training set, that quanity will equal the number of positive outcomes in the response vector:
# checking marginal sum for sklearn Logistic Regression model =>
>>> sum(i[1] for i in fitted_pairs)
7.00000002411528
# checking sum for Fisher Scoring algorithm =>
>>> sum(my_summary['fitted_values'])
7
Conslusion
This post shed some light on the estimation routines that can be utilized in determining Logistic Regression coefficients, and walked through the mathematical justification. In addition, we briefly explored scikitlearn’s linear_model API, and demonstrated two distinct estimation approaches which arrived at the same set of coefficients. For those who are interested, scikitlearn exposes a huge amount of functionality with thorough, easytofollow documentation that’s full of straightforward examples. Get started here. Until next time, happy coding!
Footnotes:
[1]  Agresti, A. (2002). Categorical Data Analysis (2nd Ed.)