Composite Estimators in scikit-learn

Combining preprocessing and classifier within a single pipeline
Machine Learning
Python
Published

August 1, 2024

To build a composite estimator in scikit-learn, transformers are usually combined with other transformers and/or predictors (such as classifiers or regressors). The most common tool used for composing estimators is a Pipeline. The Pipeline is often used in combination with ColumnTransformer or FeatureUnion which concatenate the output of transformers into a composite feature space.

In this notebook, I demonstrate how to create a composite estimator based on a synthetic dataset.

"""
Create synthetic dataset for composite estimator demo.
"""
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

np.set_printoptions(suppress=True, precision=8)
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

rng = np.random.default_rng(516)

n = 1000

df = pd.DataFrame({
    "A": rng.gamma(shape=2, scale=50000, size=n),
    "B": rng.normal(loc=1000, scale=250, size=n),
    "C": rng.choice(["red", "green", "blue"], p=[.7, .2, .1], size=n),
    "D": rng.choice(["left", "right", None], p=[.475, .475, .05], size=n),
    "E": rng.poisson(17, size=n),
    "target": rng.choice([0., 1.], p=[.8, .2], size=n)
})

# Set a selected samples to NaN in A, B and C. 
df.loc[rng.choice(n, size=10),"A"] = np.NaN
df.loc[rng.choice(n, size=17),"B"] = np.NaN
df.loc[rng.choice(n, size=5),"E"] = np.NaN

# Create train-validation split. 
y = df["target"]
dftrain, dfvalid, ytrain, yvalid = train_test_split(df, y, test_size=.05, stratify=y)

print(f"dftrain.shape: {dftrain.shape}")
print(f"dfvalid.shape: {dfvalid.shape}")
print(f"prop. ytrain : {ytrain.sum() / dftrain.shape[0]:.4f}")
print(f"prop. yvalid : {yvalid.sum() / dfvalid.shape[0]:.4f}")
dftrain.shape: (950, 6)
dfvalid.shape: (50, 6)
prop. ytrain : 0.2389
prop. yvalid : 0.2400


For this dataset, we’ll use ColumnTransformer to create separate pre-processing pipelines for continuous and categorical features. For continuous features, we impute missing values and standardize each to be on the same scale. For categorical features, we impute missing values and one-hot encode, creating k-1 features for a variable with k distinct levels. As the last step a LogisticRegression classifier is included with elastic net penatly. The code to accomplish this is given below:

from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression


# Data pre-processing for LogisticRegression model.
lr = LogisticRegression(
    penalty="elasticnet", solver="saga", max_iter=5000
    )

# Identify continuous and catergorical features. 
continuous = ["A", "B", "E"]
categorical = ["C", "D"]

continuous_transformer = Pipeline(steps=[
    ("imputer", IterativeImputer()),
    ("scaler" , StandardScaler())
    ])
categorical_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(drop="first", sparse_output=False, handle_unknown="error"))
    ])

preprocessor = ColumnTransformer(transformers=[
    ("continuous" , continuous_transformer, continuous),  
    ("categorical", categorical_transformer, categorical)
    ], remainder="drop"
    )

pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", lr)
    ]).set_output(transform="pandas")

pipeline
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('continuous',
                                                  Pipeline(steps=[('imputer',
                                                                   IterativeImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['A', 'B', 'E']),
                                                 ('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(drop='first',
                                                                                 sparse_output=False))]),
                                                  ['C', 'D'])])),
                ('classifier',
                 LogisticRegression(max_iter=5000, penalty='elasticnet',
                                    solver='saga'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


In the next cell, RandomizedSearchCV is run agasinst two hyperparameters: l1_ratio and C. Notice that we only call mdl.fit on the pipeline, as the data transform will be applied to each of the k-datasets separately based on the samples in each fold.


from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# Hyperparameters to search over. 
param_grid = {
    "classifier__l1_ratio": uniform(loc=0, scale=1), 
    "classifier__C": uniform(loc=0, scale=10)
    }

mdl = RandomizedSearchCV(
    pipeline, param_grid, scoring="accuracy", cv=5, verbose=2, 
    n_iter=3, random_state=516
    )

mdl.fit(dftrain.drop("target", axis=1), ytrain)

print(f"\nbest parameters: {mdl.best_params_}")
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time=   0.0s
[CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time=   0.0s
[CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time=   0.0s
[CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time=   0.0s
[CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time=   0.0s
[CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time=   0.0s
[CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time=   0.0s
[CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time=   0.0s
[CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time=   0.0s
[CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time=   0.0s
[CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time=   0.0s
[CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time=   0.0s
[CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time=   0.0s
[CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time=   0.0s
[CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time=   0.0s

best parameters: {'classifier__C': 8.115660497752215, 'classifier__l1_ratio': 0.7084090612742915}


When an estimator is included within a scikit-learn pipeline and a grid search performed using RandomizedGridSearchCV, the estimator is automatically set to the best parameters found during the search. The best_estimator_ attribute of the RandomizedGridSearchCV object will reflect the best parameters for the estimator within the pipeline in terms of the scoring measure:


mdl.best_estimator_
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('continuous',
                                                  Pipeline(steps=[('imputer',
                                                                   IterativeImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['A', 'B', 'E']),
                                                 ('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(drop='first',
                                                                                 sparse_output=False))]),
                                                  ['C', 'D'])])),
                ('classifier',
                 LogisticRegression(C=8.115660497752215,
                                    l1_ratio=0.7084090612742915, max_iter=5000,
                                    penalty='elasticnet', solver='saga'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Once the optimal model has been determined, we can pass our validation/test data into the pipeline to generate predicted probabilities for unseen data:


# Assessing model performance on unseen data.
ypred = mdl.predict_proba(dfvalid)[:,1]

ypred
array([0.23803061, 0.23987571, 0.22497394, 0.2360284 , 0.21692351,
       0.24979123, 0.22930123, 0.23805811, 0.18848299, 0.2269307 ,
       0.18739627, 0.21963412, 0.24601412, 0.24592807, 0.26313459,
       0.19509853, 0.22403892, 0.2644474 , 0.25217899, 0.25114582,
       0.25275472, 0.25602435, 0.23526247, 0.22682578, 0.21364797,
       0.31097165, 0.25706994, 0.26917858, 0.21912074, 0.14953379,
       0.2521859 , 0.19803027, 0.23446292, 0.20239688, 0.22329016,
       0.23452063, 0.19225738, 0.1971433 , 0.32557197, 0.2366244 ,
       0.21352434, 0.27294373, 0.25589429, 0.23278834, 0.24858346,
       0.2058699 , 0.17559173, 0.24556249, 0.22534097, 0.22728177])

In some cases, we may want to pickle our model to share with a third-party for some downstream task. This is straightforward:

import pickle

with open("my-model.pkl", "wb") as fpkl:
    pickle.dump(mdl, fpkl, protocol=pickle.HIGHEST_PROTOCOL)