Combining preprocessing and classifier within a single pipeline
Machine Learning
Python
Published
August 1, 2024
To build a composite estimator in scikit-learn, transformers are usually combined with other transformers and/or predictors (such as classifiers or regressors). The most common tool used for composing estimators is a Pipeline. The Pipeline is often used in combination with ColumnTransformer or FeatureUnion which concatenate the output of transformers into a composite feature space.
In this notebook, I demonstrate how to create a composite estimator based on a synthetic dataset.
"""Create synthetic dataset for composite estimator demo."""import numpy as npimport pandas as pdfrom sklearn.model_selection import train_test_splitnp.set_printoptions(suppress=True, precision=8)pd.options.mode.chained_assignment =Nonepd.set_option('display.max_columns', None)pd.set_option('display.width', None)rng = np.random.default_rng(516)n =1000df = pd.DataFrame({"A": rng.gamma(shape=2, scale=50000, size=n),"B": rng.normal(loc=1000, scale=250, size=n),"C": rng.choice(["red", "green", "blue"], p=[.7, .2, .1], size=n),"D": rng.choice(["left", "right", None], p=[.475, .475, .05], size=n),"E": rng.poisson(17, size=n),"target": rng.choice([0., 1.], p=[.8, .2], size=n)})# Set a selected samples to NaN in A, B and C. df.loc[rng.choice(n, size=10),"A"] = np.NaNdf.loc[rng.choice(n, size=17),"B"] = np.NaNdf.loc[rng.choice(n, size=5),"E"] = np.NaN# Create train-validation split. y = df["target"]dftrain, dfvalid, ytrain, yvalid = train_test_split(df, y, test_size=.05, stratify=y)print(f"dftrain.shape: {dftrain.shape}")print(f"dfvalid.shape: {dfvalid.shape}")print(f"prop. ytrain : {ytrain.sum() / dftrain.shape[0]:.4f}")print(f"prop. yvalid : {yvalid.sum() / dfvalid.shape[0]:.4f}")
For this dataset, we’ll use ColumnTransformer to create separate pre-processing pipelines for continuous and categorical features. For continuous features, we impute missing values and standardize each to be on the same scale. For categorical features, we impute missing values and one-hot encode, creating k-1 features for a variable with k distinct levels. As the last step a LogisticRegression classifier is included with elastic net penatly. The code to accomplish this is given below:
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In the next cell, RandomizedSearchCV is run agasinst two hyperparameters: l1_ratio and C. Notice that we only call mdl.fit on the pipeline, as the data transform will be applied to each of the k-datasets separately based on the samples in each fold.
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time= 0.0s
[CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time= 0.0s
[CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time= 0.0s
[CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time= 0.0s
[CV] END classifier__C=8.115660497752215, classifier__l1_ratio=0.7084090612742915; total time= 0.0s
[CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time= 0.0s
[CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time= 0.0s
[CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time= 0.0s
[CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time= 0.0s
[CV] END classifier__C=1.115284252761577, classifier__l1_ratio=0.5667878644753359; total time= 0.0s
[CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time= 0.0s
[CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time= 0.0s
[CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time= 0.0s
[CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time= 0.0s
[CV] END classifier__C=7.927782545875722, classifier__l1_ratio=0.8376069301429002; total time= 0.0s
best parameters: {'classifier__C': 8.115660497752215, 'classifier__l1_ratio': 0.7084090612742915}
When an estimator is included within a scikit-learn pipeline and a grid search performed using RandomizedGridSearchCV, the estimator is automatically set to the best parameters found during the search. The best_estimator_ attribute of the RandomizedGridSearchCV object will reflect the best parameters for the estimator within the pipeline in terms of the scoring measure:
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Once the optimal model has been determined, we can pass our validation/test data into the pipeline to generate predicted probabilities for unseen data:
# Assessing model performance on unseen data.ypred = mdl.predict_proba(dfvalid)[:,1]ypred