Skip to main content
Version: v1.4.1

Preprocessing

xplainable-preprocessing

Preprocessing is handled by the xplainable-preprocessing package. It provides a spec-driven pipeline system where you define preprocessing steps declaratively using PipelineSpec and StepSpec, then compile them into an executable pipeline.

Overview

The xplainable-preprocessing package uses a declarative, spec-based approach:

  1. Define your pipeline as a PipelineSpec containing ordered StepSpec steps
  2. Compile the spec into an executable DataFramePipeline using compile_spec()
  3. Fit and transform your data using the compiled pipeline

This design separates the pipeline definition (serialisable, inspectable) from the pipeline execution, making it easy to persist, version, and share pipelines.

Key Features

Declarative Specs

Define pipelines as data (PipelineSpec / StepSpec) that can be serialised, versioned, and inspected.

Transformer Registry

Built-in and sklearn transformers available through a unified registry with automatic parameter coercion.

Cloud Persistence

Save pipeline specs to the xplainable platform for team sharing and deployment.

Mutation Analysis

Analyze the impact of adding, removing, or reordering steps before recompiling.

Installation

1pip install xplainable-preprocessing

Quick Start

1from xplainable_preprocessing import PipelineSpec, StepSpec, compile_spec
2
3# Define pipeline spec
4spec = PipelineSpec(steps=[
5 StepSpec(
6 id="fill_missing",
7 type="FillMissingTransformer",
8 params={"strategy": "median"}
9 ),
10 StepSpec(
11 id="drop_ids",
12 type="DropColumnsTransformer",
13 columns=["id", "timestamp"]
14 ),
15 StepSpec(
16 id="scale",
17 type="StandardScaler",
18 columns=["age", "income", "balance"]
19 ),
20])
21
22# Compile to executable pipeline
23pipeline = compile_spec(spec)
24
25# Fit and transform
26X_train_processed = pipeline.fit_transform(X_train)
27X_test_processed = pipeline.transform(X_test)

Core Classes

PipelineSpec

The top-level specification containing an ordered list of steps.

1from xplainable_preprocessing import PipelineSpec, StepSpec
2
3spec = PipelineSpec(
4 version="2.0",
5 steps=[...] # list of StepSpec instances
6)

Key methods:

MethodDescription
get_step(step_id)Get a step by its ID
step_index(step_id)Get the index of a step
remove_step(step_id, cascade=True)Return a new spec with the step removed
insert_step(step, after=None)Return a new spec with a step inserted
reorder_step(step_id, new_index)Return a new spec with a step moved
update_step_params(step_id, params)Return a new spec with updated params
analyze_removal(step_id)Analyze the impact of removing a step
enrich_from_deltas(step_deltas)Populate column contracts from fit deltas
optimize()Return a topologically sorted spec

All mutation methods return new PipelineSpec instances (immutable pattern).

StepSpec

A single preprocessing step.

1StepSpec(
2 id="step_1", # unique identifier (required)
3 type="FillMissingTransformer", # transformer type from registry (required)
4 columns=["col1", "col2"], # columns to apply to (optional, None = all)
5 params={"strategy": "median"}, # transformer constructor params (optional)
6 description="Fill missing values" # human-readable description (optional)
7)

When columns is specified, the transformer is wrapped in a DataFrameColumnTransformer that applies it only to the listed columns.

compile_spec()

Converts a PipelineSpec into an executable DataFramePipeline.

1from xplainable_preprocessing import compile_spec
2
3pipeline = compile_spec(spec)
4
5# Use like any sklearn pipeline
6pipeline.fit(X_train)
7X_transformed = pipeline.transform(X_test)
8X_transformed = pipeline.fit_transform(X_train)

Available Transformers

Custom xplainable Transformers

TypeDescription
TextCleanTransformerClean and normalise text columns
DropColumnsTransformerRemove specified columns
FillMissingTransformerFill missing values with a given strategy
TypeCastTransformerCast column data types
CategoryCondenseTransformerCondense low-frequency categories into an "other" bucket
ExpressionTransformerCreate new columns via pandas expressions
DateTimeExtractTransformerExtract date/time components (year, month, day, etc.)
RenameColumnsTransformerRename columns
GroupByAggTransformerGrouped aggregation features
GroupedLagTransformerLag features grouped by a key column
RollingAggTransformerRolling window aggregation features

sklearn Transformers

The following sklearn transformers are available in the registry out of the box:

TypeDescription
SimpleImputerImpute missing values (mean, median, most_frequent, constant)
StandardScalerStandardise features (zero mean, unit variance)
MinMaxScalerScale features to a given range
RobustScalerScale features using statistics robust to outliers
OneHotEncoderEncode categorical features as one-hot
OrdinalEncoderEncode categorical features as ordinal integers
PowerTransformerApply power transforms (Box-Cox, Yeo-Johnson)
QuantileTransformerTransform features to follow a uniform or normal distribution
KBinsDiscretizerBin continuous features into discrete intervals
BinarizerThreshold features to binary values

Custom Transformers

You can write custom sklearn-compatible transformers inline:

1StepSpec(
2 id="custom_step",
3 type="custom",
4 params={
5 "code": """
6import pandas as pd
7from sklearn.base import BaseEstimator, TransformerMixin
8
9class MyTransformer(BaseEstimator, TransformerMixin):
10 def fit(self, X, y=None):
11 return self
12 def transform(self, X):
13 return X.clip(lower=0)
14""",
15 "class_name": "MyTransformer",
16 "description": "Clip negative values to zero"
17 }
18)

Registering New Transformers

1from xplainable_preprocessing import register
2
3register("MyCustomTransformer", MyCustomTransformerClass)

Examples

Data Cleaning Pipeline

1from xplainable_preprocessing import PipelineSpec, StepSpec, compile_spec
2
3spec = PipelineSpec(steps=[
4 StepSpec(
5 id="drop_cols",
6 type="DropColumnsTransformer",
7 params={"columns": ["id", "created_at"]}
8 ),
9 StepSpec(
10 id="fill_numeric",
11 type="FillMissingTransformer",
12 columns=["age", "income"],
13 params={"strategy": "median"}
14 ),
15 StepSpec(
16 id="fill_categorical",
17 type="SimpleImputer",
18 columns=["category", "region"],
19 params={"strategy": "most_frequent"}
20 ),
21 StepSpec(
22 id="condense_cats",
23 type="CategoryCondenseTransformer",
24 columns=["category"],
25 params={"min_frequency": 0.01}
26 ),
27 StepSpec(
28 id="scale",
29 type="StandardScaler",
30 columns=["age", "income"]
31 ),
32])
33
34pipeline = compile_spec(spec)
35X_processed = pipeline.fit_transform(X_train)

Feature Engineering Pipeline

1spec = PipelineSpec(steps=[
2 StepSpec(
3 id="extract_date",
4 type="DateTimeExtractTransformer",
5 params={
6 "column": "order_date",
7 "components": ["year", "month", "dayofweek"]
8 }
9 ),
10 StepSpec(
11 id="rolling_avg",
12 type="RollingAggTransformer",
13 params={
14 "column": "sales",
15 "window": 7,
16 "agg_func": "mean",
17 "new_column": "sales_7d_avg"
18 }
19 ),
20 StepSpec(
21 id="lag_features",
22 type="GroupedLagTransformer",
23 params={
24 "group_col": "customer_id",
25 "value_col": "purchase_amount",
26 "lags": [1, 7, 30]
27 }
28 ),
29 StepSpec(
30 id="expression",
31 type="ExpressionTransformer",
32 params={
33 "expressions": {
34 "revenue_per_unit": "revenue / quantity"
35 }
36 }
37 ),
38])
39
40pipeline = compile_spec(spec)
41X_enriched = pipeline.fit_transform(X_train)

Using with XClassifier

1from xplainable_preprocessing import PipelineSpec, StepSpec, compile_spec
2from xplainable.core.models import XClassifier
3from sklearn.model_selection import train_test_split
4
5# Define preprocessing
6spec = PipelineSpec(steps=[
7 StepSpec(id="fill", type="FillMissingTransformer", params={"strategy": "median"}),
8 StepSpec(id="drop", type="DropColumnsTransformer", params={"columns": ["id"]}),
9])
10
11pipeline = compile_spec(spec)
12
13# Preprocess
14X_train_proc = pipeline.fit_transform(X_train)
15X_test_proc = pipeline.transform(X_test)
16
17# Train model
18model = XClassifier()
19model.fit(X_train_proc, y_train)
20
21# Evaluate
22metrics = model.evaluate(X_test_proc, y_test)
23print(metrics)

Cloud Persistence

For uploading and managing preprocessing pipelines on the Xplainable Cloud platform, see the REST API documentation.

Mutation Analysis

Analyze the impact of removing a step before modifying the pipeline:

1# Analyze what happens if we remove a step
2analysis = spec.analyze_removal("fill_numeric")
3print(f"Safe to remove: {analysis.safe}")
4print(f"Columns lost: {analysis.columns_lost}")
5print(f"Cascade removals: {analysis.cascade_remove}")
6print(f"Columns restored: {analysis.columns_restored}")
7
8# Remove step (returns new spec)
9new_spec = spec.remove_step("fill_numeric", cascade=True)
10
11# Insert a new step
12new_spec = spec.insert_step(
13 StepSpec(id="new_step", type="StandardScaler"),
14 after="fill_numeric"
15)
16
17# Reorder a step
18new_spec = spec.reorder_step("scale", new_index=0)

Local Persistence

1from xplainable_preprocessing import save_pipeline, load_pipeline
2
3# Save a fitted pipeline
4save_pipeline(pipeline, "my_pipeline.pkl")
5
6# Load it back
7pipeline = load_pipeline("my_pipeline.pkl")
8X_transformed = pipeline.transform(X_new)

Generating a Transformer Catalog

Get a prompt-ready listing of all available transformers with their parameters:

1from xplainable_preprocessing import generate_catalog
2
3print(generate_catalog())

Best Practices

Pipeline Design
  1. Separate definition from execution -- define specs, then compile
  2. Use unique step IDs -- makes mutation analysis and debugging easier
  3. Specify columns when possible -- avoids unintended transformations
  4. Fit on training data only -- avoid data leakage
  5. Handle missing values early -- before other transformations
  6. Version your specs -- specs are serialisable Pydantic models

Troubleshooting

Unknown transformer type

Cause: The step type is not in the registry.

Solution: Check available types with from xplainable_preprocessing import REGISTRY; print(sorted(REGISTRY.keys())). Register custom transformers with register().

Pipeline fails during transform

Possible causes:

  • New categories in test data not seen during training
  • Missing columns in new data
  • Data type mismatches

Solutions:

  • Use handle_unknown='ignore' in OneHotEncoder params
  • Ensure consistent column names between train and test
  • Use TypeCastTransformer to enforce types

Next Steps

Ready to Build Models?

Now that you understand preprocessing, explore: