Version: v1.4.1

Preprocessing

xplainable-preprocessing

Preprocessing is handled by the xplainable-preprocessing package. It provides a spec-driven pipeline system where you define preprocessing steps declaratively using PipelineSpec and StepSpec, then compile them into an executable pipeline.

Overview

The xplainable-preprocessing package uses a declarative, spec-based approach:

Define your pipeline as a PipelineSpec containing ordered StepSpec steps
Compile the spec into an executable DataFramePipeline using compile_spec()
Fit and transform your data using the compiled pipeline

This design separates the pipeline definition (serialisable, inspectable) from the pipeline execution, making it easy to persist, version, and share pipelines.

Key Features

Declarative Specs

Define pipelines as data (PipelineSpec / StepSpec) that can be serialised, versioned, and inspected.

Transformer Registry

Built-in and sklearn transformers available through a unified registry with automatic parameter coercion.

Cloud Persistence

Save pipeline specs to the xplainable platform for team sharing and deployment.

Mutation Analysis

Analyze the impact of adding, removing, or reordering steps before recompiling.

Installation

1pip install xplainable-preprocessing

Quick Start

1from xplainable_preprocessing import PipelineSpec, StepSpec, compile_spec

3# Define pipeline spec

4spec = PipelineSpec(steps=[

5 StepSpec(

6 id="fill_missing",

7 type="FillMissingTransformer",

8 params={"strategy": "median"}

9 ),

10 StepSpec(

11 id="drop_ids",

12 type="DropColumnsTransformer",

13 columns=["id", "timestamp"]

14 ),

15 StepSpec(

16 id="scale",

17 type="StandardScaler",

18 columns=["age", "income", "balance"]

19 ),

20])

22# Compile to executable pipeline

23pipeline = compile_spec(spec)

25# Fit and transform

26X_train_processed = pipeline.fit_transform(X_train)

27X_test_processed = pipeline.transform(X_test)

Core Classes

PipelineSpec

The top-level specification containing an ordered list of steps.

1from xplainable_preprocessing import PipelineSpec, StepSpec

3spec = PipelineSpec(

4 version="2.0",

5 steps=[...] # list of StepSpec instances

Key methods:

Method	Description
`get_step(step_id)`	Get a step by its ID
`step_index(step_id)`	Get the index of a step
`remove_step(step_id, cascade=True)`	Return a new spec with the step removed
`insert_step(step, after=None)`	Return a new spec with a step inserted
`reorder_step(step_id, new_index)`	Return a new spec with a step moved
`update_step_params(step_id, params)`	Return a new spec with updated params
`analyze_removal(step_id)`	Analyze the impact of removing a step
`enrich_from_deltas(step_deltas)`	Populate column contracts from fit deltas
`optimize()`	Return a topologically sorted spec

All mutation methods return new PipelineSpec instances (immutable pattern).

StepSpec

A single preprocessing step.

1StepSpec(

2 id="step_1", # unique identifier (required)

3 type="FillMissingTransformer", # transformer type from registry (required)

4 columns=["col1", "col2"], # columns to apply to (optional, None = all)

5 params={"strategy": "median"}, # transformer constructor params (optional)

6 description="Fill missing values" # human-readable description (optional)

When columns is specified, the transformer is wrapped in a DataFrameColumnTransformer that applies it only to the listed columns.

compile_spec()

Converts a PipelineSpec into an executable DataFramePipeline.

1from xplainable_preprocessing import compile_spec

3pipeline = compile_spec(spec)

5# Use like any sklearn pipeline

6pipeline.fit(X_train)

7X_transformed = pipeline.transform(X_test)

8X_transformed = pipeline.fit_transform(X_train)

Available Transformers

Custom xplainable Transformers

Type	Description
`TextCleanTransformer`	Clean and normalise text columns
`DropColumnsTransformer`	Remove specified columns
`FillMissingTransformer`	Fill missing values with a given strategy
`TypeCastTransformer`	Cast column data types
`CategoryCondenseTransformer`	Condense low-frequency categories into an "other" bucket
`ExpressionTransformer`	Create new columns via pandas expressions
`DateTimeExtractTransformer`	Extract date/time components (year, month, day, etc.)
`RenameColumnsTransformer`	Rename columns
`GroupByAggTransformer`	Grouped aggregation features
`GroupedLagTransformer`	Lag features grouped by a key column
`RollingAggTransformer`	Rolling window aggregation features

sklearn Transformers

The following sklearn transformers are available in the registry out of the box:

Type	Description
`SimpleImputer`	Impute missing values (mean, median, most_frequent, constant)
`StandardScaler`	Standardise features (zero mean, unit variance)
`MinMaxScaler`	Scale features to a given range
`RobustScaler`	Scale features using statistics robust to outliers
`OneHotEncoder`	Encode categorical features as one-hot
`OrdinalEncoder`	Encode categorical features as ordinal integers
`PowerTransformer`	Apply power transforms (Box-Cox, Yeo-Johnson)
`QuantileTransformer`	Transform features to follow a uniform or normal distribution
`KBinsDiscretizer`	Bin continuous features into discrete intervals
`Binarizer`	Threshold features to binary values

Custom Transformers

You can write custom sklearn-compatible transformers inline:

1StepSpec(

2 id="custom_step",

3 type="custom",

4 params={

5 "code": """

6import pandas as pd

7from sklearn.base import BaseEstimator, TransformerMixin

9class MyTransformer(BaseEstimator, TransformerMixin):

10 def fit(self, X, y=None):

11 return self

12 def transform(self, X):

13 return X.clip(lower=0)

14""",

15 "class_name": "MyTransformer",

16 "description": "Clip negative values to zero"

17 }

18)

Registering New Transformers

1from xplainable_preprocessing import register

3register("MyCustomTransformer", MyCustomTransformerClass)

Examples

Data Cleaning Pipeline

1from xplainable_preprocessing import PipelineSpec, StepSpec, compile_spec

3spec = PipelineSpec(steps=[

4 StepSpec(

5 id="drop_cols",

6 type="DropColumnsTransformer",

7 params={"columns": ["id", "created_at"]}

8 ),

9 StepSpec(

10 id="fill_numeric",

11 type="FillMissingTransformer",

12 columns=["age", "income"],

13 params={"strategy": "median"}

14 ),

15 StepSpec(

16 id="fill_categorical",

17 type="SimpleImputer",

18 columns=["category", "region"],

19 params={"strategy": "most_frequent"}

20 ),

21 StepSpec(

22 id="condense_cats",

23 type="CategoryCondenseTransformer",

24 columns=["category"],

25 params={"min_frequency": 0.01}

26 ),

27 StepSpec(

28 id="scale",

29 type="StandardScaler",

30 columns=["age", "income"]

31 ),

32])

34pipeline = compile_spec(spec)

35X_processed = pipeline.fit_transform(X_train)

Feature Engineering Pipeline

1spec = PipelineSpec(steps=[

2 StepSpec(

3 id="extract_date",

4 type="DateTimeExtractTransformer",

5 params={

6 "column": "order_date",

7 "components": ["year", "month", "dayofweek"]

8 }

9 ),

10 StepSpec(

11 id="rolling_avg",

12 type="RollingAggTransformer",

13 params={

14 "column": "sales",

15 "window": 7,

16 "agg_func": "mean",

17 "new_column": "sales_7d_avg"

18 }

19 ),

20 StepSpec(

21 id="lag_features",

22 type="GroupedLagTransformer",

23 params={

24 "group_col": "customer_id",

25 "value_col": "purchase_amount",

26 "lags": [1, 7, 30]

27 }

28 ),

29 StepSpec(

30 id="expression",

31 type="ExpressionTransformer",

32 params={

33 "expressions": {

34 "revenue_per_unit": "revenue / quantity"

35 }

36 }

37 ),

38])

40pipeline = compile_spec(spec)

41X_enriched = pipeline.fit_transform(X_train)

Using with XClassifier

1from xplainable_preprocessing import PipelineSpec, StepSpec, compile_spec

2from xplainable.core.models import XClassifier

3from sklearn.model_selection import train_test_split

5# Define preprocessing

6spec = PipelineSpec(steps=[

7 StepSpec(id="fill", type="FillMissingTransformer", params={"strategy": "median"}),

8 StepSpec(id="drop", type="DropColumnsTransformer", params={"columns": ["id"]}),

9])

11pipeline = compile_spec(spec)

13# Preprocess

14X_train_proc = pipeline.fit_transform(X_train)

15X_test_proc = pipeline.transform(X_test)

17# Train model

18model = XClassifier()

19model.fit(X_train_proc, y_train)

21# Evaluate

22metrics = model.evaluate(X_test_proc, y_test)

23print(metrics)

Cloud Persistence

For uploading and managing preprocessing pipelines on the Xplainable Cloud platform, see the REST API documentation.

Mutation Analysis

Analyze the impact of removing a step before modifying the pipeline:

1# Analyze what happens if we remove a step

2analysis = spec.analyze_removal("fill_numeric")

3print(f"Safe to remove: {analysis.safe}")

4print(f"Columns lost: {analysis.columns_lost}")

5print(f"Cascade removals: {analysis.cascade_remove}")

6print(f"Columns restored: {analysis.columns_restored}")

8# Remove step (returns new spec)

9new_spec = spec.remove_step("fill_numeric", cascade=True)

11# Insert a new step

12new_spec = spec.insert_step(

13 StepSpec(id="new_step", type="StandardScaler"),

14 after="fill_numeric"

15)

17# Reorder a step

18new_spec = spec.reorder_step("scale", new_index=0)

Local Persistence

1from xplainable_preprocessing import save_pipeline, load_pipeline

3# Save a fitted pipeline

4save_pipeline(pipeline, "my_pipeline.pkl")

6# Load it back

7pipeline = load_pipeline("my_pipeline.pkl")

8X_transformed = pipeline.transform(X_new)

Generating a Transformer Catalog

Get a prompt-ready listing of all available transformers with their parameters:

1from xplainable_preprocessing import generate_catalog

3print(generate_catalog())

Best Practices

Pipeline Design

Separate definition from execution -- define specs, then compile
Use unique step IDs -- makes mutation analysis and debugging easier
Specify columns when possible -- avoids unintended transformations
Fit on training data only -- avoid data leakage
Handle missing values early -- before other transformations
Version your specs -- specs are serialisable Pydantic models

Troubleshooting

Unknown transformer type

Cause: The step type is not in the registry.

Solution: Check available types with from xplainable_preprocessing import REGISTRY; print(sorted(REGISTRY.keys())). Register custom transformers with register().

Pipeline fails during transform

Possible causes:

New categories in test data not seen during training
Missing columns in new data
Data type mismatches

Solutions:

Use handle_unknown='ignore' in OneHotEncoder params
Ensure consistent column names between train and test
Use TypeCastTransformer to enforce types

Next Steps

Ready to Build Models?

Now that you understand preprocessing, explore:

Binary Classification with preprocessed data
Regression for continuous targets
REST API — Manage pipelines via the API

Preprocessing

Overview​

Key Features​

Installation​

Quick Start​

Core Classes​

PipelineSpec​

StepSpec​

compile_spec()​

Available Transformers​

Custom xplainable Transformers​

sklearn Transformers​

Custom Transformers​

Registering New Transformers​

Examples​

Data Cleaning Pipeline​

Feature Engineering Pipeline​

Using with XClassifier​

Cloud Persistence​

Mutation Analysis​

Local Persistence​

Generating a Transformer Catalog​

Best Practices​

Troubleshooting​

Next Steps​

Overview

Key Features

Installation

Quick Start

Core Classes

PipelineSpec

StepSpec

compile_spec()

Available Transformers

Custom xplainable Transformers

sklearn Transformers

Custom Transformers

Registering New Transformers

Examples

Data Cleaning Pipeline

Feature Engineering Pipeline

Using with XClassifier

Cloud Persistence

Mutation Analysis

Local Persistence

Generating a Transformer Catalog

Best Practices

Troubleshooting

Next Steps