Preprocessing
Preprocessing is handled by the xplainable-preprocessing package. It provides a spec-driven pipeline system where you define preprocessing steps declaratively using PipelineSpec and StepSpec, then compile them into an executable pipeline.
Overview
The xplainable-preprocessing package uses a declarative, spec-based approach:
- Define your pipeline as a
PipelineSpeccontaining orderedStepSpecsteps - Compile the spec into an executable
DataFramePipelineusingcompile_spec() - Fit and transform your data using the compiled pipeline
This design separates the pipeline definition (serialisable, inspectable) from the pipeline execution, making it easy to persist, version, and share pipelines.
Key Features
Define pipelines as data (PipelineSpec / StepSpec) that can be serialised, versioned, and inspected.
Built-in and sklearn transformers available through a unified registry with automatic parameter coercion.
Save pipeline specs to the xplainable platform for team sharing and deployment.
Analyze the impact of adding, removing, or reordering steps before recompiling.
Installation
Quick Start
Core Classes
PipelineSpec
The top-level specification containing an ordered list of steps.
Key methods:
| Method | Description |
|---|---|
get_step(step_id) | Get a step by its ID |
step_index(step_id) | Get the index of a step |
remove_step(step_id, cascade=True) | Return a new spec with the step removed |
insert_step(step, after=None) | Return a new spec with a step inserted |
reorder_step(step_id, new_index) | Return a new spec with a step moved |
update_step_params(step_id, params) | Return a new spec with updated params |
analyze_removal(step_id) | Analyze the impact of removing a step |
enrich_from_deltas(step_deltas) | Populate column contracts from fit deltas |
optimize() | Return a topologically sorted spec |
All mutation methods return new PipelineSpec instances (immutable pattern).
StepSpec
A single preprocessing step.
When columns is specified, the transformer is wrapped in a DataFrameColumnTransformer that applies it only to the listed columns.
compile_spec()
Converts a PipelineSpec into an executable DataFramePipeline.
Available Transformers
Custom xplainable Transformers
| Type | Description |
|---|---|
TextCleanTransformer | Clean and normalise text columns |
DropColumnsTransformer | Remove specified columns |
FillMissingTransformer | Fill missing values with a given strategy |
TypeCastTransformer | Cast column data types |
CategoryCondenseTransformer | Condense low-frequency categories into an "other" bucket |
ExpressionTransformer | Create new columns via pandas expressions |
DateTimeExtractTransformer | Extract date/time components (year, month, day, etc.) |
RenameColumnsTransformer | Rename columns |
GroupByAggTransformer | Grouped aggregation features |
GroupedLagTransformer | Lag features grouped by a key column |
RollingAggTransformer | Rolling window aggregation features |
sklearn Transformers
The following sklearn transformers are available in the registry out of the box:
| Type | Description |
|---|---|
SimpleImputer | Impute missing values (mean, median, most_frequent, constant) |
StandardScaler | Standardise features (zero mean, unit variance) |
MinMaxScaler | Scale features to a given range |
RobustScaler | Scale features using statistics robust to outliers |
OneHotEncoder | Encode categorical features as one-hot |
OrdinalEncoder | Encode categorical features as ordinal integers |
PowerTransformer | Apply power transforms (Box-Cox, Yeo-Johnson) |
QuantileTransformer | Transform features to follow a uniform or normal distribution |
KBinsDiscretizer | Bin continuous features into discrete intervals |
Binarizer | Threshold features to binary values |
Custom Transformers
You can write custom sklearn-compatible transformers inline:
Registering New Transformers
Examples
Data Cleaning Pipeline
Feature Engineering Pipeline
Using with XClassifier
Cloud Persistence
For uploading and managing preprocessing pipelines on the Xplainable Cloud platform, see the REST API documentation.
Mutation Analysis
Analyze the impact of removing a step before modifying the pipeline:
Local Persistence
Generating a Transformer Catalog
Get a prompt-ready listing of all available transformers with their parameters:
Best Practices
- Separate definition from execution -- define specs, then compile
- Use unique step IDs -- makes mutation analysis and debugging easier
- Specify
columnswhen possible -- avoids unintended transformations - Fit on training data only -- avoid data leakage
- Handle missing values early -- before other transformations
- Version your specs -- specs are serialisable Pydantic models
Troubleshooting
Unknown transformer type
Cause: The step type is not in the registry.
Solution: Check available types with from xplainable_preprocessing import REGISTRY; print(sorted(REGISTRY.keys())). Register custom transformers with register().
Pipeline fails during transform
Possible causes:
- New categories in test data not seen during training
- Missing columns in new data
- Data type mismatches
Solutions:
- Use
handle_unknown='ignore'inOneHotEncoderparams - Ensure consistent column names between train and test
- Use
TypeCastTransformerto enforce types
Next Steps
Now that you understand preprocessing, explore:
- Binary Classification with preprocessed data
- Regression for continuous targets
- REST API — Manage pipelines via the API