Skip to main content
Version: v1.4.1

Custom Transformers

Build Your Own Preprocessing Steps

Custom transformers allow you to write sklearn-compatible transformer classes and use them within xplainable-preprocessing pipelines. This enables domain-specific preprocessing while leveraging the full pipeline compilation and validation system.

Overview

The xplainable-preprocessing package provides a declarative pipeline system where preprocessing steps are defined as StepSpec objects and compiled into executable pipelines. When no built-in transformer type fits your needs, you can write a custom transformer using type="custom".

Custom transformers are:

  • Validated at compile time via AST-based code analysis to prevent unsafe operations
  • Executed in a restricted namespace for security
  • Fully compatible with the pipeline specification, column contracts, and mutation system

How Custom Transformers Work

Custom transformers use the StepSpec with type="custom". The params dictionary must contain:

  • code (str): The Python source code defining the transformer class
  • class_name (str): The name of the class to instantiate from the code

Any additional keys in params are passed as constructor keyword arguments to the class.

1from xplainable_preprocessing.schema import StepSpec, PipelineSpec
2
3custom_step = StepSpec(
4 id="my_custom_step",
5 type="custom",
6 columns=["col_a", "col_b"], # Optional: apply only to specific columns
7 params={
8 "code": """
9import numpy as np
10
11class MyTransformer:
12 def __init__(self, threshold=0.5):
13 self.threshold = threshold
14
15 def fit(self, X, y=None):
16 return self
17
18 def transform(self, X):
19 return X.copy()
20""",
21 "class_name": "MyTransformer",
22 "threshold": 0.75 # Passed to the constructor
23 }
24)
25

Writing Custom Transformer Classes

Required Interface

Custom transformer classes must implement the sklearn transformer interface:

1class MyTransformer:
2 def __init__(self, **kwargs):
3 # Store configuration parameters
4 pass
5
6 def fit(self, X, y=None):
7 # Learn any state from training data
8 # Must return self
9 return self
10
11 def transform(self, X):
12 # Apply the transformation
13 # Must return a DataFrame or array
14 return X
15

Code Validation Rules

The xplainable-preprocessing sandbox validates custom code using AST analysis before execution. The following restrictions apply:

Forbidden imports -- these modules cannot be imported: os, sys, subprocess, socket, shutil, pathlib, http, urllib, requests, importlib, ctypes, signal, multiprocessing, threading, asyncio, pickle, shelve, tempfile, glob, io, builtins, code, codeop

Forbidden function calls -- these functions cannot be called: exec, eval, __import__, open, compile, globals, locals, breakpoint, exit, quit, getattr, setattr, delattr

Allowed imports include standard data science libraries: numpy, pandas, sklearn, scipy, math, re, json, collections, itertools, functools, datetime, decimal, statistics, and others not in the forbidden list.

Available builtins in the restricted namespace include: True, False, None, int, float, str, bool, list, dict, tuple, set, frozenset, range, enumerate, zip, map, filter, sorted, reversed, len, min, max, sum, abs, round, isinstance, issubclass, type, super, property, staticmethod, classmethod, print, repr, hasattr, any, all, and common exception types.

Examples

Outlier Capper

Cap values at specified percentiles to handle outliers:

1from xplainable_preprocessing.schema import StepSpec, PipelineSpec
2from xplainable_preprocessing.compiler import compile_spec
3
4spec = PipelineSpec(
5 steps=[
6 StepSpec(
7 id="cap_outliers",
8 type="custom",
9 columns=["revenue", "cost"],
10 params={
11 "code": """
12import numpy as np
13
14class OutlierCapper:
15 def __init__(self, lower_pct=0.01, upper_pct=0.99):
16 self.lower_pct = lower_pct
17 self.upper_pct = upper_pct
18 self.bounds = {}
19
20 def fit(self, X, y=None):
21 for col in X.columns:
22 self.bounds[col] = (
23 float(np.nanpercentile(X[col], self.lower_pct * 100)),
24 float(np.nanpercentile(X[col], self.upper_pct * 100))
25 )
26 return self
27
28 def transform(self, X):
29 X = X.copy()
30 for col in X.columns:
31 if col in self.bounds:
32 lower, upper = self.bounds[col]
33 X[col] = X[col].clip(lower, upper)
34 return X
35""",
36 "class_name": "OutlierCapper",
37 "lower_pct": 0.02,
38 "upper_pct": 0.98
39 }
40 )
41 ]
42)
43
44# Compile and use the pipeline
45pipeline = compile_spec(spec)
46pipeline.fit(df)
47df_transformed = pipeline.transform(df)
48

Feature Interaction Creator

Create interaction features from existing columns:

1interaction_step = StepSpec(
2 id="create_interactions",
3 type="custom",
4 params={
5 "code": """
6import numpy as np
7
8class InteractionCreator:
9 def __init__(self, pairs=None):
10 self.pairs = pairs or []
11
12 def fit(self, X, y=None):
13 return self
14
15 def transform(self, X):
16 X = X.copy()
17 for col_a, col_b in self.pairs:
18 if col_a in X.columns and col_b in X.columns:
19 X[f'{col_a}_x_{col_b}'] = X[col_a] * X[col_b]
20 X[f'{col_a}_div_{col_b}'] = np.where(
21 X[col_b] != 0, X[col_a] / X[col_b], np.nan
22 )
23 return X
24""",
25 "class_name": "InteractionCreator",
26 "pairs": [["revenue", "employees"], ["cost", "units"]]
27 }
28)
29

Cyclic Time Encoder

Encode cyclical time features (e.g., hour of day, day of week) using sine and cosine:

1time_encoder_step = StepSpec(
2 id="encode_time",
3 type="custom",
4 params={
5 "code": """
6import numpy as np
7
8class CyclicTimeEncoder:
9 def __init__(self, columns=None, max_values=None):
10 self.columns = columns or []
11 self.max_values = max_values or {}
12
13 def fit(self, X, y=None):
14 for col in self.columns:
15 if col not in self.max_values:
16 self.max_values[col] = int(X[col].max()) + 1
17 return self
18
19 def transform(self, X):
20 X = X.copy()
21 for col in self.columns:
22 max_val = self.max_values[col]
23 X[f'{col}_sin'] = np.sin(2 * np.pi * X[col] / max_val)
24 X[f'{col}_cos'] = np.cos(2 * np.pi * X[col] / max_val)
25 return X
26""",
27 "class_name": "CyclicTimeEncoder",
28 "columns": ["hour", "day_of_week", "month"],
29 "max_values": {"hour": 24, "day_of_week": 7, "month": 12}
30 }
31)
32

Combining Custom and Built-in Steps

Custom transformers work alongside the built-in transformer registry. A pipeline can mix both:

1from xplainable_preprocessing.schema import StepSpec, PipelineSpec
2from xplainable_preprocessing.compiler import compile_spec
3
4spec = PipelineSpec(
5 steps=[
6 # Built-in: standard scaling
7 StepSpec(
8 id="scale_numerics",
9 type="StandardScaler",
10 columns=["age", "income", "balance"]
11 ),
12
13 # Custom: domain-specific ratio creation
14 StepSpec(
15 id="financial_ratios",
16 type="custom",
17 params={
18 "code": """
19import numpy as np
20
21class FinancialRatios:
22 def fit(self, X, y=None):
23 return self
24
25 def transform(self, X):
26 X = X.copy()
27 X['debt_to_income'] = np.where(
28 X['income'] != 0, X['balance'] / X['income'], np.nan
29 )
30 return X
31""",
32 "class_name": "FinancialRatios"
33 }
34 ),
35
36 # Built-in: drop columns
37 StepSpec(
38 id="drop_ids",
39 type="DropColumnsTransformer",
40 params={"columns": ["customer_id", "account_id"]}
41 )
42 ]
43)
44
45pipeline = compile_spec(spec)
46pipeline.fit(train_df)
47train_transformed = pipeline.transform(train_df)
48test_transformed = pipeline.transform(test_df)
49

Column Scoping

When columns is specified on a StepSpec, the transformer is wrapped in a DataFrameColumnTransformer that applies the transformation only to those columns, leaving others untouched:

1# This step only applies to the specified columns
2StepSpec(
3 id="log_transform",
4 type="custom",
5 columns=["revenue", "cost", "profit"], # Only these columns are transformed
6 params={
7 "code": """
8import numpy as np
9
10class LogTransformer:
11 def fit(self, X, y=None):
12 return self
13
14 def transform(self, X):
15 return np.log1p(X)
16""",
17 "class_name": "LogTransformer"
18 }
19)
20

Pipeline Validation

Before compilation, you can validate your pipeline specification:

1from xplainable_preprocessing.schema import PipelineSpec, validate_spec
2
3spec = PipelineSpec(steps=[...])
4
5# Validates:
6# - All non-custom types exist in the registry
7# - Custom steps have 'code' and 'class_name' in params
8# - Step IDs are unique
9validate_spec(spec)
10

Custom code is further validated via AST analysis during compile_spec() to ensure no forbidden imports or calls are present.

Best Practices

Writing Custom Transformers
  1. Always return self from fit() to support method chaining and pipeline compatibility.
  2. Always copy the input DataFrame in transform() with X.copy() to avoid modifying the original data.
  3. Handle missing values -- check for NaN/None in your transformation logic.
  4. Keep code self-contained -- avoid relying on external state or files. Only import from allowed modules.
  5. Use constructor parameters for configuration rather than hardcoding values, so they can be adjusted via params.
  6. Test your transformer independently before embedding it in a pipeline specification.
Security Restrictions

Custom transformer code runs in a restricted sandbox. Do not attempt to use file I/O, network access, subprocess execution, or other system-level operations. The AST validator will reject code containing forbidden imports or function calls.

Next Steps

Ready for More Advanced Topics?