Version: Next

Custom Transformers

Build Your Own Preprocessing Steps

Custom transformers allow you to write sklearn-compatible transformer classes and use them within xplainable-preprocessing pipelines. This enables domain-specific preprocessing while leveraging the full pipeline compilation and validation system.

Overview

The xplainable-preprocessing package provides a declarative pipeline system where preprocessing steps are defined as StepSpec objects and compiled into executable pipelines. When no built-in transformer type fits your needs, you can write a custom transformer using type="custom".

Custom transformers are:

Validated at compile time via AST-based code analysis to prevent unsafe operations
Executed in a restricted namespace for security
Fully compatible with the pipeline specification, column contracts, and mutation system

How Custom Transformers Work

Custom transformers use the StepSpec with type="custom". The params dictionary must contain:

code (str): The Python source code defining the transformer class
class_name (str): The name of the class to instantiate from the code

Any additional keys in params are passed as constructor keyword arguments to the class.

1from xplainable_preprocessing.schema import StepSpec, PipelineSpec

3custom_step = StepSpec(

4 id="my_custom_step",

5 type="custom",

6 columns=["col_a", "col_b"], # Optional: apply only to specific columns

7 params={

8 "code": """

9import numpy as np

11class MyTransformer:

12 def __init__(self, threshold=0.5):

13 self.threshold = threshold

15 def fit(self, X, y=None):

16 return self

18 def transform(self, X):

19 return X.copy()

20""",

21 "class_name": "MyTransformer",

22 "threshold": 0.75 # Passed to the constructor

23 }

24)

Writing Custom Transformer Classes

Required Interface

Custom transformer classes must implement the sklearn transformer interface:

1class MyTransformer:

2 def __init__(self, **kwargs):

3 # Store configuration parameters

4 pass

6 def fit(self, X, y=None):

7 # Learn any state from training data

8 # Must return self

9 return self

11 def transform(self, X):

12 # Apply the transformation

13 # Must return a DataFrame or array

14 return X

Code Validation Rules

The xplainable-preprocessing sandbox validates custom code using AST analysis before execution. The following restrictions apply:

Forbidden imports -- these modules cannot be imported: os, sys, subprocess, socket, shutil, pathlib, http, urllib, requests, importlib, ctypes, signal, multiprocessing, threading, asyncio, pickle, shelve, tempfile, glob, io, builtins, code, codeop

Forbidden function calls -- these functions cannot be called: exec, eval, __import__, open, compile, globals, locals, breakpoint, exit, quit, getattr, setattr, delattr

Allowed imports include standard data science libraries: numpy, pandas, sklearn, scipy, math, re, json, collections, itertools, functools, datetime, decimal, statistics, and others not in the forbidden list.

Available builtins in the restricted namespace include: True, False, None, int, float, str, bool, list, dict, tuple, set, frozenset, range, enumerate, zip, map, filter, sorted, reversed, len, min, max, sum, abs, round, isinstance, issubclass, type, super, property, staticmethod, classmethod, print, repr, hasattr, any, all, and common exception types.

Examples

Outlier Capper

Cap values at specified percentiles to handle outliers:

1from xplainable_preprocessing.schema import StepSpec, PipelineSpec

2from xplainable_preprocessing.compiler import compile_spec

4spec = PipelineSpec(

5 steps=[

6 StepSpec(

7 id="cap_outliers",

8 type="custom",

9 columns=["revenue", "cost"],

10 params={

11 "code": """

12import numpy as np

14class OutlierCapper:

15 def __init__(self, lower_pct=0.01, upper_pct=0.99):

16 self.lower_pct = lower_pct

17 self.upper_pct = upper_pct

18 self.bounds = {}

20 def fit(self, X, y=None):

21 for col in X.columns:

22 self.bounds[col] = (

23 float(np.nanpercentile(X[col], self.lower_pct * 100)),

24 float(np.nanpercentile(X[col], self.upper_pct * 100))

25 )

26 return self

28 def transform(self, X):

29 X = X.copy()

30 for col in X.columns:

31 if col in self.bounds:

32 lower, upper = self.bounds[col]

33 X[col] = X[col].clip(lower, upper)

34 return X

35""",

36 "class_name": "OutlierCapper",

37 "lower_pct": 0.02,

38 "upper_pct": 0.98

39 }

40 )

41 ]

42)

44# Compile and use the pipeline

45pipeline = compile_spec(spec)

46pipeline.fit(df)

47df_transformed = pipeline.transform(df)

Feature Interaction Creator

Create interaction features from existing columns:

1interaction_step = StepSpec(

2 id="create_interactions",

3 type="custom",

4 params={

5 "code": """

6import numpy as np

8class InteractionCreator:

9 def __init__(self, pairs=None):

10 self.pairs = pairs or []

12 def fit(self, X, y=None):

13 return self

15 def transform(self, X):

16 X = X.copy()

17 for col_a, col_b in self.pairs:

18 if col_a in X.columns and col_b in X.columns:

19 X[f'{col_a}_x_{col_b}'] = X[col_a] * X[col_b]

20 X[f'{col_a}_div_{col_b}'] = np.where(

21 X[col_b] != 0, X[col_a] / X[col_b], np.nan

22 )

23 return X

24""",

25 "class_name": "InteractionCreator",

26 "pairs": [["revenue", "employees"], ["cost", "units"]]

27 }

28)

Cyclic Time Encoder

Encode cyclical time features (e.g., hour of day, day of week) using sine and cosine:

1time_encoder_step = StepSpec(

2 id="encode_time",

3 type="custom",

4 params={

5 "code": """

6import numpy as np

8class CyclicTimeEncoder:

9 def __init__(self, columns=None, max_values=None):

10 self.columns = columns or []

11 self.max_values = max_values or {}

13 def fit(self, X, y=None):

14 for col in self.columns:

15 if col not in self.max_values:

16 self.max_values[col] = int(X[col].max()) + 1

17 return self

19 def transform(self, X):

20 X = X.copy()

21 for col in self.columns:

22 max_val = self.max_values[col]

23 X[f'{col}_sin'] = np.sin(2 * np.pi * X[col] / max_val)

24 X[f'{col}_cos'] = np.cos(2 * np.pi * X[col] / max_val)

25 return X

26""",

27 "class_name": "CyclicTimeEncoder",

28 "columns": ["hour", "day_of_week", "month"],

29 "max_values": {"hour": 24, "day_of_week": 7, "month": 12}

30 }

31)

Combining Custom and Built-in Steps

Custom transformers work alongside the built-in transformer registry. A pipeline can mix both:

1from xplainable_preprocessing.schema import StepSpec, PipelineSpec

2from xplainable_preprocessing.compiler import compile_spec

4spec = PipelineSpec(

5 steps=[

6 # Built-in: standard scaling

7 StepSpec(

8 id="scale_numerics",

9 type="StandardScaler",

10 columns=["age", "income", "balance"]

11 ),

13 # Custom: domain-specific ratio creation

14 StepSpec(

15 id="financial_ratios",

16 type="custom",

17 params={

18 "code": """

19import numpy as np

21class FinancialRatios:

22 def fit(self, X, y=None):

23 return self

25 def transform(self, X):

26 X = X.copy()

27 X['debt_to_income'] = np.where(

28 X['income'] != 0, X['balance'] / X['income'], np.nan

29 )

30 return X

31""",

32 "class_name": "FinancialRatios"

33 }

34 ),

36 # Built-in: drop columns

37 StepSpec(

38 id="drop_ids",

39 type="DropColumnsTransformer",

40 params={"columns": ["customer_id", "account_id"]}

41 )

42 ]

43)

45pipeline = compile_spec(spec)

46pipeline.fit(train_df)

47train_transformed = pipeline.transform(train_df)

48test_transformed = pipeline.transform(test_df)

Column Scoping

When columns is specified on a StepSpec, the transformer is wrapped in a DataFrameColumnTransformer that applies the transformation only to those columns, leaving others untouched:

1# This step only applies to the specified columns

2StepSpec(

3 id="log_transform",

4 type="custom",

5 columns=["revenue", "cost", "profit"], # Only these columns are transformed

6 params={

7 "code": """

8import numpy as np

10class LogTransformer:

11 def fit(self, X, y=None):

12 return self

14 def transform(self, X):

15 return np.log1p(X)

16""",

17 "class_name": "LogTransformer"

18 }

19)

Pipeline Validation

Before compilation, you can validate your pipeline specification:

1from xplainable_preprocessing.schema import PipelineSpec, validate_spec

3spec = PipelineSpec(steps=[...])

5# Validates:

6# - All non-custom types exist in the registry

7# - Custom steps have 'code' and 'class_name' in params

8# - Step IDs are unique

9validate_spec(spec)

Custom code is further validated via AST analysis during compile_spec() to ensure no forbidden imports or calls are present.

Best Practices

Writing Custom Transformers

Always return self from fit() to support method chaining and pipeline compatibility.
Always copy the input DataFrame in transform() with X.copy() to avoid modifying the original data.
Handle missing values -- check for NaN/None in your transformation logic.
Keep code self-contained -- avoid relying on external state or files. Only import from allowed modules.
Use constructor parameters for configuration rather than hardcoding values, so they can be adjusted via params.
Test your transformer independently before embedding it in a pipeline specification.

Security Restrictions

Custom transformer code runs in a restricted sandbox. Do not attempt to use file I/O, network access, subprocess execution, or other system-level operations. The AST validator will reject code containing forbidden imports or function calls.

Next Steps

Ready for More Advanced Topics?

Explore rapid refitting for real-time model updates
Learn about partitioned models for segment-specific modeling
Check out XEvolutionaryNetwork for advanced weight optimization

Custom Transformers

Overview​

How Custom Transformers Work​

Writing Custom Transformer Classes​

Required Interface​

Code Validation Rules​

Examples​

Outlier Capper​

Feature Interaction Creator​

Cyclic Time Encoder​

Combining Custom and Built-in Steps​

Column Scoping​

Pipeline Validation​

Best Practices​

Next Steps​

Overview

How Custom Transformers Work

Writing Custom Transformer Classes

Required Interface

Code Validation Rules

Examples

Outlier Capper

Feature Interaction Creator

Cyclic Time Encoder

Combining Custom and Built-in Steps

Column Scoping

Pipeline Validation

Best Practices

Next Steps