Skip to main content
Version: v1.4.1

Classification - Customer Repurchase Window Prediction

Predicting customer repurchase behavior and timing using historical transaction data from an online retail business.

Dataset Source: Online Retail II UCI Dataset Problem Type: Classification Target Variable: Customer repurchase probability within specific time windows Use Case: Customer retention strategies, inventory management, targeted marketing campaigns

Package Imports

1import pandas as pd
2import xplainable as xp
3from xplainable.core.models import XClassifier
4from xplainable.core.optimisation.bayesian import XParamOptimiser
5from sklearn.model_selection import train_test_split
6import requests
7import json
8
9# Additional imports specific to this example
10import numpy as np
11import datetime as dt
12from datetime import datetime, timedelta
13import matplotlib.pyplot as plt
14import seaborn as sns
15import warnings
16
17# New refactored client import
18from xplainable_client.client.client import XplainableClient
19from xplainable_client.client.base import XplainableAPIError
1!pip install xplainable
2!pip install xplainable-client

Xplainable Cloud Setup

1# Initialize Xplainable Cloud client using new refactored client
2client = XplainableClient(
3 api_key="", # Add your API key from https://platform.xplainable.io/
4 hostname="https://platform.xplainable.io" # Optional, defaults to production
5)

Data Loading and Exploration

Load the Online Retail II dataset and perform basic data exploration.

1import pandas as pd
2import requests
3from io import BytesIO
4
5def load_online_retail_ii():
6 """
7 Downloads the Online Retail II dataset directly from the UCI repository
8 and returns a single DataFrame combining both sheets.
9 """
10 url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx"
11 r = requests.get(url)
12 r.raise_for_status() # fail early if we got a bad status
13
14 # read both year‐sheets and concatenate
15 xls = pd.ExcelFile(BytesIO(r.content))
16 df1 = pd.read_excel(xls, sheet_name="Year 2009-2010", parse_dates=["InvoiceDate"])
17 df2 = pd.read_excel(xls, sheet_name="Year 2010-2011", parse_dates=["InvoiceDate"])
18 df = pd.concat([df1, df2], ignore_index=True)
19
20 # cleanup exactly like you had before
21 df = df.dropna(subset=["Customer ID"])
22 df = df[(df.Price > 0) & (df.Quantity > 0)].copy()
23 df["Amount"] = df.Price * df.Quantity
24 return df
25
26# usage
27df = load_online_retail_ii()
28df.head()
InvoiceStockCodeDescriptionQuantityInvoiceDatePriceCustomer IDCountryAmount
04894348504815CM CHRISTMAS GLASS BALL 20 LIGHTS122009-12-01 07:45:006.9513085United Kingdom83.4
148943479323PPINK CHERRY LIGHTS122009-12-01 07:45:006.7513085United Kingdom81
248943479323WWHITE CHERRY LIGHTS122009-12-01 07:45:006.7513085United Kingdom81
348943422041RECORD FRAME 7" SINGLE SIZE482009-12-01 07:45:002.113085United Kingdom100.8
448943421232STRAWBERRY CERAMIC TRINKET BOX242009-12-01 07:45:001.2513085United Kingdom30

The timeline below illustrates the core problem the model is solving: will a customer place another order within 30 days of a given purchase? Each row represents an individual customer (C1 – C4), and every blue dot marks one of their historical purchases. From each purchase, a magenta line extends 30 days—the evaluation window used to create the training label. When a follow-up order actually arrives inside that window, it is highlighted with a pink star. Purchases followed by a star are the positive cases (“repurchased”), while those without a star are negative. Visually stepping through these tracks makes it clear how the dataset converts raw transactions into a binary outcome that the model can learn to predict.

1# --- 1. LOAD & CLEAN -------------------------------------------------
2df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"], dayfirst=True, errors="coerce")
3
4df = df.dropna(subset=["Customer ID", "InvoiceDate"])
5df = df[(df["Price"] > 0) & (df["Quantity"] > 0)].copy()
6
7df["Amount"] = df["Price"] * df["Quantity"]
8df["InvoiceMonth"] = df["InvoiceDate"].dt.to_period("M")

1. Data Preprocessing

Data Preview and Initial Exploration

1df.head()
InvoiceStockCodeDescriptionQuantityInvoiceDatePriceCustomer IDCountryAmountInvoiceMonth
04894348504815CM CHRISTMAS GLASS BALL 20 LIGHTS122009-12-01 07:45:006.9513085United Kingdom83.42009-12
148943479323PPINK CHERRY LIGHTS122009-12-01 07:45:006.7513085United Kingdom812009-12
248943479323WWHITE CHERRY LIGHTS122009-12-01 07:45:006.7513085United Kingdom812009-12
348943422041RECORD FRAME 7" SINGLE SIZE482009-12-01 07:45:002.113085United Kingdom100.82009-12
448943421232STRAWBERRY CERAMIC TRINKET BOX242009-12-01 07:45:001.2513085United Kingdom302009-12

RFM Feature Engineering

1# Sort by customer and invoice date
2df_sorted = df.sort_values(["Customer ID", "InvoiceDate"])
3
4# Track the most recent purchase for each row
5df_sorted["LastPurchase"] = (
6 df_sorted.groupby("Customer ID")["InvoiceDate"].shift()
7)
8
9# Add InvoiceMonth and MonthEnd again (safe even if already set)
10df_sorted["InvoiceMonth"] = df_sorted["InvoiceDate"].dt.to_period("M")
11df_sorted["MonthEnd"] = df_sorted["InvoiceMonth"].dt.to_timestamp("M")
12
13# Keep only the last purchase as of each month
14last_purchase = (
15 df_sorted.dropna(subset=["LastPurchase"])
16 .groupby(["Customer ID", "InvoiceMonth"])["LastPurchase"]
17 .max()
18 .reset_index()
19)
20
21# Create the monthly feature matrix (grp)
22grp = (
23 df.groupby(["Customer ID", "InvoiceMonth"])
24 .agg({
25 "Invoice": "nunique", # Frequency
26 "Quantity": "sum", # DistinctItems or total quantity
27 "Amount": "sum", # Monetary
28 "Country": "first", # Keep Country
29 })
30 .rename(columns={
31 "Invoice": "Frequency",
32 "Quantity": "DistinctItems",
33 "Amount": "Monetary"
34 })
35 .reset_index()
36)
37
38# Ensure InvoiceMonth is period type
39grp["InvoiceMonth"] = grp["InvoiceMonth"].astype("period[M]")
40
41# Merge last purchase dates and calculate Recency
42grp = grp.merge(last_purchase, on=["Customer ID", "InvoiceMonth"], how="left")
43grp["MonthEnd"] = grp["InvoiceMonth"].dt.to_timestamp("M")
44grp["Recency"] = (grp["MonthEnd"] - grp["LastPurchase"]).dt.days
45grp.drop(columns=["LastPurchase"], inplace=True)
46
47# Add Month and Quarter for time-based grouping or encoding
48grp["Month"] = grp["InvoiceMonth"].dt.month
49grp["Quarter"] = grp["InvoiceMonth"].dt.quarter

Build 30-day Repurchase Label

1from pandas.tseries.offsets import Day
2
3# Set the window size
4DAYS = 30 # Change to 30 or 90 if needed
5
6# Step 1: Unique (Customer ID, InvoiceDate) combinations
7invoice_dates = df[["Customer ID", "InvoiceDate"]].drop_duplicates().copy()
8
9# Step 2: Function to check if there's a purchase within N days
10def has_purchase_within_n_days(row):
11 cid, date = row["Customer ID"], row["InvoiceDate"]
12 future_txns = invoice_dates[
13 (invoice_dates["Customer ID"] == cid) &
14 (invoice_dates["InvoiceDate"] > date) &
15 (invoice_dates["InvoiceDate"] <= date + Day(DAYS))
16 ]
17 return 1 if len(future_txns) > 0 else 0
18
19# Step 3: Apply the function row-wise (can take 30s+ on large data)
20invoice_dates[f"rebuy_{DAYS}d"] = invoice_dates.apply(has_purchase_within_n_days, axis=1)
21
22# Step 4: Convert to monthly and aggregate to get the label
23invoice_dates["InvoiceMonth"] = invoice_dates["InvoiceDate"].dt.to_period("M")
24
25label = (
26 invoice_dates.groupby(["Customer ID", "InvoiceMonth"])
27 [f"rebuy_{DAYS}d"].max()
28 .reset_index()
29 .rename(columns={f"rebuy_{DAYS}d": f"will_rebuy_{DAYS}d"})
30)
31
32# Step 5: Merge with feature matrix
33data = grp.merge(label, on=["Customer ID", "InvoiceMonth"], how="left")
34data[f"will_rebuy_{DAYS}d"].fillna(0, inplace=True)
35data[f"will_rebuy_{DAYS}d"] = data[f"will_rebuy_{DAYS}d"].astype(int)
36
1data[f"will_rebuy_{DAYS}d"].value_counts()
Out:

0 16500

1 9095

Name: will_rebuy_30d, dtype: int64

Train/Test Time-based Split

1# --- 4. TIME-BASED SPLIT & MODEL (DYNAMIC DAYS, NO ONE-HOT) --------
2
3data["Date"] = data["InvoiceMonth"].dt.to_timestamp()
4
5train = data[data["Date"] < "2011-07-01"]
6test = data[data["Date"] >= "2011-07-01"]
7
8label_col = f"will_rebuy_{DAYS}d"
9
10X_train = train.drop(columns=[label_col, "InvoiceMonth", "Date", "MonthEnd", "Customer ID"])
11y_train = train[label_col]
12X_test = test.drop(columns=[label_col, "InvoiceMonth", "Date", "MonthEnd", "Customer ID"])
13y_test = test[label_col]
1X_train
FrequencyDistinctItemsMonetaryCountryRecencyMonthQuarter
0526113.50United Kingdom12.0124
142090.00United Kingdom16.011
21527.05United Kingdom28.031
3119142.31United Kingdom1.062
417421577183.60United Kingdom216.011
........................
255891494833.48United Kingdom10.083
2559017321071.61United Kingdom13.052
255912508892.60United Kingdom8.093
255921187381.50United Kingdom7.0114
255931488765.28United Kingdom8.052

2. Model Optimization

1opt = XParamOptimiser()
2params = opt.optimise(X_train, y_train)
Out:

100%|██████████| 30/30 [00:08<00:00, 3.60trial/s, best loss: -0.8776764727397712]

3. Model Training

1model = XClassifier(**params)
2model.fit(X_train, y_train)
Out:

<xplainable.core.ml.classification.XClassifier at 0x2a566e140>

4. Model Interpretability and Explainability

1model.explain()

7. Model Testing

Hold-out Evaluation

1model.evaluate(X_test, y_test)
Out:

&#123;'confusion_matrix': [[4333, 25], [823, 1612]],

'classification_report': &#123;'0': &#123;'precision': 0.8403801396431342,

'recall': 0.9942634235888022,

'f1-score': 0.9108681942400673,

'support': 4358.0&#125;,

'1': &#123;'precision': 0.984728161270617,

'recall': 0.6620123203285421,

'f1-score': 0.7917485265225933,

'support': 2435.0&#125;,

'accuracy': 0.8751656116590608,

'macro avg': &#123;'precision': 0.9125541504568756,

'recall': 0.8281378719586721,

'f1-score': 0.8513083603813303,

'support': 6793.0&#125;,

'weighted avg': &#123;'precision': 0.892122732409647,

'recall': 0.8751656116590608,

'f1-score': 0.868168887469561,

'support': 6793.0&#125;&#125;,

'roc_auc': 0.8699849600395034,

'neg_brier_loss': 0.8800487167761432,

'log_loss': 0.4091278987464692,

'cohen_kappa': 0.7074258976095472&#125;

5. Model Persistence

1# Create model using the new client's models service
2try:
3 model_id, version_id = client.models.create_model(
4 model=model,
5 model_name="Customer Repurchase - 30 Day Forecast",
6 model_description="Predicts whether a customer will make another purchase within 30 days based on their recent order behaviour and RFM features.",
7 x=X_train,
8 y=y_train
9 )
10 print(f"Model created successfully!")
11 print(f"Model ID: {model_id}")
12 print(f"Version ID: {version_id}")
13except XplainableAPIError as e:
14 print(f"Error creating model: {e.message}")
15 model_id, version_id = None, None

6. Model Deployment

1# Deploy model using the new client's deployments service
2try:
3 deployment_response = client.deployments.deploy(model_version_id=version_id)
4 deployment_id = deployment_response.deployment_id
5 print(f"Model deployed successfully!")
6 print(f"Deployment ID: {deployment_id}")
7except XplainableAPIError as e:
8 print(f"Error deploying model: {e.message}")
9 deployment_id = None
1# Activate deployment using the new client
2try:
3 client.deployments.activate_deployment(deployment_id)
4 print("Deployment activated successfully!")
5except XplainableAPIError as e:
6 print(f"Error activating deployment: {e.message}")
1# Generate deployment key using the new client
2try:
3 deploy_key = client.deployments.generate_deploy_key(
4 deployment_id=deployment_id,
5 description='Deployment API for Purchase Prediction',
6 days_until_expiry=7
7 )
8 print(f"Deployment key generated: {str(deploy_key)[:20]}...")
9except XplainableAPIError as e:
10 print(f"Error generating deploy key: {e.message}")
11 deploy_key = None

Generate Example Payload

1#Set the option to highlight multiple ways of creating data
2option = 2
1if option == 1:
2 # Generate example payload using the new client
3 try:
4 body = client.deployments.generate_example_deployment_payload(deployment_id)
5 except:
6 body = json.loads(train.drop(columns=[label_col, "InvoiceMonth", "Date", "MonthEnd", "Customer ID"]).sample(1).to_json(orient="records"))
7else:
8 body = json.loads(train.drop(columns=[label_col, "InvoiceMonth", "Date", "MonthEnd", "Customer ID"]).sample(1).to_json(orient="records"))
1body

Call Inference Endpoint

1# Make prediction request
2if deploy_key:
3 response = requests.post(
4 url="https://inference.xplainable.io/v1/predict",
5 headers={'api_key': str(deploy_key)}, # Convert deploy_key to string
6 json=body
7 )
8
9 value = response.json()
10 print("Prediction response:")
11 print(value)
12else:
13 print("Deploy key not available, skipping prediction test")