Skip to main content

Instacart Market Basket Analysis: Leveraging Xplainable Classifier for Transparent Recommendations

Introduction

Welcome to our comprehensive walkthrough of the Instacart Market Basket Analysis Challenge on Kaggle. This challenge presents an opportunity for data enthusiasts to dive deep into the world of grocery shopping and unravel the patterns behind consumer purchase behavior. Instacart, a prominent online grocery delivery platform, has provided a rich dataset that includes a information on customer orders through time. The objective? To predict which previously purchased products will be in a user's next order.

As data scientists and technical experts, we understand the criticality of not just making accurate predictions but also being able to interpret and explain our models. This is where our novel approach comes into play. We're introducing the Xplainable algorithm - a novel machine learning algorithm designed to enhance the transparency and interpretability of the recommendations.

Enhancing Brand Trust through Two-Way Transparency

An integral aspect of our approach with the Xplainable algorithm is fostering two-way transparency between the recommender system and the customer. In traditional recommender systems, users are often left wondering why certain products are recommended to them. Our method provides the opportunity to bridge this gap by incorporating an explanatory feature, such as "users like yourself also purchase."

The Importance of Relatable Recommendations

By providing context such as "users like yourself also purchase," we achieve several key objectives:

  1. Enhanced User Engagement: When users understand the rationale behind recommendations, they are more likely to explore and accept these suggestions.
  2. Increased Personalisation: This approach reflects a deeper understanding of user behaviour and preferences, leading to more personalised shopping experiences.
  3. Greater Brand Trust and Loyalty: Transparency in recommendations fosters trust. When users feel that their needs are understood and catered to, it enhances their loyalty to the brand.
  4. Feedback Loop for Continuous Improvement: Such transparent systems encourage user feedback, providing valuable insights that can be used to further refine and improve the recommender system.

By implementing a two-way transparent model, we not only elevate the accuracy of our predictions but also enrich the user experience, instilling a sense of trust and reliability in the brand. In this walkthrough, we will explore how the Xplainable algorithm not only achieves high accuracy in predicting the next set of products for Instacart users but also provides clear insights into the 'why' behind its predictions.

import xplainable as xp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
color = sns.color_palette()
import warnings
warnings.filterwarnings('ignore')

# Garbage Collector to free up memory
import gc
gc.enable()
print(f"This notebook was created using Xplainable version {xp.__version__}")
Out:

This notebook was created using Xplainable version 1.1.1

Load the datasets

It's possible to download the Instacart price prediction dataset at the following link: https://www.kaggle.com/competitions/instacart-market-basket-analysis/data

Following extraction of the .zip file build the dataset as below:

# Load the datasets
orders = pd.read_csv('./dataset/orders.csv' )
order_products_train = pd.read_csv('./dataset/order_products__train.csv')
order_products_prior = pd.read_csv('./dataset/order_products__prior.csv')
products = pd.read_csv('./dataset/products.csv')
aisles = pd.read_csv('./dataset/aisles.csv')
departments = pd.read_csv('./dataset/departments.csv')

Inspecting orders dataset

# Get heads for the orders dataset
orders.head()
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_order
025393291prior128nan
123987951prior23715
24737471prior331221
322547361prior44729
44315341prior541528
# Get shape of orders dataset
orders.shape
Out:

(3421083, 7)

# Get a brief descriptive info on orders
orders.info()
Out:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 3421083 entries, 0 to 3421082

Data columns (total 7 columns):

# Column Dtype

--- ------ -----

0 order_id int64

1 user_id int64

2 eval_set object

3 order_number int64

4 order_dow int64

5 order_hour_of_day int64

6 days_since_prior_order float64

dtypes: float64(1), int64(5), object(1)

memory usage: 182.7+ MB

# Get missing values in orders dataset
orders.isnull().sum()
Out:

order_id 0

user_id 0

eval_set 0

order_number 0

order_dow 0

order_hour_of_day 0

days_since_prior_order 206209

dtype: int64

We have observed that there are 206209 missing values in days_since_prior_order column.

Inspecting order_products_train dataset

# Get heads for the order_products_train dataset
order_products_train.head()
order_idproduct_idadd_to_cart_orderreordered
014930211
111110921
211024630
314968340
414363351
# Get shape of order_products_train dataset
order_products_train.shape
Out:

(1384617, 4)

# Get missing values in order_products_train dataset
order_products_train.isnull().sum()
Out:

order_id 0

product_id 0

add_to_cart_order 0

reordered 0

dtype: int64

Inspecting order_products_prior dataset

# Get head for order_products_prior
order_products_prior.head()
order_idproduct_idadd_to_cart_orderreordered
023312011
122898521
22932730
324591841
423003550
# Get shape for order_products_prior
order_products_prior.shape
Out:

(32434489, 4)

# Get missing value for order_products_prior
order_products_prior.isnull().sum()
Out:

order_id 0

product_id 0

add_to_cart_order 0

reordered 0

dtype: int64

Inspect products dataset

# Get heads for the products dataset
products.head()
product_idproduct_nameaisle_iddepartment_id
01Chocolate Sandwich Cookies6119
12All-Seasons Salt10413
23Robust Golden Unsweetened Oolong Tea947
34Smart Ones Classic Favorites Mini Rigatoni Wit...381
45Green Chile Anytime Sauce513
#  Get shape for products
products.shape
Out:

(49688, 4)

# Get missing value for products
products.isnull().sum()
Out:

product_id 0

product_name 0

aisle_id 0

department_id 0

dtype: int64

Inspecting aisles dataset

# Get head for aisles
aisles.head()
aisle_idaisle
01prepared soups salads
12specialty cheeses
23energy granola bars
34instant foods
45marinades meat preparation
# Get shape for aisles
aisles.shape
Out:

(134, 2)

# Check missing values in aisle
aisles.isnull().sum()
Out:

aisle_id 0

aisle 0

dtype: int64

Inspecting departments dataset

# Get head for departments
departments.head()
department_iddepartment
01frozen
12other
23bakery
34produce
45alcohol
# Get shape for departments
departments.shape
Out:

(21, 2)

# Check missing values in departments
departments.isnull().sum()
Out:

department_id 0

department 0

dtype: int64

Exploratory Data Analysis (EDA)

# Get the number of orders in each days of a week
plt.figure(figsize=(6,4))
sns.countplot(x="order_dow", data=orders, color=color[0])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Orders by week day", fontsize=15)
plt.show()

The number of orders on weekends is more compared to weekdays as people stay at home and might have wanted to enjoy the foods.

# Get the number of orders for each hour in a day
plt.figure(figsize=(6,4))
sns.countplot(x="order_hour_of_day", data=orders, color=color[0])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Orders by Hour of day", fontsize=15)
plt.show()

Peak hours wheremaximum orders are done is between 9 AM- 5PM. Less orders are placed before 7AM and after 11 PM.

# Analyse how people orders a new one since last order
plt.figure(figsize=(10,6))
sns.countplot(orders['days_since_prior_order'])
plt.xticks(rotation=90)
plt.show()

Maximum number of users order again after 1 month. People also order after a week and this forms the second largest order habit.

# Merge products and departments dataframes and then merging with aisles
products_details = pd.merge(left=products,right=departments,how="left")
products_details = pd.merge(left=products_details,right=aisles,how="left")
products_details.head()
product_idproduct_nameaisle_iddepartment_iddepartmentaisle
01Chocolate Sandwich Cookies6119snackscookies cakes
12All-Seasons Salt10413pantryspices seasonings
23Robust Golden Unsweetened Oolong Tea947beveragestea
34Smart Ones Classic Favorites Mini Rigatoni Wit...381frozenfrozen meals
45Green Chile Anytime Sauce513pantrymarinades meat preparation
# Get the number of products in each department
plt.figure(figsize=(10,6))
g=sns.countplot(x="department",data=products_details)
g.set_xticklabels(g.get_xticklabels(), rotation=40, ha="right")
plt.show()

Personal care is the most abundant type of department available followed by snacks.

# Get top 10 aisle with most number of products
plt.figure(figsize=(10,6))
top10_aisle=products_details["aisle"].value_counts()[:10].plot(kind="bar",title='Aisles')

missing is the aisle with most products available.

# Merge order_products_train and products dataframes
order_products_name_train = pd.merge(left=order_products_train,right=products.loc[:,["product_id","product_name"]],on="product_id",how="left")
# Get the top 10 common products which are most bought by the customers
common_Products=order_products_name_train[order_products_name_train.reordered == 1]["product_name"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="product_name", y="index", data=common_Products.head(10))
plt.ylabel('product_name', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()

Banana is the most common type of product bought by people followed by Bag of organic banana.

# Merge order_products_name_train and products_details dataframes
order_products_name_train = pd.merge(left=order_products_name_train,right=products_details.loc[:,["product_id","aisle","department"]],on="product_id",how="left")
#  Get the aisles which have most number of sales
common_aisle=order_products_name_train["aisle"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="aisle", y="index", data=common_aisle.head(10),palette="Blues_d")
plt.ylabel('aisle', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()

Fresh vegetable aisle has the highest number of sales followed by fresh_fruits.

#  Get the departments which have most number of sales
common_aisle=order_products_name_train["department"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="department", y="index", data=common_aisle,palette="Blues_d")
plt.ylabel('department', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()

produce and dairy eggs are the top 2 departments with the highest number of sales.

# Get the products which were reordered in each order from the train data.
train_data_reordered = order_products_train.groupby(["order_id","reordered"])["product_id"].apply(list).reset_index()
train_data_reordered = train_data_reordered[train_data_reordered.reordered == 1].drop(columns=["reordered"]).reset_index(drop=True)
train_data_reordered.head()
order_idproduct_id
01[49302, 11109, 43633, 22035]
136[19660, 43086, 46620, 34497, 48679, 46979]
238[21616]
396[20574, 40706, 27966, 24489, 39275]
498[8859, 19731, 43654, 13176, 4357, 37664, 34065...

Feature Engineering

# Delete all unnncessary dataframes
del products_details
del order_products_name_train
del common_Products
del common_aisle
del train_data_reordered
gc.collect()
Out:

0

# Get 15% of users as it is a huge dataset and will take lots of time to train.
#orders = orders.loc[orders.user_id.isin(orders.user_id.drop_duplicates().sample(frac=0.15, random_state=101))]
# Convert character variables into category. 
aisles['aisle'] = aisles['aisle'].astype('category')
departments['department'] = departments['department'].astype('category')
orders['eval_set'] = orders['eval_set'].astype('category')
products['product_name'] = products['product_name'].astype('category')
# Merge orders and order_products_prior datasets to get prior order dataset
prior_orders = pd.merge(orders, order_products_prior, on='order_id', how='inner')
prior_orders.head()
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_orderproduct_idadd_to_cart_orderreordered
025393291prior128nan19610
125393291prior128nan1408420
225393291prior128nan1242730
325393291prior128nan2608840
425393291prior128nan2640550

Create Features using user_id

# Create a feature based on number of orders placed by each user.
users = prior_orders.groupby(by='user_id')['order_number'].aggregate('max').to_frame('num_of_orders_for_each_user').reset_index()
users.head()
user_idnum_of_orders_for_each_user
0110
1214
2312
345
454
# Get average number of products in  each orders placed by each users.

# Get the total number of products in each order.
toal_product_per_order = prior_orders.groupby(by=['user_id', 'order_id'])['product_id'].aggregate('count').to_frame('total_products_per_order').reset_index()

# Create a dataframe to get the average number of products purchased in each orders by each user
avg_number_of_products_per_order = toal_product_per_order.groupby(by=['user_id'])['total_products_per_order'].mean().to_frame('avg_no_prd_per_order').reset_index()

# Delete unnecessay the toal_product_per_order dataframe
del [toal_product_per_order]
gc.collect()

# Get head of avg_number_of_products_per_order
avg_number_of_products_per_order.head()
user_idavg_no_prd_per_order
015.9
1213.9286
237.33333
343.6
459.25
from scipy import stats
import pandas as pd
import numpy as np

def calculate_mode(x):
if len(x) > 0:
mode_result = stats.mode(x)
# Check if mode_result.mode is an array and has at least one element
if isinstance(mode_result.mode, np.ndarray) and mode_result.mode.size > 0:
return mode_result.mode[0]
else:
return mode_result.mode
else:
return pd.NA

# Create a dataframe for the day of the week where users order most
order_most_dow = prior_orders.groupby(by=['user_id'])['order_dow'].aggregate(calculate_mode).to_frame('dow_with_most_orders').reset_index()

# Get head of the dataset
order_most_dow.head()
user_iddow_with_most_orders
014
122
230
344
453
def calculate_mode_hour(x):
if len(x) > 0:
mode_result = stats.mode(x)
# Check if mode_result.mode is an array and has at least one element
if isinstance(mode_result.mode, np.ndarray) and mode_result.mode.size > 0:
return mode_result.mode[0]
else:
return mode_result.mode
else:
return pd.NA

# Create a dataframe for the hour of day where users have ordered most
order_most_hod = prior_orders.groupby(by=['user_id'])['order_hour_of_day'].aggregate(calculate_mode_hour).to_frame('hod_with_most_orders').reset_index()

# Display the first few rows of the dataframe
order_most_hod.head()

user_idhod_with_most_orders
017
129
2316
3415
4518
#Get a dataframe for reorder ratio of each user and set data type as float
user_reorder_ratio = prior_orders.groupby(by='user_id')['reordered'].aggregate('mean').to_frame('reorder_ratio').reset_index()
user_reorder_ratio['reorder_ratio'] = user_reorder_ratio['reorder_ratio'].astype(np.float16)
user_reorder_ratio.head()
user_idreorder_ratio
010.694824
120.476807
230.625
340.055542
450.378418
# Merging all the created user based features into the users dataset one by one.
users = users.merge(avg_number_of_products_per_order, on='user_id', how='left')
users = users.merge(order_most_dow, on='user_id', how='left')
users = users.merge(order_most_hod, on='user_id', how='left')
users = users.merge(user_reorder_ratio, on='user_id', how='left')

users.head()
user_idnum_of_orders_for_each_useravg_no_prd_per_order_xavg_no_prd_per_order_ydow_with_most_ordershod_with_most_ordersreorder_ratio
01105.95.9470.694824
121413.928613.9286290.476807
23127.333337.333330160.625
3453.63.64150.055542
4549.259.253180.378418
# Delete unnecessay dataframes
del [avg_number_of_products_per_order,order_most_dow,order_most_hod,user_reorder_ratio]
gc.collect()
Out:

0

Create features using product_id.

#Get a dataframe to show the number of times a product has been purchased.
purchased_num_of_times = prior_orders.groupby(by='product_id')['order_id'].aggregate('count').to_frame('purchased_num_of_times').reset_index()
purchased_num_of_times.head()

product_idpurchased_num_of_times
011852
1290
23277
34329
4515
#Get a dataframe for the reordered ratio for each product
product_reorder_ratio = prior_orders.groupby(by='product_id')['reordered'].aggregate('mean').to_frame('product_reorder_ratio').reset_index()
product_reorder_ratio.head()
product_idproduct_reorder_ratio
010.613391
120.133333
230.732852
340.446809
450.6
#Get a dataframe for avearage number of adding to cart for each product.
add_to_cart = prior_orders.groupby(by='product_id')['add_to_cart_order'].aggregate('mean').to_frame('product_avg_cart_addition').reset_index()
add_to_cart.head()
product_idproduct_avg_cart_addition
015.80184
129.88889
236.41516
349.5076
456.46667
# Merge all the created features based on product_id into the purchased_num_of_times dataset.
purchased_num_of_times = purchased_num_of_times.merge(product_reorder_ratio, on='product_id', how='left')
purchased_num_of_times = purchased_num_of_times.merge(add_to_cart, on='product_id', how='left')

#Delete unwanted dataframes.
del [product_reorder_ratio, add_to_cart]
gc.collect()
Out:

0

# Get head of purchased_num_of_times
purchased_num_of_times.head()
product_idpurchased_num_of_timesproduct_reorder_ratioproduct_avg_cart_addition
0118520.6133915.80184
12900.1333339.88889
232770.7328526.41516
343290.4468099.5076
45150.66.46667

Creating features using user_id and product_id

#Create a user_product dataframe which shows the number of times a user have bough a product.
user_product_data = prior_orders.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('uxp_times_bought').reset_index()
user_product_data.head()
user_idproduct_iduxp_times_bought
0119610
11102589
21103261
311242710
41130323
#Create a dataframe  to find a product's order number when the user has bought a product for the first time.
product_first_order_num = prior_orders.groupby(by=['user_id', 'product_id'])['order_number'].aggregate('min').to_frame('first_order_number').reset_index()
product_first_order_num.head()
user_idproduct_idfirst_order_number
011961
11102582
21103265
31124271
41130322
#Get total number of orders by each user
total_orders = prior_orders.groupby('user_id')['order_number'].max().to_frame('total_orders').reset_index()
total_orders.head()
user_idtotal_orders
0110
1214
2312
345
454
# Merge total_orders and user_product_data dataframes to create a new dataframe user_product_df
user_product_df = pd.merge(total_orders, product_first_order_num, on='user_id', how='right')
user_product_df.head()
user_idtotal_ordersproduct_idfirst_order_number
01101961
1110102582
2110103265
3110124271
4110130322
# Calculate the order range.
# The +1 includes in the difference is the first order where the product has been purchased
user_product_df['order_range'] = user_product_df['total_orders'] - user_product_df['first_order_number'] + 1
user_product_df.head()
user_idtotal_ordersproduct_idfirst_order_numberorder_range
0110196110
11101025829
21101032656
311012427110
41101303229
#Create  a dataframe to show the number of times a user have bough a product.
number_of_times = prior_orders.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('times_bought').reset_index()
number_of_times.head()
user_idproduct_idtimes_bought
0119610
11102589
21103261
311242710
41130323
# Merging number_of_times with user_product_df
uxp_ratio = pd.merge(number_of_times, user_product_df, on=['user_id', 'product_id'], how='left')
uxp_ratio.head()
user_idproduct_idtimes_boughttotal_ordersfirst_order_numberorder_range
011961010110
111025891029
211032611056
31124271010110
411303231029
# Get a dataframe to calculate the reorder ratio for each product
uxp_ratio['uxp_reorder_ratio'] = uxp_ratio['times_bought'] / uxp_ratio['order_range']
uxp_ratio.head()
user_idproduct_idtimes_boughttotal_ordersfirst_order_numberorder_rangeuxp_reorder_ratio
0119610101101
1110258910291
2110326110560.166667
311242710101101
4113032310290.333333
#Drop all the unnecessary columns from uxp_ratio dataframe .
uxp_ratio.drop(['times_bought', 'total_orders', 'first_order_number', 'order_range'], axis=1, inplace=True)
uxp_ratio.head()
user_idproduct_iduxp_reorder_ratio
011961
11102581
21103260.166667
31124271
41130320.333333
#Merge uxp_ratio with user_product_data.
user_product_data = user_product_data.merge(uxp_ratio, on=['user_id', 'product_id'], how='left')

# Delete all unnecessay datasets
del [product_first_order_num, number_of_times,user_product_df,total_orders, uxp_ratio]
gc.collect()
Out:

0

# Get head for user_product_data
user_product_data.head()
user_idproduct_iduxp_times_boughtuxp_reorder_ratio
01196101
111025891
211032610.166667
3112427101
411303230.333333
#Create a column order_number_back to reverse the order number for each product in prior_orders dataframe
prior_orders['order_number_back'] = prior_orders.groupby(by=['user_id'])['order_number'].transform(max) - prior_orders.order_number + 1
prior_orders.head()
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_orderproduct_idadd_to_cart_orderreorderedorder_number_back
025393291prior128nan1961010
125393291prior128nan140842010
225393291prior128nan124273010
325393291prior128nan260884010
425393291prior128nan264055010
#Update the dataframe to keep only the first 3 orders from the order_number_back.
temp_df = prior_orders.loc[prior_orders.order_number_back <= 3]
temp_df.head()
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_orderproduct_idadd_to_cart_orderreorderedorder_number_back
3831085881prior81141412427113
3931085881prior811414196213
4031085881prior81141410258313
4131085881prior81141425133413
4231085881prior81141446149503
#Get the products bought by users in the last 3 orders.
last_three_order = temp_df.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('uxp_last_three').reset_index()
last_three_order.head()
user_idproduct_iduxp_last_three
011963
11102583
21124273
31130321
41251333
#Get the ratio of the products bought in the last 3 orders.
last_three_order['uxp_ratio_last_three'] = last_three_order['uxp_last_three'] / 3
last_three_order.head()
user_idproduct_iduxp_last_threeuxp_ratio_last_three
0119631
111025831
211242731
311303210.333333
412513331
#Merge last_three_order with feature with uxp df.
user_product_data = user_product_data.merge(last_three_order, on=['user_id', 'product_id'], how='left')

# Delete unwanted dataframes
del [last_three_order, temp_df]
gc.collect()

Out:

0

# Get the head of updated dataframe user_product_data.head()
user_product_data.head().head()
user_idproduct_iduxp_times_boughtuxp_reorder_ratiouxp_last_threeuxp_ratio_last_three
0119610131
11102589131
211032610.166667nannan
311242710131
411303230.33333310.333333
# Check any missing values in user_product_data columns
user_product_data.isnull().sum()
Out:

user_id 0

product_id 0

uxp_times_bought 0

uxp_reorder_ratio 0

uxp_last_three 8382738

uxp_ratio_last_three 8382738

dtype: int64

#Fill the NAN values with 0 in user_product_data.
user_product_data.fillna(0, inplace=True)
# Confirm filling of missing values in user_product_data columns
user_product_data.isnull().sum()
Out:

user_id 0

product_id 0

uxp_times_bought 0

uxp_reorder_ratio 0

uxp_last_three 0

uxp_ratio_last_three 0

dtype: int64

Create final dataframe for engineered features

#  Merge user_product_data and users, then resulting dataframe with purchased_num_of_times to get featured_engineered_data dataset
featured_engineered_data = user_product_data.merge(users, on='user_id', how='left')
featured_engineered_data = featured_engineered_data.merge(purchased_num_of_times, on='product_id', how='left')

# Delete unncessary dataframes.
del [users, user_product_data, purchased_num_of_times]
gc.collect()

# Get head of featured_engineered_data
featured_engineered_data.head()
user_idproduct_iduxp_times_boughtuxp_reorder_ratiouxp_last_threeuxp_ratio_last_threenum_of_orders_for_each_useravg_no_prd_per_order_xavg_no_prd_per_order_ydow_with_most_ordershod_with_most_ordersreorder_ratiopurchased_num_of_timesproduct_reorder_ratioproduct_avg_cart_addition
0119610131105.95.9470.694824357910.776483.72177
11102589131105.95.9470.69482419460.7137724.27749
211032610.16666700105.95.9470.69482455260.6520094.1911
311242710131105.95.9470.69482464760.7407354.76004
411303230.33333310.333333105.95.9470.69482437510.6571585.62277
# Check if any missing values found in featured_engineered_data dataframe columns
featured_engineered_data.isnull().sum()
Out:

user_id 0

product_id 0

uxp_times_bought 0

uxp_reorder_ratio 0

uxp_last_three 0

uxp_ratio_last_three 0

num_of_orders_for_each_user 0

avg_no_prd_per_order_x 0

avg_no_prd_per_order_y 0

dow_with_most_orders 0

hod_with_most_orders 0

reorder_ratio 0

purchased_num_of_times 0

product_reorder_ratio 0

product_avg_cart_addition 0

dtype: int64

Creating Train and Test datasets

Create training dataset

# Keep only the future orders from all customers i.e. train  and test orders
orders_future = orders[((orders.eval_set=='train') | (orders.eval_set=='test'))]
orders_future = orders_future[['user_id', 'eval_set', 'order_id']]
# merge the orders_future with featured_engineered_data to create a final dataframe.
final_data = featured_engineered_data.merge(orders_future, on='user_id', how='left')
final_data.head()
user_idproduct_iduxp_times_boughtuxp_reorder_ratiouxp_last_threeuxp_ratio_last_threenum_of_orders_for_each_useravg_no_prd_per_order_xavg_no_prd_per_order_ydow_with_most_ordershod_with_most_ordersreorder_ratiopurchased_num_of_timesproduct_reorder_ratioproduct_avg_cart_additioneval_setorder_id
0119610131105.95.9470.694824357910.776483.72177train1187899
11102589131105.95.9470.69482419460.7137724.27749train1187899
211032610.16666700105.95.9470.69482455260.6520094.1911train1187899
311242710131105.95.9470.69482464760.7407354.76004train1187899
411303230.33333310.333333105.95.9470.69482437510.6571585.62277train1187899
#Create training data set.
train_data = final_data[final_data.eval_set=='train']
train_data.head()
user_idproduct_iduxp_times_boughtuxp_reorder_ratiouxp_last_threeuxp_ratio_last_threenum_of_orders_for_each_useravg_no_prd_per_order_xavg_no_prd_per_order_ydow_with_most_ordershod_with_most_ordersreorder_ratiopurchased_num_of_timesproduct_reorder_ratioproduct_avg_cart_additioneval_setorder_id
0119610131105.95.9470.694824357910.776483.72177train1187899
11102589131105.95.9470.69482419460.7137724.27749train1187899
211032610.16666700105.95.9470.69482455260.6520094.1911train1187899
311242710131105.95.9470.69482464760.7407354.76004train1187899
411303230.33333310.333333105.95.9470.69482437510.6571585.62277train1187899
#Merge order_products__train  with into train_data datframe.
train_data = train_data.merge(order_products_train[['product_id', 'order_id', 'reordered']], on=['product_id', 'order_id'], how='left')
train_data.head()
user_idproduct_iduxp_times_boughtuxp_reorder_ratiouxp_last_threeuxp_ratio_last_threenum_of_orders_for_each_useravg_no_prd_per_order_xavg_no_prd_per_order_ydow_with_most_ordershod_with_most_ordersreorder_ratiopurchased_num_of_timesproduct_reorder_ratioproduct_avg_cart_additioneval_setorder_idreordered
0119610131105.95.9470.694824357910.776483.72177train11878991
11102589131105.95.9470.69482419460.7137724.27749train11878991
211032610.16666700105.95.9470.69482455260.6520094.1911train1187899nan
311242710131105.95.9470.69482464760.7407354.76004train1187899nan
411303230.33333310.333333105.95.9470.69482437510.6571585.62277train11878991
# Check if any missing values found in train_data dataframe columns
train_data.isnull().sum()
Out:

user_id 0

product_id 0

uxp_times_bought 0

uxp_reorder_ratio 0

uxp_last_three 0

uxp_ratio_last_three 0

num_of_orders_for_each_user 0

avg_no_prd_per_order_x 0

avg_no_prd_per_order_y 0

dow_with_most_orders 0

hod_with_most_orders 0

reorder_ratio 0

purchased_num_of_times 0

product_reorder_ratio 0

product_avg_cart_addition 0

eval_set 0

order_id 0

reordered 7645837

dtype: int64

# Fill the missing values in reordered column with 0
train_data['reordered'] = train_data['reordered'].fillna(0)
train_data.head()
user_idproduct_iduxp_times_boughtuxp_reorder_ratiouxp_last_threeuxp_ratio_last_threenum_of_orders_for_each_useravg_no_prd_per_order_xavg_no_prd_per_order_ydow_with_most_ordershod_with_most_ordersreorder_ratiopurchased_num_of_timesproduct_reorder_ratioproduct_avg_cart_additioneval_setorder_idreordered
0119610131105.95.9470.694824357910.776483.72177train11878991
11102589131105.95.9470.69482419460.7137724.27749train11878991
211032610.16666700105.95.9470.69482455260.6520094.1911train11878990
311242710131105.95.9470.69482464760.7407354.76004train11878990
411303230.33333310.333333105.95.9470.69482437510.6571585.62277train11878991
# Set user_id and product_id as the index of the train_data
train_data = train_data.set_index(['user_id', 'product_id'])
# Drop unwanted columns from train_data dataframe
train_data = train_data.drop(['eval_set', 'order_id'], axis=1)
# Get head of train_data
train_data.head()
uxp_times_bought_Unnamed: 2_level_1uxp_reorder_ratio_Unnamed: 3_level_1uxp_last_three_Unnamed: 4_level_1uxp_ratio_last_three_Unnamed: 5_level_1num_of_orders_for_each_user_Unnamed: 6_level_1avg_no_prd_per_order_x_Unnamed: 7_level_1avg_no_prd_per_order_y_Unnamed: 8_level_1dow_with_most_orders_Unnamed: 9_level_1hod_with_most_orders_Unnamed: 10_level_1reorder_ratio_Unnamed: 11_level_1purchased_num_of_times_Unnamed: 12_level_1product_reorder_ratio_Unnamed: 13_level_1product_avg_cart_addition_Unnamed: 14_level_1reordered_Unnamed: 15_level_1
119610131105.95.9470.694824357910.776483.721771
1102589131105.95.9470.69482419460.7137724.277491
11032610.16666700105.95.9470.69482455260.6520094.19110
11242710131105.95.9470.69482464760.7407354.760040
11303230.33333310.333333105.95.9470.69482437510.6571585.622771

Create testing dataset

#Keep only the future orders labelled as test
test_data = final_data[final_data.eval_set=='test']
test_data.head()
user_idproduct_iduxp_times_boughtuxp_reorder_ratiouxp_last_threeuxp_ratio_last_threenum_of_orders_for_each_useravg_no_prd_per_order_xavg_no_prd_per_order_ydow_with_most_ordershod_with_most_ordersreorder_ratiopurchased_num_of_timesproduct_reorder_ratioproduct_avg_cart_additioneval_setorder_id
120324810.09090900127.333337.333330160.62563710.40025110.6208test2774568
1213100510.33333310.333333127.333337.333330160.6254630.4406059.49892test2774568
1223181930.33333300127.333337.333330160.62524240.4921629.28754test2774568
1233750310.100127.333337.333330160.625124740.5535519.54738test2774568
1243802110.09090900127.333337.333330160.625278640.5911578.82285test2774568
# Set user_id and product_id as the index of the train_data
test_data = test_data.set_index(['user_id', 'product_id'])

# Drop unwanted columns from train_data dataframe
test_data = test_data.drop(['eval_set', 'order_id'], axis=1)

# Get head of test_data
test_data.head()
uxp_times_bought_Unnamed: 2_level_1uxp_reorder_ratio_Unnamed: 3_level_1uxp_last_three_Unnamed: 4_level_1uxp_ratio_last_three_Unnamed: 5_level_1num_of_orders_for_each_user_Unnamed: 6_level_1avg_no_prd_per_order_x_Unnamed: 7_level_1avg_no_prd_per_order_y_Unnamed: 8_level_1dow_with_most_orders_Unnamed: 9_level_1hod_with_most_orders_Unnamed: 10_level_1reorder_ratio_Unnamed: 11_level_1purchased_num_of_times_Unnamed: 12_level_1product_reorder_ratio_Unnamed: 13_level_1product_avg_cart_addition_Unnamed: 14_level_1
324810.09090900127.333337.333330160.62563710.40025110.6208
3100510.33333310.333333127.333337.333330160.6254630.4406059.49892
3181930.33333300127.333337.333330160.62524240.4921629.28754
3750310.100127.333337.333330160.625124740.5535519.54738
3802110.09090900127.333337.333330160.625278640.5911578.82285
#Delete unnecessay dataframes
del [final_data, orders_future, products, order_products_train]
gc.collect()
Out:

0

Building model using Xplainable Classifier

Build X_train and y_train dataset

# Define X_train and y_train
X_train, y_train = train_data.drop('reordered', axis=1), train_data.reordered

Optimise on 1,000,000 rows

from xplainable.core.optimisation.bayesian import XParamOptimiser
opt = XParamOptimiser()
params = opt.optimise(X_train[:1000000], y_train[:1000000])
Out:

100%|████████| 30/30 [00:45<00:00, 1.53s/trial, best loss: -0.8107380819617527]

Use params from XParamOptimiser() to fit the xplainable classifier

from xplainable.core.models import XClassifier

model = XClassifier(**params)
model.fit(X_train, y_train)
Out:

<xplainable.core.ml.classification.XClassifier at 0x2a426f760>

Model Explanations for Item Recommender

Plot model explanations using the .explain() method

model.explain()

In the Feature Importances section, we see a list of features with corresponding importance values. The feature uxp_reorder_ratio has the highest importance, indicating that it is the most influential factor in the model's predictions.

On the Contributions side, the uxp_reorder_ratio feature also shows a notable contribution to the model's output. The green bars represent positive contributions, while the red bars indicate negative contributions. The specific contribution values are again not directly visible, but the length and color of the bars suggest that uxp_reorder_ratio has a strong positive influence on the model's predictions.

Create model predictions using threshold cutoff NOTE: Adjust the threshold cutoff to see the impact on the result

# Get model predictions with threshold of 0.21 probability
test_prediction = (model.predict_proba(test_data) >= 0.21).astype(int)
test_prediction[:5]
Out:

array([0, 0, 0, 0, 0])

train_prediction = (model.predict_proba(X_train) >= 0.21).astype(int)
train_prediction[:5]
Out:

array([1, 1, 0, 1, 0])

# Import evaluation matrices
from sklearn.metrics import f1_score, classification_report
# Get f1 score and classification report
print(f'f1 Score: {f1_score(train_prediction, y_train)}')
print(classification_report(train_prediction, y_train))
Out:

f1 Score: 0.41808008442097677

precision recall f1-score support

0 0.92 0.94 0.93 7520042

1 0.45 0.39 0.42 954619

accuracy 0.88 8474661

macro avg 0.69 0.66 0.67 8474661

weighted avg 0.87 0.88 0.87 8474661

#Create the prediction as a new column in test_data
test_data['prediction'] = test_prediction
test_data.head()
uxp_times_bought_Unnamed: 2_level_1uxp_reorder_ratio_Unnamed: 3_level_1uxp_last_three_Unnamed: 4_level_1uxp_ratio_last_three_Unnamed: 5_level_1num_of_orders_for_each_user_Unnamed: 6_level_1avg_no_prd_per_order_x_Unnamed: 7_level_1avg_no_prd_per_order_y_Unnamed: 8_level_1dow_with_most_orders_Unnamed: 9_level_1hod_with_most_orders_Unnamed: 10_level_1reorder_ratio_Unnamed: 11_level_1purchased_num_of_times_Unnamed: 12_level_1product_reorder_ratio_Unnamed: 13_level_1product_avg_cart_addition_Unnamed: 14_level_1prediction_Unnamed: 15_level_1
324810.09090900127.333337.333330160.62563710.40025110.62080
3100510.33333310.333333127.333337.333330160.6254630.4406059.498920
3181930.33333300127.333337.333330160.62524240.4921629.287540
3750310.100127.333337.333330160.625124740.5535519.547380
3802110.09090900127.333337.333330160.625278640.5911578.822850
# Reset the index and create a dataset called final_df
final_df = test_data.reset_index()

# Keeping only the required columns to create submission file
final_df = final_df[['product_id', 'user_id', 'prediction']]

# Collect garbage and show head of final_df
gc.collect()
final_df.head()
product_iduser_idprediction
024830
1100530
2181930
3750330
4802130

Creating the Kaggle submission file (optional)

After developing a robust model and ensuring its performance on our validation set, the next step is to prepare our submission for Kaggle. Although this step is optional, it is a good practice to understand how to create a submission file that adheres to the competition's requirements.

To create a submission file, you typically need to:

  1. Ensure that your model has been trained with the full training set or with an appropriate cross-validation strategy.
  2. Generate predictions for the test set provided by Kaggle.
  3. Format these predictions into a CSV file that matches the submission format of the competition, which usually involves setting the index to an id column and including a column with your predictions.
  4. Use the to_csv() function from pandas with the appropriate parameters, such as index=False if the index should not be included in the submission file, to save your dataframe to a CSV file.
  5. Upload this CSV file to the Kaggle competition's submission page to see how your model performs on the unseen test set.

See specific steps for the kaggle upload below

#Create  a new dataframe orders_test 
orders_test = orders.loc[orders.eval_set == 'test', ['user_id', 'order_id']]
orders_test.head()
user_idorder_id
3832.77457e+06
444329954
5361.52801e+06
96111.37694e+06
102121.35684e+06
#Merge  final_df with orders_test daatframe
final_df = final_df.merge(orders_test, on='user_id', how='left')
final_df.head()
product_iduser_idpredictionorder_id
0248302.77457e+06
11005302.77457e+06
21819302.77457e+06
37503302.77457e+06
48021302.77457e+06
# Remove user_id column and convert product_id as integer
final_df = final_df.drop('user_id', axis=1)
final_df['product_id'] = final_df.product_id.astype(int)
final_df.head()
product_idpredictionorder_id
024802.77457e+06
1100502.77457e+06
2181902.77457e+06
3750302.77457e+06
4802102.77457e+06
# Create a dictionary to store product IDs whose reordered prediction value is 1
final_dict = dict()
for row in final_df.itertuples():
if row.prediction== 1:
try:
final_dict[row.order_id] += ' ' + str(row.product_id)
except:
final_dict[row.order_id] = str(row.product_id)

# Update products whose reorder prediction value is 0 as None
for order in final_df.order_id:
if order not in final_dict:
final_dict[order] = 'None'

# Collect garbage
gc.collect()
Out:

31

# Convert the final_dict dictionary into a final submission dataframe called submission_df
submission_df = pd.DataFrame.from_dict(final_dict, orient='index')

# Reset index
submission_df.reset_index(inplace=True)

#Set column names
submission_df.columns = ['order_id', 'products']

# Get head
submission_df.head()
order_idproducts
0277456817668 18599 21903 39190 43961 47766
1152801321903 38293
213769458309 13176 14947 20383 27959 33572 35948 44632
313568455746 7076 8239 10863 11520 13176 14992
42161313196 10441 11266 12427 14715 27839 37710
# Create the final submission file
submission_df.to_csv('sub.csv', index=False, header=True)
submission_df.shape
Out:

(75000, 2)