Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Models Home » Domain Usecases » Retail » Retail Customer Analysis

Retail Customer Analysis

Models Status

Model Overview

We are going to perform an exploratory analysis for an online retail store data set, in order to understand its customers. Let’s assume, we own a retail store that has been doing incredibly well and we want to find a way to scale our business efficiently and effectively. In order to do this, we need to make sure we understand our customers and customize our marketing or expansion efforts based on specific subset of our customers. The main business problem in question in this case is “How can I scale my current business that is doing really well, in the most effective way?”. The sub question that might follow to support the main business objectives can be “What type of marketing initiatives can we perform for each customer in order to get the best ROI?”



About the Dataset:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.


Business Problem:

Problem Statement: The goal is to come up with a solution for the given questions:


  1. Can we categorize the customers in a particular segment based on their buying patterns? (Customer Segmentation)

  2. Can we predict which kind of items they will buy in future based on their segmentation? (Prediction)



ML Problem Mapping:


  1. Given a dataset of transanctions (Online Retail dataset from UCI Machine Learning repository) get the segments i.e clusters/segments. (Find common patterns and group them)

  2. Predict what to display to what group of users



Input: We will be using e-commerce data that contains the list of purchases in 1 year for 4,000 customers.

Output:
 The first goal is that we need to categorize our consumer base into appropriate customer segments. The second goal is we need to predict the purchases for the current year and the next year based on the customers' first purchase.



Importing libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import itertools
import nltk
import warnings
warnings.filterwarnings('ignore')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import wordcloud
%matplotlib inline
plt.style.use('fivethirtyeight')

 


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!

The following code is used to connect google colab to drive and navigate to the folder where dataset is stored.


import os
os.getcwd()

 


'/content'

# Mount google drive
from google.colab import drive
drive.mount('/content/gdrive')#,force_remount=True)
root_path = 'gdrive/My Drive/Customer-Analytics-master/'

os.chdir(root_path)
os.getcwd()

'/content/gdrive/My Drive/Customer-Analytics-master'


Loading the Dataset:


data = pd.read_excel('Online Retail.xlsx', dtype={'StockCode':str})
data.head(3)




data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 541909 non-null object
1 StockCode 541909 non-null object
2 Description 540455 non-null object
3 Quantity 541909 non-null int64
4 InvoiceDate 541909 non-null datetime64[ns]
5 UnitPrice 541909 non-null float64
6 CustomerID 406829 non-null float64
7 Country 541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB

data.shape

(541909, 8)


Data Preprocessing:


# Checking for null values.
info = pd.DataFrame(data=data.isnull().sum()).T.rename(index={0:'Null values'})
info = info.append(pd.DataFrame(data=data.isnull().sum()/data.shape[0] * 100).T.rename(index={0:'% Null values'}))
info


Since we dont have CustomerID for 25% of points we will remove them as we cannot give them any arbitrary ID.


# Removing null values
data.dropna(axis=0, subset = ['CustomerID'], inplace=True)
info = pd.DataFrame(data=data.isnull().sum()).T.rename(index={0:'Null values'})
info = info.append(pd.DataFrame(data=data.isnull().sum()/data.shape[0] * 100).T.rename(index={0:'% Null values'}))
info


# Checking for Duplicates :
data.duplicated().sum()

5225

# Removing duplicate entries :
data.drop_duplicates(inplace=True)
data.duplicated().sum()

0


Exploratory Data Analysis:


plt.figure(figsize=(14,6))
plt.bar(list(data.groupby(['Country']).groups.keys()), data.groupby(['Country'])['CustomerID'].count())
plt.xticks(rotation = 90, fontsize = 14)
plt.title("Number of transanctions done for each country")
plt.ylabel("No. of trans.")
plt.xlabel("Country")
plt.show()




info = pd.DataFrame(data = data.groupby(['Country'])['InvoiceNo'].nunique(), index=data.groupby(['Country']).groups.keys()).T
info



Observations:



  1. UK has done most of the transanctions. (19857)

  2. Least amount of transanctions were made by countries like Brazil, RSA etc. (only 1)




# StockCode Feature ->
# We will see how many different products were sold in the year data was collected.
print(len(data['StockCode'].value_counts()))​

3684

# Transanction feature
# We will see how many different transanctions were done.
print(len(data['InvoiceNo'].value_counts()))​

22190​

# Transanction feature
# We will see how many different Customers are there.
print(len(data['CustomerID'].value_counts()))​

4372​

pd.DataFrame({'products':len(data['StockCode'].value_counts()),
'transanctions':len(data['InvoiceNo'].value_counts()),
'Customers':len(data['CustomerID'].value_counts())},
index = ['Quantity'])​



There are 22k transanctions but only 4k customers with 3.5k products. It seems that some orders were placed then cancelled or the customers bought items multiple times or multiple items were bought in a single transaction.


Checking the number of items bought in a single transaction:

df = data.groupby(['CustomerID', 'InvoiceNo'], as_index=False)['InvoiceDate'].count()
df = df.rename(columns = {'InvoiceDate':'Number of products'})
df[:10].sort_values('CustomerID')



There are customers who purchase only 1 item per transaction and others who purchase many items per transanction. Also there are some orders which were cancelled they are marked with 'C' in the beginning.


Counting number of cancelled transanctions:


df['orders cancelled'] = df['InvoiceNo'].apply(lambda x: int('C' in str(x)))
df.head()


# Printing number of orders cancelled ->
print("Number of orders cancelled {}/{} ({:.2f}%)".format(df['orders cancelled'].sum(), df.shape[0], df['orders cancelled'].sum()/ df.shape[0] * 100))

Number of orders cancelled 3654/22190 (16.47%)


Handling Cancelled Values:


# Looking at cancelled transanctions in original data.
data.sort_values('CustomerID')[:5]



We see that for every order that has to be cancelled a new transanction has to be started with different invoiceno, with negative quantity and every other description is same. We can use this to remove the cancelled orders.


Checking for discounted products:


df = data[data['Description'] == 'Discount']
df.head()


So there are some discounted transanctions too but they appear to be cancelled.


Checking whether every order that has been cancelled has a counterpart:


df = data[(data['Quantity']<0) & (data['Description']!='Discount')][['CustomerID','Quantity','StockCode','Description','UnitPrice']]
df.head()




for index, col in df.iterrows():
if data[(data['CustomerID'] == col[0]) & (data['Quantity'] == -col[1]) & (data['Description'] == col[2])].shape[0] == 0:
print(index, df.loc[index])
print("There are some transanctions for which counterpart does not exist")
break

154 CustomerID                               15311
Quantity -1
StockCode 35004C
Description SET OF 3 COLOURED FLYING DUCKS
UnitPrice 4.65
Name: 154, dtype: object
There are some transanctions for which counterpart does not exist

We found out that there are some orders for which counterpart do not exist.
Reasons could be because some orders were made before the date the dataset is given from or that some orders were cancelled with exactly same counterpart or some are just errors maybe.


Removing cancelled orders:


df_cleaned = data.copy(deep=True)
df_cleaned['QuantityCancelled'] = 0
entry_to_remove = []; doubtfull_entry = []

for index, col in data.iterrows():
if(col['Quantity'] > 0)or(col['Description']=='Discount'):continue
df_test = data[(data['CustomerID']==col['CustomerID'])&(data['StockCode']==col['StockCode'])&
(data['InvoiceDate']<col['InvoiceDate'])&(data['Quantity']>0)].copy()

# Order cancelled without counterpart, these are doubtful as they maybe errors or maybe orders were placed before data given
if(df_test.shape[0] == 0):
doubtfull_entry.append(index)

# Cancellation with single counterpart
elif(df_test.shape[0] == 1):
index_order = df_test.index[0]
df_cleaned.loc[index_order, 'QuantityCancelled'] = -col['Quantity']
entry_to_remove.append(index)

# Various counterpart exists for orders
elif(df_test.shape[0] > 1):
df_test.sort_index(axis = 0, ascending=False, inplace=True)
for ind, val in df_test.iterrows():
if val['Quantity'] < -col['Quantity']:continue
df_cleaned.loc[ind, 'QuantityCancelled'] = -col['Quantity']
entry_to_remove.append(index)
break

print("Entry to remove {}".format(len(entry_to_remove)))
print("Doubtfull Entry {}".format(len(doubtfull_entry)))

Entry to remove 7521
Doubtfull Entry 1226

# Deleting these entries :
df_cleaned.drop(entry_to_remove, axis=0, inplace=True)
df_cleaned.drop(doubtfull_entry, axis=0, inplace=True)



We will now see the StockCode feature especially the discounted items:


list_special_codes = df_cleaned[df_cleaned['StockCode'].str.contains('^[a-zA-Z]+', regex = True)]['StockCode'].unique()
list_special_codes

array(['POST', 'D', 'C2', 'M', 'BANK CHARGES', 'PADS', 'DOT'],
dtype=object)

for code in list_special_codes:
print("{:<17} -> {:<35}".format(code, df_cleaned[df_cleaned['StockCode'] == code]['Description'].values[0]))

POST              -> POSTAGE                            
D -> Discount
C2 -> CARRIAGE
M -> Manual
BANK CHARGES -> Bank Charges
PADS -> PADS TO MATCH ALL CUSHIONS
DOT -> DOTCOM POSTAGE

df_cleaned['QuantityCancelled'] = np.nan_to_num(df_cleaned['QuantityCancelled'])
df_cleaned.head()


We see that the same transanction is duplicated for every different item in the dataset. Like above invoice number 536365 the user probably purchased many different items and each have been given a row as shown. We will need to merge these so we will add the totalprice feature for each row.


Getting total data feature:


df_cleaned['TotalPrice'] = df_cleaned['UnitPrice'] * (df_cleaned['Quantity'] - df_cleaned['QuantityCancelled'])
df_cleaned.sort_values('CustomerID')[:5]




Now we sum the individual orders and group them on the basis of invoice number to remove the problem of duplicate rows for same order:


temp = df_cleaned.groupby(by=['CustomerID', 'InvoiceNo'], as_index=False)['TotalPrice'].sum()
basket_price = temp.rename(columns = {'TotalPrice': 'Basket Price'})

df_cleaned['InvoiceDate_int'] = df_cleaned['InvoiceDate'].astype('int64')
temp = df_cleaned.groupby(by=['CustomerID', 'InvoiceNo'], as_index=False)['InvoiceDate_int'].mean()
df_cleaned.drop('InvoiceDate_int', axis = 1, inplace=True)
basket_price.loc[:, 'InvoiceDate'] = pd.to_datetime(temp['InvoiceDate_int'])

basket_price = basket_price[basket_price['Basket Price'] > 0]
basket_price.sort_values('CustomerID')[:6]

 

Plotting the purchases made:


price_range = [0, 50, 100, 200, 500, 1000, 5000, 50000]
count_price = []
for i,price in enumerate(price_range):
if i==0:continue
val = basket_price[(basket_price['Basket Price'] < price)&
(basket_price['Basket Price'] > price_range[i-1])]['Basket Price'].count()
count_price.append(val)

plt.rc('font', weight='bold')
f, ax = plt.subplots(figsize=(11, 6))
colors = ['yellowgreen', 'gold', 'wheat', 'c', 'violet', 'royalblue', 'firebrick']
labels = ["{}<.<{}".format(price_range[i-1], s) for i,s in enumerate(price_range) if i != 0]
sizes = count_price
explode = [0.0 if sizes[i] < 100 else 0.0 for i in range(len(sizes))]
ax.pie(sizes, explode = explode, labels = labels, colors = colors,
autopct = lambda x:'{:1.0f}%'.format(x) if x > 1 else '',
shadow = False, startangle = 0)
ax.axis('equal')
f.text(0.5, 1.01, "Distribution of order amounts", ha = 'center', fontsize = 18)
plt.show()




Analyzing product Description:


is_noun = lambda pos:pos[:2] == 'NN'

def keywords_inventory(dataframe, colonne = 'Description'):
import nltk
stemmer = nltk.stem.SnowballStemmer("english")
keywords_roots = dict()
keywords_select = dict()
category_keys = []
count_keywords = dict()
icount = 0

for s in dataframe[colonne]:
if pd.isnull(s): continue
lines = s.lower()
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]

for t in nouns:
t = t.lower() ; racine = stemmer.stem(t)
if racine in keywords_roots:
keywords_roots[racine].add(t)
count_keywords[racine] += 1
else:
keywords_roots[racine] = {t}
count_keywords[racine] = 1


for s in keywords_roots.keys():
if len(keywords_roots[s]) > 1:
min_length = 1000
for k in keywords_roots[s]:
if len(k) < min_length:
clef = k ; min_length = len(k)

category_keys.append(clef)
keywords_select[s] = clef

else:
category_keys.append(list(keywords_roots[s])[0])
keywords_select[s] = list(keywords_roots[s])[0]

print("Number of keywords in the variable '{}': {}".format(colonne, len(category_keys)))
return category_keys, keywords_roots, keywords_select, count_keywords

df_produits = pd.DataFrame(data['Description'].unique()).rename(columns = {0:"Description"})
keywords, keywords_roots, keywords_select, count_keywords = keywords_inventory(df_produits)

Number of keywords in the variable 'Description': 1483

# Plotting keywords vs frequency graph :
list_products = []
for k, v in count_keywords.items():
word = keywords_select[k]
list_products.append([word, v])

liste = sorted(list_products, key = lambda x:x[1], reverse=True)

plt.rc('font', weight='normal')
fig, ax = plt.subplots(figsize=(7, 25))
y_axis = [i[1] for i in liste[:125]]
x_axis = [k for k,i in enumerate(liste[:125])]
x_label = [i[0] for i in liste[:125]]
plt.xticks(fontsize=15)
plt.yticks(fontsize=13)
plt.yticks(x_axis, x_label)
plt.xlabel("Number of occurance", fontsize = 18, labelpad = 10)
ax.barh(x_axis, y_axis, align='center')
ax = plt.gca()
ax.invert_yaxis()

plt.title("Word Occurance", bbox={'facecolor':'k', 'pad':5}, color='w', fontsize = 25)
plt.show()




# Preserving important words :
list_products = []
for k, v in count_keywords.items():
word = keywords_select[k]
if word in ['pink', 'blue', 'tag', 'green', 'orange']: continue
if len(word)<3 or v<13: continue
list_products.append([word, v])

list_products.sort(key = lambda x:x[1], reverse=True)
print("Number of preserved words : ", len(list_products))

Number of preserved words :  193



Describing every product in terms of words present in the description:



  1. We will only use the preserved words, this is just like Binary Bag of Words

  2. We need to convert this into a product matrix with products as rows and different words as columns. A cell contains a 1 if a particular product has that word in its description else it contains 0.

  3. We will use this matrix to categorize the products.

  4. We will add a mean price feature so that the groups are balanced.




threshold = [0, 1, 2, 3, 5, 10]

# Getting the description.
liste_produits = df_cleaned['Description'].unique()

# Creating the product and word matrix.
X = pd.DataFrame()
for key, occurence in list_products:
X.loc[:, key] = list(map(lambda x:int(key.upper() in x), liste_produits))


label_col = []
for i in range(len(threshold)):
if i == len(threshold) - 1:
col = '.>{}'.format(threshold[i])
else:
col = '{}<.<{}'.format(threshold[i], threshold[i+1])

label_col.append(col)
X.loc[:, col] = 0

for i, prod in enumerate(liste_produits):
prix = df_cleaned[df_cleaned['Description'] == prod]['UnitPrice'].mean()
j = 0

while prix > threshold[j]:
j += 1
if j == len(threshold):
break
X.loc[i, label_col[j-1]] = 1​

print("{:<8} {:<20} \n".format('range', 'number of products') + 20*'-')
for i in range(len(threshold)):
if i == len(threshold)-1:
col = '.>{}'.format(threshold[i])
else:
col = '{}<.<{}'.format(threshold[i],threshold[i+1])
print("{:<10} {:<20}".format(col, X.loc[:, col].sum()))

range    number of products   
--------------------
0<.<1 964
1<.<2 1009
2<.<3 673
3<.<5 606
5<.<10 470
.>10 156



Clustering:
K-means:


from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.


matrix = X.values

# Using optimal number of clusters using hyperparameter tuning:
for n_clusters in range(3, 10):
kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init = 30)
kmeans.fit(matrix)
clusters = kmeans.predict(matrix)
sil_avg = silhouette_score(matrix, clusters)
print("For n_clusters : ", n_clusters, "The average silhouette_score is : ", sil_avg)

For n_clusters :  3 The average silhouette_score is :  0.10158702596012364
For n_clusters : 4 The average silhouette_score is : 0.1268004588393788
For n_clusters : 5 The average silhouette_score is : 0.14708700459493795
For n_clusters : 6 The average silhouette_score is : 0.14329241182453895
For n_clusters : 7 The average silhouette_score is : 0.15026667240832906
For n_clusters : 8 The average silhouette_score is : 0.16136085168920045
For n_clusters : 9 The average silhouette_score is : 0.12901394677787018

# Choosing number of clusters as 5:
# Trying Improving the silhouette_score :
n_clusters = 5
sil_avg = -1
while sil_avg < 0.145:
kmeans = KMeans(init = 'k-means++', n_clusters = n_clusters, n_init = 30)
kmeans.fit(matrix)
clusters = kmeans.predict(matrix)
sil_avg = silhouette_score(matrix, clusters)
print("For n_clusters : ", n_clusters, "The average silhouette_score is : ", sil_avg)

For n_clusters :  5 The average silhouette_score is :  0.14740815062347604

# Printing number of elements in each cluster :
pd.Series(clusters).value_counts()

2    1009
4 964
1 673
0 626
3 606
dtype: int64



Analyzing the 5 clusters :


def graph_component_silhouette(n_clusters, lim_x, mat_size, sample_silhouette_values, clusters):
import matplotlib as mpl
mpl.rc('patch', edgecolor = 'dimgray', linewidth = 1)

fig, ax1 = plt.subplots(1, 1)
fig.set_size_inches(8, 8)
ax1.set_xlim([lim_x[0], lim_x[1]])
ax1.set_ylim([0, mat_size + (n_clusters + 1) * 10])
y_lower = 10

for i in range(n_clusters):
ith_cluster_silhoutte_values = sample_silhouette_values[clusters == i]
ith_cluster_silhoutte_values.sort()
size_cluster_i = ith_cluster_silhoutte_values.shape[0]
y_upper = y_lower + size_cluster_i

ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhoutte_values, alpha = 0.8)

ax1.text(-0.03, y_lower + 0.5 * size_cluster_i, str(i), color = 'red', fontweight = 'bold',
bbox = dict(facecolor = 'white', edgecolor = 'black', boxstyle = 'round, pad = 0.3'))

y_lower = y_upper + 10

# Plotting the intra cluster silhouette distances.
from sklearn.metrics import silhouette_samples
sample_silhouette_values = silhouette_samples(matrix, clusters)
graph_component_silhouette(n_clusters, [-0.07, 0.33], len(X), sample_silhouette_values, clusters)




Analysis using wordcloud:

Checking which words are most common in the clusters.


liste = pd.DataFrame(liste_produits)
liste_words = [word for (word, occurance) in list_products]

occurance = [dict() for _ in range(n_clusters)]

# Creating data for printing word cloud.
for i in range(n_clusters):
liste_cluster = liste.loc[clusters == i]
for word in liste_words:
if word in ['art', 'set', 'heart', 'pink', 'blue', 'tag']: continue
occurance[i][word] = sum(liste_cluster.loc[:, 0].str.contains(word.upper()))

# Code for printing word cloud.
from random import randint
import random
def random_color_func(word=None, font_size=None, position=None,orientation=None, font_path=None, random_state=None):
h = int(360.0 * tone / 255.0)
s = int(100.0 * 255.0 / 255.0)
l = int(100.0 * float(random_state.randint(70, 120)) / 255.0)
return "hsl({}, {}%, {}%)".format(h, s, l)

def make_wordcloud(liste, increment):
ax1 = fig.add_subplot(4, 2, increment)
words = dict()
trunc_occurances = liste[0:150]
for s in trunc_occurances:
words[s[0]] = s[1]

wc = wordcloud.WordCloud(width=1000,height=400, background_color='lightgrey', max_words=1628,relative_scaling=1,
color_func = random_color_func, normalize_plurals=False)
wc.generate_from_frequencies(words)
ax1.imshow(wc, interpolation="bilinear")
ax1.axis('off')
plt.title('cluster n{}'.format(increment-1))

fig = plt.figure(1, figsize=(14,14))
color = [0, 160, 130, 95, 280, 40, 330, 110, 25]
for i in range(n_clusters):
list_cluster_occurences = occurance[i]
tone = color[i]
liste = []
for key, value in list_cluster_occurences.items():
liste.append([key, value])
liste.sort(key = lambda x:x[1], reverse = True)
make_wordcloud(liste, i+1)



Observations:



  1. Cluster number two contains all the items related to decoration and gifts.

  2. Cluster number 4 contains luxury items.

  3. Words like Vintage are common to most of the clusters.



Dimensionality Reduction:
PCA:

from sklearn.decomposition import PCA
pca = PCA()
pca.fit(matrix)
pca_samples = pca.transform(matrix)

# Checking the amount of variance explained :
fig, ax = plt.subplots(figsize=(14, 5))
sns.set(font_scale=1)
plt.step(range(matrix.shape[1]), pca.explained_variance_ratio_.cumsum(), where = 'mid', label = 'Cummulative Variance Explained')
sns.barplot(np.arange(1, matrix.shape[1] + 1), pca.explained_variance_ratio_, alpha = 0.5, color = 'g',
label = 'Individual Variance Explained')
plt.xlim(0, 100)
plt.xticks(rotation = 45, fontsize = 14)
ax.set_xticklabels([s if int(s.get_text())%2 == 0 else '' for s in ax.get_xticklabels()])

plt.ylabel("Explained Variance", fontsize = 14)
plt.xlabel("Principal Components", fontsize = 14)
plt.legend(loc = 'upper left', fontsize = 13)
plt.show()


We need more than 100 Principal Components to explain more than 90 % of the variance.


Generating Customer Segments/Categories:
We will use the already generated product categories and create a new feature which tells to which category the product belongs to.


corresp = dict()
for key, val in zip(liste_produits, clusters):
corresp[key] = val

df_cleaned['categ_product'] = df_cleaned.loc[:, 'Description'].map(corresp)
df_cleaned[['InvoiceNo', 'Description', 'categ_product']][:10]




# Creating 5 new features that will contain the amount in a single transanction on different categories of product.
for i in range(5):
col = 'categ_{}'.format(i)
df_temp = df_cleaned[df_cleaned['categ_product'] == i]
price_temp = df_temp['UnitPrice'] * (df_temp['Quantity'] - df_temp['QuantityCancelled'])
price_temp = price_temp.apply(lambda x:x if x > 0 else 0)
df_cleaned.loc[:, col] = price_temp
df_cleaned[col].fillna(0, inplace = True)

df_cleaned[['InvoiceNo', 'Description', 'categ_product', 'categ_0', 'categ_1', 'categ_2', 'categ_3', 'categ_4']][:10]




A single order is split into multiple entries we will basket them:


# sum of purchases by user and order.
temp = df_cleaned.groupby(by=['CustomerID', 'InvoiceNo'], as_index = False)['TotalPrice'].sum()
basket_price = temp.rename(columns={'TotalPrice': 'Basket Price'})

# percentage spent on each product category
for i in range(5):
col = "categ_{}".format(i)
temp = df_cleaned.groupby(by=['CustomerID', 'InvoiceNo'], as_index = False)[col].sum()
basket_price.loc[:, col] = temp

# Dates of the order.
df_cleaned['InvoiceDate_int'] = df_cleaned['InvoiceDate'].astype('int64')
temp = df_cleaned.groupby(by=['CustomerID', 'InvoiceNo'], as_index = False)['InvoiceDate_int'].mean()
df_cleaned.drop('InvoiceDate_int', axis = 1, inplace=True)
basket_price.loc[:, 'InvoiceDate'] = pd.to_datetime(temp['InvoiceDate_int'])

# Selecting entries with basket price > 0.
basket_price = basket_price[basket_price['Basket Price'] > 0]
basket_price.sort_values('CustomerID', ascending=True)[:5]


basket_price['InvoiceDate'].min()

Timestamp('2010-12-01 08:26:00')

basket_price['InvoiceDate'].max()

Timestamp('2011-12-09 12:50:00')

basket_price['InvoiceDate'].mean()

Timestamp('2011-07-01 17:32:29.703417600')



Time Based Splitting:


import datetime
pd.to_datetime('2011-10-1')

Timestamp('2011-10-01 00:00:00')

import datetime
set_entrainment = basket_price[basket_price['InvoiceDate'] < pd.to_datetime('2011-10-1')]
set_test = basket_price[basket_price['InvoiceDate'] >= pd.to_datetime('2011-10-1')]
basket_price = set_entrainment.copy(deep = True)



Grouping Orders:

We will get info about every customer on how much do they purchase, total number of orders. etc


transanctions_per_user = basket_price.groupby(by=['CustomerID'])['Basket Price'].agg(['count', 'min', 'max', 'mean', 'sum'])

for i in range(5):
col = 'categ_{}'.format(i)
transanctions_per_user.loc[:, col] = basket_price.groupby(by=['CustomerID'])[col].sum() / transanctions_per_user['sum'] * 100

transanctions_per_user.reset_index(drop = False, inplace = True)
basket_price.groupby(by=['CustomerID'])['categ_0'].sum()
transanctions_per_user.sort_values('CustomerID', ascending = True)[:5]




# Generating two new variables - days since first puchase and days since last purchase.
last_date = basket_price['InvoiceDate'].max().date()

first_registration = pd.DataFrame(basket_price.groupby(by=['CustomerID'])['InvoiceDate'].min())
last_purchase = pd.DataFrame(basket_price.groupby(by=['CustomerID'])['InvoiceDate'].max())

test = first_registration.applymap(lambda x:(last_date - x.date()).days)
test2 = last_purchase.applymap(lambda x:(last_date - x.date()).days)

transanctions_per_user.loc[:, 'LastPurchase'] = test2.reset_index(drop = False)['InvoiceDate']
transanctions_per_user.loc[:, 'FirstPurchase'] = test.reset_index(drop = False)['InvoiceDate']

transanctions_per_user[:5]



We need to focus on customers who only placed one order, our objective is to target these customers in a way to retains them.


n1 = transanctions_per_user[transanctions_per_user['count'] == 1].shape[0]
n2 = transanctions_per_user.shape[0]
print("No. of Customers with single purchase : {:<2}/{:<5} ({:<2.2f}%)".format(n1, n2, n1/n2*100))

No. of Customers with single purchase : 1445/3608  (40.05%)



Building Customer Segments:


list_cols = ['count', 'min', 'max', 'mean', 'categ_0', 'categ_1', 'categ_2', 'categ_3', 'categ_4']
selected_customers = transanctions_per_user.copy(deep=True)
matrix = selected_customers[list_cols].values

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(matrix)
print("Variable Mean Values: \n" + 90*'-' + '\n', scaler.mean_)
scaled_matrix = scaler.transform(matrix)

Variable Mean Values: 
------------------------------------------------------------------------------------------
[ 3.62305987 259.93189634 556.26687999 377.06036244 23.21847344
21.19884856 25.22916919 16.37327913 13.98907929]

pca = PCA()
pca.fit(scaled_matrix)
pca_samples = pca.transform(scaled_matrix)

# Checking the amount of variance explained :
fig, ax = plt.subplots(figsize=(14, 5))
sns.set(font_scale=1)
plt.step(range(matrix.shape[1]), pca.explained_variance_ratio_.cumsum(), where = 'mid', label = 'Cummulative Variance Explained')
sns.barplot(np.arange(1, matrix.shape[1] + 1), pca.explained_variance_ratio_, alpha = 0.5, color = 'g',
label = 'Individual Variance Explained')
plt.xlim(0, 10)
plt.xticks(rotation = 45, fontsize = 14)
ax.set_xticklabels([s if int(s.get_text())%2 == 0 else '' for s in ax.get_xticklabels()])

plt.ylabel("Explained Variance", fontsize = 14)
plt.xlabel("Principal Components", fontsize = 14)
plt.legend(loc = 'upper left', fontsize = 13)
plt.show()


# Using optimal number of clusters using hyperparameter tuning:
for n_clusters in range(3, 21):
kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init = 30)
kmeans.fit(scaled_matrix)
clusters = kmeans.predict(scaled_matrix)
sil_avg = silhouette_score(scaled_matrix, clusters)
print("For n_clusters : ", n_clusters, "The average silhouette_score is : ", sil_avg)

For n_clusters :  3 The average silhouette_score is :  0.16032080193530113
For n_clusters : 4 The average silhouette_score is : 0.15228322459107096
For n_clusters : 5 The average silhouette_score is : 0.16383272664364962
For n_clusters : 6 The average silhouette_score is : 0.17381160467028156
For n_clusters : 7 The average silhouette_score is : 0.18797211677153328
For n_clusters : 8 The average silhouette_score is : 0.19882636788256774
For n_clusters : 9 The average silhouette_score is : 0.20525645349822272
For n_clusters : 10 The average silhouette_score is : 0.21189996039374637
For n_clusters : 11 The average silhouette_score is : 0.21620594900368645
For n_clusters : 12 The average silhouette_score is : 0.18508142436046335
For n_clusters : 13 The average silhouette_score is : 0.1865944849916417
For n_clusters : 14 The average silhouette_score is : 0.18784125711480013
For n_clusters : 15 The average silhouette_score is : 0.19036282087608622
For n_clusters : 16 The average silhouette_score is : 0.19244359111998907
For n_clusters : 17 The average silhouette_score is : 0.1848920893685197
For n_clusters : 18 The average silhouette_score is : 0.1844801089644844
For n_clusters : 19 The average silhouette_score is : 0.18214979474920676
For n_clusters : 20 The average silhouette_score is : 0.18787461874056635

# Choosing number of clusters as 10:
# Trying Improving the silhouette_score :
n_clusters = 10
sil_avg = -1
while sil_avg < 0.208:
kmeans = KMeans(init = 'k-means++', n_clusters = n_clusters, n_init = 30)
kmeans.fit(scaled_matrix)
clusters = kmeans.predict(scaled_matrix)
sil_avg = silhouette_score(scaled_matrix, clusters)
print("For n_clusters : ", n_clusters, "The average silhouette_score is : ", sil_avg)

For n_clusters :  10 The average silhouette_score is :  0.21107665673451478

n_clusters = 10
kmeans = KMeans(init = 'k-means++', n_clusters = n_clusters, n_init = 100)
kmeans.fit(scaled_matrix)
clusters_clients = kmeans.predict(scaled_matrix)
silhouette_avg = silhouette_score(scaled_matrix, clusters_clients)
print("Silhouette Score : {:<.3f}".format(silhouette_avg))

Silhouette Score : 0.212

# Looking at clusters :
pd.DataFrame(pd.Series(clusters_clients).value_counts(), columns=['Number of Clients']).T


There is a large difference in cluster segments, We will analyze these clusters further.


sample_silhouette_values = silhouette_samples(scaled_matrix, clusters_clients)

graph_component_silhouette(n_clusters, [-0.15, 0.55], len(scaled_matrix), sample_silhouette_values, clusters_clients)



From this above graph we can rest assured that all the clusters are disjoint


Now we need to learn the habits of the customers to do that we will add the variables that define a cluster to which each customer belong:


selected_customers.loc[:, 'cluster'] = clusters_clients
merged_df = pd.DataFrame()
for i in range(n_clusters):
test = pd.DataFrame(selected_customers[selected_customers['cluster'] == i].mean())
test = test.T.set_index('cluster', drop = True)
test['size'] = selected_customers[selected_customers['cluster'] == i].shape[0]
merged_df = pd.concat([merged_df, test])

merged_df.drop('CustomerID', axis = 1, inplace = True)
print('Number of customers : ', merged_df['size'].sum())

merged_df = merged_df.sort_values('sum')

Number of customers :  3608

# Reorganizing the content of the dataframe.
liste_index = []
for i in range(5):
column = 'categ_{}'.format(i)
liste_index.append(merged_df[merged_df[column] > 45].index.values[0])

liste_index_reordered = liste_index
liste_index_reordered += [s for s in merged_df.index if s not in liste_index]

merged_df = merged_df.reindex(index = liste_index_reordered)
merged_df = merged_df.reset_index(drop = False)
merged_df.head()




Saving the selected customer dataframe and above dataframe to csv so that we do not need to do all this again:


selected_customers.to_csv("selected_customers.csv")
merged_df.to_csv("merged_df.csv")


Classifying the Customers:


selected_customers = pd.read_csv('selected_customers.csv')
merged_df = pd.read_csv('merged_df.csv')


Defining Helper Functions:


from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
class Class_Fit(object):
def __init__(self, clf, params = None):
if params:
self.clf = clf(**params)
else:
self.clf = clf()

def train(self, x_train, y_train):
self.clf.fit(x_train, y_train)

def predict(self, x):
return self.clf.predict(x)

def grid_search(self, parameters, Kfold):
self.grid = GridSearchCV(estimator = self.clf, param_grid = parameters, cv = Kfold)

def grid_fit(self, X, Y):
self.grid.fit(X, Y)

def grid_predict(self, X, Y):
self.predictions = self.grid.predict(X)
print("Precision: {:.2f} %".format(100 * accuracy_score(Y, self.predictions)))

selected_customers.head()


Since we are trying to predict the customer segment/cluster, we will choose cluster column as the target.


columns = ['mean', 'categ_0', 'categ_1', 'categ_2', 'categ_3', 'categ_4']
X = selected_customers[columns]
Y = selected_customers['cluster']

 


Train Test splitting:


from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size = 0.8)



Training Models:


from sklearn.svm import LinearSVC
svc = Class_Fit(clf=LinearSVC)
svc.grid_search(parameters = [{'C':np.logspace(-2,2,10)}], Kfold = 5)
svc.grid_fit(X=X_train, Y=Y_train)
svc.grid_predict(X_test, Y_test)

Precision: 81.02 %

from sklearn.metrics import confusion_matrix

# code from -> SKLEARN Documentation.
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')

print(cm)

plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')

class_names = [i for i in range(1,11)]
cnf = confusion_matrix(Y_test, svc.predictions)
np.set_printoptions(precision=2)
plt.figure(figsize=(8,8))
plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization
[[ 0 0 0 0 0 0 0 0 36 1]
[ 0 0 0 0 0 0 0 0 1 1]
[ 0 0 32 0 0 0 0 0 20 0]
[ 0 0 0 62 0 0 0 0 25 1]
[ 0 0 0 0 54 0 0 0 25 1]
[ 0 0 0 0 0 0 0 0 1 0]
[ 0 0 0 0 0 0 0 0 2 0]
[ 0 0 0 1 0 0 0 38 6 0]
[ 0 0 1 0 0 0 0 0 314 0]
[ 0 0 0 0 0 0 0 0 15 85]]




# Code from sklearn documentation.
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and training learning curve.
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")

plt.legend(loc="best")
return plt

g = plot_learning_curve(svc.grid.best_estimator_, "SVC Learning Curve", X_train, Y_train, ylim=[1.01, 0.6], cv = 5,
train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

 


Logistic Regression:


from sklearn.linear_model import LogisticRegression
lr = Class_Fit(clf = LogisticRegression)
lr.grid_search(parameters = [{'C':np.logspace(-1,2,10)}], Kfold = 5)
lr.grid_fit(X_train, Y_train)
lr.grid_predict(X_test, Y_test)

Precision: 94.74 %

cnf = confusion_matrix(Y_test, lr.predictions)
plt.figure(figsize=(8,8))
plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization
[[ 32 0 0 0 0 0 0 0 5 0]
[ 0 0 0 0 0 0 0 0 1 1]
[ 0 0 52 0 0 0 0 0 0 0]
[ 0 0 0 84 0 0 0 0 3 1]
[ 0 0 0 0 78 0 0 0 2 0]
[ 0 0 0 0 0 0 1 0 0 0]
[ 2 0 0 0 0 0 0 0 0 0]
[ 0 0 0 2 0 0 0 42 1 0]
[ 5 0 3 2 1 0 0 1 300 3]
[ 0 0 0 0 0 0 0 0 4 96]]


g = plot_learning_curve(lr.grid.best_estimator_, "LogisticRegression Learning Curve", X_train, Y_train, ylim=[1.01, 0.6], cv = 5,
train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])




K-Nearest Neighbours:


from sklearn.neighbors import KNeighborsClassifier
knn = Class_Fit(clf = KNeighborsClassifier)
knn.grid_search(parameters = [{'n_neighbors':np.arange(1,50,1)}], Kfold = 5)
knn.grid_fit(X_train, Y_train)
knn.grid_predict(X_test, Y_test)

Precision: 83.52 %

cnf = confusion_matrix(Y_test, knn.predictions)
plt.figure(figsize=(8,8))
plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization
[[ 33 0 0 0 0 0 0 0 4 0]
[ 0 0 0 0 0 0 0 0 1 1]
[ 3 0 40 1 0 0 0 1 7 0]
[ 0 1 1 68 1 0 0 0 16 1]
[ 4 1 0 2 53 0 0 0 13 7]
[ 0 0 0 0 0 0 1 0 0 0]
[ 2 0 0 0 0 0 0 0 0 0]
[ 2 0 0 2 0 0 0 36 5 0]
[ 4 1 5 12 4 0 0 2 282 5]
[ 1 0 0 0 2 0 0 0 6 91]]




g = plot_learning_curve(knn.grid.best_estimator_, "KNearestNEighbors Learning Curve", X_train, Y_train, ylim=[1.01, 0.6], cv = 5,
train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])




Decision Trees:


from sklearn.tree import DecisionTreeClassifier
tr = Class_Fit(clf = DecisionTreeClassifier)
tr.grid_search(parameters = [{'criterion':['entropy', 'gini'], 'max_features':['sqrt', 'log2']}], Kfold = 5)
tr.grid_fit(X_train, Y_train)
tr.grid_predict(X_test, Y_test)

Precision: 91.14 %

cnf = confusion_matrix(Y_test, tr.predictions)
plt.figure(figsize=(8,8))
plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization
[[ 27 0 1 0 3 0 1 0 4 1]
[ 0 0 0 0 0 0 0 0 1 1]
[ 1 0 48 0 2 0 0 0 1 0]
[ 0 0 0 83 0 0 0 0 5 0]
[ 1 0 1 0 73 0 0 0 5 0]
[ 1 0 0 0 0 0 0 0 0 0]
[ 1 0 1 0 0 0 0 0 0 0]
[ 0 0 0 2 0 0 0 42 0 1]
[ 5 2 3 10 2 0 0 1 288 4]
[ 1 0 0 0 0 0 0 0 2 97]]




g = plot_learning_curve(tr.grid.best_estimator_, "DecisionTree Learning Curve", X_train, Y_train, ylim=[1.01, 0.6], cv = 5,
train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])




Random Forests:


from sklearn.ensemble import RandomForestClassifier
rf = Class_Fit(clf = RandomForestClassifier)
rf.grid_search(parameters = [{'criterion':['entropy', 'gini'],
'max_features':['sqrt', 'log2'], 'n_estimators':[20, 40, 60, 80, 100]}], Kfold = 5)
rf.grid_fit(X_train, Y_train)
rf.grid_predict(X_test, Y_test)

Precision: 93.35 %

cnf = confusion_matrix(Y_test, rf.predictions)
plt.figure(figsize=(8,8))
plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization
[[ 29 0 1 0 3 1 0 0 3 0]
[ 0 0 0 0 0 0 0 0 1 1]
[ 0 0 50 0 1 0 0 0 1 0]
[ 0 0 1 82 0 0 0 0 5 0]
[ 0 0 0 0 76 0 0 0 4 0]
[ 1 0 0 0 0 0 0 0 0 0]
[ 1 0 1 0 0 0 0 0 0 0]
[ 0 0 0 2 0 0 0 43 0 0]
[ 6 0 1 2 2 0 0 2 298 4]
[ 0 0 1 0 0 0 0 0 3 96]]




g = plot_learning_curve(rf.grid.best_estimator_, "Random Forest Learning Curve", X_train, Y_train, ylim=[1.01, 0.6], cv = 5,
train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])




AdaBoost Classifier:


from sklearn.ensemble import AdaBoostClassifier
ada = Class_Fit(clf = AdaBoostClassifier)
ada.grid_search(parameters = [{'n_estimators':[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}], Kfold = 5)
ada.grid_fit(X_train, Y_train)
ada.grid_predict(X_test, Y_test)

Precision: 57.06 %

cnf = confusion_matrix(Y_test, ada.predictions)
plt.figure(figsize=(8,8))
plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization
[[ 0 0 0 2 0 0 1 0 31 3]
[ 0 0 0 0 0 0 0 0 1 1]
[ 0 0 0 1 0 0 0 0 51 0]
[ 0 0 0 6 0 0 0 0 81 1]
[ 0 0 0 1 0 0 0 0 71 8]
[ 0 0 0 0 0 1 0 0 0 0]
[ 0 0 0 0 0 0 1 0 1 0]
[ 0 0 0 45 0 0 0 0 0 0]
[ 0 0 0 2 0 0 0 0 311 2]
[ 0 0 0 0 0 0 0 0 7 93]]




g = plot_learning_curve(ada.grid.best_estimator_, "AdaBoost Learning Curve", X_train, Y_train, ylim=[1.01, 0.4], cv = 5,
train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])




Gradient Boosted Decision Trees:


import xgboost
gbdt = Class_Fit(clf = xgboost.XGBClassifier)
gbdt.grid_search(parameters = [{'n_estimators':[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}], Kfold = 5)
gbdt.grid_fit(X_train, Y_train)
gbdt.grid_predict(X_test, Y_test)

Precision: 94.04 %

cnf = confusion_matrix(Y_test, gbdt.predictions)
plt.figure(figsize=(8,8))
plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization
[[ 32 0 0 0 1 0 0 0 4 0]
[ 0 0 0 0 0 0 0 0 1 1]
[ 0 0 50 1 1 0 0 0 0 0]
[ 0 0 1 82 0 0 0 0 5 0]
[ 1 0 0 0 73 0 0 0 6 0]
[ 0 0 0 0 0 0 1 0 0 0]
[ 0 0 1 0 0 0 1 0 0 0]
[ 0 0 0 2 0 0 0 43 0 0]
[ 4 0 1 2 2 0 0 1 300 5]
[ 0 0 0 0 0 0 0 0 2 98]]


g = plot_learning_curve(gbdt.grid.best_estimator_, "GBDT Learning Curve", X_train, Y_train, ylim=[1.01, 0.6], cv = 5,
train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])




Voting Classifier:

Voting Classifier is a machine learning model that trains on an ensemble of numerous models and predicts an output (class) based on their highest probability of chosen class as the output.


rf_best = RandomForestClassifier(**rf.grid.best_params_)
gbdt_best = xgboost.XGBClassifier(**gbdt.grid.best_params_)
svc_best = LinearSVC(**svc.grid.best_params_)
tr_best = DecisionTreeClassifier(**tr.grid.best_params_)
knn_best = KNeighborsClassifier(**knn.grid.best_params_)
lr_best = LogisticRegression(**lr.grid.best_params_)

from sklearn.ensemble import VotingClassifier
votingC = VotingClassifier(estimators=[('rf', rf_best), ('gb', gbdt_best), ('knn', knn_best), ('lr', lr_best)])
votingC = votingC.fit(X_train, Y_train)
predictions = votingC.predict(X_test)
print("Precision : {:.2f}%".format(100 * accuracy_score(Y_test, predictions)))

Precision : 94.88%

This is the highest precision that we have obtained.


Testing the model:


basket_price = set_test.copy(deep=True)

transanctions_per_user = basket_price.groupby(by=['CustomerID'])['Basket Price'].agg(['count', 'min', 'max', 'mean', 'sum'])

for i in range(5):
col = 'categ_{}'.format(i)
transanctions_per_user.loc[:, col] = basket_price.groupby(by=['CustomerID'])[col].sum() / transanctions_per_user['sum'] * 100

transanctions_per_user.reset_index(drop = False, inplace = True)
basket_price.groupby(by=['CustomerID'])['categ_0'].sum()

transanctions_per_user['count'] = 5 * transanctions_per_user['count']
transanctions_per_user['sum'] = transanctions_per_user['count'] * transanctions_per_user['mean']

transanctions_per_user.sort_values('CustomerID', ascending = True)[:5]


list_cols = ['count', 'min', 'max', 'mean', 'categ_0', 'categ_1', 'categ_2', 'categ_3', 'categ_4']
matrix_test = transanctions_per_user[list_cols].values
scaled_test_matrix = scaler.transform(matrix_test)

Y = kmeans.predict(scaled_test_matrix)
columns = ['mean', 'categ_0', 'categ_1', 'categ_2', 'categ_3', 'categ_4' ]
X = transanctions_per_user[columns]
predictions = votingC.predict(X)

print("Precision : {:.2f}%".format(100 * accuracy_score(Y, predictions)))

Precision : 89.18%

Accuracy on test dataset is good considering we use 10 months old data to predict on new data.


Conclusion:



  1. We are able to separate customers into different segments, based on the type of products that they buy.

  2. Using a Voting Classifier and a combination of multiple machine learning models, such as Random Forest, Gradient Boosted Decision Trees, K-Nearest Neighbours, and Logistic Regression, we are able to predict what type of product a user will buy, with a precision of 94.88%.

  3. We can use this information to target selected customers with promotional offers for their desired products, which increases the likelihood of more sales in the future.


0 comments