Retail Customer Analysis

Nitara Bobal

Related Listings

Breast tumor localiza...

0 comments, 2 reviews , 1 like
Eye Disease Prediction

0 comments, 1 review , 2 likes

Major Concepts

Models Home » Domain Usecases » Retail » Retail Customer Analysis

Models Status

Model Overview

We are going to perform an exploratory analysis for an online retail store data set, in order to understand its customers. Let’s assume, we own a retail store that has been doing incredibly well and we want to find a way to scale our business efficiently and effectively. In order to do this, we need to make sure we understand our customers and customize our marketing or expansion efforts based on specific subset of our customers. The main business problem in question in this case is “How can I scale my current business that is doing really well, in the most effective way?”. The sub question that might follow to support the main business objectives can be “What type of marketing initiatives can we perform for each customer in order to get the best ROI?”

About the Dataset:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

Business Problem:

Problem Statement: The goal is to come up with a solution for the given questions:

Can we categorize the customers in a particular segment based on their buying patterns? (Customer Segmentation)

Can we predict which kind of items they will buy in future based on their segmentation? (Prediction)

ML Problem Mapping:

Given a dataset of transanctions (Online Retail dataset from UCI Machine Learning repository) get the segments i.e clusters/segments. (Find common patterns and group them)

Predict what to display to what group of users

Input: We will be using e-commerce data that contains the list of purchases in 1 year for 4,000 customers.

Output: The first goal is that we need to categorize our consumer base into appropriate customer segments. The second goal is we need to predict the purchases for the current year and the next year based on the customers' first purchase.

Importing libraries:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import scipy

import itertools

import nltk

import warnings

warnings.filterwarnings('ignore')

nltk.download('punkt')

nltk.download('averaged_perceptron_tagger')

import wordcloud

%matplotlib inline

plt.style.use('fivethirtyeight')

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data]   Package punkt is already up-to-date!

[nltk_data] Downloading package averaged_perceptron_tagger to

[nltk_data]     /root/nltk_data...

[nltk_data]   Package averaged_perceptron_tagger is already up-to-

[nltk_data]       date!

The following code is used to connect google colab to drive and navigate to the folder where dataset is stored.

import os

os.getcwd()

'/content'

# Mount google drive

from google.colab import drive

drive.mount('/content/gdrive')#,force_remount=True)

root_path = 'gdrive/My Drive/Customer-Analytics-master/'

os.chdir(root_path)

os.getcwd()

'/content/gdrive/My Drive/Customer-Analytics-master'

Loading the Dataset:

data = pd.read_excel('Online Retail.xlsx', dtype={'StockCode':str})

data.head(3)

data.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 541909 entries, 0 to 541908

Data columns (total 8 columns):

 #   Column       Non-Null Count   Dtype         

---  ------       --------------   -----         

 0   InvoiceNo    541909 non-null  object        

 1   StockCode    541909 non-null  object        

 2   Description  540455 non-null  object        

 3   Quantity     541909 non-null  int64         

 4   InvoiceDate  541909 non-null  datetime64[ns]

 5   UnitPrice    541909 non-null  float64       

 6   CustomerID   406829 non-null  float64       

 7   Country      541909 non-null  object        

dtypes: datetime64[ns](1), float64(2), int64(1), object(4)

memory usage: 33.1+ MB

data.shape

(541909, 8)

Data Preprocessing:

# Checking for null values.

info = pd.DataFrame(data=data.isnull().sum()).T.rename(index={0:'Null values'})

info = info.append(pd.DataFrame(data=data.isnull().sum()/data.shape[0] * 100).T.rename(index={0:'% Null values'}))

info

Since we dont have CustomerID for 25% of points we will remove them as we cannot give them any arbitrary ID.

# Removing null values

data.dropna(axis=0, subset = ['CustomerID'], inplace=True)

info = pd.DataFrame(data=data.isnull().sum()).T.rename(index={0:'Null values'})

info = info.append(pd.DataFrame(data=data.isnull().sum()/data.shape[0] * 100).T.rename(index={0:'% Null values'}))

info

# Checking for Duplicates :

data.duplicated().sum()

# Removing duplicate entries :

data.drop_duplicates(inplace=True)

data.duplicated().sum()

Exploratory Data Analysis:

plt.figure(figsize=(14,6))

plt.bar(list(data.groupby(['Country']).groups.keys()), data.groupby(['Country'])['CustomerID'].count())

plt.xticks(rotation = 90, fontsize = 14)

plt.title("Number of transanctions done for each country")

plt.ylabel("No. of trans.")

plt.xlabel("Country")

plt.show()

info = pd.DataFrame(data = data.groupby(['Country'])['InvoiceNo'].nunique(), index=data.groupby(['Country']).groups.keys()).T

info

Observations:

UK has done most of the transanctions. (19857)

Least amount of transanctions were made by countries like Brazil, RSA etc. (only 1)

# StockCode Feature ->

# We will see how many different products were sold in the year data was collected.

print(len(data['StockCode'].value_counts()))

# Transanction feature

# We will see how many different transanctions were done.

print(len(data['InvoiceNo'].value_counts()))

# Transanction feature

# We will see how many different Customers are there.

print(len(data['CustomerID'].value_counts()))

pd.DataFrame({'products':len(data['StockCode'].value_counts()),

              'transanctions':len(data['InvoiceNo'].value_counts()),

              'Customers':len(data['CustomerID'].value_counts())},

             index = ['Quantity'])

There are 22k transanctions but only 4k customers with 3.5k products. It seems that some orders were placed then cancelled or the customers bought items multiple times or multiple items were bought in a single transaction.

Checking the number of items bought in a single transaction:

df = data.groupby(['CustomerID', 'InvoiceNo'], as_index=False)['InvoiceDate'].count()

df = df.rename(columns = {'InvoiceDate':'Number of products'})

df[:10].sort_values('CustomerID')

There are customers who purchase only 1 item per transaction and others who purchase many items per transanction. Also there are some orders which were cancelled they are marked with 'C' in the beginning.

Counting number of cancelled transanctions:

df['orders cancelled'] = df['InvoiceNo'].apply(lambda x: int('C' in str(x)))

df.head()

# Printing number of orders cancelled ->

print("Number of orders cancelled {}/{} ({:.2f}%)".format(df['orders cancelled'].sum(), df.shape[0], df['orders cancelled'].sum()/ df.shape[0] * 100))

Number of orders cancelled 3654/22190 (16.47%)

Handling Cancelled Values:

# Looking at cancelled transanctions in original data.

data.sort_values('CustomerID')[:5]

We see that for every order that has to be cancelled a new transanction has to be started with different invoiceno, with negative quantity and every other description is same. We can use this to remove the cancelled orders.

Checking for discounted products:

df = data[data['Description'] == 'Discount']

df.head()

So there are some discounted transanctions too but they appear to be cancelled.

Checking whether every order that has been cancelled has a counterpart:

df = data[(data['Quantity']<0) & (data['Description']!='Discount')][['CustomerID','Quantity','StockCode','Description','UnitPrice']]

df.head()

for index, col in df.iterrows():

    if data[(data['CustomerID'] == col[0]) & (data['Quantity'] == -col[1]) & (data['Description'] == col[2])].shape[0] == 0:

        print(index, df.loc[index])

        print("There are some transanctions for which counterpart does not exist")

        break

154 CustomerID                               15311

Quantity                                    -1

StockCode                               35004C

Description    SET OF 3 COLOURED  FLYING DUCKS

UnitPrice                                 4.65

Name: 154, dtype: object

There are some transanctions for which counterpart does not exist

We found out that there are some orders for which counterpart do not exist.
Reasons could be because some orders were made before the date the dataset is given from or that some orders were cancelled with exactly same counterpart or some are just errors maybe.

Removing cancelled orders:

df_cleaned = data.copy(deep=True)

df_cleaned['QuantityCancelled'] = 0

entry_to_remove = []; doubtfull_entry = []



for index, col in data.iterrows():

    if(col['Quantity'] > 0)or(col['Description']=='Discount'):continue

    df_test = data[(data['CustomerID']==col['CustomerID'])&(data['StockCode']==col['StockCode'])&

                   (data['InvoiceDate']<col['InvoiceDate'])&(data['Quantity']>0)].copy()

    

    # Order cancelled without counterpart, these are doubtful as they maybe errors or maybe orders were placed before data given

    if(df_test.shape[0] == 0):

        doubtfull_entry.append(index)

    

    # Cancellation with single counterpart

    elif(df_test.shape[0] == 1):

        index_order = df_test.index[0]

        df_cleaned.loc[index_order, 'QuantityCancelled'] = -col['Quantity']

        entry_to_remove.append(index)

        

    # Various counterpart exists for orders

    elif(df_test.shape[0] > 1):

        df_test.sort_index(axis = 0, ascending=False, inplace=True)

        for ind, val in df_test.iterrows():

            if val['Quantity'] < -col['Quantity']:continue

            df_cleaned.loc[ind, 'QuantityCancelled'] = -col['Quantity']

            entry_to_remove.append(index)

            break

print("Entry to remove {}".format(len(entry_to_remove)))

print("Doubtfull Entry {}".format(len(doubtfull_entry)))

Entry to remove 7521

Doubtfull Entry 1226

# Deleting these entries :

df_cleaned.drop(entry_to_remove, axis=0, inplace=True)

df_cleaned.drop(doubtfull_entry, axis=0, inplace=True)

We will now see the StockCode feature especially the discounted items:

list_special_codes = df_cleaned[df_cleaned['StockCode'].str.contains('^[a-zA-Z]+', regex = True)]['StockCode'].unique()

list_special_codes

array(['POST', 'D', 'C2', 'M', 'BANK CHARGES', 'PADS', 'DOT'],

      dtype=object)

for code in list_special_codes:

    print("{:<17} -> {:<35}".format(code, df_cleaned[df_cleaned['StockCode'] == code]['Description'].values[0]))

POST              -> POSTAGE                            

D                 -> Discount                           

C2                -> CARRIAGE                           

M                 -> Manual                             

BANK CHARGES      -> Bank Charges                       

PADS              -> PADS TO MATCH ALL CUSHIONS         

DOT               -> DOTCOM POSTAGE

df_cleaned['QuantityCancelled'] = np.nan_to_num(df_cleaned['QuantityCancelled'])

df_cleaned.head()

We see that the same transanction is duplicated for every different item in the dataset. Like above invoice number 536365 the user probably purchased many different items and each have been given a row as shown. We will need to merge these so we will add the totalprice feature for each row.

Getting total data feature:

df_cleaned['TotalPrice'] = df_cleaned['UnitPrice'] * (df_cleaned['Quantity'] - df_cleaned['QuantityCancelled'])

df_cleaned.sort_values('CustomerID')[:5]

Now we sum the individual orders and group them on the basis of invoice number to remove the problem of duplicate rows for same order:

temp = df_cleaned.groupby(by=['CustomerID', 'InvoiceNo'], as_index=False)['TotalPrice'].sum()

basket_price = temp.rename(columns = {'TotalPrice': 'Basket Price'})



df_cleaned['InvoiceDate_int'] = df_cleaned['InvoiceDate'].astype('int64')

temp = df_cleaned.groupby(by=['CustomerID', 'InvoiceNo'], as_index=False)['InvoiceDate_int'].mean()

df_cleaned.drop('InvoiceDate_int', axis = 1, inplace=True)

basket_price.loc[:, 'InvoiceDate'] = pd.to_datetime(temp['InvoiceDate_int'])



basket_price = basket_price[basket_price['Basket Price'] > 0]

basket_price.sort_values('CustomerID')[:6]

Plotting the purchases made:

price_range = [0, 50, 100, 200, 500, 1000, 5000, 50000]

count_price = []

for i,price in enumerate(price_range):

    if i==0:continue

    val = basket_price[(basket_price['Basket Price'] < price)&

                       (basket_price['Basket Price'] > price_range[i-1])]['Basket Price'].count()

    count_price.append(val)

    

plt.rc('font', weight='bold')

f, ax = plt.subplots(figsize=(11, 6))

colors = ['yellowgreen', 'gold', 'wheat', 'c', 'violet', 'royalblue', 'firebrick']

labels = ["{}<.<{}".format(price_range[i-1], s) for i,s in enumerate(price_range) if i != 0]

sizes = count_price

explode = [0.0 if sizes[i] < 100 else 0.0 for i in range(len(sizes))]

ax.pie(sizes, explode = explode, labels = labels, colors = colors,

       autopct = lambda x:'{:1.0f}%'.format(x) if x > 1 else '',

       shadow = False, startangle = 0)

ax.axis('equal')

f.text(0.5, 1.01, "Distribution of order amounts", ha = 'center', fontsize = 18)

plt.show()

Analyzing product Description:

is_noun = lambda pos:pos[:2] == 'NN'



def keywords_inventory(dataframe, colonne = 'Description'):

    import nltk

    stemmer = nltk.stem.SnowballStemmer("english")

    keywords_roots = dict()

    keywords_select = dict()

    category_keys = []

    count_keywords = dict()

    icount = 0

    

    for s in dataframe[colonne]:

        if pd.isnull(s): continue

        lines = s.lower()

        tokenized = nltk.word_tokenize(lines)

        nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]

        

        for t in nouns:

            t = t.lower() ; racine = stemmer.stem(t)

            if racine in keywords_roots:

                keywords_roots[racine].add(t)

                count_keywords[racine] += 1

            else:

                keywords_roots[racine] = {t}

                count_keywords[racine] = 1

        

    

    for s in keywords_roots.keys():

        if len(keywords_roots[s]) > 1:

            min_length = 1000

            for k in keywords_roots[s]:

                if len(k) < min_length:

                    clef = k ; min_length = len(k)

            

            category_keys.append(clef)

            keywords_select[s] = clef

        

        else:

            category_keys.append(list(keywords_roots[s])[0])

            keywords_select[s] = list(keywords_roots[s])[0]

            

    print("Number of keywords in the variable '{}': {}".format(colonne, len(category_keys)))

    return category_keys, keywords_roots, keywords_select, count_keywords

df_produits = pd.DataFrame(data['Description'].unique()).rename(columns = {0:"Description"})

keywords, keywords_roots, keywords_select, count_keywords = keywords_inventory(df_produits)

Number of keywords in the variable 'Description': 1483

# Plotting keywords vs frequency graph :

list_products = []

for k, v in count_keywords.items():

    word = keywords_select[k]

    list_products.append([word, v])

liste = sorted(list_products, key = lambda x:x[1], reverse=True)



plt.rc('font', weight='normal')

fig, ax = plt.subplots(figsize=(7, 25))

y_axis = [i[1] for i in liste[:125]]

x_axis = [k for k,i in enumerate(liste[:125])]

x_label = [i[0] for i in liste[:125]]

plt.xticks(fontsize=15)

plt.yticks(fontsize=13)

plt.yticks(x_axis, x_label)

plt.xlabel("Number of occurance", fontsize = 18, labelpad = 10)

ax.barh(x_axis, y_axis, align='center')

ax = plt.gca()

ax.invert_yaxis()



plt.title("Word Occurance", bbox={'facecolor':'k', 'pad':5}, color='w', fontsize = 25)

plt.show()

# Preserving important words :

list_products = []

for k, v in count_keywords.items():

    word = keywords_select[k]

    if word in ['pink', 'blue', 'tag', 'green', 'orange']: continue

    if len(word)<3 or v<13: continue

    list_products.append([word, v])

    

list_products.sort(key = lambda x:x[1], reverse=True)

print("Number of preserved words : ", len(list_products))

Number of preserved words :  193

Describing every product in terms of words present in the description:

We will only use the preserved words, this is just like Binary Bag of Words

We need to convert this into a product matrix with products as rows and different words as columns. A cell contains a 1 if a particular product has that word in its description else it contains 0.

We will use this matrix to categorize the products.

We will add a mean price feature so that the groups are balanced.

threshold = [0, 1, 2, 3, 5, 10]



# Getting the description.

liste_produits = df_cleaned['Description'].unique()



# Creating the product and word matrix.

X = pd.DataFrame()

for key, occurence in list_products:

    X.loc[:, key] = list(map(lambda x:int(key.upper() in x), liste_produits))

    



label_col = []

for i in range(len(threshold)):

    if i == len(threshold) - 1:

        col = '.>{}'.format(threshold[i])

    else:

        col = '{}<.<{}'.format(threshold[i], threshold[i+1])

        

    label_col.append(col)

    X.loc[:, col] = 0

    

for i, prod in enumerate(liste_produits):

    prix = df_cleaned[df_cleaned['Description'] == prod]['UnitPrice'].mean()

    j = 0

    

    while prix > threshold[j]:

        j += 1

        if j == len(threshold):

            break

    X.loc[i, label_col[j-1]] = 1

print("{:<8} {:<20} \n".format('range', 'number of products') + 20*'-')

for i in range(len(threshold)):

    if i == len(threshold)-1:

        col = '.>{}'.format(threshold[i])

    else:

        col = '{}<.<{}'.format(threshold[i],threshold[i+1])

    print("{:<10}  {:<20}".format(col, X.loc[:, col].sum()))

range    number of products   

--------------------

0<.<1       964                 

1<.<2       1009                

2<.<3       673                 

3<.<5       606                 

5<.<10      470                 

.>10        156

Clustering:
K-means:

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_score

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

matrix = X.values

# Using optimal number of clusters using hyperparameter tuning:

for n_clusters in range(3, 10):

    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init = 30)

    kmeans.fit(matrix)

    clusters = kmeans.predict(matrix)

    sil_avg = silhouette_score(matrix, clusters)

    print("For n_clusters : ", n_clusters, "The average silhouette_score is : ", sil_avg)

For n_clusters :  3 The average silhouette_score is :  0.10158702596012364

For n_clusters :  4 The average silhouette_score is :  0.1268004588393788

For n_clusters :  5 The average silhouette_score is :  0.14708700459493795

For n_clusters :  6 The average silhouette_score is :  0.14329241182453895

For n_clusters :  7 The average silhouette_score is :  0.15026667240832906

For n_clusters :  8 The average silhouette_score is :  0.16136085168920045

For n_clusters :  9 The average silhouette_score is :  0.12901394677787018

# Choosing number of clusters as 5:

# Trying Improving the silhouette_score :

n_clusters = 5

sil_avg = -1

while sil_avg < 0.145:

    kmeans = KMeans(init = 'k-means++', n_clusters = n_clusters, n_init = 30)

    kmeans.fit(matrix)

    clusters = kmeans.predict(matrix)

    sil_avg = silhouette_score(matrix, clusters)

    print("For n_clusters : ", n_clusters, "The average silhouette_score is : ", sil_avg)

For n_clusters :  5 The average silhouette_score is :  0.14740815062347604

# Printing number of elements in each cluster :

pd.Series(clusters).value_counts()

2    1009

4     964

1     673

0     626

3     606

dtype: int64

Analyzing the 5 clusters :

def graph_component_silhouette(n_clusters, lim_x, mat_size, sample_silhouette_values, clusters):

    import matplotlib as mpl

    mpl.rc('patch', edgecolor = 'dimgray', linewidth = 1)

    

    fig, ax1 = plt.subplots(1, 1)

    fig.set_size_inches(8, 8)

    ax1.set_xlim([lim_x[0], lim_x[1]])

    ax1.set_ylim([0, mat_size + (n_clusters + 1) * 10])

    y_lower = 10

    

    for i in range(n_clusters):

        ith_cluster_silhoutte_values = sample_silhouette_values[clusters == i]

        ith_cluster_silhoutte_values.sort()

        size_cluster_i = ith_cluster_silhoutte_values.shape[0]

        y_upper = y_lower + size_cluster_i

        

        ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhoutte_values, alpha = 0.8)

        

        ax1.text(-0.03, y_lower + 0.5 * size_cluster_i, str(i), color = 'red', fontweight = 'bold',

                 bbox = dict(facecolor = 'white', edgecolor = 'black', boxstyle = 'round, pad = 0.3'))

        

        y_lower = y_upper + 10

# Plotting the intra cluster silhouette distances.

from sklearn.metrics import silhouette_samples

sample_silhouette_values = silhouette_samples(matrix, clusters)

graph_component_silhouette(n_clusters, [-0.07, 0.33], len(X), sample_silhouette_values, clusters)

Analysis using wordcloud:

Checking which words are most common in the clusters.

liste = pd.DataFrame(liste_produits)

liste_words = [word for (word, occurance) in list_products]



occurance = [dict() for _ in range(n_clusters)]

# Creating data for printing word cloud.

for i in range(n_clusters):

    liste_cluster = liste.loc[clusters == i]

    for word in liste_words:

        if word in ['art', 'set', 'heart', 'pink', 'blue', 'tag']: continue

        occurance[i][word] = sum(liste_cluster.loc[:, 0].str.contains(word.upper()))

# Code for printing word cloud.

from random import randint

import random

def random_color_func(word=None, font_size=None, position=None,orientation=None, font_path=None, random_state=None):

    h = int(360.0 * tone / 255.0)

    s = int(100.0 * 255.0 / 255.0)

    l = int(100.0 * float(random_state.randint(70, 120)) / 255.0)

    return "hsl({}, {}%, {}%)".format(h, s, l)

def make_wordcloud(liste, increment):

    ax1 = fig.add_subplot(4, 2, increment)

    words = dict()

    trunc_occurances = liste[0:150]

    for s in trunc_occurances:

        words[s[0]] = s[1]

        

    wc = wordcloud.WordCloud(width=1000,height=400, background_color='lightgrey', max_words=1628,relative_scaling=1,

                             color_func = random_color_func, normalize_plurals=False)

    wc.generate_from_frequencies(words)

    ax1.imshow(wc, interpolation="bilinear")

    ax1.axis('off')

    plt.title('cluster n{}'.format(increment-1))

fig = plt.figure(1, figsize=(14,14))

color = [0, 160, 130, 95, 280, 40, 330, 110, 25]

for i in range(n_clusters):

    list_cluster_occurences = occurance[i]

    tone = color[i]

    liste = []

    for key, value in list_cluster_occurences.items():

        liste.append([key, value])

    liste.sort(key = lambda x:x[1], reverse = True)

    make_wordcloud(liste, i+1)

Observations:

Cluster number two contains all the items related to decoration and gifts.

Cluster number 4 contains luxury items.

Words like Vintage are common to most of the clusters.

Dimensionality Reduction:
PCA:

from sklearn.decomposition import PCA

pca = PCA()

pca.fit(matrix)

pca_samples = pca.transform(matrix)



# Checking the amount of variance explained :

fig, ax = plt.subplots(figsize=(14, 5))

sns.set(font_scale=1)

plt.step(range(matrix.shape[1]), pca.explained_variance_ratio_.cumsum(), where = 'mid', label = 'Cummulative Variance Explained')

sns.barplot(np.arange(1, matrix.shape[1] + 1), pca.explained_variance_ratio_, alpha = 0.5, color = 'g',

            label = 'Individual Variance Explained')

plt.xlim(0, 100)

plt.xticks(rotation = 45, fontsize = 14)

ax.set_xticklabels([s if int(s.get_text())%2 == 0 else '' for s in ax.get_xticklabels()])



plt.ylabel("Explained Variance", fontsize = 14)

plt.xlabel("Principal Components", fontsize = 14)

plt.legend(loc = 'upper left', fontsize = 13)

plt.show()

We need more than 100 Principal Components to explain more than 90 % of the variance.

Generating Customer Segments/Categories:
We will use the already generated product categories and create a new feature which tells to which category the product belongs to.

corresp = dict()

for key, val in zip(liste_produits, clusters):

    corresp[key] = val

    

df_cleaned['categ_product'] = df_cleaned.loc[:, 'Description'].map(corresp)

df_cleaned[['InvoiceNo', 'Description', 'categ_product']][:10]

# Creating 5 new features that will contain the amount in a single transanction on different categories of product.

for i in range(5):

    col = 'categ_{}'.format(i)

    df_temp = df_cleaned[df_cleaned['categ_product'] == i]

    price_temp = df_temp['UnitPrice'] * (df_temp['Quantity'] - df_temp['QuantityCancelled'])

    price_temp = price_temp.apply(lambda x:x if x > 0 else 0)

    df_cleaned.loc[:, col] = price_temp

    df_cleaned[col].fillna(0, inplace = True)

    

df_cleaned[['InvoiceNo', 'Description', 'categ_product', 'categ_0', 'categ_1', 'categ_2', 'categ_3', 'categ_4']][:10]

A single order is split into multiple entries we will basket them:

# sum of purchases by user and order.

temp = df_cleaned.groupby(by=['CustomerID', 'InvoiceNo'], as_index = False)['TotalPrice'].sum()

basket_price = temp.rename(columns={'TotalPrice': 'Basket Price'})



# percentage spent on each product category 

for i in range(5):

    col = "categ_{}".format(i)

    temp = df_cleaned.groupby(by=['CustomerID', 'InvoiceNo'], as_index = False)[col].sum()

    basket_price.loc[:, col] = temp



# Dates of the order.

df_cleaned['InvoiceDate_int'] = df_cleaned['InvoiceDate'].astype('int64')

temp = df_cleaned.groupby(by=['CustomerID', 'InvoiceNo'], as_index = False)['InvoiceDate_int'].mean()

df_cleaned.drop('InvoiceDate_int', axis = 1, inplace=True)

basket_price.loc[:, 'InvoiceDate'] = pd.to_datetime(temp['InvoiceDate_int'])



# Selecting entries with basket price > 0.

basket_price = basket_price[basket_price['Basket Price'] > 0]

basket_price.sort_values('CustomerID', ascending=True)[:5]

basket_price['InvoiceDate'].min()

Timestamp('2010-12-01 08:26:00')

basket_price['InvoiceDate'].max()

Timestamp('2011-12-09 12:50:00')

basket_price['InvoiceDate'].mean()

Timestamp('2011-07-01 17:32:29.703417600')

Time Based Splitting:

import datetime

pd.to_datetime('2011-10-1')

Timestamp('2011-10-01 00:00:00')

import datetime

set_entrainment = basket_price[basket_price['InvoiceDate'] < pd.to_datetime('2011-10-1')]

set_test = basket_price[basket_price['InvoiceDate'] >= pd.to_datetime('2011-10-1')]

basket_price = set_entrainment.copy(deep = True)

Grouping Orders:

We will get info about every customer on how much do they purchase, total number of orders. etc

transanctions_per_user = basket_price.groupby(by=['CustomerID'])['Basket Price'].agg(['count', 'min', 'max', 'mean', 'sum'])



for i in range(5):

    col = 'categ_{}'.format(i)

    transanctions_per_user.loc[:, col] = basket_price.groupby(by=['CustomerID'])[col].sum() / transanctions_per_user['sum'] * 100

    

transanctions_per_user.reset_index(drop = False, inplace = True)

basket_price.groupby(by=['CustomerID'])['categ_0'].sum()

transanctions_per_user.sort_values('CustomerID', ascending = True)[:5]

# Generating two new variables - days since first puchase and days since last purchase.

last_date = basket_price['InvoiceDate'].max().date()



first_registration = pd.DataFrame(basket_price.groupby(by=['CustomerID'])['InvoiceDate'].min())

last_purchase = pd.DataFrame(basket_price.groupby(by=['CustomerID'])['InvoiceDate'].max())



test = first_registration.applymap(lambda x:(last_date - x.date()).days)

test2 = last_purchase.applymap(lambda x:(last_date - x.date()).days)



transanctions_per_user.loc[:, 'LastPurchase'] = test2.reset_index(drop = False)['InvoiceDate']

transanctions_per_user.loc[:, 'FirstPurchase'] = test.reset_index(drop = False)['InvoiceDate']

transanctions_per_user[:5]

We need to focus on customers who only placed one order, our objective is to target these customers in a way to retains them.

n1 = transanctions_per_user[transanctions_per_user['count'] == 1].shape[0]

n2 = transanctions_per_user.shape[0]

print("No. of Customers with single purchase : {:<2}/{:<5} ({:<2.2f}%)".format(n1, n2, n1/n2*100))

No. of Customers with single purchase : 1445/3608  (40.05%)

Building Customer Segments:

list_cols = ['count', 'min', 'max', 'mean', 'categ_0', 'categ_1', 'categ_2', 'categ_3', 'categ_4']

selected_customers = transanctions_per_user.copy(deep=True)

matrix = selected_customers[list_cols].values

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(matrix)

print("Variable Mean Values: \n" + 90*'-' + '\n', scaler.mean_)

scaled_matrix = scaler.transform(matrix)

Variable Mean Values: 

------------------------------------------------------------------------------------------

 [  3.62305987 259.93189634 556.26687999 377.06036244  23.21847344

  21.19884856  25.22916919  16.37327913  13.98907929]

pca = PCA()

pca.fit(scaled_matrix)

pca_samples = pca.transform(scaled_matrix)

# Checking the amount of variance explained :

fig, ax = plt.subplots(figsize=(14, 5))

sns.set(font_scale=1)

plt.step(range(matrix.shape[1]), pca.explained_variance_ratio_.cumsum(), where = 'mid', label = 'Cummulative Variance Explained')

sns.barplot(np.arange(1, matrix.shape[1] + 1), pca.explained_variance_ratio_, alpha = 0.5, color = 'g',

            label = 'Individual Variance Explained')

plt.xlim(0, 10)

plt.xticks(rotation = 45, fontsize = 14)

ax.set_xticklabels([s if int(s.get_text())%2 == 0 else '' for s in ax.get_xticklabels()])



plt.ylabel("Explained Variance", fontsize = 14)

plt.xlabel("Principal Components", fontsize = 14)

plt.legend(loc = 'upper left', fontsize = 13)

plt.show()

# Using optimal number of clusters using hyperparameter tuning:

for n_clusters in range(3, 21):

    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init = 30)

    kmeans.fit(scaled_matrix)

    clusters = kmeans.predict(scaled_matrix)

    sil_avg = silhouette_score(scaled_matrix, clusters)

    print("For n_clusters : ", n_clusters, "The average silhouette_score is : ", sil_avg)

For n_clusters :  3 The average silhouette_score is :  0.16032080193530113

For n_clusters :  4 The average silhouette_score is :  0.15228322459107096

For n_clusters :  5 The average silhouette_score is :  0.16383272664364962

For n_clusters :  6 The average silhouette_score is :  0.17381160467028156

For n_clusters :  7 The average silhouette_score is :  0.18797211677153328

For n_clusters :  8 The average silhouette_score is :  0.19882636788256774

For n_clusters :  9 The average silhouette_score is :  0.20525645349822272

For n_clusters :  10 The average silhouette_score is :  0.21189996039374637

For n_clusters :  11 The average silhouette_score is :  0.21620594900368645

For n_clusters :  12 The average silhouette_score is :  0.18508142436046335

For n_clusters :  13 The average silhouette_score is :  0.1865944849916417

For n_clusters :  14 The average silhouette_score is :  0.18784125711480013

For n_clusters :  15 The average silhouette_score is :  0.19036282087608622

For n_clusters :  16 The average silhouette_score is :  0.19244359111998907

For n_clusters :  17 The average silhouette_score is :  0.1848920893685197

For n_clusters :  18 The average silhouette_score is :  0.1844801089644844

For n_clusters :  19 The average silhouette_score is :  0.18214979474920676

For n_clusters :  20 The average silhouette_score is :  0.18787461874056635

# Choosing number of clusters as 10:

# Trying Improving the silhouette_score :

n_clusters = 10

sil_avg = -1

while sil_avg < 0.208:

    kmeans = KMeans(init = 'k-means++', n_clusters = n_clusters, n_init = 30)

    kmeans.fit(scaled_matrix)

    clusters = kmeans.predict(scaled_matrix)

    sil_avg = silhouette_score(scaled_matrix, clusters)

    print("For n_clusters : ", n_clusters, "The average silhouette_score is : ", sil_avg)

For n_clusters :  10 The average silhouette_score is :  0.21107665673451478

n_clusters = 10

kmeans = KMeans(init = 'k-means++', n_clusters = n_clusters, n_init = 100)

kmeans.fit(scaled_matrix)

clusters_clients = kmeans.predict(scaled_matrix)

silhouette_avg = silhouette_score(scaled_matrix, clusters_clients)

print("Silhouette Score : {:<.3f}".format(silhouette_avg))

Silhouette Score : 0.212

# Looking at clusters :

pd.DataFrame(pd.Series(clusters_clients).value_counts(), columns=['Number of Clients']).T

There is a large difference in cluster segments, We will analyze these clusters further.

sample_silhouette_values = silhouette_samples(scaled_matrix, clusters_clients)



graph_component_silhouette(n_clusters, [-0.15, 0.55], len(scaled_matrix), sample_silhouette_values, clusters_clients)

From this above graph we can rest assured that all the clusters are disjoint

Now we need to learn the habits of the customers to do that we will add the variables that define a cluster to which each customer belong:

selected_customers.loc[:, 'cluster'] = clusters_clients

merged_df = pd.DataFrame()

for i in range(n_clusters):

    test = pd.DataFrame(selected_customers[selected_customers['cluster'] == i].mean())

    test = test.T.set_index('cluster', drop = True)

    test['size'] = selected_customers[selected_customers['cluster'] == i].shape[0]

    merged_df = pd.concat([merged_df, test])

    

merged_df.drop('CustomerID', axis = 1, inplace = True)

print('Number of customers : ', merged_df['size'].sum())



merged_df = merged_df.sort_values('sum')

Number of customers :  3608

# Reorganizing the content of the dataframe.

liste_index = []

for i in range(5):

    column = 'categ_{}'.format(i)

    liste_index.append(merged_df[merged_df[column] > 45].index.values[0])

    

liste_index_reordered = liste_index

liste_index_reordered += [s for s in merged_df.index if s not in liste_index]



merged_df = merged_df.reindex(index = liste_index_reordered)

merged_df = merged_df.reset_index(drop = False)

merged_df.head()

Saving the selected customer dataframe and above dataframe to csv so that we do not need to do all this again:

selected_customers.to_csv("selected_customers.csv")

merged_df.to_csv("merged_df.csv")

Classifying the Customers:

selected_customers = pd.read_csv('selected_customers.csv')

merged_df = pd.read_csv('merged_df.csv')

Defining Helper Functions:

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score

class Class_Fit(object):

    def __init__(self, clf, params = None):

        if params:

            self.clf = clf(**params)

        else:

            self.clf = clf()

            

    def train(self, x_train, y_train):

        self.clf.fit(x_train, y_train)

    

    def predict(self, x):

        return self.clf.predict(x)

    

    def grid_search(self, parameters, Kfold):

        self.grid = GridSearchCV(estimator = self.clf, param_grid = parameters, cv = Kfold)

        

    def grid_fit(self, X, Y):

        self.grid.fit(X, Y)

        

    def grid_predict(self, X, Y):

        self.predictions = self.grid.predict(X)

        print("Precision: {:.2f} %".format(100 * accuracy_score(Y, self.predictions)))

selected_customers.head()

Since we are trying to predict the customer segment/cluster, we will choose cluster column as the target.

columns = ['mean', 'categ_0', 'categ_1', 'categ_2', 'categ_3', 'categ_4']

X = selected_customers[columns]

Y = selected_customers['cluster']

Train Test splitting:

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size = 0.8)

Training Models:

from sklearn.svm import LinearSVC

svc = Class_Fit(clf=LinearSVC)

svc.grid_search(parameters = [{'C':np.logspace(-2,2,10)}], Kfold = 5)

svc.grid_fit(X=X_train, Y=Y_train)

svc.grid_predict(X_test, Y_test)

Precision: 81.02 %

from sklearn.metrics import confusion_matrix



# code from -> SKLEARN Documentation.

def plot_confusion_matrix(cm, classes,

                          normalize=False,

                          title='Confusion matrix',

                          cmap=plt.cm.Blues):

    """

    This function prints and plots the confusion matrix.

    Normalization can be applied by setting `normalize=True`.

    """

    if normalize:

        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

        print("Normalized confusion matrix")

    else:

        print('Confusion matrix, without normalization')



    print(cm)



    plt.imshow(cm, interpolation='nearest', cmap=cmap)

    plt.title(title)

    plt.colorbar()

    tick_marks = np.arange(len(classes))

    plt.xticks(tick_marks, classes, rotation=45)

    plt.yticks(tick_marks, classes)



    fmt = '.2f' if normalize else 'd'

    thresh = cm.max() / 2.

    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):

        plt.text(j, i, format(cm[i, j], fmt),

                 horizontalalignment="center",

                 color="white" if cm[i, j] > thresh else "black")



    plt.tight_layout()

    plt.ylabel('True label')

    plt.xlabel('Predicted label')

class_names = [i for i in range(1,11)]

cnf = confusion_matrix(Y_test, svc.predictions)

np.set_printoptions(precision=2)

plt.figure(figsize=(8,8))

plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization

[[  0   0   0   0   0   0   0   0  36   1]

 [  0   0   0   0   0   0   0   0   1   1]

 [  0   0  32   0   0   0   0   0  20   0]

 [  0   0   0  62   0   0   0   0  25   1]

 [  0   0   0   0  54   0   0   0  25   1]

 [  0   0   0   0   0   0   0   0   1   0]

 [  0   0   0   0   0   0   0   0   2   0]

 [  0   0   0   1   0   0   0  38   6   0]

 [  0   0   1   0   0   0   0   0 314   0]

 [  0   0   0   0   0   0   0   0  15  85]]

# Code from sklearn documentation.

from sklearn.model_selection import learning_curve

from sklearn.model_selection import ShuffleSplit

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,

                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):

    """

    Generate a simple plot of the test and training learning curve.

    """

    plt.figure()

    plt.title(title)

    if ylim is not None:

        plt.ylim(*ylim)

    plt.xlabel("Training examples")

    plt.ylabel("Score")

    train_sizes, train_scores, test_scores = learning_curve(

        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)

    train_scores_mean = np.mean(train_scores, axis=1)

    train_scores_std = np.std(train_scores, axis=1)

    test_scores_mean = np.mean(test_scores, axis=1)

    test_scores_std = np.std(test_scores, axis=1)

    plt.grid()



    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,

                     train_scores_mean + train_scores_std, alpha=0.1,

                     color="r")

    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,

                     test_scores_mean + test_scores_std, alpha=0.1, color="g")

    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",

             label="Training score")

    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",

             label="Cross-validation score")



    plt.legend(loc="best")

    return plt

g = plot_learning_curve(svc.grid.best_estimator_, "SVC Learning Curve", X_train, Y_train, ylim=[1.01, 0.6], cv = 5,

                        train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Logistic Regression:

from sklearn.linear_model import LogisticRegression

lr = Class_Fit(clf = LogisticRegression)

lr.grid_search(parameters = [{'C':np.logspace(-1,2,10)}], Kfold = 5)

lr.grid_fit(X_train, Y_train)

lr.grid_predict(X_test, Y_test)

Precision: 94.74 %

cnf = confusion_matrix(Y_test, lr.predictions)

plt.figure(figsize=(8,8))

plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization

[[ 32   0   0   0   0   0   0   0   5   0]

 [  0   0   0   0   0   0   0   0   1   1]

 [  0   0  52   0   0   0   0   0   0   0]

 [  0   0   0  84   0   0   0   0   3   1]

 [  0   0   0   0  78   0   0   0   2   0]

 [  0   0   0   0   0   0   1   0   0   0]

 [  2   0   0   0   0   0   0   0   0   0]

 [  0   0   0   2   0   0   0  42   1   0]

 [  5   0   3   2   1   0   0   1 300   3]

 [  0   0   0   0   0   0   0   0   4  96]]

g = plot_learning_curve(lr.grid.best_estimator_, "LogisticRegression Learning Curve", X_train, Y_train, ylim=[1.01, 0.6], cv = 5,

                        train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

K-Nearest Neighbours:

from sklearn.neighbors import KNeighborsClassifier

knn = Class_Fit(clf = KNeighborsClassifier)

knn.grid_search(parameters = [{'n_neighbors':np.arange(1,50,1)}], Kfold = 5)

knn.grid_fit(X_train, Y_train)

knn.grid_predict(X_test, Y_test)

Precision: 83.52 %

cnf = confusion_matrix(Y_test, knn.predictions)

plt.figure(figsize=(8,8))

plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization

[[ 33   0   0   0   0   0   0   0   4   0]

 [  0   0   0   0   0   0   0   0   1   1]

 [  3   0  40   1   0   0   0   1   7   0]

 [  0   1   1  68   1   0   0   0  16   1]

 [  4   1   0   2  53   0   0   0  13   7]

 [  0   0   0   0   0   0   1   0   0   0]

 [  2   0   0   0   0   0   0   0   0   0]

 [  2   0   0   2   0   0   0  36   5   0]

 [  4   1   5  12   4   0   0   2 282   5]

 [  1   0   0   0   2   0   0   0   6  91]]

g = plot_learning_curve(knn.grid.best_estimator_, "KNearestNEighbors Learning Curve", X_train, Y_train, ylim=[1.01, 0.6], cv = 5,

                        train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Decision Trees:

from sklearn.tree import DecisionTreeClassifier

tr = Class_Fit(clf = DecisionTreeClassifier)

tr.grid_search(parameters = [{'criterion':['entropy', 'gini'], 'max_features':['sqrt', 'log2']}], Kfold = 5)

tr.grid_fit(X_train, Y_train)

tr.grid_predict(X_test, Y_test)

Precision: 91.14 %

cnf = confusion_matrix(Y_test, tr.predictions)

plt.figure(figsize=(8,8))

plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization

[[ 27   0   1   0   3   0   1   0   4   1]

 [  0   0   0   0   0   0   0   0   1   1]

 [  1   0  48   0   2   0   0   0   1   0]

 [  0   0   0  83   0   0   0   0   5   0]

 [  1   0   1   0  73   0   0   0   5   0]

 [  1   0   0   0   0   0   0   0   0   0]

 [  1   0   1   0   0   0   0   0   0   0]

 [  0   0   0   2   0   0   0  42   0   1]

 [  5   2   3  10   2   0   0   1 288   4]

 [  1   0   0   0   0   0   0   0   2  97]]

g = plot_learning_curve(tr.grid.best_estimator_, "DecisionTree Learning Curve", X_train, Y_train, ylim=[1.01, 0.6], cv = 5,

                        train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Random Forests:

from sklearn.ensemble import RandomForestClassifier

rf = Class_Fit(clf = RandomForestClassifier)

rf.grid_search(parameters = [{'criterion':['entropy', 'gini'], 

                              'max_features':['sqrt', 'log2'], 'n_estimators':[20, 40, 60, 80, 100]}], Kfold = 5)

rf.grid_fit(X_train, Y_train)

rf.grid_predict(X_test, Y_test)

Precision: 93.35 %

cnf = confusion_matrix(Y_test, rf.predictions)

plt.figure(figsize=(8,8))

plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization

[[ 29   0   1   0   3   1   0   0   3   0]

 [  0   0   0   0   0   0   0   0   1   1]

 [  0   0  50   0   1   0   0   0   1   0]

 [  0   0   1  82   0   0   0   0   5   0]

 [  0   0   0   0  76   0   0   0   4   0]

 [  1   0   0   0   0   0   0   0   0   0]

 [  1   0   1   0   0   0   0   0   0   0]

 [  0   0   0   2   0   0   0  43   0   0]

 [  6   0   1   2   2   0   0   2 298   4]

 [  0   0   1   0   0   0   0   0   3  96]]

g = plot_learning_curve(rf.grid.best_estimator_, "Random Forest Learning Curve", X_train, Y_train, ylim=[1.01, 0.6], cv = 5,

                        train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

AdaBoost Classifier:

from sklearn.ensemble import AdaBoostClassifier

ada = Class_Fit(clf = AdaBoostClassifier)

ada.grid_search(parameters = [{'n_estimators':[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}], Kfold = 5)

ada.grid_fit(X_train, Y_train)

ada.grid_predict(X_test, Y_test)

Precision: 57.06 %

cnf = confusion_matrix(Y_test, ada.predictions)

plt.figure(figsize=(8,8))

plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization

[[  0   0   0   2   0   0   1   0  31   3]

 [  0   0   0   0   0   0   0   0   1   1]

 [  0   0   0   1   0   0   0   0  51   0]

 [  0   0   0   6   0   0   0   0  81   1]

 [  0   0   0   1   0   0   0   0  71   8]

 [  0   0   0   0   0   1   0   0   0   0]

 [  0   0   0   0   0   0   1   0   1   0]

 [  0   0   0  45   0   0   0   0   0   0]

 [  0   0   0   2   0   0   0   0 311   2]

 [  0   0   0   0   0   0   0   0   7  93]]

g = plot_learning_curve(ada.grid.best_estimator_, "AdaBoost Learning Curve", X_train, Y_train, ylim=[1.01, 0.4], cv = 5,

                        train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Gradient Boosted Decision Trees:

import xgboost

gbdt = Class_Fit(clf = xgboost.XGBClassifier)

gbdt.grid_search(parameters = [{'n_estimators':[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}], Kfold = 5)

gbdt.grid_fit(X_train, Y_train)

gbdt.grid_predict(X_test, Y_test)

Precision: 94.04 %

cnf = confusion_matrix(Y_test, gbdt.predictions)

plt.figure(figsize=(8,8))

plot_confusion_matrix(cnf, class_names)

Confusion matrix, without normalization

[[ 32   0   0   0   1   0   0   0   4   0]

 [  0   0   0   0   0   0   0   0   1   1]

 [  0   0  50   1   1   0   0   0   0   0]

 [  0   0   1  82   0   0   0   0   5   0]

 [  1   0   0   0  73   0   0   0   6   0]

 [  0   0   0   0   0   0   1   0   0   0]

 [  0   0   1   0   0   0   1   0   0   0]

 [  0   0   0   2   0   0   0  43   0   0]

 [  4   0   1   2   2   0   0   1 300   5]

 [  0   0   0   0   0   0   0   0   2  98]]

g = plot_learning_curve(gbdt.grid.best_estimator_, "GBDT Learning Curve", X_train, Y_train, ylim=[1.01, 0.6], cv = 5,

                        train_sizes = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Voting Classifier:

A Voting Classifier is a machine learning model that trains on an ensemble of numerous models and predicts an output (class) based on their highest probability of chosen class as the output.

rf_best = RandomForestClassifier(**rf.grid.best_params_)

gbdt_best = xgboost.XGBClassifier(**gbdt.grid.best_params_)

svc_best = LinearSVC(**svc.grid.best_params_)

tr_best = DecisionTreeClassifier(**tr.grid.best_params_)

knn_best = KNeighborsClassifier(**knn.grid.best_params_)

lr_best = LogisticRegression(**lr.grid.best_params_)



from sklearn.ensemble import VotingClassifier

votingC = VotingClassifier(estimators=[('rf', rf_best), ('gb', gbdt_best), ('knn', knn_best), ('lr', lr_best)])

votingC = votingC.fit(X_train, Y_train)

predictions = votingC.predict(X_test)

print("Precision : {:.2f}%".format(100 * accuracy_score(Y_test, predictions)))

Precision : 94.88%

This is the highest precision that we have obtained.

Testing the model:

basket_price = set_test.copy(deep=True)



transanctions_per_user = basket_price.groupby(by=['CustomerID'])['Basket Price'].agg(['count', 'min', 'max', 'mean', 'sum'])



for i in range(5):

    col = 'categ_{}'.format(i)

    transanctions_per_user.loc[:, col] = basket_price.groupby(by=['CustomerID'])[col].sum() / transanctions_per_user['sum'] * 100

    

transanctions_per_user.reset_index(drop = False, inplace = True)

basket_price.groupby(by=['CustomerID'])['categ_0'].sum()



transanctions_per_user['count'] = 5 * transanctions_per_user['count']

transanctions_per_user['sum'] = transanctions_per_user['count'] * transanctions_per_user['mean']



transanctions_per_user.sort_values('CustomerID', ascending = True)[:5]

list_cols = ['count', 'min', 'max', 'mean', 'categ_0', 'categ_1', 'categ_2', 'categ_3', 'categ_4']

matrix_test = transanctions_per_user[list_cols].values

scaled_test_matrix = scaler.transform(matrix_test)

Y = kmeans.predict(scaled_test_matrix)

columns = ['mean', 'categ_0', 'categ_1', 'categ_2', 'categ_3', 'categ_4' ]

X = transanctions_per_user[columns]

predictions = votingC.predict(X)

print("Precision : {:.2f}%".format(100 * accuracy_score(Y, predictions)))

Precision : 89.18%

Accuracy on test dataset is good considering we use 10 months old data to predict on new data.

Conclusion:

We are able to separate customers into different segments, based on the type of products that they buy.

Using a Voting Classifier and a combination of multiple machine learning models, such as Random Forest, Gradient Boosted Decision Trees, K-Nearest Neighbours, and Logistic Regression, we are able to predict what type of product a user will buy, with a precision of 94.88%.

We can use this information to target selected customers with promotional offers for their desired products, which increases the likelihood of more sales in the future.

0 comments

Vaibhav Mali, Swapnil Vishwasrao, and Advika Banerjee like this

Related Listings

Major Concepts