Sales Data Analysis With Python

Tarun Reddy

Related Listings

Breast Cancer Detection

0 comments, 1 review , 1 like
Rice Disease Detection

0 comments, 1 review , 1 like

stock prices prediction in Python using recurrent neural network and machine learning.

0 comments, 1 review , 1,149 views, 3 likes
CNN Brain Tumor Detection using optimizers.

0 comments, 3 reviews , 1,055 views, 1 like

Major Concepts

Models Home » Domain Usecases » Retail » Sales Data Analysis With Python

Sales Data Analysis With Python

Models Status

Model Overview

Sales analysis is mining your data to evaluate the performance of your sales team against its goals. It provides insights about the top-performing and underperforming products/services, the problems in selling and market opportunities, sales forecasting, and sales activities that generate revenue. In this use case the dataset has 8523 rows of 12 variables.Item_Identifier- Unique product ID, Item_Weight- Weight of the product, Item_Fat_Content - Whether the product is low fat or not, Item_Visibility - The % of the total display area of all products in a store allocated to the particular product, Item_Type - The category to which the product belongs, Item_MRP - Maximum Retail Price (list price) of the product, Outlet_Identifier - Unique store ID, Outlet_Establishment_Year- The year in which store was established, Outlet_Size - The size of the store in terms of ground area covered, Outlet_Location_Type- The type of city in which the store is located, Outlet_Type- Whether the outlet is just a grocery store or some sort of supermarket, Item_Outlet_Sales - Sales of the product in the particular store. This is the outcome variable to be predicted.

OBJECTIVE

This use case is about Big Mart Sales Prediction using Machine Learning with Python. In this project, XGBoost Regressor is used for Prediction.
The dataset which we used in this use case contains the data has 8523 rows of 12 variables.

Dataset Details:

Item_Identifier- Unique product ID

Item_Weight- Weight of the product

Item_Fat_Content - Whether the product is low fat or not

Item_Visibility - The % of the total display area of all products in a store allocated to the particular product

Item_Type - The category to which the product belongs

Item_MRP - Maximum Retail Price (list price) of the product

Outlet_Identifier - Unique store ID

Outlet_Establishment_Year- The year in which the store was established

Outlet_Size - The size of the store in terms of ground area covered

Outlet_Location_Type- The type of city in which the store is located

Outlet_Type- Whether the outlet is just a grocery store or some sort of supermarket

Item_Outlet_Sales - Sales of the product in the particular store. This is the outcome variable to be predicted.

IMPORTING REQUIRED LIBRARIES

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

from xgboost import XGBRegressor

from sklearn import metrics

import joblib

Now, let's import the dataset.

# Importing dataset

df = pd.read_excel('superstore_sales.xlsx')

DATA AUDIT

You can’t make your data work for you until you know what data you’re talking about.

To get a quick idea of what the data looks like, we can call the head function on the data frame. By default, this returns the top five rows, but it can take in a parameter of how many rows to return.

Let us check the first five rows of the dataset using the head function.

# first 5 rows of the dataframe

print(salesdata.head())

Now let us check the last five rows of the data by using tail function.

#last 5 rows of data

salesdata.tail

Now let us check the size of the dataset by seeing how many rows and columns are present in it by using the shape function.

# number of data points & number of features

salesdata.shape

In the next step, we are going to check the columns present in the dataset.

# getting some information about thye dataset

salesdata.info()

Now we can do further analysis on our data to answer our questions. Before that, we should see if there are any missing values in our data set.To check if there are any missing values in the entire data set we use the isnull function, then see if there are any values.

We’re lucky we have such a nice data set and with no missing values. While we won’t focus on it in this post, a data scientist will spend their time cleaning (or wrangling ) the data. Since we don’t have any missing data, we can start doing further analysis on our data.

# checking for missing values

salesdata.isnull().sum()

# mean value of "Item_Weight" column

salesdata['Item_Weight'].mean()

# filling the missing values in "Item_weight column" with "Mean" value

salesdata['Item_Weight'].fillna(salesdata['Item_Weight'].mean(), inplace=True)

Let us now replace the missing values in "outlet_size" with mode.

# mode of "Outlet_Size" column

salesdata['Outlet_Size'].mode()

# filling the missing values in "Outlet_Size" column with Mode

outlet_size_mode = salesdata.pivot_table(values='Outlet_Size', columns='Outlet_Type', aggfunc=(lambda x: x.mode()[0]))

print(outlet_size_mode)

Let us now create another variable as missing_values and check whether any values are missing.

missing_values = salesdata['Outlet_Size'].isnull()

print(missing_values)

Let us now replace the missing values in outlet_size column

salesdata.loc[missing_values, 'Outlet_Size'] = salesdata.loc[missing_values,'Outlet_Type'].apply(lambda x: outlet_size_mode[x])

salesdata.head()

Now we need to change low fat and lf to Low fat and reg to Regular

salesdata['Item_Fat_Content'].value_counts()

salesdata.replace({'Item_Fat_Content': {'low fat':'Low Fat','LF':'Low Fat', 'reg':'Regular'}}, inplace=True)

let's check whether the items are merged

salesdata['Item_Fat_Content'].value_counts()

Now let us do encoding by changing all categorical values to numerical values

print("before",salesdata.head())

d={}

#label encoding

encoder = LabelEncoder()

salesdata['Item_Identifier'] = encoder.fit_transform(salesdata['Item_Identifier'])



d['Item_Identifier']=encoder.classes_



salesdata['Item_Fat_Content'] = encoder.fit_transform(salesdata['Item_Fat_Content'])

d['Item_Fat_Content']=encoder.classes_





salesdata['Item_Type'] = encoder.fit_transform(salesdata['Item_Type'])

d['Item_Type']=encoder.classes_





salesdata['Outlet_Identifier'] = encoder.fit_transform(salesdata['Outlet_Identifier'])

d['Outlet_Identifier']=encoder.classes_





salesdata['Outlet_Size'] = encoder.fit_transform(salesdata['Outlet_Size'])

d['Outlet_Size']=encoder.classes_





salesdata['Outlet_Location_Type'] = encoder.fit_transform(salesdata['Outlet_Location_Type'])

d['Outlet_Location_Type']=encoder.classes_





salesdata['Outlet_Type'] = encoder.fit_transform(salesdata['Outlet_Type'])

d['Outlet_Type']=encoder.classes_

Now let us save the total encoding process in a model file named enc.sav

# np.save('classes.npy', encoder.classes_)



joblib.dump(d,"enc.sav" )





print("after",salesdata.head())

Now let us split the data into target and features by taking two variables X and Y. X contain all the feature columns and Y contains the target column.

#Splitting features and Target

X = salesdata.drop(columns='Item_Outlet_Sales', axis=1)

Y = salesdata['Item_Outlet_Sales']

Let us check the data

print(X)

Now check the other variable Y

print(Y)

Now let us split the data into training and testing data

#Splitting the data into Training data & Testing Data

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

print(X.shape, X_train.shape, X_test.shape)

Now let us train our machine learning model and evaluate it.

#model training

regressor = XGBRegressor()

regressor.fit(X_train, Y_train)

Now the model is built and in the next step let us predict on training data

# prediction on training data

training_data_prediction = regressor.predict(X_train)

# R squared Value

r2_train = metrics.r2_score(Y_train, training_data_prediction)

print('R Squared value = ', r2_train)

let us now predict on testing data

# prediction on test data

test_data_prediction = regressor.predict(X_test)

print(test_data_prediction)

0 comments

Vaibhav Mali likes this

Related Listings

Tarun Reddy's other Models Reports

Major Concepts

Sales Data Analysis With Python

Models Status

Model Overview

Deployment

Photos

Reviews

Connect With Us

Member Sign In

Member Sign In

Create Account

Related Listings

Tarun Reddy's other Models Reports

Major Concepts

Sales Data Analysis With Python

Models Status

Model Overview

Deployment

Photos

Reviews

Connect With Us