Product quality is critical since it impacts the company's performance and helps develop its reputation in the marketplace. When businesses can consistently provide high-quality products that match consumer standards, they may reduce production costs, boost investment returns, and increase revenue.
Customers who rely on a company's attention to detail and customer demand value product quality. Companies manufacture items to suit market demand, and customers expect products to meet that need as advertised by the firm. In addition, they want things that assist them in forming a bond with a brand so they can trust what the firm has to offer. As a result, customers may handle their difficulties safely and effectively using high-quality items.
The roasting machine is an aggregate consisting of 5 chambers of equal size, and each chamber has three temperature sensors. In addition, for this task, you have collected data on the height of the raw material layer and its moisture content. Layer height and humidity are measured when raw materials enter the machine. Raw materials pass through the kiln in an hour.
The project will help measure product quality when let into a roasting machine. The use-case will help factories detect good quality raw materials and discard the low-quality ones to produce a quality product.
The dataset includes 'Layer Height,' 'Humidity', and data acquired by the sensors in the roasting machine.
The XGBoost algorithm was created as part of a University of Washington research effort.
In 2016, Tianqi Chen and Carlos Guestrin presented their article at the SIGKDD Conference, which ignited the Machine Learning industry.
Since its inception, this algorithm has been credited with winning a slew of Kaggle contests and serving as the brains behind several cutting-edge industrial applications.
Consequently, the XGBoost open source projects have a robust community of data scientists contributing to them, with 350 contributors and 3,600 contributions on GitHub.
The following are some of how the algorithm distinguishes itself:
Understanding Code
First, let us import the necessary libraries for the project.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score
import joblib
import pickle
Now, let us load the required data into the system.
df = pd.read_csv('data_X.csv', sep=',')
Before we start preprocessing our data let us explore the data using a few visualizations.
plt.figure(figsize=(12,5))
sns.distplot(train_df['quality'] , fit=norm);
# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train_df['quality'])
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title('Quality distribution');
sns.pairplot(data=train_df);
Coming to the 'Data Preprocessing' part, let us search for missing values in the data.
df.isnull().sum()
As you can see, no missing values exist in our data.
Y = train_df['quality']
train_df.drop(['quality', 'date_time'], axis=1, inplace=True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_df, Y, test_size = 0.3)
As you can see, I used the "train_test_split" function to split our dataframe into training and testing sets. Also, I dropped the 'date_time' feature as it is not necessary for our model.
Finally, we need to scale our data before feeding our data into a model.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = train_df.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns = train_df.columns)
I used the "MinMaxScaler" function to scale the data.
Now, let us dive into the modelling part of the project.
from xgboost import XGBRegressor
xbg_model= XGBRegressor()
xbg_model.fit(X_train, y_train)
y_pred = xbg_model.predict(X_test)
xbg_model.score(X_train, y_train)*100
As you can see, I used the "XGBoost" algorithm to get the most accurate predictions.
I used the "XGBRegressor" function to apply the "XGBoost" algorithm to our data.
Finally, let us check the model's performance using a few metrics.
print('r2 score', r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
As you can see, the model performed really well on the data.
Thank you for your time.