Note: If the author has requested for "Expert Guidance" and you can help, please start a New Topic in the "Discussions" Tab

Hashwanth Gogineni's other Models Reports

Major Concepts


Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Models Home » Domain Usecases » Sports » Baseball Sports Analysis

Baseball Sports Analysis

Models Status

Model Overview


Baseball is a bat-and-ball sport in which two opposing teams, each consisting of nine players, take turns batting and fielding. The game begins when a pitcher from the fielding team throws a ball that a batter from the batting team attempts to hit with a bat. The offensive team's (batting team) goal is to hit the ball into the field of play, allowing its players to advance counter-clockwise around four bases and score "runs." The defensive team's (fielding team's) goal is to keep hitters from becoming runners and runners from moving around the bases. When a runner legally moves around the bases in order and hits the home plate, a run is scored (the place where the player started as a batter). The winning team is the one that scores the most runs by the end of the game.

The batting team's first goal is to have a player safely reach first base.
If a batter reaches first base without being ruled "out," he or she might attempt to move to the next base as a runner, either immediately or during his or her teammates' turns at bat. The fielding team seeks to prevent runs by getting hitters or runners "out," or removing them from the game. Both the pitcher and the fielders have strategies for getting the batters out. The opposing teams alternate batting and fielding turn, with the batting team's turn ending once the fielding team registers three outs.
An inning is one turn of batting for each team. A game normally consists of nine innings, with the team scoring the most runs at the end of the game-winning. Extra innings are frequently played if the score is tied after nine innings. Although most games end in the ninth inning, baseball does not have a game clock.

Why Baseball Sports Analytics Project?

The Project can be used to analyse baseball players' performance data.



The discipline of sports analytics is exploding. To study the performance of players and teams, owners, coaches, and fans use a variety of statistical metrics and models. The examination of annual statistics on hitting averages for individual players in the sport of baseball provides a basic example. The sample utilised here comes from the Lahman Baseball Database and contains 4535 rows of data for a select selection of players from 1960 to 2004.


Gradient Boosting is a machine learning approach that is commonly used for classification and regression issues. It's simple to use and works well with both heterogeneous and small data sets. It effectively turns a group of many weak learners into strong learner. Yandex developed CatBoost, or Categorical Boosting, an open-source boosting library. CatBoost can be used in ranking, recommendation systems, forecasting, and even personal assistants, in addition to regression and classification. 

Advantages of CatBoost:

  • On several datasets, superior quality when compared to other GBDT libraries.

  • Prediction speed is the fastest in the class.

  • Both numerical and category features are supported.

  • Out-of-the-box GPU and multi-GPU support for training.

  • Tools for visualisation are offered.

  • With Apache Spark and the CLI, distributed training can be done quickly and consistently.

Understanding Code

First, let us import the necessary libraries for the project.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
import pickle
from sklearn.model_selection import train_test_split
import joblib
from sklearn import metrics
from sklearn.metrics import r2_score​

Now, let us load the required data into the system.

df = pd.read_excel('data.xlsx', engine='openpyxl')

Before we start preprocessing our data let us explore the data using a few visualizations.

plt.figure(figsize=(15, 10))
sns.barplot(x="League code", y="Runs", data=df)

y = df.groupby('Year')['Global_Sales'].sum()
plt.ylabel('Global Sales')

Coming to the 'Data Preprocessing' part, let us search for missing values in the data.


As you can see no missing values exist in our data.

Now let us encode the categorical values to feed the data into the model.

df['NAMElast'] = NAMElast_encoder.fit_transform(df['NAMElast'])
pickle.dump(NAMElast_encoder, open('NAMElast_encoder.pkl','wb'))

df['NAMEfirst'] = NAMEfirst_encoder.fit_transform(df['NAMEfirst'])
pickle.dump(NAMEfirst_encoder, open('NAMEfirst_encoder.pkl','wb'))

df['TEAM'] = TEAM_encoder.fit_transform(df['TEAM'])
pickle.dump(TEAM_encoder, open('TEAM_encoder.pkl','wb'))

df['League'] = League_encoder.fit_transform(df['League'])
pickle.dump(League_encoder, open('League_encoder.pkl','wb'))

As you can see, I used the 'LabelEncoder' function to encode our data.

Let us split the data using the "train_test_split" function into training and testing sets.

X = df.drop(columns=['Runs', 'PLAYERID', 'YRINDEX'])
Y = df['Runs']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

Finally, we need to scale our data before feeding our data into a model.

scaler = MinMaxScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)

X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns)

pickle.dump(scaler, open('scaler.pkl','wb'))

As you can see, I used the "MinMaxScaler" function to scale the data.

Now, let us dive deep into the modelling part of the project.

from catboost import CatBoostRegressor

cb_model= CatBoostRegressor(), y_train)
y_pred = cb_model.predict(X_test)
cb_model.score(X_train, y_train)*100

I used the "Catboost" model to solve the problem.
As you can see, I used the "CatBoostRegressor" function to use the "Catboost" algorithm.

Now let us have a look at the model's performance report.

print('r2 score', r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

As you can see the model performed well on the data.

Thank you for your time.