Mental illness, often known as mental health issues, refers to a wide range of conditions that affect your emotions, thoughts, and behaviour. Mental illnesses include depression, anxiety disorders, schizophrenia, eating problems, and addictive behaviours.
Many people experience mental problems from time to time. A mental health disorder becomes a mental disease when persistent signs and symptoms cause frequent stress and impede your ability to function.
Mental illness can make you sad and cause problems in your daily life, such as school, work, or relationships. Symptoms are often managed with a mix of medications and talk therapy (psychotherapy).
Mental disease can manifest itself in a variety of ways. Symptoms of mental illness can alter emotions, attitudes, and behaviours.
Here are some instances of warning signs and symptoms:
Physical difficulties, such as stomach discomfort, back pain, headaches, or other inexplicable aches and pains, can sometimes indicate mental health.
The project can be useful to Tech companies to analyze and solve employees' mental issues.
The data comes from a 2014 poll that looked at attitudes about mental health in the workplace and the prevalence of mental health issues.
This dataset contains the following data:
A random forest is a machine learning approach for solving classification and regression issues.
It uses ensemble learning, a technique for solving complicated problems by combining several classifiers.
Many decision trees make up a 'random forest' algorithm.
Bagging/bootstrap aggregation is used to train the 'forest' formed by the random forest method.
Bagging is an algorithm that increases the accuracy of machine learning methods by grouping them.
Random forest algorithm determines the output based on decision tree predictions.
It forecasts by averaging or averaging the outputs of various trees.
The precision of the result improves as the number of trees grows.
The random forest method overcomes the drawbacks of a decision tree algorithm.
It reduces dataset overfitting problems and improves precision.
It generates forecasts without requiring a large number of package setups (like sci-kit-learn).
The following are a few reasons why we should utilize the Random Forest algorithm:
First, let us import the required libraries for the project.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score
import joblib
import pickle
And now load the data into the system.
df=pd.read_csv("data.csv")
Also, let us have a look at a few important visualizations of our data.
from collections import Counter
country_count = Counter(df['Country'].dropna().tolist()).most_common(10)
country_idx = [country[0] for country in country_count]
country_val = [country[1] for country in country_count]
fig,ax = plt.subplots(figsize=(8,6))
sns.barplot(x = country_idx,y=country_val ,ax =ax)
plt.title('Top ten country')
plt.xlabel('Country')
plt.ylabel('Count')
ticks = plt.setp(ax.get_xticklabels(),rotation=90)
import seaborn as sns
sns.countplot(df['treatment'])
plt.title('Treatement Distribution')
Coming to the 'Data Preprocessing' part, let us search for missing values in the data.
df.isnull().sum()
As you can see, missing values exist in our data.
df['work_interfere'] = df['work_interfere'].fillna('Don\'t know' )
print(df['work_interfere'].unique())
df['self_employed'] = df['self_employed'].fillna('No')
print(df['self_employed'].unique())
df.drop(["Timestamp", "comments", "state"], axis = 1, inplace = True)
As you can see I dropped 'Timestamp', 'comments' and 'state' features as there are a lot of missing values in them.
Now let us encode the categorical values to feed the data into the model.
from sklearn import preprocessing
categorical = ['Gender', 'Country', 'self_employed', 'family_history',
'treatment', 'work_interfere', 'no_employees', 'remote_work',
'tech_company', 'benefits', 'care_options', 'wellness_program',
'seek_help', 'anonymity', 'leave', 'mental_health_consequence',
'phys_health_consequence', 'coworkers', 'supervisor',
'mental_health_interview', 'phys_health_interview',
'mental_vs_physical', 'obs_consequence']
for feature in categorical:
le = preprocessing.LabelEncoder()
df[feature] = le.fit_transform(df[feature])
As you can see, I used the 'Label encoder' function to encode our data.
Let us split the data using the "train_test_split" function into training and testing sets.
Y = df['treatment']
X = df.drop('treatment', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
Finally, we need to scale our data before feeding our data into a model.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns)
As you can see, I used the "StandardScaler" function to scale the data.
Now, let us dive deep into the modelling part of the project.
from sklearn.ensemble import RandomForestClassifier
rf_model= RandomForestClassifier()
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
rf_model.score(X_train, y_train)
I used the "Random Forest" model to solve the problem.
As you can see, I used the "RandomForestClassifier" function to use the "Random Forest" algorithm.
Now let us have a look at the model's performance report.
from sklearn.metrics import classification_report
class_names = ['Mental illness Treatment is not required', 'Mental illness Treatment is required']
print(classification_report(y_test, y_pred, target_names=class_names))
As you can see the model performed well.
Thank you for your time.