Note: If the author has requested for "Expert Guidance" and you can help, please start a New Topic in the "Discussions" Tab

Hashwanth Gogineni's other Models Reports

Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Census Income Prediction

Models Status

Model Overview

Adult Census Income:


Census is an enumeration of people, houses, offices, and other important things in a country or region at a specific time. The term commonly refers to a population census.


Census, being expensive, are taken only at infrequent intervals: every ten years in many countries, every five years, or at irregular intervals in other countries.


Adult Census Income data have been extracted from the Census Bureau database, 1994 by Barry Becker and Ronny Kohavi.


The task is to determine whether a person makes over $50K a year or not.




Problem Statement


The above introduction had an aim to increase the awareness about how the income factor actually has an impact not only on the personal lives of people, but also an impact on the nation and its betterment. We will today have a look on the data extracted from the 1994 Census bureau database, and try to find insights about how different features have an impact on the income of an individual. Though the data is quite old, and the insights drawn cannot be directly used for derivation in the modern world, but it would surely help us to analyse what role different features play in predicting the income of an individual.

Why Income Prediction?


The project will pave the path for you to predict multiple income groups in a country. Government and organizations can use the project to predict a country or a region’s economic conditions. Organizations can find desired customers for their products or services according to income classes.


 

The Dataset


The dataset provided to us contains 32560 rows, and 14 different independent features. We aim to predict if a person earns more than 50k$ per year or not. Since the data predicts 2 values (>50K or <=50K), this clearly is a classification problem, and we will train the classification models to predict the desired outputs.


Mentioned below are the details of the features provided to us, which we will be feeding to our classification model to train it.
1. Age — The age of an individual, this ranges from 17 to 90.
2. Workclass — The class of work to which an individual belongs.
3. Fnlwgt — The weight assigned to the combination of features (an estimate of how many people belong to this set of combination)
4. Education — Highest level of education
5. Education_num — Number of years for which education was taken
6. Marital_Status — Represents the category assigned on the basis of marriage status of a person
7. Occupation — Profession of a person
8. Relationship — Relation of the person in his family
9. Race — Origin background of a person
10. Sex — Gender of a person
11. Capital_gain — Capital gained by a person
12. Capital_loss — Loss of capital for a person
13. Hours_per_week — Number of hours for which an individual works per week
14. Native_Country — Country to which a person belongs

Output:
1. Income — The target variable, which predicts if the income is higher or lower than 50K$.


I have attempted to fit the best machine learning model using the python language and several other visualizations.




Decision tree:


The Decision tree algorithm is one of the most popular machine learning algorithms used all along.


Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. As a result, we can say that the purity of the node increases concerning the target variable. The decision tree splits the nodes on all available variables and then selects the split, which results in the most homogeneous sub-nodes.


Few algorithms used in Decision Trees:


ID3 → (extension of D3)C4.5 → (successor of ID3)


CART → (Classification And Regression Tree)


CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing classification trees)


MARS → (multivariate adaptive regression splines)


Dataset Link: https://archive-beta.ics.uci.edu/ml/datasets/adult


Understanding Code:


Let us import the required libraries for the project.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pickle
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
import pickle



Let us now load the data into the system.


df= pd.read_csv('adult.csv')
df.head()


As you can see, we have got multiple features in our dataset, starting from 'Age' to 'Income.'



Now let us check for missing values.


df.isnull().sum()

 




Missing values exist in 'workclass', 'occupation' and 'native.country'.


So let us handle the missing values using 'mode.'


for col in ['workclass', 'occupation', 'native.country']:
df[col].fillna(df[col].mode()[0], inplace=True)


Now, let us know the importance of each feature in our data using a heat map.


g = sns.heatmap(df[numeric_features].corr(),annot=True, fmt = ".2f", cmap = "coolwarm")
plt.show()





Now, let us drop the 'fnlwgt' feature as it is very less important for us.

Let us have a deep dive into our data using a few visualizations of our data.



As our data is imbalanced, I used the ‘Undersampling’ technique to solve the problem.


from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

rus = RandomUnderSampler(random_state=42, replacement=True)#
x_undersample, y_undersample = rus.fit_resample(X, Y)

print('Original dataset shape:', Counter(Y))
print('Resampled dataset shape', Counter(y_undersample))


Before we feed our data into our model, we need to first split the data accordingly.


from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(x_undersample, y_undersample, test_size = 0.3, random_state = 0)


Also, more preprocessing, such as using 'Label Encoder' and scaling our data, is necessary to acquire good results from our model.


from sklearn import preprocessing

categorical = ['workclass','education', 'marital.status', 'occupation', 'relationship','race', 'sex', 'native.country']
for feature in categorical:
le = preprocessing.LabelEncoder()
X_train[feature] = le.fit_transform(X_train[feature])
X_test[feature] = le.transform(X_test[feature])

 


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)

X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns)


Coming to the modelling part of our project, I have used the 'Decision tree' algorithm to solve the use case.
As you can see we used the "DecisionTreeClassifier()"  function to use the 'Decision tree' algorithm.
Finally using the "predict" function I made predictions.


decision_tree_model = DecisionTreeClassifier()

model_dt=decision_tree_model.fit(X_train, Y_train)

Y_pred = model_dt.predict(X_test)

model_dt.score(X_train, Y_train)

Let us generate a classification report for our model.
I used the 'classification_report' function to generate our model's performance report.


from sklearn.metrics import classification_report
class_names = ['Income is less than 50,000$', 'Income is more than 50,000$']
print(classification_report(Y_test, Y_pred, target_names=class_names))




As you can see, we acquired remarkable results from our model.



Thank you for your time.


 


 


0 comments