Decision Trees
Example:
Consider the above scenario of a factory where
The management needs to take a decision to expand or not based on the above data,
NetExpand = ( 0.4 * 6 + 0.6 * 2) - 1.5 = $2.1M
NetNo Expand = ( 0.4 * 3 + 0.6 * 1) - 0 = $1.8M
$2.1M > $1.8M, therefore the factory should be expanded
Applications:
Types of decision trees:
Regression Vs. Decision Trees
Regression Methods
Decision Trees
Finally, the accuracy of the regression methods and decision trees can be compared to decide which model to use.
Structure of a decision tree:
Types of decision tree structures:
Which node to choose for splitting?
The best split at root (or child) nodes is defined as the one that does the best job of separating the data into groups where a single class(either 0 or 1) predominates in each group.
The measure used to evaluate a potential split is purity. The best split is one that increases the purity of the sub-sets by the greatest amount. There are different indicators of purity:
Example of purity calculation:
In the above scenario, the purity of Node N3 can be calculated as follows:
Probability of Class = 0 & Class = 1 are equal i.e. (3/6)
Node N3:
Gini = 1 - ( (3/6)2 + (3/6)2 ) = 0.5
Entropy = -(3/6) log2(3/6) - (3/6) log2(3/6) = 1
Error = 1 - max[(3/6), (3/6)] = 0.5
Similarly, any one of the above indicators can be calculated for Node N1, N2, and the node with the highest value is chosen as the attribute to split further.
Example ( Transportation study):
Consider the following data which is supposed to be a part of transportation study by a government to understand the traveling preferences of citizens.
Prediction variable is a mode of transportation preference: Bus, Car or Train among commuters along a major route in a city.
The data has 4 variables.
Calculate the entropy before the split:
P(Bus) = P(B) = 4/10 = 0.4
P(Car) = P(C) = 3/10 = 0.3
P(Train) = P(T) = 3/10 = 0.3
Entropy = -0.4 log(0.4) - 0.3 log(0.3) - 0.3 log(0.3) = 1.57
Round 1:
Calculate the entropy of split based on gender.
P(Female) = 5/10 = 0.5
P(Male) = 5/10 = 0.5
EntropyGender = 1.52 * 0.5 + 1.37 * 0.5 = 1.45
Entropy before this split = 1.57
Gender Entropy Gain = 1.57 - 1.45 = 0.12
Entropy of split base on Car Ownership:
P(ownership = 0) = 3/10 = 0.33
P(ownership = 1) = 5/10 = 0.5
P(ownership = 2) = 2/10 = 0.2
Entropyownership = 0.92*0.33 + 1.52 * 0.5 + 0*0.2 = 1.06
Entropy before this split = 1.57
Car Ownership Entropy Gain = 1.57 - 1.06 = 0.51
Similarly,
Income Level Entropy Gain = 0.695
Travel Cost/Km Entropy Gain = 1.210
The entropy of Travel Cost/Km is the highest. So, the decision tree should be split with Travel Cost/Km as the root node.
After splitting, the data is as follows,
Data when Travel Cost/Km is Cheap:
P(Bus) = P(B) = ( 4/5 ) = 0.8
P(Train) = P(T) = ( 1/5 ) = 0.2
P(Car) = P(C) = 0
Entropy = -0.8 log(0.8) - 0.2 log(0.2) = 0.72
Now repeat the above process and calculate entropy gain for each attribute Gender, Car Ownership, Income until the final decision tree is obtained.
Confusion Matrix:
It is a tabular representation of Actual vs Predicted values. This helps to find the accuracy of the model and avoid overfitting.
Example:
The above table can be interpreted as:
True Positives (Predicted 1 & Actual 1): 393
True Negatives (Predicted 0 & Actual 0): 380
False Positives (Predicted 1 & Actual 0): 125
False Negatives (Predicted 0 & Actual 1): 198
Accuracy, sensitivity, specificity:
Overfitting and Pruning:
Over-fitting happens when
Over-fitting results in decision trees that are more complex than necessary and training error no longer provides a good estimate of how well the tree will perform on previously unseen records
Over-fitting can be avoided by pruning i.e. preventing the tree from further splits.
Pre-Pruning (Early stopping rule)
Stop the algorithm before it becomes a fully-grown tree
Typical stopping conditions for a node:
More restrictive conditions:
Post-Pruning:
Random Forests
Random forest is an ensemble classifier that consists of many decision trees and outputs the class that is the most voted class by individual trees.
The advantages of Random Forest are:
Disadvantages:
Random Forest with Cross-Validation
The objective of cross-validation is to choose different partitions of the training set and validation set, and then average the result so that the result will not be biased by any single partition.
Classification Algorithms in Python
Consider Carseats data on which we apply the above classification algorithms to predict the sales of carseats
Import the libraries and load the dataset with pandas
import pandas as pd
import graphviz
from subprocess import call
from sklearn import tree, metrics
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
Path = "D:\\DSA Course\\Datasets\\R Inbuilt Datasets\\Carseats.csv"
data = pd.read_csv(Path)
Replace all the values above 4 in the Sales column to ‘Yes’ and below ‘No’ as this will be our target column to predict. Then separate the features and label columns, and factorize categorical columns
data.loc[data.Sales > 4, 'Sale'] = 'Yes'
data.loc[data.Sales < 4, 'Sale'] = 'No'
class_names = data['Sale']
data = data.loc[:,data.columns != 'Sales']
data['Sale'],_ = pd.factorize(data['Sale'])
data['ShelveLoc'],_ = pd.factorize(data['ShelveLoc'])
data['Urban'],_= pd.factorize(data['Urban'])
data['US'],_ = pd.factorize(data['US'])
data.info()
X = data.loc[:,data.columns != 'Sale']
Y = data.Sale
feature_names = X.columns
Divide the data into train, test set and create a decision tree classifier and check the accuracy
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
############## Decision Tree ###########################################
dtree = tree.DecisionTreeClassifier(random_state=0)
dtree.fit(X_train, y_train)
y_pred = dtree.predict(X_test)
print(metrics.confusion_matrix(y_test, y_pred))
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
DT_accuracy = metrics.accuracy_score(y_test, y_pred)
print('DT_Accuracy: {:.2f}'.format(DT_accuracy))
Output:
[[100 14]
[ 5 1]]
Misclassified samples: 19
DT_Accuracy: 0.84
Decision Tree with Information Gain
dtree = tree.DecisionTreeClassifier(criterion='entropy',random_state=0)
dtree.fit(X_train, y_train)
y_pred = dtree.predict(X_test)
print(metrics.confusion_matrix(y_test, y_pred))
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
DT_Entropy_accuracy = metrics.accuracy_score(y_test, y_pred)
print('DT_Entropy_Accuracy: {:.2f}'.format(DT_Entropy_accuracy))
Output:
[[102 12]
[ 4 2]]
Misclassified samples: 16
DT_Entropy_Accuracy: 0.87
Decision Tree Post-Pruning
dtree = tree.DecisionTreeClassifier(criterion='gini',random_state=0,max_depth=3)
dtree.fit(X_train, y_train)
y_pred = dtree.predict(X_test)
print(metrics.confusion_matrix(y_test, y_pred))
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
DT_Post_accuracy = metrics.accuracy_score(y_test, y_pred)
print('DT_Post_Accuracy: {:.2f}'.format(DT_Post_accuracy))
Output:
[[110 4]
[ 6 0]]
Misclassified samples: 10
DT_Post_Accuracy: 0.92
Decision Tree Pre-Pruning
dtree = tree.DecisionTreeClassifier(criterion='gini',random_state=0,min_samples_split=10,min_samples_leaf=5)
#min_samples_split === minsplit
#min_samples_leaf === minbucket
#max_depth = depth
dtree.fit(X_train, y_train)
y_pred = dtree.predict(X_test)
print(metrics.confusion_matrix(y_test, y_pred))
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
DT_Pre_accuracy = metrics.accuracy_score(y_test, y_pred)
print('DT_pre_Accuracy: {:.2f}'.format(DT_Pre_accuracy))
Output:
[[109 5]
[ 5 1]]
Misclassified samples: 10
DT_pre_Accuracy: 0.92
Random Forest Classifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train,y_train)
predicted = model.predict(X_test)
print(metrics.confusion_matrix(y_test, predicted))
print(metrics.classification_report(y_test, predicted))
RF_accuracy = metrics.accuracy_score(y_test, y_pred)
print('RF_Accuracy: {:.2f}'.format(RF_accuracy))
Output:
[[113 1]
[ 6 0]]
precision recall f1-score support
0 0.95 0.99 0.97 114
1 0.00 0.00 0.00 6
accuracy 0.94 120
macro avg 0.47 0.50 0.48 120
weighted avg 0.90 0.94 0.92 120
RF_Accuracy: 0.92
Random Forest with Cross-Validation
model = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(model, X, Y, cv=5)
print("RF_CV_Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
RF_CV_Accuracy = scores.mean()
print('RF_CV_Accuracy: {:.2f}'.format(RF_CV_Accuracy))
Output:
RF_CV_Accuracy: 0.91 (+/- 0.02)
RF_CV_Accuracy: 0.91
For code file refer here:https://www.cluzters.ai/vault/274/1029/classification-algorithms-code