Reviews

  • 1
  • 2
  • 3
  • 4
  • 5
Editor Rating
  • 1
  • 2
  • 3
  • 4
  • 5
User Ratings
Based on 0 reviews

Major Concepts

Articles Home » Predictive and Prescriptive Modeling » Machine Learning » Decision Trees and Random Forests

Decision Trees and Random Forests

                                                                 Decision Trees



  • Decision tree is a hierarchical tree structure that can be used to divide up a large collection of records into smaller sets of classes by applying a sequence of simple decision rules.

  • A decision tree model consists of a set of rules for dividing a large heterogeneous population into smaller, more homogenous (mutually exclusive)

  • The attributes of the classes can be any type of variables from binary, nominal, ordinal, and quantitative values, while the classes must be qualitative type (categorical or binary, or ordinal).

  • In short, given data of attributes together with its classes, a decision tree produces a sequence of rules ( or series of questions ) that can be used to recognize the class.

  • One rule is applied after another, resulting in a hierarchy of segments within segments. The hierarchy is called a tree, and each segment is called a node.

  • With each successive division, the members of the resulting sets become more and more similar to each other.

  • Hence, the algorithm used to construct decision trees is referred to as recursive partitioning.

  • The algorithm is popularly known as CART - Classification And Regression Trees.


Example:

                          


Consider the above scenario of a factory where



  • Expanding factor costs $1.5 million, the probability of a good economy is 0.4 (40%) which leads to $6 million profit and the probability of a bad economy is 0.6 (60%) which leads to $2 million profit.

  • Not expanding factor with $0 cost, probability of good economy is 0.4 which leads to $3M profit and the probability of bad economy is 0.6 which leads to $1M profit.


The management needs to take a decision to expand or not based on the above data,


NetExpand  = ( 0.4 * 6 + 0.6 * 2) - 1.5 = $2.1M


NetNo Expand  = ( 0.4 * 3 + 0.6 * 1) - 0 = $1.8M


$2.1M > $1.8M, therefore the factory should be expanded


Applications:



  • Predicting tumor cells as benign or malignant

  • Classifying credit card transactions as legitimate or fraudulent

  • Classifying buyers from non-buyers

  • Decision on whether or not to approve a loan

  • Diagnosis of various diseases based on symptoms and profile.


Types of decision trees:



  • The target variable is usually categorical and the decision tree is used either to calculate the probability that a given record belongs to each of the categories or to classify the record by assigning it to the most likely class ( or category).

  • Decision trees can also be used to estimate the value of a continuous target variable. However, regression models and neural networks are generally more appropriate for such estimations.


Regression Vs. Decision Trees


Regression Methods                                                                                                                         



  • If the relation between Y and X1, X2,....Xn is determined by a linear relationship                            

  • When data/sample is limited

  • Extrapolation is possible when representative scenarios not present in the training dataset


Decision Trees



  • If there is a high nonlinearity & complex relationship between Y and X1, X2,....Xn

  • Decision tree models are even simpler to interpret than regression!

  • When huge data is available to cover all scenarios

  • A scenario not present in training is largely misinterpreted


Finally, the accuracy of the regression methods and decision trees can be compared to decide which model to use.


Structure of a decision tree:



                  


 


Types of decision tree structures:



  • Binary trees - only 2 choices in each split. Can be non-uniform (uneven) in-depth

  • N-way trees or ternary trees - 3 or more choices in at least one of its splits (3-way, 4-way, etc.)


Which node to choose for splitting?


The best split at root (or child) nodes is defined as the one that does the best job of separating the data into groups where a single class(either 0 or 1) predominates in each group.


The measure used to evaluate a potential split is purity. The best split is one that increases the purity of the sub-sets by the greatest amount. There are different indicators of purity:


                                                               
                               


Example of purity calculation:



                                


In the above scenario, the purity of Node N3 can be calculated as follows:


Probability of Class = 0 & Class = 1 are equal i.e. (3/6)


Node N3:


Gini = 1 - ( (3/6)2 + (3/6)2 ) = 0.5


Entropy = -(3/6) log2(3/6) - (3/6) log2(3/6) = 1


Error = 1 - max[(3/6), (3/6)] = 0.5


Similarly, any one of the above indicators can be calculated for Node N1, N2, and the node with the highest value is chosen as the attribute to split further.


Example ( Transportation study):


Consider the following data which is supposed to be a part of transportation study by a government to understand the traveling preferences of citizens.


Prediction variable is a mode of transportation preference: Bus, Car or Train among commuters along a major route in a city.


The data has 4 variables.



  1. Gender is a binary type

  2. Car ownership is the nominal type (0 - No car, 1 - Diesel, 2 - Petrol)

  3. Travel cost/km is quantitative of ratio converted into the ordinal type

  4. Income type is an ordinal type


                                            


Calculate the entropy before the split:


P(Bus) = P(B) = 4/10 = 0.4


P(Car) = P(C) = 3/10 = 0.3


P(Train) = P(T) = 3/10 = 0.3


Entropy = -0.4 log(0.4) - 0.3 log(0.3) - 0.3 log(0.3) = 1.57



Round 1:


Calculate the entropy of split based on gender.


                          


P(Female) = 5/10 = 0.5


P(Male) = 5/10 = 0.5


EntropyGender = 1.52 * 0.5 + 1.37 * 0.5 = 1.45


Entropy before this split = 1.57


Gender Entropy Gain = 1.57 - 1.45 = 0.12


Entropy of split base on  Car Ownership:


      


P(ownership = 0) = 3/10 = 0.33


P(ownership = 1) = 5/10 = 0.5


P(ownership = 2) = 2/10 = 0.2


Entropyownership = 0.92*0.33 + 1.52 * 0.5 + 0*0.2 = 1.06


Entropy before this split = 1.57


Car Ownership Entropy Gain = 1.57 - 1.06 = 0.51


Similarly,


Income Level Entropy Gain = 0.695


Travel Cost/Km Entropy Gain = 1.210

                        


The entropy of Travel Cost/Km is the highest. So, the decision tree should be split with Travel Cost/Km as the root node.


After splitting, the data is as follows,

             

                                           

                                            



  • When the Travel Cost/Km is Expensive, Car is the preferred mode of transportation

  • When it is Standard, Train is the preferred

  • When it is Cheap, the data should be further split as it contains both Bus and Train


Data when Travel Cost/Km is Cheap:

                      


P(Bus) = P(B) = ( 4/5 ) = 0.8


P(Train) = P(T) = ( 1/5 ) = 0.2


P(Car) = P(C) = 0


Entropy = -0.8 log(0.8) - 0.2 log(0.2) = 0.72


Now repeat the above process and calculate entropy gain for each attribute Gender, Car Ownership, Income until the final decision tree is obtained.

                                                    


Confusion Matrix: 


It is a tabular representation of Actual vs Predicted values. This helps to find the accuracy of the model and avoid overfitting.


Example:


                


The above table can be interpreted as:

                


True Positives    (Predicted 1 & Actual 1): 393


True Negatives  (Predicted 0 & Actual 0): 380


False Positives   (Predicted 1 & Actual 0): 125


False Negatives (Predicted 0 & Actual 1): 198





Accuracy, sensitivity, specificity:








Overfitting and Pruning:


Over-fitting happens when



  • Model is too complicated

  • Model works well on training data and performs very badly on test data


Over-fitting results in decision trees that are more complex than necessary and training error no longer provides a good estimate of how well the tree will perform on previously unseen records


                           
            

Over-fitting can be avoided by pruning i.e. preventing the tree from further splits.


Pre-Pruning (Early stopping rule)


Stop the algorithm before it becomes a fully-grown tree


Typical stopping conditions for a node:



  • Stop if all instances belong to the same class

  • Stop if all the attributes values are the same


More restrictive conditions:



  • Stop if the number of instances is less than some user-specified threshold

  • Stop if expanding the current node does not improve impurity measures (e.g. Gini or information gain)


Post-Pruning:



  • Grow decision tree to its entirety

  • Trim the nodes of the decision tree in a bottom-up fashion

  • If generalization error improves after trimming, replace the sub-tree with a leaf node






                                              Random Forests


Random forest is an ensemble classifier that consists of many decision trees and outputs the class that is the most voted class by individual trees.



  • Assume a number of cases in the training set are N, Then, a sample of these N cases is taken at random but with replacement. This sample will be the training set for growing the decision tree

  • If there are Z input variables, a subset of variables is specified such that at each node, these variables are selected at random out of the Z. The best split on this m is used to split the node. The value of m is held constant while we grow the forest.

  • Each tree is grown to the largest extent possible and there is no pruning.

  • Forest Predicts new data by aggregating the predictions of all the trees


The advantages of Random Forest are:



  • It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier.

  • It runs efficiently on large databases.

  • It can handle thousands of input variables without variable deletion.

  • It gives estimates of what variables are important in the classification.

  • It generates an internal unbiased estimate of the generalization error as the forest building progresses.

  • It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.


Disadvantages:



  • Difficult to interpret the logic for prediction

  • Acts like a black-box


Random Forest with Cross-Validation


The objective of cross-validation is to choose different partitions of the training set and validation set, and then average the result so that the result will not be biased by any single partition.


                        




  • Randomly divide your data into K pieces/folds

  • Treat lst fold as the test dataset. Fit the model to the other folds (training data).

  • Apply the model to the test data and repeat k times.

  • Calculate statistics of model accuracy and fit from the test data only.


 


Classification Algorithms in Python


Consider Carseats data on which we apply the above classification algorithms to predict the sales of carseats


Import the libraries and load the dataset with pandas


import pandas as pd
import graphviz
from subprocess import call
from sklearn import tree, metrics

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

Path = "D:\\DSA Course\\Datasets\\R Inbuilt Datasets\\Carseats.csv"
data = pd.read_csv(Path)

Replace all the values above 4  in the Sales column to ‘Yes’ and below ‘No’ as this will be our target column to predict. Then separate the features and label columns, and factorize categorical columns


data.loc[data.Sales > 4, 'Sale'] = 'Yes'
data.loc[data.Sales < 4, 'Sale'] = 'No'

class_names = data['Sale']
data = data.loc[:,data.columns != 'Sales']
data['Sale'],_ = pd.factorize(data['Sale'])
data['ShelveLoc'],_ = pd.factorize(data['ShelveLoc'])
data['Urban'],_= pd.factorize(data['Urban'])
data['US'],_ = pd.factorize(data['US'])
data.info()

X = data.loc[:,data.columns != 'Sale']
Y = data.Sale

feature_names = X.columns

Divide the data into train, test set and create a decision tree classifier and check the accuracy


X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
############## Decision Tree ###########################################
dtree = tree.DecisionTreeClassifier(random_state=0)
dtree.fit(X_train, y_train)
y_pred = dtree.predict(X_test)
print(metrics.confusion_matrix(y_test, y_pred))
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
DT_accuracy = metrics.accuracy_score(y_test, y_pred)
print('DT_Accuracy: {:.2f}'.format(DT_accuracy))

Output:


[[100  14]
[ 5 1]]

 


Misclassified samples: 19
DT_Accuracy: 0.84

Decision Tree with Information Gain


dtree = tree.DecisionTreeClassifier(criterion='entropy',random_state=0)
dtree.fit(X_train, y_train)
y_pred = dtree.predict(X_test)
print(metrics.confusion_matrix(y_test, y_pred))
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))

DT_Entropy_accuracy = metrics.accuracy_score(y_test, y_pred)
print('DT_Entropy_Accuracy: {:.2f}'.format(DT_Entropy_accuracy))

Output:


[[102  12]
[ 4 2]]
Misclassified samples: 16
DT_Entropy_Accuracy: 0.87

Decision Tree Post-Pruning


dtree = tree.DecisionTreeClassifier(criterion='gini',random_state=0,max_depth=3)
dtree.fit(X_train, y_train)
y_pred = dtree.predict(X_test)

print(metrics.confusion_matrix(y_test, y_pred))
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
DT_Post_accuracy = metrics.accuracy_score(y_test, y_pred)
print('DT_Post_Accuracy: {:.2f}'.format(DT_Post_accuracy))

Output:


[[110   4]
[ 6 0]]
Misclassified samples: 10
DT_Post_Accuracy: 0.92

 


Decision Tree Pre-Pruning


dtree = tree.DecisionTreeClassifier(criterion='gini',random_state=0,min_samples_split=10,min_samples_leaf=5)
#min_samples_split === minsplit
#min_samples_leaf === minbucket
#max_depth = depth

dtree.fit(X_train, y_train)
y_pred = dtree.predict(X_test)

print(metrics.confusion_matrix(y_test, y_pred))
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
DT_Pre_accuracy = metrics.accuracy_score(y_test, y_pred)
print('DT_pre_Accuracy: {:.2f}'.format(DT_Pre_accuracy))

Output:


[[109   5]
[ 5 1]]
Misclassified samples: 10
DT_pre_Accuracy: 0.92

Random Forest Classifier


model = RandomForestClassifier(n_estimators=100)
model.fit(X_train,y_train)

predicted = model.predict(X_test)
print(metrics.confusion_matrix(y_test, predicted))
print(metrics.classification_report(y_test, predicted))
RF_accuracy = metrics.accuracy_score(y_test, y_pred)
print('RF_Accuracy: {:.2f}'.format(RF_accuracy))

Output:


[[113   1]
[ 6 0]]
precision recall f1-score support
0 0.95 0.99 0.97 114
1 0.00 0.00 0.00 6
accuracy 0.94 120
macro avg 0.47 0.50 0.48 120
weighted avg 0.90 0.94 0.92 120
RF_Accuracy: 0.92

Random Forest with Cross-Validation


model = RandomForestClassifier(n_estimators=100)

scores = cross_val_score(model, X, Y, cv=5)

print("RF_CV_Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
RF_CV_Accuracy = scores.mean()
print('RF_CV_Accuracy: {:.2f}'.format(RF_CV_Accuracy))

Output:


RF_CV_Accuracy: 0.91 (+/- 0.02)
RF_CV_Accuracy: 0.91

For code file refer here:https://www.cluzters.ai/vault/274/1029/classification-algorithms-code





















User Reviews