Decision Trees and Random Forests

Shivakumar Kota

Reviews

Editor Rating

User Ratings

Based on 0 reviews

Major Concepts

Articles Home » Predictive and Prescriptive Modeling » Machine Learning » Decision Trees and Random Forests

Decision Trees and Random Forests

Decision Trees

Decision tree is a hierarchical tree structure that can be used to divide up a large collection of records into smaller sets of classes by applying a sequence of simple decision rules.

A decision tree model consists of a set of rules for dividing a large heterogeneous population into smaller, more homogenous (mutually exclusive)

The attributes of the classes can be any type of variables from binary, nominal, ordinal, and quantitative values, while the classes must be qualitative type (categorical or binary, or ordinal).

In short, given data of attributes together with its classes, a decision tree produces a sequence of rules ( or series of questions ) that can be used to recognize the class.

One rule is applied after another, resulting in a hierarchy of segments within segments. The hierarchy is called a tree, and each segment is called a node.

With each successive division, the members of the resulting sets become more and more similar to each other.

Hence, the algorithm used to construct decision trees is referred to as recursive partitioning.

The algorithm is popularly known as CART - Classification And Regression Trees.

Example:

Consider the above scenario of a factory where

Expanding factor costs $1.5 million, the probability of a good economy is 0.4 (40%) which leads to $6 million profit and the probability of a bad economy is 0.6 (60%) which leads to $2 million profit.

Not expanding factor with $0 cost, probability of good economy is 0.4 which leads to $3M profit and the probability of bad economy is 0.6 which leads to $1M profit.

The management needs to take a decision to expand or not based on the above data,

Net_Expand = ( 0.4 * 6 + 0.6 * 2) - 1.5 = $2.1M

Net_{No Expand} = ( 0.4 * 3 + 0.6 * 1) - 0 = $1.8M

$2.1M > $1.8M, therefore the factory should be expanded

Applications:

Predicting tumor cells as benign or malignant

Classifying credit card transactions as legitimate or fraudulent

Classifying buyers from non-buyers

Decision on whether or not to approve a loan

Diagnosis of various diseases based on symptoms and profile.

Types of decision trees:

The target variable is usually categorical and the decision tree is used either to calculate the probability that a given record belongs to each of the categories or to classify the record by assigning it to the most likely class ( or category).

Decision trees can also be used to estimate the value of a continuous target variable. However, regression models and neural networks are generally more appropriate for such estimations.

Regression Vs. Decision Trees

Regression Methods

If the relation between Y and X₁, X₂,....X_n is determined by a linear relationship

When data/sample is limited

Extrapolation is possible when representative scenarios not present in the training dataset

Decision Trees

If there is a high nonlinearity & complex relationship between Y and X₁, X₂,....X_n

Decision tree models are even simpler to interpret than regression!

When huge data is available to cover all scenarios

A scenario not present in training is largely misinterpreted

Finally, the accuracy of the regression methods and decision trees can be compared to decide which model to use.

Structure of a decision tree:

Types of decision tree structures:

Binary trees - only 2 choices in each split. Can be non-uniform (uneven) in-depth

N-way trees or ternary trees - 3 or more choices in at least one of its splits (3-way, 4-way, etc.)

Which node to choose for splitting?

The best split at root (or child) nodes is defined as the one that does the best job of separating the data into groups where a single class(either 0 or 1) predominates in each group.

The measure used to evaluate a potential split is purity. The best split is one that increases the purity of the sub-sets by the greatest amount. There are different indicators of purity:

Example of purity calculation:

In the above scenario, the purity of Node N3 can be calculated as follows:

Probability of Class = 0 & Class = 1 are equal i.e. (3/6)

Node N3:

Gini = 1 - ( (3/6)² + (3/6)² ) = 0.5

Entropy = -(3/6) log₂(3/6) - (3/6) log₂(3/6) = 1

Error = 1 - max[(3/6), (3/6)] = 0.5

Similarly, any one of the above indicators can be calculated for Node N1, N2, and the node with the highest value is chosen as the attribute to split further.

Example ( Transportation study):

Consider the following data which is supposed to be a part of transportation study by a government to understand the traveling preferences of citizens.

Prediction variable is a mode of transportation preference: Bus, Car or Train among commuters along a major route in a city.

The data has 4 variables.

Gender is a binary type

Car ownership is the nominal type (0 - No car, 1 - Diesel, 2 - Petrol)

Travel cost/km is quantitative of ratio converted into the ordinal type

Income type is an ordinal type

Calculate the entropy before the split:

P(Bus) = P(B) = 4/10 = 0.4

P(Car) = P(C) = 3/10 = 0.3

P(Train) = P(T) = 3/10 = 0.3

Entropy = -0.4 log(0.4) - 0.3 log(0.3) - 0.3 log(0.3) = 1.57

Round 1:

Calculate the entropy of split based on gender.

P(Female) = 5/10 = 0.5

P(Male) = 5/10 = 0.5

Entropy_Gender = 1.52 * 0.5 + 1.37 * 0.5 = 1.45

Entropy before this split = 1.57

Gender Entropy Gain = 1.57 - 1.45 = 0.12

Entropy of split base on Car Ownership:

P(ownership = 0) = 3/10 = 0.33

P(ownership = 1) = 5/10 = 0.5

P(ownership = 2) = 2/10 = 0.2

Entropy_ownership = 0.92*0.33 + 1.52 * 0.5 + 0*0.2 = 1.06

Entropy before this split = 1.57

Car Ownership Entropy Gain = 1.57 - 1.06 = 0.51

Similarly,

Income Level Entropy Gain = 0.695

Travel Cost/Km Entropy Gain = 1.210

The entropy of Travel Cost/Km is the highest. So, the decision tree should be split with Travel Cost/Km as the root node.

After splitting, the data is as follows,

When the Travel Cost/Km is Expensive, Car is the preferred mode of transportation

When it is Standard, Train is the preferred

When it is Cheap, the data should be further split as it contains both Bus and Train

Data when Travel Cost/Km is Cheap:

P(Bus) = P(B) = ( 4/5 ) = 0.8

P(Train) = P(T) = ( 1/5 ) = 0.2

P(Car) = P(C) = 0

Entropy = -0.8 log(0.8) - 0.2 log(0.2) = 0.72

Now repeat the above process and calculate entropy gain for each attribute Gender, Car Ownership, Income until the final decision tree is obtained.

Confusion Matrix:

It is a tabular representation of Actual vs Predicted values. This helps to find the accuracy of the model and avoid overfitting.

Example:

The above table can be interpreted as:

True Positives (Predicted 1 & Actual 1): 393

True Negatives (Predicted 0 & Actual 0): 380

False Positives (Predicted 1 & Actual 0): 125

False Negatives (Predicted 0 & Actual 1): 198

Accuracy, sensitivity, specificity:

Overfitting and Pruning:

Over-fitting happens when

Model is too complicated

Model works well on training data and performs very badly on test data

Over-fitting results in decision trees that are more complex than necessary and training error no longer provides a good estimate of how well the tree will perform on previously unseen records

Over-fitting can be avoided by pruning i.e. preventing the tree from further splits.

Pre-Pruning (Early stopping rule)

Stop the algorithm before it becomes a fully-grown tree

Typical stopping conditions for a node:

Stop if all instances belong to the same class

Stop if all the attributes values are the same

More restrictive conditions:

Stop if the number of instances is less than some user-specified threshold

Stop if expanding the current node does not improve impurity measures (e.g. Gini or information gain)

Post-Pruning:

Grow decision tree to its entirety

Trim the nodes of the decision tree in a bottom-up fashion

If generalization error improves after trimming, replace the sub-tree with a leaf node

Random Forests

Random forest is an ensemble classifier that consists of many decision trees and outputs the class that is the most voted class by individual trees.

Assume a number of cases in the training set are N, Then, a sample of these N cases is taken at random but with replacement. This sample will be the training set for growing the decision tree

If there are Z input variables, a subset of variables is specified such that at each node, these variables are selected at random out of the Z. The best split on this m is used to split the node. The value of m is held constant while we grow the forest.

Each tree is grown to the largest extent possible and there is no pruning.

Forest Predicts new data by aggregating the predictions of all the trees

The advantages of Random Forest are:

It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier.

It runs efficiently on large databases.

It can handle thousands of input variables without variable deletion.

It gives estimates of what variables are important in the classification.

It generates an internal unbiased estimate of the generalization error as the forest building progresses.

It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.

Disadvantages:

Difficult to interpret the logic for prediction

Acts like a black-box

Random Forest with Cross-Validation

The objective of cross-validation is to choose different partitions of the training set and validation set, and then average the result so that the result will not be biased by any single partition.

Randomly divide your data into K pieces/folds

Treat lst fold as the test dataset. Fit the model to the other folds (training data).

Apply the model to the test data and repeat k times.

Calculate statistics of model accuracy and fit from the test data only.

Classification Algorithms in Python

Consider Carseats data on which we apply the above classification algorithms to predict the sales of carseats

Import the libraries and load the dataset with pandas

import pandas as pd

import graphviz

from subprocess import call

from sklearn import tree, metrics

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier



Path = "D:\\DSA Course\\Datasets\\R Inbuilt Datasets\\Carseats.csv"

data = pd.read_csv(Path)

Replace all the values above 4 in the Sales column to ‘Yes’ and below ‘No’ as this will be our target column to predict. Then separate the features and label columns, and factorize categorical columns

data.loc[data.Sales > 4, 'Sale'] = 'Yes'

data.loc[data.Sales < 4, 'Sale'] = 'No'



class_names = data['Sale']

data = data.loc[:,data.columns != 'Sales']

data['Sale'],_ = pd.factorize(data['Sale'])

data['ShelveLoc'],_ = pd.factorize(data['ShelveLoc'])

data['Urban'],_= pd.factorize(data['Urban'])

data['US'],_ = pd.factorize(data['US'])

data.info()



X = data.loc[:,data.columns != 'Sale']

Y = data.Sale



feature_names = X.columns

Divide the data into train, test set and create a decision tree classifier and check the accuracy

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

############## Decision Tree ###########################################

dtree = tree.DecisionTreeClassifier(random_state=0)

dtree.fit(X_train, y_train)

y_pred = dtree.predict(X_test)

print(metrics.confusion_matrix(y_test, y_pred))

count_misclassified = (y_test != y_pred).sum()

print('Misclassified samples: {}'.format(count_misclassified))

DT_accuracy = metrics.accuracy_score(y_test, y_pred)

print('DT_Accuracy: {:.2f}'.format(DT_accuracy))

Output:

[[100  14]

 [  5   1]]

Misclassified samples: 19

DT_Accuracy: 0.84

Decision Tree with Information Gain

dtree = tree.DecisionTreeClassifier(criterion='entropy',random_state=0)

dtree.fit(X_train, y_train)

y_pred = dtree.predict(X_test)

print(metrics.confusion_matrix(y_test, y_pred))

count_misclassified = (y_test != y_pred).sum()

print('Misclassified samples: {}'.format(count_misclassified))



DT_Entropy_accuracy = metrics.accuracy_score(y_test, y_pred)

print('DT_Entropy_Accuracy: {:.2f}'.format(DT_Entropy_accuracy))

Output:

[[102  12]

 [  4   2]]

Misclassified samples: 16

DT_Entropy_Accuracy: 0.87

Decision Tree Post-Pruning

dtree = tree.DecisionTreeClassifier(criterion='gini',random_state=0,max_depth=3)

dtree.fit(X_train, y_train)

y_pred = dtree.predict(X_test)



print(metrics.confusion_matrix(y_test, y_pred))

count_misclassified = (y_test != y_pred).sum()

print('Misclassified samples: {}'.format(count_misclassified))

DT_Post_accuracy = metrics.accuracy_score(y_test, y_pred)

print('DT_Post_Accuracy: {:.2f}'.format(DT_Post_accuracy))

Output:

[[110   4]

 [  6   0]]

Misclassified samples: 10

DT_Post_Accuracy: 0.92

Decision Tree Pre-Pruning

dtree = tree.DecisionTreeClassifier(criterion='gini',random_state=0,min_samples_split=10,min_samples_leaf=5)

#min_samples_split === minsplit

#min_samples_leaf === minbucket

#max_depth = depth



dtree.fit(X_train, y_train)

y_pred = dtree.predict(X_test)



print(metrics.confusion_matrix(y_test, y_pred))

count_misclassified = (y_test != y_pred).sum()

print('Misclassified samples: {}'.format(count_misclassified))

DT_Pre_accuracy = metrics.accuracy_score(y_test, y_pred)

print('DT_pre_Accuracy: {:.2f}'.format(DT_Pre_accuracy))

Output:

[[109   5]

 [  5   1]]

Misclassified samples: 10

DT_pre_Accuracy: 0.92

Random Forest Classifier

model = RandomForestClassifier(n_estimators=100)

model.fit(X_train,y_train)



predicted = model.predict(X_test)

print(metrics.confusion_matrix(y_test, predicted))

print(metrics.classification_report(y_test, predicted))

RF_accuracy = metrics.accuracy_score(y_test, y_pred)

print('RF_Accuracy: {:.2f}'.format(RF_accuracy))

Output:

[[113   1]

 [  6   0]]

          	   precision	recall  f1-score   support

       	0   	0.95  	0.99  	0.97   	114

       	1   	0.00  	0.00  	0.00     	6

	accuracy                 0.94   	120

   macro avg   	0.47  	0.50  	0.48   	120

weighted avg   	0.90  	0.94  	0.92   	120

RF_Accuracy: 0.92

Random Forest with Cross-Validation

model = RandomForestClassifier(n_estimators=100)



scores = cross_val_score(model, X, Y, cv=5)



print("RF_CV_Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

RF_CV_Accuracy =  scores.mean()

print('RF_CV_Accuracy: {:.2f}'.format(RF_CV_Accuracy))

Output:

RF_CV_Accuracy: 0.91 (+/- 0.02)

RF_CV_Accuracy: 0.91

For code file refer here:https://www.cluzters.ai/vault/274/1029/classification-algorithms-code

User Reviews

Member Sign In

Member Sign In

Create Account

Reviews

Major Concepts

Decision Trees and Random Forests

User Reviews

Connect With Us