Decision Tree Algorithm in Machine Learning

Decision Tree in ML

Decision Tree algorithm belongs to the family of supervised machine learning algorithm. It is simple and popular and is based on choices as suggested by its name. This algorithm works both for continuous as well as choice variables i.e. it can be used for both classification as well as regression problems. The article unfolds in three sections, commencing with a general introduction of the algorithm, followed by the mathematical concepts that builds the blocks for the same and we shall conclude with a code walk through on building a Decision Tree model in Python using Loan approval data set.

1. Introduction

Decision Tree learns decision rules inferred from prior data that we use to train it. The motive is to predict the target variables or predict the class based on these decision rules. As the name suggests, the algorithm solves problems by using a tree representation. Each internal node of the tree are attributes and the leaf nodes corresponds to the class labels.

1.1. Terminology

Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.

Figure 1. A sample Decision Tree

2. Pseudocode and Assumptions
The best attribute should be set a:s the root of the tree.
Split the data set into subsets in such a way that the subset contains the same value as an attribute.
Repeat the above two steps till the leaf nodes of all the branches of the tree is found out.
For predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with the record's attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node. The process is continued till the leaf node is reached.

We must first assume the whole training set is first taken as the root in the beginning, and the feature values are preferred to be categorical in nature. In case the values are continuous, it is discretized prior to model training. All the entries/records are distributed recursively in the sense that, the distribution will give similar information repeatedly for similar types of records. Decision Trees follow Sum of Product (SOP) representation. The Sum of product (SOP) is also known as Disjunctive Normal Form. For a class, every branch from the root of the tree to a leaf node having the same class is conjunction (product) of values, different branches ending in that class form a dis-junction (sum).

3. Attribute Selection and Mathematical Elucidations.

Let us suppose that there are “k” number of attributes in the data set and our first step, as discussed above would be to select an attribute which will be set as the root attribute of tree. Random selection of the root and subsequent nodes will be a great idea since the decision of making strategic splits heavily affects a tree’s accuracy. For the same purpose, researchers have come up with several criteria of attribute selection, to name a few:

Entropy
Information Gain
Gini Index

These criterions will calculate values for every attribute. The values are sorted, and attributes are placed in the tree by following the order i.e, the attribute with a high value(in case of information gain) is placed at the root.

3.1. Entropy
The degree of randomness in the information given by an attribute is measured by entropy. More the entropy, it becomes more ambiguous to draw any concrete conclusion from the information. Generally, in the tree a branch with zero entropy is placed at the terminal node and branches with more than zero entropy needs to be split further. Mathematically,

Where, S is the current state, p is the probability of an event i in the state S. For multiple attributes,

3.2. Information Gain

Information gain or IG is a statistical property that measures how well a given attribute separates the training examples according to their target classification. The building of a Decision Tree is based on finding the attribute that has the highest information gain. It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values.

Where T is the state before split and X is the state after split.

3.3. Gini Index

Gini Index is used to evaluate the splits of the dataset by using a cost function. It is calculated by subtracting the sum of the squared probabilities of each class from 1. Unlike information gain, it favours large partitions and is simpler to implement and to be noted that it performs binary splits only. Higher the value of Gini index higher the homogeneity. Mathematically,

4. Decision Tree using Python

As discussed previously, here we shall build a simple decision tree model in this section. For the same we have used the loan approval dataset. The dataset could be used to do Classification as well as regression, although however we shall use Decision Tree classifier to predict whether loan should be approved or not.

###Import required libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split

Read the data:

#import the data

df = pd.read_csv("train.csv")
df.info()

Python gives output as follows:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID 614 non-null object
Gender 601 non-null object
Married 611 non-null object
Dependents 599 non-null object
Education 614 non-null object
Self_Employed 582 non-null object
ApplicantIncome 614 non-null int64
CoapplicantIncome 614 non-null float64
LoanAmount 592 non-null float64
Loan_Amount_Term 600 non-null float64
Credit_History 564 non-null float64
Property_Area 614 non-null object
Loan_Status 614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.4+ KB

The data has attributes as shown above and is pretty much self explanatory in nature and with the help of which we will be predicting the Loan_Status, whether loan will be approved or not.
As we see, there is missing data in the set. One way is to either drop the variable but that comes with a cost of losing a lot of degrees of freedom. Here, we will use simplistic methods to impute the missing values in the data. It is not necessary that we go by such preprocessing methods as shown here. For simplicity let us do the following steps to modify the data.

#Data Preprocessing

df["Gender"].fillna("unknown",inplace = True)
df["Married"].fillna("unknown",inplace = True)
df["Credit_History"].fillna(df.Credit_History.mean(),inplace = True)
df["Dependents"].fillna(0,inplace = True)
df["LoanAmount"].fillna(df.LoanAmount.mean(),inplace = True)
df["Self_Employed"].fillna("No",inplace = True)
df["Loan_Amount_Term"].fillna(round(df.Loan_Amount_Term.mean()),inplace=True)

As shown above, missing gender and marital status is filled with “unknown”, Credit History and loan amount and the loan amount terms is filled in with the mean of the variables. Whether the person is self employed is given “No” and number of dependents has been given 0 if the data was missing. After the preprocessing, we have:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID 614 non-null object
Gender 614 non-null object
Married 614 non-null object
Dependents 614 non-null object
Education 614 non-null object
Self_Employed 614 non-null object
ApplicantIncome 614 non-null int64
CoapplicantIncome 614 non-null float64
LoanAmount 614 non-null float64
Loan_Amount_Term 614 non-null float64
Credit_History 614 non-null float64
Property_Area 614 non-null object
Loan_Status 614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.4+ KB

All the fields are non-null , we will go ahead with some more feature extraction.

In[]: df["Dependents"]= df["Dependents"].map({"0":int(0),"1":int(1),"2":int(2),"3+":int(3)})

Dependents is the number of people dependent on the loan applicant. Since they are ordinal in nature we shall just map it to integer numbers.

Let us get the summary statistics of the data.

df.describe()

A simple does that for you in python, its given as follows:

	Dependents	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Loan_Status
count	599	614	614	614	614	614	614
mean	1	5403	1621	146	342	1	0
std	1	6109	2926	84	64	0	0
min	0	150	0	9	12	0	0
0	0	2878	0	100	360	1	0
1	0	3813	1189	129	360	1	0
1	2	5795	2297	165	360	1	1
max	3	81000	41667	700	480	1	1

Loan status as discussed before is the target variable and is labelled in 0s and 1s.

For training and validating the model let’s split the data into training and testing set.

X_train , X_test , y_train , y_test =train_test_split(X,y,test_size=0.33, random_state=42)

We are keeping the test size as 1/3rd of the total size of the whole data set. As a next step we will create the model, first with the “Gini” criterion, then with the Entropy criterion. Then we will fit the model to the training data and then predict for the test data. The code is simple and is as follows:

Model using Gini Criterion

clf_gini = DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=3, min_samples_leaf=5)

In[] : clf_gini.fit(X_train, y_train)

Out[]: DecisionTreeClassifier(class_weight=None, criterion='gini'

,max_depth=3,

max_features=None, max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=5, min_samples_split=2,

min_weight_fraction_leaf=0.0, presort=False, random_state=100,

splitter='best')

y_pred = clf_gini.predict(X_test)

y_pred in the above code is the predicted values of the target.

Let’s evaluate the model.

print("Confusion Matrix: ", confusion_matrix(y_test, y_pred))
print ("Accuracy : ",accuracy_score(y_test,y_pred)*100)
print("Report : ", classification_report(y_test, y_pred))

Gives the following output:

Confusion Matrix: [[129 2]
[ 48 24]]

Accuracy : 75.36945812807882
Report : precision recall f1-score support

0 0.73 0.98 0.84 131
1 0.92 0.33 0.49 72

avg / total 0.80 0.75 0.71 203

The accuracy with the model is 75%. Let’s check what happens if we use the entropy criterion.

Model with Entropy Criterion

clf_ent = DecisionTreeClassifier(criterion = "entropy",random_state = 42,
max_depth=3, min_samples_leaf=10)

clf_ent.fit(X_train, y_train)

Out[] : DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=10, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=42,
splitter='best')

y_pred = clf_ent.predict(X_test)

y_pred is defined as before. The evaluation report is given as below:

print("Confusion Matrix: ", confusion_matrix(y_test, y_pred))
print ("Accuracy : ",accuracy_score(y_test,y_pred)*100)
print("Report : ", classification_report(y_test, y_pred))

The reports are given as below:

Confusion Matrix: [[129 2]
[ 39 33]]
Accuracy : 79.80295566502463
Report : precision recall f1-score support

0 0.77 0.98 0.86 131
1 0.94 0.46 0.62 72

avg / total 0.83 0.80 0.78 203

As we see that, using “Entropy” gives a better accuracy than the “Gini” model.

Comments

Unknown27 February 2020 at 15:30
As reported by Stanford Medical, It's indeed the ONLY reason this country's women get to live 10 years longer and weigh 19 KG less than us.

(And actually, it is not related to genetics or some secret exercise and absolutely EVERYTHING to do with "HOW" they eat.)

P.S, What I said is "HOW", not "what"...

Click on this link to uncover if this short questionnaire can help you release your true weight loss potential
360DigiTMGAurangabad29 July 2021 at 02:25
This is my first time visit here. From the tremendous measures of comments on your articles.I deduce I am not only one having all the fulfillment legitimately here!
artificial intelligence training aurangabad
Ramesh Sampangi22 October 2021 at 14:51
Join now for the intense Machine Learning Training in Hyderabad program at AI Patasala to become an early leader in this trending platform.
Machine Learning Institute in Hyderabad

Search This Blog

RamaKavanan