How to build a decision tree

6 min readMar 20, 2022

How do you build a tree ? Wait don’t trees grow on their own ? Well decision trees don’t. What sort of trees are these ? And what kind of fruits or flowers do they produce ?

Wait it’s not the kind of tree you think. It’s a classification Machine Learning model that uses decisions and their possible consequences to arrive at a conclusion.

Simply put, at it’s base it’s a nested if-else model. “If it is Saturday, ria will turn up late”, this is a simple conditional statement. Now you can use a if statement to check Saturday or not and conclude if she’ll turn up late or not. Here you go, you just built a base decision tree.

But Machine Learning involves using multiple such attributes / conditions (Features), to classify the new data points. We’re going to dive deep into how it is done.

There are few terms you should be familiar with, before learning to build decision trees.

Entropy
Information Gain
Kullback-Leibler Divergence
Ginni Impurity

Okay now don’t freak out, I’ll explain each of these properly in detail.

This article will be in 2 parts, the first will explain all the necessary terms and the second part we will look deeper into actually constructing a decision tree.

Let’s dive in with no further delay.

If you’re a science student, you’d have heard the word Entropy in physics, this is similar to that. The Entropy in Information Theory is derived from Entropy in physcis.

In a simple words, Entropy is defined as the uncertainity or unpredictability in predicting the output of a variable.

Imagine there we’re tossing a coin 4 times, and the output is [H,T,T,H]. The number of time’s head and tail has appeared is equal. If I ask you to predict what will be te outcome if I toss the coin again, you won’t be able to predict it. A distribution like this has a maximum entropy, that is the uncertainity of the next outcome is high. Where as if I toss a coin 4 times and the output is [H,T,H,H]. We can now see that there are more number of H than tail, as this coin is biased.So when I toss the coin again, it’s most likely to be a head. Hence the unpredictability or uncertainity of this distribution is very less.

We have a formula to calculate entropy.

Let Y => y1, y2, y3, ……yk be a random variable.

The Entropy of Y is denoted as H(Y) and the formula is,

P(Yi) is the probability of Yi occuring.

Our main goal is to construct a decision tree here, how is this relevant to that ?

Well we calculate the Entropy of each feature, and decide which feature to use so that the Entropy is less. We want minimum Entropy so that we have maximul Information gain.

Now what is Information gain and why do we desire high information gain ?

Like we saw in the coin toss example, when there was a coin which produced equal number of heads and tails, the entropy, that is unpredictability is high. When the unpredictability of the random variable is high it’s information gain will be less. Information gain can be simply defined as the amount of Information you get from a variable, with a unbiased coin, the information gain will be less, as we can’t predict the next outcome.

The Information Gain is calculated for each split, using the Entropy of the Parent, the Parent Entropy is simply the Entropy of the target variable.

For choosing the feature based on which the split should occur, we calculate Information Gain and choose the feature with maximum Information Gain.

I will give an example and cover all this in the next part. For now let’s just focus on understanding the meaning of these terms.

Let Y => y1, y2, y3, ……yk be the target variable, in binary classification it’ll be Yes/No for each event.

The formula of Information Gain also includes weighted sum of the children node. That is, the sum of the product of Entropy of each child node after split with the quotient of the number of times the target value present in that child node to that of the total length of the target variable.

So, we need to find a feature that, when split, produces less Entropy or high Information Gain.

Now, we have learnt 2 terms, let’s move on to the third.

Kullback-Leibler divergence, hard to pronounce ? We call it KL divergence.

This topic might seem a little irrelevant, but stay with me.

KL divergence score quantifies the difference between two given probability distributions. If two Probability Density Functions are given, KL divergence is used to compute the difference between them.

This is what KS Statistics does too, then why do we need KL divergence ? because unlike KS Statistics, KL divergence is differentiable. In ML, a function needs to be differentiable, or else we won’t be able to optimize it.

KL Divergence is same as Information Gain, that is why we use it in decision tree. It is the amount of information you can gain about a particular random variable by observing another random variable. The Mat behind KL Divergence is not of our concern.

But in the case of decision tree, we can understand KL Divergence as a synoym of Mutual Information,which is the conditional expected value of the KL Divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given another.

Complicated sentence right ? Classic Wikepedia.

To put it simply, KL divergence here is the mutual information, that is the amount of information obatined about one variable by observing the other one.

That is, imagine I have given you a conditional distribution, from this you’ll get the probabilities of a Y occuring given X has occured. From this now you can predict what will be the next expected output of the probability distribution of Y. Thus KL divergence helps us to predict the condtional expected output of Y.

The Formula or Math behind KL Divergence and KS Statistics is not going to be used. If you’re into Math you can go ahead and learn for sure.

But remember that the main reason for using KL Divergence over KS Stats is because KL Divergence is differentiable and hence can be used for Optimization.

Ginni Impurity, last but not the least. It’s so important that Sklearn actually uses Ginni Impurity instead of Entropy as the splitting criterion due to it’s computational effectiveness.

Gini Impurity is same as Entropy except for the scale difference. The maximum value of Gini Impurity is 0.5, while maximum value of Entropy is log n, where n is the cardinality of random variable.

As I mentioned before Gini impurity is all about efficiency.

This is the formula of Gini Impurity.Notice the difference between the notation of Information Gain and Gini Impurity. It’s not the same IG.

If you notice the Entropy formula has log function in it, which is a bit computationally effective, especially when k is large. Hence we use Gini Impurity for computational effectiveness.

Side note, Gini Impurity and Entropy for a pure node is 0.

A pure node is when we toss a coin 5 times, and it comes Head all the five times, when the outcome is same. In simple terms, it’s not corrupted, as name suggests pure.

If you have hard time connecting all this and making sense of how these are useful in building decision tree.

Just sit back and Relax !

We’ll Find out in PART 2.

How to build a decision tree

Written by Raghavi_bala