Demystifying ID3: Entropy and Information Gain in Machine Learning
Decision trees are among the most intuitive algorithms in machine learning. They mimic human decision-making by breaking down complex datasets into a series of simple, logical choices.
At the heart of early decision tree learning is the ID3 (Iterative Dichotomiser 3) algorithm, developed by Ross Quinlan in 1986. To understand how ID3 builds a tree, you must understand two core concepts from information theory: Entropy and Information Gain. The Core Challenge: How to Split?
Imagine you are building a decision tree to predict whether a person will play tennis based on weather conditions (Outlook, Temperature, Humidity, and Wind).
Your root node contains the entire dataset. To build an efficient tree, you need to choose the single best attribute to split your data first. If you choose poorly, your tree will become deep, complex, and inefficient.
ID3 solves this problem mathematically by measuring the “purity” of the resulting splits. It uses Entropy to quantify disorder and Information Gain to select the best feature. 1. Entropy: Measuring Disorder
In information theory, Entropy is a metric that measures the amount of uncertainty, randomness, or impurity in a dataset. The Intuition
High Entropy: If a dataset contains a ⁄50 split of target classes (e.g., 5 “Play Tennis” and 5 “Don’t Play”), it is completely impure and unpredictable. Entropy is at its maximum value of 1.
Low Entropy: If a dataset contains only one class (e.g., 10 “Play Tennis” and 0 “Don’t Play”), it is perfectly pure. There is no uncertainty. Entropy is 0. The Formula
For a binary classification problem, the mathematical formula for Entropy of a dataset
H(S)=−p+log2(p+)−p−log2(p−)cap H open paren cap S close paren equals negative p sub positive end-sub log base 2 of open paren p sub positive end-sub close paren minus p sub negative end-sub log base 2 of open paren p sub negative end-sub close paren p+p sub positive end-sub is the proportion of positive examples. p−p sub negative end-sub is the proportion of negative examples.
The algorithm uses a base-2 logarithm because it measures information in “bits.” 2. Information Gain: Calculating the Payoff
While Entropy tells us how messy our current data is, Information Gain (IG) tells us how much cleaner the data will get if we split it using a specific attribute.
Information Gain measures the reduction in entropy after a dataset is split. ID3 calculates the Information Gain for every available attribute, and the attribute with the highest Information Gain is chosen as the splitting node. The Formula The Information Gain of an attribute relative to a dataset is calculated as:
IG(S,A)=H(S)−∑v∈Values(A)|Sv||S|H(Sv)cap I cap G open paren cap S comma cap A close paren equals cap H open paren cap S close paren minus sum over v is an element of Values open paren cap A close paren of the fraction with numerator the absolute value of cap S sub v end-absolute-value and denominator the absolute value of cap S end-absolute-value end-fraction cap H open paren cap S sub v close paren is the entropy of the current dataset. represents all possible distinct values of attribute (e.g., Sunny, Overcast, Rainy). Svcap S sub v is the subset of where attribute has the value
|Sv||S|the fraction with numerator the absolute value of cap S sub v end-absolute-value and denominator the absolute value of cap S end-absolute-value end-fraction
acts as a weight, representing the proportion of items in the subset relative to the original dataset.
In simple terms: Information Gain = (Entropy Before Split) – (Weighted Entropy After Split). Step-by-Step ID3 Walkthrough
The ID3 algorithm builds the decision tree using a greedy, top-down approach through the following steps:
Calculate the Total Entropy of the target variable for the current dataset. Split the Dataset using every available attribute.
Calculate the Weighted Entropy for each attribute’s resulting branches.
Subtract the branch entropy from the total entropy to find the Information Gain for each attribute.
Pick the Winner: Choose the attribute with the highest Information Gain to become the decision node.
Repeat Recursively: Create sub-nodes with the remaining attributes and repeat the process for branches that are not yet “pure.”
The algorithm stops when all samples in a branch belong to the same class, or when there are no more attributes left to split on. Limitations of ID3
While ID3 revolutionized early machine learning, it has distinct limitations that led to the development of newer algorithms like C4.5 and CART:
Bias Toward High-Cardinality Attributes: Information Gain naturally favors attributes with a large number of unique values. For example, if your dataset includes a “Customer ID” column, splitting on it creates many pure, single-item branches. The Information Gain will be maximum, but the resulting tree is completely useless for predicting unseen data.
No Support for Continuous Data: ID3 natively handles only categorical data (like “Sunny” or “Rainy”). It cannot directly process continuous numerical values (like a temperature of 23.5°C) without prior binning.
Overfitting: Because ID3 grows the tree until branches are perfectly pure, it often creates overly complex trees that memorize the training data but fail to generalize to new data.
The ID3 algorithm demystifies how machines can learn rules from raw data. By combining the concepts of Entropy (identifying chaos) and Information Gain (seeking clarity), ID3 provides a mathematically elegant framework for structured decision-making.
Understanding these foundational concepts is essential, as they remain the core building blocks for advanced tree-based models used today, including Random Forests and Gradient Boosted Trees.
If you would like to explore this topic further, tell me if you want to: See a coded Python example implementing ID3 from scratch.
Learn how the C4.5 algorithm fixes the high-cardinality bias using Gain Ratio.
Work through a complete numerical example with a sample dataset.