Introduction: Decision Trees and the Need for Purity

Decision trees are one of the most intuitive and widely used supervised learning algorithms in machine learning. They model decisions as a tree structure, where internal nodes represent tests on features, branches represent outcomes of those tests, and leaf nodes represent final predictions. Whether you are classifying whether an email is spam or predicting house prices, decision trees offer a transparent, human-readable approach.

The core challenge in building a decision tree is deciding where to split the data at each node. The algorithm must choose the feature and split value that best separates the target classes. This is where entropy comes in. Entropy, borrowed from information theory, provides a mathematical measure of uncertainty or impurity in a dataset. By minimizing entropy after each split, decision trees create increasingly homogeneous subsets, leading to accurate and efficient models.

What Is Entropy? A Measure of Disorder

In everyday language, entropy refers to randomness or chaos. In the context of decision trees, entropy quantifies the amount of unpredictability in a dataset with respect to the target variable. If all examples in a node belong to the same class, the node is pure and its entropy is zero. Conversely, if the classes are evenly mixed, entropy reaches its maximum.

For a binary classification problem (e.g., positive vs. negative), entropy is defined as:

Entropy = –p⁺ log₂(p⁺) – p⁻ log₂(p⁻)

where p⁺ is the proportion of positive examples and p⁻ = 1 – p⁺. The logarithm base 2 is used because information in bits is measured in binary. When there are more than two classes, the formula generalizes to:

Entropy = –Σ pᵢ log₂(pᵢ) for all classes i.

The resulting value ranges from 0 (perfectly pure) to log₂(k) for k classes (maximum impurity). For a binary case, maximum entropy is 1.0 when p⁺ = p⁻ = 0.5.

A Quick Example

Consider a dataset of 10 samples with 5 positives and 5 negatives. Entropy = –0.5 log₂(0.5) – 0.5 log₂(0.5) = –0.5 * (–1) – 0.5 * (–1) = 0.5 + 0.5 = 1.0. Now consider a dataset with 9 positives and 1 negative: entropy = –0.9 log₂(0.9) – 0.1 log₂(0.1) ≈ –0.9 * (–0.152) – 0.1 * (–3.322) ≈ 0.137 + 0.332 = 0.469. The second dataset is much more predictable.

Why Base 2?

The choice of base 2 is rooted in Claude Shannon’s information theory. A bit is the fundamental unit of information, representing a binary choice. Using base 2 means entropy gives the average number of bits needed to encode the class of a random sample. If you already know the distribution, lower entropy means fewer bits are required to communicate the outcome.

Information Gain: How Entropy Guides Splits

Simply calculating entropy is not enough; the goal is to reduce it after splitting. Information gain (IG) measures the expected reduction in entropy caused by partitioning the data according to a feature. The feature and split value that yield the highest information gain are chosen for the node.

The formula for information gain is:

Information Gain = Entropy(parent) – Σ (|Sᵢ| / |S|) * Entropy(Sᵢ)

where S is the parent dataset, Sᵢ are the child subsets after the split, and |.| denotes the number of samples. The sum is a weighted average of the children’s entropies.

Worked Example

Imagine a parent node with 30 samples: 16 class A and 14 class B. Entropy(parent) = –(16/30) log₂(16/30) – (14/30) log₂(14/30) ≈ 0.996.

Now consider a split on Feature X that creates two children: Child1 has 20 samples (15 A, 5 B) → entropy = –0.75 log₂(0.75) – 0.25 log₂(0.25) ≈ 0.811; Child2 has 10 samples (1 A, 9 B) → entropy = –0.1 log₂(0.1) – 0.9 log₂(0.9) ≈ 0.469. Weighted child entropy = (20/30)*0.811 + (10/30)*0.469 ≈ 0.541 + 0.156 = 0.697. Information gain = 0.996 – 0.697 = 0.299.

If another split yields higher IG, that split is preferred. The algorithm evaluates all features and possible split thresholds to find the best one.

Limitations of Information Gain

Information gain tends to favor features with many distinct values (e.g., a unique ID column) because splitting on such a feature creates many pure children, yielding high IG. This can lead to overfitting. To counter that, variants like Gain Ratio (used in C4.5) normalize IG by the intrinsic information of the split. Another approach is to use the Gini impurity, which is computationally cheaper and often yields similar results.

Comparing Entropy with Gini Impurity

Gini impurity is an alternative splitting criterion used in the CART algorithm (Classification and Regression Trees). It measures the probability of misclassifying a randomly chosen sample if it were labeled randomly according to the class distribution in the node. The formula:

Gini = 1 – Σ pᵢ²

For a binary case, Gini = 2p⁺(1 – p⁺). Maximum Gini is 0.5 (balanced classes) and minimum is 0 (pure).

Both entropy and Gini impurity are convex functions, meaning they behave similarly in practice. The choice between them often comes down to computational efficiency: Gini does not require logarithms, so it can be slightly faster. However, entropy has a stronger information-theoretic justification. Many libraries, including scikit-learn, you can choose either; empirically, differences are small.

Entropy in Regression Trees

Decision trees can also solve regression problems (predicting continuous values). In regression, entropy is not appropriate because the target is not categorical. Instead, the algorithm uses variance reduction or mean squared error (MSE) as the splitting criterion. The idea is analogous: at each node, we split to minimize the weighted sum of variances of the child nodes. This maximizes the homogeneity of target values in each region.

For regression, the quantity is often called mean squared error reduction or total variance reduction. The principle is exactly the same as information gain: measure the impurity (variance) of the parent, then the weighted average of children, and maximize the difference.

Building a Complete Decision Tree: From Root to Leaf

Now that we understand entropy and information gain, let’s walk through how a typical decision tree learning algorithm (like ID3, C4.5, or CART) builds a tree:

  1. Start with the entire dataset at the root node.
  2. Calculate the impurity of the root using entropy (for classification) or variance (for regression).
  3. For each feature, evaluate every possible split point (for numerical features, sort values and consider midpoints between consecutive distinct values; for categorical features, consider subsets or one-hot encoding).
  4. Calculate information gain (or gain ratio, Gini reduction, etc.) for each split.
  5. Choose the split that yields the highest gain.
  6. Partition the data and recursively repeat steps 2–5 for each child node.
  7. Stopping criteria prevent infinite growth: maximum depth, minimum samples per leaf, minimum impurity decrease, or when all samples in a node belong to one class.
  8. Prune the tree (either pre-pruning via hyperparameters or post-pruning by cutting back branches that don’t improve performance on a validation set) to combat overfitting.

Handling Categorical and Numerical Features

Entropy-based splitting works for both types of features, but the approach differs:

  • Numerical features: The algorithm sorts the unique values and tests each possible threshold. For efficiency, it often only considers thresholds between consecutive sorted values where the class label changes.
  • Categorical features: For binary splits, the algorithm may consider grouping categories into two subsets. For multi-way splits (as in ID3), each category becomes a branch. However, multi-way splits fragment data quickly and are prone to overfitting, so most modern implementations use binary splits even for categorical features.

Handling Missing Values

Real-world datasets often contain missing values. Decision trees can handle them in several ways:

  • Surrogate splits: When splitting on a feature, a backup feature that best mimics the split is used for samples missing the primary feature.
  • Fractional instances: Assign a sample to multiple children with weights proportional to the probability of each child based on non-missing data.
  • Simple imputation: Replace missing values with the mode or median before building the tree.

Many libraries, such as scikit-learn, do not handle missing values internally and expect them to be imputed beforehand. XGBoost and LightGBM, however, learn the best direction for missing values during training.

Overfitting and Pruning

A decision tree grown to maximum depth will perfectly memorize the training data, including noise, leading to poor generalization. Entropy reduction continues until each leaf is pure, but this rarely benefits test performance. Two main strategies control overfitting:

Pre-pruning (Early Stopping)

Stop tree growth before it overfits by applying constraints: limit maximum depth, require a minimum number of samples per leaf, or require a minimum reduction in impurity (e.g., entropy decrease must be > 0.01). These hyperparameters are tuned using cross-validation.

Post-pruning (Cost-Complexity Pruning)

Grow the tree fully, then remove branches that add little value. The algorithm considers a trade-off between tree complexity (number of leaves) and training error. A complexity parameter (alpha) penalizes additional leaves. Scikit-learn’s DecisionTreeClassifier offers cost-complexity pruning via ccp_alpha.

Both pruning techniques help ensure that entropy-driven splits are not too granular and that the tree remains interpretable while generalizing well.

Entropy in Ensemble Methods

While a single decision tree can be unstable (small changes in data can result in a very different tree), entropy remains a foundational concept in ensemble methods:

  • Random Forests: Build many trees using bootstrap samples and random feature subsets. Each tree typically uses entropy or Gini to split. The forest averages predictions, reducing variance.
  • Gradient Boosting: Trees are built sequentially to correct errors of previous trees. Entropy is used as the objective (via cross-entropy loss) for classification forests in libraries like XGBoost.

Understanding entropy helps interpret why a particular split was chosen in any individual tree, which is essential for model debugging and feature importance analysis.

Practical Considerations When Using Entropy

First, compute entropy using logarithms carefully — avoid undefined log(0) by defining 0 log₂(0) as 0. Second, be aware that entropy calculations are sensitive to class imbalance; a node with 99% one class and 1% another has low entropy but may not indicate a good split if the minority class is important. In that case, weighting classes or using alternative metrics (e.g., F1) for evaluation is advisable.

Also, decision trees with entropy can be memory-intensive for large datasets because they evaluate all features and split points. Libraries use algorithms like sort-and-scan to compute entropy for numerical features in O(n log n) time.

External references for deeper reading:

Beyond Classification: Entropy and Information Gain in Feature Selection

Entropy is not only used inside decision trees — it also powers feature selection techniques. Mutual information between feature and target is directly related to information gain. You can rank features by their mutual information to reduce dimensionality before training other models. This is a non-linear alternative to correlation analysis.

For example, if feature X has high mutual information with target Y, then knowing X substantially reduces the uncertainty about Y. This is exactly the reduction in entropy achieved by splitting on X. Libraries like scikit-learn provide mutual_info_classif and mutual_info_regression.

Limitations of Entropy-Based Decision Trees

Despite their power, decision trees built with entropy have some drawbacks:

  • Instability: Small dataset variations can drastically change the tree structure. Ensembles mitigate this.
  • Bias toward features with many levels: Information gain favors high-cardinality features. Gain ratio or using only binary splits helps.
  • Poor handling of additive structure: Trees are piecewise constant models, so they struggle to learn linear relationships.
  • Greedy nature: The algorithm makes locally optimal splits, which may not be globally optimal.

In practice, combining entropy-based decision trees with proper hyperparameter tuning and ensemble methods yields robust models for many tabular datasets.

Conclusion: Entropy as a Foundation for Insightful Splits

Entropy provides a principled, information-theoretic way to evaluate the quality of a split when constructing a decision tree. By measuring the disorder in a dataset and aiming to reduce it at each step, we can build trees that efficiently and accurately partition the feature space. Whether you are a student learning machine learning or a practitioner deploying models, understanding entropy deepens your grasp of how decision trees “think.” It also connects to broader concepts like mutual information, feature selection, and even data compression.

As you apply decision trees, remember that entropy is a tool — not an end. Pair it with proper validation, pruning, and ensemble techniques to unlock its full potential. And if you are managing data pipelines for machine learning, tools like Directus can help you collect, organize, and serve the high-quality datasets that decision trees depend on.