Decision Tree Entropy

These are some basic notes on how entropy is used when making decision trees. The examples are taken from Process Mining: Data Science In Action.


The equation for entropy is \(E = - \sum\limits_{i=1}^n p_i \log_2 p_i\), where \(p_i\) is the probability of variable \(i\). In other words, for variable \(i\), \(p_i\) is the count of instances for that variable divided by the total number of instances for all variables. Well, I'm probably not saying this clearly enough. A concrete example might be better.

Dying Young

This example uses a data set that contains various attributes that might predict if someone died 'young' (less than 70) or 'not young' (70 or older). There are 860 entries with 546 dying young and 314 dying old. We can calculate the entropy for the root node using the proportions of \(young\) (died young) and \(\lnot young\) (didn't die young).

\[ E = -(E_{young} + E_{\lnot young})\\ = -(\frac{546}{860} \log_2 \frac{546}{860} + \frac{314}{860} \log_2 \frac{314}{860})\\ \approx 0.9468 \]