Consider the decision trees shown in Figure 43 Assume they a
Consider the decision trees shown in Figure 4.3. Assume they are generated from a data set that contains 16 binary attributes and 3 classes, C1, C2, and C3. Compute the total description length of each decision tree according to the minimum description length principle.
• The total description length of a tree is given by: Cost(tree, data) = Cost(tree) + Cost(data|tree).
Each internal node of the tree is encoded by the ID of the splitting attribute. If there are m attributes, the cost of encoding each attribute is log2 m bits.
Each leaf is encoded using the ID of the class it is associated with. If there are k classes, the cost of encoding a class is log2 k bits.
Cost(tree) is the cost of encoding all the nodes in the tree. To simplify the computation, you can assume that the total cost of the tree is obtained by adding up the costs of encoding each internal node and each leaf node.
Cost(data|tree) is encoded using the classification errors the tree com- mits on the training set. Each error is encoded by log2 n bits, where n is the total number of training instances.
Which decision tree is better, according to the MDL principle?
C. (a) Decision tree with 7 errors (b) Decision tree with 4 errors Figure 4.3. Decision trees for Exercise 9.Solution
Because there are 16 attributes, the cost for each internal node in the decision tree is:
log2(m) = log2(16) = 4.
Futhermore, because there are 3 classes, the cost for each leaf node is:
[log2(k)] = [log2(3)] = 2.
The cost for each misclassification error is log2(n).
The overall cost for the decision tree (a) is
2*4 + 3*2 + 7*log2n = 14+7log2n and the overall cost for the decision tree (b) is
4*4 + 5*2 + 4*5 = 26 + 4 log2n. According to the MDL principle, tree (a) is better than (b)
if n < 16 and is worse than (b) if n >16.
