At any point in the recursive process, a partition is defined by two
quantities: the split variable j (1j
J) and the split point
s (-
s
). For any given partition R
of a tree
T, there are at most N
J possible binary subpartitions, which
can be exhaustively searched. We choose the one that minimizes the
``node impurity'' Q
(T). Let
be the proportion of
class k observations in node m, then
This particular form of Q(T) is called the ``Gini index of
diversity'', but alternatives exist. Note that if we simply used the
``misclassification error'' (i.e., the number of misclassified data
points resulting from a candidate split) as a measure of node
impurity, it would be impossible to partition the data set shown in
Figure 1 because any initial split would misclassify
all the triangles, yielding the same misclassification error (=23/63
in this case). The recursive partitioning process continues until all
the end-nodes are ``pure'', i.e. all belong to the same class. The
maximum sized tree thus obtained perfectly describes the training
data. In other words, it has zero bias. However, for the purpose of
prediction, this tree is not optimal, because it overfits the training
data, causing high variance. The tree with optimal predictive power
will be smaller than the largest possible tree, and can be found by
``cost-complexity pruning''.