Pruning a Tree

Next: Handling Missing Data Up: METHOD Previous: Building a Tree

Pruning a Tree

Define the ``cost-complexity criterion'' of a tree T as

$\begin{displaymath} cp_{\alpha}(T) = \sum_{m=1}^{\vert T\vert} N_m Q_m(T) + \alpha\vert T\vert \end{displaymath}$

(2)

with $\vert$ T $\vert$ the number of terminal nodes in T, N the number of observations in the m $^{th}$ terminal node, Q(T) the ``node impurity'' defined by Equation 1 and $\alpha$ a tuning parameter. For a given $\alpha \geq$ 0, it is possible to find the subtree T $_{\alpha}\subset$ T that minimizes cp $_{\alpha}$ (T) over all possible subtrees of the largest possible tree T

$\begin{displaymath} \begin{array}{rrcl} T_{\alpha} & = & \mbox{argmin} & cp_{\alpha}(T)\\ & & T \subset T_0 & \end{array} \end{displaymath}$

(3)

Repeating this procedure for a range of values 0 $\leq\alpha<\infty$ produces a finite nested sequence of trees {T,T $_{\alpha_1}$ ,...,T $_{\alpha_{max}}$ }. Except for T, these trees will no longer have only pure end-nodes. Impure end-nodes are assigned the class that dominates in them. We then choose the value $\alpha^*$ that minimizes an estimate of future prediction error, for example by ``V-fold cross-validation''. The training data are randomly divided into V (e.g., ten) fractions of equal size. We then grow V overly large trees T $_0^{(v)}$ , each time using all but the v $^{th}$ sample fraction. Each of these trees then goes through the pruning procedure described before, yielding V nested sequences of trees {T $_0^{(v)}$ ,T $_{\alpha_{1}}^{(v)}$ ,...T $_{\alpha_{max}}^{(v)}$ }, 1 $\leq$ v $\leq$ V. The trees T $_{\alpha}^{(v)}$ were constructed without ever seeing the cases in the v $^{th}$ fraction. Sending the v $^{th}$ fraction down T $_{\alpha}^{(v)}$ for each v=1,...,V thus yields an independent estimate of the misclassification error.

A plot of these cross-validated (CV) prediction errors versus the number of nodes in each of the nested subtrees shows a minimum at some point. As discussed before, trees with fewer nodes tend to have large bias, while the increased CV-cost of overly large trees is caused by their inflated variance. There typically exist several trees with CV-costs close to the minimum. Therefore, a ``1-SE rule'' is used, i.e., choosing the smallest tree whose CV misclassification cost does not exceed the minimum CV-cost plus one standard error of the CV-cost for the minimum CV-cost tree. An example of this procedure will be given in Section 3.

Next: Handling Missing Data Up: METHOD Previous: Building a Tree

Pieter Vermeesch 2005-12-14