next up previous
Next: Handling Missing Data Up: METHOD Previous: Building a Tree

Pruning a Tree

Define the ``cost-complexity criterion'' of a tree T as


\begin{displaymath}
cp_{\alpha}(T) = \sum_{m=1}^{\vert T\vert} N_m Q_m(T) + \alpha\vert T\vert
\end{displaymath} (2)

with $\vert$T$\vert$ the number of terminal nodes in T, N$_m$ the number of observations in the m$^{th}$ terminal node, Q$_m$(T) the ``node impurity'' defined by Equation 1 and $\alpha$ a tuning parameter. For a given $\alpha \geq$0, it is possible to find the subtree T $_{\alpha}\subset $T$_0$ that minimizes cp$_{\alpha}$(T) over all possible subtrees of the largest possible tree T$_0$


\begin{displaymath}
\begin{array}{rrcl} T_{\alpha} & = & \mbox{argmin} & cp_{\alpha}(T)\\
  &  & T \subset T_0 & 
\end{array}
\end{displaymath} (3)

Repeating this procedure for a range of values 0 $\leq\alpha<\infty$ produces a finite nested sequence of trees {T$_0$,T$_{\alpha_1}$,...,T $_{\alpha_{max}}$}. Except for T$_0$, these trees will no longer have only pure end-nodes. Impure end-nodes are assigned the class that dominates in them. We then choose the value $\alpha^*$ that minimizes an estimate of future prediction error, for example by ``V-fold cross-validation''. The training data are randomly divided into V (e.g., ten) fractions of equal size. We then grow V overly large trees T$_0^{(v)}$, each time using all but the v$^{th}$ sample fraction. Each of these trees then goes through the pruning procedure described before, yielding V nested sequences of trees {T$_0^{(v)}$,T $_{\alpha_{1}}^{(v)}$,...T $_{\alpha_{max}}^{(v)}$}, 1$\leq$v$\leq$V. The trees T $_{\alpha}^{(v)}$ were constructed without ever seeing the cases in the v$^{th}$ fraction. Sending the v$^{th}$ fraction down T $_{\alpha}^{(v)}$ for each v=1,...,V thus yields an independent estimate of the misclassification error.

A plot of these cross-validated (CV) prediction errors versus the number of nodes in each of the nested subtrees shows a minimum at some point. As discussed before, trees with fewer nodes tend to have large bias, while the increased CV-cost of overly large trees is caused by their inflated variance. There typically exist several trees with CV-costs close to the minimum. Therefore, a ``1-SE rule'' is used, i.e., choosing the smallest tree whose CV misclassification cost does not exceed the minimum CV-cost plus one standard error of the CV-cost for the minimum CV-cost tree. An example of this procedure will be given in Section 3.


next up previous
Next: Handling Missing Data Up: METHOD Previous: Building a Tree
Pieter Vermeesch 2005-12-14