Handling Missing Data

Next: APPLICATION TO THE TECTONIC Up: METHOD Previous: Pruning a Tree

Handling Missing Data

One of the greatest advantages of classification trees is their ability to handle missing data. For example, in a dataset of geochemical analyses, some samples might have been analysed for major and trace elements, while others were only analysed for trace elements and stable isotopes. Yet another set of samples might have been analysed for all elements except Zr, etc. Methods like discriminant analysis cannot easily handle these situations, severely restricting their applicability and power. Both for training and prediction, trees solve the missing data problem by ``surrogate splits''. Having chosen the best primary predictor and split point (disregarding the missing data), the first surrogate is the predictor and corresponding split point that has the highest correlation with the primary predictor in R (Figure 2). The second surrogate is the predictor that shows the second highest correlation with the primary split variable and so forth.

**Figure 2:** Primary split variable X and split point s and surrogate split variable X and split point s. Surrogates answer the question: ``which other splits would classify the same objects in the same way as the primary split?''

Next: APPLICATION TO THE TECTONIC Up: METHOD Previous: Pruning a Tree

Pieter Vermeesch 2005-12-14