Pieter Vermeesch
Agrawal and Verma (2007) allege six problems with the use of
classification trees for the tectonic discrimination of oceanic
basalts, as proposed by Vermeesch (2006). In the following, I will
demonstrate that the first five of their points are false, whereas the
sixth is partially correct but can easily be fixed.
The results of Vermeesch (2006) are said to be irreproducible because
a large number of data points are ``unclassifiable'' due to the
absence of all the primary and surrogate variables. However, Section
4.2 of Vermeesch (2006) explains that in such cases a ``follow the
majority'' decision should be made. For example, in the absence of
TiO, PO, and Zr (the primary and surrogate variables for
the first split of the full tree; Figure 4 of Vermeesch, 2006), the
sample should be sent to the ``Yes''-side of the split, because this
is where the majority (520/756) of the training data go. The ``follow
the majority'' rule is an integral part of the classification tree
method and all but obviates points (i)-(v) of the Comment by Agrawal
and Verma (2007).
Thanks to the simple but effective way of dealing with missing data,
the sparseness of the training data used by Vermeesch (2006) is not a
problem. Agrawal and Verma (2007) remark that not even a single
sample in the training set was analyzed for all the variables, and
that only one MORB sample was analyzed for Sn. It is important to
note that the variable Sn was used in neither of the two
classification trees presented by Vermeesch (2006). The fact that
classification trees are not hurt by sparse datasets should be seen as
a positive feature and can hardly be considered a criticism of the
method.
The sixth and final point of Agrawal and Verma (2007) is that
geochemical analysis should consider only the relative and not the
absolute values of its components. I welcome the opportunity to
elaborate on this point here. It is true that the constant-sum
constraint (``closure'') of compositional data implies some degree of
correlation between the components. Loss or gain of any component
causes a change in the concentration of all the other components.
This problem is well known in geochemistry and is generally solved by
taking ratios. For a parametric method such as discriminant analysis,
Aitchison (1986) advocates taking log-ratios. However, taking
logarithms is not necessary for non-parametric tools such as the
classification tree. The latter only considers the order of the split
variables, which is not affected by taking logarithms.
For the sake of illustration, a ratio-based tree was built using the
same dataset as Vermeesch (2006), but converting major oxide
concentrations (in weight percent) to elemental concentrations (in
parts per million). The following variables were used: La/Ti, Ce/Ti,
Nd/Ti, Sm/Ti, Eu/Ti, Gd/Ti, Tb/Ti, Dy/Ti, Ho/Ti, Er/Ti, Tm/Ti, Yb/Ti,
Lu/Ti, Sc/Ti, V/Ti, Sr/Ti, Y/Ti, Zr/Ti, Nb/Ti, Hf/Ti, Ta/Ti, Th/Ti,
U/Ti, Sr/Zr, Zr/Nb, Nb/Th, La/Sm, La/Yb, Gd/Yb, Th/Ta, Nb/La, Th/Yb,
Th/U, Nb/U and Nb/Ta (Figure 1). Only 751 of the
original 756 data were used for the tree construction, because one IAB
and four OIBs lacked all the necessary variables. Using the entire
dataset of 756 training data, the substitution-error of the
ratio-based tree is 14% and its 10-fold cross-validation error is
18%. Because the surrogate variables are also composed of ratios
(Table 1), they are subject to some degree of spurious
correlation (Chayes, 1971). There is no way around this, but the
cross-validation error estimate suggests that it only affects the
performance of the tree to a minor degree.
To illustrate once again the use of surrogate variables and the ``follow the majority'' rule, consider a sample with a Sr/Ti ratio of 0.01 and lacking all other variables. The primary split variable (Sr/Zr) is missing, so the first surrogate variable must be used (Table 1). Because Sr/Ti = 0.01 0.02056053, the sample is sent to the right side of the first node. We have now arrived at the third node, and the primary and surrogate variables are La/Yb, La/Sm and Yb/Ti, respectively. All these variables are missing, so we must use the ``follow the majority'' rule. Because 197 out of the 223 training data that arrived at node 3 were sent to the left, our sample is classified as MORB. The misclassification rate of samples with missing data is worse than that of samples which were analyzed for all the components. However, provided that the sample of unknown tectonic affinity and the training data are comparibly sparse, the cross-validation error of 18% should be a reasonably accurate estimate of the true misclassification rate. For this reason, comparing the performance of a classification tree with that of a discriminant analysis lacking any missing data, as done by Verma et al (2006) is fundamentally unfair.
|
|