Next: Revisiting a few popular Up: Tectonic discrimination diagrams revisited Previous: The compositional data problem

Aitchison's solution to the compositional data problem

Although Chayes (1949, 1960, 1971) made significant contributions to the compositional data problem, the real breakthrough was made by Aitchison (1982, 1986). Aitchison argues that N-variate data constrained to a constant sum form an N-1 dimensional sample space or simplex. An example of a simplex for N=3 is the ternary diagram (e.g., Weltje, 2002). The very fact that it is possible to plot ternary data on a two-dimensional sheet of paper tells us that the sample space really has only two, and not three dimensions. The ``traditional'' statistics of real space ( $\mathbb{R}^N$ ) do no longer work on the simplex ( $\Delta_{N-1}$ ). Figure 5 shows the breakdown of the calculation of 100(1- $\alpha$ )% confidence intervals on $\Delta_{2}$ . Treating $\Delta_{2}$ the same way as $\mathbb{R}^3$ yields 95% confidence polygons that partly fall outside the ternary diagram, corresponding to meaningless negative values of x, y and z.

As a solution to this problem, Aitchison suggested to transform the data from $\Delta_{N-1}$ to $\mathbb{R}^{N-1}$ using the log-ratio transformation (Figure 6). After performing the desired (``traditional'') statistical analysis on the transformed data in $\mathbb{R}^{N-1}$ , the results can then be transformed back to $\Delta_{N-1}$ using the inverse log-ratio transformation. For example, in the ternary system (X+Y+Z=1), we could use the transformed values V = log(X/Z) and W = log(Y/Z). Alternatively, we could also use V=log(X/Y) and W=log(Z/Y), or V=log(Y/X) and W=log(Z/X). The inverse log-ratio transformation is given by:

$\displaystyle X = \frac{e^V}{e^V+e^W+1}, Y = \frac{e^W}{e^V+e^W+1}, Z = \frac{1}{e^V+e^W+1}$

(6)

The back-transformed confidence regions of Figure 6 are no longer elliptical, but completely fall within the ternary diagram, as they should. Figure 7 shows an LDA of the synthetic data of Figures 5 and 6, done the ``wrong'' way (i.e., treating the simplex as a regular data space). As explained in the previous section, such an analysis yields linear decision boundaries. 10% of the training data were misclassified. Figure 8 shows an LDA done the ``correct'' way (i.e., after mapping the data to log-ratio space). The decision boundaries are still linear, but this time only $\sim$ 3% of the training data were misclassified. Because log(Y/Z) and log(X/Z) are rather hard quantities to interpret, it is a good idea to map the results back to the ternary diagram using the inverse log-ratio transformation (Figure 9). The transformed decision boundaries are no longer linear, but curved. However, the misclassification rate is still only 3%.

Note that there are two different kinds of constant-sum constraint. The first is a physical one, resulting from the fact that all chemical concentrations add up to 100%. The second is a diagrammatic contraint caused by renormalizing three chosen elements to 100% on a ternary plot. Aitchison's logratio-transform adequately deals with both types of constant sum constraint. The first type is discussed in Sections 5.1 and 5.3, the second in 5.2.

Next: Revisiting a few popular Up: Tectonic discrimination diagrams revisited Previous: The compositional data problem

Pieter Vermeesch 2005-11-21