next up previous
Next: The compositional data problem Up: Tectonic discrimination diagrams revisited Previous: Introduction


Discriminant analysis

Consider a dataset of a large number of N-dimensional data X, which belong to one of K classes. For example, X might be a set of geochemical data (e.g., SiO$ _2$, Al$ _2$O$ _3$,etc) from basaltic rocks of K tectonic affinities (e.g., mid ocean ridge, ocean island, island arc,...). We might ask ourselves which of these classes an unknown sample x belongs to. This question is answered by Bayes' Rule: the decision d is the class G (1$ \leq$G$ \leq$K) that has the highest posterior probability given the data x:

$\displaystyle d = \underset{k=1,...,K}{argmax}  Pr(G=k\vert X=x)$ (1)

where argmax stands for ``argument of the maximum'', i.e. when f(k) reaches a maximum when k=d, then $ \underset{k=1,...,K}{argmax}$ f(k) = d. This posterior distribution can be calculated according to Bayes' Theorem:

$\displaystyle Pr(G\vert X) \propto Pr(X\vert G)Pr(G)$ (2)

where Pr(X$ \vert$G) is the probability density of the data in a given class, and Pr(G) the prior probability of the class, which we will consider uniformly distributed (i.e., Pr(G=1) = Pr(G=2) = ... = Pr(G=K) = 1/K) in this paper. Therefore, plugging Equation 2 into Equation 1 reduces Bayes' Rule to a comparison of probability density estimates. We now make the simplifying assumption of multivariate normality:

$\displaystyle Pr(X=x\vert G=k) = \frac{exp \left( -\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) \right)}
 {(2\pi)^{N/2}\sqrt{\vert\Sigma_k\vert}}$ (3)

Where $ \mu_k$ and $ \Sigma_k$ are the mean and covariance of the k$ ^{th}$ class and (x-$ \mu_k)^T$ indicates the transpose of the matrix (x-$ \mu_k$). Using Equation 3, and taking logarithms, Equation 1 becomes:

$\displaystyle d = \underset{k=1,...,K}{argmax}  -\frac{1}{2}log\vert\Sigma_k\vert -
 \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)$ (4)

Equation 4 is the basis for quadratic discriminant analysis (QDA). Usually, $ \mu_k$ and $ \Sigma_k$ are not known, and must be estimated from the training data. If we make the additional assumption that all the classes share the same covariance structure (i.e., $ \Sigma_k$ = $ \Sigma$ $ \forall$ k), then Equation 1 simplifies to:

$\displaystyle d = \underset{k=1,...,K}{argmax}  x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k$ (5)

This is the basis of linear discriminant analysis (LDA), which has some desirable properties. For example, because Equation 5 is linear in x, the decision boundaries between the different classes are straight lines (Figure 8). Furthermore, LDA can lead to a significant reduction in dimensionality, in a similar way to principal component analysis (PCA). PCA finds an orthogonal transformation B (i.e., a rotation) that transforms the centered data (X) to orthogonality, so that the elements of the vector BX are uncorrelated. B can be calculated by an eigenvalue decomposition of the covariance matrix $ \Sigma$. The eigenvectors are orthogonal linear combinations of the original variables, and the eigenvalues give their variances. The first few principal components generally account for most of the variability of the data, constituting a significant reduction of dimensionality (Figure 2).

Like PCA, LDA also finds linear combinations of the original variables. However, this time, we do not want to maximize the overall variability, but find the orthogonal transformation Z = BX that maximizes the between class variance S$ _b$ relative to the within class variance S$ _w$, where S$ _b$ is the variance of the class means of Z, and S$ _w$ is the pooled variance about the means (Figure 2).


next up previous
Next: The compositional data problem Up: Tectonic discrimination diagrams revisited Previous: Introduction
Pieter Vermeesch 2005-11-21