Compositional Data Analysis. Sample and parameter space structure.

  • Vera Pawlowsky-Glahn profile
    Vera Pawlowsky-Glahn
    22 April 2016 - updated 4 years ago
    Total votes: 6

Many of the previous comments address the need for novel mathematical tools in the Biosciences, Environmental Studies, Geosciences, in Industrial, Political and Social Studies, in interdisciplinary fields in general. Compositional Data Analysis (CoDA) is such a new tool. Its roots go back to a paper on spurious correlation by K. Pearson, published in 1897, and it has seen an amazing expansion since J. Aitchison introduced in 1982 the logratio approach. Nowadays it builds on the fact that the sample space of compositional data, i.e. of data in proportions, percentages, or, in general, parts of a whole that carry relative information, is a constraint subset of real space that has its own algebraic-geometric structure. More precisely, it has a Euclidean vector space structure which has been extended to a Hilbert space structure for functions with finite or infinite support. Recently, the importance and necessity of these tools in microbiomics, or other "omics", has been recognised, and all the problems related to these fields, like fat data, represent new challenges.

This type of problem is close to the fundamentals of statistics, which is based on the interplay between observations and parameters. Both, observations and parameters, are in measurable spaces, the sample space and the parameter space. Measurability of these spaces is the minimal structure required, but generally richer structures are convenient, if not necessary. For instance, vector space operations have to be defined for attaining central limit results; distances are required for straightforward definitions of mean values and variability (see Fréchet, 1948). Most contributions to statistics are at present assuming that both sample and parameter spaces are subsets of the real space endowed with the sum as group operation and with Euclidean metrics. At most, in functional data analysis or stochastic processes the sample space is a Hilbert $L^p$ space with its metrics. However these structures inherited from real or $L^p$ spaces may fail at modelling the main features of observations and parameters. We can mention some examples: compositional data which sample space can be represented as a simplex with its own Euclidean structure, where the sum is not the group operation; directional data, which sample space is the sphere, which can not be equipped with its own Euclidean structure; random sets, which need special operations and metrics; random positive measures, represented by densities, which can be included in vector spaces (Bayes spaces) where the group operation is Bayes updating.

It would be a great opportunity for European science to have this topic included in the program of Horizon 2020.