INTRODUCTION
Human microbiome research, focused on the study of the microorganisms that live throughout the human body and their role in health and disease, has experienced significant growth in the last few years. High-throughput sequencing technologies have revolutionized this field by allowing the quantification of microbiome composition and function in different environments. Large-scale projects, such as the Human Microbiome Project (
1,
2) or MetaHIT (
metagenomics of the
human
intestinal
tract), have established standardized protocols for creating, processing, and interpreting metagenomics data (
3). However, analysis of microbiome data is still challenging due to, among other reasons, their inherently compositional nature.
High-throughput DNA sequencing generates thousands of sequence reads that, after bioinformatics preprocessing and quality control, are annotated to different microbial species or taxa. An abundance table of counts summarizes the number of sequences per sample of each taxon. The total number of counts per sample, also known as sequencing depth or library size, is highly variable and constrained by the maximum number of sequence reads of the instrument. This total count constraint induces strong dependencies among the abundances of the different taxa; an increase in the abundance of one taxon requires the decrease of the observed number of counts for some of the other taxa so that the total number of counts does not exceed the specified sequencing depth. Moreover, observed raw abundances and the total number of reads per sample are noninformative since they represent only a fraction or random sample of the original DNA content in the environment. These characteristics of microbiome abundance data clearly fall into the notion of compositional data. Compositional data are defined as a vector of strictly positive real numbers with an unknown or uninformative total. Compositional data carry relative information, i.e., information contained in the ratios between components, and the numerical value of each component by itself is irrelevant (
4). Except for the fact that microbiome abundance tables contain many zeros, microbiome data fit the definition of compositional data and, as already acknowledged by many authors (
5,
6), their analysis requires the use of a proper mathematical theory (
7).
Most of the methods proposed for microbiome analysis are intended to address two main issues: first, whether there is a global association between the microbiome and a phenotype of interest; second, which specific taxa are associated with the outcome. Multivariate methods such as PERMANOVA (
8,
9), implemented in the
Adonis() function of the R package
vegan (
10), and MiRKAT (
11) address the first issue. The second issue is approached with univariate methods where each taxon is tested for association with the outcome. When the response variable is dichotomous, the testing method is known as differential abundance testing and methods specifically developed for transcriptome sequencing (RNA-Seq) data, such as
DESeq2 (
12) and
edgeR (
13), are commonly used. Other methods, such as
ANCOM (
14) and
ALDEx2 (
15), have been proposed that acknowledge the compositionality of microbiome data. See a previous report by Weiss et al. for a comprehensive comparison of methods for microbiome differential abundance testing (
16).
In this paper we focus on a different issue. The goal of the proposed methodology is to identify microbial signatures, that is, groups of microbial taxa that are predictive of a phenotype of interest. These microbial signatures can be used for diagnosis, prognosis, or prediction of therapeutic response based on an individual’s specific microbiota (
17). The identification of microbial signatures involves both modeling and variable selection: modeling the response variable and identifying the smallest number of taxa with the highest prediction or classification accuracy. We present
selbal, a model selection procedure that searches for a sparse model that adequately explains the response variable of interest. Similarly to forward stepwise linear regression,
selbal performs multiple regressions a number of times, each time adding a new taxon to the model. Unlike linear regression, the raw variables in
selbal are not included in a linear equation in real space but are added as part of what is called a “balance” in the compositional data analysis literature, i.e., as part of a particular type of log-contrast.
Mathematically, a compositional balance is defined as follows. Let
X = (
X1,
X2,
…,
Xk) be a composition with
k components or parts. Given two disjoint subsets of components in
X, denoted by
X+ and
X−, indexed by
I+ and
I−, and composed of
k+ and
k− parts, respectively, the balance between
X+ and
X− is defined as the normalized log ratio of the geometric mean of the values for the two groups of components as follows:
Expanding the logarithm, we obtain a more usual expression of a balance that is proportional to the difference between the means of the log-transformed variables of the two groups of components as follows:
A compositional balance is a special kind of a log-contrast, defined as a linear combination of the log-transformed components of a composition with the restriction that the coefficients of the linear function add up to zero (
4). The importance of working with log-contrast functions or, in particular, with balances, in analyzing compositional data is that this kind of function preserves scale invariance, one of the principles that should be fulfilled in compositional data analysis (
4,
7).
Our algorithm for balance selection, selbal, starts with a first thorough search of the two taxa whose balance, or log ratio, is most closely associated with the response. Once the first two-taxon balance is selected, the algorithm performs a forward selection process where, at each step, a new taxon is added to the existing balance such that the specified criterion is improved (area under the receiver operating characteristic [ROC] curve [AUC] or mean squared error [MSE]). The algorithm stops when there is no additional variable that improves the current optimization parameter or when the maximum number of components to be included in the balance is achieved. This number is established with a cross-validation (CV) procedure, which is also used to explore the robustness of the identified balance. A more detailed description of the algorithm is given in Materials and Methods.
selbal is different from other modeling approaches for microbiome analysis such as MiRKAT (
11), which performs an overall association test comparing the microbiome and the phenotype but does not perform model selection.
Model selection for microbial signature identification can also be performed in two separate steps: first, variable selection; second, model building with the selected variables. When the outcome variable is dichotomous, variable selection can be obtained with methods for microbiome differential abundance testing mentioned before, such as
DESeq2 (
12),
edgeR (
13), or, in the context of compositional data analysis,
ANCOM (
14) or
ALDEx2 (
15). However, it is not clear how to combine the selected variables to obtain the best joint sparse model. This is specially challenging for microbiome analysis, where the compositional nature of microbiome data induces spurious correlations among the variables. We think that a joint procedure that involves both modeling and variable selection, as performed in
selbal, is more appropriate in this context.
Other authors (
18–20) have previously proposed the use of balances for microbiome analysis regarding the construction of an isometric log ratio (ILR) transformation (
21), which allows compositional data to be represented in a real Euclidean space, where standard statistical methods can be applied. Silverman et al. (
18) and Washburne et al. (
19) proposed methods that use microbial phylogenetic information to guide the sequential binary partition in the construction of a particular ILR transformation. This phylogenetically driven ILR transformation would help to detect relevant evolutionary factors or phylogenetically related bacterial groups (clades) related to host-microbiome interactions (
18,
19). In the method proposed by Morton et al. (
20), instead of using phylogenetic information, they use the response variable to define the binary sequential partitions of the ILR transformation.
selbal is different from these methods in the following ways: first, only one balance is considered and not a sequence of balances in
selbal; second, the purpose of the selected balance is classification or prediction and not a new representation of the data.
As with any other compositional data method, one important issue that
selbal addresses before the searching algorithm can be applied is that of how to deal with the large amount of zeros that are typically present in microbiome data sets. Their treatment is different depending on whether they represent essential or rounded zeros (
22). In microbiome analysis, an essential or structural zero represents the absolute absence of a taxon in a sample, e.g., because the microbe is unable to live in that environment. In dealing with essential zeros, samples are considered to belong to two distinct subpopulations according to the presence or absence of a zero. On the other hand, rounded zeros arise because of insufficient sampling depth: they correspond to taxa in such a small proportion in the sample that they were not picked during the sequencing process. The common practice for the treatment of rounded zeros is their replacement by a small positive value. This replacement can be implemented in different ways, including replacing the zeros by a constant, adding a constant to all values in the data set, and using more-sophisticated methods for zero replacement that are designed to preserve as much as possible the covariance structure of the initial data. Though the user can perform other zero replacements before using
selbal (
18,
20), the default option in
selbal is
geometric
Bayesian
multiplicative replacement (GBM) (
23) (described in more detail in Materials and Methods).
Another important issue in microbiome analysis is sampling variance. As discussed by Gloor et al. (
24), microbiome data, as with any other kind of high-throughput sequencing data, are subject to high levels of variability or uncertainty that should be conveniently treated. The number of reads obtained in an experiment represents just an instance from a random sample of the true microbial composition in the environment of interest. The same protocol applied twice to the same sample would provide different microbial compositions. This imprecision, which arises during the library preparation and sequencing process, is larger for taxa of low abundance and should be taken into account when dealing with observed zeros. One way to handle this variability is by modeling the read counts per sample as multinomial or negative binomial, which implies that the vectors of relative abundances of the different taxa follow approximately a Dirichlet distribution. Then, using a Dirichlet Monte Carlo sampling approach, we can obtain multiple instances of the posterior distribution of the relative abundances determined for each sample (
25). By multiplying the relative abundances of each sample by the observed total number of counts, we obtain multiple abundance tables which can be analyzed with
selbal. The comparison of the microbial signatures obtained from these new Monte Carlo abundance tables to the microbial signature obtained with the initial table can be used to evaluate the robustness of the initial result.
The remainder of this paper is organized as follows. In the next section, we illustrate the proposed algorithm with a Crohn’s disease (CD) microbiome study and an HIV microbiome study. Some discussions and suggestions for future work are provided in the Discussion. In Materials and Methods, we present the detailed description of the algorithm.
selbal is accessible as an R package in GitHub (
https://github.com/UVic-omics/selbal), where the data sets and scripts to reproduce this work are also available.
DISCUSSION
The identification of microbial signatures that are predictive of a variable of interest is an essential step toward the translation of microbiome research to clinical practice. In this work, we present selbal, a greedy stepwise algorithm for the identification of microbial signatures consisting of two groups of taxa whose relative abundances, or balance, are predictive of the outcome. Working with balances and, in general, with log-contrast functions preserves the scale-invariant principle for compositional data analysis.
In the Crohn’s disease study considered in this work, selbal outperformed methods commonly used in microbiome analysis, such as DESeq2 and edgeR, in terms of discrimination accuracy. With respect to methods for compositional data, selbal performs much better than ANCOM and similarly to ALDEx2 in terms of classification accuracy, but selbal is more parsimonious.
selbal overcomes the problem of differences in sample size that is usually accommodated with different methods based on count normalization, rarefaction, or transformation into proportions. These normalization techniques are controversial since they may have an important impact on the analysis (
16,
33,
34). The only way in which data are altered in
selbal is at the zero imputation stage, which is required because of the use of logarithms and ratios in the definition of balances. This replacement of zeros by positive numbers is performed under the assumption that the observed zeros represent rounded zeros, that is, that all taxa are present in all the samples but some of them are not detected because of low abundance and insufficient sampling depth. However, it is not clear how the imputation method and the presence of structural zeros (absence of the taxa in the sample) may influence the results. Future research will focus on the treatment of zeros, with the aim of more precisely evaluating if zeros are rounded or structural, and on selecting the best replacement method.
The technical variability of microbiome sequencing data should be taken in consideration. The effects of this uncertainty on the results of
selbal can be explored through Monte Carlo sampling from a Dirichlet distribution (
25) prior to microbiome signature identification with
selbal.
A limitation of selbal is that the greedy algorithm does not guarantee that the global optimum will be found. Due to the computational cost, selbal does not explore the whole balance space; the method for selecting the optimal balance is suboptimal and may be improved. In this respect, the iterative CV process included in the selbal algorithm is useful for exploring the robustness of the result. The degree of concordance between the balances obtained in the CV process and the global balance can provide reasonable evidence to support the optimality of the global balance. Exploring possible alternative approaches in the search of the optimal balance is another topic of future research.
MATERIALS AND METHODS
Let
X = (
X1,
X2,
…,
Xk) be a composition, that is, a vector of strictly positive real numbers that carry relative information. Given two disjoint subsets of components in
X, denoted by
X+ and
X−, indexed by
I+ and
I−, and composed of
k+ and
k− features, respectively, the balance between
X+ and
X− is defined as the normalized log ratio of the geometric mean of the two groups of components as follows:
Expanding the logarithm, we obtain the result that a balance is proportional to the difference between the arithmetic means of the log-transformed variables of the two groups of components as follows:
The second expression is preferable from a computational point of view and is the one implemented in the proposed algorithm.
Given Y, a response variable, which can be either numeric or dichotomous, a composition X = (X1, X2, …, Xk), and additional covariates Z = (Z1, Z2, …, Zr), the goal of the algorithm is to determine two subcompositions of X, X+ and X−, indexed by I+ ⊂ {1, 2,…, k} and I− ⊂ {1, 2,…, k}, respectively, so that the balance B(X+, X−) between X+ and X− is highly associated with Y after adjustment for covariates Z. Depending on the nature of the dependent variable, the association can be defined in several ways.
For a continuous variable
Y, the optimization criterion is defined as minimization of the MSE of the linear regression model as follows:
For a dichotomous variable
Y, we fit the logistic regression model as follows:
In this case, we consider three possible optimization criteria corresponding to the maximization of the area under the ROC curve (default option), the maximization of the explained variance (
35), or the discrimination coefficient (
36).
Main function: selbal().
The main function of the proposed algorithm to detect the most closely associated balance is called selbal() and employs the following three steps.
Step 0: zero replacement.
The initial matrix of counts in a microbiome study, denoted
X̃, typically contains many zeros that must be treated prior to using
selbal algorithm. Though the user can perform other zero replacements before using
selbal (
18,
20), the default option in
selbal is
geometric
Bayesian
multiplicative replacement (GBM) (
23) as implemented in the
cmultRepl() function of the R package
zCompositions (
37). GBM performs Bayesian estimation of the zero values, assuming a Dirichlet model and a multiplicative modification of the nonzero values, so that both the ratios between parts and the total sum of the initial vector before the replacement are preserved. GBM performs better than other Bayesian multiplicative replacements assuming a Dirichlet distribution (
23).
selbal also provides the option of adding a value of 1 to all values in the data set. The resulting matrix with zeros replaced by positive values is denoted by
X.
Step 1: optimal balance between two components.
The algorithm evaluates exhaustively all possible balances composed of only two components, that is, all balances of the following form:
Each two-component balance (
Bij) is tested for association with the response variable
Y with one of these regression models as follows:
The balance that maximizes the optimization criteria (MSE or AUC) is selected and denoted by B1.
Note that in defining a balance for a pair of components (Xi, Xj), there are two options that differ only in their signs but provide the same association with the response:
selbal returns the balance whose regression coefficient is positive.
Step s: optimal balance—adding a new component.
For s = >1 and until the stop criterion is fulfilled, let B(s − 1) be the balance defined in the previous step:
where
and
are two disjoint subsets of indices in (1,
.
.
.,
k), with
and
elements, respectively.
For each of the remaining variables, Xp, not yet included in the balance, p ∉ , the algorithm considers the balance that is obtained by adding log(Xp) to the positive part of the previous balance B(s−1)
and the balance that is obtained by adding log(
Xp) to the negative part of
B(s−1)Each of these pairs of balances,
and
, for each of the remaining variables,
Xp, is tested for association with the response variable through one of these two regression models, where
B denotes the balance tested:
The balance that maximizes the optimization criterion defines the new balance B(s).
Stop criteria.
There are two stopping rules: the iterative algorithm stops when value corresponding to the the improvement of the optimization parameter is lower than a specified threshold th.imp (default th.imp = 0) or when the specified maximum number of components, C, has been included in the balance (default C = 20).
Iterative cross-validation: selbal.cv().
An iterative cross-validation procedure is implemented in selbal.cv() function with two goals: (i) to identify the optimal number of components to be included in the balance and (ii) to explore the robustness of the global balance identified with the whole data set.
Let M be the number of iterations (default M = 10), K the number of folds in the cross-validation (default K = 5), and C the maximum number of variables or components included in a balance (default C = 20).
At each iteration of m ∈ {1, …, M}, the data are divided into K folds, .
For each k ∈ {1, …, K}, selbal() is applied to the training data set, , and the optimal balance with C = 20 variables is obtained, . Since selbal() is a forward selection process where variables are included sequentially at every step, we have a sequence of balances, including from C = 2 to C = 20 variables:
The classification accuracy (MSE or AUC) of these balances is measured on the test data set,
, giving a sequence of accuracy measures for each number of variables included in the balance:
and similarly for AUC.
Optimal number of components.
For each number of components
we have
K ×
M measures of accuracy, MSE or AUC. The mean and the standard error are computed and are represented in a plot (
Fig. 1).
Similarly to the cross-validation process in LASSO for finding the optimal penalization parameter lambda (
38), we follow the "1se strategy" and define the optimal number of variables included in the balance (
kopt) as the lowest number whose mean MSE is within 1 standard error of the minimum mean MSE (or whose mean AUC is within 1 standard error of the maximum mean AUC). Usually, the 1se strategy provides sparser models than taking the minimum mean MSE (or maximum mean AUC), with very similar accuracy. This 1se strategy is the default option in
selbal, but there is also the possibility of determining the optimal number of variables as the value reaching the optimum (minimum mean MSE or maximum mean AUC).
Global balance.
Once the optimal number of components kopt has been determined, we apply the main function selbal() to the whole dataset, with the number of taxa C = kopt, and obtain what we call the global balance.
Robustness of the result.
Any method that requires variable selection may result in overfitting. In order to explore the robustness of the global balance and the variables that form it, we retrieve all the balances with
kopt components obtained in the cross-validation process
and compare them with the global balance. We summarize these cross-validation balances in two different ways: per balance and per variable. We provide the relative frequencies of the different balances and the proportion of times that each taxon has been included into a balance. This information, available in the output of
selbal.cv(), is summarized in a table such as that shown in
Fig. 3.
This cross-validation process can also be used to obtain the cross-validation accuracy, defined as the mean MSE or mean AUC of the balances obtained in the CV process that have the same number of variables as the global balance: or