INTRODUCTION
Over the last 20 years, the development of low-cost sequencing technologies has led to the creation of a large number of microbiome data sets, mainly generated using metataxonomic analyses based on 16S rRNA metabarcoding technology. For example, the number of papers using metataxonomic or metagenomic approaches to study the microbial communities of food increased sixfold between 2015 and 2021 and currently exceeds 600 (
1); similarly, within the NCBI database, the Taxonomy ID “Food metagenome” (NCBI:
Tax id 870726) is associated with 770 BioProjects. In keeping with the principles of Open Science, most of these publication-associated data sets are available in public repositories such as SRA (the Sequence Read Archive of NCBI), ENA (the European Nucleotide Archive of EBI), or the DNA Data Bank of Japan. To promote the reuse of certain kinds of data sets, specialized databases have been developed, such as MGnify for microbiome data (
2). Here, we focused on metataxonomic studies whose integration by a robust and efficient approach allows the identification of the key taxa found in a specific ecosystem. The availability of a vast amount of metataxonomic data sets provides an unprecedented opportunity to develop new integrative tools for comparing and better understanding taxa associations in various closely related microbial ecosystems. However, these efforts face numerous challenges related to data reusability (e.g., data availability, metadata quality, and data preprocessing) and the most appropriate ways of identifying biologically informative features in a collection of metataxonomic studies. In this work, we address these challenges by designing a workflow for exploring public data sets related to the microbiota of fermented vegetables and performing a meta-analysis (i.e., reusing independent data sets and integrating them into a larger analysis to generate new knowledge).
Our choice of ecosystem was motivated by current interest in the bacterial communities involved in the fermentation of vegetables (
3–5). Plant-based fermented foods diversify human diets and possess interesting properties in terms of sustainability and nutritional quality. These products require little energy to produce and preserve, and their consumption confers several benefits to human health (
6,
7). With this study, we wanted to assess whether public data sets that are already available for fermented vegetables could help to improve our knowledge on the ecological dynamics taking place in these products. Fermented vegetables are created through the (usually spontaneous) activity of heterofermentative and homofermentative lactic acid bacteria (LAB) naturally present in the raw material (
8). In Europe, the most popular example of this kind of food is sauerkraut, for which the use of pre-selected starter strains remains uncommon even for large-scale production (
9). A combination of low pH and the anaerobic conditions resulting from the fermentation process are the main factors that select for the beneficial anaerobic LAB essential in the production of good-quality fermented vegetables (
3). These bacteria are a broad and diverse group of species classified in the phylum
Firmicutes, class
Bacilli, and order
Lactobacillales and include representatives from the families
Lactobacillaceae,
Streptococcaceae,
Enterococcaceae,
Carnobacteriaceae, and
Aerococcaceae (
10).
It should be noted that, to date, most studies have focused on describing the microbial communities present at the end of the fermentation process (
4,
5), while the dynamic succession of various microbial populations during fermentation has received little attention. This represents an important gap in knowledge, especially when compared, for example, to research on cheese microbial communities, which has revealed that the proper succession of microbial populations is important to the quality of the final product (
11,
12). Two separate metataxonomic analyses have revealed important changes in microbial dynamics during vegetable fermentation. A study on carrot juice reported a succession process involving
Enterobacteriaceae,
Leuconostoc, and
Lactobacillus, while work on Suan Cai (Chinese pickles) showed that the dominant species changed from early stages of fermentation (
Leuconostoc mesenteroides) to later ones (
Lactiplantibacillus plantarum) (
13,
14). The little information that can be gathered on the subject does not allow us to identify species or consortia that might be responsible for controlling various stages of fermentation among different vegetables. In this context, the use of metataxonomic data to carry out meta-analysis could prove illuminating.
The use and comparison of amplicon data (such as the 16S-based data considered in the present work) raise certain difficulties. First, sequencing technology may vary among studies, as may the region amplified or PCR primers employed. Second, taxonomic assignment based on the 16S variable region is considered valid only at the genus level, limiting species-level interpretations (
4). There are, therefore, two possibilities for carrying out a comparative study of multiple data sets: comparing genus-level taxonomic profiles or comparing exact sequences, specifically, amplicon sequence variants (ASVs). The advantages of the first approach include the ability to compare different sequenced regions and to reduce the sparsity of the count matrices, while the use of ASVs enables intra-genus diversity to be taken into account (
15,
16). In both cases, the aim of this type of meta-analysis is often to identify core taxa based on criteria of abundance and prevalence (
17).
The analytical design of such a study is also important. One promising approach for meta-analysis is the construction of microbial association networks, which provide additional and complementary information to classic analyses of alpha- and beta-diversity (
18). Association networks enable the identification of hub species (
19,
20), taxa clusters (
21), and core networks, the last of which corresponds to the intersection of several microbial association networks and can be used to identify taxa and associations shared by most networks (
22). Association networks were originally designed for macroscopic ecosystems and have only recently been adapted for the investigation of interactions within microbial assemblages (
21). They are constructed using count data from the sequenced environment, which are compositional (
23), high-dimensional, and in the form of sparse matrices, thus increasing the difficulty of analysis (
21). However, compared to networks from other assemblages, the association networks in fermented ecosystems appear to be significantly smaller (
16) (due to the decrease in microbial diversity over time), making them easier to construct, visualize, and compare. According to Chen et al. (
24), association networks can be divided into four categories, which are built using different approaches: correlation networks [CoNet (
25) and SparCC (
26)], conditional correlation networks [SPIEC-EASI (
27)], mixture networks [MixMPLN (
28)], and differential networks (DCDTr). Due to the complexity of microbial interactions, all these approaches have important limitations, and no method has yet managed to capture all of the aspects of interest. Indeed, studies have even shown that classical measures such as Pearson and Spearman correlations can perform just as well as computationally time-consuming methods based on more sophisticated statistical models (
29,
30). Integrating independent studies, thanks to network analysis, offers a comprehensive view of microbial communities, facilitating the discovery of core associations among taxa (
22). By modeling sample composition as a function of covariates, linear mixed models serve as an alternative method for integrating metataxonomic data sets. However, they primarily focus on differential analysis and the discovery of population structures within samples, as exemplified in the investigation of conditions such as inflammatory bowel disease (
31). Conversely, meta-analyses employing a network approach center on elucidating microbial associations and are intrinsically integrative. They can be used to visualize associations between taxa and eventually highlight those that are conserved [for instance, in a study revealing the stability of gut microbiota across diverse geographic populations (
32)].
This study presents an integrative bioinformatics workflow based on a network approach for the meta-analysis of public amplicon data sets. The workflow includes steps designed to search for and select public time-series data sets and construct ASV association networks based on co-abundance metrics. Microbial communities are then analyzed by comparing and clustering the ASV networks. We applied this workflow to 10 publicly available data sets and 931 samples of different fermented vegetables with specific sampling times on the microbial assemblages of fermented vegetables. Here, we describe the value of this approach for discovering core bacterial taxa and core associations shared by different vegetables during the process of fermentation.
DISCUSSION
This work presents an integrative bioinformatics approach that uses association networks to combine different independent data sets on the microbial dynamics of different vegetable fermentations. By using relevant network metrics and integration methods on metabarcoding data, we obtained valuable insights into bacterial community structure during different phases of fermentation. Historically, association networks have been used to detect potential inter-species interactions; here, we adapted this strategy to identify and visualize ASVs with similar temporal dynamics. To our knowledge, this work is the first to construct a core network representing the fermentation of different vegetables throughout time based on sequence data from multiple independent data sets. By integrating several public data sets together, we were able to characterize two successional shifts that were conserved among different fermentation ecosystems: the first from the initial microbial population of vegetables to
Enterobacterales, and the second to an assemblage dominated by
Lactobacillales. To test the significance of the core network we obtained, we used an approach based on comparison to a null model, which was similar to that developed by Röttjers et al. (
22), with a sampling of random graphs similar to Doane et al. (
43). Indeed, the identification of core networks is a more challenging task than the computation of the global intersection network (
21). With these tests, we determined that some intersections between networks would not be expected by random chance, and thus that some edges may correspond to genuine ASV dynamics shared among several studies. Finally, we complemented this approach by using the SBM method for ASV clustering, which is a technique applicable to multiplexes (a type of multi-layer network) that does not require any
a priori assumptions regarding connectivity patterns. The SBM model has been used for community detection in various fields, such as sociology. More recently, it has been applied to taxonomic profiling of the human microbiome in order to uncover patterns of community structure. Specifically, it was used as a bipartite model for clustering samples and taxa (
44). In another study, the simple SBM enabled the detection of OTU clusters based on their connectivity patterns in a co-occurrence network (
45). In the present work, we applied the multiplex version of this model to a collection of networks in order to identify clusters of ASVs that share similar patterns of associations across the different networks. We were able to identify 10 clusters of ASVs, which could be used to guide the exploration and delineation of new bacterial consortia in fermented vegetables (
46).
With respect to the microbial ecology of fermented vegetables, our methodology not only confirms previously proposed hypotheses on bacterial succession from individual studies but also brings to light novel insights. Two expected successional shifts are observed, one from the initial very rich and diverse microbiota to
Enterobacterales, and the second to an assemblage dominated by a few abundant
Lactobacillales specific to the fermented product. However, contrary to expectations, we did not detect the anticipated succession between heterofermentative and homofermentative genera in our study. Our study also showed that
Enterobacterales ASVs were more widely shared than
Lactobacillales ASVs. Finally, our most important finding was the recurring and transient appearance, at the beginning of fermentation, of ASVs belonging to
Enterobacterales and their association with ASVs affiliated with
Lactobacillales. This raises the question of the ecological function of
Enterobacterales in vegetable fermentation and their impact on the properties of the final product. Due to the small number of studies carried out on the subject and the extensive variability in the methodologies used, most reports have not generated convincing conclusions on the impact of
Enterobacterales and their possible interactions with LAB. Nevertheless, based on the existing literature, several hypotheses can be put forward.
Enterobacterales may have fermentative properties or they may participate in nutritional mutualism that is beneficial to the development of LAB. Indeed, certain trophic relationships between LAB and
Enterobacteriaceae have already been described. For example, some LAB generate metabolic energy using an agmatine deiminase pathway that relies on agmatine produced by
Enterobacteriaceae (
47). In the wet coffee fermentation process, the first phase involves interactions between
Enterobacteriaceae (with pectinolytic activity), acetic acid bacteria, and some yeasts (
48).
Enterobacteriaceae have also been found in two other studies on fermented vegetables (
49,
50), of which the former hypothesizes that the presence of
Erwinia sp. may reflect its ability to invade compromised plant tissues or its potential ability to ferment sugar.
This meta-analysis demonstrates the value of using public data sets in an Open Data and Open Science framework. The approach we designed is particularly well suited to fermented vegetable ecosystems: since these ecosystems are closed, contain relatively few taxa, and undergo a temporal succession of communities, the representation of ASV association networks is fairly easy to visualize and interpret. This approach could be easily applied to amplicon or shotgun metagenomic data on closed ecosystems with community shifts. One limitation of the present meta-analysis is that it was carried out on a relatively small scale (on 10 independent data sets including a total of 931 samples), due to the small number of reusable public metabarcoding data sets on fermented vegetables. Hence, biological results could be confirmed by investigating additional data sets. This limitation is mainly due to difficulties in accessing raw data (some samples are missing, some data are pre-processed, etc.) and metadata (sometimes incomplete and inconsistent, with manual extract from paper required). Indeed, these limitations were highlighted in a recent article (
51), which recommended that data be deposited in public repositories together with assay metadata (technical features of the experiment) and biological metadata (environmental conditions of the biosamples). This, along with the adoption of other best practices, will enable wider reuse and integration of microbiome data sets on a broader scale.
This study is based on 16S metataxonomic data, more specifically, the V4 hypervariable region because it was used in the majority of the data sets found. This region is the most frequent target of studies focused on food ecosystems, along with the V3–V4 region of the 16S rRNA gene (
1). Unfortunately, this gene region has poor discriminatory power; it is able to provide reliable taxonomic assignment at the genus level only and cannot be used to study species-level diversity [unlike, for instance, the V1–V3 region (
52)]. Therefore, although it is interesting to discover ASVs that are shared between different studies, this approach is ill-suited for characterizing the species- and strain-level diversity of
Lactobacillales and
Enterobacterales. Furthermore, the read count tables obtained for the different studies can be shaped by many biases, including differences in sample collection and storage, DNA extraction method and primer choice, variation in the number of rRNA operons (
52,
53), amplification of extracellular DNA, and errors in taxonomic affiliations. Therefore, the results of any individual ASV count table must be interpreted cautiously. However, in the context of our study, the use of ASVs enabled direct comparison of sequences between studies and reduced the influence of taxonomic misclassifications (
54,
55). In addition, integrating ASVs into association networks allowed comparisons of similar dynamics between ASVs in different studies and limited the biases that might arise from direct comparison of relative abundances.
This work demonstrates the effectiveness of using association networks for temporal meta-analysis. The approach we developed could easily be applied to new data sets or extended to incorporate new tools for association network inference, core network detection, and clustering. In the future, it could be interesting to integrate additional sample metadata (such as temperature, lactic acid concentration, pH, and/or salinity) if they were available in a standardized format and could be easily integrated into an association network. Indeed, many factors can influence the composition and dynamics of the fermenting microbial community, as shown previously for salinity or temperature (
3). This approach could lead to the design of ideal consortia that could make vegetable fermentation safer (
56), more reproducible, and exploitable on a large scale (
57).
Finally, the taxonomic profile inferred from 16S rRNA is not able to provide insights into the functional profile of bacterial communities or into the part(s) played by other microorganisms [even if their presence is minor, e.g., less than 5% relative abundance for fungi and
Archaea in brine food according to Leech et al. (
4)]. Ultimately, there is a need for complementary functional studies (shotgun metagenomics and metatranscriptomics) to improve our understanding of vegetable fermentation and assess the functional interactions taking place during this process between all microorganisms.
ACKNOWLEDGMENTS
We are grateful to INRAE MICA Division and the ENS Paris-Saclay for the funding of this PhD, INRAE MIGALE bioinformatics facility (MIGALE, INRAE, 2020. Migale bioinformatics Facility,
doi:10.15454/1.5572390655343293E12) for providing computing and storage resources, and to Lindsay Higgins for proofreading the manuscript.
R.J., Conceptualization, Methodology, Writing – original draft, Writing – review and editing; F.V., Supervision, Writing – original draft, Writing – review and editing; M-Y.M., Supervision, Writing – original draft, Writing – review and editing; S.C., Conceptualization, Methodology, Supervision, Writing – original draft, Writing – review and editing; H.C., Conceptualization, Funding acquisition, Methodology, Supervision, Writing – original draft, Writing – review and editing.