INTRODUCTION
Accurate identification of distinct functional units in natural bacterial communities is crucial in understanding their ecological roles, interactions within the network, as well as the fine-scale composition and dynamic changes within the whole community. As a rule of thumb, a bacterial phylotype is often defined by grouping strains that share a sequence identify greater than 97% for a selected fragment of the small subunit (16S) rRNA gene (
1). However, increasing evidence has indicated that a bacterial phylotype may contain multiple finer lineages, each showing distinct biological traits. For example, closely related enterotoxigenic
Escherichia coli (ETEC) isolates form discrete lineages with consistently definable variations in virulence profiles (
2). Such intraphylotype lineages could be delineated based on divergence in genomic sequences and phylogenetic inferences. These finer subdivisions of phylotypes are called sequence-discrete populations (SDPs), which typified by genetic and genealogical discontinuity from the rest of the community and are delineated by overall sequence divergence at the whole-genome level (
3–5). A broad comparison of 90,000 bacterial genomic sequences, with a close examination of pairwise genomic similarities in natural bacterial communities, has proved the pervasive discontinuity in genetic similarity below and above SDPs (
3). Bacteria in the same SDP normally show less than ca. 5% variation in whole-genome sequences. This genetic divergence is much less than those among strains of the same phylotype (ca. 30%) (
6). With respect to habitats, specific SDPs are likely ubiquitous in various environments, such as human and animal guts (
5,
7,
8), freshwater (
9), ocean (
10) and soil (
11). Based on large-scale whole-genome and metagenomics investigations, these genetically and ecologically discernible SDPs have been identified within multiple phylotypes,
e.g., for intestinal
Prevotella copri (
8,
12) and
Eubacterium rectale (
13), four to five geographically stratified SDPs consist in human (or mouse) spanning geography and lifestyle. Therefore, SDPs are probably better than phylotypes, as taxonomic units that represent functional entities in bacterial communities, which are likely shaped by ecological pressure and evolutionary selection. As such, SDPs are important units of microbial diversity and should be considered baseline information for investing crucial questions, such as “how do bacterial populations interact and evolve within communities?” (
4).
Despite the essential nature of accurate SDP identification, a rapid and accurate method that can trace SDP boundaries is still lacking, especially with regard to the selection of proper markers for evaluating sequence divergence. It is obvious that genetic divergence among bacterial strains is dependent on which genes are compared. We now understand that the commonly used 16S gene cannot generally provide sufficient resolution to characterize SDP diversity (
14,
15). For example, in cases where the SDPs show an ∼5–10% genome-wide divergence, they varied mostly merely < 1% in the 16S sequences (
16). Moreover, the copy number of the 16S gene may vary significantly among phylotypes or even among strains of the same phylotype, making quantitative characterization of bacterial community a challenging, if not impossible, task (
17,
18). The 16S was selected for phylotype delineation years ago because it has conserved primer sites that flank relatively variable regions that made it easy to sequence with Sanger technology. Currently, much effort has been put into developing genes or gene segments that can be easily sequenced, and that vary enough to serve as practical proxies for SDP delineation (
19–21). However, a systematic evaluation of the validity and performance of such genes in SDP delineation, which includes the rapidly increasing but heterogeneously sampled database, has not been carried out.
Fortunately, recent developments in microbial genomics show a promising solution to complement the coverage of bacterial genomes. The number of sequenced genomes of various bacterial lineages has been growing rapidly. For example, the Genomes OnLine Database (GOLD) now contains 437,099 bacterial genomes, the majority of which (397,945) are uncultured, representing host-associated, environmental and engineered ecosystems (
22). The ever-growing bacterial genome data set offers a great opportunity to screen phylogenetically informative genes that show good performance in taxonomic delineation, including those capable of quantitatively characterizing bacterial communities at the SDP level (
23,
24). For instance, Wu and colleagues identified 114 PhyEco universal markers for all bacteria (
25). From these universal markers, 15 single-copy protein-coding genes were successfully applied in estimating species abundances using shotgun metagenomic data (
26). On the other hand, growing numbers of genomes and metagenomes produced for particular bacterial communities or taxonomic groups allow for comprehensive characterization of SDP diversity within focal environments and bacterial groups. Taking social bee gut microbiota as an example, diverse strains derived from the major honeybee hosts have been isolated and deep-sequenced (
27), including well-covered SDPs of nearly all core gut bacterial phylotypes (
5,
28,
29). Thus, the relatively complete genome data set provides a genome-wide-based gold standard for defining SDPs for the honeybee core bacteria.
Honeybees are important pollinators and play vital roles in the sustainability of ecosystem and agriculture (
30). Honeybee has a simple and specialized gut microbial community, consisting of 5–9 core bacterial genera (
31,
32), which benefit the host in multiple aspects, including diet digestion (
33,
34), nutrient provision (
35), pathogen resistance (
36), immune modulation (
37) and endocrine signaling (
35). Despite of its relatively simple diversity at the generic level, the gut bacterial community of the honeybee contains extensive genetic divergences, where high degrees of gene content diversity have been described within each phylotype (
38–40). Strains belonging to the same phylotype form distinct phylogenetic clusters according to host species,
e.g., Lactobacillus Firm5 (a dominant gut symbiont of honeybees) strains isolated from honeybees and bumble bees are grouped into 4 and 2 SDPs, respectively, showing distinct host specificity (
38,
40). Shotgun metagenomics further revealed that SDPs generally co-occurred in individual honeybees with a relatively stable abundance (
5). As gene repertoires can differ markedly among SDPs (
38,
40), SDP profiling is essential for resolving the fine-scale diversity and the organization of ecological function in bee gut microbial community, as well as for understanding their impacts on the hosts.
In the present study, we developed a pipeline to screen potential marker genes capable of accurate identification and quantification of SDP diversity. Here, we have comprehensively collected core bacterial strains from the Asian honeybee
Apis cerana across China. Among these core bacterial phylotypes, all known SDPs are well represented by numerous strains, and at least one strain has been genome-sequenced for each SDP. Therefore, we used
Gilliamella as a proof of concept to examine the efficacy of the selected marker gene. In particular, we applied the available genome sequences as a leverage to delineate
Gilliamella SDPs, which serves as the reference for marker evaluation. We further screened from a 15 single-copy protein-coding gene set leveraged from MIDAS software (
26), which had been broadly applied for identification and quantification bacterial species from shotgun metagenomic data, to identify candidate marker genes capable of differentiating the defined
Gilliamella SDPs. Important characteristics such as the level of sequence divergence, phylogenetic robustness, and the presence of conservative primer regions, are considered in marker gene screening. Finally, we applied the candidate markers in amplicon sequencing of both bacterial mock samples and real honeybee guts to verify their efficiency in SDP profiling (
Fig. 1). The markers we identified could accurately, consistently and quantitatively capture SDP diversity.
SUMMARY AND DISCUSSION
We developed a pipeline to identify reliable marker genes for accurate identification and quantification of SDPs from bacterial communities. Three important criteria were applied in the assessment: the extent of sequence divergence, phylogenetic accuracy, and the presence of flanking conservative primer regions. Single-copy protein-coding genes identified by our pipeline were applied as marker genes in SDP quantification of honeybee gut microbiota, successfully producing results consistent with those from metagenomics, which were used as the gold standard. Conversely, we showed that the widely used 16S V4 region contained limited sequence divergence within phylotypes, failing to provide sufficient resolution in differentiating SDPs. As a result, 16S V4 amplicon sequencing cannot reflect fine scale bacterial diversity for the community. Consequently, dominant OTUs delineated by 16S V4 at 97% or 99% thresholds significantly differed from the defined SDPs. On the other hand, the OTUs of single-copy protein-coding genes screened out by our pipeline were successfully assigned to the correct SDPs, and the numbers of dominant OTUs showed more congruent results to those from metagenomics.
Compared with whole-genome shotgun sequencing, amplicon sequencing of single-copy protein-coding genes provides an alternative solution to characterize SDP diversity in an accurate, quantitative and economical way. We address that not every single copy protein-coding gene is efficacious in SDP quantification. The candidate gene must meet all three criteria integrated in our pipeline to be a good marker gene. Notably, the three marker genes identified in this study were screened from a small set of candidates containing only 15 genes recommended for bacterial species identification. We emphasize that these genes are not meant to be exhaustive for bacterial SDP profiling. A larger candidate gene set (
e.g., the 114 PhyEco bacterial universal genes reported in Wu et al. [
25]) or even the whole set of single-copy genes that are conservative across bacterial lineages) will likely result in a lot more marker genes suitable for identifying and quantifying bacterial SDPs. In this case, we expect dozens to hundreds of proper marker genes to be filtered out. On the other hand, a small set of core single-copy protein-copy genes that are determined to be universally present among known bacteria, such as the 15 marker genes tested in this study, will likely provide candidate genes suitable for accurate characterization of SDP diversity for less known bacterial taxa.
Accurate identification of the SDP composition will also facilitate the prediction of the functional capacity of microbial communities. Functional attributes of a given bacterial lineage are strongly correlated with its phylogenetic position (
43). Therefore, various approaches,
e.g., PICRUTs (
44), have been developed to predict potential functions of a given microbial community based on phylogenetic profiles of bacterial members. As demonstrated in this study, single-copy protein-coding genes identified by our pipeline show better fidelity in revealing phylogenetic relationships for the focal phylotype. Therefore, we anticipate that function prediction for microbial communities will be further improved by integrating single-copy protein-coding genes and the screening pipeline described here.
MATERIALS AND METHODS
Genome references of core gut bacteria of honeybees.
A total of 242 bacterial genomes associated with A. mellifera and A. cerana were downloaded from the NCBI genome database (Table S1). These 242 genomes were used as the reference database of honeybee gut bacteria, which comprised the 6 major phylotypes: Apibacter (n = 16), Bifidobacterium (n = 28), Lactobacillus Firm4 (n = 2), Lactobacillus Firm5 (n = 24), Gilliamella (n = 130) and Snodgrassella (n = 42).
SDP delineation for honeybee core phylotypes.
Protein-coding genes of all sequenced genomes were annotated using Prokka (
https://github.com/tseemann/prokka) (
45). Core genes, which were defined as being shared by > 99% strains of a given phylotype, were identified using Roary (version 3.13.0) (
46) with the parameter -blastp 75. Multiple sequence alignments were carried out using MAFFT (version v7.467,
https://github.com/The-Bioinformatics-Group/Albiorix/wiki/mafft) (
47). Phylogenetic trees were constructed using core single-copy genes of each phylotype by RAxML (version 8.2.12, -x 12345 -N 1000 -p 12345 -f a -m GTRGAMMA) (
48). Phylogenies were visualized in R (version 3.6.0) using the package ggtree_v2.4.1 (
49) or iTOL (version 6.1.1) (
50). Pairwise genome-wide average nucleotide identity (gANI) values were calculated using pyani (version 0.2.10;
https://github.com/widdowquinn/pyani) (
51). A clade with a gANI ≥ 95% from its closest clade was defined as an SDP.
Screening for candidate marker genes capable of discriminating Gilliamella SDPs.
The 15 universal single-copy maker genes (
frr, nusA, pth, rbfA, recR, rnhB, ribF, rimM, rsfS, RuvA, smpB, truB, miaA, murB, and
yebY, listed in Table S2) (
26) were evaluated as candidate genes. The sequences of candidate marker genes were retrieved by MIDAS (version 1.3.2) (
26), whereas the 16S genes were retrieved from the reference genomes using an in-house script. The average Shannon entropy (ASE) of the full gene length was used to assess sequence variation between strains of inter- and intra-SDPs for all phylotypes, where the Shannon entropy for each nucleotide site across genomes in comparison was calculated using oligotyping (version 2.1) (
52).
The phylotype
Gilliamella, which contains the most genomes available for this study, was used as a proof of concept to examine the efficacy of marker genes in SDP differentiation. For each SDPs in phylotype
Gilliamella, the Shannon entropy values were subsequently averaged for each 20-bp slide-window with a 5-bp step to evaluate the regional genetic divergence along the full length of the marker genes. Pairwise sequence similarities were determined by Clustal Omega (
53).
From the candidate genes, potential marker genes that may efficiently distinguish all known SDPs of the Gilliamella phylotype were screened. The following criteria were followed: (i) the marker genes should contain conservative regions flanking the hyper-variable region for designing primers enabling recovery target phylotype; (ii) the amplicon length is between ∼150-550 bp; (iii) the amplified region is sufficiently variable to allow the discrimination of SDPs; and (iv) the primers are specific to the focal phylotype to avoid off-target amplifications. The aforementioned 15 marker genes were subject to these criteria, and 5 of them (ffr, nusA, pth, truB, and smpB) were selected as potential markers for identifying SDPs of A. cerana Gilliamella. Among these, three genes (ffr, nusA, and pth) were subjected to further testing as a proof of concept, because their amplicon lengths were 206, 206 and 230 bp, respectively, which were ideal for current shotgun sequencing platforms. To increase the throughput and cost efficiency, 24 amplicons were pooled for one sequencing run. The 5′ end of both forward and reverse primers were tagged with 6-bp unique barcode sequences (see Table S3) to distinguish positive and negative DNA strains, and to differentiate samples.
Bacterial mock samples.
One representative strain from each of the five
Gilliamella SDPs associated with
A. cerana was cultured at 35°C and 5% CO
2 for 48 h, on heart infusion agar (HIA) medium containing 5% sheep’s blood (
54). To screen potential contaminations, the full-length 16S gene was amplified for each bacterial culture using universal primers 27F and 1492R (
54) and was subject to Sanger sequencing. 16S sequences were checked against those of the reference strains for identification, before strains were mixed for mock samples. Each
Gilliamella culture was adjusted to OD600 = 0.5. Twenty-four mock SDP communities were prepared by mixing up 2–5 of the representative strains at varied proportions. The compositions of the mock samples were set as: equal proportion of each of the five strains, equal proportion of four strains with the absence of one strain at a time, equal proportion of three strains with the absence of two randomly selected strains, and a series of varied compositions with relative abundances ranging from ca. 0.02% to 50%. DNA of the bacterial mixtures were extracted using a CTAB-based DNA extraction protocol followed by recovery in 10 mM Tris-EDTA buffer (1×TE, pH 7.4) and quantified using the Qubit DNA assay kit on a Qubit 3.0 Fluorometer (Life Technologies, CA, USA). Alternatively, genomic DNA of each of the five representative strain cultures was extracted separately and the mixed at varied compositions and proportions (see Table S4).
SDP identification and quantification for mock samples using amplicon sequencing of the three marker genes.
PCR amplification was performed for frr (frr-F 5′-GCTGAAGATGCAAGAAC and frr-R 5′-GCATCACGACGAATATT), nusA (nusA-F 5′-CTTGAAATTGAAGAACT and nusA-R 5′-GTACCTTGTTCAGCTAA), and pth (pTH-F 5′-AAACTTATTGTAGG and pTH-R 5′-CCACTTAAATTCATAAA) for each mock sample with three replicates. Triplicate 50-μl reactions were carried out with 25 μl of 2 × Phanta Max Master Mix (Vazyme Biotech, Nanjing, China), 2 μl (each) of 10 μM primer, 19 μl of ddH2O, and 2 μl of template DNA. The thermocycling profile consisted of an initial 3-min denaturation at 95°C, 35 cycles of 15 s at 95°C, 15 s at 52°C for nusA and frr or at 42°C for pth and 20 s at 72°C and a final 10-min extension step at 72°C. After being visualized on 2% agarose gels, DNA was purified using a gel extraction kit (Qiagen, Germany) and quantified using the Qubit DNA assay kit on a Qubit 3.0 Fluorometer. Barcoded amplicons of up to 24 mock samples were pooled and subject to Illumina sequencing using a NovaSeq 6000 platform (PCR-free library, 150 PE) at Novogene (Beijing, China). Approximately 1 Gb of raw data were obtained from each pooled library (Table S5).
The program fastq-multx (version 1.3.1.
https://github.com/brwnj/fastq-multx) was employed to demultiplex sequencing reads based on barcode sequences. The 6-bp barcodes in reverse sequences were trimmed using Seqtk (
https://github.com/lh3/seqtk). The demultiplexed paired-end reads were then analyzed in QIIME2 (version 2020.2.
https://qiime2.org) (
55). A plugin DATA2 (
56) was used to denoise reads and to group sequences into amplicon sequence variants (ASVs). Individual ASVs were then taxonomically classified using blast (classify-consensus-blast) at a 97% identity threshold (Fig. S3) against the 3 marker genes (
ffr,
nusA, and
pth) derived from the customized bee gut bacterial data set. The relative abundance of each SDP (RA
SDP) was calculated as: RA
SDP = (NR
SDP)/(NR
Gillia)*100, where NR
SDP represents the number of reads mapped to the focal SDP and NR
Gillia represents the number of reads mapped to all
Gilliamella SDPs. These estimated abundances were then compared to those of the mock samples. The performance of SDP profiling of the 3 marker genes was evaluated on the basis of accuracy, sensitivity and repeatability. Intraclass correlation coefficient (ICC) with a two way random/mixed (ICC[C,1]) model was used to assess the repeatability of this method using SPSS (version 20.1) (
57).
Rarefaction curves were plotted using identified SDP numbers against read numbers, which were used to infer the minimum read number required to detect strains at varied proportions. For each sample, ASVs with a depth <100 were filtered out. Rarefaction was performed using QIIME2 with the plugin alpha-rarefaction and a sampling depth of 40,000 reads per sample and default parameters. Minimum read numbers for identifying SDPs with relative abundances of 0.02%, 1% and 20% were chosen manually.
SDP identification and quantification for A. cerana gut microbiota using 16S V4, marker genes and metagenome sequencing.
Adult worker bees collected in Sichuan were used to quantify
Gilliamella SDP diversity using three different methods (16S V4 region amplicon sequencing, MGAS and metagenomics sequencing). Bees were first cooled at 4°C for 10 min. Then the entire guts were dissected from the abdomen using sterile forceps and DNA was extracted using a CTAB bead-beating protocol described previously (
29).
First, the 16S V4 region was amplified for six bee guts from Sichuan and sequenced using an Illumina Hiseq X 10 platform (250–300 bp insert size, 250 PE) at BGI-Shenzhen (Shenzhen, China). Raw reads obtained for each sample were summarized in Table S6. Data quality control was performed using fastp (version 0.13.1, -q 20 -u 10 -w 16) (
58). The demultiplexed sequences were denoised and grouped into ASVs using an open reference method VSEARCH (
59) embedded in QIIME 2. The taxonomic identification for ASVs was subsequently performed using the naive-Bayesian classifier trained on the BGM-Db, a curated 16S reference database for the classification of honeybee and bumblebee gut bacteria (
60). A feature table and ASVs consisting of filtered 16S reads pertaining to
Gilliamella was constructed. OTU clustering was performed at both 97% and 99% identity thresholds, respectively, using VSEARCH with cluster-features-de-novo method. Additionally, low-abundant OTUs comprising of <100 reads were removed. Taxonomic assignments for OTUs were performed using blast against the BGM-Db with SDP-level taxonomy. OTU composition heatmaps were generated based on relative abundances and visualized in R.
Second, for each sample, the marker genes frr and pth, which demonstrated the best and worst performances in accuracy and sensitivity, respectively, among the 3 marker genes, were applied following the same pipeline used in the mock samples. ASVs of the six sample from Sichuan were clustered into OTUs and filtered following the above-mentioned 16S V4 pipeline. Taxonomic assignments for OTUs were performed by blast against frr sequences derived from the customized bee gut bacterial genome sequence database.
Finally, metagenome sequencing of four bee (B0108, B0120, B0154, and B0174) guts was performed using an Illumina Hiseq X 10 platform (300–400 bp insert size, 150 PE) at BGI-Shenzhen. Additional metagenomes of eight worker bee guts (BioProject PRJNA705951) were download from NCBI (Table S6). The metagenome sequencing was used as the gold standard for
Gilliamella diversity distributed in the honeybee guts. Shotgun reads mapped to the
A. cerana genome (GCF_001442555.1) using BWA aln (version 0.7.16a-r1181, -n 1) (
61) were identified as host reads and subsequently excluded. We used the ‘run_midas.py species’ script in MIDAS with default parameters to estimate the relative abundances of SDPs for each sample. Finally, the results from MGAS were compared to those from metagenome sequencing to assess the performance of the marker genes.
Data availability.
Raw data from MGAS, 16S V4 amplicon and metagenomics sequencing have been submitted to NCBI under BioProject
PRJNA772085.