Metagenomic analyses.
(i) Sequence preprocessing and assembly. Sequencing fastq files were quality checked with FastQC (
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). We removed human and technical contaminant DNA by aligning reads to the PhiX genome and the human genome (hg19) with Bowtie2 (
60), and used the iu-filter-quality-minoche script of the illumina-utils program with default parameters to filter the reads (
61).
(ii) Taxonomic assignment. Processed paired-end metagenomic sequences were classified using two taxonomic profilers: Kraken2 v.2.0.8_beta (a k-mer matching algorithm) (
62) and MIDAS v.1.3.0 (a read mapping algorithm) (
63). Kraken 2 examines the k-mers within a query sequence, uses the information within those k-mers to query a database, and then maps k-mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer. Kraken2 was run against a reference database containing all RefSeq viral, bacterial, and archaeal genomes (built in May 2019), with default parameters. MIDAS uses a panel of 15 single-copy marker genes present in all ∼31,000 bacterial species included in its database to perform taxonomic classification and maps metagenomic reads to this database to estimate the read depth and relative abundance of 5,952 bacterial species. We identified metagenomic samples containing
V. cholerae and vibriophage reads and computed the mean depth of coverage (number of reads per base pair) of the
V. cholerae pangenome in the MIDAS database (
Table 1).
(iii) Assembly and binning of Vibrio cholerae genomes. To recover good-quality metagenome-assembled genomes (MAGs) of
V. cholerae, we selected metagenomic samples with coverage of >10× against the
V. cholerae pangenome in the MIDAS database and used MEGAHIT v.1.2.9 (
64) to perform
de novo assembly. For 9 of the 11 selected samples, we independently assembled the genome of each sample and coassembled the two remaining samples, which belonged to the same patient (a symptomatic infected contact on days 9 and 10). Contigs of <1.5 kb were discarded.
We extracted MAGs by binning our metagenomic assemblies. Because no single binning approach is superior in every case, with performance of the algorithms varying across samples, we used different binning tools to recover MAGs. The quality of a metagenomic bin is evaluated by its completeness (the level of coverage of a population genome) and the contamination level (the amount of sequence that does not belong to this population from another genome). These metrics can be estimated by counting the frequency of single-copy marker genes within each bin (
65). We inferred bins using CONCOCT v.1.1.0 (
66), MaxBin 2 v.2.2.7 (
67), and MetaBAT 2 v.2.12.1 (
68), with default parameters. We then used DAS_Tool v.1.1.1 on the results of these three methods to select a single set of nonredundant, high-quality bins per sample (
69). DAS_Tool is a bin consolidation tool which predicts single-copy genes in all the provided bin sets, aggregates bins from the different binning predictions, and extracts a more complete consensus bin from each aggregate such that the resulting bin has the most single-copy genes while having a reasonably low number of duplicate genes (
69). We then used Anvi’o v.6.1 (
70) to manually refine the bins with contamination higher than 10% and Centrifuge v.1.0.4_beta (
71) to determine the taxonomy of all bins in each sample, in order to identify
V. cholerae MAGs.
Bins with completeness of >60% and contamination of <10% were first selected, and those assigned to V. cholerae were further filtered (completeness of >90% and contamination of <1% for the V. cholerae bins). We dereplicated the entire set of bins with dRep v.2.2.3 using a minimum completeness of 60%, the ANImf algorithm, 99% secondary clustering threshold, a maximum contamination of 10%, and a 25% minimum coverage overlap and obtained 79 MAGs displaying the best quality and representing individual metagenomic species (MGS).
(iv) Detection of Vibrio cholerae genetic diversity within and between metagenomic samples. We created a Bowtie2 index of the 79 representative genomes from the dereplicated set, including a single high-quality
Vibrio cholerae MAG, and mapped reads from each sample to this set. By including many diverse microbial genomes in the Bowtie2 index, we aimed to avoid the mismapping of reads from other species to the
V. cholerae genome and to reduce potential false-positive intrahost single nucleotide variant (iSNV) calls. As recommended, we used
Vibrio cholerae MAGs from the samples under study rather than a genetically distant reference, as read mapping to the most closely related genome available is expected to reduce the rate of false-positive iSNV calls (
72). We mapped the metagenomics reads of each sample with a
V. cholerae coverage value of >5× (obtained with MIDAS) against the set of 79 MAGs, using Bowtie2 (
60) with the –very-sensitive parameters. We also used Prodigal (
73) on the concatenated MAGs, in order to predict open reading frames using default metagenomic settings.
We then used InStrain on the 15 selected samples (
https://instrain.readthedocs.io/en/latest/index.html). This program aims to identify and compare the genetic heterogeneities of microbial populations within and between metagenomic samples (
27). “InStrain profile” was run on the mapping results, with the minimum percent identity of read pairs to consensus set to 99%, the minimum depth of coverage to call a variant of 5×, and the minimum allele frequency to confirm a SNV equal to 0.05. All nonpaired reads were filtered out, as well as reads with an identity value below 0.99. Coverage and breadth of coverage (percentage of reference base pairs covered by at least one read) were computed for each genome. InStrain identified both biallelic and multiallelic SNV frequencies at positions where phred30 quality-filtered reads differ from the reference genome and at positions where multiple bases were simultaneously detected at levels above the expected sequencing error rate. SNVs were classified as nonsynonymous, synonymous, or intergenic based on gene annotations, and gene functions were recovered from the UniProt database (
74) and BLAST (
75). Then, filters similar to those described in reference
29 were applied to the detected SNVs. We excluded from the analysis positions with a very low or high coverage value,
C, compared to the median coverage,
, and positions within 100 bp of contig extremities. As sites with very low coverage could result from a bias in sequencing or library preparation and sites with higher coverage could arise from mapping errors or be the result of repetitive region or multicopy genes not well assembled, we masked sites in all the samples if
C was <0.3
and if
C was >3
in at least two samples.
(v) Mutation spectrum of hypermutator and nonmutator samples. For each sample, iSNVs were categorized into six mutation types based on the chemical nature of the nucleotide changes (transitions or transversions). We combined all the samples with hypermutators and compared them to the mutation spectrum of the nonmutators. The mutation spectrum was significantly different between the hypermutator samples and the nonhypermutator samples (chi-square test,
P < 0.01). We then computed the mutation mean and standard error of each of the six mutation types and compared the two groups (
Fig. 2C).
(vi) Bacterial replication rate estimation. Replication rates were estimated with the metric iRep (index of replication), which is based on the measurement of the rate of decrease in average sequence coverage from the origin to the terminus of replication. iRep values (
39) were calculated by mapping the sequencing reads of each sample to the
V. cholerae MAG assembled from that sample.
(vii) Tests for natural selection. First, we identified signals of convergent evolution in the form of nonsynonymous iSNVs occurring independently in the same gene in multiple patients. To assess the significance of convergent mutations, we compared their observed frequencies to expected frequencies in a simple permutation model. We ran separate permutations for nonmutators (two genes with convergent mutations in at least two out of eight nonmutator samples, including only one time point from the patient sampled twice and excluding the outlier patient A with a large number of intergenic iSNVs) and possible hypermutators (five genes with convergent mutations in at least two out of five possible hypermutator samples). In each permutation, we randomized the locations of the nonsynonymous mutations, preserving the observed number of nonsynonymous mutations in each sample and the observed distribution of gene lengths. For simplicity, we assumed that two of three nucleotide sites in coding regions were nonsynonymous. We repeated the permutations 1,000 times and estimated a P value as the fraction of permutations yielding greater than or equal to the observed number of genes mutated in two or more samples.
Second, we compared natural selection at the protein level within versus between patients, using the McDonald-Kreitman test (
41). We again considered hypermutators separately. Briefly, the four counts (
Pn,
Ps,
Dn,
Ds) of between-patient divergence (
D) versus within-patient polymorphism (
P), and nonsynonymous (n) versus synonymous (s) mutations were computed and tested for neutrality using a Fisher exact test (false discovery rate [FDR] corrected
P values of <0.05).
Whole-genome sequencing analyses.
(i) Culture of Vibrio cholerae isolates. We selected three of the households with asymptomatic infected contacts (households 56, 57, and 58) for within-patient diversity analysis using multiple V. cholerae colonies per individual. Each index case was sampled on the day of presentation to the icddr,b, and asymptomatic contacts positive for V. cholerae were sampled on the following day, except for one contact (household 58, contact 02). This individual was positive only on day 4 following presentation of the index case, and we collected samples and cultured isolates from day 4 to day 8. Stool samples collected from three index cases and their respective infected contacts were streaked onto thiosulfate-citrate-bile salts-sucrose (TCBS) agar, a medium selective for V. cholerae. After overnight incubation, individual colonies were inoculated into 5 ml Luria-Bertani broth and grown at 37°C overnight. For each colony, 1 ml of broth culture was stored at −80°C with 30 % glycerol until DNA extraction. We used the Qiagen DNeasy blood and tissue kit, using 1.5 ml bacteria grown in LB medium, to extract the genomic DNA. In order to obtain pure genomic DNA (gDNA) templates, we performed an RNase treatment, followed by purification with the MoBio PowerClean pro DNA cleanup kit.
(ii) Whole-genome sequencing and preprocessing. We prepared 48 sequencing libraries using the NEBNext Ultra II DNA library prep kit (New England Biolabs) and sequenced them on the Illumina HiSeq 2500 (paired-end 125 bp) platform at the Genome Québec sequencing platform (McGill University). Sequencing fastq files were quality checked with FastQC, and Kraken2 was used to test for potential contamination with other bacterial species (
62).
(iii) Variant calling and phylogeny. We mapped the reads for each sample to the MJ-1236 reference genome and called single nucleotide polymorphisms (SNPs; fixed within patients) and single nucleotide variants (SNVs; variable within patients) using Snippy v.4.6.0 (
76), with default parameters. A concatenated alignment of these core variants was generated, and an unrooted phylogenic tree was inferred using maximum parsimony (MP) in MEGA X (
77). The percentages of replicate trees in which the associated taxa clustered together in the bootstrap test (1,000 replicates) are shown next to the branches. The MP tree was obtained using the subtree-pruning-regrafting (SPR) algorithm with search level 1, in which the initial trees were obtained by the random addition of sequences (10 replicates).
(iv) De novo assembly and pangenome analyses. We
de novo assembled genomes from each isolate using SPAdes v.3.14 on the short reads, with default parameters (
78), and used Prokka v1.14.6 (
79) to annotate them. We constructed a pangenome from the resulting annotated assemblies by combining Roary v.3.13.0 (
80) and GenAPI (
81), identifying genes present in all isolates (core genome) and genes present only in some isolates (flexible genome). The flexible genome and the phylogenetic tree were visualized with Phandango v.1.1.0 (
82).