Nonpareil curves are plots of abundance-weighted average coverage (
Ĉ) per sequencing effort (
LR) fitted to the cumulative probability function of the gamma distribution (
3) with parameters α and β as follows:
Ĉ = γ[α, β × log(
LR + 1)]/Γ(α), where Γ is the gamma function and γ is the lower incomplete gamma function. Hence, we can use the mode of the corresponding gamma distribution to identify the value of log(
LR + 1) corresponding to the inflection point of the curve, which we propose as a measurement of sequence diversity as follows:
Nd = (α − 1)/β.
To evaluate the correlation between
Nd and traditional measurements of alpha diversity derived from taxonomic affiliation, we compiled three collections of metagenomic data sets. Set I consisted of 90 samples from multiple environments, set II consisted of 54 human-associated microbiome metagenomic samples, and set III consisted of 292 marine samples. Set I was used to evaluate the general agreement between traditional taxonomy-based alpha diversity and
Nd. It consisted of 84 metagenomic data sets from multiple environments divided into six distinct biomes plus six mock data sets (
Table S3). Subsamples of processed nucleotide reads (between one and seven file chunks of 0.5 gigabyte zipped, as necessary to surpass 60% coverage or as many files as available) and OTU tables based on metagenome-derived 16S rRNA gene-containing reads were obtained from EBI Metagenomics (
11). The Shannon index (
H′) of each OTU table was estimated on the basis of Bayesian estimates of frequencies by using the Dirichlet multinomial pseudocount model with Laplace prior, as implemented in the R package entropy (
22), with the exception of mock samples, for which maximum-likelihood entropy of input concentrations was used. Note that the 16S rRNA gene-containing reads derived from metagenomes do not necessarily overlap; hence, the EBI Metagenomics Pipeline uses closed-reference OTU picking (
23), potentially biasing the results by database completeness. Therefore, we extended this analysis by using set II derived from the HMP (
24). This collection consisted of samples from 13 body sites including 54 metagenomic data sets and 3,613 16S rRNA gene amplicon data sets. Fifty-three samples included both types of data sets. An OTU table including all of the processed samples was obtained from HMP Qiime Community Profiling (
23,
24), from which
H′ values per sample were estimated. Values were compared against
Nd values derived from the metagenomic data sets by body site (medians per site) and by sample (intersection samples). Finally, we evaluated the resolution of
Nd compared to
H′ by using set III, a collection of marine samples derived from two global sampling projects, 228 samples from the Tara Oceans expedition between September 2009 and March 2012 (
12) and 64 samples from the Global Ocean Sampling expedition between March 2009 and December 2010 funded by the Beyster Family Fund and the Life Technology Foundation (SRA BioProject accession no. PRJEB10418). Sample metadata, as well as preprocessed reads and 16S rRNA gene OTU tables, were obtained from EBI Metagenomics (
11). ANOVA was performed to evaluate the effects of different metadata variables on both
Nd and Bayesian analysis-corrected
H′ of extracted 16S rRNA genes. The variables considered were size fraction (categorical), absolute value of latitude (i.e., degrees from the equator), latitude (degrees), geographic location (Mediterranean Sea, North Atlantic Ocean, North Pacific Ocean, Indian Ocean, Red Sea, South Atlantic Ocean, South Pacific Ocean, or Southern Sea), sampling date, and two decomposed seasonal components of sampling date. The decomposed seasonal components were estimated as the sine (vernality) and cosine (wintriness) of the date in radians (1 year = 2π) for samples in the Northern Hemisphere or the negative sine and negative cosine, respectively, for samples in the Southern Hemisphere.