Here we demonstrate improvements in both sampling depth and sequence quality by use of an inexpensive and rapid sequencing methodology. An advantage of this technique over current high-throughput methods is the assembly of paired-end reads, which greatly reduces the number of erroneous sequences included in downstream analyses. Importantly, as the read lengths for the Illumina platform increase (currently at ∼125 bases), so too will the quality of the libraries generated with this technique. Additionally, the use of index sequences enables many samples to be sequenced in parallel. We have tested 24 indexed primers in our laboratory (data not shown), and additional index sequences have been provided that can further increase sample throughput (see Table S1 in the supplemental material). Further improvements to this method can be introduced, such as the addition of a highly diverse series of bases adjacent to the forward sequencing primer binding area (see Table S1). This addition improves Illumina base calling because the algorithm identifies clusters optimally on the flow cell when maximum nucleotide diversity is present across the first four bases sequenced in the forward read. In addition, the long oligonucleotide primers used here were purified commercially by PAGE for an additional cost (IDT, Coralville, IA). Future research will determine if standard desalting of primers will be sufficient to generate Illumina data sets, which would reduce the start-up cost for this new technology.
With the increase in recovered sequences, there is a corresponding increase in artifactual sequences. The capacity of the Illumina platform to generate enormous data sets is undoubtedly an advantage; however, if low-abundance phenotype discovery and accurate measurements of alpha diversity are desired, errors must be managed effectively. Otherwise, community characterization is only useful at a coarse level. In this study, assembly was accomplished by the use of overlapping paired-end reads, and a modified single-linkage clustering protocol was applied at 97% sequence identity. Future work will identify effective clustering algorithms that adequately reduce data sets to the expected phylotype diversity, as shown recently for 454 pyrosequencing data (
13), and that are scalable to sequence libraries possessing many millions of sequences and hundreds (or thousands) of samples. Additionally, problems resulting from the sensitivity of the technology (e.g., sequencing of low-abundance sequence contamination in laboratory growth medium) would be bypassed by multiplex PCR amplifications directly from environmental samples as outlined in this protocol.
Regardless of sequencing artifacts, advances in sequencing technologies are paralleled by increased magnitudes of phylotype diversity surveyed from microbial communities. Although a small number of sequences may be sufficient to detect underlying patterns differentiating highly divergent communities (
15), larger data sets are required to identify more subtle responses to environmental factors among less predominant populations and for increased sequence coverage of the rare biosphere (
13,
31). Rare microbial taxa likely represent microorganisms that (i) are adapted to life at low relative abundance, (ii) have not been discovered previously, and (iii) possess abundance distributions with important correlates to measured physicochemical parameters. In this study, the Illumina sequencing platform provided access to low-abundance phylotypes from soil with adequate coverage (
Fig. 4) and combined library sizes greater than those reported previously (
3,
4,
29). The main limitation of recent iterations of the Illumina platform has been the reduced taxonomic resolution of short sequence reads (
3,
8,
20). With the introduction of 125-base paired-end reads reported here, this sequencing methodology can now span the taxonomically informative V3 variable region of the 16S rRNA gene and will soon generate two-fold coverage of complete PCR amplicons as sequence length continues to increase. Note that the V3 region chosen here was selected because the primers used are the same as those used for DGGE of bacterial communities (
22) and that this region is longer (∼170 to 190 bases) than the V6 region, which was sequenced elsewhere (∼105 to 120 bases) (
8). Although base-calling accuracy decreases markedly toward the 3′ end, the sequence read overlap of 66 ± 11 nucleotides (ATCL library) greatly increased data quality in this region (
Fig. 2). The primers and adaptors are modular, so this sequencing methodology can readily be modified to target other genes or regions of interest. This versatile, affordable, and powerful methodology greatly increases the depth at which low-abundance organisms can now be probed, as noted by high Good's coverage estimates (
Fig. 4), high levels of similarity between replicates (see Fig. S3 in the supplemental material), and the number of unclassified or unique taxa in low-abundance groups (
Fig. 5), suggesting that we are now able to comprehensively and reproducibly characterize and compare abundant and rare populations across multiple samples derived from complex microbial communities.