INTRODUCTION
The last 10 years of microbial ecology research have involved a profound shift in focus from observational phylogenetic analyses of as-yet uncultured novel taxa (
1) to experimental characterization of the taxonomic shifts in communities through the use of complex experimental designs (
2). This shift in focus has been driven by the advent of relatively inexpensive next-generation sequencing approaches and the development of robust bioinformatic tools. The most commonly used sequencing platform has been from 454 (
3); however, there is growing interest in using the IonTorrent (
4), PacBio (
5), and Illumina (
6) platforms. This transition between platforms has not been painless, as there has been a steady stream of concerns raised regarding the quality and meaning of sequence data generated by the 454 sequencers (
7,
8). Numerous groups have worked to develop bioinformatic solutions that make the 454 platform a robust approach to characterizing microbial communities (
9,
10). As other sequencing platforms mature and perhaps replace 454 as the platform of choice for 16S rRNA gene sequencing, it is critical to develop similar solutions so that the field does not sacrifice high data quality for increased sequencing throughput.
A number of considerations must be made when selecting a platform for sequencing the 16S rRNA gene. We contend that sequence quality is the most important consideration, as studies that are built upon data that are unreliable are themselves unreliable (
10). A second important consideration is the number of reads that one can obtain per run and per dollar. This is significant, because investigators can titrate the number of samples and reads per sample using multiplexing strategies to fit the number of overall reads. A third consideration for 16S rRNA studies is the length of the sequences, as longer sequences are easier to assign to a taxonomic group using a classifier (
11). Finally, the customer and technical support of the companies that manufacture the reagents and instrumentation and the availability of their platform and reagents are significant factors. The goal of this study was to assess the quality of MiSeq-generated data and to determine its advantages and disadvantages compared to 454. Further studies are necessary to make similar comparisons to other platforms.
Illumina-based strategies are able to generate the largest amount of sequence data per dollar, using a chip-based bridge amplification procedure followed by sequencing by synthesis using reversible terminator dye nucleotides (
12). Depending on the platform and reagents, one can currently obtain up to 300 and 500 cycles (i.e., nucleotides [nt]) of sequence data on HiSeq and MiSeq platforms, respectively. These cycles are commonly split into two reads, providing paired reads of the same DNA fragment. The platforms also vary in their sequencing throughput, with HiSeq 2500 being capable of generating 600 Gbp using paired 100-nt reads (i.e., 3 billion pairs of reads) or 180 Gbp using paired 150-nt reads (i.e., 1.2 billion pairs of reads), and MiSeq was capable of generating 8.5 Gbp using paired 250-nt reads (i.e., 17 million pairs of reads). Because the HiSeq platform requires one to fill 16 sequencing lanes with the same reagents, logistically it is more difficult for an individual to fill a complete run with a 300-cycle kit when the 200-cycle kit is more commonly used within the genomics field. Reagents for the HiSeq (300 cycle) are approximately $1,500 per lane, and for the MiSeq they are approximately $1,000 per lane. The HiSeq 2500 and MiSeq instruments currently retail for $740,000 and $125,000, respectively. The HiSeq platform has become the standard approach for shotgun metagenomic sequencing because of its increased read depth; however, the MiSeq has greater potential for use with 16S rRNA gene sequence studies, because it generates longer sequence reads, and its performance and cost are tractable to the needs of individual investigators (
13).
Until recently, the most significant problem with the Illumina platforms has been the ability to sequence samples with low genetic diversity, such as that commonly found with 16S rRNA gene amplicons. To artificially increase the genetic diversity, it has been common to mix in a control library of genomic DNA from the phage PhiX, such that 50% of the DNA was from PhiX. During the course of this study, Illumina upgraded their image analysis software to overcome this challenge, such that only 5 to 10% PhiX is needed to sufficiently increase the genetic diversity. Another factor that can affect data quality is the amount of DNA loaded onto the flow cell, as this affects the cluster density and the ability of the image analysis software to discriminate between clusters. In this study, we evaluate the effect of the new software on sequencing error rates with various cluster densities.
Previous studies involving 16S rRNA gene sequencing on the Illumina platforms have utilized two approaches. The first approach involves two PCR steps with different primer pairs (for an example, see reference
6). In the first PCR step, the two primers used contain an Illumina sequencing primer, an index (i.e., barcode) sequence, and the gene-specific primer. In the second PCR step, two primers are used that contain the Illumina adapter and sequencing primer sequence. Paired-end sequencing is performed using the built-in Illumina sequencing primers. This approach is limited, because it requires two rounds of PCR, increasing the risks of artifacts commonly observed with large numbers of PCR rounds, and because one must devote 20 to 25 nt to sequencing the index and gene-specific primer. The second approach emulates Illumina's TruSeq genomic library construction protocol, in which a single PCR is used (
14). In this approach, the primers contain the Illumina adapter sequence, an index sequence (only for the reverse primer), a 10-nt pad to prevent hairpin formation, a 2-nt linker that is noncomplementary to the 16S rRNA gene, and a gene-specific primer. Sequencing proceeds by (i) using the combined pad-linker-primer as the sequencing primer at the 5′ end to obtain a long read, (ii) using the reverse complement of the combined pad-linker-primer as the sequencing primer at the 3′ end to sequence the index region, and (iii) using the combined pad-linker-primer as the sequencing primer at the 3′ end to obtain a long read. With the 500-cycle reagents, this results in an index sequence and two 250-nt reads. A collection of 2,168 reverse primers with different indices has been published for the V4 region of the 16S rRNA gene (
14). Considering these methods were developed using previous Illumina platforms that could only generate 200 or 300 nt and current technology can obtain 500 nt, with anticipated further expansions in sequencing lengths, we sought to develop a dual-index, paired-read approach that could easily be adapted to other regions of the 16S rRNA gene or other genes. The advantage of such an approach is that with dual indices, one could replace the 2,168 previously proposed V4 primers with a total of 94 primers.
The development of bioinformatic solutions for curating sequences generated on the Illumina platforms has been limited. Several studies have insisted that extensive sequence curation and contig formation is unnecessary (
14,
15); however, these were largely focused on analyzing the beta-diversity between communities and taxonomic classification to the genus level. These approaches are limited because of the limitations of existing databases and the various levels of diversity across taxonomic lineages. Caporaso and colleagues (
14) have utilized a mapping procedure where reads are mapped to a reference database of V4 reads that are not more than 97% similar to each other; if a read is not more than 97% similar to a sequence in the database, it is culled. Although this is clearly a fast approach, a significant number of good reads may be rejected, and it requires the creation of very specialized databases for each region being sequenced. Such a strategy can be impossible should researchers attempt to adapt the sequencing strategy to poorly characterized genes. Others have attempted to use the Phred/Phrap quality scores associated with each base to trim sequence reads in combination with removing rare taxa (
16). Unfortunately, no error rates are provided following their sequence-trimming procedure, and removal of rare taxa could be problematic if one is interested in tracking rare populations. Finally, the only published attempt to develop a method of curating paired sequence reads has suggested allowing various numbers of mismatches between the overlapping sequence reads; however, again, final error rates were not provided (
6,
17). In the current study, we resequence a mock treatment community where we know the true 16S rRNA gene sequence to assess the effect of various trimming and sequence assembly methods on the overall error rates.
Here, we address several technical and bioinformatic challenges related to employing the MiSeq platform for sequencing of the 16S rRNA gene. First, the recent release of the Illumina MiSeq v. 2.0 platform provides 500 cycles that are typically applied by obtaining paired 250-nt sequences per fragment. This allowed us to determine whether the additional sequence length would allow one to sequence longer regions of the 16S rRNA gene fragment either as a single read or as paired reads. Furthermore, because the sequencing platform is constantly evolving to provide more and longer sequence reads, we developed a sequencing strategy that could easily be adapted when longer reads are possible and reduce the investment in buying large numbers of indexed primers. Second, we evaluated the prospects of sequencing metagenomic shotgun libraries in parallel to 16S rRNA gene amplicons for situations where deep sequencing coverage is not necessary. Third, we developed a sequence curation pipeline that results in a minimal number of sequence reads while producing sequences with error rates comparable to those we have previously observed with 454 data (
10). Finally, we reanalyzed a large set of samples that we previously analyzed using the 454 platform using our MiSeq-based approach and observed similar results (
18).
MATERIALS AND METHODS
Overall strategy and primer design.
Our dual-index paired-end sequencing approach is analogous to the single-index approach described elsewhere (
13,
14). As shown in
Fig. 1, each primer consists of the appropriate Illumina adapter, an 8-nt index sequence, a 10-nt pad sequence, a 2-nt linker, and the gene-specific primer. The index sequences were selected to be at least 2 nt different from all other indices in use, and when combined, they provide an equal intensity in the two light channels used by the sequencer (i.e., green channel [G/T] and red channel [A/C]). The index sequences were also at least 2 nt different from the indices that accompany the Nextera library construction kit. The 2-nt linker sequence was selected to not be complementary to the homologous positions in a large collection of 16S rRNA gene sequences (
19). Finally, the pad sequence was selected so that the combined pad, linker, and gene-specific primer sequences had an estimated melting temperature between 60 and 65°C. For the development of our method, we used gene-specific primers to amplify the V34, V4, and V45 regions from the bacterial 16S rRNA gene that have been described elsewhere (
14,
20). The complete primers were each 63 to 68 bp long.
Two sequence reads, two index reads, and three sequence primers were necessary to sequence each DNA fragment. The first sequence read (250 nt) was obtained using the Read 1 primer, which was the same as the sequence of the combined pad, linker, and gene-specific primer at the 5′ end of the region. The first index, located at the 3′ end of the fragment, was sequenced using the Index primer. The Index primer was the reverse complement of the combined pad linker and gene-specific primer sequence at the 3′ end of the region. After this index read, the platform flips the fragment. The second index read then was performed to obtain the index sequence at the 5′ end of the fragment using the adapter lawn on the surface of the sequencing flow cells. Finally, the second sequence read (250 nt) was obtained using the Read 2 primer, which was the same as the sequence of the combined pad linker, and gene-specific primer sequence at the 3′ end of the region. The overall process of cluster generation, sequencing, image processing, demultiplexing, and quality score calculation was performed on the MiSeq in approximately 40 h.
Community DNA.
In the initial studies to develop our sequencing approach, we utilized genomic DNA isolated from four communities that were then sequenced as three technical replicates. The first was termed a “mock community” composed of genomic DNA from 21 bacterial isolates. This mock community is similar to the one that we used previously to assess error rates in 454-generated sequence data: Acinetobacter baumannii ATCC 17978, Actinomyces odontolyticus ATCC 17982, Bacillus cereus ATCC 10987, Bacteroides vulgatus ATCC 8482, Clostridium beijerinckii ATCC 51743, Deinococcus radiodurans ATCC 13939, Enterococcus faecalis ATCC 47077, Escherichia coli ATCC 70096, Helicobacter pylori ATCC 700392, Lactobacillus gasseri ATCC 33323, Listeria monocytogenes ATCC BAA-679, Neisseria meningitidis ATCC BAA-335, Porphyromonas gingivalis ATCC 33277, Propionibacterium acnes DSM 16379, Pseudomonas aeruginosa ATCC 47085, Rhodobacter sphaeroides ATCC 17023, Staphylococcus aureus ATCC BAA-1718, Staphylococcus epidermidis ATCC 12228, Streptococcus agalactiae ATCC BAA-611, Streptococcus mutans ATCC 700610, and Streptococcus pneumoniae ATCC BAA-334. The genomic DNAs were pooled to have an equimolar concentration of 16S rRNA gene copies per genome with a final concentration of 5 ng/μl. Mock community DNA is available through BEI Resources (HM-278D v3.1). Genomic DNAs from the three other communities were obtained using the MO BIO PowerSoil DNA extraction kit with material from mouse and human feces and soil from a residential area. To demonstrate the ability to scale up our method and recapitulate previous results, we reused the DNA from a previous study in which DNA was isolated from mouse feces using the Roche MagnaPure DNA extraction kit. All fecal samples were obtained using protocols that were reviewed and approved by the University Committee on Use and Care of Animals and the Institutional Review Board at the University of Michigan.
Amplicon library construction and sequencing.
16S rRNA gene libraries were constructed using the primers described above to amplify the V34, V4, and V45 regions. Amplicons were generated using a high-fidelity polymerase (AccuPrime; Invitrogen) and then were purified using a magnetic bead capture kit (Ampure; Agencourt) and quantified using a fluorometric kit (QuantIT PicoGreen; Invitrogen). The purified amplicons were then pooled in equimolar concentrations using a SequalPrep plate normalization kit (Invitrogen), and the final concentration of the library was determined using a SYBR green quantitative PCR (qPCR) assay with primers specific to the Illumina adapters (Kappa). Libraries were mixed with Illumina-generated PhiX control libraries and our own genomic libraries and denatured using fresh NaOH. A detailed protocol with primer and index sequences is provided in the supplemental material. We performed 11 sequencing runs for this study. Tables S1 and S2 in the supplemental material provide results from seven sequencing runs performed using Real Time Analysis software (RTA), v. 1.16.18 and 1.17.22, MiSeq Control software (MCS), v. 2.0.5 and 2.1.13, various amounts of a PhiX genomic library control, and various cluster densities. Four sequencing runs were performed with RTA v. 1.17.28, MCS v. 2.2.0, a target of 5% PhiX, and various cluster densities (
Table 1).
Shotgun library construction.
DNA (2.5 ng) from the mock and human fecal communities used in our amplicon experiments, plus genomic DNA from two
Clostridium clostridioforme strains (D4 and CIP110249), were used to generate four shotgun libraries using a customized Nextera XT genomic library construction protocol (Illumina). Our customization involved increasing the amount of input DNA and the amount of transposon-based tagmentation. This modified protocol allowed us to reduce the number of PCR cycles to 10 instead of the recommended 12 while still obtaining a sufficient yield for sequencing. DNA concentration and the length of the fragments were assessed using an Agilent BioAnalyzer. The metagenomes were pooled in equal molar quantities, and the two genomes were pooled at one-half the concentration of the metagenomes. Metagenomic and genomic libraries were quality trimmed using Sickle (
https://github.com/najoshi/sickle), assembled using ABySS (
21), and filtered to remove contigs smaller than 500 bp.
Bioinformatic analysis and data availability.
All development was carried out using custom Perl and C++ software. The functions required to implement the overall analysis pipeline are available within the mothur software package (v. 1.30) and are illustrated on the mothur website (
http://www.mothur.org/wiki/MiSeq_SOP) (
22). Contigs between read pairs were assembled as described in this study, and any contigs with an ambiguous base (i.e., N) were culled, as were those sequences where there was no meaningful overlap between sequences. Sequences then were aligned to a reference alignment, and those sequences that did not align to the correct region were culled (
23–25). The ends of the sequences were trimmed so that the sequences all started and ended at the same alignment coordinates (
25). After identifying the unique sequences and their frequency in each sample, we utilized a preclustering algorithm to further denoise sequences within each sample (
10). The resulting sequences were screened for chimeras using UCHIME (
26). We then used a naive Bayesian classifier to classify each sequence against the Ribosomal Database Project (RDP) 16S rRNA gene training set (version 9) that was customized to include rRNA gene sequences from mitochondria and
Eukaryota. We required an 80% pseudobootstrap confidence score (
11). Those sequences that either did not classify to the level of kingdom or that classified as
Archaea,
Eukaryota, chloroplasts, or mitochondria were culled. Finally, sequences were split into groups corresponding to their taxonomy at the level of order and then assigned to operational taxonomic units (OTUs) at a 3% dissimilarity level; previous experience indicates that the OTU assignments by this approach are equivalent to not splitting the sequences by taxonomic order and has the advantage of parallelization and reduced memory usage (
27). Calculations of sequencing error rates and identification of chimeras based on the mock community data were performed as previously described (
10). The sequence data used in this study are available at
http://www.mothur.org/MiSeqDevelopmentData.html.