NGS.
Next-generation sequencing (NGS) refers to a high-throughput sequencing method that parallelizes the sequencing process, producing thousands or millions of sequences at once. Intentionally broad, next-generation sequencing encompasses several different sequencing technologies that have been adapted to high-throughput, low-cost sequencing. A thorough review and comparison of these methods has been published elsewhere (
150), but we summarize the key differences and applications of the major NGS approaches (
Table 3).
Pyrosequencing, licensed by 454 Life Sciences and later purchased by Roche, was the first next-generation sequencing method commercially marketed. Pyrosequencing employs a “sequence-by-synthesis” approach, meaning that it generates sequence data during DNA synthesis rather than analyzing nucleic acid amplicons postsynthesis as is the case with Sanger sequencing (
151,
152) (
Fig. 7). Amplified or chromosomal target nucleic acid is fragmented, and synthetic nucleic acid adaptors are enzymatically ligated to each end of the product. One adaptor serves as an adaptor for hybridization of the nucleic acid product to a microbead, and the other serves as a sequencing primer. Following a PCR to amplify the target sequence, microbeads coated with amplicon are segregated into microwells. Each well contains all the reagents required for sequencing, including DNA polymerase, luciferase, ATP sulfurylase, and apyrase. Each of the four dNTPs is individually added and washed away from the wells in repeating cycles. When a complementary dNTP is added, it is incorporated by DNA polymerase, with the concomitant release of pyrophosphate as a by-product of DNA synthesis. ATP sulfurylase converts the released pyrophosphate to ATP, which is used to drive luciferase activity, resulting in the production of light (
153). Sequence data are generated by monitoring the microwell reactions for a pulse of light following addition of each dNTP. Since each microwell contains a single microbead harboring a unique region of chromosomal DNA, parallel sequencing of hundreds of regions of the chromosome achieves high sequence coverage in a single run. Additionally, because sequencing reactions are carried out in picoliter-volume reaction wells, this technology is capable of sequencing 400 to 600 megabases of DNA per 10-h run at a price per base up to 100-fold lower than that for Sanger sequencing (
154) (
Table 3). Pyrosequencing was initially capable of generating accurate reads of approximately 100 bases, with the limiting factor related to decreasing efficiency of apyrase in degrading unincorporated nucleotides in each successive cycle (
155). Replacement of apyrase with thorough washing to remove unused nucleotides can extend the effective read length to approximately 400 bases. This is still a relatively short read in comparison to that with the Sanger method, but it is significantly longer than those of other NGS methods (
155). An extended read length can be advantageous when attempting rapid whole-genome sequencing (WGS), especially when coupled with the speed of pyrosequencing technology and sophisticated software capable of assembling short individual reads into a confluent genome sequence. The overall accuracy of the sequence data generated is 99.51% to 99.96% (
156,
157). A potential drawback to pyrosequencing is the inability to generate reliable sequences of homopolymers of >4 bases in length (
156). In a study assessing the accuracy of sequences generated by pyrosequencing, 39% of errors were attributable to homopolymer sequences (
156).
Semiconductor sequencing, typified by the Ion Torrent system (ABI), is a similar “sequence-by synthesis” technology. Parallel sequencing reactions are carried out in 1.2 million microwells on the surface of a low-cost semiconductor chip (
158). Each picoliter well contains template and DNA polymerase, to which each of the four nucleosides is added in sequential order, however; Ion Torrent sequencing differs from pyrosequencing in that it uses production of hydrogen as the sole marker for determining the sequence (
Fig. 7) (
158). Release of hydrogen ions following incorporation of a complementary nucleotide is detected by a miniaturized ion sensor integrated into each reaction well. This technology is capable of generating up to 25 Mb of sequence data in a single run with a 2-h run time (
158). Independence from the use of multiple enzymes, sensitive optics, or modified nucleotides dramatically reduces the cost of reagents and equipment compared to those with Sanger or other NGS methods. The reported cost of an Ion Torrent instrument is approximately US$50,000, excluding sample preparation equipment and a server for data analysis (
159). The reported accuracy of semiconductor sequencing systems, including Ion Torrent, ranges from 98.4% to 98.9% (
158,
160) (
Table 3). The major limitations of this system are that it has difficulty in enumerating long repeats (homopolymers of >6 nt in length) and has a read length of 50 to 100 nt, which is relatively a short compared to that of Sanger sequencing or pyrosequencing (
158).
Applications of pyrosequencing and semiconductor sequencing include whole-genome sequencing (WGS), amplicon sequencing, transcriptome sequencing, and metagenomics. The strength of pyrosequencing for WGS was demonstrated by Margulies et al., who sequenced the entire genome of
M. genitalium (580,096 bp) with >99.9% accuracy and 96% genome coverage in a single run (
157). More impressively, pyrosequencing was utilized to sequence the entire 6-gigabase human genome with 7.4× coverage in just 2 months (
154). Similarly, the whole genomes of
Escherichia coli and
Vibrio fischeri were sequenced with 96.8 to 99.9% coverage with 98.9% accuracy in a single run using Ion Torrent (
158). While the sequencing and assembly of an entire genome in days to months are remarkable, the most immediate use of NGS in clinical microbiology is likely amplicon sequencing. Amplicon sequencing is targeted to full sequencing of one or more genetic loci concurrently. This method is valuable when identification of multiple mutations or SNPs in a genetic locus is required to predict oncogenic potential or antimicrobial resistance. In addition to detection of multiple SNPs in a single locus, parallel sequencing offers the ability to generate sequences for multiple loci simultaneously. Next-generation sequencing is among the molecular technologies that can be applied to the identification of mycobacteria, including the prediction of resistance to antituberculosis therapies (
64). Determination of resistance to first-line antituberculosis drugs (rifampin [RIF], isoniazid [INH], pyrazinamide [PZA], and ethambutol [EMB]) requires the analysis of several SNPs contained on 5 different genes (
161). SNPs associated with resistance to rifampin are relatively conserved, with 3 mutations accounting for up to 75% of resistance (
161). In this instance, routine probe-based amplification tests can be up to 98% sensitive (
162). However, SNPs resulting in resistance to other first line antituberculosis drugs are considerably less conserved, rendering detection by a limited number of probes impractical. Pyrosequencing has been exploited for the simultaneous detection of resistance mutations in multiple genes to rapidly identify multidrug-resistant (MDR) strains of
M. tuberculosis (
163,
164). Resistance to rifampin, isoniazid, and fluoroquinolones was determined using 4 sequencing primers to identify multiple point mutations in
rpoB,
katG, and
gyrA, with sensitivities of 96.7%, 63.8%, and 70%, respectively. The specificity of the pyrosequencing reaction was reported to be 97.3% to 100% (
163). Variable sensitivity for predicting susceptibility to the 3 drugs reflects the lack of knowledge regarding the mutations and mechanisms which contribute to a resistant phenotype. This limitation is inherent to all molecular testing strategies and will be overcome only through continued research to characterize mutations conferring resistance and development of more complete reference libraries for sequence comparison.
Analogous to the use of NGS methods to sequence multiple targets in a single organism is the utility of NGS to simultaneously sequence and identify multiple organisms in a single specimen. Many of these studies, known as metagenomics, have been conducted to characterize complex bacterial communities in environmental specimens. Clinically, NGS has been used characterize the microbial community present in the airways of patients with cystic fibrosis (CF) using sputum specimens (
149,
165). An advantage of NGS is the detection of nonculturable or fastidious organisms that may be outcompeted and overlooked in routine CF cultures. In a cohort of 66 sputum specimens from CF patients, NGS identified 122 different microbial species, compared to only 18 identified by culture (
149). In an analytic study, organisms representing as little as 0.25% of the total nucleic acid template in a specimen were reproducibly identified (
149). This ability to better define the microbiological components of the CF lung could aid in a better understanding of the associated illness and inform therapeutic strategies. A potential drawback to this type of metagenomic study is the semiquantitative nature NGS, which prevents an accurate assessment of the proportion of each organism present at a single point or changes in the composition of microorganisms in serial specimens. Similarly, the presence of nucleic acid is not necessarily indicative of a viable organism and may represent residual nucleic acid from flora or exogenous sources entering the upper respiratory tract.
A final application of pyrosequencing and semiconductor sequencing is in epidemiological investigation of outbreaks. Most notably, Mellmann et al. used the Ion Torrent NGS to identify and characterize a novel strain of enterohemorrhagic
E. coli (EHEC) responsible for a large outbreak in Germany in 2011 (
166). Whole-genome sequencing of 4 isolates from geographically distinct cities along with relevant historical reference strains was conducted. Sequencing and analysis of the strains were completed in 2 to 3 days and enabled near-real-time phylogenetic linkage of these strains (
166). Investigators were also able to propose a likely evolutionary pathway linking the outbreak strain to an earlier progenitor strain identified 20 years earlier. In a smaller study, investigators were able to examine 33 multidrug-resistant isolates of
E. coli obtained from patients in a neonatal intensive care unit using Ion Torrent NGS (
167). The authors reported a 5-day turnaround and a cost of US$300 per isolate for whole-genome sequencing. Sequencing resulted in 88% to 89% genome coverage, which was sufficient to link all strains phylogenetically and identify them as most closely related to multiresistant strains of the ST-131 multilocus sequence type (MLST). While approximately twice the cost of traditional strain typing using pulse-field gel electrophoresis (PFGE) or MLST, NGS provided additional useful information, including the specific identification of the
bla CTX-M-15 extended-spectrum beta-lactamase (ESBL) gene and the presence of other genes and point mutations associated with resistance to several classes of antimicrobials (
167).
The term “ultradeep sequencing” (UDS) refers to amplicon sequencing designed to allow mutations to be detected at extremely low levels in a population. Initial PCR amplification of a genetic region of interest followed by segregation of each amplicon into a separate reaction well allows sequencing and identification of rare sequence variants. For example, ultradeep sequencing has been successfully used to detect HIV quasispecies and the emergence of resistant subpopulations. Analysis of blood samples from HIV-infected patients using pyrosequencing identified strains with mutations in the viral reverse transcriptase gene at levels of <0.1% of the total viral population (
168,
169). Similarly, Ion Torrent sequencing was utilized to identify the emergence of mutations conferring resistance to nonnucleoside reverse transcriptase inhibitors (NNRTIs) and protease inhibitors (PIs) at a level of <1% of the total population through an average of 13,700× coverage of the
gag-pol loci, though it was noted that coverage decreased significantly in a homopolymeric region containing five consecutive guanine residues (
170). This is again in comparison to routine Sanger methods, which demonstrate a limit of detection of approximately 20 to 35% of the population (
171). Accurate sequence data with error rates of <0.05% are easily achieved due to the high number of parallel reads, which provide highly redundant coverage of the target sequence (
168,
169). Pretherapy resistance testing has been recommended to identify quasispecies with mutations known to confer resistance to antiretrovirals and is also recommended following a rise in HIV load attributed to therapy failure (
172). Early detection of mutant alleles present at a low frequency is key in selection of antiretroviral therapy, since discontinuation of a specific antiviral can result in reversion of mutant populations to a susceptible, pretherapy genotype (
168,
173). Future clinical applications of pyrosequencing include transcriptome sequencing, which aims to efficiently create RNA profiles and examine the effects of mRNA transcript expression. The majority of research using NGS for transcriptome analysis has involved the basic sciences; however, recent studies have utilized this method for comparison of mRNA expression in normal and malignant cell populations and for discovery of latent or cryptic viruses whose presence and expression may be associated with malignancies (
174 – 176).
Other NGS platforms such as Illumina and SOLiD are capable of generating 1.5 to 4.0 Gb of data per single run at a cost of less than $0.10 per kilobase, which is significantly less expensive than Sanger or other NGS methods (
177,
178). Illumina (Solexa) sequencing is based on reversible dye terminators. DNA molecules are first attached to primers on a slide and amplified so that local clonal colonies are formed. Four types of reversible terminator bases are added, and nonincorporated nucleotides are washed away. A camera takes images of the fluorescently labeled nucleotides, and then the dye along with the terminal 3′ blocker is chemically removed from the DNA, allowing the next cycle. In contrast, SOLiD (supported oligonucleotide ligation and detection) is a method of sequencing by ligation. A target-specific sequencing primer is used to initiate sequencing by the sequential addition of octamer probes, each containing 2 specific nucleotides at the 5′ terminus followed by 6 degenerate nucleotides. Each of the 16 possible combinations of two nucleotides is represented, and octamers are fluorescently labeled with one of 4 fluorophores. The 16 octamers are then grouped into 4 sets (each containing one each of the 4 fluorophores) and are added sequentially to the sequencing reaction mixture for 7 full cycles of the 4 groups. Fluorescence is measured after addition of each 4-member group of probes, and the 2-base sequence is determined by the fluorophore detected. Gaps in the sequence corresponding to the 6 degenerate nucleotides in each probe are filled in by repeating the reaction using additional sequencing primers, each offset by one nucleotide (
n − 1,
n − 2, etc.) from the initial primer (
177). This results in short reads (26 nucleotides); however, the sequencing error rate is reduced to 0.001 because each nucleotide in the template is read twice (
177). The disadvantage of this technology is turnaround time. The run time for a single sequencing reaction is 2.5 to 6 days, resulting in turnaround time for a full genome sequence of up to 2 weeks (
178). Because of the large amount of sequence data generated per run, low cost per base sequenced, and extended TAT, these platforms are currently best suited to whole-genome sequencing projects rather than rapid identification of microorganisms or SNP polymorphisms in a clinical laboratory. Most recently, Illumina has begun offering full genome sequencing through it reference laboratory at a reported cost of $4,000.00 per genome.