INTRODUCTION
Streptococcus pneumoniae is a common Gram-positive microorganism of the human nasopharynx that can cause severe invasive pneumococcal diseases (IPDs), such as bacteremia and meningitis. In this environment, exposure to other commensal bacteria, host immune responses, and sublethal antimicrobial levels increase selective pressures and favor the exchange of genetic material (
1–4). The pneumococcal genome also contains numerous insertion sequences and other mobile genetic elements that facilitate the uptake of antimicrobial resistance determinants through horizontal recombination events (
2,
5). Due to the ease with which
S. pneumoniae can acquire novel antimicrobial resistance determinants, prompt detection of their emergence, dissemination, and dynamics through surveillance systems is essential to public health. The implementation of pediatric vaccination programs has been successful not only in lowering general incidence of IPD, but also by targeting serotypes associated with antimicrobial resistance, achieving a concurrent decrease in overall antimicrobial resistance (
6,
7). The proliferation of antimicrobial-resistant isolates of nonvaccine serotypes, however, remains a major global health concern (
8,
9).
The antimicrobial resistance mechanisms of
S. pneumoniae have been extensively documented and with few exceptions can fully explain the observed antimicrobial phenotypes (
10,
11). β-Lactam resistance has been attributed to genetic changes in the transpeptidase domains of penicillin binding proteins (encoded by
pbp1a,
pbp2b, and
pbp2x), macrolide and lincosamide resistance to the presence of 23S rRNA methyltransferases (encoded by
ermB and
ermTR), the ABC-efflux pump (encoded by
mefAE), or the number of 23S rRNA alleles with point mutations in the peptidyltransferase loop of domain V, fluoroquinolone resistance due to genetic changes in quinolone resistance-determining regions (QRDR) of
gyrA and/or
parC, tetracycline resistance due to the presence of
tetM or
tetO, chloramphenicol resistance due to the presence of
cat, and trimethoprim-sulfamethoxazole resistance caused by genetic mutations in
folA and
folP (
10,
12–24).
Monitoring the dissemination and dynamics of antimicrobial resistance of
S. pneumoniae has traditionally relied upon
in vitro phenotypic susceptibility testing of bacterial cultures; however, the development of whole-genome sequencing and novel molecular-based techniques to determine antimicrobial resistance can increase efficiency and decrease the labor associated with screening and monitoring strains for surveillance purposes. A EUCAST subcommittee reviewed the existing data on the use of whole-genome sequencing for the prediction of antimicrobial resistance for a number of bacterial species (
25). The committee concluded that there was a lack of studies on the utility of whole-genome sequencing for the prediction of antimicrobial resistance in antimicrobials used to treat
S. pneumoniae infections. Multiple-variable linear regression modeling has been successfully used to accurately predict MICs of a variety of agents for
Neisseria gonorrhoeae (
26–28). In this study, we employ a computational statistical approach not only to directly determine antimicrobial MICs from molecular determinants, but also to determine the relative contribution of each determinant to the overall MIC and present simple mathematical equations that can be applied to determine penicillin, ceftriaxone, erythromycin, clarithromycin, clindamycin, levofloxacin, and trimethoprim-sulfamethoxazole MIC values for
S. pneumoniae.
RESULTS
Regression analysis indicated that the predicted MIC (MIC
pred) for penicillin depended upon any alteration of the PBP1a-STMK, PBP1a-TSQF, PBP2b-SSNT, PBP2b-QLQPT, and/or PBP2x-LKSG motifs and specific alterations of PBP2x-STMK→SAFK and PBP2x-KDA→EDT or KEA. No isolates were found in the training or validation data sets to have modifications of the PBP1a-SSN, PBP1a-KTG, PBP2b-SVVK, PBP2b-KTG, or PBP2x-SSN motifs (
Table 1).
Equation 1 shows a penicillin MIC regression model:
where each molecular determinant has a value of 1 if present or 0 if absent.
The molecular determinants having the largest effect on penicillin MIC were modifications to the PBP2x-STMK→SAFK, PBP2x-KDA→EDT, and PBP1a-STMK→SAMK/SSMK amino acid motifs producing an adjusted
R2 value of 0.893 (see Table S1 in the supplemental material). The PBP2x-KDA→EDT motif with two amino acid changes had a regression coefficient over two times that of the PBP2x-KDA→KEA motif, corresponding to a 2-fold increased contribution to the MIC
pred increment value. The accuracy of the resultant penicillin MIC
pred values calculated from the regression equation within 1 doubling dilution to the overall phenotypically determined MIC (MIC
pheno) values of the Canadian and U.S. data sets was 97.4% (1,748/1,794), with a sensitivity and specificity of 92.8% and 96.5%, respectively (
Table 2). There were four (4/1,354 [0.3%]) very major interpretative errors (VMEs; i.e., predicted susceptible but phenotypically resistant) seen in the U.S. data sets, with MIC
pred values of ≤0.03 mg/liter but MIC
pheno values ranging from 2 to 8 mg/liter. Alignments of
pbp1a,
pbp2b, and
pbp2x genes from these isolates did not reveal any other major nucleotide differences from other susceptible strains to explain the discrepancy.
Ceftriaxone MIC regression modeling resulted in fewer molecular determinants where there were no contributions to MICpred from PBP2b motif changes, with only the PBP1a-STMK, PBP2x-STMK, PBP2x-KDA, and PBP2x-LKSG motifs having an influence, with an adjusted R2 value of 0.72 (see Table S3 in the supplemental material). The PBP2x-STMK→SAFK motif had the greatest magnitude, with a regression coefficient of 2.7, twice that of modifications to next most influential motif, PBP1a-STMK, which had a regression coefficient of 1.3.
Equation 2 shows the ceftriaxone MIC regression model:
where each molecular determinant has a value of 1 if present or 0 if absent.
The MICpred for ceftriaxone had an overall accuracy of 98.2% to 1 doubling dilution of the MICpheno of the combined validation data sets. The specificity (measure of susceptibility) was 96.7%, and sensitivity (measure of resistance) was 97.2%. The relatively large number of minor (MI) errors for both penicillin (n = 121 [6.7%]) and ceftriaxone (n = 124 [6.9%]) were due to a large number of MICpred values within 1 doubling dilution of the intermediate resistance interpretative breakpoints and the very broad CLSI intermediate resistance interpretative breakpoint range for penicillin, covering 3 doubling dilutions from 0.125 to 1 mg/liter.
Equation 3 shows the erythromycin MIC regression model:
where
ermB and
mefAE molecular determinants have a value of 1 or 0 if present or absent, respectively, “23S rRNA-A2059G” and “23S rRNA-C2611T” are the number of alleles with the point mutation present, and “
mefAE promoter” and “
mefAE intergenic” have a value of 1 or 0, corresponding to the presence or absence of the −364T substitution or the 99-bp deletion in the intergenic region between
mefE and
mel, respectively.
Seven isolates within the validation data sets possessed a predicted dysfunctional ermB gene and were considered ermB negative for regression analysis calculations. Specifically, an ErmB-G41E amino acid substitution was found in five isolates, ErmB-G41K was detected in one isolate, and an ermB adenosine nucleotide deletion at position 629 creating a pseudogene was observed in two isolates (see supplemental validation Data Set S12 in the supplemental material).
The MICpred for erythromycin was greatly influenced by the presence of ermB (coefficient = 9.5), mefAE (coefficient = 5.5), and the A2059G point mutation of 23S rRNA (coefficient = 2.9 for each mutated allele), with lesser contributions from the C2611T 23S rRNA mutation, mefAE-346T mutation, and 99-bp intergenic deletion, to achieve an adjusted R2 value of 0.96 (see Table S4 in the supplemental material). Although the coefficient value for the 23S rRNA-A2059G mutation is relatively low compared to those of the ermB or mefAE determinants, when all four alleles carry the mutation, the magnitude of the coefficient increases considerably to 12. The erythromycin MICpred had an overall accuracy of 94.8% (1,215/1,281) within 1 doubling dilution of the erythromycin MICpheno of the validation data sets and 98% sensitivity and specificity.
Equation 4 shows the clarithromycin MIC regression model:
where
ermB and
mefAE molecular determinants have a value of 1 or 0 if present or absent, respectively, “23S rRNA-A2059G” and “23S rRNA-C2611T” are the number of alleles with the point mutation present, and “
mefAE promoter” and “
mefAE intergenic” have a value of 1 or 0, corresponding to the presence or absence of the −364T substitution or the 99-bp deletion in the intergenic region between
mefE and
mel, respectively.
The regression equation for the clarithromycin MICpred had determinant coefficients similar to those for erythromycin, with ermB, mefAE, and the A2059G 23S rRNA point mutations contributing the most to the overall MIC, with values of 10.8, 5.6, and 1.8, respectively, resulting in an adjusted R2 value of 0.98 (see Table S5 in the supplemental material). Overall MIC accuracy within 1 doubling dilution was 96.6% (424/439), with 100% sensitivity and specificity.
Equation 5 shows the clindamycin MIC regression model:
where the
ermB molecular determinant has a value of 1 if present or 0 if absent, and the value of “23S rRNA-A2059G” is the number of alleles with the point mutation present.
The regression model for clindamycin resistance included primarily the presence of ermB and a minor contribution from the 23S rRNA-A2059G mutation and had an adjusted R2 value of 0.97 (see Table S6 in the supplemental material). The C2611T 23S rRNA mutation did not contribute to the model, as only one isolate was present in the training data carrying the mutation in all four alleles, yet only having a MICpheno value of ≤0.125 mg/liter. This determinant was also rare in the validation data, present in two isolates of the USA-1 data set, for which the MICpheno values were not available. This minimal complement of resistance determinants for clindamycin reflected the distribution of MICpheno values observed in the training (see Table S12 in the supplemental material) and validation data sets (see Tables S26 and S27 in the supplemental material), where MICs are polarized with extremely low or very high values. In the USA-2 data set, which had a maximum clindamycin MICpheno value of 2 mg/liter, 0.74% of the isolates had MICpheno values between 0.25 and 2 mg/liter, and of the 439 isolates in the Canadian validation data set, where testing included dilutions up to ≥64 mg/liter, there were no MICpheno values in the range from 0.5 to 16 mg/liter observed. There was a 98.2% (1,164/1,186) overall accuracy between MICpred and MICpheno values, with over 98% specificity and sensitivity.
Equation 6 shows the levofloxacin MIC regression model:
where each molecular determinant has a value of 1 if present or 0 if absent.
The MICpred for levofloxacin was predominantly dependent upon GyrA amino acid mutations S81L (n = 2) and S81F (n = 16), having regression coefficients of 3.6 and 2.0, respectively. Other determinants contributing to a lesser extent to the overall MICpred values included the GyrA-S81Y mutations (n = 1) and any mutations at ParC-S79 or ParC-D83. The adjusted R2 value of 0.460 (see Table S7 in the supplemental material) was very low for the model, likely due to very small number of isolates with resistance determinants and corresponding MICpheno values of ≥8 mg/liter (n = 11) compared to the very large number of isolates with no determinants and MICpheno values of ≤2 mg/liter (n = 942) within the training data. Despite the low adjusted R2 value and only two phenotypically levofloxacin-resistant isolates in the validation data, the accuracy of MICpred values compared to MICpheno values within 1 doubling dilution approached 100% (1,185/1,186), with 100% sensitivity and specificity. One isolate was missing the open reading frame corresponding to the gyrA gene in the genome assembly.
Equation 7 shows the trimethoprim-sulfamethoxaxole MIC regression model:
where each molecular determinant has a value of 1 if present or 0 if absent.
Regression modeling for trimethoprim-sulfamethoxazole resistance indicated that disruption of FolP had a regression coefficient of 2.5, suggesting a larger influence on overall MICpred than the I100L FolA determinant, which had a coefficient of 1.6. The adjusted R2 value was 0.798 (see Table S8 in the supplemental material) for the model, and there was 98.8% overall accuracy of MICpred and MICpheno, with a specificity of 97.5% and sensitivity of 96.5%.
The absence or presence of a single molecular determinant for chloramphenicol, doxycycline, and tetracycline resistance was used to assign a MICpred value as less than or greater than the susceptible or resistant interpretation breakpoint, rather than using multiple-variable linear regression analysis. The presence of the cat gene was associated with a MICpred value of ≥8 mg/liter (see Table S15 in the supplemental material), which resulted in 99.8% accuracy with corresponding MICpheno values, with sensitivity, specificity, positive predictive value (PPV), and negative predicted value (NPV) of 53.3%, 100%, 100%, and 98.5%, respectively. The low sensitivity can be attributed to the very abrupt resistance breakpoint between susceptible and resistant at ≤4 mg/liter and ≥8 mg/liter resulting in a relatively large number of possible phenotyping errors that were phenotypically resistant strains (n = 21) without the cat gene among a relatively low overall number of phenotypically resistant isolates (n = 45). Similarly, tetracycline and doxycycline MICs were determined solely by the presence of tetM (tetO was not detected in this study), giving a MICpred value for tetracycline of ≥8 mg/liter and a value for doxycycline of ≥4 mg/liter (see Tables S16 and S17 in the supplemental material). The accuracy, sensitivity, specificity, positive predictive value, and negative predictive value for tetracycline MICpred were 97.6%, 96.2%, 98.1%, 94.1% and 98.8%, those for doxycycline were 98.4%, 95.1%, 99.5%, 95.1% and 99.5%, respectively.
DISCUSSION
Multiple-variable linear regression analysis is a relatively simple, yet powerful tool to determine the dynamics of a specific variables among a complex series of other factors. Determinants of interest affecting the MIC were identified for each antimicrobial by summarizing the phenotypic MIC values and molecular determinant profiles (
Table 1; see Tables S9 to S17 in the supplemental material), and the regression model was optimized by removing, combining, and adding back individual factors while examining the effect on the regression model metrics. This analytical strategy has been successfully used to validate predicted MICs from whole-genome sequence data of
N. gonorrhoeae (
26–28). In this study, multiple-variable linear regression modeling of molecular antimicrobial resistance determinants accurately predicted the MIC values for the β-lactam, macrolide, lincosamide, fluoroquinolone, and folate pathway inhibitor antimicrobials investigated. There was 98% (range, 94.2 to 100%) overall accuracy between the predicted and phenotypically derived MIC values, with 96% (range, 87.8 to 100%) sensitivity and 98% (range, 87.7 to 100%) specificity.
Resistance to β-lactam antimicrobials in
S. pneumoniae is associated with changes to the transpeptidase domains of penicillin binding proteins PBP1a, PBP2b, and PBP2x, with particular focus on three amino acid motifs in each protein (
24). It has been suggested that changes in PBP2b and PBP2x provide low-level resistance, while high level resistance is achieved with additional changes to PBP1a (
29). Although increased resistance to β-lactams caused by altered PBP amino acid motifs has been extensively reported, multiple-variable linear regression analysis identified another possible amino acid motif in each protein that may contribute to overall MIC levels and was able to predict the relative contribution of each mutation to the overall MIC. Multiple-variable linear regression modeling for predicting penicillin MICs identified any changes to two motifs of PBP1a and PBP2b and specific changes to three motifs of PBP2x as significantly contributing to the MIC. The ceftriaxone MIC regression model was simpler, lacking the PBP2b motifs as contributing factors. Penicillin and ceftriaxone models included any changes to the PBP1a-STMK motif and PBP2x-STMK→SAFK, PBP2x-KDA→EDT, and LKSG→VKSG specific motif changes. The regression coefficients for these shared determinants were similar for both penicillin and ceftriaxone MICs, except for PBP2x-STMK→SAFK, which had a 2-fold greater effect on ceftriaxone MICs, reflecting the importance of this mutation to overall resistance reported in other studies (
10,
24). The regression models for predicting β-lactam MICs had 98% accuracy to those derived phenotypically, with 0.1% major interpretative errors and 0.2% very major errors. The relatively large number of minor interpretative errors in both penicillin and ceftriaxone MICs could be due to a large number of MIC
pred values within 1 doubling dilution of the intermediate CLSI resistance interpretative breakpoints, which are very broad for penicillin, covering 3 doubling dilutions from 0.125 to 1 mg/liter. These findings are similar to those from a previous study, which used PBP allelic profiles as a PBP type library to associate phenotypic MICs with specific alleles, giving a similar 98% accuracy within 1 doubling dilution of the phenotypic MIC, with major and very major interpretative errors slightly larger at 3% and 2%, respectively (
18).
The greatest contributor to the macrolide and lincosamide MICs was the presence of
ermB, which had similar regression coefficients for erythromycin, clarithromycin, and clindamycin, corresponding to 9 to 11 doubling MIC increments. The
mefAE coefficients for erythromycin and clarithromycin were also similar, with increment values of about 5 for each antimicrobial, contributing about half as much as
ermB to the MIC
pred values. The 23S rRNA-A2059G point mutation had regression coefficient values of 3 for each mutated allele for erythromycin and 2 for clarithromycin, but contributed much less to the clindamycin MIC, with a coefficient of only 0.5. The C2611T 23S rRNA resistance determinant contributed less than the A2059G determinant, with a value of about 1 for clarithromycin and erythromycin MICs; however, the C2611T mutation was not identified as a significantly contributing factor to increased clindamycin MICs. The G761T mutation, 364 nucleotides upstream (−364T) from the
mefE start codon (
Fig. 1), had a 2-fold greater influence upon the erythromycin predicted MIC than that for clarithromycin and had a similar influence to the 99-bp intergenic deletion between
mefE and
mel. Although the −364T mutation may be a considerable distance from the
mefE start codon in the macrolide efflux genetic assembly to be located in the promoter region for
mefE, there are a number of ATG start codons upstream before
mefE, which may suggest that a small regulatory protein is located in this region. Accuracy with phenotypic MIC was best with clindamycin, with 98% accuracy, and both erythromycin and clarithromycin had accuracies of 95% and 97%, respectively, with sensitivities and specificities over 98% for all three antimicrobials. The USA-1 validation data set had an accuracy of 94% and a relatively low sensitivity of 91%, primarily due to 11 isolates with erythromycin MIC
pheno values of ≤0.5 mg/liter, despite having
mefAE or an intact
ermB as the sole resistance determinant, which should result in a MIC
pred value of ≥8 mg/liter, suggesting possible phenotyping errors. Conversely, there were 5 isolates in this data set that lacked any known molecular determinants but were phenotypically erythromycin resistant, having MICs of ≥1 mg/liter. Screening the discrepant genomes with additional molecular antimicrobial resistance determinant query tools ResFinder, ARG-ANNOT, and CARD (
30–32) confirmed the genotypes. The observed discrepancies between molecular determinant profiles and expected resistance phenotypes may be due phenotypic reading errors, contamination, mislabeling, DNA sequencing errors, or possibly novel resistance mechanisms.
A single isolate in the training data set, and no isolates in the validation data sets, possessed the ermTR resistance determinant combined with mefAE. The single ermTR-positive isolate had MICpheno values for erythromycin and clarithromycin of ≥256 mg/liter and ≥16 mg/liter, respectively, similar to other isolates having an ermB mefAE genotype. Additional data are required to perform adequate regression analysis for the ermTR resistance determinant; however, speculatively its contribution to overall MICpred may be similar to that of ermB.
Fluoroquinolone resistance has been attributed to the GyrA-S81 and ParC-S79, -D83, and -N91 amino acid substitutions (
10,
21). Regression modeling indicated that each of the three GyrA-S81F, -Y, and -L mutations had different contributions to the overall levofloxacin MIC, with the S81L mutation having about twice the effect of S81Y. Any mutation at ParC-D79 or -D83 significantly contributed the overall levofloxacin MIC
pred; however, a D91 mutation was found in only a single isolate of the training data set, with a MIC
pheno value of 0.5 mg/liter (susceptible interpretation) and therefore did not contribute significantly during the modeling process. Despite have the lowest adjusted
R2 value, the MIC
pred for levofloxacin had the best accuracy to MIC
pheno of all the antimicrobials analyzed, with accuracy to within 1 doubling dilution, all percentages of sensitivity and specificity of 100%, and no interpretive errors.
Molecular determinants for sulfamethoxazole-trimethoprim resistance include a simple and well-documented mechanism involving a FolA-I100L amino acid substitution and disruptions of
folP, with a resistant phenotype observed when both determinants are present and intermediate when only one is present (
10,
33,
34). MICs could be assigned simply as ≥4/76 mg/liter (resistant) when both determinants are present, 1/19 mg/liter or 2/38 mg/liter (intermediate) when only one is present, or ≤0.5/9.5 mg/liter (susceptible) when both determinants are wild type. A similar strategy can be used where the presence of
cat in an isolate correlates with a chloramphenicol MIC
pred value of ≥8 mg/liter (resistant CLSI breakpoint), or
tetM or
tetO with a tetracycline MIC
pred value of ≥16 mg/liter. Regression analysis, however, offers insight into the relative magnitudes of each determinant, and in the future, the specific nucleotide insertions of
folP may each be investigated to add further refinement to the predictive models to enhance the accuracy and precision of the MIC predictions. Comparison of the magnitudes of the regression coefficients indicates that the FolP disruptions (
Fig. 2; see Table S8 in the supplemental material) have a greater influence on MIC than the FolA-I100L mutation.
There was low variability of MIC accuracy values between the validation data sets, suggesting the regression equations are robust and may be applied broadly across testing sites. A regression model developed for penicillin MICs from PBP types described by Metcalf et al. (
10,
18,
35) as a simulated training data set (
http://www.cdc.gov/streplab/mic-tables.html) generated a regression equation very similar to that attained using the Canadian training data (see Tables S1 and S2 in the supplemental material). The accuracy and precision of the predicted MICs may continually be improved over time, with larger, broader, and more current training data to address consistency of sampling, culturing methods, laboratory testing procedures, interpretation of phenotypic results, geographical variation, and the discovery of novel resistance determinants. Despite some discrepancies, the comparison of MIC
pred to MIC
pheno compares favorably to comparison studies of purely phenotypic results. A summary of an interlaboratory quality control program for pneumococcal serotyping and antimicrobial susceptibility testing involving reference laboratories participating in the International Circumpolar Surveillance program had an 97% overall accuracy of tests within 1 doubling dilution of the modal MIC, with erythromycin and clindamycin accuracies of 92% and 89%, respectively (
36). Other quality assurance programs that have collated accuracy for phenotypic antimicrobial susceptibility testing included the Canadian National Gonococcal Antimicrobial Susceptibility Comparison Program (
37), where the average MIC accuracy ranged from 85.6% to 98.9%, and a 2018 comparison of international antimicrobial proficiency panel results from various Caribbean and South American countries (
38) reported an overall accuracy of >90% for some participants, while accuracy among other laboratories ranged from 60.0% to 82.4%.
Limitations of the study include that the accuracy and precision of the MIC prediction based on molecular determinants are largely limited by the training data used to generate the regression equations. The training data may include variability due to the subjective nature of phenotypic testing, where the same phenotypes may not always be observed on repeat testing, molecular resistance profile errors, and the possible presence of as-yet-unidentified resistance factors. Rare resistance determinants need to be present in the training data in sufficient quantities to generate meaningful statistics. While using a large training data set to develop the regression model can resolve some discrepancies, some rare resistance patterns, such as very high β-lactam resistance, are reliant on the availability of a relatively small number of isolates with this phenotype. Furthermore, there may also be some rare resistance determinants that were not present or were present in insufficient numbers to significantly influence the regression model, such as some of the reported PBP motifs, ermTR, or the 23S rRNA point mutations. These limitations can be reduced by increasing the size of the training data with isolates from varied regions of the world and regularly updating the regression models with newly discovered factors and updated coefficient values for currently identified factors. The MIC prediction models described here can be easily regenerated using the molecular markers discussed in this study with local training phenotypic data sets, which may be more applicable to individual laboratory testing environments. This approach also directly identifies the magnitude of antimicrobial resistance determinants specifically contributing to overall MIC without the need for continual curation of allelic databases that infer MIC values.
There is a need for surveillance systems that not only closely track the dissemination of known resistant strains, but also promptly detect novel antimicrobial resistant clones as they emerge to limit their expansion. Over the short term, molecular-based methods may primarily be used for surveillance purposes. As molecular-based genomic techniques become more comprehensive and broadly available to track lineages, antibiotic resistance, and virulence and fitness determinants, the MIC predicting strategy described here may provide a powerful tool to replace traditional phenotypic testing in clinical settings. Mathematical modeling to describe biological systems can fill a surveillance gap in an era of increased reliance on nucleic acid assay diagnostics to monitor the dynamics of S. pneumoniae, and the ability to acquire detailed antimicrobial resistance information directly from molecular information will enhance the monitoring of the dynamics of S. pneumoniae to effectively inform public health interventions to reduce the burden of disease.