A TraDIS gene essentiality compendium encompassing five genera of the Enterobacteriaceae
To investigate gene essentiality in the Enterobacteriaceae, we assembled a collection of new and previously published data that cover five medically and scientifically important genera of the Enterobacteriaceae:
Escherichia,
Salmonella,
Citrobacter,
Klebsiella, and
Enterobacter (
Fig. 1B and C). This includes eight TraDIS libraries that have not been previously reported (see Methods), and all have been constructed in rich media with growth at 37°C. The selected
Escherichia strains include the parent strain of the Keio collection (
16),
Escherichia coli BW25113 (
31), as well as two sequence type 131 (ST131) strains associated with urinary tract infections, NCTC 13441 (
32) and EC958 (
33). Our
Salmonella data include data from the CVD908-
htrA vaccine derivative of
Salmonella enterica serovar Typhi Ty2 (
25), the common
S. Typhimurium lab strain SL1344 and its vaccine derivative SL3261 (
18), the
S. Enteritidis PT4 strain P125109 (
34) representative of the global gastroenteritis epidemic, and
S. Typhimurium D23580 (
35) and A130, representatives of the two major clades involved in the HIV-associated invasive non-typhoidal salmonellosis epidemic in sub-Saharan Africa (
36). We also have included
Citrobacter rodentium ICC168 (
37), a model mouse pathogen that carries the type III secretion system encoding locus of enterocyte effacement (LEE) pathogenicity island in common with the enteropathogenic (EPEC) and enterohemorrhagic (EHEC) pathovars of
E. coli (
38). We have also included two strains of
Klebsiella pneumoniae, a major cause of multi-drug-resistant nosocomial infections. These are Ecl8, a genetically manipulable model strain (
39), and RH201207, a multi-drug-resistant clinical isolate (
40). Finally, we included a representative of the opportunistic pathogen
Enterobacter cloacae, strain NCTC 9394.
Assessing essentiality classification
Several methods have been used for evaluating the essentiality of genes using transposon insertion data. These use different features that report on gene essentiality, which can include features of the sequenced libraries themselves, such as the number of insertion sites divided by gene length [the so-called insertion index (
25)], or comparisons of experimentally determined transposon read counts against synthetic reference libraries (
41). Freed and colleagues assessed 11 of these features, finding the insertion index, the mean distance between insertions, and the largest uninterrupted fraction of gene length were most predictive of essentiality (
23).
Here we evaluated four of these measures: the average distance between insertion sites in a gene, the largest uninterrupted fraction, the insertion index, and the Monte Carlo method combined with DESeq proposed by Turner and colleagues (
41) (
Fig. 2A). In addition, we constructed a predictor using all four measures of gene essentiality using principal component analysis (PCA), taking the weighted sum given by the first principal component as our combined measure. To evaluate the accuracy, we compared essential genes predicted from an
E. coli K-12 BW25113 transposon library (
31) with each method to curate essential genes in the EcoGene database (
42). EcoGene is a refinement of essential genes determined by directed single-gene knockouts in the Keio collection (
16). The four measures performed similarly, as evaluated by the area under the receiver operating characteristic curve (AUC). Our PCA-based predictors lead to a modest increase in performance, indicating that these measures are not completely redundant. However, on average, this improvement only led to an average increase of approximately 1.2 correctly determined essential genes across our strain collection at a false-positive rate of 0.05 (Table S1). Given this modest improvement and the relative ease of calculation, and particularly the straightforward interpretability of the insertion index compared to PCA, we have used the insertion index for further work.
A second issue in building an essential gene classifier is choosing a threshold for essentiality. In previous work, gamma distributions have been fit to the bimodal distribution observed in the insertion index (
Fig. 2B), and a log-likelihood ratio was used to determine the essentiality cut-off (
25,
26). This method is heuristic, based on the separation of the two-component distributions and maximum likelihood fits. We have observed that this, on occasion, can lead to poor fits and instability in the classification, which can require manual intervention to correct. To make this procedure more robust, we have adopted a non-parametric clustering approach using DBSCAN (
27). This leads to consistent improvements in classification accuracy compared to automated gamma fitting (
26) as measured by the Matthews correlation coefficient across all our data sets using EcoGene essential genes as a gold standard (Fig. S1).
The final component in determining gene essentiality is the transposon library itself; insufficient density will lead to genes being devoid of insertion by chance. Often, density is reported in terms of the number of insertions divided by genome length. However, this can be misleading, as insertion density is clearly non-uniform. To address this, we have constructed a series of five increasingly dense
Tn5 libraries in
S. Typhimurium SL1344, with an estimated 50K, 100K, 250K, 500K, and 1,000K mutants based on transformation efficiency control plates. After sequencing and read-mapping, we recovered 85,477, 92,145, 295,854, 344,380, and 550,657 unique insertion sites, respectively. We then computationally subsampled these insertion libraries and predicted gene essentiality using the insertion index and DBSCAN, to plot both false-positive and true-positive rates for a full range of library complexities (
Fig. 2C and D). These curves show that classifier performance stabilizes at around one transposon mutant per 35 nucleotides of genomic sequence, providing a clear minimum density for accurate prediction of gene essentiality with
Tn5 mutagenesis. All of the data sets included in our study meet this threshold.
Evaluating and correcting for biases in Tn5 insertion
A number of studies have reported genomic biases in Tn
5 insertion (
17,
18,
25,
43–47). However, the findings of different studies are contradictory, and it remains unclear whether these biases impact the quality of gene essentiality predictions. Using our combined data set, we examined the evidence and impact of four potential biases: positional bias within genes and genomes, and compositional bias in the form of preferred sequence motifs and G + C content.
Within genes, it has been noted that the effect of transposon insertions at the start and end of the sequence can diverge from that observed across the entire gene (
15). To investigate this in our own data, we calculated local averaged insertion indices across percentiles of essential genes (
Fig. 3A). There was a clear enrichment in transposon insertions in the first 5% and last 20% of the essential gene sequences. Insertions early in essential genes may be due to either biological alternative start codons, or possibly more likely, the bias in gene prediction algorithms toward longer coding sequences resulting in incorrect start-codon annotations (
48). The enrichment of insertions at the 3′ end of essential genes may indicate that functional proteins can frequently be produced from truncated transcripts, as can be seen in the specific case of RNase E in
K. pneumoniae RH201207 (
Fig. 3B).
Within genomes, positional effects in transposon density have been observed with differences in density between the origin and terminus of replication (
44). These differences are presumably due to the dosage effect of multiple copies of DNA near the origin of replication during exponential growth (
18,
50). We observed this phenomenon to varying degrees in our data (
Fig. 3C; Fig. S2); this variation is likely due in part to individual optimization of the transformation protocol during library creation, leading to different bacterial libraries being created in slightly different growth phases. In some genomes (e.g.,
E. cloacae NCTC9394), positional bias in transposon density did not clearly correlate with distance to the origin (Fig. S2). This may be indicative of errors in genome scaffolding and contiguation, as the genome sequence remains unclosed.
We next examined the existence of preferred sequence motifs for Tn
5. When the Tn
5 transposon inserts into DNA, a region of 9 nucleotides is duplicated on each side of the transposon (
43). An early study of Tn
5 insertion based on circa 100 example insertions suggested a preferred A-GNTYWRANC-T motif (
43), and this was later supported and extended by transposon sequencing in
Salmonella (
17). Other higher-throughput studies have failed to find a clear motif but have suggested a preference for elevated G + C content regions (
45,
47). To address the presence of a preferred motif, we generated a sequence logo from the 10 nucleotides flanking each insertion site for the 100 most frequent insertion sites in each genome (
Fig. 3D). The nucleotide frequency observed is broadly consistent with the sequence motif proposed by Goryshin and Canals; however, the information content across the duplicated region was consistently low, suggesting this motif is not a major constraint on transposon insertion. Similarly, examining the relationship between G + C content and insertion density failed to give consistent results across genomes. Lower G + C genes tended to have slightly elevated insertion densities (
Fig. 3E), but this may be an artifact of the small number of data points below the ~50% to 60% G + C range and the association of low G + C content with horizontally acquired DNA in enterobacterial species (
27).
Correcting for the three biases most easily adjusted for (positional bias within genes, positional bias within the genome, and G + C content bias) each led to modest improvements in classifier performance. Combining the three corrections improved performance as measured by the AUC by ~1%. However, this did lead to a substantial increase in the true-positive rate in the critical low false positive range of the ROC curve (Fig. S3).
Core and ancestral gene essentiality in the Enterobacteriaceae
Having investigated the factors affecting the accuracy of gene essentiality calls, we turned to the investigation of essential genes in our TraDIS collection. We called essential genes in all 13 strains using the insertion index after correcting for positional bias within genes, positional bias within genomes, and G + C content bias using our DBSCAN-based classification. We then defined an “essentiality score” as the log2 transformation of the ratio of the insertion index to the DBSCAN cut-off. Intuitively, this results in a score where values less than or equal to 0 indicate essentiality and the values indicate twofold changes from this threshold. Genes with an insertion index of 0 were arbitrarily recoded to −6.5, as there were no observed scores lower than this number, and log2(0) is undefined.
This resulted in between ~300 and ~440 genes classified as essential in each strain (
Fig. 4A). This is comparable to previous results in the literature applying transposon insertion sequencing to the study of gene essentiality (
51), with larger essential gene sets likely reflecting the difficulty in segregating truly essential genes from those with strong growth effects in high-throughput studies. To compare essential genes between strains, we defined groups of orthologous proteins using Hieranoid (
52) and a phylogenetic tree constructed using RaXML (
28) from concatenated core gene alignments produced by PhyloSift (
29) (see Methods). On manual examination, we found that a number of essential genes were absent in the genome sequence of
Enterobacter cloacae NCTC 9394 due to an incomplete assembly, and this strain was excluded from further analyses.
As a gold standard comparator, we included a curated collection of
E. coli K-12 essential genes from the EcoGene database (
42). Unfortunately, this resource is no longer maintained, but an archival copy is available in the Internet Archive [
https://web.archive.org/web/20170228010715/http://www.ecogene.org/?q=topic/5], and we additionally provide the gene list in
Table S2. This list curates the Keio collection of directed gene deletions (
16) with observations from later studies that have investigated technical artifacts in the Keio collection and contains 299 essential genes. Major differences from the Keio collection are the inclusion of 15 genes not considered essential by Keio which were subsequently shown to be artifacts of gene duplication, namely
glyS,
ileS,
alaS,
coaA,
coaE,
dnaG,
glmM,
groL,
parC,
prfB,
polA,
rho,
rpoD,
rpsU, and
lptB. Genes considered essential in the original Keio paper but later shown to be non-essential include the ribosomal proteins
rpsI,
rpsM, and
rpsQ, mutants of which grow extremely slowly (
53), and
rnc encoding RNase III, whose deletion is thought to have a polar effect on the downstream essential gene
era with which it is co-transcribed (
54).
Our initial expectation was that differences in gene essentiality would largely be concordant with the phylogeny, that is, we expected to find groups of genus-specific essential genes. However, this was not the case (
Fig. 4A). To investigate further, we took multiple approaches to exploring the contents of our essential gene collection. As a first approach, we defined the core essential genes in our genomes as the intersection of essential genes across all TraDIS libraries and the EcoGene list. This resulted in a set of 201 genes essential across all strains in our collection. As this gene set excluded a number of essential functions, we also applied Fitch’s algorithm (
55) as a less conservative approach to reconstruct the set of genes essential in the ancestor of all strains included in our collection resulting in 296 ancestrally essential genes (
Fig. 4A and B). Fitch’s algorithm is based on parsimony and attempts to reconstruct the ancestral essentiality state by minimizing the number of transitions between essentiality and non-essentiality given the observed data. To check that our choice of reconstruction method did not majorly bias our results, we also used PastML, a likelihood-based method, to infer ancestral essentiality (
56). This yielded similar but slightly more conservative results with fewer genes classified as ancestrally essential. All genes except for one (272 out of 273) that were classified as ancestrally essential using PastML were also considered ancestrally essential using Fitch’s reconstruction. Conversely, 24 genes that were not classified as ancestral using PastML were considered ancestral using Fitch’s method (272 out of 296, Fig. S4 and S5). Core and ancestrally essential genes are summarized in
Table 1.
We found it somewhat surprising that despite the range in the number of genes predicted to be essential across our collection, the ancestral essential gene set was of similar size to that in the EcoGene gold standard. We examined the intersection of the core, ancestral, and EcoGene gene sets in an attempt to understand the differences between them (
Fig. 4B). By design, the core essential genes are a proper subset of the ancestral and EcoGene sets, and contain ~2/3rds of the genes essential in either of these sets. Fifty-five genes occur in both the ancestral and EcoGene sets but are not core. Manual inspection led us to conclude that two of these were excluded from the core gene set due to annotation errors: strains D23580 and A130 lacked annotations for
rpmC, though the gene sequence appears to exist at the same locus as other
Salmonella strains. Similarly, the A130 annotation appears to contain a truncated annotation for
accB, which was not properly clustered by Hieranoid. The gene
cohE, encoding a phage repressor protein, was essential and present in
E. coli and
K. pneumoniae strains but not others, suggesting independent horizontal acquisition of similar prophage in the two lineages and misclassification as ancestral. Twenty-seven of the remaining genes were predicted as essential in most strains (
Table 1), with only one or two strains scoring slightly above (<twofold higher) the essentiality threshold, indicating that most of these genes are likely either core essential or at least cause severe fitness defects in all strains surveyed.
This analysis left 24 genes that were ancestrally essential and essential in EcoGene, but not classified as core essential, where at least one strain had a score greater than twofold higher than the essentiality threshold (
Fig. 4C). Some of these had clear-cut explanations. For instance,
folA, encoding dihydrofolate reductase, is the target of the antibiotic trimethoprim commonly used in the treatment of urinary tract infections.
folA non-essentiality could be explained in all cases by the acquisition of alternative trimethoprim-resistant dihydrofolate reductase enzymes either in the plasmid or chromosome. Another explained example is
cysS, encoding a cysteine-tRNA ligase uniquely non-essential in
S. Typhimurium D23580, which was previously shown to be displaced by a plasmid-borne copy in this strain (
57). Most others were less easily explained. For instance,
ispD,
ispA, and
hemL in the isoprenoid and heme biosynthesis pathways were called as clearly non-essential in multiple strains. The gene
ispD in particular, encoding a key cytidylyltransferase in isoprenoid biosynthesis and a target of fosmidomycin (
58), was non-essential in both ST131 UPEC strains. These are particularly interesting as the methyl-D-erythritol (MEP) pathway is not present in metazoans and hence has served as a target for therapeutic intervention (
59).
Evidence for differences in essentiality among the Enterobacteriaceae
Forty-three genes were essential in the EcoGene list, but are neither core nor ancestral (
Fig. 4B), constituting almost 15% of essential genes in
E. coli. This is of particular importance, as the essential genes of
E. coli are often taken as representative of related species. Fourteen were variably present in our strain collection, and a third of these encoded the antitoxin components of toxin-antitoxin systems (
chpS,
mazE,
mqsA,
yafN,
yefM). Phage repressors were also in this set of essential, variably present genes as seen in previous comparisons of gene essentiality between strains (
18), notably including
dicA encoding the c2 repressor in the cryptic Qin prophage (
60) that controls expression of the small protein DicB and small RNA DicF that affect cell division and phage susceptibility (
61). Others include the
trpS tryptophan tRNA-ligase, of which an ancestral second copy appears to have been lost in the lineage leading to
E. coli and
C. rodentium.
Approximately 60% (27/43) of the genes essential in the EcoGene list but not core or ancestrally essential were also non-essential in our
E. coli BW25113 data set. We cross-compared our results to an independently generated and analyzed
E. coli BW25113 TraDIS library (
62) and found that with the exception of two genes very close to the essentiality decision boundary (
yrfF and
can), these results agreed with ours in all cases. Some of these disagreements with EcoGene could be explained by domain essentiality within a gene otherwise tolerant of insertions on visual examination. For instance,
rne,
polA,
yejM, and
spoT all appeared to contain essential domains across multiple strains and were additionally found to be essential across a collection of
E. coli strains in a recent CRISPRi screen (
21). However, the reasons behind the majority of these disagreements between EcoGene and TraDIS data remain unclear, and many of these genes remain poorly characterized.
One example of a set of genes essential in EcoGene and not core or ancestrally essential is associated with
rpoE, which encodes the extracytoplasmic stress-responsive sigma factor σ
E (
Fig. 4C). While σ
E has long been known to be essential in
E. coli (
63), and suppressor mutants are easily obtained (
64), the exact reason why it is essential remains unclear, though may have to do with the accumulation of misfolded outer membrane proteins (OMPs) in the deletion strain (
65). For activation, σ
E requires release from the anti-sigma factor RseA
via the activity of the DegS and RseP proteases. All three genes
rpoE,
degS, and
rseP were essential in our BW25113 data in agreement with EcoGene; however, none of these genes were essential in either of the ST131
E. coli strains included in our collection. These genes have also previously been reported to be essential in
S. Typhi CVD908-htrA (
18), confirmed both here and by an independent study in
S. Typhi Ty2 (
17). We also find all three to be essential in our clinical isolate of
K. pneumoniae RH201207. Why this system is essential in three phylogenetically separated strains and not others is unclear. It has been speculated that σ
E essentiality may depend on the presence of
ydcQ/hicB (
64), though this was later shown to be the result of HicB toxin activation (
66) and neither
S. Typhi CVD908-htrA nor
K. pneumoniae RH201207 encode the HicAB toxin-antitoxin system. The OMP content of Enterobacterial membranes can vary widely between strains, so this could possibly be a driving factor in this differential essentiality if the accumulation of misfolded OMPs plays a role in the differential requirement for σ
E. Whatever the ultimate reasons for σ
E essentiality in these strains, it is clear that this trait has clearly arisen independently multiple times within the Enterobacteriaceae and can occur in strains of clinical relevance.
To further test our observation that changes in essentiality frequently happen between related strains in a genus, we developed a permutation test to identify genes more likely to be essential within a genus than without (see Methods). We found only 10 genes that were significantly specific to either
Salmonella or
Escherichia, the two genera where we had the most samples (
P < 0.05,
Fig. 4C, right panel). Of these, only one (
trpS, discussed previously) was fully concordant with the phylogeny, being essential only in
Citrobacter and
Escherichia. All others were either sporadically non-essential within the genus (e.g.,
fepC,
fepG,
ftsX,
ftsE in
Salmonella), or also sporadically essential in other strains outside the genus (e.g.,
rsgA,
ubiF,
lipA,
ubiH,
wzyE in
Salmonella). Our inability to identify genes whose essentiality is concordant with the phylogeny suggests that changes in essentiality status for non-core essential genes are rapid, occurring frequently within genera.
Analysis of essential genes conserved in reduced genomes identifies essential stress responses
Two major approaches have been taken to determining gene essentiality in the literature: experimental approaches, like the TraDIS technique used here, and comparative genomics based on the conservation of individual genes across bacterial phylogeny. Bacteria with reduced genomes, such as obligate pathogens and endosymbionts, are particularly informative for this purpose as they can survive with just hundreds of genes (
11). The family Enterobacteriaceae contains a number of obligate insect endosymbionts, including the model symbionts
Buchnera aphidicola in aphids and
Wigglesworthia glossinidia in the tsetse fly (
67). These bacteria typically reside in specialized fat cells termed bacteriocytes, which provide nutrients to the endosymbiont in exchange for essential amino acids or vitamins absent from the insect diet. While phylogenetic inference on highly reduced genomes is challenging, it does appear the endosymbiont lifestyle has emerged independently from free-living or facultative intracellular ancestors several times within the Enterobacteriaceae (
68). Investigating gene conservation in these “natural experiments” provides an orthogonal approach to our TraDIS screens for determining core gene essentiality.
We collected a set of 34 endosymbiont genomes from within the Enterobacteriaceae on which to conduct conservation analysis. Half (17) were
Buchnera aphidicola strains but also included representatives of the
Sodalis,
Wigglesworthia,
Blochmannia,
Baumannia genera; two secondary symbionts of psyllids; and
Moranella endobia, a symbiont which lives within the betaproteobacterium
Tremblaya princeps, itself a symbiont of mealybugs (
69). Using the same Hieranoid-based method as we used for our essentiality analysis, we identified orthologous genes between these strains and our collection of strains used for TraDIS analysis. This allowed us to identify genes universally conserved among endosymbionts, as well as compare them to our essential gene sets for free-living Enterobacteriaceae on rich media.
We identified a remarkably small set of 120 genes universally conserved in the endosymbionts (
Fig. 5A), and 73% (88 of 120) overlapped with either our core or ancestral essential gene sets determined by TraDIS. Nearly half of these (41 genes) are ribosomal proteins, comprising all but eight of the essential ribosomal proteins in our ancestral essential gene set (
Table 1;
Table S2). Other large classes of genes essential in free-living Enterobacteriaceae and conserved in endosymbionts included a subset of aminoacyl-tRNA biogenesis (nine genes, including
trpS), genes involved in translation (
der,
frr,
fmt,
infA, infB,
mnmA,
tadA,
tsaD,
pth,
rnc,
ybeY,
ygfZ), DNA replication (
dnaB,
dnaN,
dnaQ,
dnaX,
gyrB,
holB,
ssb), and fatty acid biosynthesis (
acpS,
fabB,
fabG,
fabI).
While endosymbiont lineages largely conserved a core of genes involved in information transfer, few essential biosynthetic pathways were universally conserved (
Fig. 5B). Loss of genes involved in ubiquinone synthesis has been noted since the sequencing of the first
Buchnera genome (
70); we find that these and other associated isoprenoid and heme biosynthesis genes involved in electron transport have been lost independently multiple times. Similarly, essential genes associated with the synthesis of the outer membrane, peptidoglycan, and lipopolysaccharides have been lost multiple times, including the genes responsible for the synthesis of phospholipids. It has been speculated that the host may provide phospholipids for the symbiont membrane (
71), though the details of how this may occur remain unclear. In the case of
Moranella, it has been shown that host-encoded peptidoglycan synthesis genes of diverse bacterial origin can complement the absence of these genes in the endosymbiont (
72,
73).
Thirty-two genes were universally conserved in endosymbionts despite not being essential in free-living species (
Fig. 5A;
Table 2). Some of these could be explained by their relationship to core information transfer processes: for instance, six encoding non-essential ribosomal proteins (
rplI,
rpmEFGJ,
rpsI), and four encoding ribosomal- and tRNA-modifying enzymes (
mnmE,
gidA/mnmG,
rluD, and
rsmH). Others appeared to be associated with stress responses in free-living bacteria, which was unexpected as the host is generally thought to maintain a stable environment for its symbionts (
71). These included, for instance,
lepA, encoding a non-essential translation factor that contributes to survival of stress conditions (
74) and whose deletion has previously been shown to negatively interact with the deletion of ubiquinone biosynthesis genes in
E. coli (
75), and
ahpC, encoding alkyl hydroperoxide reductase required to scavenge hydrogen peroxide in the absence of catalase (
76).
The requirement for stress response genes extends to entire systems, including for instance the trans-translation system comprising SsrA (the transfer-messenger RNA), SmpB, and the ClpPX protease that collectively rescue stalled ribosomes (
77). While generally thought to be important for the survival of stress conditions in free-living bacteria (
78), tmRNA has been found in some rare mitochondria (
79) suggesting it may also play an important role in reduced genomes. A final striking example is given by the
cspE, encoding the cold shock protein CspE. As the name suggests, cold shock proteins were discovered as regulators of the cold shock response (
80), and the best characterized, CspA, acts to melt RNA structures to ease translation at low temperatures (
81). Free-living Enterobacteriaceae genomes encode multiple cold shock protein homologs, with
E. coli carrying nine, for example. CspE in particular has recently been characterized as playing a key role in the virulence of
S. enterica (
82), possibly through stabilizing mRNA transcripts (
83), though the nature of this role is unclear and appears to be largely redundant with the paralogous protein CspC. A recent study of
Buchnera genomes found that
cspC has been repeatedly independently lost while
cspE has been maintained in different lineages (
84), a finding in keeping with our expanded analysis of other endosymbiont genomes. This suggests some level of non-redundancy in CspC and E function and further suggests cold shock proteins may collectively provide some essential function during normal growth masked by their high degree of redundancy in free-living Enterobacteriaceae.