Comparative analysis.
The genome of
C. acetobutylicum provides us with at least two unique opportunities: (i) compare, for the first time, two large and moderately related gram-positive bacterial genomes, those of
C. acetobutylicumand
B. subtilis (
41); (ii) investigate the genes that underlie the diverse set of metabolic capabilities so far not represented in the collection of complete genomes.
The median level of sequence similarity (
26) between probable orthologs in
C. acetobutylicum and
B. subtilis was greater than between
C. acetobutylicum and any other bacterium, but only by a rather small margin, indicating significant divergence (Table
1). Compared to the other pairs of evolutionarily relatively close genomes, the
Clostridium-Bacillus pair is more distant than the species within the gamma-proteobacterial lineage (
Escherichia coli, Haemophilus influenzae, Vibrio cholerae, and
Pseudomonas aeruginosa) or
Helicobacter pyloriand
Campylobacter jejunii; in contrast, the level of divergence between
C. acetobutylicum and
B. subtilis is comparable to that between the two spirochetes,
Treponema pallidum and
Borrelia burgdorferi(Table
1). The comparative analysis of the spirochete genomes has proved to be highly informative for elucidating the functions of many of their genes and predicting previously undetected aspects of the physiology of these pathogens (
76).
A taxonomic breakdown of the closest homologs for the
C. acetobutylicum proteins immediately reveals the specific relationship with the low-GC gram-positive bacteria, with the reliable best hits for 31% of the
C. acetobutylicum protein sequences being to this bacterial lineage (Fig.
2). However, nearly as many proteins produced clear best hits to homologs from other taxa (Fig.
2), which emphasizes the likely major role for lateral gene transfer, a hallmark of microbial evolution.
The same trends appear even more notable when the genome organizations of
C. acetobutylicum and other bacteria are compared. Gene order is, in general, poorly conserved in the bacteria, with no extended synteny detected even among relatively close genomes, such as those of
E. coli and
P. aeruginosa or
H. influenzae. In contrast, a genomic dot plot comparison
of C. acetobutylicum with
B. subtilis revealed several regions of colinearity (Fig.
3A and B). Thus, at least some bacterial genomes separated by a moderate evolutionary distance, as exemplified by
C. acetobutylicum and
B. subtilis, appear to retain the memory of parts of the ancestral gene order. A systematic mapping of conserved gene strings (many of which form known or predicted operons) on the
C. acetobutylicum genome shows the clear preponderance of gene clusters shared with
B. subtilis but also considerable complementary coverage by conserved operons from other bacterial and even archaeal genomes (Fig.
3C; see supplementary material at
ftp://ncbi.nlm.nih.gov/pub /koonin/Clostridium). Altogether, 1,243
Clostridium genes (32% of the total predicted number of genes and 40% of the genes with detectable homologs) belong to conserved gene strings; 779 of these are in 271 predicted operons shared with
B. subtilis (Fig.
3C; see supplementary material at
ftp://ncbi.nlm.nih.gov/pub /koonin/Clostridium).
The genome region that shows the greatest level of gene order conservation between
C. acetobutylicum and
B. subtilis includes ∼200 genes and includes primarily (predicted) operons encoding central cellular functions, such as translation and transcription (Fig.
3C). The multiple genome alignment for this region clearly shows numerous rearrangements of gene clusters, with large-scale colinearity seen only between
C. acetobutylicumand
B. subtilis. The intermediate conservation of gene order seen between
C. acetobutylicum and
B. subtilis is likely to be particularly informative in terms of complementing functional predictions based on direct sequence conservation. For example, the predicted large “superoperon,” which contains genes for several components of the translation machinery (
def, encoding
N-formylmethionyl-tRNA deformylase;
fmt, encoding methionyl-tRNA formyl transferase; and
fmu, encoding a predicted rRNA methylase), transcription, and replication, additionally includes the genes
yloO (CAC1727),
yloP (CAC1728), and
yloQ (CAC1729). These genes encode predicted protein phosphatase, serine-threonine protein kinase, and a GTPase, respectively. Based on the operon context, the readily testable predictions can be made that yloQ is a previously uncharacterized translation factor, whereas yloO and yloP are likely to play a role in the regulation of translation and/or transcription.
The mosaic picture of operon conservation can be explained by a combination of the processes of horizontal operon transfer, gene (operon) loss, and operon disruption (rearrangement). Distinguishing between these phenomena is, in many cases, difficult, but in certain extreme situations, one of the evolutionary routes is clearly preferable. A striking example is the conservation of the nitrogen fixation operon (six genes in a row) between
C. acetobutylicum and another nitrogen fixator, the archaeon
Methanobacterium thermoautotrophicum (Fig.
4A). This particular gene organization so far has not been seen in any other genome except for that of another clostridial species,
C. pasteurianum, in which, interestingly, two genes of the operon are deleted (Fig.
4A). Similarly, the aromatic amino acid biosynthesis operon is conserved, albeit with local rearrangements,
in C. acetobutylicum, Thermotoga maritima, and partially in
Chlamydia (Fig.
4B). In these and similar cases, it is hard to imagine an evolutionary scenario that does not involve horizontal mobility of these operons, along with operon disruption in some of the bacterial and archaeal lineages.
In general, C. acetobutylicum carries the typical complement of genes that are conserved in most bacteria. The only gene that is present in all other bacteria (and, in fact, in all genomes sequenced to date) but is missing in C. acetobutylicum is that for thymidylate kinase.
A differential genome display analysis for
C. acetobutylicumand
B. subtilis, which was performed using the COG system (
78), revealed 186 conserved protein families (COGs) that are represented in
C. acetobutylicum but not in
B. subtilis. Many of these proteins are involved in redox chains that are characteristic of the anaerobic metabolism of
Clostridia as opposed to the aerobic metabolism of
B. subtilis, as well as oxidation and reduction that are required for assimilation of nitrogen and hydrogen. Another group of enzymes belongs to biosynthetic pathways that are present in
C. acetobutylicum but not in
B. subtilis, primarily those for certain coenzymes, for example, cyancobalamin (see supplementary material at
ftp://ncbi.nlm.nih.gov/pub/koonin /Clostridium). Conversely, 335 COGs were detected in which
B. subtilis was represented, whereas
C. acetobutylicum was not. An obvious part of this set consists of genes coding for components of aerobic redox chains, such as cytochromes and proteins involved in the assembly of cytochrome complexes. Also missing are a variety of membrane transporters, the glycine cleavage system that is present in the majority of bacteria. Several metabolic pathways are incomplete; for example, a considerable part of the tricarboxylic acid (TCA) cycle and molybdopterin biosynthesis is missing. The TCA cycle is incomplete in many prokaryotes, but in most of these cases, the chain of reactions producing three key precursors, 2-oxoglutarate, succinyl-CoA, and fumarate, can proceed in either the oxidative or the reductive direction (
30). In
C. acetobutylicum, citrate synthase, aconitase, and isocitrate dehydrogenase are missing. It appears, however, that what remains of the TCA cycle could function in the reductive (counterclockwise in Fig.
5) direction. The counterparts of enzymes involved in succinyl-CoA and 2-oxoglutarate formation in other organisms are missing in C.
acetobutylicum. However, the genome encodes acetoacetyl:acyl CoA-transferase that catalyzes butyryl-CoA formation in solventogenesis (CAP0163-0164) and might also utilize succinate for the synthesis of succinyl-CoA and 2-oxoacid:ferredoxin oxidoreductase (CAC2458-2459) that could catalyze 2-oxoglutarate formation from succinyl-CoA (Fig.
5). Succinate dehydrogenase/fumarate reductase, the enzyme that normally catalyzes the reduction of fumarate to succinate, seems to be missing in
C. acetobutylicum. However, this reaction is linked to the electron transfer chain and might be supported by another dehydrogenase whose identity could not be easily determined.
The repertoires of transcriptional regulators in
B. subtilis(
27) and
C. acetobutylicum are very similar. In particular, of the 17 sigma factors predicted in
C. acetobutylicum, 11 have readily detectable orthologs in
B. subtilis. C. acetobutylicum also encodes numerous predicted specific transcriptional regulators, including 28 members of the AcrR/TetR family, 22 members of the MarR/EmrRs family, 14 members of the LysR family, 14 members of the Xre family, 9 members of the LacI family, and also several smaller sets of paralogous regulators. One-to-one orthologous relationships could be established only for a minority of these proteins (data not shown), and in some cases, such as, for example, that of the MarR/EmrRs family, part of the observed diversity seems to be due to independent family expansion.
The set of sporulation genes in
C. acetobutylicumsurprisingly differs from the set that has been well studied in
B. subtilis (
75). The number and diversity of detectable sporulation genes in
Clostridium is much smaller. The most dramatic difference was observed among the SpoV genes.
C. acetobutylicum does not have orthologs of the
spoVF, spoVK, and spoVM genes, the disruption of which in
B. subtilis leads to formation of immature spores that are sensitive to heat, organic solvents, and lysozyme (
75). The phosphorelay system that functions in phase 0 of sporulation in
B. subtilis (
7,
31) appears to be missing in
C. acetobutylicum, as indicated by the absence of an ortholog of SpoOB (phosphotransferase B) and SpoOF (a response regulator). In contrast,
C. acetobutylicum encodes an apparent ortholog of the SpoOA (CAC2071) signaling protein that consists of a CheY domain and DNA-binding HTH domain and three proteins homologous to the ambiactive transcription repressors and activators AbrB and Abh (CAC1941, CAC0310, and CAC3647), also involved in phase 0 in
B. subtilis. Interestingly, the SpoOA gene has been shown to control solventogenesis in solvent-forming
Clostridia (
60). In
B. subtilis, sporulation is regulated by opposing activities of a distinct family of histidine kinases, KinA to KinE, and the Rap family phosphatases; orthologs of these genes were not detected in
C. acetobutylicum.
B. subtilis has 22
cot genes that are responsible for coat biosynthesis; only 14 of these genes are conserved in
C. acetobutylicum. Similarly,
B. subtilis has 21
ger genes, 7 of which are represented by orthologs in
Clostridium. Many of the missing GER genes encode various receptors of germination, which appear to be different in these bacteria. Furthermore,
C. acetobutylicum does not have an ortholog of the cell-division-initiation gene
divIC(
75), which is essential in
B. subtilis, suggesting differences in the mechanism of septum formation.
B. subtilis has a large set of competence genes which are involved in DNA uptake (
12). The majority of these genes are represented by orthologs in
C. acetobutylicum, but the proteins encoded by these genes in
B. subtilis and
C. acetobutylicum typically are not the most closely related members of the respective clusters of orthologs (data not shown). Operon disruption and rearrangements are also observed, suggesting a significant functional difference between the two gram-positive bacteria.
Many of the clostridial genes that are missing in
B. subtilis seem to show distinct evolutionary affinities and probably have been acquired via horizontal transfer. In particular, a significant number of clostridial genes are conserved in all archaea whose genomes have been sequenced to date but are present in bacteria only sporadically (Table
2). Many of these genes encode various redox proteins, which reflects the similarity between the anaerobic redox chains in archaea and clostridia. For most of these “archaeal” genes found in bacteria, the probable evolutionary model is a single entry into the bacterial world by horizontal transfer from the
Archaea, followed by dissemination among the
Bacteria. In several cases, however, direct gene transfer from archaea into the clostridial genome seems likely; examples include the genes for a metal-dependent hydrolase of the metallo-beta-lactamase superfamily (CAC0535), a calcineurin-like phosphatase which has undergone duplication in
C. acetobutylicum, probably subsequent to the acquisition of an archaeal gene (CAC1010 and CAC1078), and a predicted DNA-binding protein (CAC3166). Another group of clostridial genes includes probable eukaryotic acquisitions (Table
2). As with archaeal genes, the scenario of a single entry into the bacterial world followed by horizontal dissemination is likely for many of these genes, for example, that for the FHA domain discussed below. However, about 50 genes in
C. acetobutylicum could have been directly hijacked from eukaryotes (Table
2). An interesting example is the nucleotide pyrophosphatase, which is encoded within one of the gene clusters including genes for FHA-containing proteins (Fig.
6) and therefore may be also implicated in signaling. As noticed previously, lateral acquisition of some of the aminoacyl-tRNA synthetases from eukaryotes, accompanied by displacement of the original copies, seems to have occurred repeatedly in bacterial evolution (
85).
C. acetobutylicum is no exception, with its arginyl-tRNA synthetase showing a clear eukaryotic affinity. In these cases, horizontal gene transfer from eukaryotes to specific bacterial lineages appears more likely than horizontal transfer in the opposite direction, bacteria to eukaryotes. The latter interpretation would require independent gene loss in multiple bacterial lineages accompanied by multiple instances of nonorthologous displacement.
Most of the essential functions
in C. acetobutylicum and
B. subtilis are associated with readily detectable orthologs, but there are also notable cases of nonorthologous gene displacement (Table
3). Examples include glycyl-tRNA synthetase, which is represented by the typical bacterial, two-subunit form in
B. subtilis and by the one-subunit archaeal-eukaryotic version in
C. acetobutylicum, and uracil-DNA glycosylase, similarly represented by the classical bacterial enzyme (ortholog of
E. coli Ung) and by the archaeal version in
C. acetobutylicum (Table
3). In many cases, while an apparent orthologous relationship was detected between a clostridial protein and its counterpart from
B. subtilis, there was nevertheless a clear difference in the domain architectures (Table
2). Notable examples of unusual domain organizations from
C. acetobutylicum include the FtsK ATPase, which is fused to the FHA domain (see below), a Pkn2 family protein kinase fused to tetratricopeptide repeats (CAC0404), and another ATPase fused to a LexA-like DNA-binding domain (CAC1793). The evolution of another set of genes seems to have involved xenologous gene displacement whereby a gene in one of the compared genomes (
C. acetobutylicum or
B. subtilis) is displaced by the ortholog from a distant branch of the phylogenetic tree, e.g., eukaryotes (Table
3). Characteristically, this evolutionary pattern was detected for three aminoacyl-tRNA synthetases, those for isoleucine, arginine, and histidine; in each of these cases,
C. acetobutylicum possesses the archaeal-eukaryotic version as opposed to the typical bacterial versions found in
B. subtilis. Another interesting example of xenologous displacement involves the two forms of clostridial ribonucleotide reductase, neither of which groups with the counterparts from
B. subtilis in phylogenetic trees. One of the ribonucleotide reductase genes in
B. subtilis contains the single intein in that organism;
C. acetobutylicum has no inteins, however. These observations show that there had been a significant horizontal exchange of genes between the
Clostridium lineage and certain archaea and/or eucaryotes subsequent to its divergence from the
Bacillus lineage.
The results of systematic analysis of protein families that are specifically expanded with
C. acetobutylicum are largely compatible with the current knowledge of the physiology of the bacterium (Table
4). For example, distinct families of proteins involved in sporulation, anaerobic energy conversion, and carbohydrate degradation were identified (Table
4). A so far unique feature is the presence of four diverged copies of the single-stranded DNA-binding proteins, an essential component of the replication machinery that is present in one or two copies in all other sequenced bacterial genomes. In addition, this analysis revealed remarkable aspects of the signal transduction system in this bacterium. Of particular interest is the proliferation of the phosphopeptide-specific, protein-protein interaction module, the FHA domain, which is generally rare in the
Bacteria(
44)
. C. acetobutylicum encodes five FHA-domain-containing proteins, which is comparable to the number of these domains in other bacteria with versatile Ser/Thr-phosphorylation-based signaling, namely
Mycobacterium tuberculosis (
10) and
Synechocystis sp. (
7); most of the other bacteria do not encode FHA domains or possess just one copy (
58). Four of the genes coding for FHA-domain-containing proteins in
C. acetobutylicumbelong to two partially similar gene clusters that are unique for
C. acetobutylicum and additionally include genes for other phosphorylation-dependent signaling proteins, namely predicted protein kinases and phosphatases (Fig.
6). The fusion of the FHA domain with the FtsK ATPases, which is involved in chromosome segregation, and the presence, in one of the clusters, of an ATPase of the MinD family, also involved in chromosome partitioning, suggest previously unsuspected regulation of cell division in
C. acetobutylicumvia reversible protein phosphorylation. The fifth FHA-domain-containing protein seems to belong to yet another predicted operon that is potentially involved in cell division as indicated by the presence of genes for a penicillin-binding protein and another membrane protein implicated in cell division in other bacteria (Fig.
6). These observations are compatible with the hypothesis on the role of phosphorylation in the regulation of this process in
C. acetobutylicum. Another signaling system that is predicted to play a prominent role in
C. acetobutylicum on the basis of protein family expansion analysis includes the so-called HD-GYP domains (name based on the one-letter code for characteristic amino acids) that are suspected to possess cyclic diguanylate phosphoesterase activity (Table
4); the only comparable expansion of the HD-GYP domain is seen in
T. maritima. The HD-GYP proteins could play a major role in sensing the redox state of the environment in
C. acetobutylicum (M. Y. Galperin, D. A. Natale, L. Aravind, and E. V. Koonin, Letter, J. Mol. Microbiol. Biotechnol.
1:303–305, 1999).
The solventogenesis pathways of
C. acetobutylicum involve the formation of acetone, acetate, butanol, butyrate, and ethanol from acetyl-CoA (
52). Two mechanisms of butanol formation have been identified in
C. acetobutylicum, one of which is associated with solventogenesis (production of butanol, ethanol, and acetone) and the other with alcohologenesis (production of butanol and ethanol only). The genes involved in solventogenesis have been previously identified on the megaplasmid and sequenced (Galperin et al, letter), but the genes responsible for alcohologenesis were unknown. The genome sequencing allows the identification of a second alcohol-aldehyde dehydrogenase (CAP0035), a pyruvate decarboxylase (CAP0025), and an ethanol dehydrogenase (CAP0059) that are probably involved in this alcohologenic metabolism (Fig.
5) and interestingly are also carried by the megaplasmid. The enzymes involved in the final steps of solvent formation show variable phylogenetic profiles, and in particular, several of them appear to be specifically related to the homologs from the archaeon
Archaeoglobus fulgidus (Fig.
5). In contrast, the genes for the two subunits of another key enzyme of the acetone pathway, acetoacetyl-CoA:acyl-CoA transferase, show a clear proteobacterial affinity. Together with the fact that a significant subset of the solventogenesis enzymes is encoded on the clostridial megaplasmid, these observations suggest that these pathways could have evolved via a complex sequence of gene/operon acquisition events. The megaplasmid also carries second copies of genes involved in PTS-type sugar transport (CAP0066-68), glycolysis (aldolase, CAP0064) and central metabolism (thiolase, CAP0078). It would be interesting to determine the expression profiles of the plasmid-encoded and chromosomal copies of these genes to investigate (i) whether these genes and the solventogenic genes are regulated or coregulated and (ii) whether metabolic complementarily exists between the chromosome and the plasmid in
C. acetobutylicum.
The cellulosome, the macromolecular complex for cellulose degradation, has been genetically and biochemically characterized in four
Clostridium species (
C. thermocellum, C. cellulovorans, C. cellulolyticum, and
C. josui) but not in
C. acetobutylicum (which is able to hydrolyze carboxy-methyl cellulose but not amorphous or crystalline cellulose (
68). The proteins of the cellulosome contain a C-terminal Ca
2+-binding dockerin domain, which is required for the binding to the cohesin domains of a scaffolding protein (
36,
40). Genome sequence analysis revealed at least 11 proteins that are confidently identified as cellulosome components (Fig.
5A). Most of these genes are organized in an operon-like cluster (CAC910 to CAC919) with a gene order similar to that of those in mesophilic
C. cellulolyticum and
C. cellulovorans, as distinct from the more dispersed organization in the thermophile
C. thermocellum (
4,
77). The large glycohydrolase CAC3469 is the homolog of EngE of
C. cellulovorans, which is also encoded away from the main cellulosome gene cluster. Unlike EngE, CAC3469 possesses an additional cell adhesion domain (Fig.
5A). This protein contains S-layer homology domains and cell adhesion domains similar to those of SlpA, one of the anchoring proteins of
C. thermocellum. The presence of the short cohesion domain protein CAC914 suggests a role in cellulosome function related to that of the HbpA protein of
C. cellulovorans (
77). The other dockerin-domain containing proteins, those of the GH48, GH5, and GH9 families, might interact with CAC910, the ortholog of the scaffolding protein CbpA. Generally, although the cellulosome has not been detected in
C. acetobutylicum, the number of relevant proteins and domains would seem sufficient to encode the various combinations of cellulose-binding and hydrolytic proteins found in this complex. An interaction between CAC3469 and CAC910 could be speculatively proposed as a means of anchoring a potential cellulosome-like structure to the peptidoglycan.
In work analyzing the cellulolytic activities of
C. acetobutylicum strains, it was found that NRRL B 527 could hydrolyze Avicel and acid-swollen cellulose but
C. acetobutylicum ATCC 824 could not (
42). The subsequent taxonomic and historical analyses of these strains (
32,
33) indicate a close relationship and suggest that further investigation of the cluster from strain B 527 would be informative in elucidating the reason for the different cellulolytic activities of the two strains. Further work is required to resolve these issues and to determine the exact functions of the cellulosome subunits in
C. acetobutylicum.
In addition to the known cellulosome components,
C. acetobutylicum encodes numerous other proteins that are predicted to be involved in the degradation of xylan, levan, pectin, starch, and other polysaccharides. Altogether, there seem to be over 90 genes encoding proteins implicated in these processes, including representatives of at least 14 distinct families of glycosyl hydrolases. In particular, a predicted operon located on the
C. acetobutylicum megaplasmid (CAP0114 to CAP0120) consists mostly of genes encoding xylan degradation enzymes. Similarly to the cellulosome components, these enzymes possess complex domain architectures, with the oligosaccharide-binding ricin domain (
74) typically present at the C terminus; the addition of ricin domain is (so far) a unique feature of this postulated novel system for xylan degradation in
Clostridium (Fig.
5B). Two of the putative xylanases presumably correspond to previously reported enzymes of xylan degradation isolated from
C. acetobutylicum ATCC 824 (
43).
A number of sugar PTS transport system genes,as well as the corresponding regulatory system analogs (e.g., Hpr, ptsK, and CcpA), have been found which couple transport signals to genetic regulation of degradative operons (
61,
63). Non-PTS-mediated uptake of certain sugars, especially pentoses, has been found in several clostridial species (
52). Many primary active transporters, including ABC-type transporters and P-type ATPases, electrochemical potential-driven transporters, channels and pores, and uncharacterized transporters were detected among the gene products of
C. acetobutylicum (Fig.
5; see details in the figure legend). There is, however, no ortholog of the glucose facilitator of
B. subtilis (
17).
Along with previously characterized molecular complexes involved in extracellular hydrolysis of organic polymers, a novel system possibly related to these processes was detected. The signature of this system is a previously undetected domain with a distinct repetitive structure, which we designated as “ChW repeats” (clostridial hydrophobic, with a conserved W, tryptophan) (Fig.
7B). So far, the only nonclostridial protein containing similar repeats was detected in
Streptomyces coelicolor (Fig.
7B). All proteins containing ChW repeats contain confidently predicted signal peptides at their N termini and do not contain predicted transmembrane helices, which suggests that all of them are secreted (Fig.
7A). Some of the ChW-repeat proteins contain additional enzymatic domains, such as glycosyl hydrolases or proteases, which implicates them in the degradation of polysaccharides and proteins. Several ChW-repeat proteins also contain domains that are involved in cell interactions, such as the cell adhesion domain (
39) and the leucine-rich repeat (internalin) domain (
46) (Fig.
7A). The internalin domain has been shown to play a critical role in the host cell invasion by the bacterial pathogen
Listeria monocytogenes(
46). In
C. acetobutylicum, these domains might be responsible for interactions with plant cells. ChW repeats also could function in either substrate-binding or protein-protein interactions. The specific expansion of this domain in
C. acetobutylicum suggests the existence of a novel molecular system, which partially resembles the cellulosome and could also form structurally distinct multisubunit complexes involved in polymer degradation and interaction with the environment. Elucidation of the function of this system is expected to shed light on the unique physiology of
C. acetobutylicum.
The extreme diversity of the domain architectures of the proteins that comprise the cellulosome and other predicted polymer degradation systems suggests that such complexes are highly dynamic not only in terms of the subunit stoichiometry (
68) but also with respect to the genetic organization, with horizontal gene transfer, domain shuffling, and nonorthologous gene displacement playing pivotal roles in their evolution.
C. acetobutylicum is the first sequenced bacterial genome with such a remarkable abundance of polymer degradation systems, which makes it a model for future studies on other bacteria with similar lifestyles. In addition, the sequencing of the
C. acetobutylicum genome will offer perspectives in future comparative genomic studies concerning pathogenic bacteria, e.g.,
C. difficile,
C. tetani, and
C. perfringens, which are currently being sequenced by other groups.