INTRODUCTION
Synthetic genomics is an emerging field of synthetic biology combining different approaches and technologies to chemically synthesize sections of chromosomes or even entire genomes (
1,
2), thus enabling the generation of engineered organisms that significantly differ from those found in nature. Given sufficient knowledge and proper execution, this could lead to the rational design of organisms built to accomplish specific tasks (
3). However, the complexity of current model organisms is overwhelming and outstrips our ability to understand how cells operate on a global scale (
4). Minimal genomes, in addition to providing invaluable information about the essential genes and fundamental principles required to sustain life, would therefore facilitate systematic investigations toward a global understanding of cell functioning. Minimal cells could also become interesting platforms for rapid and affordable prototyping of engineered cells, further helping to uncover underlying genome design rules. So far, three main approaches have been used to determine the minimal gene set in various organisms: comparative genomic analyses, gene inactivation studies, and progressive genome reduction.
Comparative genomics uses sequence-based strategies to identify conserved genes, which are hypothesized to be maintained throughout evolution and shared across different organisms because of their contribution to cell fitness (
5). The exact number and nature of conserved genes have been found to vary considerably between studies, depending on the phylogenetic distribution (
6) and number of genomes analyzed (
5,
7). For example, Land and colleagues reported that 3,188 genes were always detected in
Escherichia coli (
8) while only 38 genes were found to be shared by 147 different species of bacteria and archaea (
6). Some conserved gene sets are thus certainly too large to reveal the minimal genome and rather correspond to important functions that are not necessarily essential but likely contribute to the fitness of an organism in its natural habitat (
9). Other gene sets are simply too small to support basic functions like replication, transcription, and translation. The results obtained through comparative genomics approaches are thus highly dependent on the set of organisms analyzed.
Genes that are conserved within a species are thought to be important or essential in their natural environment. However, laboratory and environmental conditions can greatly differ, resulting in different genetic requirements. Experimental assessment of essential genes can be achieved through individual gene inactivation studies. In this regard, gene deletion (
9,
10), transposon mutagenesis (
11,
12), and transcriptional interference (
13,
14) were used to identify dispensable genes. The results obtained from such experiments depend largely on growth conditions since, for example, cells may need certain metabolic pathways unless their products are already available in the medium (
15). Certain genes are essential only in the presence of another gene, for example to balance or counteract the activity of another gene product. For example, the antitoxin of a toxin/antitoxin system is only essential while the toxin is also present (
9). At lower insertion densities, transposon mutagenesis is likely to overestimate the number of essential genes because of the higher probability of missing genes simply by chance (
16). On the other hand, gene inactivation strategies can overestimate the number of dispensable genes since duplicated sequences or alternate metabolic pathways may be interrupted individually but not simultaneously, a phenomenon called synthetic lethality (
17). Overall, these phenomena can lead to biases and uncertainties in the estimation of the number of essential genes.
Cumulative gene deletions that result in genome reduction provide a more accurate picture of possible minimal genome compositions of a given organism. However, this approach involves considerable effort along with well-developed genetic tools. So far, this strategy has only been applied to a few organisms, including
E. coli (
18–21),
Bacillus subtilis (
22),
Streptomyces avermitilis (
23),
Pseudomonas putida (
24), and
Mycoplasma mycoides subsp.
capri (
2). The latter organism has undergone the most drastic streamlining, with the removal of ~50% of its original genome, resulting in the creation of
M. mycoides JCVI-syn3.0 containing a single chromosome of 531 kbp.
M. mycoides JCVI-syn3.0 was described as the first “working approximation of a minimal cell” (
2) and is currently the simplest organism capable of autonomous growth in axenic culture. Interestingly, minimal genome designs initially proposed for
M. mycoides based on single gene inactivation by transposon mutagenesis and other literature-based knowledge were not viable (
2). Many optimization and debugging steps were required to obtain
M. mycoides JCVI-syn3.0, highlighting the difficulty of identifying and understanding the roles of essential genes, even in the simplest cells.
Mesoplasma florum is a bacterium first isolated from a lemon tree flower in 1984 (
25). Unlike many other members of the class
Mollicutes,
M. florum shows a short doubling time of <40 min, requires no sterol for growth, and has no known pathogenic potential. The genomes of two
M. florum strains, L1 and W37 (
26), have been completely sequenced, revealing a single circular chromosome of ~800 kbp and positioning this species among the simplest free-living organisms. Basic genetic manipulation tools comprising antibiotic resistance genes, plasmids, and transformation methods have recently been developed for
M. florum (
27). Furthermore, the complete genome of
M. florum L1 has also been cloned in yeast and transplanted into a recipient
Mycoplasma capricolum subsp.
capricolum strain (
28,
29), which will enable sophisticated modifications and reengineering of the
M. florum chromosome. This combination of low cell complexity, ease of manipulation, and the availability of genome engineering methods makes
M. florum an interesting model for systems biology and synthetic genomics.
Here, we report a comparative genomic analysis of 13 M. florum strains. These data were investigated in conjunction with transposon mutagenesis to identify conserved, accessory, and essential genes in this species. We also discuss different scenarios for eventual M. florum genome reduction efforts according to results presented here and using comparisons with the phylogenetically related strain M. mycoides JCVI-syn3.0.
DISCUSSION
Minimal cells constitute powerful tools to better understand the fundamental components and the basic mechanisms that support life. The first approximation of a minimal gene set was recently provided with the creation of
M. mycoides JCVI-syn3.0 (
2). Technical advances now also enable the exploration of the
M. florum minimal genome. The development of
oriC-based plasmids and antibiotic selection markers (
27) constituted the basic steps that led to the whole-genome cloning of
M. florum in yeast (
29). This was followed by the establishment of a genome transplantation protocol for
M. florum and by the investigation of the impact of phylogenetic distance on this procedure (
28).
M. florum is therefore a bona fide candidate for genome reduction. However, this raises a few questions. Which genes should be removed to obtain a minimal
M. florum genome? Given their phylogenetic proximity, would a minimal
M. florum genome differ from or be equivalent to the minimal
M. mycoides JCVI-syn3.0 genome? What could be learned by creating minimal genomes based on different cell chassis?
Two different approaches, comparative genomics and random transposon mutagenesis, were used to determine the gene composition of a putative minimal
M. florum genome. The former exposed genes important for the survival of
M. florum in its natural habitat, whereas the latter revealed the genes likely to be essential under laboratory conditions. Through the analysis of 13 different strains (
Fig. 1), we determined the composition of the
M. florum core genome and explored the diversity of its pangenome (
Fig. 2). Although some strains were isolated from distant sites (
Fig. 1A) and from different plants or insects (see
Table 1), a total of 546 different protein-coding gene clusters, out of an average of 688 ± 23 per strain, were found to compose the core
M. florum genome. Random transposon mutagenesis of strain L1 predicted a total of ~430 dispensable and ~290 putatively essential genes under laboratory conditions. It is possible that the relatively low transposon insertion density (on average, one insertion every ~280 bp) spared a small number of genes simply by chance, which would result in the inclusion of a few dispensable genes in the minimal genome. However, this is unlikely to significantly affect our general conclusions about which genes should be deleted first during an eventual reduction of the
M. florum genome. Generating additional transposon insertion mutants would, however, increase the precision and confidence level of these predictions, especially for small genes.
Combining comparative genomics and transposon mutagenesis data can provide contrasting perspectives on which genes should be included in a minimal
M. florum genome. While the 585 core genes could be expected to be sufficient for the survival of
M. florum L1, ~25 noncore genes are expected to be essential according to our transposon mutagenesis of
M. florum L1 (
Fig. 4B). A minimal genome design based on conserved genes only is thus highly unlikely to produce a viable cell. This can be explained by the differences in the growth conditions and evolutionary pressures experienced by
M. florum in the environment compared to laboratory settings. In fact, a majority, 320 (55%), of the of the 585
M. florum core genes are not essential in rich medium (
Fig. 4B). An alternative scenario that includes only the 290 putatively essential genes is also questionable, as synthetic lethality is likely to occur and result in a nonviable minimal
M. florum genome. This interpretation is supported by the fact that initially proposed minimal
M. mycoides genome designs based on transposon mutagenesis and other literature-based knowledge were not viable (
2). Preservation of both the core and essential genes would remove a total of 110 genes, which has a reasonable chance of success but would most probably remain far from the minimal genome composition.
Another possibility is to infer the minimal
M. florum L1 genome on the basis of
M. mycoides JCVI-syn3.0. A total of 409
M. florum L1 genes have homologs in
M. mycoides JCVI-syn3.0. Of these, 404 are part of the
M. florum L1 “core or putatively essential” gene set. Since all of the genes present in
M. mycoides JCVI-syn3.0 are essential or have a strong impact on cell fitness, this reveals interesting differences between these organisms. Despite their phylogenetic relatedness, 69 gene families are found only in
M. mycoides JCVI-syn3.0. Conversely, 57 putatively essential
M. florum L1 genes have no homolog in
M. mycoides JCVI-syn3.0 (
Data Set S1). It is possible that some of these genes perform equivalent functions although their sequences differ significantly. However, a majority of these
M. florum L1 (~61%) and
M. mycoides JCVI-syn3.0 (~54%) genes are annotated as encoding putative or hypothetical proteins with no clear function, making further investigations more difficult. This highlights our current inability to unambiguously assign functions to a large number of genes and to analyze cell physiology by using a truly functional perspective, which constitutes a major challenge for biology. Genome scale
in silico models (
45) would constitute an attractive tool to help organize, refine, and compare the available information on minimal genomes. Nevertheless, a scenario emerging from this comparison would be to combine the 57 putatively essential genes found only in
M. florum L1 to the 409 genes that have a homolog in
M. mycoides JCVI-syn3.0. This would likely represent a better approximation of a minimal
M. florum genome, given the data currently available. This also implies that the genome-reduced versions of these two organisms would, in large part, be similar but still differ despite their phylogenetic relatedness.
What could be the conceptual nature of the differences observed between
M. mycoides JCVI-syn3.0 and the proposed
M. florum minimal genome? In principle, the minimal genome can be divided into three categories, a hard, a semihard, and a soft minimal genome. The hard minimal genome includes genes encoding functions that are essential and performed in a similar fashion across different strains or species (i.e., genome replication, protein synthesis, etc.). The semihard category contains functions essential for any organism but for which alternative genes or strategies are possible to fulfill the same requirement. For instance, different gene families can ensure the same functions, as exemplified by nonorthologous gene displacement (
46). The soft minimal genome is, on the other hand, composed of genes that are crucial in a given organism or environment but not necessarily in others. The availability of particular nutrients in the environment or the presence of a particular gene that affects the essentiality of other genes represents a possible factor affecting the soft minimal genome. The differences between
M. florum L1 and
M. mycoides JCVI-syn3.0 should vastly reside in either the semihard or soft minimal genome category. Since the semihard minimal genome of phylogenetically closely related bacteria is expected to be relatively small, the soft minimal genome is more likely to explain the distinctions between minimal
M. florum L1 and
M. mycoides JCVI-syn3.0 genomes. Indeed, the gene composition of these strains derives from data obtained in rich but slightly different media. Transposon mutagenesis of both strains in a set of different media would presumably lead to the identification of many environment-specific essential genes.
In conclusion, although the technology needed to build entire genomes is now accessible, synthetic genomics is increasingly limited by our understanding of cell functioning. A significant fraction of genetic components are still poorly characterized, even in the most thoroughly studied organisms. Because of their lower complexity, minimal genomes offer a remarkable opportunity to investigate the most fundamental cellular functions that support life. Furthermore, the construction of minimal synthetic chromosomes will facilitate the generation of several genome versions that could help better define the rules governing genome organization. The use of minimal cells will also facilitate the establishment of comprehensive whole-cell models, which is currently hindered by excessive biological complexity. These models could become powerful tools to predict cell behavior and to create synthetic genomes (
47). Overcoming these important challenges will constitute a stepping stone toward the rational design and programming of complete genomes.