EcoCyc (pronounced “eeko-sike,” as in “ecology” and “encyclopedia”) is a bioinformatics database that describes the genome and the biochemical machinery of Escherichia coli K-12 MG1655. The project’s long-term goal is describing the complete molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists and for all researchers who work with E. coli and related microorganisms. In addition to the database, a steady-state metabolic flux model is available, generated from each new version of EcoCyc.
This review provides an overview of EcoCyc’s data content and the procedures by which these data enter EcoCyc.
EcoCyc accelerates science. EcoCyc is designed for several different modes of interactive use via both the
EcoCyc.org website and in conjunction with the downloadable Pathway Tools (
1) software (the resources available to assist users in learning the website and software are listed in “How to Learn More,” below):
•
EcoCyc is an encyclopedic reference providing information about the biological roles of E. coli genes, metabolites, and pathways. Visualization tools, such as a genome browser, metabolic map display, and regulatory network diagram, aid in the comprehension of these complex data.
•
EcoCyc facilitates the analysis of high-throughput data such as gene-expression and metabolomics data via tools for enrichment analysis, and for visualizing omics data on a metabolic map diagram, complete genome diagram, or regulatory network diagram.
•
The EcoCyc metabolic flux model can predict growth or no-growth of wild-type and knockout E. coli strains under different nutrient conditions.
Users of EcoCyc fall into several different groups. Experimental biologists use EcoCyc as an encyclopedic reference on genes, pathways, and regulation, and they use its omics-data analysis tools to analyze gene-expression and metabolomics data. Examples of papers citing EcoCyc in the analysis of functional genomics data include references
2,
3,
4,
5, and
6.
Because the EcoCyc data are structured within a sophisticated ontology that is amenable to computational analyses, EcoCyc enables scientists to ask computational questions spanning the entire genome of
E. coli, the known metabolic network of
E. coli, the known transport complement of
E. coli, the known genetic regulatory network of
E. coli, and combinations thereof. Past work includes the use of EcoCyc to develop methods for studying path lengths within metabolic networks (
7,
8,
9), in studies relating protein structure to the metabolic network (
10,
11), and in analysis of the
E. coli regulatory network (
12,
13).
The development of many new bioinformatics methods requires high-quality, gold-standard data sets for the training and validation of those methods. EcoCyc has been used as a gold-standard data set for the development of genome-context methods for predicting gene function (
14,
15), operon-prediction methods (
16,
17), the prediction of promoters and transcription start sites (
18,
19), regulatory network reconstruction (
20), and the prediction of functional and direct protein-protein interactions (
21,
22,
23). The EcoCyc metabolic data have been used for studies concerning predicted metabolic networks and growth prediction (
24,
25), and for model checking of a symbiotic bacterium’s metabolic network (
26).
Metabolic engineers alter microbes to produce biofuels, industrial chemicals, and pharmaceuticals; to degrade toxic pollutants; and to sequester carbon (
27,
28,
29). Metabolic engineers who use
E. coli as their host organism consult EcoCyc to aid in optimizing the production of an end product through a better understanding of the metabolic network and its regulation and to predict undesirable side effects of a metabolic alteration. Metabolic engineering studies using EcoCyc include references
30,
31, and
32.
According to the Thomson Reuters Web of Knowledge citation index, as of August 2013, the 23 EcoCyc and RegulonDB papers authored since 1997 had been cited by 2,395 publications from 1997 to 2013. According to Google Analytics, approximately 100,000 visitors query the EcoCyc website each year, generating 177,000 object page views per month on average in 2012.
The Pathway Tools software that underlies EcoCyc (
1) is not specific to
E. coli, but rather has been applied to manage genomic and biochemical data for thousands of organisms.
OVERVIEW OF EcoCyc DATA CONTENT
EcoCyc covers a broad array of data types. Key to understanding the EcoCyc data and their presentation within the EcoCyc website and Pathway Tools is the notion of a database class, which describes a specific type of data. For example, the class Genes provides the database definition of a gene, including the attributes (e.g., starting nucleotide position within the genome) and relationships (e.g., the linkage between a gene and gene product) of the class. Each specific gene within EcoCyc is stored in a single database object, or frame, that is an instance of the class Genes.
No one-to-one mapping exists between EcoCyc classes and the data pages within the EcoCyc website, because one data page typically integrates information from multiple classes. For example, the pathway data page integrates information from objects in the classes Pathways, Reactions, Genes, Proteins, and Chemicals.
Genome
EcoCyc contains the complete genome sequence of
E. coli and describes the nucleotide position and function of all known protein-coding and RNA-coding
E. coli genes. Genome-related classes that are populated within EcoCyc include Genes, Pseudo-Genes, Promoters, DNA-Binding-Sites, and REP-Elements. Gene Ontology (GO) terms are assigned to genes both by EcoCyc curators and by import of GO terms from UniProt (
33). EcoCyc data on the essentiality of
E. coli genes are described in “Essential Gene Information” (see below).
Proteome
EcoCyc describes all known monomers and multimeric protein complexes of E. coli. EcoCyc contains extensive annotation of the features of E. coli proteins, such as phosphorylation sites, metal ion binding sites, and enzyme active sites, assigned by EcoCyc curators and imported from UniProt. Relevant classes within EcoCyc include Polypeptides and Protein-Complexes.
RNAome
EcoCyc describes all known RNAs and protein-RNA complexes of E. coli. Relevant classes within EcoCyc include RNAs, rRNAs, and regulatory RNAs. Note that EcoCyc does not explicitly represent messenger RNAs.
Regulation
EcoCyc contains the most complete description of the regulatory network of any organism. It covers E. coli operons, promoters, transcription factors, transcription factor binding sites, attenuators, and small-RNA regulators, as well as substrate-level regulation of E. coli enzymes. Each molecular regulatory interaction is described as an instance of class regulation, whose subclasses describe different types of regulation.
Metabolism
EcoCyc describes all known metabolic and signal-transduction pathways of E. coli. It describes each metabolic enzyme of E. coli including its cofactors, activators, inhibitors, and subunit structure.
Membrane Transporters
EcoCyc annotates E. coli transport proteins and the associated transport reactions that they mediate.
Growth Observations
EcoCyc integrates data on the growth of E. coli under many different growth conditions, as described in “Conditions of E. coli Growth and Nongrowth.”
Database Links
EcoCyc is linked to other biological databases containing protein and nucleic acid sequence data, bibliographic data, protein structures, and descriptions of different E. coli strains.
LITERATURE-BASED CURATION
Curation is the process of manually refining and updating a bioinformatics database. The EcoCyc project uses a literature-based curation approach in which database updates are based on evidence in the experimental literature. EcoCyc is largely up to date with respect to its curation activities. As of October 2013, EcoCyc encodes information from more than 25,000 publications. A staff of four full-time curators updates the annotation of the E. coli genome on an ongoing basis.
The transcriptional regulatory information in EcoCyc and RegulonDB is curated by the group of Dr. Julio Collado-Vides at the Universidad Nacional Autónoma de México (UNAM); therefore, both databases include the same data content on transcriptional regulation of gene expression. The actual data curation occurs within EcoCyc, and the information is periodically propagated to RegulonDB.
Curators collect gene, protein, pathway, and compound names and synonyms. They classify genes and gene products by using the Gene Ontology (
34) and MultiFun (
35) ontologies, and they classify pathways within the Pathway Tools pathway ontology. Protein complex components and the stoichiometry of these subunits are captured; cellular localization of polypeptides and protein complexes is entered, as are experimentally determined protein molecular weights; enzyme activities and any enzyme prosthetic groups, cofactors, activators, or inhibitors are captured. Operon structure and gene regulation information are encoded.
Curators author textual summaries with extensive citations. Within the summaries for proteins, RNAs, pathways, and operons, curators capture additional information not otherwise captured in the highly structured database fields of EcoCyc. For example, curators use the free-text summary sections to describe the overall function of a gene product, the phenotypes caused by mutation, depletion, or overproduction of each gene product; any known genetic interactions; protein domain architecture and structural studies; the similarity to other proteins; or any functional complementation experiments that have been described. Summaries can also be used to note cases in which the published reports present contradictory results. In such cases, both viewpoints will be presented with proper attribution. This approach strives to ensure that no information is lost.
EcoCyc entries are generally updated when new literature becomes available. Regular PubMed searches are used to generate lists of potentially curatable publications, which are then evaluated and prioritized for curation. Papers containing newly identified functions of gene products, as well as substantial advances in understanding the functions of known gene products, are given the highest priority for curation. Because the Pathway Tools software continues to evolve and to enable the addition of new data types, older entries are also being updated in a systematic fashion (e.g., each enzyme in a metabolic pathway) as time allows.
STATISTICS ON EcoCyc CONTENT
Tables
1,
2,
3, and
4 present statistics on EcoCyc content. The listed numbers are current as of version 17.5, released in October 2013.
CONDITIONS OF E. COLI GROWTH AND NONGROWTH
As of 2011, EcoCyc incorporates media that have been shown experimentally to support or not support growth of both wild-type and knockout strains of E. coli K-12. This work has two goals. First, a comprehensive encyclopedia of E. coli growth conditions will be assembled for experimentalists. The spectrum of environmental conditions supporting the growth of a bacterium is among its most important phenotypic traits. We cannot expect to understand the functions of all genes in an organism unless we understand the full range of the environments in which the cell can grow. Second, a comprehensive collection of E. coli growth media will drive more accurate systems biology modeling of E. coli. The larger the set of growth media against which these computational models are validated, the more accurate and comprehensive that the models will be.
EcoCyc captures approximately 20 media that are commonly used by
E. coli laboratories; growth data are provided for some of these media. EcoCyc also records the results of high-throughput experiments using Biolog Phenotype Microarrays (PMs) that measure cell respiration as a sensitive indicator of microbial growth (
36). The commercially available PM system for microorganisms provides a comprehensive set of phenotype tests including information on the ability to metabolize 190 carbon (C) compounds, 95 nitrogen (N) compounds, 59 phosphorus (P) compounds, and 35 sulfur (S) compounds. EcoCyc currently documents five sets of PM data from the following sources:
•
B. Bochner and X. Lei, personal communication, 2012.
Strain:
E. coli K-12 BW30270 (
rph+ [RNase PH] derivative of MG1655; the strains also show a PyrE deficiency. Found to be
fnr+ as well, according to K. A. Datsenko and B. L. Wanner, unpublished results.)
This data set includes aerobic growth observations for the full complement of C, N, P, and S compounds that are included in the PM system plus growth observations for 95 C sources under anaerobic conditions.
•
“Genome Scale Reconstruction of a
Salmonella Metabolic Model,” AbuOun et al., 2009 (
37).
Strain: E. coli K-12 MG1655 (American Type Culture Collection 700926)
This data set includes growth observations for the full complement of C, N, P, and S compounds under aerobic conditions. Bacteria were pregrown on LB agar before the inoculation of Biolog plates and incubation at 37°C for 26 hours. The Omnilog instrument (a specialized incubator plus reader) was used for data collection and analysis.
•
“The Evolution of Metabolic Networks of
E. coli,” Baumler et al., 2011 (
38).
Strain: E. coli K-12 MG1655
This data set consists of growth observations for 95 C compounds under aerobic and anaerobic conditions. Bacteria were pregrown on Biolog Universal Growth Agar plus sheep blood (BUG-S) before the inoculation of Biolog plates and incubation at 37°C. Growth was monitored by measuring optical density at 600 nm with readings taken at 3, 6, 12, 24, and 48 h (D. Baumler, personal communication).
•
Mackie et al., 2013 (
39).
Strain: E. coli K-12 MG1655 (Coli Genetic Stock Center 7740).
This data set consists of growth observations for the full complement of C, N, P, and S compounds under aerobic conditions. Bacteria were pregrown on either LB or R2A agar before inoculation of Biolog plates and incubation at 37°C for 48 h. The Omnilog instrument was used for data collection and analysis.
•
“Comparative Multi-Omics Systems Analysis of
Escherichia coli strains B and K-12,” Yoon et al., 2012 (
40).
Strain: E. coli K-12 MG1655
This data set consists of growth observations for the full complement of C, N, P, and S compounds under aerobic conditions. Bacteria were pregrown on BUG-S agar before the inoculation of Biolog plates and incubation at 37°C for 48 hours. The Omnilog instrument was used for data collection and analysis.
Data on growth conditions can be accessed from the EcoCyc website by invoking the menu command Search → Growth Media and then clicking on the button “All Growth Media for this Organism.” Individual media are shown in the initial table; PM data are shown in the following tables. The coloring of each box indicates the degree of growth observed under that condition. Three levels of growth are recorded: no growth, low growth, and growth (see legend that indicates the colors associated with each level of growth). Click on any growth medium to request a page describing its composition and to see genes that are essential or not essential for growth under that condition.
ESSENTIAL GENE INFORMATION
As of 2011, EcoCyc incorporates several large-scale data sets on gene essentiality in E. coli. Gene essentiality information is useful for:
•
Predicting antibiotic targets for pathogenic bacteria.
•
Guiding the design of minimal genomes.
•
Validating genome-scale metabolic flux models. Model predictions can be compared with the experimental data recorded in EcoCyc to assess model accuracy.
•
Providing clues regarding the functions of genes of unknown function, when essentiality varies depending on conditions of growth.
EcoCyc incorporates data on essentiality from the following publications:
•
“Experimental Determination and System Level Analysis of Essential Genes in
Escherichia coli, MG1655,” Gerdes et al. (
41).
Strain: E. coli K-12 MG1655 (F− λ− ilvG rfb-50 rph-1)
This study used a genetic footprinting technique with a Tn5-based transposome system and reported unambiguous assessment of approximately 87% of E. coli open reading frames (ORFs) for essentiality. Six hundred twenty-six genes were identified as essential for aerobic growth in rich media, while 3,126 genes were dispensable. Note that the inability to obtain an insertion mutant by using this system may in some cases be a reflection of the nontargeted nature of transposon insertion rather than a reflection of gene essentiality. For this and other technical reasons, 327 genes were classified in this study as ambiguous with regard to essentiality.
•
“Construction of
Escherichia coli K-12 In-Frame, Single-Gene Knockout Mutants: The Keio Collection,” Baba et al. (
42) and corrections (
43)
Strain:
E. coli K-12 BW25113 [
rpoS(Am) rph-1 λ
− rrnB3 Δ
lacZ4787 hsdR514 Δ
(araBAD)567 Δ
(rhaBAD)568 rph-1]
This study created 3,985 in-frame, single-gene deletion mutants by using the lambda RED recombinase system. Three hundred three genes were unable to be disrupted and were predicted to be essential for growth in rich media at 37°C. Note that, in some cases, there were secondary impacts from single-gene deletions, such as compensating suppressor mutations. There were also errors in some of the mutants described in this paper, which were later corrected (
43). This study also profiled the growth of the mutants in minimal glucose MOPS (morpholinepropanesulfonic acid) media to identify genes that are conditionally essential under these conditions.
•
“Experimental and Computational Assessment of Conditionally Essential Genes in
Escherichia coli,” Joyce et al. (
44)
Strain:
E. coli K-12 BW25113 [
rpoS(Am) rph-1 λ− rrnB3 Δ
lacZ4787 hsdR514 Δ
(araBAD)567 Δ
(rhaBAD)568 rph-1] (the same as in reference
42)
This study used the Keio collection of single-gene knockout mutants and profiled them for growth on glycerol-supplemented minimal medium. One hundred nineteen genes were identified as essential for growth on glycerol. They also combined these observations with those made by Baba et al. (
42) regarding the conditional essentiality of the mutants when grown on glucose-supplemented minimal media and were thus able to identify a conserved conditionally essential core of 94 genes that are required for
E. coli K-12 to grow under minimal nutritional supplementation but are not essential for growth under rich conditions.
•
“A Genome-Scale Metabolic Reconstruction for
Escherichia coli K-12 MG1655 that Accounts for 1260 ORFs and Thermodynamic Information,” Feist et al. (
45)
This publication used the experimental data regarding conditional gene essentiality from Joyce et al. (
44) and from Baba et al. (
42) and compared these data with the computationally predicted essential genes in their genome-scale metabolic reconstruction of
E. coli. This data set is included in EcoCyc to facilitate the benchmarking of computational predictions of essentiality from the EcoCyc model with computations from the model of Feist et al. (
45). Multicopy suppression underpins metabolic evolvability.
•
“Multicopy Suppression Underpins Metabolic Evolvability,” Patrick et al. 2007 (
46)
Strain:
E. coli BW25113 [
rpoS(Am) rph-1 λ− rrnB3 Δ
lacZ4787 hsdR514 Δ
(araBAD)567 Δ
(rhaBAD)568 rph-1]
This study used the conditionally essential gene sets identified by Baba et al. (
42) and Joyce et al. (
44) and tested them for their ability to form colonies on glucose M9 agar. They identified 107 genes that were conditionally essential under these conditions.
When essentiality data are available for a given gene, the EcoCyc gene page includes a table of the conditions under which that gene has been found to be either essential or not essential for growth. Clicking on the condition will navigate to a growth-medium page that lists all essentiality information under that growth condition.
EcoCyc METABOLIC FLUX MODEL
A quantitative steady-state metabolic flux model has been derived from EcoCyc by using flux balance analysis (FBA) (
47,
48). By running this model with different parameters, scientists can model the growth of
E. coli under different nutrient conditions and for different gene knockouts. Every time the model is executed, it is freshly generated from EcoCyc, meaning that, as the reactions in EcoCyc are updated because of curation, the model automatically reflects those changes.
The EcoCyc FBA model is distinct from the
E. coli FBA models derived by the Palsson group (
45,
49,
50), but these models have much in common because EcoCyc and the iAF1260 model were partially unified in 2007 (
45), and both groups consult the other’s work when updating their models.
The
Supplementary Information provided separately details the
E. coli biomass metabolite set used to model biomass production metabolite requirements in EcoCyc FBA. This metabolite set is derived from the iJO1366 model WT biomass reaction of Orth et al. (
50). The Supplementary Information also contains a description of the nutrient and secretion metabolite sets that supply inputs and outputs to the FBA model, as well as a description of differences between the EcoCyc FBA biomass metabolite set and the iJO1366 WT biomass reaction.
To run the EcoCyc FBA model, download and install a Pathway Tools software configuration that includes EcoCyc, and invoke the MetaFlux modeling component of Pathway Tools (see Chapter 8 of the Pathway Tools User’s Guide).
EcoCyc provides several example files describing invocations of the FBA model under different nutrient conditions. Those files are found within the installed Pathway Tools directory tree at pathway-tools/aic-export/pgdbs/biocyc/ecocyc/VERSION/data/fba/. Output files produced as a result of successful FBA runs on the supplied .fba files are also included. The supplied input files (where CDW is cell dry weight) are:
1.
GlucoseAer.fba : 10 mmol/g CDW/h glucose uptake, minimal media, aerobic conditions
2.
GlucoseAnaer.fba : 10 mmol/g CDW/h glucose uptake, minimal media, anaerobic conditions
3.
GlycerolAer.fba : 10 mmol/g CDW/h glycerol uptake, minimal media, aerobic conditions
4.
GlycerolAnaer.fba : 10 mmol/g CDW/h glycerol uptake, minimal media, anaerobic conditions
External Flux Predictions
MetaFlux metabolic flux predictions from EcoCyc version 17.5 for aerobic growth on glucose and glycerol are given in Tables
5 and
6. Model predictions for anaerobic growth on glucose and glycerol are given in Tables
7 and
8. In all cases, the uptake rate of the carbon source is set to an upper bound reflecting experimental uptake rates in mmol/g CDW/h. O
2 uptake rates are set to an upper bound of 0.00 mmol/g CDW/h under anaerobic conditions. All other nutrient sources are left free.
Improvement of the Metabolic Model
With each EcoCyc release, we plan to include an improved version of the EcoCyc metabolic flux model that reflects recent improvements to our knowledge of the E. coli metabolic network.
Model predictions can differ from experimental measurements owing to a number of reasons including the operation of additional, unmodeled reactions and metabolites; existing reactions operating in a different fashion from the model (e.g., the model contains a “perfect” respiratory electron-transfer chain without the possibility of reactive oxygen-species generation); the presence of regulation or of product inhibition that deactivates reactions or limits their throughput; and differences in optimization objective functions depending on the specified feed source.
UPDATE FREQUENCY
The
EcoCyc.org and
BioCyc.org websites and downloadable files are updated three to four times per year. A faster, more powerful version of EcoCyc that you can install locally on your computer (Macintosh, PC/Windows, PC/Linux) is released semiannually.
DATA SOURCES INCORPORATED INTO EcoCyc
EcoCyc includes data imported from the following bioinformatics databases. In most cases, the data are reimported once or twice per year. We note that many literature references within EcoCyc were obtained from PubMed.
UniProt Features
UniProt protein features (the UniProt KB term is sequence annotations) from the complete proteome of E. coli K-12 MG1655 in SwissProt are imported into EcoCyc for every EcoCyc release. We import all protein features with experimental or nonexperimental evidence qualifiers except for the following types: turn, helix, beta strand, and coiled-coil. The chain type is only imported if it does not span the entire length of the protein. Examples of imported feature types include catalytic domains, phosphorylation sites, and metal ion binding sites. We import citations associated with UniProt protein features if they include an associated PubMed ID.
The import of protein features into EcoCyc is done via the UniProt Feature Importer tool within the Pathway Tools software.
Gene Ontology
For several years, EcoCyc and EcoliWiki/PortEco have been collaborating on improving and maintaining the GO annotations for
E. coli. GO and its applications are described in more detail in reference
56. Since the summer of 2008, we have been periodically generating a file containing all
E. coli K-12 GO term annotations, called gene_association.ecocyc, that may be obtained from the Gene Ontology Consortium.
GO annotation is a standard part of EcoCyc’s manual literature-based curation process. The GO annotations are added to the database objects that represent the functional gene products or multimers, not directly to the gene objects. This approach models the biology more accurately because it indicates exactly which form of the gene product has the specified GO function. In parallel, manual annotation of E. coli genes with GO is ongoing at EcoliWiki. On a regular basis, the GO annotations are merged. The latest UniProt and EcoliWiki annotations are imported into EcoCyc. Because the GO Consortium does not accept electronic annotations as part of the gene association file if the annotations are more than 1 year old, these UniProt annotations are reimported into EcoCyc on a regular basis.
EcoCyc incorporates many electronic and experimental GO term annotations of E. coli K-12 gene products obtained from the “UniProt [multispecies] GO Annotations @ EBI” file downloaded from the Gene Ontology Consortium. When this import was first performed in 2007, approximately 30,000 new IEA (“Inferred from Electronic Annotation”) GO term assignments were added to EcoCyc, along with approximately 1,000 assignments with experimental evidence codes including assignments from high-throughput protein-interaction studies. During the import of GO terms from UniProt into EcoCyc, a filtering operation is applied to prune GO term annotations based solely on computational (IEA) evidence if the EcoCyc gene product already has more specific GO annotations (in other words, GO terms that are children of the GO term being imported) that have experimental evidence available. For example, if a gene product already contained an experimental annotation of the term “galactose kinase,” the software would not add the computational annotation “carbohydrate kinase.” This filtering leads to the removal of approximately 1,000 of these less specific and redundant annotations.
A gene association file is generated from the quarterly EcoCyc releases. This file is sent to the EcoliWiki team at Texas A&M for further processing. At EcoliWiki, annotations made in the wiki-based community annotation system since the last EcoCyc update are added to the file, along with annotations containing qualifiers (mainly contributes_to) not yet supported by EcoCyc. Only those annotations that are complete by GO Consortium standards are extracted from EcoliWiki; incomplete annotations are left with the hope that community members will eventually complete them. EcoliWiki runs the GO Consortium validation scripts and deposits the file with the GO Consortium via their Concurrent Versioning System.
GenBank
The GenBank record U00096, produced by the Blattner laboratory in October 1997, was the source of the original
E. coli MG1655 genome sequence and annotation incorporated by EcoCyc. A corrected nucleotide sequence was deposited in GenBank as U00096.2 in 2004, and the revised sequence was incorporated into EcoCyc as of version 8.6 (November 2004). The revised genome annotation published in reference
57 was incorporated into EcoCyc in version 10.0 (March 2006).
RefSeq Collaboration
EcoCyc is involved in a collaboration to update the genome annotation of the GenBank (U00096.2) and RefSeq (NC 000913.2) entries for E. coli K-12 MG1655 on an ongoing basis. The primary collaborators include EcoCyc, EcoGene, UniProtKB/Swiss-Prot, and NCBI. The collaborators routinely share their data and resolve data conflicts. The updates of gene names, gene positions, and gene product names are shared among all partners.
MetaCyc
The EcoCyc and MetaCyc databases exchange data as part of the release processes for both databases. The updates that have occurred to enzymes, genes, pathways, reactions, and metabolites are exchanged between the databases based on automated comparisons of update dates to ensure that the latest information and corrections are propagated between the databases.
EcoCyc ACCESSION NUMBERS
Gene Accession Numbers
Three systems of accession numbers are typically available for genes within EcoCyc. Any of these accession numbers may be used when querying EcoCyc genes “by name,” and in the website Quick Search.
•
EcoCyc ID: The EcoCyc project assigns unique identifiers to each gene that for historical reasons are of variable syntax, and are of the form “Gnnnn,” “EGnnnnn,” or “G0-nnnnn.” EcoCyc IDs are stored as the frame id of the EcoCyc gene object.
•
B-numbers: Originally assigned by the Blattner laboratory as part of the E. coli genome project, the b-number identifiers are of the form “bnnnn.” B-numbers were originally assigned sequentially along the genome. When a gene object is removed from the genome because of a decision that insufficient evidence for the existence of that gene is available, then that b-number is retired and is not reused. When new genes are added to the genome, they are assigned the next highest available b-number. Thus, b-numbers are no longer purely sequential along the genome. B-numbers are stored in the EcoCyc slot Accession-1.
•
ECK numbers: ECK numbers were assigned to the
E. coli K-12 MG1655 and W3110 genomes in 2005 in an attempt to provide shared accession numbers for genes common to the two genomes (
57). ECK numbers are stored in the EcoCyc slot Accession-2. For only the first 18 or so genes in the
E. coli K-12 MG1655 genome are the b-number and ECK number the same number; for subsequent genes the numbers have diverged.
OTHER E. COLI AND SHIGELLA PATHWAY/GENOME DATABASES IN BioCyc
EcoCyc is part of the larger BioCyc collection of Pathway/Genome Databases (PGDBs). BioCyc version 17.5 (2013) includes 160 E. coli and Shigella PGDBs. Most of these PGDBs were generated computationally and lack the extensive manual literature-based curation of the EcoCyc K-12 database. The E. coli genomes in BioCyc are focused on complete genomes and do not include draft genomes.
Two of these PGDBs have undergone additional curation: the BioCyc PGDBs for strains E. coli W3110 and for E. coli B str. REL606. Both strains underwent a computational annotation-normalization procedure in which gene names, product names, heteromultimeric protein complexes, and Gene Ontology terms were propagated from EcoCyc to their orthologous genes in these other two strains (the orthologs were computed by SRI as bidirectional best-BLAST hits with additional manual review and curation). This procedure was performed under the assumption that genome-annotation pipelines typically introduce syntactically large but semantically insignificant variation in the naming of genes and gene products. In addition, E. coli B str. REL606 underwent literature-based curation by SRI to incorporate experimental information regarding the genes and pathways present in this strain but not in the EcoCyc strain MG1655. This curation is supported by the PortEco project.
To select a given genome for querying in the BioCyc website, click on the words “change organism database” under the Quick Search and Gene Search buttons in the upper right corner of most EcoCyc web pages.
WE ENCOURAGE YOUR FEEDBACK
Feedback from the scientific community has proved invaluable to improving EcoCyc during its many years of development. We strongly encourage your comments and suggestions for improvements in all areas, including:
•
The database content of EcoCyc
•
The presentation of information within the EcoCyc website
•
The analysis tools provided in conjunction with EcoCyc
•
The performance of the EcoCyc website
If you see an error or omission within EcoCyc, please report it by using the “Report Errors or Provide Feedback” link at the bottom of every data page. Please email suggestions or questions to biocyc support at
[email protected].
During every EcoCyc release, we email a summary of new developments to our biocyc users mailing list. To subscribe to this mailing list, please see
http://biocyc.org/subscribe.shtml.HOW TO LEARN MORE
•
Publications on EcoCyc (
58,
59,
60,
61,
62,
63,
64,
65,
66,
67,
68,
69,
70)
HOW TO CITE EcoCyc
Please cite EcoCyc in publications that benefit from the use of the EcoCyc database or website. Please cite EcoCyc as the most recent Nucleic Acids Research Database issue article, currently: Keseler et al. 2013, Nucleic Acid Res 41:D605–D612.
ACKNOWLEDGMENTS
Monica Riley led the curation of EcoCyc for many years, from its inception. Her efforts created the content for the first organism-scale metabolic database. John Ingraham was a valued advisor to EcoCyc for many years. We thank the scientists who have contributed corrections and suggestions to EcoCyc over the years, and we thank the scientists who have served on the EcoCyc Steering Committee. Many contributors to EcoCyc are listed on the EcoCyc credits page.
The development of EcoCyc is funded by NIH grants GM77678 and GM71962 from the NIH National Institute of General Medical Sciences.
Conflicts of interest: We disclose no conflicts.