Expected observation data generation.
Expected observation data, representing the known composition of an MC, are provided in mockrobiota in two forms, source data and expected composition (taxonomy or gene annotation) data. Source data provide a record of the original inputs to the MC as a list of microbial strains and their relative abundances. Ideally, a strain ID should be provided to identify a retrievable source strain, allowing accurate tracking and revision of taxonomic information. These data are generally created by the developer of the MC, and taxonomic groups are not necessarily annotated with respect to any specific taxonomic reference database. An example of source data is given in
Table 2, and a template example is provided in the mockrobiota data directory. These files consist of two or more columns. The first column (Taxonomy) lists the taxonomy of each MC member in as much detail as can be provided by the MC developer. In
Table 2, this contains the genus, species, and strain ID of each strain added to the MC on separate lines. The remaining columns each represent an individual MC sample contained within the data set. The column heads contain the names of the samples and must correspond to the sample names listed in the sample-metadata.tsv file for that data set. The values in the column are the relative abundances at which each taxon is present in the samples.
Expected composition data represent the known composition of the MC (e.g., taxonomies or KEGG pathways) annotated according to a specific reference database. Like source data files, expected composition data files are created and carefully reviewed by contributors to mockrobiota; the automatic integrity checks employed by mockrobiota cannot ensure that expected observation annotations are accurate. It is in the interest of contributors to ensure the accuracy of their data sets, as poor curation will deteriorate the quality of results obtained when using a given MC, decreasing the likelihood that the MC will be used and cited by other researchers. Compilation of expected composition data is not a trivial task and requires careful review of database annotations to ensure that accurate annotations are applied to source data. An example of expected composition data is shown in
Table 3, corresponding to the source data example shown in
Table 2. In these files, column layout and header naming follow the same conventions as described above for source files. The first column (Taxonomy) lists taxonomic descriptions or other annotations associated with each species added to the MC. These taxonomic descriptions (or other annotations) are drawn from an appropriate reference database, e.g., the Greengenes (
17) or SILVA (
18) rRNA gene sequence database. The taxonomic description should be copied directly from the reference database. If using this MC for comparison of expected versus observed taxonomy assignments, the same reference database must be used for taxonomy assignment of the MC sequences during analysis to allow for direct comparison between expected and observed results. The expected composition data are deposited in mockrobiota in a directory structure that indicates the reference database name and version used for annotation. For example, expected composition data that list taxonomy strings from the Greengenes 13_8 release (
17) are deposited in mockrobiota/data/mock-X/greengenes/13_8/expected-taxonomy.tsv, where mock-X is the number assigned to that MC.
Several issues may arise during database annotation that require careful attention, and hence, careful manual curation of expected composition files is important. Specific taxa may not be represented in a reference taxonomy at the species level and must be annotated to the nearest common lineage. For example,
Streptococcus mutans and
Streptococcus pneumoniae in the source data (
Table 2) are annotated as g__Streptococcus;s__ in the expected composition example above (
Table 3).
Multiple input strains, listed as separate entities in the source files, may need to be combined under common annotations in the expected composition files if they are not listed in the reference database. The relative abundance of an expected taxonomy will be equal to the sum of all of the members matching that taxonomy. For example, multiple strains may be combined as a single species, or species not listed in the reference database may be combined under a single genus; note the relative abundance of g__Streptococcus;s__ listed in the example above.
Reference databases may contain quirks that complicate the annotation of expected composition files, such as listing strain IDs or different taxonomic lineages for multiple entries of the same species. MC developers should carefully inspect reference database annotations and all expected composition files. The accuracy of taxonomic descriptions cannot be checked (i) by mockrobiota’s automatic integrity checks, because all possible databases that could be used for annotation will not be available to the testing system, or (ii) by mockrobiota’s developers during pull request reviews. Ultimately, the integrity of each data set is the responsibility of the contributor.
Expected composition data will consist of one of two types. The first is a marker gene MC (expected taxonomic composition of a mixture of microbial cells). The taxonomic annotations present in the expected data will be specific to the database version that is used for analysis and will be meaningless if used for different database versions. Likewise, they may not match the source annotation, i.e., the taxonomy of each strain to the best knowledge of the MC’s creator, if taxonomic annotations have been revised or if the reference database being used does not contain a given taxonomy. The second is a metagenome MC (expected gene composition of a mixture of microbial cells/genomes). Gene annotations will be reference database specific, as for the marker gene MCs described above.
Other MC data types are theoretically possible and could be included in mockrobiota, which only defines required information, files, and file formats. Expected data definitions can expand as other MC data types are contributed to mockrobiota.
The MCs currently deposited in mockrobiota are all marker gene MCs representing known compositions of microbial species analyzed by marker gene sequencing methods. Taxonomy strings for 16S rRNA gene MCs (mock-1 through mock-8) were generated with the Greengenes 13_8 release (
17) and the SILVA 119 release (
18) sequence reference databases, both prefiltered to 97% sequence identity. Taxonomy strings for fungal internal transcribed spacer MCs (mock-9 and mock-10) were generated with the UNITE+INSD database (9/24/12 release) (
19) prefiltered at 97% identity and from which sequences with incomplete taxonomy strings and empty taxonomy annotations (e.g., “uncultured fungus”) were removed as described previously (
20). Taxonomy strings for the 18S rRNA MC (mock-11) were generated with the SILVA 119 release (
18).
Raw data generation.
Raw data for MCs fall into different types, corresponding to the MC types and expected composition data defined above, i.e., Marker gene MC (raw data consisting of marker gene sequences) and Metagenome MC (raw data consisting of shotgun metagenome sequences).
All raw sequence data are currently linked in fastq format. mockrobiota does not host raw data files and only ensures that valid, accessible links are provided in the data set metadata. MC data sets that contain multiple samples are provided in nondemultiplexed files, i.e., one file per sequencing run, containing multiple uniquely barcoded samples. All raw data files are archived by using the standard gzip compression format, and index/barcode sequences are provided as a separate fastq file. Reverse sequencing reads are accepted but not required. All submissions must conform to the standard file names mock-forward-read.fastq.gz, mock-reverse-read.fastq.gz, and mock-index-read.fastq.gz.
The raw data for each marker gene MC currently available in the repository were generated by 11 separate sequencing runs on the Illumina GAIIx (
n = 1), HiSeq 2000 (
n = 6), and MiSeq (
n = 4), as described in
Table 1 and in the dataset-metadata.tsv files associated with each data set in mockrobiota. These consisted of genomic DNA from known species isolates deliberately combined at defined rRNA copy number ratios.