GSR-DB: a manually curated and optimized taxonomical database for 16S rRNA amplicon analysis
ABSTRACT
IMPORTANCE
INTRODUCTION
MATERIALS AND METHODS
Creation of GSR database
Creation of the GSR full-16S database
Manual curation of the GSR-DB
Merging algorithm
Creation of 16S variable region databases
Variable region extraction
Clustering
Phylogeny construction
In silico mock community data sets
Validation
Validation data sets
Classifier training and taxonomy assignment
Parameter comparison
Database benchmarking
Expected taxon (E) | ||||
---|---|---|---|---|
Lactobacillus iners | Lactobacillus jensenii | Lactobacillus crispatus | ||
Assigned taxonomy (A) | Lactobacillus iners | TP | FP | |
Lactobacillus jensenii | FN | TN | ||
Lactobacillus crispatus |
Tenfold cross-validation
Gut and vaginal microbial data sets
Computational benchmarking
RESULTS
GSR database
Database | Region | Cluster | Source database | ||||
---|---|---|---|---|---|---|---|
RDP | SILVA | Greengenes | NCBI | Total | |||
GSR | V1–V3 | 100% | 6,707 (24.96%) | 14,595 (54.31%) | 5,521 (20.54%) | 51 (0.19%) | 26,874 |
V3–V4 | 100% | 15,401 (30.79%) | 22,621 (45.23%) | 11,949 (23.89%) | 45 (0.09%) | 50,016 | |
V3–V5 | 100% | 16,659 (29.43%) | 26,239 (46.35%) | 13,655 (24.12%) | 53 (0.09%) | 56,606 | |
V4 | 100% | 12,670 (32.65%) | 16,186 (41.72%) | 9,916 (25.56%) | 29 (0.07%) | 38,801 | |
Full-16S | None | 20,151 (22.29%) | 52,570 (58.15%) | 17,548 (19.41%) | 139 (0.15%) | 90,408 |
QIIME2 parameters impact taxonomic assignment performance
GSR outperforms most existing databases across all tested regions
Case study: vaginal and gut data sets
GSR annotation enhances taxonomic nomenclature consistency
Computational benchmarking
Database | Elapsed time | Classifier size (MB) | Memory usage peak (GB) | Memory usage mean (GB) |
---|---|---|---|---|
RDP | 0:01:23 | 20.85 | 4.46 | 2.6 |
GSR | 0:02:28 | 25.61 | 6.3 | 4.06 |
Greengenes | 0:02:51 | 28.14 | 3.52 | 2.43 |
ITGDB | 0:06:38 | 44.92 | 14.88 | 8.79 |
GTDB | 0:06:42 | 49.0 | 14.6 | 9.24 |
Greengenes2 | 0:11:11 | 47.77 | 15.2 | 9.53 |
SILVA | 0:40:07 | 106.55 | 23.54 | 16.75 |
Metasquare | 1 day 21:27:39 | 589.99 | 175.92 | 125.98 |
Data set | Database | Elapsed time | Memory usage peak (GB) | Memory usage mean (GB) |
---|---|---|---|---|
Gut | Greengenes | 0:00:17 | 7.09 | 3.11 |
RDP | 0:00:46 | 18.21 | 6.97 | |
GSR | 0:01:08 | 18.55 | 5.11 | |
ITGDB | 0:01:28 | 33.62 | 11.74 | |
Greengenes2 | 0:01:31 | 49.46 | 16.72 | |
GTDB | 0:01:44 | 37.47 | 13.46 | |
SILVA | 0:02:17 | 48.93 | 16.36 | |
Vagina | Greengenes | 0:00:09 | 5.64 | 2.55 |
GSR | 0:00:29 | 14.85 | 4.33 | |
RDP | 0:00:29 | 12.56 | 4.28 | |
Greengenes2 | 0:00:53 | 42.86 | 9.2 | |
ITGDB | 0:01:00 | 26.59 | 6.83 | |
GTDB | 0:01:07 | 30.12 | 7.83 | |
SILVA | 0:01:30 | 37.86 | 9.28 |
DISCUSSION
ACKNOWLEDGMENTS
SUPPLEMENTAL MATERIAL
- Download
- 2.58 MB
- Download
- 14.39 MB
- Download
- 14.39 MB
- Download
- 8.58 MB
- Download
- 14.81 KB
- Download
- 13.38 KB
- Download
- 181.41 KB
REFERENCES
Information & Contributors
Information
Published In
Copyright
History
Keywords
Data Availability
Contributors
Editor
Metrics & Citations
Metrics
Note:
- For recently published articles, the TOTAL download count will appear as zero until a new month starts.
- There is a 3- to 4-day delay in article usage, so article usage will not appear immediately after publication.
- Citation counts come from the Crossref Cited by service.
Citations
If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. For an editable text file, please select Medlars format which will download as a .txt file. Simply select your manager software from the list below and click Download.