Editor's Pick
Virology
Letter
14 August 2023

Virus world database (VWdb), an API-enabled database of virus taxonomy

LETTER

The number of viruses that are being identified and characterized is growing by the day. However, it is not easy to gather information on the taxonomy of these viruses as the related data at NCBI (https://www.ncbi.nlm.nih.gov/) (1) are too large (hundreds of thousands of lines) and loaded into a single webpage or not up to date at the ICTV (https://ictv.global/) (2). Hence, there is a gap in knowledge and the need for easy access to the taxonomy information in a user-friendly manner.
In this work, we created an easy-to-use web interface, entitled Virus World database (VWdb), accessed from the URL: https://viperdb.org/vw, that can be utilized to select viruses according to various aspects of taxonomy and the related information available in other relevant databases: UniProtKB (3), PDB (4), VIPERdb (5, 6), and EMDB (7). In addition, we created an API (Application Programming Interface) that can be used for programmatic access of the metadata available in VWdb.
We downloaded all the taxonomy data available in the new format (https://ncbiinsights.ncbi.nlm.nih.gov/2018/02/22/new-taxonomy-files-available-with-lineage-type-and-host-information/) on all the kingdoms (e.g., bacteria, invertebrates, and viruses) from NCBI. We then parsed and stored all the taxonomy data into a temporary MySQL database (Taxo), followed by searching, extracting, and uploading virus-specific information into a virus entry-specific database, VWdb. This was followed by searching and storing virus-related data from various databases (e.g., UniProtKB, PDB, EMDB, AlphaFoldDB), which involved accessing various APIs via Python scripts with multi-threading on to speed up the search significantly. We have used the taxonomy-ID as the primary key in VWdb and to connect to relevant databases. Additionally, we included the information on Baltimore classification system of genomes for virus families available at ViralZone web resource (8). The details of VWdb schema are given in the Supplementary Information (Fig. S1). We then created multiple PHP web services to query and access data from a web interface to select and display the taxonomy data associated with each virus genome/family/genus in a table form. A flow chart depicting the various steps involved in data acquisition and setting up the MySQL database at VWdb is shown in Fig. S2. Furthermore, we have included a table comprising a glossary of technical terms (e.g., JSON, MySQL) and the description of their intended function (Table 1).
TABLE 1
TABLE 1 Glossary of technical terms and the description of their utility
TermDescription
API (Application Programming Interface)Used for programmatic access of data from an online database, bypassing the web interface.
Excel formatMicrosoft spreadsheet format.
JSON formatA language independent data (text) format for data exchange between different resources.
MySQL (structured query language) databaseA relational database management system.
PHP (hypertext preprocessor) servicesPHP is a scripting language geared toward web development. The PHP services involve query requests between the user interface and behind-the-scenes MySQL database.
At the time of writing, there are 231,437 viruses from 233 virus families and 2,578 genera available at VWdb. These viruses are classified to contain eight different types of genomes (e.g., ribovira, monodnaviria, duplodnaviria), including an unclassified class that infect 13 different hosts, as categorized by NCBI (Table 2). Significantly, however, we have also included the Baltimore classification of viral genomes (e.g., ssRNA, dsDNA) gathered from the Viral Zone resource (8) that is not available at NCBI. One can choose to display all the viruses that infect a particular host (e.g., humans, bacteria, and plants) or those containing a particular type of genome or belong to a family or genus. With the exception of the host category, a particular selection (e.g., genome-type) will automatically populate the “downstream” associated categories (e.g., family or genus) in the windows to the right side of the chosen category. Currently, there are 163,512, 13,718, 4,818 viruses that infect humans, bacteria, and plants, respectively. The data displayed in the table according to the selection include taxonomy-ID, virus name, species name, genus, family, genome type, type of host, and a column displaying whether or not any relevant structures are available (Fig. 1A). Furthermore, one can choose to display only the entries having structural details by clicking on the “with structural information” box. Additionally, a search function is found at the top menu bar to search for a particular virus, family name, and taxonomy-ID based on the structural information (e.g., PDB-ID or EMDB-ID). There is also an advanced search option provided to narrow down the results by supplying additional keywords in multiple categories. Of note, the data listed in the table can be downloaded in Excel or JSON formats.
Fig 1
Fig 1 User interface of VWdb and an example info_page. (A) User interface of Taxonomy Explorer showing the number of individual viruses, virus families, and genera that are available at VWdb. Specific host, genome-type, virus family or genus can be selected from the corresponding lists. Subsequently, clicking on the <Display> button provides the list of viruses according to the selection. (B) An example info_page of adeno-associated virus 2 showing the taxonomy details as well as the links to the sequence and structural information in corresponding databases.
TABLE 2
TABLE 2 List of various hosts and genomes available at Virus World db
CategoryTypes
HostsAlgae, archaea, bacteria, eukaryotic algae, fungi, human, human stool, insects, invertebrates, land plants, plants, protozoa, vertebrates
Genomes (NCBI)Adnaviria, duplodnaviria, monodnaviria, riboviria, ribozyviria, satellites, varidnaviria, and unclassified
Genomes (Baltimore classification)dsDNA, dsRNA, rtRNA, ssDNA, ssRNA (+), ssRNA (−), circular SSRNA
An info-page is generated immediately by clicking on each row in the table that contains various relevant information of the virus (Fig. 1B). In addition to the taxonomy information, the info_page (e.g., https://viperdb.org/vw/info_page.php?taxid=12081) contains UniProt-IDs with hyperlinks corresponding to various genes in the viral genome (3) and the structural information that may be available at different structural databases—PDB (4), VIPERdb (5, 6), and EMDB (7) and AlphaFoldDB (9).
To our knowledge, VWdb is the only place where the integrated information on virus taxonomy, susceptible hosts, genome type, sequence, and structure details is available on all the known viruses in an easy-to-use fashion. The viruses that belong to a particular family or contain a specific genome type can be selected and listed in a table form that in turn can be downloaded in Excel or JSON formats. One can also search for a particular virus based on its name, family name, taxonomy-ID, or structural information (e.g., PDB-ID, EMDB-ID). In addition to the details on virus taxonomy, the info_page of each virus contains hyperlinks to the relevant access codes in UniProtKB, PDB, VIPERdb, and EMDB and AlphaFoldDB. Furthermore, we provided an API to programmatically access the metadata that are stored at VWdb in JSON format. The description of which can be found in the Supplementary Information (Fig. S3). We plan to update VWdb at regular intervals as and when new data become available at NCBI. VWdb can be accessed from the URL: https://viperdb.org/vw.

ACKNOWLEDGMENTS

We thank Nelly Santoyo-Rivera for helpful discussions on implementing some of the Python scripts and web-services on VWdb. We also would like to thank Dr. Jeffrey McDonald, the Director of Information Technology and HPC computing at the Hormel Institute, for his assistance in setting up the VIPERdb web server.
This work supported by the startup funds from the Hormel Institute to V.S.R.
The authors declare that there is no conflict of interest.

SUPPLEMENTAL MATERIAL

Fig. S1, Fig. S2, and Fig. S3 - jvi.00620-23-s0001.pdf
This supplemental file provides the details of VWdb schema, flow chart of data acquisition and the description of VirusWorldDB REST API on how to access the metadata.
ASM does not own the copyrights to Supplemental Material that may be linked to, or accessed through, an article. The authors have granted ASM a non-exclusive, world-wide license to publish the Supplemental Material files. Please contact the corresponding author directly for reuse.

REFERENCES

1.
Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, Comeau DC, Funk K, Kim S, Klimke W, Marchler-Bauer A, Landrum M, Lathrop S, Lu Z, Madden TL, O’Leary N, Phan L, Rangwala SH, Schneider VA, Skripchenko Y, Wang J, Ye J, Trawick BW, Pruitt KD, Sherry ST. 2021. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 49:D10–D17.
2.
Kuhn JH, Adkins S, Alkhovsky SV, Avšič-Županc T, Ayllón MA, Bahl J, Balkema-Buschmann A, Ballinger MJ, Bandte M, Beer M, Bejerman N, Bergeron É, Biedenkopf N, Bigarré L, Blair CD, Blasdell KR, Bradfute SB, Briese T, Brown PA, Bruggmann R, Buchholz UJ, Buchmeier MJ, Bukreyev A, Burt F, Büttner C, Calisher CH, Candresse T, Carson J, Casas I, Chandran K, Charrel RN, Chiaki Y, Crane A, Crane M, Dacheux L, Bó ED, de la Torre JC, de Lamballerie X, de Souza WM, de Swart RL, Dheilly NM, Di Paola N, Di Serio F, Dietzgen RG, Digiaro M, Drexler JF, Duprex WP, Dürrwald R, Easton AJ, Elbeaino T, Ergünay K, Feng G, Feuvrier C, Firth AE, Fooks AR, Formenty PBH, Freitas-Astúa J, Gago-Zachert S, García ML, García-Sastre A, Garrison AR, Godwin SE, Gonzalez J-P, de Bellocq JG, Griffiths A, Groschup MH, Günther S, Hammond J, Hepojoki J, Hierweger MM, Hongō S, Horie M, Horikawa H, Hughes HR, Hume AJ, Hyndman TH, Jiāng D, Jonson GB, Junglen S, Kadono F, Karlin DG, Klempa B, Klingström J, Koch MC, Kondō H, Koonin EV, Krásová J, Krupovic M, Kubota K, Kuzmin IV, Laenen L, Lambert AJ, Lǐ J, Li J-M, Lieffrig F, Lukashevich IS, Luo D, Maes P, Marklewitz M, Marshall SH, Marzano S-Y, McCauley JW, Mirazimi A, Mohr PG, Moody NJG, Morita Y, Morrison RN, Mühlberger E, Naidu R, Natsuaki T, Navarro JA, Neriya Y, Netesov SV, Neumann G, Nowotny N, Ochoa-Corona FM, Palacios G, Pallandre L, Pallás V, Papa A, Paraskevopoulou S, Parrish CR, Pauvolid-Corrêa A, Pawęska JT, Pérez DR, Pfaff F, Plemper RK, Postler TS, Pozet F, Radoshitzky SR, Ramos-González PL, Rehanek M, Resende RO, Reyes CA, Romanowski V, Rubbenstroth D, Rubino L, Rumbou A, Runstadler JA, Rupp M, Sabanadzovic S, Sasaya T, Schmidt-Posthaus H, Schwemmle M, Seuberlich T, Sharpe SR, Shi M, Sironi M, Smither S, Song J-W, Spann KM, Spengler JR, Stenglein MD, Takada A, Tesh RB, Těšíková J, Thornburg NJ, Tischler ND, Tomitaka Y, Tomonaga K, Tordo N, Tsunekawa K, Turina M, Tzanetakis IE, Vaira AM, van den Hoogen B, Vanmechelen B, Vasilakis N, Verbeek M, von Bargen S, Wada J, Wahl V, Walker PJ, Whitfield AE, Williams JV, Wolf YI, Yamasaki J, Yanagisawa H, Ye G, Zhang Y-Z, Økland AL. 2022. Recent changes to virus taxonomy ratified by the international committee on taxonomy of viruses. Arch Virol 167:2429–2440.
3.
UniProt C. 2023. Uniprot: The universal protein knowledgebase in 2023, p D523–D531. In Nucleic acids research
4.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. 2000. The protein data bank. Nucleic Acids Res 28:235–242.
5.
Montiel-Garcia D, Santoyo-Rivera N, Ho P, Carrillo-Tripp M, Iii CLB, Johnson JE, Reddy VS. 2021. VIPERdb V3.0: a structure-based data analytics platform for viral capsids. Nucleic Acids Res 49:D809–D816.
6.
Carrillo-Tripp M, Shepherd CM, Borelli IA, Venkataraman S, Lander G, Natarajan P, Johnson JE, Brooks CL, Reddy VS. 2009. VIPERb2: an enhanced and web API enabled relational database for structural virology. Nucleic Acids Res 37:D436–42.
7.
Lawson CL, Patwardhan A, Baker ML, Hryc C, Garcia ES, Hudson BP, Lagerstedt I, Ludtke SJ, Pintilie G, Sala R, Westbrook JD, Berman HM, Kleywegt GJ, Chiu W. 2016. EMDataBank unified data resource for 3DEM. Nucleic Acids Res 44:D396–D403.
8.
Hulo C, de Castro E, Masson P, Bougueleret L, Bairoch A, Xenarios I, Le Mercier P. 2011. Viralzone: a knowledge resource to understand virus diversity. Nucleic Acids Res 39:D576–82.
9.
Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, Stroe O, Wood G, Laydon A, Žídek A, Green T, Tunyasuvunakool K, Petersen S, Jumper J, Clancy E, Green R, Vora A, Lutfi M, Figurnov M, Cowie A, Hobbs N, Kohli P, Kleywegt G, Birney E, Hassabis D, Velankar S. 2022. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 50:D439–D444.

Information & Contributors

Information

Published In

cover image Journal of Virology
Journal of Virology
Volume 97Number 831 August 2023
eLocator: e00620-23
Editor: Felicia Goodrum, The University of Arizona, Tucson, Arizona, USA
PubMed: 37578228

History

Received: 24 April 2023
Accepted: 13 June 2023
Published online: 14 August 2023

Permissions

Request permissions for this article.

Keywords

  1. virus taxonomy
  2. family
  3. genus
  4. genome type
  5. Baltimore classification
  6. host type
  7. MySQL database

Data Availability

The data and associated APIs underlying this article are available freely at VWdb. The
data sets were derived from sources in the public domain: NCBI, UniProtKB, RCSB PDB, VIPERdb, and EMDB EMDB and AlphaFoldDB.

Contributors

Authors

Oscar Rojas Labra
The Hormel Institute, University of Minnesota, Austin, Minnesota, USA
Author Contributions: Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Visualization, and Writing – review and editing.
Daniel Montiel-Garcia
Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, California, USA
Author Contributions: Data curation, Formal analysis, Investigation, Project administration, and Validation.
The Hormel Institute, University of Minnesota, Austin, Minnesota, USA
Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, California, USA
Author Contributions: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – original draft, and Writing – review and editing.

Editor

Felicia Goodrum
Editor
The University of Arizona, Tucson, Arizona, USA

Notes

The authors declare no conflict of interest.

Metrics & Citations

Metrics

Note:

  • For recently published articles, the TOTAL download count will appear as zero until a new month starts.
  • There is a 3- to 4-day delay in article usage, so article usage will not appear immediately after publication.
  • Citation counts come from the Crossref Cited by service.

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. For an editable text file, please select Medlars format which will download as a .txt file. Simply select your manager software from the list below and click Download.

View Options

Figures

Tables

Media

Share

Share

Share the article link

Share with email

Email a colleague

Share on social media

American Society for Microbiology ("ASM") is committed to maintaining your confidence and trust with respect to the information we collect from you on websites owned and operated by ASM ("ASM Web Sites") and other sources. This Privacy Policy sets forth the information we collect about you, how we use this information and the choices you have about how we use such information.
FIND OUT MORE about the privacy policy