INTRODUCTION
High-throughput sequencing (HTS) has become increasingly important for virus diagnostics in human and veterinary clinical settings and for disease outbreak investigations (
1–3). Since the introduction of the first HTS platform only about 1 decade ago, sequencing quality and output have been increasing exponentially, and costs per base have decreased. Thus, HTS has become a standard method for molecular diagnostics in many virological laboratories. The relatively unbiased approach of HTS not only enables the screening of clinical samples for common and expected viruses but also allows an open view without preconceptions about which virus might be present. This approach has led to the discovery of novel viruses in clinical samples, such as Bas-Congo virus, associated with hemorrhagic fever outbreaks in Central Africa (
2); Lujo arenavirus in southern Africa (
3); and a bornavirus-like virus, the causative agent of several cases of encephalitis with fatal outcomes in Germany (
4). Considering the potential of HTS to complement or even replace existing “gold-standard” diagnostic approaches such as PCR and quantitative PCR (qPCR), quality assessment (QA) and accreditation processes need to be established to ensure the quality, harmonization, comparability, and reproducibility of diagnostic results. While the computational analysis of the immense amount of data produced requires dedicated computational infrastructure, as well as bioinformatics knowledge or software developed by (bio)informaticians, the interpretation of the results also requires evaluation by an experienced virologist or physician. In many cases, true-positive results may be difficult to discern among large numbers of false-positive results or may be entirely missing from result sets due to false-negative results. Interpretation of results also requires knowledge of anomalies that may arise through sequencing artifacts or contamination.
Proficiency testing (PT) is an external quality assessment (EQA) tool for evaluating and verifying sequencing quality and reliability in HTS analyses. The pioneer in EQA and PT for infectious disease applications of HTS has been the Global Microbial Identifier (GMI) initiative, which has been organizing annual PTs since 2015, focusing on sequencing quality parameters, including the detection of antimicrobial resistance genes, multilocus sequence typing, and phylogenetic analysis of defined bacterial strains (
https://www.globalmicrobialidentifier.org/workgroups/about-the-gmi-proficiency-tests) (
5). Subsequently, the concept was similarly established regionally for U.S. FDA field laboratories (
6,
7).
COMPARE (
Collaborative
Management
Platform for Detection and
Analyses of (
Re-)emerging and Foodborne Outbreaks in
Europe (
http://www.compare-europe.eu/) is a European Union-funded program with the vision of improving the identification of (novel) emerging diseases through HTS technologies. Participating institutions have hands-on experience in viral outbreak investigation. One of the ambitious goals is to establish and enhance quality management and quality assurance in HTS, including external assessment and interlaboratory comparison.
In this study, we present the results of the first global PT offered by the COMPARE network to assess bioinformatics analyses of simulated in silico clinical HTS virus data. The viral sequence data set was accompanied by a fictitious case report providing a realistic scenario to support the identification of the simulated virus included in the data set.
Tools and programs for bioinformatics analysis.
In recent years, numerous tools, programs, and ready-to-use workflows have been established, making metagenomics sequence analyses accessible to scientists from all research fields. Workflows for the typical analysis of HTS data and for the identification of viral sequences are based on the same general tasks and tools, including quality trimming, background/host subtraction,
de novo assembly, and sequence alignment and annotation. Sequence processing usually starts with obligatory quality assessment and trimming, using programs such as FastQC or Trimmomatic, including the removal of technical and low-complexity sequences or the filtering of poor-quality reads (
8,
9). Following these initial steps, many workflows include the subtraction of background reads, e.g., host and bacteria, to reduce the total amount of data and increase specificity, using tools such as BWA (Burrows-Wheeler Alignment Tool) or Bowtie 2 (
10,
11).
De novo assembly of HTS reads into longer, contiguous sequences (contigs), followed by reference-based identification, has been shown to improve the sensitivity of pathogen identification. Such analyses depend heavily on the use of assemblers, such as SPAdes or VELVET, which make use of specific assembly algorithms, such as overlap-layout-consensus graph or de Bruijn graph algorithms (
12,
13). Alignment tools such as BLAST, DIAMOND (double-index alignment of next-generation sequencing [NGS] data), Kraken, and USEARCH are among the most important components in bioinformatics workflows for pathogen identification and taxonomic assignment of viral sequences (
14–17). Since command-line tools for HTS require specific knowledge in bioinformatics, complete workflows and pipeline approaches have been developed, including ready-to-use Web-based tools, such as RIEMS (
reliable
information
extraction from
metagenomic
sequence data sets), PAIPline (
PAIPline for the
automatic
identification of
pathogens), Genome Detective, and others (
18–20). Since the COMPARE
in silico PT focuses on comparing different tools and software programs for bioinformatics analyses, an overview of frequently used programs is given in
Table 1. A more extensive overview of virus metagenomics classification tools and pipelines published between 2010 and 2017 can be found at
https://compare.cbs.dtu.dk/inventory#pipeline.
DISCUSSION
HTS-based virus diagnostics requires complex multistep processing, including laboratory preparation, assessment of the quality of sequences produced, computationally challenging analytic validation of sequence reads, and postanalytic interpretation of results. Therefore, not only comprehensive technical skills but also bioinformatic, biological, and medical knowledge is of paramount importance for proper analyses of HTS data for virus diagnostics.
HTS data can comprise several hundred thousand to many millions of reads from a single sequenced sample. Handling and analyzing such amounts of data pose computational challenges and currently require know-how and expertise in bioinformatics. Depending on the laboratory procedure, identification of viral reads from clinical metagenomics data is negatively affected by low virus-to-host sequence ratios and high viral mutation rates, making reference-based sequence assignments for highly divergent viruses challenging (
24).
In silico bioinformatics analysis of HTS data can be separated into an analytic and a postanalytic step. The analytic step includes the processing of sequence reads with software tools or scripts assembled into workflows and pipelines. The postanalytic step is evaluation of the results obtained from the bioinformatics analysis with regard to pathogen identification, often involving interpretation by an experienced, qualified health professional to correlate bioinformatics results with clinical and epidemiological patient information.
The bioinformatics analysis and the technical identification of viral reads from the HTS data set were shown to have decreasing success as sequences became more divergent from reference strains, as exemplified by MeV, with 82% identity on the nucleotide level to its closest relative, and nABV, with just 52% identity on the nucleotide level to other bornaviruses, which was identified by only 4 of the 13 participants. MeV and TTV were missed by participant 4, whose analysis was based on the Kraken tool and an in-house workflow. Kraken is known to align sequence reads to reference sequences with high specificity and low sensitivity, making the alignment of mutated and divergent virus reads difficult (
15). Since Kraken employs a user-specific reference database, TTV may have been absent from the custom database; Kraken was also used by participant 7, which was able to identify both MeV and TTV. It is noted that the use of different databases is an obstacle in bioinformatics analysis of HTS data. To date, there have been unified, curated virus reference databases only for influenza viruses (EpiFlu) (
25), HIV (
26) and human-pathogenic viruses (ViPR) (
27). Recently, viral reference databases for bioinformatics analysis of HTS data have been developed (
https://hive.biochemistry.gwu.edu/rvdb,
https://rvdb-prot.pasteur.fr/) (
28). NCBI offers the most extensive collection of viral genomes, but the lack of curation and verification of submitted sequences often leads to false-positive and false-negative results. To overcome such problems, reference-independent tools for virus detection in HTS data have been developed, making the discovery of novel viruses feasible without any knowledge of the reference genome (
29). All of the participants that were able to identify the divergent nABV used workflows based on protein alignment approaches, including BLASTx/p, USEARCH, and DIAMOND, which are known to be highly sensitive (
14,
17). The identification of such highly divergent viruses is still challenging and cannot be accomplished by workflows with nucleotide-only reference-based alignment approaches. DIAMOND, which became available in 2015, was specifically designed for such sensitive analysis of HTS data at the protein level and is as much as 20,000 times faster than BLAST programs. Compared to other alignment tools, which seem to have a trade-off between speed and sensitivity, DIAMOND offers superior sensitivity for the detection of mutated and divergent viral sequences (
14). However, the detection of such highly divergent viral sequences in patient samples is rare, and virus discovery is not a routine part of clinical virus diagnostics.
In terms of specificity, all workflows were highly specific; only workflow 6 showed the identification of a chordopoxvirus that was not present in the data set. Such false-positive results, as well as the excessive number of HSV-1 and MeV reads found by participant 7 (8,361 of 2,000 reads and 1,411 of 1,000 reads, respectively), can derive, for example, from low-complexity reads in the data set that are aligned to low-complexity or repetitive sequences of the viral reference genomes, from inappropriate matching score limits during filtering, or from inappropriate algorithm parameters. Furthermore, custom databases and viral references from NCBI can include sequences of human origin that can lead to false-positive results, resulting, in some cases, in nonreporting of other matches due to default algorithm reporting limits.
The total times of all workflows differed widely, from only 3 h to 216 h (15 h for the analysis and 201 h waiting for available servers). One of the fastest participants was participant 1, which needed only 3 h to perform the calculations on a scalable high-performance national virtual machine, whereas the slowest workflow (participant 4; 216 h) involved calculation on a personal computer through an external public server where bioinformatics software jobs are queued among many other users (
Fig. 1;
Table 5). However, participant 5 also performed analysis on a notebook but within a much shorter time (26 h). Overall, workflows exclusively specified for virus detection or using only a viral or RefSeq database did not clearly correlate with shorter times than workflows with full metagenomics analyses. However, the specific composition of each database was not provided. To finally evaluate the performance of each bioinformatics workflow with regard to the time of analysis, all workflows should be run on the same computer system, but such standardization was not practical for this PT evaluation.
The COMPARE virus PT has further shown that both analytic work and postanalytic evaluation are of importance, since similar analytic results can be interpreted very differently, depending on the analyzing participant. Unlike standard routine virus diagnostic approaches such as PCR, where a medical hypothesis of relevance tests either positive or negative, HTS offers an extensive and largely unbiased catalogue of results. The etiological agent of a patient sample can be masked by false-positive results, sequencing contaminants, commensal viruses of the human virome, or viruses of yet unknown importance. Furthermore, the causative viral agent of a disease may be present in very low read numbers, because viral loads may be low, depending on the timing of sampling and the sample matrix. RNA viruses, among which are the most pathogenic human viruses, usually have smaller genomes than DNA viruses (
30,
31). Therefore, low read numbers from an RNA virus might be dismissed, resulting in a false-negative result. To assess sequencing results, some workflows and pipelines use cutoffs for read numbers so as to reduce false-positive results, but they may in the process make the detection of low-read-number matches less likely.
Since the analysis of HTS data for virus diagnostics requires bioinformatics as well as virological knowledge, collaboration between the two disciplines has been emphasized (
32). Furthermore, automated pipelines for HTS-based virus diagnostics with unbiased evaluation of the pathogenicity and relevance of the pathogen detected have been implemented; these can help harmonize the analysis and interpretation of HTS sequence results (
33).
A robust approach to viral diagnostics using HTS requires further refinement and validation. The COMPARE
in silico PT is limited by the low complexity of the simulated data set.
In vivo sequence data sets can comprise a highly diverse background and microbiome of the host, further increasing the difficulty of identifying viral reads. Further proficiency schemes with
in vivo data sets and samples and wider collaboration are required to make progress. A second
in silico PT organized by the COMPARE network has focused on the interpretation of the significance of foodborne pathogens in a simulated data set (unpublished data). Again, the interpretation of the results was shown to be one of the most diverse and critical points in HTS data analysis. Furthermore, third-generation sequencing technologies, such as MinION from Oxford Nanopore Technologies, are becoming available in many laboratories and field settings due to low cost and short sequencing times (
34–36). However, analysis tools developed for second-generation sequencing technologies, such as the Illumina system, may not be applicable for third-generation sequencing data, due to the low sequencing accuracy of approximately 85% and the length of the sequences, which can be as long as 2 Mbp (
37–39). Consequently, future PTs should also include the use of third-generation sequencing technologies, since those are likely to become part of routine laboratory diagnostics in the future.
Conclusion.
The present availability of external quality assessment for HTS-based virus identification is limited. The COMPARE in silico virus PT has shown that numerous tools and different workflows are used for virus analysis of HTS data and that the results of such workflows differ in sensitivity and specificity. At present, there are no standard procedures for virome analyses, and the sharing, comparison, and reliable production of the results of such analyses are difficult.
Finally, there is a clear need for creating updated, highly curated, free, publicly available databases for harmonized identification of viruses in virome data sets, as well as mechanisms for conducting continuous ring trials to ensure the quality of virus diagnostics and characterization in clinical diagnostic and public and veterinary health laboratories.