Open access
Computational Biology
Announcement
7 June 2024

Setu: a pipeline for the robust assembly of SARS-CoV-2 genomes

ABSTRACT

Setu is an efficient pipeline integrating currently available open source bioinformatic tools to perform rapid de novo assembly to assist tracking of severe acute respiratory syndrome coronavirus 2 genome evolution in clinical data, being particularly useful for institutions with limited computing resources or personnel not familiar with bioinformatic pipelines.

ANNOUNCEMENT

RNA virus assembly is a challenge (1) due to high error rates during RNA replication, resulting in a high number of mutations and thus exhibiting enormous genetic viral diversity (2). Thus, estimating accurate haplotype reconstruction relies on both robust error correction and read assembly methods (3). The assembly of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a positive-sense single-stranded RNA virus, faces the same challenges. While genome sequencing and bioinformatics have played an important role in the coronavirus disease 2019 pandemic, aiding in viral identification (4), and tracking the transmission and evolution of the virus, a high-quality genome assembly is crucial for effective surveillance and identification of novel lineages.
We present Setu, a pipeline (5) that efficiently performs pre-alignment quality control, using Trimmomatic v.0.39 (6), followed by read selection through read mapping against the SARS-CoV-2 reference using BWA-MEM v.0.7.17 (7). Resulting files are processed using SAMtools v.1.18 (8) to extract mapped reads and BEDTools v.2.30.0 (9) to convert the BAM file into FASTQ. De novo assembly is performed using coronaSPAdes v.3.15.5 (10), in addition to reference-assisted scaffolding through Ragout v.2.3 (11), resulting in a single contiguous sequence. Assembly stats are calculated using MetaQUAST v.5.2.0 (12). It is currently optimized for Illumina paired-end sequence data.
Setu was evaluated against de novo assembly pipelines MEGAHIT v.1.2.9 (13), ABySS v.2.3.5 (14), IDBA-UD v.1.1.3 (15), as well as against targeted SARS-CoV-2 pipelines TAR-VIR (16) and HAVoC (17) from 125 SARS-CoV-2 paired-end Illumina reads, retrieved from NCBI BioProject PRJNA639066 data set and 79 reads from PRJNA746690 (5). Assembly statistics generated through MetaQUAST v.5.2.0 were used for evaluation. All assemblies were run at k-mer value of 33, where applicable. All evaluations were performed on an HP laptop computer consisting of an Intel Core i5-9300H processor running at 2.4 Ghz consisting of eight threads and 24 GB of RAM to demonstrate Setu’s efficiency.
Setu outperformed all other pipelines (Table 1) (5) in largest contig size, NA50, and NGA50 (Fig. 1A), thus having the highest quality assemblies. It also had the highest mean genome fraction values, covering most of the reference genome (Fig. 1B). HAVoC and Setu had the most contiguous assemblies and joint highest N50 values, respectively (Fig. 1B). MEGAHIT was fastest, completing assembly in 43 minutes, followed by Setu at 70 minutes. None of the pipelines had any extensive memory requirements. It is important to note that out of all pipelines, only Setu and HAVoC perform QC steps before the assembly, while others do not.
TABLE 1
TABLE 1 Mean statistics of the performance evaluation data seta
PipelineTime (m:s)# ContigsLargest contig size (bp)N50 (bp)NA50 (bp)NGA50 (bp)Genome fraction (%)
ABySS82:206.713,01911,17611,17313,01792.18
HAVoC98:13129,83829,83828,44328,43695.21
IDBA-UD105:325.8915,22513,81413,80815,22094.37
MEGAHIT43:523.4721,71421,12321,06923,74196.22
Setu70:05129,68229,66928,71728,73796.40
TAR-VIR192:1174.4211,5949,0668,46111,24093.52
a
Statistics here indicate Time; # contigs, total number of contigs; Largest contig size, the size of the largest contig; N50, the longest contig in the genome at 50% assembly length; NA50, the shortest length of aligned bases in the genome at 50% length; Genome fraction, percentage of bases aligned to the reference genome; and the average GC content of each genome assembly.
Fig 1
Fig 1 (A) Radar plot of evaluation metrics performed (best values at 100%). Setu (blue) had the best performance across all metrices except N50 where HAVoC (gold) performed better. (B) Boxplots of genome fraction (above) N50 values (below) across different pipelines.

ACKNOWLEDGMENTS

This work was supported by the Rockefeller Foundation, Grant Number: 2021 HTH 018. This grant was given to the CSIR Institute of Genomics and Integrative Biology, which allowed them to carry out the research and complete the study.

REFERENCES

1.
Marz M, Beerenwinkel N, Drosten C, Fricke M, Frishman D, Hofacker IL, Hoffmann D, Middendorf M, Rattei T, Stadler PF, Töpfer A. 2014. Challenges in RNA virus bioinformatics. Bioinformatics 30:1793–1799.
2.
Bull JJ, Meyers LA, Lachmann M. 2005. Quasispecies made simple. PLoS Comput Biol 1:e61.
3.
Zagordi O, Däumer M, Beisel C, Beerenwinkel N. 2012. Read length versus depth of coverage for viral quasispecies reconstruction. PLoS One 7:e47046.
4.
Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, Hu Y, Tao ZW, Tian JH, Pei YY, Yuan ML, Zhang YL, Dai FH, Liu Y, Wang QM, Zheng JJ, Xu L, Holmes EC, Zhang YZ. 2020. A new coronavirus associated with human respiratory disease in China. Nature 579:265–269.
5.
Shukla N, Narayan J. 2024. Setu supplementary data. Zenodo. https://doi.org/10.5281/zenodo.11108539.
6.
Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics 30:2114–2120.
7.
Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv. https://doi.org/10.48550/arXiv.1303.3997.
8.
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. 2021. Twelve years of SAMtools and BCFtools. Gigascience 10:giab008.
9.
Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841–842.
10.
Meleshko D, Hajirasouliha I, Korobeynikov A. 2021. coronaSPAdes: from biosynthetic gene clusters to RNA viral assemblies. Bioinformatics 38:1–8.
11.
Kolmogorov M, Raney B, Paten B, Pham S. 2014. Ragout-a reference-assisted assembly tool for bacterial genomes. Bioinformatics 30:i302–9.
12.
Mikheenko A, Saveliev V, Gurevich A. 2016. Metaquast: evaluation of metagenome assemblies. Bioinformatics 32:1088–1090.
13.
Li D, Liu CM, Luo R, Sadakane K, Lam TW. 2015. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph . Bioinformatics 31:1674–1676.
14.
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. 2009. ABySS: a parallel assembler for short read sequence data. Genome Res 19:1117–1123.
15.
Peng Y, Leung HCM, Yiu SM, Chin FYL. 2012. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28:1420–1428.
16.
Chen J, Huang J, Sun Y. 2019. TAR-VIR: a pipeline for tARgeted VIRal strain reconstruction from metagenomic data. BMC Bioinformatics 20:305.
17.
Truong Nguyen PT, Plyusnin I, Sironen T, Vapalahti O, Kant R, Smura T. 2021. Havoc, a bioinformatic pipeline for reference-based consensus assembly and lineage assignment for SARS-CoV-2 sequences. BMC Bioinformatics 22:373.

Information & Contributors

Information

Published In

cover image Microbiology Resource Announcements
Microbiology Resource Announcements
Online First
eLocator: e00237-24
Editor: Simon Roux, DOE Joint Genome Institute, Berkeley, California, USA
PubMed: 38847537

History

Received: 15 March 2024
Accepted: 10 May 2024
Published online: 7 June 2024

Keywords

  1. SARS-CoV-2
  2. COVID-19
  3. genome assembly
  4. viral evolution
  5. genome surveillance

Data Availability

The source code, detailed instructions for installation and use are available on GitHub (https://github.com/jnarayan81/setu). We recommend installation of dependencies through the Conda package manager. Setu will remain freely available for the next 10 years alongside instructions for use and any applicable updates. The data used for performance evaluation is publicly through the NCBI Bioproject database at PRJNA639066 and PRJNA746690.

Contributors

Authors

Nityendra Shukla
CSIR-Institute of Genomics & Integrative Biology, New Delhi, Delhi, India
Author Contributions: Data curation, Formal analysis, Methodology, Software, Validation, and Visualization.
Neha Srivastava
Institute of Biotechnology, Amity University, Lucknow, India
Author Contributions: Data curation, Methodology, and Validation.
Institute of Biotechnology, Amity University, Lucknow, India
Author Contributions: Conceptualization, Formal analysis, Writing – review and editing, and Supervision.
CSIR-Institute of Genomics & Integrative Biology, New Delhi, Delhi, India
Author Contributions: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, and Writing – review and editing.

Editor

Simon Roux
Editor
DOE Joint Genome Institute, Berkeley, California, USA

Notes

The authors declare no conflict of interest.

Metrics & Citations

Metrics

Note:

  • For recently published articles, the TOTAL download count will appear as zero until a new month starts.
  • There is a 3- to 4-day delay in article usage, so article usage will not appear immediately after publication.
  • Citation counts come from the Crossref Cited by service.

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. For an editable text file, please select Medlars format which will download as a .txt file. Simply select your manager software from the list below and click Download.

View Options

Figures and Media

Figures

Media

Tables

Share

Share

Share the article link

Share with email

Email a colleague

Share on social media

American Society for Microbiology ("ASM") is committed to maintaining your confidence and trust with respect to the information we collect from you on websites owned and operated by ASM ("ASM Web Sites") and other sources. This Privacy Policy sets forth the information we collect about you, how we use this information and the choices you have about how we use such information.
FIND OUT MORE about the privacy policy