INTRODUCTION
Until now, bacterial transcriptome studies have mainly relied on bulk RNA sequencing (RNA-seq) (
1). This approach provides averaged gene expression values across an entire cell population and therefore does not allow conclusions regarding transcriptional heterogeneity between individual bacteria. Yet, such phenotypic heterogeneity is a common microbial phenomenon (
2). It is important for bacterial survival strategies such as bet hedging, which allows fast adaptations to changing environments (
3,
4), or biofilm formation, in which individual cells take on highly specific roles within a community (
5).
Dating to 2009, pioneering work established single-cell RNA-seq (scRNA-seq) in eukaryotes (
6). While this field rapidly evolved (
7), the development of scRNA-seq in bacteria was slow to progress due to several challenges (
8). Prokaryotic cells are much smaller than eukaryotic cells, leading to less input material per cell. Single bacteria contain RNA in the femtogram range (
9) and the average mRNA copy number is low, at only 0.4 copies/cell (
10). Further challenges include efficient cell lysis, which is hampered by the bacterial cell wall, and capture of nonpolyadenylated bacterial transcripts. These differences prevent a direct adaptation of most eukaryotic single-cell transcriptomic workflows.
Nevertheless, thanks to technical advances, bacterial single-cell transcriptomics has recently become a reality (
8). Three general types of approaches are currently available. Bacterial multiple annealing and deoxycytidine (dC) tailing-based quantitative scRNA-seq (MATQ-seq) (
11) is a workflow originally developed for eukaryotes (
12) that relies on cell isolation by fluorescence-activated cell sorting (FACS) and random priming of cellular transcripts. A second type, also previously established for eukaryotes and termed split-pool ligation transcriptomics sequencing (SPLiT-seq) (
13), is based on combinatorial barcoding. It was adapted for bacterial scRNA-seq in two independent studies introducing the so-called prokaryotic expression profiling by tagging RNA
in situ and sequencing (PETRI-seq) and microbial SPLiT-seq (microSPLiT) protocols (
14,
15). In comparison to MATQ-seq, bacterial split-pool barcoding workflows enable the analysis of thousands of cells instead of a few hundred, offsetting the lower transcript capture rate and higher rate of cell loss in these protocols. The third type is a microscopy- and probe-based approach that does not employ RNA-seq. It is called parallel sequential fluorescence
in situ hybridization (par-seqFISH) and allows spatial transcriptomics on the level of single bacteria (
16).
Despite these recent advances, challenges remain. These include a high frequency of cell loss and problems with robustness, coverage and prevalence of redundant rRNA. In addition, short transcripts, such as small regulatory RNAs (sRNAs), show poor coverage or are not measurable at all. Importantly, transcript detection is currently limited to ~200 genes per single cell, which is far below the average bacterial transcriptome. We reasoned that targeted improvements of our previous MATQ-seq protocol would address some of these challenges.
In this work, we present the next version of bacterial MATQ-seq. While the original protocol (
11) has a high transcript capture rate, including low-abundance transcripts, it is also limited in throughput and robustness. Through the integration of automation, we have now achieved increased cell throughput. In addition, we improved robustness through the selection of a more efficient reverse transcriptase (RT), which also led to a reduced transcript dropout rate. Finally, given that in our previous protocol the vast majority of reads represented rRNA, we integrated a Cas9-mediated targeted rRNA depletion protocol, called depletion of abundant sequences by hybridization (DASH) (
17). This allowed us to obtain more gene expression information per single cell with decreased sequencing costs.
DISCUSSION
Here, we report substantial improvements to our previously established bacterial MATQ-seq protocol (
11). Specifically, we focused on three elements of the workflow: (i) integration of automation and minimization of reaction volumes during different steps of the protocol, as well as optimization of the bioinformatic pipeline for data analysis; (ii) selection of a more efficient RT; and (iii) implementation of an rRNA depletion step by integrating DASH into the library preparation. We validated our improved MATQ-seq protocol by generating a large data set of single
Salmonella cells sampled over different growth conditions. Overall, our data show that the changes we implemented increased the cell throughput and robustness of the protocol while reducing cell loss. In addition, we were able to improve gene coverage and the gene detection limits. We were even able to detect sRNAs at the single-cell level, which previously had not been feasible. This will allow the exploration of the regulatory functions of sRNA at the single-cell level in future studies. Moreover, our data confirm previously described heterogeneity within the same cell population, especially regarding
Salmonella pathogenicity genes and genes encoding components of the flagellum (
32,
34).
The successful implementation of DASH to deplete rRNA-derived cDNA was instrumental in achieving these advances. We believe that DASH can be adapted to other single-cell approaches, which currently do not include a targeted rRNA depletion protocol (
11,
14,
15). Because depletion is performed at the cDNA level, DASH only needs to be customized to the library preparation step. For protocols that also use Nextera XT for library preparation, such as PETRI-seq (
15), a direct application without any further adjustments is possible. The implementation of DASH would reduce the required sequencing depth and overall sequencing costs.
A general limitation of current scRNA-seq workflows is efficient cell lysis, which might require species-specific customization. Consequently, analysis of mixed bacterial communities is a challenge. This is especially true if their cell wall compositions vary, as this necessitates different enzymatic disruption and poses the risk of introducing bias based on varying lysis efficiency. Combinatorial indexing-based protocols (
14,
15) can cope better with this limitation than MATQ-seq, because these protocols can process many more cells at once. Inefficient lysis is therefore compensated for by a higher number of input cells, although the danger of introducing bias remains. In contrast, MATQ-seq is more limited in throughput because of the cell sorting step and therefore inefficient lysis will lead to a high rate of failure. Nevertheless, MATQ-seq can in principle be applied to mixed cell populations if lysis conditions can be optimized.
Due to the high transcript capture rate of MATQ-seq, this method is particularly well suited for experimental settings in which the starting material is limited, such as the analysis of small subpopulations of bacterial cells in host niches or intracellular bacteria. In these settings, the application of dual RNA-seq allows the study of host-pathogen interactions through the simultaneous analysis of the transcriptomes of both the bacteria and the host (
1,
35). Single-cell dual RNA-seq (scDual-Seq) has been attempted, but so far, either bacterial gene detection has been inefficient (
36) or the experiments were performed under a high multiplicity of infection, which does not reflect physiological conditions (
37). Since MATQ-seq was initially developed for scRNA-seq in eukaryotes (
12), we see high potential in establishing scDual-Seq with MATQ-seq to capture both eukaryotic and prokaryotic transcripts. In this context, it is interesting that DASH has been shown to remove bacterial as well as eukaryotic rRNA at a single-cell level (
22). Nevertheless, establishment of an scDual-Seq protocol based on MATQ-Seq will require further testing and validation.
MATERIALS AND METHODS
Bacterial strains and growth conditions.
The bacterial strains used in this study are listed in
Table S1E.
Salmonella enterica serovar Typhimurium SL1344 (
38) and constitutively green fluorescent protein (GFP)-expressing strain SL1344 (
39) were grown in Lennox broth (LB) (tryptone [10 g/L], yeast extract [5 g/L], and sodium chloride [85.6 mM]) medium. The GFP-expressing
Salmonella strain was used only for FACS gating. Bacterial cultures were inoculated to an optical density (OD) at 600 nm of 0.01 and incubated at 37°C with agitation (220 rpm) until early exponential, mid-exponential, late exponential, and early stationary phases (ODs of 0.1, 0.3, 1.0, and 2.0; according to reference
31). A 1 mL volume of each culture was pelleted and washed twice with 1 mL of 1× Dulbecco’s phosphate-buffered saline (DPBS). Afterwards, the pellet was resuspended in 1 mL of 100% RNAlater/RNAprotect tissue reagent (Qiagen) and kept on ice. Right before sorting, samples were diluted 1:20 in 1× DPBS.
All pipetting steps in the sections below were automated using an I.DOT (Dispendix) dispensing robot except for cleanup and quality control steps.
Isolation of single cells.
Isolation of single cells was done as previously described by Imdahl et al. (
11). Briefly, single cells were isolated using a BD FACS Aria III for sorting individual cells into 96-well plates prefilled with lysis buffer (0.26 μL of 10× lysis buffer [TaKaRa], 0.03 μL of RNase inhibitor [100 U/μL; TaKaRa], 0.26 μL of DPBS [Gibco], 0.1 μL of lysozyme [50 U/μL; Epicentre], and 1.95 μL of nuclease-free water [Ambion]). After sorting, plates were kept on ice and stored at −80°C until further processing.
Improved MATQ-seq protocol.
The improved MATQ-seq protocol is based on the protocol described by Imdahl et al. (
11), with several modifications. Briefly, reverse transcription was performed using primers described by Sheng et al. (
12). Instead of SuperScript III, SuperScript IV was used (Invitrogen) as a reverse transcriptase without changing RT reaction volumes. As reaction buffer, SS IV buffer was used instead. Reverse transcription was followed by primer digestion, RNA digestion, and poly(C) tailing. Subsequent second-strand synthesis was performed before PCR amplification with only one-fourth of the reaction volume (40 μL instead of 160 μL). cDNA purification was performed using AMPure XP beads (Beckman Coulter) at a 1:1 (vol/vol) ratio. cDNA quality was checked by using Qubit Flex and the 2100 Bioanalyzer DNA High Sensitivity kit (Agilent Technologies). Oligonucleotides used for MATQ-seq are listed in
Table S1F.
DASH sgRNA pool generation.
The sgRNA pool for Cas9-based ribosomal depletion in
Salmonella was generated according to Prezza et al. (
21). Briefly, a double-stranded DNA (dsDNA) template for
in vitro transcription was generated with the oligonucleotide pool composed of 797 sequences targeting rRNA of
Salmonella (JVO-21893 [
Table S1F]) as well as the fill-in reaction oligonucleotide (JVO-21894 [
Table S1F]) using KAPA HiFi HotStart ReadyMix (Roche). After column-based cleanup and quality control on Nanodrop and the Bioanalyzer DNA 1000 kit (Agilent), the sgRNA pool was generated by
in vitro transcription using MEGAshortscript T7 transcription kit (Invitrogen). The final pool, consisting of 797 sgRNAs, was purified using the Monarch RNA cleanup kit (500 μg; New England BioLabs [NEB]). sgRNA quality control was performed with Qubit and the Bioanalyzer RNA 6000 Pico kit (Agilent).
Single-cell RNA-seq: library preparation, including ribosomal depletion.
cDNA obtained from single cells after MATQ-seq was further processed for library preparation, including a ribosomal depletion protocol. Library preparation was done using the Nextera XT DNA library preparation kit (Illumina) including the DASH protocol (according to reference
21, with several modifications) with only one-fourth of overall reaction volumes compared to the manufacturer’s recommendations. Tagmentation was performed with 5 μL of total reaction volume instead of 20 μL and 0.5 ng of cDNA input. Index PCR was performed with 13 cycles using Integrated DNA Technologies (IDT) for Illumina Nextera DNA unique dual indexes and a total reaction volume of 12.5 μL instead of 50 μL (Illumina). Obtained libraries were purified with AMPure XP beads (Beckman Coulter) at a 1:1 (vol/vol) ratio to ensure capture of sRNA-derived cDNA. After quality control, up to 12 samples were pooled equimolar for ribosomal depletion. sgRNA/Cas9 complex formation was followed by DASH using the appropriate ratio of cDNA, Cas9, and sgRNA. Cas9 enzyme was inactivated by proteinase K (15 min at 37°C). Afterwards, proteinase K was inactivated by adding PMSF (1 mM final concentration). Depleted cDNA was purified by another round of AMPure XP bead cleanup and used as the input for a second PCR amplification. Depleted cDNA (0.5 ng) was used as the input for Nextera XT reactions omitting the tagmentation steps. Index-independent primers i5 and i7 were used to amplify noncleaved cDNA products. PCR was done for 13 cycles, and cleanup was performed with AMPure XP beads at a 1:1 (vol/vol) ratio. Oligonucleotides used for the second PCR are listed in
Table S1F.
Total RNA extraction and library preparation.
Bacterial RNA was isolated from Salmonella strain SL1344 grown under the same conditions as for scRNA-seq experiments. RNA extraction was performed with 1.8 mL of each in vitro culture using the TRIzol reagent (Invitrogen) according to the manufacturer’s recommendation. RNA quality was checked using the Qubit RNA high-sensitivity assay kit (Invitrogen) and 2100 Bioanalyzer RNA 6000 Pico/Nano kit (Agilent Technologies). Prior to library preparation, DNase treatment was performed using a DNase I kit (Thermo Fisher), followed by rRNA depletion. rRNA was depleted using Lexogen’s RiboCop META rRNA depletion kit protocol according to the manufacturer’s recommendation using 100 ng of total RNA as the input per sample. DNA libraries suitable for sequencing were prepared using the CORALL total RNA-Seq library prep protocol (Lexogen) according to the manufacturer’s recommendation with 13 PCR cycles. Library quality was checked using a 2100 Bioanalyzer DNA High Sensitivity kit.
Sequencing.
Sequencing pools of single cells as well as total RNA-seq libraries were checked using the Qubit DNA High Sensitivity assay kit and a 2100 Bioanalyzer DNA High Sensitivity kit. Sequencing of library pools, spiked with 1% PhiX control library, was performed in single-end 100-cycle sequencing mode on the NextSeq 2000 or NovaSeq 6000 platform (Illumina). Demultiplexed FASTQ files were generated with bcl2fastq2 v2.20.0.422 (Illumina).
Bioinformatics.
(i) Preprocessing. Read trimming and quality control of MATQ-seq reads were executed using BBDuk (
40) and MultiQC (
41). To efficiently remove primer and adapter sequences located at both ends of a read, we ran BBDuk in a two-pass procedure using the default adapter sequence database augmented with MATQ-seq-specific sequences (
Table S1G). The first pass focused on the 5′ end, with parameters
minlen=18 qtrim=rl trimq=20 ktrim=l k=17 mink=11 hdist=1 trimpolya=30, while the second pass focused on the 3′ end, with parameters
minlen=18 qtrim=rl trimq=20 ktrim=r k=17 mink=11 hdist=1.
(ii) Read alignment and counting. Read alignment and counting were performed with Bowtie2 (
42) and featureCounts (
43), allowing a single mismatch and run-in --
local mode. BigWig files were generated using deepTools (
44), passing additional parameters --
binSize 5 --
normalizeUsing BPM. We employed the same gene detection method from the original MATQ-seq analyses (
11), requiring a detected gene to have >5 reads.
(iii) Normalization and differential expression. DESeq2 (
45) was used for normalization and differential expression analysis, using size factors calculated by the
computeSumFactors function in Scran (
46) and other recommended parameters for DESeq2 single-cell analysis, which included using the likelihood ratio test (LRT),
useT=TRUE,
minmu=1e-6, and
minReplicatesForReplace=Inf.
(iv) Identification of outlier cells. To identify outlier cells, we calculated the average number of detected genes per cell (>5 reads) for each condition. Cells were determined to be outliers if their detected gene number varied more than 2 standard deviations (SD) above or below the mean, which removed 14 cells. One additional cell was removed based on the PCA plot generated using PCAtools (
47).
(v) Comparison with existing bulk and single-cell RNA-seq data. To ensure fair comparisons between bulk RNA-seq and scRNA-seq data sets, we processed all FASTQ files using the same preprocessing, alignment, and counting approaches as described above. For our pseudo-bulk representation from the single cell data, for each condition, we summed the counts across all cells for each gene. Bulk RNA-seq data were generated in parallel with the single-cell data.
(vi) Small RNAs and highly variable genes. Additional small RNAs (sRNAs) were added to our annotation, giving us 172 sRNAs in total. To show the most abundant sRNA in the heat map for
Fig. 5, sRNAs were only shown if the row sums of Transcripts Per Million (TPM) normalized counts across all conditions was >100. The full list of expressed sRNAs is provided in
Table S1D. Highly variable genes were identified using Scran (
46), with the top 1% of HVG used for the heat map in
Fig. 6. The number of
Salmonella pathogenicity and flagellar genes used in the supplementary heat maps (
Fig. S7 and
S8) was reduced to show only genes expressed under the examined conditions.
(vii) Single-cell simulations and downsampling. Simulated data were generated using the sample function in R. All Salmonella genes (including rRNA) were sampled with replacement, with read count frequencies used as probability weights per cell for each condition. Different sample sizes were used to represent sequencing read depth, and detected genes were the resulting uniquely sampled genes.
ACKNOWLEDGMENTS
We thank Anke Sparman for helpful comments and editing of the manuscript and Gene-Wei Li for very helpful comments on the manuscript. We also acknowledge Esther Gottwald for great technical support on MATQ-seq workflow and library preparation, Laura Jenniches for discussions about simulating data, and Emmanuel Saliba and Fabian Imdahl for help with MATQ-seq pipeline. We further thank the Core Unit SysMed at the University of Würzburg and the sequencing facility at HZI Brunswick for RNA-seq data generation.
This work was supported by a DFG Leibniz Prize to J.V. and the DFG-funded Centre for Microbial Single-cell RNA-seq (MICROSEQ) at the University of Würzburg (grant DFG INST 93/1105-1).
We declare that no conflict of interest exists.
C.H. performed the experiments, C.H. and J.V. designed experiments, R.J.H. and L.B. performed bioinformatic analysis, C.H., R.J.H., L.B., and J.V. wrote and edited the manuscript, and L.B. and J.V. supervised the research project.