INTRODUCTION
Functional genomics technologies, such as transposon insertion sequencing (TIS) (
1) and RNA-sequencing (RNA-seq) (
2), have emerged as effective, high-throughput methods for investigating gene function. Most analysis of functional genomics data relies on quantifying and comparing sequencing read counts, with results often expressed as relative (log) ratios of read counts between an experimental condition and control. Critically, the calculation of these log read count ratios depends on accurate normalization. Common normalization methods correct for differences in sequencing depth between experiments and assume this normalization factor is constant for all genes assayed. Here, we identify a phenomenon that violates these normalization assumptions resulting from treatments that affect DNA replication in bacteria, which we have coined chromosomal location bias (CLB). We present a new normalization technique and associated tool, named “ChromoCorrect,” that can be easily applied to any functional genomics data to identify and correct for CLB.
During normal exponential growth in bacteria, the time between cell divisions is significantly shorter than the time needed to complete chromosomal replication. This leads to cells containing more copies of DNA around the origin of replication (
oriC) compared to the terminus (
ter) and is due to the firing of multiple simultaneous replication forks (
3) (
Fig. 1A). In many sequencing-based assays, this difference in DNA copy numbers translates into higher read counts around the origin as compared to the terminus due to a higher availability of template nucleic acids. Under most conditions, the ratio of
oriC-ter reads remains constant between a treatment and an untreated control and does not interfere with results. However, certain treatments specifically alter the
oriC-ter ratio, such as exposure to the DNA gyrase-targeting antibiotic ciprofloxacin (
Fig. 1B). This introduces large changes in read counts that primarily reflect changes in the DNA copy number near the origin rather than gene regulation (RNA-seq) or mutant fitness (TIS) (
Fig. 1C). These distortions, or CLB, in turn can lead to incorrect predictions of drug sensitivity or resistance.
A comprehensive study by Slager et al. (
4) showed that treatment with a range of antimicrobials modified both DNA and RNA copy numbers, resulting in altered
oriC-ter ratios across the genomes for multiple bacterial species. In a subsequent review, Slager et al. (
5) suggested that not normalizing for this effect in RNA-seq analyses may lead to an overestimation of differential gene expression. Previous studies have attempted to correct CLB in antibiotic-treated functional genomics data using local regression methods such as Lowess (
6–8). Typically, local regression is performed on the raw read counts, with normalized counts produced to replace the raw reads for the differential analysis. However, most differential analysis tools for sequencing data rely on count models that assume counts of similar magnitude have similar variance (
9,
10). Providing modified or transformed counts violates these assumptions and will lead to incorrect assessment of statistical significance. We often observe distortions in the local read count density spanning several orders of magnitude following antibiotic treatment, which would lead to concomitantly large distortions in resulting
P-values. Hence, there is a clear need for a statistically sound methodology to properly address CLB in functional genomics data.
Here, we develop a statistically principled approach for correcting CLB using a local normalization factor rather than directly providing normalized counts. These local normalization factors can then be provided to differential analysis tools such as edgeR (
9) or DESeq2 (
10,
11) as offsets alongside the raw counts to correct for CLB within the statistical model. This preserves important features of the data needed for accurate calculation of
P values, namely the mean-variance trend, while also producing fold-changes that have the CLB effect removed. Based on this method we have developed an application for identifying and correcting for CLB named “ChromoCorrect.” We have made our diagnostic and normalization procedure available as a graphical Shiny application that can be applied to any sequencing-based functional genomics assay. We apply ChromoCorrect to a data set we generated for this study that displays strong CLB: TIS output of an
Escherichia coli K12 library challenged with ciprofloxacin. We confirm that ciprofloxacin produces the predicted large local distortions in read counts around the origin of replication, confounding the TIS counts such that they no longer accurately reflect the fitness of individual mutants. We show, using minimum inhibitory concentration (MIC) assays, that these distortions lead to incorrect predictions of mutant ciprofloxacin sensitivity and resistance. We also demonstrate that our normalization approach, after processing with ChromoCorrect, corrects these, rendering accurate fold changes that align well with independently determined mutant phenotypes.
DISCUSSION
This study addresses the fundamental issue of CLB, which we show impacts a wide variety of functional genomics analyses, resulting in false positives and negatives or incorrect interpretation of data. CLB arises from nucleic acid copy number fluctuations along the chromosome, typically around the origin and/or terminus, which can be exacerbated by replication-disrupting events like replication-targeted antibiotic treatment. To solve this problem, we introduce ChromoCorrect, a normalization tool that effectively corrects for CLB producing accurate log2FCs and significance values for each locus. Using a ciprofloxacin-treated TraDIS data set in E. coli, we demonstrated that CLB leads to incorrect predictions of antibiotic resistance phenotypes that can be corrected using ChromoCorrect.
Over the past decade, functional genomics techniques, like TIS and RNA-Seq, have been employed to comprehensively assess the effects of various selection pressures, such as antibiotic exposure, on microbial fitness and cellular responses (
8,
13,
19,
24,
25). Here, we show that CLB can lead to incorrect prediction of phenotype using our own antibiotic-treated TIS data, as a cautionary tale for future studies. Our study retrospectively highlighted the prevalence of CLB in published functional genomics data sets across diverse species and under a range of conditions. These conditions include both treatment with direct DNA-targeting antimicrobials, such as fluoroquinolones, but also antimicrobials that have an indirect effect on DNA replication, like trimethoprim, HPUra, and tobramycin (
Fig. 2) (
4,
7,
12,
13). Trimethoprim targets the dihydrofolate reductase of the folate biosynthesis pathway and reduces the availability of tetrahydrofolate, a precursor to the essential DNA components thymidine and thymine (
26) eventually leading to “thymineless death” (
27). Thymine starvation stalls replication forks, initially leading to a transient increase in origin-proximal DNA before the replicating DNA is destabilized and degraded, leading to ultimate depletion of origin-proximal DNA (
28). Similarly, the uracil analogue HPUra has an indirect effect on DNA replication by stalling replication forks (
4). The mechanism by which tobramycin, a ribosome-targeting antibiotic, contributes to the observed CLB is not obvious but could represent another indirect or secondary effect. Our work suggests that CLB may be exerting a more significant influence on the interpretation of sequencing data than currently acknowledged and an important future study would be to comprehensively assess the prevalence of CLB across functional genomics data sets.
The four organisms we investigated span both Gram-negative and Gram-positive bacteria, indicating that CLB occurs in diverse species, and is likely relevant beyond these well-studied organisms. Our method bears conceptual similarity to peak-to-trough ratio (PTR) methods that use
oriC-ter ratios to determine bacterial growth rates from genomic or metagenomic DNA sequencing data (
29–31), which have been shown to accurately predict growth rates in a wide range of bacteria in pure culture. The success of these methods suggests that beyond antibiotic treatments that interfere with DNA replication, CLB may also affect the results of functional genomics comparison between bacterial cultures growing at different rates.
Biases arising from altered
oriC-ter ratios in TIS data have been recognized previously and corrected for using local regression methods (
6–8), though the origins of these biases have not been clearly described. Our contribution has been to highlight the prevalence of CLB in bacterial functional genomics data. A major advantage of ChromoCorrect is an ability to directly incorporate the local normalization factors as an offset into differential analysis tools such as edgeR (
9) and DESeq2 (
10). This preserves the mean-variance relationship in the underlying count data, which is important for accurate estimation of statistical significance. This approach was inspired by the transcript quantification package tximport that corrects for differences in isoform abundance during gene-level differential expression in eukaryotes (
32). By facilitating the import of offsets into established RNA-seq analysis tools, ChromoCorrect can be used seamlessly with existing pipelines.
Although clearly a critical step in the analysis of functional genomics data, the interpretation of correcting for CLB requires care and depends on the technology analyzed. In the case of RNA-seq, CLB is reflective of a genuine increase in RNA synthesized from the origin-proximal region, and the primary danger is that this could be interpreted as a specific regulatory response rather than a direct result of antibiotic activity on DNA replication dynamics. In contrast for TIS experiments, CLB introduces artifacts that can lead to false predictions of phenotype. Since TIS uses transposon-flanking reads as a proxy for mutant abundance, and local distortions in DNA copy number will lead to a local distortion in template DNA abundance, mutants containing transposon insertions in the vicinity of the origin will appear to be more frequent in the population than they really are in data affected by CLB.
Our study highlights the importance of scrutinizing data for CLB to improve the reliability of conclusions drawn from functional genomics data. We recommend that future microbial functional genomic data sets with read counts produced, especially those that involve antibiotic exposure, be screened for the presence of CLB, and if so, to correct the data using ChromoCorrect before proceeding with time and labor-intensive biological interpretation and laboratory experiments.
MATERIALS AND METHODS
Software
Analyses were performed using R (version 4.0.3), R Studio (version 2022.07.2). The application was developed using RShiny (version 1.7.3).
TraDIS library ciprofloxacin challenging and sequencing
An
E. coli K12 TraDIS library was generated as previously described (
16) and challenged with subinhibitory ciprofloxacin (40 µg/mL) in 10 mL of Mueller Hinton cation-adjusted media. Genomic DNA was extracted using the DNeasy UltraClean Microbial Kit (Qiagen) according to the manufacturer’s instructions and was sequenced on an Illumina HiSeq2500 platform at the Wellcome Sanger Institute.
Identifying chromosomal location bias
The data were run through the Bio::TraDIS pipeline (
17) using SMALT mapping and a minimum read count of 10. To identify whether CLB was present, the log
2FCs were plotted in genome order, with locus on the
x-axis and log
2FC on the
y-axis.
Generating normalized read counts and offsets with ChromoCorrect
For each condition, the read counts are obtained, and the first 1,000 genes are appended to the end of the file and the last 1,000 to the beginning to mimic a circular genome for the median sliding window function. Read counts of zero are excluded from the data set to remove their influence on the median. The medianFilter function from package FBN (version 1.5.1) is used to calculate a median for each locus based on an adjustable window size. We have found that a default window size of 500 is sufficient for many smoother trends, whilst sharps trends need to be computed with a smaller window size. A ratio for each point is calculated by dividing the locus’ median by the mean median of all loci. The normalized read count is obtained by dividing the raw read count by the ratio computed for each locus. The normalized counts are not used as a replacement for the raw read counts, instead, an offset data set is created. The offset is computed as the natural logarithm of the raw count, subtracted from the natural logarithm of the normalized count. An arbitrary pseudocount of 0.1 is added to both raw and normalized counts before log transforming to prevent the undefined log of 0 occurring. The effective library size is calculated using the edgeR package (version 3.32.1) calcNormFactors function applied to the normalized counts, multiplied by the column sums of the normalized counts. Following this, the offset values are adjusted to account for library size differences by subtracting the logarithm of the effective library size.
Comparisons using edgeR within ChromoCorrect
ChromoCorrect incorporates the offset along with the raw counts in edgeR to perform differential analysis and assess the normalization. Genes are filtered based on a minimum count threshold (default value of 10). Raw counts are put into a DGEList with groupings and scaleOffset() is used to offset the raw read counts. Library size normalization is not computed due to the inclusion of the offset, which produces a custom normalization factor per gene. Common negative binomial (estimateGLMCommonDisp()) and Bayes tagwise (estimateGLMTagwiseDisp()) dispersions for general linear models are calculated. A gene-wise negative binomial for general linear models (glmFit() and glmLRT()) is fit with contrasts to produce likelihood ratio tests per gene between the control and conditions, producing the log2 fold changes and adjusted P values. Following the analysis, ChromoCorrect generates summary statistics and diagnostic plots, automatically calculating the adjustment of the window size if bias has not been mitigated. The process repeats until a satisfactory result is achieved or a minimum window size of 200 is reached. The code produces scatterplots of the data before and after bias is removed for a visual reference.
Minimum inhibitory concentration assays
MIC assays were performed for the single gene knockouts to determine their breakpoint compared to WT cultures. Mutants were steaked from frozen onto Mueller Hinton (MH) agar plates and incubated overnight at 37˚C. Three single colonies of each mutant were inoculated into 5 mL of cation-adjusted MH broth (CAMHB) and grown overnight at 37˚C and 200 rpm shaking. A 1/100 dilution of the overnight cultures was made in 5 mL of fresh CAMHB and grown for 2.5 h until the exponential phase. MICs were performed with triplicate technical replicates in a 96-well plate with approximately 1 × 105 cells per 150 µL well and grown overnight at 37˚C with 200 rpm shaking. Cells were imaged after 16 h at OD600 on a PHERAstar plate reader (BMG Labtech). Wells were blanked and averaged within triplicates. MIC was determined as the lowest concentration that inhibited at least 50% of growth compared to the untreated mutant positive control.