T. brucei has a life cycle that alternates between the tsetse fly and the mammal. In the bloodstream of their mammalian hosts, the parasites evade the immune response by antigenic variation, a continual switching of the variant surface glycoprotein (VSG) that constitutes the surface coat. Although each bloodstream trypanosome has a single VSG species on its surface, the parasite genome has a repertoire of several hundred to 1,000 different
VSG genes that are expressed in a mutually exclusive manner from about 20 potential bloodstream form expression sites (B-ESs), invariantly located near telomeres (see references
5,
10,
19,
53, and
66 for recent reviews). Only one B-ES at a time is activated by an unknown mechanism. These expression sites are long polycistronic transcription units in which the
VSG is cotranscribed with several intervening expression site-associated genes (
ESAGs) from a promoter located about 45 to 60 kb upstream (
34,
54) and are separated from the rest of the chromosome by a 10- to 40-kb region of 50-bp repeats. VSGs are also expressed during the metacyclic stage of the life cycle in the salivary glands of the tsetse fly as a preadaptation to life in the mammal. The genome contains 20 to 30 telomere-linked metacyclic expression sites (M-ESs) containing
VSGs that are transcribed into monocistronic precursor RNAs from a proximal promoter located within 2 kb upstream (see references
4 and
22 for recent reviews).
The genome is highly plastic, as revealed by pulsed-field gel electrophoresis (PFGE) and analysis of the recombination events associated with
VSG switching. It also contains a large number of putative non-long terminal repeat (LTR) retrotransposons:
ingi's and ribosomal inserted mobile elements (RIMEs) (
29,
33,
47). Non-LTR retrotransposons, exemplified by the human short interspersed nucleotide elements (SINE) and long interspersed nucleotide elements (LINE), are replicating retroelements of a type that are ubiquitous in nature and may constitute as much as 14% of host genomes (
60). Retroelements replicate by copying their RNA transcript into DNA by using a reverse transcriptase. The DNA copy then integrates into the genome (
35). All the non-LTR retroelements are flanked by target site duplications of variable length, have variable length poly(A) or A-rich 3" tails, and are devoid of the LTRs present in retroviruses and LTR retrotransposons. As reported for mammals (
60) and plants (
58), the non-LTR retrotransposons constitute the most abundant repeat elements described for the genome of
T. brucei (
ingi, RIME, and SLACS) (
3,
47). The
ingi elements (5.2 kb) has the characteristics of LINE elements, while the RIME (500-bp) elements are similar to the nonautonomous SINE elements.
ingi's are composed of a 4.7-kb fragment bordered by two separate halves of RIME and, if their reading frames are not mutated to possess termination codons, they may encode a single large protein containing a central reverse transcriptase domain, a C-terminal DNA-binding domain (
52), and an N-terminal apurinic-apyrimidinic-like endonuclease domain (
48). SLACS are site-specific retroelements found only in the spliced leader RNA genes (
3), but
ingi's and RIMEs were previously thought to be randomly distributed in the host genome (
47). Individual
ingi and RIME are associated with rRNA genes (
29) and tubulin gene arrays (
1) and precede or are within most of the B-ESs and M-ESs characterized so far (
4,
7,
13,
36,
39,
43,
54,
55,
59).
Recently, Melville et al. showed that a large region (about 200 kb) of uncharacterized repeated sequences is present upstream of the 50-bp repeats preceding the B-ES of chromosome I (
ChrI) (
43). Interestingly, this region also contains a high number of
ingi's and RIMEs and is very size polymorphic between strains, and similar sequences are present in many of the megabase chromosomes of
T. brucei (
43; unpublished data). We have characterized a novel, large multigene family (about 128 copies per haploid nonminichromosomal genome) encoding mainly nuclear proteins, multiple copies of which are also located in the RIME/
ingi-rich region. Approximately 60% of the identified members of this gene family are pseudogenes. The gene family can be divided into six subfamilies, called
RHS1 through
RHS6 (for retrotransposon hot spot), based on deduced amino acid sequences. About one-third of the
RHS (pseudo)genes contain RIME and/or
ingi retroelement(s) inserted in frame and at exactly the same relative nucleotide position. Analysis of the
ChrIa sequence indicates that the
RHS genes are clustered upstream of the 50-bp repeats preceding the bloodstream expression site (B-ES). They account for most of the unknown sequences present in the RIME/
ingi-rich repeated region described previously (
43).
MATERIALS AND METHODS
Trypanosomes.
Cells of the bloodstream form of
T.
brucei AnTat1 were used to infect rats and then were isolated by ion exchange chromatography (
37). Procyclic form of
T. brucei EATRO1125, TREU927/4, and 427 were cultured at 27°C in SDM-79 medium (
15) containing 10% fetal calf serum and 5 mg of hemin liter
−1.
Construction and screening of genomic and cDNA libraries.
λZAP II clones containing cDNA-004, cDNA-005, cDNA-040, and cDNA-132 (accession numbers AF403385, AF403388, AF403386, and AF403387, respectively) were randomly isolated from a
T. brucei AnTat1 cDNA library (derived from the bloodstream form). The cDNA was synthesized from poly(A)
+ mRNA as described previously (
12) and was inserted into the
EcoRI site of λZAP II cloning vector (Stratagene). Recombinant pBluescript II plasmids containing cDNA fragments were excised from the λZAP II clones according to the manufacturer's instructions (Stratagene). The genomic DNA library of the
T. brucei AnTat1 strain was constructed in the c2X75 cosmid vector (
17). Large DNA fragments generated by
Sau3A partial digestion of genomic DNA were inserted into the
BamHI site of the vector as previously described (
14), and the cosmid library (20,000 clones) was screened with α-
32P-labeled cDNA-132. We have selected and partially sequenced five cosmid clones, three containing a full-length and apparently functional
RHS1 gene (Cos-02, Cos-03, and Cos-17 with the accession numbers AY046893, AY046894, and AY046895, respectively) and two containing one
RHS1 pseudogene inactivated by a RIME/
ingi insertion (Cos-12 [accession number AY046896 ] and Cos-23 [accession numbers AY046897S1 and AY046897S2 ]).
DNA sequencing, alignments, and phylogenetic analysis.
Inserts of recombinant pBluescript II plasmids and c2X75 cosmids were sequenced by the dideoxynucleotide chain termination method, using AmpliTaq DNA polymerase, as described by the manufacturer (ABI PRISM, Perkin-Elmer). DNA and amino acid sequences were analyzed using the DNA STRIDER and Artemis programs (The Wellcome Trust Sanger Institute), and database searches were done with BLAST. Multiple alignments of amino acid sequences were obtained using MacVector 6.0.1. For the phylogenetic analysis, multiple alignments of DNA and amino acid sequences were obtained using CLUSTAL W version 1.6 (
64). For DNA alignments, all the available full-length
RHS1 or
RHS2 (pseudo)gene sequences located downstream of the RIME/
ingi insertion site were used. For amino acid alignments, the full-length protein sequence encoded by functional genes and pseudogenes (φ
RHS1c, φ
RHS3b, and φ
RHS3c), corrected to remove frame shifts or premature stop codons, were analyzed. The phylogenetic trees were constructed using version 3.5c of the PHYLIP program package of J. Felsenstein (CLUSTAL W and PHYLIP were obtained through Bisance and Infobiogen facilities). The matrix of pairwise sequence distances were calculated by the Dayhoff's method using DNADIST or PROTDIST. The unrooted phylogenetic trees were constructed from the distance matrix using the neighbor or Fitch methods and were drawn with TREEVIEW version 1.3 (
50). The statistical robustness of the resulting phylogenetic trees was assessed with the SEQBOOT program by bootstrap resampling analysis generating 100 reiterated data sets. The resulting bootstrap values were added manually at each corresponding node.
Southern blot analysis.
Approximately 2.5 μg of genomic DNA from
T. brucei TREU927/4 and 427, extracted as described elsewhere (
8), was subjected to endonuclease digestion (
HincII for RHS1 and RHS5,
ClaI for RHS2 and RHS6.
AseI for RHS3, and
HpaII and
KpnI for RHS4), electrophoresed in 0.6% agarose gel, blotted onto neutral membrane (Quantum-Appligene), and hybridized with α-
32P-labeled RHS-specific probes at 65°C in 6× SSPE (1× SSPE is 0.18 mM NaCl, 10 mM NaH
2PO
4, 1 mM EDTA, pH 7.0)-0.1% sodium dodecyl sulfate (SDS). The probes specific for each
RHS1 to
RHS6 multigene subfamily were obtained by PCR from the most divergent 3" region of the (pseudo)genes, which corresponds to box 2 in Fig.
2. The membranes were washed at 65°C using 0.1× SSPE-0.1% SDS, before autoradiography. Probes were removed by boiling in a solution of 0.5% SDS, before rehybridizing blots.
Estimation of RHS1 to RHS6 (pseudo)gene and RIME/ingi copy numbers.
The copy numbers per haploid nonminichromosomal genome (T. brucei TREU927/4) of each RHS subfamily and the RIME and ingi retrotransposons were estimated by BLAST analysis of a Genome Survey Sequence (GSS) database and hydridization to a P1 genomic DNA library. For the BLAST analysis, the calculation of the copy number per haploid nonminichromosomal genome (CN) includes the number of GSSs (GSS) homologous to the probe, the size of the probe (GS), the size of the haploid nonminichromosomal genome (HGS = 30 Mb), and the number of GSSs contained in the library (TGSS), using the following equation: CN = (GSS × HGS)/(TGSS × GS). For P1 library hybridization, the copy number per haploid nonminichromosomal genome (CN) is calculated from the number of positive P1 clones (PC), the total number of P1 clones (TC = 1,819 clones), the average size of the P1 DNA inserts (PIS = 65 kb), and the size of the haploid genome without mini and intermediate chromosomes (HGS = 26.7 Mb): CN = (PC × HGS)/(TC × PIS). Since all complete ingi retroelements contain a full-length RIME sequence, the ingi copy number was deduced from the RIME copy number.
Production of recombinant proteins in Escherichia coli and antibody production.
PCR fragments encoding the C-terminal subfamily-specific domain of RHS1 (372 amino acids [aa]), RHS2 (260 aa), RHS4 (289 aa), RHS5 (285 aa), and RHS6 (286 aa), preceded by a methionine and six histidine residues, were obtained using the respective 5" primers (5"-GCCTCA CATATGcaccatcaccatcaccatTTGAAGGATTTGGAAGCCA-3", 5"-AATTTA CATATGcatcaccatcaccatcacGAAGAATGCAGAAACAGAGC-3", 5"-TATTTA CATATGcatcaccatcaccatcacCGAGATGCCGGAGAGAGCGT-3", 5"-AATTTA CATATGcatcaccatcaccatcacAAAGCTCGAGAAGGAAACT-3" and 5"-AATTTA CATATGcatcaccatcaccatcacGTACCTCACTCTGAATCCAT-3") and 3" primers (5"-TCCTTC GGATCCCTATGCATTGTTACCACC-3", 5"-TTTATT GGATCCTCAGTCAGCGGGGCCACCAG-3", 5"-AATTAA GGATCCTCACCCTCCTTGCGCTCCCG-3", 5"-TTTAAA GGATCCTTACCTTCGGCCCGCAGCAG-3" and 5"-TTTAAA GGATCCTTATTCGTTATTCGCCACTT-3"). The 5" primers contain an NdeI restriction site (italicized), a start codon (italicized and bold), and six histidine codons (lower case). The 3" primers contain a BamHI restriction site (italicized) and a stop codon (bold). DNA isolated from Cos-02 (RHS1) and BAC-25N24 (RHS2, RHS4, RHS5, and RHS6) clones was used as template for PCR. The resulting DNA fragments were cloned into the pET3a expression vector (Novagen) and expressed in E. coli BL21 cells. Expression and affinity purification of the recombinant proteins were performed as described by the manufacturer (Novagen). The affinity-purified recombinant proteins were separated by SDS-polyacrylamide gel electrophoresis (PAGE), electroeluted, and emulsified with complete (first injection) or incomplete Freund adjuvants. Antisera were raised in rabbits (RHS1) or rats (RHS2, RHS4, RHS5, and RHS6) by five injections at 2-week intervals by using 100 or 30 μg of protein per injection, respectively.
Western blot analysis.
Total extracts of trypanosomes were boiled for 5 min in 2% (wt/vol) SDS. Sample preparation, migration in SDS-8% PAGE, immunoblotting on Immobilon-P membranes (Millipore), and immunodetection using as secondary antibody goat anti-rabbit or anti-goat antibody conjugated to horseradish peroxidase (SIGMA) were achieved as previously described (
28,
57). The antisera were diluted 1:100 in phosphate-buffered saline (PBS)-0.05% (vol/vol) Tween 20 containing 5% (wt/vol) nonfat milk, and blots were developed with 3,3"-diaminobenzidine.
Immunolocalization of RHS proteins.
For immunofluorescence microscopy, trypanosomes were fixed in PBS-1% (vol/vol) formaldehyde for 30 min, permeabilized for 10 min by adjusting the solution to 0.1% (vol/vol) Triton X-100, and finally 0.1 M glycine was added for 10 min to neutralize active aldehyde groups. Cells were washed once in PBS, and trypanosomes were resuspended in PBS and allowed to adhere to glass slides until completely dry before incubation with antibodies. Rabbit or rat antisera raised against the RHS recombinant proteins were diluted 1:100, whereas secondary goat anti-rabbit fluorescein isothiocyanate (FITC) or anti-rat FITC were used at a 1:10,000 or 1:1,000 dilution, respectively. All incubations were carried out for 30 min at room temperature, and all dilutions were performed with PBS containing 0.1% (vol/vol) Triton X-100 and 0.1% (wt/vol) bovine serum albumin. At the end of the immunofluorescence assay, cells were incubated for 5 min with PBS containing 1 μg of the fluorescent DNA dye DAPI (4",6"-diamino-2-phenylindole; SIGMA) ml−1. Observations were made after mounting in Vectashield (Valbiotech) mounting medium using a Zeiss epifluorescence microscope fitted with FITC and UV filters. Images were captured by camera (Princeton) and MetaView software (Universal Imaging Corporation) and were processed in Adobe Photoshop (Adobe Systems, Mountain View, Calif.) on a Macintosh iMac computer.
Nucleotide sequence accession numbers.
The sequences have been deposited in GenBank and assigned accession numbers as follows: cDNA-004, AF403385 ; cDNA-005, AF403388 ; cDNA-040, AF403386 ; cDNA-132, AF403387 ; Cos-02, AY046893 ; Cos-03, AY046894 ; Cos-12, AY046896 ; Cos-17, AY046895 ; Cos-23, AY046897S1 and AY046897S2 ; RHS1a,AY046887 ; RHS2a, AY046888 ; RHS3a, AY046889 ; RHS4a, AY046890 ; RHS5a, AY046891 ; RHS6a, AY046892 .
DISCUSSION
We have characterized a new, large multigene family encoding nuclear and perinuclear proteins in T. brucei. We analyzed a total of 61 different RHS genes and pseudogenes detected in four cDNA clones, two BAC clones from ChrII, the contiguous sequence of ChrIa, and three BACs and five cosmids of unknown genomic location. Analysis of the C-terminal DNA sequence allowed us to subdivide the family into six multigene subfamilies, RHS1 to RHS6. More than half of the RHS copies described here are pseudogenes. To estimate the number of RHS (pseudo)genes in the nuclear genome of strain TREU927/4, we took advantage of the T. brucei GSS databases at TIGR and The Wellcome Trust Sanger Institute, which provide about 1.8-fold coverage of the haploid DNA (excluding minichromosomes). We estimate that there are 128 RHS (pseudo)gene fragments per nonminichromosomal haploid genome. RHS (pseudo)genes also appear to be present in a subset of minichromosomes (hybridization data not shown).
The computational analysis of DNA sequences from TIGR and The Wellcome Trust Sanger Institute, selected cosmids and cDNAs revealed that this multigene family contains a hot spot for insertion of the RIME and
ingi retrotransposons: (i) approximately one-third of the
RHS (pseudo)genes contain RIME and/or
ingi retrotransposons (16 out of 51 copies), (ii) the retroelements are always inserted at exactly the same relative position in the
RHS pseudogenes, even though these genes display up to 50% variation in nucleotide sequence in the vicinity of the insertion site (data not shown), (iii) of the 16
RHS pseudogenes containing RIME/
ingi element(s), 25% contain two or three retroelements while only 1 of the 10 non-
RHS sequences in the databases containing RIME/
ingi retroelements has tandemly arranged elements (data not shown), (iv) a phylogenetic analysis shows that most were generated by independent insertion events, and (v) among the 10 RIME/
ingi retroelements present in the sequenced
ChrIa of strain TREU927/4, 7 are inserted into
RHS pseudogenes. Many eukaryotes contain site-specific non-LTR retrotransposons (
3,
9,
16,
26,
38,
63,
67). Also, non-LTR retrotransposons that appear to be randomly distributed in the host genome in fact show a bias of recognition for insertion sites, as exemplified by the TTAAAA sequence of human LINEs (
31). The exact site specificity of retroelement insertion into
RHS genes leads to the observed tandem arrays of elements. Interestingly, the tandem arrangement of the
T. brucei (RIME and
ingi) and
T. cruzi (L1Tc) non-LTR retrotransposons is unique since, to our knowledge, none of the site-specific or randomly distributed retroelements show this organization in other organisms.
It appears that all the RIME/
ingi elements present in
RHS genes are inserted in frame with the
RHS gene. When the retroelement is unmutated, this results in the generation of long open reading frames encoding putative chimeric proteins composed of the RHS N-terminal half followed by a peptide encoded by the retroelement. However, it is noteworthy that only a few
ingi elements contain a single long open reading frame encoding a putative multifunctional protein (data not shown). Most, including those originally described (
33,
47), are probably not able to encode functional mRNAs due to the presence of frame shifts or premature stop codons. Consequently, the putative
RHS/ingi chimeric proteins may exhibit an important size and sequence polymorphism due to the
ingi polymorphism. At least seven different chimeric proteins formed between cellular and mobile element genes are expressed in humans (
24,
56,
60,
65). Thus, it is tempting to consider that some of the
RHS/ingi chimeric proteins may be expressed and that the proteins may have a cellular role. This would provide a functional
raison d'être for the presence and conservation of a RIME/
ingi insertion hot spot within the
RHS genes. This hypothesis is supported by the characterization of
RHS/retrotransposon chimeric cDNA molecules in which the boundary between the
RHS pseudogene and the RIME sequence corresponds exactly to the conserved RIME/
ingi insertion site observed in genomic DNA. Production of antibodies against the N-terminal region of the
ingi products will allow us to determine if the RHS/
ingi chimeric proteins are expressed.
Analysis of the
T. cruzi databases revealed that the genome of
T. cruzi also contains polymorphic repeated sequences that potentially code for proteins homologous to the
T. brucei RHS proteins. Interestingly, these DNA sequences were initially characterized as non-LTR retrotransposon (L1Tc) flanking sequences (
49), suggesting that such elements also frequently insert into the putative
T. cruzi RHS-like genes. In contrast, a BLAST analysis of the
Leishmania GSS and cosmid sequence databases, which contain at least as many sequences as the
T. brucei databases, does not reveal the presence of any
RHS homologue. The absence of these sequences is probably correlated with the apparent absence of mobile elements, including retrotransposons, as revealed by the ongoing sequence analysis of this highly related genome (http://www.ebi.ac.uk/parasites/leish.html).
Comparison of
ChrI homologues in different
T. brucei strains indicates that the large RIME/
ingi-rich repetitive region presents a polymorphism with an important size (
43). The RIME/
ingi richness observed for this large section of
ChrIa in TREU927/4 (
43), but also in
ChrII and BAC-26P8, is entirely due to insertion into the clustered
RHS (pseudo)genes. Detailed analysis of the
RHS multigene family shows that they are subject to frequent homologous recombination. Where this occurs within and between nonhomologous chromosomes may explain not only the size of the polymorphism of the RIME/
ingi-rich repetitive area (
43) but also the variation in number and location of B-ESs observed in different strains (
43,
44,
45). Our analysis reveals that among 23 retroelements present in 14
RHS pseudogenes within the large clusters described here, 8 are flanked by one
RHS sequence and one unknown sequence. The latter were probably generated by homologous recombination between two retroelements, one inserted in an
RHS pseudogene and another inserted into the unknown sequence. In addition, approximately one-third of the
RHS (pseudo)-genes studied are chimeric, and we suggest that these probably result from homologous recombination in conserved regions of
RHS copies belonging to different subfamilies. These suspected homologous recombination events are probably the tip of the iceberg, since numerous undetectable events probably occur between the abundant homologous sequences clustered in large sections of multiple chromosomes.
The 52
RHS (pseudo)genes identified so far in the
T. brucei (TREU927/4) databases are located in five different clusters that are almost exclusively composed of
RHS copies and their large conserved flanking regions: 28 copies (15 genes, 13 pseudogenes) in a 250-kb area of
ChrII (unpublished data), 15 copies (5 genes, 10 pseudogenes) in the 150-kb RIME/
ingi-rich region in
ChrIa (
42,
43), five pseudogenes in BAC-26P8 (
36), two pseudogenes in BAC-45I2 (
36), and two pseudogenes in BAC-30P15 (unpublished data). The three largest
RHS clusters are located upstream of the TTAGGG telomere repeats (
ChrII) or upstream of a 45- to 60-kb B-ES that is adjacent to the telomere repeats (
ChrIa and BAC-26P8). The tandemly arranged
RHS pseudogenes in BAC-45I2 are located 30 kb upstream of a region with the characteristics of a telomeric M-ES. Similarly, the M-ES active in
T. brucei rhodesiense WRATat1.1-MVAT5 (
41) and present in
T. brucei AnTat1 (
13) is preceded by a
RHS1 pseudogene (Cos-12) located 10 kb upstream of the telomere repeats (unpublished data). Although the chromosomal positions of the DNA sequences derived from the other BACs and the cosmids are not known, it appears from this analysis that the
RHS (pseudo)genes are located in subtelomeric regions of chromosomes, upstream of ESs (B-ESs or M-ESs) or directly adjacent to the telomere repeats. However, in the fully sequenced
ChrI and
ChrII, the large clusters are found only at one end, indicating that not all telomeres are separated from the central coding regions by
RHS clusters. Nevertheless, it is interesting that the P1 genomic library analysis showed that most of the B-ESs, maybe all of them, are flanked by
RHS (pseudo)genes.
The subtelomeric localization of the
RHS (pseudo)genes may be related to their function. In most eukaryotes, subtelomeric regions are large and repetitive, and poorly transcribed sequences are located at both ends of chromosomes and directly adjacent to the short telomere repeats (
69). Although subtelomeres are essentially composed of noncoding sequences, expressed genes are found embedded in subtelomeric repeats, such as the
PAU,
SUC,
MAL, and
MEL multigene families in yeast (
40), and surface antigen gene families in
Plasmodium (
6,
18,
20,
61,
62). Apparently there is a selective advantage for the
Plasmodium surface antigen genes, which are involved in antigenic variation, to be located within subtelomeric regions. The high recombination frequencies in subtelomeric domains seem to create a favorable environment for the rapid generation of novel genes encoding surface proteins (
25). Interestingly, in
Plasmodium vivax, a large cluster of 35
vir genes and pseudogenes encoding immunovariant surface proteins is located directly upstream of the telomere repeats (
20), exactly as observed for the
RHS cluster in
ChrII. In addition,
T. brucei VSGs are expressed in the telomeric ESs (B-ESs and M-ESs) and homologous recombination is required to mediate antigenic variation. These observations suggest that the diversity observed for the
RHS multigene family, probably generated by the high rate of recombination in subtelomeric regions, may be advantageous for the parasite. Our experiments indicate that the RHS proteins are located inside the cell, not on the cell surface, and it is now a priority to investigate the function of this diverse and potentially rapidly evolving gene family.
In summary, we describe for the first time a gene family with conserved flanking regions that constitutes about 5% of the T. brucei genome. This multigene family is associated with the most abundant putative mobile elements (about 5% of the genome content) and may be undergoing rapid evolution by recombination and sequence divergence. The RHS genes are clustered in defined regions of chromosomes in T. brucei and are probably always found upstream of B-ESs, although also present on chromosomes not carrying B-ESs. A homologous family is present in T. cruzi, and for both of these organisms the data presented here will be very significant to the finishing stages of the genome sequencing projects.