The Ab repertoire is not uniform. Some variable, diversity, and joining genes are used more frequently than others. Nonuniform usage can result from the rearrangement process, or from selection. To study how the Ab repertoire is selected, we analyzed one part of diversity generation that cannot be driven by the rearrangement mechanism: the reading frame usage of DH genes. We have used two high-throughput sequencing methodologies, multiple subjects and advanced algorithms to measure the DH reading frame usage in the human Ab repertoire. In most DH genes, a single reading frame is used predominantly, and inverted reading frames are practically never observed. The choice of a single DH reading frame is not limited to a single position of the DH gene. Rather, each DH gene participates in rearrangements of differing CDR3 lengths, restricted to multiples of three. In nonproductive rearrangements, there is practically no reading frame bias, but there is still a striking absence of inversions. Biases in DH reading frame usage are more pronounced, but also exhibit greater interindividual variation, in IgG+ and IgA+ than in IgM+ B cells. These results suggest that there are two developmental checkpoints of DH reading frame selection. The first occurs during VDJ recombination, when inverted DH genes are usually avoided. The second checkpoint occurs after rearrangement, once the BCR is expressed. The second checkpoint implies that DH reading frames are subjected to differential selection. Following these checkpoints, clonal selection induces a host-specific DH reading frame usage bias.
Antibodies are proteins produced by B cells. Abs consist of two H chains and two L chains. Ab H chains and L chains consist of C and V regions, which are so named because they vary in their level of sequence diversity when different Ab molecules are compared. The V regions that encode the greatest diversity correspond to the regions in the Ab that are important for binding to a vast array of Ags. Ab diversity is achieved through a series of somatic mechanisms that produce many different sequences, but conserve genetic space. These diversification mechanisms include recombination of V, D, and J gene segments [V(D)J recombination], junctional modifications between the rearranged gene segments, pairing of different H and L chains, and somatic hypermutation (SHM) [reviewed in (1, 2)]. V(D)J recombination occurs in an ordered and stage-specific fashion in the bone marrow, with H chain rearrangement producing a V region that contains juxtaposed VH, DH, and JH gene segments.
The fate of the B cell depends in large part on the specificity of its BCR. Self-reactive B cells must be edited, killed, or inactivated to maintain self-tolerance (3, 4). B cells that respond to pathogens are, instead, activated, clonally expanded, and undergo differentiation into specialized effector cells. During an immune response, Abs can undergo further sequence modification to optimize their effector functions (isotype switching from IgM to a different H chain C region such as IgG or IgA, while keeping the same VH region). Abs of mature B cells can also undergo SHM. SHM, coupled with selection for the B cells that bind to a particular Ag with the highest affinity, results over time in the successive improvement in affinity for the Ag, a process termed affinity maturation.
Within the Ab V region, there are more conserved and more variable sequences referred to as framework regions and CDRs, respectively (5). The CDRs form loops that are important for Ag binding. Among the CDRs, CDR3 is the most hypervariable in sequence because it encompasses the junctions between the recombining VH, DH, and JH gene segments. The position of DH in the sequence often brings it to the center of the Ab combining site. DH gene usage has been proposed to be different in autoimmunity (6). The addition and deletion of nontemplated nucleotides at the junctions between the recombining gene segments allow the DH segment (which is flanked on one side by the VH gene segment and on the other by the JH gene segment) to be read in one of three forward-facing reading frames (RFs). As long as the junctional modifications at the D-J side of the rearrangement return to the +1 RF in the JH gene segment, the rearrangement is potentially functional. Despite the availability of three forward-facing RFs, the usage of RFs is biased. For example, studies in mice (7, 8) have shown that there is selection against RFs containing a stop codon or charged residues (7–11). Other mechanisms, such as a Dμ-mediated suppression, have also been proposed to explain the preference for a single RF in mice (10–12).
In addition to the usage of up to three different forward RFs, DH rearrangement can occur by deletion or by inversion. Inversional rearrangements increase the number of potential RFs to six (three forward-facing RFs and three reverse). Rearrangements to all six possible DH gene RFs (DRF) have been reported in mice, and the use of inverted DRF has been described in autoimmune-prone strains of mice (13–16).
Whereas the joining probability of VH, DH, and JH may be affected by the structure of the locus and properties of the RAG complex, as well as by the dynamics of the rearrangement mechanism [e.g., editing (3, 17–19)], the DRF probability distribution appears to be purely driven by selection, as we will show further in this work. The usage of forward and inverted DRF, in contrast, can be based on selection, but also on possible limitations imposed by the recombination signal (20). In humans, previous reports have claimed that essentially only forward RFs are used (21) and that there are practically no D-D fusions, findings that we will confirm more definitively in this work (22–24).
Up until now, the analysis of DH gene sequences in humans has been limited to relatively few sequences. Because selection of Abs is influenced by the VH gene used (25, 26), earlier studies in which only a few dozen rearrangements from different VH were analyzed were insufficient to rigorously evaluate DH selection (which requires controlling for the VH gene used). A second difficulty is to precisely identify DH segments in rearranged Ab genes, which are often truncated due to exonucleolytic nibbling by nonhomologous end-joining machinery that can ultimately remove practically the entire DH gene. Sometimes it is difficult to discern which nucleotides in the sequence are derived from the DH segment as opposed to those that result from junctional modifications or SHMs that target the CDR3 [e.g., (27, 28)]. The confounding effects introduced by SHMs may reduce the reliability of DH segment identification in mature B cell populations. Some DH genes are more similar to each other than others, which means that DH identification fidelity depends not only upon the length of the DH germline sequence (11–37 bp), but also upon its degree of similarity to the other germline DH gene alleles. Finally, some DH genes are much more frequently used than others [see for example (29)]. Thus, in a limited number of sequences, only the most abundantly used DH genes will be sampled, as the precise distribution in relatively rarely used DH genes is hard to measure.
With the advent of high-throughput sequencing, it has become possible to sample a large enough number of naive BCR sequences to capture a large enough number of DH with a limited number of SHMs, even those with short DH gene sequences. In this study, we present the DRF distribution from 12 human subjects. We describe a new computational method for defining DH segments and provide information on which DH segments are most versus least reliably identified, which is of interest to those using high-throughput sequencing to analyze the Ab repertoire. In agreement with previous publications, we find that inverted DRF are practically never used (22, 24). However, even in the forward RF, we show that there is very strong selection of DRFs that eventually leads to the selection of one or two dominant RFs of the three possible forward RFs in the expressed Ab repertoire. These results have implications for the overall level of IgH diversity and the timing of H chain selection checkpoints in humans. An understanding of these selection checkpoints may be useful in future studies of how B cell selection is altered in disease.
Materials and Methods
Twelve apparently healthy adult subjects (see Table I for demographic characteristics) were recruited for high-throughput sequencing using the 454 platform. Two 45-ml blood draws were collected in heparin tubes from each subject at a single time point. Mononuclear cells were isolated using Ficoll-Paque Plus (GE Healthcare), and then sorted by flow cytometry into naive (CD20+, CD27−) and memory (CD20+, CD27+) populations. Informed consent was obtained from all donors. This work was performed in accordance with an Institutional Review Board–approved protocol at Pfizer.
|Sequencing Technology .||ID .||Gender .||Ethnographics .||Age .|
|Sequencing Technology .||ID .||Gender .||Ethnographics .||Age .|
Two apparently healthy adult subjects were also recruited for high-throughput sequencing using the Illumina platform at a single time point (Table I). A total of 25 cc venous peripheral blood was drawn in sodium EDTA tubes from each of the two subjects and sequenced using the Illumina platform. Informed consent was obtained, and subjects were asked to fill out a medical questionnaire. This work was performed in accordance with an Institutional Review Board–approved protocol at the University of Pennsylvania.
Target amplification and 454 sequencing
Unbiased amplification of repertoires was performed by 25 cycles of 5′RACE, using individual isotype-specific reverse primers. Primers were optimized for efficiency, fidelity, and completeness of repertoire recovery by informatic screening, gel analysis, and high-throughput sequencing of recovered products. The degree of germline-dependent amplification bias was assessed by comparing amplified products of stimulated naive B cell pools to direct sequencing of the same pools. Cycle-dependent effects on diversity estimates were evaluated by high-throughput sequencing. All products received multiplex identifiers (barcodes) to allow unambiguous identification of all products by sequence analysis in subsequent processing steps. Multiplex identifiers differed by at least 3 bp from any other multiplex identifier sequence, and only reads with exact matches were included in the analysis. Products were sequenced with 454 titanium. Sequencing quality was assessed by keypass control. Sample quality control was confirmed by demultiplexing and VH segment genotype. Sequencing depth was determined by diversity estimate rarefaction and simulations of germline-profile stabilization as a function of sequencing depth. A detailed discussion of the sequencing methodology has been described previously (30).
Ab DNA CDR3 analysis by Illumina sequencing
PBLs were enriched over Ficoll-Hypaque, and CD19+ cells were isolated by magnetic bead separation (Miltenyi Biotec). Genomic DNA was extracted using a Puregene Kit (Qiagen, Valencia, CA), and 800 ng was used for IgH CDR3 amplification and library construction. IgH CDR3 V regions were amplified and sequenced, as described in Larimore et al. (31). Briefly, a multiplexed PCR method was employed to amplify rearranged IgH sequences at the genomic level, using seven VH segment primers (one specific to each VH segment family) and six JH segment primers (one for each functional JH segment). Reads of 110 bp extending from the JH segment, across the NDN junction, and into the VH segment were obtained using the Illumina HiSeq sequencing platform. The immunoSEQ assay was used for the sequencing. The BCR CDR3 region was defined according to the IMGT convention (32), beginning with the second conserved cysteine encoded by the 3′ portion of the VH gene segment and ending with the conserved phenylalanine encoded by the 5′ portion of the JH gene segment. The resulting sequences were analyzed for DH usage, orientation, and RF, as described in the section on DH detection. A total of 12,948,437 sequence reads representing 47,358 unique sequences, of which 82% were productive, was analyzed for subject 1, and 12,851,153 sequence reads representing 20,374 unique sequences, of which 84% were productive, were analyzed for subject 2.
Germline VH, DH, and JH genes used for sequence composition
Human Ig germline sequences for VH, DH, and JH genes of B cells were extracted from the IMGT database (33). In our study, we used 34 germline sequences of DH, categorized under functional genes that included open RF, 13 sequences of JH genes, and 188 sequences of VH genes.
The 454 sequence analysis
We used two different approaches to analyze the DRF usage. In the first approach, we used all unique sequences. In the second approach, we attempted to detect clones by clustering together sequences with similar CDR3 sequences, to minimize the effect of potential biases in the sequence copy numbers. Both approaches led to similar results.
Sequences were grouped into clones using a two-step approach. First, the germline VH and JH of each sequence were determined by aligning all possible germline VH and JH (based on the IMGT germline library) (33) against the sequence, finding the highest number of overlapping nucleotides, and assuming that no deletions or insertions occurred. A full alignment using BLAST produced similar VH and JH assignments for all tested sequences that were classified clearly enough in the alignment.
Next, to count the clones, we grouped all sequences according to their VH and JH usage as well as the distance between VH and JH, because SHMs usually do not produce additions or deletions of nucleotides (a detailed example of clone detection can be found at: http://peptibase.cs.biu.ac.il/homepage/Lymphocyte_clone_detection.htm). Thus, every clone emerging from the same founder cell should have the same distance between VH and JH. We then took all of the sequences with the same VH, JH, and distance between VH and JH and grouped them using a phylogenic approach. The distance between VH and JH was computed by positioning the IMGT germline VH and JH genes on the observed sequence and determining the distance between the last nucleotide of VH and the first nucleotide of JH. All the sequences with equal VH, JH, and distance were aligned together with an artificial sequence composed of the germline and gaps between them. Within each group, the sequences were aligned (using MUSCLE 3.6) (34), and a phylogenetic tree was built using maximum parsimony (35) and/or neighbor-joining (36) methods (from the PHYLIP 3.69 program package). We then parsed this tree with a cutoff distance of four mutations into clones. Thus, a clone was defined as a set of sequences that are similar one to each other, up to a distance of four mutations.
We have computed the maximal similarity between each DH germline gene segment in each RF and all the possible subsequences of same length in the region spanning the last 30 nt before the 3′ end of the VH gene to the end of the JH gene. The similarity of a given sequence to a germline DH has been defined as the fraction of nucleotides equal to the one of the germline DH. We only further analyzed sequences with a similarity of at least 0.75.
Validation set production
In silico, we generated a set of 10,000 sequences by random V-D-J joining (with equal probability for each DH gene). In this dataset, the DH gene could be inserted in forward or backward orientation. We then added mutations between each pair of genes (V-D and D-J) to create the following: 1) replacements of 1–3 nt at each inner edge of the DH gene, and 2) replacement of 1–3 nt in random positions in the DH gene. In both methods, purines and pyrimidines had a two-thirds probability of being mutated into a base in the same group (transition) and a one-third probability to be mutated to the opposite group (transversion). We then computed the best fit DH and its orientation for each of these known rearrangements, using the same methodology as was used for the real (unknown) sequences. The fraction of high precision fits was affected by the DH gene length: as expected, longer DH genes were properly classified more often.
Whereas the RF usage of VH and JH is constrained, the RF of DH is flexible because nucleotide additions or deletions are present at both ends of the rearranged DH gene segment. Furthermore, DH genes can potentially undergo inversion. Thus, DH segments can be read in up to six RFs. The presence of multiple possible RFs increases the variability of the CDR3. In this study, we show that the DH gene actually uses a limited number of RFs.
The analysis of DH gene sequences in the Ab repertoire is not easy because some DH genes are similar to each other, and nibbling of the DH genes by the nonhomologous end-joining machinery can result in very short DH gene sequences. Thus, the robust identification of DH gene segments from CDR3 sequences involves a trade-off between sensitivity and specificity.
This analysis required the development and validation of a robust computational method for the identification of DH gene segments and their RFs from high-throughput sequencing data. We have performed several analyses to rule out certain technical artifacts in the sequencing or sampling methodology. We have further validated the precision of the DH gene determination using computer simulations of DH rearrangements and found the results to be highly precise for long DH genes. We have tested our computational methodology on artificial datasets, to ensure that the results are not due to artifacts of the computation. The code for V(D)J detection and the code to analyze and detect DH genes can be requested from the authors.
To rule out potential biases that could have been introduced as a result of the sample type, IgH amplification process, or sequencing platform, we analyzed the DRF usage using two completely different methodologies. The first method used RNA extracted from sorted B cell subsets, amplification of cDNA by 5′RACE, and pyrosequencing (by 454). The second method employed genomic DNA from CD19+-enriched PBLs, amplification using mixtures of VH and JH primers, and Illumina sequencing. The results with both approaches were very similar for the in-frame (IF) sequences. In this analysis, we have sequenced the IgH repertoire of multiple human subjects to ensure that the observed DH selection is reproducible and robust.
Subjects and 454 IgH sequences
We have sequenced the Ab H chain repertoire from 12 healthy human adult volunteers (Table I). For each subject, we sorted CD20+ lymphocytes into naive (CD27−) and memory (CD27+) subsets. We then applied a RACE protocol to separately amplify IgM, IgG, and IgA BCRs from each sample, using isotype-specific primers at the C region. The resulting sequences were compared with all germline VH genes at all possible positions. Sequences were discarded from the analysis if a VH gene could not be detected with a high enough accuracy (at least 70% overlap with germline VH and at least 120 nt of VH sequence length were required). Nearly all sequences had a much better fit than 70%. VH genes that were highly related (e.g., alleles of the same VH gene) were grouped into a single gene. We then compared the sequences downstream of the identified VH with all possible germline JH genes, and assumed that the JH gene could not start >50 nt before the 3′ end of the VH gene. We found that practically all sequences for which a VH gene could be identified, the JH gene could also be detected with a high enough accuracy (>70% similarity to germline VH and JH). Note that even if the precise VH gene could not be identified, its final position could almost always be known precisely, because similar VH genes have similar lengths. Given the position of VH and JH, we extracted the sequence regions that included the DH gene segment(s), starting with the sequence that was 30 nt before the end of the VH germline gene and ending with the 3′ end of the germline JH sequence. We looked for the best fit to all germline DH genes in the IMGT database. The optimal fitting DH gene was given a score representing the fraction of the germline DH gene nucleotides fitting the observed sequence (see Fig. 1). The position of the DH gene was then determined relative to the end of the region where the sequence was identical to the corresponding germline sequence and relative to the beginning of the germline JH sequence. The position determines the DRF relative to the beginning of the VH germline sequence.
We first tested whether our DH detection methodology properly identifies the DH germline gene and the corresponding DRF. We have produced artificial V-D-J sequences and have introduced random mutations in these sequences (see 2Materials and Methods). We have then checked the fraction of cases in which the algorithm properly classified the results.
We have limited the analysis to contain up to three mutations in the body of the DH gene and three nucleotide changes/removed in the 3′ and 5′ ends of the DH gene, because sequences with more mutations than that had a too high error level and where not used, as will be further discussed. Within the misclassified ones, we checked the type of misclassification that occurred: either an error with respect to the RF or a misidentification of the DH, or both. Finally, we checked whether the error was due to failure to detect an inverted DH. We found that the error level is a function of the number of mutations in the DH segment (nucleotide differences compared with the germline DH sequence) and the number of nucleotides that are added and/or removed from the DH gene. At or above a level of 75% overlap (the percentage of nucleotides overlapping between the query sequence and the germline DH sequence), <10% of the query sequences are misclassified (Fig. 2). Note that even among these 10%, most errors are due to a misclassification of the DH gene as a similar gene, and not due to errors in the assignment of the DRF (compare Supplemental Fig. 1A with 1B). This misclassification occurs because some DH have highly similar nucleotide sequences (e.g., D-4-11 versus D-4-17). This precision is much better than currently used algorithms such as SoDA, JS, and V-Quest (50–73% precision for 2–6 mutations as used in the simulation). However, the comparison is not fair, because in this study we discard a large fraction of the sequences.
Even a low mutation frequency results in a nonnegligible error rate for highly homologous DH genes (Fig. 2). Therefore, we have restricted our analysis to DH genes that can be identified with a precision of at least 0.75. Thus, this analysis excludes some of the shorter DH genes that practically never reach this level of precision (see Supplemental Fig. 1C, 1D).
Absence of inverted DRF
We first checked the DRF usage in the total sample (without taking into account the specific DH used). We then computed how many unique clones (see 2Materials and Methods) used each RF for each DH. The inverted DRF usage frequency is lower than the precision level of the methodology (<1% of the unique clones contained a possible inverted DRF, compared with an error level of 5%). Thus, sequences that are identified as having potential DH inversions may not really contain inverted DH segments. Indeed, for each DH gene found to be in the inverted DRF, there was also a reasonable fit to a DH in a forward DRF (not always in the same DH gene). Thus, we conclude [as was previously suggested based upon smaller datasets (22)] that the use of inverted DH segments, if it occurs at all, is very infrequent in the IgH repertoire of healthy human adults.
The frequencies at which DH genes are used in our sample vary widely among DH genes. One caveat is that we are only analyzing DH genes in which the fraction of original nucleotides maintained in the rearranged sequence is >75%. This selection induces a bias toward long DH genes that can be classified properly with a better accuracy (Supplemental Fig. 1).
We have repeated the analysis separately for each DH. When averaging over each DH separately (instead of overall clones), some inverted DH genes are observed in rare short DH genes (Fig. 3A). However, in short DH genes, a small number of fitting nucleotides can be enough to ensure a 75% match. As mentioned above, for each candidate inverted DH, a reasonable candidate forward DH can be found. One can thus conclude that in very large cohorts, if inverted DH exist, they are very rare and limited to short DH genes. We repeated the analysis using all the sequences and not only clones, to make sure that this bias is not a result of artifacts from our clone detection methodology (Fig. 3B). When using all the sequences, there are again practically no inverted DH genes.
Note that most inverted DH do not contain stop codons. Thus, there is no a priori reason that these DRF should not be used. Their absence seems to highlight the presence of mechanistic differences in the inverted and forward rearrangements.
Forward DRF usage distribution
The absence of inverted DRF may be mediated by aspects of the rearrangement mechanism. A surprising feature of the DRF usage that cannot be explained by the rearrangement mechanism is the restricted usage of particular forward RFs. In the forward direction, the DRF usage is highly skewed toward the third forward RF for most clones, whereas a minority of the clones uses first and second forward DRF (Fig. 3A). Again, as was the case for the inverted DRF, most forward DRF do not contain stop codons, which would reduce their likelihood of usage in circulating B cells. Moreover, as will be further explained, there is no a priori advantage for one DRF over the others, and no element of the rearrangement process that could readily explain this preference. This unequal distribution thus seems to hint to the presence of a clear selection mechanism for one DRF for a given DH segment.
To check that this result is not the peculiarity of a single individual, we have compared the DRF usage among subjects and among Ab H chain isotypes (data not shown). In all cases, the DRF usage is skewed and the manner of skewing is very similar in different individuals. Thus, if specific DRF are indeed selected, this selection mechanism is similar in most individuals. Note also that there is no reason for the PCR amplification to overamplify one DRF rather than the other because the region of the amplification product that contains the DRF is internal to and nonoverlapping with the primers.
As mentioned above, the frequencies at which DH genes are used in our sample vary widely among DH genes. Thus, the average results could be the result of a small number of highly frequent DH genes. As was the case for the inverted DRF, we have repeated the analysis separating each DH. The DRF usage varies among DH genes, but it is again highly nonuniform for each DH gene (Fig. 3A). The nonuniform distribution is not the result of stop codons, because DRFs without stop codons are practically absent. Moreover, the DRF usage pattern is highly reproducible among samples from different individuals (Supplemental Fig. 2) and is the result of a large number of clones for each sample and each DH (data not shown). Supplemental Fig. 2 shows the DRF usage of clones and not of total sequences, but the results for total sequences are similar. Thus, the biased distribution cannot be the result of amplification errors or biases, because those would not affect the clone number (remember that similar sequences comprising each clone are only counted once). For the same reason, the biased distribution is not affected by very large clones, because again all sequences in the same clone are only counted once. Thus, if selection occurs, it is not the result of the amplification of some clones, but rather the selection of individual clones. Moreover, the RACE protocol minimizes artifactual skewing of VH or JH gene usage because the primers are neither VH nor JH specific.
To test that the total DRF usage is not the result of errors (that comprise <10% of classified sequences) in either the DH or DRF classification, we repeated the analysis and increased the precision level requested, until only sequences with a 100% precise classification were used (i.e., the full DH gene is observed in the original sequence). Note that in such a case, the error level is negligible, and a limited number of sequences are used (most of these sequences have only insertions and no deletions at the VD and DJ junctions). Even under these conditions, the average DRF usage did not change (Fig. 3C). Thus, the nonuniform DRF usage seems to be a real feature of the sequences, and not an artifact of the methodology. In contrast, one may worry that because we are looking only at sequences with a good enough classification of the DH gene, the results may be biased toward a given DRF. However, the comparison with the DH gene is only performed with the germline and is not affected by additions around the DH gene, and the DRF preference is also observed in CDR3 sequences where the distance between VH and DH, and between DH and JH is positive, where the majority of sequences are taken into account (Supplemental Fig. 3).
Comparison of DRF in IgM+, IgG+, and IgA+ B cell subsets
To determine whether DRF skewing could be mediated by selection in the periphery, we compared DRF usage in naive (CD27−) and memory (CD27+) sorted B cell subsets. We amplified IgM rearrangements (from cDNA) of the naive B cells, and, to further analyze the memory B cell pool, we amplified IgM, IgA, and IgG rearrangements from the CD27+ fraction (see 2Materials and Methods).
The relative DRF usage distribution can be treated as a 3 × 18 frequency matrix (3 DRF and 18 of 34 DH genes that had >100 sequences in most tested donors) in which the sum of each column is 1. We rearranged this matrix as a vector and computed the correlation of this vector between all samples (12 subjects, 4 compartments: naive [IgM] and memory [CD27+] IgM, IgG, or IgA). We observed a very similar distribution among all samples, with the maximal similarity being among technical repeats of the same sample, followed by different isotypes from the same donor (corr = 0.8) (Fig. 4A). But even between individuals, the correlation was typically >0.75 (Fig. 4A). The correlation between naive IgM repertoires was 0.8, whereas the correlations within and between memory isotypes were 0.7 and 0.77, respectively (t test, p < 1.e-10 for all comparisons). The higher similarity between naive IgM sequences than between IgA and IgG suggests that there is a shared bone marrow–based selection mechanism that drives the uniform selection in IgM sequences, followed by diversification in the memory compartment.
The analysis of selection can be taken one step further by comparing the variability of DRF usage in naive versus memory B cell IgH repertoires. Completely uniform DRF usage would lead to a SD of 0 in the DRF usage (per DH gene). Conversely, in the extreme case of usage of a single DRF, the SD would be 0.57 (the SD of [1,0,0]). The SD in the naive compartment is ∼0.3, and it rises progressively to 0.37 in the memory IgM repertoire and to 0.43, almost the maximal variability (i.e., most limited DRF usage), in the IgG compartment (Fig. 4B). The rise in the DRF variance may be partially due to clonotypic enlargement. However, there is typically much more than one clone per DH gene. Thus, clonotypic enlargement is far from being the only source of the high variance.
Among the memory subsets, we found the highest degree of variance in the IgG+ sequences, where DRF usage is limited to practically a single DRF (t test, p < 0.01 versus naive IgM cells). As expected, memory IgM has a variance between the naive and IgG variances. The DRF usage is less skewed in memory IgA than in IgG sequences for reasons that are not clear.
The average overall DH gene DRF distribution was practically equal among all subjects and compartments (Fig. 4C), which again favors a two (or more)-stage selection model: a common selection stage among naive cells or their precursors, and further selection/diversification in the IgG/IgA pools, which is subject specific.
Distance between VH, DH, and JH
One possible explanation for the skewed usage of DRFs in the IgM compartment is a rearrangement bias. Such a bias could arise if there were preferred rearrangement distances between VH and DH or between DH and JH. If biased rearrangement had been the source of the skewed DRF usage, we would have expected to observe a very narrow distribution of the distances between VH and DH (or DH and JH). We have measured this distance for each DH. For most DH genes, one observes a very wide distribution covering >30 nt, with jumps of 3 nt (Fig. 5). Thus, for example, the frequency of a distance between VH and DH of 3 nt is much higher than 2 or 1 nt, but is similar to a distance of 0 nt. The same can be observed in the distance between DH and JH (data not shown). Although the JH segment used influences the CDR3 length, the range of rearrangement lengths cannot be readily ascribed to skewed JH usage for a particular DH gene segment. Furthermore, the wide range of CDR3 lengths that occur at 3-nt intervals in the preferred DRF exists even at the level of specific V-D-J combinations (Fig. 6); note again that the RACE protocol is not expected to produce any length or VH-JH bias. Such a bias in intervals of three over a wide range of rearrangement lengths is therefore highly unlikely to be the result of biased rearrangement. One is thus left with positive or negative selection as the most likely potential explanations for the common bias in DRF usage.
DRF usage is uniform in out-of-frame IgH rearrangements
To further demonstrate that the DRF bias is likely to be due to selection rather than biased recombination, we have analyzed out-of-frame (OF) rearrangements in B cell genomic DNA. If biased rearrangement contributes to the DRF bias, then this should also be observed in OF rearrangements. We thus compared the DRF bias in rearrangements that are IF with those that are OF. To obtain sufficient numbers of OF sequences, we sequenced genomic DNA using the Illumina platform (Table I, see 2Materials and Methods). First, we checked whether we observe the same average DRF usage using a different sequencing technology (Fig. 7A). Indeed, one can clearly see the similar biased DRF usage between the cDNA 454 samples and the DNA Illumina samples. The fit shows again that the DRF usage is not an artifact of the sequencing methodology.
Consistent with selection, the OF DRF usage was practically uniform for all DH genes in the forward direction. Indeed, the OF variance is not very different from what was expected due to random chance in most DH genes. Conversely, in the IF rearrangements, the observed bias is similar to the one observed in the cDNA sequence analysis described using 454 sequencing (Fig. 7B), and significantly different from random for most alleles (p < 1.e-100). Because the number of OF sequences was much smaller than the number of IF sequences, the expected variance in the OF DRF was much larger than the expected variance in the IF DRF.
The junction between DH and JH exhibits a small degree of reading frame bias compared with DRF bias
We wondered whether counterselection for the Dμ protein could account for the DRF bias. The Dμ protein is generated by DH to JH rearrangement prior to complete VDJ rearrangement in pro-B cells. The Dμ protein is often encoded in RF2, and, based upon the analysis of Dμ transgenic mice, Dμ signaling can inhibit VH to DJ rearrangement. The Dμ protein can associate with the surrogate L chain as well as Igβ (11), and, intriguingly, the signaling adaptor and tumor suppressor protein called SLP-65 (39) is required for Dμ selection (40).
Because the Dμ protein is generated by rearrangement between DH and JH only at the N2 junction, it cannot account for skewing in the DRF of the fully rearranged VDJ unless the N1 junction is somehow constrained. To test this idea, we analyzed the RF usage of N2 in isolation (without regard for N1). If the Dμ protein is counterselected and contributes significantly to the final VDJ DRF usage, then we expect to find counterselection of N2 RF2. Indeed this is the case (Fig. 8A). RF2 is underrepresented. Intriguingly, N2 RF3 is more frequently used than either N1 or N2. But the effect sizes are small compared with the skewing observed when both the N1 and N2 junctions are taken into account (Fig. 8B ). These data indicate that the skewing introduced by the Dμ RF is insufficient to fully account for the DRF bias.
Stop codon usage is insufficient to account for DRF bias
We next wondered whether negative selection for disfavored DRFs arose due to higher frequencies of stop codon usage in the disfavored DRFs. This possibility seems unlikely because stop codons are only found in 30% of the forward DRF (a table of human DH genes and amino acid conversion in each reading frame can be found at http://peptibase.cs.biu.ac.il/homepage/Lymphocyte_trans_aa.htm). Moreover, stop codons often reside in the extremities of the DH gene and tend to be eliminated through nucleotide deletion and addition. DRF usage exhibited a low correlation with the presence or absence of a stop codon in a given DRF for a given DH gene. Thus, stop codons are unlikely to be significant drivers of DRF selection. Furthermore, there are alternative DRFs that contain no stop codon (for example, DRF 2 and 3 of IGHD1-20*01), but that nonetheless display a very clear bias (0.8 versus 0.2 in the case of IGHD1-20*01).
This leaves two final scenarios that are not mutually exclusive. Either specific DRFs are selected at the amino acid level in the germline through evolution (41), or specific properties of DH genes are selected somatically (either positively or negatively) during the life of the B cell (9–12, 16, 42–44).
Correlation of amino acid with DRF frequency
To test whether skewed DRF usage correlates with the presence (or absence) of particular amino acids, we computed a DRF usage probability matrix (3 DRF *18 DH genes, represented as a single 54-position vector) and compared it with an occupancy matrix for each amino acid in each DH in each DRF. Thus, each amino acid was assigned a (3*18) matrix representing the number of times it appears in each DH and DRF (represented again as a single 54-position vector). We then computed for each sample and each amino acid the Pearson correlation between the DRF usage probability matrix and the amino acid occupancy matrix. The result was a correlation value for each amino acid for each sample. For most amino acids, this correlation is highly significant (p < 0.01 for 16 of 20 aa using a t test over the correlation in all samples versus 0). Moreover, the correlation is highly consistent among all samples. Whereas the overall average correlation in the DRF usage among all samples is 0.75–0.8, the list of the 10 most highly correlated amino acids (both positive and negative) is actually conserved in 95% of the samples. Strikingly, some amino acids are highly positively correlated with DRF usage, such as threonine, tyrosine, or serine, and some are highly negatively correlated (leucine, glutamine, and others; Fig. 9). Moreover, except for methionine, asparagine, and valine, all amino acids are either always positively correlated with expression or negatively correlated in the vast majority of samples.
Effect of DRF and DH gene usage on DH gene position
A possible explanation for the DRF bias may be that each DH gene and DRF uses a different part of the DH gene. For example, for a given gene, the end of the gene may be typically used, whereas for other genes the initial part may be used, and the end deleted in the junction production. Interestingly, there are clear differences between the different DH genes (Fig. 10). Fig. 10 shows for each position along the DH gene, for each DH gene and each RF, how often this position is used starting at the beginning of the DH in the rearranged gene. Each triplet of rows is one DH gene in three different forward RFs (DRF1, DRF2, and DRF3), and each position along the x-axis represents the nucleotide position along the DH germline gene sequence. The colors represent the relative frequency at which a position is used (the number of times that that nucleotide is used divided by the number of rearrangements involving that particular DH in that particular DRF). As one can clearly see, there are preferred positions used by each DH gene. However, these preferred positions are quite similar among the different DRF for the same gene. Still, some specific differences exist, and some DH genes tend to use preferred positions in some specific RFs (the red boxes in Fig. 10). We have thus analyzed the amino acid sequences that are encoded by nucleotides in these positions, and there seems to be a preferred usage of a QLV sequence for some of the DH genes (see Table II). This preference is unlikely to be explained by frequent usage of Q and L among germline DH genes, because RFs individually containing Q and L individually are actually infrequently used (Fig. 9). Rather, this specific combination of amino acids seems to be favored. Thus, whereas selection for the starting point of a DH gene is probably based on the rearrangement mechanism for most DH genes, there are some specific sequences that are highly positively selected.
|DH Gene .||DRF .||Starting Position (Nucleotide) .||Nucleotide Sequence (5′→3′) .||Amino Acid Sequence .|
|DH Gene .||DRF .||Starting Position (Nucleotide) .||Nucleotide Sequence (5′→3′) .||Amino Acid Sequence .|
The first column is the DH gene; the second column the DRF; the third column is the position in the DH gene used as a beginning of the DH gene. The last columns represent the nucleotide and amino acids of the resulting DH genes. An interesting QLV motif appears in many of the sequences. The nucleotide sequences are not the full germline sequences, but the relative regions translated, taking into account the DRF and the starting position in each case.
We show in this work in human peripheral blood B cells that a single DRF predominates for most DH. Thus, in reality there is much less freedom in DRF usage, and the expressed IgH repertoire is therefore not as diversified by alternative DRF as is theoretically possible. The skewing toward particular DRFs mapped not only to individual DH genes, but tended to be similar in members of the same DH gene family. One of the most striking findings was that the skewing toward a particular DRF in the rearrangements of a given DH gene was preserved over a wide range of rearrangement lengths. This finding suggests that the DRF bias reflects a selective process that is intrinsic to the DH gene sequence itself. Furthermore, the DRF bias was highly similar in different individuals and was most marked in naive IgM+ B cells. Collectively, these findings indicate that there is something about the DRF bias that is hardwired into the preimmune repertoire. But how and why does this happen?
There are two general ways in which DRF bias could arise. The first is that bias could be introduced at the time of rearrangement. The second possibility is that rearrangements are random, but that selection (operating either negatively or positively) favors certain DRFs. Biased rearrangement would predict that OF rearrangements should also exhibit biases in DRF, yet this is not observed: instead, the usage of DRF is fairly uniform among OF rearrangements. As naive IgM+ B cells exhibited the most striking and consistent DRF bias, we reasoned that the major checkpoint for DRF selection must occur early during B cell development. The earliest stage in which this could occur is in pro-B cells (when H chain rearrangement is occurring) and could involve the Dμ protein. However, when we analyzed the RF skewing in the N2 junction between DH and JH, the effect sizes were small compared with the level of skewing when N1 and N2 junctions were viewed in aggregate in the completed VDJ rearrangement. Furthermore, there was no specific distance between the N1 and N2 rearrangement junctions that was favored. Instead, the DRF bias existed over a wide range of rearrangement lengths. Thus, counterselection for the Dμ protein appears to be insufficient to account for the DRF bias. Furthermore, the frequency of stop codon usage in the different DRFs was also insufficient to account for the DRF bias. Collectively, these findings favor the remaining alternative, namely that selection for particular amino acid motifs within the DH gene sequence is occurring. Consistent with amino acid–based selection, the usage of particular amino acids within particular DRFs is highly nonrandom.
Currently, we have no clear explanation for the mechanism relating the amino acid usage with the DRF usage. The simplest explanation would be that some target Ags bind to specific amino acids. Thus, the main element that shapes the DRF bias could be simply the need to express or avoid certain specific amino acids in the CDR3 region. As such, these results may provide a map of permissive CDR3 amino acid motifs that could be helpful in the analysis of Ab and autoantibody repertoires. The DRF bias could be altered if there is relaxed selection stringency, as could occur in autoimmunity.
In parallel to the current analysis, DRF bias in human B cells has been studied by Larimore et al. (31), which confirms our observation on the larger DRF bias in productive than in nonproductive rearrangements, the authors also observed biases against longer CDR3 sequences with higher hydrophobicity (GRAVY) scores among productive rearrangements. Counterselection against longer CDR3 sequences has also been documented during early B cell development in humans (45) and is observed in our sequences. However, we did not observe the skewing toward hydrophobic residues. In our analysis, we computed the correlation between the DRF usage with the presence or absence of the different amino acids (Fig. 9). Therefore, we should expect a negative correlation with hydrophobic amino acids. Instead, we show that hydrophobic amino acids either positively (phenylalanine, isoleucine) or negatively (leucine, tryptophan) correlated with the DRF usage, and some did not correlate at all (methionine, valine).
Our analysis shows that, in addition to major shifts during early B cell development, DRF usage appears to be further shaped by peripheral selection, as could arise through exposure and immune responses to foreign Ags produced by pathogens. We have compared the naive and memory B cell repertoires. Biases in DRF usage are most pronounced and consistent between different individuals in the naive B cell repertoire. Among the CD27+ B cell pools there are greater interindividual differences. Among IgG+ CD27+ B cells the DRF bias is less pronounced, and overall the repertoire appears to contract, with a smaller and shorter range of CDR3 lengths. Additionally, there are differences in DRF usage between IgA+ and IgG+ subsets. Collectively, these data suggest that DRF is subjected to more than one selection checkpoint, the first occurring in the bone marrow, followed by a second occurring peripherally. We have no clear explanation why IgG selection appears to be more stringent than IgA selection. It does not appear to be due to differences in sample size and because unique sequences rather than total copies were analyzed; thus, these data are not likely to be significantly skewed by clonal expansion.
The conclusion from all these observations is that the repertoire is shaped by a common and seemingly quite stringent selection mechanism that operates primarily in the naive pool, followed by further selection in the periphery. This interpretation is reinforced by the very clear correlation between the presence of specific amino acids in the CDR3 with the usage of a given DRF. This correlation is very similar in samples from different individuals, again highlighting a common selection mechanism. However, it is not clear how this selection operates. Selection could be influenced by self-Ags that are highly expressed in the bone marrow or in the periphery, as proposed in the context of natural Abs. Alternatively or in addition, the selection could be negative, disfavoring Abs with certain biochemical properties in their CDR3s. In mice, we and others (7–11, 46) have shown that charged amino acids induce negative selection, which constrains the DRF (8).
This study advances our understanding of B cell repertoire selection in two basic ways, as follows:
First, this study shows that DH gene segments do not afford maximal diversity, due to the biased usage of particular DRFs. This restriction in diversity arises after V(D)J recombination because OF rearrangements do not exhibit a similar bias. This implies that selection for particular RFs occurs after their generation. Intriguingly, the restricted usage of a subset of amino acids in a particular DH is seen over a large range of rearrangement lengths. This suggests that there is an important region at the core of the DH gene sequence that may be getting selected. We show that the Dμ protein and the frequency of stop codons in the different DRFs are insufficient to account for the degree of DRF skewing. Finally, we show that most of the DRF skewing occurs in naive IgM+ B cells and that DRF usage shifts slightly in the more mature (CD27+) B cell pools. Taken together, these findings indicate that there are two windows for DRF selection during B cell development. By defining the skewed DRF usage in healthy individuals, we can use this information to study how B cell selection may be altered in B cell disorders such as autoimmunity or cancer.
Second, we show that the bias in DRF varies by individual DH gene segment and is reproducible in different individuals. This implies that the selection mechanism leading to DRF bias may be evolutionarily conserved. Analyzing the DRF distribution in different species may provide insight into the potential (self) Ags that drive this process.
There are three potential limitations to this analysis. First, the amino acid: DRF correlation is biased by the presence of stop codons. For example, leucine is often present near a stop codon. However, even when only sequences without stop codons are used, clear correlations are observed (Supplemental Fig. 4). Second, this analysis utilizes the entire germline DH sequence. Because DH genes are frequently nibbled from both ends during nonhomologous end joining, the amino acid occupancy matrix may not adequately reflect the true amino acid usage. Indeed, one can, for example, observe a very limited negative correlation of the DRF frequency with the presence of stop codons. The correlation is limited, because stop codons are often near the borders of the DH gene and are nibbled away. Note that in this work we have not analyzed the effect of somatic hypermutation on DRF. This is a fascinating issue and one that requires far more analysis in the future.
This work was supported by the Lupus Research Institute and National Institutes of Health Grant R56-AI-090842 (to E.T.L.P.).
The online version of this article contains supplemental material.
Abbreviations used in this article:
DH gene reading frame
The authors have no financial conflicts of interest.