Abstract
The accurate partitioning of Ig H chain VHDJH junctions and L chain VLJL junctions is problematic. We have developed a statistical approach for the partitioning of such sequences, by analyzing the distribution of point mutations between a determined V gene segment and putative Ig regions. The establishment of objective criteria for the partitioning of sequences between VH, D, and JH gene segments has allowed us to more carefully analyze intervening putative nontemplated (N) nucleotides. An analysis of 225 IgM H chain sequences, with five or fewer V mutations, led to the alignment of 199 sequences. Only 5.0% of sequences lacked N nucleotides at the VHD junction (N1), and 10.6% at the DJH junction (N2). Long N regions (>9 nt) were seen in 20.6% of N1 regions and 17.1% of N2 regions. Using a statistical analysis based upon known features of N addition, and mutation analysis, two of these N regions aligned with D gene segments, and a third aligned with an inverted D gene segment. Nine additional sequences included possible alignments with a second D segment. Four of the remaining 40 long N1 regions included 5′ sequences having six or more matches to V gene end motifs, which may be the result of V gene replacement. Such sequences were not seen in long N2 regions. The long N regions frequently seen in the expressed repertoire of human Ig gene rearrangements can therefore only partly be explained by V gene replacement and D-D fusion.
All Ig H chain V regions are encoded by gene rearrangements that take place early in B cell ontogeny. The rearranged H chain V gene is the outcome of the recombination of gene segments selected from each of three sets of germline gene segments: VH, D, and JH (1). Each rearranged H chain V gene encodes a polypeptide composed of supporting β-pleated sheet scaffold or framework regions (FR)3 and three hypervariable loops, termed complementarity determining regions (CDRs). These CDRs are the principal points of interaction between the H chain and its specific Ag. The CDR3 of the H chain is encoded by the VH-D-JH junction, and is the most diverse of all the hypervariable loops (2). However, this diversity is not generated solely by gene segment rearrangements. Diversity is enhanced by processing of the ends of the recombining elements through exonuclease activity and by nontemplated (N) (3) and palindromic (P) (4) nucleotide addition. Following clonal selection, point mutations introduced by the process of somatic hypermutation within the germinal center reaction introduce an additional level of diversity (5).
Sequencing of junction regions, and the identification of their embedded D genes, has been critical to the discovery of the processes contributing to the diversity of the CDR3 regions. In addition to the well-known processes of exonuclease activity, and the addition of N and P nucleotides, other processes have been reported, including the use of inverted D gene segments and gene conversion (6), the insertion and deletion of trinucleotides (7), the use of D genes with irregular recombination sequences (8), and the use of more than one D gene segment in a single rearrangement (9). However, the veracity of some of these processes remains in dispute. For example, although some studies have presented evidence against the possibility of D-D recombination (10), other studies have reported such recombination events to be relatively common (11, 12). The failure to resolve this and other controversies can in part be explained by the frequent difficulties involved in arriving at an unambiguous identification of the various elements within junction regions. These alignment problems have seemed to be particularly difficult because of the effects of somatic point mutations.
Following a recent report that the extent of somatic hypermutation decays exponentially, downstream of the promoter region (13), the distribution of mutations has become amenable to new kinds of analysis, which can be applied to improve the partitioning of rearranged genes. The likelihood of a mutation occurring in any part of a sequence can now be calculated from its position within the sequence, as well as by reference to known mutational hot spots and cold spots within the sequence (14).
We have developed a statistical analysis based upon elements of the hypermutation process, to objectively evaluate any proposed partitioning of an Ig gene sequence between two elements. The analysis uses trinucleotide mutability scores that we have determined after analyzing the frequency and 5′ to 3′ distribution, within germline sequences, of each trinucleotide. The mutability scores for the different trinucleotides can be used to determine the mutability of Ig sequences, and of parts of sequences. These sequence mutability scores can then be used to predict the likely distribution of somatic point mutations within V segments, and between V segments and putative N regions and D segments.
The predictability of mutations within different regions of rearranged V genes makes it possible to develop new, objective criteria with which to identify D gene segments within the VHDJH junction. These criteria exclude matches where apparent identity with D regions is likely to occur as a consequence of the known nucleotide preferences of TdT (15).
Such an analysis was applied to a set of relatively unmutated IgM H chain sequences. Improved identification of the ends of rearranged gene segments led to the development of a dataset of N nucleotides that could be identified with little ambiguity. The study of this N nucleotide dataset showed that human Ig junction regions frequently include long N regions. These regions were investigated for the presence of additional D gene segments. The reality of D-D fusions involving both orientations of D gene segments was confirmed, and evidence of V gene replacement was also seen. Analysis of N nucleotides also revealed nucleotide patterns that cannot presently be explained, which may represent the molecular signatures of presently unknown aspects of Ig gene rearrangement.
Materials and Methods
Trinucleotide mutability scores
Forty-nine nonproductively rearranged Ig H chain V gene sequences, obtained from GenBank database, were aligned using the VQUEST software (16), and somatic point mutations in each V segment sequence were noted by reference to the ImMunoGeneTics (IMGT) Ig gene database (16). Trinucleotide mutability scores were determined by first counting the number of occurrences of each trinucleotide in the germline V gene segments from which each of the 49 sequences were derived. For each overlapping trinucleotide, the sequence position (in the germline sequence) of the first nucleotide of the trinucleotide was recorded. For each trinucleotide, the trinucleotide counts were then adjusted to account for the position of each trinucleotide within the sequence, given the exponential decay of the mutation rate along the V gene sequences, as determined by Rada and Milstein (13): an adjusted trinucleotide count C′NNN,S was calculated, for each trinucleotide NNN in the sequence S, using the formula:
where N is any nucleotide A, G, T, or C, and P(NNN,S) is the set of sequence positions where the trinucleotide NNN is observed in sequence S.
This adjusted trinucleotide count was used to determine the expected number of mutations of each trinucleotide, in each of the nonproductively rearranged sequences, assuming that all mutations were randomly distributed through the sequences. The expected mutation frequency for trinucleotide NNN in sequence S (Fe(NNN,S)) allows for the fact that most observed mutations will affect three overlapping trinucleotides, and was calculated as follows:
where T(NNN,S) is the total number of mutated NNN trinucleotides in S, and A is the set of all possible trinucleotides nnn, where n is A, G, C, or T.
The number of times each trinucleotide was expected to mutate (Fe(NNN,S)) and the number of observed mutations (Fo(NNN,S)) were determined, and summed for each of the 49 sequences, and the trinucleotide mutability scores (MNNN) were finally calculated as follows:
where D is the set of 49 sequences.
V gene segment mutability scores
A representative allele from each of the Ig H chain V gene segments in the IMGT database (17) was analyzed to calculate their relative tendencies to mutate, as a consequence of the frequencies of hot spots and cold spots within their sequences. The mutability score (MV) of any V gene sequence was calculated as the sum of the trinucleotide mutability scores (MNNN) calculated for the overlapping trinucleotides that make up the sequence, after adjustment for the position (P(t)) of each of those trinucleotides within the sequence, as follows:
where B(V) is the set of all overlapping trinucleotides in sequence V.
Mutability scores for sequence regions including CDRs (MCDR) and FRs (MFR) were also calculated using Equation 4.
Ag selection
The expected numbers of mutations in the CDR1 and CDR2 regions of 69 mutated H chain sequences (18, 19), obtained from GenBank database, were determined by reference to the mutability scores (MCDR and MFR) of the combined CDR1 and CDR2 sequences and the combined FR1, FR2, and FR3 sequences. These two mutability scores were then used to determine the probability (pMut) that any mutation will occur in either the FR or CDR regions:
The probabilities of the distributions of mutations in the FR and CDR regions of each of the 69 gene sequences, containing known numbers of mutations, were then calculated using the binomial distribution. The Ag selection factor was finally calculated as the mean ratio of observed V segment CDR mutations over expected CDR mutations, for the 69 sequences:
where NS is the number of mutations observed in sequence S.
This selection factor R was then used to calculate an adjusted mutability score (M′S) for any sequence S containing framework sequences and CDR sequences by the following modification of Equation 4:
where B(FR) is the set of all overlapping trinucleotides in framework sequences and B(CDR) the set of all overlapping trinucleotides in CDR sequences. Note that P(t) is the position of the trinucleotide relative to the first base of the whole Ig sequence.
Partitioning of rearranged H chain V genes
Two hundred twenty-five full-length IgM sequences, with five or fewer mutations within their V gene segments, were identified among the 729 IgM sequences in the IMGT database (17). The sequences were carefully aligned, with particular focus upon each boundary within the CDR3. To partition any junction sequence, the putative VH, D, and JH gene segments were first determined using the VQUEST program (17). Partitioning of any sequence (S) was then arrived at by determining the mutability (MS) of the germline sequence from which it was derived, and determining the probability that any number of somatic point mutations could occur in that sequence, by reference to the mutability score of its associated germline V gene segment (MV), the number of mutations seen in that V sequence, and the binomial distribution. Consideration of the distribution of mutations within the ends of V, D, and J gene segments, in the light of these probabilities, clarified whether or not some apparent mutations were more likely to be evidence of exonuclease activity and N nucleotide addition.
In this way, the most likely 3′ ends of the V gene segments, and then the 5′ ends of the putative D gene segments, were determined. The intervening sequences were designated the putative N1 regions of N nucleotide addition. The 5′ ends of the J gene segments and the 3′ ends of the D gene segments were then similarly determined, and the intervening N2 regions were identified. Finally, where there was no evidence of exonuclease activity at a gene segment end, putative N1 and N2 regions were examined for palindromic sequences, and such nucleotides were designated as P nucleotides. Alignment of a first D gene segment was only accepted if a minimum of 8 consecutive matching nucleotides were seen, or 9 matches in a 10-nt sequence.
The acceptance criteria for a second D gene alignment depended upon the length of the putative N(D2)N region under investigation. The probabilities that N nucleotide addition could give rise to apparent D segments were determined by first identifying all unique 6-, 7-, 8-, 9-, 10-, 11-, and 12-nt sequences that can be produced from known D gene segments, and determining their probabilities using probabilities of N addition based upon the known TdT nucleotide preferences (15), as follows: p(A) = 0.15, p(T) = 0.15, p(G) = 0.6, and p(C) = 0.1. Calculations, which are more fully described below, included the equal likelihood that G nucleotides in N regions could arise from G addition to the sense strand, or from C addition to the antisense strand (20). The analysis was performed for complete matches, as well as for sequences with one or two mismatches. The resulting probabilities were used to determine the likelihood that matches of varying lengths would be seen, given the length of the N(D2)N region under investigation. For example, if the probability that 8 N nucleotides will align perfectly with a D segment is 0.005, then the probability that such an alignment might be seen in an N(D2)N region of 18 nt is 11 × 0.005, or 0.055, for 11 overlapping sequences of 8 nt are found in an 18-nt sequence. These calculations were used to determine the acceptance criteria for second D gene segments in each N region, by setting 95% confidence limits based on the cumulative distribution function. These confidence limits describe the least likely sequences, which individually were highly improbable, and which together might be seen with a probability of 0.05. Identical acceptance criteria were developed for inverted D gene sequences. In practice, most long N regions were between 10 and 16 nt in length, and required matches of 8 consecutive nt, or 1 mismatch in 10 or 11 nt. Alignment including two or more mismatches could be excluded on the basis of the mutation analysis. Identification of a second D gene in the N1 region was accepted only if that D gene segment was located 5′ of the primary D gene segment, within the germline. Similarly, a second D gene in the N2 region was only accepted if that D gene segment was located 3′ of the primary D gene segment, within the germline.
Analysis of the GC content of putative N regions
To analyze whether putative N regions carried the G/C features of N regions, including high G or C content, and homogeneity of either G or C (20), the probability of the particular frequencies of G, C, and A/T nucleotides within the N regions was determined as follows:
where w is the number of A or T (ambiguity code W), g is the number of G, and c is the number of C nucleotides.
The probabilities of an A or T insertion (pW) was set as 0.3, the probability of an insertion of G (pG) as 0.6, and the probability of C insertion (pC) as 0.1 as determined by Basu et al. (15). The equation assumes that concatenation of strands does not occur (20), and allows for the fact that, for example, C additions may appear in the sense strand through addition of G nucleotides to the antisense strand.
The possible influence of base stacking was investigated, in sets of the various putative N nucleotide regions. The expected frequency and the cumulative distribution function were determined for each dinucleotide, using probabilities calculated from the actual nucleotide frequencies seen in the N regions, and the binomial distribution. The cumulative distribution function was then used to determine the 95% confidence limits for the observed dinucleotide frequencies.
Results
Trinucleotide mutability
Trinucleotide mutability scores were determined by analysis of the distribution of mutations in VH gene segments of 49 nonproductively rearranged H chain genes. By analyzing nonproductive rearrangements, any contribution of Ag selection to the observed patterns of mutation could be avoided. The results are presented as Table I. Mutability scores ranged from 0.28 for TTG, to 3.06 for GTA. The mean mutability of the 8 trinucleotides encompassed by the 4-nt hot spot RGYW was 1.65, and of the 8 trinucleotides in the complementary WRCY sequence was 1.37. The mean mutability of the 8 WAN trinucleotides was 1.58. Together, RGYW/WRCY and WAN trinucleotides accounted for 9 of the 10 most highly mutable trinucleotides. Biased distribution of trinucleotides within germline sequences led to significant adjustments to some scores. Nineteen mutability scores were adjusted by 10% or more, and 5′ bias in the distribution of the ACG and ATT trinucleotides led to upward adjustments of ∼25%.
Ag selection
To investigate the affect of Ag selection upon mutation patterns, productively rearranged, class-switched sequences were analyzed. Analysis of 69 mutated, H chain sequences confirmed a tendency for mutations to accumulate in CDR1 and CDR2, at rates higher than expected solely on the basis of the nucleotide composition of the sequences, and their distance from the Ig promoter. This tendency is most evident in sequences with higher total V gene mutations, as seen in Fig. 1. The Ag selection factor (R), the mean ratio of observed V segment CDR mutations over expected mutations, was calculated to be 1.54.
VH gene segment mutability
Mutability scores were calculated, for full-length germline VH gene segments, from codon 1 to codon 104, the last codon of FR3, before the start of the CDR3. A single allele was selected for analysis of each gene, and the results are presented for each gene family as Table II. The mean mutability MV of the segments was 225.8 (SD = 7.2), and scores ranged from 212.6 for IGHV2-70*01 to 240.3 for the most mutable V gene segment, IGHV6-1*01. Considerable allelic variation was also seen. For example, the 11 allelic variants of IGHV1-69 had a mean mutability of 223.9 (SD = 9.2) and ranged from 197.6 to 228.7. A few alleles with truncated FR3 regions had significantly lower mutability scores. For example, MV for IGHV2-5*03 was 174.6.
These V gene segment mutability scores were subsequently used in the partitioning of rearranged IgM sequences, and this partitioning then allowed all the calculations that are shown below. As a guide to the process, a typical rearranged D segment with a length of ∼15 nt would have a score of ∼10–20, and the probability that any mutation would occur in the D segment rather than the V segment would be 0.05–0.10. Even in sequences where the V segments had five mutations, the probabilities that the D segments would include more than one mutation were almost always <0.05.
Analysis of N nucleotide addition
Two hundred twenty-four IgM H chain rearranged V genes were analyzed, and satisfactory VHDJH alignments could be determined for 199 sequences (89%). Many of the sequences that could not be aligned included seven consecutive nucleotide matches to germline D sequences, but this was not considered sufficient to allow confident alignment. This introduced a systematic bias against very short D segments, but ensured the reliability of the 199 alignments that were then subjected to further analysis. The lengths of putative N1 regions, at the VHD junctions of the 199 sequences, and of putative N2 regions at the DJH junction were determined, and the results are presented as Fig. 2. N nucleotide addition was absent from only 10 (5.0%) of the VHD junctions, and from 21 (10.6%) of the DJH junctions. Eleven N1 sequences were identified with ≥15 nt, including 1 sequence of 20 and 1 sequence of 21 nt. Thirteen N2 sequences were identified with ≥15 nt, including 1 sequence of 22 and 2 sequences of 23 nt.
Identification of second D gene segments in long N regions
The probabilities that N region addition would give rise to apparent D segments of varying lengths and identity were calculated, and are presented as Table III. These probabilities were used to calculate the degree of identity that was required, between long N regions and D segments, in order for a second D gene segment to be accepted as part of an alignment. The required matches for various N lengths are presented as Table IV, which shows, for example, that whereas 7 consecutive matches were required for identification of a D segment in an N region of 7 or 8 nt, 9 consecutive matches, or a single mismatch in 11 nt were required for any N region of ≥17 nt in length. The likelihood of alignments containing mismatches was assessed by reference to mutation analysis. Seventy-five sequences containing long putative N regions (length, >9 nt) were reanalyzed for the presence of additional D segments. Details of three sequences that appear to include second D segments are shown in Table V, including the probabilities that the degree of identity seen in the sequences could arise by random N nucleotide addition. Interestingly, the sequence X94053 includes two relatively long D segment alignments: 29 consecutive nucleotide matches to IGHD2-2*02, and 15 consecutive matches to IGHD5-12*01. An additional N2 sequence was found with 8 consecutive matches to a second D gene, but this D gene was located 3′ of the primary D gene segment, in the germline. For this reason, the alignment was not accepted as an example of D-D fusion. A number of alignments that just failed to meet the criteria for second D segments are shown as Table VI.
N1 sequences were also analyzed for identity with the 3′ ends of germline V genes, which is considered to be evidence of V gene replacement (21) (Table VII). One sequence was found with seven consecutive nucleotide matches, 3 with six matches, 7 with five matches, and 14 with four matches. Based upon the known nucleotide preferences of TdT, and the repertoire of V ends, alignments with six or more consecutive matches are unlikely to result from TdT activity, and more likely represent V gene replacement. Although many of the shorter alignments could be expected to arise by chance in a series of long N regions, the number of such alignments is noteworthy, as is the GA-rich nature of the sequences.
Analysis of the GC content of long N regions
The GC content and G or C homogeneity was determined for the 72 remaining long N sequences. Among the 40 N1 sequences, there were 12 sequences for which the probabilities that such sequences had arisen by conventional TdT activity were <0.05. Similarly, among the 32 N2 sequences, there were 9 sequences for which the p values were <0.05, 2 sequences with p values of <0.01, and 2 sequences with p values of <0.001.
The G/(G+C) proportions of the 72 sequences were plotted against N length for N1 (Fig. 3 a) and N2 (b) sequences. As expected, the longer N1 sequences showed a tendency toward homogeneity for either G or C. This was not the case for N2 sequences, and the difference between the homogeneity of N1 and N2 sequences, among sequences of ≥15 nt in length, was statistically significant (p < 0.05).
To investigate possible concatenation of N nucleotides from both strands as an explanation of long N regions, the proportion of G among GC (G/G+C) for the 5′ and 3′ ends of the sequences were determined. Surprisingly, the mean proportion of 5′ G among N2 GC nucleotides was 0.36, and of 3′ G was 0.57. The corresponding values for N1 sequences were 0.56 and 0.57, respectively. The values seen for the N2 sequences represent a significant overrepresentation of 5′ C (p < 0.001) and of 3′ G (p < 0.05). Both ends of the N1 sequences were significantly enriched for G (p < 0.05). Nucleotide frequencies were therefore more fully considered, by an examination of dinucleotide frequencies.
The dinucleotide frequencies in the long N regions were analyzed for evidence of base stacking, and the results are presented as Table VIII. Among long N1 sequences, there was a significant overrepresentation of the homodimers CC (p < 0.001) and GG (p < 0.05); however, there was no general overrepresentation of purine or pyrimidine homodimers that would indicate base stacking. There was also a marked underrepresentation of GC heterodimers (p < 0.01). GA was also overrepresented (p < 0.05), which may be a consequence of inclusion of GAGA and GAGG motifs within the N1 sequences, as a result of V gene replacement. The GA dinucleotides were particularly overrepresented in the 5′ portions of the N1 sequences (data not shown). In the long N2 sequences, GG was also overrepresented (p < 0.001), and GC was underrepresented (p < 0.001). Although CC was overrepresented, this result did not reach significance.
Discussion
Despite considerable recent progress in our understanding of the molecular processes at the heart of the somatic hypermutation process (22, 23), and despite our long-standing and detailed understanding of the process of Ig gene rearrangement, the analysis of rearranged Ig genes has seen little advancement since these processes were first described. As alignment algorithms inevitably involve consideration of patterns of matching and mismatching nucleotides, between a sequence under consideration and a germline sequence, an understanding of the process of somatic point mutation should assist the alignment process.
An important advance in our understanding of somatic point mutation came with the report that the probability that the hypermutation process will introduce a mutation within a sequence decays exponentially, from 5′ to 3′, downstream of the promoter region (13). As a consequence of this finding, the likelihood of a mutation occurring in any sequence can be determined by the following four factors: 1) the position of the sequence within the rearranged gene; 2) the presence of mutational hot spots within the sequence, because the mutation mechanism preferentially targets particular nucleotide sequences; 3) selection processes that may favor mutations occurring in the Ag-binding CDRs, but resist mutations in the FRs; and 4) the extent to which the hypermutation process has acted upon the sequence.
Of these factors, only the fourth factor is difficult to estimate, but the observations of Rada and Milstein (13) now provides a means of doing so, for the number of mutations in the long V gene segments of the H and L chain are easily determined, and can be determined without ambiguity. The extent of mutation in these sequences can serve as a predictor of the level of mutation throughout a rearranged gene.
It is clear that the mutability of a particular nucleotide is influenced by upstream and/or downstream nucleotides, and a number of studies have therefore reported the mutability of dinucleotides and trinucleotides (24, 25). Other studies have examined the influence of nucleotides as many as three positions upstream or downstream of the target nucleotide (26). These studies have led to the description of major hot spots at RGYW/WRCY and WAN motifs (27, 28, 29), where R is a purine, Y is a pyrimidine, W is A or T, N is any nucleotide, and the underlined nucleotide is preferentially but not absolutely targeted. In this study, we have focused on trinucleotide mutability, acknowledging the influence of nucleotides two positions upstream and two positions downstream of the target nucleotide. We have re-examined the issue of trinucleotide mutability in the light of the report of Rada and Milstein (13), because improved mutability scores should result when due regard is paid to the 5′ to 3′ distribution of nucleotide motifs within Ig sequences.
The mutability scores derived in this study can be most directly compared with two studies, one of murine sequences (24) and one of human sequences (30). The scores of all three studies are in broad agreement, but perhaps as a consequence of the positional analysis described in this study, important differences are also seen. A measure of the ability of the trinucleotide scores to describe the mutation process is the extent to which previously reported hot-spot motifs can be predicted from the trinucleotide scores. The high scores seen for the RGYW, WRCY, and WAN trinucleotides in this study suggests that the scores are appropriate and can be usefully applied to mutation analysis.
The effect of Ag selection upon the likely frequency of mutations within CDRs remains the most uncertain of the factors influencing analysis of mutations. During an ongoing immune response, replacement mutations within the Ag-binding CDRs may be selected during rounds of replication (31). We estimated the contribution of Ag selection by calculating the mean extent to which mutations accumulate above expectations in the CDR1 and CDR2 regions of V gene segments. This Ag selection factor (R) was estimated to be 1.54. This is in close agreement with a previous report of CDR mutations, where the enhancement of CDR mutations was calculated to be 1.58 (32). These relatively low figures likely reflect the 5′ engagement of the mutator mechanism (13), and the consequent tendency for mutations in the CDR to be accompanied by mutations in upstream FRs. Regardless, we believe the figure is a suitable one for use in mutation analysis. There may be circumstances in which a greater weighting should be given to CDR mutations, and Ag selection; however, such a weighting would not have substantially changed the outcome of the analysis reported here.
In this study, we have applied mutation analysis to an investigation of N nucleotide addition in human H chain sequences. Our results support early observations that human Ig sequences include long N regions (6). This is in contrast to the situation in mice, where the mean length of N regions has been reported to be 3.0 (20). Interestingly, studies using transgenic human Ig minilocus mice appear to show a more murine pattern of N addition (33).
Perhaps as a result of the different approach that we have taken to the determination of the ends of gene segments, the results of our study also challenge the high frequency of sequences that have been reported to lack N addition in adult rearrangements. Although ∼30% of rearrangements in transgenic human minilocus mice appear to lack N addition (33), in our study, only 7.8% of rearrangements lacked N addition. Even this figure is likely to be an overestimate, for ∼25% of rearrangements in which a single N nucleotide is added will appear to lack N addition. Inevitably, the added nucleotide will often return a germline nucleotide that had been removed by exonuclease activity. Some systematic bias in the analysis, favoring low levels of N addition, is therefore unavoidable. Only 2 of the 33 junctions that lacked N addition were fetal sequences. In contrast, many of the sequences appear to be derived from unusual B cell populations, including 14 sequences derived from subepithelial tonsillar B lymphocytes (34). Therefore, it may be that the lack of N addition in conventional B cell populations is even rarer than indicated in this study.
Long putative N regions could arise from the presence of a second D gene segment in a rearranged V gene, but many have queried the very existence of D-D fusions since the process was first proposed (35). As part of the study that first fully described the human D segment locus, randomly generated mock sequences were aligned with D segments and used to define an alignment algorithm (10). Only 4 of 821 sequences that were then aligned met the 99% confidence limit for D-D fusion. As more than 4 sequences in 821 could be expected by chance to meet the 99% confidence limit, this was taken as evidence against the existence of D-D fusion. Nevertheless, studies have continued to report D-D fusions (33), and such fusions have sometimes been reported to be exceedingly common (11).
A similar strategy was used to investigate the use of inverted D gene segments (10), and the identification of just 7 sequences in 127 that met the 99% confidence limit was considered to provide little evidence for their use. Despite this study, the use of inverted D segments also continues to be reported.
In the present study, the predictability of mutations within CDR3 regions has allowed us to apply strict criteria to the investigation of D-D fusions, including D-D fusions involving inverted D gene segments, by calculating the probabilities that random N nucleotide addition would give rise to such sequences. It has been recognized for many years that D segments are G-C rich, and that apparent alignments could therefore arise from random N nucleotide addition (6). Perhaps as a consequence of the uncertainty surrounding point mutations within the CDR3, no systematic analysis has been developed before to resolve this issue.
This study clearly demonstrates the reality of D-D fusion in the expressed repertoire, with the identification of 3 sequences among 75 that met the 95% confidence limits. In 75 sequences, a small number of relatively short alignments could still be expected to arise by chance. One of the 3 sequences (AF174050) could perhaps be attributed to random N nucleotide addition. In fact, AF174050 can be aligned to a single D gene segment (IGHD3–22*01), if three central, consecutive mismatches arose from point mutation. We consider such a pattern of mutation to be improbable (p < 0.002). The sequence could alternatively be an example of an insertion/deletion event of the kind that has been reported in CDR1 and CDR2 regions (7, 36). Such insertion/deletion events have been associated with repetitive sequences (37), and such repetitive sequences are to be found on either side of the trinucleotide mismatch. However insertion/deletion events have been linked to the hypermutation process (36), yet there were no mutations in the V segment of this sequence.
Two alignments remain as unequivocal evidence of D-D fusions, including one alignment with an inverted IGHD2-2 allele. It is possible that mutations within the second D segments, or unreported D segment polymorphisms could have prevented the identification of additional D-D sequences. Six sequences failed to meet the alignment criteria because of a single mismatch. Given the low level of mutation seen in the six associated V gene segments, it is most unlikely that more than one of these sequences represents a mutated D segment. Three other sequences were 1 nt short of the acceptance criteria, but the sequences were strikingly AT rich. Therefore, these sequences may be additional examples of D-D fusion. We therefore consider the number of examples of D-D fusion in the 398 VHD and DJH junction sequences examined to be between 2 (0.5%) and 7 (1.8%). Therefore, D-D fusion is a rare event in the human.
Many long putative N regions with unexpected features remain unexplained in this study. As probabilities were calculated on the basis that strand concatenation does not occur during N nucleotide addition, we investigated the possibility that in fact strand concatenation is responsible for long N regions. If this is the case, the 5′ halves of long N regions should be rich in G and 3′ halves should be rich in C. Surprisingly, among long N2 regions, the opposite situation was seen. This anti-concatenation has been observed previously (20), although, in that study, the anti-concatenation did not reach statistical significance. We are unable to propose a model of N addition that can satisfactorily explain such observations.
Analysis of homodimer frequencies in N1 and N2 regions showed no significant and general overrepresentation of RR and YY dinucleotides, as has been previously reported (38). Therefore, this study does not support a role for base stacking in N nucleotide addition. However, a highly significant overrepresentation of GG and CC was seen, with an underrepresentation of GC dimers. GA heterodimers were also overrepresented in the N1 regions.
A number of long N1 regions are likely to have arisen by V gene replacement. Six of 40 long N1 regions included six or more consecutive matches with the 3′ ends of V genes. Although these alignments are short, and could reflect random TdT activity, the location of these motifs at the 5′ end of long N1 regions is striking, and such sequences were not seen at the 3′ end of N1 sequences. Only two such motifs were identified in the long N2 sequences, and both of these were located 3′ in the sequence. As many other long N1 sequences include 4- and 5-nt matches, it may be that >6 of the long N1 regions are the result of V gene replacement. Nevertheless, most long N1 regions lack the motifs, and the origin of these long sequences and the long N2 sequences cannot presently be explained.
Patterns in short nucleotide sequences can provide vital clues to processes that may contribute to the generation of Ab diversity. We have described the development of mutation analysis, which allows us to more reliably partition rearranged Ig genes. Mutation analysis also provides a guide for the interpretation of sequences between the rearranged VH, D, and JH segments. As a consequence, we have been able to develop objective criteria for the acceptance or rejection of putative alignments to multiple D gene segments within a VHDJH rearrangement. Together, these techniques have allowed us to identify N nucleotides with greater certainty, and to develop a reliable dataset for the study of the human CDR3. We have been able to confirm the reality of D-D fusion, and we have highlighted features of the human CDR3 sequences that remain unexplained. Further investigation of patterns of nucleotides within the human CDR3 region are likely to uncover additional processes that contribute to repertoire diversity, and that are responsible for these molecular signatures, provided that such investigations are conducted using the kinds of objective partitioning algorithms that are described in this work.
Acknowledgements
We thank Dr. Dan Conrad (Medical College of Virginia, Richmond, VA) and Dr. William Sewell (Garvan Institute of Medical Research, Sydney, Australia) for their assistance with this work and for their comments on the manuscript.
Footnotes
This study was supported by a grant from the National Health and Medical Research Council of Australia.
Abbreviations used in this paper: FR, framework region; CDR, complementarity determining region; N, nontemplated; P, palindromic.