Abstract
Analyses of somatic hypermutation (SHM) patterns in B cell Ig sequences have important basic science and clinical applications, but they are often confounded by the intrinsic biases of SHM targeting on specific DNA motifs (i.e., hot and cold spots). Modeling these biases has been hindered by the difficulty in identifying mutated Ig sequences in vivo in the absence of selection pressures, which skew the observed mutation patterns. To generate a large number of unselected mutations, we immunized B1-8 H chain transgenic mice with nitrophenyl to stimulate nitrophenyl-specific λ+ germinal center B cells and sequenced the unexpressed κ L chains using next-generation methods. Most of these κ sequences had out-of-frame junctions and were presumably uninfluenced by selection. Despite being nonfunctionally rearranged, they were targeted by SHM and displayed a higher mutation frequency than functional sequences. We used 39,173 mutations to construct a quantitative SHM targeting model. The model showed targeting biases that were consistent with classic hot and cold spots, yet revealed additional highly mutable motifs. We observed comparable targeting for functional and nonfunctional sequences, suggesting similar biological processes operate at both loci. However, we observed species- and chain-specific targeting patterns, demonstrating the need for multiple SHM targeting models. Interestingly, the targeting of C/G bases and the frequency of transition mutations at C/G bases was higher in mice compared with humans, suggesting lower levels of DNA repair activity in mice. Our models of SHM targeting provide insights into the SHM process and support future analyses of mutation patterns.
Introduction
Somatic hypermutation (SHM) is a process that diversifies BCRs by introducing point mutations into Ig genes at a high rate (1). SHM is initiated when activation-induced cytidine deaminase (AID) is recruited to the Ig locus and converts cytosine (C) to uracil (U). Error-prone DNA repair pathways are then activated, resulting in somatic mutations either at the AID-targeted C/G base pair (phase I) or at neighboring base pairs (phase II) (2). Although stochastic, SHM is biased by the local DNA sequence context and preferentially introduces mutations at specific DNA motifs (hot spots) while avoiding others (cold spots) (3–5). SHM plays a crucial role in the B cell immune response and immune-mediated disorders. The analysis of mutation patterns and distributions has been widely used to infer selective processes involved in such responses (6). However, the analysis of SHM patterns can be confounded by the intrinsic biases of SHM targeting, driving the need for accurate characterization of targeting patterns that reflect inherent SHM properties in the absence of Ag-driven selection (7, 8).
The SHM process can be quantitatively characterized by a targeting model, consisting of a mutability model, which specifies the relative mutation frequency of DNA microsequence motifs, and a substitution model, which describes the specific nucleotide substitution frequencies at the mutated sites (9–13). These models can serve as a background distribution for statistical analysis of mutation patterns in Ig sequences, improving the ability to detect deviations in SHM pathways related to diseases or identify selected mutations that drive Ag specificity and affinity maturation (7, 8). However, modeling these intrinsic biases has been limited by the lack of large sets of Ig sequences that have undergone SHM in vivo in the absence of selection pressures. Previous work has focused on studying mutations in intronic regions or in Ig sequences that were determined to be nonfunctional (e.g., because of an out-of-frame junction) (9–11, 13). However, intronic regions have limited diversity and a different base composition from exonic V(D)J regions, and some mutations in nonfunctional sequences may be subject to selection pressures if the sequences were rendered nonfunctional during the affinity maturation process. Another strategy to determine targeting models involves using mutations that do not alter the amino acid sequence (i.e., silent or synonymous mutations), which presumably are not subject to selection pressures. We previously used this strategy to construct the human silent, 5-mer, functional (S5F) SHM targeting model from >800,000 mutations in functional Ig sequences (12). Despite the high resolution of this S5F model, the mutability of some DNA motifs could not be estimated directly because they do not yield silent mutations.
Modeling and analysis of SHM would also benefit from a clear understanding of whether equivalent models can be used across chains (L and H) and species (mouse and human). L and H chain genes are located on different chromosomes; thus, different regulatory elements and epigenetic effects may impact microsequence targeting specificity (14–17). Previously, Shapiro et al. (9) reported comparable dinucleotide and trinucleotide mutabilities between L and H chains, and between mouse and human sequences. However, the small sequence database and short motif comparisons limited the resolution of this analysis.
In this study, we use a novel experimental strategy based on the (4-hydroxy-3-nitrophenyl)acetyl (NP) immunization system in mice, combined with next-generation sequencing technologies, to generate a large set of unselected mutations from nonfunctionally rearranged Ig κ-chain sequences. For comparison, we also sequenced L and H chains from healthy human subjects. Using a modified version of the 5-mer motif framework previously developed (12), we have used this novel database of mutations spanning the full V gene to develop a next-generation targeting model. We found that targeting patterns on nonfunctional sequences resemble those on functional sequences, but revealed unexpected chain- and species-specific differences in SHM targeting. These models provide insights into the SHM process and support future analyses of SHM patterns.
Materials and Methods
mIgM Tg mouse data set: cell sorting
Four mIgM Tg JHD−/− BALB/c strain female mice (18), 6–10 wk old, were immunized with NP-CGG in alum adjuvant. Twenty-eight days after immunization the mice were sacrificed and splenocytes were harvested. Germinal center (GC) cells (live, B220+, NIP+, CD95+, CD38−) were FACS-sorted, leaving 64,000, 87,000, 120,000, and 107,000 cells, respectively. As a control, non-GC B cells expressing κ L chain with a B220+, CD95−, CD38+, λ− phenotype were FACS-sorted from two nonimmunized mice, leaving 500,000 and 400,000 cells, respectively. RNA was isolated from sorted cells using the RNeasy Mini kit (Qiagen) per the manufacturer’s protocol and stored at −80°C.
Human data set: cell sorting
Four healthy donors were recruited for the assessment of B cell Ab repertoire analysis. Specimens were collected after informed written consent was obtained, under a protocol approved by the Human Research Protection Program at Yale School of Medicine. PBMCs were isolated from whole blood by Ficoll-Paque gradient centrifugation. B cells were first enriched with CD20 MicroBeads (Miltenyi Biotec); then memory (CD27+) B cells were isolated through FACS on a FACSAria flow cytometer (BD) into bulk collections before RNA isolation and sequencing. For cytometric analysis and sorting, enriched B cells were stained with PerCP-Cy5.5 anti-human CD27 (clone O323), Pacific Blue anti-human CD19 (clone HIB 19) (both from BioLegend), and with FITC anti-human IgM (clone G20-127; BD). For this study, memory and bulk unsorted B cell populations for subjects HD09, HD10, and HD13 were analyzed; only bulk unsorted B cells were analyzed for subject HD07.
Sequencing
RNA from sorted memory B cells was extracted via Qiagen RNeasy kit as per manufacturer’s recommendation. A total of 250 ng of RNA input was used for each sample. Immune sequencing library construction for human samples was performed as previously described using human V region H, V region L-λ and V region L-κ C region primer mix in each reaction (19, 20), and for mouse samples was performed using mouse V region L-λ and V region L-κ primer mix (21). In brief, RNA was reverse transcribed using biotin-dT oligonucleotide and barcoded on the 5′ end in a template switchlike reaction using an oligonucleotide harboring a semidegenerate 17-nt-long unique identifier (UID) barcode and a universal flanking sequence. The cDNA was purified using streptavidin bead pulldown and then subjected to a first round of 12 cycles of PCR. The human samples used a human C region primer mix composed of Ig H chain C region, Ig κ L chain C region, and Ig λ L chain C region flanked by another universal sequence (19, 20), and the mouse samples used a mouse primer mix composed of Ig κ L chain C region and Ig λ L chain C region (21). A second round of 12 cycles of PCR was then performed using universal overhang from both ends to add the required Illumina sequencing and clustering adapters. The samples were then pooled at equimolar ratio and subjected to sequencing on the Illumina MiSeq platform at 325 × 275 bp. Raw mouse and human sequences are available via the Sequence Read Archive under BioProject accession numbers PRJNA283640 and PRJNA338795, respectively. Human samples included in this study were restricted to BioSample accession numbers SAMN05570585, SAMN05570589, SAMN05570593, SAMN05570596, SAMN05570598, SAMN05570602, SAMN05570604.
Sequencing data preprocessing
Raw sequences were processed using the Repertoire Sequencing Toolkit (pRESTO) (22) as previously described (12). The script containing detailed procedures for Ig sequence preprocessing is included in the supplemental material. In summary, we removed low-quality sequences, annotated sequences by UID (corresponding to the same mRNA molecule) and isotype-specific primer, generated a consensus sequence for each UID, and assembled pair-end reads. To obtain high-fidelity sequences, we constrained the reads being used for mutation analysis to appear at least twice. Furthermore, a sliding window approach was used to eliminate sequences with ≥6 mutations in 10 consecutive nucleotides. IMGT was used to assign germline V(D)J segments and determine functionality (23). Mutations were defined as nucleotides that were different from the inferred germline sequence. Clustering of sequences into clonal groups was carried out using the Change-O command line tool (24). H chain sequences were assigned into clonal groups on the basis of identical V gene, J gene, and junction length, with a weighted intraclonal distance threshold of 10 using the substitution probabilities previously described (25). L chain sequences were assigned into clonal groups on the basis of identical V gene, J gene, and junction sequence.
Estimation of the sequencing error rate after quality control
The empirical sequencing error rate for each sequence was computed by comparing the number of mismatched nucleotides with the consensus sequence in each UID group. The UID consensus sequence was obtained by majority rule at each nucleotide base in each UID group. The error rate of UID sequences was estimated by subtracting the probability that each base is correct (defined by more than half of the sequences at each position being correct) from 1. The actual error is expected to be lower because sequencing may result in distinct errors, in which case less than half of correct bases at a position would result in the correct UID consensus.
Mutability calculation for replacement and silent models
The construction of silent (S) 5-mer models was previously described (12). The replacement and silent (RS) model was constructed using the same method except that both silent and replacement mutations were included.
Inference for substitution and mutability frequencies
The SHazaM R package (version 0.1.1) was used to compute the targeting models (12, 24). For substitution models, the values of 5-mers with <20 mutations were inferred. Inference was performed by first averaging substitution frequencies of all mutations in 5-mers matching the same inner 3-mers. If the number of mutations was still insufficient (<20), substitution frequencies of all mutations in 5-mers matching the same middle base were averaged. In the mutability models, 5-mers with <500 mutations in sequences containing the mutated 5-mer were inferred. If the mutated base was A or T, inference was done by averaging the mutability of all 5-mers sharing the same inner 3-mer. For the base C, inference was based on 5-mers sharing the same upstream 3-mer (the mutated base and two nucleotides upstream of it), whereas for the base G, inference was based on 5-mers sharing the same downstream 3-mer (the mutated base and two nucleotides downstream of it).
Construction of sequencing-error models
Sequencing-error models were constructed for the purposes of comparing with targeting models. Combined PCR and MiSeq sequencing error (hereafter referred to as sequencing error) profiles were generated using sequences from large UID groups (>20 sequences) in mice. A consensus sequence was constructed per group through majority rule at each nucleotide position. Within each UID group, the deviations from consensus sequences were counted as sequencing errors and used to construct the error models. Targeting models for these errors were built the same way as the 5-mer RS targeting model, except that all sequences were treated independently (instead of collapsing mutations for each clone). 5-mer motifs with <20 mutations in the center base or <500 mutations in any base in sequences containing the mutated motif were excluded. Samples that did not have any 5-mers matching the criteria were excluded from the study.
Principal component analysis on targeting models
For each pair of samples, the distance was defined as follows: 1 − Spearman correlation coefficient between any two targeting models, considering only the motifs observed in both models. The R function prcomp was used to project the distance matrix onto the first two principal components.
Mutation simulation
Ten simulated repertoires were generated for each SHM model (mouse L chain, human L chain, and human H chain). For each repertoire, 2000 sequences were simulated with five mutations per sequence, roughly the same mutation levels as observed in the GC cells from mice analyzed in this study. The mutations were added to the sequences stochastically using a mutation algorithm previously described (8). All the simulations were initiated with the germline Mus mus IGKV1-135*01, the most abundant V segment among nonfunctional chains in the NP-immunized mice. The simulation results were compared across different SHM targeting models by calculating the number of different mutations (defined as mutations at different positions or mutations to different bases) between repertoires divided by the total number of mutations.
Results
Nonfunctional rearrangements of the κ L chain accumulate SHMs
To generate a large number of unselected mutations necessary to precisely model SHM targeting, we carried out next-generation repertoire sequencing (Rep-Seq) (26) in NP-immunized mIgM Tg mice (18). These mice express a transgenic H chain (B1-8) that binds the NP Ag when paired with λ, but not κ, L chains. We reasoned that NP-binding B cells in these mice would likely carry nonfunctionally rearranged κ-chains, because λ-chains are generally used only after failure to generate a productive κ-chain rearrangement during B cell development (27). Next-generation sequencing of κ L chains of sorted GC NP-binding B cells from four mice 28 d postimmunization with NP produced 25,341 distinct κ L chain sequences (Table I). Consistent with our expectations, 20,194 sequences (80%) were identified as nonfunctional because they contained an out-of-frame junction (92%) or a stop codon (8%) (Supplemental Fig. 1). The small fraction of in-frame κ L chain sequences may be derived from B cells reacting to the NP carrier protein (CGG) that contaminated the sort, or may represent cells carrying both a productive κ- and λ-chain (28, 29).
. | . | No. of Unique Sequences . | No. of Clones . | No. of Mutations . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Sample . | Chain . | Processed . | Functional . | Total . | Functional . | Nonfunctionala . | Functional . | Nonfunctionala . | Functional Silent . | Nonfunctional Silenta . |
NP1 | κ | 2,526 | 534 | 1,163 | 232 | 942 | 739 | 4,225 | 183 | 1,081 |
λ | 25,755 | 21,332 | 1,298 | 516 | 794 | 2,728 | 3,691 | 718 | 703 | |
NP2 | κ | 8,703 | 1,555 | 2,671 | 464 | 2,233 | 1,546 | 12,489 | 370 | 3,347 |
λ | 84,141 | 70,990 | 2,552 | 930 | 1,653 | 4,923 | 7,874 | 1,440 | 1,566 | |
NP3 | κ | 10,006 | 2,226 | 3,245 | 656 | 2,626 | 2,305 | 14,992 | 565 | 3,758 |
λ | 89,976 | 76,406 | 2,720 | 1,045 | 1,711 | 5,528 | 8,508 | 1,590 | 1,767 | |
NP4 | κ | 4,106 | 832 | 1,813 | 373 | 1,459 | 1,095 | 7,467 | 252 | 1,964 |
λ | 50,587 | 42,523 | 1,923 | 658 | 1,281 | 3,639 | 6,214 | 1,017 | 1,230 | |
Total | κ | 25,341 | 5,147 | 8,892 | 1,725 | 7,260 | 5,685 | 39,173 | 1,370 | 10,150 |
λ | 250,459 | 211,251 | 8,493 | 3,149 | 5,439 | 16,818 | 26,287 | 4,765 | 5,266 | |
C1 | κ | 89,707 | 76,245 | 15,688 | 9,905 | 5,989 | 18,984 | 7,843 | 3,871 | 879 |
λ | 3,844 | 590 | 798 | 138 | 662 | 202 | 1,293 | 32 | 178 | |
C2 | κ | 26,041 | 22,814 | 5,797 | 3,853 | 2,026 | 6,900 | 2,539 | 1,331 | 266 |
λ | 1,240 | 183 | 305 | 44 | 262 | 62 | 493 | 9 | 65 | |
Total | κ | 115,748 | 99,059 | 21,485 | 13,758 | 8,015 | 25,884 | 10,382 | 5,202 | 1,145 |
λ | 5,084 | 773 | 1,103 | 182 | 924 | 264 | 1,786 | 41 | 243 |
. | . | No. of Unique Sequences . | No. of Clones . | No. of Mutations . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Sample . | Chain . | Processed . | Functional . | Total . | Functional . | Nonfunctionala . | Functional . | Nonfunctionala . | Functional Silent . | Nonfunctional Silenta . |
NP1 | κ | 2,526 | 534 | 1,163 | 232 | 942 | 739 | 4,225 | 183 | 1,081 |
λ | 25,755 | 21,332 | 1,298 | 516 | 794 | 2,728 | 3,691 | 718 | 703 | |
NP2 | κ | 8,703 | 1,555 | 2,671 | 464 | 2,233 | 1,546 | 12,489 | 370 | 3,347 |
λ | 84,141 | 70,990 | 2,552 | 930 | 1,653 | 4,923 | 7,874 | 1,440 | 1,566 | |
NP3 | κ | 10,006 | 2,226 | 3,245 | 656 | 2,626 | 2,305 | 14,992 | 565 | 3,758 |
λ | 89,976 | 76,406 | 2,720 | 1,045 | 1,711 | 5,528 | 8,508 | 1,590 | 1,767 | |
NP4 | κ | 4,106 | 832 | 1,813 | 373 | 1,459 | 1,095 | 7,467 | 252 | 1,964 |
λ | 50,587 | 42,523 | 1,923 | 658 | 1,281 | 3,639 | 6,214 | 1,017 | 1,230 | |
Total | κ | 25,341 | 5,147 | 8,892 | 1,725 | 7,260 | 5,685 | 39,173 | 1,370 | 10,150 |
λ | 250,459 | 211,251 | 8,493 | 3,149 | 5,439 | 16,818 | 26,287 | 4,765 | 5,266 | |
C1 | κ | 89,707 | 76,245 | 15,688 | 9,905 | 5,989 | 18,984 | 7,843 | 3,871 | 879 |
λ | 3,844 | 590 | 798 | 138 | 662 | 202 | 1,293 | 32 | 178 | |
C2 | κ | 26,041 | 22,814 | 5,797 | 3,853 | 2,026 | 6,900 | 2,539 | 1,331 | 266 |
λ | 1,240 | 183 | 305 | 44 | 262 | 62 | 493 | 9 | 65 | |
Total | κ | 115,748 | 99,059 | 21,485 | 13,758 | 8,015 | 25,884 | 10,382 | 5,202 | 1,145 |
λ | 5,084 | 773 | 1,103 | 182 | 924 | 264 | 1,786 | 41 | 243 |
The number of nonfunctional sequences is the number of processed sequences minus the number of functional sequences.
Previous data suggested that the nonfunctional κ sequences would accumulate SHMs because they were derived from GC B cells, where the SHM machinery is active, and where Ig is clearly being transcribed (because the sequencing was carried out from an mRNA template) (9, 13, 30). To determine the extent of SHM, we used IMGT/HighV-QUEST to identify and align the germline V and J gene segments for each of the observed sequences. We found that the nonfunctional κ L chain sequences in NP-binding GC B cells had an average mutation frequency of 1.49% in the V region. Interestingly, the nonfunctional κ sequences were consistently more mutated than functional κ sequences, which had a mutation frequency of 0.95% (p < 1e−15) (Fig. 1). The decreased mutation frequency observed in functional sequences was not due to the use of V segments with fewer hot-spot motifs, because a similar pattern was observed when comparing mutation frequencies in functional and nonfunctional sequences using the same V segment. The λ L chains sequenced from the same B cell subsets had a similar mutation frequency as the functional κ-chain sequences. In contrast, κ-chain sequences from non-GC cells sorted from two unimmunized mice were significantly less mutated than κ-chain sequences from GC cells (p < 1e−15). The mutation frequencies in all of these populations were higher than the estimated sequencing error rate after correction using the molecular barcodes (estimated sequencing error rate <1e−9). To identify independent SHM events, we partitioned the sequences into groups that were likely to be clonally related (see 2Materials and Methods), and identified the set of distinct mutations from each clone. In total, we obtained 39,173 independent and unselected mutations from NP-binding nonfunctional κ-chain sequences.
SHM targeting at the nonfunctional locus resembles the functional locus
Similar to previous models (12), we defined the mutability of a DNA motif as the probability of the central base in the motif being targeted by SHM relative to all other motifs (described in 2Materials and Methods). To capture the classic SHM hot spots (WRC/GYW and WA/TW, where W = [A, T], R = [G, A], Y = [C, T]) (2) on both the forward and reverse strands, we modeled targeting patterns based on 5-mer motifs. Furthermore, we used both replacement (R) and silent (S) mutations because neither type of mutation should be subject to selection pressures in nonfunctional κ-chain sequences in λ+ cells. Using the notation developed in our previous work, we refer to this model as RS, 5-mer, nonfunctional (RS5NF) (Fig. 2). The RS5NF targeting model contained 825 directly observed motifs (Fig. 2A) and 199 unobserved motifs, whose values were inferred from similar motifs (Fig. 2B). We built a separate targeting model using data from each mouse and confirmed that the mutabilities were highly similar among them (Supplemental Fig. 2A). As expected, classic hot spots generally had higher mutabilities, whereas classic cold spots had lower mutabilities. Furthermore, we observed potential new strand-symmetric hot-spot motifs, CRCY/RGYG (Fig. 2, orange bars) and ATCT/AGAT (Fig. 2, magenta bars), whose mean mutability levels were more than twice as large as the mean mutability level of the classic WA/TW hot spot. These SHM targeting patterns remained stable over the course of B cell maturation because models constructed from sequences with <1% mutation frequency were well correlated with those constructed from sequences with ≥1% mutation frequency; the strength of the correlations between highly and lowly mutated sequences (average pairwise Pearson ρ = 0.61) was comparable with pairwise correlations for sequences with <1% mutation frequency (average Pearson ρ = 0.58).
To test whether SHM targeting at the nonfunctional locus was similar to the functional locus, we compared targeting models constructed from functional and nonfunctional sequences. Because selection pressures may influence mutation patterns in functional sequences, we constructed a model based only on silent mutations. For this comparison, the model for nonfunctional sequences was also based on silent mutations to ensure comparability. Specifically, we built a silent, 5-mer, nonfunctional (S5NF) model using silent mutations from nonfunctional κ sequences from GC cells in immunized mice, and an S5F model using silent mutations from functional κ sequences from both GC B cells in immunized mice and non-GC B cells in control mice. The mutability models for nonfunctional (S5NF) and functional (S5F) sequences were highly correlated (Pearson ρ = 0.86 and Spearman ρ = 0.58 over 250 common 5-mer motifs), consistent with the notion that the same mutational pathways are operating at both loci in this murine system (Fig. 3A).
. | . | No. of Unique Sequences . | No. of Clones . | No. of Mutations . | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Sample . | Chain . | Processed . | Functional . | Nonfunctionala . | Total . | Functional . | Nonfunctionala . | Functional . | Nonfunctionala . | Functional Silent . | Nonfunctional Silenta . |
HD07 | H | 5,697 | 5,570 | 58 | 2,870 | 2,789 | 57 | 24,032 | 616 | 8,650 | 189 |
L | 21,011 | 20,362 | 454 | 6,455 | 5,944 | 418 | 41,135 | 2,077 | 14,280 | 479 | |
HD09 | H | 8,628 | 8,362 | 199 | 6,426 | 6,182 | 190 | 48,479 | 1,632 | 16,831 | 469 |
L | 36,596 | 34,846 | 1,514 | 16,319 | 14,909 | 1,266 | 113,235 | 7,909 | 40,399 | 1,837 | |
HD10 | H | 8,742 | 8,462 | 233 | 7,333 | 7,064 | 217 | 49,327 | 1,627 | 16,908 | 354 |
L | 26,089 | 24,553 | 1,338 | 14,611 | 13,322 | 1,144 | 89,687 | 6,243 | 31,415 | 1,299 | |
HD13 | H | 9,193 | 8,900 | 218 | 5,903 | 5,647 | 200 | 56,551 | 2,157 | 20,200 | 552 |
L | 27,929 | 26,505 | 1,191 | 12,508 | 11,490 | 901 | 87,735 | 6,532 | 31,694 | 1,484 | |
Total | H | 32,260 | 32,294 | 708 | 22,532 | 21,682 | 664 | 178,389 | 6,032 | 62,589 | 1,564 |
L | 111,625 | 106,266 | 4,497 | 49,893 | 45,665 | 3,729 | 331,792 | 22,761 | 117,788 | 5,099 |
. | . | No. of Unique Sequences . | No. of Clones . | No. of Mutations . | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Sample . | Chain . | Processed . | Functional . | Nonfunctionala . | Total . | Functional . | Nonfunctionala . | Functional . | Nonfunctionala . | Functional Silent . | Nonfunctional Silenta . |
HD07 | H | 5,697 | 5,570 | 58 | 2,870 | 2,789 | 57 | 24,032 | 616 | 8,650 | 189 |
L | 21,011 | 20,362 | 454 | 6,455 | 5,944 | 418 | 41,135 | 2,077 | 14,280 | 479 | |
HD09 | H | 8,628 | 8,362 | 199 | 6,426 | 6,182 | 190 | 48,479 | 1,632 | 16,831 | 469 |
L | 36,596 | 34,846 | 1,514 | 16,319 | 14,909 | 1,266 | 113,235 | 7,909 | 40,399 | 1,837 | |
HD10 | H | 8,742 | 8,462 | 233 | 7,333 | 7,064 | 217 | 49,327 | 1,627 | 16,908 | 354 |
L | 26,089 | 24,553 | 1,338 | 14,611 | 13,322 | 1,144 | 89,687 | 6,243 | 31,415 | 1,299 | |
HD13 | H | 9,193 | 8,900 | 218 | 5,903 | 5,647 | 200 | 56,551 | 2,157 | 20,200 | 552 |
L | 27,929 | 26,505 | 1,191 | 12,508 | 11,490 | 901 | 87,735 | 6,532 | 31,694 | 1,484 | |
Total | H | 32,260 | 32,294 | 708 | 22,532 | 21,682 | 664 | 178,389 | 6,032 | 62,589 | 1,564 |
L | 111,625 | 106,266 | 4,497 | 49,893 | 45,665 | 3,729 | 331,792 | 22,761 | 117,788 | 5,099 |
Nonfunctional sequences are defined as sequences with out-of-frame junctions.
To test whether SHM targeting was also consistent across functional and nonfunctional loci in humans, we recruited four healthy human subjects and sequenced both their Ig H and L chains. Overall, we collected 32,260 unique H chain sequences and 111,625 unique L chain sequences (Table II). Comparison of S5F targeting models constructed independently for each subject showed that targeting patterns were highly conserved across individuals (Supplemental Fig. 2B, 2C), consistent with previous observations (12). We thus agglomerated mutations from all individuals to build L chain S5F and S5NF targeting models, using functional and nonfunctional (defined by out-of-frame junctions) sequences, respectively. The relative mutability of DNA motifs in the nonfunctional and functional sequences was highly correlated (Pearson ρ = 0.93, Spearman ρ = 0.88; Fig. 3B). Similar results were obtained for the human H chain sequences (Pearson ρ = 0.93, Spearman ρ = 0.86; Fig. 3C). Overall, these results suggest the SHM targeting mechanism is highly conserved at the functional and nonfunctional loci in both the human and the mouse systems.
C/G bases are targeted more frequently and exhibit a higher transition frequency in mice
To understand whether SHM targeting was species specific, we compared the mouse and human L chain mutability models. Overall, there was a high correlation between the mouse S5NF model and the human S5F model (Pearson ρ = 0.63, Spearman ρ = 0.63; Fig. 4A). In both species, classic hot spots generally exhibited higher mutabilities, whereas classic cold spots exhibited lower mutabilities. Despite this similarity, we observed a global shift in mutabilities at certain bases. In the mouse model, 5-mer motifs with C or G (hereafter written as C/G) as the central base had much higher mutabilities compared with the same motifs in the human model across the 299 motifs observed in both models, whereas 5-mer motifs centered at A or T (hereafter written as A/T) had much lower mutabilities (Fig. 4B). When considering the mutability of motifs centered on each base separately, the mutabilities were highly consistent (Supplemental Fig. 3), suggesting that the difference between mouse and human SHM targeting exists mainly in the overall targeting frequencies of C/G verses A/T bases. The models adjust for differences in CG contents between mice and humans by normalizing mutabilities using frequencies of microsequence motifs in the germline sequences. To ensure that the observed species-specific difference was not due to tissue-specific differences (mouse B cells were collected from the spleen, whereas human B cells were from PBMCs), we examined an independent data set derived from lymph node B cells from four multiple sclerosis subjects (20). The largest mutability difference between species (i.e., mouse versus human) was five times greater than the largest difference between tissues (i.e., blood versus lymph node) in the same species (data not shown), suggesting that the large increase observed in C/G targeting in mice compared with humans was likely due to a species-dependent mechanistic difference in SHM.
The increase in C/G (verses A/T) targeting in the mouse model suggested a relative decrease in the recruitment of the MSH2/MSH6 DNA mismatch repair pathway that drives mutation spreading from the original AID-induced lesion, thus leading to a shift in the balance toward phase I (versus phase II) SHM. To test whether there was also a decreased efficiency of recruiting uracil-DNA glycosylase (UNG), the base excision repair enzyme associated with phase Ib of SHM, we analyzed the pattern of nucleotide substitution frequencies computed for each base by aggregating all mutations at positions where only silent mutations were possible (Table III). We reasoned that decreased recruitment of UNG would lead to an increase in AID-induced lesions being resolved through simple replication events, and consequently an increase in the frequency of transition mutations (versus transversion mutations) at C/G bases. Consistent with this hypothesis, and with previously reported transition frequencies in mice (4, 31, 32) and humans (30), the transition frequencies from C-to-T and G-to-A in mice were significantly higher than those in humans (Table III) (p < 1e−15, Pearson’s χ2 test). This species-specific difference in transition frequency was found only at C/G bases, whereas the substitution profiles at A/T bases were similar between mice and humans. This is consistent with a shift in the balance toward phase Ia compared with phase Ib SHM in mice. Overall, these observations suggest a general decrease in the efficiency of recruiting DNA error-prone repair pathways normally associated with SHM to the sites of AID-induced lesions in mice compared with humans.
. | Mouse L Chain . | Human L Chain . | Human H Chain . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
From\To . | A . | C . | G . | T . | A . | C . | G . | T . | A . | C . | G . | T . |
A | — | 0.16 | 0.54 | 0.29 | — | 0.26 | 0.50 | 0.24 | — | 0.28 | 0.49 | 0.23 |
C | 0.13 | — | 0.09 | 0.78 | 0.25 | — | 0.30 | 0.45 | 0.21 | — | 0.35 | 0.44 |
G | 0.77 | 0.14 | — | 0.09 | 0.53 | 0.31 | — | 0.16 | 0.49 | 0.35 | — | 0.16 |
T | 0.25 | 0.57 | 0.18 | — | 0.20 | 0.57 | 0.23 | — | 0.19 | 0.44 | 0.37 | — |
. | Mouse L Chain . | Human L Chain . | Human H Chain . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
From\To . | A . | C . | G . | T . | A . | C . | G . | T . | A . | C . | G . | T . |
A | — | 0.16 | 0.54 | 0.29 | — | 0.26 | 0.50 | 0.24 | — | 0.28 | 0.49 | 0.23 |
C | 0.13 | — | 0.09 | 0.78 | 0.25 | — | 0.30 | 0.45 | 0.21 | — | 0.35 | 0.44 |
G | 0.77 | 0.14 | — | 0.09 | 0.53 | 0.31 | — | 0.16 | 0.49 | 0.35 | — | 0.16 |
T | 0.25 | 0.57 | 0.18 | — | 0.20 | 0.57 | 0.23 | — | 0.19 | 0.44 | 0.37 | — |
The substitution frequency from each base to every other base is computed using the set of mutations observed at positions where all base changes are silent.
Targeting in human H chains is distinct from L chains
Because of different chromosomal contexts, it is possible that the SHM process for H and L chains is subject to distinct regulatory machineries. To examine this, we compared the human L chain S5F model with the human H chain S5F model previously built. The L chain and the H chain models were well correlated (Pearson ρ = 0.66, Spearman ρ = 0.75; Fig. 5A), and the classic hot and cold spots showed the expected deviations from neutral motifs. However, the cross-chain correlations were significantly weaker than the correlations for same-chain comparisons across individuals (p < 1e−15; Fig. 5B). Furthermore, >30% of the 5-mer motifs had mutabilities that were >2-fold different between H and L chains (100/323 motifs observed in both models), and >7% of the motifs were at least 4-fold different (25/323 motifs), whereas <17% of the motifs were >2-fold different between the same type of chains (when comparing separate models built for each individual). The targeting model built solely on human κ L chains (instead of combining κ and λ) did not yield higher correlation with the mouse κ-chain model, possibly because a smaller database is more prone to noise. Overall, these results indicate that H and L chains exhibit similar SHM targeting patterns, but distinct mechanisms are likely acting at each locus, and analysis of H and L chain SHM patterns is likely to require separate targeting models.
Species-specific and chain-specific targeting
To explore the global relationships between the species- and chain-specific SHM targeting models, we built separate targeting models for each individual mouse and human of the following categories: 1) κ L chain sequences from four NP-immunized mice (RS5NF models), 2) κ L chain sequences from two control mice (S5F models), 3) κ and λ L chain sequences from four healthy human subjects (S5F models), 4) H chain sequences from four healthy human subjects (S5F models), 5) H chain sequences from 11 human subjects used in a previous study by Yaari et al. (12) (S5F models), and 6) sequencing errors from three mouse samples (described in 2Materials and Methods). The pairwise distance between each of these models was defined as follows: 1 − Spearman correlation of the mutability estimates. Principal component analysis was then used to project this distance matrix onto the first two principal components, which together account for most of the variance in the distance matrix (95%) (see 2Materials and Methods) (Fig. 6A). We found that mouse L chains, human L chains, and human H chains clustered into distinct groups. The high consistency of samples within each group, combined with the disparity across the groups, indicates that different species and chains have distinct targeting patterns. These differences were not completely due to the increased targeting of C/G bases in mice, because renormalizing the models to have the same mutability for all 5-mers with the same central base produced a similar result. The differences observed across chains and between species were also not likely due to usage of distinct germline sequences because mutabilities are normalized based on the frequency of occurrence of each motif in the germline sequences. To confirm this, the analyses were repeated using separate models constructed for each V gene family independently. These results showed that the variation between V gene families at the same locus was small compared with the differences observed across chains and between species. All of the models were clustered far from the sequencing error model, providing reassurance that the SHM targeting models were not significantly influenced by potential sequencing errors. Interestingly, the human L chain models were intermediate between the murine L chain and the human H chain models, suggesting some conservation of SHM mechanisms at this locus. To better understand how these differences in SHM targeting could impact affinity maturation, we designed a simulation-based analysis to quantify how often the models lead to different mutations being introduced in a repertoire of 2000 Mus mus IGKV1-135*01 sequences (the most frequent segment among mouse nonfunctional sequences), each carrying five mutations. Ten simulated repertoires were generated for each SHM model (mouse L chain, human L chain, and human H chain) using a mutation algorithm previously described (8). We found that two repertoires that responded independently under the same mutability model (mouse L chain) would differ in ∼13% of mutations. In contrast, responses that evolved under the mouse L chain model differed from those using the human H chain model by >38% (Fig. 6B). In contrast with the PCA in Fig. 6A, the human L chain model was not closer to the mouse L chain than the human H chain. This is because the V segment chosen for the simulation lacks the sequence motifs exhibiting similar mutabilities between the mouse L chain model and the human L chain model. However, in simulations with other V segment sequences, we observed that the human L chain model rendered more similar mutations with the mouse L chain model. Overall, these analyses provide strong evidence for species- and chain-specific targeting in the SHM process, and highlight the need to use the proper targeting models for the analysis of SHM patterns.
Discussion
Models of SHM microsequence targeting and substitution bias can provide mechanistic insights into the underlying mutation and DNA repair pathways, and serve as critical background models for statistical analyses of mutation patterns. The development of SHM targeting models has been hindered by the difficulty in identifying large numbers of somatic mutations that have not been influenced by selection. To overcome this limitation, we used high-throughput sequencing to generate a large number of κ L chains from NP-binding GC B cells. Because B cells specific for the NP hapten use λ L chains, many of these cells also carry a nonfunctional κ L chain sequence. These nonfunctional κ L chains accumulated significant numbers of somatic mutations after NP immunization. To model SHM targeting, we adapted the 5-mer microsequence targeting framework (12), which estimates the mutability of each base as a function of the surrounding two bases upstream and two bases downstream. This approach captured SHM targeting patterns that were highly reproducible across individual mice. The patterns were also similar in sequences with low (<1%) and high (≥1%) mutation frequencies, suggesting that the SHM process is stable over time. This contrasts with the hierarchical model proposed by Yeap et al. (33). Although our data also show that early mutations are targeted preferentially to hot-spot motifs, this is due to their higher overall mutability. Hot- and cold-spot mutations are both observed in sequences with few mutations (<1%) and appear at their expected frequencies. Comparative analysis of this mouse L chain model with SHM targeting models for human L and H chains identified both species- and chain-specific differences.
Although the mutability of classic hot- and cold-spot motifs was broadly similar in mice and humans, SHM targeting in mice was characterized by increased C/G targeting. Previous studies suggested that SHM targeting in mice and humans was similar, and that there were no chain-specific differences (9). However, the number of sequences and mutations in these studies were significantly smaller compared with this study. Although the gross patterns are similar with the previous finding, the novel observations of both species- and locus-specific patterns of mutation imply that there are unique molecular mechanisms operating and/or that the extent of activity of certain mechanisms varies across species and loci. The increased targeting of C/G (versus A/T) bases combined with the increased transition frequency specifically at C/G bases in mice is consistent with less involvement of the DNA error repair pathways normally associated with SHM relative to humans. Thus, in this mouse model, more of the AID-induced lesions appear to be resolved by a simple replication event without the involvement of base excision or mismatch repair. L and H chains display overall similar targeting patterns, but the detailed mutability patterns within each hot or cold spot displayed large discrepancy. One explanation for locus-specific targeting is that differences in the regulatory elements (e.g., promoters and enhancers) in H and L chain genes control recruitment of AID and DNA repair factors like UNG and error-prone polymerases. Each repair mechanism may have distinct preference in targeting motifs, and the shift in the contribution of repair mechanisms may be reflected in microsequence targeting specificity. Overall, these species- and chain-specific differences highlight the importance of using an SHM targeting model that is appropriately matched for the data being analyzed.
A potential limitation of this study concerns the comparison of functional versus nonfunctional sequences. In the mouse system, the total number of mutations in functional sequences was low, and thus may be more influenced by sequencing errors. However, the confirmation that SHM targeting in functional and nonfunctional sequences in human subjects was also similar supports our conclusion that data from nonfunctional sequences can be used as a model for mutation patterns in functional sequences. It is possible that the 5-mer model does not fully capture the influence of microsequence context on mutability and that targeting may be influenced by bases that are further upstream or downstream. However, extending the analysis to longer motifs (e.g., 7-mers) would require much higher sequence coverage to account for the large number of possible nucleotide combinations (e.g., 16,384 combinations of 7-mers). It is also important to acknowledge that no motif-based model may be sufficient to capture the full complexity of SHM targeting, which may be influenced uniquely by each V gene context. However, the observation that the same conclusions hold when running the analysis independently on each V gene family suggests that this is unlikely. There is also a caveat in the comparison between species. The human samples in the study were from a diverse population, whereas the mouse models were based on a single inbred laboratory strain. It is possible that the patterns observed are specific to this strain. Finally, it is important to note that the sequencing error models were built from sequences with identical UIDs, which captures the PCR and sequencing errors, but not errors introduced in the reverse transcription step prior to the UID-tagging process.
The models developed in this study quantitatively characterize SHM targeting without the biasing influence of selection pressures. In addition to revealing previously unrecognized hot and cold spots as well as the influence of species and locus, these models provide important background distributions for statistical analysis of affinity maturation in experimentally derived Ig sequencing data. Specifically, we developed two new models: the mouse RS5NF L chain and the human S5F L chain models. These models complement the previously published human S5F H chain model (12) and are available in the SHazaM R package (24) (http://shazam.readthedocs.io/).
Acknowledgements
We thank Dr. Gur Yaari and Dr. Mohamed Uduman for feedback and technical assistance and the Yale University Biomedical High Performance Computing Center for use of computing resources.
Footnotes
This work was supported by the National Institutes of Health (Grants R03AI092379 and R01AI104739 to S.H.K., Grant R01AI43603 to M.J.S.), the Myasthenia Gravis Foundation of America, the National Institute of Allergy and Infectious Diseases (Awards R01AI114780 and U19 AI056363 to K.C.O.), the Natural Sciences and Engineering Research Council of Canada (NSERC PGS-M to A.C.), and the National Library of Medicine (Grant T15LM07056 to J.A.V.H.). The Yale University Biomedical High Performance Computing Center is supported by National Institutes of Health Grants RR19895 and RR029676-01.
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
The online version of this article contains supplemental material.
References
Disclosures
The authors have no financial conflicts of interest.