BCR sequences diversify through mutations introduced by purpose-built cellular machinery. A recent paper has concluded that a “templated mutagenesis” process is a major contributor to somatic hypermutation and therefore Ig diversification in mice and humans. In this proposed process, mutations in the Ig locus are introduced by copying short segments from other Ig genes. If true, this would overturn decades of research on B cell diversification and would require a complete rewrite of computational methods to analyze B cell data for these species. In this paper, we re-evaluate the templated mutagenesis hypothesis. By applying the original inferential method using potential donor templates absent from B cell genomes, we obtain estimates of the methods’ false positive rates. We find false positive rates of templated mutagenesis in murine and human Ig loci that are similar to or even higher than the original rate inferences, and by considering the bases used in substitution, we find evidence that if templated mutagenesis occurs, it is at a low rate. We also show that the statistically significant results in the original paper can easily result from a slight misspecification of the null model.

Our immune systems generate a highly diverse set of Abs to protect us from pathogens. An important part of this process is affinity maturation, which generates high-affinity Abs for Ags encountered by the immune system. Affinity maturation is the result of multiple rounds of mutation and selection; mutations are introduced into the rearranged Ab gene by enzymatic processes, and mutations leading to higher-affinity Abs are selected.

Two major processes are believed to underlie the mutation processes in B cells: classical somatic hypermutation (SHM) and gene conversion (GCV). Both processes depend on activation-induced cytidine deaminase (AID) (1), which creates U:G lesions in the DNA by deaminating deoxycytidine to deoxyuridine. In SHM, the lesion is resolved by recruiting error-prone repair machinery that can introduce nontemplated point mutations at and around the AID-induced lesion. In GCV, the lesion is repaired using an homologous segment elsewhere in the genome as a donor template, resulting in the homologous tract being copied into the rearranged Ab gene.

In a recent paper, Dale et al. (2) propose that new mutation process, called “templated mutagenesis,” is also an important contributor to BCR diversification in mice and humans. In this process, an incompletely understood mechanism uses the sequence of other germline genes to guide the mutation process at a rearranged germline gene. One candidate mechanism for this process is GCV. However, templated mutagenesis differs from previous descriptions of GCV (in species such as chicken) in that it does not require long stretches of homology between donor and recipient sequences. Indeed, Dale et al. (2) find that templated mutagenesis “extends into the somatically mutated non-Ig sequences, LAIR1, gpt, and β-globin, despite the lack of overt homology between these genes and the IgHV repertoire.” For this paper, we will simply refer to this newly hypothesized process as templated mutagenesis.

Although much of the evidence presented by Dale et al. (2) was in the form of statistically significant deviations from a simplified null model, the authors suggest that ∼50–65% of mutations in IgH from a collection of human and murine datasets are consistent with templated mutagenesis. If over half of mutations truly come from templated mutagenesis, B cell repertoire analysis methods will need to be rebuilt. For example, to estimate the likelihood of a group of mutations that match a template elsewhere in the Ig locus, one must incorporate the respective likelihoods that the group occurred from classical SHM or from templated mutagenesis. Therefore, any method that relies on estimating mutation probabilities would need to be updated, including all core methods for B cell sequence analysis: germline annotation, lineage tree estimation, selection strength estimation, and validation techniques such as repertoire simulation. In light of this dependence, accurate rate estimates of templated mutagenesis are crucial.

In this paper, we show that data from human samples (3) and data from a transgenic mouse model (4) analyzed by Dale et al. (2) do not support a high rate of templated mutagenesis. We do so by reimplementing the software described in Dale et al. (2) and called PolyMotifFinder (PolyMF), which identifies potential templated mutagenesis events by comparing mutated sequences to a pool of potential donor genes. Using this software, we run a control experiment absent from the original analysis: we calculate the rate of templated mutagenesis using a donor set of simulated genes not present in the organism from which the mutated sequences derived. In this way, we show that the PolyMF strategy for detecting templated mutagenesis via microhomology has a false positive rate (FPR) very close to and in some cases above the reported positive rate (PR). This yields an upper bound on the range of the true templated mutagenesis rate; this range is often zero. We also describe how Dale et al. (2) conflate a nontrivial rate of templated mutagenesis with significance estimates for a simplified null model. In addition, we argue that clustering of mutations is compatible with the classical Neuberger model of SHM, thus evidence of such clustering is not prima facie evidence of templated mutagenesis. We conclude that more evidence is needed if templated mutagenesis should be accepted as an important part of BCR diversification.

Because the original software in Dale et al. (2) was not made available as part of the publication, nor was it available upon personal request without a materials transfer agreement restricting its use, we reimplemented the algorithms described in Dale et al. (2) as open-source software in Python, a widely available and free programming language.

The PolyMF algorithm relies on the creation of two matrices. Given a set of n mutated sequences, all deriving from the same germline sequence, the length of which is p, and a window size k corresponding to the minimum allowable donor tract length for templated mutagenesis, we create matrices M and S, each having n rows and p columns. Sij = 1 if 1) there is a mutation at position j in sequence i and 2) there exists a window of size k around position j in the ith sequence that contains at least two mutations and is represented in the donor set. That is, Sij is an indicator of position j in sequence i belonging to a pair of mutations consistent with templated mutagenesis. For M, we take Mij = 1 if 1) there is a mutation at position j in sequence i, 2) there is at least one other mutation in sequence i whose distance from j is k − 1 or less, and 3) that the mutation is not part of a pair of mutations that was seen in one of the previous sequences. In other words, M identifies the set of unique mutations within a length-k window from other mutations. In Dale et al. (2). S is referred to as the scoring matrix and M as the mutation matrix. The templated mutagenesis coverage is then computed as i=1nj=1pMijSij/i=1nj=1pMij.

We note in passing that calculation of the templated mutagenesis coverage depends on the order in which the sequences are processed. As an example, consider the very abbreviated case in which the germline sequence is AAA, the donor set is the single sequence CT (so that k = 2), and the mutated sequences are ATT and CTT. If the sequences are processed as ATT followed by CTT, we will have

then i,jMijSij=1, i,jMij=3 for a templated mutagenesis coverage of 1/3.

If the sequences are processed in the opposite order, CTT followed by ATT, we have

Therefore, i,jMijSij=2, i,jMij=3 for a templated mutagenesis coverage of 2/3. Nevertheless, we reimplemented this order-dependent procedure.

PyMotifFinder (PyMF), our package, including a reimplementation of PolyMF, is available on GitHub (https://github.com/matsengrp/PyMotifFinder), and our analysis scripts are available at https://github.com/matsengrp/TemplatedMutagenesis-1. Our implementation takes, as input pairs of naive and mutated sequences along with a donor gene set, the set of potential templates for templated mutagenesis. It then computes a templated mutagenesis coverage according to the strategy defined above. Our implementation contains unit tests to verify the accuracy of the algorithm. The sequence data used in our analyses are available on Zenodo (https://doi.org/10.5281/zenodo.3572361). The complete analysis starting from preprocessed data, including generating the figures and tables included in this article, can be reproduced by copying and pasting a handful of commands into a provided Docker container (5) as described in the GitHub repository. All preprocessing scripts are included with the data on Zenodo.

We analyzed three sets of mutated sequences: one from human subjects, and two from a transgenic mouse model. The first set of sequences, described in (3), corresponds to Abs to the membrane-anchored Ebola virus glycoprotein trimer. They were collected from the peripheral B cells of a convalescent donor who survived the 2014 Ebola outbreak in Zaire and will be referred to as the anti-Ebola sequences. The sequences were downloaded from GenBank using the accession numbers corresponding to the H chain sequences (taken from the supplemental material of Ref. 3), and both the accession numbers and sequences used are available on Zenodo.

The second and third sets of mutated sequences come from a transgenic mouse model described in (4), which was also investigated by Dale et al. (2). Briefly, these sequences come from the B cells of mice that have been genetically engineered with a modified H chain locus. One chromosome, with the “productive” allele, contains the pre-rearranged V region of the 4-hydroxy-3-nitrophenylacetyl–binding B1-8 Ab (VB1-8). The other chromosome, with the “passenger” allele, contains a sequence consisting of a VB1-8 promoter and leader, followed by a stop codon, followed by either the Escherichia coli gpt gene or another copy of VB1-8. The sequences on both alleles accumulate mutations by SHM following the immunization of the mice with 4-hydroxy-3-nitrophenylacetyl–chicken γ globulin. Because the sequences on the passenger allele cannot be expressed, the SHM patterns on these sequences are unaffected by natural selection, making the system particularly useful for studying SHM. The analyses presented in this study use only the passenger gpt or VB1-8 sequences from this system. The sequences come from B cells collected from either the Peyer patches or the spleens of vaccinated mice (six samples from each), and we included sequences taken from both tissue types in our analysis. These sequences will be referred to as the gpt sequences and the VB1-8 sequences, respectively.

The gpt and VB1-8 sequences were downloaded from the Sequence Read Archive (SRP061422). Presto (6) was used to assemble the raw paired-end reads, filter reads to those with an average quality score of at least 20, remove PhiX contamination, and filter to sequences that were seen at least twice. The script used for this process is available with the data on Zenodo.

For each mutation tract, PolyMF looks for templated mutagenesis from a provided donor gene set that contains potential templated mutagenesis donors; if a mutation tract in a mature sequence matches exactly to a region in the donor gene set, the mutation is explainable by templated mutagenesis from that donor set. We prepared five donor gene sets: a set of human IGHV genes, two sets of mouse IGHV genes, and two mock sets containing simulated genes homologous to the E. coli gpt gene. Because these simulated gpt homologs are not present in mouse, we use them as controls, as described below. We refer to these sets as the human IGHV gene set, the mouse ImMunoGeneTics (IMGT) IGHV gene set, the mouse 129S1 IGHV gene set, and the mock gpt gene sets.

The human IGHV donor set was used to obtain the PyMF rate estimate of templated mutagenesis in the anti-Ebola sequences. This set consists of human IGHV gene segments downloaded from IMGT (http://www.imgt.org/vquest/refseqh.html) in September 2017. The IMGT label for this set was “F+ORF+all P,” corresponding to all functional genes, all open reading frames (ORF), and all pseudogene alleles, yielding 466 total segments. A file containing these segments is available on Zenodo.

The mouse IMGT IGHV gene set was used to obtain the PyMF rate estimate of templated mutagenesis in the VB1-8 and gpt sequences. This set consists of mouse IGHV gene segments downloaded from IMGT (http://www.imgt.org/vquest/refseqh.html) in September 2017. The IMGT label for this set was “F+ORF+all P,” corresponding to all functional genes, all ORFs, and all pseudogene alleles, yielding 499 total segments. A file containing these segments is available on Zenodo.

The mouse 129S1 IGHV donor set was used primarily to describe what the mutation spectrum would look like under a templated mutagenesis model in the transgenic mouse system described in (4). Although these mice belonged to strain 129P2, the 129P2 H chain locus has not been sequenced; instead, we used the set of IGHV genes present in the closely related strain 129S1, published in (7). We note that the gene set is incomplete, with only the 3′ half of the IGHV locus sequenced. Because this gene set was used primarily to describe the likelihood of mutations in a model of templated mutagenesis and not to get at the rate of templated mutagenesis, we decided that a partial set of genes that match closely those found in in the actual system was an appropriate choice. In particular, we believe it is better than the alternative of using the mouse IGHV genes taken from IMGT, which include several times as many genes as are present the 129P2 genome. The sequenced and annotated region of the 129S1 genome was downloaded from GenBank (https://www.ncbi.nlm.nih.gov/nuccore/126349412). The V gene segments were extracted using a custom Python script available on Zenodo.

The two mock gpt donor sets were used to estimate the FPR of the PolyMF strategy with the human IMGT donor set and the FPR of the PolyMF strategy with the mouse IMGT donor set. For a mock donor set to give a good estimate of the FPR of the PolyMF strategy, the mock set should have approximately the same homology structure as the donor set used by PolyMF. We created one such set for the human IMGT IGHV donor set and one for the mouse IMGT IGHV donor set. In each case, we aligned the sequences in the gene set using MUSCLE version 3.8.31 (8) and inferred a phylogenetic tree on the sequences using FastTree version 2.1.7 (9). We then used Pyvolve (10) to simulate a new set of sequences from the estimated phylogenetic tree. In each case, the gpt sequence was the root, and the sequences were simulated according to a continuous-time Markov process along the estimated phylogeny. We used the GY94 codon model (11) with parameters α = 0.98, β = 0.65.

All of the donor sets are available on Zenodo, along with the scripts used to extract the IGHV gene segments from the 129S1 genome, the scripts to align and create trees from the human IGHV genes and gpt genes, and the script to create the two mock gpt donor sets.

To use PyMF on the anti-Ebola, VB1-8, and gpt sequences, we needed to identify the mutations and the naive sequences. For the anti-Ebola and VB1-8 sequences, germline sequences and mutations were identified using partis version 0.13.0 (12) with default germline V, D, and J gene sets. These sets comprise curated subsets of the germline genes in IMGT; excluded are genes that are biologically implausible (e.g., on the wrong chromosome, nonfunctional, lacking the conserved cysteine) or otherwise considered inaccurate (13).

For the gpt sequences, partis was run using a modified set of germline genes. The reference sequence for the passenger gpt gene (obtained via personal communication with Dr. L.-S. Yeap) was as follows:

5′-CTTTCTCTCCACAGGTGTCCACTCCCAGGTCCAACTGTAGTAGATGAGCGAAAAATACATCGTCACCTGGGACAT GTTGCAGATCCATGCACGTAAACTCGCAAGCCGACTGATGCCTTCTGAACAATGGAAAGGCATTATTGCCGTAAG CCGTGGCGGTCTGGTACCGGGTGCGTTACTGGCGCGTGAACTGGGTATTCGTCATGTCGATACCGTTTGTATTTC CAGCTACGATCACGACAACCAGCGCGAGCTTAAAGTGCTGAAACGCGCAGAAGGCGATGGCGAAGGCTTCATCGT TATTGATGACCTGGTGGATACCGGTGGTACTGCGGTTGCGATTCGTGAAATCTGCAGTGACGCGCCCACTCTCAC AGTCTCCTCAGGTGAGTCCTTACAACCTCTCTCTT-3′.

In the reference sequence, positions 44 through 351 correspond to the first 308 nt of the gpt gene, and the remainder are linkers. To align and call mutations from the germline sequence, partis requires a set of germline V, D, and J gene segments. For the V gene segment, we used the first 308 nt of the gpt gene (i.e., the portion of the gpt gene inserted into the mouse germline). For the D and J gene segments, we used arbitrary “fake” gene segments: AAAAAAAAAA for the D gene segment and GGGGGGGGGG for the J gene segment. We appended these segments to the end of each sequence so that each input sequence to partis ended with AAAAAAAAAAGGGGGGGGGG. When used this way, partis aligns the gpt portion of the sequence to the portion of the gpt gene included in the reference, aligns the added suffix to the fake D and J gene segments, and treats the linker region following the end of the gpt gene as a VD insertion. We validated the results by checking that each inferred V sequence length was correct, that the inferred mutation rate in the V gene region was not too high, that the inferred VD insertion lengths were correct, and that the VD insertion sequences corresponded to the linker portion of the naive gpt sequence (positions 352 through 410 in the sequence above). Therefore, all mutations identified by partis are located on the gpt portion of the sequence, with none in the linker sequence. This is the desired behavior for our analysis because we are looking for regions of microhomology in the gpt sequence, and we do not wish to analyze mutations that occurred in the linkers.

To investigate whether templated mutagenesis could explain the observed mutations, we constructed a simple statistical model for templated mutagenesis. The model assumes a uniform probability distribution over the possible templated mutagenesis donors, so that each donor is equally likely to have provided the template for a given templated mutation event. For each mutated site we identified the three possible mutations: the mutation that actually occurred and the two mutations that did not occur. For each of the three mutations, we identified all ways of aligning a donor gene to a mutation-containing region so that the two match in at least k bases around the mutation. We defined this set as the set of potential templated mutagenesis donors, and we modeled mutation because of templated mutagenesis as a uniform draw from this set of donors. In this model, the probability of seeing the observed mutation from a templated mutagenesis event is the number of donors containing the observed mutation divided by the total number of donors. That is,

(1)

where nobs is the number of donors matching the observed mutation and nuobs is the number of donors matching one of the two possible unobserved mutations. We can compute these probabilities for any donor gene set to evaluate how well it explains the observed pattern of mutations.

Given a bound on the FPR of PyMF and an estimate by PyMF of the rate of templated mutagenesis, we can compute an upper bound on the true rate of templated mutagenesis. To do this, we use a simple mixture model in which we assume that mutations arise either by templated mutagenesis or by classical SHM. We first define the following quantities, all of which take values in [0,1]: 1) The FPR is the probability that PyMF classifies a mutation because of classical SHM as explainable by templated mutagenesis, 2) the true positive rate (TPR) is the probability that PyMF classifies a mutation because of templated mutagenesis as explainable by templated mutagenesis, 3) The PR is the overall probability that PyMF classifies a mutation as explainable by templated mutagenesis, and 4) pshm is the true proportion of classical SHM events. Because a mutation can be due either to SHM or templated mutagenesis but not both, 1 − pshm is the true proportion of templated mutagenesis events. Therefore, the overall probability that PyMF classifies a mutation as explainable by templated mutagenesis is

If we assume that TPR = 1, so that all true templated mutagenesis events are correctly classified by PyMF as because of templated mutagenesis (i.e., PyMF has sensitivity 1) and that FPR ≥ b for some lower bound b, we can rearrange the expression above to conclude that pshm1PR1b. Equivalently, the rate of templated mutagenesis would be at most 11PR1b. If we do not specify a TPR, the upper bound for the rate of templated mutagenesis becomes

(2)

assuming that TPR > FPR. If TPR < FPR, we cannot obtain an upper bound. We apply (2) to obtain upper bounds on the rate of templated mutagenesis in mice and humans.

The gpt sequences have a grouped structure; each mutated sequence comes from one of 12 tissue samples from six different organisms, so it is inappropriate to model them as independent and identically distributed. Hypothesis testing and confidence interval construction for the gpt sequences was therefore performed using a mixed-effects model fit with the lme4 package (14) in R (15). For each value of k (the minimum donor tract length) and each reference sequence, we modeled the probability of a mutation, given templated mutagenesis using a mixed model with a random effect for the tissue sample. Confidence intervals were plotted as the fitted value in the mixed model plus or minus two SEs. To test whether templating from the IGHV genes could explain the observed mutations better than templating from the gpt genes, we computed the probability of each mutation under the model of templated mutagenesis by IGHV genes and templated mutagenesis by gpt genes. We then computed the difference between the probability of the mutation under the IGHV templating model and the gpt templating model. Under the null hypothesis that the two models are equally good at explaining the observed mutations, these differences should have a mean of zero. As before, because the mutations have a grouped structure, we tested this null hypothesis using a mixed model with a random effect for tissue sample.

To verify that our reimplementation of PolyMF was comparable to the version presented by Dale et al. (2), we ran PyMF on the gpt and VB1-8 sequences described in (4) with the mouse IGHV gene set as well as the anti-Ebola sequences described in (3) with the human IGHV gene donor set. We found slightly higher but comparable rates of mutations explainable by templated mutagenesis in the gpt sequences: of the mutations within 8 nt of each other, 60–75% had 8-mer templates in the mouse V genes (Fig. 1), consistent with the initial report. We found a higher rate of mutations explainable by templated mutagenesis in the anti-Ebola and VB1-8 sequences; of the mutations within 8 nt of each other, 73% of the anti-Ebola sequences had 8-mer templates in the human V genes and 75% of the VB1-8 sequences had 8-mer templates in the mouse V genes. The discrepancy is likely due to differences in gene filtering, germline annotation, and mutation calling between our respective pipelines. Indeed, when we perform stricter filtering to our donor gene sets by removing ORF and pseudogene sequences, we obtain rate estimates of 42% for humans and 62% for mice, which are very close to the Dale et al. (2) estimates. We discuss this in detail in 16Consistent results using filtered donor gene sets.

FIGURE 1.

Red points represent the fraction of mutations in the gpt gene explainable by templates in the mock donor set of simulated gpt homologs, which are not present in the mouse germline and are not available as templates for templated mutagenesis (i.e., the FPR). Blue points represent the fraction of mutations in the gpt gene explainable by templates in the mouse IMGT IGHV gene donor set, which are potentially present in the mouse germline and available as templates. For each tract length and donor set, the filled circle and error bar represent the overall estimate of the probability of a mutation being explainable by templated mutagenesis plus or minus two SEs. We see that the FPR of PyMF is larger on average than the PyMF estimate of the fraction of mutations explainable by templated mutagenesis. Points corresponding to samples from Peyer patches and spleen are offset slightly to the left and right, respectively, to facilitate comparison and to avoid overplotting. This analysis was performed once on data from six individual mice, with two replicates per mouse corresponding to samples from Peyer patches and spleen, yielding 12 total samples.

FIGURE 1.

Red points represent the fraction of mutations in the gpt gene explainable by templates in the mock donor set of simulated gpt homologs, which are not present in the mouse germline and are not available as templates for templated mutagenesis (i.e., the FPR). Blue points represent the fraction of mutations in the gpt gene explainable by templates in the mouse IMGT IGHV gene donor set, which are potentially present in the mouse germline and available as templates. For each tract length and donor set, the filled circle and error bar represent the overall estimate of the probability of a mutation being explainable by templated mutagenesis plus or minus two SEs. We see that the FPR of PyMF is larger on average than the PyMF estimate of the fraction of mutations explainable by templated mutagenesis. Points corresponding to samples from Peyer patches and spleen are offset slightly to the left and right, respectively, to facilitate comparison and to avoid overplotting. This analysis was performed once on data from six individual mice, with two replicates per mouse corresponding to samples from Peyer patches and spleen, yielding 12 total samples.

Close modal

To investigate the FPR of the PolyMF strategy, we ran PyMF on the set of somatically mutated gpt sequences described in 3Sequence data sets section, using a mock donor set of simulated gpt homologs that are not present in B cells. Because mutations could not have arisen from copying over templates from our mock set, any inference of templated mutagenesis events identified by the method must be a false positive. For the rate of false positives provided by the mock donor gene set to provide a good estimate of the true FPR, the mock donor gene set should be constructed so that the probability that an SHM-induced mutation in a gpt sequence matches a member of the mock donor gene set is close to the probability that an SHM-induced mutation in a real Ab sequence matches one of the IGHV genes. For this to hold, the distribution of molecular divergences among the genes in the mock donor set should match the distribution of divergences among the real donor gene set.

Our mock donor gene sets were created to have these properties. To verify this, we estimated phylogenies for the two mock donor gene sets and the two IMGT IGHV gene sets and computed the divergences between the roots and the leaves in each. The divergence distribution in the mock gpt set based on the mouse IMGT gene set resembled the divergence distribution in the mouse IMGT gene set (mean divergence 0.48 and 0.48, respectively). The same holds with the mock gpt set based on the human IMGT gene set and the human IMGT gene set (mean divergence 0.41 and 0.4, respectively). For a more complete description of the divergences, five-number summaries of the divergences in each of the four gene sets are given in Supplemental Table I.

Finally, we note that in the gpt system, we expect all of the mutations to be introduced by SHM. Although Dale et al. (2) analyze these sequences and suggest that templated mutagenesis could be occurring, the lack of homology between the gpt gene and the V genes makes it a priori unlikely that these mutations are introduced by templated mutagenesis. We address the potential contribution from small microhomologies (matches of fewer than 10 bases) between the gpt gene and the IGHV genes because of chance sequence similarity in 13Evidence that mutations in gpt sequences are not due to templating from V genes.

To estimate the FPR of PyMF with the human IMGT IGHV gene set and the FPR of PyMF with the mouse IMGT IGHV gene set, we ran the algorithm on the gpt sequences with the corresponding mock gpt donor gene set. Because the donor set of simulated gpt homologs is not present in the mouse, they cannot have been used as templated mutagenesis donors, and any mutation PyMF identifies as explainable by templated mutagenesis from this set is a false positive. We ran PyMF with a minimum donor tract length ranging from 8 to 14, and we found FPRs that were on the same order as the PolyMF rates obtained when mutated sequences were run against real donor sets. The absolute FPRs were particularly high for minimum donor tracts of 8 and 9 (Fig. 1). The average FPR was 83% for a donor tract of size 8, 50% for donor tracts of size 9, and 25% for donor tracts of size 10. This rate falls dramatically as the minimum donor tract size increases, dropping to 5% for donor tracts of size 14. This suggests that the PolyMF strategy has a large FPR for small values of k, classifying more than 50% of mutations as explainable by templated mutagenesis from genes that were not present in the mouse when k = 8 or k = 9.

One might explain the high FPR of PolyMF by saying that, despite the overall lack of homology between gpt genes and V genes, the mutations in the gpt sequence were actually introduced by templating from very small homologous tracts in the mouse V genes. To check this possibility, we ran PyMF on the gpt sequences with the mouse IMGT IGHV genes as a donor set. We found that many of the mutations could be explained by templating from very small homologous tracts in the mouse V genes, corresponding to the findings of Dale et al. (2). For example, nearly 60% of the mutations had a template of size 8 in the V gene set and ∼40% of the mutations had a template of size 9. However, the percentage approaches zero as template size increases, and for every template size, the average proportion of mutations explainable by templating from gpt sequences is higher than the average proportion of mutations explainable by templating from V genes (Fig. 1). In fact, the proportion of mutations explainable by templating from the V genes is very close to zero for templating by tracts of size 11 or greater, whereas the proportion explainable by templating from the mock gpt donor set remains nonneglibible.

Next, we formally evaluated the plausibility that the mutations in the gpt sequences were introduced by templated mutagenesis from the IGHV genes present in the mouse. To do so, we computed the probabilities of the gpt mutations via templated mutagenesis from the 129S1 IGHV donor gene set and the probabilities of the gpt mutations via templated mutagenesis from the mock donor set of simulated gpt homologs using the uniform-across-donors probability model specified by Eq. 1 for both cases. These models encode our intuition that if the mutations really were templated from a donor gene set, the observed mutation spectrum should be biased toward bases that are represented more frequently in potential donors from that gene set. As an example, suppose that we are considering one mutation in the gpt sequence from A to T. If all of the potential templated mutagenesis donors in the V gene set would lead to a mutation from A to T and all of the templated mutagenesis donors in the gpt gene set would lead to a mutation from A to C, the observed A to T mutation is explained better by templating from the V genes than by templating from the gpt genes. We can fit one model using donors from the mouse V genes and another model using donors from the gpt genes and compare how well each model explains the data; if templated mutagenesis from mouse V genes were really occurring, we would expect the V gene model to fit the data better than the gpt model. If this is not true, it suggests that both the V gene and the gpt inferences are spurious, as the gpt donor genes are not actually present in the mouse.

For each sample, we computed the average probability of the mutations in the gpt sequences given templated mutagenesis from the mouse IGHV gene donor set and the gpt donor set for tract sizes ranging from 8 to 14. We found that these numbers were comparable for the gpt donor set and the mouse IGHV gene donor set, as shown in Fig. 2. As described in 2Materials and Methods, we used a mixed-effects model to test for a difference in the expected probabilities of mutation because of templating from the gpt donor set and the mouse IGHV gene donor set. The resulting p values were as follows: p = 0.059, 0.054, 0.115, 0.202, 0.001, 0.213, and 0.249 for k = 8 through 14, respectively. This indicates that the mutations in the gpt sequences do not tend to look any more like the mouse IGHV gene donor set than they do like the gpt donor set. Because the gpt donor set was not present in the mouse, we believe that it is unlikely that the mutations in the gpt sequences were introduced by templating from the mouse IGHV genes.

FIGURE 2.

Average probability of the observed mutations under a templated mutagenesis model, either templating from gpt genes or templating from the set of 129S1 V genes. Each point corresponds to one sample taken from either spleen or Peyer patches, so that the average is computed over all sequences in a given sample. This analysis was performed once on data from six individual mice, with two replicates per mouse corresponding to samples from Peyer patches and spleen, yielding 12 total samples.

FIGURE 2.

Average probability of the observed mutations under a templated mutagenesis model, either templating from gpt genes or templating from the set of 129S1 V genes. Each point corresponds to one sample taken from either spleen or Peyer patches, so that the average is computed over all sequences in a given sample. This analysis was performed once on data from six individual mice, with two replicates per mouse corresponding to samples from Peyer patches and spleen, yielding 12 total samples.

Close modal

To further investigate whether the mutations could have arisen because of templated mutagenesis from the mouse V genes, we asked whether mutations that had a higher probability under the templated mutagenesis model were observed more frequently. For each mutation from germline base b1, we computed the probability of mutation from b1 to any of the other three bases at that position under the templated mutagenesis model. We then asked whether target bases that had a higher probability under the templated mutagenesis model were observed more frequently.

We found that mutations with higher probabilities under the templated mutagenesis model were not observed any more frequently than mutations with low probabilities under the model (Fig. 3). This finding held true both for the model of templated mutagenesis from gpt genes and from the 129S1 IGHV genes. To formally test whether mutations with high probabilities occurred more frequently, we performed independent logistic regressions for each pair of germline and target base. The response variable was an indicator of whether the observed mutation was the target base, and the predictor variable was the probability of mutation to the target base under the templated mutagenesis model. In each case, we found that the slope in the model was nonsignificant at the 0.05 level, indicating that the templated mutagenesis model did not help to explain the observed pattern of mutations. This analysis provides further evidence that the mutations in the gpt sequences did not arise by templated mutagenesis from the IGHV genes present in the mouse.

FIGURE 3.

Each subplot displays whether a mutation was observed (on the y-axis) versus its probability under the templated mutagenesis model (on the x-axis). A y value of one means the mutation was observed, and a y value of zero means the mutation was not observed. For each mutation, the germline base is indicated by the row name, and the target base is indicated by the column name. The lines are linear smoothers. We do not observe any consistent and significant trend to these lines, indicating that templated mutagenesis has not contributed to the observed sequence changes in the gpt sequence data set. This analysis was performed once on data from six individual mice, with two replicates per mouse corresponding to samples from Peyer patches and spleen, yielding 12 total samples.

FIGURE 3.

Each subplot displays whether a mutation was observed (on the y-axis) versus its probability under the templated mutagenesis model (on the x-axis). A y value of one means the mutation was observed, and a y value of zero means the mutation was not observed. For each mutation, the germline base is indicated by the row name, and the target base is indicated by the column name. The lines are linear smoothers. We do not observe any consistent and significant trend to these lines, indicating that templated mutagenesis has not contributed to the observed sequence changes in the gpt sequence data set. This analysis was performed once on data from six individual mice, with two replicates per mouse corresponding to samples from Peyer patches and spleen, yielding 12 total samples.

Close modal

We combined PyMF’s estimate of the rate of templated mutagenesis with our estimate of PyMF’s FPR to obtain an approximate upper bound on the true rate of templated mutagenesis in mice (using the VB1-8 sequences) and in humans (using the anti-Ebola sequences). By plugging in PyMF’s estimates of the rate of templated mutagenesis in the VB1-8 sequences and our estimates of PyMF’s FPRs from the gpt sequences to Eq. 2, we found upper bounds on the rate of templated mutagenesis in this system ranging from 0 (in cases that our estimate of the FPR exceeds the rate at which PyMF identified templated mutations) to 0.1, depending on the value of k and the assumed TPR (Table I, top panel). The largest upper bounds were obtained at k = 8. For the human anti-Ebola sequences, we found upper bounds on the rate of templated mutagenesis ranging from 0 to 0.12, with the numbers again varying based on the value of k and the assumed TPR. In this case, the largest upper bounds is obtained at the largest value of k, k = 14, and in general, the larger values of k correspond to larger upper bounds. However, note that because these estimates are upper bounds of the true rates in both humans and mice they are consistent with a rate of zero.

Table I.
Upper bounds on the rate of templated mutagenesis in the VB1-8 (top) and the anti-Ebola sequences (bottom) computed for a range of tract lengths (k) and sensitivities
Mice
kPyMF RatePyMF FPRUB (1)UB (0.99)UB (0.95)UB (0.9)
0.79 0.83 
0.54 0.5 0.08 0.08 0.09 0.1 
10 0.28 0.25 0.05 0.05 0.05 0.05 
11 0.15 0.14 0.01 0.01 0.01 0.02 
12 0.11 0.09 0.01 0.01 0.01 0.02 
13 0.07 0.07 
14 0.06 0.05 0.01 0.01 0.01 0.01 

 

 

 
Humans
 

 

 

 
0.73 0.78 
0.44 0.43 0.02 0.02 0.02 0.02 
10 0.26 0.2 0.07 0.08 0.08 0.09 
11 0.18 0.11 0.09 0.09 0.09 0.1 
12 0.15 0.07 0.09 0.1 0.1 0.11 
13 0.15 0.05 0.1 0.1 0.11 0.11 
14 0.13 0.03 0.1 0.11 0.11 0.12 
Mice
kPyMF RatePyMF FPRUB (1)UB (0.99)UB (0.95)UB (0.9)
0.79 0.83 
0.54 0.5 0.08 0.08 0.09 0.1 
10 0.28 0.25 0.05 0.05 0.05 0.05 
11 0.15 0.14 0.01 0.01 0.01 0.02 
12 0.11 0.09 0.01 0.01 0.01 0.02 
13 0.07 0.07 
14 0.06 0.05 0.01 0.01 0.01 0.01 

 

 

 
Humans
 

 

 

 
0.73 0.78 
0.44 0.43 0.02 0.02 0.02 0.02 
10 0.26 0.2 0.07 0.08 0.08 0.09 
11 0.18 0.11 0.09 0.09 0.09 0.1 
12 0.15 0.07 0.09 0.1 0.1 0.11 
13 0.15 0.05 0.1 0.1 0.11 0.11 
14 0.13 0.03 0.1 0.11 0.11 0.12 

The number in parentheses denotes the assumed sensitivity (TPR) of PyMF.

PyMF rate, naive PyMF estimate of the rate of templated mutagenesis; UB, upper bound.

We caution against taking these numbers as definitive, as we do not know the TPR of PyMF, and they require that our estimate of the FPR of PyMF is a lower bound. However, they attempt to correct the observed rates using FPR estimates and, in particular, show that templated mutagenesis does not occur at a high rate unless PyMF misses many true templated mutagenesis events.

We also tested whether templated mutagenesis could be occurring from the reverse complementary strand. To this end, we repeated all the analyses with the reverse complements added to the donor gene sets. The results are shown in Supplemental Figs. 1 and 2 and Supplemental Table II and were qualitatively similar to those with the original gene sets. The rate estimates were slightly higher because of the larger size of the donor sets (Supplemental Fig. 1). The average probability of the observed mutations given templated mutagenesis from the gpt genes and their reverse complements remained about the same as the average probability of the observed mutations given templated mutagenesis from the IGHV genes and their reverse complements (Supplemental Fig. 2). The upper bounds on the rate of templated mutagenesis also remained low when the reverse complements were included in the donor gene sets (Supplemental Table II).

We obtained PR estimates of 73% for humans and 79% for mice (Table I) when applying the PolyMF strategy with k = 8, using our chosen donor gene sets and BCR sequence datasets as discussed in 2Materials and Methods. In their analysis, Dale et al. (2) estimate this PR to range to be ∼5065% when applying the same strategy to their chosen donor gene sets and BCR sequence datasets. Although they describe the five different BCR sequence datasets used to obtain these estimates, two things remain unclear. First, it is not obvious how they extracted the 5065% range from the data displayed in their Figure 5(I), which seems to show PR estimates roughly ranging from 30 to 90% and whose rate estimates seem to depend on the dataset in question. Second, they do not describe the exact donor gene sets obtained from IMGT. The construction of the donor sets is crucial because the number of genes in the donor set influence the PR estimates; adding more templates to the donor set can only increase the number of PolyMF hits because there will be more chances to observe a match.

To address these discrepancies, we reran both of the VB1-8 and anti-Ebola analyses using a more restricted donor gene set in each case. Specifically, we filtered out all ORF and pseudogene (P) sequence reads from the respective IMGT sets, which led to a 31.7% decrease in potential donors for the VB1-8 sequences and a 44.9% decrease in potential donors for the anti-Ebola sequences. We obtained PR estimates of 42% for humans and 62% for mice for k = 8. Detailed tables of rate estimates using the filtered donor sets, analogous to Table I, can be found in the main GitHub repository (https://git.io/Jfl6t). Between the collective full and restricted analyses for mice and humans, our PR estimates range from 4279%, which contains the 5065% range proposed by Dale et al. (2). More importantly, our estimates on the upper bound of templated mutagenesis events remain highly similar between the full and restricted analyses, demonstrating the robustness of our methodology to the particular choice of donor gene set.

Finally, we point out that our estimates of templated mutagenesis occurring at a low rate are in fact compatible with the large values of Stouffer Z and the correspondingly small p values obtained in Dale et al. (2). These authors compare the rate of templated mutagenesis to the rate obtained using a simplified null model (called RandomCheck) in which, conditional on the locations of the mutations, the mutation identity at each location is independent of the other locations and follows a fixed distribution taken from previous studies. This model is a simplification of the classical Neuberger model of SHM in many ways. In the Neuberger model, lesions introduced by AID can be resolved by one of three pathways, each of which leads to a repair by a different set of enzymes. The likelihood of each pathway being recruited to repair the lesion depends on nucleotide context, and each pathway is assumed to have its own unique, context-dependent mutation profile (16). The result is that the mutations are not independent and identically distributed conditional on the germline base, in contrast with the assumption of RandomCheck, which was used to compute p values and Stouffer Z in Dale et al. (2). Aside from issues of independence, the overall mutation profile taken from the literature is exceedingly unlikely to be exactly correct, and given enough samples, any consistent hypothesis-testing framework will confidently identify even small differences between the true mutation profile and the one drawn from the literature.

To demonstrate that even a slightly misspecified model can lead to extreme values of Stouffer Z and highly significant p values, we performed a small simulation study. We suppose that the fraction of mutations explainable by templated mutagenesis in the “true” model is drawn from a β distribution with mean 0.518 and variance 0.048, shown as a dashed line in Fig. 4. In the null model, the fraction of mutations explainable by templated mutagenesis is drawn from a β distribution with mean 0.5 and variance 0.05, shown as a solid line in Fig. 4. We simulate 2000 values (corresponding to the 2000 gpt sequences analyzed) for the fraction of mutations explainable by templated mutagenesis, construct Z values from the hypothesis test that these values come from the null distribution, and finally, compute Stouffer Z from the collection of 2000 Z values. We performed this procedure 10,000 times, yielding a distribution of 10,000 Stouffer Z values.

FIGURE 4.

Top, Densities of the distributions used in the simulations of Stouffer Z. Samples coming from the “true” distribution (dashed line) are tested against the hypothesis that they come from the null distribution (solid line). Bottom, Distributions of Stouffer Z statistics (left) and p values (right) for the true and null distributions in the top panel for 10,000 simulation trials. In each trial, the Stouffer Z value is aggregated over 2000 independent tests, which is about the same as the number of trials aggregated by Dale et al. (2).

FIGURE 4.

Top, Densities of the distributions used in the simulations of Stouffer Z. Samples coming from the “true” distribution (dashed line) are tested against the hypothesis that they come from the null distribution (solid line). Bottom, Distributions of Stouffer Z statistics (left) and p values (right) for the true and null distributions in the top panel for 10,000 simulation trials. In each trial, the Stouffer Z value is aggregated over 2000 independent tests, which is about the same as the number of trials aggregated by Dale et al. (2).

Close modal

In this simulation, the values of Stouffer Z were centered around 3.65 with an SD of 0.97. The corresponding p values had a median value of 1.3 × 10−4. Ten percent of the p values were smaller than 4.2 × 10−7, and ninety percent were smaller than 7.9 × 10−3. The full distributions of both the p values and Z statistics are shown in Fig. 4. These numbers are comparable to those reported in Dale et al. (2) and they show that even a very small amount of misspecification in the null model could lead to very small p values in the hypothesis-testing framework.

Species rely on a variety of pathways for secondary Ab diversification, and the reasons for this variety remain an immunological puzzle. The current understanding is that chickens, rabbits, and some other species use a combination of GCV and SMH during affinity maturation, whereas humans and mice use only SMH. A recent paper by Dale et al. (2) suggests that humans and mice also use extensive templated mutagenesis to diversify their repertoires, which may happen by a mechanism similar to GCV. This finding was based on a novel method, PolyMF, for identifying templated mutagenesis via microhomology, and in this article we studied its properties.

We were interested in the FPR of the PolyMF strategy and developed a novel way of estimating this rate. We ran the algorithm on two sets of mutation observations, derived from mouse and human, respectively, using two corresponding sets of simulated donor genes not present in the subject in question; any inferences of templated mutagenesis in this case must be spurious. The homology structure of these mock donor genes mimicked that of the set of potential templated mutagenesis donors present in the subject. Using this method, we found that although the PolyMF strategy is quite sensitive to templated mutagenesis, it also has an FPR exceeding 50% for the donor tract sizes considered in Dale et al. (2). We used our estimates of the FPRs of the PolyMF strategy along with the naive PolyMF estimates of the rate of templated mutagenesis to obtain upper bounds on the true rate in mice and humans. In each case, we obtain upper bounds ranging from zero to around 10%, although because these are upper bounds, the true rate may also be zero.

Many of the results in Dale et al. (2) were based on findings of a statistically significant deviation from a null model instead of an estimate of the rate of templated mutagenesis. The results of the PolyMF/RandomCheck strategy were presented in terms of a Stouffer Z score, which describes deviation from a simplified null hypothesis about the way the mutations arise. We showed that the observed Stouffer Z values and p values in Dale et al. (2) are not proof of templated mutagenesis, but merely reflect the fact that the specified null model is incorrect, and given thousands of samples, we have enough power to detect even small departures from it.

The same considerations apply to the findings of linkage disequilibrium in the mutated sequences; a statistically significant amount of linkage disequilibrium does not imply templated mutagenesis, and is in fact entirely consistent with the Neuberger model. In particular, if a mutation-generating process satisfies mutation at one site and implies a higher probability of mutation at nearby sites, and not every base has an equal probability of being chosen as the new base for mutation, then sites that are close together will be in linkage disequilibrium, although the mutations are not introduced by templated mutagenesis. One of the potential pathways posited by the Neuberger model to resolve AID lesions has exactly the properties described above. In that pathway, an exonuclease strips out several nucleotides around the AID-induced lesion, and the resulting single-stranded sequence is patched by Pol η, an error-prone polymerase. Thus, a mutation at one position is likely to be accompanied by mutations at neighboring positions because Pol η might have introduced multiple errors in the same patch of nucleotides. In addition, we do not expect Pol η to replace nucleotides uniformly at random because we expect bias in the nucleotide misincorporation rate (16). Accordingly, we expect this pathway to cause linkage disequilibrium between sites, particularly those that are close together. Therefore, the observed significant linkage disequilibrium is not prima facie evidence of templated mutagenesis.

Next, we describe several limitations of the analysis presented in this study to be considered when interpreting the results. First, our bounds depend on our estimate of the FPR being an underestimate of the true FPR. We have two main reasons for believing that this is true, particularly for the human sequences. The first is that our mock donor sets of simulated gpt homologs are slightly smaller than the corresponding IMGT donor gene sets. The mock gpt set based on the mouse IMGT IGHV genes has 462 unique genes, compared with 499 in the mouse IMGT IGHV gene set. The corresponding numbers for the mock gpt set based on the human IMGT IGHV genes and the human IMGT IGHV genes are 404 and 466. The mock gene sets have slightly smaller numbers of genes than the gene sets they were based on because of the simulation method; not all of the branches in the inferred tree actually lead to a mutation in the simulations, and so there are fewer unique genes than leaves in the tree. Our second reason for believing our estimate of the FPR is conservative involves the correspondence between diversity in variable regions and mutation hotspots; in real Ab sequences, mutations are more likely to occur in the CDRs, and there is also more variability in the IGHV genes in the CDRs. This is not the case for the gpt sequences; as demonstrated in (4), there are mutation hotspots in the gpt genes as well, but these hotspots do not correspond to regions of higher variability in the mock gpt gene sets. Because mutations are more likely to occur in regions with more templated mutagenesis templates in the Ab gene sequences than in the gpt sequences, we believe that the FPR estimate based on the gpt sequences is lower than the true FPR.

We emphasize that we have obtained bounds on, not estimates of, the rate of templated mutagenesis, and that these bounds depend on assumptions about the sensitivity of PyMF and on our estimate of the FPR being conservative. For humans, the quality of the bound also depends on how well our estimate of the PyMF FPR translates from mice to humans. We were only able to estimate the FPR of PyMF in the mouse because of the transgenic system set up in (4), and that estimate translates to humans to the extent that the SMH processes of the two species coincide. We expect the processes to be similar enough that the FPR is valid for both species, but any differences that do exist mean that the bounds for mice are more reliable than those for humans.

It is still possible that templated mutagenesis occurs at a low rate. If so, characterizing its properties is important because even if templated mutagenesis events occur infrequently, they could increase the rate of certain mutation patterns immensely. This has important implications for estimation procedures (phylogenetic estimation, germline annotation, etc.) as well as translational applications such as rational vaccine design. Thus, we do not view our work as closing the book on the interesting possibility that templated mutagenesis could play a role in B cell diversification.

We are grateful to Trevor Bedford, Christian Busse, Gordon Dale, and Joshy Jacob for discussions that improved this analysis and to Leng-Siew Yeap and Duncan Ralph for help analyzing the gpt data. Christian Busse provided helpful comments on the manuscript.

This work was supported by National Institutes of Health Grants R01 GM113246, R01 AI120961, U19 AI117891, and R01 AI146028. The research of F.A.M. was supported in part by a Faculty Scholar grant from the Howard Hughes Medical Institute and the Simons Foundation.

The online version of this article contains supplemental material.

Abbreviations used in this article:

AID

activation-induced cytidine deaminase

FPR

false positive rate

GCV

gene conversion

IMGT

ImMunoGeneTics

ORF

open reading frame

PolyMF

PolyMotifFinder

PR

positive rate

PyMF

PyMotifFinder

SHM

somatic hypermutation

TPR

true positive rate

VB1-8

V region of the 4-hydroxy-3-nitrophenylacetyl–binding B1-8 Ab.

1
Methot
,
S. P.
,
J. M.
Di Noia
.
2017
.
Molecular mechanisms of somatic hypermutation and class switch recombination.
Adv. Immunol.
133
:
37
87
.
2
Dale
,
G. A.
,
D. J.
Wilkins
,
C. D.
Bohannon
,
D.
Dilernia
,
E.
Hunter
,
T.
Bedford
,
R.
Antia
,
I.
Sanz
,
J.
Jacob
.
2019
.
Clustered mutations at the murine and human IgH locus exhibit significant linkage consistent with templated mutagenesis.
J. Immunol.
203
:
1252
1264
.
3
Bornholdt
,
Z. A.
,
H. L.
Turner
,
C. D.
Murin
,
W.
Li
,
D.
Sok
,
C. A.
Souders
,
A. E.
Piper
,
A.
Goff
,
J. D.
Shamblin
,
S. E.
Wollen
, et al
.
2016
.
Isolation of potent neutralizing antibodies from a survivor of the 2014 Ebola virus outbreak.
Science
351
:
1078
1083
.
4
Yeap
,
L.-S.
,
J. K.
Hwang
,
Z.
Du
,
R. M.
Meyers
,
F.-L.
Meng
,
A.
Jakubauskaitė
,
M.
Liu
,
V.
Mani
,
D.
Neuberg
,
T. B.
Kepler
, et al
.
2015
.
Sequence-Intrinsic mechanisms that target AID mutational outcomes on antibody genes.
Cell
163
:
1124
1137
.
5
Boettiger
,
C.
2015
.
An introduction to Docker for reproducible research.
Oper. Syst. Rev.
49
:
71
79
.
6
Vander Heiden
,
J. A.
,
G.
Yaari
,
M.
Uduman
,
J. N. H.
Stern
,
K. C.
O’Connor
,
D. A.
Hafler
,
F.
Vigneault
,
S. H.
Kleinstein
.
2014
.
pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires.
Bioinformatics
30
:
1930
1932
.
7
Retter
,
I.
,
C.
Chevillard
,
M.
Scharfe
,
A.
Conrad
,
M.
Hafner
,
T.-H.
Im
,
M.
Ludewig
,
G.
Nordsiek
,
S.
Severitt
,
S.
Thies
, et al
.
2007
.
Sequence and characterization of the Ig heavy chain constant and partial variable region of the mouse strain 129S1.
J. Immunol.
179
:
2419
2427
.
8
Edgar
,
R. C.
2004
.
MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Nucleic Acids Res.
32
:
1792
1797
.
9
Price
,
M. N.
,
P. S.
Dehal
,
A. P.
Arkin
.
2010
.
FastTree 2--approximately maximum-likelihood trees for large alignments.
PLoS One
5
: e9490.
10
Spielman
,
S. J.
,
C. O.
Wilke
.
2015
.
Pyvolve: a flexible python module for simulating sequences along phylogenies.
PLoS One
10
: e0139047.
11
Goldman
,
N.
,
Z.
Yang
.
1994
.
A codon-based model of nucleotide substitution for protein-coding DNA sequences.
Mol. Biol. Evol.
11
:
725
736
.
12
Ralph
,
D. K.
,
F. A.
Matsen
IV
.
2016
.
Consistency of VDJ rearrangement and substitution parameters enables accurate B cell receptor sequence annotation.
PLOS Comput. Biol.
12
: e1004409.
13
Wang
,
Y.
,
K. J. L.
Jackson
,
W. A.
Sewell
,
A. M.
Collins
.
2008
.
Many human immunoglobulin heavy-chain IGHV gene polymorphisms have been reported in error.
Immunol. Cell Biol.
86
:
111
115
.
14
Bates
,
D.
,
M.
Mächler
,
B.
Bolker
,
S.
Walker
.
2015
.
Fitting linear mixed-effects models using lme4.
J. Stat. Softw.
67
:
1
48
.
15
R Core Team
.
2017
.
R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing
,
Vienna, Austria
.
16
Rogozin
,
I. B.
,
Y. I.
Pavlov
,
K.
Bebenek
,
T.
Matsuda
,
T. A.
Kunkel
.
2001
.
Somatic mutation hotspots correlate with DNA polymerase eta error spectrum.
Nat. Immunol.
2
:
530
536
.

The authors have no financial conflicts of interest.

Supplementary data