Aberrant targeting of the enzyme activation-induced cytidine deaminase (AID) results in the accumulation of somatic mutations in ∼25% of expressed genes in germinal center B cells. Observations in Ung−/−Msh2−/− mice suggest that many other genes efficiently repair AID-induced lesions, so that up to 45% of genes may actually be targeted by AID. It is important to understand the mechanisms that recruit AID to certain genes, because this mistargeting represents an important risk for genome instability. We hypothesize that several mechanisms combine to target AID to each locus. To resolve which mechanisms affect AID targeting, we analyzed 7.3 Mb of sequence data, along with the regulatory context, from 83 genes in Ung−/−Msh2−/− mice to identify common properties of AID targets. This analysis identifies three transcription factor binding sites (E-box motifs, along with YY1 and C/EBP-β binding sites) that may work together to recruit AID. Based on previous knowledge and these newly discovered features, a classification tree model was built to predict genome-wide AID targeting. Using this predictive model, we were able to identify a set of 101 high-interest genes that are likely targets of AID.

Somatic hypermutation (SHM) occurs in germinal center (GC) B cells, resulting in the introduction of point mutations into Ig genes. Although SHM provides an important source of genetic diversity, capable of producing specific Abs for quickly evolving pathogens, the process also poses a severe threat to genomic stability. Activation-induced cytidine deaminase (AID), the enzyme that deaminates cytosines to initiate SHM, can act outside of the Ig locus. In a previous sequencing study, we showed that >45% of expressed genes in GC B cells are targeted by AID in Ung−/−Msh2−/− double-knockout (dKO) mice, where the absence of DNA repair reveals the “footprint” of AID. Even among genes that were targeted by AID, this study revealed a wide range of mutation frequencies observed across 83 genes (1). In this study, we seek to address two basic questions that are raised by the former study: Why are some genes targeted by AID, whereas others are not? and How do the genes targeted by AID accumulate different levels of mutation? The main hypothesis we pursue is that sequence features of each gene are responsible for this differential targeting.

The current model of SHM proposes two phases (2). In the first phase, AID converts a cytosine (C) residue to a uracil (U) in ssDNA created during the process of transcription, which, if left unrepaired, leads to a C to T (thymine) transition mutation when the DNA is replicated for cell division (3). The second phase of SHM begins when DNA repair mechanisms attempt to remove the uracil lesion from the DNA. The repair of the uracil happens via two pathways: base excision repair with UNG and mismatch repair facilitated by the MSH2/MSH6 complex, both of which are capable of working in an error-prone fashion and contributing to the observed mutation frequency (4). In the dKO setting, the second phase of SHM is unavailable, thus revealing the underlying “footprint” of AID, where the expectation is primarily C → T transition mutations. We previously sequenced 83 non-Ig genes from dKO mice on average 70 times per gene over a 1-kb region downstream of the transcription start site (TSS) (1). Mutation frequencies varied widely, ranging from <1 × 10−5 to 116.1 × 10−5 mutations/bp, but they were highly predictable for the same gene across samples from multiple mice. In the same system, sequencing of an IgH positive control, specifically the VhJ558-Jh4 intron 3′ flanking region (hereafter referred to as the Jh4 intron), found a mutation frequency of 9.96 × 10−3 mutations/bp. Each gene represents a unique genomic context in which to explore the various properties associated with AID targeting.

Differential AID activity in non-Ig genes may be influenced by multiple underlying mechanisms. A higher transcription rate may be associated with an increased mutation frequency. Genes with a higher mutation frequency may contain a large number of AID hotspots, such as WRC (W = A/T; R = A/G), and/or few AID coldspots, such as SYC (S = C/G; Y = C/T), where the C is the mutated position (5, 6). Clonal recruitment of AID to certain genes may lead to an increased mutation frequency (7). Finally, the genes for which high mutation frequencies are observed may share functional elements, like transcription factor binding sites, which recruit AID to the locus for mutation. In this study, we first examine each of the possible mechanisms independently and then develop an integrated model to predict targeting of AID in the non-Ig genes.

Genes were selected for sequencing in our previous study based on multiple criteria (1): expression of the gene determined through microarray studies in both mouse and human B cells (811), because it is expected that expressed genes are undergoing transcription, a requirement for AID targeting (3, 12); a well-defined TSS and chromosomal location; and a high level of homology between the mouse and human genes for the first exon. Genes known to be involved in the immune response, tumorigenesis, cell proliferation, or apoptosis or known to undergo chromosomal translocations or deletions, especially in B cell tumors, were given preference. Finally, the set of genes chosen was also selected to provide good coverage across the set of mouse chromosomes. A total of 83 non-Ig genes was sequenced from Peyer’s patches of Ung−/−Msh2−/− dKO mice, in addition to the positive control of the Ig Jh4 intron. Each gene was sequenced, as previously described, in a 1-kb region directly downstream of the TSS (1). Mutations were determined through use of the neighborhood quality standard algorithm (13), using previously defined criteria (1). The resulting mutation frequency of each gene in the dKO setting was compared with the background mutation rate determined by sequencing 31 genes from Aicda−/− knockout (KO) mice using the Fisher exact test. False-discovery rate (FDR) q-values were determined for each gene (λ = 0) and used as the basis for ranking the genes and determining group assignments: group A = q-value < 10−6; group B = 10−6 ≤ q-value < 10−2; and group C = q-value ≥ 10−2.

BALB/c background mice carrying IgH-transgenic alleles and targeted deletions of the endogenous JH locus (VH186.2-Tg JH KO and V23-Tg JH KO) were immunized i.p. with 50 μg NP25-chicken γ globulin precipitated in alum, as described (9). Fourteen days later, splenocytes were isolated and stained with fluorescently labeled peanut agglutinin (Vector Laboratories), anti-λ (goat polyclonal; SouthernBiotech), and anti-CD45R/B220 (RA3-6B2) to identify λ+ PNA+ B220+ GC B cells. Cells were purified on a FACSAria cell sorter (BD), as previously described (14). Live/dead discrimination was accomplished based on forward/side scatter and propidium iodide exclusion. Cells were kept at 4°C in buffers containing 0.05% sodium azide to minimize alterations in gene expression. All animal immunizations and experiments were approved by the Yale Institutional Animal Care and Use Committee.

The mRNA expression levels among sorted cells were determined using Affymetrix Mouse 430 2.0 microarray and previously described methods (14, 15). Annotation files were obtained directly from Affymetrix (v. 31) and were used to associate the gene symbols and National Center for Biotechnology Information Reference Sequence (RefSeq) (16) identifiers with each probe set. The samples were normalized using GC Robust Multi-array Average algorithm. The average expression value reported for each gene was determined by taking the log2 of the average expression values across all probes associated with each RefSeq identifier for all samples. These microarray data are available through the Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44260.

Mice carrying an mVH186.2-Tg IgH Tg allele and targeted deletions of the JH locus (17) were immunized with NP25-chicken γ globulin, and spleens were removed 12 d postimmunization. B cells were enriched by an EasySep customized B cell negative selection kit (without anti-CD43 Ab; Stem Cell Technologies). Cells were stained with anti-mouse B220–Cy-Chrome, anti-mouse CD19-PE, and anti-mouse GL7-FITC (all from Becton Dickinson Biosciences) Abs and sorted on a DakoCytomation MoFlo cell sorter at the Yale Immunobiology Cell Sorting Facility.

Dynabeads Protein A (Invitrogen) were incubated with RNA polymerase II Ab N20 (Santa Cruz Biotechnologies) or normal rabbit serum. Excess Ab was washed away. Then Ab-bound beads were incubated with chromatin from 20 million sorted spleen GC B cells (previously cross-linked with 1% HCHO and then sonicated to shear the DNA fragments to 100–300 bp) at 4°C overnight. Beads were washed, chromatin was eluted, and the cross-linking was reversed. DNA was purified, precipitated, and redissolved in TE buffer. Precipitated DNA was quantified using a PicoGreen dsDNA quantification kit (Molecular Probe). A total of 200 ng chromatin immunoprecipitation (ChIP) DNA (from 40 million cells) ends was repaired using polynucleotide kinase and Klenow enzyme, followed by treatment with Taq polymerase to generate a protruding 3′ A base used for adaptor ligation. Following ligation of a pair of Solexa adaptors to the repaired ends, the ChIP DNA was amplified using the adaptor primers for 17 cycles, and the fragments around 220 bp (mononucleosome + adaptors) were isolated from agarose gel. The purified DNA was used directly for cluster generation and sequencing analysis using the Solexa 1G Genome Analyzer, following the manufacturer’s protocols.

The resulting 25-bp reads were aligned against the mouse genome (mm8) using Efficient Local Alignment of Nucleotide Data (Illumina), allowing up to two mismatches against the reference. The reads kept for further analysis had to map uniquely to the genome, and a maximum of three copies of the same read was kept to reduce PCR artifact. The reads were then converted to browser extensible data format. The TSSs for all genes were identified using the RefSeq identifier (16) and were obtained through the UCSC Genome Table Browser (18) for the mouse mm8 genome build. For each group, genes were aligned by the TSS, and the maximum peak of overlapping reads was determined for each gene in the region 100 bases around the TSS. Differences among the three groups were determined using the Kruskal–Wallis test (KW), and the Mann–Whitney U test (MWU) was used for determining differences among pairs of groups.

Mutations and all positions sequenced that occurred at C on either the top or bottom strands were considered for analysis if the residue and the two residues directly upstream passed the neighborhood quality standard criteria of Liu et al. (1). Both mutations and positions that did not meet this criterion were excluded from the analysis, resulting in an observed mutation frequency for each gene that is slightly different from the previously published figures. Mutability was calculated in a manner similar to that of Shapiro et al. (19), restricted to the case when the mutation is a C residue in the third position (or G in the first position for the reverse complement). To do so, the frequency of mutations in each of the 16 possible dinucleotides upstream of the mutated C residue was tabulated for each sequence and normalized by the total number of mutations in that sequence. The same was done for all sequenced C residues to determine the background. The mutability for each sequence was calculated by dividing by the normalized frequency of the dinucleotide motifs for the mutated residues by the background of the sequenced region. The overall mutability for a set of sequences was calculated as the mean mutability in individual sequences weighted by the total number of mutations in each sequence. Error bars were calculated by bootstrapping the original set of sequences 10,000 times. The p values were calculated for each motif by comparing the bootstrapped mutability values and the observed mutability values for each sample. An aggregated p value for all motifs was computed using the Fisher method for combining p values of individual tests.

For each gene, the region sequenced was analyzed for occurrences of the AID hotspot WRC and the AID coldspot SYC on both the top and bottom strands. The total number of hotspots found in the sequenced region was normalized by the number of C and G residues in the sequence to account for the unique composition of each gene.

The number of mutations from each sequence was determined for each of the 15 genes in group A and for the Jh4 intron. The negative binomial (NB) distribution was fit to each gene using the fitdistr method available through the MASS package for R (20), and a χ2 goodness-of-fit test was applied. In the cases in which the NB did not fit the data (p < 0.05), a zero-inflated NB (ZI-NB) distribution was fit as defined through the emdbook package for R (21) and tested using a χ2 goodness-of-fit test.

The region 2 kb around the TSS for each sequenced gene was calculated based on RefSeq gene definitions (accessed July 2009). The corresponding aligned sequences for both mouse (mm9) and human (hg18) were obtained from the multiple sequence alignment, multiz30way (22), for the mouse genome available from the UCSC Genome Bioinformatics Site (23). Repetitive sequences were masked from both the mouse and human sequences using RepeatMasker (24). With regard to the identification of E-box core sequences, two consensus sequences defined the site: CASSTG (where S = C or G) and CANNTG (where n = A, C, G or T). In addition, the MATCH (25) program was used to scan the sequences for hits of TRANSFAC-defined binding sites (26) (including all high-quality vertebrate binding sites used for v. 2010.3) using the cutoff values defining a hit to minimize the sum of false positives and false negatives. Only hits that occurred in the same position in the mouse and human aligned sequences were retained for analysis. Note that this step of evolutionary conservation does not impose upon the model an assumption that mutation spectrum from AID targeting in the species used for the conservation will correlate perfectly, or at all, with mouse.

Gene set enrichment analysis (GSEA) was performed using the GSEA Preranked tool with classic weighting within the GSEA tool (27). Genes from the dKO setting were ranked according to the FDR q-value, comparing the dKO mutation frequency with the background mutation frequency from Liu et al. (1). A total of 10,000 permutations was used to determine the null distribution of enrichment scores for each gene set and to calculate the p value. The FDR q-value corrects for multiple hypothesis testing and is influenced by the total number of gene sets tested. In the case of E-boxes and YY1 sites, the total number of gene sets tested was 6; in the general test of all other transcription factor binding sites (including C/EBP-β), the total number of gene sets tested was 705.

Binding site locations were compared between sets of genes using a Bonferroni-corrected one-tailed MWU. For each gene, the binding site closest to the TSS (in the region ± 7.5 kb) was selected for analysis. Two sets of control genes were included. The first set of control genes was defined as the top 10% of genes with the highest average expression (across all probe sets and samples) in the GC B cell microarray data previously described. The second set of control genes included the set of 17 nonexpressed genes that was sequenced in the wild-type setting (1).

For the pair-wise analysis of transcription factor binding site locations performed, a score was computed for each possible pair of binding sites between two transcription factors (Eq. 1),

Colocation score=d(S)+d(T),
(1)

where d(S) is the distance between the binding sites for each factor, and d(T) is the distance of the binding site furthest from the TSS. Genes without a predicted binding site in the analyzed region defined were assigned a distance equal to the maximal distance from the TSS. The genes were again split into groups A, B, and C and compared against the two sets of control genes from above. In computing the trilocation score of three factors, d(S) was defined as the maximal distance between all three sites. A Bonferroni-corrected one-tailed MWU was used to determine the differences between the groups of genes and the control sets.

To determine the significance of the variables that were used in this analysis, the measurements were obtained for all group A, B, and C genes for all independent attributes measured across all genes. In addition, the total number of binding sites for each significant factor within the region 2 kb around the TSS was included for analysis. The cforest algorithm (28) (available through the party package for R) was used to generate a random forest of classification trees to determine the variable importance using an unbiased approach. A forest of 1000 trees was built with the following parameters specified: minimum samples in a node for splitting = 15, minimum number of samples in a leaf = 5, and number of variables randomly selected at each node = 5. Variable importance was calculated from the random forest, defined by the mean decrease in prediction accuracy when values are permuted for each variable. Important variables were defined as having a variable importance score greater than the absolute value of the least important variable (29).

The classification tree model was generated using the rpart package for R (30). The model was constructed using the default parameters, with the exception of the following parameters: the minimum number of samples in each node to allow a split was set to 9, the splitting index used was “information,” and the loss matrix used was:

[0315100102010],

where the columns correspond to the true group A, B, or C and the rows correspond to predicting group A, B, or C. Thus, a gene from group B predicted to be in group A results in a loss (or penalty) for misclassification of 10. Ten-fold cross-validation was performed during the construction of the model and used to prune the full tree by choosing the tree that minimized the cross-validated error to create the final model (31).

Transcription is required for SHM at the Ig locus, and we previously observed that nontranscribed genes did not accumulate mutations in the genome-wide setting (1). We asked whether, for expressed genes, there is a correlation with mutation frequency. We previously determined the mutation frequency for 83 non-Ig genes isolated from Ung−/−Msh2−/− dKO Peyer’s patch B cells and classified these into three groups: A, B, and C (1) (see 2Materials and Methods). In this setting, group A genes had the highest mutation frequency, percentage of C → T transition mutations, and hotspot focusing, followed by genes in groups B and C, with decreases in each subsequent group, respectively. The relative mRNA expression of these genes was compared in GC B cells; we found a minimal correlation between a gene’s log2 mRNA expression level, as measured by Affymetrix microarray, and its mutation frequency (r = +0.079) (Supplemental Fig. 1A). When genes were grouped based on their p value for AID targeting (group A and B genes accumulate mutations that are significantly different from the background and show strong signs of AID targeting, whereas genes in group C are minimally mutated compared with the background), we observed a positive trend between average log2 expression and mutation frequency (Fig. 1A). However, there is not a significant difference between these groups (p = 0.064, KW). Thus, although AID-targeted genes seem to have higher average log2 mRNA expression at the group level, this was not statistically significant, and the correlation between mutation and steady-state transcript levels is very weak.

FIGURE 1.

Transcription is correlated weakly with mutation frequency. (A) Normalized mRNA expression levels (averaged over all probe sets and microarray samples) for genes in groups A, B, and C. A set of 17 nonexpressed genes (NE) is included as a control. (B) The maximum peak for RNA polymerase II binding based on ChIP-Seq of GC B cells for each group of genes, including the control of nonexpressed genes (NE). (C) Comparison of the maximum peak of RNA polymerase II and the observed mutation frequency for the 83 individual genes sequenced in the dKO setting.

FIGURE 1.

Transcription is correlated weakly with mutation frequency. (A) Normalized mRNA expression levels (averaged over all probe sets and microarray samples) for genes in groups A, B, and C. A set of 17 nonexpressed genes (NE) is included as a control. (B) The maximum peak for RNA polymerase II binding based on ChIP-Seq of GC B cells for each group of genes, including the control of nonexpressed genes (NE). (C) Comparison of the maximum peak of RNA polymerase II and the observed mutation frequency for the 83 individual genes sequenced in the dKO setting.

Close modal

Because steady-state mRNA levels are not a direct measure of the rate of transcription and performing a nuclear run-on assay would be difficult with GC B cells (32), RNA polymerase II ChIP followed by massively parallel sequencing (ChIP-Seq) was performed on GC B cells to better address the relationship between transcription rate and mutation frequency. For each gene, transcription was quantified by the maximum peak of overlapping polymerase II tags in the region 100 bp around the TSS. A statistically significant difference was found among the genes in groups A, B, and C (p = 0.04, KW), indicating an unequal distribution of polymerase II (Fig. 1B). A further examination of the polymerase II levels among the groups revealed that group A had a significantly higher amount of polymerase II compared with both group B (p = 0.0045, MWU) and group C (p = 4.6 × 10−4). As a control, group A, B, and C genes all had significantly higher polymerase II levels compared with the set of 17 nonexpressed genes that were also analyzed in the previous study (1) (p values: group A = 8.2 × 10−7; group B = 9.0 × 10−5; group C = 5.1 × 10−4, MWU). Thus, AID targeting is correlated positively with higher levels of polymerase II near the TSS of a gene. However, as with mRNA expression, polymerase II levels are only weakly predictive of mutation frequency at the individual gene level (Fig. 1C).

AID preferentially targets WRC/GYW motifs, so it is possible that genes with more of these motifs may accumulate a greater number of mutations independent of AID recruitment to the locus. To test this hypothesis, we first confirmed that the hot/cold-spot targeting of AID, which was previously defined at the Ig locus, was conserved in the genome-wide setting. It was shown previously that the observed mutations, especially in the group A genes, were biased toward the SHM hotspot WRCY/RGYW motif (1). To extend this analysis to cover the full spectrum of DNA motifs, focusing on the AID hotspot, and to increase coverage across all motifs given the relatively low number of mutations in the genome-wide dataset, the relative mutability of each of the 16 possible dinucleotide combinations in the motif NNC/GNN was calculated. Under the null hypothesis of no hot/cold-spot targeting, the mutability for each motif will be focused around a value of 1. However, we observed significant deviations from this null hypothesis for both the group A genes and the positive control Jh4 intron. In each case, the same general trend is observed: trinucleotide motifs representing the classic hotspot WRC have a higher mutability than random targeting, whereas those representing the classic coldspot SYC have a lower mutability than random targeting (Fig. 2A). In addition, inspection of the bootstrap interval for the mutability of each motif shows a high overlap between the two datasets. Thus, we conclude that AID displays a range of hot/cold-spot preferences in the genome-wide setting that is consistent with those at the Ig locus.

FIGURE 2.

Genome-wide AID hot/cold-spots mirror the Ig locus but do not influence overall targeting. (A) The relative mutability of each DNA triplet motif ending in C based on group A genes and the Jh4 intron. The error bars represent the 95% confidence interval, as determined through bootstrapping of the original sequences. (B) Normalized WRC/GYW hotspot frequency for each group of genes. (C) Normalized SYC/GRS coldspot frequency for each group of genes.

FIGURE 2.

Genome-wide AID hot/cold-spots mirror the Ig locus but do not influence overall targeting. (A) The relative mutability of each DNA triplet motif ending in C based on group A genes and the Jh4 intron. The error bars represent the 95% confidence interval, as determined through bootstrapping of the original sequences. (B) Normalized WRC/GYW hotspot frequency for each group of genes. (C) Normalized SYC/GRS coldspot frequency for each group of genes.

Close modal

To test whether differences in the mutability of each gene could account for differential mutation accumulation, we compared the hot-/coldspot densities to the mutation frequency across the set of 83 genes from the dKO setting. Supplemental Fig. 1B shows a minimal negative relationship between hotspot frequency and mutation frequency (r = −0.16), and this was not significant when the genes were split into groups A/B/C (p = 0.33, KW) (Fig. 2B). The AID coldspot SYC/GRS displayed a weak positive relationship with the mutation frequency (r = +0.2118, Supplemental Fig. 1C), but once again this was not statistically significant (p = 0.31) for groups A/B/C (Fig. 2C). We conclude that the density of AID hot-/coldspots does not account for the differential targeting of AID to group A/B/C genes.

Clonal recruitment of SHM to the Ig locus was suggested based on transgene studies (7). According to this model, once a particular gene in a cell is targeted, the gene remains marked as accessible to SHM in the cell’s progeny. If true, the differential ability of genes to initiate clonal recruitment could contribute to the wide range of observed mutation frequencies in the genome-wide setting. We first sought to confirm the idea of clonal recruitment at the Ig locus by analyzing the distribution of mutation counts in the Jh4 intronic region. If mutation acts randomly among all of the cells in a population, the number of mutations per sequence should fit an NB distribution (33, 34). In contrast, the process of clonal recruitment implies that only some of the cells will accumulate mutations efficiently, whereas others that fail to recruit AID will not accumulate mutations at all. In this case, we propose that the number of mutations per sequence should fit a ZI-NB distribution, which has the capability of modeling the distribution of the number of mutations per sequence for cells that have recruited AID separately from those with no mutations present. Fig. 3A clearly shows that the NB distribution is a poor model for the observed mutations per sequence in the Jh4 intron (p = 2.6 × 10−8, χ2 test), whereas the ZI-NB distribution is appropriate (p > 0.05) (35, 36). Thus, these data support the idea of clonal recruitment of AID at the Ig locus. In contrast, the NB distribution was an appropriate fit to all genes in group A (p > 0.05) (Fig. 3, Supplemental Fig. 2). This was true for both genes with a low mutation frequency, like Eif4a2 (Fig. 3E), as well as the most mutated genes, like Myc, Bcl6, and H2afx (Fig. 3B–D). Thus, unlike the Ig locus, we find no evidence for clonal recruitment of AID in the genome-wide setting, and this process is unlikely to play a role in the differential accumulation of mutations.

FIGURE 3.

AID is clonally recruited to the Ig locus but not to non-Ig genes. (A) The frequency distribution of the total number of mutations per Jh4 intron sequence was fit to both the NB (○) and the ZI-NB (▪). A similar analysis was carried out for group A genes, including Myc (B), Bcl6 (C), H2afx (D), and Eif4a2 (E).

FIGURE 3.

AID is clonally recruited to the Ig locus but not to non-Ig genes. (A) The frequency distribution of the total number of mutations per Jh4 intron sequence was fit to both the NB (○) and the ZI-NB (▪). A similar analysis was carried out for group A genes, including Myc (B), Bcl6 (C), H2afx (D), and Eif4a2 (E).

Close modal

We previously demonstrated that evolutionarily conserved E-box motifs (CASSTG), which bind various E-proteins, including E12 and E47, were enriched in the region 2 kb around the TSS for highly mutated genes (1). Extending these results, we also found moderate enrichment for a more general form of the motif (CANNTG) (GSEA p value = 0.006; FDR q-value = 0.061). This association between E-box motifs and highly mutated genes is further supported by a shift in the location of the motif in group A genes toward the TSS (CASSTG, p = 0.0089; CANNTG, p = 0.0092, MWU). Previous studies showed that binding sites located closer to the TSS have a higher probability of being true regulatory sites (37, 38). These results strongly support a role for E-boxes in recruiting AID.

We next tested whether YY1 binding sites were associated with AID targeting, because this transcription factor was recently shown to be a regulator of the GC gene-expression program (39). Four separate definitions exist for a YY1 binding site in the TRANSFAC database, where each defined YY1 binding site has a different length and slightly different consensus motif. Two of these four binding site definitions (YY1_Q6 and YY1_Q6_02) were significantly enriched in the promoter regions of group A genes (GSEA p < 0.05; FDR q-value < 0.1). Further supporting the involvement of YY1, these binding sites tended to be farther from the TSS in group C genes (Fig. 4). This shift in location was significant for the YY1_Q6_02 binding site definition (p = 0.014), and we refer to this definition as a YY1 binding site throughout the rest of this article. Thus, along with E-box motifs, YY1 motifs are associated with the recruitment of AID.

FIGURE 4.

E-box, YY1, and C/EBP-β binding sites exhibit a nonrandom location distribution with respect to the TSS of group A genes. Each gene is represented by the binding site that is closest to the TSS within the window 7.5 kb in either direction. For groups A, B and C, comparisons were made against two control groups: a set of highly expressed genes in GC B cells (HC) and a set of nonexpressed genes (NE). A statistically significant shift in binding site locations is indicated by the group’s name at the top of the panel (p < 0.05, versus HC, Bonferroni-corrected MWU). No significant difference was detected for any group compared with the set of NE genes.

FIGURE 4.

E-box, YY1, and C/EBP-β binding sites exhibit a nonrandom location distribution with respect to the TSS of group A genes. Each gene is represented by the binding site that is closest to the TSS within the window 7.5 kb in either direction. For groups A, B and C, comparisons were made against two control groups: a set of highly expressed genes in GC B cells (HC) and a set of nonexpressed genes (NE). A statistically significant shift in binding site locations is indicated by the group’s name at the top of the panel (p < 0.05, versus HC, Bonferroni-corrected MWU). No significant difference was detected for any group compared with the set of NE genes.

Close modal

To identify additional transcription factors influencing AID targeting, we screened the remaining set of 705 high-quality vertebrate TRANSFAC transcription factor binding sites. This analysis identified an additional factor associated with AID targeting: evolutionarily conserved binding sites for C/EBP-β (CCAAT-enhancer binding protein, β), also known as NF-IL6 (p < 0.001; FDR q-value = 0.094). As with both E-box and YY1 sites, the location of C/EBP-β binding sites tends to drift away from the TSS with the decrease in mutation frequency (Fig. 4), and this is significant for group C genes (p = 0.0086). At no point in our analysis did we find enrichment for binding sites inhibiting AID targeting. Altogether, our transcription factor binding site screen and location analysis identified three binding sites that could be involved in the recruitment of AID: E-boxes, YY1, and C/EBP-β.

Because we showed that evolutionarily conserved binding sites for E-boxes, YY1, and C/EBP-β have a nonrandom location distribution, we set out to test whether these sites are colocated near one another. A colocation score was defined that combines the distance between pairs of binding sites with the distance of the pair from the TSS of the gene, such that a low colocation score reflects a pair of binding sites that are located close to one another and close to the TSS (see 2Materials and Methods). Using this scoring metric, we found that CASSTG E-box motifs were colocated both with YY1 and C/EBP-β binding sites in group A genes at a statistically significant level compared with control genes (p < 0.01, Fig. 5). In addition, these two pairs of sites are significantly colocated in group A compared with either group B genes or group C genes (p < 6.5 × 10−4 and 6.0 × 10−3, respectively). Although a similar trend was observed in the colocation scores of the CANNTG E-box with both YY1 and C/EBP-β, the colocation signal is much stronger for the CASSTG E-box motif.

FIGURE 5.

E-box, YY1, and C/EBP-β binding sites are colocated in group A genes. For each pair of enriched transcription factors, the colocation score combining the distance between factors and the distance to the TSS was calculated for each gene. Groups A, B, and C were compared separately with each set of control genes (HC [highly expressed control] and NE lines at top of plot, see Fig. 4 legend). Statistically significant shifts in colocation scores are indicated above the relevant group. Single letter: p < 0.05, double letter: p < 0.01, triple letter: p < 0.001, Bonferroni-corrected MWU.

FIGURE 5.

E-box, YY1, and C/EBP-β binding sites are colocated in group A genes. For each pair of enriched transcription factors, the colocation score combining the distance between factors and the distance to the TSS was calculated for each gene. Groups A, B, and C were compared separately with each set of control genes (HC [highly expressed control] and NE lines at top of plot, see Fig. 4 legend). Statistically significant shifts in colocation scores are indicated above the relevant group. Single letter: p < 0.05, double letter: p < 0.01, triple letter: p < 0.001, Bonferroni-corrected MWU.

Close modal

In addition to being colocated with E-box motifs, YY1 and C/EBP-β sites were found to be colocated with one another in both group A and B genes (Fig. 5, far right panel), and both groups have significantly lower colocation scores compared with group C genes (p = 5.7 × 10−4 and 5.0 × 10−4, respectively). Interestingly, the colocation scores from genes in groups A and B are indistinguishable from one another (p > 0.8). Because the strong colocation of YY1 and C/EBP-β binding sites in groups A and B can distinguish genes that mutate from nonmutating genes (group C), it seems that more information is required to explain the differential level of mutation observed between the two groups.

The observation that both YY1 and C/EBP-β binding sites are colocated with E-box motifs suggests that the three binding sites may be located together, perhaps forming a cis-regulatory module located close to the TSS of the gene. To characterize the organization of the three sites together, we developed a trilocation score that incorporates the maximal distances between the three binding sites and the location from the TSS of the gene in a manner similar to the colocation score (see 2Materials and Methods). We find that both group A and group B have significantly lower trilocation scores compared with control genes (p < 0.05, Fig. 6). In addition, both groups A and B have significantly lower trilocation scores compared with group C (p = 1.7 × 10−4 and p = 9.3 × 10−4, respectively) but are not distinguishable from one another (p = 0.07). These results suggest that E-box motifs, YY1, and C/EBP-β binding sites form a regulatory module that is capable of recruiting AID to target the gene for mutation.

FIGURE 6.

E-boxes (CASSTG), C/EBP-β and YY1 binding sites trilocate in AID targets. For each gene, the trilocation score was calculated by combining the distance between all three sites and the distance from the TSS. Groups A, B, and C were compared separately with each set of control genes (HC [highly expressed control] and NE lines at top of plot, see Fig. 4 legend). Statistically significant shifts in trilocation scores are indicated above the relevant group. Single letter: p < 0.05, double letter: p < 0.01, triple letter: p < 0.001, Bonferroni-corrected MWU.

FIGURE 6.

E-boxes (CASSTG), C/EBP-β and YY1 binding sites trilocate in AID targets. For each gene, the trilocation score was calculated by combining the distance between all three sites and the distance from the TSS. Groups A, B, and C were compared separately with each set of control genes (HC [highly expressed control] and NE lines at top of plot, see Fig. 4 legend). Statistically significant shifts in trilocation scores are indicated above the relevant group. Single letter: p < 0.05, double letter: p < 0.01, triple letter: p < 0.001, Bonferroni-corrected MWU.

Close modal

Although none of the features investigated above could individually separate genes in groups A, B, and C (note the high overlap in box plots for each group), we hypothesized that a combination of these features could accurately predict AID targeting. As a first step, importance analysis based on random forests was used to select a subset of variables to be included in the final model. A list of variables included in this analysis is shown in Supplemental Fig. 3. Note that some colocation scores were excluded to reduce redundancy. Variable importance analysis (see 2Materials and Methods) identified seven features as significant for separating group A, B, and C genes: CASSTG E-box and YY1 colocation, CASSTG E-box and C/EBP-β colocation, YY1 and C/EBP-β colocation, number of C/EBP-β sites (± 2 kb around TSS), location of the C/EBP-β site closest to the TSS, and the maximum polymerase II peak (± 100 bp around the TSS) (Supplemental Fig. 3).

To predict whether each gene belongs to group A, B, or C, the seven variables identified as important were used to create a classification tree. In generating this model, we strongly penalized the misclassification between groups A and C. The penalty for misclassification of group B genes was deemed less significant for two reasons. First, the original classification of a gene in this group could change based on where the initial cut points between groups was drawn by Liu et al. (1). Second, group B genes are still considered to be targeted by AID, albeit at a lower level than group A genes; therefore, misclassification would not be as severe for this middle group. The full model generated from the data was pruned to an appropriate size based on the 10-fold cross-validation error, producing a final model containing five terminal nodes with four decision points (Fig. 7A). For the 83 genes used to build the model, the overall misclassification rate was 25%, with the majority of these misclassified genes from group B (as expected by the penalty setup). Importantly, the misclassification rate between groups A and C was only 8%. It was determined through cross-validation that the sensitivity of the model for correctly predicting a group A gene was 55%, with a specificity of 84% in predicting non–group A genes. Consequently, this model allows for the accurate separation of genome-wide AID targets (groups A and B) compared with nontargeted genes (group C).

FIGURE 7.

Classification tree model to predict genome-wide targets of AID. (A) Classification tree to predict AID targeting for individual genes. For each decision node in the tree (gray circle), the gene proceeds down the left branch if it satisfies the decision condition. The leaves of the tree indicate the predicted group (A, B, or C), along with the number of genes falling into this group from the 83 genes sequenced in the dKO setting (group A/group B/group C). For example, genes in leaf #1 are predicted to be in group A, and this leaf has a total of nine genes from the training data: six group A genes, three group B genes, and zero group C genes. (B) AID-mediated translocated genes (43) are enriched among genes predicted by the model to be AID targets (group A) compared with the set of all genes (22,492 genes) and the set of highly expressed genes comprised of the top 25% of all genes by mRNA expression levels (4,403 genes). (C) Based on the ChIP-Seq study by Yamane et al. (44), group A and predicted group A genes show greater recruitment of AID than do either group B or C. The mean and 95% confidence intervals of AID tags per gene are shown for each group of genes. The set of 17 nonexpressed genes is shown as a negative control. Also shown separately is the set of 101 high-interest predicted group A genes. ANOVA was used to compare the number of AID tags per gene for the group A/B/C genes used to build the model, as well as for the genes predicted by the model to be in groups A/B/C, with the p value shown above each comparison.

FIGURE 7.

Classification tree model to predict genome-wide targets of AID. (A) Classification tree to predict AID targeting for individual genes. For each decision node in the tree (gray circle), the gene proceeds down the left branch if it satisfies the decision condition. The leaves of the tree indicate the predicted group (A, B, or C), along with the number of genes falling into this group from the 83 genes sequenced in the dKO setting (group A/group B/group C). For example, genes in leaf #1 are predicted to be in group A, and this leaf has a total of nine genes from the training data: six group A genes, three group B genes, and zero group C genes. (B) AID-mediated translocated genes (43) are enriched among genes predicted by the model to be AID targets (group A) compared with the set of all genes (22,492 genes) and the set of highly expressed genes comprised of the top 25% of all genes by mRNA expression levels (4,403 genes). (C) Based on the ChIP-Seq study by Yamane et al. (44), group A and predicted group A genes show greater recruitment of AID than do either group B or C. The mean and 95% confidence intervals of AID tags per gene are shown for each group of genes. The set of 17 nonexpressed genes is shown as a negative control. Also shown separately is the set of 101 high-interest predicted group A genes. ANOVA was used to compare the number of AID tags per gene for the group A/B/C genes used to build the model, as well as for the genes predicted by the model to be in groups A/B/C, with the p value shown above each comparison.

Close modal

The classification tree model was applied genome wide to predict additional AID targets. Of the 22,492 RefSeq-defined mouse genes that were not used to build the model, 3,648 (16%) were predicted to be in group A, 1,949 (9%) were predicted to be in group B, and the remaining 16,895 (75%) were predicted to be group C (Supplemental Table I). To generate a list of high-quality AID targets that would be of interest for experimental validation, predicted group A genes were filtered to include only those that have an average gene expression in the top 25% of all genes in GC B cells, as determined through mRNA expression; have human homologs [defined by the Mouse Genome Database (40)] annotated as either oncogenes or tumor suppressors [identified by an Entrez query through the CancerGenes tool (41)], or are found within the COSMIC Cancer Gene Census (42) version 56. This filtering produced a set of 101 high-interest genes shown in Table I. Some notable members of this list are immune-related genes: Bcl7a, Btg1, Cd82, Cdk4, Cxcr4, Foxo1, Foxp1, Irf1, Irf8, Raf1, Rela, Stat3, and Tcf4.

Table I.
High-interest predicted group A genes from classification tree model
Entrez IdentifierModel LeafGenes
Oncogene Golga5a, Lmo4, Rab18, Rab30, Rab5c, Snrpe, Ubtf 
 Actb, Fusa 
 Cblbb, Mybl1, Nae1, Pfdn5, Rapgef1, Sh3kbp1, Tpd52, Vav1 
Tumor suppressor Anp32b, Ccna2, Cd82, Cdk2ap2, Cdkn2cb, Cfl1, Eif2s1, Flna, Foxp1a, G3bp2, Gabarap, Gtf2e1, Ing3, Ltf, Mef2c, Nbr1, Nfkbia, Rbbp5, Tcf4, Uhrf1, Zbtb33 
 Btg2, Ddx3x, Ddx5a, Dhx9, G3bp1, Irf1, Irf8, Klf10, Tnfaip3b, Uvrag 
 Anxa7, Chfr, Foxo1, Msh2b, Ndufa13, Raf1a, Rnf129, Rnf40, Rtn4, Sdhbb, Sept9, Spint2, Stk4, Tes, Tusc2, Ywhah, Ywhaq, Zfp238 
Tumor suppressor and oncogene Ccdn3a, Ptenb, Rela, Rhoa 
 Ctnnb1a, Cxcr4, Hif1a, Lyn, Stat3 
 Mapk1, Prkar1aa,b, Smarcb1 
Neither Bcl7aa, Brd4a, Ep300a,b, Eps15a, Gnasb, Herpud1a, Ikzf1b, Lasp1a, Myh9a, Ncoa1a, Nina, Nsd1a, Thrap3a 
 Btg1a 
 Cdk4b, Chic2a, Crebbpa,b, Ncoa2a, Nfe2l2b, Numa1a, Sept6a, Sh3gl1a, Tpm3a 
Entrez IdentifierModel LeafGenes
Oncogene Golga5a, Lmo4, Rab18, Rab30, Rab5c, Snrpe, Ubtf 
 Actb, Fusa 
 Cblbb, Mybl1, Nae1, Pfdn5, Rapgef1, Sh3kbp1, Tpd52, Vav1 
Tumor suppressor Anp32b, Ccna2, Cd82, Cdk2ap2, Cdkn2cb, Cfl1, Eif2s1, Flna, Foxp1a, G3bp2, Gabarap, Gtf2e1, Ing3, Ltf, Mef2c, Nbr1, Nfkbia, Rbbp5, Tcf4, Uhrf1, Zbtb33 
 Btg2, Ddx3x, Ddx5a, Dhx9, G3bp1, Irf1, Irf8, Klf10, Tnfaip3b, Uvrag 
 Anxa7, Chfr, Foxo1, Msh2b, Ndufa13, Raf1a, Rnf129, Rnf40, Rtn4, Sdhbb, Sept9, Spint2, Stk4, Tes, Tusc2, Ywhah, Ywhaq, Zfp238 
Tumor suppressor and oncogene Ccdn3a, Ptenb, Rela, Rhoa 
 Ctnnb1a, Cxcr4, Hif1a, Lyn, Stat3 
 Mapk1, Prkar1aa,b, Smarcb1 
Neither Bcl7aa, Brd4a, Ep300a,b, Eps15a, Gnasb, Herpud1a, Ikzf1b, Lasp1a, Myh9a, Ncoa1a, Nina, Nsd1a, Thrap3a 
 Btg1a 
 Cdk4b, Chic2a, Crebbpa,b, Ncoa2a, Nfe2l2b, Numa1a, Sept6a, Sh3gl1a, Tpm3a 

Genes are listed according to the Entrez Query Identifier and by the leaf of the model which led to the gene being predicted as a strong AID target.

a

Identified in the COSMIC database as translocation type.

b

Identified in the COSMIC database as other mutation type.

We carried out two validations of the model by comparing the predicted AID targets with sets of experimentally identified genes that should be enriched for actual AID targets. In the first validation, the set of actual AID targets was based on results from a translocation-capture sequencing (TC-Seq) study (43). This study identified 92 genes that could translocate with either IgH or Myc in an AID-dependent manner in ex vivo B cells. Of the 92 TC-Seq genes, 10 were excluded from the validation because they had been used to generate the model, 11 were not in our RefSeq database, and 1 was translocated to Myc in the AID−/− control. Of the remaining 70 TC-Seq genes, the model predicted 19 (27%) as group A, 13 (19%) as group B, and 38 (54%) as group C (Supplemental Table I). When compared with the set of predictions from all mouse RefSeq-defined genes, the frequency of predicted group A and B genes was significantly higher among the TC-Seq genes than expected by chance (p = 0.00019, χ2 test) (Fig. 7B). This overrepresentation of predicted group A and B genes remained significant, even when the background was restricted to genes that were highly expressed in GC B cells and, therefore, were more likely to accumulate AID-induced mutations (p = 0.0031, χ2 test). Overall, these results demonstrate that the classification model predictions are enriched for actual AID targets.

In the second validation, the set of actual AID targets was defined by AID binding, according to a ChIP-Seq study (44). Although the B cells used in this study undergo class-switch recombination (CSR) and not SHM, we expect that because the two processes are linked in their use of AID we would still observe enrichment of AID-binding genes for SHM. This assumption is supported by the observation that group A, B, and C genes differed in the mean number of AID sequence tags that were bound (p = 0.013, ANOVA), with the highest frequency of AID tags found in group A genes (Fig. 7C). Differential AID binding was also observed among highly expressed genes that were predicted to be in groups A, B, and C, with predicted group A genes exhibiting the highest frequency of AID tags (p = 1.9 × 10−9, ANOVA) (Fig. 7C). An even higher frequency of AID binding was found among the 101 high-interest predicted group A genes. The significantly higher AID binding among targets predicted by the classification model may be even more significant because CSR and SHM are dependent on different regions of the AID protein and may involve process-specific cofactors that can change the targeting (3, 45), so that complete overlap with the model predictions was not expected. Taken together, these results further support the validity of the classification tree model to predict genome-wide targets of AID and highlight the rationale for choosing the set of 101 high-interest genes put forth as candidates for experimental validation.

The mechanisms that underlie AID targeting to the Ig locus and mistargeting to non-Ig loci are poorly understood. We analyzed the sequence context of 83 non-Ig genes that are targeted by AID to varying degrees in dKO mice to define several mechanisms responsible for genome-wide AID targeting. Our data confirm the association of AID targeting with transcription and the presence of E-box motifs. E-boxes were found to increase the frequency of SHM in several systems (12, 46, 47). Our data also implicate several new factors as relevant to the targeting of AID. In particular, YY1 and C/EBP-β binding sites are associated with increased mutation accumulation, and these sites, together with E-box motifs, are colocated near the TSS of AID targets, perhaps forming a cis-regulatory module. YY1 is of high interest because it is active in GC B cells and was recently identified as a regulator for the GC program (39). Intriguingly, YY1 was recently shown to interact directly with AID, to influence levels of AID in the nucleus, and to enhance CSR (48). C/EBP-β mRNA and protein levels were reported to increase in abundance as B cells mature, and C/EBP-γ, the negative regulator of C/EBP-β, decreases with B cell maturity (49). In addition, a C/EBP-β site was recently identified in a mutational enhancer element in the Ig L chain of DT40 cells that is conserved in the condor and zebra finch (50). These results support a role for shared functional elements in explaining differential AID activity in non-Ig genes.

The contribution of AID hot/cold-spots and gene transcription to mutation accumulation was also evaluated. Although known AID hotspots are specifically targeted in non-Ig genes (1), the frequency of hotspots (or coldspots) in the sequenced region did not correlate with the overall gene-mutation frequency. This suggests that AID recruitment to the locus is a rate-limiting step. Gene transcription is also important, and some transcription is required for mutation accumulation in non-Ig genes, because nonexpressed genes do not accumulate mutations (1). Indeed, there was a small, but significant, increase in polymerase II binding among genes with the highest mutation frequencies (group A). However, there was only a weak correlation at the level of individual genes, and highly mutated genes could be identified with widely varying levels of polymerase II binding. Notably, the polymerase II–associated pause/elongation factor Spt5, a component of the DRB-sensitivity inducing complex, interacts with AID and with Ig and non-Ig targets of AID (51). Although our data indicate that total polymerase II levels do not correlate strongly with AID targeting, a better correlation might be seen upon examination of stalled polymerase II and Spt5 in GC B cells. A recent study found that levels of stalled polymerase II, but not Spt5, correlated with AID-mediated mutation of an artificial mutation reporter (52). Overall, although our analysis identifies several associations that are statistically significant when comparing group A, B, and C genes, none of these features alone is sufficient to predict the differential accumulation of mutations by individual genes.

We hypothesized that several mechanisms work together to promote AID targeting to individual genes. By combining the features that were each associated with AID targeting, we were able to construct a classification tree model that separated highly mutating genes (group A) from genes that do not mutate (group C) with an accuracy of 92%. It is instructive to consider the genes for which AID targeting is incorrectly predicted. In the training data, there are five genes that are considered major misclassifications: Pax5 exon 1b, Il21r, Fas, B2m, and Mll1. For the group A genes misclassified as group C genes (Pax5 exon 1b, Il21r, and Fas) the lack of strong evolutionary conservation between mouse and human near the TSS diminishes the ability of the model to accurately predict these genes as true group A genes. Pax5 exon 1b presents unique challenges to this analysis because the sequencing was focused on an alternative promoter, which may not be adhering to the same conventions of the other genes within the group. The two group C genes were predicted as group A genes for different reasons: B2m had a high peak of polymerase II near the TSS (max peak = 20), and Mll1 had a conserved YY1 and E-box that had a very low colocation score (score = 11), reflecting two sites that were overlapping and located at the TSS of the gene. Thus, many features of each gene contribute in a small way to AID targeting. In predicting AID targets, there was not a single reason underlying false positive and negative predictions, and improvements in accuracy are likely to come from the inclusion of additional features in the model.

The model was applied genome wide to predict AID targets. We identified a set of 101 high-interest genes that fulfilled three criteria: they are predicted to be in group A, they are highly expressed in GC B cells, and they have a known association with cancer. These targets are significantly enriched for genes identified through TC-Seq of B cells (43). They also tend to bind higher levels of AID in an ex vivo model of CSR (44). About 25% of the high-interest genes are known to undergo translocations in oncogenic settings as identified through the Cancer Gene Census. Several of these genes undergo translocations with other genes known to sustain high levels of AID targeting: Ccnd3 translocates with IgH, Bcl7a and Btg1 each translocate with Myc, and Foxp1 translocates with Pax5. Additionally, two genes, Btg1 and Raf1, were previously identified as sustaining high levels of SHM in the wild-type setting (1) but were not sequenced in dKO mice. Thus, we have high confidence in the validity of our model to predict AID-induced mutations and suggest these high-interest genes as targets for future experiments. Although it is not clear whether a common mechanism underlies genome-wide targeting of AID and targeting of AID at the Ig locus, at least some features seem to be shared. Several of the transcription factors that we implicate in genome-wide targeting have been associated with targeting at the Ig locus (46, 47, 53). In addition, we find that AID hot/cold-spots for mutation are generally shared between the Ig and non-Ig genes. Interestingly, there is a significant difference in the relative mutability of the four individual trinucleotide motifs that form the WRC hotspot (Fig. 2A). This may be the result of varying cofactors present in each unique genomic setting that slightly alters the specificity of AID to its preferred nucleotide target.

One significant difference found between AID targeting at Ig and non-Ig loci is the occurrence of clonal recruitment of AID. For a random mutation process with hot-/coldspots, the distribution of the number of mutations per sequence should fit an NB distribution (33). Indeed, we find this model to be an appropriate fit for the non-Ig genes. In contrast, the NB is a poor fit for the Jh4 intron mutation data, because this model significantly underestimates the frequency of unmutated sequences, given the mutation distribution among sequences that do acquire mutations. Instead, these data can be fit by a ZI-NB distribution, which separately models the process of mutation accumulation from the induction of AID activity. This suggests that these sequences from the Jh4 intron were derived from two subsets of cells: one subset in which AID had been effectively recruited to the gene being sequenced, resulting in a high degree of mutation, and a second subset in which AID had not (yet) been recruited. One caveat of this analysis is that we may not observe enough mutations in the non-Ig genes to be able to detect whether clonal recruitment of AID is occurring, and further sequencing is necessary to increase the number of observed mutations in each non-Ig gene. Overall, these results provide additional support for the idea that AID is clonally recruited to the Ig loci but that clonal recruitment is not a property of genome-wide AID targeting.

The evidence presented in this article supporting clonal recruitment of AID to the Ig genes has potential implications with respect to Ab diversity during the affinity-maturation process. Current thought predisposes B cells with receptors of initial high affinity to the Ag as having a competitive advantage for expansion within GCs (54). The addition of clonal recruitment may allow for the B cells with receptors of lower initial affinity, if able to recruit AID early, to go through enough rounds of mutation, selection, and expansion to yield mutated receptors with equal or higher affinity than the unmutated higher-affinity cells. These lower initial affinity, but early AID-recruiting, B cells could then be competitive in the GC environment. Thus, depending on the time of clonal recruitment of AID in B cells, the expected competitive advantage of B cells with high-affinity receptors may be diminished enough to allow for a broader spectrum of Abs to be produced from B cells with a wider range of initial affinities to the Ag.

A limitation of this study is the focus on the promoter region when analyzing transcription factor binding site associations. At the Ig locus, enhancer regions were shown to play a role in SHM that is separate from influencing the rate of transcription (12, 50, 55). Nevertheless, through this approach we identified three important DNA elements that can effectively differentiate group A and C genes. It is likely that future work incorporating additional elements in the enhancer regions will improve these predictions and may also distinguish group B genes, which were not a major focus of this work. However, there are still many problems to be worked out in the identification of enhancer boundaries and their association(s) with particular genes.

In conclusion, we identified several mechanisms that are significantly associated with AID targeting to non-Ig genes, including polymerase II binding along with the presence of E-boxes, YY1, and C/EBP-β binding sites. A classification model integrating these features captures ∼55% of highly targeted (group A) genes with the benefit of being specific for AID targets. This model was used to predict AID targeting genome wide, and a set of 101 additional high-interest AID targets was identified for further experimental study.

We thank Dustin Schones (Department of Cancer Biology, Beckman Research Institute, City of Hope, Duarte, CA), Kairong Cui (Systems Biology Center, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD), and Keji Zhao (Systems Biology Center, National Heart, Lung, and Blood Institute, National Institutes of Health) for expertise, sequencing, and read mapping of the RNA polymerase II ChIP-Seq experiment and Annette Molinaro (Departments of Neurological Surgery and of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA) for guidance in developing the classification tree model.

J.L.D. was supported in part by the Pharmaceutical Research and Manufacturers of America Foundation and National Institutes of Health Grant T15 LM07056 from the National Library of Medicine. D.G.S. is an investigator of the Howard Hughes Medical Institute. Computational resources were provided by the Yale University Biomedical High Performance Computing Center (National Institutes of Health Grant RR19895).

The microarray data presented in this article have been submitted to the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44260) under accession number GSE44260.

The online version of this article contains supplemental material.

Abbreviations used in this article:

AID

activation-induced cytidine deaminase

ChIP

chromatin immunoprecipitation

ChIP-Seq

chromatin immunoprecipitation followed by massively parallel sequencing

CSR

class switch recombination

dKO

double knockout

FDR

false-discovery rate

GC

germinal center

GSEA

gene set enrichment analysis

KO

knockout

KW

Kruskal–Wallis test

MWU

Mann–Whitney U test

NB

negative binomial

RefSeq

National Center for Biotechnology Information Reference Sequence

SHM

somatic hypermutation

TC-Seq

translocation-capture sequencing

TSS

transcription start site

ZI-NB

zero-inflated negative binomial.

1
Liu
M.
,
Duke
J. L.
,
Richter
D. J.
,
Vinuesa
C. G.
,
Goodnow
C. C.
,
Kleinstein
S. H.
,
Schatz
D. G.
.
2008
.
Two levels of protection for the B cell genome during somatic hypermutation.
Nature
451
:
841
845
.
2
Rada
C.
,
Ehrenstein
M. R.
,
Neuberger
M. S.
,
Milstein
C.
.
1998
.
Hot spot focusing of somatic hypermutation in MSH2-deficient mice suggests two stages of mutational targeting.
Immunity
9
:
135
141
.
3
Peled
J. U.
,
Kuang
F. L.
,
Iglesias-Ussel
M. D.
,
Roa
S.
,
Kalis
S. L.
,
Goodman
M. F.
,
Scharff
M. D.
.
2008
.
The biochemistry of somatic hypermutation.
Annu. Rev. Immunol.
26
:
481
511
.
4
Rada
C.
,
Di Noia
J. M.
,
Neuberger
M. S.
.
2004
.
Mismatch recognition and uracil excision provide complementary paths to both Ig switching and the A/T-focused phase of somatic mutation.
Mol. Cell
16
:
163
171
.
5
Pham
P.
,
Bransteitter
R.
,
Petruska
J.
,
Goodman
M. F.
.
2003
.
Processive AID-catalysed cytosine deamination on single-stranded DNA simulates somatic hypermutation.
Nature
424
:
103
107
.
6
Bransteitter
R.
,
Pham
P.
,
Calabrese
P.
,
Goodman
M. F.
.
2004
.
Biochemical analysis of hypermutational targeting by wild type and mutant activation-induced cytidine deaminase.
J. Biol. Chem.
279
:
51612
51621
.
7
Goyenechea
B.
,
Klix
N.
,
Yélamos
J.
,
Williams
G. T.
,
Riddell
A.
,
Neuberger
M. S.
,
Milstein
C.
.
1997
.
Cells strongly expressing Ig(kappa) transgenes show clonal recruitment of hypermutation: a role for both MAR and the enhancers.
EMBO J.
16
:
3987
3994
.
8
Yu
D.
,
Cook
M. C.
,
Shin
D.-M.
,
Silva
D. G.
,
Marshall
J.
,
Toellner
K.-M.
,
Havran
W. L.
,
Caroni
P.
,
Cooke
M. P.
,
Morse
H. C.
, et al
.
2008
.
Axon growth and guidance genes identify T-dependent germinal centre B cells.
Immunol. Cell Biol.
86
:
3
14
.
9
Anderson
S. M.
,
Khalil
A.
,
Uduman
M.
,
Hershberg
U.
,
Louzoun
Y.
,
Haberman
A. M.
,
Kleinstein
S. H.
,
Shlomchik
M. J.
.
2009
.
Taking advantage: high-affinity B cells in the germinal center have lower death rates, but similar rates of division, compared to low-affinity cells.
J. Immunol.
183
:
7314
7325
.
10
Klein
U.
,
Tu
Y.
,
Stolovitzky
G. A.
,
Keller
J. L.
,
Haddad
J.
 Jr.
,
Miljkovic
V.
,
Cattoretti
G.
,
Califano
A.
,
Dalla-Favera
R.
.
2003
.
Transcriptional analysis of the B cell germinal center reaction.
Proc. Natl. Acad. Sci. USA
100
:
2639
2644
.
11
Alizadeh
A.
,
Eisen
M.
,
Davis
R. E.
,
Ma
C.
,
Sabet
H.
,
Tran
T.
,
Powell
J. I.
,
Yang
L.
,
Marti
G. E.
,
Moore
D. T.
, et al
.
1999
.
The lymphochip: a specialized cDNA microarray for the genomic-scale analysis of gene expression in normal and malignant lymphocytes.
Cold Spring Harb. Symp. Quant. Biol.
64
:
71
78
.
12
Odegard
V. H.
,
Schatz
D. G.
.
2006
.
Targeting of somatic hypermutation.
Nat. Rev. Immunol.
6
:
573
583
.
13
Altshuler
D.
,
Pollara
V. J.
,
Cowles
C. R.
,
Van Etten
W. J.
,
Baldwin
J.
,
Linton
L.
,
Lander
E. S.
.
2000
.
An SNP map of the human genome generated by reduced representation shotgun sequencing.
Nature
407
:
513
516
.
14
Tomayko
M. M.
,
Anderson
S. M.
,
Brayton
C. E.
,
Sadanand
S.
,
Steinel
N. C.
,
Behrens
T. W.
,
Shlomchik
M. J.
.
2008
.
Systematic comparison of gene expression between murine memory and naive B cells demonstrates that memory B cells have unique signaling capabilities.
J. Immunol.
181
:
27
38
.
15
Affymetrix, Inc. 2004. Eukaryotic sample and array processing. In GeneChip Expression Analysis Technical Manual, 701021 Rev. 5. Affymetrix, Inc., Santa Clara, CA, 2.1.3–2.3.18
.
16
Pruitt
K. D.
2004
.
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.
Nucleic Acids Res.
33
:
D501
D504
.
17
Hannum
L. G.
,
Haberman
A. M.
,
Anderson
S. M.
,
Shlomchik
M. J.
.
2000
.
Germinal center initiation, variable gene region hypermutation, and mutant B cell selection without detectable immune complexes on follicular dendritic cells.
J. Exp. Med.
192
:
931
942
.
18
Karolchik
D.
,
Hinrichs
A. S.
,
Furey
T. S.
,
Roskin
K. M.
,
Sugnet
C. W.
,
Haussler
D.
,
Kent
W. J.
.
2004
.
The UCSC Table Browser data retrieval tool.
Nucleic Acids Res.
32
:
D493
D496
.
19
Shapiro
G. S.
,
Aviszus
K.
,
Murphy
J.
,
Wysocki
L. J.
.
2002
.
Evolution of Ig DNA sequence to target specific base positions within codons for somatic hypermutation.
J. Immunol.
168
:
2302
2306
.
20
Venables
W. N.
,
Ripley
B. D.
.
2002
.
Modern Applied Statistics with S
, 4th Ed.
Springer
,
New York
.
21
Bolker
B. M.
2008
.
Ecological Models and Data in R.
Princeton University Press
,
Princeton, NJ
.
22
Blanchette
M.
,
Kent
W. J.
,
Riemer
C.
,
Elnitski
L.
,
Smit
A. F.
,
Roskin
K. M.
,
Baertsch
R.
,
Rosenbloom
K.
,
Clawson
H.
,
Green
E. D.
, et al
.
2004
.
Aligning multiple genomic sequences with the threaded blockset aligner.
Genome Res.
14
:
708
715
.
23
Kent
W. J.
,
Sugnet
C. W.
,
Furey
T. S.
,
Roskin
K. M.
,
Pringle
T. H.
,
Zahler
A. M.
,
Haussler
D.
.
2002
.
The human genome browser at UCSC.
Genome Res.
12
:
996
1006
.
24
Smit, A. F. A., R. Hubley, and P. Green. 2011. RepeatMasker. Available at: http://repeatmasker.org. Accessed: March 20, 2011.
25
Kel
A. E.
,
Gössling
E.
,
Reuter
I.
,
Cheremushkin
E.
,
Kel-Margoulis
O. V.
,
Wingender
E.
.
2003
.
MATCH: A tool for searching transcription factor binding sites in DNA sequences.
Nucleic Acids Res.
31
:
3576
3579
.
26
Matys
V.
,
Fricke
E.
,
Geffers
R.
,
Gössling
E.
,
Haubrock
M.
,
Hehl
R.
,
Hornischer
K.
,
Karas
D.
,
Kel
A. E.
,
Kel-Margoulis
O. V.
, et al
.
2003
.
TRANSFAC: transcriptional regulation, from patterns to profiles.
Nucleic Acids Res.
31
:
374
378
.
27
Subramanian
A.
,
Tamayo
P.
,
Mootha
V. K.
,
Mukherjee
S.
,
Ebert
B. L.
,
Gillette
M. A.
,
Paulovich
A.
,
Pomeroy
S. L.
,
Golub
T. R.
,
Lander
E. S.
,
Mesirov
J. P.
.
2005
.
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.
Proc. Natl. Acad. Sci. USA
102
:
15545
15550
.
28
Strobl
C.
,
Boulesteix
A.-L.
,
Zeileis
A.
,
Hothorn
T.
.
2007
.
Bias in random forest variable importance measures: illustrations, sources and a solution.
BMC Bioinformatics
8
:
25
.
29
Strobl
C.
,
Malley
J.
,
Tutz
G.
.
2009
.
An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests.
Psychol. Methods
14
:
323
348
.
30
Therneau, T. M., and E. J. Atkinson. 1997. Mayo Foundation. An introduction to recursive partitioning using the RPART routines. Available at: http://www.mayo.edu/hsr/techrpt/61.pdf. Accessed: January 4, 2011
.
31
Breiman
L.
,
Friedman
J.
,
Stone
C. J.
,
Olshen
R. A.
.
1984
.
Classification and Regression Trees.
Wadsworth International Group
,
Belmont, CA
.
32
García-Martínez
J.
,
Aranda
A.
,
Pérez-Ortín
J. E.
.
2004
.
Genomic run-on evaluates transcription rates for all yeast genes and identifies gene regulatory mechanisms.
Mol. Cell
15
:
303
313
.
33
Uzzell
T.
,
Corbin
K. W.
.
1971
.
Fitting discrete probability distributions to evolutionary events.
Science
172
:
1089
1096
.
34
Storb
U.
,
Klotz
E. L.
,
Hackett
J.
,
Kage
K.
,
Bozek
G.
,
Martin
T. E.
.
1998
.
A hypermutable insert in an immunoglobulin transgene contains hotspots of somatic mutation and sequences predicting highly stable structures in the RNA transcript.
J. Exp. Med.
188
:
689
698
.
35
Zuur
A. F.
,
Ieno
E. N.
,
Walker
N.
,
Saneliev
A. A.
,
Smith
G. M.
.
2009
.
Mixed Effects Models and Extension in Ecology with R (Statistics for Biology and Health)
, 1st Ed.
Springer
,
New York
.
36
Lord
D.
,
Washington
S.
,
Ivan
J.
.
2005
.
Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory.
Accid. Anal. Prev.
37
:
35
46
.
37
Liu
R.
,
McEachin
R. C.
,
States
D. J.
.
2003
.
Computationally identifying novel NF-kappa B-regulated immune genes in the human genome.
Genome Res.
13
:
654
661
.
38
Tabach
Y.
,
Brosh
R.
,
Buganim
Y.
,
Reiner
A.
,
Zuk
O.
,
Yitzhaky
A.
,
Koudritsky
M.
,
Rotter
V.
,
Domany
E.
.
2007
.
Wide-scale analysis of human functional transcription factor binding reveals a strong bias towards the transcription start site.
PLoS One
2
:
e807
.
39
Green
M. R.
,
Monti
S.
,
Dalla-Favera
R.
,
Pasqualucci
L.
,
Walsh
N. C.
,
Schmidt-Supprian
M.
,
Kutok
J. L.
,
Rodig
S. J.
,
Neuberg
D. S.
,
Rajewsky
K.
, et al
.
2011
.
Signatures of murine B-cell development implicate Yy1 as a regulator of the germinal center-specific program.
Proc. Natl. Acad. Sci. USA
108
:
2873
2878
.
40
Blake
J. A.
,
Bult
C. J.
,
Kadin
J. A.
,
Richardson
J. E.
,
Eppig
J. T.
.
2011
.
The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics.
Nucleic Acids Res.
39
:
D842
D848
.
41
Higgins
M. E.
,
Claremont
M.
,
Major
J. E.
,
Sander
C.
,
Lash
A. E.
.
2007
.
CancerGenes: a gene selection resource for cancer genome projects.
Nucleic Acids Res.
35
:
D721
D726
.
42
Santarius
T.
,
Shipley
J.
,
Brewer
D.
,
Stratton
M. R.
,
Cooper
C. S.
.
2010
.
A census of amplified and overexpressed human cancer genes.
Nat. Rev. Cancer
10
:
59
64
.
43
Klein
I. A.
,
Resch
W.
,
Jankovic
M.
,
Oliveira
T.
,
Yamane
A.
,
Nakahashi
H.
,
Di Virgilio
M.
,
Bothmer
A.
,
Nussenzweig
A.
,
Robbiani
D. F.
, et al
.
2011
.
Translocation-capture sequencing reveals the extent and nature of chromosomal rearrangements in B lymphocytes.
Cell
147
:
95
106
.
44
Yamane
A.
,
Resch
W.
,
Kuo
N.
,
Kuchen
S.
,
Li
Z.
,
Sun
H.
,
Robbiani
D. F.
,
McBride
K.
,
Nussenzweig
M. C.
,
Casellas
R.
.
2011
.
Deep-sequencing identification of the genomic targets of the cytidine deaminase AID and its cofactor RPA in B lymphocytes.
Nat. Immunol.
12
:
62
69
.
45
Teng
G.
,
Papavasiliou
F. N.
.
2007
.
Immunoglobulin somatic hypermutation.
Annu. Rev. Genet.
41
:
107
120
.
46
Michael
N.
,
Shen
H. M.
,
Longerich
S.
,
Kim
N.
,
Longacre
A.
,
Storb
U.
.
2003
.
The E box motif CAGGTG enhances somatic hypermutation without enhancing transcription.
Immunity
19
:
235
242
.
47
Tanaka
A.
,
Shen
H. M.
,
Ratnam
S.
,
Kodgire
P.
,
Storb
U.
.
2010
.
Attracting AID to targets of somatic hypermutation.
J. Exp. Med.
207
:
405
415
.
48
Zaprazna
K.
,
Atchison
M. L.
.
2012
.
YY1 controls immunoglobulin class switch recombination and nuclear activation-induced deaminase levels.
Mol. Cell. Biol.
32
:
1542
1554
.
49
Cooper
C. L.
,
Berrier
A. L.
,
Roman
C.
,
Calame
K. L.
.
1994
.
Limited expression of C/EBP family proteins during B lymphocyte development. Negative regulator Ig/EBP predominates early and activator NF-IL-6 is induced later.
J. Immunol.
153
:
5049
5058
.
50
Kothapalli
N. R.
,
Collura
K. M.
,
Norton
D. D.
,
Fugmann
S. D.
.
2011
.
Separation of mutational and transcriptional enhancers in Ig genes.
J. Immunol.
187
:
3247
3255
.
51
Pavri
R.
,
Gazumyan
A.
,
Jankovic
M.
,
Di Virgilio
M.
,
Klein
I.
,
Ansarah-Sobrinho
C.
,
Resch
W.
,
Yamane
A.
,
San-Martin
B. R.
,
Barreto
V.
, et al
.
2010
.
Activation-induced cytidine deaminase targets DNA at sites of RNA polymerase II stalling by interaction with Spt5.
Cell
143
:
122
133
.
52
Kohler
K. M.
,
McDonald
J. J.
,
Duke
J. L.
,
Arakawa
H.
,
Tan
S.
,
Kleinstein
S. H.
,
Buerstedde
J.-M.
,
Schatz
D. G.
.
2012
.
Identification of core DNA elements that target somatic hypermutation.
J. Immunol.
189
:
5314
5326
.
53
Liu
M.
,
Schatz
D. G.
.
2009
.
Balancing AID and DNA repair during somatic hypermutation.
Trends Immunol.
30
:
173
181
.
54
Shlomchik
M. J.
,
Weisel
F.
.
2012
.
Germinal center selection and the development of memory B and plasma cells.
Immunol. Rev.
247
:
52
63
.
55
Blagodatski
A.
,
Batrak
V.
,
Schmidl
S.
,
Schoetz
U.
,
Caldwell
R. B.
,
Arakawa
H.
,
Buerstedde
J.-M.
.
2009
.
A cis-acting diversification activator both necessary and sufficient for AID-mediated hypermutation.
PLoS Genet.
5
:
e1000332
.

The authors have no financial conflicts of interest.