Next-generation sequencing of the Ig gene repertoire (Ig-seq) produces large volumes of information at the nucleotide sequence level. Such data have improved our understanding of immune systems across numerous species and have already been successfully applied in vaccine development and drug discovery. However, the high-throughput nature of Ig-seq means that it is afflicted by high error rates. This has led to the development of error-correction approaches. Computational error-correction methods use sequence information alone, primarily designating sequences as likely to be correct if they are observed frequently. In this work, we describe an orthogonal method for filtering Ig-seq data, which considers the structural viability of each sequence. A typical natural Ab structure requires the presence of a disulfide bridge within each of its variable chains to maintain the fold. Our Ab Sequence Selector (ABOSS) uses the presence/absence of this bridge as a way of both identifying structurally viable sequences and estimating the sequencing error rate. On simulated Ig-seq datasets, ABOSS is able to identify more than 99% of structurally viable sequences. Applying our method to six independent Ig-seq datasets (one mouse and five human), we show that our error calculations are in line with previous experimental and computational error estimates. We also show how ABOSS is able to identify structurally impossible sequences missed by other error-correction methods.

Effective recognition and elimination of noxious molecules from jawed vertebrates relies on the versatility of their immune systems. Abs, secreted products of B cells, play a key role in recognizing Ags: structural motifs on pathogenic molecules. Abs can be raised against potentially any Ag (1). As a result of this binding plasticity, Abs are currently the most successful class of biotherapeutics (2, 3).

Next-generation sequencing of the Ig gene repertoire (Ig-seq) produces large volumes of information at the nucleotide sequence level, allowing interrogation of snapshots of Ab diversity. Such data have improved our understanding of immune systems across numerous species and have already been successfully applied in vaccine development and drug discovery [e.g., (4, 5)]. However, the high-throughput nature of Ig-seq means that it is afflicted by high error rates, which makes it difficult to distinguish between Ig-seq artifacts and true nucleotide alterations introduced by the somatic hypermutation (SHM) machinery of B cells.

Several experimental Ig-seq error-correction approaches have been proposed; however, an agreed standard does not yet exist (6). Existing experimental approaches for error correction include taking invariant sequence portions as a proxy for estimating error or barcoding sequences that should be identical. For instance, Galson et al. (7) performed sequencing of the constant portions of the Ab H chain. As this region is typically sequence invariant, it offered an estimated error rate on the variable (V) portions sequenced in the course of the same study. Khan et al. (8) barcoded individual Ab cDNA transcripts with unique molecular identifiers (UMI) prior to PCR. The resultant pool of genetic data was sequenced, and identically barcoded sequences were put into separate clusters in which a consensus sequence was devised. All other members of the cluster were corrected with respect to this consensus sequence. Error can be introduced even in this method in the early steps of sequencing sample preparation, such as reverse transcription and PCR (9, 10). Devising a correct sequence within the clusters is heavily dependent on sequence redundancies, which precludes correction of singleton clusters using the barcode approach (9, 10).

Techniques such as barcoding or sequencing constant portions are time consuming and require specialized experimental setups. To address such issues, several computational error-correction tools have been developed (6). These applications all operate by building consensus sequences using homology clustering. The majority of these tools work only in the remit of CDR-3 of the V–H chain (VH) domain (CDR-H3) (11, 12), largely ignoring the rest of the sequence. MiXCR is the most commonly used Ig-seq error-correction tool to date (13). It supports the analysis of entire VH or V–L chains (VL) and performs sequencing error correction. MiXCR works by aligning sequences from an Ig-seq dataset to reference V, J, and C genes followed by identifying gene feature sequences. This is a k-mer of residues identical across multiple sequences and is found in CDR-H3 by default. These gene feature sequences are then used to sort Ab sequences into sets of separate clonotypes. The number of unique clonotypes is always overestimated because of PCR and sequencing errors. To overcome this, “correct” sequences are found by performing heuristic multilayer clustering on these clonotypes, in which the most redundant clonotypes are treated as correct. A more recently developed Ab repertoire construction tool, IgReC (14), takes a different approach. It uses Hamming graphs to identify correct sequences. Benchmark analysis on barcoded Ig-seq data shows that the IgReC pipeline is as accurate as experimental error-correction approaches (14). This suggests that advances in algorithm development can potentially alleviate the need for experimental Ig-seq correction. All currently available computational methods consider sequence information alone. In this paper, we consider how knowledge of an Ab structure may help to identify sequencing errors by finding sequences that are not structurally viable, as structural viability is crucial for the correct functioning of an Ab. We then use this structural information to estimate sequencing error rates.

A typical Ab structure requires the presence of a disulfide bridge within each of the V chains. This bridge helps to maintain the Ig fold. Cysteines at positions 23 and 104 [ImMunoGeneTics (IMGT) numbering (15)] must be present in structurally viable natural Ab sequences (1619). There is evidence that some Abs can still fold when the disulfide bond is ablated (2022). However, such Abs have been found via rational protein engineering, in which the conserved cysteines are mutated alongside further modifications to the rest of the Ab sequence that stabilize the overall structure (2022). ABPC48 is the only example of an Ab that naturally lacks cysteine at position 104 (18). APBC48 is a mouse Ab derived from plasmacytoma (23). Although ABPC48 Ab is able to fold, restoration of cysteine at position 104 significantly improves its stability (22).

In this article, we describe a novel computational tool, Ab Sequence Selector (ABOSS) that uses the presence/absence of the conserved cysteines in an Ab sequence to create a structure-based estimate of the sequencing error rate in Ig-seq data. As opposed to other error-correction tools that operate at the nucleotide sequence level, ABOSS uses amino acid sequences, as they relate directly to protein structure. ABOSS both filters out amino acid sequences that are not structurally viable as well as those likely to contain erroneous residue/positions. Because of its use of structural information rather than homology clustering, ABOSS is orthogonal to all other computational methods for error estimation.

Examining ABOSS performance on simulated Ig-seq datasets indicated that ABOSS successfully isolates ∼99% of structurally viable sequences while preserving most of the SHM-generated diversity. We tested ABOSS on six separate Ig-seq datasets and found that our error calculations based on structural viability were in line with error estimates declared in other recently published studies.

ABOSS supports several input formats. These can be amino acid sequences in the FASTA file or raw IgBlastn outputs (24).

The first step of ABOSS is to parse every sequence through ANARCI (25), an Ab-numbering program. ANARCI parsing acts as a prefiltering step, removing sequences that 1) contain unusual insertions/deletions in the framework and canonical CDR regions, 2) do not align to the respective species Hidden Markov Models of the IMGT germline, and 3) have a J gene sequence identity of <50% to the IMGT germline (of the respective species) or truncated framework 4 region. Calculation of the J gene sequence identity allows us to remove sequences in which indels have occurred in CDR-3 and framework 4. At this point, ABOSS also removes sequences in human and mouse datasets that have CDR-H3s longer than 37 aa. This cutoff is in place to remove sequences with erroneously long CDR-H3s (26, 27). These are chimeric sequences that arise as a result of PCR error. Sequences that pass these initial tests are numbered using the IMGT scheme (15). This provides a consistent frame of reference for sequences and defines CDR and framework regions. We employ the IMGT numbering scheme in ABOSS because it assigns length-mismatched CDRs located in roughly structurally equivalent space to identical residue numbers (15, 28).

IMGT numbering enables the calculation of amino acid distributions by position. IMGT positions 23 and 104 are used to estimate the error rate in the data. In all naturally occurring Abs, both these positions are always a cysteine residue (16, 18). Some Ab pseudo V genes encode for a noncysteine residue at position 23 and 104, but these Abs are not structurally viable (17). Therefore, under a conservative model, any amino acid divergence from cysteine at these positions can be treated as error. We define the error to be equal to the largest noncysteine amino acid proportion found at either of the two positions (Fig. 1). The dataset used in Fig. 1 would be ascribed an error rate of 0.0027 (the occurrence of glycine at position 104). Thus, all amino acids at a position that occurs with proportion of <0.0027 across the dataset are considered erroneous. This will remove all the noncysteine residue types at position 23 and 104 in the data as they all occur <0.0027, but will also indicate several other positions in which residues may be erroneous. In this fashion, we provide an error estimate for individual residues, which can be extrapolated to the entire sequence.

The next stage is ABOSS filtering. In this step, if the proportion of an amino acid at a position is below the residue error rate, amino acids of that type at that position are flagged as potentially erroneous.

ABOSS creates a reference matrix, which contains the “allowed” amino acids at each IMGT position. The allowed amino acids are those whose proportion in the Ig-seq dataset are greater than the residues error rate at the respective position (see previous section). The reference matrix also contains the amino acids from IMGT germline sequences as they represent structurally viable Abs. If <20 entries are used to calculate amino acid proportions at a position, this position is not included in the reference matrix.

Once the reference matrix is calculated, every sequence from the Ig-seq dataset is compared with it. For a given position in a sequence from the Ig-seq dataset, a flag is placed if the amino acid in the sequence is not present in the reference matrix. ABOSS outputs a comma-separated values file of the ANARCI-parsed sequences, their redundancies, CDR-H3 regions, flagged residue/positions, V and J genes, and query names of the original raw nucleotide sequences from the IgBlastn output. The ABOSS filtered dataset refers to the set of sequences with zero flagged residue/positions.

We have tested ABOSS on six Ig-seq datasets (Table I). Two datasets are from Khan et al. (8), the raw sequences (Khan_R) and the error-corrected sequences (Khan_C); each of these datasets was composed of three immunized datasets of a single mouse that were pooled together (2.4 million [m] sequences). The Galson et al. (7) dataset (HEPB) consists of sequences from hepatitis B studies (29, 30) at a time point before the 11 participants were vaccinated (9.9 m sequences). The third and fourth datasets are proprietary UCB Pharma datasets of 5.6-m VH (UCB_H) and 9.3-m VL (UCB_L) chain sequences (31). The UCB data were generated from the nonantigen-challenged B cells of 494 pooled participants. The Vander Heiden et al. (32) datasets (Healthy_H and Healthy_L) include sequences from four healthy B cell donors. A mixture of VH and VL gene primers were used in sequencing material preparation, which produced pooled VH/VL Ig-seq datasets. Healthy_H and Healthy_L are the sorted H and L chain sequences, respectively. This plethora of diversity of Ig-seq datasets was employed to test ABOSS across heterogeneous sequencing setups.

We performed in silico error simulation on two Ig-seq datasets, UCB_H and Khan_R. The simulations were performed at the nucleotide level. The nucleotide sequences that corresponded to the amino acid sequences that passed ABOSS with zero flagged residue/positions (see ABOSS-filtered dataset in Table II) acted as starting points for our simulation. During the simulation, each sequence in the starting dataset was subjected to randomized nucleotide mutations. The distribution of the number of nucleotide mutations was proportional to the distribution of flagged residue/positions in the respective redundant Ig-seq data determined by ABOSS analysis (see 11ABOSS analysis on raw Ig-seq data), whereas mutation positions were stochastically selected along the VH. Only sequences in which random mutations were introduced were added to the final simulation dataset. As UCB_H and Khan_R datasets were generated using Illumina sequencing technology, only nucleotide substitutions were considered in the error simulations.

To assess the robustness of ABOSS, we varied both residue error rate and dataset size in our error simulations. To increase the residue error rate, every entry from the originally calculated distribution of flagged residue/positions was amplified by an error multiplier. Separate simulations were carried out for individual values of the error rate multiplier that ranged between one and eight. The simulation final dataset sizes were equivalent to the size of the respective ANARCI-parsed Ig-seq dataset (see ANARCI-parsed dataset in Table II). Separate simulations were also performed in which the size of the simulation final dataset was varied to be between one and eight times smaller than the respective ANARCI-parsed Ig-seq dataset.

We carried out two separate SHM simulations on the nucleotide sequences of the Healthy_H and UCB_H datasets. Nucleotide sequences, whose translated amino acids had zero ABOSS-flagged residue/positions in Healthy_H, were assigned as the most-recent common ancestors (MRCA) in the simulations. Two different clonal lineage trees (Lineage_A and Lineage_B) were employed for the number of progenitor sequences and SHM substitutions. We used the human HH_S5F–targeting model (33) from the SHazaM package (http://shazam.readthedocs.io/en/version-0.1.9—baseline-fixes/) to perform SHM substitutions in the lineage trees. All MRCA and progeny sequences were added to the final SHM simulation datasets.

In the Lineage_A simulation experiment, two progenitor sequences originated from a single MRCA. Both of these sequences harbored 2-nt SHM substitutions. In the Lineage_B SHM experiment, two progenitors were produced by MRCA with 2- and 4-nt substitutions, respectively. The former progenitor formed a further four offspring sequences with one, one, three, and six SHM substitutions. Finally, the offspring with three SHM substitutions produced another progeny, also with 3-nt substitutions.

ABOSS is a computational method that leverages structural Ab information to calculate the sequencing error rate and flag potentially erroneous residue/positions in Ig-seq sequences. Specifically, we exploit the knowledge of the conserved cysteines at positions 23 and 104, which shape and stabilize the conformation of the Ab V chains. The presence of these conserved cysteines can be used as a way of both identifying structurally viable sequences and estimating the sequencing error rate.

ABPC48 is the only characterized natural Ab that lacks either cysteine at either position (20). A small number of structurally stable Abs with pairwise substitutions of the conserved cysteines based on the ABPC48 Ab scaffold have been engineered (21, 22, 34). These pairwise substitutions require further stabilizing mutations to the Ab structure, often to the opposite V chain (21, 22, 34). The known structurally viable noncysteine pairs seen at positions 23 and 104 are summarized in (18). In our Ig-seq datasets, we rarely observe the pairwise substitution of cysteines. For instance, the total number of instances when the substitution of both cysteines was observed in the UCB_H data were 811, which corresponded to ∼0.015% of UCB_H. Of these 811 pairwise substitutions, the potentially viable substitutions as described in (18) were serine–serine, serine–alanine, alanine–serine, and tyrosine–valine, which appeared 24, 2, 1, and 1 times, respectively. The 6 aa that constitute the largest proportions of noncysteine residue types at positions 23 and 104 in our six raw Ig-seq datasets are always the amino acids 1-nt edit distance from the cysteine codons (Fig. 1). The top noncysteine residue type at positions 23 and 104 varies across our Ig-seq datasets, demonstrating the stochastic nature of this amino acid substitution. It was previously demonstrated that SHM substitutions are significantly reduced at positions 23 and 104 in gene-specific amino acid substitution profiles of SHM (35). This must be due to negative structural selection, as SHM substitution still takes place at these positions in passenger alleles and using the HH_S5F computational model (35). This evidence suggests that the substitutions in conserved cysteines seen in Ig-seq datasets are highly likely to be sequencing errors.

FIGURE 1.

Calculation of the residue error rate in terms of structural viability. (A) Three-dimensional structure of the VH (Protein Data Bank [PDB] code: 5WUV) with the conserved disulfide bridge shown. Framework, gray); CDR regions, red; cysteine bond between positions 23 and 104, yellow. (B) The distribution of amino acid types found at positions 23 and 104 for an Ig-seq dataset. Because both positions in natural Abs should be cysteines, the noncysteine occurrence indicates possible sequencing error.

FIGURE 1.

Calculation of the residue error rate in terms of structural viability. (A) Three-dimensional structure of the VH (Protein Data Bank [PDB] code: 5WUV) with the conserved disulfide bridge shown. Framework, gray); CDR regions, red; cysteine bond between positions 23 and 104, yellow. (B) The distribution of amino acid types found at positions 23 and 104 for an Ig-seq dataset. Because both positions in natural Abs should be cysteines, the noncysteine occurrence indicates possible sequencing error.

Close modal

In the first step of the ABOSS protocol, all sequences are parsed using ANARCI (25), which numbers the sequences in accordance with the IMGT scheme (15). Ab sequences with low sequence identities to ANARCI Hidden Markov Model profiles; unusual insertions/deletions along the Ab chain are discarded. Next, ABOSS calculates the residue error rate using the ANARCI-parsed sequences. The residue error rate is taken as the largest noncysteine amino acid proportion found at position 23 or 104 (Fig. 1). The residue error rate is then used to flag specific residue/positions in individual sequences.

The workflow of the algorithm is summarized in Fig. 2. ABOSS analysis takes <10 h wall-clock time for 5-m unique Ab amino acid sequences on a standard eight-core desktop computer (intel i7-6700). ABOSS is parallelized, allowing for shorter runtimes on more powerful machines. ABOSS is available via http://opig.stats.ox.ac.uk//resources.

FIGURE 2.

Workflow of ABOSS. ABOSS input is Ab amino acid sequences in the FASTA format. Every sequence from the input file is IMGT numbered with ANARCI (ANARCI parsing). The amino acid distribution by IMGT position is calculated for successfully ANARCI-parsed sequences. The residue error rate is estimated based on the amino acid distributions at positions 23 and 104 (see Fig. 1 for more details). The estimated residue error rate together with the ANARCI-numbered IMGT germline genes are used to flag potentially erroneous residue/positions in individual Ig-seq sequences. Filtered Ig-seq dataset refers to a collection of sequences that pass ABOSS analysis with zero flagged residues/positions. The solid bubble shows the stages in the ABOSS algorithm; the dashed bubbles list the reasons why Ig-seq sequences failed the ANARCI preprocessing step.

FIGURE 2.

Workflow of ABOSS. ABOSS input is Ab amino acid sequences in the FASTA format. Every sequence from the input file is IMGT numbered with ANARCI (ANARCI parsing). The amino acid distribution by IMGT position is calculated for successfully ANARCI-parsed sequences. The residue error rate is estimated based on the amino acid distributions at positions 23 and 104 (see Fig. 1 for more details). The estimated residue error rate together with the ANARCI-numbered IMGT germline genes are used to flag potentially erroneous residue/positions in individual Ig-seq sequences. Filtered Ig-seq dataset refers to a collection of sequences that pass ABOSS analysis with zero flagged residues/positions. The solid bubble shows the stages in the ABOSS algorithm; the dashed bubbles list the reasons why Ig-seq sequences failed the ANARCI preprocessing step.

Close modal

We ran ABOSS on six Ig-seq datasets (Tables I, II). We consider two sequences redundant if they have identical length and identical amino acid compositions. ANARCI parsing removed between 3–23% of sequences in the Ig-seq samples (Table II). The ANARCI parsing step removed the largest proportion of sequences from Healthy_L, followed by the Healthy_H, UCB_L, Khan_R, UCB_H, and HEPB datasets, respectively. In the second step, ABOSS filtering and residue/position in the sequences are flagged as potential errors. In the Khan_R was the dataset with the smallest proportion of sequences with zero ABOSS-flagged residue/positions (26.6%) (Table II). The HEPB dataset had the highest proportion of zero ABOSS-flagged residue/positions (65.9%), followed by Health_L, UCB_L (37.3%), Healthy_H, and UCB_H (33.7%).

Table I.
Summary of the datasets used
Dataset NameStudy DescriptionTotal Dataset Size (m)Ab ChainDataset Average RedundancyParticipants
Khan_R Raw sequences of immunized mouse 1 from Khan et al. (82.4 3.74 1 (mouse) 
Khan_C Barcode-corrected sequences of immunized mouse 1 from Khan et al. (82.4 45.3 1 (mouse) 
HEPB Human hepatitis B vaccination from Galson et al. (79.9 1.93 11 
UCB_H Proprietary UCB Ig-seq of the VH 5.6 1.15 494 
UCB_L Proprietary UCB Ig-seq of the VL 9.3 1.12 494 
Healthy_H VH from healthy human B cell donors from Vander Heiden et al. (321.4 1.9 
Healthy_L VL from healthy human B cell donors from Vander Heiden et al. (326.3 2.96 
Dataset NameStudy DescriptionTotal Dataset Size (m)Ab ChainDataset Average RedundancyParticipants
Khan_R Raw sequences of immunized mouse 1 from Khan et al. (82.4 3.74 1 (mouse) 
Khan_C Barcode-corrected sequences of immunized mouse 1 from Khan et al. (82.4 45.3 1 (mouse) 
HEPB Human hepatitis B vaccination from Galson et al. (79.9 1.93 11 
UCB_H Proprietary UCB Ig-seq of the VH 5.6 1.15 494 
UCB_L Proprietary UCB Ig-seq of the VL 9.3 1.12 494 
Healthy_H VH from healthy human B cell donors from Vander Heiden et al. (321.4 1.9 
Healthy_L VL from healthy human B cell donors from Vander Heiden et al. (326.3 2.96 

The seven datasets (Khan_R, Khan_C, HEPB, UCB_H, UCB_L, Healthy_H, and Healthy_L) were obtained from different sequencing methodologies, organisms, and immunization protocols. The Khan_R and Khan_C datasets are the immunized mouse 1 dataset of the Khan et al. (8) study before and after the barcode-correction approach. These datasets are from repeated Ig-seq of the same mouse. The majority of sequences in this Ig-seq dataset start at position 8. The Khan_R and Khan_C datasets consist of Ab amino acid and corresponding nucleotide sequences. The Khan_R dataset has the highest redundancy among the interrogated noncorrected datasets. We have removed the roughly 10% synthetic spike-ins in the Khan_R and Khan_C datasets. The HEPB dataset from Galson et al. (7) is from 11 participants. Standard Illumina Ig-seq was performed. The reads were gene aligned and processed using IMGT/HighV-Quest. Because of selection of PCR primers, most of the sequences start at position 17. This dataset contains amino acid sequences only. The dataset’s redundancy is almost two times lower than the Khan_R data. The UCB proprietary Ig-seq datasets were obtained from 494 participants. The UCB_H and UCB_L datasets are composed of 5.6- and 9.3-m sequences, respectively. The UCB_H and UCB_L datasets contain both Ab amino acid and corresponding nucleotide sequences. The UCB datasets were aligned with IgBlast (24), and V and J genes were identified and prefiltered for stop codons; they contain full-length V chain sequences as described in Krawczyk et al. (31). The UCB_H and UCB_L datasets are the least redundant among the datasets. The Healthy_H and Healthy_L datasets come from four healthy human B cell donors from the Vander Heiden et al. (32) study. In this study, sequencing primers for both H and L chain genes were used at the same time, forming pooled raw nucleotide samples. The raw nucleotide Ig-seq datasets were obtained from the Observed Antibody Space (40), followed by translating sequences into amino acids and Ab chain separation using IgBlastn (24).

Table II.
ABOSS analysis of six Ig-seq datasets
Data SourceStarting DatasetANARCI-Parsed DatasetABOSS-Filtered DatasetSequence Percentage withZero Flags (%)ABOSS Residue Error Rate (%)
HEPB 9,985,575 (5,175,036) 9,698,098 (4,931,263) 6,580,526 (3,227,724) 65.9 0.2276 
Khan_R 2,445,354 (653,520) 2,247,758 (521,672) 649,685 (47,593) 26.6 1.5741 
UCB_L 9,371,465 (8,380,540) 8,021,407 (7,120,100) 3,494,319 (2,983,103) 37.3 0.4674 
UCB_H 5,645,304 (4,925,532) 5,277,305 (4,587,918) 1,903,703 (1,561,082) 33.7 0.5892 
Healthy_H 1,422,405 (745,276) 1,135,185 (558,171) 486,437 (176,012) 34.2 0.5427 
Healthy_L 6,317,736 (2,135,745) 4,860,389 (1,372,804) 2,667,263 (386,165) 42.2 0.4121 
Data SourceStarting DatasetANARCI-Parsed DatasetABOSS-Filtered DatasetSequence Percentage withZero Flags (%)ABOSS Residue Error Rate (%)
HEPB 9,985,575 (5,175,036) 9,698,098 (4,931,263) 6,580,526 (3,227,724) 65.9 0.2276 
Khan_R 2,445,354 (653,520) 2,247,758 (521,672) 649,685 (47,593) 26.6 1.5741 
UCB_L 9,371,465 (8,380,540) 8,021,407 (7,120,100) 3,494,319 (2,983,103) 37.3 0.4674 
UCB_H 5,645,304 (4,925,532) 5,277,305 (4,587,918) 1,903,703 (1,561,082) 33.7 0.5892 
Healthy_H 1,422,405 (745,276) 1,135,185 (558,171) 486,437 (176,012) 34.2 0.5427 
Healthy_L 6,317,736 (2,135,745) 4,860,389 (1,372,804) 2,667,263 (386,165) 42.2 0.4121 

In the table, dataset sizes are given as the number of redundant sequences; the number of nonredundant sequences are shown in parentheses. Starting datasets are the inputs for ABOSS. ANARCI-parsed datasets contain sequences that are successfully IMGT numbered. ABOSS-filtered datasets are the number of sequences that contain zero flagged residues. The percentage of sequences with zero flags are calculated as a percentage of redundant ABOSS-passed sequences over the total number of starting redundant sequences. Residue error rates are calculated as described in Fig. 1.

Current Ig-seq error-correction pipelines assign greater confidence to highly redundant sequences and manipulate the nucleotide sequences of rare sequences (6, 8, 13, 14). In contrast, ABOSS does not have a direct link between sequence redundancy and correct sequences. To examine the performance overlap of ABOSS and redundancy based Ig-seq error-correction tools, we compared the number of ABOSS-flagged residue/positions with the sequence redundancy for our six datasets (Fig. 3). In every dataset, sequences that are more redundant tend to have fewer ABOSS-flagged residue/positions. This suggests that even though ABOSS is not a redundancy-based technique, its results are still in line with the widely adopted methodology based on sequence redundancy. ABOSS does flag residues as erroneous in a number of highly redundant clones, which might be flagged as correct by redundancy-reliant methods. If a sequence was highly redundant, it could, in theory, avoid any of its residues being flagged by ABOSS, as every residue/position in these sequences would be present more times than the residue error rate. The horizontal dashed lines in Fig. 3 shows the redundancy necessary to achieve this. Only a single sequence from the Healthy_L dataset reached such a level of redundancy (Fig. 3F).

FIGURE 3.

Sequence redundancy relative to the number of ABOSS-flagged residue/positions in the sequences of our six datasets (see Table I for more details): UCB_H (A), UCB_L (B), Khan_R (C), HEPB (D), Healthy_H (E), and Healthy_L (F). The ABOSS-filtering step outputs the number of flagged residue/positions for every sequence in the ANARCI-parsed Ig-seq dataset. Zero flagged residue/positions indicates that the sequence is structurally viable. The general trend in each Ig-seq dataset is that the more redundant the sequence, the fewer ABOSS-flagged residue/positions it has. The horizontal dashed line represents the residue error rate in terms of the number of entries required for a residue/position to be identified as structurally viable.

FIGURE 3.

Sequence redundancy relative to the number of ABOSS-flagged residue/positions in the sequences of our six datasets (see Table I for more details): UCB_H (A), UCB_L (B), Khan_R (C), HEPB (D), Healthy_H (E), and Healthy_L (F). The ABOSS-filtering step outputs the number of flagged residue/positions for every sequence in the ANARCI-parsed Ig-seq dataset. Zero flagged residue/positions indicates that the sequence is structurally viable. The general trend in each Ig-seq dataset is that the more redundant the sequence, the fewer ABOSS-flagged residue/positions it has. The horizontal dashed line represents the residue error rate in terms of the number of entries required for a residue/position to be identified as structurally viable.

Close modal

To investigate the types of Ig-seq datasets that ABOSS can successfully analyze, we benchmarked ABOSS with respect to dataset redundancies, sequencing error rates, and input sequence volumes. We tested ABOSS on two datasets with contrasting depth and breadth of coverage: the UCB_H and Khan_R datasets (see Table I). The starting datasets for the simulation consisted of sequences that passed ABOSS analysis with zero flagged residue/positions. The sizes of the simulation final datasets (UCB_H_Sim and Khan_R_Sim) were based on the number of sequences that passed the ANARCI-parsing step (see Table II). We used the distribution of the number of flagged residue/positions in the UCB_H and Khan_R datasets as calculated by ABOSS (see Fig. 3) to introduce erroneous residue/positions into our simulation-starting datasets. The mutation substitution positions were stochastically selected along the VH. From these starting points, the simulations were performed as described in 2Materials and Methods (see 7In silico simulation).

The simulation results are shown in Fig. 4 and Supplemental Fig. 1. Using sizes and error rates that match the original data ABOSS recovered, 99.6 and 99% of the correct sequences incorporated into the UCB_H_Sim and Khan_R_Sim datasets, respectively. Reducing the UCB_H_Sim and Khan_R_Sim dataset sizes does not appear to influence the percentage of correct sequences recovered by ABOSS analysis. Increasing the error rates has a minor effect on the recovered number of correct sequences from the Khan_R_Sim dataset and a much larger effect on the recovery of correct sequences from the UCB_H_Sim dataset. This difference is due to the far lower initial redundancy of the UCB_H data.

FIGURE 4.

Examination of ABOSS performance on Ig-seq error-simulation datasets. Ig-seq data simulation was carried out based on the previously calculated numbers of ABOSS-flagged residue/positions in two Ig-seq datasets, UCB_H and Khan_R (see Fig. 3). The x-axis corresponds to UCB_H_Sim and Khan_R_Sim dataset sizes used for simulation (the percentages relative to the sizes of respective datasets that passed ANARCI are shown in parentheses). The y-axis shows the multiplier of the original distribution of erroneous residue/positions in the Ig-seq datasets (see Fig. 3). The total number of ABOSS-filtered sequences (black) and the number of the correct sequences (gray) are pictured (the percentages are given in Supplemental Fig. 1). The overlapping region indicates the proportion of the correct sequences that passed ABOSS relative to the total number of ABOSS-filtered sequences.

FIGURE 4.

Examination of ABOSS performance on Ig-seq error-simulation datasets. Ig-seq data simulation was carried out based on the previously calculated numbers of ABOSS-flagged residue/positions in two Ig-seq datasets, UCB_H and Khan_R (see Fig. 3). The x-axis corresponds to UCB_H_Sim and Khan_R_Sim dataset sizes used for simulation (the percentages relative to the sizes of respective datasets that passed ANARCI are shown in parentheses). The y-axis shows the multiplier of the original distribution of erroneous residue/positions in the Ig-seq datasets (see Fig. 3). The total number of ABOSS-filtered sequences (black) and the number of the correct sequences (gray) are pictured (the percentages are given in Supplemental Fig. 1). The overlapping region indicates the proportion of the correct sequences that passed ABOSS relative to the total number of ABOSS-filtered sequences.

Close modal

ABOSS also retained small numbers of sequences from UCB_H_Sim and Khan_R_Sim that were not present in the simulation-starting datasets (Fig. 4). These sequences are still structurally viable. The number of these sequences was larger in the UCB_H_Sim dataset (∼30%) than in the Khan_R_Sim dataset (∼17%) (Supplemental Fig. 1). As the residue error rate was increased, the simulation-starting datasets constituted larger proportions of the ABOSS-filtered UCB_H_Sim and Khan_R_Sim datasets (Fig. 4, Supplemental Fig. 1).

The outputs from these error simulations suggest that ABOSS performance becomes robust when either the Ig-seq data are redundant or more than ∼600 k of nonredundant sequences are available.

The SHM machinery of B cells increases Ab diversity by introducing nucleotide substitutions in the V region (36). SHM helps to fine-tune an Ab to its cognate epitope (37). These substitutions are known to exhibit uneven frequencies along the V region (33, 35). We exploited the 5-mer nucleotide, HH_S5F-targeting model of SHM (33) to examine the ability of ABOSS to flag errors while preserving SHM-generated diversity in Ig-seq datasets.

The model requires a clonal tree reference to estimate rates of substitutions. We used two distinct architectures of Ab clonal lineage trees (Lineage_A and Lineage_B) to construct such substitution matrixes. We used these two lineages to have coverage of the spectrum of SHM mutations, as Lineage_A has a low substitution rate, whereas Lineage_B has a high one. Using the HH_S5F model combined with either of the Lineage_A or Lineage_B SHM references, the simulations were performed on Healthy_H and UCB_H. These two datasets were selected to test ABOSS performance on low- (UCB_H) and high- (Healthy_H) redundancy data. The sequences with zero ABOSS-flagged residue/positions were used as the MRCA that were then employed as templates to which SHM mutations were introduced. The HH_S5F-targeting model (33) introduced roughly the same SHM substitution ratios along the VH region in the two lineage trees (Fig. 5, Supplemental Fig. 2). There was a biased increase in SHM substitutions in framework 3 and CDR regions and positions flanking the CDRs, similar to previous results (33, 35). As the HH_S5F model does not consider structural selection pressure on the H chain positions, the conserved cysteines were mutated, which resulted in the residue error rates of 0.002567 (Lineage_A) and 0.008 (Lineage_B) in UCB_H, and 0.002513 (Lineage_A) and 0.0076 (Lineage_B) in Healthy_H simulation datasets, respectively. The Lineage_A-produced residue error rates were within the observed range of human Ig-seq data, whereas the Lineage_B-generated residue error rates exceeded this range (Table II). ABOSS exhibited no preferential selection of unmutated germline V gene sequences over sequences that harbor SHM mutations in the Ig-seq data (Supplemental Fig. 3).

FIGURE 5.

ABOSS performance on SHM-simulated Ig-seq data diversity. Two Ab clonal lineage trees (Lineage_A and Lineage_B) were employed to provide the background mutational reference to introduce SHM substitutions into the ABOSS-filtered UCB_H dataset using the human HH_S5F–targeting model (33). The x-axis shows positions along the VH, and the y-axis shows the proportions of residue/positions in the simulation datasets. The figures on the left depict the proportion of SHM substitutions introduced at positions in the VH. The figures on the right represent the proportions of ABOSS-flagged residue/positions in the simulation datasets.

FIGURE 5.

ABOSS performance on SHM-simulated Ig-seq data diversity. Two Ab clonal lineage trees (Lineage_A and Lineage_B) were employed to provide the background mutational reference to introduce SHM substitutions into the ABOSS-filtered UCB_H dataset using the human HH_S5F–targeting model (33). The x-axis shows positions along the VH, and the y-axis shows the proportions of residue/positions in the simulation datasets. The figures on the left depict the proportion of SHM substitutions introduced at positions in the VH. The figures on the right represent the proportions of ABOSS-flagged residue/positions in the simulation datasets.

Close modal

Most of the HH_S5F-generated diversity was preserved by ABOSS analysis (Fig. 5). ABOSS flagged residue/positions uniformly along the VH, with the exception of CDR-H3, in which fewer residue/positions were flagged. These proportions of ABOSS flags are unrelated to the pattern of generated SHM substitutions, which has a strong bias toward framework 3 and CDR loops.

These results demonstrate that ABOSS is able to flag structurally nonviable residue/positions while preserving the majority of SHM substitutions. However, some true rare SHM substitutions may still be removed by ABOSS, as their positional presence in the V region is below the residue error rate. Therefore, highly SHM-altered Ig-seq datasets may have a higher proportion of true mutations incorrectly flagged as erroneous.

We compared ABOSS to IgReC, a computational Ig-seq error-correction tool. IgReC clusters and corrects PCR and sequencing errors in Ig-seq databased on sequence redundancy and homology. IgReC was recently benchmarked alongside other commonly used tools to error correct Ig-seq data (14). Its performance was considered comparable, if not better, than all other tools tested. IgReC relies on identification of clonotypes and sequence clustering.

We ran IgReC on the UCB_H and Healthy_H datasets, as IgReC requires full-length VH or VL sequences, and the HEPB and Khan_R datasets have truncated framework 1 regions. IgReC modifies sequences of Ig-seq datasets, making it difficult to carry out an overlap comparison with ABOSS. IgReC removed ∼1.5% of UCB_H and 8% of Healthy_H, but modified nearly 50 and 30% of the sequences to ones not seen in the original UCB_H and Healthy_H datasets, respectively (Table III).

Table III.
Interrogation of IgReC performance on UCB_H and Healthy_H
OutputsSequences Found in the Original Dataset (%)
ABOSS-filtered UCB_H 1,903,703 (1,561,082) 100 (100) 
IgReC-corrected UCB_H 5,572,963 (4,069,318) 51.5 (43.3) 
IgReC on ABOSS-filtered UCB_H 1,894,860 (1,320,392) 57.6 (48.1) 
ABOSS on Healthy_H 486,437 (176,012) 100 (100) 
IgReC-corrected Healthy_H 1,303,131 (367,238) 71.6 (60.4) 
IgReC on ABOSS-filtered Healthy_H 476,072 (61,281) 90.7 (87.9) 
OutputsSequences Found in the Original Dataset (%)
ABOSS-filtered UCB_H 1,903,703 (1,561,082) 100 (100) 
IgReC-corrected UCB_H 5,572,963 (4,069,318) 51.5 (43.3) 
IgReC on ABOSS-filtered UCB_H 1,894,860 (1,320,392) 57.6 (48.1) 
ABOSS on Healthy_H 486,437 (176,012) 100 (100) 
IgReC-corrected Healthy_H 1,303,131 (367,238) 71.6 (60.4) 
IgReC on ABOSS-filtered Healthy_H 476,072 (61,281) 90.7 (87.9) 

IgReC was run on the raw nucleotide UCB_H and Healthy_H datasets as well as the ABOSS-filtered data. IgReC-constructed datasets derived from the raw data contained roughly 50 and 30% of sequences that were different to ones found in the UCB_H and Healthy_H datasets, respectively. When tested on the ABOSS-filtered datasets, IgReC was unable to find V and J germline references for 8,843 sequences in ABOSS-filtered UCB_H and 10,365 in ABOSS-filtered Healthy_H. IgReC also generated ∼42 and ∼9% of sequences that were not present in the original UCB_H and Healthy_H datasets. The default parameters were used to run IgReC: ./igrec.py –s 〈reads.fasta〉 –l 〈IGH〉 –o 〈output_dir〉. The numbers of nonredundant values are shown in parentheses.

For both datasets, roughly 30% of the sequences in the IgReC-corrected set contained ABOSS-flagged residue/positions. The redundancy of sequences that did not pass ABOSS but are found in the IgReC-corrected data are lower than the average of the IgReC-corrected data (Supplemental Table I).

As the data above suggest that IgReC and ABOSS remove different sequences, ABOSS was run on the IgReC-corrected UCB_H and Healthy_H datasets. ABOSS filtered out 3,327,793 sequences (59.7%) from the IgReC-corrected UCB_H with a residue error rate of 0.0055. This error rate was very similar to that given by ABOSS for the original UCB_H dataset (see Table II). Among the IgReC-corrected UCB_H sequences filtered out by ABOSS, 37,671 (1.1%) sequences failed to pass ANARCI, whereas the rest contained ABOSS-flagged residue/positions, of which 120,264 (3.6%) sequences lacked conserved cysteines. Applying ABOSS to the IgReC-corrected Healthy_H dataset yielded a residue error rate of 0.0041, which filtered 685,536 sequences (52.6%). Of these filtered sequences, 143,862 (21%) sequences failed ANARCI parsing, and of the rest with flagged residue/positions, 33,529 (4.9%) lacked cysteines at positions 23 and/or 104. IgReC analysis does not appear to correct stop codons, as at least one was identified in ∼85.7% of the sequences that failed to pass ANARCI parsing from the IgRec-corrected Healthy_H dataset.

We then tested the reverse protocol, running IgReC on the ABOSS-filtered UCB_H and Healthy_H datasets. IgReC generates structurally incorrect sequences (∼0.01%) when it is run on the ABOSS filtered UCB_H dataset. Many of these sequences had a stop codon introduced by IgReC. IgReC also altered the sequences of over 40 and 9% of the data to ones that were absent in the original UCB_H and Healthy_H datasets, respectively (Table III).

Given this data, we would suggest that IgReC analysis can be enhanced by first using ABOSS to filter out structurally impossible Ig-seq data.

We also compared the results of ABOSS to two different experimental approaches, first to the work of Galson et al. (7). Their methodology of residue error estimation employs an analogous approach to ours. It is based on the proportion of nucleotide mismatches to the germline in the sequence-invariant C region, which was adjacent to the framework 4 region of the H chain (FW-H4). ABOSS analysis on their HEPB dataset estimated the residue error rate to be 0.2276%. This is in the agreement with the residue error rates estimated by Galson et al. (7), which ranged between 0.19 and 0.79%.

Secondly, we contrasted ABOSS with the experimental/computational error-correction protocol of Khan et al. (8). This method considers the entirety of the VH domain by applying barcodes to cDNA prior to sequencing, followed by clustering of identically barcoded sequences and error correction. The Khan_C dataset is the experimentally corrected version of the Khan_R dataset. In the process of this error-correction protocol, sequences are computationally modified (∼34% of Khan_C sequences have been altered from the sequences experimentally determined in the Khan_R dataset). This may be due to another sequence present in the Khan_R dataset (increasing redundancy from 3.7 in Khan_R to 45.3 in Khan_C) or to a new sequence altogether. This modification means that the redundancy of sequence changes and that 0.5% of nonredundant (0.02% of redundant) sequences in the Khan_C dataset are not present in the original Khan_R dataset. These sequence changes make comparison with ABOSS difficult, as within the ABOSS protocol,no sequences are altered.

ABOSS analysis on Khan_R selects a similar number of nonredundant sequences to Khan_C (∼50,000), but only ∼6000 of these sequences are directly observed in the Khan_C dataset (Table IV). In terms of redundant sequences, ABOSS selects a far smaller set. This reflects the fact that sequences have been modified to increase the redundancy of specific sequences in the Khan_C dataset. The redundant overlaps between the ABOSS-filtered Khan_R dataset and the Khan_C dataset are 36.8 and 89.6%, respectively (Table IV). Around 60% of Khan_C sequences are not seen in the ABOSS-filtered Khan_R dataset of these ∼1% that fail the ANARCI-parsing step (suggesting they would not produce viable Abs), 0.04% are not found in Khan_R, the others contain residue/positions that are ABOSS flagged as below the residue error rate. Those flagged by ABOSS include ∼0.3% redundant and ∼8% nonredundant sequences that lack a cysteine at either position 23 or 104.

Table IV.
Comparison analysis of ABOSS and the barcode approach of Khan et al. (8)
Data SourceDataset SizeRedundancyOverlap (%)
Khan_C 2,385,080 (52,623) 45.3 36.8 
ABOSS-filtered Khan_R 649,685 (47,593) 13.7 89.6 
Data SourceDataset SizeRedundancyOverlap (%)
Khan_C 2,385,080 (52,623) 45.3 36.8 
ABOSS-filtered Khan_R 649,685 (47,593) 13.7 89.6 

ABOSS was run on the Khan_R dataset. The ABOSS outputs were contrasted with the Khan_C dataset (see Table I for dataset information). The overlap represents the percentage of total sequences that are shared between the Khan_C and ABOSS-filtered Khan_R datasets. ABOSS appears to be more conservative than the barcode approach. The numbers of nonredundant sequences are shown in parentheses.

ABOSS provides orthogonal functionality to Ig-seq data error correction and can be used to complement the UMI barcode approach, an increasingly common practice in Ig-seq data analysis (38). Performance of the barcode approach is heavily dependent on drawing a consensus sequence from a pool of identically barcoded sequences. Two common problems in the barcode approach are when a large number of the barcoded sequences are singletons or several identically barcoded sequences share the highest redundancies in a cluster. These problems hamper the ability of the approach to correct data efficiently. ABOSS can be used prior to clustering to prevent all structurally nonviable sequences from becoming consensus sequences.

UMI barcodes are also used for accurate detection of template amplification and quantification biases in Ig-seq datasets (810, 39). This allows for the precise calculation of the amount and diversity of sequencing templates (8). In this scenario, ABOSS should not be run prior to the barcode-correction approach, as it is a conservative tool that always reduces the dataset size and never alters Ab sequences.

As an example of how ABOSS identifies potentially not structurally viable sequences that are not picked up by other techniques, Fig. 6 shows an example of an Ab sequence from the Khan_C dataset (8). This sequence is translated into amino acids in the first reading frame. This sequence cannot be structurally viable, as FW-H4 and the distal end of CDR-H3 do not align to the known IMGT amino acid germline. Translating this sequence into the second reading frame reveals a FW-H4 and a distal end of CDR-H3 that now align to the IMGT amino acid germline. This suggests that a single nucleotide insertion was introduced into CDR-H3.

FIGURE 6.

ABOSS flags structurally nonviable Ab sequences. (A) The distal part of the VH sequence of an Ab that is selected as correct in the Khan_C dataset. The closest germline matches of the V and J genes for this sequence are IGHV5-4*02 and IGHJ3*01, respectively. This sequence is shown in the first reading frame. The FW-H4 region and the distal end of CDR-H3 of this sequence do not align to an IMGT amino acid germline. (B) Translating this Ab sequence into the second reading frame creates FW-H4 and the distal end of CDR-H3 that align to the IMGT amino acid germline. (C) An arbitrary deletion of a nucleotide from the middle of the CDR-H3 region generates a structurally viable Ab sequence when it is translated in the first reading frame.

FIGURE 6.

ABOSS flags structurally nonviable Ab sequences. (A) The distal part of the VH sequence of an Ab that is selected as correct in the Khan_C dataset. The closest germline matches of the V and J genes for this sequence are IGHV5-4*02 and IGHJ3*01, respectively. This sequence is shown in the first reading frame. The FW-H4 region and the distal end of CDR-H3 of this sequence do not align to an IMGT amino acid germline. (B) Translating this Ab sequence into the second reading frame creates FW-H4 and the distal end of CDR-H3 that align to the IMGT amino acid germline. (C) An arbitrary deletion of a nucleotide from the middle of the CDR-H3 region generates a structurally viable Ab sequence when it is translated in the first reading frame.

Close modal

If we run ABOSS on Khan_C (the experimentally/computationally error-corrected set of Khan_R), the ANARCI-parsing step in conjunction with the check for conserved cysteines at positions 23 and 104 removed 11.5% of the unique sequences. These structurally impossible sequences correspond to 0.9% of the total redundant dataset (Fig. 7). The inability of ANARCI to align the full-length FW-H4 to the IMGT germline was the main cause for sequences from the Khan_C dataset to fail ANARCI parsing, as these sequences were considered to have a truncated FW-H4 region.

FIGURE 7.

Identification of structurally nonviable Ab sequences using first steps of ABOSS on the Khan_C datasets. Each sequence from the Khan_C (Table I) dataset is examined for structural viability using ABOSS. (A) The tabulated outputs gives the total number sequences that did not pass ANARCI. (B) The pie chart shows the percentage of sequences that fail the ANARCI step and those that lack a cysteine at position 23 and 104. The reasons include the following: 1) a sequence lacks a cysteine at position 23 or 104, 2) an indel is present in the framework or canonical CDR regions, 3) a non–amino acid residue is present in a sequence, or 4) a sequence does not align to the IMGT amino acid germlines of V or J genes.

FIGURE 7.

Identification of structurally nonviable Ab sequences using first steps of ABOSS on the Khan_C datasets. Each sequence from the Khan_C (Table I) dataset is examined for structural viability using ABOSS. (A) The tabulated outputs gives the total number sequences that did not pass ANARCI. (B) The pie chart shows the percentage of sequences that fail the ANARCI step and those that lack a cysteine at position 23 and 104. The reasons include the following: 1) a sequence lacks a cysteine at position 23 or 104, 2) an indel is present in the framework or canonical CDR regions, 3) a non–amino acid residue is present in a sequence, or 4) a sequence does not align to the IMGT amino acid germlines of V or J genes.

Close modal

These results demonstrate how leverage of our knowledge of Ig folding can help to filter data, even those that had been generated by a barcoding approach.

We have examined the robustness of the ABOSS protocol by running it on a dataset parsed by either ANARCI or IgBlastn (24), a sequence-centered Ig-seq data-processing tool. ANARCI parsing removed ∼9% more sequences than IgBlastn (Table V). However, if ABOSS is run on the ANARCI-parsed data or on data already passed by IgBlastn approximately, the same number of sequences are obtained. Examination of the sequences ANARCI removes and IgBlastn does not reveals that these sequences tend to not have a full-length framework 4 region or nothing at position 23 or had unusual indels in canonical CDR and framework regions. The ANARCI-parsed Healthy_H and Healthy_L datasets contained almost all (>99.98%) sequences that IgBlastn called productive. ABOSS analysis generated almost identical outputs for Healthy_H and Healthy_H_IgBlastn, whereas there was an increase (∼4%) in the number of Healthy_L_IgBlastn sequences compared with Healthy_L as a result of the slightly smaller residue error rate.

Table V.
IgBlastn and ANARCI parsing
Dataset NameStarting DatasetANARCI-Parsed DatasetABOSS-Filtered DatasetABOSS Residue Error Rate (%)
Healthy_H 1,422,405 (745,276) 1,135,185 (558,171) 486,437 (176,012) 0.5427 
Healthy_H_IgBlastn 1,228,129 (597,976) 1,135,116 (558,129) 486,429 (176,008) 0.5427 
Healthy_L 6,317,736 (2,135,745) 4,860,389 (1,372,804) 2,667,263 (386,165) 0.4121 
Healthy_L_IgBlastn 5,361,955 (1,539,964) 4,859,848 (1,372,519) 2,776,176 (386,146) 0.4118 
Dataset NameStarting DatasetANARCI-Parsed DatasetABOSS-Filtered DatasetABOSS Residue Error Rate (%)
Healthy_H 1,422,405 (745,276) 1,135,185 (558,171) 486,437 (176,012) 0.5427 
Healthy_H_IgBlastn 1,228,129 (597,976) 1,135,116 (558,129) 486,429 (176,008) 0.5427 
Healthy_L 6,317,736 (2,135,745) 4,860,389 (1,372,804) 2,667,263 (386,165) 0.4121 
Healthy_L_IgBlastn 5,361,955 (1,539,964) 4,859,848 (1,372,519) 2,776,176 (386,146) 0.4118 

Performance of IgBlastn and ANARCI parsing was investigated on two datasets, Healthy_H and Healthy_L (see Table I for more details). IgBlastn analysis was performed on the nucleotide sequences that were downloaded from the Observed Antibody Space (40). IgBlastn productive–called sequences were put into Healthy_H_IgBlastn and Healthy_L_IgBlastn datasets, respectively. ANARCI parsing was performed on the translated amino acid version of Healthy_H and Healthy_L (Table I). All four datasets were then subjected to ABOSS analysis.

ABOSS is an orthogonal redundancy–neutral method that uses structural information to calculate sequencing error rate estimates for Ig-seq datasets. The novelty of our approach is founded in the application of current knowledge of Ig folding to identify and flag potential errors in Ig-seq sequences.

ABOSS has been tested on six different Ig-seq datasets, ranging from 1,422,405 sequences to 9,985,575 sequences, which were generated by a variety of sequencing methods. The protocol is rapid and takes 10 h to analyze 5-m unique Ab sequences on a standard desktop computer. ABOSS calculated residue error rates agree well with experimental error rates that were available as in Galson et al. (7).

ABOSS identified 99% of correct Ab sequences in the simulated Ig-seq data when using dataset sizes and error rates matching those in the experiments. Decreasing the size of the simulation Ig-seq data did not affect the percentage of correct sequences recovered. Our simulation results suggest that even at far higher error rates, ABOSS performs well as long as either the redundancy is high or the dataset size is large enough (∼600-k unique sequences). The model selected to introduce in silico sequencing errors was based on Illumina technologies, in which nucleotide substitutions can happen stochastically along the VH. For Roche 454 datasets [now one of the most common sources of Ig-seq data (40)], nucleotide indel introduction along the chain is the main origin of sequencing errors, so further simulations of error might be necessary to understand the behavior of ABOSS.

We also ran ABOSS on Ig-seq data with computationally simulated SHM diversity to assess its ability to preserve true mutations. These simulations indicated that ABOSS was able to spot structurally incorrect residue/positions while preserving the SHM-generated diversity. It is hard to assess the accuracy of SHM substitutions introduced by the HH_S5F model in the structural context. In the functional Ab repertoire, there are a number of positions where SHM substitutions are not observed, in particular, positions 23 and 104, but are seen in the HH_S5F model (35). Therefore, SHM in functional genes has a negative reinforcement effect on the residue error rate, which will mean ABOSS is less likely to flag positions that harbor SHM substitutions.

The nature of ABOSS analysis is orthogonal to current Ig-seq-correction techniques; in particular, it does not alter sequences but rather removes those it considers to contain impossible structural features. Comparison to leading experimental and computational Ig-seq error-correction methods that do alter sequences shows that these approaches retain as well as create Ab sequences that are structurally nonviable (i.e., lack of cysteines at conserved position 23 and 104 or Ab regions that are out of the correct reading frame). These results suggest that ABOSS should be used alongside current state-of-art error-correction protocols to increase confidence of structural viability of Ig-seq sequences.

We thank Dominic F. Kelly, Jacob D. Galson, Simon Friedensohn, and Sai Reddy for providing the Ig-seq datasets as well as Jinwoo Leem from Oxford Protein Informatics Groups, whose comments have significantly improved the quality of our work. We thank Michael Wright for generating UCB_H and UCB_L Ig-seq data.

This work was supported by funding from the Biotechnology and Biological Sciences Research Council (BB/M011224/1) and UCB Pharma Ltd., awarded to A.K.

The online version of this article contains supplemental material.

Abbreviations used in this article:

ABOSS

Ab Sequence Selector

CDR-H3

CDR-3 of the V–H chain domain

FW-H4

framework 4 region of the H chain

Ig-seq

next-generation sequencing of the Ig gene repertoire

IMGT

ImMunoGeneTics

m

million

MRCA

most-recent common ancestor

SHM

somatic hypermutation

UMI

unique molecular identifier

V

variable

VH

V–H chain

VL

V–L chain.

1
Collis
,
A. V. J.
,
A. P.
Brouwer
,
A. C. R.
Martin
.
2003
.
Analysis of the antigen combining site: correlations between length and sequence composition of the hypervariable loops and the nature of the antigen.
J. Mol. Biol.
325
:
337
354
.
2
Reichert
,
J. M.
2017
.
Antibodies to watch in 2017.
MAbs
9
:
167
181
.
3
Strohl
,
W. R.
2018
.
Current progress in innovative engineered antibodies.
Protein Cell
9
:
86
120
.
4
Georgiou
,
G.
,
G. C.
Ippolito
,
J.
Beausang
,
C. E.
Busse
,
H.
Wardemann
,
S. R.
Quake
.
2014
.
The promise and challenge of high-throughput sequencing of the antibody repertoire.
Nat. Biotechnol.
32
:
158
168
.
5
Parola
,
C.
,
D.
Neumeier
,
S. T.
Reddy
.
2018
.
Integrating high-throughput screening and sequencing for monoclonal antibody discovery and engineering.
Immunology.
153
:
31
41
.
6
Friedensohn
,
S.
,
T. A.
Khan
,
S. T.
Reddy
.
2017
.
Advanced methodologies in high-throughput sequencing of immune repertoires.
Trends Biotechnol.
35
:
203
214
.
7
Galson
,
J. D.
,
J.
Trück
,
A.
Fowler
,
M.
Münz
,
V.
Cerundolo
,
A. J.
Pollard
,
G.
Lunter
,
D. F.
Kelly
.
2015
.
In-depth assessment of within-individual and inter-individual variation in the B cell receptor repertoire.
Front. Immunol.
6
:
531
.
8
Khan
,
T. A.
,
S.
Friedensohn
,
A. R.
Gorter de Vries
,
J.
Straszewski
,
H.-J.
Ruscheweyh
,
S. T.
Reddy
.
2016
.
Accurate and predictive antibody repertoire profiling by molecular amplification fingerprinting.
Sci. Adv.
2
:
e1501371
.
9
Turchaninova
,
M. A.
,
A.
Davydov
,
O. V.
Britanova
,
M.
Shugay
,
V.
Bikos
,
E. S.
Egorov
,
V. I.
Kirgizova
,
E. M.
Merzlyak
,
D. B.
Staroverov
,
D. A.
Bolotin
, et al
.
2016
.
High-quality full-length immunoglobulin profiling with unique molecular barcoding.
Nat. Protoc.
11
:
1599
1616
.
10
Shugay
,
M.
,
O. V.
Britanova
,
E. M.
Merzlyak
,
M. A.
Turchaninova
,
I. Z.
Mamedov
,
T. R.
Tuganbaev
,
D. A.
Bolotin
,
D. B.
Staroverov
,
E. V.
Putintseva
,
K.
Plevova
, et al
.
2014
.
Towards error-free profiling of immune repertoires.
Nat. Methods
11
:
653
655
.
11
Kuchenbecker
,
L.
,
M.
Nienen
,
J.
Hecht
,
A. U.
Neumann
,
N.
Babel
,
K.
Reinert
,
P. N.
Robinson
.
2015
.
IMSEQ--a fast and error aware approach to immunogenetic sequence analysis.
Bioinformatics
31
:
2963
2971
.
12
Yu
,
Y.
,
R.
Ceredig
,
C.
Seoighe
.
2016
.
LymAnalyzer: a tool for comprehensive analysis of next generation sequencing data of T cell receptors and immunoglobulins.
Nucleic Acids Res.
44
:
e31
.
13
Bolotin
,
D. A.
,
S.
Poslavsky
,
I.
Mitrophanov
,
M.
Shugay
,
I. Z.
Mamedov
,
E. V.
Putintseva
,
D. M.
Chudakov
.
2015
.
MiXCR: software for comprehensive adaptive immunity profiling.
Nat. Methods
12
:
380
381
.
14
Shlemov
,
A.
,
S.
Bankevich
,
A.
Bzikadze
,
M. A.
Turchaninova
,
Y.
Safonova
,
P. A.
Pevzner
.
2017
.
Reconstructing antibody repertoires from error-prone immunosequencing reads.
J. Immunol.
199
:
3369
3380
.
15
Lefranc
,
M.-P.
,
C.
Pommié
,
M.
Ruiz
,
V.
Giudicelli
,
E.
Foulquier
,
L.
Truong
,
V.
Thouvenin-Contet
,
G.
Lefranc
.
2003
.
IMGT unique numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily V-like domains.
Dev. Comp. Immunol.
27
:
55
77
.
16
Glockshuber
,
R.
,
T.
Schmidt
,
A.
Plückthun
.
1992
.
The disulfide bonds in antibody variable domains: effects on stability, folding in vitro, and functional expression in Escherichia coli.
Biochemistry
31
:
1270
1279
.
17
Lefranc
,
M. P.
2003
.
IMGT, the international ImMunoGeneTics database.
Nucleic Acids Res.
31
:
307
310
.
18
Hagihara
,
Y.
,
D.
Saerens
.
2014
.
Engineering disulfide bonds within an antibody.
Biochim. Biophys. Acta
1844
:
2016
2023
.
19
Koenig
,
P.
,
C. V.
Lee
,
B. T.
Walters
,
V.
Janakiraman
,
J.
Stinson
,
T. W.
Patapoff
,
G.
Fuh
.
2017
.
Mutational landscape of antibody variable domains reveals a switch modulating the interdomain conformational dynamics and antigen binding.
Proc. Natl. Acad. Sci. USA
114
:
E486
E495
.
20
Rudikoff
,
S.
,
J. G.
Pumphrey
.
1986
.
Functional antibody lacking a variable-region disulfide bridge.
Proc. Natl. Acad. Sci. USA
83
:
7875
7878
.
21
Wörn
,
A.
,
A.
Plückthun
.
1998
.
Mutual stabilization of VL and VH in single-chain antibody fragments, investigated with mutants engineered for stability.
Biochemistry
37
:
13120
13127
.
22
Proba
,
K.
,
A.
Honegger
,
A.
Plückthun
.
1997
.
A natural antibody missing a cysteine in VH: consequences for thermodynamic stability and folding.
J. Mol. Biol.
265
:
161
172
.
23
Auffray
,
C.
,
J. L.
Sikorav
,
R.
Ollo
,
F.
Rougeon
.
1981
.
Correlation between D region structure and antigen-binding specificity: evidences from the comparison of closely related immunoglobulin VH sequences.
Ann. Immunol. (Paris)
132D
:
77
88
.
24
Ye
,
J.
,
N.
Ma
,
T. L.
Madden
,
J. M.
Ostell
.
2013
.
IgBLAST: an immunoglobulin variable domain sequence analysis tool.
Nucleic Acids Res.
41
(
Web server issue
):
W34
W40
.
25
Dunbar
,
J.
,
C. M.
Deane
.
2016
.
ANARCI: antigen receptor numbering and receptor classification.
Bioinformatics
32
:
298
300
.
26
He
,
L.
,
D.
Sok
,
P.
Azadnia
,
J.
Hsueh
,
E.
Landais
,
M.
Simek
,
W. C.
Koff
,
P.
Poignard
,
D. R.
Burton
,
J.
Zhu
.
2014
.
Toward a more accurate view of human B-cell repertoire by next-generation sequencing, unbiased repertoire capture and single-molecule barcoding.
Sci. Rep.
4
:
6778
.
27
Zemlin
,
M.
,
M.
Klinger
,
J.
Link
,
C.
Zemlin
,
K.
Bauer
,
J. A.
Engler
,
H. W.
Schroeder
Jr.
,
P. M.
Kirkham
.
2003
.
Expressed murine and human CDR-H3 intervals of equal length exhibit distinct repertoires that differ in their amino acid composition and predicted range of structures.
J. Mol. Biol.
334
:
733
749
.
28
Kovaltsuk
,
A.
,
K.
Krawczyk
,
J. D.
Galson
,
D. F.
Kelly
,
C. M.
Deane
,
J.
Trück
.
2017
.
How B-cell receptor repertoire sequencing can Be enriched with structural antibody data.
Front. Immunol.
8
:
1753
.
29
Galson
,
J. D.
,
J.
Trück
,
A.
Fowler
,
E. A.
Clutterbuck
,
M.
Münz
,
V.
Cerundolo
,
C.
Reinhard
,
R.
van der Most
,
A. J.
Pollard
,
G.
Lunter
,
D. F.
Kelly
.
2015
.
Analysis of B cell repertoire dynamics following hepatitis B vaccination in humans, and enrichment of vaccine-specific antibody sequences.
EBioMedicine
2
:
2070
2079
.
30
Galson
,
J. D.
,
J.
Trück
,
E. A.
Clutterbuck
,
A.
Fowler
,
V.
Cerundolo
,
A. J.
Pollard
,
G.
Lunter
,
D. F.
Kelly
.
2016
.
B-cell repertoire dynamics after sequential hepatitis B vaccination and evidence for cross-reactive B-cell activation. [Published erratum appears in 2016 Genome Med. 8: 81.]
Genome Med.
8
:
68
.
31
Krawczyk
,
K.
,
S.
Kelm
,
A.
Kovaltsuk
,
J. D.
Galson
,
D.
Kelly
,
J.
Trück
,
C.
Regep
,
J.
Leem
,
W. K.
Wong
,
J.
Nowak
, et al
.
2018
.
Structurally mapping antibody repertoires.
Front. Immunol.
9
:
1698
.
32
Vander Heiden
,
J. A.
,
P.
Stathopoulos
,
J. Q.
Zhou
,
L.
Chen
,
T. J.
Gilbert
,
C. R.
Bolen
,
R. J.
Barohn
,
M. M.
Dimachkie
,
E.
Ciafaloni
,
T. J.
Broering
, et al
.
2017
.
Dysregulation of B cell repertoire formation in myasthenia gravis patients revealed through deep sequencing.
J. Immunol.
198
:
1460
1473
.
33
Yaari
,
G.
,
J. A.
Vander Heiden
,
M.
Uduman
,
D.
Gadala-Maria
,
N.
Gupta
,
J. N.
Stern
,
K. C.
O’Connor
,
D. A.
Hafler
,
U.
Laserson
,
F.
Vigneault
,
S. H.
Kleinstein
.
2013
.
Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data.
Front. Immunol.
4
:
358
.
34
Proba
,
K.
,
A.
Wörn
,
A.
Honegger
,
A.
Plückthun
.
1998
.
Antibody scFv fragments without disulfide bonds made by molecular evolution.
J. Mol. Biol.
275
:
245
253
.
35
Sheng
,
Z.
,
C. A.
Schramm
,
R.
Kong
,
J. C.
Mullikin
,
J. R.
Mascola
,
P. D.
Kwong
,
L.
Shapiro
;
NISC Comparative Sequencing Program
.
2017
.
Gene-specific substitution profiles describe the types and frequencies of amino acid changes during antibody somatic hypermutation.
Front. Immunol.
8
:
537
.
36
Peled
,
J. U.
,
F. L.
Kuang
,
M. D.
Iglesias-Ussel
,
S.
Roa
,
S. L.
Kalis
,
M. F.
Goodman
,
M. D.
Scharff
.
2008
.
The biochemistry of somatic hypermutation.
Annu. Rev. Immunol.
26
:
481
511
.
37
Di Noia
,
J. M.
,
M. S.
Neuberger
.
2007
.
Molecular mechanisms of antibody somatic hypermutation.
Annu. Rev. Biochem.
76
:
1
22
.
38
Miho
,
E.
,
A.
Yermanos
,
C. R.
Weber
,
C. T.
Berger
,
S. T.
Reddy
,
V.
Greiff
.
2018
.
Computational strategies for dissecting the high-dimensional complexity of adaptive immune repertoires.
Front. Immunol.
9
:
224
.
39
Friedensohn
,
S.
,
J. M.
Lindner
,
V.
Cornacchione
,
M.
Iazeolla
,
E.
Miho
,
A.
Zingg
,
S.
Meng
,
E.
Traggiai
,
S. T.
Reddy
.
2018
.
Synthetic standards combined with error and bias correction improve the accuracy and quantitative resolution of antibody repertoire sequencing in human naïve and memory B cells.
Front. Immunol.
9
:
1401
.
40
Kovaltsuk
,
A.
,
J.
Leem
,
S.
Kelm
,
J.
Snowden
,
C. M.
Deane
,
K.
Krawczyk
.
2018
.
Observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires.
J. Immunol.
201
:
2502
2509
.

The authors have no financial conflicts of interest.

Supplementary data