Transforming error-prone immunosequencing datasets into Ab repertoires is a fundamental problem in immunogenomics, and a prerequisite for studies of immune responses. Although various repertoire reconstruction algorithms were released in the last 3 y, it remains unclear how to benchmark them and how to assess the accuracy of the reconstructed repertoires. We describe an accurate IgReC algorithm for constructing Ab repertoires from high-throughput immunosequencing datasets and a new framework for assessing the quality of reconstructed repertoires. Surprisingly, Ab repertoires constructed by IgReC from barcoded immunosequencing datasets in the blind mode (without using information about unique molecular identifiers) improved upon the repertoires constructed by the state-of-the-art tools that use barcoding. This finding suggests that IgReC may alleviate the need to generate repertoires using the barcoding technology (the workhorse of current immunogenomics efforts) because our computational approach to error correction of immunosequencing data is nearly as powerful as the experimental approach based on barcoding.
Recent progress in sequencing technologies enabled high-throughput generation of full-length Ab sequences and opened a possibility to assess the diversity of Ab repertoires in various species (1–4). However, transforming an error-prone immunosequencing dataset into an accurate Ab repertoire is challenge that is a prerequisite for a multitude of downstream studies of adaptive immune systems (Fig. 1), including reconstruction of clonal lineages (5, 6), analysis of immune response dynamics (7–9), analysis of recombination events and secondary diversification (10, 11), immunoproteogenomics (12–15), and population analysis of Ig germline segments (16).
The standard immunosequencing (Rep-seq) protocols include an amplification step that introduces amplification errors that can be propagated by consecutive amplification cycles, thus generating pseudo-diversity of an Ab repertoire (17, 18). These errors are further compounded by sequencing errors in reads (estimated at 0.5% per base on average in Illumina reads), leading to error-prone Rep-seq datasets, which trigger errors in the constructed repertoires and complicate downstream analysis of Abs. Thus, error correction with follow-up repertoire construction is a prerequisite for any downstream analysis of immunosequencing data.
Construction of full-length Ab repertoires is a more difficult computational problem than those of the well-studied TCR repertoire construction (19, 20), V(D)J classification (11, 21, 22), and CDR3 classification (23–26). In fact, V(D)J classification, CDR3 classification, and full-length repertoire construction represent three different decompositions of immunosequencing reads with gradually increasing granularity. V(D)J classification decomposes reads according to their V, D, and J segments, and provides basic information about the diversity of a repertoire. CDR3 classification decomposes reads based on their CDR3s but ignores hypermutations outside CDR3s. In contrast, full-length repertoire construction takes into account both CDR3 and all hypermutations, and attempts to remove artificial diversity caused by sequencing and amplification errors.
Recently, several tools for constructing and annotating full-length Ab repertoires have been developed, including MiGEC (27), pRESTO (28), MiXCR (29), and IgRepertoireConstructor (14). However, these tools all have limitations, making it difficult to benchmark them. For example, pRESTO groups together identical reads and removes low-abundance clusters in the constructed repertoire without performing their error-correction. As a result, although pRESTO usually reports accurate high-abundance clusters, it underestimates the diversity of the repertoires. MiXCR also groups together identical reads but instead of removing low-abundance clusters, it error-corrects them using high-abundance clusters. As a result, MiXCR reports more diverse repertoires than pRESTO, but may fail to construct repertoires from Rep-seq libraries with high error rates.
Benchmarking various repertoire construction algorithms remains a challenge because it is unclear how to construct the reference Ab repertoire and evaluate the quality of the constructed repertoire. Benchmarking various computational tools in genomics would not be possible without the development of specialized quality assessment tools aimed at various applications. For example, benchmarking of assembly tools would not be possible without the development of such tools for evaluating genomics [QUAST in (30)], metagenomics [metaQUAST in (31)], and transcriptomics [rnaQUAST in (32)] assemblies. Similar to other areas of genomics, developing a benchmarking framework for repertoire reconstruction is a prerequisite for objective evaluation of the state-of-art immunoinformatics algorithms. However, assessing the quality of Ab repertoires remains a poorly addressed challenge, making it difficult to compare various repertoire construction tools.
Unique molecular identifier barcoding technology (33, 34) allows one to analyze low-abundant receptor sequences and error-correct amplification errors (27, 35–38). Barcoding and nonbarcoding protocols have different requirements with respect to repertoire construction algorithms, e.g., although aggressive error-correction works well for barcoded Rep-seq datasets, it results in overcorrection and loss of natural diversity for nonbarcoded datasets.
We present the IgReC tool for Ab repertoire construction from both barcoded and nonbarcoded immunosequencing reads and the IgQUAST tool for quality assessment of Ab repertoires. We demonstrate that accurate repertoires constructed by IgReC from barcoded Rep-seq datasets in the blind mode (without using barcoding information) improved on the repertoires constructed by the state-of-the-art tools that use barcoding information. This surprising finding suggests that advanced repertoire construction algorithms may reduce experimental effort by alleviating the need to generate barcoded repertoires.
Materials and Methods
The IgReC tool addresses the deficiencies of IgRepertoireConstructor (14), which is limited to nonbarcoded data and is prohibitively time and memory consuming for large immunosequencing datasets. IgReC generates an Ab repertoire by partitioning error-prone immunosequencing reads (covering the entire variable regions of Igs) into clusters (Fig. 2). The goal is to place reads from the same Ab into the same cluster, while placing reads from different Abs into different clusters. This results in a difficult clustering problem because reads from the same Ab differ by sequencing and amplification errors, and the number of these errors often exceeds the number of differences between Abs from different clusters. We define the Ab sequence as the consensus of reads in a cluster, and its abundance as the number of reads in a cluster.
We define the distance between two aligned reads (of possibly different lengths) as the distance between the shorter read and the prefix of the longer read of the same length. Two reads are called similar if the distance between them does not exceed a distance threshold, τ. IgReC constructs the Hamming graph on the set of all reads as vertices and connects two vertices (reads) in the Hamming graph by an edge if they are similar (14, 39, 40).
Analysis of connected components of the Hamming graphs of various immunosequencing datasets revealed that they typically consist of dense (complete or nearly complete) subgraphs connected by very few edges (14). It further revealed that most vertices in each dense subgraph correspond to reads derived from a single Ab (simple dense subgraphs) or from similar Abs differing from each other by a small number of somatic hypermutations (SHMs) (composite dense subgraphs). Our goal is to generate a list of dense subgraphs in the Hamming graph and to further break composite dense subgraphs into subgraphs corresponding to single Abs (Fig. 2).
As a successor of IgRepertoireConstructor, IgReC also builds the Hamming graph (HG Constructor module), partitions its vertices into clusters corresponding to dense subgraphs, and further breaks down composite dense subgraphs as described in (14). IgReC includes three new tools (BarcodedIgReC, IgQUAST, and IgDiversityAnalyzer) and the following new steps that were missing in IgRepertoireConstructor: 1) fast V(D)J classification of reads to filter out contaminated reads and crop the remaining reads by the start of their V segment, and 2) HG Constructor, a novel approach for fast Hamming graph construction based on the minimizers algorithm described below.
Fast V(D)J classification
Because immunosequencing datasets often contain various transcripts from non-Ig genes, IgReC first aligns all reads against the database of Ig germline segments and filters out the unaligned reads. Instead of performing full-scale V(D)J labeling (which requires time-consuming identification of recombination events), IgReC performs V and J labeling based on a fast algorithm for finding the longest subsequence of k-mers between reads and immune segments (the D segment is assigned based on the best match to the database of the germline D segments). After filtering contaminated reads, the remaining reads are realigned to the starting position of their corresponding V segment to bypass the time-consuming computation of extended Hamming distances implemented in IgRepertoireConstructor.
Limitations of previous approaches for Hamming graph construction
The time- and space-efficient construction of large Hamming graphs is a challenging problem that was addressed in Hammer (40) and BayesHammer (41). However, these approaches construct Hamming graphs on the set of all k-mers from reads (for small k), rather than the set of all reads as required for solving the repertoire construction problem. IgRepertoireConstructor (14) is the first read-based algorithm for constructing the Hamming graph, but it becomes too slow in the case of large clonally expanded Rep-seq datasets. Below we describe a novel HG Constructor algorithm for efficient construction of the read-based Hamming graphs.
Because brute force computing of distances between all members of a large set of strings is prohibitively time consuming, bioinformaticians use various filtration techniques to reduce the number of comparisons. A popular filtration approach is based on the fact that two strings of length L differing in most τ positions share at least one l-mer of length, L/(τ + 1) (42). Thus, one can index reads by their l-mers, find all pairs of reads that share an l-mer, and filter out all other pairs of reads (43, 44). This filtration step is followed by a verification step to find the distances between all pairs of reads sharing an l-mer.
Although this filtration approach works well for genome comparison (44) and read mapping (45), it becomes prohibitively time consuming in the case of immunosequencing reads, as different Abs often share l-mers, thus reducing the efficiency of the filtration step. Because all Abs originating from the same V segment often share an l-mer from this segment, the filtration approach will be forced to conduct the verification step for nearly all pairs of reads arising from the same V segment. As a result, although this filtration approach was implemented in IgRepertoireConstructor and applied to small immunosequencing datasets, analyzing large hypermutated repertoires turned out to be very time consuming.
To address this problem, we have developed a new filtration strategy based on the following idea: instead of using all l-mers for filtering [for l = L/(τ + 1)], HG Constructor uses some carefully selected k-mers [for k ≤ L/(τ + 1)], with the goal of removing k-mers shared by many Abs from the filtering process. The minimizer approach is based on the following observation about two strings of length L differing in at mostτ positions: given a set of τ + 1 nonoverlapping k-mers in the first string, at least one of them appears in the second string. Specifically, we select τ + 1 rarest and nonoverlapping k-mers (called minimizers) in each read, find all reads that contain one of the selected minimizers, and further apply the verification step to the found reads (Supplemental Fig. 1). The notion of a rare k-mer is defined with respect to its multiplicity, i.e., the number of reads containing this k-mer. To find all neighbors of a read R in the Hamming graph, we compute the distance between R and all reads containing at least one minimizer of R (IgReC uses the default values k = 10 and τ = 4). Note that even for a small k, the minimizer approach provides the exact solution of the Hamming graph construction problem. The number of computations of the Hamming distance for a given read is equal to the total multiplicity of minimizers for this read. See Supplemental Fig. 2 for the running time of HG Constructor.
The histogram of positions of rare k-mers reveals that the minimizers algorithm often selects k-mers from the complementary determining regions (CDR1, CDR2, and CDR3), which represent the most diverged parts of the variable regions of Abs (Supplemental Fig. 2). However, simply selecting three k-mers from CDR1, CDR2, and CDR3 for τ = 2 is not a good filtration approach because using small values of τ often results in breaking the connected components of the Hamming graph corresponding to a single Ab. Also, sequencing errors may turn CDRs into unique sequences, even for reads arising from the same Ab.
Molecular barcoding and amplification artifacts
As an alternative to computational error correction of immunosequencing datasets, biologists use unique molecular barcoding (33, 34), an experimental approach that allows one to correct most amplification and sequencing errors, thus enabling analysis of low-abundance receptor sequences (9, 27, 35–38, 46). This approach is based on attaching a short unique barcode (12−17 nt) to each RNA molecule, so that all amplified copies of this molecule contain the same barcode. Because all reads corresponding to the same RNA molecule contain the same barcode, errors in reads can be corrected by constructing the consensus of all error-prone reads with the same barcode. However, barcode errors, barcode collisions, and chimeric reads lead to new computational challenges addressed by the MiGEC (27) and pRESTO (28) tools. These challenges are particularly pronounced in the case of overamplified immunosequencing datasets resulting from multiple cycles of amplification that are used for sequencing low-abundance Abs.
Barcode errors, barcode collisions, and chimeric reads
Some reads in barcoded Rep-seq datasets have amplification errors within barcodes. Our analysis revealed that up to 10% of the clusters in the constructed repertoire of overamplified Rep-seq datasets contain erroneous barcodes (Supplemental Fig. 3). Ignoring barcoding errors leads to an explosion of small (erroneous) clusters that should be error-corrected and merged with larger (correct) ones.
In an ideal experiment, barcodes of length k are randomly and uniformly generated from the set of all 4k k-mers. Thus, if barcodes are applied to a set of n Ig molecules, we expect only [n × (n − 1)]/(2 × 4k) pairs of them to have barcode collisions, i.e., share a barcode (47). However, because there are biases in barcode generation, barcodes are not uniformly sampled from the set of all k-mers, thus increasing the number of barcode collisions. Although the barcoded datasets analyzed in this paper have a rather low number of collisions (typically, <10 for a Rep-seq library consisting of a million reads), some barcoded datasets have an order of magnitude more collisions. Ignoring barcode collisions may lead to a loss of natural diversity and corruption of the consensus sequences.
Some immunosequencing sample preparation protocols produce up to ∼8% of chimeric reads (Supplemental Fig. 3). Failure to detect chimeric reads reduces the accuracy of the consensus sequences in the constructed repertoire and may lead to artificially inflated diversity. Most chimeric reads (resulting from gluing two distinct Ig molecules during amplification) are filtered out after V(D)J alignment because they contain only partial V or J segments. However, some chimeric reads look like legitimate Ab sequences and thus may artificially inflate the diversity of the constructed repertoire. Although most chimeric reads form small clusters, simply discarding small clusters does not address the problem because some chimeric reads form clusters with hundreds of reads. Although it is difficult to recognize chimeric reads in nonbarcoded datasets, they can be identified using information about barcodes.
pRESTO does not filter chimeric reads and does not correct for errors in the barcodes, which can lead to abundant erroneous barcodes being assigned into distinct clusters. In contrast, MiGEC (in the default setting) aggressively discards barcodes with a low-quality consensus. As the result, repertoires constructed by pRESTO often retain false clusters for barcode consensus sequence generation, whereas repertoires constructed by MiGEC often have reduced diversity. Below we describe how IgReC addresses these challenges using the BarcodedIgReC tool.
BarcodedIgReC consists of the following steps (Fig. 2). First, to resolve barcode collisions, all reads are grouped by barcodes and clustered inside each group (first column). Second, to resolve barcode errors, clusters for similar barcodes are merged if their consensus sequences are similar (second column). Third, chimeric reads are detected and removed (third column). Finally, the standard IgReC pipeline (for nonbarcoded data) is applied to merge clusters with similar sequences (fourth column). Because IgReC at the final step is applied to highly accurate sequences, we use a small distance threshold equal to two for constructing the Hamming graph.
BarcodedIgReC clusters reads sharing the same barcode and computes the consensus for each constructed cluster. We use a large edit distance threshold for clustering to correct most amplification and sequencing errors and, at the same time, separate barcode collisions (see Supplemental Fig. 3 for details). We use the edit distance (rather than the Hamming distance), because barcoded Rep-seq datasets have amplification errors that include both indels and mismatches. To correct barcode errors, we construct the Hamming graph on the barcodes and merge clusters with similar consensus sequences for barcodes from the same connected components in the Hamming graph. If there are several clusters for a single connected component of a Hamming graph on barcodes, we also check them for chimerism (Fig. 3C).
Tools like QUAST (30) benchmark various assembly algorithms against the reference genomes. However, in many areas of genomics, reference genomes remain unknown, e.g., metaQUAST (31) evaluates metagenomics assemblies without references. Similar to metagenomics, reference Ab repertoires for Rep-seq datasets are not available.
To assess the quality of the reconstructed repertoires, we developed the IgQUAST tool that works in the unknown reference and the known reference modes (Fig. 2). If the reference repertoire is unknown (usual situation), IgQUAST reports various statistics of the cluster abundances. In the rare cases when the reference repertoire is known, it evaluates how the constructed repertoire differs from the reference repertoire and analyzes overcorrected clusters. More details of IgQUAST are provided at http://www.cab.spbu.ru/software/ig-repertoire-constructor.
To construct reference repertoires, we use simulated Rep-seq datasets generated by IgSimulator (48). However, because IgSimulator does not adequately model some intricate properties of real Rep-seq dаtasets, we complemented simulated datasets by the repertoires derived from barcoded Rep-seq data in the blind mode, i.e., without using the barcoding information.
Sensitivity and precision of a repertoire
A cluster in the constructed repertoire is correct if its consensus sequence is present in the reference repertoire. Given the constructed and the reference repertoires of sizes Ncon and Nref, we define the sensitivity as Ncorr/Nref and the precision as Ncorr/Ncon, where Ncorr is the number of correct clusters (Fig. 2). Low sensitivity often results from overcorrection: combining multiple clusters from the reference repertoire into a single cluster in the constructed repertoire. Low precision often results from undercorrection: splitting a single cluster from the reference repertoire into multiple clusters in the constructed repertoire.
Because cluster abundances in a repertoire can often be approximated by the power law distribution, there are typically a small number of large clusters and many small clusters (49). Although the consensus sequences of large clusters are typically accurate, the consensus sequences of small clusters often have errors (even in the case of correctly constructed clusters), e.g., the consensus sequence of a cluster of size two is often corrupted if the reads contributing to this cluster have errors. Thus, small clusters should be excluded from computing sensitivity and precision because these parameters are computed for exact matches between consensus sequences in the constructed and reference repertoires. Also, the constructed repertoires are often more diverse than the reference repertoire because a large cluster in the reference repertoire typically corresponds to a large cluster with the correct consensus sequence and multiple small clusters with erroneous consensus sequences caused by sequencing and amplification errors.
To exclude small clusters from computing sensitivity and precision, we consider only clusters of size at least minsizeref in the reference repertoire and of size at least minsizecon in the constructed repertoire (the default value minsizeref = 5). Because sensitivity + precision is optimized at minsizecon = 5 for nearly all datasets we analyzed (Fig. 3, left), we set the default value at minsizecon = 5. We classify a cluster in the constructed repertoire as large if it contains at least minsizecon reads, and as small otherwise.
Analyzing abundances of Abs with IgQUAST
To compare various repertoire construction tools, IgQUAST compares abundances of all correct clusters in the reference and constructed repertoires (Fig. 3, right). It turned out that most abundances are underestimated (i.e., the abundance of the constructed cluster is smaller than the abundance of the reference cluster) because repertoire construction tools often split a reference cluster into a large cluster with the correct consensus and multiple small erroneous clusters (Supplemental Fig. 4).
In addition to reporting statistics of SHMs and analyzing nucleotide and amino acid content of CDRs, IgDiversityAnalyzer (Fig. 2) complements IMGT/HighV-QUEST (50), MiXCR (29), VDJtools (51), and Alakazam (5) tools by reporting the Simpson index (SI) and the clonal SI (CSI) of the repertoire.
The SI, originally proposed for analyzing diversity of populations (52), is the probability that two randomly selected individuals belong to the same population. Given a probability distribution (p1,…, pn), where pi represents the fraction of the i-th population, the SI is defined as . IgDiversityAnalyzer uses CDR3 abundances (the sum of abundances of clusters with the same CDR3) and normalizes them to generate the probability distribution for computing the SI. Diverse repertoires are characterized by a low SI, whereas repertoires dominated by a few abundant receptor sequences are characterized by a high SI.
IgDiversityAnalyzer also computes the CSI to reveal similarities between CDR3s in a repertoire. It constructs the Hamming graph on CDR3 sequences with a small distance threshold (the default value τ = 3) and computes abundance of each connected component as a sum of its CDR3 abundances. The CSI is defined as the SI on abundances of the connected components in the constructed Hamming graph. The CSI can be interpreted as a probability that two randomly selected Ab sequences belong to the same clonal lineage. The higher the CSI/SI ratio, the higher the clonal diversity of a repertoire, as compared with its recombination diversity.
We characterize each repertoire by its recombination (two parameters), expression (three parameters), and clonality levels (two parameters), using the following statistics: recombination level: number of unique CDR3s and VJ pairs; expression level: max cluster size, max CDR3 abundance, and SI; clonality level: CSI and the ratio CSI/SI.
We benchmarked IgReC (Figs. 1, 2), MiXCR and pRESTO using: 1) reference repertoires and corresponding Rep-seq libraries simulated by IgSimulator (48); 2) reference repertoires derived from barcoded Rep-seq repertoires with follow-up benchmarking in the blind mode; i.e., without using information about barcodes; and 3) multiple immunosequencing datasets from various B cell subtypes using a reference-free approach for evaluating repertoires. Because these datasets have widely varying levels of diversity, we analyzed how various repertoire construction tools preserve natural diversity.
IgSimulator uses the ART Illumina read simulator (53) to generate a sequencing library along with a simulated repertoire. However, this and other read simulators create reads with typical error rates in Illumina reads (∼1% of incorrect bases per read) and do not account for elevated rates of amplification errors in Rep-seq datasets. In contrast, some Rep-seq datasets have error rates as low as 0.25% (14). We thus simulated Rep-seq datasets with both low and high error rates by implanting random mismatches into randomly chosen positions in sequences from the reference repertoire.
Top, Properties of PBMC, ASC−, and ASC+ cells. Rows corresponding to recombination, expression, and clonality levels are colored by various shades of blue, green, and red, respectively. High and low levels of these metrics are presented in dark and light shades. Classification into low, medium, and high serves for comparative analysis of various features low < medium < high rather than exact numerical bounds. Middle, Diversity metrics for PBMC, ASC−, and ASC+ datasets. The shown percentages represent percentage of clusters with unique CDR3s among all large clusters, percentage of detected VJ pairs among all possible VJ pairs, percentage of reads in the largest cluster among all reads in large clusters, and percentage of reads with the most abundant CDR3 among all reads in large clusters. Bottom, Diversity metrics for simulated simple, synthetic and real datasets. The simulated simple dataset was generated by IgSimulator with the following parameters: chain type = IGH, number of base sequences = 1000, expected number of mutated sequences = 10,000, and expected repertoire size = 50,000. Number of base sequences, number of mutated sequences, and repertoire size refer to the number of generated VDJ recombinations, number of unique sequences with SHMs derived from VDJ recombinations, and number of sequences in a repertoire, respectively. Because we simulated the Rep-seq library with coverage 1, the number of reads is expected to be close to the size of the simulated repertoire.
The simulated datasets were generated by IgSimulator (48) with parameters specified in Table I. Although the simulated datasets do not adequately reflect some features of real Ab repertoires, they provide a way to evaluate the constructed repertoires for various error rates. We selected 12 values for the average number of errors in reads ranging from unrealistically low to rather high (0, 0.0625, 0.125, 0.25, 0.375, 0.5, 0.75, 1, 1.25, 1.5, 1.75, and 2) and simulated a Rep-seq library, where the number of mismatches in reads followed the Poisson distribution with the parameter λ equal to the average number of mismatches in the simulated Rep-seq library. We refer to them as simulated simple datasets because they often contain error-free reads, making it easier to reconstruct a repertoire. As a result, for each input repertoire we generated 12 simulated simple and 12 synthetic simple datasets. Each of 24 simulated datasets consists of 10 libraries with randomly generated errors. We also simulated 12 more difficult-to-analyze simulated complex datasets (modeling highly corrupted clusters) where each input read contains at least one sequencing error.
The simulated barcoded datasets include two barcoded Rep-seq datasets (478,250 and 481,245 reads) with realistic barcode errors, barcode collisions, and chimeric reads. To generate these datasets, we extended IgSimulator by adding a capability to generate barcoding artifacts.
The simulated TCR dataset, generated by IgSimulator, has 21,013 clusters and 412 large clusters with at least five reads (maximal cluster size 62). For this repertoire, we simulated two Rep-seq libraries (each consisting of 28,175 reads) with an average number of errors per read equal to 0.5 and 1.0.
The synthetic dataset is a Rep-seq dataset of bulk unsorted B cells from peripheral blood. The synthetic dataset is challenging because it represents a clonally expanded repertoire containing extensively hypermutated Abs. For benchmarking purposes, we constructed a reference repertoire of the synthetic dataset using BarcodedIgReC. The synthetic repertoire was used as a base for simulating Rep-seq libraries with various error rates in the same way as the simulated dataset.
The real dataset is a barcoded Rep-seq dataset of bulk unsorted B cells from peripheral blood of a healthy donor generated at the Institute for Bioorganic Chemistry in Moscow, Russia. For benchmarking purposes, we compared repertoires constructed by IgReC, MiXCR, and pRESTO tools in a blind mode with the reference repertoire obtained using BarcodedIgReC. The real dataset does not contain highly abundant clonal families, but reflects the complexity of real data.
The PBMC dataset was derived from PBM B cells. The PBMCs were taken right after flu vaccination (day 0), whereas ASCs were sequenced and sorted at the moment of the highest immune response to the flu vaccine (day 7). Because the cells in the PBMC dataset include various B cell subtypes (naive, memory, and plasma), it is characterized by diverse recombination events and large variations in Ab abundances.
The ASC+ dataset was derived from ASCs positive to hemagglutinin. In contrast to the PBMC dataset, the ASCs generate highly abundant Abs. Because the B cells from the ASC+ dataset share specificity to hemagglutinin, they form abundant clonal lineages.
The ASC− dataset was derived from ASCs negative to hemagglutinin. Because the B cells from the ASC− dataset showed negative specificity to hemagglutinin, the most abundant clonal lineages from this dataset are less abundant than the most abundant clonal lineages from the ASC+ dataset.
Analyzing PBMC, ASC+, and ASC− datasets of sorted B cells
We evaluated how the results of various repertoire construction tools correlate with the specific features of sorted B cells, e.g., for the PBMC dataset (high recombination, low expression, and low clonality levels), we expect to observe many dissimilar CDR3s corresponding to unrelated V(D)J recombinations; whereas for the ASC+ dataset (medium recombination, high expression, and high clonality levels), we expect to observe large groups of similar CDR3s that came from the same Ab lineage and differ by SHM. We also expect that various metrics for the ASC− dataset (medium recombination, medium expression, and medium clonality levels) have smaller values than those of the ASC+ repertoire, but larger values than for the PBMC repertoire.
We analyzed PBMC, ASC−, and ASC+ datasets using IgReC, and selected large clusters in the constructed repertoires to minimize the effect of amplification artifacts on our analysis (Table I). We extracted the CDR3 sequence for each constructed cluster using IgDiversityAnalyzer and grouped identical CDR3s. Because CDR3 is the most divergent part of Abs, the set of all CDR3s is a good (albeit insufficient) representation of the diversity of the entire repertoire. Table I (middle) illustrates that the IgReC results for the PBMC, ASC−, and ASC+ repertoires are consistent with the expectations summarized in Table I (top).
The PBMC repertoire contains more CDR3s (6255 versus 5008 and 949), and more VJ pairs (885 versus 537 and 156) than the ASC− and ASC+ repertoires, respectively.
Although the PBMC repertoire contains some highly abundant clusters (the largest is formed by 3834 reads), its CDR3 abundances are rather low (maximum CDR3 abundance is 20) and its SI is only 0.00024. The ASC− and ASC+ repertoires contain more abundant clusters, many of which share CDR3. The largest cluster in the ASC+ repertoire has 47,750 reads, reflecting an extensive clonal expansion.
The ratio of the SI and the CSI revealed that the PBMC repertoire contains a very limited number of clonal lineages (CSI/SI = 1.36). In contrast, the ASC− and ASC+ repertoires have large clonal lineages (CSI/SI is 7.15 and 11.96 for ASC− and ASC+, respectively).
Benchmarking on the simulated datasets
We ran MiXCR, pRESTO, and IgReC on each simulated dataset. Parameters of MiXCR and pRESTO are given at the Github; IgReC was launched with the default parameters. For each tool, we selected the parameter minsizecon that maximizes sensitivity + precision for each constructed repertoire (Fig. 3). Fig. 4 (top) shows the sensitivity, precision, and sensitivity + precision plots for 120 simulated simple datasets (12 error rate values × 10 libraries for each error rate), and illustrates that all tools deteriorate with the increase in the error rate. However, IgReC reconstructs ∼73% of reference clusters, even in the case of the high number of errors equal to two, whereas the corresponding values of sensitivity for MiXCR and pRESTO fall to 20%. Fig. 4 (top) also demonstrates that the variance of sensitivity and precision is small for all tools, suggesting that our approach to estimating the quality of constructed repertoires is robust.
We also benchmarked IgReC, MiXCR, and pRESTO for typical values of error rates in real Rep-seq datasets corresponding to 0.5 and 1 errors per read on average (Fig. 5A, 5B). IgReC improved on MiXCR and pRESTO for 0.5 and 1 errors, except for a small region in the sensitivity-precision plot for 0.5 errors (corresponding to high precision). pRESTO improved on MiXCR in the case of 0.5 errors, whereas MiXCR improved on pRESTO in the case of one error.
In contrast to IgReC, MiXCR and pRESTO failed to construct correct cluster sequences for the simulated complex datasets. Surprisingly, the repertoires constructed by IgReC for the simulated complex dataset are even more accurate than repertoires constructed from simulated simple datasets (Fig. 5C).
Benchmarking on the simulated TCR dataset
Fig. 5D and 5E illustrate that IgReC improves on pRESTO and MiXCR for the simulated TCR datasets. For the dataset with 0.5 errors per read, IgReC (sensitivity = 0.99, precision = 1.00) improves on pRESTO (sensitivity = 0.58, precision = 0.86), and MiXCR (sensitivity = 0.66, precision = 1.00). For the dataset with 1.0 errors per read, IgReC (sensitivity = 0.96, precision = 1.00) also improves on pRESTO (sensitivity = 0.25, precision = 0.96) and MiXCR (sensitivity = 0.33, precision = 1.00). An absence of SHMs in TCR repertoires makes distinguishing between different TCR sequences easy. As a result, all benchmarked tools demonstrate high precision. However, due to the limited diversity of TCR repertoires and thus a low potential number of distinct clusters, errors in reads would corrupt sensitivity dramatically. pRESTO and MiXCR use soft approaches for error-correction and thus report many erroneous clusters for the TCR simulated dataset. IgReC uses more aggressive approach for error-correction that results in mild overcorrection of Ab repertoires (Fig. 4), but works well for TCR repertoires.
Benchmarking on the synthetic dataset
Because the synthetic dataset is very complex, we limited benchmarking on it to simple libraries only. Fig. 4 (bottom) illustrates that all tools resulted in high precision on the synthetic dataset for all values of error rates. The sensitivity values for the synthetic datasets are lower as compared with the simulated simple datasets, reflecting the fact that there are many similar clusters in the reference repertoire for the simulated simple dataset. MiXCR and pRESTO showed significantly lower sensitivity than IgReC for high error rates. However, for small number of errors (<0.8), pRESTO showed higher sensitivity than IgReC. In general, IgReC maintained high values of sensitivity for the entire range of error rates (93% for zero errors and 56% for two errors).
Benchmarking on the real dataset
Fig. 5F illustrates that MiXCR slightly improves on pRESTO and IgReC slightly improves on MiXCR, but all tools feature high sensitivity and precision on the real dataset. It turned out that IgReC works well with error-prone and overamplified data: its sensitivity (98%) and precision (96%) for the real dataset are even higher than the corresponding values for the simulated dataset with zero errors (sensitivity 97%, precision 88%).
Benchmarking on the barcoded datasets
Using barcodes improves the performance of BarcodedIgReC, MiGEC, and pRESTO as compared with applying these tools in the blind mode, i.e., ignoring barcodes. Therefore, repertoires constructed using information about barcodes represent excellent references for repertoires constructed in the blind mode. However, it is important to evaluate how repertoire construction tools perform on barcoded datasets to measure how close the resulting repertoires are to the ideal repertoires.
To simulate a realistic barcoded Rep-seq dataset, we estimated the average rate of both sequencing and amplification errors using the real dataset. For each barcode, we computed the largest group of identical reads (up to small shifts) and classified it as dominant if it 1) contains at least five reads; and 2) contains at least 50% more reads than the second-largest group. We expect that dominant groups contain error-free reads and thus represent true receptor sequences. For dominant groups, we computed pairwise alignments between the largest group sequence and the remaining reads with the same barcode. According to the distribution of the computed Hamming distances, the average number of mismatches in the computed alignment is close to two. We thus simulated reads with low (0.5 errors per read) and medium (two errors per read) error rates, which resulted in two simulated barcoded libraries with 478,250 and 481,245 reads, respectively. We further compared repertoires constructed by IgReC (in the blind mode), BarcodedIgReC, MiGEC+MiXCR, and pRESTO. We note that MiGEC was used for clustering of reads with the same barcodes and correction of barcode errors. To construct the full-length repertoire, we launched MiXCR on the consensus sequences reported by MiGEC as proposed previously (37).
All tools demonstrated high sensitivity for the library with the low error rate (Fig. 5G), but BarcodedIgReC and MiGEC+MiXCR outperformed IgReC and pRESTO with respect to precision. Specifically, BarcodedIgReC and MiGEC+MiXCR correctly reconstructed 85 and 83% of large clusters (more than five reads), respectively (compared with 61% for IgReC and 73% for pRESTO). Surprisingly, IgReC launched in the blind mode on a library with high error rates improved on the specialized tools utilizing barcoding information (Fig. 5H).
Abundance analysis of the barcoded datasets
For each cluster in the constructed repertoire we compute its barcode abundance (the number of RNA molecules contributing to this cluster) and use the barcode abundances as references to evaluate the standard abundances (the number of reads in the cluster). Fig. 6 (left) compares barcode abundance and read abundance for the real repertoire computed by BarcodedIgReC, and reveals that although a large barcode abundance typically corresponds to a large read abundance, the variance of read abundances for a given barcode abundance is very high. For example, for the barcode abundance 10, the read abundances vary from 10 to 1000. Thus, barcode abundances represent the method of choice in immunogenomics studies aimed at accurate quantitation of RNA molecules.
We also analyzed barcode abundance reconstructed by BarcodedIgReC using the abundance plots with the standard abundance replaced by the barcode abundance. Fig. 6 (right) shows the barcode abundance plot for the simulated barcoded dataset (two errors per read). We computed the ratio of the abundance in the constructed repertoire to the abundance in the reference repertoire for each cluster and further computed the median ratio across all clusters. Because the median ratio is high (0.9 for the barcode abundance versus 0.81 for the read abundance), we concluded that BarcodedIgReC accurately reconstructs repertoires in terms of both sensitivity and precision and barcode abundance.
Does IgReC adequately represent the diversity of repertoires?
Adequately representing the diversity of Ab repertoires is an important goal of IgReC. We assume that a large cluster in the (unknown) reference repertoire is typically represented by a large cluster (with correct consensus) and multiple small clusters (with incorrect consensuses) in the constructed repertoire. As a result, information about all large reference clusters is preserved in the constructed repertoire even after filtering low-abundance clusters. If all small clusters in the constructed repertoire represent sequencing and amplification errors, the constructed repertoire preserves the diversity of the real repertoire even after removing small clusters.
This observation suggests the following approach to analyzing how the minsizecon parameter affects the diversity of the constructed repertoires (Fig. 7). For each CDR3 from small filtered clusters, we check if a similar CDR3 is present in large nonfiltered clusters (with a Hamming distance up to 3). If such similar CDR3 is found, we classify the CDR3 from the small cluster as confirmed and compute the percentage of confirmed CDR3s versus the minsizecon threshold. Fig. 7 illustrates that the default value minsizecon = 5 is an excellent threshold for high-abundance clonally expanded B cells (ASC+ and, to a smaller extent, ASC− datasets) because ∼90% of CDR3 in ASC+ and ASC− are confirmed for this threshold. In contrast, because repertoires of peripheral blood B cells (real and PBMC datasets) contain a mixture of B cells with widely varying abundances, discarding small clusters leads to a significant loss of diversity for such datasets: only 44 and 7% of CDR3s are confirmed (for minsizecon = 5) for the real and PBMC datasets, respectively. However, reducing minsizecon below the default value for such datasets leads to artificially inflated diversity because IgReC starts reporting erroneous clusters (with many amplification and sequencing errors) in the constructed repertoire. Surprisingly, for the synthetic dataset, even filtering singleton clusters significantly deflates diversity (41 and 38% of confirmed CDR3s for minsizecon = 1 and minsizecon = 5, respectively).
Correction of errors in barcoded datasets using BarcodedIgReC
The quality assessment approach in IgQUAST (based on computing sensitivity and precision) does not reveal the quality of the repertoires constructed from the barcoded Rep-seq datasets with respect to their artifacts (barcode errors, barcode collisions, and chimeric reads). To evaluate how BarcodedIgReC deals with these artifacts, we launched it on the simulated barcoded dataset (480,078 reads with two errors per reads on average) with the default parameters. This library contains 43,193 distinct barcodes (∼10% of reads had erroneous barcodes). BarcodedIgReC identified and corrected 68% of reads with erroneous barcodes by combining them with a cluster with a correct barcode. The remaining reads with erroneous barcodes (32%) form singleton clusters (Supplemental Fig. 3).
To estimate the accuracy of the barcode collision procedure, we simulated a Rep-seq library with nine nt-long barcodes. For a repertoire with 43,193 Ab molecules, we generated 39,888 distinct barcodes corresponding to 3305 barcode collisions. BarcodedIgReC resolved 96% of them. Simulation with more realistic collision rates resulted in resolving all barcode collisions.
BarcodedIgReC identified ∼30% of chimeric reads (rate of simulated chimeric reads is ∼2.5%). The remaining ∼70% of simulated chimeric reads form singleton clusters that are discarded after applying the minsizecon threshold.
Although it is difficult to draw a fair comparison of various repertoire construction tools (they all have slightly different goals), a comprehensive assessment of various tools is crucial for all areas of modern genomics and immunogenomics is not an exception. Because such benchmarking effort in immunogenomics is still missing, we developed the IgQUAST quality assessment tool that, to our knowledge, enabled the first benchmarking study of various repertoire reconstruction approaches across diverse Rep-seq datasets. We also proposed a reference-free approach to quality assessment of the Ab repertoires using immunosequencing datasets from sorted B cells with varying diversity and expression levels. Although IgReC, MiXCR, and pRESTO showed different results in terms of sensitivity and precision, our analysis revealed that all tools accurately reflect important biological features of Rep-seq data.
In difference from other repertoire construction tools, IgReC reconstructs repertoires based on the concept of the Hamming graphs (40, 41), the workhorse of the error-correction approach in the SPAdes genome assembler (55). To address the computational bottlenecks of constructing the immunosequencing Hamming graph, we developed the minimizer approach that resulted in three orders of magnitude increase in speed compared to IgRepertoireConstructor. As a result, IgReC analyzes large highly hypermutated Rep-seq datasets in hours compared with days required by IgRepertoireConstructor (14).
Our benchmarking revealed that no repertoire construction tool works better than others across the diverse types of immunosequencing datasets. For example, repertoires constructed by IgReC featured better sensitivity than those constructed by MiXCR and pRESTO, but had lower precision in some cases. However, as our benchmarking demonstrated, IgReC is currently a tool of choice for analyzing hypermutated repertoires. In such studies, it is important to extract detailed information about expanded clonal lineages because erroneous clusters can be filtered at a later stage using the clonal analysis (7). In contrast, high precision is critical for diversity analysis of repertoires from healthy individuals; because such repertoires typically do not contain expanded clonal lineages, large erroneous clusters may negatively affect the estimates of the diversity of the constructed repertoire.
We compared BarcodedIgReC, MiGEC, and pRESTO tools that construct repertoires from barcoded immunosequencing datasets. Although barcoding simplifies repertoire reconstruction and allows one to recover low-abundance Abs from extensively amplified Rep-seq libraries, it requires additional experimental protocols that lead to various artifacts triggering errors in the reconstructed Ab repertoires. Despite the fact that the problem of mitigating barcoding artifacts in genome sequencing was extensively studied (56–58), the efforts to address these artifacts in immunogenomics are still in their infancy. Our attempt to address them led to the surprising conclusion that in many immunogenomics studies, barcoding may not be necessary because our computational approach to error correction of immunosequencing data (Hamming graphs) is nearly as powerful as the experimental approach to error correction (barcoding).
Our benchmarking revealed that although various repertoire construction tools result in high sensitivity, their precision varies and deteriorates in the case of immunosequencing datasets with many amplification errors. Surprisingly, repertoires reported by IgReC (in the blind mode without using barcoding information) on barcoded datasets improved on the repertoires constructed by the specialized tools that use barcoding information. This finding suggests that advanced error-correction algorithms may alleviate the need to generate barcoded repertoires, thus reducing experimental effort. We showed that repertoires constructed by IgReC from various nonbarcoded Rep-seq datasets [e.g., datasets with a high expression level (59)] are well suited for various downstream applications such as clonal analysis or immunoproteogenomics. In contrast, barcoded Rep-seq datasets may still be needed for specialized studies aimed at accurate quantification and detection of low-abundance receptor sequences such as studies of naive and memory cells requiring extensive amplification as in studies of long-term immune response (54, 60).
All datasets analyzed in this paper are publicly available under the following accession numbers or DOIs. Simulated: doi.org/10.5281/zenodo.823351 (https://zenodo.org/record/823351#.WYN34dOGPBI); simulated TCR: doi.org/10.5281/zenodo.823347 (https://zenodo.org/record/823347#.WYN3-NOGPBI); simulated barcoded: doi.org/10.5281/zenodo.826956 (https://zenodo.org/record/826956#.WYN4B9OGPBI); synthetic: SRR4431793 (https://www.ncbi.nlm.nih.gov/sra/SRX2251687); real: SRR5851422 (https://www.ncbi.nlm.nih.gov/sra/SRX3021078); ASC+: SRR3620074 (https://www.ncbi.nlm.nih.gov/sra/SRX1816116); ASC−: SRR3620075 (https://www.ncbi.nlm.nih.gov/sra/SRX1816117); PBMC: SRR3620098 (https://www.ncbi.nlm.nih.gov/sra/SRX1816140). IgReC, BarcodedIgReC, IgDiversityAnalyzer and IgQUAST are available from http://www.cab.spbu.ru/software/ig-repertoire-constructor.
We are indebted to Dmitry Bolotin, Mikhail Shugay, Dmitriy Chudakov, Jason Vander Heiden, and Steven Kleinstein for productive discussions and assistance in benchmarking MiXCR, MiGEC, and pRESTO tools. We thank Sonya Timberlake for providing the synthetic dataset and Anton Bankevich for insightful comments and help in preparation of this paper.
This work was supported by the Russian Science Foundation (Grant 14-50-00069) and the National Institutes of Health (Grant 2-P41-GM103484). Ig library preparation was supported by Russian Science Foundation Grant 14-14-00533 (to M.A.T.).
The datasets presented in this article have been submitted to the National Center for Biotechnology Information under accession number SRR5851422.
The online version of this article contains supplemental material.
Abbreviations used in this article:
clonal Simpson index
The authors have no financial conflicts of interest.