High-throughput sequencing data from TCRs and Igs can provide valuable insights into the adaptive immune response, but bioinformatics pipelines for analysis of these data are constrained by the availability of accurate and comprehensive repositories of TCR and Ig alleles. We have created an analytical pipeline to recover immune receptor alleles from genome sequencing data. Applying this pipeline to data from the 1000 Genomes Project we have created Lym1K, a collection of immune receptor alleles that combines known, well-supported alleles with novel alleles found in the 1000 Genomes Project data. We show that Lym1K leads to a significant improvement in the alignment of short read sequences from immune receptors and that the addition of novel alleles discovered from genome sequence data are likely to be particularly significant for comprehensive analysis of populations that are not currently well represented in existing repositories of immune alleles.

This article is featured in In This Issue, p.1763

The adaptive immune system provides protection from disease by initiating specific immune responses to specific foreign Ags following their recognition by the clonally distributed surface-bound receptors of T and B lymphocytes, namely TCRs for T cells and Igs for B cells. In its secreted form, the B cell’s Ig molecule is called an Ab. The variability in molecular shapes of foreign Ags represents a significant challenge for the immune system, requiring development of an extensive receptor repertoire for Ag recognition. In response to this challenge, TCRs and Igs achieve vast diversity by a unique mechanism called VDJ recombination, which is a largely stochastic process of gene rearrangement encoding the variable parts of TCRs and Igs, respectively. Additionally, genes encoding Igs undergo somatic hypermutation during T cell–dependent B cell responses, resulting in further expansion in Ab diversity and subsequent selection of B cell clones generating Igs with higher affinity for Ag. Quantitative analysis of the diversity of TCR and Ig repertoires can shed light on receptor–Ag interactions and reveal insights into the state of the immune system under diverse conditions, such as aging (13), chronic viral infections (46), and autoimmune diseases (710).

Next-generation sequencing technology revolutionized the analysis of TCR/Ig diversity by providing vast amounts of data at much higher resolution, leading also to a need for effective and robust bioinformatics pipelines. Generally, these pipelines align the input sequences with the reference V, D (for Ig H chain and TCR β-chain), and J genes, determine the CDR3 region, and estimate the repertoire diversity. Two main factors that directly influence the quality of these analyses are the alignment algorithm and the completeness and accuracy of the reference database. In the past 3 y, several software packages have been developed with different alignment algorithms for analyzing next-generation sequencing sequence data of TCR and/or Ig (1119). These studies all focused on improving the accuracy, effectiveness, and completeness of their alignment algorithms. Most of the software packages use the reference V(D)J sequences provided by the international ImMunoGeneTics information system (IMGT) (20), which is currently the most comprehensive collection of TCR and Ig germline gene sequences available.

However, previous studies have reported numerous novel V alleles that are not found in the IMGT database (2125), suggesting that this database is incomplete. Additionally, a study of 226 IGHV alleles in IMGT reported a high rate of false positives (of 226 alleles, 104 were classified as very likely to contain sequencing errors) (26). This is because most of the IMGT TRBV and IGHV alleles were included from early studies, conducted between 1984 and 1996 (27, 28). After that, there were only two major updates on IGHV alleles in 2002 and 2009 introducing in total 36 alleles, and no updates on TRBV alleles have been submitted in the past 20 y. Furthermore, many alleles were reported in only one study, and multiple allelic variants of the same gene were derived from the same individual. For instance, an individual can have at most two alleles of any gene; however, seven IGHV3-15 alleles were derived from one individual. This was pointed out by Wang et al. (26), but the issue remains unresolved in the current version of IMGT. Additionally, many TRBV and TRAV alleles in IMGT are annotated as having come from rearranged sequences, which may contain somatic mutations and partially truncated 3′ ends resulting from VDJ recombination.

Since the development of IMGT, a large and ever-increasing number of full human genomes have been sequenced. The chromosomal regions containing the immune receptor alleles have both a high level of intrachromosomal duplication and a high level of diversity and thus present challenges for genome assembly. However, improvement of the reference genome over time has the potential to enable recovery of a comprehensive set of immune gene alleles found in human populations from population resequencing studies. We have developed a bioinformatics pipeline that can be used to generate a TCR/Ig reference database using data from population resequencing studies. We applied this pipeline to data from the 1000 Genomes Project (G1K) (29), which contains the most comprehensive collection of global human variation currently available in the public domain, to create the Lym1K database. We aligned short read sequence data from human immune receptors to Lym1K and obtained significantly improved alignments compared with what could be achieved using the IMGT database as reference. We also report evidence that the diversity of immune receptor alleles observed in non-African populations is particularly underrepresented in the existing database.

We obtained two datasets from the National Center for Biotechnology Information Sequence Read Archive (SRA). The first dataset was for the TCR β-chain (SRA index: PRJNA310731) and consisted of 34 samples. The number of reads ranged from 4,571,596 to 13,742,272. Each read was 200-bp long and included part of the V, all of the D, and part of the J region. The second dataset contained both Ig H and L chain sequences (SRA index: PRJNA179099). We used four samples from this dataset (all samples that included >1 million reads). The number of reads in the samples ranged from 1,046,575 to 23,191,224. Each read was 250 bp long and included part of V (all of the D for the H chain) and part of the J region.

Each simulated sequence was created based on the process of VDJ recombination of TCR and Ig and included mismatches to stimulate the combined effects of sequencing errors and mutations/polymorphisms. First, we randomly chose a V, D (only for the H chain of Igs and the β-chain of TCRs), and J gene assuming uniform gene usage. We then decided the particular allele of the chosen V gene based on the frequency of its occurrence in the chosen population. Second, we “mutated” the chosen V and J segments by introducing 0–10 mismatches on the V region and 0–5 mismatches on the J region. To simulate junction diversity, we inserted zero to six randomly generated nucleotides into the V-D and D-J junction (V-J junction only for the Ig L chain and TCR α-chain).

We applied our simulation pipeline for TCRB, TCRA, IGH, and IGL genes separately. In all, we produced 27 datasets, with separate datasets corresponding to each of the 26 populations included in G1K and 1 dataset derived from the pool of all populations. Each dataset consisted of 20 simulated samples, and each sample contained 200,000 simulated sequences.

We retrieved variant cell format (VCF) files corresponding to the TCRB, TCRA, IGH, and IGL regions from phase 3 of G1K. We used the R package SNPRelate (30) to perform principal component analysis separately on variant data from each of the four regions.

The goal of AlleleMiner is to infer all the possible alleles of the target genes from the input VCF files. For the Lym1K database, we first retrieved the location information of TCR and Ig genes from BioMart (GRCh38.p2) and phased VCF files from G1K (GRCh38; phase 3). Note that the current phase 3 VCF files from G1K were generated based on the GRCh37 human genome assembly; however, VCF files in the coordinates of the GRCh38 assembly are also provided.

For each gene, AlleleMiner extracts the corresponding reference genome sequence from the University of California Santa Cruz database and single nucleotide polymorphisms (SNPs) from the input VCF file. Subsequently, AlleleMiner retrieves all alleles of each gene from the haplotypes spanning the gene of interest. Identical haplotypes are merged, and the number of times a particular haplotype appears is counted. Users can define a threshold for the haplotype occurrence times to eliminate rare haplotypes, some of which may be the result of sequencing error; only the haplotypes that occur more than the threshold number of times are stored as potential alleles in the new TCR/Ig database.

The pipelines were implemented in JAVA and are freely available at https://sourceforge.net/projects/alleleminer/. The Lym1K database of immune receptor alleles can be obtained from http://maths.nuigalway.ie/biocluster/database/.

The sequencing of increasing numbers of whole human genomes and improvements in the reference human genome assembly give rise to opportunities to expand substantially the reference database of human immune receptor alleles and to profile the diversity of immune receptor alleles across global populations. Currently, G1K represents the most extensive collection of whole genomes sampled from global human populations in the public domain. Focusing specifically on TCRs and Igs, most human immune receptor genes have been sequenced to sufficiently high coverage to recover novel alleles; the median coverage of IGHV, IGLV, TRBV, and TRAV are accordingly 121, 120.5, 76, and 53 (Supplemental Fig. 1). We retrieved known and novel TCR/Ig alleles from 2504 humans in the phase 3 cohort of G1K. We then tested the performance of the reference database we derived from population resequencing data and compared it to the existing IMGT reference database of human immune receptors using simulated data as well as short read receptor sequencing datasets (Fig. 1).

FIGURE 1.

Study outline. AlleleMiner retrieves all immune receptor haplotypes in data from G1K, merging it with alleles from IMGT to produce the Lym1K reference database. We compared the performance of the Lym1K and IMGT reference datasets using real Ig and TCR short read sequence data. Finally, we carried out a simulation study to investigate the extent of underrepresentation of alleles found in diverse human populations in IMGT.

FIGURE 1.

Study outline. AlleleMiner retrieves all immune receptor haplotypes in data from G1K, merging it with alleles from IMGT to produce the Lym1K reference database. We compared the performance of the Lym1K and IMGT reference datasets using real Ig and TCR short read sequence data. Finally, we carried out a simulation study to investigate the extent of underrepresentation of alleles found in diverse human populations in IMGT.

Close modal

AlleleMiner is the core component of the pipeline and is used to infer immune receptor alleles from genomic data. The algorithm used in AlleleMiner is described in detail in 2Materials and Methods. In brief, AlleleMiner first enumerates all possible haplotypes of each TCR/Ig gene, using as input genome-wide sequence variants in VCF. Subsequently, the inferred haplotypes that occur in more than a user-defined minimum number of individuals are merged with known alleles (here we merged discovered variants with alleles in the IMGT database to yield the Lym1K database of known and inferred immune receptor alleles). A filtering step is then applied to assess the support for each nonreference SNP allele found among the observed haplotypes (Fig. 2).

FIGURE 2.

Workflow for database construction. AlleleMiner is used to infer all the possible alleles from VCF data. The Lym1K database is constructed by merging the inferred alleles with the original IMGT alleles. For alleles not found in IMGT, we apply a filtering step that takes account of the validation status of all nonreference SNP alleles contained on the haplotypes corresponding to the putative novel allele.

FIGURE 2.

Workflow for database construction. AlleleMiner is used to infer all the possible alleles from VCF data. The Lym1K database is constructed by merging the inferred alleles with the original IMGT alleles. For alleles not found in IMGT, we apply a filtering step that takes account of the validation status of all nonreference SNP alleles contained on the haplotypes corresponding to the putative novel allele.

Close modal

To assess the extent to which known immune receptor alleles could be recovered from population resequencing data, we set the initial inclusion threshold to 1 (i.e., we considered all alleles found at least once in the G1K data). For IGL, TCRA, and TCRB most alleles in IMGT could be recovered (Table I, Supplemental Table I). We then examined the alleles that were not recovered. In the case of TRAV and TRBV, many of these alleles were annotated as originating from rearranged genomic DNA or cDNA, rather than originating from unrearranged germline DNA. Rearranged sequences may have additional somatic variants that we do not expect to find among the alleles we recovered from the genomic sequence. Many of the remaining alleles are from the 22 TRBV genes that are not included in either the GRCh37 or GRCh38 human genome assemblies (GRCh37 was used in G1K, and the current human genome assembly, GRCh38, is used as a reference genome by AlleleMiner). Excluding these genes and the alleles corresponding to rearranged sequences, we were able to retrieve all of the remaining TRAV and all but one allele (TRBV19*02) of the remaining TRBV alleles in IMGT (Table I).

Table I.
Shared allele counts between IMGT database and inferred alleles with minimum repeat threshold of one
IGHVIGLVTRBVTRAVIGHJIGLJTRBJTRAJIGHDTRBD
Number of alleles in IMGT 208 133 88 56 13 12 14 50 34 
Number of alleles inferred from G1K 3746 3917 370 568 100 160 22 162 249 
Number of putative novel alleles 3609 3740 318 511 90 141 114 219 
Number of alleles shared 83 99 54 46 12 13 48 30 
Proportion of IMGT alleles found in G1K, % 39.9 74.4 98.1 100 46.15 100 92.86 96 88.24 66.67 
Proportion of IMGT alleles not found in G1K, % 60.1 25.6 1.9 53.85 7.14 11.76 33.33 
IGHVIGLVTRBVTRAVIGHJIGLJTRBJTRAJIGHDTRBD
Number of alleles in IMGT 208 133 88 56 13 12 14 50 34 
Number of alleles inferred from G1K 3746 3917 370 568 100 160 22 162 249 
Number of putative novel alleles 3609 3740 318 511 90 141 114 219 
Number of alleles shared 83 99 54 46 12 13 48 30 
Proportion of IMGT alleles found in G1K, % 39.9 74.4 98.1 100 46.15 100 92.86 96 88.24 66.67 
Proportion of IMGT alleles not found in G1K, % 60.1 25.6 1.9 53.85 7.14 11.76 33.33 

The compared alleles were restricted to the genes that are GRCh37 and GRCh38. The IMGT alleles retrieved from non-germline sequences were not included in the comparison.

In contrast, although most IGLV alleles were recovered, fewer than half of the IGHV and IGHJ alleles from IMGT were found among the inferred alleles (Table I). This could reflect the greater diversity of Ig genes (Supplemental Fig. 2) or the presence of false-positive alleles in IMGT, particularly in IGHV genes. Indeed, a high rate of false-positive IGHV alleles in IMGT has previously been reported (26). We found evidence for both of these effects. The greater diversity of Ig genes is apparent from the relationship between the number of inferred alleles and the inclusion threshold used in AlleleMiner (Supplemental Fig. 3A). The number of shared IGHV alleles dropped significantly more rapidly than the other three gene groups, suggesting that IMGT includes IGHV alleles that are found at relatively low frequencies in the G1K populations (Supplemental Fig. 3B). To explore the impact of false-positive alleles in IMGT on the extent to which known alleles could be recovered from genomic data, we used the validation classes from Wang et al. (26). We found that most of the IMGT alleles that were classified as high confidence (levels 1 and 2) were successfully recovered. In contrast, there were only a few IMGT alleles labeled as low confidence (levels 3, 4, and 5) found among the inferred alleles (Supplemental Fig. 3D).

In addition to the known alleles, we recovered large numbers of novel putative immune receptor alleles from the population sequence data not found in IMGT. These numbered in the hundreds for TRVA and TRBV and in the thousands for IGLV and IGHV, when we applied the minimal threshold of a single occurrence (Table I). The number of putative novel alleles decreased rapidly, as this threshold was increased in the case of Ig, but far more slowly in the TCR case, suggesting saturation with respect to the number of samples in G1K of the TCR but not of the Ig allelic diversity (Supplemental Fig. 1).

We compared the performance of the IMGT and Lym1K databases on real TCR and Ig short read sequence datasets. Because there are no high-throughput sequencing datasets for TCRA currently in the public domain (to our knowledge), this comparison was restricted to IGH, IGL, and TCRB. For each input sequence, when the alignment score of the best match of the sequence is higher using Lym1K rather than IMGT alone as the reference database, it demonstrates that the Lym1K database contains additional allele(s) that are more similar to the input read than the most similar allele from IMGT. Consequently, this corresponds to a sequence for which an improved match can be found when alleles recovered from genomic sequence data are included in the reference database. The median length of the IGHD genes is only 19 nucleotides, and the entire D gene is within the CDR3 region with high mutation rates. During alignment, the median of the perfect matching nucleotides of the D gene is only six, which hampered the accuracy of the assessments on IGHD genes. Therefore we focused our comparison on V and J gene alleles.

We calculated the portions of the input sequences with improved alignment performances (Fig. 3). To assess the differences in alignment performance in greater detail, we classified the aligned reads based on the number of mismatches with the best matching reference allele. Furthermore, to rule out the possibilities that Lym1K achieved improved alignment performance due to the larger number of included alleles, in each individual, we only kept the top two best aligned alleles (corresponding to the number of haplotypes per diploid individual) for each gene in the following comparison. We then compared the proportion of reads within each class between the two reference databases (Fig. 4). The proportions of unaligned reads were consistently lower using Lym1K whereas the proportions of reads with zero or one to two mismatches were consistently higher. This demonstrates that the Lym1K database contains alleles that are more similar to the repertoire of the input sample. For instance, IGHV1-2*75p is a novel allele that we inferred from the G1K data and included in Lym1K. For sample SRR611358 there were 67 input sequences aligned with IGHV1-2*75p with an improved alignment score compared with the best matching alleles (IGHV1-2*02 or IGHV1-2*03) in IMGT (Fig. 5).

FIGURE 3.

Improvements in alignment performance using the Lym1K database.

FIGURE 3.

Improvements in alignment performance using the Lym1K database.

Close modal
FIGURE 4.

Alignment performance difference between IMGT and Lym1K. The aligned reads were classified into 13 categories starting from 0 mismatches with increments of 2 mismatches, to 20 mismatches and >20 mismatches. For example, for IGHV in sample SRR1056423 (top left panel) the number of alignments with zero mismatches increased by 33.2% using Lym1K, relative to IMGT.

FIGURE 4.

Alignment performance difference between IMGT and Lym1K. The aligned reads were classified into 13 categories starting from 0 mismatches with increments of 2 mismatches, to 20 mismatches and >20 mismatches. For example, for IGHV in sample SRR1056423 (top left panel) the number of alignments with zero mismatches increased by 33.2% using Lym1K, relative to IMGT.

Close modal
FIGURE 5.

Example of a novel allele (IGHV1-2*75p) in Lym1K that matches a short read sequence from a real IGHV sequencing dataset (SRR611538). IGHV1-2*02 and IGHV1-2*03 are both found in the IMGT database, and the sequence (SRR611538.578794) was aligned to these alleles in the original study. These two alleles both have two mismatches with the observed sequence. The inferred allele differed from the two most similar alleles found in IMGT, but showed a perfect match with the input sequence read.

FIGURE 5.

Example of a novel allele (IGHV1-2*75p) in Lym1K that matches a short read sequence from a real IGHV sequencing dataset (SRR611538). IGHV1-2*02 and IGHV1-2*03 are both found in the IMGT database, and the sequence (SRR611538.578794) was aligned to these alleles in the original study. These two alleles both have two mismatches with the observed sequence. The inferred allele differed from the two most similar alleles found in IMGT, but showed a perfect match with the input sequence read.

Close modal

In the case of TCRB we introduced 318 putative novel TRBV alleles in Lym1K (Table I). However, the improvements of the alignment performance of the TCRB sequences were small (Fig. 3). Among 34 samples, the highest proportion of reads with improved alignment was 4.1% (in sample SRR1033671) and the average proportion was only 2.4%. Therefore, it appears likely that IMGT contains adequate representation of the TCR diversity in the individuals from whom these samples were derived.

The Ig dataset consisted only of samples collected from the South African population (31). The TCRB samples were from whites and African Americans. This may have affected the results of the performance comparisons due to the under- or overrepresentation of certain populations in the IMGT database. For instance, if the IMGT database mainly consists of alleles from the white population, this could result in better alignment results for datasets that originate from white populations compared with datasets from African populations. In general, African populations display greater genetic diversity than do non-African populations (32), and this greater genetic diversity can be also be seen in the immune receptor regions (Fig. 6). This suggests that a comprehensive collection of human immune alleles requires sampling from diverse global populations, particularly those from the African continent. We carried out simulations to assess the potential for differential performance of the existing IMGT database on datasets derived from different global populations. The simulation pipeline is described in detail in 2Materials and Methods. For each chain, we simulated separate datasets for each of the populations in G1K as well as one pooled dataset corresponding to all populations.

FIGURE 6.

Principal component analysis of TCR/IG gene variation in human populations.

FIGURE 6.

Principal component analysis of TCR/IG gene variation in human populations.

Close modal

We used LymAnalyzer (19) to map simulated reads to reference alleles contained in IMGT and determined the proportion of reads that were mapped to the correct allele. Note that, in this case, our objective was not to compare IMGT to Lym1K, as the simulation is based on the same haplotypes used to construct Lym1K. Instead, we wanted to compare the accuracy of the alignments obtained when we used the existing IMGT database to map simulated reads from different human populations.

For all but one of the genes (TCRA), a significantly lower mapping accuracy was achieved for African than for non-African populations, indicating that the additional variation in immune genes found in African populations has not been captured adequately in IMGT (Fig. 7). For the IGH datasets, the African populations (LWK, GWD, MSL, YRI, ESN, ASW, and ACB) contained significantly more incorrectly mapped reads than in non-African populations (p < 2.2 × 10−16). The IGL datasets showed the largest portions of reads that were mapped inaccurately using IMGT. Again, using the IMGT database, there were noticeably larger portions of reads incorrectly mapped in the African populations than in non-African populations (p < 2.2 × 10−16). The portions of incorrectly mapped reads in TCRB datasets ranged from 4.9% (KHV) to 9.9% (LWK), and there were also more incorrectly mapped reads in African populations compared with non-African populations (p < 2.2 × 10−16). The TCRA datasets had the highest accuracy among the four chains using the IMGT database as the reference, with the smallest variation across populations. In contrast to other chains, the top three most incorrectly mapped populations in TCRA datasets are three East Asian populations (CDX, KHV, and CHS), which were among the least incorrectly mapped populations in other chains (Fig. 7).

FIGURE 7.

Comparison of the proportions of incorrectly mapped simulated reads among different populations using the IMGT database as the reference. Error bars corresponding to two SEs are shown in each plot. Population codes: ACB, African Caribbeans in Barbados; ASW, Americans of African ancestry in SW; BEB, Bengali from Bangladesh; CDX, Chinese Dai in Xishuangbanna; CEU, Utah residents (CEPH) with Northern and Western ancestry; CHB, Han Chinese in Bejing; CHS, Southern Han Chinese; CLM, Colombians from Medellin; ESN, Esan in Nigeria; FIN, Finnish in Finland; GBR, British in U.K. and Scotland; GIH, Gujarati Indian from Houston; GWD, Gambian in western divisions in the Gambia; IBS, Iberian population in Spain; ITU, Indian Telugu from the U.K.; JPT, Japanese in Tokyo; KHV, Kinh in Ho Chi Minh City; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; MXL, Mexican ancestry from Los Angeles; PJL, Punjabi from Lahore; PEL, Peruvians from Lima; PUR, Puerto Ricans from Puerto Rico; STU, Sri Lankan Tamil from the U.K.; TSI, Toscani in Italia; YRI, Yoruba in Ibadan, Nigeria.

FIGURE 7.

Comparison of the proportions of incorrectly mapped simulated reads among different populations using the IMGT database as the reference. Error bars corresponding to two SEs are shown in each plot. Population codes: ACB, African Caribbeans in Barbados; ASW, Americans of African ancestry in SW; BEB, Bengali from Bangladesh; CDX, Chinese Dai in Xishuangbanna; CEU, Utah residents (CEPH) with Northern and Western ancestry; CHB, Han Chinese in Bejing; CHS, Southern Han Chinese; CLM, Colombians from Medellin; ESN, Esan in Nigeria; FIN, Finnish in Finland; GBR, British in U.K. and Scotland; GIH, Gujarati Indian from Houston; GWD, Gambian in western divisions in the Gambia; IBS, Iberian population in Spain; ITU, Indian Telugu from the U.K.; JPT, Japanese in Tokyo; KHV, Kinh in Ho Chi Minh City; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; MXL, Mexican ancestry from Los Angeles; PJL, Punjabi from Lahore; PEL, Peruvians from Lima; PUR, Puerto Ricans from Puerto Rico; STU, Sri Lankan Tamil from the U.K.; TSI, Toscani in Italia; YRI, Yoruba in Ibadan, Nigeria.

Close modal

The advent of high-throughput sequencing technologies has enabled detailed studies of immune receptor repertoire diversity and changes in repertoire diversity over time, with many applications, including assessing the efficacy of leukemia treatment, understanding immune responses to disease, and determining the causes and consequences of changes in receptor diversity with age. Recently, a number of dedicated algorithms and bioinformatics tools have been developed to support such studies (1119); however, in addition to accurate alignment algorithms, the results of bioinformatics pipelines for the analysis of receptor diversity depend on the accuracy and completeness of the reference database of immune receptors and their alleles.

The IMGT database is currently the most widely used global reference in TCR and Ig diversity analysis. However, multiple previous studies have shown that IMGT is incomplete (2125), and one study suggested that some reported IGHV alleles might contain sequencing errors (26). This is mainly due to quality control issues arising from the inclusion of alleles from early studies and inadequate updating since the first release of the IMGT database. In this study, we provide evidence that alignment results obtained using IMGT as the reference source for TRBV, IGH, and IGL datasets derived from African populations are likely to be worse than those obtained for non-African populations, suggesting that IMGT does not adequately capture the global diversity of human immune receptors and that certain global populations are particularly underrepresented. This supports the value of incorporating immune alleles inferred from population sequencing studies that were specifically designed to profile the genetic diversity of global human populations.

To create a more comprehensive TCR/Ig reference database, we developed a novel bioinformatics pipeline that can be used to infer alleles from variant calling information, obtained from population sequencing projects. We applied this pipeline to data from G1K to create the Lym1K database and have made both Lym1K as well as the bioinformatics pipeline used to create it freely available (see “Software resource” above). Using real TCRB, IGH, and IGL datasets, we have shown that the incorporation of alleles inferred from population sequencing studies leads to improvements in alignment performance. These improvements are substantial in the case of Ig and result from novel alleles that provide a better match to the receptor repertoire of the input dataset than can be obtained using the alleles in the existing reference database.

There were in total 22 genes missing either from the current genome build (GRCh38) or GRCh37, and the depth of sequencing was unsatisfactory on several TRBV genes from G1K (Supplemental Fig. 3). This represents a limit to the comprehensiveness of receptor alleles inferred from the currently available population sequencing data; however, further improvements in the human reference genome and future population sequencing studies will enable alleles of these genes to be discovered.

Our study suggests that the diversity of Ig genes may not be profiled completely with existing sample numbers (Supplemental Fig. 1). The availability of genomic variants from larger numbers of individuals sampled from global populations could lead to further improvements in the comprehensiveness of immune receptor alleles recovered from genomic data. Our bioinformatics pipeline is freely available and can be applied to recover additional immune alleles from further public or restricted access VCF files and updated genome builds, as they become available. Furthermore, we have designed our short read mapping package, LymAnalyzer (19), so that Lym1K or any other collections of immune alleles produced using our pipeline or by other means can be substituted seamlessly as the reference allele database.

Two use cases are envisaged for the resources described in the present study (Fig. 8). First, users can apply Lym1K directly as their TCR/Ig reference database for their analysis. Currently, G1K represents the largest public catalog of human genome variation. However, as new public as well as restricted-access population sequencing datasets become available, and with further improvements in the human genome assembly, users may want to replace or extend the reference dataset with their own database of reference alleles inferred from genomic variant data. Such users can apply AlleleMiner to their own VCF files to generate a customized TCR/Ig reference database, which can then be selected as the reference database for LymAnalyzer. Indeed, depending on the specific data access restrictions that apply, such users may be able to contribute alleles discovered by applying AlleleMiner to restricted-access data into public repositories (such as IMGT), without violating data protection rules.

FIGURE 8.

Two use cases. Lym1K enables user 1 to perform a more comprehensive analysis of immune receptor diversity by combining alleles inferred from G1K with known alleles in IMGT. User 2 applies AlleleMiner to recover additional receptor alleles from VCF data to generate a user-defined database, which is subsequently used as the reference database for receptor repertoire analysis.

FIGURE 8.

Two use cases. Lym1K enables user 1 to perform a more comprehensive analysis of immune receptor diversity by combining alleles inferred from G1K with known alleles in IMGT. User 2 applies AlleleMiner to recover additional receptor alleles from VCF data to generate a user-defined database, which is subsequently used as the reference database for receptor repertoire analysis.

Close modal

This work was supported by the European Union Seventh Framework Programme (FP7/2007–2013) under Grant 317040 (QuanTI). Funding for open access was by the European Union Seventh Framework Programme.

The online version of this article contains supplemental material.

Abbreviations used in this article:

G1K

1000 Genomes Project

IMGT

international ImMunoGeneTics information system

SNP

single nucleotide polymorphism

SRA

National Center for Biotechnology Information Sequence Read Archive

VCF

variant cell format.

1
Gibson
K. L.
,
Wu
Y. C.
,
Barnett
Y.
,
Duggan
O.
,
Vaughan
R.
,
Kondeatis
E.
,
Nilsson
B. O.
,
Wikby
A.
,
Kipling
D.
,
Dunn-Walters
D. K.
.
2009
.
B-cell diversity decreases in old age and is correlated with poor health status.
Aging Cell
8
:
18
25
.
2
Boyd
S. D.
,
Liu
Y.
,
Wang
C.
,
Martin
V.
,
Dunn-Walters
D. K.
.
2013
.
Human lymphocyte repertoires in ageing.
Curr. Opin. Immunol.
25
:
511
515
.
3
Dunn-Walters
D. K.
,
Ademokun
A. A.
.
2010
.
B cell repertoire and ageing.
Curr. Opin. Immunol.
22
:
514
520
.
4
Wu
X.
,
Zhou
T.
,
Zhu
J.
,
Zhang
B.
,
Georgiev
I.
,
Wang
C.
,
Chen
X.
,
Longo
N. S.
,
Louder
M.
,
McKee
K.
, et al
NISC Comparative Sequencing Program
.
2011
.
Focused evolution of HIV-1 neutralizing antibodies revealed by structures and deep sequencing.
Science
333
:
1593
1602
.
5
Wang
C.
,
Liu
Y.
,
Xu
L. T.
,
Jackson
K. J.
,
Roskin
K. M.
,
Pham
T. D.
,
Laserson
J.
,
Marshall
E. L.
,
Seo
K.
,
Lee
J. Y.
, et al
.
2014
.
Effects of aging, cytomegalovirus infection, and EBV infection on human B cell repertoires.
J. Immunol.
192
:
603
611
.
6
Wu
X.
,
Zhang
Z.
,
Schramm
C. A.
,
Joyce
M. G.
,
Kwon
Y. D.
,
Zhou
T.
,
Sheng
Z.
,
Zhang
B.
,
O’Dell
S.
,
McKee
K.
, et al
NISC Comparative Sequencing Program
.
2015
.
Maturation and diversity of the VRC01-antibody lineage over 15 years of chronic HIV-1 infection.
Cell
161
:
470
485
.
7
Madi
A.
,
Shifrut
E.
,
Reich-Zeliger
S.
,
Gal
H.
,
Best
K.
,
Ndifon
W.
,
Chain
B.
,
Cohen
I. R.
,
Friedman
N.
.
2014
.
T-cell receptor repertoires share a restricted set of public and abundant CDR3 sequences that are associated with self-related immunity.
Genome Res.
24
:
1603
1612
.
8
Tipton
C. M.
,
Fucile
C. F.
,
Darce
J.
,
Chida
A.
,
Ichikawa
T.
,
Gregoretti
I.
,
Schieferl
S.
,
Hom
J.
,
Jenks
S.
,
Feldman
R. J.
, et al
.
2015
.
Diversity, cellular origin and autoreactivity of antibody-secreting cell population expansions in acute systemic lupus erythematosus.
Nat. Immunol.
16
:
755
765
.
9
Hehle
V.
,
Fraser
L. D.
,
Tahir
R.
,
Kipling
D.
,
Wu
Y. C.
,
Lutalo
P. M.
,
Cason
J.
,
Choong
L.
,
D’Cruz
D. P.
,
Cope
A. P.
, et al
.
2015
.
Immunoglobulin kappa variable region gene selection during early human B cell development in health and systemic lupus erythematosus.
Mol. Immunol.
65
:
215
223
.
10
Muraro
P. A.
,
Robins
H.
,
Malhotra
S.
,
Howell
M.
,
Phippard
D.
,
Desmarais
C.
,
de Paula Alves Sousa
A.
,
Griffith
L. M.
,
Lim
N.
,
Nash
R. A.
,
Turka
L. A.
.
2014
.
T cell repertoire following autologous stem cell transplantation for multiple sclerosis.
J. Clin. Invest.
124
:
1168
1172
.
11
Thomas
N.
,
Heather
J.
,
Ndifon
W.
,
Shawe-Taylor
J.
,
Chain
B.
.
2013
.
Decombinator: a tool for fast, efficient gene assignment in T-cell receptor sequences using a finite state machine.
Bioinformatics
29
:
542
550
.
12
Bolotin
D. A.
,
Shugay
M.
,
Mamedov
I. Z.
,
Putintseva
E. V.
,
Turchaninova
M. A.
,
Zvyagin
I. V.
,
Britanova
O. V.
,
Chudakov
D. M.
.
2013
.
MiTCR: software for T-cell receptor sequencing data analysis.
Nat. Methods
10
:
813
814
.
13
Bolotin
D. A.
,
Poslavsky
S.
,
Mitrophanov
I.
,
Shugay
M.
,
Mamedov
I. Z.
,
Putintseva
E. V.
,
Chudakov
D. M.
.
2015
.
MiXCR: software for comprehensive adaptive immunity profiling.
Nat. Methods
12
:
380
381
.
14
Yang
X.
,
Liu
D.
,
Lv
N.
,
Zhao
F.
,
Liu
F.
,
Zou
J.
,
Chen
Y.
,
Xiao
X.
,
Wu
J.
,
Liu
P.
, et al
.
2015
.
TCRklass: a new K-string-based algorithm for human and mouse TCR repertoire characterization.
J. Immunol.
194
:
446
454
.
15
Kuchenbecker
L.
,
Nienen
M.
,
Hecht
J.
,
Neumann
A. U.
,
Babel
N.
,
Reinert
K.
,
Robinson
P. N.
.
2015
.
IMSEQ—a fast and error aware approach to immunogenetic sequence analysis.
Bioinformatics
31
:
2963
2971
.
16
Gerritsen
B.
,
Pandit
A.
,
Andeweg
A. C.
,
de Boer
R. J.
.
2016
.
RTCR: a pipeline for complete and accurate recovery of T cell repertoires from high throughput sequencing data.
Bioinformatics
32
:
3098
3106
.
17
Nazarov
V. I.
,
Pogorelyy
M. V.
,
Komech
E. A.
,
Zvyagin
I. V.
,
Bolotin
D. A.
,
Shugay
M.
,
Chudakov
D. M.
,
Lebedev
Y. B.
,
Mamedov
I. Z.
.
2015
.
tcR: an R package for T cell receptor repertoire advanced data analysis.
BMC Bioinformatics
16
:
175
.
18
Zhang
W.
,
Du
Y.
,
Su
Z.
,
Wang
C.
,
Zeng
X.
,
Zhang
R.
,
Hong
X.
,
Nie
C.
,
Wu
J.
,
Cao
H.
, et al
.
2015
.
IMonitor: a robust pipeline for TCR and BCR repertoire analysis.
Genetics
201
:
459
472
.
19
Yu
Y.
,
Ceredig
R.
,
Seoighe
C.
.
2015
.
LymAnalyzer: a tool for comprehensive analysis of next generation sequencing data of T cell receptors and immunoglobulins.
Nucleic Acids Res.
44
:
e31
.
20
Lefranc
M. P.
,
Giudicelli
V.
,
Ginestoux
C.
,
Bodmer
J.
,
Müller
W.
,
Bontrop
R.
,
Lemaitre
M.
,
Malik
A.
,
Barbié
V.
,
Chaume
D.
.
1999
.
IMGT, the international ImMunoGeneTics database.
Nucleic Acids Res.
27
:
209
212
.
21
Kidd
M. J.
,
Chen
Z.
,
Wang
Y.
,
Jackson
K. J.
,
Zhang
L.
,
Boyd
S. D.
,
Fire
A. Z.
,
Tanaka
M. M.
,
Gaëta
B. A.
,
Collins
A. M.
.
2012
.
The inference of phased haplotypes for the immunoglobulin H chain V region gene loci by analysis of VDJ gene rearrangements.
J. Immunol.
188
:
1333
1340
.
22
Boyd
S. D.
,
Gaëta
B. A.
,
Jackson
K. J.
,
Fire
A. Z.
,
Marshall
E. L.
,
Merker
J. D.
,
Maniar
J. M.
,
Zhang
L. N.
,
Sahaf
B.
,
Jones
C. D.
, et al
.
2010
.
Individual variation in the germline Ig gene repertoire inferred from variable region gene rearrangements.
J. Immunol.
184
:
6986
6992
.
23
Wang
Y.
,
Jackson
K. J.
,
Gäeta
B.
,
Pomat
W.
,
Siba
P.
,
Sewell
W. A.
,
Collins
A. M.
.
2011
.
Genomic screening by 454 pyrosequencing identifies a new human IGHV gene and sixteen other new IGHV allelic variants.
Immunogenetics
63
:
259
265
.
24
Watson
C. T.
,
Steinberg
K. M.
,
Huddleston
J.
,
Warren
R. L.
,
Malig
M.
,
Schein
J.
,
Willsey
A. J.
,
Joy
J. B.
,
Scott
J. K.
,
Graves
T. A.
, et al
.
2013
.
Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation.
Am. J. Hum. Genet.
92
:
530
546
.
25
Scheepers
C.
,
Shrestha
R. K.
,
Lambson
B. E.
,
Jackson
K. J.
,
Wright
I. A.
,
Naicker
D.
,
Goosen
M.
,
Berrie
L.
,
Ismail
A.
,
Garrett
N.
, et al
.
2015
.
Ability to develop broadly neutralizing HIV-1 antibodies is not restricted by the germline Ig gene repertoire.
J. Immunol.
194
:
4371
4378
.
26
Wang
Y.
,
Jackson
K. J.
,
Sewell
W. A.
,
Collins
A. M.
.
2008
.
Many human immunoglobulin heavy-chain IGHV gene polymorphisms have been reported in error.
Immunol. Cell Biol.
86
:
111
115
.
27
Pallarès
N.
,
Lefebvre
S.
,
Contet
V.
,
Matsuda
F.
,
Lefranc
M. P.
.
1999
.
The human immunoglobulin heavy variable genes.
Exp. Clin. Immunogenet.
16
:
36
60
.
28
Folch
G.
,
Lefranc
M. P.
.
2000
.
The human T cell receptor beta variable (TRBV) genes.
Exp. Clin. Immunogenet.
17
:
42
54
.
29
1000 Genomes Project Consortium
Auton
A.
,
Brooks
L. D.
,
Durbin
R. M.
,
Garrison
E. P.
,
Kang
H. M.
,
Korbel
J. O.
,
Marchini
J. L.
,
McCarthy
S.
,
McVean
G. A.
,
Abecasis
G. R.
.
2015
.
A global reference for human genetic variation.
Nature
526
(
7571
):
68
74
.
30
Zheng
X.
,
Levien
D.
,
Shen
J.
,
Gogarten
S. M.
,
Laurie
C.
,
Weir
B. S.
.
2012
.
A high-performance computing toolset for relatedness and principal component analysis of SNP data.
Bioinformatics
28
:
3326
3328
.
31
Doria-Rose
N. A.
,
Schramm
C. A.
,
Gorman
J.
,
Moore
P. L.
,
Bhiman
J. N.
,
DeKosky
B. J.
,
Ernandes
M. J.
,
Georgiev
I. S.
,
Kim
H. J.
,
Pancera
M.
, et al
NISC Comparative Sequencing Program
.
2014
.
Developmental pathway for potent V1V2-directed HIV-neutralizing antibodies.
Nature
509
:
55
62
.
32
Tishkoff
S. A.
,
Reed
F. A.
,
Friedlaender
F. R.
,
Ehret
C.
,
Ranciaro
A.
,
Froment
A.
,
Hirbo
J. B.
,
Awomoyi
A. A.
,
Bodo
J. M.
,
Doumbo
O.
, et al
.
2009
.
The genetic structure and history of Africans and African Americans.
Science
324
:
1035
1044
.

The authors have no financial conflicts of interest.

Supplementary data