We share the view of Dr. Watson and colleagues on the importance of germline databases (GLDB) of immune receptor genes and thank them for their interest in our work (1). Whereas we fully agree with them on many of the caveats that apply to alleles inferred from genome sequence data (discussed further below), we feel that it is important for us first to clarify the objectives of our article, to avoid any misinterpretation. Watson et al. state that we propose a solution to the incompleteness and inaccuracy of existing GLDBs; however, we made no claim that inferring putative germline alleles from genome sequence data represents a complete solution to this problem. We showed evidence in our article that inferred germline alleles can contribute to the solution, particularly concerning incompleteness, and that this can lead to improved alignment performance. But we consider the alleles inferred in this way to be putative, and they are generally referred to as such in our article.

In particular, Watson et al. state that we report 8750 novel alleles. This number is derived from Table I in our article, which was constructed using the most permissive inclusion threshold (each allele that was found even once in genome sequence data was included) to maximize completeness. It represents an upper bound on the number of novel alleles that could be detected in the genome sequence data. In the table we refer to these as “putative novel alleles,” elsewhere in the article we note that the impact of sequencing errors can be reduced by increasing the inclusion threshold (see the Materials and Methods), and we show how the number of putative novel alleles per gene decreases as this threshold is increased (in Supplemental Fig. 3). By providing an adjustable parameter for the inclusion threshold we allow users of our pipeline to decide the balance they would like to strike between maximizing inclusiveness of rare alleles and minimizing false positives due to sequencing errors.

Watson et al. list three potential caveats of inferring germline alleles from genome sequence data. The first relates to sequencing and mapping errors in the data from the 1000 Genomes Project (G1K). We agree with the authors that sequencing errors can be mistaken for rare alleles and we acknowledged this in our article. The impact of sequencing errors can be reduced by increasing the inclusion threshold, as discussed above. We agree with Watson et al. that mapping errors, resulting from the repetitive nature of these genomic loci, may also cause false positive alleles to be inferred and acknowledge that this caveat should have been included in the Discussion section of our article. We note that in some cases mapping errors may cause true novel alleles to be attributed to the wrong gene, rather than the inference of false positive alleles. Unlike sequencing errors, the impact of mapping errors is not necessarily reduced by increasing the inclusion threshold.

The second issue relates to limitations imposed by the reference genome to which short read sequences were mapped by G1K for variant calling. Clearly, our method will not recover alleles of genes not included in the reference genome or not sequenced to sufficient depth, and this is noted in our article. However, we are grateful to Watson et al. for pointing out that the incompleteness of the reference genome may also exacerbate mapping errors. This underscores the importance of an accurate reference genome (possibly augmented to take account of copy number variants) for the task of inferring germline alleles from genomic sequence data.

The last caveat listed by Watson et al. relates to the fact that the G1K data were generated from lymphoblastoid cell lines that may have already undergone rearrangement and somatic hypermutation. We do not find evidence that this has had a major impact on our results. There was no obvious relationship between gene position (distal/proximal) and coverage, as would be predicted by a large fraction of the cells having undergone V(D)J recombination (see Supplemental Fig. 1) and we observed no obvious excess of variants within the CDR3 region. This is probably because many reads overlapping the recombined V(D)J locus fail to map to the reference genome, due to gaps caused by recombination and mismatches from somatic mutation. Somatic mutations on the rearranged locus are likely either not to be called as variants (due to insufficient reads confirming the variant) or to fail to be found in multiple individuals (and can thus be removed by increasing the inclusion threshold).

In summary, germline alleles inferred from genomic sequencing data should be considered putative and require validation. The incompleteness of the human reference genome sequence places a clear limitation on the completeness of the set of germline alleles that can be inferred from genome sequence data. Inferred alleles are likely to include false positives arising from at least two sources – sequencing errors (which can be reduced by thresholding on the number of individuals in which an allele is found before it is included in the database) and mapping errors, including mapping errors resulting from the limitations of the human reference genome. Nonetheless, we showed in our article that a high proportion of known alleles (or in the case of IGHV, of the alleles at higher validation levels) can be recovered from genome sequencing data and that including putative alleles inferred from genome sequence data leads to improved mapping performance with real data, likely reflecting a proportion of true novel germline alleles recovered from the genomic sequence data that contributes to the completeness of the GLDB.

Finally, we wholeheartedly concur with Watson et al. on the need for a collaborative effort, making use of multiple data types, for the inference of a complete and accurate GLDB. We hope that the methods and software made available through our article will be of value in this regard.

Abbreviations used in this article:


1000 Genomes Project


germline database.

A database of human immune receptor alleles recovered from population sequencing data
J. Immunol.