Abstract
Abs are immune system proteins that recognize noxious molecules for elimination. Their sequence diversity and binding versatility have made Abs the primary class of biopharmaceuticals. Recently, it has become possible to query their immense natural diversity using next-generation sequencing of Ig gene repertoires (Ig-seq). However, Ig-seq outputs are currently fragmented across repositories and tend to be presented as raw nucleotide reads, which means nontrivial effort is required to reuse the data for analysis. To address this issue, we have collected Ig-seq outputs from 55 studies, covering more than half a billion Ab sequences across diverse immune states, organisms (primarily human and mouse), and individuals. We have sorted, cleaned, annotated, translated, and numbered these sequences and make the data available via our Observed Antibody Space (OAS) resource at http://antibodymap.org. The data within OAS will be regularly updated with newly released Ig-seq datasets. We believe OAS will facilitate data mining of immune repertoires for improved understanding of the immune system and development of better biotherapeutics.
Introduction
Antibodies (or BCR) are protein products of B cells and primary actors of adaptive immunity in-jawed vertebrates (1). They are highly malleable molecules that can bind to virtually any Ag. An organism holds a great variety of these molecules, increasing the probability that an arbitrary Ag can be recognized by an Ab, initiating an immune response (2). Owing to their binding malleability, they are the most prominent class of reagents and biotherapeutics (Refs. 3 and 4 and M. Raybould, C. Marks, K. Krawczyk, B. Taddese, J. Nowak, A.P. Lewis, A. Bujotzek, J. Shi, and C.M. Deane, manuscript posted on bioRxiv). Continued successful exploitation of these molecules relies on our ability to discern the functional diversity of Ab repertoires (5–7).
Next-generation sequencing of Ig gene repertoires (Ig-seq) has enabled researchers to take snapshots of millions of sequences at a time across individuals, diverse organisms, and different immune states (8, 9). The ability to sequence and analyze millions of Ab sequences has the potential to uncover the mechanics of the immune response to any Ag (10, 11) and dysfunctions of the immune system itself (12).
Many previous studies have addressed the issue of Ab diversity, contributing invaluable evidence to understanding the dynamics of human immune systems (13). Numerous analyses have focused on the frequencies of V(D)J gene usages, which can offer insights into creating biased therapeutic Ab libraries (14–16). Another therapeutic application of Ab repertoire analysis is advancing vaccine design by comparative longitudinal studies of pre- and postantigen challenge experiments (10, 11, 17–22). Such comparative studies have shown that different individuals can converge on the same Ab sequence against a given vaccine (11, 19). Because of sequencing limitations, these analyses have focused on H or L chains separately, whereas one ought to study the paired repertoire to obtain deeper insights of Ab diversity (23).
Technical advances in sequencing technology have outpaced storage and analysis pipelines (24, 25). This has meant that the outputs of Ig-seq studies are fragmented across repositories, making it difficult to perform large-scale data mining of Ab repertoires (25). Metadata, such as isotype, age, or subject identifiers, are not typically standardized; therefore, extraction of specific subsets of Ab repertoires for comparative analyses is challenging. Furthermore, the data are typically deposited as raw nucleotide reads. It requires nontrivial ad hoc effort to convert such raw reads to amino acid sequences that ultimately dictate the molecular structure and Ag recognition. Some of these issues are addressed by services that provide Ig-seq–specific data deposition and analysis pipelines such as the B-T.CR wiki (https://b-t.cr), ImmPort (http://immport.org) (26, 27), immunoSEQ Analyzer (http://clients.adaptivebiotech.com/), iReceptor (http://ireceptor.irmacs.sfu.ca/) (28), or VDJServer (http://vdjserver.org) (29). The iReceptor and the VDJServer are the main resources that fall under the umbrella of the organized effort of the Adaptive Immune Receptor Repertoire Community to provide standardized deposition and analysis pipelines for the Ig-seq outputs (24). These services chiefly focus on facilitating bulk deposition of raw data to perform standardized sequencing analyses. Ultimately, because immunoinformatics is not the chief focus of such services, bulk data download from such websites is limited, and converting the raw nucleotide data obtained into a format suitable for analysis still requires installation and running additional software packages. In this study, we identify, clean, annotate, and make the data available as a starting point for immunodiagnostics analyses.
To address these issues, we have created the Observed Antibody Space (OAS) resource that allows large-scale data mining of Ab repertoires. We have, so far, collected the raw outputs of 55 Ig-seq experiments, covering over half a billion sequences. We have organized the sequences by metadata, such as organism, isotype, B cell type, and source, and the immune status of B cell donors to facilitate bulk retrieval of specific subsets for comparative analyses. We have converted all of the Ig-seq sequences to amino acids while preserving the link to the respective original raw nucleotide sequences and numbered them using the International ImMunoGeneTics information system (IMGT) scheme. The data are available for querying or bulk download at http://antibodymap.org. We believe that OAS will facilitate data-mining Ab repertoires for improved understanding of the dynamics of the immune system and, thus, better engineering of biotherapeutics.
Materials and Methods
A list of study accession codes of publicly available Ig-seq datasets were obtained via a literature review. The majority of raw reads were downloaded from the European Nucleotide Archive (30) and the National Center for Biotechnology Information websites (31). In a small number of cases, another public Ig-seq repository was specified [e.g (14, 32–34)]. Metadata were manually extracted from the deposited datasets and arranged in a reproducible format.
The downloaded FASTQ files were processed depending on the sequencing platform. Paired raw Illumina reads were assembled with FLASH (35). The assembled Ab sequences were converted to the FASTA format using FASTX-Toolkit (36). As raw reads from Roche 454 are not paired, these FASTQ files were directly converted to the FASTA format with the FASTX-Toolkit.
The H chain sequences were automatically annotated with isotype information unless such data were given in the corresponding publication. Automatic isotype annotation was performed by aligning the constant heavy domain 1 (CH1) of any given Ab sequence against the IMGT isotype reference (37) of the respective species using the Smith–Waterman (38) algorithm. We assigned a score of 2 for a nucleotide match and a score of −1 for a nucleotide mismatch or a gap. The IMGT isotype references comprised 21-nt–long fragments of the CH1 domain of the Ab isotypes. To ensure a high confidence of correct isotype identification, we employed a conservative threshold of 30 in the Smith–Waterman algorithm scoring function. Sequences whose Smith–Waterman algorithm score was below the threshold for all isotypes were assigned as “bulk.” The robustness of this protocol was confirmed on the author-annotated Ig-seq datasets (18, 39, 40), in which it resulted in 99% accurate annotations. Around 1% of the Ig-seq data had a very short (or missing) CH1 domain sequence. Such sequences were also assigned as bulk.
IgBLASTn (41) was used to convert the FASTA files of Ab nucleotide sequences to amino acids. The amino acid sequences were then parsed with ANARCI (42) using the IMGT scheme (43). In this step, every sequence is IMGT numbered and inspected for compliance with our knowledge of Ig folding. Amino acid sequences that harbor unusual indels in canonical CDR and framework regions or stop codons are removed as these are considered structurally nonviable. ANARCI does not number a sequence if its V and J genes do not align to a Hidden Markov Model (44) built on its respective species amino acid IMGT germlines (37). We also filter out potentially chimeric sequences by detecting duplicated CDR-H3 regions in every amino acid sequence, checking for the complete sequence residue annotation, checking for the full-length framework 4 region, and imposing the length cutoff of 37 residues for CDR-H3 in human, mouse, rat, rabbit, alpaca, and rhesus Abs. Because of technical limitations of sequencing platforms, certain reads were missing significant portions of the V region (e.g., portions of CDR1); sequences that did not have all three CDRs were discarded as incomplete.
The V and J gene annotation available in OAS is obtained using ANARCI, which identifies the germline genes with the highest amino acid identity (37). As V and J genes of camels have not yet been well characterized (45), we employed the alpaca (the closest relative available) Ig genes in camel Ig-seq data interrogation, as these two species belong to the same biological family (Camelidae). If data from other poorly cataloged species are added to OAS, we will use the closest available relative for V and J gene annotation.
Using the protocol above, we annotated Ig-seq results of 55 independent studies. To streamline updating OAS with new data, we have generated a procedure to automatically identify Ig-seq datasets from raw sequence read archives. We apply our Ab annotation protocol to each raw nucleotide dataset deposited in the National Center for Biotechnology Information/European Nucleotide Archive repositories; if we find more than 10,000 Ab sequences in any given dataset, it is set aside for manual inspection. Manual inspection is still necessary to efficiently assign metadata, as these are currently deposited in a nonstandardized manner. This procedure allows for automatic identification of new Ig-seq datasets and semiautomatically updating of OAS.
Results
We collected raw sequencing outputs from 55 Ig-seq studies. All raw nucleotide reads were converted into amino acids using IgBLASTn (41). Within OAS, it is possible to link back from the translated amino acid sequences to the raw nucleotide data. The full amino acid sequences were then IMGT numbered using ANARCI (42). As well as providing IMGT and gene annotations, ANARCI acts as a broad-brush filter of Ab sequences that are likely to be erroneous (see 2Materials and Methods). For each Ig-seq dataset, we provide the total number of amino acids that were retrieved from IgBLASTn outputs as well as after ANARCI parsing. These numbers may be useful as proxies for dataset quality assessment. Applying the same retrieval, amino acid conversion, gene annotation, and numbering protocol to all sequences assures the same point of reference across the 55 heterogeneous Ig-seq datasets (46). This protocol produces the full IMGT-numbered sequences together with gene annotations for each of the 55 datasets.
The numbered amino acid sequences in each dataset are sorted by metadata (e.g., individuals, age, vaccination regimen, B cell type, and source, etc.) (Fig. 1). Deposition of such metadata is currently not standardized and requires ad hoc manual curation for each dataset. In an effort to organize the Ab sequences using such metadata, we have grouped the sequences within each dataset into Data Units. Each Data Unit represents a group of sequences within a given dataset with a unique combination of metadata values. The metadata values are summarized in Table I.
Metadata Name . | Metadata Description . |
---|---|
Chain | H chain/L chain annotation |
Isotype | Identified or deposited isotype information |
Age | Information on the age of the human B cell donors |
Disease | Indication of whether the donor was sick at the time of B cell extraction |
Vaccine | Indication if the B cell donor was purposely immunized prior to B cell extraction |
B-cell subset | Indication if a particular B cell subset was sorted for Ig-seq |
Species | Organism of the B cell donor |
B-cell source | Organ/tissue from which the B cells were extracted |
Subject | Indication of a particular B cell donor from whom the B cells were sourced |
Longitudinal | If the study was longitudinal, an indicator of the time point |
Size | Number of redundant amino acid sequences in the Data Unit |
Size_igblastn | Number of redundant amino acid sequences extracted from IgBLASTn outputs prior to ANARCI parsing |
Link | Link to the source publication |
Metadata Name . | Metadata Description . |
---|---|
Chain | H chain/L chain annotation |
Isotype | Identified or deposited isotype information |
Age | Information on the age of the human B cell donors |
Disease | Indication of whether the donor was sick at the time of B cell extraction |
Vaccine | Indication if the B cell donor was purposely immunized prior to B cell extraction |
B-cell subset | Indication if a particular B cell subset was sorted for Ig-seq |
Species | Organism of the B cell donor |
B-cell source | Organ/tissue from which the B cells were extracted |
Subject | Indication of a particular B cell donor from whom the B cells were sourced |
Longitudinal | If the study was longitudinal, an indicator of the time point |
Size | Number of redundant amino acid sequences in the Data Unit |
Size_igblastn | Number of redundant amino acid sequences extracted from IgBLASTn outputs prior to ANARCI parsing |
Link | Link to the source publication |
Each Data Unit is uniquely identified by the study and a collection of the metadata values.
As of July 1, 2018, 55 Ig-seq studies are included in OAS, totaling 618,371,034 sequences (562,544,071 VH and 55,826,963 VL sequences), whereas the total number of translated amino acid sequences that were obtained from IgBLASTn outputs prior to ANARCI parsing is 803,508,673. The majority of the sequences deposited in OAS are murine (∼49.4%) and human (∼48.4%). Twenty-two of the Ig-seq studies interrogate the immune system of diseased individuals, the most common ailment being HIV (13 studies). The database also contains 24 Ig-seq studies of the naive Ab gene repertoires (the collection of B cells from donors who are healthy and not purposefully vaccinated). The main source of B cells in the OAS database is peripheral blood (∼241 mln of sequences), followed by spleen/splenocytes (∼198 mln) and bone marrow (∼124 mln). The database holds isotype information for each individual heavy sequence, and the two most common isotypes are IgM (∼316 mln) and IgG (∼144 mln). For ∼65 mln sequences, we were not able to assign isotypes with high confidence. The median redundant size of the Ig-seq studies in the OAS database is 2,164,901 sequences, whereas the largest Ig-seq study was that by Greiff et al. (14) (246,449,120 redundant sequences). Two sequences are redundant if they are of identical length and identical amino acid composition. Detailed statistics on each dataset are given in Table II and summary statistics are located at http://antibodymap.org/oasstats. All the data may be bulk downloaded or individual Data Units queried at http://antibodymap.org.
Study . | Species . | Disease . | Vaccine . | B Cell Source . | B Cell Subset . | Total ANARCI Parsed Sequences . |
---|---|---|---|---|---|---|
Banerjee et al. (47) | Rabbit | None | HIV | PBMC | Unsorted | 4,334,088 (2,926,727) |
Bashford-Rogers et al. (48) | Human | CLL/none | None | PBMC | Unsorted | 129,013 (86,166) |
Bhiman et al. (49) | Human | HIV | None | PBMC | Unsorted | 785,751 (187,067) |
Bonsignori et al. (50) | Human | HIV/none | None | PBMC | Memory/unsorted | 210,377 (57,374) |
Collins et al. (51) | Mouse | None | None | Splenocytes | Unsorted | 812,439 (194,752) |
Corcoran et al. (52) | Human/mouse/rhesus | None | None | PBMC | Unsorted | 5,307,880 (2,840,877) |
Cui et al. (53) | Mouse | None | NP-CGG/none | Splenocytes | Memory | 5,513,816 (935,646) |
Doria-Rose et al. (13) | Human | HIV | None | PBMC | Unsorted | 2,164,901 (549,544) |
Ellebedy et al. (54) | Human | None | Flu | PBMC | Naive/memory/ASC/ABC | 9,626,744 (4,807,583) |
Fisher et al. (55) | Mouse | None | Plasmodium | Spleen | Unsorted | 175,015 (113,594) |
Galson et al. (18) | Human | None | Hepatitis B | PBMC | Unsorted/plasma cells/Hepatitis B–specific | 21,755,739 (10,442,291) |
Galson et al. (39) | Human | None | Hepatitis B | PBMC | Unsorted/plasma cells/Hepatitis B–specific | 26,687,394 (14,343,236) |
Galson et al. (21) | Human | None | Meningitis | PBMC | Naive/plasma cells/memory/marginal zone | 7,918,197 (3,282,907) |
Galson et al. (17) | Human | None | Flu | PBMC | Plasma cells | 13,685,210 (5,065,786) |
Greiff et al. (40) | Mouse | None | NP-CGG | Bone marrow/spleen | Plasma cells/plasmablasts | 7,955,739 (2,891,649) |
Greiff et al. (34) | Mouse | None | NP-CGG | Spleen | ASCs/plasma cells/naive | 788,787 (523,716) |
Greiff et al. (14) | Mouse | None | OVA/Hepatitis B/NP-HEL/none | Spleen/bone marrow | Plasma cells/pre–B cells/naive | 246,449,120 (129,417,569) |
Gupta et al. (56) | Human | None | Flu/Hepatitis A/Hepatitis B | PBMC | Unsorted | 25,134,322 (9,966,175) |
Halliley et al. (57) | Human | None | Flu/tetanus | Bone marrow | Plasma cells | 2,348,164 (1,208,616) |
Huang et al. (58) | Human | HIV | None | PBMC | Memory | 11,693,783 (5,701,433) |
Jiang et al. (59) | Human | None | Flu | PBMC | Naive/plasmablasts | 3,199,271 (1,809,306) |
Joyce et al. (60) | Human | None | None | PBMC | Unsorted | 2,747,688 (1,463,421) |
Khan et al. (61) | Mouse | None | OVA | Spleen | Unsorted | 24,175,033 (7,113,411) |
Levin et al. (62) | Human | Allergy | None | PBMC/nasal biopsy | Unsorted | 528,173 (370,465) |
Levin et al. (63) | Human | Allergy | None | PBMC/bone marrow | Unsorted | 29,643,305 (9,557,586) |
Li et al. (64) | Camel | None | None | PBMC | Unsorted | 1,152,359 (1,127,651) |
Liao et al. (65) | Human | HIV | None | PBMC | Unsorted | 1,420,314 (619,492) |
Lindner et al. (66) | Mouse | None | Escherichia coli/Clostridia/Lactobacillus | Biopsy of small intestine | Unsorted | 1,686,350 (544,061) |
Meng et al. (67) | Human | CMV/EBV/none | None | PBMC/lung/spleen/ bone marrow/colon/ jejunum/lymph node/ileum | Unsorted | 45,576,606 (21,738,501) |
Menzel et al. (68) | Mouse | None | NP-CGG | Spleen/bone marrow | ASCs | 14,355,151 (6,058,480) |
Mroczek et al. (69) | Human | None | None | PBMC | Immature/transitional/mature/plasmacytes/ memory | 104,154 (85,525) |
Ota et al. (70) | Mouse | None | None | Spleen/lymph | Unsorted | 21,505 (9,619) |
Palanichamy et al. (71) | Human | MS | None | Cerebrospinal fluid/PBMC | Unsorted | 776,895 (292,801) |
Parameswaran et al. (11) | Human | Dengue/none/ nondengue febrile illness | None | PBMC | Unsorted | 26,584 (23,606) |
Prohaska et al. (72) | Mouse | None | None | Spleen/peritoneum | B-1/B-2/marginal zone/follicular | 336,723 (198,983) |
Rettig et al. (33) | Mouse | None | None | Spleen/splenocytes | Unsorted | 41,908 (24,908) |
Rubelt et al. (73) | Human | None | None | PBMC | Naive/memory | 2,320,947 (1,719,507) |
Schanz et al. (32) | Human | HIV/none | None | PBMC | Unsorted | 12,734,958 (5,412,549) |
Stern et al. (74) | Human | MS | None | Cervical lymph node/white matter lesion/pia mater/choroid plexus/cortex/spleen | Unsorted | 8,550,247 (3,321,530) |
Sundling et al. (75) | Rhesus | None | HIV | PBMC | Unsorted | 40,960 (26,298) |
Tipton et al. (76) | Human | SLE/none | Flu/tetanus | PBMC | Unsorted | 28,204,742 (13,301,396) |
Tong et al. (77) | Mouse | None | OVA | Bone marrow/spleen | Pro–B cells/follicular | 92,936 (56,878) |
Turchaninova et al. (78) | Human | None | None | PBMC | Memory/plasma cells/naive | 183,949 (176,441) |
Vander Heiden et al. (79) | Human | MG/none | None | PBMC | Memory/naive/ unsorted | 13,939,166 (5,170,299) |
VanDuijn et al. (80) | Rat | None | DNP/HuD | Splenocytes | Unsorted | 6,359,396 (4,234,597) |
Vergani et al. (81) | Human | None | None | PBMC | Unsorted | 14,161,949 (5,987,086) |
Wasemann et al. (82) | Mouse | None | NP-CGG | Lamina propria/bone marrow/spleen | Unsorted | 146,370 (40,132) |
Wu et al. (83) | Human | HIV | None | PBMC | Unsorted | 394,144 (198,468) |
Wu et al. (84) | Human | HIV | None | PBMC | Unsorted | 5,545,910 (1,370,109) |
Wu et al. (85) | Human | Allergy/none | None | PBMC/nasal biopsy | Unsorted | 35,034 (23,923) |
Zhou et al. (22) | Human | HIV | None | PBMC | Unsorted | 1,541,645 (458,227) |
Zhou et al. (86) | Human | HIV | None | PBMC | Unsorted | 722,112 (291,670) |
Zhu et al. (87) | Human | HIV | None | PBMC | Unsorted | 874,930 (174,435) |
Zhu et al. (88) | Human | HIV | None | PBMC | Unsorted | 1,962,643 (532,350) |
Zhu et al. (89) | Human | HIV | None | PBMC | Unsorted | 1,290,499 (699,828) |
Study . | Species . | Disease . | Vaccine . | B Cell Source . | B Cell Subset . | Total ANARCI Parsed Sequences . |
---|---|---|---|---|---|---|
Banerjee et al. (47) | Rabbit | None | HIV | PBMC | Unsorted | 4,334,088 (2,926,727) |
Bashford-Rogers et al. (48) | Human | CLL/none | None | PBMC | Unsorted | 129,013 (86,166) |
Bhiman et al. (49) | Human | HIV | None | PBMC | Unsorted | 785,751 (187,067) |
Bonsignori et al. (50) | Human | HIV/none | None | PBMC | Memory/unsorted | 210,377 (57,374) |
Collins et al. (51) | Mouse | None | None | Splenocytes | Unsorted | 812,439 (194,752) |
Corcoran et al. (52) | Human/mouse/rhesus | None | None | PBMC | Unsorted | 5,307,880 (2,840,877) |
Cui et al. (53) | Mouse | None | NP-CGG/none | Splenocytes | Memory | 5,513,816 (935,646) |
Doria-Rose et al. (13) | Human | HIV | None | PBMC | Unsorted | 2,164,901 (549,544) |
Ellebedy et al. (54) | Human | None | Flu | PBMC | Naive/memory/ASC/ABC | 9,626,744 (4,807,583) |
Fisher et al. (55) | Mouse | None | Plasmodium | Spleen | Unsorted | 175,015 (113,594) |
Galson et al. (18) | Human | None | Hepatitis B | PBMC | Unsorted/plasma cells/Hepatitis B–specific | 21,755,739 (10,442,291) |
Galson et al. (39) | Human | None | Hepatitis B | PBMC | Unsorted/plasma cells/Hepatitis B–specific | 26,687,394 (14,343,236) |
Galson et al. (21) | Human | None | Meningitis | PBMC | Naive/plasma cells/memory/marginal zone | 7,918,197 (3,282,907) |
Galson et al. (17) | Human | None | Flu | PBMC | Plasma cells | 13,685,210 (5,065,786) |
Greiff et al. (40) | Mouse | None | NP-CGG | Bone marrow/spleen | Plasma cells/plasmablasts | 7,955,739 (2,891,649) |
Greiff et al. (34) | Mouse | None | NP-CGG | Spleen | ASCs/plasma cells/naive | 788,787 (523,716) |
Greiff et al. (14) | Mouse | None | OVA/Hepatitis B/NP-HEL/none | Spleen/bone marrow | Plasma cells/pre–B cells/naive | 246,449,120 (129,417,569) |
Gupta et al. (56) | Human | None | Flu/Hepatitis A/Hepatitis B | PBMC | Unsorted | 25,134,322 (9,966,175) |
Halliley et al. (57) | Human | None | Flu/tetanus | Bone marrow | Plasma cells | 2,348,164 (1,208,616) |
Huang et al. (58) | Human | HIV | None | PBMC | Memory | 11,693,783 (5,701,433) |
Jiang et al. (59) | Human | None | Flu | PBMC | Naive/plasmablasts | 3,199,271 (1,809,306) |
Joyce et al. (60) | Human | None | None | PBMC | Unsorted | 2,747,688 (1,463,421) |
Khan et al. (61) | Mouse | None | OVA | Spleen | Unsorted | 24,175,033 (7,113,411) |
Levin et al. (62) | Human | Allergy | None | PBMC/nasal biopsy | Unsorted | 528,173 (370,465) |
Levin et al. (63) | Human | Allergy | None | PBMC/bone marrow | Unsorted | 29,643,305 (9,557,586) |
Li et al. (64) | Camel | None | None | PBMC | Unsorted | 1,152,359 (1,127,651) |
Liao et al. (65) | Human | HIV | None | PBMC | Unsorted | 1,420,314 (619,492) |
Lindner et al. (66) | Mouse | None | Escherichia coli/Clostridia/Lactobacillus | Biopsy of small intestine | Unsorted | 1,686,350 (544,061) |
Meng et al. (67) | Human | CMV/EBV/none | None | PBMC/lung/spleen/ bone marrow/colon/ jejunum/lymph node/ileum | Unsorted | 45,576,606 (21,738,501) |
Menzel et al. (68) | Mouse | None | NP-CGG | Spleen/bone marrow | ASCs | 14,355,151 (6,058,480) |
Mroczek et al. (69) | Human | None | None | PBMC | Immature/transitional/mature/plasmacytes/ memory | 104,154 (85,525) |
Ota et al. (70) | Mouse | None | None | Spleen/lymph | Unsorted | 21,505 (9,619) |
Palanichamy et al. (71) | Human | MS | None | Cerebrospinal fluid/PBMC | Unsorted | 776,895 (292,801) |
Parameswaran et al. (11) | Human | Dengue/none/ nondengue febrile illness | None | PBMC | Unsorted | 26,584 (23,606) |
Prohaska et al. (72) | Mouse | None | None | Spleen/peritoneum | B-1/B-2/marginal zone/follicular | 336,723 (198,983) |
Rettig et al. (33) | Mouse | None | None | Spleen/splenocytes | Unsorted | 41,908 (24,908) |
Rubelt et al. (73) | Human | None | None | PBMC | Naive/memory | 2,320,947 (1,719,507) |
Schanz et al. (32) | Human | HIV/none | None | PBMC | Unsorted | 12,734,958 (5,412,549) |
Stern et al. (74) | Human | MS | None | Cervical lymph node/white matter lesion/pia mater/choroid plexus/cortex/spleen | Unsorted | 8,550,247 (3,321,530) |
Sundling et al. (75) | Rhesus | None | HIV | PBMC | Unsorted | 40,960 (26,298) |
Tipton et al. (76) | Human | SLE/none | Flu/tetanus | PBMC | Unsorted | 28,204,742 (13,301,396) |
Tong et al. (77) | Mouse | None | OVA | Bone marrow/spleen | Pro–B cells/follicular | 92,936 (56,878) |
Turchaninova et al. (78) | Human | None | None | PBMC | Memory/plasma cells/naive | 183,949 (176,441) |
Vander Heiden et al. (79) | Human | MG/none | None | PBMC | Memory/naive/ unsorted | 13,939,166 (5,170,299) |
VanDuijn et al. (80) | Rat | None | DNP/HuD | Splenocytes | Unsorted | 6,359,396 (4,234,597) |
Vergani et al. (81) | Human | None | None | PBMC | Unsorted | 14,161,949 (5,987,086) |
Wasemann et al. (82) | Mouse | None | NP-CGG | Lamina propria/bone marrow/spleen | Unsorted | 146,370 (40,132) |
Wu et al. (83) | Human | HIV | None | PBMC | Unsorted | 394,144 (198,468) |
Wu et al. (84) | Human | HIV | None | PBMC | Unsorted | 5,545,910 (1,370,109) |
Wu et al. (85) | Human | Allergy/none | None | PBMC/nasal biopsy | Unsorted | 35,034 (23,923) |
Zhou et al. (22) | Human | HIV | None | PBMC | Unsorted | 1,541,645 (458,227) |
Zhou et al. (86) | Human | HIV | None | PBMC | Unsorted | 722,112 (291,670) |
Zhu et al. (87) | Human | HIV | None | PBMC | Unsorted | 874,930 (174,435) |
Zhu et al. (88) | Human | HIV | None | PBMC | Unsorted | 1,962,643 (532,350) |
Zhu et al. (89) | Human | HIV | None | PBMC | Unsorted | 1,290,499 (699,828) |
The datasets are organized into studies related to a given Ig-seq experiment. Each study in the OAS database is subdivided into Data Units. Each Data Unit is a collection of IMGT-numbered amino acid sequences uniquely identified by the metadata descriptors given in Table I; five of which (species, disease, vaccine, B cell source, and B cell type) are given in this table. The “Total ANARCI Parsed Sequences” column indicates the total number of redundant sequences in our database, with the nonredundant numbers in parentheses.
ABC, activated B cell; ASC, Ab secreting cell; CLL, chronic lymphocytic leukemia; DNP, keyhole limpet hemocyanine modified with dinitrophenyl; Flu, influenza; HuD, paraneoplastic encephalomyelitis Ag; MG, myasthenia gravis; MS, multiple sclerosis; NP-CGG, chicken γ globulin; NP-HEL, hen egg lysozyme; SLE, systemic lupus erythematosus.
Discussion
In this study, we describe the OAS database, a unified repository to facilitate large-scale data mining of Ab repertoires in both their amino acid and nucleotide forms. Absence of well-established repositories in Ig-seq deposition space required us to perform a combination of literature search and manual curation of the datasets to organize the data into OAS. The current lack of widely adopted deposition standards hampers automatic updating of OAS, as datasets in which we find a large number of Abs still require manual curation to perform metadata annotation correctly. Hopefully, efforts such as those by the Adaptive Immune Receptor Repertoire Community will result in standardization of Ig-seq outputs and will further streamline deposition procedures facilitating large-scale data mining of Ab repertoires (24). Devising unified Ab repertoire repositories is challenging because of both the size of the datasets as well as the diverse data descriptors and analytical pipelines desired by bioinformaticians, wet lab scientists, and clinicians (34).
To our knowledge, OAS is the first organized collection of a large body of Ig-seq outputs that is designed for continuous expansion as more and more Ig-seq data become available.The basic data files are stored in an efficiently compressed format and are searchable by light-weight metadata entries. To allow comparative bioinformatics analyses across different subsets of Ab repertoires, we have annotated the datasets by commonly used metadata descriptions, such as organism, isotype, B cell type and source, and the immune state of B cell donors. To facilitate research about particular Ab sequences or regions, we make full IMGT-numbered, high-quality amino acid sequences available together with gene annotations as well as linked raw nucleotide data.
These data should aid in-depth comparative analyses across different studies to discern the commonalities observed between independent samples as well as directing Ig-seq experiments on not-yet interrogated Ab repertoires. Revealing shared preferences can be invaluable in identifying the portions of the theoretically allowed Ab space that are strategically used to start immune responses (6). Furthermore, such comparative studies can offer a way of deconvoluting the various df of immune repertoires, such as differences between diversity of isotypes (69) or organisms (90). Charting the differences between repertoires of human/mouse is of particular interest for engineering better humanized biotherapeutics (91). Paired Ab chain sequence information provides an enhanced view on Ab biology (92). However, current paired-sequencing approaches only allow for the delineation of CDR-H3 and CDR-L3 sequences (16); as sequencing read length increases to span all three CDR regions, these paired Ig-seq datasets will be incorporated into OAS.
Beyond identifying broad commonalities across repertoires, data mining Ig-seq outputs provides novel avenues for designing better Ab-based therapeutics. The plethora of currently available Ig-seq data offers a glimpse at a set of sequences that should be able to fold and function in an organism. Aligning therapeutic candidates to sequences in Ig-seq repertoires can reveal mutational choices that might be naturally acceptable, hence providing shortcuts for Ab engineering such as humanization (93). Furthermore, contrasting the naturally observed Abs with therapeutic ones can offer insight as to the naturally favored biophysical properties of these molecules (4). All such future applications rely on the availability of well-structured datasets that can offer a unified point of reference for bioinformatics analyses. We hope that OAS will aid data mining Ab repertoires, help identify strategic preferences of our immune systems, and will ultimately improve how we engineer Abs into better therapeutics.
Acknowledgements
We thank all members of Oxford Protein Informatics Group for testing our OAS resource. In particular, we are grateful to Garret M. Morris and Matthew Raybould for comments that significantly improved the quality of our work.
Footnotes
This work was supported by funding from the Biotechnology and Biological Sciences Research Council (Grant BB/M011224/1) and UCB Pharma Ltd., awarded to A.K.
References
Disclosures
The authors have no financial conflicts of interest.