Abstract
Current Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) using short-read sequencing strategies resolve expressed Ab transcripts with limited resolution of the C region. In this article, we present the near-full-length AIRR-seq (FLAIRR-seq) method that uses targeted amplification by 5′ RACE, combined with single-molecule, real-time sequencing to generate highly accurate (99.99%) human Ab H chain transcripts. FLAIRR-seq was benchmarked by comparing H chain V (IGHV), D (IGHD), and J (IGHJ) gene usage, complementarity-determining region 3 length, and somatic hypermutation to matched datasets generated with standard 5′ RACE AIRR-seq using short-read sequencing and full-length isoform sequencing. Together, these data demonstrate robust FLAIRR-seq performance using RNA samples derived from PBMCs, purified B cells, and whole blood, which recapitulated results generated by commonly used methods, while additionally resolving H chain gene features not documented in IMGT at the time of submission. FLAIRR-seq data provide, for the first time, to our knowledge, simultaneous single-molecule characterization of IGHV, IGHD, IGHJ, and IGHC region genes and alleles, allele-resolved subisotype definition, and high-resolution identification of class switch recombination within a clonal lineage. In conjunction with genomic sequencing and genotyping of IGHC genes, FLAIRR-seq of the IgM and IgG repertoires from 10 individuals resulted in the identification of 32 unique IGHC alleles, 28 (87%) of which were previously uncharacterized. Together, these data demonstrate the capabilities of FLAIRR-seq to characterize IGHV, IGHD, IGHJ, and IGHC gene diversity for the most comprehensive view of bulk-expressed Ab repertoires to date.
This article is featured in Top Reads, p.1461
Introduction
Antibodies or Igs are the primary effectors of humoral immunity and are found as both membrane-bound receptors on B cells and circulating, secreted proteins (1). Both membrane-bound BCRs and secreted Abs act to recognize and bind Ag. All Abs and BCRs are composed of two identical heavy (H) and light (L) chains that are posttranslationally associated. The H chain is composed of two distinct domains: (1) the variable domain (Fab), which allows for Ag binding; and (2) the constant domain (Fc), which modulates downstream effector functions (1, 2). The L chain also includes a variable domain that, once posttranslationally associated with the H chain variable domain, interacts with cognate Ag (3). In humans, Abs are grouped into discrete isotypes and subisotypes (i.e., IgM, IgD, IgG1, IgG2, IgG3, IgG4, IgA1, IgA2, and IgE), based on the expression of specific C genes within the genomic locus that encodes the IgH (IGH) locus. Each isotype and subisotype has unique effector properties that together represent the wide diversity of Ab-mediated functions, including binding of Fc receptors, activation of complement, opsonization, Ab-dependent cellular cytotoxicity, and Ab-dependent cellular phagocytosis (4, 5).
To facilitate the development of diverse Ab repertoires capable of recognizing the wide range of pathogens that humans encounter, the Ig genomic loci are highly polymorphic and harbor diverse and complex sets of genes that recombine in each B cell to encode up to 1013 unique specificities (6). B cells create this expansive catalog of specificities through somatic recombination of the V, D, and J genes in IGH, and V and J genes from the corresponding L chain loci, λ (IGL) and κ (IGK) (7). During VDJ recombination in IGH, a single D and J gene are first recombined, whereas the intervening and unselected D and J gene sequences are removed by RAG recombinase (8). After D and J genes are joined, further recombination of a specific V gene to the recombined DJ completes the formation of the full VDJ rearrangement. After transcription of the recombined VDJ, a single C region gene is spliced together with the VDJ cassette to generate the completed H chain transcript (7). Recombination at IGL and IGK occurs similarly, recombining V and J genes only. H and L chain transcripts are independently translated and linked via covalent cysteine bonds resulting in a fully functional protein before B cell cell-surface expression or secretion (Fig. 1A) (9). Naive B cells, which develop in the bone marrow from hematopoietic stem cell progenitors, have undergone VDJ recombination but have not yet encountered Ag and solely express IgM and IgD (10). These naive B cells then migrate to B cell zones in secondary lymphoid tissues, where they encounter Ag, driving further maturation and class switch recombination (CSR) to enable the most effective humoral responses (11). CSR mediates the excision of IGHC genes at the DNA level, which leads to the utilization and linkage of different IGHC genes to the same VDJ, ultimately resulting in class switching to alternate isotypes and subisotypes (11).
Overview of Ab structure and FLAIRR-seq molecular method. (A) Schematic representation of the IGH locus, H chain transcript structure, and functional Ig protein. (B) Comparative coverage across the H chain transcript of commonly used sr-AIRR-seq methods compared with Iso-Seq and FLAIRR-seq. (C) FLAIRR-seq molecular pipeline: RNA (green) was converted to first-strand cDNA (red) using the 5′ RACE method, incorporating a 5′ TSO-UMI (pink) via template switch. Second-strand amplification specifically targeted IgG and IgM molecules through priming of the 5′ TSO-UMI and the 3′ CH3 for IgG or CH4 for IgM. A 16-bp barcode was incorporated into the 3′ CH3/CH4 primers to enable sample multiplexing postamplification. (D) Screenshot showing near-full-length single-molecule structure of IgG4 FLAIRR-seq transcripts.
Overview of Ab structure and FLAIRR-seq molecular method. (A) Schematic representation of the IGH locus, H chain transcript structure, and functional Ig protein. (B) Comparative coverage across the H chain transcript of commonly used sr-AIRR-seq methods compared with Iso-Seq and FLAIRR-seq. (C) FLAIRR-seq molecular pipeline: RNA (green) was converted to first-strand cDNA (red) using the 5′ RACE method, incorporating a 5′ TSO-UMI (pink) via template switch. Second-strand amplification specifically targeted IgG and IgM molecules through priming of the 5′ TSO-UMI and the 3′ CH3 for IgG or CH4 for IgM. A 16-bp barcode was incorporated into the 3′ CH3/CH4 primers to enable sample multiplexing postamplification. (D) Screenshot showing near-full-length single-molecule structure of IgG4 FLAIRR-seq transcripts.
The IgG isotype class is represented by four subisotypes: IgG1, IgG2, IgG3, and IgG4. Each IgG subisotype circulates at varied frequencies and facilitates unique immune functions. For example, IgG1 is typically the most abundant circulating IgG and mediates proinflammatory responses; IgG2 targets bacterial polysaccharides, providing protection from bacterial pathogens; IgG3 confers protection against intracellular bacterial infections and enables clearing of parasites; and IgG4 contains exclusive structural and functional characteristics often resulting in anti-inflammatory and tolerance-inducing effects (5). Multiple studies have identified Ab-mediated subisotype-specific pathogenicity in the context of autoimmune diseases and cancer, highlighting the need for further investigation of subisotype-specific repertoires (12–15).
Current Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) methods aim to resolve V and C region transcripts to differing extents. Profiling of the V region, even in part, defines V, D, and J gene usage while also providing characterization of complementarity-determining regions (CDRs) 1, 2, and 3, which are hypervariable and directly interact with target Ag (16). CDR3-targeted profiling approaches, such as those used by Adaptive Biotechnologies (17), allow for V, D, and J gene assignments but do not provide complete resolution of the entire V region (17). The isoform sequencing (Iso-Seq) method captures full-length transcripts expressing a poly(A) tail through oligo dT-based priming. For characterization of AIRR-seq reads, this approach has lower throughput, lower depth, and increased cost, because Iso-Seq generates a complete transcriptome per sample without any enrichment of H chain sequences, which are then filtered out for analyses, resulting in a considerable amount of non-Ab repertoire data being discarded (Fig. 1B). Multiplexed primer-based sr-AIRR-seq strategies generate full V region content but require specific primers to known targets and therefore may miss novel genes and alleles. 5′ RACE sr-AIRR-seq methods capture the full-length VDJ exon without V region–targeted multiplexed primer pools, therefore limiting the impact of primer bias and enabling discovery of novel IGHV, IGHD, and IGHJ genes and alleles (18). 5′ RACE methods also often prime from the first IGHC exon (CH1), allowing for determination of isotype. Additional methods have been developed that shift amplification strategies by capturing additional IGHC region sequence to enable subisotype resolution; however, these methods sacrifice full and contiguous characterization of IGHV genes (19). All commonly used sr-AIRR-seq methods are further technically limited by the length restrictions (≤600 nt) of short-read next generation sequencing. As a result, no current sr-AIRR-seq strategy resolves the complete H chain transcript, including all IGHC exons alongside the recombined IGHV, IGHD, and IGHJ genes. Our team has recently shown that population-based polymorphisms within the IGHV, IGHD, IGHJ, IGK, and IGL loci are far more extensive than previously known; the IGHC region has also been shown to contain genomic diversity, although the extent of this diversity has likely not been fully explored (20–27). Although it is understood that the Fc domain mediates Ab effector functions, there is limited knowledge as to how genetic variation in this region may impact functional capabilities or posttranslational modifications (5, 28, 29). As such, there is a growing need to understand genomic variation across the complete Ab molecule. To address these limitations, we have developed an end-to-end pipeline to target, profile, and characterize the Ab H chain repertoire in the context of isotype (IgG, IgM) and subisotype (IgG1, IgG2, IgG3, IgG4).
In this article, we present full-length AIRR-seq (FLAIRR-seq), a novel targeted 5′ RACE-based amplification of near-full-length IgG and IgM H chain transcripts, paired with single-molecule real-time (SMRT) sequencing, resulting in highly accurate (mean read accuracy ∼Q60, 99.9999%), near-full-length Ab sequences from RNA derived from whole blood, isolated PBMCs, and purified B cells. When analyzed with the Immcantation sr-AIRR-seq tool suite (30, 31), we demonstrate that FLAIRR-seq performs comparably with standard 5′ RACE sr-AIRR-seq methods and single-molecule Iso-Seq strategies for characterizing the expressed Ab repertoire. We further highlight FLAIRR-seq data features, including phased identification of IGHV, IGHD, IGHJ, and IGHC genes, allowing for profiling of subisotype- and IGHC allele-specific repertoires and CSR characterization.
Materials and Methods
Sample collections
Experiments were conducted using healthy donor PBMCs, purified B cells from healthy donors, or whole blood collected from hospitalized COVID-19 patients (Supplemental Table I). Commercially available healthy donor PBMCs (STEMCELL Technologies) and a subset of matched purified B cells were used to generate sr-AIRR-seq and FLAIRR-seq validation datasets. Full-length Iso-Seq was performed using B cells isolated from the PBMCs of a healthy, consented 57-y-old male donor at the University of Louisville (UofL) School of Medicine. The UofL Institutional Review Board (IRB) approved sample collection (IRB 14.0661). For COVID-19–affected patient samples (n = 5), whole blood was collected from the Mount Sinai COVID-19 biobank cohort of hospitalized COVID-19 patients, approved by the IRB at the Icahn School of Medicine at Mount Sinai as previously described (32).
PBMC isolation and B cell purification
Frozen healthy donor PBMCs were purchased, thawed, and aliquoted for use in downstream experiments (STEMCELL Technologies). For Iso-Seq analyses, 175 ml of venous blood was collected in a final concentration of 6 mM K3EDTA using standard phlebotomy. PBMCs were isolated using Sepmate PBMC Isolation Tubes (STEMCELL Technologies) as previously described (33), with an additional granulocyte depletion step using the RosetteSep Human Granulocyte Depletion Cocktail (STEMCELL Technologies) as directed by the manufacturer. B cells from the freshly collected and frozen healthy donor PBMCs were isolated using the EasySep Human Pan-B Cell Enrichment Kit, as described by the manufacturer (STEMCELL Technologies). In brief, B cells, including plasma cells, were isolated by negative selection using coated magnetic particles. First, the B cell enrichment mixture was added to the sample and mixed for a 5-min incubation at room temperature, followed by addition of magnetic particles and further incubation for 5 min on the benchtop. The sample tube was then placed on an EasySep magnet (STEMCELL Technologies), and purified B cells were carefully eluted from the magnetic particles and immediately used for RNA extraction.
Genomic DNA and RNA extraction
For the healthy frozen PBMCs and matched purified B cells, genomic DNA (gDNA) and RNA were coextracted using the AllPrep DNA/RNA Mini Kit (Qiagen) according to the manufacturer’s instructions. For the freshly processed UofL healthy donor PBMCs, purified Pan-B cells were lysed in Buffer RLT Plus, and RNA was extracted with the RNeasy Plus Mini Kit (Qiagen) per the manufacturer’s protocol; no gDNA was collected from this sample. COVID-19 whole blood–derived RNA was extracted from samples collected in Tempus Blood RNA tubes using the Tempus Spin RNA Isolation Kit (ThermoFisher) as described by the manufacturer. For all samples, concentrations of RNA and gDNA (when appropriate) were assessed using the Qubit 4.0 fluorometer, with the RNA HS Assay Kit and Qubit DNA HS Assay Kit, respectively (ThermoFisher Scientific). RNA and gDNA integrity were evaluated using the Bioanalyzer RNA Nano Kit and DNA 12000 Kit, respectively (Agilent Technologies). Extracted RNA and gDNA were stored at −80°C and −20°C, respectively, until used.
FLAIRR-seq targeted amplification of H chain transcripts
Extracted RNA was thawed on ice and converted to first-strand cDNA using the SMARTer RACE 5′/3′ Kit (Takara Bio USA), as described by the manufacturer and a custom oligonucleotide that contained the template switch oligo and a unique molecular identifier (UMI; 5′ TSO-UMI) for template switch during first-strand cDNA synthesis. The following reaction conditions were used: (1) a primary master mix was prepared with 4.0 μl 5× First-Strand Buffer, 0.5 μl DTT (100 mM), and 1.0 μl dNTPs (20 mM) per reaction and set aside until needed; (2) in a separate 0.2-ml PCR tube, 10 μl of sample RNA and 1 μl 5′-CDS Primer A were combined and incubated in a thermal cycler at 72°C (lid temperature: 105°C) for 3 min, followed by cooling to 42°C for 2 min; (3) after cooling, tubes were spun briefly to collect contents, and 1 μl (12 μM) of the 5′ TSO-UMI was added to the RNA; and (4) 0.5 μl of RNase inhibitor and 2.0 μl of SMARTScribe Reverse Transcriptase were added to the primary master mix tube per sample, and 8 μl of the combined master mix was then added to each RNA-containing sample tube. First-strand cDNA synthesis reactions were incubated in a thermal cycler at 42°C (lid temperature: 105°C) for 90 min, followed by heat inactivation at 70°C for 10 min. Total first-strand cDNA generated in this reaction was diluted 1:2 with Tricine-EDTA Buffer before moving onto targeted H chain transcript amplification.
To specifically amplify H chain transcripts from total first-strand cDNA, we performed targeted IgG and IgM transcript amplification reactions using barcoded IgG (3′ primer binding in the C region exon 3 [CH3])-specific or IgM (3′ primer binding in the C region exon 4 [CH4])-specific primers (Supplemental Table II) and the following conditions: (1) 5 μl of diluted first-strand cDNA was added to 0.2-ml PCR tubes; (2) a master mix was generated using 10 μl 5× PrimeSTAR GXL Buffer, 4 μl GXL dNTP mixture, 28 μl PCR-grade water, 1 μl PrimeSTAR GXL Polymerase, and 1 μl 10× UPM form the SMARTer RACE 5′/3′ Kit per reaction; and (3) 44 μl master mix was added to each reaction tube followed by 1 μl of the appropriate barcoded IgG (CH3) or IgM (CH4) primer. Different temperatures were used for annealing of IgG- (63.5°C) and IgM-specific primers (60°C) to account for primer-specific melting temperatures and to enhance targeted amplification specificity. Amplification conditions for full-length IgG were 1 min at 95°C, followed by 35 amplification cycles of 95°C for 30 s, 63.5°C for 20 s, and 2 min at 68°C, followed by a final extension for 3 min at 68°C and hold at 4°C. Amplification conditions for full-length IgM were 1 min at 95°C, followed by 35 amplification cycles of 95°C for 30 s, 60°C for 20 s, and 2 min at 68°C, followed by a final extension for 3 min at 68°C and hold at 4°C. Final amplification reactions were purified using a 1.1× (v/v) cleanup with ProNex magnetic beads (Promega). Successfully amplified products were quantified with Qubit dsDNA HS assay (ThermoFisher Scientific), and length was evaluated with the Fragment Analyzer Genomic DNA HS assay (Agilent). Samples were equimolar pooled in 8-plexes for SMRTbell library preparation and sequencing.
FLAIRR-seq SMRTbell library preparation and sequencing
Eight-plex pools of targeted IgG or IgM amplicons were prepared into SMRTbell sequencing templates according to the “Procedure and Checklist for Iso-Seq Express Template for Sequel and Sequel II systems” protocol starting at the “DNA Damage Repair” step and using the SMRTbell Express Template Prep Kit 2.0, with some modifications (Pacific Biosciences). In brief, targeted IgG and IgM amplicons underwent enzymatic DNA damage and end repair, followed by ligation with overhang SMRTbell adapters as specified in the protocol. To increase consistency in SMRTbell loading on the Sequel IIe system, we further treated the SMRTbell libraries with a nuclease mixture to remove unligated amplified products, using the SMRTbell Enzyme Cleanup Kit, as recommended by the manufacturer (Pacific Biosciences). In brief, after heat-killing the ligase with an incubation at 65°C, samples were treated with a nuclease mixture at 37°C for 1 h and then purified with a 1.1× Pronex cleanup. Final SMRTbell libraries were evaluated for quantity and quality using the Qubit dsDNA HS assay and Fragment Analyzer Genomic DNA assay, respectively. Sequencing of each 8-plex, barcoded sample pool was performed on one SMRTcell 8M using primer v4 and polymerase v2.1 on the Sequel IIe system with 30-h movies. Demultiplexed, high-fidelity circular consensus sequence reads (“HiFi reads”) were generated on the instrument for downstream analyses.
sr-AIRR-seq SMARTer Human BCR IgG/IgM sequencing
Matched healthy donor RNA was used to generate targeted IgG and IgM sr-AIRR-seq libraries using the SMARTer Human BCR IgG IgM H/K/L Profiling Kit (Takara Bio USA) according to the manufacturer’s instructions with no modifications. In brief, for each sample, proprietary IgG and IgM primers were used to amplify H chain transcripts following a 5′ RACE reaction. sr-AIRR-seq libraries were then quality controlled using the 2100 Bioanalyzer High Sensitivity DNA Assay Kit (Agilent) and the Qubit 3.0 Fluorometer dsDNA High Sensitivity Assay Kit. Sequencing on the MiSeq platform using 300-bp paired-end reads was performed using the 600-cycle MiSeq Reagent Kit v3 (Illumina) according to the manufacturer’s instructions, and FASTQ reads were generated using the associated DRAGEN software package (Illumina).
B cell Iso-Seq
RNA extracted from healthy sorted B cells was used to generate Iso-Seq SMRTbell libraries following the “Procedure & Checklist Iso-Seq Express Template Preparation for the Sequel II System” with minor adaptations compared with the manufacturer’s instructions. In brief, Iso-Seq libraries were generated using 500 ng of high-quality (RNA integrity number > 8) RNA as input into oligo-dT primed cDNA synthesis (NEB). Barcoded primers were incorporated into the cDNA during second-strand synthesis. After double-stranded cDNA amplification, transcripts from two samples sourced from purified B cells and NK cells were equimolar pooled as previously described (33). SMRTbells were generated from the pooled cDNA as described earlier for the FLAIRR-seq amplification products, including the addition of a nuclease digestion step. Quantity and quality of the final Iso-Seq libraries were performed with the Qubit dsDNA High Sensitivity Assay Kit and Agilent Fragment Analyzer Genomic DNA assay, respectively. This 2-plex Iso-Seq pool was sequenced using primer v4 and polymerase v2.1 on the Sequel IIe system with a 30-h video. HiFi reads were generated on the instrument before analyses. Demultiplexing of barcoded samples and generation of full-length nonconcatemer predicted transcripts were performed using the Iso-Seq v3 pipeline available through SMRTLink (v.10.2). B cell–derived full-length nonconcatemer reads were mapped to the human genome using the Genomic Mapping and Alignment Program reference database, and reads derived from chromosome 14 were extracted for downstream IgH transcript characterization via Immcantation, as described later.
Immcantation analyses of IgG and IgM repertoires
Analyses of FLAIRR-seq, sr-AIRR-seq, and Iso-Seq datasets were performed using Immcantation tools (30, 31). Demultiplexed barcoded HiFi (for SMRT sequencing data) or FASTQ (for sr-AIRR-seq) reads were first processed using the pRESTO tool for quality control, UMI processing, and error profiling (30). For sr-AIRR-seq, pRESTO analysis data from paired-end reads (“R1” and “R2”) were trimmed to remove bases with <Q20 read quality and/or <125-bp length using the “FilterSeq trimqual” and “FilterSeq length,” respectively. IgG and IgM CH3 or CH4 primer sequences were identified with an error rate of 0.2, and primers identified were then noted in FASTQ headers using “MaskPrimers align.” Next, 12-bp UMIs were located and extracted using “Maskprimers extract.” Sequences found to have the same UMIs were grouped and aligned using “AlignSets muscle,” with a consensus sequence generated for each UMI using “BuildConsensus.” Mate pairing of sr-AIRR-seq reads was conducted using a reference-guided alignment requiring a minimum of a 5-bp overlap via “AssemblePairs sequential.” After collapsing consensus reads with the same UMI (“conscount”) using “CollapseSeq,” reads with fewer than two supporting sequences were removed from downstream analysis. For pRESTO processing of FLAIRR-seq, single HiFi reads (“R1”) did not require trimming because of >Q20 sequence quality across all bases. 5′ TSO-UMI and CH3 or CH4 region primers were identified along with a 22-bp UMI with an error rate of 0.3 using “MaskPrimers align.” Reads were then grouped and aligned using “AlignSets muscle.” Due to the single-molecule nature of FLAIRR-seq reads, no mate pairing was required. Consensus reads were then generated as described earlier, including removal of sequences with fewer than two supporting reads. Read counts following each step of data filtration for AIRR-seq and FLAIRR-seq are represented in Supplemental Table III.
pRESTO-filtered reads for both sr-AIRR-seq and FLAIRR-seq data were then input into the Change-O tool (Table I). Iso-Seq required no initial processing from pRESTO and was input into Change-O for Ig gene reference alignment along with sr-AIRR-seq and FLAIRR-seq data using “igblastn,” clonal clustering using “DefineClones,” and germline reconstruction and conversion using “CreateGermlines.py” and the GRCh38 chromosome 14 germline reference (31). Fully processed and annotated data were then converted into a TSV format for use in downstream analyses. The Alakazam Immcantation tool suite was then used to quantify gene usage, calculate CDR3 length, and assess somatic hypermutation frequencies (SHMs), which were compared by unpaired t tests with Bonferroni multiple testing correction. Alakazam was also used to analyze clonal diversity (31). SCOPer clonal assignment by spectral clustering was conducted for COVID-19 patient time-course samples (34, 35). For clonal lineage tree analysis, the Dowser tool was used to examine clonal diversity and CSR over time (36). After trimming of 3′ primers, FLAIRR-seq data were also processed using the MiXCR generic BCR amplicon pipeline with visualization using the immunarch R package (Supplemental Fig. 1) (37, 38).
Sample ID . | Isotype . | Method . | RNA Input (ng) . | Input Reads into Immcantationa . | Postassembly/Single Readsb . | Unique VDJc . | Unique Clonesd . | Unique CDR3e . |
---|---|---|---|---|---|---|---|---|
0007 | IgG | sr-AIRR | 100 | 841,627 | 15,217 | 3,164 | 1,008 | 1,187 |
FLAIRR | 335 | 181,109 | 17,881 | 4,880 | 1,358 | 1,520 | ||
201c | IgG | sr-AIRR | 100 | 1,509,063 | 25,015 | 4,568 | 1,221 | 1,371 |
FLAIRR | 296 | 221,301 | 36,053 | 10,819 | 2,145 | 2,388 | ||
203c | IgG | sr-AIRR | 100 | 1,006,965 | 23,085 | 5,793 | 1,436 | 1,705 |
FLAIRR | 106 | 47,5703 | 14,923 | 3,778 | 993 | 1,138 | ||
602c | IgG | sr-AIRR | 100 | 1,275,902 | 38,078 | 7,736 | 1,542 | 2,217 |
FLAIRR | 149 | 344,906 | 33,903 | 9,225 | 1,828 | 2,357 | ||
705c | IgG | sr-AIRR | 100 | 1,113,660 | 15,358 | 1,362 | 557 | 576 |
FLAIRR | 268 | 279,601 | 18,627 | 2,328 | 1,063 | 1,081 | ||
1008 | IgG | sr-AIRR | 100 | 1,310,350 | 56,848 | 12,217 | 3,005 | 3,307 |
FLAIRR | 335 | 238,820 | 57,020 | 18,235 | 3,505 | 3,813 | ||
1013 | IgG | sr-AIRR | 100 | 1,158,222 | 25,250 | 5,027 | 1,795 | 1,964 |
FLAIRR | 222 | 396,322 | 27,871 | 7,487 | 2,565 | 2,744 | ||
2008 | IgG | sr-AIRR | 100 | 835,928 | 21,681 | 3,900 | 1,235 | 1,368 |
FLAIRR | 240 | 418,906 | 16,441 | 4,258 | 1,308 | 1,380 | ||
4002 | IgG | sr-AIRR | 100 | 632,272 | 8,949 | 1,112 | 727 | 772 |
FLAIRR | 335 | 398,838 | 19,692 | 5,377 | 1,719 | 1,887 | ||
5001 | IgG | sr-AIRR | 100 | 695,884 | 13,930 | 1,885 | 1,134 | 1,244 |
FLAIRR | 335 | 313,843 | 42,845 | 12,146 | 2,226 | 2,381 | ||
0007 | IgM | sr-AIRR | 100 | 841,627 | 12,777 | 9,154 | 7,924 | 8,163 |
FLAIRR | 335 | 331,380 | 46,384 | 14,926 | 12,367 | 12,851 | ||
201c | IgM | sr-AIRR | 100 | 1,509,063 | 26,850 | 17,888 | 14,281 | 14,622 |
FLAIRR | 296 | 148,355 | 98,069 | 12,803 | 9,353 | 9,573 | ||
203c | IgM | sr-AIRR | 100 | 1,006,965 | 24,623 | 18,189 | 16,295 | 16,629 |
FLAIRR | 106 | 509,536 | 37,527 | 10,636 | 9,315 | 9,419 | ||
602c | IgM | sr-AIRR | 100 | 1,275,902 | 12,306 | 8,378 | 7,704 | 7,833 |
FLAIRR | 149 | 273,103 | 48,897 | 16,406 | 14,495 | 14,745 | ||
705c | IgM | sr-AIRR | 100 | 1,113,660 | 19,484 | 14,273 | 12,936 | 13,413 |
FLAIRR | 268 | 435,391 | 79,251 | 17,417 | 15,706 | 16,249 | ||
1008 | IgM | sr-AIRR | 100 | 1,310,350 | 20,264 | 12,710 | 7,878 | 8,368 |
FLAIRR | 335 | 181,063 | 47,531 | 15,284 | 8,745 | 9,276 | ||
1013 | IgM | sr-AIRR | 100 | 1,158,222 | 12,769 | 9,012 | 8,057 | 8,241 |
FLAIRR | 222 | 262,617 | 52,508 | 17,812 | 15,291 | 15,759 | ||
2008 | IgM | sr-AIRR | 100 | 835,928 | 11,608 | 8,551 | 5,838 | 6,206 |
FLAIRR | 240 | 502,354 | 47,596 | 14,132 | 10,225 | 10,674 | ||
4002 | IgM | sr-AIRR | 100 | 632,272 | 18,411 | 11,607 | 11,184 | 11,367 |
FLAIRR | 335 | 489,287 | 91,732 | 22,379 | 19,000 | 21,532 | ||
5001 | IgM | sr-AIRR | 100 | 313,843 | 10,018 | 12,146 | 2,226 | 2,381 |
FLAIRR | 355 | 171,688 | 18,944 | 6,164 | 4,866 | 5,011 |
Sample ID . | Isotype . | Method . | RNA Input (ng) . | Input Reads into Immcantationa . | Postassembly/Single Readsb . | Unique VDJc . | Unique Clonesd . | Unique CDR3e . |
---|---|---|---|---|---|---|---|---|
0007 | IgG | sr-AIRR | 100 | 841,627 | 15,217 | 3,164 | 1,008 | 1,187 |
FLAIRR | 335 | 181,109 | 17,881 | 4,880 | 1,358 | 1,520 | ||
201c | IgG | sr-AIRR | 100 | 1,509,063 | 25,015 | 4,568 | 1,221 | 1,371 |
FLAIRR | 296 | 221,301 | 36,053 | 10,819 | 2,145 | 2,388 | ||
203c | IgG | sr-AIRR | 100 | 1,006,965 | 23,085 | 5,793 | 1,436 | 1,705 |
FLAIRR | 106 | 47,5703 | 14,923 | 3,778 | 993 | 1,138 | ||
602c | IgG | sr-AIRR | 100 | 1,275,902 | 38,078 | 7,736 | 1,542 | 2,217 |
FLAIRR | 149 | 344,906 | 33,903 | 9,225 | 1,828 | 2,357 | ||
705c | IgG | sr-AIRR | 100 | 1,113,660 | 15,358 | 1,362 | 557 | 576 |
FLAIRR | 268 | 279,601 | 18,627 | 2,328 | 1,063 | 1,081 | ||
1008 | IgG | sr-AIRR | 100 | 1,310,350 | 56,848 | 12,217 | 3,005 | 3,307 |
FLAIRR | 335 | 238,820 | 57,020 | 18,235 | 3,505 | 3,813 | ||
1013 | IgG | sr-AIRR | 100 | 1,158,222 | 25,250 | 5,027 | 1,795 | 1,964 |
FLAIRR | 222 | 396,322 | 27,871 | 7,487 | 2,565 | 2,744 | ||
2008 | IgG | sr-AIRR | 100 | 835,928 | 21,681 | 3,900 | 1,235 | 1,368 |
FLAIRR | 240 | 418,906 | 16,441 | 4,258 | 1,308 | 1,380 | ||
4002 | IgG | sr-AIRR | 100 | 632,272 | 8,949 | 1,112 | 727 | 772 |
FLAIRR | 335 | 398,838 | 19,692 | 5,377 | 1,719 | 1,887 | ||
5001 | IgG | sr-AIRR | 100 | 695,884 | 13,930 | 1,885 | 1,134 | 1,244 |
FLAIRR | 335 | 313,843 | 42,845 | 12,146 | 2,226 | 2,381 | ||
0007 | IgM | sr-AIRR | 100 | 841,627 | 12,777 | 9,154 | 7,924 | 8,163 |
FLAIRR | 335 | 331,380 | 46,384 | 14,926 | 12,367 | 12,851 | ||
201c | IgM | sr-AIRR | 100 | 1,509,063 | 26,850 | 17,888 | 14,281 | 14,622 |
FLAIRR | 296 | 148,355 | 98,069 | 12,803 | 9,353 | 9,573 | ||
203c | IgM | sr-AIRR | 100 | 1,006,965 | 24,623 | 18,189 | 16,295 | 16,629 |
FLAIRR | 106 | 509,536 | 37,527 | 10,636 | 9,315 | 9,419 | ||
602c | IgM | sr-AIRR | 100 | 1,275,902 | 12,306 | 8,378 | 7,704 | 7,833 |
FLAIRR | 149 | 273,103 | 48,897 | 16,406 | 14,495 | 14,745 | ||
705c | IgM | sr-AIRR | 100 | 1,113,660 | 19,484 | 14,273 | 12,936 | 13,413 |
FLAIRR | 268 | 435,391 | 79,251 | 17,417 | 15,706 | 16,249 | ||
1008 | IgM | sr-AIRR | 100 | 1,310,350 | 20,264 | 12,710 | 7,878 | 8,368 |
FLAIRR | 335 | 181,063 | 47,531 | 15,284 | 8,745 | 9,276 | ||
1013 | IgM | sr-AIRR | 100 | 1,158,222 | 12,769 | 9,012 | 8,057 | 8,241 |
FLAIRR | 222 | 262,617 | 52,508 | 17,812 | 15,291 | 15,759 | ||
2008 | IgM | sr-AIRR | 100 | 835,928 | 11,608 | 8,551 | 5,838 | 6,206 |
FLAIRR | 240 | 502,354 | 47,596 | 14,132 | 10,225 | 10,674 | ||
4002 | IgM | sr-AIRR | 100 | 632,272 | 18,411 | 11,607 | 11,184 | 11,367 |
FLAIRR | 335 | 489,287 | 91,732 | 22,379 | 19,000 | 21,532 | ||
5001 | IgM | sr-AIRR | 100 | 313,843 | 10,018 | 12,146 | 2,226 | 2,381 |
FLAIRR | 355 | 171,688 | 18,944 | 6,164 | 4,866 | 5,011 |
sr-AIRR-seq data reflects total number of assembled paired end reads (bold) used as input into analyses.
Counts of unique single molecules or assembled templates.
Counts of unique sequences after Immcantation annotation.
Counts of unique clones after clonal clustering.
Counts of sequences containing unique CDR3 sequences.
Targeted gDNA capture, long-read sequencing, and IGenotyper analyses
FLAIRR-seq validation samples also underwent IGHC targeted enrichment and long-read sequencing as previously described. In brief, gDNA was mechanically sheared and size selected to include 5- to 9-kb fragments using the BluePippin (Sage Science). Samples were then end repaired and A-tailed using the standard KAPA library preparation protocol (Roche). Universal priming sequences and barcodes were then ligated onto samples for multiplexing (Pacific Biosciences). Barcoded gDNA libraries were captured using IGH-specific probes following a SeqCap protocol (Roche). Twenty-six IGH-enriched samples were purified and pooled together for SMRTbell library prep as described earlier, including the final nuclease digestion step. Pooled SMRTbells were annealed to primer v4, bound to polymerase v2.0, and sequenced on the Sequel IIe system with 30-h movies. After sequencing, HiFi reads were generated and analyzed by the IGenotyper pipeline (39). In brief, IGenotyper was used to detect single-nucleotide variants (SNVs) and assemble sequences into haplotype-specific assemblies for downstream IGHC gene genotyping. Alleles were then extracted from assemblies using a bed file containing coordinates for each IGHC gene exon. After sequences were extracted, reads were then aligned to the international ImMunoGeneTics information system (IMGT) database (downloaded on 2/21/22) and assigned as an exact match to IMGT, “novel” if there was no match to the IMGT database, or “novel, extended” if a match was detected to a partial allele found in IMGT but the IMGT allele was a substring of the IGenotyper-identified allele (40). This set of alleles was then used as a ground-truth dataset.
IGHC gene genotyping with FLAIRR-seq
To genotype IGHC genes and alleles from FLAIRR-seq data, we filtered productive reads by IGHC length (900–1100 bp) for IgG and aligned to the chromosome 14 hg38 reference using minimap2 along with SAMtools to generate sorted and indexed bam files (41, 42). WhatsHap was used to identify, genotype, and phase SNVs (M. Martin, M. Patterson, S. Garg, S. O. Fischer, N. Pisanti, G. W. Klau, A. Schöenhuth, and T. Marschall, manuscript posted on bioRxiv, DOI: 10.1101/085050). Phased SNVs were used to assign each read to a haplotype using MsPAC. Reads from each haplotype and gene were clustered using CD-HIT using a 100% identity clustering threshold parameter, and a single representative read from the largest cluster was aligned to the IGenotyper curated alleles using BLAST (43) to determine the closest matching IGHC gene and allele (40, 44). The representative read was selected based on 100% identity to all other sequences in that cluster.
Inference of IGHV, IGHD, and IGHJ gene haplotypes from FLAIRR-seq data using IGHC gene anchors
To test the ability of IGHC genes to be used for the inference of IGHV, IGHD, and IGHJ haplotypes from FLAIRR-seq data, we chose one sample that was heterozygous for both IGHM and IGHJ6 (IGHJ6 use is standard for sr-AIRR-seq haplotype inference). For this sample, we used TIgGER (31, 45–47) to infer novel IGHV alleles and generate sample-level IGHV genotypes. Rearranged sequences within the Change-O table were then reannotated, taking into account sample genotype and detected novel alleles. Updated annotations were then used to infer haplotypes using RAbHIT version 0.2.4 (48). Both IGHJ6 and IGHM were used as anchor points for haplotyping, and the resulting haplotypes were compared.
Data availability
All IGenotyper, FLAIRR-seq, sr-AIRR-seq, and Iso-Seq data generated for this study are available in the Sequence Read Archive using the identifier BioProject ID PRNJA922682 (https://www.ncbi.nlm.nih.gov/bioproject/?term=922682).
Results
Gene usage, CDR3, and SHM profiles characterized from FLAIRR-seq data are comparable with AIRR-seq and Iso-Seq
Current methods for commercially available 5′ RACE sr-AIRR-seq use targeted amplification of the V region and, in some cases, a small portion of CH1, in conjunction with short-read sequencing to characterize Ig repertoires. However, this minimal examination of the IGHC gene sequence is primarily used to define isotypes. No current method defines both the H chain V and C regions allowing for both subisotype classification and IGHC allele-level resolution. To address these technical limitations, we developed the FLAIRR-seq method (Fig. 1C), a targeted 5′ RACE approach combined with SMRT sequencing, to generate highly accurate, near-full-length IgG (∼1500 bp) and/or IgM (2000 bp) sequences, allowing for direct, simultaneous analysis of the H chain V and C regions (Fig. 1D), including gene/allele identification for IGHV, IGHD, IGHJ, and IGHC segments and isotype- and subisotype-specific repertoire profiling.
To evaluate and validate the capabilities of FLAIRR-seq and sr-AIRR-seq, we performed analyses on 10 healthy donor PBMC samples to compare both library preparation methods. FLAIRR-seq data were filtered from the initial HiFi reads (>Q20) to include only >Q40 reads. The average read quality of these filtered reads was >Q60 (99.9999%), with a pass filter rate ranging from 88 to 93% of total reads. sr-AIRR-seq FASTQ bases were trimmed to retain sequences with an average quality of Q20 (99%). These filtered reads were used as input into the Immcantation suite, specifically the pREST-O and Change-O tools, for IGHV, IGHD, and IGHJ gene assignment, and repertoire feature analyses, including identification of clones, extent of somatic hypermutation, and evaluation of CDR3 lengths. As shown in Table I, fewer overall FLAIRR-seq reads were used as input into the Immcantation analyses, after required filtration and read assembly steps (which were not needed for the high-quality single-molecule FLAIRR-seq reads). FLAIRR-seq resulted in comparable or, in many cases, increased number of unique VDJ sequences, clones, and CDR3 sequences when compared with matched sr-AIRR-seq–derived samples in both the IgG and IgM repertoires. These basic sequencing and initial analysis metrics demonstrated that FLAIRR-seq produced high-quality V region data for detailed Ab repertoire analyses and are amenable to analysis using existing sr-AIRR-seq analysis tools. Comparative costs were calculated based on the cost per “actionable read,” defined as the read number per sample and per method after quality filtration and assembly, but before cluster consensus. This calculation represented the total unique single molecule or assembled templates captured by either method that passed all necessary quality control irrespective of biologic repertoire diversity or clonality, removing technology/platform biases. To remove the impact of pooling differences, we used this “per actionable read” price to calculate the cost for 15,000 “actionable reads” as our standard sample. sr-AIRR-seq costs $25.50 per sample, whereas the FLAIRR-seq cost was $33.57 per sample. These costs reflect reagents and consumables only; instrumentation and labor are not included.
While optimizing FLAIRR-seq sample preparation, we examined whether upstream isolation of B cells before FLAIRR-seq molecular preparation would enhance the ability to detect IGHV, IGHD, and IGHJ gene usage. To do this, we split aliquots of PBMCs (n = 4) into two groups for RNA extraction: (1) RNA derived from bulk PBMCs; and (2) RNA isolated from purified pan B-cells, followed by FLAIRR-seq preparation, SMRT sequencing, and Immcantation analysis of both groups. IGHV, IGHD, and IGHJ gene usage correlations between groups are shown in Fig. 2A and demonstrate a significant association (p values ranging from 0.033 to 4.1e−16) strongly supporting the conclusion that B cell isolation before RNA extraction was not necessary to achieve comparable gene usage metrics. The limited differences that were observed could be explained by template sampling differences between the two experiments. Due to the strong associations observed and the ease of processing PBMCs in bulk, we moved forward with RNA derived directly from PBMC aliquots for the remainder of our analyses.
FLAIRR-seq shows robust characterization of V, D, and J genes. (A) Spearman ranked correlations and p values of V, D, and J gene usage frequencies identified by FLAIRR-seq performed on matched total PBMCs and purified B cells. (B) Heatmap of Spearman ranked correlations and p values of V, D, and J gene usage frequencies between FLAIRR-seq and sr-AIRR-seq processed samples. (C) Violin plots of SHMs defined by FLAIRR-seq or sr-AIRR-seq in IgM- and IgG-specific repertoires. (D) Boxplots of CDR3 length defined by FLAIRR-seq or sr-AIRR-seq analysis in both IgM- and IgG-specific repertoires. Significant differences of CDR3 length and SHM were measured by unpaired t test with Bonferroni correction between FLAIRR-seq and NGS-AIRR-seq data indicated by *p < 0.05 or **p < 0.01. (E) Spearman ranked correlation of V, D, and J gene usage frequencies between FLAIRR-seq–based and Iso-Seq–based repertoire profiling.
FLAIRR-seq shows robust characterization of V, D, and J genes. (A) Spearman ranked correlations and p values of V, D, and J gene usage frequencies identified by FLAIRR-seq performed on matched total PBMCs and purified B cells. (B) Heatmap of Spearman ranked correlations and p values of V, D, and J gene usage frequencies between FLAIRR-seq and sr-AIRR-seq processed samples. (C) Violin plots of SHMs defined by FLAIRR-seq or sr-AIRR-seq in IgM- and IgG-specific repertoires. (D) Boxplots of CDR3 length defined by FLAIRR-seq or sr-AIRR-seq analysis in both IgM- and IgG-specific repertoires. Significant differences of CDR3 length and SHM were measured by unpaired t test with Bonferroni correction between FLAIRR-seq and NGS-AIRR-seq data indicated by *p < 0.05 or **p < 0.01. (E) Spearman ranked correlation of V, D, and J gene usage frequencies between FLAIRR-seq–based and Iso-Seq–based repertoire profiling.
We established FLAIRR-seq performance by comparing its output with the commonly used 5′ RACE sr-AIRR-seq method. 5′ RACE sr-AIRR-seq was chosen because it provides resolution of the complete V region and a small portion of IGHC, allowing for isotype differentiation. After preparation of both FLAIRR-seq and sr-AIRR-seq libraries from healthy donor PBMC samples (n = 10), we compared multiple repertoire features with benchmark FLAIRR-seq performance. First, we evaluated IGHV, IGHD, and IGHJ gene usage frequencies. We observed significant correlations between FLAIRR-seq and sr-AIRR-seq datasets in IGHV, IGHD and IGHJ gene usage for both IgM (V genes: r = 0.93–0.97, p ≤ 2.2e−16; D genes: r = 0.98–0.99, p ≤ 2.2e−16; J genes: r = 0.94–1.0, p = 0.0028–0.017) and IgG isotypes (V genes: r = 0.90–0.96, p ≤ 2.2e−16; D genes: r = 0.87–0.99, p ≤ 2.2e−16–6.1e−14; J genes: r = 0.89–1.0, p = 0.0028–0.033) (Fig. 2B), indicating that FLAIRR-seq comparably resolves IGHV, IGHD, and IGHJ gene usage profiles. Notably, IGHJ genes showed lower levels of significance (larger p values) across all comparisons because of the relatively few genes that make up the IGHJ gene family compared with the more diverse IGHV and IGHD families. We next investigated the performance of both methods in terms of resolving SHM frequencies (Fig. 2C) and CDR3 lengths (Fig. 2D), which are often used as measures of evaluating B cell affinity maturation. Although we did observe occasional statistically significant differences in the SHM frequency between sr-AIRR-seq and FLAIRR-seq data using the same samples, these differences were not seen across all samples, suggesting that sample-to-sample variation may drive this observation rather than technology-based discrepancies. We found that CDR3 lengths were consistently longer in the FLAIRR-seq datasets for both the IgM and IgG isotypes in most donors. The characterization of unusually long CDR3 regions (>40 nt) in the IgG sequences with FLAIRR-seq is likely due to the higher contiguity and quality afforded by the longer read lengths, which are less likely to be spanned by short-read 2 × 300-bp paired-end sequencing strategies. Together, these data demonstrate that FLAIRR-seq achieves comparable gene usage profiles and improved resolution of long CDR3 sequences.
Others have recognized the power of long-read sequencing to resolve B cell repertoires using bulk Iso-Seq methods, allowing for the examination of full-length transcripts from isolated B cells (49). As mentioned earlier, the Iso-Seq method captures full-length transcripts expressing a poly(A) tail without bias, although a considerable amount of nonrepertoire data are discarded. To investigate whether the untargeted transcriptome-wide Iso-Seq method would resolve a qualitatively different repertoire than FLAIRR-seq, which would have indicated FLAIRR-seq–driven primer bias, we performed matched Iso-Seq and FLAIRR-seq on purified B cell–derived RNA. IGHV, IGHD, and IGHJ gene usage frequencies were compared between Iso-Seq and FLAIRR-seq datasets (Spearman rank correlation), revealing significant correlations between usage profiles (V genes: r = 0.94, p = 2.2e−16; D genes: r = 0.92, p = 1.4e−13; J genes: r = 1.0, p = 0.0028; Fig. 2E). These data strongly suggest that FLAIRR-seq has very limited to no primer-driven bias compared with whole-transcriptome data. Collectively, this benchmarking dataset confirmed that FLAIRR-seq is comparable with other state-of-the-art methods, providing robust characterization of commonly used repertoire metrics, with limited increases in per-sample cost.
IGenotyper and FLAIRR-seq provide C region gene allele identification and allow for haplotyping of variable genes
FLAIRR-seq data provides improved resolution of IGHC, including estimation of IGHC gene and allele usage, subisotype identification, and phasing of V and C regions for comprehensive repertoire analysis. To evaluate the capabilities and accuracy of IGHC gene and allele identification with FLAIRR-seq, we first established a ground-truth dataset of IGHC alleles for all 10 samples by targeted sequencing of the germline IGH locus (Fig. 3A) using IGenotyper, as previously described (39). IGHG1, IGHG2, IGHG3, IGHG4, and IGHM alleles called by IGenotyper (see Materials and Methods) were assigned to one of three categories, schematized in Fig. 3B: (1) “exact match,” alleles documented in the IMGT database; (2) “novel not in IMGT,” alleles not documented in the IMGT database; or (3) “extended,” alleles that matched partial alleles in the IMGT database (i.e., those spanning only a subset of exons) but were extended by sequences in our dataset. IGenotyper identified a total of 32 unique IGHG1, IGHG2, IGHG3, IGHG4, and IGHM alleles across all individuals, as schematized in Fig. 3C. Among these 32 alleles, only 4 were documented in IMGT; the remaining represented novel alleles (n = 11) or extensions (n = 17) of known alleles. In aggregate, we observed a greater number of IGHG4 alleles than for any of the IGHG genes. Among these alleles were four sequences represented by suspected duplications of IGHG4. In fact, we observed three IGHG4 gene alleles in 4/10 samples, indicating the presence of gene duplications in these donors; in all cases, these alleles were also identified in the FLAIRR-seq data (see later). Given the relatively small size of this proof-of-concept healthy donor cohort, the identification of 28 (87%) novel or extended alleles underscores the extensive polymorphism in this region and reflects the paucity of information regarding this locus in existing immunogenomics databases.
FLAIRR-seq provides IGHC resolution for allelic discovery and allows variable gene haplotyping. (A) Overview of experimental design and pipeline overviews of genotyping by IGenotyper (gDNA) and FLAIRR-seq (RNA). (B) Schematic depicting IGHC alleles identified by IGenotyper, partitioned by identification as (i) exact matches to documented IMGT alleles, (ii) novel alleles that are not in IMGT, or (iii) extended alleles. Red bars indicate SNVs. (C) Pie chart and stacked bar graph representing the total number of alleles and fraction of each category identified per IGHC gene as identified by either IGenotyper or FLAIRR-seq. Bar charts showing number of IGHC alleles from FLAIRR-seq that were resolved, ambiguous, or unresolved when compared with IGenotyper alleles. Asterisks (*) indicate additional allele found as a result of IGHG4 duplication. (D) Table summarizing novel and extended alleles resolved by IGenotyper data. Extended alleles are denoted by *(allele number)-FL, and novel alleles are denoted by FL_(number) alleles resolved by FLAIRR-seq and are marked with a dot (•). (E) Venn diagrams showing number of IGHV haplotype allele/deletion calls when using IGHJ6 or IGHM anchors for each haplotype. Tile plots showing IGHV gene haplotypes inferred using either IGHJ6 anchors or IGHM anchors for one sample. Dark gray represents a deletion (DEL), off-white represents a nonreliable allele annotation (NRA), and light gray represents an unknown allele (Unk).
FLAIRR-seq provides IGHC resolution for allelic discovery and allows variable gene haplotyping. (A) Overview of experimental design and pipeline overviews of genotyping by IGenotyper (gDNA) and FLAIRR-seq (RNA). (B) Schematic depicting IGHC alleles identified by IGenotyper, partitioned by identification as (i) exact matches to documented IMGT alleles, (ii) novel alleles that are not in IMGT, or (iii) extended alleles. Red bars indicate SNVs. (C) Pie chart and stacked bar graph representing the total number of alleles and fraction of each category identified per IGHC gene as identified by either IGenotyper or FLAIRR-seq. Bar charts showing number of IGHC alleles from FLAIRR-seq that were resolved, ambiguous, or unresolved when compared with IGenotyper alleles. Asterisks (*) indicate additional allele found as a result of IGHG4 duplication. (D) Table summarizing novel and extended alleles resolved by IGenotyper data. Extended alleles are denoted by *(allele number)-FL, and novel alleles are denoted by FL_(number) alleles resolved by FLAIRR-seq and are marked with a dot (•). (E) Venn diagrams showing number of IGHV haplotype allele/deletion calls when using IGHJ6 or IGHM anchors for each haplotype. Tile plots showing IGHV gene haplotypes inferred using either IGHJ6 anchors or IGHM anchors for one sample. Dark gray represents a deletion (DEL), off-white represents a nonreliable allele annotation (NRA), and light gray represents an unknown allele (Unk).
We next used the IGenotyper-derived IGHC gene database as the ground truth for evaluating the capability of FLAIRR-seq to identify and resolve IGHC gene alleles. Based on our analysis workflow for identifying IGHC alleles from FLAIRR-seq data, we resolved 19/32 (59%) IGenotyper alleles at 100% identity; no additional false-positive alleles were identified. Of the alleles that were not unambiguously resolved by our FLAIRR-seq pipeline, eight had allele-defining SNVs 3′ of the FLAIRR-seq primers. The rate of true-positive allele calls using FLAIRR-seq across all 10 samples ranged from 5% for IGHG1 to 90% for IGHM (Fig. 3C, 3D). As a result, the IGHC genotypes inferred by FLAIRR-seq have some limitations, but on the whole allow for much greater resolution of IGHC variation in the expressed repertoire than currently used methods. Future iterations of FLAIRR-seq will include primer optimization to facilitate better sequence coverage in the 3′ regions of the IGHC alleles and improve the direct genotyping capabilities of the FLAIRR-seq method.
Previous studies have demonstrated the use of IGHJ6 heterozygosity to infer haplotypes of V and D genes from sr-AIRR-seq data (47, 50). However, the frequency of IGHJ6 heterozygotes in the population can vary. Therefore, we wanted to assess the utility of leveraging IGHC polymorphism resolved by FLAIRR-seq for haplotyping IGHV alleles with the publicly available tool RAbHIT (47, 50, 51). We selected a single donor (“1013”) from our cohort that was heterozygous for both IGHJ6 (*02 and *03) and IGHM (FL_2 and FL_4). Importantly, we were able to associate each IGHJ6 allele to the respective IGHM allele from the corresponding haplotype (Fig. 3E). After defining germline IGHV alleles using TIgGER (31), we generated and compared IGHV haplotype inferences using either IGHJ6 or IGHM alleles as anchor genes using RAbHIT (48) (Fig. 3E). Although some allele assignments were ambiguous (“unknown”) using both methods, we observed a strong consensus between haplotype inferences using the two anchor genes. For haplotype 1, represented by IGHJ6*03 and IGHM_FL_4, the IGHJ6*03-derived haplotype had 39 IGHV genes for which either an allele or deletion call was made. When using IGHM_FL_4, the same allele/deletion calls were made for 39 of these genes; in addition, using IGHM as the anchor gene, assignments were made for an additional five IGHV genes that had “unknown” designations using IGHJ6. Similarly, of the allele/deletion calls made for 41 IGHV genes on haplotype 2 assigned to IGHJ6*02, 40 of these gene had identical assignments to IGHM_FL_2. Together, these results indicate that IGHC variants can be used for haplotype inference from repertoire data when commonly used IGHJ or IGHD genes are homozygous in individuals of interest.
FLAIRR-seq enables isotype-, subisotype-, and allele-specific repertoire analyses
IGHG and IGHM alleles identified in each sample were used to annotate reads in each respective repertoire. These assignments allowed for partitioning of the repertoire by isotype, subisotype, and IGHC allele (Fig. 4). To demonstrate this, we used the same representative sample (“1013”) that was heterozygous for all IGHC genes. As shown in Fig. 4A, IGHC gene assignments allow for subisotype and allele-level frequencies to be estimated as a proportion of the overall IgG and IgM repertoires. In addition, detailed analyses of the repertoire can be conducted within each of these compartments. For example, Fig. 4B shows the frequencies of IGHV gene subfamilies for each IGHG and IGHM allele identified in this sample. Importantly, the average subisotype distribution of unique sequences resolved with FLAIRR-seq across our 10 subjects were in line with what has been reported for healthy individuals, with the most prevalent being IGHG1 (52%) and IGHG2 (36%), whereas IGHG3 (6%) and IGHG4 (5%) were seen at lower levels (52). Using standard sr-AIRR-seq analyses, we would not be able to identify allele-resolved V gene usage or enrichment within subisotype populations, which is important for linking subisotype functionality to particular Ag-specific VDJ clones.
FLAIRR-seq resolves subisotype-specific repertoire diversity. (A) Bar plots showing distribution of unique VDJ sequences across isotypes, subisotypes, and subisotype alleles in one representative sample, 1013, characterized by FLAIRR-seq. (B) Circos plots showing V family gene usage frequency within each subisotype allele for sample 1013. (C) Left, Boxplots of V gene family usage frequencies within IGHG1, IGHG2, IGHG3, and IGHG4 repertoires across all 10 individuals. Top right, Principal-component analysis of V gene family usage by subisotype; plot includes the first two principal components, and individual repertoires are colored by IGHG subisotype. Bottom right, Boxplots showing sequence frequency of IGHV1 and IGHV3 family genes by subisotype across all 10 samples.
FLAIRR-seq resolves subisotype-specific repertoire diversity. (A) Bar plots showing distribution of unique VDJ sequences across isotypes, subisotypes, and subisotype alleles in one representative sample, 1013, characterized by FLAIRR-seq. (B) Circos plots showing V family gene usage frequency within each subisotype allele for sample 1013. (C) Left, Boxplots of V gene family usage frequencies within IGHG1, IGHG2, IGHG3, and IGHG4 repertoires across all 10 individuals. Top right, Principal-component analysis of V gene family usage by subisotype; plot includes the first two principal components, and individual repertoires are colored by IGHG subisotype. Bottom right, Boxplots showing sequence frequency of IGHV1 and IGHV3 family genes by subisotype across all 10 samples.
Through the partitioning of repertoire sequences by subisotype and IGHC allele, we found that FLAIRR-seq also allowed for trends to be assessed in aggregate across donors. To demonstrate this, we further examined V gene family usage partitioned by IgG subisotype across all 10 healthy donors. This analysis revealed expected patterns, in that IGHV1, IGHV3 and IGHV4 subfamily genes were dominant across the four subisotypes (Fig. 4C). However, we did observe significant variation in subfamily proportions between subisotypes, associated with distinct profiles in specific subisotypes (Fig. 4C). Specifically, the estimated frequencies of IGHV1 and IGHV3 were statistically different between subisotypes (p < 0.01, ANOVA); IGHV1 usage was elevated in IGHG1 and IGHG4, whereas IGHV3 was elevated in IGHG2 and IGHG3. These analyses demonstrate the unique capability of FLAIRR-seq to examine variation in the expressed repertoire at the level of isotype, subisotype, and IGHC allele. As samples sizes increase, we expect that a multitude of additional repertoire features will become accessible to this kind of analysis leading to novel discoveries linking VDJ and IGHC genetic signatures.
FLAIRR-seq identifies subisotype-specific clonal expansion and CSR in longitudinal samples
We wanted to investigate the utility of FLAIRR-seq in clinical samples, particularly to observe changes in immune repertoires over time. Ab responses are highly dynamic, with specific Ab clones expanding upon activation by Ag. We were interested to know whether CSR could be captured by FLAIRR-seq, given the capability to identify clones with the subisotype resolved. We had the opportunity to evaluate FLAIRR-seq resolved repertoires over time in four samples collected from one individual over their >13 d hospitalization for severe COVID-19. Blood draws were taken on days 1, 4, 8, and 13 posthospitalization (Fig. 5A) and analyzed with FLAIRR-seq across all time points. After initial FLAIRR-seq processing and analysis, we defined unique clones using SCOPer, which clusters sequences based on CDR3 similarity and mutations in IGHV and IGHJ genes (34). This analysis allowed for the estimation of subisotype-specific clone counts across the four time points examined. Overall, we observed IgG1 dominated the repertoire at all four time points, but the proportion of subisotypes fluctuated over time (Fig. 5B). Specifically, the IGHG2- and IGHG3-specific repertoires expanded from day 1 to day 4, but then contracted in overall frequency from day 8 to day 13 (Fig. 5B). To assess clonal diversity within each subisotype repertoire across time, we calculated the Simpson’s diversity index (q = 2) using Alakazam (31). All subisotype-specific repertoires became less diverse from day 1 to day 4, suggesting clonal expansion across the IGHG repertoire (Fig. 5C). Notably, IGHG4 was not included in diversity calculations because IGHG4 would have required higher sequencing depth to ascertain diversity, given the overall lower expression of IgG4 transcripts in this individual. Subisotype-specific repertoire polarity was also assessed by calculating the fraction of clones needed to represent 80% of the total repertoire (Fig. 5D), with lower fractions representing more polarized and clonally expended repertoires. Results of this analysis were consistent with the diversity index, demonstrating an increase in clonal expansion (i.e., decreased polarity) at day 4 across IGHG1, IGHG2, and IGHG3 compartments, which returned to baseline at later time points.
FLAIRR-seq resolves subisotype-specific clonal expansion and facilitates haplotype analysis of CSR in a patient hospitalized for COVID-19. (A) Overview of experimental design: whole blood–derived RNA was collected on days 1, 4, 8, and 13 posthospitalization and used for FLAIRR-seq profiling. (B) Bar plot showing the percentage of clones represented by each subisotype across time points. (C) Simpson’s diversity index (q = 2) for all clones in each subisotype across four time points; IgG4 was not included because of low sequence counts. (D) Polarity, or the fraction of clones needed to comprise 80% of the repertoire, reported as fraction of total subisotype-specific repertoires across time. (E) Distribution of a single clone 9900 across isotypes and subisotypes over time, suggesting CSR of this clone. (F) Phylogenetic tree constructed from sequences/members of clone 9900, with the inferred germline sequence as the outgroup (star). Shapes and colors of tips (sequences) indicate time point and isotype/subisotype. The scale bar represents the number of mutations between each node in the tree. The subclade within the red box is represented by multiple time points and subisotypes, providing evidence of CSR. (G) Tile plot showing the assignment of IGHC alleles to their respective haplotypes, based on the frequency of observations in which each IGHC allele was linked to each respective allele of heterozygous V genes, IGHV3–7 and IGHV3–48; light gray denotes IGHC alleles for which haplotype assignment was not possible. Analysis of sequences in (F) revealed that the IGHG1 and IGHG2 alleles represented in the phylogeny came from the same haplotype (IGHG1*02/*07, IGHG2*08, IGHM_FL_2).
FLAIRR-seq resolves subisotype-specific clonal expansion and facilitates haplotype analysis of CSR in a patient hospitalized for COVID-19. (A) Overview of experimental design: whole blood–derived RNA was collected on days 1, 4, 8, and 13 posthospitalization and used for FLAIRR-seq profiling. (B) Bar plot showing the percentage of clones represented by each subisotype across time points. (C) Simpson’s diversity index (q = 2) for all clones in each subisotype across four time points; IgG4 was not included because of low sequence counts. (D) Polarity, or the fraction of clones needed to comprise 80% of the repertoire, reported as fraction of total subisotype-specific repertoires across time. (E) Distribution of a single clone 9900 across isotypes and subisotypes over time, suggesting CSR of this clone. (F) Phylogenetic tree constructed from sequences/members of clone 9900, with the inferred germline sequence as the outgroup (star). Shapes and colors of tips (sequences) indicate time point and isotype/subisotype. The scale bar represents the number of mutations between each node in the tree. The subclade within the red box is represented by multiple time points and subisotypes, providing evidence of CSR. (G) Tile plot showing the assignment of IGHC alleles to their respective haplotypes, based on the frequency of observations in which each IGHC allele was linked to each respective allele of heterozygous V genes, IGHV3–7 and IGHV3–48; light gray denotes IGHC alleles for which haplotype assignment was not possible. Analysis of sequences in (F) revealed that the IGHG1 and IGHG2 alleles represented in the phylogeny came from the same haplotype (IGHG1*02/*07, IGHG2*08, IGHM_FL_2).
CSR mediates the switching of Abs from one subisotype/isotype to another. This occurs through the somatic recombination of IGHC genes, which brings the switched/selected IGHC genes adjacent to the recombined IGHV, IGHD, and IGHJ segments, facilitating transcription. The switching of isotypes and subisotypes can result in changes to associated effector functions of the Ab while maintaining Ag-specific V regions. Given the ability of FLAIRR-seq to resolve clones with subisotype and IGHC allele resolution, as proof-of-concept, we sought to assess whether FLAIRR-seq could allow for more detailed haplotype-level analysis of CSR through the course of infection. To do this, we identified the largest clones in our dataset that were both represented by multiple isotypes and found across time points. In total, using SCOPer (34), we identified 19 unique clonal lineages that met these criteria. We focused our detailed analysis on one of the largest clones, 9900 (Fig. 5E), comprising IGHM, IGHG1, and IGHG2 sequences. On day 1 posthospitalization, clone 9900 sequences were represented by both IGHM and IGHG2. At day 4, IGHG2 was the only isotype observed, whereas again at day 8, both IGHM and IGHG2 were observed, as well as IGHG1. To visualize CSR, we built a phylogeny using Dowser (36). Highlighted in the red box on the phylogenetic tree shown in Fig. 5F, we observe a single subclade that is represented by IGHM, IGHG1, and IGHG2.
We were also able to resolve IGHC alleles from this individual, with the exception of IGHG1 alleles, which were ambiguous. Critically, both IGHG1 and IGHG2 were heterozygous (Fig. 5G). Through the assignment of IGHG alleles to haplotypes within this individual using heterozygous V genes (IGHV3–7 and IGHV3–48), we were able to determine that sequences from clone 9900 (Fig. 5F) used IGHG1 and IGHG2 alleles from the same haplotype, associated with IGHV3–7*07 and IGHV3–48*03 (31). This observation offers direct characterization of CSR events occurring on the same chromosome. When we looked across the remaining clones (n = 8) in this dataset that spanned time points and were represented by IGHG1 and IGHG2 subisotypes, we were able to confirm that all of these clones used IGHG alleles from the same haplotype.
Together, these data provide demonstrative proof-of-concept evidence that FLAIRR-seq profiling performs robustly on clinical samples, including RNA directly extracted from whole blood. In addition, these data provide near full-length repertoire resolution extending what would have been possible with standard sr-AIRR-seq methods, including analysis of subisotype-specific repertories, evaluation of clonal expansion, and characterization of CSR in the IgG and IgM repertoires.
Discussion
In this article, we present the development, validation, and application of FLAIRR-seq, a method designed to resolve near-full-length Ab transcripts from bulk PBMC-derived, isolated B cell–derived, and whole blood–derived total RNA. FLAIRR-seq enables highly accurate, simultaneous resolution of V and C regions and suggests that IGHC polymorphism is far more extensive than previously assumed. FLAIRR-seq performed equivalent to or with increased resolution compared with existing standard 5′ RACE AIRR-seq methods when resolving V, D, and J gene calls, CDR3 lengths, and SHM signatures, suggesting that our CH3/CH4 targeting strategies did not compromise V region characterization while simultaneously adding the capability to resolve IGHC variation. Little to no primer bias was observed when compared with Ab repertoire profiling from total mRNA Iso-Seq methods. FLAIRR-seq provides the ability to use IGHC gene usage to identify subisotypes and genotype H chain transcripts, linking these data back to evaluate subisotype-specific repertoires, clonal expansion, and CSR. Underscoring the underappreciated extent of IGHC variation, our profiling of a restricted cohort of only 10 individuals from relatively homogenous backgrounds still identified 4 and 7 previously undocumented IGHC alleles in IgM and IgG, respectively, and extended an additional 17 alleles beyond which had been available in the IMGT database.
The unique capabilities of FLAIRR-seq will allow higher resolution examination of Ab repertoires, including the characterization of variable gene usage and clonotype distribution within unique subisotype subsets. This perspective has the potential to provide key insights into dynamic Ab responses in diseases known to be mediated by subisotype-specific processes or have skewed subisotype distribution as predictive markers of disease, including myasthenia gravis (mediated by pathogenic autoantibodies across subisotypes that give rise to varied disease pathologies), acute rheumatic fever (associated with elevated IgG3), and melanoma (skewing toward IgG4 in late-stage disease thought to be indicative of tolerogenic responses and poor prognosis) (12–14). These subisotype-specific repertoire profiling approaches have the potential to enable identification of unique clones that mediate disease pathogenicity or serve as high-resolution biomarkers to disease progression, as well as open the door for potential functional experiments on subisotype clones of interest, including examining the functional impact of the, to our knowledge, novel IGHC alleles identified in this study. We propose that expanded population-based FLAIRR-seq profiling and curation of novel IGHC alleles, particularly in conjunction with IGenotyper targeted genomic assembly efforts in IGH, will be a significant first step in defining the full extent of variation in a region too long assumed to be relatively invariant.
The Fc region is known to be critical for modulating differential Ab effector functions. These differential functionalities are currently understood to be regulated by differential posttranslational modifications, such as variable glycosylation (53–56). Future FLAIRR-seq profiling will be a valuable tool to investigate how genomic variation across IGHC genes impacts residue usage and resultant Fc receptor binding, signaling potential, crosslinking, and potential for posttranslational modification, all of which would be expected to alter downstream effector functions, such as Ab-dependent cellular cytotoxicity, Ab-dependent cellular phagocytosis, and complement fixation.
We further demonstrate that FLAIRR-seq can effectively examine clonal expansion and CSR in longitudinal samples, demonstrating the feasibility of using FLAIRR-seq to resolve Ab repertoire dynamics. This increased resolution will further our understanding of Ab repertoire evolution in the transition of acute to chronic disease states, many of which are associated with overall IgG subisotype distribution changes that are thought to reflect the inflammatory milieu (14, 57). One example is advanced melanoma, where late-stage disease is characterized by elevated IgG4 compared with IgG1, which is believed to reflect a more tolerizing, protumor environment (12). As suggested by the data presented in this article, FLAIRR-seq could also provide insights into longitudinal Ab repertoire dynamics in COVID-19 infection after exposure to different viral variants of concern or assess Ab responses to viral vaccines. FLAIRR-seq examination of these samples may identify specific repertoire distribution patterns that could act as biomarkers of disease progression. Moving forward it is critical to account for all variability within the Ab repertoire for the most comprehensive understanding of repertoire dynamics and the myriad factors impacting Ab effector function. Future efforts will expand FLAIRR-seq methods to target IGHA and IGHE repertoires, as well as implement multiplexed arrays sequencing to considerably increase throughput and lower cost (A. M. Al’Khafaji, J. T. Smith, K. V. Garimella, M. Babadi, M. Sade-Feldman, M. Gatzen, S. Sarkizova, M. A. Schwartz, V. Popic, E. M. Blaum, et al., manuscript posted on bioRxiv, DOI: 2021.10.01.462818). Together, the data presented in this article demonstrate that the FLAIRR-seq method provides a comprehensive characterization of allele-resolved IgG and IgM repertoires, detailing V region gene usage and measurements of maturation, isotype, and subisotype identification, and the unappreciated extent of C region variation, which will be necessary to fully appreciate the impact of Ig genomic variation in health and disease.
Disclosures
The authors have no financial conflicts of interest.
Acknowledgments
We thank Kamille Rasche, Kaitlyin Shields, Uddalok Jana, and the staff of the UofL Sequencing Technology Center for assistance in the sequencing and analysis of FLAIRR-seq samples.
Footnotes
This work was supported by the National Institutes of Health Grant P20 GM135004-02 (to E.E.F., D.T., C.T.W., and M.L.S.).
The online version of this article contains supplemental material.
- AIRR-seq
Adaptive Immune Receptor Repertoire sequencing
- CDR
complementarity-determining region
- CH1
first IGHC exon
- CH3
C region exon 3
- CH4
C region exon 4
- CSR
class switch recombination
- FLAIRR-seq
full-length AIRR-seq
- gDNA
genomic DNA
- IGH
genomic locus that encodes the IgH
- IMGT
international ImMunoGeneTics information system
- IRB
Institutional Review Board
- Iso-Seq
isoform sequencing
- SHM
somatic hypermutation frequency
- SMRT
single-molecule real-time
- SNV
single-nucleotide variant
- sr-AIRR-seq
Adaptive Immune Receptor Repertoire sequencing using short-read sequencing
- TSO-UMI
template switch oligo and a unique molecular identifier
- UMI
unique molecular identifier
- UofL
University of Louisville