Emerging high-throughput sequencing methods for the analyses of complex structure of TCR and BCR repertoires give a powerful impulse to adaptive immunity studies. However, there are still essential technical obstacles for performing a truly quantitative analysis. Specifically, it remains challenging to obtain comprehensive information on the clonal composition of small lymphocyte populations, such as Ag-specific, functional, or tissue-resident cell subsets isolated by sorting, microdissection, or fine needle aspirates. In this study, we report a robust approach based on unique molecular identifiers that allows profiling Ag receptors for several hundred to thousand lymphocytes while preserving qualitative and quantitative information on clonal composition of the sample. We also describe several general features regarding the data analysis with unique molecular identifiers that are critical for accurate counting of starting molecules in high-throughput sequencing applications.
Recently developed high-throughput sequencing (HTS) methods for adaptive immunity studies allowed for the deep and, to a certain extent, quantitative analysis of clonotypic composition of TCR (1–4) and BCR (5–9) diversity.
These methods enabled researchers to decipher the intimate architecture of immune repertoires in health and disease, both at the general scale (1–4, 10–17) and in diverse lymphocyte subpopulations, such as naive and memory T cells (10, 11, 18), functional subsets of CD4+ T cells (19, 20) or B cells (5, 8), NKT and mucosal-associated invariant T cells subsets (21–23), tumor-infiltrating lymphocytes (24, 25), tissue-specific lymphocytes (26), or Ag-specific T cells (20).
Different technical approaches have been developed to prepare immune repertoire libraries that are utilizing genomic DNA (1) or RNA (3) material, commonly based on multiplex PCR or RACE methods, respectively. Upon the availability of sufficient quantity and quality of the starting material, these techniques usually perform well, albeit with some library amplification biases (e.g., unequal efficiency of V gene segment amplification in multiplex PCR) and accumulated PCR and sequencing errors that should be additionally corrected (27–30). However, there are many cases when the population of interest, such as sorted lymphocytes or lymphocytes in an available tissue sample, is represented by a limited number of cells, thus requiring a robust high-yield HTS approach that would preserve the original clonotypic composition of immune repertoire.
Such approach should guarantee that 1) nearly all cells of interest are processed during sample preparation, 2) DNA or RNA material of most of those cells is preserved in the course of preliminary purification procedures, 3) DNA or RNA material of most cells enters the PCR amplification, 4) amplification is uniform across all combinations of immune receptor gene rearrangements, and 5) quantitative readout on relative frequencies of identified lymphocyte clones is provided.
In the present study, we sought to develop such a reliable solution that would allow quantitative profiling of immune repertoires in those common cases when several hundred to thousand of cells are available as a starting material, thus making the RNA/DNA/cell losses and library preparation biases critical. We based our approach on cDNA library preparation with a template switch (31), which forms universal primer annealing sites on both 3′ and 5′ ends of the library. This approach minimizes quantitative biases associated with uneven amplification efficiency for different TCR and BCR genes (3, 32). Additionally, we employed barcoding of each cDNA molecule using unique molecular identifiers (UMIs) (5, 33–36), which provides several important advantages.
First, UMI barcoding controls the number of molecules that have successfully entered cDNA synthesis and passed through amplification, which were then analyzed in the sequencing output. Such quantification is crucially important to understand the extent and value of performed analysis and to be able to perform normalized comparison of repertoires. For example, in an experiment that started from a sample containing 1000 lymphocytes and yielded 106 sequencing reads, there could be from 1 to 1000 unique TCR or BCR sequences observed, depending on clonality (after appropriate error correction, see below). However, observation of, for example, 100 clonotypes within these 106 sequencing reads does not provide sufficient information on existing bottlenecks. Indeed, these 100 clonotypes could contain diversity of all 1000 or only 100 lymphocytes. Alternatively, observation of only 100 unique UMI sequences among those reads would clearly indicate that a downsampling has occurred and only 100 starting cDNA molecules were analyzed. This would mean that receptors of not >100 lymphocytes were analyzed, and this could not be compensated by increasing the depth of sequencing.
Second, UMI-based analysis eliminates quantitative biases that occur due to the PCR stochastics and preferences, as well as due to the uneven sequencing efficiency for different templates (33–35, 37). For example, a pair of identical starting molecules ultimately can be read 1000 and 100 times. However, they both will convert to just two equal cDNA molecules after grouping of reads carrying the same UMI. In further analysis of two or more immune repertoires, equal numbers of cDNA molecules from each dataset can be used, which provides rational normalized comparison.
Third, with sufficient over-sequencing, molecular identifiers provide ways for efficient correction of PCR and sequencing errors using Safe-Sequencing (33) or Molecular Identifier Groups–Based Error Correction (MIGEC) (36) algorithms. Both algorithms group those reads that carry the same UMI, that is, cover the same starting DNA or cDNA molecule. This information is averaged, which efficiently eliminates most PCR and sequencing errors. As an additional step, MIGEC filters the hot-spot errors, which is critical to achieve deep error-free sequencing (36). In contrast to the frequency-based grouping of homologous reads (27, 28), selective grouping of reads basing on UMI information does not result in the loss of true sample diversity. This advantage is important in deep profiling, as well as in the analysis of minor B cell or T cell subsets with limited receptor diversity, where highly homologous variants can be present in different proportions, thus being hardly distinguishable from accumulated errors.
In this study, we demonstrate dramatic improvements achieved with UMI-based analysis of immune repertoires for limited cell numbers, and we provide important clues for reliable interpretation of UMI information in HTS analysis.
Materials and Methods
Blood samples from a 50-y-old male were collected into Vacutainer tubes with EDTA (BD Biosciences). PBMCs were isolated by Ficoll-Paque (Paneco) density gradient centrifugation and frozen in freezing medium containing 8% DMSO. Before the experiment, cells were thawed, washed twice with PBS, and cultured overnight in RPMI 1640 supplemented with 10% human serum (First Link) and 5 g/ml IL-2 (Roche) at 37°C. Furthermore, we used different protocols for preparing cell suspensions for cDNA synthesis.
Experiment 1: 4000 PBMCs, eight replicas.
Cells were concentrated in 50 μl 150 mM NaCl and treated with 300 μl 0.025% trypsin (Paneco, 334 U/mg) for 2 min at 37°C for destruction of cell aggregates. To remove clumps of dead cells, PBMCs were washed with 150 mM NaCl, passed through the cell strainer (partition size 40 μm, BD Falcon), and transferred to the 9 mM Tris-HCl (pH 8.0) buffer containing 4.5 mM DTT, 135 mM NaCl, and 2 U/μl rRNasin (according to Ref. 38). Live cell count was estimated using trypan blue exclusion protocol with a hemocytometer. Eight 2-μl replicas of ∼4000 PBMCs each were aliquoted. PBMCs were frozen at −70°C for 10 min for cell membranes breaching.
Experiment 2: sorted subpopulation of TRBV9low T cells, eight replicas.
PBMCs were washed with 120 mM KCl and stained with Abs specific to human CD3 and TRBV9. T cells (2 × 104) of the CD3+/TRBV9low subpopulation were sorted on a FACSAria III (BD Biosciences) into 350 μl 120 mM KCl. Forward and side scatter were used to exclude dead cells, cell aggregates, and cell debris. Cells were concentrated to 100 cells/μl, eight replicas of 5 μl (∼500 T cells) were aliquoted, and 1 μl rRNAsin was added to each replica. Cells were permeabilized using 2 min heating at 50°C, then cooling at 20°C and repeated heating at 50°C for 2 min.
Experiment 3: 500 and 1000 sorted CD8 memory T cells, four plus four replicas.
PBMCs were washed with 120 mM KCl and stained with Abs specific to human CD8 and CD45RA. CD8 memory (CD45RA−/CD8+) T cell populations of 500 and 1000 T cells in four replicas were sorted into 20 μl SMARTScribe buffer (Clontech), supplemented with 2 mM DTT, 1 mM of each dNTP, 2 U/μl rRNasin (Promega), and 0.05% Tween 20. Sorted cells were incubated for 10 min at 4°C to allow RNA leakage. Note that sorting to the limited volume requires thorough control for drop deposition direct into the liquid. This includes minimization of stream fanning and accounting for PCR plate or strip backlash in the holder. In control experiments, we analyzed cell recovery by counting BCECF-AM (Invitrogen)–labeled cells with fluorescence microscopy, which demonstrated that at least ∼400 live T cells of 500 sorted were recovered after sorting into 20 μl PBS.
For experiments 1 and 2, we added premix containing SMARTScribe buffer, 1 mM DTT, 0.5 mM each dNTP, 2 U/μl rRNasin, 5 U/μl SMARTScribe reverse transcriptase, 0.5 μM 5′-template switch adapter SmartNNNa (Supplemental Table I), and 0.5 μM each of the reverse primers for TCR α- and β-chain synthesis. In experiment 3, cells were sorted to buffer for reverse transcription, and, after 10 min incubation at 4°C, specific primers for TCR α and β cDNA synthesis, SMARTScribe reverse transcriptase, and template switch adapter were added. Water was added to a final volume of 20 μl for experiments 1 and 2, and to 40 μl for experiment 3. The solution was incubated for 60 min at 42°C for cDNA priming, synthesis, and template switch. To remove the deoxyuridine-containing template switch adapter, we treated cDNA with uracil–DNA glycosylase (New England BioLabs, 5 U/μl) for 40 min at 37°C.
For all three experiments we used the same two-step PCR amplification protocol.
PCR amplification 1.
Each cDNA synthesis reaction was divided into several reactions, and all obtained cDNA was amplified. TCR α and β libraries were amplified together. Reaction mix included 5× buffer for Q5 DNA polymerase (New England BioLabs), 0.2 mM of each dNTP, Q5 DNA polymerase (New England BioLabs), and 0.2 μM forward primer annealing to SmartNNNa adapter and reverse primers for TCR α- and β-chain amplification. The amplification program was as follows: 94°C for 2 min, then 94°C for 10 s, 55°C for 10 s, 72°C for 50 s (21 cycles), with final elongation at 72°C for 2 min. PCR products from the same cDNA synthesis were joined and then purified using a QIAquick PCR purification kit (Qiagen) and eluted in 20 μl.
PCR amplification 2.
Two microliters PCR amplification 1 product was used in a second step PCR. TCR α and β libraries from each sample were amplified in two different tubes. To combine PCR libraries from all replicas in a single Illumina MiSeq run, we used primers with preintroduced sample barcodes (from the 5′ end of each library in experiment 1 and from both ends in experiments 2 and 3). Reaction mix included buffer for DNA polymerase, dNTP, 0.2 μM forward primer annealing to adapter, and a reverse primer for TCR α- or β-chain constant gene segment. The amplification program was as follows: 94°C for 2 min, then 94°C for 10 s, 55°C for 10 s, 72°C for 50 s (15–18 cycles), with final elongation at 72°C for 2 min. When considering dilution of each library in the second PCR, this results in ∼33–36 PCR cycles. The samples were combined together and purified using QIAquick PCR purification kit (Qiagen) the same day.
PCR product concentration in each library was determined using a Qubit fluorometer (Invitrogen). PCR products of replicate samples were mixed together in an equal ratio. Illumina adapters were ligated according to the manufacturer’s protocol. The libraries were sequenced on Illumina MiSeq using 300 cycle reagent kit in paired-end 150 plus 150 nt mode.
Raw TCR sequencing data were processed using the MIGEC pipeline (36) (http://milaboratory.com/projects/migec-project/). Briefly, the Checkout utility was used for data demultiplexing and UMI sequence extraction, yielding >90% sample barcode matching, and the data were then assembled using the Assemble utility with erroneous UMI filtering option and an over-sequencing threshold of at least four or five reads per UMI. Both raw and assembled data were then subject to CdrBlast for CDR3 region extraction and identification of variable/joining segments, yielding 80–90% CDR3-containing reads in most samples. To filter off hot-spot PCR errors, the second-level correction of the MIGEC pipeline was applied, and on this stage 5–9% of MIGs were filtered. MIG over-sequencing and erroneous UMI histograms were calculated using the Histogram utility. The analysis was complemented by running MiTCR software (29) on raw (unassembled) reads. MiTCR was executed with options enabling low-quality read rescue and frequency-based error correction. CDR3 extraction efficiency was nearly identical for MIGEC and MiTCR in all samples, which allowed us to perform an unbiased comparison between conventional and UMI-based analyses.
Optical duplicates on the flow cell
The analysis was performed by random sampling of 107 sequencing read pairs from the sets of reads containing same and distinct UMI sequences. The information on DNA cluster positioning (tile ID, x and y cluster coordinates) was extracted from sequence headers according to manufacturer information (http://support.illumina.com/content/dam/illumina-support/help/BaseSpaceHelp_v2/Content/Vault/Informatics/Sequencing_Analysis/BS/swSEQ_mBS_FASTQFiles.htm) and further analyzed using a custom script (https://github.com/mikessh/duplicates). DNA cluster distance was calculated only for read pairs that belong to the same flow cell tile. Presence of optical duplicates (i.e., the events where multiple clusters are identified on a position where only one true cluster exists) was resolved by plotting the distribution of read pair distance on the flow cell.
The background frequency of same-tile reads calculated for the read pairs with distinct UMI randomly drawn from the whole lane (139,204 and 360,873) was inversely proportional to the number of tiles (96 and 38) for the corresponding HiSeq and MiSeq instruments, which confirms uniformity of sampling. Alternatively, higher values were obtained when considering read pairs with identical UMI sequences: 144,062 (3% higher) and 536,557 (49% higher), correspondingly, indicating the presence of optical duplicates.
Revising UMI-based analysis
With the clear advantages of UMI-based profiling of immune repertoires summarized in the 1Introduction, there are specific pitfalls that have to be recognized and accounted for.
In particular, with high over-sequencing, PCR errors within the UMI sequence itself may dramatically overestimate the true number of starting molecular events. For example, if one UMI of 12 nt length is amplified and sequenced 104 times, it may routinely produce ∼10–20 erroneous UMI subvariants generated during PCR amplification, which may be represented altogether by >100–200 sequencing reads, according to our experience with spike-in CDR3 sequence controls in BCR and TCR libraries (28, 29, 36). In this situation, straightforward data interpretation could be that there were 11–21 TCR or BCR cDNA molecules at the start, whereas there was only 1. Even higher numbers of error-containing UMI subvariants were recently reported in a model system with 16 nt length UMI (39).
However, appropriate control for the presence of nearly identical UMIs along with the extent of UMI over-sequencing allows elimination of such erroneous subvariants quite efficiently. In the present study, we had to estimate the average number of TCR cDNA molecules that were successfully captured per T cell in the HTS analysis. Although we initially observed high numbers of distinct UMI, many of them, mainly covered by one to two sequencing reads, represented single mismatch subvariants of the highly over-sequenced molecular identifiers (Fig. 1A). The latter situation was reproducible for multiple analyzed datasets with template switch–introduced UMIs that passed 27–35 cycles of PCR amplification (data not shown). Thus, the abundant pool of UMI subvariants with low sequence coverage apparently represents artificial UMI diversity that results from PCR errors at the late cycles of amplification when maximal numbers of molecules are being replicated.
Based on these observations, we applied two-stage filtering to obtain the true numbers of distinct cDNA molecules in the experiments described below. First, we followed straightforward frequency-based logic and filtered minor UMI subvariants that differed by a single nucleotide mismatch from the major ones. Second, we set a threshold of at least several sequencing reads per UMI as shown in Fig. 1, which allowed us to filter the remaining erroneous UMI subvariants (including those carrying two errors) based on their intrinsically low over-sequencing. In this logic, the optimal threshold is set according to the comparative analysis of the over-sequencing curves for all UMIs for the single-mismatch UMI subvariants. The optimal threshold may vary depending on the size of the starting library, achieved over-sequencing, and shape of the over-sequencing peak. The higher the general over-sequencing and the narrower the over-sequencing peak, the higher the threshold can be set to eliminate artificial UMI diversity, still being confident that most true starting cDNA or DNA events are preserved. Interestingly, in this approach, the width of the over-sequencing peak depends on efficiency of the starting PCR cycles that essentially determines the resulting uniformity of molecule amplification (data not shown).
Another important point refers to the natural coincidence of randomly generated UMI sequences (so-called collisions, see Refs. 39, 40). For the experiments reported below, the case when two true identical UMIs or two UMIs with a single mismatch difference were generated by sampling 500–5000 UMIs from a theoretical pool of ∼1.7 × 107 unique 12 bp length variants was extremely rare. However, note that real diversity of 12-bp UMIs was lower than this theoretical number and was estimated as ∼1.4 × 107 (Fig. 1B). This could result from uneven primer synthesis and, hypothetically, from the preferences of template switch and amplification, as this loss of degeneracy depends on the nucleotide position and the extent to which the UMI was amplified (Fig. 1B). Apparently, natural collisions of 12-bp UMIs are not rare in deep sequencing experiments with high numbers of starting molecules. For example, in the case of 106 starting molecules, one would expect >3 × 104 randomly generated 12-bp UMI pairs with exactly the same sequence. Therefore, larger UMI diversity is recommended for comprehensive sample analysis in deep sequencing. At the same time, too short (i.e., low diversity) UMIs, as reported in Grün et al. (41), cannot accurately count even limited numbers of starting molecules both owing to the natural collisions of true UMI variants and to accumulated PCR and sequencing errors that may hardly be corrected. These critical points that determine the accuracy of input molecule quantification are not always recognized.
In the present study, we also observed that UMI logic eliminates the so-called optical duplicates (42), that is, cases when a single DNA cluster (raw Illumina instrument signal that is processed into a sequencing read) is misidentified as a group of several identical DNA clusters. Such duplicates could be clearly detected as a peak of read pairs that carry the same UMI with the intercluster distance <100 pixels (Fig. 1C). This effect is especially prominent for our MiSeq experiments that have higher DNA cluster density. In standard read-based analysis, optical duplicates may alter the interpreted frequency for the rare clonotypes, whereas safe-sequencing system equations or MIGEC algorithm group reads with identical UMI sequences safely eliminate such sequencing biases, along with the routine biases arising from nonuniformity and stochastics of PCR amplification.
Library preparation method
We have optimized primer sets and protocols used for human TCR libraries preparation (Supplemental Table I). In this design, priming of cDNA and then the first PCR are performed for the TCR α- and β-chain genes together to use all starting material. All RNA goes in single synthesis reaction, and all cDNA goes in the single first PCR. TCR α and β libraries are then separately amplified in the second nested PCR with primers specific to the TCR α and β constant gene segments, respectively.
To avoid the loss of starting material in the course of RNA purification, which is critical when working with small cell numbers, we performed cDNA synthesis with template switch directly in cell lysis solution. To this end, we sorted out several feasible ways to uncage undamaged RNA from cells, including short freezing at −70оC, repeated heating at 50оC, or incubation with 0.05% Tween 20. Hereafter, all obtained cDNA was used in further PCR amplification bypassing any purification to preserve maximal numbers of input molecules. However, note that using the MinElute PCR purification kit (Qiagen) to clean cDNA was generally safe in our hands for the TCR and BCR libraries preparation, at least for cell numbers >103, and improved further PCR amplification (data not shown).
With this protocol, both TCR β and TCR α cDNA libraries are reproducibly amplified within 30–36 PCR cycles starting from 1000 human PBMCs (∼500 T cells). Notably, 2-fold larger cell numbers were required to obtain libraries of comparable quality after TRIzol-based RNA purification, and reproducibility in replicate experiments remained low, whereas RNA purification with RNeasy Micro Kit (Qiagen) showed almost the same efficiency as direct cDNA synthesis from permeabilized cells (data not shown).
As demonstrated below, UMI-based analysis yields reliable TCR profiling results starting from several hundred to several thousand T cells, providing accurate quantification of TCR α- and β-chain clonotypes and high reproducibility between replicas.
Experiment 1: eight replicas of 4000 PBMCs
In the first model experiment, we prepared and sequenced UMI-barcoded TCR α- and β-chain libraries coming in eight replicas of ∼4000 PBMCs (∼2000 T cells) each. Cells were obtained from peripheral blood of a 50-y-old male donor. Libraries were prepared as described in 2Materials and Methods, pooled in equal proportions and sequenced on an Illumina MiSeq 2 × 150 nt paired end run, yielding information on sample barcode, UMI sequence, and CDR3 sequence for each sequencing read. For each library, preliminary MiTCR (29) analysis without using UMI information identified from 61,860 to 702,195 CDR3-containing sequencing reads (Supplemental Table IIa). In MIGEC analysis (36), reads that carried the same molecular identifier and thus covered the same starting cDNA molecule were grouped. Erroneous UMI subvariants were filtered according to the logic described above. Analysis of the over-sequencing histograms revealed that the threshold of four reads per molecular identifier was optimal to collect most CDR3-containing sequencing reads and uniquely labeled cDNA molecules but to remove most low-count errors and artifacts. Therefore, only those molecular barcodes that were covered with at least four sequencing reads were used in further analyses as shown on Fig. 1A. As a final step, we have filtered the out-of-frame and stop codon–containing CDR3 variants to analyze only functional TCR diversity.
This rigorous reads grouping and filtration yielded from 1520 to 6395 reliably analyzed in-frame cDNA molecules per replica of TCR β (3302 ± 1522) libraries, and from 489 to 1830 molecules per replica of TCR α (1035 ± 406) libraries; that is, ∼1.7 TCR β and 0.5 TCR α mRNA molecules were successfully analyzed per each of the ∼2000 T cells in a replica. This sensitivity corresponded well with expectations (43), and it is probably limited by the template switch efficiency. The number of analyzed TCR α and TCR β cDNA molecules correlated accurately along replicas (R = 0.96, Fig. 2A), indicating that the variation between replicas mainly followed the stochasticity of cell sampling, whereas library preparation protocol was highly reproducible. Apparently, the number of sampled cells determined the depth of analysis. Detected diversity of CDR3 sequences contained from 690 to 2558 variants for TCR β and from 276 to 812 variants for TCR α libraries per replica and correlated with the number of analyzed cDNA molecules (Fig. 2B).
Importantly, although standard data analysis with frequency-based error correction but without using UMI information yielded higher clonotype diversity, manual verification revealed that many additional variants resulted from the undercorrected errors within CDR3 (Supplemental Table III). This was expected because intense over-sequencing of the small number of starting molecules leads to the dominance of accumulated errors diversity over the real repertoire diversity (27, 28). Albeit strict, UMI-based analysis successfully employed >85% of CDR3-containing sequencing reads (Supplemental Table IIa). With obtained over-sequencing (Fig. 1A) we were confident that we extracted the majority of valid cDNA events and obtained representative data on sample composition, which were now free of errors and sequencing artifacts.
Comparison of two data analysis approaches revealed that MIGEC substantially improves quantification of clonotypes and gene segments. Frequencies of identified clonotypes correlated better in pairwise replica comparisons in the molecular barcoded approach compared with the standard data analysis (Fig. 3A, 3B). Molecular barcoding analysis also yielded more accurate frequencies of TCR gene segments usage with minimal dispersion between the replicas (Fig. 3C, 3D).
To determine the reproducible sensitivity of the assay, we plotted the number of replicas in which a clonotype was detected versus its average concentration (Fig. 4). This analysis revealed that TCR α and β clonotypes with average concentration >0.3% (i.e., represented on average by at least approximately five to seven T cells) were routinely identified in all eight replicas. The physical probability of failing to sample a clonotype in eight of eight samples for a clonotype represented by three T cells is 33% in this case, indicating that we have closely approached the sampling limit.
In standard data analysis, relative concentration of major clonotypes was identified with substantial dispersion, whereas counting cDNA molecules with molecular identifiers yielded reproducible data on concentration of each clonotype (Fig. 3E). Correspondingly, UMI-analyzed data showed higher stability in ranks of the top clonotypes between the samples (Supplemental Table IV).
For a number of stably expanded T cell clones of this individual, we have previously identified TCR α- and β-chains (44). Using this information, we have verified how accurate the rank-based pairing of chains in UMI-based TCR profiling of minor cell numbers could be. As shown in Fig. 5, native TCR α- and β-chain pairs could be unambiguously identified for the top clones in a complex mixture of peripheral blood T cells. This indicates that for the low-complexity subsets, such as a sorted MHC multimer–stained Ag-specific population of T cells, straightforward identification of native TCR α and β pairs is feasible with the UMI-based data analysis starting from minor amounts of T cells. In contrast, standard analysis without using molecular identifiers yielded poorly matched results on relative concentrations of known TCR α-and β-chain pairs (Fig. 5).
Our previous studies revealed that the CD3+/TRBV9low subpopulation of cells for this individual is dominated by a single T cell clone, named Slec1, which can therefore be specifically visualized in flow cytometry (45). Slec1 carries one functional TCR α-chain (CAFMKPREGNTDKLIF/TRAV38-1/TRAJ34), one functional TCR β-chain (CASSVALGLNYEQYF/TRBV9/TRBJ2-7), and one nonfunctional but expressed TCR β-chain (CASSKARWDFTANVLTF/TRBV21-1/TRBJ2-6) (45). According to flow cytometry, Slec1 constituted ∼2.5% of all T cells in peripheral blood at the time of analysis. In UMI-based data, its CDR3 sequences were represented by 2.4 ± 0.2% in TCR β libraries (considering both expressed TCR β mRNA of this clone) and 3.4 ± 1% in TCR α libraries, respectively, thus demonstrating good agreement with cytometry data. In standard analysis, these values constituted 1.3 ± 0.3% for TCR β and 2.6 ± 1.3% for TCR α sequences, demonstrating weaker correlation both with flow cytometry and between TCR α- and β-chain frequencies.
Experiment 2: sorted subpopulation of TRBV9low T cells, 500 T cells, eight replicas
Next, we asked whether it is possible to apply a similar approach to the low counts of FACS-sorted T cells. We sorted a TRBV9low subpopulation of CD3+ T cells of the same individual into 350 μl 120 mM KCl, concentrated them, and then split them into eight replicas of ∼500 T cells each. Cells were then permeabilized using brief heating at 50°C, then cooling at 20°C and repeated heating at 50°C. Further cDNA synthesis, amplification, sequencing, and analysis were performed as in the previous experiment.
As we expected, most TCR β and TCR α cDNA events (85–97% for different replicas) belonged to a single T cell clone, Slec1, with two TCR β-chains presented in nearly equal frequencies. Additionally, in minor counts, TCR α clonotype CAFVGGPQGGSEKLVF/TRAV38-1/TRAJ57 was identified in three replicas, and TCR β clonotype CASSVETGDLAFETQYF/TRBV9/TRBJ2-5 was detected in the same three and one additional replica, suggesting that they constitute TCR of a single T cell clone with a TRBV9low phenotype. This example demonstrates the additional possibility of native TCR α- and β-chain pairs identification based on their co-occurrence in the same small sample replicas.
Although reproducibility in replicas was high, we were unsatisfied with the average number of cDNA copies analyzed per ∼500 sorted T cells (∼50 molecules for TCR α and ∼120 molecules for TCR β). This could result from degradation of RNA in cells affected by sorting, or loss of cells initially sorted in a relatively big volume in the course of their concentration and buffer exchange.
Experiment 3: sorted replicas of 500 and 1000 CD8 memory T cells
In the next experiment, we sorted T cells directly to 20 μl cDNA synthesis buffer with 0.05% Tween 20 and rRNAsin to provide immediate cell permeabilization and conservation of RNA. Accuracy of drop deposition was thoroughly controlled (see 2Materials and Methods). We sorted four replicas of 1000 and four replicas of 500 CD8+/CD27− memory T cells. Sorted cells were incubated for 10 min at 4°C to allow uniform RNA leakage from early and late sorted cells. Then, specific primers for TCR α and β cDNA synthesis, reverse transcriptase, and template switch adapter were added and the solution was incubated for 60 min at 42°C for cDNA priming, synthesis, and template switch. Further amplification, sequencing, and data analysis were performed as in previous experiments. Results of UMI-based analysis with a threshold of five reads per UMI are summarized in Supplemental Table IIb.
Reproducibility between replicas in terms of the number of obtained in-frame cDNA events was rather high. We obtained 522 (±61) and 171 (±34) in-frame cDNA events per 1000 sorted T cells for TCR β and TCR α libraries, respectively. Similarly, we obtained 278 (±18) and 92 (±18) in-frame cDNA events per 500 sorted T cells for TCR β and TCR α libraries, respectively. In eight replicas this constitutes 0.54 (±0.05) TCR β and 0.17 (±0.03) TCR α cDNA molecules analyzed per each T cell. Therefore, immediate permeabilization of sorted cells resulted in improvement of cDNA synthesis and template switch efficiency compared with the sorting to PBS with further cell concentration and buffer exchange. In both the 500 and 1000 cell sorting experiments, TCR β clonotypes representing on average at least five cells were identified in four of four replicas. For TCR α, this threshold constituted at least ∼15 T cells per replica. Accuracy of clonotype quantification remained high for the major clonotypes (geometric SD in four replicas of <1.3 for the TCR β clonotypes represented on average by at least 20 T cells) and resembled cell sampling bias rather than limitation of the method.
UMI-based analysis represents a powerful tool for normalized, error-free sequencing in a variety of HTS applications, including genome, transcriptome (33, 34, 41), and microbiome (46) analysis, studying polymerase (33), transcription (47), primer synthesis (33), or sequencing (39) error rates, and profiling of immune repertoires (5, 35, 36, 48). However, this technique should be used with due caution to achieve accurate quantitative results. In particular, straightforward counting of unique UMIs faces the problem of artificial error-based subvariants that mimic distinct starting molecules. We show that this problem can be essentially solved by combining frequency-based filtering of single mismatch UMI and a threshold of several reads per UMI. The latter is also a prerequisite for implementation of Safe-Sequencing/MIGEC algorithms of error correction. Additionally, control for sufficient diversity of UMI variants in relationship to the size of the studied library is important.
Qualified analysis of Ag receptors in minor lymphocyte subsets is vital both for the progress in deciphering the adaptive immunity puzzles and future development of diagnostic techniques based on TCR and BCR repertoire analysis for the particular functional or tissue-resident subsets of lymphocytes (49). In present study we report a methodical approach that allows performance of quantitative profiling of immune repertoires for samples containing as few as several hundred lymphocytes. Those cell numbers are typical for the material available after sorting of Ag-specific lymphocytes, minor size–functional subsets, and extraction of tissue-resident or tumor-infiltrating lymphocytes from biopsy samples or microdissected tissues.
Unique molecular barcoding followed by MIGEC analysis of sequencing data efficiently protects from artifacts and eliminates PCR and sequencing errors while preserving the natural sample diversity. The latter is crucial for samples containing homologous immune clonotypes, such as Ag-specific T cells, low-diversity T cell subsets, or hypermutated BCR subvariants. At the same time, molecular barcoding substantially increases accuracy of clonotype quantification based on the counts of verified input cDNA molecules instead of output sequencing reads. With the confirmed efficiency of 0.5–1.7 TCR β cDNA analyzed per T cell, all those clones represented by at least approximately five cells in a sample are reliably detected in replicate experiments. Those clones that are represented by lower numbers of cells are also captured based on the physical probability of their sampling, therefore preserving and revealing quantitative information on the repertoire diversity of the studied sample. This efficiency is thus quite sufficient for the comprehensive analysis of small lymphocyte subsets of interest.
Although TCR/BCR sequencing of single cells sorted to the 96-well plate remains a feasible alternative (50), it is expensive and time consuming to proceed with multiple samples of interest for which a rapid snapshot of immune receptor sequences is required. In contrast, our method is robust, easy to implement, and is highly protected from sequencing artifacts and hidden bottlenecks that could lead to data misinterpretation. These features allow reliable and cost-effective comparison of immune repertoires for the tens and hundreds of distinct functional subsets, patients, tissues, model animals, or time points.
The work was carried out in part using equipment provided by the Shemyakin–Ovchinnikov Institute of Bioorganic Chemistry Core Facility.
This work was supported by Russian Science Foundation Grant 14-14-00533.
The online version of this article contains supplemental material.
Abbreviations used in this article:
Molecular Identifier Groups–Based Error Correction
unique molecular identifier.
The author has no financial conflicts of interest.