HLA-I molecules bind short peptides and present them for recognition by CD8+ T cells. The length of HLA-I ligands typically ranges from 8 to 12 aa, but variability is observed across different HLA-I alleles. In this study we collected recent in-depth HLA peptidomics data, including 12 newly generated HLA peptidomes (31,896 unique peptides) from human meningioma samples, to analyze the peptide length distribution and multiple specificity across 84 different HLA-I alleles. We observed a clear clustering of HLA-I alleles with distinct peptide length distributions, which enabled us to study the structural basis of peptide length distributions and predict peptide length distributions from HLA-I sequences. We further identified multiple specificity in several HLA-I molecules and validated these observations with binding assays. Explicitly modeling peptide length distribution and multiple specificity improved predictions of naturally presented HLA-I ligands, as demonstrated in an independent benchmarking based on the new human meningioma samples.
The binding of short intracellular peptides to HLA-I molecules is a key step for immune recognition, and prediction of naturally presented HLA-I ligands is currently the most important and most selective feature used to predict T cell epitopes. Peptide HLA-I interactions have been extensively studied either with binding assays or by mass spectrometry (MS). In recent years, improvements in sensitivity of MS instruments has led to the identification of hundreds of thousands naturally presented peptides, which include cancer testis Ags and neoantigens (1, 2). Computational tools developed by ourselves and others have shown that HLA-I restriction can be predicted in large HLA-I peptidomics datasets, even in the absence of a priori information about HLA-I binding specificity using motif deconvolution (3, 4) or peptide clustering approaches (1, 5, 6). The use of MS data combined with these novel algorithms has led to refinement of known HLA-I motifs, identification of novel motifs, characterization of N- or C-terminal extensions, and improved HLA-I ligand predictions (3, 4, 7–15).
In addition, HLA peptidomics data provide information about the length distribution of naturally presented HLA-I ligands. Interestingly, in a recent study, clear discrepancies were observed between length distributions of eluted peptides and length distributions predicted based on binding affinity for five soluble HLA-I alleles transfected into HeLa cells (16). This work indicated that the peptide length distribution observed in naturally presented HLA-I ligands is the result of both distinct affinity values of HLA-I alleles for peptides of different lengths as well as length biases in the pool of peptides available for loading onto HLA-I molecules in the endoplasmic reticulum (ER). However, most studies investigating peptide length distributions have been restricted to a few alleles (2, 5, 7, 8, 16, 17). Moreover, most existing HLA-I predictors, with the exception of recent versions of NetMHCpan (17) and NetMHC (18), do not include length distribution corrections based on MS data.
Another important feature characterizing peptides interacting with HLA-I molecules is the potential influence of a given amino acid at one position on the amino acid preferences at other positions. This type of co-operative effects, which often reflect multiple specificity patterns (19), can be captured with neural networks, support vector machines or mixture models. Such models have been widely used to predict the binding of peptides to HLA-I molecules (18, 20) or to peptide recognition domains (19, 21, 22). However few examples of actual amino acid correlations and multiple specificity have been documented among HLA-I ligands (3, 23). It therefore remains unclear how prevalent these correlations are in HLA-I alleles.
In this work, we set out to analyze the peptide length distribution and multiple specificity of more than 80 HLA-I molecules using high quality HLA peptidomics data generated by ourselves and others. To this end, we expanded our recent motif deconvolution algorithm MixMHCp (3) to better handle peptides of different lengths. Clear clustering of HLA-I alleles, based on their peptide length distribution similarity, was observed. This enabled us to investigate the structural basis of peptide length distributions and predict peptide length distributions based on HLA-I sequences. We then used MixMHCp to investigate multiple specificity in HLA-I molecules and validated these observations with binding assays. Both peptide length distribution and multiple specificity were then incorporated into our HLA-I ligand predictor MixMHCpred (4), leading to improved prediction accuracy of naturally presented peptides in 10 meningioma samples. Finally, this work expands significantly the allelic coverage of MixMHCpred (122 HLA-I alleles in total), making it suitable to run predictions of naturally presented HLA-I ligands for most individuals of Caucasian origin.
Materials and Methods
Expanding motif deconvolution to peptides of different lengths
Binding motif deconvolution in HLA-I peptidomes with the previous version of the mixture model (MixMHCp1.0) performs well in many samples for 9- and 10-mers (3, 4). However, it often fails for 8-mers and for longer (>10-mers) peptides because much less data are available for them in MS samples. To overcome this limitation, we introduce in this study a new approach to better handle peptides of different lengths. Motif deconvolution is first carried out on 9-mers as previously done in MixMHCp1.0 (3, 4). For peptides of other lengths, the motifs used in the mixture model are fixed, based on the first three and last two positions of the 9-mer motifs, and only the weights of the different motifs are learned from the data. The underlying justification for this approach is that HLA-I motifs, and especially positions close to the anchor residues, differ only slightly between 9-mers and other peptide lengths. However, the proportion of peptides from each allele in HLA-I peptidomics samples is likely to differ substantially for different peptide lengths, because of the different length distributions of HLA-I alleles. In addition, we integrated a flat motif (fixed values at all positions) in the mixture model to which peptides that do not match any motif inferred from the data can be assigned. This is useful to remove potential contaminants in MS data (6), and the number of peptides assigned to the flat motif is listed for each sample analyzed in this work in Supplemental Table I (see also Fig. 1).
In mathematical terms, the new log-likelihood function for the peptides of length equal to core length (hereafter core length equals to nine) is defined as:
where Mk represents the kth motif, represents the weight of this motif in the 9-mer peptides, the number of 9-mer ligands, the nth ligand of length 9, and the prior on the position weight matrix (PWM) entries as defined in (19, 22). The Pi (i = 1,…,20) values are taken as the amino acid frequencies in the human proteome and represent a flat motif to which peptides that do not match any of the other motifs can be assigned, similar to the trash cluster in GibbsCluster (5, 24). This log-likelihood can be efficiently maximized used expectation–maximization (19, 22). For 9-mers, the only difference with the previous version of MixMHCp comes from the inclusion of the flat motif.
For peptides of length L < 9, the log-likelihood function reads:
In this case, the matrices Mk are fixed based on the motifs deconvoluted in the 9-mer peptides and only the weights of the different motifs are learned from the data, using again the expectation–maximization algorithm. For example, for 8-mers, it means that positions 1–3 of the 9-mer motifs are used to score positions 1–3 of the peptides, and positions 8 and 9 of the 9-mer motifs are used to score positions 7 and 8 of the peptides.
For peptides of length L > 9, the log-likelihood function is given by:
Here again the motifs themselves (Mk, k = 1–3, 8, 9) are fixed based on those inferred from the 9-mer peptides, and the weights () of the different motifs are learned separately for each length. The use of the “max” function is meant to consider all possible cases of N- and C-terminal extensions (9). Information about which starting position of the N-terminal motif (i.e., S1+1) and end position of the C-terminal motif (i.e., S2+2) gave rise to the highest value for each cluster is given in output of MixMHCp2.0 (responsibility files), thereby providing predictions about potential N- and C-terminal extensions. T1 = 0.05 and T2 = 0.2 correspond to penalties assigned to extensions and represent the fact that on average, cases in which a motif can be found both at the terminus and within the peptide sequence correspond to bulges (9). The optimal number of motifs and the assignment of the different motifs to different alleles in each sample was determined as described in our previous work based on 9-mers (3, 4), and all motif annotations were manually reviewed.
Collection of publicly available HLA peptidomics data
Publicly available HLA peptidomics datasets from monoallelic studies (7, 8, 25, 26) and pooled studies (1, 2, 4, 14, 27–30) were considered in this work (Supplemental Table II). Almost all these samples were measured with 1% false discovery rate (FDR) [the main exception being the transfected HLA-C alleles from (8)] and none of them were filtered by existing predictors. For the samples from (29), raw MS data were reprocessed and the list of peptides is available in (4). For the samples from (8), the list of peptides before filtering was kindly provided to us by the authors of this study. Finally, Immune Epitope Database (IEDB) (31) data (as of March 2018) were also collected, with those annotated as “ligand presentation” (MS data) and those not (non-MS data) treated separately.
Motif deconvolution in HLA peptidomes
All HLA peptidomics datasets were analyzed with the new version of MixMHCp (see above), and all motifs were manually verified. These include monoallelic samples, in which the flat cluster was used to remove potential contaminants and motifs from endogenous alleles in samples transfected with soluble alleles [e.g., HLA-B35:03 and HLA-C04:01 in the data from (8)]. Peptides were assigned to specific alleles based on the highest responsibility value observed for the corresponding motif (3, 4). For each allele, a unique list of peptides was derived by pooling together all peptides corresponding to motifs annotated to this allele across all samples, as previously described (3, 4). Peptides assigned to the flat motif in any sample were also excluded from downstream analyses.
Length distribution in HLA-I ligands
The number of ligands of each length L = 8,…,14 for a given allele h was computed based on available MS data. When MS data from our curated set of HLA peptidomics studies were not available for a given allele, we used MS data from IEDB. Although allele restriction was often predicted with former versions of HLA-I ligand predictors in IEDB MS data, the utility of such data to estimate peptide length distributions and improve predictors has been recently demonstrated (17). Only HLA-I alleles with at least 200 peptides were considered in this analysis (84 alleles in total). Low-dimensionality representation of the alleles based on their peptide length distribution similarity was carried out with standard t-distributed stochastic neighbor embedding (t-SNE), as implemented in R (initial_dims = 5 and perplexity = 25) (32). Of note, the exact positioning of the points in the t-SNE plot was not used in downstream analyses and is only meant to help visualization. Clustering of HLA-I alleles was performed in R (hclust package in R) based on the Euclidean distance between length distributions for all pairs of alleles (k = 3, method = “ward.D2”). The average length distribution of each cluster was then computed (Fig. 2B).
When incorporating peptide length distributions into our predictor (i.e., to compute the shifts, see below), a smoothing procedure was applied based on the average distribution for each cluster c (c = 1,2,3):
with β = 100, γ = 10, and c = 1, 2 or 3, depending on the cluster of allele h and Sc as the set of alleles in cluster c. This will be especially useful for alleles without MS data (), see below.
Prediction of peptide length distributions
To predict peptide length distribution directly from HLA-I sequences, we aligned the amino acid sequences of all alleles with at least 200 ligands found by MS. We then trained a multinomial logistic regression to predict the clusters observed in Fig. 2A (cv.glmnet, as implemented in R), using as input the amino acids at each position in the HLA-I binding site (i.e., positions 5, 7, 9, 33, 45, 59, 63, 66, 67, 69, 70, 73, 76, 77, 80, 81, 84, 97, 99, 114, 116, 123, 142, 143, 146, 147, 152, 156, 159, 163, 167, and 171, following residue numbering in x-ray structures), based on standard 20-dimensional encoding of amino acids. Cross-validation was carried out by randomly splitting into five groups the alleles shown in Fig. 2A and iteratively training the model (including both the clustering and the logistic regression) on four groups and testing the predictions on the fifth group. In the testing set, the predicted length distribution of a given allele was taken as the average length distribution over the alleles of the cluster to which this allele was assigned, and the correlation between the predicted length distribution and the actual one was computed. Average correlation reported in the article corresponds to the average over all alleles and over 10 different random choices of the groups used in the cross-validation (distinct random seeds). For comparison, correlations obtained by random cluster assignment of alleles in the testing sets were computed. This is important because peptide length distributions of all HLA-I molecules share common features, so we always expect some level of correlation between predicted and actual peptide length distributions, even with a random predictor.
Finally, for HLA-I alleles without MS data, the clusters were predicted with the logistic regression trained on the full dataset of alleles with MS data (84 alleles in total), and the predicted peptide length distributions for these alleles correspond to the average length distribution of the alleles belonging to the cluster where it was assigned, plus a flat pseudocount given by γ (Eq. 4).
Amino acid frequencies at nonanchor positions
Alleles with at least 20 ligands for all lengths L = 8,…14 were selected (i.e., A01:01, A02:01, A03:01, A11:01, A24:02, A29:02, A31:01, A68:01, B07:02, B15:01, B27:05, B44:02, B57:01), and the frequency of amino acids (fr(x), x = 1,…20) at positions P5 to PΩ-2 was computed for each peptide length and each allele. For each amino acid, the average frequency (<fr(x)>) divided by the amino acid frequency in the human proteome (Px), and SD across the 20 alleles are shown in Fig. 3. Amino acids have been ranked based on the slope of the linear regression (red line in Fig. 3). The p values in Fig. 3 correspond to the p value of the Pearson correlation.
The training set of our HLA-I ligand predictor MixMHCpred was built as previously described (4), adding two new meningioma samples measured in this work (3911-ME_I and 4021_I, representing 4,719 unique peptides of length L = 8,…,14) and several recent HLA peptidomics studies (8, 25, 26, 30). For alleles with <200 peptides, MS data from IEDB (31) were further included. For alleles that still had <20 peptides, in vitro binding data from IEDB were used (treating as positives all entries not annotated as “Negatives”). This led to a total of 122 alleles for which we could obtain ligands and build PWMs for length L = 8 to L = 14. To this end, all peptides assigned to the same allele across all studies were pooled together, and each peptide was considered only once. Single PWMs for each allele h and each peptide length L () were built, including pseudocounts based on BLOSUM correction and renormalization by the expected background amino acid frequencies (4, 33). In this framework, the score of an L-mer peptide with allele h was computed as: . In a few cases, no ligand was available for L = 8, and we used the 9-mer PWMs, removing the middle position with the lowest binding specificity. In a few other cases, ligands were not available for longer peptides (L = 10,…,14). In these cases, a nonspecific position was added to right after position L-4. Of note, these cases contribute only marginally to the predictions when incorporating peptide length distribution in the predictor because by construction the corresponding lengths are given a very low weight (see below).
Explicitly modeling length distribution in HLA-I ligand predictors
The scores given by standard PWMs result in peptide length distributions that do not follow those observed in MS data. To explicitly model the peptide length distributions of naturally presented ligands, we added a correction factor to the scoring of peptides of different lengths:
The correction factors were computed as the score corresponding to percentile rank for 100,000 L-mers (randomly selected from the human proteome) without the correction factor. In this way, the top 1% peptides predicted from a set of 700,000 8- to 14-mers (100,000 of each length) follows by construction the distribution derived from MS data, and positive scores indicate peptides in the top 1% of the predictions, whereas negative scores indicate peptides in the low 99% of the predictions for a single allele.
Incorporating multiple specificity in HLA-I ligand predictors
All peptides annotated to the same allele were reanalyzed with our mixture model to identify multiple motifs. Peptides of different lengths were treated separately, with the core length always set equal to the peptide length in MixMHCp. Treating peptides of different lengths separately is important for multiple specificity analysis because peptides of different length may not display the same multiple specificity patterns (see example of C-terminal extensions in Fig. 4). All cases were then manually reviewed to determine whether the multiple motifs differed significantly from the single motif (Fig. 4).
To investigate whether multiple specificity could improve HLA-I ligand predictions, we modeled these cases with multiple PWMs using the framework introduced previously by ourselves for peptide recognition domains (19, 22) and recently applied to model C-terminal extensions in 10-mer HLA-I ligands (9). Peptides were assigned to each motif identified by the mixture model based on the highest responsibility values and multiple PWMs () were built as described above for each allele h, each group of peptides k and each length L. Each motif also includes a weight corresponding to the fraction of peptides assigned to it in the training set. The score of a new peptide of length L with allele h is then computed as:
The new version of the HLA-I ligand predictor (MixMHCpred2.0.1) is available at https://github.com/GfellerLab/MixMHCpred.
HLA peptidome profiling of meningioma samples with MS-based immunopeptidomics
Snap-frozen meningioma tissues from patients were obtained from the Centre Hospitalier Universitaire Vaudois (Lausanne, Switzerland). HLA-I typing was performed for all samples. Informed consent of the participants was obtained following requirements of the institutional review board (ethics commission, Centre Hospitalier Universitaire Vaudois). Protocol F-25/99 has been approved by the local ethics committee and the biobank of the Lab of Brain Tumor Biology and Genetics.
We extracted HLA-I peptides from the meningioma samples as previously described (1). Briefly, snap-frozen meningioma tissue samples were placed in tubes containing ice-cold PBS lysis buffer composed of 0.25% sodium deoxycholate (Sigma-Aldrich), 0.2 mM iodoacetamide (Sigma-Aldrich), 1 mM EDTA, 1:200 Protease Inhibitors Cocktail (Sigma-Aldrich), 1 mM Phenylmethylsulfonylfluoride (Roche, Basel, Switzerland), and 1% octyl-β-D glucopyranoside (Sigma-Aldrich) and homogenized on ice in three to five short intervals of 5 s each using an Ultra Turrax homogenizer (IKA, Staufen, Germany) at maximum speed. Cell lysis was performed at 4°C for 1 h. Lysates were cleared by centrifugation at 20,000 rpm in a high speed centrifuge (JSS15314; Beckman Coulter, Nyon, Switzerland) at 4°C for 50 min. The lysates were loaded on stacked 96-well single-use microplates (3-μm glass fiber, 10-μm polypropylene membranes; reference number 360063; Seahorse Bioscience, North Billerica, MA). The first plate contained protein-A Sepharose 4B beads (Invitrogen, Carlsbad, CA) for depletion of Abs, and the second plate contained the same beads cross-linked to W6/32 anti-HLA mAbs as previously described (1). We then employed the Waters Positive Pressure-96 Processor (Waters, Milford, MA). We washed the second plate with four times 2 ml of 150 mM sodium chloride (NaCl) (CARLO ERBA, Val de Reuil, France) in 20 mM Tris-HCl (pH 8), four times 2 ml of 400 mM NaCl in 20 mM Tris-HCl (pH 8), and again with four times 2 ml of 150 mM NaCl in 20 mM Tris-HCl (pH 8). Finally, we washed the beads twice with 2 ml of 20 mM Tris-HCl (pH 8). We stacked the affinity plate on top of a Sep-Pak tC18 100 mg Sorbent 96-well plate (reference number: 186002321; Waters) already equilibrated with 1 ml of 80% acetonitrile (ACN) in 0.1% trifluoroacetic acid (TFA) and with 2 ml of 0.1% TFA. We eluted the HLA and peptides with 500 μl 1% TFA into the Sep-Pak plate and then we washed this plate with 2 ml of 0.1% TFA. Thereafter, we eluted the HLA-I peptides with 500 μl of 28% ACN in 0.1% TFA into a collection plate. Recovered HLA-I peptides were dried using vacuum centrifugation (Concentrator plus; Eppendorf) and stored at −20°C.
Prior to MS analysis, HLA-I peptide samples were resuspended in 10 μl of 2% ACN/0.1% formic acid, and aliquots of 3 μl for each MS run were placed in the Ultra HPLC autosampler. All samples were acquired using the nanoflow Ultra-HPLC Easy-nLC 1200 (LC140; Thermo Fisher Scientific) coupled online to a Q Exactive HF Orbitrap mass spectrometer (Thermo Fischer Scientific) with a nanoelectrospray ion source (PRSO-V1; Sonation, Baden-Württemberg, Germany) as previously described (1). We packed the uncoated PicoTip 8-μm tip opening with 75 μm inner diameter × 50-cm long analytical columns with ReproSil-Pur C18 (1.9-μm particles, 120 Å pore size, Dr. Maisch HPLC, Ammerbuch, Germany). Mounted analytical columns were kept at 50°C using a column oven. Data was acquired with data-dependent “top10” method. The MS scan range was set to 300 to 1650 m/z with a resolution of 60,000 (200 m/z) and an automated gain control target value of 3 × 106 ions. For MS/MS, automated gain control target values of 1 × 105 were used with a maximum injection time of 120 ms at set resolution of 15,000 (200 m/z). In cases of assigned precursor ion charge states of four and above, no fragmentation was performed.
We employed the MaxQuant computational proteomics platform version 188.8.131.52 (34) to search the peak lists against the UniProt databases (human 42,148 entries, March 2017) and a file containing 247 frequently observed contaminants. We used default settings unless otherwise mentioned. Methionine oxidation (15.994915 Da), cysteine carbamidomethylation (57.021463 Da), and cysteinylation (119.004099) were set as variable modifications. The second peptide identification option in Andromeda was enabled. A peptide spectrum match FDR of 1% and no protein FDR were set. The enzyme specificity was set as unspecific, and the “match between runs” option was enabled across different replicates of the same biological sample in a time window of 0.5 min and an initial alignment time window of 20 min. We used the “peptides” MaxQuant output table, from which peptides matching to reverse sequences and contaminants were filtered out (Supplemental Table III). MS raw data, MaxQuant parameters, and selected MaxQuant output tables used for analyses have been deposited to the ProteomeXchange Consortium (35) via the PRIDE partner repository with the dataset identifier PXD009925.
Two of these samples (3911-ME_I and 4021_I) were included in the peptide length distribution and multiple specificity analysis as well as the training of the new version of the HLA-I ligand predictor (MixMHCpred2.0.1) because they contained alleles (HLA-B49:01 and HLA-C15:05) whose motifs were not detected in our collection of publicly available HLA peptidomics studies and therefore enabled us to expand the allelic coverage of MixMHCpred. The other 10 datasets were kept for benchmarking of the new version of MixMHCpred (Fig. 6) and were not included in any training procedures presented in this work.
To benchmark our new algorithm (MixMHCpred2.0.1), we used the 10 HLA peptidomes from meningioma samples that were measured in this study. Negative data were chosen by randomly selecting peptides from the human proteome to have four times more negative than positive peptides. Peptides of lengths 8 to 14 were uniformly selected in the negative data. When attempting to predict peptides naturally presented in meningioma samples with many (i.e., up to six) different HLA-I alleles, we used for each peptide the highest score of MixMHCpred2.0.1 across all alleles. Precision in the top 20% of the predicted peptides (which is also equal to recall in the case of 4-fold excess of negative data) and area under the curve (AUC) were computed for each sample (fourth bars in Fig. 6A, 6B). AUC values were also computed with 99-fold excess of random negative data (Fig. 6C). MixMHCpred2.0.1 scores were then computed in the absence of peptide length distribution corrections and multiple specificity (first bars in Fig. 6), absence of peptide length distribution corrections (second bars in Fig. 6), or absence of multiple specificity (third bar in Fig. 6). When benchmarking with NetMHCpan4.0 (17), we used the percentile rank of the elution scores and chose the lowest percentile rank across all alleles for each peptide (fifth bar in Fig. 6).
X-ray structures of HLA-I alleles
X-ray structures of HLA-I alleles shown in Fig. 2A were retrieved from the Protein Data Bank (PDB). A representative structure for each allele was selected based on 1) the fact that the ligand is a 9-mer, 2) the absence of other receptors binding to the peptide-HLA-β2m complex (e.g., TCR or KIR, the only exception is HLA-C03:04 for which a crystal structure was only available in complex with KIR2DL2), and 3) the lowest R value. This led to a total of 30 alleles with x-ray structures in complex with 9-mers (Supplemental Table IV).
Euclidean distances between the peptides and the β sheet of the HLA-I molecules were computed as the smallest distance of any heavy atom of the peptide (P3 – PΩ-1) with any heavy atom of residues in the β sheet (positions 5, 6, 7, 8, 9, 10, 11, 12, 23, 24, 25, 26, 27, 28, 95, 96, 97, 98, 99, 100, 101, 113, 114, 115, 116, 117, and 118). Most of the time, the residue closest to the peptide in the β sheet is residue 97, which is located in the center of the binding site and points toward the nonanchor residues of the ligand. In each structure, B factors for all Cα atoms of the peptides (P3 – PΩ-1) were averaged and this value was normalized (i.e., divided) by the average of B factors of all Cα atoms of the HLA-I molecule to enable meaningful comparison between different structures.
In vitro refolding assays
All peptides used in refolding assays with HLA-A03:01, HLA-B07:02, HLA-B15:01, and HLA-C06:02 were synthesized with free N and C termini (1 mg of each peptide, >80% purity). Peptides were incubated separately with denaturated HLA-I alleles refolded by dilution in the presence of biotinylated β-2 microglobulin proteins at temperature T = 4°C for 48 h. The solution was then incubated at 37°C. Samples were retrieved at time t = 0, 8, 24, 48, and 72 h. Stable complexes indicating interactions between HLA-I molecules, and the peptides were detected by ELISA. Two independent replicates were performed for each measurement in Figs. 2C and 5B and 5C. Positive controls were used to renormalize the ELISA absorbance values between the different replicates, and the y-axis shows the mean and SD of these normalized values. Values in Fig. 2C were further renormalized by the 9-mer at time t = 0 h.
Peptide length distributions in HLA-I ligands
To study the length distribution of naturally presented HLA-I ligands, we used our collection of curated HLA peptidomics datasets, including two novel meningioma samples (4719 unique 8- to 14-mer peptides, see 2Materials and Methods and Supplemental Table III), to reach a total of 227,510 unique peptides. We applied our recent motif deconvolution algorithm MixMHCp (3) to assign allelic restriction in pooled HLA peptidomics samples. The approach was expanded to better handle peptides of different lengths and to allow for a flat motif that can identify contaminants that do not fit any of the motifs learned by the algorithm, akin to the trash cluster in GibbsCluster (5, 6, 24) (Eqs. 1–3 and 2Materials and Methods). Supplemental Table I shows the number of predicted contaminants for each sample analyzed in this work and each peptide length. Overall, we observed a median 3.7% of predicted contaminants across samples, with some samples displaying over 25% of predicted contaminants [e.g., HLA-A68:02 in (7), as also reported in (6)]. The highest fraction of predicted contaminants was observed in samples obtained by transfecting soluble HLA-I alleles, followed by monoallelic samples and then pooled HLA peptidomics samples (Fig. 1A), although this may also reflect different FDR thresholds used in these studies. We also observed a much higher fraction of predicted contaminants among longer peptides (median of 58.8% among 14-mers versus 0.66% among 9-mers, see Fig. 1B).
Focusing on alleles with enough MS data (>200 ligands after removing peptides assigned to the flat motif), we could estimate peptide length distributions for a total of 84 alleles (Supplemental Fig. 1). As previously reported (7, 16–18), we observed significant differences in peptide length distributions between different alleles. To have a global view of the diversity of peptide length distributions across HLA-I alleles, we used t-SNE to represent in two dimensions the different HLA-I alleles based on the similarity of their peptide length distributions and proposed three distinct clusters of HLA-I alleles (Fig. 2A, 2B) (see 2Materials and Methods). Cluster 1 is composed of HLA-I alleles that accommodate many longer ligands (e.g., HLA-A01:01 or HLA-03:01). Cluster 2 contains alleles that can bind 10- and 11-mers, in addition to canonical 9-mers, but much less longer peptides (e.g., HLA-A02:01). Finally, cluster 3 contains alleles with peptide length distributions peaked on 9-mers (e.g., HLA-C06:02) or 8- and 9-mers (e.g., HLA-B51:01) and much fewer longer peptides.
Some patterns emerged from this analysis in terms of which HLA-I alleles fall into which cluster. First, we observed a strong clustering of HLA-C alleles (mainly in cluster 3), whereas HLA-A and HLA-B alleles were more mixed, with few HLA-A alleles in cluster 3. As expected, HLA-B alleles found in cluster 3 also displayed peptide length distributions peaked at 8- and 9-mers and include alleles with anchor residues at P5 (e.g., HLA-B08:01, HLA-B14:02, HLA-B37:01) (3). To test whether the peptide length distributions peaked at 9-mers for HLA-C alleles reflected differences in binding affinity, we tested six different ligands of HLA-C06:02 (2Materials and Methods and Fig. 2C). In all cases we observed lower stability of both the 10- and 11-mers (orange and cyan, respectively) compared with the 9-mers (purple). These results suggest that the preference of HLA-C06:02 for 9-mers compared with 10- or 11-mers is also due to lower binding affinity of longer peptides, in addition to the expected higher frequency of 9-mer peptides available for loading in the ER (16) Conversely, 8-mers (black) displayed only slightly lower stability compared with 9-mers, suggesting that their lower frequency in MS data are mainly due to lower frequency of 8-mers in the ER (16). Similar results were previously observed for HLA-B51:01 (16).
To explore the structural basis of peptide length distributions, we surveyed existing x-ray structures of HLA-I alleles in complex with 9-mer peptides (Supplemental Table IV) and measured the distance between the ligand and the β sheet of the HLA-I binding site (see 2Materials and Methods). Interestingly, we observed that the smallest distance between residues in the β sheet of the HLA-I binding site and residues at nonanchor positions (P3 to P8) in the peptide is significantly lower for alleles in cluster 2 and 3 compared with alleles in cluster 1 (Fig. 2D, see example in Fig. 2E). In addition, the average normalized B-factor of Cα atoms of residues P3 to P8 (see 2Materials and Methods), which represents how flexible these atoms are in x-ray structures, is significantly lower for alleles found in cluster 2 and 3 compared with alleles in cluster 1 (Fig. 2F). The lower distance between the residues in the β sheet and those in the peptide together with the lower B-factors of ligands for alleles in cluster 2 and 3 indicate that 9-mer ligands of these alleles are more constrained and less floppy. This provides a potential structural explanation for the observed differences in peptide length distributions between different alleles and suggests that bulging out of the peptides in the presence of additional amino acids (i.e., longer peptides) is less favorable in alleles from clusters 2 and 3.
To further support this hypothesis, we note that HLA-B57:01 and HLA-B58:01 differ by only two residues in the HLA-I binding site (47 and 97) and have very similar binding motifs (Fig. 2E). Position 97 is located close to nonanchor residues of the peptide (P3 to P8), whereas position 45 is buried in the B pocket (Fig. 2E). In HLA-B57:01, valine is found at position 97 and does not interact directly with residues P3–P8 in the peptide [PDB: 2RFX (36), minimal distance equal to 5.4 Å], whereas in HLA-B58:01, arginine is found at position 97 and makes direct interactions with the peptide [PDB: 5IND (37), minimal distance equal to 3.5 Å] (Fig. 2E). Consistent with the mechanism proposed above, HLA-B57:01 has a broader peptide length distribution than HLA-B58:01 (Fig. 2A). These data suggest that position 97 in the HLA-I binding site may play a role in determining length preference, and Fig. 2B indicates that on average, smaller residues are found in alleles of cluster 1. This is also consistent with the increased interactions previously observed between W97 and the peptide in HLA-B14:02 compared with most B27 alleles which have N at position 97 (38).
Predicting peptide length distributions
The clustering observed in Fig. 2A suggests that it may be possible to predict peptide length distribution for a new HLA-I allele based on its sequence only, by predicting to which cluster it belongs and then mapping the average peptide length distribution of the cluster to this new allele. To this end, we trained a multinomial logistic regression, using as features all the HLA-I residues found in the binding site and as output the three clusters identified in Fig. 2A (2Materials and Methods). Five-fold cross-validation was carried out, including both the clustering and the training of the multinomial logistic regression. The average correlation between observed and predicted peptide length distributions was 0.972. As a comparison, a random assignment of HLA-I alleles to the clusters during the cross-validation would result in an average correlation of 0.930. This clearly indicates that our approach has predictive power for peptide length distributions based on HLA-I sequences, beyond what is conserved across all HLA-I alleles (e.g., the higher frequency of 9-mers).
Amino acid frequencies at nonanchor positions
We next took advantage of our large collection of unbiased HLA-I ligands of different lengths to investigate the changes in amino acid preferences at nonanchor positions (P5 to PΩ-2) for alleles displaying at least 20 ligands for all peptide lengths L = 8,…14 (13 alleles in total, see 2Materials and Methods). Fig. 3 shows the changes in amino acid frequencies with respect to peptide lengths. The first pattern that emerges from this analysis is the very clear increase in frequency of glycine in longer peptides. Glycine is known to destabilize secondary structures and induce protein disorder (39). We anticipate that the higher frequency of glycine in longer HLA-I ligands enables them to adopt bulging conformations more easily. We also observed increased frequency of negatively charged residues (aspartate and glutamate), serine, and to a lower extent, alanine. Glycine, serine, and alanine have small sidechains, which confirms at a larger scale the observation made for HLA-C04:01 ligands (40). Disfavored amino acids in longer peptides include leucine, tryptophan, arginine, valine, phenylalanine, histidine, and isoleucine. Many of them tend to induce structured conformations of polypeptides, further indicating that on average, longer HLA-I ligands tend to be less structured at nonanchor positions. As noted in previous studies (4, 7), we also observed a very low frequency of cysteine at all peptide lengths, which is likely due to the fact that cysteine modifications are not included in standard MS spectral searches.
Multiple specificity in HLA-I ligands
To study cases of multiple specificity, we applied the new version of the MixMHCp algorithm to each dataset of HLA-I ligands, treating separately peptides of different lengths (2Materials and Methods). All cases were manually reviewed and our results revealed several instances of multiple specificity (Fig. 4). These include the previously reported multiple specificity of HLA-B07:02 (3), as well as cases of C-terminal extensions in 10-mers for HLA-A03:01 and HLA-A68:01 (orange boxes in Fig. 4) (9). For HLA-B07:02, the multiple specificity predicts mutual exclusivity of positively charged residues at P3 and P6. In our previous work, we had proposed a structural explanation for this observation, involving the presence of aspartate in the HLA binding site at position 114, which can interact with one, but not two, positively charged residues in the peptide (3). A similar multiple specificity pattern is observed for HLA-C16:01 (Fig. 4, red boxes), and this allele also has aspartate at position 114. For HLA-B35:08, existing structures indicate that R156 can interact with residues at P3 [PDB: 2AXF (41)] or P5 [PDB: 2FZ3 (42)] (Fig. 5A), which is consistent with the mutual exclusivity of negatively charged residues observed at these two positions in MS data (Fig. 4, cyan boxes).
For two alleles (HLA-A03:01 and HLA-B51:01), we set out to validate these predictions. For HLA-A03:01, the multiple specificity model predicted that the binding of peptides with R/K at P9 should not be much influenced by the amino acid at P7, whereas peptides with Y/H at P9 should show additional specificity for leucine at P7. These predictions could be validated by observing that L7A mutation of peptides with K at P9 did not affect the binding, whereas L7A mutation decreased the stability of peptides with Y or H at P9 (Fig. 5B). For HLA-B51:01, peptides with sequences predicted to follow the two binding specificities (i.e., L1P2 and D1A2), as well as the other possibilities (i.e., D1P2 and L1A2), were tested. The latter are predicted to display weaker binding, although the single 9-mer logo of HLA-B51:01 would naively predict D1P2 to show the highest stability (Fig. 4). For both 9-mers and 8-mers, we could confirm the predictions of the multiple specificity analysis by observing that peptides with either L1P2 or D1A2 displayed higher stability than ligands with D1P2 or L1A2 (Fig. 5C). Comparison of HLA-B51:01 crystal structures in complex with two different ligands (L1P2 in blue and T1A2 in orange in Fig. 5D) (43) indicates that the polar group of T1 interacts with R62, which is consistent with the specificity for negatively charged or polar residues seen in MS data at P1. Conversely, in the presence of L1, the sidechain of R62 points toward the solvent. In addition, superposition of the two structures suggest that proline at P2 may not be compatible with the position of the threonine sidechain at P1. This suggests a possible explanation for the multiple specificity of HLA-B51:01. If alanine is found at P2, then negatively charged or polar amino acids are preferred at P1 and can make interactions with R62. Conversely, if proline is found at P2, amino acids at P1 cannot make direct interactions with R62, which is then pointing toward the solvent, and hydrophobic sidechains are preferred at P1. It is also interesting to note that one of the first neoantigens identified by MS (DANSFLQSV) in melanoma samples (44) matches very precisely the second motif of HLA-B51:01 identified in this work.
Incorporating peptide length distribution and multiple specificity in HLA-I ligand predictors
To incorporate peptide length distribution and multiple specificity in HLA-I ligand predictors, and explore whether this would lead to improved predictions of naturally presented HLA-I ligands, we developed a new version of MixMHCpred (v2.0.1). Peptide length distribution information was incorporated by adding a penalty so that the length distribution of the top 1% of peptides predicted from a pool of 700,000 random peptides (100,000 of each length L = 8,...,14) follows by construction the distribution observed in MS data (Eq. 5 and 2Materials and Methods). For all alleles displaying multiple specificity, their binding specificity was modeled with multiple PWMs (Eq. 6 and 2Materials and Methods).
To validate this new version of our predictor in an independent dataset, we measured the HLA peptidome of 10 new meningioma samples, resulting in 27,882 unique peptides of length 8 to 14 (2Materials and Methods and Supplemental Table III). We then attempted to predict these peptides by adding four times more negative data consisting of 8- to 14-mer peptides randomly sampled from the human proteome (2Materials and Methods). Fig. 6 shows the results in terms of precision in the top 20% of the predictions (which is equal to recall because we have a 4-fold excess of negative data) and AUC values. In all cases, we saw clear improvement by adding correction for peptide length distributions (compare first and third bars, as well as second and fourth bars in Fig. 6A, 6B). Moreover, in all but two samples containing alleles that were shown to display multiple specificity (stars in Fig. 6), we observed that including multiple specificity led to equal or better results (compare first and second bars, as well as third and fourth bars in Fig. 6). When comparing with the other predictor of naturally presented HLA-I ligands that includes MS data in its training [i.e., NetMHCpan4.0 (17)], we could observe better performance in precision for all samples and in AUC values for all but one sample (compare fourth and fifth bars in Fig. 6). We repeated the AUC analysis in the presence of 99-fold excess of randomly selected negatives (Fig. 6C) and observed that here as well our predictor displayed higher accuracy compared with other predictors trained on MS data. However, it should be noted that by construction, AUC values do not depend on the size of the negative dataset, which explains the high similarity between Fig. 6B and 6C.
In-depth and unbiased sampling of naturally presented HLA-I ligands is a powerful approach to unravel new properties of HLA-I molecules (3, 4, 6–9, 16, 40). In this work, we capitalized on these data to explore peptide length distribution and multiple specificity of HLA-I molecules at a much larger scale than previous studies. Our work revealed clear clustering of HLA-I alleles based on their peptide length distribution, which could guide investigation of the structural properties underlying the preference for peptides of different lengths. One clear pattern that emerged from our analysis is the restriction of HLA-C alleles to mainly 8- and 9-mers (8), likely reflecting both the biased distribution toward 9-mers of peptides in the ER (16) and the lower binding stability observed in Fig. 2C for longer peptides. Conversely, the similar binding stability observed between 8- and 9-mers suggests that the lower frequency of 8-mers among naturally presented HLA-C ligands is primarily due to their lower frequency in the pool of peptides available for loading in the ER (16), at least for HLA-C06:02. The restriction of HLA-C alleles to short peptides may also have functional implications. In addition to TCRs that can somatically rearrange their sequences to bind a wide range of HLA-I–restricted targets, HLA-C alleles are the main targets of NK receptors that are germline encoded. Numerous crystal structures indicate that the NK receptors bind to HLA-I alleles on top of the F pocket (12, 45). Longer peptides bulging out or extending at the C terminus (9, 10, 13) may therefore interfere with the binding of NK receptors (12). This could explain why HLA-C alleles have evolved to preferentially bind to 8- and 9-mers.
Our work further enabled us to demonstrate, for the first time to our knowledge, that peptide length distributions can be predicted directly from HLA-I sequences even in the absence of known ligands, as shown in our cross-validation study.
The multiple specificity observed in this work suggests that for some alleles, correlations between amino acid positions play an important role in binding to the HLA-I molecules (e.g., L1P2 and D1A2 being much more frequent to D1P2 or L1A2 in B51:01). These types of correlations, which we implicitly modeled with multiple PWMs, can also be captured with machine learning frameworks such as neural networks. It is worth noting that from a mathematical point of view, multiple PWMs share many similarities with hidden nodes in neural networks, with the main difference being that the inputs of nodes in standard neural networks are combined with a logistic regression, whereas we use linear combinations in Eq. 6. As already observed for peptide recognition domains (19, 21, 22), an important advantage of the multiple PWMs framework is the straightforward visualization, which is useful to guide structural interpretation and experimental validation of the predictions. Nevertheless, the fact that <15% of HLA-I alleles displayed clear multiple specificity and the modest improvement in prediction obtained by incorporating multiple specificity (Fig. 6) suggest that the assumption of positional independence is still a good first approximation for modeling the specificity of many HLA-I alleles (23).
Our performance estimates in Fig. 6 are only meant to compare between different methods and not to provide actual estimates of the success rate. The latter is unfortunately more difficult to assess in the absence of experimentally validated large negative datasets. In particular, in our benchmarking with 99-fold excess of randomly generated negatives (Fig. 6C), we expect roughly 1% of them to be binders. This corresponds to the same number as the actual positives used in this benchmark and would make precision values in the top 1% more difficult to interpret.
Altogether, our work provides, to our knowledge, the first large-scale analysis of peptide length distributions and multiple specificity in naturally presented HLA-I ligands across a large panel of alleles. Including these two features into our predictor led to improved predictions of peptides presented on the surface of meningioma samples and may therefore contribute to accelerating (neo)antigen discovery in cancer immunotherapy.
The computations were performed at the Vital-IT (http://www.vital-it.ch) Center for high-performance computing of the SIB Swiss Institute of Bioinformatics. We are thankful to Raphael Genolet for carrying out the HLA-I typing of the meningioma samples. We are thankful to Julien Schmidt and the Protein and Peptide Chemistry Facility from University of Lausanne for synthesizing the peptides.
This work was supported by the Swiss Cancer League (Grant KFS-4104-02-2017-R), the Ludwig Institute for Cancer Research, and by the ISREC Foundation thanks to a donation from the Biltema Foundation.
The online version of this article contains supplemental material.
The authors have no financial conflicts of interest.