Abstract
Recent advances in single-cell RNA–sequencing (scRNA-seq) technology increase the understanding of immune differentiation and activation processes, as well as the heterogeneity of immune cell types. Although the number of available immune-related scRNA-seq datasets increases rapidly, their large size and various formats render them hard for the wider immunology community to use, and read-level data are practically inaccessible to the non-computational immunologist. To facilitate datasets reuse, we created the JingleBells repository for immune-related scRNA-seq datasets ready for analysis and visualization of reads at the single-cell level (http://jinglebells.bgu.ac.il/). To this end, we collected the raw data of publicly available immune-related scRNA-seq datasets, aligned the reads to the relevant genome, and saved aligned reads in a uniform format, annotated for cell of origin. We also added scripts and a step-by-step tutorial for visualizing each dataset at the single-cell level, through the commonly used Integrated Genome Viewer (www.broadinstitute.org/igv/). The uniform scRNA-seq format used in JingleBells can facilitate reuse of scRNA-seq data by computational biologists. It also enables immunologists who are interested in a specific gene to visualize the reads aligned to this gene to estimate cell-specific preferences for splicing, mutation load, or alleles. Thus JingleBells is a resource that will extend the usefulness of scRNA-seq datasets outside the programming aficionado realm.
Introduction
Single-cell RNA–sequencing (scRNA-seq) is a powerful tool for profiling whole-transcriptomes at the single-cell level, which can reveal the variability between cells. Unlike conventional RNA-seq protocols, scRNA-seq can provide important information on rare cell types without requiring a large number of cells, and can help to define homogenous subpopulations within larger, heterogeneous populations without requiring a priori separating markers (1, 2).
Similar to bulk RNA-sequencing, the read-level data of scRNA-seq datasets contains a considerable amount of information that can be used to infer splicing, allele usage, point mutations, and indels. Visualization of the reads by cell can show these phenomena, and indicate how heterogeneous cells are in relation to each phenomenon. However, although there are several methods to visualize the cells from scRNA-seq datasets [e.g., SPADE (3), TSCAN (4), and ZIFA (5)], to the best of our knowledge there is no current method for visualizing reads from scRNA-seq at the single-cell level.
As scRNA-seq is becoming increasingly popular, dozens of scRNA-seq datasets have been generated and released to the public domain. These datasets are available from several online repositories, such as Gene Expression Omnibus [GEO (6)], ArrayExpress (7), and the European Nucleotide Archive (8). Such repositories typically store estimated expression levels of each gene in each cell (termed the expression table), and a reference to a sequence read archive (9), which stores raw sequence read data, but no aligned read data.
The transition from the raw sequence read data to the gene expression table is done by different pipelines in each laboratory. To compare or combine different datasets for any type of meta-analysis, datasets need to be reprocessed by the same pipeline (as much as possible—clearly estimating expression level of 3′ sequencing cannot be done in the same way as full transcript sequencing). Reprocessing a dataset requires the following steps: 1) finding the dataset and following the link to the reads repository; 2) downloading all read files; 3) understanding how the association of sequence reads with cells (read-to-cell association) is stored; 4) finding out which cell belongs to which experimental group; 5) aligning the reads to the genome; and 6) counting and normalizing the expression of each gene in each cell. Steps 1–5 need to be tailored to each dataset separately. Thus, those steps cannot be automated, and require a considerable amount of manual effort for each dataset.
In this study, we report the formatting of publicly available scRNA-seq datasets into the standard binary sequence alignment/map (BAM) format (10), extended by the read-to-cell association, and grouped by experimental condition (performing the above preprocessing steps 1–5). The formatted datasets are deposited in the JingleBells repository. The repository and the uniform extended BAM format enable users without any programming skills to download datasets and visualize the reads aligned to specific genes of interest (or genomic regions) in Integrative Genomics Viewer (IGV), grouped and colored by their cell of origin. This visualization enables appreciation of the uniformity of coverage across the gene and the splicing patterns and genetic variation displayed by each cell. Moreover, computational biologists can easily reuse JingleBells’ formatted scRNA-seq datasets for further analysis, without the laborious preprocessing of each dataset.
Materials and Methods
Datasets
PubMed (11), GEO (6), ArrayExpress (7), and the European Nucleotide Archive (8) were searched with the term “single-cell RNA-seq.” All results (published prior to the end of 2016) were manually curated to verify that they describe scRNA-seq datasets, resulting in 186 datasets.
To maximize the usefulness of JingleBells for the scientific community, we have focused, at this stage, on the 42 immune-related datasets that (preferably) include at least 90 single cells. At the time of writing, JingleBells contained 42 immune-related datasets from three organisms (34 mouse, six human, and two zebrafish). This decision was motivated by the fact that the immunology field was the first to adopt and benefit from the scRNA-seq technology, and because we predict that immunologists will most likely be interested in studying the reads from a single gene at the single-cell level. However, the repository holds a list of the 144 scRNA-seq datasets that are not immune related, and we will format those upon request.
Extended BAM format
In this study we propose an extension to the sequence alignment/map (SAM)/BAM format (10), by adding two new tags to each read, to note the index and color of the read’s cell of origin. The first tag is the “xi” user-defined field, which is a numeric cell-specific index that is unique per dataset (i.e., no other cell in this dataset will have that index, but cells in other datasets may have the same index). This tag gives the information of read-to-cell association, and makes it possible to group the reads by their cell of origin in IGV and avoid the need for dataset-specific decoding of read-to-cell association in downstream analysis. The second tag is the University of California, Santa Cruz genome browser–defined “YC” tag (https://genome.ucsc.edu/goldenpath/help/hgBamTrackHelp.html), which is the color to be used for the read in University of California, Santa Cruz genome browser or IGV. Although this tag is not crucial for visualization, as one can color the reads by any other tag in IGV, we found it is useful for two reasons. First, it saves the effort needed to manually select this coloring option in IGV for each sample upon each opening of the files (coloring by tag is currently not implemented in IGV command line options). Although this seems negligible, we anticipate that one of the more common use cases is of users who wish to see the reads aligned to their gene of interest in many datasets, and those users will have to repeat this selection for each experimental condition (represented in an IGV track). Second, the “YC” tag sets the color of each cell irrespective of the other cells presented, and thus enables one to easily identify specific cells of interest, as they are represented by the same color in every genomic region displayed. For example, one may notice that the purple and blue cells tend to have more mutations than other cells, in many of the genes visualized, highlighting them as potential premalignant cells.
Dataset processing pipeline
For each dataset, FASTQ or sequence-read archive read files were downloaded. Reads were aligned to the appropriate genome by using HISAT, a fast and memory-efficient spliced alignment method, matching or exceeding the accuracy of other commonly used spliced aligners (12). The default HISAT parameters and reference genomes were used. Specifically, for downloading SRR files and aligning the reads in data file (e.g., NAME) to a specific genome version GV (e.g., mm10 or hg19), the command used was: hisat -x GV_hisat--sra-acc NAME -S NAME.sam. For aligning paired FASTQ files (e.g., NAME1.fastq and NAME2.fastq), the command used was: hisat -x GV_hisat -1 NAME1.fastq -2 NAME2.fastq -S NAME.sam.
After alignment, the read-to-cell association was added to each read in the format of the “xi” and “YC” BAM tags described above. The encoding of the read-to-cell association was done per dataset, as this association may be encoded in the read files (13, 14) by storing the reads of each cell in a separate file (15–19) or files (20–22). The encoding process is detailed in the README file associated with each dataset. As detailed above, the cell identity is crucial for visualization of the reads by cell, and for association of the reads with experimental condition.
Finally, reads from all cells of the same experimental group were grouped into one file. This makes it easier to download the files, to upload the files into IGV and visualize the reads (one track for experimental condition, instead of one track per cell), and to perform downstream analyses, as there is no need to decode the cell-to-experimental condition association for each dataset. The encoding of the cell-to-experimental condition is different between different datasets (and thus cannot be automated) and is detailed in the README file associated with each dataset.
The extended SAM files were converted to BAM files, sorted, and indexed by SAMTOOLS (10). As a result, after loading the files into IGV, when grouping the reads by the “xi” tag the reads from each cell are grouped together and marked by a different color (Fig. 1); all cells of the same experimental group are in one track.
Cd34 read-level data reveal isoform preference and cell-specific mutations. Visualization of the reads that are aligned to the mouse gene Cd34 from scRNA-seq reads of hematopoietic stem cells. Shown are the total number of reads that are aligned to each genomic position [(A), top, grey bars. Numbers on the left state the scale of read numbers.], the reads that are aligned to the region [(B), middle, colored lines], and the known transcripts from the National Center for Biotechnology Information Reference Sequence Database (RefSeq, https://www.ncbi.nlm.nih.gov/refseq/) of the gene [(C), bottom, blue schematics]. Exon numbers are marked. Reads from different cells are separated by a gray line and by different colors. The thin blue line connects the junction spanning reads. The black box marks the reads from one cell that is mutated. The genomic region displayed is chr1:194,958,200–194,961,500. Data are from GSE61533 (19). Read visualization was done by IGV (www.broadinstitute.org/igv/).
Cd34 read-level data reveal isoform preference and cell-specific mutations. Visualization of the reads that are aligned to the mouse gene Cd34 from scRNA-seq reads of hematopoietic stem cells. Shown are the total number of reads that are aligned to each genomic position [(A), top, grey bars. Numbers on the left state the scale of read numbers.], the reads that are aligned to the region [(B), middle, colored lines], and the known transcripts from the National Center for Biotechnology Information Reference Sequence Database (RefSeq, https://www.ncbi.nlm.nih.gov/refseq/) of the gene [(C), bottom, blue schematics]. Exon numbers are marked. Reads from different cells are separated by a gray line and by different colors. The thin blue line connects the junction spanning reads. The black box marks the reads from one cell that is mutated. The genomic region displayed is chr1:194,958,200–194,961,500. Data are from GSE61533 (19). Read visualization was done by IGV (www.broadinstitute.org/igv/).
Of note, the repository includes all reads and cells that the dataset creators deemed good enough to deposit. We chose not to apply any filtering, and make all reads and cells easily accessible for visualization, reuse, and meta-analysis. That choice was motivated by the fact that common filtering practices in the field may limit the usefulness of the data. For example, removing cells with low read numbers may prevent identification of cell states with a lower transcriptional level. Similarly, removing mitochondrial reads will prevent the study of the effect of mitochondrial variation on cell characteristics. Thus, no filtering steps were taken, and no quality control was applied for reads or cells. Users who wish to do any further filtering at their discretion can do it using the formatted files, bypassing the need for manual preprocessing.
Dataset repository
The web portal at http://jinglebells.bgu.ac.il contains scRNA-seq datasets in the easy-to-visualize and analyze extended BAM format described above. The information on each dataset includes the repository in which it is stored (e.g., GEO), its identification number in that repository, publication details, the number of profiled cells, and experimental conditions. There are two download options. The first is aimed at users interested in the data of one experimental condition only. In that case, the user clicks on links for the BAM and BAM index files (BAI) associated with a specific experimental condition, and downloads only those files. In the second option, meant for users who are interested in complete datasets, the user clicks the link for the dataset directory, which opens an interface by which the user can download the BAM file(s) and BAI file(s) associated with all or some experimental conditions, a “README” file describing the dataset, and a sample script for uploading the BAM files to IGV. The site enables searching the dataset information by key words, and sorting the datasets by several criteria. JingleBells also includes a tutorial that details how to visualize the data at the single-cell level in IGV, and a step-by-step visual tutorial that shows the visualization process.
The data files are stored on a Google Drive with a public status, so they can be downloaded without requiring a log in. As the files are big (currently the size of the largest dataset, GSE48968, is 1.3TB), and likely to increase, using FTP or HTML from the servers at our institute was too slow. Google Drive provides much faster download times. As users may be interested in only some of the experimental groups, and the BAM format itself is compressed, there is no benefit in zipping experiments.
Discussion
In this study we present the JingleBells repository that stores immune-related scRNA-seq datasets in a uniform extended-BAM format that includes read-to-cell association and cell-to-experimental condition association. The uniform format of the datasets enables visualization of the reads and reuse of the data to perform various analyses without the need to locate and download the datasets themselves, align the reads to the genome, decipher the association between each read and the cell from which it comes, and assign the cells to experimental conditions.
Visualizing the reads of single cells allows the identification of novel phenomena, including cell-specific splicing variant usage, mutation load, or allele preference, which can lead to specific testable hypotheses. For example, Cd34 is an eight-exon gene that has two isoforms with an alternative 3′ splice site of the last exon. Cd34 encodes a cell surface marker expressed on hematopoietic stem and progenitor cells. Fig. 1 displays reads from 17 individual mouse hematopoietic stem and progenitor cells from GSE61533 (19) that are aligned to Cd34. Visualization of the reads by individual cells reveals details that are not reflected at the gene expression level. First, cells seem to prefer expressing one of the two isoforms. Only one of the 17 cells shown in Fig. 1 expresses both isoforms (eighth from the top, light green reads). Of the 48 cells in this experiment that have reads that span the junction between exon 7 and 8, 36 cells use the short last exon exclusively, eight use the long exon exclusively, and only four use both isoforms. Thus, 36 cells have significant preference for expressing one of the isoforms (hypergeometric distribution, false discovery rate = 5%). As scRNA-seq protocols may involve steps that might cause an apparent preference where none exists, such findings should be validated by other experimental methods. However, this preference is qualitatively similar in the other six datasets of hematopoietic stem cells in JingleBells. Although looking !at the aggregated reads will show that both isoforms are expressed, it will not show that cells have a preference. In addition to the isoform preference, it can be seen that one cell has a cell specific mutation in exon 5 (bottom cell, pink reads, mutations appear in black). In addition, this same cell and two other cells that are not shown have reads aligned to the intron between exons 6 and 7, indicating a rare cell-specific tendency to retain this intron.
One can hypothesize how the expression of one isoform affects the cell function and whether a specific isoform is likely to have an upstream cause (e.g., the presence of a specific splicing factor) or a downstream effect on the expression level or splicing of other genes. Such hypotheses can be tested on all cells in this and other relevant scRNA-seq datasets, which JingleBells stores in an easy-to-use format.
Comparison of the raw reads between the datasets in JingleBells (or any scRNA-seq datasets) should be done carefully, as there are many different protocols for producing, normalizing, and quantifying scRNA-seq data. Those protocols result in differences in coverage of the distinct parts of the transcript, range of expression values, noise level, and more. Thus, the information that can be extracted from each dataset may be different. For example, datasets produced by protocols that only sequence the 3′ end of the transcripts will not be useful for differential promoter usage analysis. We have noted the differences in the part of the transcript covered in the README file of each dataset, but made no attempt to correct for technical differences between the datasets. To the best of our knowledge, there is currently no way to correct for such differences in the read-level data. We hope that this repository will facilitate attempts to compare and combine scRNA-seq datasets by reducing the technical effort required to bring all datasets to the same format. End-users are referred to files associated with each dataset and the original publications for further details concerning dataset specific issues.
To maintain up-to-date data and facilitate its reuse by the immunology community, we will repeat the search for immune related scRNA-seq datasets in the public repositories (see Datasets in the 1Materials and Methods section) every month over the next 2 years, and add new datasets to the repository. In addition, users can submit entries for newly published scRNA-seq datasets in the format detailed above, or request new datasets to be added. We hope that our format, or an equivalent, will become a standard requirement for depositing new scRNA-seq datasets. Importantly, although currently limited to immune-related datasets, JingleBells can provide a platform for any type of scRNA-seq data, and we will add other datasets based on interest and resources.
The added value of our repository over the public repositories of RNA-sequencing data are that the data are ready for visualization and analysis. Users do not need to find where the data are deposited or know which read comes from which cell or which cell comes from which experimental condition. In addition, as a collection of all publicly available scRNA-seq datasets, this resource listing 186 scRNA-seq is almost four times bigger than other scRNA-seq datasets lists (http://single-cell.clst.riken.jp/).
Conclusions
JingleBells is a valuable source for immunologists and computational biologists alike. Immunologists can download relevant datasets, upload them to IGV and study the reads aligned to their gene of interest, comparing cells and experimental conditions. Computational biologists can download datasets for reanalysis or meta-analysis, without spending time searching datasets, downloading them, mapping the reads, deciphering the read-to-cell association and the cell-to-experimental condition association. This dramatic reduction of the amount of effort needed to reuse scRNA-seq datasets will increase the usefulness of scRNA-seq datasets to the non-computational immunology community, and facilitate comparisons of scRNA-seq protocols and datasets and their meta-analysis by computational biologists.
Acknowledgements
We thank Yehuda Baruch for help in creating the website and Dr. Roi Gazit for helpful comments on the manuscript. The choice of read colors in JingleBells is done using the function “distinguishable_colors.m” by Timothy E. Holy.
Footnotes
JingleBells is freely available on the web at http://jinglebells.bgu.ac.il/. The data files are stored on Google Drive (https://drive.google.com/open?id=0BxSFjdiDhUI1amNoSks0SmpMdE0), but do not require a Google log in.
This work was supported by Israel Science Foundation Grant 500/15 and by Broad-Israel Science Foundation Grant 1644/15.
References
Disclosures
The authors have no financial conflicts of interest.