Abstract
Sepsis develops after a dysregulated host inflammatory response to a systemic infection. Identification of sepsis biomarkers has been challenging because of the multifactorial causes of disease susceptibility and progression. Public transcriptomic data are a valuable resource for mechanistic discoveries and cross-studies concordance of heterogeneous diseases. Nonetheless, the approach requires structured methodologies and effective visualization tools for meaningful data interpretation. Currently, no such database exists for sepsis or systemic inflammatory diseases in human. Hence we curated SysInflam HuDB (http://sepsis.gxbsidra.org/dm3/geneBrowser/list), a unique collection of human blood transcriptomic datasets associated with systemic inflammatory responses to sepsis. The transcriptome collection and the associated clinical metadata are integrated onto a user-friendly and Web-based interface that allows the simultaneous exploration, visualization, and interpretation of multiple datasets stemming from different study designs. To date, the collection encompasses 62 datasets and 5719 individual profiles. Concordance of gene expression changes with the associated literature was assessed, and additional analyses are presented to showcase database utility. Combined with custom data visualization at the group and individual levels, SysInflam HuDB facilitates the identification of specific human blood gene signatures in response to infection (e.g., patients with sepsis versus healthy control subjects) and the delineation of major genetic drivers associated with inflammation onset and progression under various conditions.
Visual Abstract
Introduction
Sepsis and systemic inflammatory response syndrome (SIRS) are defined by systemic dysregulated host responses; the latter differs by having a noninfectious origin. These diseases are among the leading cause of morbidity and mortality, especially in pediatric and neonatal intensive care units. Sepsis is a serious clinical condition characterized by sequential organ dysfunction, after a dysregulated host response to a systemic infection (1). The relative contributions of clinical, genetic, and environmental factors toward sepsis susceptibility and outcomes remain unclear (2).
Current sepsis management relies on the prompt recognition and subsequent administration of broad-spectrum antibiotics, fluids, and vasopressors in case of life-threatening hypotension (shock). More than 19 million patients are affected by sepsis each year, resulting in around 5 million sepsis-related deaths, which occur predominately in low- and middle-income countries (3). Overall, the global sepsis mortality rate is between 25 and 30%, which increases to between 40 and 50% when shock occurs (4, 5).
By comparing transcriptomes between healthy and disease states, we can identify differentially expressed genes (DEGs) and the associated biological processes to evaluate potential gene biomarkers of specific phenotype (6). Advances in high-throughput transcriptomic platforms have driven a huge increase in the number of publicly available transcriptomic datasets in repositories, such as the NCBI Gene Expression Omnibus (GEO) (7, 8) and ArrayExpress (9, 10). These public resources represent an opportunity for mechanistic discoveries and confirmation of complex disease signatures across different studies. Different analytical approaches, such as GEO2R (7), ScanGEO (11), ImaGEO (12), BioJupies (13), and PulmonDB (14), have been used for large-scale omics investigations of human diseases. Nonetheless, heterogeneity between platforms, experimental conditions, and limited availability of clinical information restrict effective exploitations of these resources.
Having access to a curated collection of specific and relevant datasets would help to circumvent these hurdles and facilitate comparative and exploratory analyses. Although microarray remains the most abundant omics data platform in public repositories, the amount of information it can generate, in terms of the dynamic range and coverage, is not comparable with next generation sequencing (NGS). With the advances of NGS and its affordability, public RNA sequencing (RNA-seq) datasets are becoming increasingly available, and it becomes necessary to be able to compare diverse microarray and RNA-seq datasets conveniently within the same analytical platform.
We constructed a (to our knowledge) new human blood gene expression database, SysInflam HuDB, for sepsis and systemic inflammatory diseases, which is accessible via the Web-based Gene Expression Browser (GXB) (15). The platform has been a useful tool for (to our knowledge) novel gene/function discoveries (Refs. 16–18 and M. Garand et al., manuscript posted on bioRxiv, DOI: 10.1101/490565, and J. Roelands et al., manuscript posted on bioRxiv, DOI: 10.1101/529446) and in system reanalysis approaches (Refs. 19–21 and D. Rinchai et al., manuscript posted on bioRxiv, DOI: 10.1101/527812). Given the heterogeneous clinical presentation of sepsis, explorations for and confirmation of potential gene biomarker(s) from various independent studies may lead to important mechanistic insights into the pathogenesis of the disease. In creating SysInflam HuDB, we aimed to provide a unique translational resource for researchers and clinicians and to facilitate risk prediction and disease interception of sepsis and systemic inflammatory diseases in the future.
Materials and Methods
Data selection
The GEO dataset selection was based on whether the dataset involved either sepsis or systemic inflammation, and if the pathology studied was implicated directly or indirectly. To ensure a broad search approach, we conducted two independent search strategies using NCBI GEO, with 163 and 28 datasets returned, respectively (Fig. 1A): search strategy 1, (“sepsis”[MeSH Terms] OR (“sepsis”[MeSH Terms] OR “sepsis”[All Fields]) OR “Systemic Inflammatory Response Syndrome” [MeSH Terms] OR “Systemic Inflammatory Response Syndrome”[All Fields]) AND “Homo sapiens”[porgn] AND “gse”[Filter] AND (“Expression profiling by array”[Filter] OR “Expression profiling by high throughput sequencing”[Filter]); and search strategy 2, (systemic inflammation [All Fields] AND “Homo sapiens”[porgn] AND “gse”[Filter]).
Dataset curation workflow and collection characteristics. (A) The original GEO queries resulted in 191 results. These results were manually curated using the specified inclusion and exclusion criteria. With time and resources, troubleshooting of technical issues and newly identified series were added; hence the total number of datasets in SysInflam HuDB is increasing over time. (B and C) The frequencies of sample type, age group, study design (B), and disease type (C) deposited on SysInflam HuDB. The collection includes 10 types of sample, three age groups, two types of experimental design, and 10 major disease groups. Extensive details about the datasets can be found and explored online at: https://gxb-sidra.github.io/SysInflam-HuDB/.
Dataset curation workflow and collection characteristics. (A) The original GEO queries resulted in 191 results. These results were manually curated using the specified inclusion and exclusion criteria. With time and resources, troubleshooting of technical issues and newly identified series were added; hence the total number of datasets in SysInflam HuDB is increasing over time. (B and C) The frequencies of sample type, age group, study design (B), and disease type (C) deposited on SysInflam HuDB. The collection includes 10 types of sample, three age groups, two types of experimental design, and 10 major disease groups. Extensive details about the datasets can be found and explored online at: https://gxb-sidra.github.io/SysInflam-HuDB/.
The combined results were manually filtered to restrict datasets to human samples, expression profiling by any microarrays or NGS, and relevance to sepsis and systemic inflammatory diseases (Fig. 1A). The inclusion criteria were as follows: (1) the dataset must be original; (2) the samples must be unique, and raw data available; (3) sepsis or systemic inflammatory diseases must be included; (4) whole blood or purified leukocyte populations or primary immune cells must be profiled; (5) healthy subjects must be included to serve as controls; and (6) datasets must pass normalization and quality checks for sample quality. The exclusion criteria were as follows: (1) studies containing a small number of genes (e.g., qPCR), (2) variant analysis studies, and (3) studies analyzing only a single gene or pathway. In addition, some studies relevant to sepsis-related diseases (e.g., systemic inflammation or bacterial infection) or investigating other tissues were also included because they were deemed valuable to (1) discovering a putative novel gene–disease association, (2) improving our knowledge of inflammation, and/or (3) increasing our knowledge about factors that affect sepsis biology. Most of the metadata were obtained from GEO, but additional information (e.g., clinical annotations, experimental study design) was manually retrieved from the original articles. To further facilitate interpretation and assessment, we manually added the 2021 ICD-10-CM (International Classification of Diseases, Tenth Revision, Clinical Modification) codes pertaining to the disease study; the level of details provided on GEO determined the hierarchy level of the code we indicated; i.e., more general disease classification was used when the precise disease definition was unclear or not mentioned along with the deposited data or associated publication. Fifty-two datasets were initially curated, and the two subsequent database updates (see later) bring the total number of datasets to 62 to date (Fig. 1A).
Update of the database/collection is performed biannually and will include novel datasets deposited on GEO that fit our inclusion criteria. Changes and updates to the database are recorded on the platform landing page (http://sepsis.gxbsidra.org/dm3/landing.gsp) under the “Announcements” icon. Furthermore, a GitHub page has been configured to map the content of SysInflam HuDB (https://gxb-sidra.github.io/SysInflam-HuDB/), where it is synchronized with the database releases and updates. In a near-future release, we plan to add datasets from the mouse model organisms to SysInflam.
Web site implementation of SysInflam HuDB platform
SysInflam HuDB is a Web service for transcriptome dataset hosting and interactive data analysis that is accessible on all major operating system and Web browsers. It can be deployed on any cloud-based solution and is currently hosted on the Microsoft Azure (Fig. 2). The platform construction was previously described in detail (15), and the associated source code and R scripts are available on GitHub: https://github.com/BenaroyaResearch/gxbrowser and https://github.com/BenaroyaResearch/gxrscripts.
Organization of SysInflam HuDB. The Microsoft Azure (integrated server-database model accessible to admin only) (top left panel) hosts the collection. SysInflam HuDB was created using different software dependencies and libraries (Components, top right panel). In brief, the public transcriptomics datasets were downloaded, parsed, and stored into a MySQL database (1, 2). After this, microarray probes were remapped to establish a uniform gene annotation (3), and a controlled vocabulary was used for clinical and biological annotations for each dataset. Sample groups were created based on the original hypothesis in the associated publications (4), and rank lists were generated based on manually selecting specific samples as control versus test conditions (5). Finally, the data were marked for appropriate normalization and then subjected to a quality check before visualization and interpretation (6, 7).
Organization of SysInflam HuDB. The Microsoft Azure (integrated server-database model accessible to admin only) (top left panel) hosts the collection. SysInflam HuDB was created using different software dependencies and libraries (Components, top right panel). In brief, the public transcriptomics datasets were downloaded, parsed, and stored into a MySQL database (1, 2). After this, microarray probes were remapped to establish a uniform gene annotation (3), and a controlled vocabulary was used for clinical and biological annotations for each dataset. Sample groups were created based on the original hypothesis in the associated publications (4), and rank lists were generated based on manually selecting specific samples as control versus test conditions (5). Finally, the data were marked for appropriate normalization and then subjected to a quality check before visualization and interpretation (6, 7).
The compendium construction process was performed as previously described (22). In brief, the selected datasets were downloaded from GEO in a SOFT/Series matrix file format for microarray or as raw/normalized data inputs for RNA-seq. For RNA-seq datasets, the downloaded data were either directly uploaded or, if necessary, normalized. Tables were created in the MySQL database to describe each feature and to connect the information across experiments, samples, measurements, platforms, genes, and annotated information. Microarray probes were mapped to official gene symbols. If there is a matching number of probes between the added gene annotations in the MySQL database and dataset, then all the probes/genes are imported. In case of a mismatch, the unmapped probes were dropped off. A controlled vocabulary table was developed that organized the terms to annotate the datasets in a hierarchical structure (Supplemental Table I). Classes for the main categories and the associated terms were defined, with some of them as mandatory features (i.e., species, sample type, disease, and platform). Various disease types were also included in the controlled vocabulary because it was used in either the original experiments or the associated published studies.
The group comparisons were performed using linear models for microarray data (limma) (23) to test for statistically significant differences between the groups at a false discovery rate (FDR) < 0.05. The genes were then ranked based on fold changes of the specified default two-group comparison (i.e., disease versus healthy). Color hue denotes percent of probes significantly upregulated or downregulated (red and blue, respectively).
Visualization and graphics features
SysInflam HuDB contains different data visualization features to enhance interpretation, with a particular focus on experimental quality and reliability. The platform allows for customizable data plots with overlapping metadata information, changeable sample order, as well as generation of sharable mini-URLs. Genes across one or more samples can be represented in bar plot and boxplot displaying either signal values or fold changes compared with a control group. Data quality control across samples can be assessed using these diagnostic plots. In addition, data displays are integrated with different normalization and processing types. The following display options are available: Raw Signal, Log2 Transformed, Background Subtracted, Average Normalized, Quantile Normalized, Fold Change, and Log2 Fold Change. A detailed description on the additional GXB features can be found in Huang et al. (22).
We provide a tutorial available from the main page of our GXB platform (Supplemental Fig. 1A). In addition, we have created screen recordings that are publicly accessible via YouTube (https://youtu.be/T51-vPbQ-c4 and https://youtu.be/alOQlSeDqfM). Text-over information was added to the same videos and were uploaded as supplement multimedia files. To retrieve a specific subset of datasets, we recommend to the users to use the intuitive and efficient keyword search box located at the top of the dataset browser page. We have harmonized the dataset titles to represent the disease groups that are compared within each dataset. This last addition ensures that the condition or disease-specific queries are capturing all relevant datasets (see Limitations and advantages in the Discussion section).
Differential gene expression and Gene Ontology analysis
For preprocessing, a count per million threshold (values > 10) was applied to remove all lowly expressed genes in each dataset. In addition, only genes having expression in the minimal number of biological replicates were kept. Samples with total reads <2 SDs from the dataset means were considered outliers and removed. For accession number GSE63311, differential gene expression analysis was then performed between Sepsis T0 (n = 36) and Controls T0 (n = 11) using the following criteria: absolute log2 fold change > 1.0 and FDR < 0.05 in DESeq2 (24). For accession numbers GSE33341, GSE30119, and GSE25504, the DEGs were determined by Student t tests with the same thresholds as described earlier. Ensembl gene IDs of the DEGs were then converted to gene symbols and queried for Homo sapiens Gene Ontology (GO) terms associated with biological processes using ClusterProfiler (25). The minimal gene set size was set at 3, and FDR < 0.05 was set for statistical significance.
Statistical analyses
Numerical data were processed and analyzed using R (26) unless stated otherwise. A Student t test was used to determine significant differences between two groups. Unless otherwise indicated, p < 0.05 was considered statistically significant.
Results
Data description
The SysInflam HuDB collection has a total of 62 GSE IDs (Fig. 1A) corresponding to 5719 human blood transcriptome profiles. This collection includes a wide range of studies, sample types (blood cells), and sepsis-associated diseases (Fig. 1B, 1C). Sepsis and septic shock make up 50 and 18% of the datasets; however, there are 33 types of diseases, exposures, or treatments represented in the collection; details are available on our GitHub page, which is synchronized with SysInflam HuDB (https://gxb-sidra.github.io/SysInflam-HuDB/). When multiple conditions were represented within a dataset, the focus (i.e., the default view) was set using the following priority order: sepsis, septic shock, SIRS, Staphylococcus aureus infection, bacterial infection, and LPS exposure. Nine sample types are represented; the major types are whole blood (66%), followed by PBMCs (11%), neutrophils (8%), and monocytes (5%). In total, three RNA-seq and 12 different microarray platforms are included, with most of the microarray being Affymetrix Human Array chips (various versions). In terms of the study types, 79% of the datasets were generated from ex vivo studies and 21% from in vitro studies. The sample size ranged from 6 to 531, with most studies having >100 samples. Adults constituted the main age group (73%), followed by pediatrics (19%), neonates (6%), and mixed cohort (2%).
Presentation of datasets
Instruction on how to access the Sidra GXB landing page and the available tutorials are presented in Supplemental Fig. 1A. Each dataset is associated with a GEO accession number, and the expression value and annotation files are available for direct download from SysInflam HuDB. The total number of datasets shown will be 69 instead of 62, because five datasets have multiple platforms: accession numbers GSE72829 (n = 3), GSE25504 (n = 2), GSE16129 (n = 3), GSE6269 (n = 2), and GSE13015 (n = 2). The SysInflam HuDB interface allows visualization of gene expression trends and fold changes between groups (Fig. 2). Interactive plots can be created onsite and, importantly, with overlaid clinical/phenotypic information for in-depth data explorations. The dataset-specific fold changes for the indicated comparisons are shown below the dataset titles. In addition, the interface has a cross-project view function that allows single gene expression comparisons across multiple datasets.
To demonstrate the utility of the cross-project function, we assessed the fold changes of ACSL1 expression among the 15 datasets that met the default fold change threshold ≥ 2 (Supplemental Fig. 1B). We have recently described a putative role for ACSL1 in sepsis pathogenesis (16). We observed an increase in ACSL1 transcript abundance greater than 2-fold in studies comparing LPS, septic infection, and severe septic state with unstimulated, uninfected, and nonsevere septic state, respectively.
The GXB interface features have been previously described in detail (22). However, information can also be found on the SysInflam HuDB interface itself, by browsing through the “Gene,” “Study,” “Sample,” and “Downloads” tabs and via the “Rank Lists” and “Group Set” drop-down menus. Additional annotation features can also be used to query specific preloaded gene sets (found under “Rank Lists” and “Gene List Category”); here, users can select from either Pathways or Diseases gene sets [sourced from the Kyoto Encyclopedia of Genes and Genomes database (27)].
Dataset concordance
For each dataset, we assessed the concordance of the expression values reported in SysInflam HuDB by comparing the fold changes or trends for select referenced genes with those reported in the available associated publications (see details online at: https://gxb-sidra.github.io/SysInflam-HuDB/). We defined “strong concordance” as the fold change in SysInflam HuDB being within 1 SD of the published fold change or, when not available, within 1 decimal of the reported value. A concordance based on a “trend” was defined as “good.” Some discrepancies occur between the fold change values presented in the original study and those in SysInflam HuDB and are likely due to study-specific data postprocessing and the specific samples selected for analyses. Nonetheless, as the same data processing strategies and algorithms were applied, gene expression values and fold changes can be compared across the collection on SysInflam HuDB.
Utility of the resource
With SysInflam, the user can quickly visualize and compare the expression of genes among different cellular matrix, cohort, or other filter of interest. For example, we investigated the expression levels of two recently reviewed putative sepsis biomarkers, ACSL1 (16) and NUDT16 (M. Garand et al., manuscript posted on bioRxiv, DOI: 10.1101/490565), in datasets from neutrophils (accession number GSE49755) and whole blood (GSE30119) (Fig. 3). In GSE49755, polymorphonuclear neutrophils were isolated from two healthy adult donors and exposed to plasma samples obtained from six patients with culture-confirmed sepsis (n = 12) and from six uninfected control subjects (n = 12). In GSE30119, total RNA was extracted from whole blood from pediatric patients with acute community-acquired Staphylococcus aureus infection (n = 99) and uninfected controls (n = 44). Regardless of the matrix type or cohort, we found that the two putative markers were expressed at significantly higher levels (p < 0.01) in disease condition compared with healthy control subjects, providing added support and robustness to the putative role of the two genes as acute infection biomarkers.
Gene expression and Venn analysis of DEGs among different tissue types. ACSL1 and NUDT16 gene expression data from two publicly available datasets that used neutrophils (GSE49755) and whole blood (GSE30119). Control/uninfected, healthy individuals; medium, vehicle; septic/Staphylococcus aureus, individuals with infection. Additional definitions can be found in the original study description: for GSE49755 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49755) and for GSE30119 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse30119). Boxplots show means ± SD. The asterisk denotes *p < 0.01 obtained from a two-tailed t test considering unequal variance. For GSE49755, the size of each group was six uninfected, six with sepsis, two medium, and two LPS. For GSE30119, the size of each group was 99 with Staphylococcus aureus infection and 44 control subjects.
Gene expression and Venn analysis of DEGs among different tissue types. ACSL1 and NUDT16 gene expression data from two publicly available datasets that used neutrophils (GSE49755) and whole blood (GSE30119). Control/uninfected, healthy individuals; medium, vehicle; septic/Staphylococcus aureus, individuals with infection. Additional definitions can be found in the original study description: for GSE49755 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49755) and for GSE30119 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse30119). Boxplots show means ± SD. The asterisk denotes *p < 0.01 obtained from a two-tailed t test considering unequal variance. For GSE49755, the size of each group was six uninfected, six with sepsis, two medium, and two LPS. For GSE30119, the size of each group was 99 with Staphylococcus aureus infection and 44 control subjects.
SysInflam can be used to investigate important genes in relation to clinical data annotations (e.g., using overlays, filtering, grouping, and ranking). As a case study, we performed DEG analysis (see Materials and Methods) on the RNA-seq dataset GSE63311, which compared 37 patients with sepsis with 11 control subjects (nonurgent surgical patients) (Supplemental Fig. 2A). We identified the significant clusters of the enriched GO terms (201 enriched GO terms with 146 having an FDR < 0.05) (Supplemental Fig. 2B) and mapped the representative DEGs with their fold changes to each GO term (Supplemental Fig. 2C). We were able to explore the important genes in each GO cluster, which can then be investigated at the individual gene level with available clinical data in SysInflam. Using ANXA3, CD177, and IL2R genes as examples, the results showed marked heterogeneity among gene expression values, blood culture results, and clinical diagnosis (Supplemental Fig. 3).
Insights about the age-dependent host defense are needed and can be extrapolated from transcriptomic datasets. To demonstrate the use of the database in this context, we compared whole blood transcriptome responses to Staphylococcus aureus infection in adults (GSE33341), pediatric (GSE30119), and neonate (GSE25504) cohorts against their respective controls (Supplemental Fig. 4A, 4B). We observed that the adult cohort had the greatest number of DEGs, and that only two DEGs were expressed in all cohorts. Interestingly, there were more shared DEGs between the pediatric and neonate cohorts compared with all other combinations. Surprisingly, the number of enriched GOs was inversely related to the number of DEGs. Again, there were more similarly affected biological processes between the two nonadult cohorts compared with all other combinations. However, neutrophil activation/degranulation/mediated immunity were unanimously among the top GOs affected in all age groups.
Discussion
Currently, no databases are specialized for sepsis and/or SIRS. Although other Web interface-based resources are available to reanalyze NCBI GEO data (28), those existing tools are limited by the need to manually annotate group comparisons and the lack of multistudy comparisons at the gene level. By contrast, SysInflam HuDB is a disease-focused collection and enables simultaneous data exploration and visualization across studies that have been manually curated and annotated. In addition, the associated metadata can be overlaid on top of gene expression data, both of which can be downloaded directly from the database in Excel format for further analysis. We aim to help advance the development of a readily translatable biomarker/predictor of sepsis. Thus, for the initial release of the database, we focused on human blood cells, because whole blood is easily accessible in clinical, research, and even at-home settings and requires minimal ex vivo manipulation. For these reasons, blood biomarkers have a wide application potential and usability. RNA-seq analysis of specifically isolated cell subsets obtained after the acute phase (i.e., more representative of the immunosuppressive phase of sepsis) provided insights into the mechanism of immunosuppression mediated not only by T cells but also by monocytes. Hence the comparative value of including a purified cell population alongside whole blood studies in our database (29). To demonstrate the potential usages of the dataset collection, we showcased analyses performed at multiple levels; however, these examples were not meant to support novel findings.
The abilities to quickly visualize gene expression, rank genes according to different criteria, overlay metadata, and look at specific subsets of genes (e.g., pathways, diseases) are key strengths of SysInflam HuDB. Study-specific protocols and/or the cause of sepsis might have also contributed to the observed variance and amplitude of the responses and would need to be considered for translational research. To showcase the overlay feature, we explored the RNA-seq dataset (GSE63311) with overlaying blood culture results, because a positive blood culture is an essential diagnosis for sepsis but often not enough to accurately prognosticate sepsis outcome (30). The heterogeneity of the culture results against clinical diagnosis was clearly visible. These observations support the need for novel biomarker discovery through transcriptomic signature mining to improve on blood culture assessment (31). SysInflam HuDB can be used to confirm and interpret analyses of individual datasets. SysInflam was not envisioned as an exhaustive transcriptomic resource; however, it can be used in conjunction with others, such as Arch4S (32), to enhance the depth of reductionist investigations performed on a gene or gene sets.
SysInflam HuDB provides an organized structure and filtering options that facilitate comparisons and biological extrapolations between diverse cohort types that may be of clinical interest. For example, Staphylococcus aureus is among the most common Gram-positive infections and a major contributor to late-onset sepsis in neonates (33, 34). Better mechanistic insights about host defense are needed and can be extrapolated from transcriptomic datasets. To demonstrate the use of the database in this context, we used the keyword search function to identify datasets about whole blood transcriptome responses to Staphylococcus aureus infection in adults, pediatric, and neonate cohorts against their respective controls. We observed that the number of enriched GOs was inversely related to the number of DEGs, indicating that the changes in gene expression are more biologically “focused” in the nonadult cohorts. We also noted that immunity mediated by neutrophils activation/degranulation was the top common enriched GO in all age groups, which highlights the important role of innate immunity in the context of sepsis pathogenesis.
To ensure the best user experience, it is important to clarify the current limitations and highlight the specific advantages of the platform. Each dataset has inherited and acquired limitations that may require the user to refer to the data-associated publication; conveniently, links to PubMed and GEO entries can be found under the “Study” tab on SysInflam HuDB’s interface. Here, we enumerate five limitations to be considered by the users.
First, we focused the collection on human blood transcriptomics. One of the purposes of the collection is to help advance the development of a readily translatable biomarker/predictor of sepsis. We considered whole blood as a readily accessible tissue and requiring the least amount of ex vivo manipulation; therefore, it represents an attractive target for biomarker development. The deployment of a diagnostic/predictive test that can be complementary to routine blood collection is advantageous.
Second, because the collection is composed of varied cell types, measure of purity, study designs, clinical definitions, and methods, the user may find greater heterogeneity for the expression of specific genes across the datasets. These inherited limitations are due to the data or information that are intrinsic to the dataset deposited. However, to help navigate these differences, the filter menu allows the selection of a specific sample source, platform, and disease. In addition, the user can use the keyword search tool to compile a specific dataset (see the fifth limitation). Nonetheless, the heterogeneity is ideal. The works of Khatri and Sweeney (35, 36) from Stanford provide a strong conceptual support to the use of heterogeneous transcriptomic dataset collection. In a previous benchmarking report (35), three published sepsis gene classifiers were tested on 39 public human gene expression microarray datasets. They observed that biomarker discovery performed on a homogeneous cohort and using the same methods provides greater statistical power but lacks in generalizability. Hence the confidence of the robustness of a given finding/signature can be enhanced when employing collections of independent datasets, as imperfect as they might be. Thus, our approach not only can help to recapitulate previously published work but also to verify gene expression stability across experiments. Although public data are growing fast, it is still not often used or easily accepted as a validation tool of new diagnostics that “may reflect the difficulty and knowledge curve that some researchers face in accessing and using these data” (36). To this point, we believe SysInflam HuDB is a resource that contributes to addressing this gap and can be used to drive decision making and support new hypotheses for studying molecular and cellular mechanisms in experimental laboratories.
Third, despite our efforts, the database is not an exhaustive compilation of all available human blood transcriptomic datasets on sepsis (see Data selection in Materials and Methods). Single-cell RNA-seq analysis can help delineate at a high resolution the cell-specific phenotypic state and assess their relative abundance between disease groups. For example, a recent differential gene expression analysis between cell populations has identified putative useful clinical markers to discriminate critically ill patients with or without sepsis (37). We currently do not have the capacity to add data from a single-cell RNA-seq platform; however, we recognized its value and aim to implement it in the future releases of SysInflam HuDB.
Fourth, several analytical approaches have been used in the datasets/studies that composed our collection. The results presented in SysInflam HuDB have been generated using one uniform pipeline/algorithm for all datasets that justify the small differences and/or discrepancies in gene expression fold changes. In contrast, having this uniform analysis can help assess concordance across different types of experiment, compare study results directly, and highlight the implications of using different control groups. In this context, a high level of concordance is an indirect measure of reproducibility across studies. Another advantage is the ability to compare results from samples collected at different time points among studies sharing similar experimental design. Indeed, the timing of sampling can have an important impact on the findings from transcriptomes, which, to some extent, is known to correlate with the functionality of the immune cells. The acute phase, up to 48 h, would be characterized by hyperinflammation, while there is a continuous range of changes toward immunosuppression that occurs in the later phase. Like in any other repository, we are limited by the clinical information uploaded by the authors. To assist the users, we have assembled the dataset metadata so that it can be browsed/sorted in a single table (available on the GitHub page). However, the specific temporal details (or any other specific interest) that are not tabulated will need to be identified by the user. Once a suitable set of datasets is identified, comparisons could be done with SysInflam in parallel windows and/or by downloading the gene expression of specific gene markers from the SysInflam interface.
Fifth, the user interface has limited flexibility. The “Disease” filter function (left-hand side menu on the dataset browser interface) uniquely tags the combination of diseases, rather than creating multiple tags per dataset. As an efficient work-around, we recommend the user to use the keyword search bar at the top of the page. We harmonized the dataset titles to represent the disease groups in each dataset, allowing for intuitive keyword-based queries. We aim to enhance the filter functions in the future release versions.
In summary, to our knowledge, SysInflam HuDB is the first resource for transcriptomic analyses focused on sepsis and systemic inflammatory diseases. In addition, the platform constitutes, to the best of our knowledge, the first resource to combine diverse microarray and RNA-seq datasets. We hope that SysInflam HuDB will help the scientific community to explore gene expression profiles in sepsis and other systemic inflammatory conditions across technologies and platforms, as well as to identify interesting expression trends across different diseases. Ultimately, new hypotheses can be generated and relationships among clinical or experimental variables can be investigated. Going forward, we hope to grow SysInflam HuDB by including more sepsis and systemic inflammatory datasets as they become available.
Data availability
The datasets reported in this study are all available in the public domain: NCBI GEO (https://www.ncbi.nlm.nih.gov/geo/ accessed on July 30, 2021) accession numbers are mentioned in the text. Our in-house–developed GXB application, SysInflam HuDB, is publicly available at: http://sepsis.gxbsidra.org/dm3/geneBrowser/list.
Acknowledgements
We thank all the investigators who made their datasets publicly available by depositing them into the NCBI GEO repository. We thank Dr. Alison Russell for valuable opinions and guidance regarding clinical definitions and translational inquiries. We thank Sukanya Dhansingh (Senior Software Engineer, Mindtree Ltd) for valuable guidance and comments regarding software engineering, and Mohammedhusen Khatib (Database Engineer, Sidra Medicine) for valuable comments regarding database architecture. We also thank Dr. Patrick Tang (Division Chief of Pathology, Sidra Medicine) for critical reading of the manuscript.
Footnotes
This work was supported by the Qatar Foundation and the Qatar National Research Fund Grant NPRP10-0205-170348 awarded to D.C.
M.T., D.C., and M.G. conceptualized the study. M.T., M.A., and S.B. performed software development and implementation. M.T. and M.G. performed data curation. M.T., S.S.Y.H., and M.G. performed analyses. M.T., S.S.Y.H., and M.G. wrote and D.R., L.R.S., and D.C. commented on the manuscript. All authors have read and approved the final manuscript.
The online version of this article contains supplemental material.
References
Disclosures
The authors have no financial conflicts of interest.