Skip to main content

Assessing concordance between RNA-Seq and NanoString technologies in Ebola-infected nonhuman primates using machine learning

Abstract

This study evaluates the concordance between RNA sequencing (RNA-Seq) and NanoString technologies for gene expression analysis in non-human primates (NHPs) infected with Ebola virus (EBOV). A detailed comparison of both platforms revealed a strong correlation, with Spearman coefficients for 56 out of 62 samples ranging from 0.78 to 0.88. The mean and median coefficients were 0.83 and 0.85, respectively. Bland-Altman analysis confirmed high consistency across most measurements, with values falling within the 95% limits of agreement. Using a machine learning approach with the Supervised Magnitude-Altitude Scoring (SMAS) method trained on NanoString data, OAS1 was identified as a key gene signature for distinguishing RT-qPCR positive from negative samples. Remarkably, when used as the sole predictor in a logistic regression model, OAS1 maintained its predictive power on RNA-Seq data from the same cohort of EBOV-infected NHPs, achieving 100% accuracy in distinguishing infected from non-infected samples. OAS1 was also tested in a completely independent held-out test set, consisting of human monocyte-derived dendritic cells (DC) isolated and infected with different strains of the Ebola virus: wild-type (wt), VP35m, VP24m, along with a double mutant VP35m & VP24m, and again demonstrated a 100% accuracy rate in differentiating EBOV-infected from mock-infected samples, confirming its effectiveness as a predictive marker across diverse experimental setups and virus strains. Further differential expression analysis across both platforms identified 12 common genes (including ISG15, OAS1, IFI44, IFI27, IFIT2, IFIT3, IFI44L, MX1, MX2, OAS2, RSAD2, and OASL) that showed the highest levels of statistical significance and biological relevance. Gene Ontology (GO) analysis confirmed the involvement of these genes in key immune and viral infection pathways, highlighting their importance in EBOV infection. RNA-Seq uniquely identified genes such as CASP5, USP18, and DDX60, which are important in immune regulation and antiviral defense and were not detected by NanoString, demonstrating the broader detection capabilities of RNA-Seq. This study indicates a very strong agreement between RNA-Seq and NanoString platforms in gene expression analysis, with RNA-Seq displaying broader capabilities in identifying gene signatures.

Peer Review reports

Introduction

RNA sequencing (RNA-Seq) has transformed our capacity to understand gene expression, enabling simultaneous quantification and discovery of transcripts [1]. It is essential across various genomic applications including differential gene expression, alternative splicing, and eQTL mapping, among others [1]. For example, Bosworth et al. [2] employed RNA-Seq in A549 cells to identify key genes involved in the Ebola virus-host interaction, revealing significant gene expression changes. Liu et al. [3] used RNA sequencing to distinguish between fatal cases and survivors of Ebola by analyzing transcriptome data from peripheral blood, identifying upregulation of interferon signaling and liver pathology indicators such as albumin and fibrinogen.

On the other hand, NanoString technology is a precise digital quantification method without amplification [4]. In our recent application of NanoString, we employed the Supervised Magnitude-Altitude Scoring (SMAS) methodology, a machine learning-based approach, to analyze gene expression in non-human primates (NHPs) infected with Ebola virus (EBOV) [5]. This technique identified key genes for Ebola infection, such as IFI6 and IFI27, which provided perfect predictive performance with 100% accuracy in distinguishing stages of Ebola infection. Additional genes like MX1, OAS1, and ISG15 were significantly upregulated, playing key roles in the immune response to EBOV. These results demonstrate NanoString’s capability to offer precise insights into gene expression during viral infections, supporting the development of diagnostic and therapeutic strategies.

Building upon these insights, a question arises: How well do RNA-Seq and NanoString technologies concur in their assessments of gene expression within the context of viral infections? The distinct methodologies, RNA-Seq, with its expansive and detailed transcriptome profiling, and NanoString, known for its precise, targeted quantification, offer potentially complementary views of the same biological phenomena. Evaluating the concordance between RNA-Seq and NanoString technologies is important, particularly as they vary widely in accessibility due to cost and resource requirements. By determining the extent of agreement in gene expression assessments by RNA-Seq and NanoString during Ebola virus infections, we can depend on the more accessible technology in scenarios where the other may be impractical.

A few studies have evaluated the comparative performance of RNA-Seq and NanoString technologies in gene expression profiling across diverse biological contexts. Song et al. [6] focused on formalin-fixed paraffin-embedded (FFPE) samples of triple-negative breast cancer and gastric cancer tissues, comparing RNA-Seq using RNAaccess and NanoString. They demonstrated RNAaccess’s superior sequencing quality and higher concordance in gene expression between FFPE and fresh-frozen samples. They highlighted RNAaccess’s broad transcriptomic coverage, making it ideal for biomarker discovery, especially when sample integrity is compromised. Additionally, Song et al. [6] identified lower RNA input and DV200 thresholds than previously recommended, increasing the inclusivity of clinical FFPE samples. Despite these strengths, NanoString remained a robust alternative for targeted gene expression studies due to its amplification-free quantification [6].

Zhang et al. [7] conducted a comparison using 46 cancer cell lines across different cancer types to evaluate RNA-Seq and NanoString for gene and isoform expression quantification. Their findings indicated higher agreement at the gene level (median Spearman correlation 0.68–0.82) compared to isoform-level expression (median 0.55–0.63). Notably, RNA-Seq exhibited greater consistency with RT-qPCR validation in isoform detection, attributed to its broader transcriptome coverage. However, according to their study, NanoString maintained reliable performance in detecting lowly expressed genes without amplification bias [7]. Speranza et al. [8] compared RNA-Seq and NanoString nCounter platforms to profile host immune responses in Ebola virus-infected cynomolgus macaques. Correlation between platforms was assessed using Pearson’s correlation (cor() function in R) [9]. Principal component analysis (PCA) was used to evaluate the clustering ability of a 41-gene biomarker set, with clustering strength quantified by calculating Euclidean distances from cluster centers and comparing these to random gene sets via permutation testing. But, the reliance on Pearson’s correlation without accounting for non-normality may introduce bias when comparing data from both platforms.

In another study, our team compared RNA-Seq and NanoString for gene expression profiling in 3D airway organ tissue equivalents (OTEs) infected with Influenza A virus (IAV), human metapneumovirus (MPV), and parainfluenza virus 3 (PIV3) [10]. This analysis encompassed 19,671 genes via RNA-Seq and 773 immune-related genes via NanoString, focusing on 754 shared genes across 16 infection conditions. Spearman [11] and Distance [12] correlation analyses were used to assess agreement, while Bland-Altman [13] plots visualized biases. Generalized Linear Models (GLMs) [14] and Huber regression [15] accounted for outliers, and the Magnitude-Altitude Score (MAS) [16] algorithm ranked genes by statistical and biological relevance. Functional validation via Gene Ontology (GO) analysis [17] revealed strong platform concordance (correlations ranging from 0.86 to 0.90) and over 96.6% of measurements within Bland-Altman agreement limits. RNA-Seq and NanoString both highlighted antiviral response genes, such as ISG15, MX1, and RSAD2, with shared activation of “type I interferon signaling” pathways. However, RNA-Seq exhibited reduced sensitivity in detecting significant genes for MPV compared to NanoString, emphasizing differences in detection thresholds [10].

The current study builds upon the methodologies from our previous study on OTEs [10] by transitioning from in vitro OTEs to whole blood samples and expanding the machine learning analyses. We also address deficiencies in earlier studies [6,7,8], where the full potential of advanced statistical and machine learning techniques was not explored in their analyses. In this study, we use whole blood samples from NHPs infected with Ebola (GSE103825) [18], and identify and validate EBOV gene signatures using NanoString and RNA-Seq data from EBOV-infected NHPs as training and validation sets, respectively. We then assess the performance of these gene signatures on an independent held-out test set, which includes RNA-Seq data from human dendritic cells infected with various strains of the Ebola virus (GSE96590) [19]. To perform a comprehensive comparison between RNA-Seq and NanoString data, we apply a suite of statistical and machine learning methodologies as follows:

  1. 1.

    Correlation analysis: We conduct a Spearman correlation analysis [11] to compare gene expression profiles across RNA-Seq and NanoString platforms, using a common set of 584 genes from 62 samples collected over the progression of Ebola virus disease (EVD) in NHPs. Unlike Speranza et al. [8], we choose Spearman over Pearson correlation because the negative binomial distribution of RNA-Seq data is incompatible with Pearson correlation assumptions. Moreover, we introduce the Reverse Gene Removal (RGR) method to examine whether a low Spearman correlation for specific samples is due to technical artifacts or true differences between the platforms in measuring gene expression.

  2. 2.

    Bland-Altman analysis: We evaluate the agreement between the two platforms using Bland-Altman analysis [13], focusing exclusively on the common set of genes, to examine any systematic biases in gene expression measurements.

  3. 3.

    Cross-modal gene signature: The concordance between RNA-Seq and NanoString platforms is further evaluated using machine learning methods. Initially, we apply the SMAS method [5] to NanoString data to identify key gene signatures for EBOV that are capable of differentiating between infected and uninfected samples within NanoString. Following this, we assess the concordance of the two platforms by applying logistic regression on RNA-Seq data, using the identified key gene signatures as predictors to assess whether the power of differentiating between uninfected and infected samples using the NanoString-identified gene signatures is consistent within RNA-Seq. This methodological approach allows us to explore the predictive reliability of the gene signatures across both technologies.

  4. 4.

    Comparative differential expression analysis: We compare the differentially expressed genes between both platforms in two distinct ways:

    1. a.

      We first focus only on the common genes between the two platforms, identifying BH-significant genes and rank them based on GLMQL-MAS (for RNA-Seq) [20,21,22] and MAS (for NanoString) [16]. When the top-ranked gene signatures from RNA-Seq align with those identified via NanoString, it validates the potential of relying solely on RNA-Seq when NanoString is not available. Additionally, we conduct GO analysis [17] for the top common genes to further explore their biological significance and validate our findings.

    2. b.

      We then examine all the genes from RNA-Seq using GLMQL-MAS [20,21,22] to identify gene signatures that appear in RNA-Seq but are not present in NanoString. This analysis highlights the comprehensive capabilities of RNA-Seq, emphasizing its utility in capturing a broader spectrum of gene expressions that may be missed by NanoString due to its more limited gene set. This comprehensive approach underscores the strengths of RNA-Seq in terms of its broader transcriptomic coverage and its potential in providing deeper insights into gene expression dynamics.

  5. 5.

    Assessing the efficacy of gene signatures on a held-out dataset: Finally, to demonstrate the efficacy of the gene signatures originally identified via NanoString and validated with RNA-Seq from NHPs infected with Ebola (GSE103825) [18], we use these signatures as predictors in a logistic regression model applied to a completely independent held-out test set from a publicly available dataset (GSE96590) [19], which includes RNA-Seq data from human dendritic cells infected with various strains of the Ebola virus. This evaluation underscores the robustness and generalizability of the identified gene signatures across different sample types and viral strains.

Materials and methods

Data

For correlation analysis, Bland-Altman analysis, cross-modal gene signature, and comparative differential expression analysis, where our main objective is to assess the concordance between RNA-seq and NanoString data, we use data from EBOV-infected NHPs (GSE103825) [18], referred to as training-validation data (see Sect. 2.1.1 for details). To evaluate the effectiveness of gene signatures identified during the training-validation stage, we use RNA-Seq data from human dendritic cells infected with various strains of Ebola (GSE96590) [19], referred to as the held-out dataset (see Sect. 2.1.2 for details).

Training-validation data

For concordance analysis and identifying gene signatures related to Ebola, this study used RNA-Seq and NanoString gene expression data derived from NHPs infected with the EBOV, as provided and detailed by Speranza et al. [18]. In their study, Speranza et al. [18] focused on enhancing the accuracy of NHP models to more closely reflect human EVD. Their study detailed the exposure of 12 cynomolgus macaques to the EBOV/Makona strain via intranasal routes using a target dose of 100 plaque-forming units (PFU) [18]. Administration methods varied between using a pipette or a mucosal atomization device, leading to diverse symptom onset and disease progression, culminating in four distinct response groups (see Fig. 1) with an overall fatality rate of 83% [18].

Fig. 1
figure 1

Timeline of RT-qPCR results for NHPs exposed to EBOV. This figure shows a schedule of test results over various days post-exposure. Green cells indicate non-detectable viral RNA, suggesting negative results, while pink cells denote RT-qPCR positive results with corresponding Genome Equivalent (GE). The numbers displayed in the cells for positive cases represent the GE values. The box marked with an “X” indicates that although the RT-qPCR results are positive, the corresponding NanoString information is unavailable

These groups were categorized based on the timing and nature of symptom appearance, onset of viremia, and time to death, ranging from typical EVD courses to those not developing detectable viremia during the study period. Group 1 displayed typical symptoms of EVD with viremia measurable from day 6 and a mean survival time of 10.47 days. In Group 2, viremia was detectable later, between days 10 to 12, and the average survival extended to 13.31 days. Group 3 showed even further delayed signs of the disease, with viremia apparent after day 20 and an average survival time of 21.42 days. In contrast, Group 4 exhibited no detectable viremia throughout the study and survived until the end of the 41-day experiment [18].

In the current study, we focused exclusively on NHPs with both RNA-Seq and NanoString data available (see Fig. 1). For notation purposes, throughout this paper, “NHP-\(\:m\)\(\:n\)” refers to NHP \(\:\#m\) on day \(\:n\) post-infection. For the NanoString analysis, 769 specific NHP transcripts were targeted, offering rapid processing and lower RNA quality requirements than RNA-Seq, which makes it particularly effective for Ebola virus research [5, 18]. Standard normalization procedures were followed, involving background adjustments using negative controls and lane variation corrections with internal positive controls [18]. The most stable reference genes for these adjustments were identified using the NormFinder software package [18].

For RNA isolation from whole blood, samples were diluted with molecular biology-grade water and treated with TRIzol LS, followed by purification using the PureLink RNA Mini Kit (Thermo Fisher Scientific) and quality assessment on the Agilent 2200 TapeStation [18]. RNA-seq libraries were prepared using the TruSeq Stranded Total RNA Library Prep Kit (Illumina), with library quality assessed on the TapeStation and quantified via qPCR using the KAPA Complete (Universal) qPCR Kit (Kapa Biosystems) [18]. Sequencing was performed on the Illumina HiSeq 2500 in a paired end 2 \(\:\times\:\) 100-base pair, dual-index format [18]. Post-sequencing, the data were processed by trimming low-quality reads and filtering with the FASTX-Toolkit [18]. Alignment was conducted against the cynomolgus macaque genome using Bowtie2 and Tophat [18]. Throughout this study, we implemented Trimmed Mean of M-values (TMM) normalization [23] on RNA-Seq data for use in machine learning, correlation assessments, and various analyses, extending beyond differential expression analysis.

Held-out test data

To assess the efficacy of identified gene signatures from training-validation data as described in Sect. 2.1.1, we used another publicly available dataset provided by Ilinykh et al. [19] (GSE96590). In their study, Human monocyte-derived Dendritic cells (DC) were isolated and infected with three different strains of the Ebola virus: wild-type (wt), VP35m, and VP24m, along with a double mutant, VP35m & VP24m. These infections were performed at a multiplicity of infection (MOI) of 2 plaque-forming units (PFU) per cell to ensure each cell was exposed to two viral particles. The infected cells were maintained in culture medium supplemented with 10% human serum and harvested [19].

RNA extraction from these cells was carried out using the TRIzol reagent, followed by additional purification steps involving NaCl and ethanol precipitation, phenol extraction, and a final ethanol wash [19]. The purified RNA was then used for library construction using the Illumina TruSeq RNA Sample Preparation Kit v2, which involved poly(A) mRNA isolation, fragmentation, cDNA synthesis, adapter ligation, and amplification [19]. The quality of the RNA-seq libraries was verified on an Agilent 2100 bioanalyzer before proceeding with high-throughput sequencing. Sequencing was conducted on an Illumina HiSeq 1000 system, using a 50-base paired-end protocol [19].

Correlation analysis

Spearman correlation

In this section, we conducted a Spearman correlation analysis [11] to compare the gene expression profiles obtained from RNA-Seq and NanoString technologies. The datasets included a common set of 584 genes for 62 samples extracted from 12 NHPs over the progression of EVD as detailed in Fig. 1. For the correlation analysis, the first step involved aligning the RNA-Seq and NanoString data by ensuring that the 584 genes were identically ordered in both datasets for all samples. This alignment allowed for accurate pairwise comparisons across the corresponding gene expression vectors from the two platforms.

Each NHP was analyzed individually. For each primate, we constructed two 584-dimensional vectors from their RNA-Seq and NanoString data points. We then calculated the Spearman correlation coefficient [11] for each pair of corresponding vectors. This non-parametric measure was selected because it evaluates the monotonic relationship between the ranked variables, providing insights into the consistency of gene expression patterns observed by the two methods regardless of the absolute expression levels, and it is compatible with the nature of RNA-Seq data distribution.

Reverse gene removal (RGR)

To examine the lower Spearman correlation coefficients observed in certain samples, we introduced a reverse gene removal (RGR) approach aimed at identifying whether these lower correlations were driven by specific gene expression patterns or potential technical artifacts. RGR began by selecting the sample of interest and extracting the gene expression data from two platforms focusing on the set of common genes. The initial Spearman correlation coefficient was computed using the full set of common genes to establish a baseline for the analysis.

To assess the influence of individual genes on the overall correlation, RGR iteratively removed one gene at a time and recalculated the Spearman correlation coefficient for the remaining set of genes. At each step, the gene whose removal led to the largest increase in correlation was excluded, and the process was repeated until only two genes remained. This iterative approach enabled the identification of genes that had the greatest negative impact on correlation. By monitoring the change in correlation during each iteration, we aimed to distinguish whether observed low correlations were attributable to biological signal variations. In studies where the agreement between both platforms is strong for all samples except few, there is a chance of the presence of artifacts within those samples with low Spearman correlation. The RGR method examines the presence of artifacts by investigating whether there are few genes causing this disagreement with biological significance, or if there are multiple genes that contribute to the observed low Spearman correlation. When the analysis requires the removal of many genes (e.g., more than half) to achieve a large Spearman correlation, similar to the average of other samples within the study (e.g., above 0.9), it may suggest the presence of technical artifacts rather than true inconsistencies between the platforms in measuring gene expression.

Bland-altman analysis

In this section, the Bland-Altman analysis [13] was employed to assess the agreement between RNA-Seq and NanoString technologies. The mean and difference were calculated for each gene expression value across the two platforms for each sample. Specifically, for each paired gene expression value, the mean was determined as the average of the RNA-Seq and NanoString results, while the difference was calculated by subtracting the RNA-Seq value from the NanoString value.

Next, Bland-Altman plots were generated to visually inspect the agreement between the two platforms, plotting the differences against the means for each sample. The mean difference, indicating any systematic bias, and the limits of agreement (mean difference ± 1.96 times the standard deviation of the differences) were also calculated. These limits highlight the range where 95% of the differences are expected to fall. Finally, the proportion of data points lying within and outside these limits was evaluated to quantify the consistency between RNA-Seq and NanoString measurements across samples.

Concordance assessment through machine learning analysis

In this section, we aimed to assess the concordance between NanoString and RNA-Seq platforms by first identifying key genes in NanoString data that differentiate RT-qPCR positive from negative samples using logistic regression model via SMAS, then applying these genes as predictors in RNA-Seq to test their cross-platform utility. In terms of machine learning terminology, NanoString data serves as the training set for key gene identification, and RNA-Seq data is treated as the validation set. This approach allowed us to evaluate the efficacy of the selected genes in distinguishing positive from negative samples within the RNA-Seq dataset, testing whether the gene signatures identified in NanoString can be directly applied to RNA-Seq with comparable performance.

As shown in Fig. 1, there are a total of 12 RT-qPCR positive samples for which both NanoString and RNA-Seq data are available. To effectively measure the performance of the logistic regression model [24] using k-fold stratified cross-validation [25] with the selected genes as predictors, we need to ensure that we have a balanced dataset of positive and negative samples. Therefore, we use the samples from all 12 NHPs on day 0 of infection (DPI = 0) as the negative samples for key gene identification.

To identify the genes that are capable of separating positive from negative samples within NanoString and use them as predictors for logistic regression, we apply the SMAS [5]. For NanoString data, which often adheres to the assumptions required for parametric t-tests, the SMAS first employs two-sample independent t-tests [26] with Benjamini-Hochberg (BH) correction [27]. For genes that achieve significance according to the BH correction, SMAS ranks them based on their statistical significance and biological relevance using the Magnitude-Altitude Score (MAS). The MAS formula is defined as:

$$\:{\text{MAS}}_{\text{l}}=\:{\left|\right({\text{l}\text{o}\text{g}}_{2}\text{(}\text{FC}}_{\text{l}}\left)\right){|}^{\text{M}}\left|{\left({\text{l}\text{o}\text{g}}_{10}\right(\text{p}}_{\text{l}}^{\text{B}\text{H}}\right))\:{|}^{\text{A}},$$

where:

  • \(\:\text{l}\) indexes the genes, ranging from 1 to \(\:\text{s}\), where \(\:\text{s}\) is the number of rejected null hypotheses (i.e., \(\:{\text{p}}_{\text{l}}^{\text{B}\text{H}}<\:{\upalpha\:}=0.05\)),

  • \(\:{\text{p}}_{\text{l}}^{\text{B}\text{H}}\) is the BH-adjusted p-value for gene \(\:\text{l}\),

  • \(\:{\text{FC}}_{\text{l}}\) represents the fold change for gene \(\:\text{l}\),

  • M and A are hyperparameters set to 1 in this study, balancing the influence of the adjusted p-value and the log fold change.

Subsequently, SMAS uses these rankings to select the top gene as a predictor in logistic regression [24], implemented through k-fold stratified cross-validation (k = 6) [25]. This process helps validate the predictive power of the selected gene. Intuitively, within the context of a volcano plot, “Magnitude” refers to how far a point (gene) is from the vertical axis, which represents no change in expression. This measures the extent of gene regulation and is indicative of biological relevance, highlighting the impact of gene expression changes. “Altitude” measures the height of a point above the horizontal axis and reflects the statistical significance of these changes. This setup helps easily identify genes that are not only biologically important due to substantial changes in expression but also statistically significant, ensuring their reliability for further study.

Once we have identified the top MAS-selected genes within NanoString, we directly used them as predictors for logistic regression to evaluate the performance of the model in separating positive from negative groups within RNA-Seq. Note that since the scale of NanoString is different from that of RNA-Seq, due to differences between the two technologies, we apply k-fold stratified cross-validation within the RNA-Seq to ensure that we fine-tune the parameters of the logistic regression.

Since the viral load of NHPs on day 3 of infection (DPI = 3) is also not detectable, we additionally use logistic regression via k-fold stratified cross-validation to see whether the selected MAS genes can differentiate both DPI 0 and 3 (DPI = 0 and 3) from positive samples within both NanoString and RNA-Seq. In this way, the selected genes were tested for new samples that are negative but not part of the MAS gene identification process.

Note that in our previous study [5], where we analyzed 769 genes using NanoString, the MAS ranking system was validated, and the top gene selected by our method differentiated positive from negative samples with 100% accuracy in a held-test set, whereas the best performance by EdgeR [28] or DESeq2 [29] was only 72%, using either p-value or logFC (see Table 2 in [5]).

Concordance assessment through differential expression analysis

In this section, we aimed to conduct a differential expression analysis on both the NanoString and RNA-Seq datasets, beginning with 584 genes common to both. Using the MAS ranking system, we ranked these genes by statistical and biological significance after multiple hypothesis testing through the MAS ranking system. The analysis then extended to all RNA-Seq genes to identify broader significant patterns. Both datasets were contrasted against a control group consisting of samples from all 12 NHPs on day 0 of infection. The primary goal is to verify if the top BH-significant genes identified, based on the MAS, demonstrate consistency across different platforms.

In our analysis of RNA-Seq data, we employed GLMQL-MAS to address the challenges of non-normal data distributions and overdispersion typical of these datasets. Traditional methods like the Student’s t-test often fail to effectively manage these complexities [30]. The GLMQL-MAS approach integrates the robustness of Generalized Linear Models (GLMs) [14] with the flexibility of quasi-likelihood estimations [31], making it highly effective in capturing the true biological variations among samples.

We used GLMs to model the relationship between gene expression and experimental conditions, focusing on changes from our day 0 baseline to subsequent time points where viral load is detectable in the 12 RT-qPCR–positive samples. This model enabled us to accurately quantify gene expression changes attributable to the infection.

After establishing the model, we conducted Quasi-Likelihood (QL) F-tests to assess the significance of the observed differences in gene expression between the baseline and infected samples. These tests, by adjusting for model dispersion, provide more reliable and robust statistical inferences than traditional methods. They are particularly advantageous in handling the inherent complexities of RNA-seq data.

Using these models and tests, we calculated metrics for each gene, e.g. LogFC and p-value. Additionally, the integration of MAS within this framework allowed for prioritizing genes based on both their statistical significance (BH adjusted p-value) and biological impact (LogFC). This dual focus ensures that the genes identified as significant are not only statistically validated but also biologically relevant, enhancing our ability to identify truly important biomarkers.

In this study, we conducted differential expression analysis using three distinct approaches to thoroughly examine gene expression changes:

  • First, within the NanoString data, we focused solely on the 584 genes common between the NanoString and RNA-Seq datasets and employed two-sample independent t-tests [26] with Benjamini-Hochberg (BH) correction [27], ranking genes via MAS.

  • Secondly, we performed a similar analysis within the RNA-Seq data, again limiting our focus to these 584 common genes, and applying GLMQL-MAS.

  • The third approach expanded the scope of our analysis within the RNA-Seq data to include the full set of available protein-coding genes, totaling 14,328, and applying GLMQL-MAS. This broader analysis allows us to capture a wider array of biological changes that may be specific to the RNA-Seq technology and not observable within the smaller common gene set.

Note that in the differential expression analysis of RNA-Seq data, the choice of gene set size, whether it is the 584 common genes or the available full set of 14,328 protein-coding genes, influences the outcomes of the analysis, particularly in terms of LogFC and adjusted p-values. This variance arises from differences in the statistical modeling and adjustment processes inherent in the Generalized Linear Models with Quasi-Likelihood F-tests (GLMQL) and the subsequent BH corrections for multiple testing:

  • When analyzing a smaller subset of genes (584 common genes), the statistical power of the model is focused but limited. This limitation can restrict the model’s ability to detect smaller yet potentially meaningful expression changes that might be more apparent in a larger dataset (14,328 genes). Conversely, a larger dataset provides a broader basis for the model, potentially capturing more subtle variations in gene expression.

  • GLMs fitted to a smaller subset of genes (584 common genes) might yield different parameter estimates, affecting LogFC calculations. These estimates can vary because a smaller number of genes may not provide a complete picture of the underlying biological system, possibly skewing the LogFC.

  • The inclusion of more genes in the GLMQL enhances the robustness of the quasi-likelihood estimation. With more genes, the model can better account for inter-gene variability and potential co-expression patterns, which might be obscured in smaller datasets. This modeling can lead to different significance levels and thus different p-values for the gene expression changes.

  • BH adjustments are affected by the number of tests (genes) analyzed. In a smaller dataset (584 genes), the correction for multiple comparisons is less stringent than in a larger dataset (14,328 genes). This difference means that for the same raw p-value, the adjusted p-value in a smaller dataset might be smaller (less penalized), suggesting higher statistical significance as compared to the same value in a larger dataset.

Assessing the efficacy of gene signatures on a held-out dataset

To test the efficacy of gene signatures identified and validated in Sect. 2.4, we used these genes as predictors for logistic regression, employing 1,000 bootstrapping techniques to assess their capability to differentiate between mock-infected samples from wild-type (wt), VP35m, VP24m, and a double mutant, VP35m&VP24m EBOV, on the held-out test set described in Sect. “Held-out test data”. Additionally, we used the RNA-Seq dataset from Ilinykh et al. [19] to assess any gene signatures identified within the RNA-Seq of EBOV-infected NHPs that were not available in the NanoString dataset, as detailed in Sect. “Assessing the efficacy of gene signatures on a held-out dataset”.

Note that during the training-validation phase, we intentionally focused on selecting only one gene as the predictor for a logistic regression model to minimize the risk of overfitting. Overfitting occurs when a model captures noise instead of the underlying pattern, typically when complex machine learning models with many parameters are used with too few samples [32]. In this context, logistic regression, a linear model, is chosen due to its simplicity and lower variance compared to more complex models. This approach is particularly important given our limited sample size, where ideally, there should be at least 10 samples per predictor to ensure robustness and effectiveness of the model. By using this methodology, we aimed to enhance the generalizability of the gene signature identified from the training-validation to the held-out test set.

Results

Correlation analysis

Figure 2(a-b) presents the results of the correlation analysis between RNA-Seq and NanoString gene expression data in NHPs infected with EBOV. The panel (a) shows the distribution of Spearman correlation coefficients across 62 samples, with a mean of 0.83, a standard deviation of 0.06, and a median of 0.85. The majority of the samples, 56, exhibit strong correlation coefficients between 0.78 and 0.88, indicating a high level of agreement between the two platforms, while a few samples fall below 0.78. The panel (b) displays the time course of Spearman correlation coefficients for each NHP in four distinct groups over the progression of EVD, measured by Days Post Infection (DPI).

Fig. 2
figure 2

Histogram and time-course analysis of Spearman correlation coefficients between RNA-Seq and NanoString gene expression data. The panel (a) shows the distribution of Spearman correlation coefficients across 62 samples, with a mean of 0.83, a median of 0.85, and most values concentrated between 0.8 and 0.9. The panel (b) illustrates the Spearman correlation coefficients over Days Post Infection (DPI) for four distinct groups of NHPs, with each line representing an individual NHP. (c) Gradual increase in Spearman correlation coefficient for sample NHP-5-3 as genes are removed using the Reverse Gene Removal (RGR) method

Unlike other samples, it turns out that NHP-5-3 has a very low Spearman correlation. To determine whether this occurred due to specific genes or was the result of a technical artifact, we applied the RGR method as described in Sect. 2.2. Figure 2(c) shows the gradual increase in Spearman correlation after removing genes using the RGR method. Notably, it required the removal of 376 genes out of 548 to reach a Spearman correlation of 0.9.

Bland-altman analysis

Figure 3 illustrates the agreement between RNA-Seq and NanoString platforms based on Bland-Altman analysis. Panel (a) shows Bland-Altman plots for NHP-1 and NHP-2 at DPI 0, highlighting the differences between the platforms relative to the mean of their measurements. The plots display the limits of agreement (LOA), with the majority of measurements falling within the 95% confidence limits, indicating strong consistency between the platforms.

Fig. 3
figure 3

Bland-Altman analysis comparing RNA-Seq and NanoString platforms. (a) Bland-Altman plots for NHP-1 and NHP-2 at DPI 0, showing the mean differences and limits of agreement between the platforms. For NHP-1 at DPI 0, 97.95% of measurements fall within the 95% confidence limits (mean difference: 132.28), while for NHP-2 at DPI 0, 98.46% of measurements are within the limits (mean difference: 128.08). (b) The percentage of measurements within the 95% confidence limits ranges from 97.26–99.66% across all samples, with notable differences for NHP-4 at DPI 14 (mean difference: 527.83) and NHP-9 at DPI 21 (mean difference: 896.51), and smaller differences for NHP-2 at DPI 6 (mean difference: -31.86). “NHP-m-n” refers to NHP #m on day n post-infection

Panel (b) summarizes the percentage of measurements within the limits of agreement across all samples. The vast majority of gene expression measurements consistently fall within the 95% confidence limits for most samples, reflecting the overall robustness of the two platforms. While some samples exhibit larger differences, such as those at later time points, others show smaller differences, yet the data remain predominantly within the limits of agreement, underscoring the platforms’ reliability in capturing gene expression trends.

Overall, the Bland-Altman analysis confirms the robustness of these platforms in measuring gene expression with minimal bias, making them effective tools for this study. Even for minimal systematic bias, it is important to understand whether the bias correlates with the range of measurements across platforms. To investigate this, we compared gene expression values from RNA-Seq and NanoString by calculating the ratio of NanoString values to RNA-Seq values for each gene, ensuring that only non-zero values were considered. For each sample, we computed the average of these ratios across all genes and then took the mean across all samples to estimate the overall average ratio between the two platforms. The overall average ratio was 6.42, indicating a general trend where NanoString values tend to be higher compared to RNA-Seq.

Concordance assessment through machine learning analysis

Figure 4 illustrates the top-10 MAS-selected genes within NanoString data, using only 584 common genes, after contrasting the positive group against the negative group of NHPs on Day 0 post infection (DPI = 0). Since we have only 12 positive and 12 negative samples in total, to avoid any overfitting, we used only the top gene as the single predictor. Using the top MAS-selected gene, OAS1, we applied logistic regression via 6-fold stratified cross-validation to predict whether a sample is from Day 0 (DPI = 0) or a positive sample within both NanoString and RNA-Seq data. It turns out that the single gene, OAS1, is capable of separating all positive samples from negative samples on day 0 post-infection with 100% accuracy (see Table 1; Fig. 5 (a)).

Fig. 4
figure 4

Top-10 MAS-selected genes from NanoString analysis contrasting positive and negative NHP groups on day 0 (DPI = 0), using 584 common genes

Fig. 5
figure 5

(a) Performance of the OAS1 gene in differentiating Day 0 (DPI = 0) negative samples from positive samples within both NanoString and RNA-Seq data. (b) Performance of the OAS1 gene in differentiating Days 0 and 3 (DPI = 0 and 3) negative samples from positive samples within both NanoString and RNA-Seq data

Table 1 Performance comparison of logistic regression model using top genes selected via 6-fold stratified cross validation by MAS within NanoString and RNA-Seq. OAS1, identified as a top gene in nanostring, demonstrated its ability to differentiate between infected and uninfected samples with 100% accuracy when used as a sole predictor in the logistic regression model, both within the NanoString dataset and when applied to RNA-Seq data

We then added samples from Day 3 (DPI = 3) to the negative group to see whether OAS1 could differentiate NHPs on DPI 0 and 3 from positive samples. It turns out that OAS1 is indeed capable of this differentiation, achieving an average accuracy of 100% (see Table 2; Fig. 6(b)).

Table 2 This table summarizes the gene ontology (GO) biological process terms for 12 genes implicated in the immune response to viral infections, as identified through the MyGeneInfo API

Concordance assessment through differential expression analysis

Figure 6 illustrates the top 20 MAS-selected genes within three datasets: RNA-Seq (Common genes), NanoString (Common genes), and RNA-Seq (Full genes), where “Common genes” indicates that only 584 common genes between the two platforms were used during the analysis. According to Fig. 6, genes unique to the RNA-Seq (Common genes) dataset within top 20 MAS selected genes include AHR, IFIT5, and TCF7. The NanoString (Common genes) dataset showed exclusivity for IL18RAP, S100A8, DHX58, SIGLEC1, S100A9, DYSF, and TWIST2. In contrast, the RNA-Seq (Full genes) dataset exclusively featured CASP5, USP18, DDX60, and PLA2G4C.

Fig. 6
figure 6

Visualization of the top-20 MAS-selected genes within three datasets: RNA-Seq (Common genes), NanoString (Common genes), and RNA-Seq (Full genes). The graph displays genes that are unique to each dataset as well as those shared among them. Each gene is color-coded to indicate its uniqueness or commonality

Shared gene signatures were also observed, with ISG15, OAS1, IFI44L, IFIT2, OAS2, MX1, IFIT3, IFI44, OASL, IFI27, RSAD2, MX2, and CCL8 common between RNA-Seq (Common genes) and NanoString. Similarly, RNA-Seq (Common genes) and RNA-Seq (Full genes) shared ISG15, OAS1, IFI44L, IFIT2, OAS2, MX1, IFIT3, IFI44, OAS3, OASL, IFI27, RSAD2, DDX58, FCGR1A, MX2, and TLR3. Genes shared between NanoString and RNA-Seq (Full genes) included OAS1, ISG15, IFI27, MX1, IFI44, IFI44L, OAS2, IFIT3, MX2, OASL, IFIT2, and RSAD2. Notably, genes common across all datasets, emphasizing their robust cross-platform consistency, were IFIT2, IFI44, OASL, IFI27, IFIT3, IFI44L, MX1, OAS1, MX2, OAS2, RSAD2, and ISG15.

We proceed to test the effectiveness of these 12 common genes, IFIT2, IFI44, OASL, IFI27, IFIT3, IFI44L, MX1, OAS1, MX2, OAS2, RSAD2, and ISG15, in clustering RNA-Seq and NanoString samples. Hierarchical clustering [33] is performed using Ward’s method [34] with Euclidean distance as the metric. This analysis aims to validate whether these genes can effectively group the samples based on similarities in their expression profiles across platforms, confirming their discriminative power and relevance in broader genomic studies. Figures 7(a) and (b) illustrate the hierarchical clustering within RNA-Seq and NanoString datasets, respectively.

Fig. 7
figure 7

Hierarchical clustering of samples using Ward’s method and Euclidean distance to highlight the discriminative power of 12 common genes in gene expression analysis across different platforms. (a) RNA-Seq dataset, demonstrating the grouping of samples based on their expression profiles. (b) NanoString dataset, validating the consistency and relevance of these genes by showing how they cluster the samples

To systematically identify the biological processes associated with these 12 common genes, we employed the MyGeneInfo API, an accessible resource for gene-related data provided by the MyGene.info web service. We used version 3 of the API, which offers extensive data access via Python through the mygene Python package. This package facilitates queries against gene symbols and retrieves data about associated GO terms, specifically focusing on biological processes. The comprehensive list of biological processes identified for each gene, as retrieved from the MyGeneInfo API, is detailed in Table 2.

Note that there are some genes in RNA-Seq (Full genes) that have not appeared in the NanoString, such as CASP5, USP18, and DDX60. Figure 8 illustrates a 3-dimensional visualization of RNA-Seq samples using these three genes as coordinates after a log2 transformation. Employing MyGeneInfo API, we found that CASP5 is involved in the positive regulation of the inflammatory response. USP18 participates in the negative regulation of type I interferon-mediated signaling pathways and the antiviral innate immune response. DDX60 is associated with the response to viruses, the innate immune response, defense against viruses, and the positive regulation of the MDA-5 and RIG-I signaling pathways

Fig. 8
figure 8

Three-dimensional visualization of RNA-Seq samples using CASP5, USP18, and DDX60 as coordinates following log2 transformation with a pseudocount of 1. The negative group includes all NHPs from DPI 0, while the positive group contains NHPs from multiple DPIs, as highlighted in pink in Fig. 1

Assessing the efficacy of gene signatures on a held-out dataset

Figure 9 illustrates the efficacy of the top-2 gene signatures identified in Sect. 3.3, OAS1 and ISG15, in separating mock-infected samples from (a) wild-type (wt), (b) VP35m, (c) VP24m, and (d) a double mutant, VP35m&VP24m EBOV, on the held-out test set. Table 3 shows that using only OAS1 as the single predictor for logistic regression, we achieved 100% accuracy in differentiating mock-infected samples from all EBOV-infected samples within the held-out test set too. Figure 10(a) shows the separation of all infected samples from mock samples using the top three identified gene signatures identified in Sect. 3.3, OAS1, ISG15, and IFI44L, on the held-out test. Figure 10(b) shows the separation of mock-infected samples from EBOV-infected samples using three identified gene signatures, CASP5, USP18, and DDX60, which were identified within RNA-Seq but not included in NanoString, as described in Sect. 3.4, on the held-out test set.

Fig. 9
figure 9

Efficacy of top-2 gene signatures, OAS1 and ISG15, in differentiating mock-infected samples from various EBOV strains: (a) wild-type (wt), (b) VP35m, (c) VP24m, and (d) double mutant VP35m&VP24m, evaluated on the held-out test set

Fig. 10
figure 10

(a) Separation of all infected samples from mock samples using the top three gene signatures, OAS1, ISG15, and IFI44L, identified in Sect. 3.3 on the held-out test set. (b) Differentiation of mock-infected samples from EBOV-infected samples using gene signatures CASP5, USP18, and DDX60, identified in RNA-Seq but not included in NanoString, as detailed in Sect. 3.4, on the held-out test set

Table 3 Performance of OAS1 as a sole predictor in logistic regression employing 1,000 bootstrapping, achieving 100% accuracy in distinguishing between mock-infected and all EBOV-infected samples on the held-out test set

Discussion

Correlation analysis

The correlation analysis illustrated in Fig. 3(a-b) indicates a strong agreement between the RNA-Seq and NanoString gene expression data for NHPs exposed to the EBOV. From Fig. 3(a), the notable concentration of Spearman correlation coefficients in the range of 0.78 to 0.88, with 56 out of 62 samples falling within this interval, strongly suggests that both technologies provide highly consistent measurements of gene expression. This high degree of correlation, supported by a mean of 0.83 and a median of 0.85, confirms that both RNA-Seq and NanoString are effective in capturing the biological responses of NHPs to the virus, thus validating the use of these platforms in parallel for comprehensive gene expression analysis in virology research.

In Fig. 3(b), the Spearman correlation coefficients between RNA-Seq and NanoString gene expression data are shown for four distinct groups of NHPs over the course of EVD. Group 1 exhibits consistent correlations throughout the infection period, with values ranging from 0.8 to 0.9, indicating minimal fluctuations in agreement between the two platforms. Group 2 shows mostly stable correlations, but NHP-5 displays a significant dip at DPI 3 (NHP-5-3), suggesting variability in gene expression consistency at this stage, before returning to levels similar to other NHPs in the group. Group 3 demonstrates strong and stable correlations across all NHPs, with very little variation throughout the infection period. In Group 4, correlations remain stable overall, indicating general agreement between RNA-Seq and NanoString with minor fluctuations.

The NHP-5-3 sample showed the lowest Spearman correlation. However, after applying the RGR method (Fig. 3(c)), it became evident that this low correlation is highly likely due to technical artifacts. The RGR method achieved a Spearman correlation above 0.9 after the removal of 376 out of 548 genes, indicating that a significant proportion of the genes contributed to the initial discrepancy. This is because achieving a high correlation required the removal of a large proportion of genes, indicating widespread noise rather than a small subset of genes contributing to true inconsistencies between platforms in measuring gene expression.

Bland-altman analysis

The Bland-Altman analysis shown in Fig. 4 demonstrates a strong agreement between RNA-Seq and NanoString technologies, with the majority of measurements falling within the 95% confidence limits across all samples. Despite some variations in the mean differences between the two platforms for certain samples, the consistently high percentage of measurements within the limits suggests that both technologies provide reliable and comparable gene expression data.

Despite this agreement, the observed average ratio of 6.42 highlights a notable trend where NanoString values are consistently higher than RNA-Seq values. This difference likely stems from key technological distinctions between the two platforms. NanoString’s direct digital counting method measures transcripts without amplification, reducing variability but potentially capturing more background signal, leading to elevated expression values. In contrast, RNA-Seq relies on amplification and sequence alignment, which may introduce biases, particularly for highly expressed or low-abundance genes. Additionally, NanoString’s targeted probe-based approach provides precise quantification of predefined genes, whereas RNA-Seq is more prone to technical variations due to library preparation and read mapping.

For both NHP-1-0 and NHP-2-0, PPBP, a gene involved in platelet aggregation [35], and ACTB, a commonly used housekeeping gene [36], show significant discrepancies between the two platforms (Fig. 4(a)). For instance, PPBP has much higher expression levels in NanoString compared to RNA-Seq, with differences of 7,643.78 and 9,526.46 for both NHP-1-0 and NHP-2-0, respectively. Similarly, ACTB also exhibits large differences, with NanoString values substantially higher than RNA-Seq. These results suggest that certain highly expressed genes, like PPBP and ACTB, may be more prone to platform-specific biases, which is important when interpreting gene expression in the context of Ebola infection.

In summary, the Bland-Altman analysis demonstrates a very strong agreement between RNA-Seq and NanoString, reinforcing the reliability of both platforms for gene expression studies despite their methodological differences.

Concordance assessment through machine learning analysis

The machine learning analysis, employing the SMAS, effectively used the strengths of both NanoString and RNA-Seq technologies to enhance our understanding of gene expression in NHPs infected with the EBOV. Initially, we identified key antiviral genes using the precise quantification capabilities of NanoString. OAS1 was selected as a primary marker due to its significant ability to differentiate between positive and negative samples, as determined by the MAS ranking system (Fig. 5). Employing logistic regression with six-fold stratified cross-validation, we observed that OAS1 accurately differentiated between negative (DPI 0) and positive samples in the NanoString dataset, achieving 100% accuracy. Subsequently, OAS1 was used as the sole predictor in logistic regression within RNA-Seq to determine whether the top MAS-selected gene from NanoString could extend its capability to differentiate between positive and negative samples. Indeed, OAS1 demonstrated this capability (Table 1; Fig. 6(a)).

Furthermore, by applying the MAS-selected gene, OAS1, directly within the RNA-Seq analysis using the same stratified cross-validation method, we also achieved 100% accuracy in differentiating extended negative group (DPI 0 and DPI 3) from positive samples (Table 1; Fig. 6(b)). The successful application of a NanoString-identified gene in RNA-Seq classification underscores the complementary potential of these platforms. It demonstrates how findings from one technology can be substantiated and extended in another, offering a methodological blueprint for future studies that aim to leverage multiple genomic technologies.

Concordance assessment through differential expression analysis

From Fig. 7, we observed a strong concordance between the NanoString and RNA-Seq platforms, as 12 out of the top 20 genes were common across all three datasets. These shared genes, including ISG15, OAS1, IFI44, and RSAD2, represent key antiviral responses that were consistently identified across both technologies. This overlap suggests that despite their technological differences, both RNA-Seq and NanoString are effective in capturing key gene expression changes during Ebola virus infection.

The hierarchical clustering shown in Figs. 8 highlights the effectiveness of the 12 common genes in differentiating samples based on their gene expression profiles. These genes successfully grouped RNA-Seq (Fig. 8(a)) and NanoString (Fig. 8(b)) samples into distinct clusters, indicating their strong discriminative power. In both datasets, the clustering distinctly separates RT-qPCR positive samples from negative ones (DPI 0). This clear division suggests that the selected genes, IFIT2, IFI44, OASL, IFI27, IFIT3, IFI44L, MX1, OAS1, MX2, OAS2, RSAD2, and ISG15, are highly responsive to viral infection, consistently reflecting changes in gene expression across both platforms. The comparable clustering patterns across RNA-Seq and NanoString datasets reinforce the concordance between the two technologies, validating these genes as reliable biomarkers for distinguishing between infected and uninfected samples in Ebola virus infection studies.

Table 2 provides a comprehensive overview of the biological processes associated with the 12 common genes that were consistently identified across both the RNA-Seq and NanoString platforms. The data retrieved through the MyGeneInfo API highlights the involvement of these genes in key immune responses, particularly antiviral defenses, which include processes like the innate immune response, regulation of interferon production, and negative regulation of viral genome replication. This overlap not only underscores the reliability of RNA-Seq and NanoString in capturing essential gene functions but also confirms the biological significance of the identified genes.

Using CASP5, USP18, and DDX60, the RT-qPCR negative (DPI 0) samples and the RT-qPCR positive samples are distinctly separated, as shown in Fig. 2. These genes, which did not appear in the NanoString, highlight the broader gene detection capabilities of RNA-Seq. Their biological processes, previously discussed in Sect. 3, reflect important roles in immune regulation and antiviral defense mechanisms. The ability of these genes to clearly differentiate infected from uninfected samples underscores the importance of RNA-Seq in identifying additional biologically significant markers, which may be overlooked by more targeted platforms like NanoString. This further illustrates the complementary nature of the two platforms in gene expression analysis.

Assessing the efficacy of gene signatures on a held-out dataset

Figure 9 illustrates the generalizability of the identified gene signatures OAS1 and ISG15 in separating mock-infected and EBOV-infected samples on the held-out test set. Table 3 confirms that OAS1 is capable of differentiating between mock and infected samples when used as the sole predictor in logistic regression for different strains of EBOV. Figure 10 demonstrates that the gene signatures identified in Sect. 3.3 and 3.4 also possess the same power of separation on the held-out test set. These results collectively indicate that the gene signatures maintain consistent performance across various experimental conditions and EBOV strains.

In summary, our study provided a detailed comparison of RNA-Seq and NanoString technologies, demonstrating a high correlation and agreement between these platforms for analyzing gene expression in NHPs infected with the Ebola virus. The correlation analysis revealed strong agreement between these two platforms, highlighting their reliability for detailed gene expression studies. The Bland-Altman analysis supported this finding, showing strong agreement across all samples and underscoring the potential for these technologies to deliver consistent and comparable gene expression data despite their inherent methodological differences. The differential expression analysis and gene ranking through GLMQL-MAS and MAS consistently highlighted key genes across both platforms, including ISG15, OAS1, IFI44, IFI27, IFIT2, IFIT3, IFI44L, MX1, MX2, OAS2, RSAD2, and OASL. This consistency reinforces the reliability of these technologies in capturing gene expression changes during viral infections. The concordance between RNA-Seq and NanoString platforms was further evaluated using machine learning methods. We applied MAS to identify top gene signatures from NanoString data, notably highlighting OAS1 as a key gene signature. We then assessed the concordance of the two platforms by applying logistic regression using OAS1 as the sole predictor on RNA-Seq data, which confirmed that this gene achieved 100% accuracy in differentiating infected from non-infected samples within RNA-Seq data. This high level of predictive accuracy reinforces the reliability of findings across both technologies. Additionally, we used OAS1 as a predictor in logistic regression to evaluate its effectiveness on a completely held-out test dataset from another study by Ilinykh et al. [19]. Impressively, it maintained a 100% accuracy rate, effectively distinguishing between infected and non-infected samples, thus confirming its potential as a robust biomarker for Ebola virus infections.

The findings from our study may have significant implications for future studies of viral infections and the development of therapeutics. Our research demonstrated high correlation and agreement between RNA-Seq and NanoString technologies, ensuring they can be reliably used to study gene expression dynamics in response to viral infections. This strong concordance provides a solid foundation for identifying potential therapeutic targets. Importantly, this high level of agreement between the platforms means that researchers can confidently use either technology, depending on availability or cost constraints, without compromising the integrity of their findings. Focusing on the gene OAS1, our study identified it as a robust gene signature capable of distinguishing between infected and non-infected samples with high precision. OAS1’s predictive power positions it as a key tool for further research into Ebola virus pathogenesis and potentially other viral infections. Investigating OAS1’s role in viral defense mechanisms could lead to the development of targeted therapies or vaccines. Its effectiveness across different technological platforms also highlights its utility in clinical settings, where developing accurate and robust biomarkers is essential for diagnosing and monitoring infections. This could accelerate the development of diagnostic assays that are both sensitive and specific to particular strains of viruses, aiding in the swift management and containment of outbreaks.

Conclusions

This study offers a comprehensive evaluation of the concordance between RNA-Seq and NanoString technologies for gene expression analysis NHPs infected with EBOV. Our results demonstrate that both platforms provide highly consistent gene expression measurements, as evidenced by strong Spearman correlation coefficients and Bland-Altman analyses, confirming their reliability in capturing complex biological responses in viral infections.

Machine learning analysis using the SMAS method, trained on NanoString data, identified OAS1 as a key marker capable of distinguishing RT-qPCR positive from negative samples. When applied to RNA-Seq data, OAS1 achieved 100% accuracy using logistic regression, underscoring its robustness and cross-platform utility. This finding highlights the potential of NanoString-identified genes to be validated and extended in RNA-Seq datasets, reinforcing the complementary nature of these technologies. We also employed OAS1 as a predictor in a logistic regression model to assess its performance on a completely independent held-out test dataset. Remarkably, it achieved a 100% accuracy rate in accurately identifying infected versus non-infected samples, thereby validating its effectiveness as a strong biomarker for Ebola virus infections.

Additionally, differential expression analysis identified 12 common genes including ISG15, OAS1, IFI44, IFI27, IFIT2, IFIT3, IFI44L, MX1, MX2, OAS2, RSAD2, and OASL that exhibited the greatest statistical significance and biological relevance across both platforms. These genes are primarily associated with antiviral immune responses and were shown to reliably differentiate infected from uninfected samples through hierarchical clustering. Gene Ontology analysis further confirmed their involvement in immune pathways, reinforcing their potential as key biomarkers for Ebola virus infection.

RNA-Seq also uniquely identified genes such as CASP5, USP18, and DDX60, which are involved in immune regulation and antiviral defense mechanisms. These genes, not captured in NanoString, emphasize the broader detection capabilities of RNA-Seq, making it particularly useful for discovering additional biologically relevant markers in complex infection scenarios.

In conclusion, this study demonstrates that RNA-Seq and NanoString technologies are both powerful tools for gene expression analysis in EBOV-infected NHPs. Their complementary strengths, NanoString’s precision in quantification and RNA-Seq’s broader gene detection, provide a comprehensive understanding of gene expression dynamics. This cross-platform approach enhances the reliability of identified biomarkers and offers valuable insights for future research on viral infections, therapeutic targets, and disease monitoring.

Limitations of the study

One limitation of this study is the relatively small sample size, which could affect the robustness of the statistical analyses and the ability to detect more subtle gene expression differences. While the study demonstrates concordance between the two technologies, validating the identified biomarkers in additional cohorts is necessary to confirm their broader applicability in Ebola virus research.

Data availability

The data supporting the findings of this study are openly available in the Gene Expression Omnibus (GEO) repository, hosted by the National Center for Biotechnology Information (NCBI). This includes data originally generated by Speranza et al. (GSE103825) [18] and Ilinykh et al. [19] (GSE96590). The normalized NanoString count data can be found in Table S3 at https://www.science.org/doi/10.1126/scitranslmed.aaq1016.

Abbreviations

EBOV:

Ebola Virus

NHP:

Non-Human Primate

RNA-Seq:

RNA Sequencing

SMAS:

Supervised Magnitude-Altitude Scoring

RT-qPCR:

Real-Time Quantitative Polymerase Chain Reaction

DC:

Dendritic Cells

GO:

Gene Ontology

DPI:

Days Post Infection

GLMQL-MAS:

Generalized Linear Models with Quasi-Likelihood and Magnitude-Altitude Scoring

BH:

Benjamini-Hochberg

WT:

Wild-type

References

  1. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A et al. Andrew McPherson, Michał Wojciech Szcześniak., A survey of best practices for RNA-seq data analysis, Genome biology. 2016;17:1–19

  2. Bosworth A, Dowall SD, Garcia-Dorival I, Rickett, Natasha Y, Bruce, Christine B, Matthews DA, Fang. Yongxiang and Aljabr, Waleed and Kenny, John and Nelson, Charlotte and others, A comparison of host gene expression signatures associated with infection in vitro by the Makona and Ecran (Mayinga) variants of Ebola virus. Sci Rep. 2017;7(1):43144.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Liu X, Speranza E, Muñoz-Fontela César, Haldenby S, Natasha Y, Rickett I, Garcia-Dorival Y, Fang, et al. Transcriptomic signatures differentiate survival from fatal outcomes in humans infected with Ebola virus. Genome Biol. 2017;18:1–17.

    Article  Google Scholar 

  4. Geiss GK, Bumgarner RE, Birditt, Brian and Dahl, Timothy and Dowidar, Naeem and Dunaway, Dwayne L and, Fell H. Perry and Ferree, Sean and George, Renee D and Grogan, Tammy and others, Direct multiplexed measurement of gene expression with color-coded probe pairs, Nature biotechnology 2008;26(3):317–325.

  5. Rezapour M, Niazi MKK, Lu H, Narayanan A. Machine Learning-Based analysis of Ebola virus’ impact on gene expression in nonhuman primates. Front Artif Intell. 2024;7:1405332.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Song K, Elboudwarej E, Zhao. Xi and Zhuo, Luting and Pan, David and Liu, Jinfeng and Brachmann, Carrie and Patterson, Scott D and Yoon, oh Kyu and Zavodovskaya, Marianna, RNA-seq RNAaccess identified as the preferred method for gene expression analysis of low quality FFPE samples. PLoS ONE. 2023;18(10):e0293400.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  7. Zhang W, Petegrosso R, Chang. Jae-Woong and sun, Jiao and Yong, Jeongsik and Chien, Jeremy and Kuang, Rui, A large-scale comparative study of isoform expressions measured on four platforms. BMC Genomics. 2020;21:1–14.

    Google Scholar 

  8. Speranza E, Altamura LA, Kulcsar K, Bixler SL, Rossi CA, Schoepp RJ, Nagle. Elyse and Aguilar, William and Douglas, Christina E and Delp, Korey L and others, comparison of transcriptomic platforms for analysis of whole blood from Ebola-infected cynomolgus macaques. Sci Rep. 2017;7(1):14756.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Cohen I, Huang Y, Chen. Jingdong and Benesty, Jacob and Benesty, Jacob and Chen, Jingdong and Huang, Yiteng and Cohen, Israel, pearson correlation coefficient. Noise Reduct Speech Process.2009; 1–4.

  10. Rezapour M, Walker SJ, Ornelles DA, Niazi. Muhammad Khalid Khan and McNutt, Patrick M and Atala, Anthony and Gurcan, Metin Nafi, A comparative analysis of RNA-Seq and NanoString technologies in Deciphering viral infection response in upper airway lung organoids. Front Genet. 2024;15:1327984.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  11. Hauke J, Kossowski T. Comparison of values of Pearson’s and Spearman’s correlation coefficients on the same sets of data. Quaestiones Geographicae. 2011;30(2):87–93.

    Article  Google Scholar 

  12. Székely GáborJ, Rizzo ML. Brownian distance covariance, The annals of applied statistics. 2009; 1236–1265.

  13. Giavarina D. Understanding bland altman analysis, Biochemia medica. 2015;25(2):141–151.

  14. Nelder JA, Wedderburn, Robert WM. Generalized linear models. J Royal Stat Society: Ser (General). 1972;135(3):370–84.

    Article  Google Scholar 

  15. Maronna RA, Douglas Martin R. Victor J. Yohai, and Matías Salibián-Barrera, robust statistics: theory and methods (with R). Wiley; 2019.

  16. Rezapour M, Walker SJ, Ornelles DA, Niazi. Muhammad Khalid Khan and McNutt, Patrick M and Atala, Anthony and Gurcan, Metin Nafi, exploring the host response in infected lung organoids using NanoString technology: A statistical analysis of gene expression data. PLoS ONE. 2024;19(11):e0308849.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  17. Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol. 2010;11:1–12.

    Article  Google Scholar 

  18. Speranza E, Bixler SL, Altamura LA, Arnold CE, Pratt WD et al. Cheryl Taylor-Howell, Christina Burrows., A conserved transcriptional response to intranasal Ebola virus exposure in nonhuman primates prior to onset of fever, Science translational medicine.2018;10(434): p. eaaq1016.

  19. Ilinykh PA, Lubaki NM, Widen SG, Renn, Lynnsey A, Theisen TC, Rabin RL, Wood TG, Bukreyev A. Different Temporal effects of Ebola virus VP35 and VP24 proteins on global gene expression in human dendritic cells. J Virol. 2015;89(15):7567–83.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  20. Rezapour M, Walker SJ, Ornelles DA, McNutt PM, Atala A, Gurcan M, Nafi. Analysis of gene expression dynamics and differential expression in viral infections using generalized linear models and quasi-likelihood methods. Front Microbiol. 2024;15:1342328.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Rezapour M, Narayanan A, Gurcan M, Nafi. Signatures and Pathways in Mpox Virus-Induced Gastrointestinal Complications Using Colon Organoid Models. Int J Mol Sci. 2024;25(20):11142. Machine Learning Analysis of RNA-Seq Data Identifies Key Gene.

  22. Rezapour M, Wesolowski R, Gurcan M, Nafi. Identifying key genes involved in axillary lymph node metastasis in breast cancer using advanced RNA-Seq analysis: A methodological approach with GLMQL and MAS. Int J Mol Sci. 2024;25(13):7306.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  23. Robinson MD, Oshlack, Alicia. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11:1–9.

    Article  Google Scholar 

  24. Kleinbaum DG, Dietz K, Gail M. And Klein, Mitchel and Klein, Mitchell, logistic regression. Springer; 2002.

  25. Wong T-T, Yeh P-Y. Reliable accuracy estimates from k-fold cross validation. IEEE Trans Knowl Data Eng. 2019;32(8):1586–94.

    Article  Google Scholar 

  26. Kim TK. T test as a parametric statistic. Korean J Anesthesiology. 2015;68(6):540–6.

    Article  Google Scholar 

  27. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc: Ser B (Methodol). 1995;57(1):289–300.

    Article  Google Scholar 

  28. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, bioinformatics. 2010;26(1):139–140.

  29. Love MI, Huber W, Anders, Simon. Moderated Estimation of fold change and dispersion for RNA-seq data with DESeq2, Genome biology. 2014;15(12):1–21.

  30. Banerjee A, Chitnis UB, Jadhav SL, Bhawalkar JS, Chaudhury S. Hypothesis testing, type I and type II errors. Industrial Psychiatry J. 2009;18(2):127–31.

    Article  Google Scholar 

  31. Wedderburn RW. Quasi-likelihood functions, generalized linear models, and the Gauss—Newton method, Biometrika. 1974;61(3)439–447.

  32. Ying X. An overview of overfitting and its solutions. J Phys: Conf Ser. 2019;1168:022022.

    Google Scholar 

  33. Murtagh F, Contreras P. Algorithms for hierarchical clustering: an overview, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2012;2(1):86–97.

  34. Saraçli S. Nurhan Doğan, and İsmet Doğan, comparison of hierarchical cluster analysis methods by Cophenetic correlation. J Inequalities Appl, 2013;1–8.

  35. El-Gedaily. Ahmed and Schoedon, Gabriele and Schneemann, Markus and Schaffner, Andreas, constitutive and regulated expression of platelet basic protein in human monocytes. J Leucocyte Biology. 2004;75(3):495–503.

    Article  CAS  Google Scholar 

  36. de Jonge, Hendrik JM, Fehrmann, Rudolf SN, de Bont ESJM, Hofstra, Robert MW and Gerbens, Frans and Kamps, Willem A and de Vries, Elisabeth GE and, van der Zee. Ate GJ and te Meerman, Gerard J and ter Elst, Arja, Evidence based selection of housekeeping genes, PloS one. 2007;2(9):p. e898.

Download references

Acknowledgements

Effort sponsored by the U.S. Government under HDTRA 12310003, “Host signaling mechanisms contributing to endothelial damage in hemorrhagic fever virus infection,” PI: Narayanan. The US Government is authorized to reproduce and distribute reprints for Governmental purposes, notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government.

Funding

Effort sponsored by the U.S. Government under HDTRA 12310003, “Host signaling mechanisms contributing to endothelial damage in hemorrhagic fever virus infection,” PI: Narayanan. The US Government is authorized to reproduce and distribute reprints for Governmental purposes, notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, M.R., A.N., W.H.M., and M.N.G.; methodology, M.R., A.N., W.H.M., and M.N.G.; software, M.R.; validation, M.R., A.N., W.H.M., and M.N.G.; formal analysis, M.R., A.N., W.H.M., and M.N.G.; investigation, M.R., A.N., W.H.M., and M.N.G.; resources, A.N. and M.N.G.; data curation, M.R., and W.H.M.; writing-original draft preparation, M.R., A.N., W.H.M., and M.N.G.; writing-review and editing, M.R., A.N., W.H.M., and M.N.G.; visualization, M.R.; supervision, A.N. and M.N.G.; project administration, A.N. and M.N.G.; funding acquisition, A.N. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Mostafa Rezapour.

Ethics declarations

Ethics approval and consent to participate

Not applicable. This study did not involve any new experimental procedures on human or non-human subjects. All data used in this research were obtained from publicly available datasets originally generated and published by Speranza et al. (GSE103825) [18] and Ilinykh et al. [19] (GSE96590). Ethical approval and consent were obtained by the original authors, who conducted their study in strict compliance with institutional guidelines and ethical standards governing non-human primate research. As our study solely involved secondary analysis of existing data, no further ethical review or approval was required.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rezapour, M., Narayanan, A., Mowery, W.H. et al. Assessing concordance between RNA-Seq and NanoString technologies in Ebola-infected nonhuman primates using machine learning. BMC Genomics 26, 358 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-025-11553-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-025-11553-6

Keywords