- Research
- Open access
- Published:
TransGeneSelector: using a transformer approach to mine key genes from small transcriptomic datasets in plant responses to various environments
BMC Genomics volume 26, Article number: 259 (2025)
Abstract
Gene mining is crucial for understanding the regulatory mechanisms underlying complex biological processes, particularly in plants responding to environmental conditions. Traditional machine learning methods, while useful, often overlook important gene relationships due to their reliance on manual feature selection and limited ability to capture complex inter-gene regulatory dynamics. Deep learning approaches, while powerful, are often unsuitable for small sample sizes. This study introduces TransGeneSelector, the first deep learning framework specifically designed for mining key genes from small transcriptomic datasets. By integrating a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) for sample generation and a Transformer-based network for classification, TransGeneSelector efficiently addresses the challenges of small-sample transcriptomic data, capturing both global gene regulatory interactions and specific biological processes. Evaluated in Arabidopsis thaliana, the model achieved high classification accuracy in predicting seed germination and heat stress conditions, outperforming traditional methods like Random Forest and Support Vector Machines (SVM). Moreover, Shapley Additive Explanations (SHAP) analysis and gene regulatory network construction revealed that TransGeneSelector effectively identified genes that appear to have upstream regulatory functions based on our analyses, enriching them in multiple key pathways which are critical for seed germination and heat stress response. RT-qPCR validation further confirmed the model’s gene selection accuracy, demonstrating consistent expression patterns across varying germination conditions. The findings underscore the potential of TransGeneSelector as a robust tool for gene mining, offering deeper insights into gene regulation and organism adaptation under diverse environmental conditions. This work provides a framework that leverages deep learning for key gene identification in small transcriptomic datasets.
Introduction
Gene mining encompasses a series of methods that unveil genes playing critical roles in specific biological processes, a task of profound importance in the field of life sciences. In plants, many vital agronomic traits, such as yield, disease resistance and stress resistance, are manifested as complex quantitative characters, governed by multiple genes in conjunction with environmental interactions [1,2,3]. Moreover, the sensitivity of some endangered species, such as Cathaya argyrophylla [4], to environmental changes is also closely linked to gene expression. Enhancing these agronomic traits and improving the resilience of endangered plants for conservation purposes require the initial identification of their underlying genetic mechanisms and key regulatory genes responsive to various environmental conditions. In the realm of medical research, numerous diseases have genetic underpinnings, with an individual’s susceptibility often linked to variations in genes and gene expression patterns [5]. The pursuit of mining disease-related genes, and deciphering their roles in associated cellular processes and signaling pathways, serves to deepen our understanding of disease pathogenesis [5], pinpoint disease-related biomarkers [6], and potentially pave the way for identifying novel therapeutic targets [7].
Gene mining, especially for plants responses to the environment, often start with small-samples due to the limited availability of samples and the high cost of sequencing [8]. Traditionally, gene mining has relied on methods that identify genes based on fold changes and functional enrichment analyses, which are not very efficient [9,10,11]. Currently, machine learning is a more effective and favored approach in key gene mining [5]. For instance, Li et al. utilized machine learning to identify key genes from gene expression data of COVID-19 patients to assess disease severity [12]. Yu et al. applied machine learning and transcriptome sequencing to discern 9 SNPs in the transcriptome data of Platycodon grandiflorus, aiding in flower color identification [13]. Additional works include the development of prediction tools for disease-resistant proteins in plants by Pal et al. based on support vector machines [14], and Chen et al.‘s innovative machine learning method combinations to identify key genes in bovine multi-tissue transcriptome data for predicting feed efficiency [15].
Although the combination of transcriptome sequencing and machine learning shows promise in mining key genes, the extremely complex and precise mutual regulatory interactions among genes—which play a crucial role in regulating and maintaining life activities [16, 17]—pose significant challenges. Traditional machine learning algorithms often require manual feature engineering to improve performance, leading to the discarding of many genes that appear unimportant on the surface. However, considering the complex regulatory relationships and hidden functional patterns that genes often exhibit in organisms [16, 17], these discarded genes may actually be upstream or downstream interacting genes of critical value. Therefore, the inherent limitations of traditional machine learning highlight the importance of developing algorithms that can mine key genes from transcriptome data while fully capturing the complex regulatory relationships or global interactions among genes.
Deep learning, a subset of machine learning, employs artificial neural networks (ANNs) or deep neural networks (DNNs) to model complex problems. Unlike traditional techniques, deep learning can process unstructured data and automatically learn effective features from high-dimensional data without manual feature selection, making it well-suited for mining key genes in biological processes [18,19,20,21,22,23]. More importantly, certain algorithms within deep learning are powerful tools for capturing the interrelationships between genes. For example, in deep learning, natural language processing (NLP) models like RNN [24, 25] and LSTM [26] are notable for their ability to capture long-distance dependencies in sequences. This characteristic can be applied to unravel the complex regulatory relationships between genes. The Transformer model stands out as a significant advancement in this field. Introduced in 2017, this model is designed for NLP tasks and relies on an attention mechanism, surpassing traditional RNNs and LSTMs in capturing long-range dependencies between sequential elements [27]. The Transformer’s impact has been profound, revolutionizing the field of NLP [28,29,30]. Currently, research using Transformer architectures and transcriptome gene expression data is primarily focused on single-cell sequencing for tasks like cell type classification, as seen in studies such as TOSICA and STGRNS [31, 32], as well as a few cancer prediction studies [33, 34]. Although these studies are significant, they are not aimed at addressing the issue of key gene mining, nor do they analyze in detail the Transformer’s ability to capture intergene relationships. As a result, there is an urgent need for further research into the use of Transformers in the field of gene mining.
One challenge in integrating standard transcriptomics sequencing data with deep learning lies in the limited sample size of standard transcriptomics sequencing data, which contrasts with the high sample demands of deep learning [35, 36]. This limitation is why traditional machine learning is more commonly used, as it doesn’t require a large number of samples [37, 38]. With the growth of deep learning, data augmentation techniques such as GANs have been employed to enhance transcriptomics data [39,40,41]. WGAN and its improved version, WGAN-GP, have shown notable improvements in sample quality and training stability [42]. Recently, WGAN-GP’s application to augment transcriptomics data demonstrated that artificially generated samples could enhance classification model performance. Hence, enhancing small-scale transcriptomics sequencing data with WGAN-GP, followed by the utilization of Transformer models for classifying biological processes and mining key genes, presents a feasible and promising approach.
Based on the above analysis, we propose the following scientific questions:
-
a.
Can WGAN-GP address the challenge of poor performance in deep learning models when applied to small-sample transcriptome data?
-
b.
Can deep learning gene mining methods based on Transformer architectures match or even surpass traditional machine learning methods in terms of performance?
-
c.
Can Transformer-based deep learning models outperform traditional methods in capturing gene associations?
To address the challenges and limitations of applying both traditional machine learning and deep learning to small transcriptome datasets, and to answer the above scientific questions, we propose TransGeneSelector. This novel approach represents the first deep learning method specifically designed for key gene mining in small transcriptome datasets, with a focus on plant responses to different environmental conditions and capturing intergene relationships. It combines a Transformer architecture with a sample generation network based on WGAN-GP and a sample filtering network. The process starts by employing WGAN-GP to generate transcriptomic samples, followed by a filtering stage to exclude low-quality samples. Subsequently, a Transformer is utilized to classify biological processes by capturing the complex global relationships between genes. The significance of each gene is further assessed using SHAP (SHapley Additive exPlanations), a method that provides interpretative insights into individual predictions from machine learning models [43, 44].
TransGeneSelector not only predicts the seed states (either dry or germinating) of Arabidopsis thaliana with performance comparable to the best traditional machine learning models, such as Random Forest and SVM algorithms, but it also significantly outperforms these traditional methods in the more challenging task of predicting heat stress conditions in Arabidopsis thaliana. Moreover, it identifies genes at higher regulatory levels and those that demonstrate stronger functional connectivity, which are more representative of seed germination and the heat stress responses in A. thaliana. TransGeneSelector thus offers the following advantages:
-
1.
It has the ability to analyze transcriptomic data from small sample sizes, classify specific environment-related biological processes with high accuracy, and identify key genes involved.
-
2.
It demonstrates the capability to detect vital regulatory relationships between genes, including upstream key genes and highly functionally connected genes, that govern specific environment-related biological processes, surpassing the capabilities of the best-performing traditional algorithms, such as the Random Forest method.
In essence, TransGeneSelector serves as a practical tool for life science researchers to mine key genes from transcriptomic data of plants and other organisms responding to diverse environmental conditions. It provides insights into specific environment-related biological processes, enhancing our understanding of gene regulation and organism adaptation in varying environments.
Methods
TransGeneSelector framework
TransGeneSelector includes three neural networks, respectively, a sample generation network based on Wasserstein GAN with Gradient Penalty (WGAN-GP), an additional classifier network with a fully connected neural network architecture, and a classification network based on the Transformer architecture.
WGAN-GP, the sample generation network of TransGeneSelector, is an improvement over the original Wasserstein GAN (WGAN) [42, 45] that addresses the limitations of the original model by using a gradient penalty instead of weight clipping to enforce the Lipschitz constraint. This results in more stable training and better convergence properties. For the original Wasserstein GAN, the loss function of the discriminator (critic)is defined as:
Where \(\:{f}_{w}\:\)is the discriminator (critic), \(\:{g}_{\theta\:}\) is the generator, \(\:{p}_{r}\) is the real data distribution. \(\:{p}_{z}\) is the noise distribution.
For the generator, the loss function is defined as:
The WGAN model is trained by solving the following optimization problem:
The WGAN-GP loss function is defined as:
Where \(\:{V}_{WGAN}({f}_{w},{g}_{\theta\:})\) is the original WGAN loss function, which is designed to address the limitations of the standard GAN loss function by using the Wasserstein distance instead of the Jensen-Shannon divergence. It consists of two parts: one for the discriminator (critic) and one for the generator. \(\:{f}_{w}\) is the discriminator (also called critic) in the WGAN-GP model. \(\:{g}_{\theta\:}\) is the generator in the WGAN-GP model. \(\:\lambda\:\) is a hyperparameter that controls the strength of the gradient penalty term. \(\:\widehat{x}\sim\:{P}_{\widehat{x}}\) is the expectation over random samples \(\:\widehat{x}\) drawn from the distribution \(\:{P}_{\widehat{x}}\). In WGAN-GP, \(\:\widehat{x}\:\)is a randomly weighted average between a real data point and a generated data point. \(\:(\mid\:\mid\:{\nabla\:}_{\widehat{x}}{f}_{w}(\widehat{x})\mid\:{\mid\:}_{2}-1{)}^{2}\) is the gradient penalty term. It penalizes the squared difference between the gradient norm of the discriminator with respect to its input \(\:\widehat{x}\) and the target norm value 1. The purpose of this term is to enforce the Lipschitz constraint on the discriminator, which helps to stabilize the training and improve convergence properties.
The WGAN-GP model is trained by solving the following optimization problem:
After generating the fake samples, the additional classifier network with a fully connected neural network architecture is used to filter out the fake samples and obtain high-quality samples. The network architecture consists of several fully connected layers (also known as linear layers) with Rectified Linear Unit (ReLU) activation functions in between, followed by a final linear layer with a Sigmoid activation function to output a probability value between 0 and 1. Here’s the mathematical representation of the additional classifier network:
Where \(\:{\varvec{h}}_{i}\) represents the output of the\(\:{\:i}^{th}\) hidden layer, \(\:\varvec{W}\) and \(\:b\) are the weight matrix cnd bids vector for the \(\:{\:i}^{th}\) layer, respectively. ReLU is the Rectified Linear Unit activation function, defined as:
\(\:\text{S}\text{i}\text{g}\text{m}\text{o}\text{i}\text{d}\) is the Sigmoid activation function, given an input \(\:x\), the output \(\:\sigma\:\left(x\right)\:\)of the Sigmoid function is calculated as:
The network takes an input vector and passes it through the layers to produce a single output probability value. The output value can be thresholded to obtain the high-quality generated samples.
After the above sample-generating processes, the generated samples and real samples for seeds under both kinds of conditions (germinating or dry seeds) were altogether input into the Transformer network for biological process classification. It is start by using a fully connected network to reduce the dimensionality of the gene expression data for the numerous number of genes. The output of this step is a lower-dimensional representation of the input genes. Given an input expression value \(\:\varvec{x}\), the output \(\:\varvec{y}\) of a fully connected layer with weights \(\:\varvec{W}\) and biases \(\:b\) is calculated as
The lower-dimensional representation is then positional encoded to provide the Transformer network with information about the order of representation. The formula used for calculating the positional encoding values is as follows:
where \(\:pos\) is the position of the word in the sequence, \(\:i\) is the index of the dimension pair, and \(\:{d}_{model}\) is the dimension of the input embeddings.
When the lower-dimensional representation of the gene expression data is positional encoded, it is then fed into the Transformer Encoder. The Encoder processes the input sequence and produces a continuous representation, or embedding, of the input. The Transformer Encoder consists of multiple self-attention and feed-forward layers, allowing the model to process and understand the input sequence effectively. The multi-head self-attention mechanism in the encoder allows the model to attend to different parts of the input sequence simultaneously. It computes multiple attention outputs in parallel and then concatenates them before passing them through a linear transformation. The multi-head attention can be represented as:
Here, \(\:\mathbf{Q},\mathbf{K},\mathbf{V}\) represent the query, key, and value matrices, respectively, and \(\:\mathbf{W}\) are the learnable parameter matrices. The scaled dot-product attention computes the attention scores by taking the dot product of the query and key matrices, dividing the result by the square root of the key vector dimension, and then applying a softmax function:
After the multi-head self-attention mechanism, the output is passed through a position-wise feed-forward network, which consists of two linear layers with a ReLU activation function in between. The position-wise feed-forward network can be represented as follows:
Where \(\:x\) is the input\(\:{,\:\:W}_{1}\)and \(\:{W}_{2}\:\)are the weight matrices, and\(\:\:{b}_{1}\)and \(\:{b}_{2}\) are the bias terms.
Finally, residual connections and layer normalization are applied after both the multi-head self-attention and position-wise feed-forward network to stabilize the training process and improve the model’s performance. Residual connections are used to allow gradients to flow through a network directly. The residual connection formula is:
where \(\:F\left(\varvec{x}\right)\) is the output of the previous layer and \(\:\varvec{x}\) is the input
Layer normalization is applied to stabilize the training process. The layer normalization formula is:
where \(\:{\mu\:}^{l}\) and \(\:{\sigma\:}^{l}\) are the mean and standard deviation of the layer, respectively, and \(\:\gamma\:\) and \(\:\beta\:\) are learnable scale and shift parameters.
After processing the positional encoded lower-dimensional representation of the gene expression data through the Transformer Encoder, we use the first token of the output for classification, which is considered to contain the most relevant information for classification. We applied the Sigmoid function to the first token of the Encoder output, which is defined as:
The Binary Cross-Entropy (BCE) loss function is then used as loss function for binary classification of dry seeds and germinating seed. It measures the dissimilarity between the predicted probability distribution and the true binary labels of a dataset. The BCE loss function is particularly useful when the output is a probability value between 0 and 1.
Where \(\:N\) is the number of samples in the dataset. \(\:{y}_{i}\) is the true binary label of the \(\:{i}^{th}\) sample (1 for the positive class and 0 for the negative class). \(\:{\widehat{y}}_{i}\) is the predicted probability of the\(\:\:{i}^{th}\:\)sample belonging to the positive class.
To take into account the non-linearity of the activation functions and maintain the standard deviation of the activations around 1, we used the Kaiming Uniform Initialization [46] for initializing the weights of the Transformer network, the Kaiming Uniform Initialization initializes the weights from a uniform distribution \(\:U(-a,a)\), where:
For the input of our neural network models, we used gene TPM expression levels that had undergone distribution correction and standardization, a log1p transformation was applied to gene expression levels to stabilize variance and normalize the distribution of expression values, facilitating their use as input for neural networks. The log1p transformation is defined as:
Where \(\:x\) is the TPM value of a gene in a sample. This transformation helps mitigate the effect of highly expressed genes and, importantly, brings the distribution of gene expression values closer to a normal distribution, which is often desirable for the optimal performance of neural networks.
Subsequently, the transformed expression values were standardized using z-score normalization to ensure that each gene has a mean of zero and a standard deviation of one across all samples. The z-score for each qene expression value is calculated as:
Where \(\:x\) is the log1p-transformed TPM value of a gene in a sample, \(\:\mu\:\) is the mean expression level of the gene across all samples, and \(\:\sigma\:\:\)is the standard deviation of the gene’s expression levels across all samples. This standardization process facilitates the comparison of gene expression levels on a common scale, a crucial step given the sensitivity of neural network architectures to the scale of input features.
Data description and processing
Data for our study were extracted from the NCBI GEO (Gene Expression Omnibus) database (https://www.ncbi.nlm.nih.gov/gds/) and the Expression Atlas database (https://www.ebi.ac.uk/gxa/experiments). For the training set of TransGeneSelector and Random Forest models for seed germination state classification, we selected experiments GSE116069, GSE161704, GSE163057, GSE167244, and GSE179008 from NCBI. These contained raw counts of 79 samples derived from dry seeds and germinating seeds, comprising 43 positive samples and 36 negative samples. The test set for seed germination state classification were obtained from GSE167502, GSE230392, GSE94457, and GSE151223 with 42 samples, comprising 21 positive samples and 21 negative samples.
For the classification of heat stress states, we selected gene expression data from four experiments, namely GSE155710, GSE158444, GSE184983, and GSE200247, as the training set, comprising a total of 156 samples. Among these, there were 76 positive samples and 80 negative samples. For the testing set, we selected gene expression data from four other experiments, specifically GSE212019, GSE232094, GSE239833, and GSE244763, comprising a total of 53 samples. We selected only those samples explicitly identified as positive or negative, resulting in a testing set of 28 positive and 25 negative samples.
The transcript-per-million (TPM) was calculated for each gene according to the method reported previously [47], utilizing its length and raw counts. TPM normalization accounts for both gene length and sequencing depth, allowing for meaningful comparisons of gene expression levels within and between samples. The TPM for each gene was computed using the following formula:
Where read counts is the raw number of sequencing reads mapped to a particular gene, and gene length (kb) is the length of the gene in kilobases (kb).
The denominator represents the sum of all reads per kilobase for all genes in the sample, ensuring that the total TPM across all genes in a sample sums to one million. This scaling facilitates the comparison of qene expression levels across different samples by normalizing for sequencing depth.
For each task, we retained only those genes present in all samples. For the seed germination-related task, samples from the germinating group were designated as positive (1), and those from the non-germinating group were marked as negative (0). For the heat stress-related task, samples under heat stress were labeled as 1, while samples from plants under normal conditions were labeled as 0.
For additional MERLIN network analysis, we included experiments E-CURD-1, E-GEOD-30,720, E-GEOD-52,806, E-GEOD-64,740, E-MTAB-4202, E-MTAB-7933, E-MTAB-7978 from Expression Atlas and GSE199116 from NCBI. These experiments included 268 samples unrelated to seed germination and heat stress, and we proceeded with the same TPM calculations, retaining genes found across all samples.
Benchmarking and evaluation metrics
To optimize the parameters of TransGeneSelector in the neural networks, a comprehensive grid search was performed on each part of the TransGeneSelector using the dry seed and germinating seed training set.
WGAN-GP: The epochs were set from 200 to 6000 in steps of 200, with combinations of learning rates (0.1, 0.01, 0.001). Model performance was assessed using the loss curve, the Fréchet Inception Distance (FID), and Uniform Manifold Approximation and Projection (UMAP) visualization to determine the best parameters.
Additional Classifier: Epochs (100, 150) and learning rate (0.1, 0.01) combinations were evaluated.
Transformer Network: Various combinations of embedding and header numbers (72/8, 240/8, 72/16, and 240/16), learning rates (0.1, 0.01, 0.001), and training periods (7, 21, and 35) were assessed. The best parameter combination was determined by the validation set loss values. To comprehensively compare the performance of optimized and unoptimized models, we trained TransGeneSelector with two additional model groups - one with early stopping criteria implemented for optimization, and one without early stopping, representing unoptimized models.
The Random Forest model parameters were optimized through training set using grid search to combine n_estimators (10, 100) with 200 values of n_features_to_select, uniformly spaced between 1 and 500. The best parameter combination was chosen based on model accuracy.
Both TransGeneSelector and Random Forest models were evaluated using a 5-fold cross-validation approach. In TransGeneSelector, the training batch size for the Transformer architecture and the WGAN-GP component was 32. However, the additional classifier in TransGeneSelector utilized a full batch size for training. This was because, with a 1:1 ratio of real to generated samples, the effective sample size remained relatively small. For TransGeneSelector, a comprehensive assessment of performance was conducted using metrics including accuracy, precision, recall rate, and F1 score. In contrast, the evaluation of the Random Forest model focused solely on accuracy through cross-validation, with other metrics omitted as the Random Forest often achieved consistent 1.0 accuracy.
Additionally, as a benchmark for comparison, we adapted a previously proposed SNP mining method based on the Network-Regularized Logistic Regression model with Minimax Concave Penalty (NR-LR-MC) [48] in this study. Furthermore, we incorporated the classical K-Nearest Neighbors (KNN) and Support Vector Machine (SVM) machine learning algorithms into our performance comparison. To optimize performance, we conducted a grid search on the training set to determine the optimal combination of hyperparameters for NR-LR-MC, including alpha (α) (0.1, 0.3, 0.5, 1.0), λ1 (0.01, 0.1, 0.5), λ2 (0.01, 0.1, 0.5), and learning rate (0.001, 0.01, 0.05). The features utilized by both the KNN and SVM algorithms were derived from the feature engineering process of the Random Forest model.
We utilized the test set data to evaluate the performance of all models by calculating the accuracy, precision, recall, F1 score, and AUC value. Through comparison of these metrics, the capabilities of different models were comprehensively assessed.
Gene mining method
We applied SHAP (SHapley Additive exPlanations) [43] to mine important genes through Transformer network of TransGeneSelector by calculating the contribution of each gene to the prediction. SHAP is based on the concepts of game theory and can be applied to any machine learning model. The method uses Shapley values, which are derived from cooperative game theory, to fairly distribute the “payout” (i.e., the prediction) among the features. The formula for the Shapley value of gene \(\:j\) is given by:
Here, \(\:S\) represents a subset of genes excluding gene \(\:j\), \(\:p\) is the total number of genes, \(\:\mid\:S\mid\:\) is the number of genes of S, \(\:S\cup\:\left\{j\right\}\) represents a new subset formed by adding gene \(\:j\) to the subset \(\:S\), \(\:f\left(x\right)\:\)is the prediction function of the model. Those genes with high Shapley values were considered as important genes.
We performed Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis on the genes mined from the TransGeneSelector and Random Forest, using the STRING v11.5 database (< http://www.string-db.org). We focused on pathways with FDR < 0.05, considering them as significantly enriched.
To further analyze the function of the mined genes, we conducted Gene Ontology (GO) term matching analysis and network visualization. The GO annotations for Arabidopsis thaliana genes were obtained by downloading the gene annotation file from the TAIR database (https://www.arabidopsis.org/). Using these annotations, we determined whether the genes identified by both TransGeneSelector and Random Forest were related to seed germination and heat stress response by searching the GO term descriptions for the keywords ‘germinate’ and ‘heat’. When these keywords were present, and the corresponding gene was also present in our mined gene list, we linked the mined gene to that GO term. To visualize the functions of the identified genes, we utilized Cytoscape v3.9 for network visualization, enabling a graphical representation of the gene-GO term network. This network visualization illustrates whether the mined genes were annotated within seed germination or heat stress-related GO terms, and also highlights the differences in functional annotation between the genes mined by the two methods.
Network analysis
In this study, the Modular Regulatory Network Learning with Per Gene Information (MERLIN) algorithm [49] was employed to infer the regulatory network of genes mined using TransGeneSelector or Random Forest. The transcriptome data from the same dry seed and germinating seed datasets, as well as an additional dataset unrelated to seed or seed germination of A. thaliana, were prepared (as detailed in the Data Description and Processing section). Initially, the FPKM of the genes was transformed into Transcripts Per Million reads (TPM), and the mean expression level of each gene was calculated. Subsequently, the expression levels of the genes were zero-mean transformed. The MERLIN algorithm retained only those genes that (1) varied in expression value by at least ± 1 from the mean in at least five samples and (2) were included in the list of genes obtained in this study. All the genes were defined as both regulators and targets. A total of 10 sub-sets were created from the amended data matrix, with each sub-set containing 50% of the samples randomly selected from the complete matrix. Data from each sub-set were used to infer a MERLIN interaction. In the final MERLIN network, edges (which indicate the relationship between two genes) that appeared at least six times in the 10 sub-sets (confidence = 60%) were retained.
Plant germination and RT-qPCR test
We used the same collection of A. thaliana col-0 mature seeds for both the germination experiment and the RT-qPCR quantitative analysis experiment. The seeds were stored at the Plant Development and Molecular Laboratory of Hunan Normal University, China.
We first selected surface-sterilized A. thaliana seeds, sown on 9-cm plates containing solidified 0.5 MS medium (pH 5.9), stratified in the dark at 4 °C for 2 d, and then exposed to different light strengths and durations. The light strengths were set from weak to strong, corresponding to aluminum foil wrapped full-black germination conditions, 100 µmol photons photosynthetic light, and 200 µmol photons photosynthetic light. For germination time, we set 0 h (dry seeds), 12 h, 24 h, and 48 h. The temperature for germination was set to 25 ℃.
Seed samples were ground to a fine powder in liquid nitrogen. The ground tissue was transferred to a pre-chilled 1.5-mL Eppendorf tube, and total RNA was isolated using the TRIzol R Reagent (Life Technologies, Carlsbad, CA, United States), according to the manufacturer’s instructions. The RT-qPCR assay was conducted as described previously [50]. Forward primers used for genes are listed in Supplementary Tables 1, and actin1 was used as reference. The expected size of the amplified fragments varied from 80 to 200 bp. Three biological replicates were performed for each sample. Statistical analysis was performed using Piko Real Software 2.0. After standardization of each gene expression, the heatmap drawing was performed using Python.
Results
Overview of transgeneselector
TransGeneSelector is a composite deep learning method designed for classifying biological processes and identifying key genes. It integrates three distinct networks: a sample generation model using WGAN-GP, a Transformer-based classification model, and an additional fully connected classifier network (Fig. 1). The method unfolds in three main stages:
-
(1)
Fake Sample GeneraFig. (Fig. 1a): The WGAN-GP model is initially trained on the training set to generate fake transcriptomic gene expression samples (fake samples I). These are combined with real samples and trained in the additional classifier to enhance distinction between fake and real samples. Subsequent filtering of fake samples II through this classifier weeds out low-quality samples, resulting in high-quality final fake samples.
-
(2)
Transformer ClassificaFig. (Fig. 1b): The final fake samples are mixed with real training set samples to train a simplified Transformer classification model, retaining only the encoder part. A full connected (FC) layer reduces each sample’s gene expression level to 72 dimensions, preserving global expression information. These vectors are input to an 8-layer stacked Attention head in the Transformer encoder, post Positional Encoding. The first token’s vector of the output is used to assess the classification performance on the validation set real samples.
-
(3)
SHAP Method for Key Gene Mining: The SHAP (SHapley Additive exPlanations) method evaluates each gene’s influence on the trained Transformer classification model. Based on Shapley values, the genes exerting the most significant impact on the classification are identified as key genes for specific biological processes.
The overall frame and workflow of TransGeneSelector. a Network for Generating Synthetic Samples: Utilizing the WGAN-GP and additional classifier sample filtering networks, the WGAN-GP model is trained with real sample data to create synthetic gene expression samples, referred to as fake samples I. These are then amalgamated with the training set of real samples and passed through the additional classifier to enhance its discrimination capabilities between fake and real samples. Following the training of the additional classifier, fake samples II generated by WGAN-GP are processed to filter out substandard samples, resulting in high-quality, final fake samples. b Transformer Classification Model and SHAP Method for Key Gene Mining: The final fake samples are blended with the training set of real samples and used to train a specially simplified Transformer classification model suitable for small-sample classification tasks. The Sigmoid function is applied to the first token of the output, serving as an activation mechanism. Ultimately, the SHAP (SHapley Additive exPlanations) method is employed to evaluate the influence of each gene on the trained Transformer classification model. This results in the identification of genes with the most substantial impact on classification outcomes, facilitating the mining of key genes pertinent to a specific biological process
WGAN-GP enhances transgeneselector performance in training with small transcriptome datasets
In our study, the TransGeneSelector was initially trained using two distinct datasets from A. thaliana, each representing a different classification task. The first dataset comprised 36 samples of dry seeds and 43 samples of germinating seeds. The second dataset included 76 samples of A. thaliana under heat stress and 80 samples of plants in normal conditions. For the dataset related to seed germination, after 3,800 training epochs, the WGAN-GP module demonstrated convergence in the Generator and Discriminator’s losses. In the case of the dataset concerning heat stress, convergence was achieved after 3,600 epochs. In both scenarios, the Fréchet Inception Score (FIS) reached its minimum (0.36 for germination-related dataset and 19.37 for heat stress-related dataset), as depicted in Fig. 2a and c, demonstrating that the generated samples closely resembled the actual samples. However, the FIS values for the samples generated after training the WGAN-GP with the heat stress-related dataset were higher than those for the germination-related dataset. Additionally, analysis of the real samples from the heat stress dataset using UMAP (Uniform Manifold Approximation and Projection) revealed that the distribution of positive and negative samples partially overlapped (Fig. 2d), indicating indistinct boundaries between the classes, and this phenomenon was not observed in the germination-related dataset (Fig. 2b), suggesting that the plant response to heat stress may be complex, making the classification of heat stress conditions a more challenging task.
Training process and sample quality evaluation of WGAN-GP. a Plots showcasing the relationship between the number of training epochs and the Fréchet Inception Score for the germination-related task, providing a measure of generated sample quality. b UMAPs (Uniform Manifold Approximation and Projection) visualization for the germination-related task of both generated and real samples. Colors differentiate between positive and negative samples, while distinct shapes indicate whether samples are generated or real. c Plots showcasing the relationship between the number of training epochs and the Fréchet Inception Score for the heat stress-related task, providing a measure of generated sample quality. d UMAPs visualization for the heat stress-related task of both generated and real samples. Colors are used to differentiate between positive and negative samples, and distinct shapes indicate whether samples are generated or real
Using the WGAN-GP trained in two distinct scenarios, we combined an equal number of real and generated samples to train an additional classifier for sample filtering in each case. Following this, we integrated varying quantities of high-quality, filtered generated samples with real samples for training TransGeneSelector’s Transformer. Results indicated that in each scenario, cross-validation using validation set results (Figs. 3a and 4a) demonstrated improved performance metrics with the addition of generated samples. Specifically, for the seed germination task, the best model, which used 2,200 generated samples, achieved an accuracy of 0.9875, precision of 1.0000, recall of 0.9818, and an F1 index of 0.9904. Additionally, models employing the additional classifier exhibited enhanced stability and reliability (Fig. 3a).
Comparison of classification performance using cross-validation of TransGeneSelector and Random Forest in germination-related task. a TransGeneSelector Performance Across Different Numbers of Generated Samples: This panel illustrates the classification performance of TransGeneSelector with varying numbers of synthetic samples. The blue line represents the cross-validation performance of the model without additional classifier filtering, and the orange line signifies the cross-validation performance with additional classifier filtering. The shaded regions around each line indicate the corresponding error bands, providing a visual representation of the uncertainty associated with each measurement. b Random Forest Classifier Performance with Varied Gene Selection: This part demonstrates the classification performance of the Random Forest classifier for different numbers of genes selected through the wrapper method. The parameter ‘n_features_to_select’ within the RandomForestClassifier module is set uniformly spaced between 1 and 500, with a specific value chosen as 200, allowing for a detailed exploration of the effect of gene selection on classifier performance
In the heat stress-related task, the peak cross-validation performance was also achieved with 2,200 generated samples, yielding metrics of 0.9678 accuracy, 1.0000 precision, 0.9119 recall, and 0.9476 F1 index. However, unlike the germination task, models performed better without the use of the additional classifier (Fig. 4a), suggesting that the more complex and challenging classification of heat stress conditions may benefit from a more diverse set of samples.
We also applied Random Forest to classify the dataset. Using the Wrapper method, 200 parameters (gene numbers) were chosen for feature engineering and model evaluation. Cross-validation showed that, for the seed germination-related task (Fig. 3b), under certain gene features using the validation set, Random Forest achieved a maximum accuracy of 1.000. For the heat stress-related task, Random Forest achieved a maximum accuracy of 0.975 (Fig. 4b).
Comparison of classification performance using cross-validation of TransGeneSelector and Random Forest in heat stress-related task. a TransGeneSelector Performance Across Different Numbers of Generated Samples: This panel illustrates the classification performance of TransGeneSelector with varying numbers of synthetic samples. The blue line represents the cross-validation performance of the model without additional classifier filtering, and the orange line signifies the cross-validation performance with additional classifier filtering. The shaded regions around each line indicate the corresponding error bands, providing a visual representation of the uncertainty associated with each measurement. b Random Forest Classifier Performance with Varied Gene Selection: This part demonstrates the classification performance of the Random Forest classifier for different numbers of genes selected through the wrapper method. The parameter ‘n_features_to_select’ within the RandomForestClassifier module is set uniformly spaced between 1 and 500, with a specific value chosen as 200, allowing for a detailed exploration of the effect of gene selection on classifier performance
TransGeneSelector achieves high classification performance in small samples
In our comprehensive evaluation, we assessed the performance of TransGeneSelector, Random Forest, and a nonlinear SVM across two distinct tasks—germination-related classification and heat stress-related classification. We utilized two separate test sets for these evaluations, with results displayed in Tables 1 and 2, respectively. Notably, the features selected for both Random Forest and SVM were derived from the feature engineering phase of the Random Forest. These specific features were chosen based on their ability to achieve a max validation set accuracy during cross-validation.
In the evaluation, TransGeneSelector demonstrated notable accuracy and precision in the seed germination classification task, achieving a peak accuracy of 0.9524 and a precision of 0.9524. In comparison, SVM models, particularly those utilizing 148 and 449 features (derived from feature engineering using the Random Forest model), excelled in accuracy, recall, and F1 score, reaching maximums of 0.9524, 1.0000, and 0.9545 respectively (Table 1). Conversely, Random Forest models exhibited strong performance in the AUC metric, with the 8-feature model achieving an unparalleled AUC of 0.9875 (Table 1). Apart from these models, other models, such as NR-LR-MCP, KNN and the modified TransGeneSelector models, scored lower. Therefore, despite the competitive landscape, TransGeneSelector displayed commendable performance in this context.
However, in the more challenging heat stress-related classification task, TransGeneSelector distinctly outperformed other tested models in our specific experimental settings, leading with the highest scores in accuracy, precision, F1, and AUC, which were 0.9623, 1.9643, 0.9643, and 0.9871, respectively (Table 2). The best-performing traditional model, an SVM trained with 41 genes (derived from feature engineering using the Random Forest model), achieved scores of 0.9434 in accuracy, 0.9032 in precision, 1.0000 in recall, 0.9492 in F1, and 0.9743 in AUC, only outperform TransGeneSelector in Recall and significantly trailing in other metrics (Table 2). These results indicate that in complex tasks for classifying physiological states of plants under varying environmental conditions, TransGeneSelector’s performance surpasses that of traditional models. This is particularly significant considering it competed against conventional algorithms trained on features meticulously curated and proven optimal in previous Random Forest iterations, underscoring the robustness and potential of TransGeneSelector in handling small sample sizes. However, several traditional models outperform in recall, all reaching 1.0000 (Table 2). This stark difference between traditional algorithms and TransGeneSelector merits further investigation.
Furthermore, when the WGAN-GP network in TransGeneSelector was replaced with the mixup method, or the Transformer network was replaced with an MLP, the model performance did not reach its maximum effectiveness (Tables 1 and 2). This suggests that our model design is both rational and scientific, with the WGAN-GP and Transformer modules providing crucial support for classifying physiological states of plants under varying environmental conditions.
Given the outstanding performance of TransGeneSelector and Random Forest in the previous performance tests, particularly due to their feature selection capabilities, and considering that the features used in the high-performing SVM models were also derived from Random Forest feature engineering, the subsequent comparison of gene mining capabilities was focused primarily on TransGeneSelector and Random Forest.
TransGeneSelector demonstrates strong capability in key gene mining
Then, in our study, the capability of TransGeneSelector’s Transformer network to mine key genes was thoroughly evaluated using the SHAP (SHapley Additive exPlanations) method. This method allowed us to analyze the importance of features, highlighting the influence of individual genes on the model’s predictions, and isolating the key genes involved in the processes of seed germination and heat stress response. Additionally, we employed the Wrapper method in Random Forest to identify crucial genes for both the seed germination and heat stress processes.
We compared the genes identified by these two methods. For the Random Forest models, we selected parameters (number of genes) corresponding to the maximum cross-validation accuracy in both tasks. We found that gene sets of 11, 51, 148, and 449 were optimal and contrasted them with an equivalent number of genes chosen by TransGeneSelector’s SHAP method.
First, we conducted Venn diagram analysis on the 449 genes identified by TransGeneSelector’s and Random Forest in each of the two tasks (Fig. 5). We found that there was almost no overlap between the genes identified by the two methods. For the seed germination-related task, out of the 449 genes identified by TransGeneSelector and Random Forest separately, only 10 genes were found to be common between the two methods. Similarly, for the heat stress-related task, both methods identified 435 unique genes, with 14 genes being common to both. This outcome highlights a distinct variance in the methodologies and priorities of TransGeneSelector and Random Forest when it comes to pinpointing genes associated with plant responses to varying environments.
Then we performed a comparative analysis of the expression patterns of genes (germination-related genes and heat stress-related genes) selected by both methods in the tasks related to seed germination and heat stress within all samples including the training sets and the test sets, as illustrated in Figs. 6 and 7. The results yielded several noteworthy observations. Initially, for both tasks, when the number of genes selected was relatively low, such as 11 and 51, the genes chosen by Random Forest displayed pronounced differences in expression between positive and negative samples (i.e., germinating versus dry seeds in Fig. 6, or heat-stressed versus healthy plants in Fig. 7). Moreover, the expression levels within each individual group (either dry or germinating seeds in Fig. 6, and either heat-stressed or healthy plants in Fig. 7) were highly consistent, enabling clear differentiation from the other group. This consistency was notably superior to that observed in genes selected by TransGeneSelector for the same number of genes in earlier examinations. However, as the number of targeted genes increased to 148, the expression patterns of genes identified by Random Forest began to show disarray, and the distinctions between positive and negative samples became more subdued (Figs. 6 and 7). This trend was further accentuated when the number of targeted genes reached 449.
Comparative expression patterns of germination-related genes selected by TransGeneSelector and random forest. This heatmap illustrates the expression levels of germination-related genes selected by TransGeneSelector and Random Forest with varying numbers of genes. The color intensity represents the expression level, with darker shades indicating higher expression. TransGeneSelector’s selection is based on descending Shapley values, whereas Random Forest’ selection relies on different settings for the ‘n_features_to_select’ parameter. Unoptimized means TransGeneSelector was trained without the implementation of ‘early stop’ optimization
Conversely, TransGeneSelector exhibited consistent expression patterns across all gene sets within the training set, regardless of the number of targeted genes (Figs. 6 and 7). Whether dealing with the top 11, 51, 148, or 449 genes as designated by SHAP values, the expression patterns of genes selected by TransGeneSelector remained orderly and homogeneous within each group (dry seeds versus germinating seeds in Fig. 6, and heat-stressed versus healthy in Fig. 7). Although in the more challenging heat stress classification task, some irregularities in gene expression patterns occurred due to sample variability, the overall consistency of gene expression mined by TransGeneSelector still surpassed that of the Random Forest Wrapper method (Fig. 7).
These findings highlight a distinct divergence in the characteristics of key gene mining between TransGeneSelector and Random Forest. When using expression patterns of the training set as the sole index for evaluating the efficacy of key gene selection, TransGeneSelector exhibits a clear advantage over Random Forest across a range of targeted gene numbers.
We further evaluated the expression patterns using the test set (Figs. 6 and 7). Contrary to the trends observed within the training set, the gene expression profiles generally did not exhibit as clear a distinction between the positive and negative sample groups for both tasks on the test set. For Random Forest, regardless of whether a small or large number of genes were selected, the genes showed some inconsistency within the positive and negative sample groups. The expression profiles were more muddled, and a clear demarcation between the groups became elusive (Figs. 6 and 7). This lack of distinction across varying gene quantities suggests possible overfitting to the training data or the complex nature of the underlying gene interactions which Random Forest might not effectively capture in this context.
Conversely, TransGeneSelector, while not achieving the same level of performance as observed in the training set, consistently showcased superior gene selection on the test set across all targeted gene numbers (Figs. 6 and 7). The genes chosen by TransGeneSelector not only maintained better internal group coherence but also succeeded in differentiating between the two groups in each task (Fig. 6 or 7), validating its robustness in gene mining and predictive capabilities.
Comparative expression patterns of heat stress-related genes selected by TransGeneSelector and random forest. This heatmap illustrates the expression levels of heat stress-related genes selected by TransGeneSelector and Random Forest with varying numbers of genes. The color intensity represents the expression level, with darker shades indicating higher expression. TransGeneSelector’s selection is based on descending Shapley values, whereas Random Forest’ selection relies on different settings for the ‘n_features_to_select’ parameter. Unoptimized means TransGeneSelector was trained without the implementation of ‘early stop’ optimization
An interesting phenomenon was observed in the seed germination-related task using the TransGeneSelector model that was not optimally tuned (i.e., without the implementation of early stopping for optimal model selection), as illustrated in Fig. 6. For this unoptimized version, when selecting a smaller gene count, such as 11 genes, there was a tendency for these genes to exhibit pronounced expression in specific samples within the training set. This focused expression implies a possible over-reliance on certain training data patterns. Conversely, with a higher gene number, the expression profiles across groups in the training set remained distinct and orderly. However, when these genes were evaluated on the test set, their expression patterns became more chaotic (Fig. 6), suggesting the potential pitfalls of operating with an unoptimized model. In conclusion, despite the challenges with the unoptimized model, TransGeneSelector generally performed better than Random Forest in our tests, underlining its superior ability to generalize across different data scenarios. This demonstrates Transgene Selector’s robustness in gene mining and its potential in providing reliable predictions even for plants under varying conditions.
To dissect the regulatory relationships among the genes identified by both Random Forest and TransGeneSelector in the two classification tasks, we employed the MERLIN machine learning algorithm for gene regulatory network construction, as depicted in Fig. 8. For each task, we selected two types of gene expression datasets for network construction: one relevant to the task at hand (related to either seed germination or heat stress) and another unrelated to the task.
Comparative analysis of gene regulatory networks and functional enrichment for germination-related genes and heat stress-related genes identified by TransGeneSelector and random forest. a Parallel Plot Illustrating Regulatory Relationships between Germination-related Genes: This segment showcases the MERLIN-derived regulatory links between genes identified by TransGeneSelector and Random Forest. Networks were constructed using datasets unrelated to germination (top) and those related to germination (bottom). Varied line colors signify regulatory relationships stemming from different types of regulators. Relationships labeled with (new) indicate newly emerged regulatory interactions identified within the germination-related dataset-constructed network. b Stacked Percentage Bar Chart of Regulatory Relationships Between Germination-related Genes Identified by TransGeneSelector and Random Forest. c Comparative Results of KEGG Enrichment Analysis for Germination-related Genes Identified by TransGeneSelector and Random Forest. d Parallel Plot Illustrating Regulatory Relationships between Heat stress-related Genes: This segment showcases the MERLIN-derived regulatory links between genes identified by TransGeneSelector and Random Forest. Networks were constructed using datasets unrelated to heat stress (top) and those related to heat stress (bottom). Varied line colors signify regulatory relationships stemming from different types of regulators. Relationships labeled with (new) indicate newly emerged regulatory interactions identified within the heat stress-related dataset-constructed network. e Stacked Percentage Bar Chart of Regulatory Relationships Between Heat stress-related Genes Identified by TransGeneSelector and Random Forest. f Comparative Results of GO Enrichment Analysis for Heat stress-related Genes Identified by TransGeneSelector and Random Forest
The results of the network construction indicated that, when using larger datasets unrelated to our specific tasks, a majority of the regulatory relationships were observed from genes identified by Random Forest to those identified by TransGeneSelector (68% in the seed germination task and 61% in the heat stress task, as shown in Fig. 8a, b, d and e. However, a pronounced shift occurred when we used task-specific datasets from our study to inform the network construction. Not only did the number of regulatory relationships increase in both scenarios, but a reversal in transition patterns was also evident (Fig. 8a, b, d and e). Specifically, 68% of the regulatory interactions in the seed germination task and 64% in the heat stress task were from genes identified by TransGeneSelector to those identified by Random Forest, contrasting starkly with the 32% and 39% observed in the opposite direction (Fig. 8b and e).
Further analysis revealed an intriguing detail: when leveraging task-related datasets, many of these regulatory relationships from TransGeneSelector to Random Forest were newly emergent, absent in the networks built from task-unrelated datasets (Fig. 8a and d). The networks constructed with task-related datasets reflect specific gene regulatory relationships in particular plant responses under unique conditions (such as related to germination or heat stress). Within these task-related datasets, genes mined by TransGeneSelector tended to regulate those identified by Random Forest more frequently, indicating that TransGeneSelector is particularly effective at uncovering upstream regulatory genes specific to certain physiological processes, demonstrating its superiority over Random Forest.
Further, for the germination-related task, we annotated and enriched each set of 449 genes identified by Random Forest and TransGeneSelector as related to seed germination using Kyoto Encyclopedia of Genes and Genomes (KEGG). This analysis focused on understanding these gene-pathway relationships (Fig. 8c). The KEGG enrichment results revealed that the genes extracted by Random Forest were only significantly concentrated in one pathway—Oxidative phosphorylation—while those extracted by TransGeneSelector were prominently enriched in seven pathways, namely, Ribosome, Biosynthesis of amino acids, Metabolic pathways, Biosynthesis of secondary metabolites, Lysine biosynthesis, Protein processing in endoplasmic reticulum, and Citrate cycle (TCA cycle) (Fig. 8c). The genes identified by both methods exhibited a blend of similarities and distinctions. Initially, both enriched pathways encapsulated aspects related to energy metabolism, such as Oxidative phosphorylation and Citrate cycle (TCA cycle). The genes mined by Random Forest, were exclusively enriched in the Oxidative phosphorylation pathway (Fig. 8c). This particular pathway is chiefly involved in energy metabolism and ATP synthesis and represents a downstream stage in the energy generation process, utilizing the products of other metabolic pathways like NADH and FADH2 to produce ATP [51]. Conversely, the genes unearthed by TransGeneSelector were enriched in the Citrate cycle (TCA cycle) pathway, which are key to central carbon metabolism processes, including glycolysis and the pentose phosphate pathway. This pathway generate precursor metabolites and reducing equivalents (NADH and FADH2) that feed into the Oxidative phosphorylation pathway [52]. As such, it can be regarded as upstream of the Oxidative phosphorylation process.
In addition, genes significantly enriched in the pathways related to protein synthesis, such as Ribosome, Biosynthesis of amino acids, Lysine biosynthesis, and Protein processing in endoplasmic reticulum are of crucial importance for seed germination. During seed germination, which heralds the beginning of diverse plant life activities, a plethora of enzymes are required, along with various associated proteins to execute physiological functions transitioning from dormancy to an active state. Among these processes, protein synthesis relying on the mRNA stored within the seed is a crucial method to produce these key proteins [53,54,55,56,57,58]. Thus, in the germination-related task, genes identified by TransGeneSelector were notably enriched in several critical pathways associated with seed germination. In contrast, genes unearthed by Random Forest were predominantly concentrated in only a more downstream pathway. This not only demonstrates that TransGeneSelector is better at mining upstream genes related to seed germination than Random Forest, which is consistent with the gene regulatory network construction analysis, but also indicates that the genes identified by TransGeneSelector are more readily enriched in multiple pathways, suggesting stronger interconnectedness between these genes.
In the heat stress-related task, neither the sets of 449 genes identified by Random Forest nor those by TransGeneSelector showed significant enrichment in Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Consequently, we utilized Gene Ontology (GO) enrichment analysis to compare the two gene sets (Fig. 8f). The genes uncovered by Random Forest were only enriched in four cellular component pathways: Extracellular region, Intracellular anatomical structure, Intracellular organelle, and Intracellular membrane-bounded organelle. These pathways are generally indicative of the cellular locations but do not provide detailed insights into the specific metabolic or regulatory roles the genes might play in response to heat stress.
Conversely, the genes identified by TransGeneSelector exhibited a much broader spectrum of enrichment, showing significant presence in 27 biological process pathways (Fig. 8f). The ten most notably enriched pathways included Cellular nitrogen compound metabolic process, Cellular macromolecule metabolic process, Nucleobase-containing compound metabolic process, Nucleic acid metabolic process, Cytoskeleton organization, Gene expression, Macromolecule biosynthetic process, Heterocycle metabolic process, Cell cycle, and Spindle organization. These pathways are crucial for various aspects of cellular function and stress response, particularly those involving cellular and macromolecule metabolic processes, are essential for the cellular restructuring and repair mechanisms that plants invoke in response to environmental stress [59,60,61]. Functions like gene expression regulation [62], cell cycle control [63], and spindle organization [64, 65] are particularly critical, especially for heat stress response, as they play direct roles in cell survival and adaptation under stress conditions. This indicates that TransGeneSelector can detect a more diverse and functionally relevant set of genes that contribute to a broader and more dynamic response to heat stress compared to Random Forest, which primarily identified genes associated with basic cellular structures and components. Furthermore, this result indicates that, just as in the performance in the germination-related task, the genes mined by TransGeneSelector are more readily enriched in a variety of pathways, suggesting that the genes identified by TransGeneSelector have a higher degree of functional and regulatory interconnectedness.
Although, within each task, the respective 449 genes identified by TransGeneSelector and Random Forest were not significantly enriched in pathways directly related to germination or heat stress, such as ‘seed germination,’ ‘response to heat,’ ‘heat shock protein,’ etc., we still further analyzed how many genes among all the genes identified by the two methods were GO annotated to these pathways and compared the differences in the GO terms enriched by the genes identified by each method (Fig. 9). The results showed that in tasks related to germination or heat stress, both methods identified genes annotated with corresponding functions. For example, in tasks related to germination, both methods identified genes with the GO terms ‘pollen germination,’ ‘seed germination,’ and ‘negative regulation of seed germination.’ In tasks related to heat stress, both methods identified genes with the GO terms “heat acclimation,” “heat shock protein binding,” “regulation of seed germination,” “cellular response to heat,” and “response to heat.” Although there were very subtle differences, TransGeneSelector and Random Forest largely showed consistent tendencies in mining GO terms corresponding to these genes. This further confirms that both methods can effectively identify genes related to the target biological processes. However, it is worth noting that the Venn diagram in Fig. 5 shows that there are only 14 shared genes related to heat stress identified by the two methods, but among these 14 shared genes, 3 are precisely annotated with “response to heat.” This indicates that the functionally overlapping portion of TransGeneSelector and Random Forest, while not large in terms of gene number, focuses on crucial heat stress response mechanisms, demonstrating the rationality of both algorithms in gene mining, and further emphasizes the potential importance of these 14 shared genes in the heat stress response.
Genes selected by TransGeneSelector and random forest that are GO-annotated as related to seed germination or heat stress. In the figure, orange nodes represent genes selected by TransGeneSelector, green nodes represent genes selected by Random Forest, pink nodes represent genes selected by both methods, and blue nodes represent the names of GO terms
In summary, TransGeneSelector first demonstrates superior performance in identifying a broad and functionally relevant spectrum of genes crucial for both seed germination and heat stress response. It effectively captures upstream metabolic processes and critical pathways essential for plant adaptation and survival under diverse environmental conditions. Not only that, but TransGeneSelector also shows the ability to mine genes that are highly functionally interconnected in both tasks. The number of pathways in which these genes are significantly enriched far exceeds that of Random Forest. This contrasts with Random Forest, which tends to focus on more limited and downstream aspects of gene function, and its ability to capture gene interconnections is far inferior to that of TransGeneSelector. This suggests that the incorporation of transformer-based architecture has enhanced TransGeneSelector’s ability to mine functionally related genes. However, both methods identify genes related to the physiological processes corresponding to different tasks, therefore the two methods have different points of emphasis.
RT-qPCR test of genes identified by transgeneselector in response to germination
To elucidate the specific germination responses of genes identified by TransGeneSelector and Random Forest in A. thaliana, we conducted a total of 408 RT-qPCR tests across 12 A. thaliana samples, targeting 34 genes. Expression analyses were performed to assess the gene expression patterns under three distinct germination conditions: dark germination, low-light germination, and high-light germination. For each condition, expression profiles were captured at four critical germination time points: 0 h (dry seeds), 12 h, 24 h, and 48 h.
The results (Fig. 10) indicated that genes identified by both methods exhibited increased expression levels concurrent with the progression of germination, reinforcing the proficiency of both TransGeneSelector and Random Forest in capturing genes intimately associated with the germination process. A closer inspection of the data revealed that genes selected by the optimized TransGeneSelector exhibited expression patterns closely mirroring those chosen by Random Forest across all germination conditions. Specifically, a subset of genes reached their maximum expression levels at the 24-hour mark under low-light germination conditions.
RT-qPCR quantitative analysis of each top 11 genes selected by TransGeneSelector and Random Forest in A. thaliana seeds under various germination conditions (n = 3). This figure presents the real-time quantitative polymerase chain reaction (RT-qPCR) results of each top 11 genes chosen by TransGeneSelector and Random Forest, Dark germination refers to seed germination conducted under full-black conditions, characterized by being wrapped in aluminum foil, simulating total darkness; Low-light germination refers to seed germination under a subdued illumination of 100 µmol photons of photosynthetic light, reflecting a low-light environment; High-light germination refers to seed germination under a more intense illumination of 200 µmol photons of photosynthetic light. Unoptimized means TransGeneSelector was trained without the implementation of ‘early stop’ optimization
Conversely, genes selected by the unoptimized TransGeneSelector exhibited distinct differences, most notably showing heightened expression at the 24-hour mark under dark germination conditions, which then declined (Fig. 10). Such a phenomenon was absent in genes identified by both the optimized TransGeneSelector and Random Forest (Fig. 10). Another notable difference was the absence of maximal expression levels at the 24-hour mark under low-light germination conditions for genes selected by the unoptimized TransGeneSelector (Fig. 10), and interestingly, the expression patterns of these genes also closely resembled those of two genes related to dark germination of A. thaliana within the KAI2 pathway (Fig. 10). Furthermore, analysis of expression patterns in the training data (Fig. 11) revealed that the unoptimized TransGeneSelector tended to select genes consistent with those related to dark germination of A. thaliana, such as KAI2 (Michael et al., 2022; Waters and Smith, 2013). This suggests that the unoptimized TransGeneSelector exhibits a tendency to selectively identify A. thaliana germination-related genes associated with seeds germinating under dark conditions.
Transcriptomic expression patterns of the top 11 genes identified by the unoptimized TransGeneSelector in the training set. This illustration displays the expression profiles of the top 11 genes chosen by the unoptimized TransGeneSelector from the training set. Expression levels are denoted by color intensity, with darker shades signifying greater expression. Blue bars represent the genes handpicked by TransGeneSelector, while orange bars highlight genes associated with the KAI2 pathway. The categories on the x-axis delineate specific germination conditions of A. thaliana: dry seeds, ABA-treated germinating seeds (ABA), germinating seeds activated with the KAI2 pathway in the dark (KAI2 dark), germinating seeds exposed to salt stress (NaCl), and germinating seeds under standard conditions (Normal)
Furthermore, the expression patterns of genes identified by the optimized TransGeneSelector exhibited higher consistency and uniformity compared to those selected by Random Forest. This finding corroborates the trend observed in the construction of gene regulatory networks, where TransGeneSelector unearthed genes positioned further upstream than those identified by Random Forest. This observation further underscores the superior efficacy of TransGeneSelector over Random Forest in the realm of discovering upstream genes or functionally relevant genes related to environmental changes.
Above all, while both methods successfully identified genes intricately associated with germination, TransGeneSelector undoubtedly demonstrates enhanced potential over Random Forest, particularly in capturing more upstream gene expression dynamics with greater uniformity. However, a note of caution remains regarding the importance of meticulous model optimization to fully harness its capabilities.
Discussion
In the field of life sciences, gene mining occupies a pivotal role, particularly in research related to plant responses to diverse environmental conditions [1, 2, 3]. Traditionally, this research starts with the analysis of limited sample data, presenting significant challenges. Conventional machine learning methods often fail to account for the complex interdependencies among genes, frequently overlooking crucial genes [5, 16, 17]. Additionally, existing deep learning techniques struggle to perform optimally in scenarios involving small sample sizes. To address these challenges, our study introduces TransGeneSelector, the first deep learning method specifically designed for key gene mining within small transcriptomic datasets. This method is particularly adept at identifying upstream key regulatory genes, as well as those interrelated genes involved in plant responses to various environmental factors. By overcoming the limitations inherent in both traditional machine learning and deep learning approaches, TransGeneSelector effectively mines essential upstream regulatory genes for critical processes such as seed germination and heat stress response in A. thaliana.
The development of TransGeneSelector was initiated with a strategic combination of data augmentation using the WGAN-GP network, quality control via an additional classifier, and the implementation of a simplified Transformer network structure. This methodological blend led to impressive results: high-performance classification was achieved between A. thaliana dry and germinating seeds within a small dataset comprising only 79 samples, and it effectively classified between normal and heat-stressed states of A. thaliana within another dataset of 156 samples.
Previous studies have reported that GANs can enhance model performance in transcriptome data-related classification tasks [66, 67]. Our study corroborates these findings. By meticulous tuning of key hyperparameters of WGAN-GP, such as the number of augmented samples, allowed TransGeneSelector to outperform baseline Transformer models that did not utilize generative augmentation. This underscores the advantage of strategically expanding limited training data through generative adversarial networks to enrich data representation. Compared to conventional machine learning models like Random Forest and SVM, TransGeneSelector not only demonstrated competitive or superior performance in the seed germination task but also significantly outstripped these traditional methods in the more challenging task of classifying heat stress responses. This validates the effectiveness of TransGeneSelector for complex biological problems, particularly in scenarios where acquiring abundant training data is challenging.
The Transformer architecture is recognized for its ability to capture semantic relationships between inputs [28, 29, 30], making it a promising model structure for gene mining, especially in its potential to capture the interrelationships between genes. However, prior studies have not reported on the application of Transformers in gene mining, and there has been no exploration of their capability to capture these inter-gene relationships. In our research, we confirmed this potential, demonstrating that the Transformer architecture can be effectively applied to gene mining while successfully identifying and analyzing the connections between genes.
Firstly, TransGeneSelector consistently selected genes with orderly expression patterns that distinctly differentiated between dry and germinating seeds, or between plants under heat stress and those under normal conditions, across various targeted gene counts. This stability was confirmed through test set evaluations, where TransGeneSelector maintained its performance, unlike Random Forest, which showed deteriorating results, particularly as the number of genes increased. Functional enrichment analysis further demonstrated that genes selected by TransGeneSelector covered a wider range of diverse upstream processes critical to both seed germination and heat stress. This indicates that there is a high degree of association among the genes identified by TransGeneSelector, which leads to their enrichment in multiple metabolic pathways. In contrast, Random Forest tended to focus on more downstream pathways, and the genes it identified could only be enriched in very few pathways, indicating that the association among them is low. This distinction highlights TransGeneSelector’s superior capability in capturing essential interrelationships among genes, making it a more effective tool for studying complex biological phenomena where understanding gene functionality and interactions is crucial.
Furthermore, the construction and analysis of the gene regulatory network using the MERLIN algorithm provided intriguing insights. When employing datasets unrelated to seed germination or heat stress for network construction, most regulatory relationships were observed flowing from genes identified by Random Forest to those identified by TransGeneSelector. However, this pattern reversed when using datasets specifically related to seed germination or heat stress, with the majority of regulatory interactions flowing from TransGeneSelector to Random Forest genes. Notably, many of these were new emergent relationships that were not previously evident. More strikingly, this pattern was consistently observed in both the seed germination and heat stress tasks. This suggests that TransGeneSelector is better equipped to capture the intricate regulatory dynamics specific to plant responses to specific environmental conditions. It further underscores that TransGeneSelector can reliably discover key upstream regulatory genes that exert significant influence over downstream genes critical to these specific plant responses. This capability highlights its effectiveness in understanding and mapping the complex gene interactions essential for plant adaptation to varying environmental challenges.
In summary, while many studies have employed machine learning methods to mine genes [12,13,14,15], they have not focused on the interconnections between genes. This research serves as a complement. Through the integration of the Transformer’s multi-head self-attention mechanism and the SHAP interpretable machine learning method, TransGeneSelector effectively identified pivotal upstream genes, as well as functionally interrelated genes involved in the complex processes of seed germination and heat stress. These genes could not be identified by Random Forest, nor did the genes discovered by Random Forest exhibit the same degree of functional interconnectedness. This distinction likely provides a reasonable explanation for the small overlap between the genes identified by TransGeneSelector and those identified by Random Forest. It suggests that the two methods focus on different aspects of gene mining, with TransGeneSelector placing more emphasis on the relationships between global features, while Random Forest tends to prioritize local or more functionally isolated genes. This suggests TransGeneSelector could be a useful tool for tasks requiring both gene mining and the capture of gene interconnections.
Moreover, we validated the gene mining capability of TransGeneSelector using seed germination experiments. Through large-scale RT-qPCR experiments, we confirmed that the genes identified by both TransGeneSelector and Random Forest are involved in seed germination, validating the utility of our developed method. Notably, TransGeneSelector demonstrated a superior ability compared to Random Forest in effectively capturing key dynamics associated with seed germination. In contrast to the higher variability observed in Random Forest selections, genes identified by TransGeneSelector exhibited highly consistent and uniform expression changes in response to major germination conditions and progression. This observation, in conjunction with gene regulatory network analysis, further confirms TransGeneSelector’s balanced and robust gene discovery capabilities focused on specific biological processes.
However, it is crucial to recognize that both TransGeneSelector and Random Forest demonstrate a strong ability to mine important genes involved in specific physiological processes, as evidenced by our GO annotation analyses and RT-qPCR validation. While TransGeneSelector excels at uncovering upstream regulators and functionally interconnected gene networks, Random Forest can effectively pinpoint genes exhibiting strong and direct expression changes related to the phenotype. The limited overlap in the gene sets identified by each method suggests that they capture different facets of the underlying biological mechanisms. Specifically, the shared genes, particularly those related to heat stress response, likely represent core components crucial for the observed phenotype, highlighting the importance of these overlapping genes. Therefore, rather than viewing these methods as mutually exclusive, a more holistic approach would be to consider them as complementary tools. Combining the insights from both TransGeneSelector and Random Forest could provide a more comprehensive understanding of the gene regulatory landscape underlying plant responses to environmental stimuli, leveraging the strengths of both deep learning’s ability to capture complex relationships and traditional machine learning’s efficiency in identifying direct phenotypic associations. However, in a scenario where a choice between the two methods is necessary, the decision hinges on the specific research question. If the primary focus is on unraveling the intricate interplay between genes and identifying crucial regulatory elements within a network, then TransGeneSelector emerges as the more appropriate choice due to its capacity to uncover upstream regulators and functionally interconnected genes. Conversely, when the objective leans towards the efficient identification of key marker genes directly associated with the phenotype, the simplicity and speed of Random Forest make it a preferable option for quickly pinpointing genes exhibiting significant expression changes.
Also, despite these successes, TransGeneSelector, which comprises a complex system of three separate neural networks, presents challenges in terms of operational complexity and offers scope for further improvement. For example, optimizing parameters in the WGAN-GP module and selecting suitable thresholds in the additional classifier require refinement, as improper tuning could lead to biased results, such as the gene expression patterns observed in unoptimized models. This underscores the importance of meticulous optimization to avoid training data overfitting. A significant direction for future research lies in enhancing the user-friendliness and simplifying the training process of TransGeneSelector. Developing more automated parameter tuning strategies and more intuitive interfaces would greatly improve its accessibility for a wider range of researchers. It’s important to acknowledge that currently, Random Forest offers a significantly simpler and more straightforward user experience, requiring less parameter tuning and expertise. Therefore, future development of TransGeneSelector should prioritize closing this usability gap by focusing on streamlining the workflow and reducing the technical burden for users. Another limitation is that our validation was conducted exclusively in Arabidopsis thaliana, leaving the cross-species applicability of TransGeneSelector uncertain, as different organisms possess diverse gene regulatory mechanisms and expression patterns that might influence model performance. Therefore, future research should extend TransGeneSelector to a wider range of species to validate its broader biological applicability. Furthermore, for more complex classification tasks, it is advisable to disable the additional classifier to ensure the diversity of the generated samples. Currently designed for binary classification tasks, future research must also explore extending TransGeneSelector to handle multi-classification tasks, aligning with the diverse needs of small-sized data applications. This expansion would enhance its applicability across a broader range of biological phenomena, increasing its utility in the life sciences domain.
In conclusion, the advent of TransGeneSelector in key gene mining within small sample transcriptomic dataset marks a significant advancement, offering a robust tool for mining key genes involved in vital life processes. By bridging the gaps of traditional methodologies and infusing the strengths of deep learning, this study not only furnishes a potent practical tool but also inaugurates a new perspective in gene mining research for plant and other organisms using Transformer networks.
Data availability
The data utilized in this study were sourced from publicly available repositories. The gene expression data were extracted from the NCBI GEO (Gene Expression Omnibus) database (https://www.ncbi.nlm.nih.gov/gds/) and the Expression Atlas database (https://www.ebi.ac.uk/gxa/experiments). Specifically, the experiments GSE116069, GSE161704, GSE163057, GSE167244, GSE179008, GSE155710, GSE158444, GSE184983, GSE200247, GSE212019, GSE232094, GSE239833, and GSE244763 from NCBI were selected for the training and testing of the TransGeneSelector and Random Forest models. For additional MERLIN network analysis, experiments E-CURD-1, E-GEOD-30720, E-GEOD-52806, E-GEOD-64740, E-MTAB-4202, E-MTAB-7933, E-MTAB-7978 from Expression Atlas and GSE199116 from NCBI were included. These datasets are publicly accessible and can be found in the respective databases.
References
Fregene M, Okogbenin E, Mba C, Angel F, Suarez MC, Janneth G, Chavarriaga P, Roca W, Bonierbale M, Tohme J. Genome mapping in cassava improvement: challenges, achievements and opportunities. Euphytica. 2001;120(1):159–65.
Wang Y, Yu H, Tian C, Sajjad M, Gao C, Tong Y, Wang X, Jiao Y. Transcriptome association identifies regulators of wheat Spike architecture. Plant Physiol. 2017;175(2):746–57.
Westerman KE, Majarian TD, Giulianini F, Jang D-K, Miao J, Florez JC, Chen H, Chasman DI, Udler MS, Manning AK, et al. Variance-quantitative trait loci enable systematic discovery of gene-environment interactions for cardiometabolic serum biomarkers. Nat Commun. 2022;13(1):3993.
Huang K, Mo P, Deng A, Xie P, Wang Y. Differences in the chloroplast genome and its regulatory network among Cathaya argyrophylla populations from different locations in China. Genes. 2022; 13.
Su C, Tong J, Wang F. Mining genetic and transcriptomic data using machine learning approaches in Parkinson’s disease. Npj Park Dis. 2020;6(1):24.
Wang H, Tian Q, Zhang J, Liu H, Zhang J, Cao W, Zhang X, Li X, Wu L, Song M, et al. Blood transcriptome profiling as potential biomarkers of suboptimal health status: potential utility of novel biomarkers for predictive, preventive, and personalized medicine strategy. EPMA J. 2021;12(2):103–15.
Florez JC. Mining the genome for therapeutic targets. Diabetes. 2017;66(7):1770–8.
Soltis PS, Soltis DEJP. Plant genomes: markers of evolutionary history and drivers of evolutionary change. People Planet. 2021;3(1):74–82.
Mutz K-O, Heilkenbrinker A, Lönne M, Walter J-G, Stahl F. Transcriptome analysis using next-generation sequencing. Curr Opin Biotechnol. 2013;24(1):22–30.
Chen P, Chen T, Li Z, Jia R, Luo D, Tang M, Lu H, Hu Y, Yue J, Huang Z. Transcriptome analysis revealed key genes and pathways related to cadmium-stress tolerance in Kenaf (Hibiscus cannabinus L). Ind Crop Prod. 2020;158:112970.
Cao F, Chen F, Sun H, Zhang G, Chen Z-H, Wu F. Genome-wide transcriptome and functional analysis of two contrasting genotypes reveals key genes for cadmium tolerance in barley. BMC Genomics. 2014;15(1).
Li X, Zhou X, Ding S, Chen L, Feng K, Li H, Huang T, Cai Y-D. Identification of transcriptome biomarkers for severe COVID-19 with machine learning methods. Biomolecules. 2022;12:1735.
Yu G-E, Shin Y, Subramaniyam S, Kang S-H, Lee S-M, Cho C, Lee S-S, Kim C-K. Machine learning, transcriptome, and genotyping chip analyses provide insights into SNP markers identifying flower color in Platycodon grandiflorus. Sci Rep. 2021;11(1):8019.
Pal T, Jaiswal V, Chauhan RS. DRPPP: A machine learning based tool for prediction of disease resistance proteins in plants. Comput Biol Med. 2016;78:42–8.
Chen W, Alexandre PA, Ribeiro G, Fukumasu H, Sun W, Reverter A, Li Y. Identification of predictor genes for feed efficiency in beef cattle by applying machine learning methods to multi-tissue transcriptome data. Front Genet. 2021;12.
Crombach A, Wotton KR, Cicin-Sain D, Ashyraliyev M, Jaeger J. Efficient reverse-engineering of a developmental gene regulatory network. PLoS Comput Biol. 2012;8(7):e1002589.
Karlebach G, Shamir R. Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol. 2008;9(10):770–80.
Sakellaropoulos T, Vougas K, Narang S, Koinis F, Kotsinas A, Polyzos A, Moss TJ, Piha-Paul S, Zhou H, Kardala E, et al. A deep learning framework for predicting response to therapy in cancer. Cell Rep. 2019;29(11):3367–e33733364.
Sau BB, Balasubramanian VN. Deep model compression: distilling knowledge from noisy teachers. Preprint at arXiv; 2016.
Saxe A, Nelli S, Summerfield C. If deep learning is the answer, what is the question? Nat Rev Neurosci. 2020;22(1):55–67.
Pacal I, Karaboga D, Basturk A, Akay B, Nalbantoglu U. A comprehensive review of deep learning in colon cancer. Comput Biol Med. 2020;126:104003.
Wu M, Chen L. Image recognition based on deep learning. 2015 Chin Autom Congress (CAC). 2015:542–6.
Suryanarayana G, Lago J, Geysen D, Aleksiejuk P, Johansson C. Thermal load forecasting in district heating networks using deep learning and advanced feature selection methods. Energy. 2018;157:141–9.
Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint arXiv; 2014.
Shewalkar A, Nyavanandi D, Ludwig SA. Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU. J Artif Intell Soft Comput Res. 2019;9(4):235–45.
Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019;31(7):1235–70.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.
Ma X, Zhang P, Zhang S, Duan N, Hou Y, Zhou M, Song D. A tensorized transformer for language modeling. Adv Neural Inf Process Syst. 2019;32.
Schrimpf M, Blank IA, Tuckute G, Kauf C, Hosseini EA, Kanwisher N, Tenenbaum JB, Fedorenko E. The neural architecture of Language: integrative modeling converges on predictive processing. Proc Natl Acad Sci USA. 2021;118(45):e2105646118.
Yan H, Deng B, Li X, Qiu X. TENER: adapting transformer encoder for named entity recognition. Preprint arXiv; 2019.
Chen J, Xu H, Tao W, Chen Z, Zhao Y, Han J-DJ. Transformer for one stop interpretable cell type annotation. Nat Commun. 2023;14(1):223.
Xu J, Zhang A, Liu F, Zhang X. STGRNS: an interpretable transformer-based method for inferring gene regulatory networks from single-cell transcriptomic data. Bioinformatics. 2023;39(4).
Zhang T-H, Hasib MM, Chiu Y-C, Han Z-F, Jin Y-F, Flores M, Chen Y, Huang Y. Transformer for gene expression modeling (T-GEM): an interpretable deep learning model for gene expression-based phenotype predictions. Cancers. 2022;14(19):4763.
Khan A, Lee B. DeepGene transformer: transformer for the gene expression-based classification of cancer subtypes. Expert Syst Appl. 2023;226:120047.
Milicevic M, Zubrinic K, Obradovic I, Sjekavica T. Data augmentation and transfer learning for limited dataset ship classification. WSEAS Trans Syst Control. 2018;13(1):460–5.
Reyes-Nava A, Sánchez JS, Alejo R, Flores-Fuentes AA, Rendón-Lara E. Performance analysis of deep neural networks for classification of gene-expression microarrays. Pattern recognit: 2018// 2018; Cham. Springer International Publishing; 2018. pp. 105–15.
Xiao WH, Qu XL, Li XM, Sun YL, Zhao HX, Wang S, Zhou X. Identification of commonly dysregulated genes in colorectal cancer by integrating analysis of RNA-Seq data and qRT-PCR validation. Cancer Gene Ther. 2015;22(5):278–84.
Rajput D, Wang W-J, Chen C-C. Evaluation of a decided sample size in machine learning applications. BMC Bioinformatics. 2023;24(1):48.
Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):60.
Liu Y, Zhou Y, Liu X, Dong F, Wang C, Wang Z. Wasserstein GAN-based small-sample augmentation for new-generation artificial intelligence: A case study of cancer-staging data in biology. Engineering. 2019;5(1):156–63.
Marouf M, Machart P, Bansal V, Kilian C, Magruder DS, Krebs CF, Bonn S. Realistic in Silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat Commun. 2020;11(1):166.
Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: Doina P, Yee Whye T, editors. Proceedings of the 34th International Conference on Machine Learning, vol. 70. Proceedings of Machine Learning Research: PMLR; 2017. pp. 214–223.
Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30.
Rodríguez-Pérez R, Bajorath J. Interpretation of compound activity predictions from complex machine learning models using local approximations and Shapley values. J Med Chem. 2019;63(16):8761–77.
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial Nets. Adv Neural Inf Process Syst. 2014;27.
He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: surpassing human-level performance on imagenet classification. Proc IEEE Int Conf Comput Vis 2015:1026–34.
Zhao S, Ye Z, Stanton RJR. Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols. 2020;26(8):903–9.
Ren J, He T, Li Y, Liu S, Du Y, Jiang Y, Wu C. Network-based regularization for high dimensional SNP data in the case–control study of type 2 diabetes. BMC Genet. 2017;18(1):44.
Roy S, Lagree S, Hou Z, Thomson JA, Stewart R, Gasch AP. Integrated module and gene-specific regulatory inference implicates upstream signaling networks. PLoS Comput Biol. 2013;9(10):e1003252.
Huang K, Zhou S, Shen K, Zhou Y, Wang F, Jiang X. Elucidation of the miR164c-guided gene/protein interaction network controlling seed Vigor in rice. Front Plant Sci. 2020;11.
van Waveren C, Moraes CT. Transcriptional co-expression and co-regulation of genes coding for components of the oxidative phosphorylation system. BMC Genomics. 2008;9(1):18.
Tian R, Xu S, Chai S, Yin D, Zakon H, Yang G. Stronger selective constraint on downstream genes in the oxidative phosphorylation pathway of cetaceans. J Evol Biol. 2018;31(2):217–28.
Shutov AD, Vaintraub IA. Degradation of storage proteins in germinating seeds. Phytochemistry. 1987;26(6):1557–66.
Oracz K, Stawska M. Cellular recycling of proteins in seed dormancy alleviation and germination. Front Plant Sci. 2016;7.
Müntz K, Belozersky MA, Dunaevsky YE, Schlereth A, Tiedemann J. Stored proteinases and the initiation of storage protein mobilization in seeds during germination and seedling growth. J Exp Bot. 2001;52(362):1741–52.
Fountain DW, Bewley JD. Lettuce seed germination: modulation of pregermination protein synthesis by gibberellic acid, abscisic acid, and cytokinin 1. Plant Physiol. 1976;58(4):530–6.
Galland M, Huguet R, Arc E, Cueff G, Job D, Rajjou LJM, Proteomics C. Dynamic proteomics emphasizes the importance of selective mRNA translation and protein turnover during arabidopsis seed germination. 2014;13(1):252–68.
Marcus A, Feeley J. Activation of protein synthesis in the imbibition phase of seed germination. 1964;51(6):1075–9.
Navrot N, Rouhier N, Gelhaye E, Jacquot J-P. Reactive oxygen species generation and antioxidant systems in plant mitochondria. Physiol Plant. 2007;129(1):185–95.
Liu M, Ju Y, Min Z, Fang Y, Meng J. Transcriptome analysis of grape leaves reveals insights into response to heat acclimation. Sci Hort. 2020;272:109554.
Prasad M, Kataria P, Ningaraju S, Buddidathi R, Bankapalli K, Swetha C, Susarla G, Venkatesan R, D’Silva P, Shivaprasad PV. Double DJ-1 domain containing Arabidopsis DJ-1D is a robust macromolecule deglycase. New Phytol. 2022;236(3):1061–74.
Baena-González E. Energy signaling in the regulation of gene expression during stress. Mol Plant. 2010;3(2):300–13.
Qi F, Zhang F. Cell cycle regulation in the plant response to stress. 2020;10.
Fábián A, Péntek BK, Soós V, Sági L. Heat stress during male meiosis impairs cytoskeletal organization, spindle assembly and tapetum degeneration in wheat. 2024;14.
Smertenko A, DrÁBer P, ViklickÝ V, OpatrnÝ Z. Heat stress affects the organization of microtubules and cell division in Nicotiana tabacum cells. Plant Cell Environ. 1997;20(12):1534–42.
Lacan A, Sebag M, Hanczar B. GAN-based data augmentation for transcriptomics: survey and comparative assessment. Bioinformatics. 2023;39(Supplement1):i111–20.
Guttà C, Morhard C, Rehm M. Applying a GAN-based classifier to improve transcriptome-based prognostication in breast cancer. PLoS Comput Biol. 2023;19(4):e1011035.
Acknowledgements
We extend our gratitude to Li Zeng for his valuable advice and guidance on the paper. This work was supported by the Aid Program for Science and Technology Innovative Research Team in Higher Educational Institutions of Hunan Province.
Funding
The study was supported by the Scientific Research Fund of Hunan Provincial Education Department [24B0620, 24B0619, 22A0487], the National Natural Science Foundation of China [32072125, 8230449], the Natural Science Foundation of Hunan Province [2023JJ30436, 2022JJ50249, 2024JJ4031, 2022JJ40291], Central Guidance Fund for Science and Technology Development in Hunan [2023ZYC012], Key Research Project of Hunan University of Arts and Science [E06022005], the Key Project of the Education Department of Hunan Province [23A0505], and The Science and Technology Innovation Guidance Project of Changde City [2023ZD03].
Author information
Authors and Affiliations
Contributions
X.J., Y.W., and P.X. conceptualized the study and reviewed and edited the manuscript. K.H. conceptualized the study, developed the software and methodology, conducted experiments, wrote the original draft, and reviewed and edited the manuscript. J.T. developed the methodology, conducted experiments, and reviewed and edited the manuscript. L.S., H.H., and X.H. conducted experiments and reviewed and edited the manuscript. S.Z. and A.D. developed the methodology and reviewed and edited the manuscript. Z.Z. and M.J. prepared the visualizations and reviewed and edited the manuscript. G.L. prepared the visualizations and reviewed and edited the manuscript. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Ethic approval and consent to participate
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Huang, K., Tian, J., Sun, L. et al. TransGeneSelector: using a transformer approach to mine key genes from small transcriptomic datasets in plant responses to various environments. BMC Genomics 26, 259 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-025-11434-y
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-025-11434-y