TransGeneSelector: using a transformer approach to mine key genes from small transcriptomic datasets in plant responses to various environments

Table 2 Quantitative assessment of model performance for heat Stress-Related task

Methods	Accuracy	Precision	Recall	F1	AUC
TransGeneSelector	0.9623	0.9643	0.9643	0.9643	0.9871
TransGeneSelector (mix-up)	0.9434	0.9032	1.0000	0.9492	0.9500
TransGeneSelector (MLP)	0.9434	0.9032	1.0000	0.9492	0.9629
Random Forest with default parameter	0.9434	0.9032	1.0000	0.9492	0.9586
Random Forest with 8 genes	0.9245	0.9000	0.9643	0.9310	0.9457
Random Forest with 11 genes	0.9434	0.9032	1.0000	0.9492	0.9471
Random Forest with 41 genes	0.9245	0.9000	0.9643	0.9310	0.9464
Random Forest with 51 genes	0.9434	0.9032	1.0000	0.9492	0.9500
Random Forest with 148 genes	0.9434	0.9032	1.0000	0.9492	0.9557
Random Forest with 449 genes	0.8679	0.8889	0.8571	0.8727	0.9507
NR-LR-MCP	0.9245	0.9286	0.9286	0.9286	0.9443
SVM with default parameter	0.8302	0.8800	0.7857	0.8302	0.9429
SVM with 8 genes	0.9434	0.9032	1.0000	0.9492	0.9471
SVM with 11 genes	0.9434	0.9032	1.0000	0.9492	0.9186
SVM with 41 genes	0.9434	0.9032	1.0000	0.9492	0.9743
SVM with 51 genes	0.9434	0.9032	1.0000	0.9492	0.9271
SVM with 148 genes	0.9434	0.9032	1.0000	0.9492	0.9271
SVM with 449 genes	0.9434	0.9032	1.0000	0.9492	0.9507
KNN with 8 genes	0.8679	0.8889	0.8571	0.8727	0.8143
KNN with 11 genes	0.9245	0.9000	0.9643	0.9310	0.8800
KNN with 41 genes	0.8868	0.8929	0.8929	0.8929	0.8643
KNN with 51 genes	0.9434	0.9032	1.0000	0.9492	0.8800
KNN with 148 genes	0.9434	0.9032	1.0000	0.9492	0.8800
KNN with 449 genes	0.9245	0.9000	0.9643	0.9310	0.8643

Note: This table illustrates the performance of various models on a test set. Versions of the TransGeneSelector that substituted the WGAN component with a mix-up component or replaced the Transformer component with an MLP are included, with the best-performing model in each case selected for inclusion. The Random Forest model was trained using feature engineering on gene sets of 8, 11, 41, 51, 128, and 449 genes, chosen because these sets achieved the highest cross-validation accuracy. The SVM and KNN model utilized genes selected by the Random Forest model. ‘NR-LR-MCP’ represents the optimal performance of the Network-Regularized Logistic Regression model with Minimax Concave Penalty. ‘AUC’ stands for Area Under the Curve, which measures the probability that a randomly chosen positive instance is ranked higher than a randomly selected negative instance. The highest AUC values are highlighted in bold

ISSN: 1471-2164