Ge-SAND: an explainable deep learning-driven framework for disease risk prediction by uncovering complex genetic interactions in parallel

Ye, Lihang; Zhang, Liubin; Tang, Bin; Liang, Junhao; Tan, Ruijie; Jiang, Hui; Peng, Wenjie; Lin, Nan; Li, Kun; Xue, Chao; Li, Miaoxin

doi:10.1186/s12864-025-11588-9

Research
Open access
Published: 01 May 2025

Ge-SAND: an explainable deep learning-driven framework for disease risk prediction by uncovering complex genetic interactions in parallel

Lihang Ye^1,2,
Liubin Zhang^1,2,
Bin Tang^1,2,
Junhao Liang^1,2,
Ruijie Tan^1,2,
Hui Jiang^1,3,
Wenjie Peng^1,2,
Nan Lin^1,2,
Kun Li^1,4,
Chao Xue^1,2 &
…
Miaoxin Li^1,2

BMC Genomics volume 26, Article number: 432 (2025) Cite this article

694 Accesses
10 Altmetric
Metrics details

Abstract

Background

Accurate genetic risk prediction and understanding the mechanisms underlying complex diseases are essential for effective intervention and precision medicine. However, current methods often struggle to capture the intricate and subtle genetic interactions contributing to disease risk. This challenge may be further exacerbated by the curse of dimensionality when considering large-scale pairwise genetic combinations with limited samples. Overcoming these limitations could transform biomedicine by providing deeper insights into disease mechanisms, moving beyond black-box models and single-locus analyses, and enabling a more comprehensive understanding of cross-disease patterns.

Results

We developed Ge-SAND (Genomic Embedding Self-Attention Neurodynamic Decoder), an explainable deep learning-driven framework designed to uncover complex genetic interactions at scales exceeding 10⁶ in parallel for accurate disease risk prediction. Ge-SAND leverages genotype and genomic positional information to identify both intra- and interchromosomal interactions associated with disease phenotypes, providing comprehensive insights into pathogenic mechanisms crucial for disease risk prediction. Applied to simulated datasets and UK Biobank cohorts for Crohn’s disease, schizophrenia, and Alzheimer’s disease, Ge-SAND achieved up to a 20% improvement in AUC-ROC compared to mainstream methods. Beyond its predictive accuracy, through self-attention-based interaction networks, Ge-SAND provided insights into large-scale genotype relationships and revealed genetic mechanisms underlying these complex diseases. For instance, Ge-SAND identified potential genetic interaction pairs, including novel relationships such as ISOC1 and HOMER2, potentially implicating the brain-gut axis in Crohn’s and Alzheimer’s diseases.

Conclusion

Ge-SAND is a novel deep-learning approach designed to address the challenges of capturing large-scale genetic interactions. By integrating disease risk prediction with interpretable insights into genetic mechanisms, Ge-SAND offers a valuable tool for advancing genomic research and precision medicine.

Peer Review reports

Background

Understanding genetic risk and mechanisms underlying complex diseases is critical for early intervention and precise treatment [1,2,3]. Genome-wide association studies (GWAS) have been instrumental in identifying genetic variants. However, the relationship between genotypes and most phenotypes is highly intricate [4,5,6]. Despite analyzing data from cohorts of over 100,000 individuals, GWAS typically explains only a small portion of heritability for most traits—a phenomenon known as missing heritability [7, 8]. A key contributor to this missing heritability is thought to be deep interactions between single nucleotide polymorphisms (SNPs) [9]. These interactions are often complex, implicit, and difficult to define mathematically, and remain largely unexplored for disease risk prediction. The challenge is further compounded by the curse of dimensionality when all possible pairwise genetic combinations are considered. Thus, to accurately predict disease risk and understand the mechanisms of complex diseases, we need to unravel these intricate interactions and their relationships with diseases.

Previous methods have developed numerous strategies for capturing complex relationships between genomic data and phenotypes to predict disease risk [10,11,12,13]. Among these, machine learning (ML) techniques have been widely applied and achieved some success [14,15,16]. For instance, Elgart et al. (2022) introduced a nonlinear ML model that improved polygenic predictions across diverse human populations using the XGBoost algorithm [17]. Similarly, Kafaie et al. employed support vector machines (SVM), logistic regression (LG), and random forests (RF) to explore the link between colorectal cancer and genetic markers, achieving promising results [18]. Although ML methods have demonstrated the ability to uncover nonlinear associations, model interpretability should be improved for better understanding. Thus, Qin et al. used XGBoost to predict asthma risk, identifying risk-related SNPs through feature importance analysis [19]. López et al. combined RF and SVM with SNP data to predict type 2 diabetes risk, which leveraged feature importance and weight-based analyses to enhance model interpretability [20]. However, while such models can effectively explain the contributions of individual features, they often fall short of capturing the interactions between genotypes essential for a more comprehensive understanding of disease mechanisms. Though traditional approaches, such as methods based on pairwise regression [21, 22] and chi-square tests [23, 24], have been widely employed to capture interactions between genotypes, they are limited in scope, focusing primarily on pairwise relationships each time and often failing to capture multi-genotype interactions that may critically influence phenotypes in parallel. As the number of genotype pairs grows exponentially while sample size remains constrained, these approaches face significant challenges, including the curse of dimensionality and rapidly increasing computational complexity, which compromise their effectiveness. Moreover, such methods typically assume predefined forms of interaction effects. This assumption is limited, as gene–gene interactions are often complex, implicit, and difficult to describe using explicit mathematical formulations. Addressing these limitations requires innovative approaches.

In recent years, the self-attention model, originally developed for natural language processing, has emerged as a powerful tool for capturing long-range dependencies between features in parallel [25]. The attention score matrices generated by this model offer insights into feature relationships, making it well-suited for capturing the complex genotype interactions associated with phenotypes potentially. However, more technique extensions are needed before this approach can be effectively applied to disease risk prediction. Embedding methods, which can convert raw genomic data into structured representations suitable for integration with the self-attention model, need to be tailored [26, 27]. While existing embedding methods were not initially designed for genomic data, refining embedding methods may improve the integration of genomic information with the self-attention model. Additionally, the features captured by the self-attention model often contain valuable hidden information [28, 29], necessitating an optimized classification framework. Furthermore, optimizing the training framework for models with many parameters, especially in small samples and high-dimensional features, is essential to enhance the model's generalization [30, 31].

In this study, we propose a genomic embedding self-attention neurodynamic decoder (Ge-SAND) shown in Fig. 1, an innovative deep learning-driven framework, designed for accurate disease risk prediction by capturing large-scale genetic interactions in parallel. Ge-SAND employs a novel genomic embedding strategy combined with a self-attention module, allowing for the integration of genomic loci and genotype data. This combined approach enables the identification of both intra- and interchromosomal interactions, unveiling potential pathogenic mechanisms crucial for accurate disease risk prediction. For precise prediction, Ge-SAND incorporates the Gemini neurodynamic learning network designed to detect deep feature-phenotype relationships while mitigating data leakage. For interpretability, we quantified the interaction strengths between genotypes using attention scores and further validated their significance through statistical analyses. We evaluated Ge-SAND across simulated scenarios and three genetically related diseases (Crohn’s disease, schizophrenia, and Alzheimer’s disease), demonstrating its superior performance compared to mainstream machine learning methods.

Methods

Simulation data

Subjects'genotypes and phenotypes were simulated using a liability-threshold model [32, 33]. Six quantitative relationships between multiple variant genotypes and liability are presented in the following Eq. (1), including three basic models and three pair-wise combination models:

$$\begin{array}{c}Z={b}_{1}{\sum }_{i=1}^{{k}_{1}}{\alpha }_{i}{X}_{i}+{b}_{2}{\sum }_{i=1}^{{k}_{2}-1}{\sum }_{j=i+1}^{{k}_{2}}{\beta }_{ij}{X}_{i}{X}_{j} \\ +{b}_{3}{\sum }_{i={k}_{1}}^{{k}_{3}-2}{\sum }_{j=i+1}^{{k}_{3}-1}{\sum }_{g=j+1}^{{k}_{3}}{\gamma }_{ijg}{X}_{i}{X}_{j}{X}_{g}+\epsilon \end{array}$$

(1)

where $\alpha$, $\beta$, and $\gamma$ follow a uniform distribution between $0$ and $1$, while $X$ denotes the variant for the gene. ${k}_{1}$, ${k}_{2}$, and ${k}_{3}$ are the numbers of quantitative trait loci (QTLs). Moreover, $\epsilon$ represents the environment factor and follows $N\left(0,{\sigma }^{2}\right)$. For the linear (LN) type, ${b}_{1}=1$, ${b}_{2}=0$, ${b}_{3}=0$, and ${k}_{1}=20$. For the quadratic (QD) type, ${b}_{1}=0$, ${b}_{2}=1$, ${b}_{3}=0$, and ${k}_{2}=20$. For the cubic (CB) type, ${b}_{1}=0$, ${b}_{2}=0$, ${b}_{3}=1$, and ${k}_{3}=20$. For LN + QD, ${b}_{1}=1$, ${b}_{2}=1$, ${b}_{3}=0$, ${k}_{1}=10$, and ${k}_{2}=10$. For LN + CB, ${b}_{1}=1$, ${b}_{2}=0$, ${b}_{3}=1$, ${k}_{1}=10$, and ${k}_{3}=10$. For QD + CB, ${b}_{1}=0$, ${b}_{2}=1$, ${b}_{3}=1$, ${k}_{2}=10$, and ${k}_{3}=10$.

Genotypes for 3,368 variants on chromosome 1 from the EUR panel of the 1000 Genomes Project [34] were generated using allele frequencies and linkage disequilibrium (LD) coefficients. Genotypes were encoded as 0, 1, or 2, corresponding to the number of alternative alleles. The sequence variants in the simulation met two criteria: pair-wise LD value (r²) of SNPs below 0.1 and a minor allele frequency (MAF) above 5%.

Given the effect sizes of alternative alleles for k QTL, environmental random effects were introduced for each individual to control the heritability (h²) of liability in the population. Each subject's liability score was calculated as the sum of the effects of all alternative alleles at QTLs and the environmental impact. A population of 10 million individuals was simulated. According to their liability scores, the top 1% of individuals were designated as cases to match a 1% disease prevalence, while the remaining individuals were labeled as controls. Case/control samples of 500, 1,000, and 2,500 were drawn from the simulated populations without replacement.

Real data: UK Biobank

All genomic data and phenotypes in the real datasets are from the UK Biobank with the ICD- 10 diagnoses [35, 36]. All individuals included in the study were born in the UK. The Crohn’s disease (CD), schizophrenia (SC), and Alzheimer’s disease (AD) sub-datasets have sample sizes of 1,194 (2,447,229 SNPs), 1,516 (93,095,623 SNPs), and 4,244 (2,456,888 SNPs), respectively. The case-to-control ratio is maintained at 1:1 in each of the three datasets.

SNPs were preprocessed and analyzed by PLINK 2.0 [37, 38] and KGGA (https://pmglab.top/kgga/) based on the following criteria: (1) subject call rate ≥ 98%; (2) variant call rate ≥ 98%; (3) Hardy–Weinberg equilibrium p-value ≥ $1\times {10}^{-10}$; (4) MAF ≥ 5%; (5) LD clumping threshold ${r}^{2}<0.1$; and (6) SNP p-value for disease association ≤ 0.05. The model focused on autosomal chromosomes. After applying quality control (QC) and LD clumping, the number of SNPs per sample is 2,187 for CD, 3,000 for SC (selected by the lowest p-values), and 1,945 for AD.

The proposed Ge-SAND method

Ge-SAND is an innovative deep-learning framework designed for accurate disease risk prediction from genotype samples, by simultaneously dissecting large-scale, intricate genetic interactions between sequence variants. It utilizes genotype data across multiple SNPs, encoded as 0, 1, or 2, to predict the subjects'phenotypes, i.e., identifying control (0) and disease (1) states. Driven by phenotype labels, the model can effectively leverage genomic information to uncover potential interactions between SNPs, enhancing disease risk prediction. Ge-SAND imposes no upper limit on the number of input SNPs, depending only on hardware capacity. As shown in Fig. 1, the Ge-SAND framework integrates three neural networks—the genomic embedding self-attention network, the binary classification network, and the Gemini neurodynamic learning network)—to fulfill two core functions: interpretable feature extraction (achieved by the first two networks) and improved risk prediction (performed by the third network).

Genomic embedding self-attention network

The overview of the genomic embedding self-attention network

The genomic embedding self-attention network (GESAN) shown in Fig. 1A, forms a core structure of Ge-SAND, enabling the parallel capture of large-scale genotype interactions. To integrate genomic information, GESAN utilizes a novel Genomic Embedding (Fig. 1B) method designed for integrating genotype values, chromosomal locations, and positional order information to extend the traditional self-attention model. This chromosome-aware embedding is supported by evidence suggesting that intrachromosomal interactions are distinct from interchromosomal ones [39, 40].

Process Overview (Fig. 1A):

1)
Each genotype (represented as 0, 1, or 2) is initially embedded into a continuous space;
2)
The embedding is passed through a linear layer (maintaining input–output size), followed by a tanh activation function and a LayerNorm layer, producing a final embedding that integrates positional and genotype features;
3)
The final embedding is fed into the self-attention model, which can weigh the significance of each SNP relative to others, capturing intricate SNP interactions.

The processed embeddings, excluding special tokens (e.g., [CLS] and [SEP]), are then input into the binary classification network (BCN) (see Supplementary Figure S1 and Supplementary Note 1.1 for details). BCN (Fig. 1C) guides the training process, enabling the model to deeply learn pathogenic mechanisms and extract complex features critical for disease risk prediction. This lays a solid foundation for subsequent improved risk prediction.

In terms of interpretability, the self-attention matrix generated by GESAN can quantify the interactive effects of SNPs on disease risk. Higher attention values indicate stronger genetic interactions, providing insights into the significance of specific genomic relationships.

Genomic embedding

The Genomic Embedding (see Supplementary Figure S2 for details) combines the loci and genotype information of the SNP, allowing for a comprehensive representation of SNP characteristics. It has three components:

1)
Chromosome Embedding: Encodes the chromosome where a SNP is located.
2)
Position Order Embedding: Encodes the position order of a SNP on the chromosome in ascending order.
3)
Genotype Embedding: Encodes the genotypes of an individual at the SNP.

The chromosome and position order embeddings are summed to form the loci embedding. Then, the loci embedding is concatenated with the genotype embedding to obtain the genomic embedding. Notably, chromosome embeddings, position order embeddings, and genotype embeddings are all learnable.

Especially, the [CLS] (serving as an aggregate representation) and [SEP] (marking the sequence boundary) tokens are inserted at the beginning and end of the token sequences of individuals, respectively, to help define the boundaries of the sequence, allowing the model to focus on the complete genomic tokens. To distinguish them from SNPs during model training, we set their chromosome numbers as 0 and 23, respectively, and their position order embeddings as 0. The genotype embeddings of [CLS] and [SEP] tokens are generated from themselves separately, distinct from the SNP genotype embeddings derived from"0","1", and"2"tokens.

Moreover, to optimize the above integration of genotype and loci embeddings, we introduced a new hyperparameter, $\varphi$, which controls the ratio between the two embeddings during concatenation in a genomic embedding. This approach allows the model to break the traditional constraint where the token and position embeddings must be of equal size. $\varphi$ and other adjustable hyperparameters, such as the hidden embedding size and the number of self-attention heads, enable the model to effectively learn and represent the complex structure of biological sequences and genomic architecture. This design provides a more informative feature space enhancing the model's ability to learn from genetic data.

Conventional embedding methods like BERT-style absolute positional encoding [41], are not designed for genomic applications. These methods often assign unique positional embeddings to each token in a sequence, leading to two key limitations. First, they often treat positions uniformly, ignoring important biological distinctions such as intra- versus inter-chromosomal relationships. This can bias the model toward artifactual patterns, as interactions within the same chromosome differ mechanistically from those across chromosomes. Second, traditional frameworks often rigidly enforce positional embedding dimensions to match token embedding sizes, limiting adaptability to genomic data where positional and sequence information may require distinct representational spaces. This inflexibility may hamper the model's ability to scale to long sequences or capture nuanced genotype–phenotype relationships. In contrast, Ge-SAND introduces the Genomic Embedding method that differentiates intra- and inter-chromosomal positions and decouples positional and token embedding dimensions, enhancing both biological relevance and computational scalability. This approach may directly address the limitations of existing methods in capturing both intra- and inter-chromosomal, as well as large-scale genomic relationships, making it an advancement in the field.

The self-attention model

The self-attention model in GESAN captures interactions between all SNPs simultaneously, enabling the model to weigh the relative importance of each SNP with others. This enhances the model's ability to classify phenotypes and uncover potential genotype interactions by analyzing the attention scores within self-attention heads. Below, we describe the specific implementation of the self-attention model [25] used in GESAN.

For a given input sequence of SNPs of individuals, represented as genomic embeddings, we construct three matrices: the query matrix $Q$, the key matrix $K$, and the value matrix $V$. These matrices are derived from linear transformations of the genomic embeddings.

Let $X_\text{ge}$ denote the matrix of genomic embeddings for all SNPs in the sequence of individuals. The query, key, and value matrices are computed as:

$$Q=X_\text{ge}{W}_{q}$$

(2)

$$K=X_\text{ge}{W}_{k}$$

(3)

$$V=X_\text{ge}{W}_{v}$$

(4)

where ${W}_{q}$, ${W}_{k}$, and ${W}_{v}$ are learned weight matrices. Next, the attention scores between SNPs are computed by taking the scaled dot product of the query and key matrices:

$$Attention\left(Q,K,V\right)=Softmax(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}})V$$

(5)

where ${d}_{k}$ is the dimensionality of the key vectors, and the Softmax function ensures that the attention scores sum to 1, effectively highlighting the most relevant SNP interactions. This results in a weighted sum of the value matrix $V$, where the contribution of each SNP is scaled by its attention score relative to all other SNPs.

This mechanism allows the model to dynamically focus on key SNPs that contribute most significantly to the phenotype prediction, effectively capturing interactions between SNPs critical for the binary classification task.

Gemini neurodynamic learning network

The overview of the Gemini neurodynamic learning network

The improved risk prediction is performed by a Gemini neurodynamic learning network (GNLN) with two main characteristics shown in Fig. 1D. First, GNLN utilizes a novel neurodynamic optimization model to achieve convergence and predictive performance. The crux of this neurodynamic model lies in its ability to dynamically adjust and update parameters, including the weight parameters between the hidden and output layers, achieving monotonic and stable convergence of the absolute value of error, thereby capturing the nuanced relationships between extracted feature and individual phenotypes. Second, GNLN incorporates a Gemini structure with cross-validation to mitigate overfitting and data leakage from the validation set during LD clumping, while effectively utilizing both training and validation data to enhance model performance.

The overall structure of the neurodynamic learning network

The GNLN consists of two parallel dynamic learning networks, each with independent input, hidden, and output layers. The input to these networks consists of features extracted from previous models (GESAN and BCN). Each network independently processes the features and outputs binary predictions. The predictions of the two models are averaged to make a joint decision.

The output of the dynamic learning network can be represented in the $k$-th training round mathematically as follows:

$${F}_{q}(k)=G\left(P\left(X{U}_{q}\right){W}_{q}(k)\right)$$

(6)

where:

$X\in {\mathbb{R}}^{m\times n}$ is the input feature matrix, with $m$ samples and $n$ features.
${U}_{q}\in {\mathbb{R}}^{n\times v}$ is a weight matrix to connect the input and hidden layers, where $q\in \{\text{1,2}\}$, represents the bag index and $v$ denotes the number of hidden neurons.
$P\left(\cdot \right):{\mathbb{R}}^{m\times v}\to {\mathbb{R}}^{m\times v}$ is an activation function array applied to the hidden layer.
${W}_{q}\in {\mathbb{R}}^{v\times 1}$ connects the hidden and output layers.
$G\left(\cdot \right):{\mathbb{R}}^{m\times 1}\to {\mathbb{R}}^{m\times 1}$ is an activation function array for binary classification in the output layer.

For activation functions, we use:

Softsign function in $P\left(\cdot \right)$ for the hidden layer:

$$p\left(u\right)=\frac{u}{1+\left|u\right|}$$

(7)

Modified softsign function in $G\left(\cdot \right)$ for the output layer:

$$g\left(u\right) = \frac{u}{2\left(1 + |u|\right)} + \frac{1}{2}$$

(8)

Neurodynamic optimization algorithm

Neurodynamic optimization algorithms, which have demonstrated strong convergence and robust performance in fields such as pattern recognition [42, 43] and automation [44, 45] in recent years, may have potential advantages in disease prediction. Thus, we proposed a novel neurodynamic learning design formula to converge the error between the output and the label to zero:

$$E\left(k+1\right)=E\left(k\right)-\eta E\left(k\right)\odot \Phi \left(E\left(k\right)\right)$$

(9)

where $E\left(k\right)\in {\mathbb{R}}^{m\times 1}$ and $E\left(k+1\right)\in {\mathbb{R}}^{m\times 1}$ represent the deviation between the network output and the array of labels in the $k$-th and ($k$+1)-th training rounds, the hyperparameter $\eta \in \left(0,1\right)$, $\odot$ is the Hadamard product, and $\Phi \left(E\left(k\right)\right)$ is defined as

$${\left.\Phi \left(E\left(t\right)\right)=\left(\begin{array}{c}\dot{f}({e}_{1}(k){)}^{-1}\\ \dot{f}({e}_{2}(k){)}^{-1}\\ \vdots \\ \dot{f}({e}_{m-1}(k){)}^{-1}\\ \dot{f}({e}_{m}(k){)}^{-1}\end{array}\right.\right)}_{m\times 1}$$

(10)

where $\dot{f}\left({e}_{i}\left(k\right)\right)=df({e}_{i}\left(k\right))/d{e}_{i}\left(k\right)$,$f\left(\cdot \right)$ is a monotonically increasing odd function, and ${e}_{i}\left(k\right)$ is the i-th element of$E\left(t\right)$. In Eq. (10),$\dot{f}\left({e}_{i}\left(k\right)\right)\in \left(1/2,+\infty \right)$. When the function (9) holds, if$k\to +\infty$,$||E{\left(k\right)}^{\text{T}}{||}_{2}\to 0$. Our design of Eq. (9) is motivated by a neurodynamic method as below [45].

$$\varrho {\int }_{0}^{t} E\left(\tau \right)d\tau =v-\zeta F\left(E\left(t\right)\right)$$

(11)

where $t\in \mathbf{R}$ denotes continuous time and $E\left(t\right)\in {\mathbb{R}}^{m\times 1}$ represents the deviation between the network output and target over $t$. $\varrho$ and $\zeta$ are positive constants and the function array $\mathcal{F}\left(\cdot \right):{\mathbb{R}}^{m\times 1}\to {\mathbb{R}}^{m\times 1}$ is a monotonically increasing odd function array. Since ${\int }_{0}^{0}E\left(\tau \right){\text{d}}\tau =0$, $\upsilon =\zeta \mathcal{F}\left(E\left(0\right)\right)$ to make Eq. (11) hold. When Eq. (11) holds, $||E{\left(t\right)}^{\text{T}}{||}_{2}\to 0$, when $t\to +\infty$. Then, ${e}_{i}$(t) is defined as the $i$-th element of $E\left(t\right)$, where $i\in \{\text{1,2}, \dots , m\}$ For each element, Eq. (11) can be converted to

$$\varrho {\int }_{0}^{t} {e}_{i}\left(\tau \right)d\tau ={v}_{i}-\zeta f\left({e}_{i}\left(t\right)\right)$$

(12)

where ${\upsilon }_{i}$ and $f\left(\cdot \right)$ are the $i$-th scalar units in $\upsilon$ and $\mathcal{F}\left(\cdot \right)$ separately. We construct a new equation by taking the derivative of both sides of the equation as below.

$${\dot{e}}_{i}\left(t\right)=-\frac{\varrho {e}_{i}\left(t\right)}{\zeta \dot{f}\left({e}_{i}\left(t\right)\right)}$$

(13)

where $\dot{f}\left({e}_{i}\left(t\right)\right)=\partial f\left({e}_{i}\left(t\right)\right)/\partial {e}_{i}\left(t\right)$ and $\dot{{e}_{i}}\left(t\right)={\text{d}}{e}_{i}\left(t\right)/{\text{d}}t$. Since the neurodynamic algorithm is executed on digital computers, it is necessary to convert the temporal expressions into discrete forms for computational operations. Employing the Euler discretization formula [46], Eq. (13) is transformed as

$$\frac{\left({e}_{i}\left(k+1\right)-{e}_{i}\left(k\right)\right)}{{\hslash }}=-\frac{\varrho {e}_{i}\left(k\right)}{\zeta \dot{f}\left({e}_{i}\left(k\right)\right)}$$

(14)

where the discrete step ${\hslash }>0$ and ${\hslash }\in \mathbf{R}$, while the discrete time $k>0$ and$k\in \mathbf{Z}$. Thus, we can obtain

$${e}_{i}\left(k+1\right)={e}_{i}\left(k\right)-\frac{\eta {e}_{i}\left(k\right)}{\dot{f}\left({e}_{i}\left(k\right)\right)}$$

(15)

where $\eta = {\hslash }\varrho /\zeta>0$ and $\eta \in {\mathbb{R}}$ is a design hyperparameter of the dynamic learning network.

Theorem 1.

When Eq. (15) holds, where $\eta \in \left(\text{0,1}\right)$ and $\dot{f}\left({e}_{i}\left(k\right)\right)\in \left(1/2,+\infty \right)$, ${e}_{i}\left(k\right)$ satisfies ${e}_{i}\left(k\right)\to 0$, when $k\to +\infty$.

Proof. The analysis is divided into three parts. $1)$ ${e}_{i}\left(k\right)=0$. For ${e}_{i}\left(k\right)=0$, ${e}_{i}\left(k+1\right)={e}_{i}\left(k\right)=0$. Thus, ${e}_{i}\left(k\right)$ maintains at $0$ when $k\to \infty$. $2)$ ${e}_{i}\left(k\right)<0$. For $0<\eta <1$ and $\dot{f}\left({e}_{i}\left(k\right)\right)>1/2$ for any ${e}_{i}\left(k\right)$, $-{e}_{i}\left(k\right)>{e}_{i}\left(k+1\right)>{e}_{i}\left(k\right)$. $3)$ ${e}_{i}\left(k\right)>0$. Similar to $2)$, ${e}_{i}\left(k\right)>{e}_{i}\left(k+1\right)>-{e}_{i}\left(k\right)$ can be obtained. To sum up, if ${e}_{i}\left(k\right)\ne 0$, $\left|{e}_{i}\left(k+1\right)\right|<\left|{e}_{i}\left(k\right)\right|$ and when $k\to \infty$, $\left|{e}_{i}\left(k\right)\right|\to 0$. Then, when $k\to +\infty$, $||E{\left(t\right)}^{\text{T}}{||}_{2}\to 0$ can be deduced. The proof is complete. ◻

Based on Theorem 1, when the function (9) holds, if $k\to +\infty$, $||E{\left(k\right)}^{\text{T}}{||}_{2}\to 0$. 12 simulations are performed to prove Theorem 1 (Supplementary Figure S3). In this paper, two activation functions are used to $f\left(\cdot \right)$.

Softsign-linear-type activation function (SL-AF):

$$f\left(u\right)=\left\{\begin{array}{c}\dfrac{2}{\left|u\right|+1},\quad\quad\text{if } |u|<1,\\ u,\quad\quad\text{otherwise}.\end{array}\right.$$

(16)

Arcsine-linear-type activation function (ASL-AF):

$$f\left(u\right)=\left\{\begin{array}{c}\dfrac{2\text{arcsin}\left(u\right)}{\pi },\quad\text{if } |u|<1,\\ u,\quad\text{otherwise}.\end{array}\right.$$

(17)

Weight matrix update rule

During training, the weights between the hidden and output layers are iteratively updated based on the error between the output and the label at each step. Based on the functions (6) and (9), the update rule in each dynamic learning network can be derived as:

$${W}_{q}\left(k+1\right)={\left(P\left(X{U}_{q}\right)\right)}^{+}\mathcal{G}\left(G\left(P\left(X{U}_{q}\right){W}_{q}\left(k\right)\right)-\eta {E}_{q}\left(k\right)\odot \Phi \left({E}_{q}\left(k\right)\right)\right)$$

(18)

where ${E}_{q}\left(k\right)$ and ${E}_{q}\left(k+1\right)$ can be defined as below.

$${{E}_{q}\left(k\right)=F}_{q}\left(k\right)-Y=G\left(P\left(X{U}_{q}\right){W}_{q}\left(k\right)\right)-Y$$

(19)

$${E}_{q}\left(k+1\right)={F}_{q}\left(k+1\right)-Y=G\left(P\left(X{U}_{q}\right){W}_{q}\left(k+1\right)\right)-Y$$

(20)

$\mathcal{G}\left(\cdot \right)$ is an activation function array consisting of the inverse function $g{\left(\cdot \right)}^{-1}$ of $g\left(\cdot \right)$ and ${\left(P\left(X{U}_{q}\right)\right)}^{+}$ is the pseudoinverse of $P\left(X{U}_{q}\right)$. $Y{\in {\mathbb{R}}}^{m\times 1}$ is the array of real labels. The approximate expression $g{\left(\cdot \right)}^{*}$ of ${g}^{-1}\left(\cdot \right)$ is

$$g^{-1}(u) \approx g(u)^* =\begin{cases}\displaystyle \frac{1}{2-2u}, & \text{if } 0.5 \le u < 1, \\[1ex]\displaystyle \frac{2u-1}{2u}, & \text{if } 0 < u < 0.5.\end{cases}$$

(21)

Thus, throughout the training process, the weight matrix ${W}_{q}\left(k\right)$ is iteratively updated to reach an optimal value, enabling the model to predict individuals'phenotypes accurately.

Gemini structure

The Gemini structure consists of two dynamic learning networks in parallel, which effectively integrate and train both the training and validation data during the feature extraction process. Each of the networks uses the training data with half of the original validation data for training, while the remaining half of the validation data is used as the validation set for the respective sub-model. Once training is complete, test data is input into both models, and the final classification result is obtained by averaging the outputs of the two models.

This approach avoids validation data leakage caused by p-value filtering during preprocessing while fully utilizing the validation data.

Model training

The genomic data were randomly split into 70% training, 20% validation, and 10% independent testing sets. We maintained a 1:1 case-to-control ratio across the three datasets to ensure unbiased model development and evaluation. Ge-SAND and other methods including LASSO, Ridge, SVM, XGBoost, the multilayer perceptron (MLP), the convolutional neural network (CNN), and the long short-term memory (LSTM) model, were employed in these case/control samples for disease risk prediction across simulation and real datasets.

The hyperparameters (such as regularization strength for LASSO and Ridge, the number of estimators for Random Forest and XGBoost, the kernel coefficient and regularization strength for SVM, and variance smoothing and priors for Naive Bayes) were optimized using grid search with ten-fold cross-validation with scikit-learn (1.3.0). To facilitate this process, we combined the training and validation sets into a new training dataset, allowing ten-fold cross-validation to be applied efficiently. Neural network methods, including MLP, CNN, and LSTM, were implemented by PyTorch (2.0.1). The MLP model consists of three fully connected layers with 256, 64, and 1 output units, respectively. The CNN model includes a 1D convolutional layer with 128 output channels and a kernel size of 3, followed by a fully connected output layer with 1 unit. The LSTM model consists of an LSTM layer with 64 hidden units, followed by a fully connected output layer with 1 unit. The models achieving the highest AUC-ROC on the validation sets were selected.

For Ge-SAND also implemented by PyTorch (2.0.1), GESAN and BCN were trained to optimize AUC-ROC performance on the validation set with the training set. We used AdamW as the optimizer. The categorical cross entropy was used as the loss function during the training of GESAN and BCN. Subsequently, for GNLN training, the training set was combined with half of the validation set, creating two groups. The remaining validation data in each group serves as a sub-validation set. Each dynamic learning network was trained independently, with the highest AUC-ROC on its respective sub-validation set used as the optimization target. This method optimizes data usage while avoiding data leakage.

Supplementary Tables S1 to S3 show more details of the hyperparameters used by GESAN, BCN, and GNLN for simulated scenarios and three complex disease datasets.

Model interpretation

Analysis of attention scores

To interpret the interaction between genotypes, we averaged the attention scores from all self-attention heads as below:

$$S=\frac{1}{M}\sum_{i=1}^{M} \frac{{Q}_{i}{K}_{i}^{T}}{\sqrt{d_{a}}}$$

(22)

where $S\in {\mathbb{R}}^{\varsigma \times \varsigma }$, $\varsigma$ is the number of SNPs of individuals, $M$ is the number of attention heads, and $d_{a}$ is the quotient of the hidden embedding size over $M$. Notably, ${Q}_{i}\in {\mathbb{R}}^{\varsigma\times d_{a}}$ denotes the query matrix of the $i$-th attention head and ${K}_{i}^{T}\in {\mathbb{R}}^{d_{a} \times \varsigma}$ denotes the transpose of the key matrix $K\in {\mathbb{R}}^{\varsigma\times d_{a}}$ of the $i$-th attention head. The resulting matrix $S$ can be normalized:

$${S}_{\text{norm}}=\frac{S-\mu }{\sigma }$$

(23)

where $\mu$ and $\sigma$ are the mean and standard deviation of all elements in $S$, respectively. Subsequently, the values in the upper and lower triangular portions of ${S}_{\text{norm}}$ were symmetrically averaged, which could provide a comprehensive assessment of the overall interactions in the data.

Finally, the values in the upper triangular portion of ${S}_{\text{norm}}$ were sorted in descending order. The genotype pairs with higher rankings are considered to have stronger interactions.

Permutation-based p-value calculation for AUC-ROC values

To compute the p-value across 1,000 permutations, we applied Gaussian kernel density estimation [47] to fit the distribution of AUC-ROC values. Using numerical integration (Simpson's rule), we calculated the area under the density curve up to the observed value. As a larger AUC-ROC is considered more meaningful, a right-tailed p-value was calculated. The p-value was determined by subtracting the proportion of the observed area under the curve from 1.

Empirical p-value estimation for functional enrichment

To assess the biological relevance of top-prioritized gene pairs through Gene Ontology (GO) functional enrichment, we employed the Monte Carlo method to compute empirical p-values. Specifically, we conducted 1,000 iterations of random sampling, each generating a null distribution by selecting 30 gene pairs from all possible combinations. The empirical p-value was defined as (w + 1)/(1,000 + 1), where w denotes the number of iterations where the background gene pairs exhibited equal or stronger functional associations compared to the observed pairs.

Conventional statistical method for comparative analysis

For comparison with Ge-SAND, we implemented a conventional statistical method using a regression-based approach for pairwise feature screening. This method exhaustively evaluates all possible feature pairs through linear modeling. Specifically, all feature columns from the training dataset were extracted, generating exhaustive combinations of feature pairs. For each pair (${x}_{i},{x}_{j}$), we constructed a linear regression model with an intercept, formulated as:

$$y={\varphi }_{0}+{\varphi }_{1}{x}_{i}{x}_{j}+\varepsilon$$

(24)

where ${\varphi }_{0}$ and ${\varphi }_{1}$ are the model coefficients, ${x}_{i}$ and ${x}_{j}$ represent the genotypes, and $\varepsilon$ is the error term. The regression coefficients, ${\varphi }_{0}$ and ${\varphi }_{1}$, were estimated using the ordinary least squares method. The statistical significance of each interaction term was assessed through the p-value of ${\varphi }_{1}$ derived from the regression output using a t-test. This p-value was used to evaluate the strength of the association between the interaction of the two features and the phenotype.

Odds ratio

The odds ratio (OR) and corresponding 95% confidence interval (CI) for individual genotypes were calculated with genotype (reference allele homozygous, encoded as 0) as the reference. The OR and 95% CI were also calculated for genotype pairs, with the genotype pair 0–0 as the reference.

Computational implementation

The computational experiments in this study were conducted on a personal workstation. The system configuration comprised 32 GB of dual-channel GLOWAY DDR5 memory operating at 6000 MHz, an Intel Core i7 - 13700 K processor with a base clock of 3.4 GHz and maximum turbo frequency up to 5.4 GHz, a PCIe 4.0 NVMe solid-state drive (aigo P7000Z series, 2 TB capacity), and an NVIDIA Geforce RTX 4090 GPU with 24 GB of memory. For the Alzheimer's disease dataset analysis, the model required 35.01 s per epoch for training with a batch size of 2, corresponding to 1,484 optimization steps per epoch. Inference operations took 0.00924 s per individual sample. During the experiments, peak memory consumption was recorded at 1,784 MB for GPU operations and 1,024 MB for system RAM. Additionally, the model's total storage size was less than 15 MB.

Results

The workflow of Ge-SAND

The workflow of Ge-SAND is depicted in Fig. 2 and encompasses three principal steps. Initially, individuals with genome-wide genetic and phenotype data are stratified into training, validation, and test subsets. During the training phase, the training subset undergoes LD clumping, and SNPs with p-values above a predetermined threshold are excluded to mitigate overfitting. The remaining SNPs are utilized to train GESAN in tandem with BCN for feature extraction, with the validation subset employed for model refinement. Post-training, the self-attention matrix is extracted to elucidate genetic interactions. Subsequently, the training and validation sets are amalgamated to form an enhanced training set, which is then used to train GNLN for risk prediction based on the derived embeddings. Finally, the test subset is employed to evaluate the power of Ge-SAND.

Benchmarking Ge-SAND: unveiling performance on simulations with key interaction insights

We conducted a series of simulations to evaluate the performance of Ge-SAND. Specifically, we constructed six quantitative relationships between genotypes and phenotypes under a liability threshold model with a given heritability value (0.1). These relationships include three basic models (LN, QD, CB) and three pairwise combinations of all basic models shown in Methods.

Additionally, we assessed the performance and interpretability of Ge-SAND across three different sample sizes (1,000, 2,000, and 5,000), with 3,368 SNPs per individual. All datasets were split into training (70%), validation (20%), and testing (10%) sets, with half cases and half controls. The testing set, consisting of an independent group of subjects, was employed for performance evaluation. The genotypes for reference homozygous, heterozygous, and alternative homozygous were encoded as 0, 1, and 2, respectively. For comparison, we used nine mainstream machine learning methods: LASSO, Ridge, RF, XGBoost, Naive Bayes, SVM, MLP, CNN, and LSTM. More details can be seen in Methods.

Ge-SAND’s performance across genotype–phenotype models

The simulation results demonstrate that Ge-SAND consistently outperformed other models. As illustrated in Fig. 3A-C and detailed in Table 1, under the QD genotype–phenotype model with a sample size of 2,000, Ge-SAND achieved a Matthews correlation coefficient (MCC) advantage ranging from 0.085 to 0.249 and an F1-score improvement of 0.016 to 0.527 compared to other models. Though LASSO and Ridge have similar designs aligned with the quantitative model used in the simulation, their performance still falls short compared to Ge-SAND.

Table 1 Performance of different models for the simulation (QD, N = 2,000), Crohn’s disease, Schizophrenia, and Alzheimer’s disease testing sets

Full size table

This performance trend remains consistent across various genotype–phenotype models, though the magnitude of Ge-SAND’s advantage varies depending on model complexity. To further illustrate this, we focused on the sample size of 1,000, where performance differences were more pronounced. In the LN + CB model with this sample size (Fig. 4A-C), Ge-SAND shows significant improvements in AUC-ROC (0.063 to 0.282) and area under the precision-recall curve (AUC-PR: 0.052 to 0.246). Additionally, Ge-SAND exceeds other methods in the Kolmogorov–Smirnov (KS) metric by 0.16 to 0.34, outperforming the second-best method, Naive Bayes (0.26), by approximately 61.5%. In the QD + CB model, Ge-SAND maintains its performance advantage, with improvements of 0.032 in AUC-ROC, 0.007 in AUC-PR, and 0.04 in KS compared to XGBoost, which shows the gap between Ge-SAND and other methods narrowed, suggesting that model complexity may influence the degree of its performance gains. Nonetheless, Ge-SAND consistently delivers the best overall performance.

Analysis of Ge-SAND’s advantage in different sample sizes

Across all cases, Ge-SAND's performance advantage becomes more evident as sample sizes decrease (Fig. 3A-I). We quantified this advantage by subtracting the mean performance metrics of other methods from those of Ge-SAND.

In particular, the difference in AUC-ROC between Ge-SAND and other models grows with smaller sample sizes. For example, with 5,000 samples, Ge-SAND outperforms other methods by 0.097 on average, and this advantage increased to 0.139 and 0.128 for sample sizes of 1,000 and 2,000, respectively. A similar trend can be observed in AUC-PR and KS, where the advantages increase from 0.0809 to 0.1017 and 0.1161, and from 0.135 to 0.194 and 0.164, respectively. This suggests that Ge-SAND may have a superior capability to uncover the relationship between SNPs and phenotypes, particularly in small sample sizes. More details are shown in Tables 2, 3 and 4.

Table 2 The performance of different methods in six models with 1,000 sample size

Full size table

Table 3 The performance of different methods in six models with 2,000 sample size

Full size table

Table 4 The performance of different methods in six models with 5,000 sample size

Full size table

Attention scores and interaction analysis

To evaluate the significance of the genotypic interactions quantified by Ge-SAND, we performed 1,000 permutations of the attention score matrix (Fig. 3D) for the QD model (N = 2,000). The AUC-ROC based on the original attention scores is significantly greater than those from random permutations (mean = 0.483; P < 2.23 × 10^–16), with the original attention score yielding an AUC-ROC of 0.673 (Fig. 3E). The method to calculate p-values is a Gaussian kernel density-based method, as detailed in Methods.

We compared the performance of uncovering potential genotypic interactions of Ge-SAND with regression analysis (see Methods) based on the QD simulation (N = 2,000) with predefined ground-truth interacting SNP pairs. We extracted the top-ranked SNP pairs from Ge-SAND’s attention score matrix (ranked by scores) and regression analysis results (ranked by P-values). Using predefined ground-truth interacting SNP pairs, we computed recall curves for the top 50 SNP pairs (Fig. 3G). Ge-SAND demonstrated superior recall of true interactions compared to the regression method, though the regression method retained advantages in detecting interactions under the QD mode. The results indicate that Ge-SAND may improve disease phenotype prediction by capturing relationships between interactions of SNPs and the phenotype. Furthermore, we also explored the precision in identifying predefined effect SNPs using the top 50 SNP pairs with the highest attention scores (Fig. 3H). The top 10 pairs achieved a precision of 80%, while the top 20 pairs reached 65%. A heatmap of the top 15 genotypes based on attention scores (Fig. 3F) highlights key genetic interactions, suggesting that attention scores in Ge-SAND effectively prioritize influential SNP pairs for accurate predictions.

Analysis of top SNP interactions

We compared the performance of Ge-SAND and the regression method in uncovering SNP interactions across five other genotype–phenotype models (LN, LN + QD, CB, LN + CB, and QD + CB, N = 2,000). Through the analysis of attention scores, predefined ground-truth interacting SNP pairs in Ge-SAND exhibited higher ranks and substantial attention weights, with overall superior performance to the regression method (Supplementary Figure S4). Notably, the performance gap between methods widened in specific models (e.g., CB), where Ge-SAND demonstrated stronger recall of true interactions within the top 50 SNP pairs. This comparative analysis demonstrates the method's adaptability across diverse genetic architectures. The regression method may be limited by predefined interaction models, whereas the architecture of Ge-SAND offers superior capacity in detecting complex, unspecified patterns. This flexibility may allow Ge-SAND to effectively handle both simulated and real-world genetic studies, where multifaceted interaction structures may emerge. We further analyzed the precision of the top 50 SNP pairs with the highest interaction scores calculated by Ge-SAND. The precision for identifying predefined SNP pairs consistently ranges from 0.6 to 1.0 across all cases in the top 10 pairs as shown in Supplementary Figure S5. In particular, the CB model shows higher precision with a slower rate of decline, as the top 10 pairs reach a precision of 100% and the top 50 pairs maintain a precision of 58%. Interestingly, in models involving linear and nonlinear relationships, Ge-SAND consistently prioritizes the identification of predefined nonlinear interaction pairs, particularly in the LN + QD and LN + CB models, where all identified pairs are nonlinear. With a total of 56,700,280 genotype pairs, these precision results may demonstrate Ge-SAND's ability to identify key interaction pairs, contributing to its superior classification performance.

Scalability analysis across sample sizes

To assess the scalability of Ge-SAND, we conducted simulations across a range of sample sizes—500, 1,000, 5,000, 10,000, and 50,000—to thoroughly evaluate its performance at varying scales. We generated a large-scale synthetic dataset with quadratic interaction effects, consisting of 2,596 loci. This dataset incorporates 20 quantitative trait loci (QTLs) involved in interactions, resulting in 190 total interaction pairs, modeled according to the quadratic interaction form in Eq. (1).

We compared Ge-SAND against Ridge regression — a method with stable performance characteristics—to establish a reliable baseline. From Table 5, the advantage of Ge-SAND is more pronounced in smaller datasets. For instance, at a sample size of 500, Ge-SAND outperforms Ridge by 13.7% (AUC-ROC: 0.664 vs. 0.584). This suggests that Ge-SAND excels at extracting nuanced patterns from limited data, likely due to its ability to model complex interactions and learn hierarchical representations. As the sample size increases, the performance gap between Ge-SAND and Ridge narrows gradually. At 50,000 samples, the advantage reduces to 1.96% (AUC-ROC: 0.7441 vs. 0.7298). This trend aligns with expectations: traditional methods like Ridge can leverage larger datasets to improve their estimates, whereas Ge-SAND’s relative gains diminish once sufficient data becomes available. Nevertheless, Ge-SAND maintains superior performance even at scale, underscoring its robustness across data regimes.

Table 5 The performance of Ge-SAND with sample sizes of 500 to 5,000

Full size table

Regarding scalability challenges, we observed two critical considerations. First, while training time scales linearly with sample size (e.g., 5.86 s for 500 samples vs. ~ 9.5 min for 50,000 samples), memory usage remains stable at 2.738 GB across all scenarios with a fixed batch size of 2. However, increasing the batch size to improve computational efficiency introduces significant memory demands—testing batch sizes of 4, 8, and 16 resulted in VRAM requirements of 4.969 GB, 9.406 GB, and 18.305 GB, respectively. This highlights a trade-off between training speed and hardware constraints, particularly for large-scale applications. Second, the handling of long-sequence genetic data based on the self-attention mechanism poses non-trivial potential memory challenges. Thus, the practical deployment of Ge-SAND at scale requires careful balancing of computational resources and algorithmic efficiency.

Cross-disease analysis: superior risk prediction and interaction discovery in complex diseases

To further evaluate the proposed Ge-SAND, we applied it to three complex diseases: Crohn’s disease (N = 1,196), schizophrenia (N = 1,516), and Alzheimer’s disease (N = 4,244), with 2,187, 3,000, and 1,945 SNPs, respectively. All SNPs in three datasets have significant association (P < 0.05, pairwise LD ${r}^{2}$< 0.1), shown in Supplementary Figure S6. The case–control ratio for each dataset was balanced at 1:1. Two baseline models were designed for the ablation study and nine mainstream machine learning methods were used for comparisons. The genotype encoding and dataset usage follow the approach described in the previous section.

Ablation study

To test the roles of two important modules, Genomic Embedding and the Gemini dynamic learning network, we constructed two models for comparison (Table 6): one based on the traditional BERT-style embedding method (TBEM) (with the rest of the structure consistent with Ge-SAND) and another model without the Gemini learning network (WGNLN).

Table 6 Classification metric deltas (Δ%) for Ge-SAND and its ablated variants across three diseases

Full size table

Firstly, the ablation study demonstrates that Ge-SAND's Genomic Embedding, specifically designed for genetic context modeling, achieves performance advantages over conventional BERT-style embeddings across three diseases. For CD, Ge-SAND shows a 27.04% higher AUC-ROC (0.700 vs 0.551), a 144.46% improvement in KS statistic (0.386 vs 0.158), and a 148.73% greater MCC compared to TBEM—empirical evidence supporting our domain-adapted embedding strategy. These advantages persist in SC with Ge-SAND attaining an 8.21% AUC-ROC gain (0.646 vs 0.597) and a 64.35% improvement in KS statistic (0.295 vs 0.180), particularly in decision metrics (MCC: + 56.32%, 0.282 vs 0.180). For AD, Ge-SAND with Genomic Embedding yields a 9.61% higher AUC-ROC (0.716 vs 0.653) and a 26.58% enhancement in MCC (0.381 vs 0.301). The consistent performance gaps across phenotypes (ΔAUC-ROC: 8.21–27.04%; ΔMCC: 26.58–148.73%) validate that Genomic Embedding may be more suitable for capturing biological relationships compared to generic language model adaptations for disease risk prediction tasks.

Secondly, the comparisons demonstrate that integrating the Gemini neurodynamic learning network contributes to improved prediction accuracy across three diseases. In SC, the complete Ge-SAND achieves a 7.38% higher AUC-ROC (0.646 vs 0.602), a 64.35% improvement in the KS statistic (0.295 vs 0.180), and a 55.80% increase in MCC (0.282 vs 0.181) compared to WGNLN. For AD, the network implementation corresponds with an 8.21% AUC-PR gain (0.717 vs 0.663) and a 17.41% MCC improvement (0.381 vs 0.325). In CD, while showing a 6.13% AUC-ROC increase (0.700 vs 0.660) and a 37.37% KS enhancement (0.386 vs 0.281), the 8.08% AUC-PR reduction (0.637 vs 0.693) indicates context-dependent optimization patterns. These quantifiable improvements (ΔAUC-ROC: + 6.13–7.38%, ΔMCC: + 17.41–55.80%) suggest that the Gemini neurodynamic learning network may facilitate effective learning of feature-phenotype relationships.

In summary, the ablation studies and comparisons highlight the contributions of both the Genomic Embedding and Gemini learning network in enhancing Ge-SAND's predictive performance. These findings reinforce the potential of Ge-SAND in improving disease risk prediction.

Unified classification performance across Crohn’s disease, schizophrenia, and Alzheimer’s disease

Ge-SAND consistently outperforms nine mainstream machine learning models across three complex diseases—CD, SC, and AD—in terms of classification accuracy, discriminative ability, and balance between precision and recall (Fig. 5, Table 1). This superior performance is demonstrated by its higher AUC-ROC, AUC-PR, and KS statistics across all diseases, alongside strong overall performance metrics such as accuracy (ACC), F1-score, and MCC.

Across all diseases, Ge-SAND consistently achieves the highest AUC-ROC and AUC-PR, outperforming alternative models by significant margins. For instance, in CD, Ge-SAND achieves an AUC-ROC improvement of 0.115–0.204 over other models. Similar trends are observed in schizophrenia and Alzheimer’s disease, with Ge-SAND consistently leading in precision-recall trade-offs, as reflected in its superior AUC-PR and KS statistics across all datasets. Despite the potential limitations of imputation in the schizophrenia dataset, which may affect overall model performance, Ge-SAND still outperforms all other models. This highlights its robustness in handling imperfect data.

Furthermore, Ge-SAND exhibits strong overall classification performance, as demonstrated by higher ACC and F1-scores, particularly in CD (ACC = 0.684, F1-score = 0.731) and AD (ACC = 0.690, F1-score = 0.695), where it outperforms the second-best models by notable margins. Its ability to maintain a better balance between precision and recall across different diseases sets it apart from other models, which tend to overemphasize one aspect, such as recall in XGBoost or precision in Naive Bayes.

Ge-SAND’s MCC further confirms its superiority, reflecting balanced performance in all diseases. For instance, in Alzheimer’s disease, Ge-SAND's MCC outperforms the next-best method by 53%. This indicates that Ge-SAND not only excels in classification tasks but also provides stable and reliable predictions, critical for real-world applications in genomic-based disease prediction. Detailed comparisons of individual performance metrics can be found in Table 1 and Supplementary Note 1.2.

Permutation testing to interpret attention scores

To validate the reliability of the large-scale genotypic interactions quantified by Ge-SAND in three datasets (CD: 2,390,391 pairs, SC: 4,498,500 pairs, AD: 1,890,540 pairs), we performed permutation tests on three datasets. By shuffling the attention score matrices from all attention heads 1,000 times, we disrupted the learned relationships, producing a distribution of random baseline AUC-ROC scores. This comparison allows us to determine whether the observed model performance stems from meaningful biological patterns rather than chance.

Across the three datasets, Ge-SAND consistently outperforms the permuted baselines. In CD, the observed AUC-ROC of 0.700 (Fig. 6A) is significantly higher than the permuted ones (mean: 0.636; P < 1.56 × 10^–15). Similarly, for schizophrenia, Ge-SAND achieves an AUC-ROC of 0.646 (Fig. 6D), exceeding the baseline average of 0.596 (P < 3.33 × 10^–16). In Alzheimer’s disease, the performance gap is even more pronounced, with an AUC-ROC of 0.716 (Fig. 6G) compared to a mean permuted score of 0.526 (P < 2.23 × 10^–16).

By consistently outperforming randomized models across all datasets, Ge-SAND’s self-attention mechanism may effectively capture potential genotypic interactions that are crucial for accurate disease risk prediction.

Top SNP interactions in three diseases

For further research, the top 30 pairs of SNPs with the highest attention scores involved in genes were analyzed in three diseases. Although each disease exhibits gene pairs that either increase or decrease disease risk, the patterns of interaction differ, reflecting unique mechanisms for each condition.

In Crohn's disease, we statistically validated the biological relevance of the top 30 prioritized gene pairs based on GO analysis by calculating empirical p-values using the Monte Carlo method. Specifically, 1,000 iterations were performed, with each iteration involving the random selection of 30 gene pairs from the full set of possible combinations to establish a background distribution. For Molecular Function (MF) analysis, applying a co-occurrence threshold of at least one shared GO term, 26 out of the 30 prioritized gene pairs exhibited significant pathway overlap (P = 0.0019, computed as (w + 1)/(1,000 + 1) where w = 1 iteration showed background pairs ≥ 26). Similarly, under Cellular Component (CC) analysis using the same threshold, 27 out of 30 gene pairs showed significant pathway enrichment (P = 9.99 × 10^–4, computed as (w + 1)/(1,000 + 1) where no background iterations reached ≥ 27, thus w = 0), thereby providing statistical evidence of non-random biological. Furthermore, numerous genotype pairs show strong interaction effects even though their genotypes exhibit no significant associations with disease risk. For example, while genotypes of ISOC1 (genotype 2: OR = 1.26, 95% CI: 0.90–1.75, P = 0.184) and HOMER2 (genotype 2: OR = 1.06, 95% CI: 0.77–1.47, P = 0.71) SNPs are not independently associated with CD, their 2–2 genotype combination significantly increases disease risk (OR = 2.50, 95% CI: 1.28–4.89, P = 0.0075). Previous studies have reported that ISOC1 is involved in colorectal cancer proliferation and migration [48, 49], while HOMER2 plays a role in synaptic plasticity [50].Similarly, the 0–2 genotype pair of ICAM5’s and DNAAF4’s SNPs is associated with the increased risk of CD (OR: 2.32, 95% CI: 1.23–4.34, P = 0.0087), despite genotypes of ICAM5 (genotype 0 as the reference) and DNAAF4 (genotype 2: OR = 1.04, 95% CI: 0.74–1.46, P = 0.82) showing no significant associations individually. ICAM5 may be involved in colorectal cancer [51], while DNAAF4 is related to dyslexia [52]. Notably, through GO MF pathway analysis, both genes are enriched in the nuclear estrogen receptor binding (P < 0.02). Estrogen receptors play an important role in gastrointestinal diseases, including Crohn's disease [53,54,55]. In several pairs, genotype pairs involving HOMER2 also show protective effects, suggesting that certain interactions may buffer against disease development. More pairs of genotypes are described in Supplementary Note 1.3.1.

For schizophrenia, the statistical analysis revealed that 25 out of the top 30 gene pairs shared at least one Biological Process (BP) pathway, and the observed enrichment is highly significant (P = 9.99 × 10^–4, computed as (w + 1)/(1,000 + 1) where no background iterations reached ≥ 25, thus w = 0) compared to random expectations. Furthermore, the interactions in schizophrenia tend to align with individual genetic associations. For example, genotypes of AUTS2 (genotype 0 as the reference) and RTL6 (genotype 2: OR = 5.64, 95% CI: 1.63–19.46, P = 0.006) SNPs both show independent associations with SC, but their 0–2 genotype (OR = 13.05, 95% CI: 1.70–100.17, P = 0.013) combination further amplifies the disease risk. RTL6 may play a crucial role in the immune system by clearing harmful substances leaked from damaged neurons in the developing brain [56]. AUTS2 is known to be associated with a variety of psychiatric degenerative diseases, including SC [57, 58]. RTL6 may interact with AUTS2 under certain circumstances, further increasing the risk of SC. On the other hand, certain genotype combinations—like those involving CCT6B (genotype 1: OR = 0.76, 95% CI: 0.58–0.99, P = 0.049) and LRP5L (genotype 2: OR = 0.55, 95% CI: 0.37–0.82, P = 0.0031) —demonstrate a protective interaction effect (OR = 0.36, 95% CI: 0.14–0.97, P = 0.044), indicating that beneficial gene–gene interactions may mitigate disease risk more effectively than individual genotypes. CCT6B and LRP5L have been reported to be associated with SC [59, 60]. More pairs of genotypes related to brain functions are described in Supplementary Note 1.3.2.

For Alzheimer’s disease, the statistical evaluation of the top 30 gene pairs was conducted. In MF analysis, using a co-occurrence threshold of at least one shared GO term, 24 out of 30 prioritized pairs exhibited shared pathway associations (P = 0.02, computed as (w + 1)/(1,000 + 1) where w = 19 iteration showed background pairs ≥ 24), indicating significant biological relevance compared to random expectations. Notably, Alzheimer’s disease exhibits a hybrid interaction pattern, reflecting both independent and interaction-driven effects. For example, the 0–2 combination of CHD1L’s and VWA8’s SNPs (OR = 1.64, 95% CI: 1.04–2.58, P = 0.03), increases disease risk even though the individual genotypes show no significant associations (the genotype 2 of VWA8: OR = 1.37, 95% CI: 0.92–2.05, P = 0.12). Previous research has reported that CHD1L may play a role in the progression of glioma and is associated with attention-deficit/hyperactivity disorder [61] and multiple sclerosis [62], while VWA8 (KIAA0564) has been linked to autism [63] and corpus callosum size [64]. Notably, both CHD1L and VWA8 are enriched in several GO Molecular Function (MF) categories, including ATP hydrolysis activity (P < 0.002), ATP-dependent activity (P < 0.004), and ATP binding (P < 0.03). ATP (adenosine triphosphate) is a core molecule in cellular energy metabolism, and its homeostasis is crucial for neuronal function. The pathological mechanisms of Alzheimer's disease (AD) are closely related to ATP metabolic disturbances, involving mitochondrial dysfunction [65, 66]. This suggests that complex interactions in the nervous system may contribute to the progression of AD. At the same time, other pairs demonstrate interaction-driven effects, such as the 1–1 combination of MYRIP’s and KIF6’s SNPs (OR = 0.42, 95% CI: 0.22–0.80, P = 0.0078). MYRIP (genotype 1: OR = 0.73, 95% CI: 0.60–0.89, P = 0.0019) and KIF6 (genotype 2: OR = 0.77, 95% CI: 0.64–0.92, P = 0.0051) individually show protective effects, but their combined impact appears stronger, suggesting that coordinated gene interactions are essential for mitigating neurodegenerative risk. Previous research has reported that MYRIP may be associated with AD [67], while KIF6 is potentially related to neurodevelopment [68]. Similarly, both genes are significantly enriched in the cytoskeletal protein binding pathway in GO MF (P < 0.02). Cytoskeletal protein binding is an important pathway related to Alzheimer's disease pathogenesis and the primary behavioral symptoms of the disease [69,70,71]. More details are described in Supplementary Note 1.3.3.

Gene network construction and functional enrichment analysis

Based on the top 30 SNP pairs involved in genes, we constructed gene networks of three datasets. The networks differ in structure: unicentric (CD), bicentric (SC), and multicentric (AL), reflecting distinct biological mechanisms for each disease.

In CD, the gene network includes 29 relevant genes (Fig. 6C), where HOMER2 is connected to multiple genes as a center in the network, indicating its potential function in CD. The heatmap of 29 relevant genes is shown in Fig. 6B. Some genes are linked to gut and brain functions. Notably, genes such as ISOC1, ABCC4, and KRI1 are associated with gastrointestinal processes, while DNAAF4 and HOMER2 play significant roles in the central nervous system [72,73,74]. The presence of pleiotropic genes, including ICAM5 and VAV2, highlights their involvement in both gastrointestinal and neurological disorders [75,76,77]. g: Profiler analysis further supports these findings, revealing significant enrichment in colon tissues (P < 9.658 × 10^–05) and neuronal cells of the cerebral cortex (P < 3.744 × 10^–03), underscoring the involvement of the brain-gut axis in CD, shown in Supplementary Figure S7.

The SC network (Fig. 6F) related to 15 genes emphasizes the importance of genes involved in neuronal development and differentiation. The heatmap of 15 relevant genes is shown in Fig. 6E. Key genes such as CCT6B and AUTS2 act as central nodes, suggesting complex interactions regulating brain function. g: Profiler analysis reveals that the genes are significantly enriched (P < 0.05) in GO terms related to nervous system processes, emphasizing the role of disrupted neuronal pathways in SC pathology, shown in Supplementary Figure S8.

In AD, the gene network (Fig. 6I) related to 18 genes in AD is multi-centered, with hubs such as CHD1L, MYRIP, and CMIP (associated with the literacy skill [78]) suggesting involvement in cognitive pathways. The heatmap of 18 relevant genes is shown in Fig. 6H. Similar to CD, Ge-SAND identified gene interactions reflecting both cognitive and intestinal processes, suggesting a potential shared mechanism through the brain-gut axis. Key genes such as GJA9 [79, 80], ZCCHC [81, 82], DCLK1 [83, 84], and GPR39 [85, 86] are implicated in both brain and gut functions. These genes have been linked to memory, neurodevelopment, and intestinal regulation, reinforcing the connection between neurodegenerative disorders and intestinal health. g: Profiler analysis identified significant expression in the hippocampus, a brain region critical for memory and cognition related to AD [87] (P < 2.479 × 10⁻⁰³for glial cells; P < 4.313 × 10⁻⁰²overall).

Discussion

In this study, we present Ge-SAND, a genomic embedding self-attention neurodynamic decoder that captures complex genotype interactions for accurate disease risk prediction, advancing the understanding of genotype–phenotype relationships. First, Ge-SAND introduces a novel embedding approach, Genomic Embedding, that integrates genomic loci and genotype information, distinguishing between intrachromosomal and interchromosomal interactions. Second, a self-attention mechanism enables the model to capture both linear and nonlinear interactions across genotype pairs comprehensively, enhancing predictive accuracy. Finally, the Gemini neurodynamic learning network models the relationship between extracted features and phenotypes, improving generalization while making full use of genomic data and reducing the risk of data leakage.

Ge-SAND demonstrates superior performance across both simulated and real-world datasets compared to mainstream machine learning methods. In simulations using the QD model with 2,000 samples, Ge-SAND improves AUC-ROC by 0.047–0.227. These gains are mirrored in real-world datasets, where AUC-ROC increases by 0.115–0.204 for Crohn’s disease, 0.063–0.142 for schizophrenia, and 0.058–0.252 for Alzheimer’s disease. Ge-SAND also outperforms in other key metrics, including AUC-PR, MCC, and KS, underscoring its superior ability to capture complex genotype–phenotype relationships and deliver more accurate classification performance.

The self-attention model provides a powerful and unique way of uncovering the relationships between genotypes and phenotypes, which is an interesting extension of the model from the field of natural language processing to genomics. Previous methods for phenotype prediction using genomic data typically fall into two categories: (1) polygenic risk score-based techniques [88, 89], which are interpretable but limited to linear relationships, and (2) machine learning approaches [19, 20], which capture nonlinear interactions but lack interpretability at the genotype pair level comprehensively. Ge-SAND bridges this gap by leveraging self-attention to capture complex interactions at scales exceeding 10⁶ in parallel, offering both accuracy and interpretability through attention score analysis. This dual capability enhances understanding of how genetic variations contribute to disease risk, making Ge-SAND a valuable tool for both prediction and interpretation in genomic studies.

Additionally, we enhanced the conventional self-attention model in Ge-SAND by incorporating permutation analysis for interpretability. By performing permutation tests on attention score matrices in simulations and real datasets, we demonstrate that Ge-SAND’s attention matrices can capture genotype interactions relevant to phenotypes, which are crucial for improving prediction accuracy. The AUC-ROC values derived from the attention matrices are significantly higher than the permuted averages, underscoring the pivotal role of the attention mechanism in driving prediction performance. This finding suggests that the model’s ability to weigh SNP interactions might reflect biologically meaningful relationships linked to phenotypic outcomes. Furthermore, our simulations confirm that Ge-SAND effectively identifies predefined genotype interaction pairs, whether linear or nonlinear, reinforcing its interpretative power. For instance, in the CB model, the precision of the top 10 genotype pairs reaches 100%, providing robust evidence for Ge-SAND’s capability to detect critical genetic interactions despite the large solution space (5,670,028 genotype pairs).

Ge-SAND uncovered distinct genotype interaction patterns across CD, SC, and AD, by analyzing OR of top SNP pairs with the highest attention scores. In CD, individual genotypes show no significant impact on risk, but specific combinations—such as HOMER2 (synaptic plasticity) and ICAM5 (intercellular adhesion function)—significantly alter susceptibility. In SC, numerous genotype pairs, like AUTS2 (neurodevelopment) and RTL6 (immune regulation), demonstrate both individual and combinatory associations. AD exhibits a hybrid pattern, with some pairs with significant effects showing independent effects (e.g., MYRIP and KIF6) and others only through interaction (e.g., CHD1L and VWA8). Furthermore, gene pairs—such as ICAM5-HOMER2 in CD and GJA9-ZCCHC in AD—suggest a potential brain-gut axis, revealing similar mechanisms across neurodegenerative and intestinal disorders. These results demonstrate Ge-SAND’s ability to capture both disease-specific and cross-disease interactions through attention-based SNP pair analysis.

From a broader perspective, understanding these multi-genotype interactions allows for a more comprehensive view of the complex genetic networks driving disease. Ge-SAND constructed unique interaction networks from the top 30 SNP pairs involved in genes in each real-world dataset, revealing unicentric (CD), bicentric (SC), and multicentric (AL) features. In the Crohn’s disease dataset, the network is centered around HOMER2, a gene involved in synaptic plasticity. In the schizophrenia dataset, AUTS2 and CCT6B emerged as central hubs, both of which have been implicated in schizophrenia, with AUTS2 also linked to broader neurodevelopmental disorders. The Alzheimer’s disease network is multi-centered, featuring genes such as CHD1L (related to multiple sclerosis), MYRIP (associated with Alzheimer’s disease), and CMIP (involved in cognitive abilities). These networks highlight the importance of specific genotype interactions in understanding the genetic architecture of complex diseases.

Despite the promising results demonstrated by Ge-SAND, future research should focus on validating these findings in larger, more diverse cohorts to ensure the model’s applicability across different populations. While our current architecture maintains predictive accuracy through its intrinsic attention mechanisms, potential limitations may emerge in scenarios requiring stricter interpretability guarantees. Future investigations will systematically quantify these interpretability-performance tradeoffs and develop dynamic balancing strategies tailored to specific biological discovery objectives. Furthermore, although the self-attention mechanism provides unique analytical capabilities, its quadratic computational complexity presents scalability challenges. To broaden Ge-SAND's applicability, we plan to develop optimized architectures and computation strategies that maintain predictive performance while enhancing computational efficiency. Notably, while our method represents a novel approach capturing interaction effects for phenotype prediction – validated through both simulated and real-world datasets – the current interaction genotypes primarily rely on statistical validation and exploratory analysis of literature–supported biological functions. Their precise mechanistic roles require further biological validation through experimental approaches such as double-knockout studies.

Although the field of deep learning for genetic interaction discovery is still evolving, our findings underscore the transformative potential of models like Ge-SAND. Results from both simulations and real-world datasets highlight Ge-SAND as a potential tool for uncovering gene interactions. While further refinement and validation are necessary, this study marks a significant advancement in applying deep learning to genomics.

Data availability

In this study, partial genotype data from the UK Biobank (https://www.ukbiobank.ac.uk/) was accessed through a collaboration with application no.86920. Whole Exome Sequencing (WES) data in VCF.GZ format was accessed under Field ID 23157 and imputed genotypes in PGEN format were accessed under Field ID 22828. Data are available for bona fide researchers upon application to the UK Biobank, and the dataset of the 1000 Genomes Project is shown at https://www.internationalgenome.org/, which does not require access rights. The source code of Ge-SAND is publicly available at https://github.com/LHDLHUB/Ge-SAND.

References

Claussnitzer M, Cho JH, Collins R, Cox NJ, Dermitzakis ET, Hurles ME, Kathiresan S, Kenny EE, Lindgren CM, MacArthur DG, et al. A brief history of human disease genetics. Nature. 2020;577(7789):179–89.
Article PubMed PubMed Central CAS Google Scholar
de la Torre-Ubieta L, Won H, Stein JL, Geschwind DH. Advancing the understanding of autism disease mechanisms through genetics. Nat Med. 2016;22(4):345–61.
Article PubMed PubMed Central Google Scholar
Bluestone JA, Herold K, Eisenbarth G. Genetics, pathogenesis and clinical interventions in type 1 diabetes. Nature. 2010;464(7293):1293–300.
Article PubMed PubMed Central CAS Google Scholar
Frazer KA, Murray SS, Schork NJ, Topol EJ. Human genetic variation and its contribution to complex traits. Nat Rev Genet. 2009;10(4):241–51.
Article PubMed CAS Google Scholar
Wagner GP, Zhang J. The pleiotropic structure of the genotype–phenotype map: the evolvability of complex organisms. Nat Rev Genet. 2011;12(3):204–13.
Article PubMed CAS Google Scholar
Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of integrating data to uncover genotype–phenotype interactions. Nat Rev Genet. 2015;16(2):85–97.
Article PubMed CAS Google Scholar
Tam V, Patel N, Turcotte M, Bossé Y, Paré G, Meyre D. Benefits and limitations of genome-wide association studies. Nat Rev Genet. 2019;20(8):467–84.
Article PubMed CAS Google Scholar
Kim H, Grueneberg A, Vazquez AI, Hsu S. de los Campos G: Will Big Data Close the Missing Heritability Gap? Genetics. 2017;207(3):1135–45.
Article PubMed PubMed Central CAS Google Scholar
Wei W-H, Hemani G, Haley CS. Detecting epistasis in human complex traits. Nat Rev Genet. 2014;15(11):722–33.
Article PubMed CAS Google Scholar
Müller B, Wilcke A, Boulesteix A-L, Brauer J, Passarge E, Boltze J, Kirsten H. Improved prediction of complex diseases by common genetic markers: state of the art and further perspectives. Hum Genet. 2016;135(3):259–72.
Article PubMed PubMed Central Google Scholar
Patel AP, Wang M, Ruan Y, Koyama S, Clarke SL, Yang X, Tcheandjieu C, Agrawal S, Fahed AC, Ellinor PT, et al. A multi-ancestry polygenic risk score improves risk prediction for coronary artery disease. Nat Med. 2023;29(7):1793–803.
Article PubMed PubMed Central CAS Google Scholar
Enoma DO, Bishung J, Abiodun T, Ogunlana O, Osamor VC. Machine learning approaches to genome-wide association studies. Journal of King Saud University - Science. 2022;34(4):101847.
Article Google Scholar
Ho DSW, Schierding W, Wake M, Saffery R, O’Sullivan J. Machine Learning SNP Based Prediction for Precision Medicine. Front Genet. 2019;10:267. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fgene.2019.00267.
Article PubMed PubMed Central CAS Google Scholar
Ghafouri-Fard S, Taheri M, Omrani MD, Daaee A, Mohammad-Rahimi H. Application of Artificial Neural Network for Prediction of Risk of Multiple Sclerosis Based on Single Nucleotide Polymorphism Genotypes. J Mol Neurosci. 2020;70(7):1081–7.
Article PubMed CAS Google Scholar
Badré A, Zhang L, Muchero W, Reynolds JC, Pan C. Deep neural network improves the estimation of polygenic risk scores for breast cancer. J Hum Genet. 2021;66(4):359–69.
Article PubMed Google Scholar
Abdulaimma B, Fergus P, Chalmers C, Montañez CC: Deep Learning and Genome-Wide Association Studies for the Classification of Type 2 Diabetes. In: 2020 International Joint Conference on Neural Networks (IJCNN): 19–24 July 2020 2020; 2020: 1–8.
Elgart M, Lyons G, Romero-Brufau S, Kurniansyah N, Brody JA, Guo X, Lin HJ, Raffield L, Gao Y, Chen H, et al. Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Communications Biology. 2022;5(1):856.
Article PubMed PubMed Central CAS Google Scholar
Kafaie S, Chen Y, Hu T. A network approach to prioritizing susceptibility genes for genome-wide association studies. Genet Epidemiol. 2019;43(5):477–91.
Article PubMed Google Scholar
Qin Z-M, Liang S-Q, Long J-X, Deng J-M, Wei X, Yang M-L, Tang S-J, Li H-L. Importance of GWAS Risk Loci and Clinical Data in Predicting Asthma Using Machine-learning Approaches. Comb Chem High Throughput Screening. 2024;27(3):400–7.
Article CAS Google Scholar
López B, Torrent-Fontbona F, Viñas R, Fernández-Real JM. Single Nucleotide Polymorphism relevance learning with Random Forests for Type 2 diabetes risk prediction. Artif Intell Med. 2018;85:43–9.
Article PubMed Google Scholar
Ya A. Rahmani E, Kleber ME, Laaksonen R, März W, Halperin E: EPIQ—efficient detection of SNP–SNP epistatic interactions for quantitative traits. Bioinformatics. 2014;30(12):i19–25.
Article Google Scholar
Shen J, Li Z, Song Z, Chen J, Shi Y. Genome-wide two-locus interaction analysis identifies multiple epistatic SNP pairs that confer risk of prostate cancer: A cross-population study. Int J Cancer. 2017;140(9):2075–84.
Article PubMed CAS Google Scholar
Li P, Guo M, Wang C, Liu X, Zou Q. An overview of SNP interactions in genome-wide association studies. Brief Funct Genomics. 2015;14(2):143–55.
Article PubMed CAS Google Scholar
Lee KY, Leung KS, Ma SL, So HC, Huang D, Tang NL, Wong MH. Genome-Wide Search for SNP Interactions in GWAS Data: Algorithm, Feasibility, Replication Using Schizophrenia Datasets. Front Genet. 2020;28(11):1003. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fgene.2020.01003.
Article CAS Google Scholar
Vaswani A: Attention is all you need. Advances in Neural Information Processing Systems 2017.
Yang F, Wang W, Wang F, Fang Y, Tang D, Huang J, Lu H, Yao J. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence. 2022;4(10):852–66.
Article Google Scholar
Jiang TT, Fang L, Wang K. Deciphering the language of nature: A transformer-based language model for deleterious mutations in proteins. Innovation (Camb). 2023;4(5). https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.xinn.2023.100487.
Wang X, Zhang M, Long C, Yao L, Zhu M. Self-Attention Based Neural Network for Predicting RNA-Protein Binding Sites. IEEE/ACM Trans Comput Biol Bioinf. 2023;20(2):1469–79.
Article CAS Google Scholar
Yan W, Tang W, Wang L, Bin Y, Xia J. PrMFTP: Multi-functional therapeutic peptides prediction based on multi-head self-attention mechanism and class weight optimization. PLoS Comput Biol. 2022;18(9):e1010511.
Article PubMed PubMed Central CAS Google Scholar
Berisha V, Krantsevich C, Hahn PR, Hahn S, Dasarathy G, Turaga P, Liss J. Digital medicine and the curse of dimensionality. NPJ Digital Medicine. 2021;4(1):153.
Article PubMed PubMed Central Google Scholar
Correia D, Wilke DN. Purposeful cross-validation: a novel cross-validation strategy for improved surrogate optimizability. Eng Optim. 2021;53(9):1558–73.
Article Google Scholar
Miao L, Jiang L, Tang B, Sham PC, Li M. Dissecting the high-resolution genetic architecture of complex phenotypes by accurately estimating gene-based conditional heritability. The American Journal of Human Genetics. 2023;110(9):1534–48.
Article PubMed CAS Google Scholar
Davies G, Tenesa A, Payton A, Yang J, Harris SE, Liewald D, Ke X, Le Hellard S, Christoforou A, Luciano M, et al. Genome-wide association studies establish that human intelligence is highly heritable and polygenic. Mol Psychiatry. 2011;16(10):996–1005.
Article PubMed PubMed Central CAS Google Scholar
Clarke L, Zheng-Bradley X, Smith R, Kulesha E, Xiao C, Toneva I, Vaughan B, Preuss D, Leinonen R, Shumway M, et al. The 1000 Genomes Project: data management and community access. Nat Methods. 2012;9(5):459–62.
Article PubMed PubMed Central CAS Google Scholar
Collins R. What makes UK Biobank special? The Lancet. 2012;379(9822):1173–4.
Article Google Scholar
Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O’Connell J, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9.
Article PubMed PubMed Central CAS Google Scholar
Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4(1):s13742-13015–10047-13748.
Article Google Scholar
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, et al. PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. The American Journal of Human Genetics. 2007;81(3):559–75.
Article PubMed CAS Google Scholar
Maass PG, Barutcu AR, Rinn JL. Interchromosomal interactions: A genomic love story of kissing chromosomes. J Cell Biol. 2018;218(1):27–38.
Article PubMed Google Scholar
Fitzpatrick DJ, Ryan CJ, Shah N, Greene D, Molony C, Shields DC. Genome-wide epistatic expression quantitative trait loci discovery in four human tissues reveals the importance of local chromosomal interactions governing gene expression. BMC Genomics. 2015;16(1):109.
Article PubMed PubMed Central Google Scholar
Devlin J: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 2018.
Zhang Z, Chen B, Luo Y. A Deep Ensemble Dynamic Learning Network for Corona Virus Disease 2019 Diagnosis. IEEE Transactions on Neural Networks and Learning Systems. 2024;35(3):3912–26.
Article PubMed Google Scholar
Zhang Z, Chen G, Yang S. Ensemble Support Vector Recurrent Neural Network for Brain Signal Detection. IEEE Transactions on Neural Networks and Learning Systems. 2022;33(11):6856–66.
Article PubMed Google Scholar
Zhang Z, Ye L, Chen B, Luo Y. An anti-interference dynamic integral neural network for solving the time-varying linear matrix equation with periodic noises. Neurocomputing. 2023;534:29–44.
Article Google Scholar
Zhang Z, Ye L, Zheng L, Luo Y. A Novel Solution to the Time-Varying Lyapunov Equation: The Integral Dynamic Learning Network. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 2023;53(11):6731–43.
Article Google Scholar
Zhang Y, Zhu M, Hu C, Li J, Yang M: Euler-precision general-form of Zhang et al discretization (ZeaD) formulas, derivation, and numerical experiments. In: 2018: IEEE: 6262–6267.
Eidous, O.M., Ananbeh, E.A. Kernel Method for Estimating Matusita Overlapping Coefficient Using Numerical Approximations. Ann. Data. Sci. (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s40745-024-00563-y.
Gao B, Zhao L, Wang F, Bai H, Li J, Li M, Hu X, Cao J, Wang G. Knockdown of ISOC1 inhibits the proliferation and migration and induces the apoptosis of colon cancer cells through the AKT/GSK-3β pathway. Carcinogenesis. 2019;41(8):1123–33.
Article PubMed Central Google Scholar
Li C, Gao X, Zhao Y, Chen X. High Expression of circ_0001821 Promoted Colorectal Cancer Progression Through miR-600/ISOC1 Axis. Biochem Genet. 2023;61(1):410–27.
Article PubMed CAS Google Scholar
dela Peña I, dela Peña IJ, de la Peña JB, Kim HJ, Shin CY, Han DH, Kim B-N, Ryu JH, Cheong JH. Methylphenidate and Atomoxetine-Responsive Prefrontal Cortical Genetic Overlaps in “Impulsive” SHR/NCrl and Wistar Rats. Behavior Genetics. 2017;47(5):564–80.
Article PubMed Google Scholar
Mokarram P, Kumar K, Brim H, Naghibalhossaini F, Saberi-firoozi M, Nouraie M, Green R, Lee E, Smoot DT, Ashktorab H. Distinct High-Profile Methylated Genes in Colorectal Cancer. PLoS ONE. 2009;4(9):e7012.
Article PubMed PubMed Central Google Scholar
Rinne N, Wikman P, Sahari E, Salmi J, Einarsdóttir E, Kere J, Alho K. Developmental dyslexia susceptibility genes DNAAF4, DCDC2, and NRSN1 are associated with brain function in fluently reading adolescents and young adults. Cereb Cortex. 2024;34(4):bhae144. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/cercor/bhae144.
Article PubMed PubMed Central Google Scholar
Chen C, Gong X, Yang X, Shang X, Du Q, Liao Q, Xie R, Chen Y, Xu J. The roles of estrogen and estrogen receptors in gastrointestinal disease (Review). Oncol Lett. 2019;18(6):5673–80.
PubMed PubMed Central CAS Google Scholar
Jacenik D, Zielińska M, Mokrowiecka A, Michlewska S, Małecka-Panas E, Kordek R, Fichna J, Krajewska WM. G protein-coupled estrogen receptor mediates anti-inflammatory action in Crohn’s disease. Sci Rep. 2019;9(1):6749.
Article PubMed PubMed Central Google Scholar
Jacenik D, Krajewska WM. Significance of G Protein-Coupled Estrogen Receptor in the Pathophysiology of Irritable Bowel Syndrome, Inflammatory Bowel Diseases and Colorectal Cancer. Front Endocrinol (Lausanne). 2020;11:390. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fendo.2020.00390.
Article PubMed Google Scholar
Irie M, Itoh J, Matsuzawa A, Ikawa M, Kiyonari H, Kihara M, Suzuki T, Hiraoka Y, Ishino F, Kaneko-Ishino T. Retrovirus-derived RTL5 and RTL6 genes are novel constituents of the innate immune system in the eutherian brain. Development. 2022;149(18):dev200976.
Article PubMed PubMed Central Google Scholar
Zhang B, Xu Y-H, Wei S-G, Zhang H-B, Fu D-K, Feng Z-F, Guan F-L, Zhu Y-S, Li S-B: Association Study Identifying a New Susceptibility Gene (AUTS2) for Schizophrenia. In: International Journal of Molecular Sciences.2014;15:19406–19416.
Hori K, Nagai T, Shan W, Sakamoto A, Taya S, Hashimoto R, Hayashi T, Abe M, Yamazaki M, Nakao K, et al. Cytoskeletal Regulation by AUTS2 in Neuronal Migration and Neuritogenesis. Cell Rep. 2014;9(6):2166–79.
Article PubMed CAS Google Scholar
Wang Y, Yang Y, Jia X, Zhao C, Yang C, Fan J, Wu M, Yu M, Dong A, Wang N, et al. Identifying pleiotropic genes for major psychiatric disorders with GWAS summary statistics using multivariate adaptive association tests. J Psychiatr Res. 2022;155:471–82.
Article PubMed Google Scholar
Gardiner EJ, Cairns MJ, Liu B, Beveridge NJ, Carr V, Kelly B, Scott RJ, Tooney PA. Gene expression analysis reveals schizophrenia-associated dysregulation of immune pathways in peripheral blood mononuclear cells. J Psychiatr Res. 2013;47(4):425–37.
Article PubMed Google Scholar
Qi X, Wang S, Zhang L, Liu L, Wen Y, Ma M, Cheng S, Li P, Cheng B, Du Y, et al. An integrative analysis of transcriptome-wide association study and mRNA expression profile identified candidate genes for attention-deficit/hyperactivity disorder. Psychiatry Res. 2019;282:112639.
Article PubMed CAS Google Scholar
PahlevanKakhki M, Giordano A, StarvaggiCucuzza C, Venkata S, Badam T, Samudyata S, Lemée MV, Stridh P, Gkogka A, Shchetynsky K, Harroud A, et al. A genetic-epigenetic interplay at 1q211 locus underlies CHD1L-mediated vulnerability to primary progressive multiple sclerosis. Nature Communications. 2024;15(1):6419.
Article CAS Google Scholar
Cuscó I, Medrano A, Gener B, Vilardell M, Gallastegui F, Villa O, González E, Rodríguez-Santiago B, Vilella E, Del Campo M, et al. Autism-specific copy number variants further implicate the phosphatidylinositol signaling pathway and the glutamatergic synapse in the etiology of the disorder. Hum Mol Genet. 2009;18(10):1795–804.
Article PubMed PubMed Central Google Scholar
Newbury AJ, Rosen GD. Genetic, morphometric, and behavioral factors linked to the midsagittal area of the corpus callosum. Front Genet. 2012;31(3):91. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fgene.2012.00091.
Article Google Scholar
Hauptmann S, Scherping I, Dröse S, Brandt U, Schulz KL, Jendrach M, Leuner K, Eckert A, Müller WE. Mitochondrial dysfunction: an early event in Alzheimer pathology accumulates with age in AD transgenic mice. Neurobiol Aging. 2009;30(10):1574–86.
Article PubMed CAS Google Scholar
Parihar MS, Brewer GJ: Mitoenergetic failure in Alzheimer disease. American Journal of Physiology-Cell Physiology 2007.
Zhang L, Ju X, Cheng Y, Guo X, Wen T. Identifying Tmem59 related gene regulatory network of mouse neural stem cell from a compendium of expression profiles. BMC Syst Biol. 2011;5(1):152.
Article PubMed PubMed Central CAS Google Scholar
Konjikusic MJ, Yeetong P, Boswell CW, Lee C, Roberson EC, Ittiwut R, Suphapeetiporn K, Ciruna B, Gurnett CA, Wallingford JB, et al. Mutations in Kinesin family member 6 reveal specific role in ependymal cell ciliogenesis and human neurological development. PLoS Genet. 2018;14(11):e1007817.
Article PubMed PubMed Central Google Scholar
Bamburg JR, Bloom GS. Cytoskeletal pathologies of Alzheimer disease. Cell Motil. 2009;66(8):635–49.
Article CAS Google Scholar
Kang DE, Roh SE, Woo JA, Liu T, Bu JH, Jung AR, Lim Y. The interface between cytoskeletal aberrations and mitochondrial dysfunction in Alzheimer’s disease and related disorders. Experimental neurobiology. 2011;20(2):67.
Article PubMed PubMed Central Google Scholar
Brandt R, Götz J. Special issue on “Cytoskeletal proteins in health and neurodegenerative disease: Concepts and methods.” Brain Res Bull. 2023;198:50–2.
Article PubMed CAS Google Scholar
Parisiadou L, Bethani I, Michaki V, Krousti K, Rapti G, Efthimiopoulos S. Homer2 and Homer3 interact with amyloid precursor protein and inhibit Aβ production. Neurobiol Dis. 2008;30(3):353–64.
Article PubMed CAS Google Scholar
Gilks WP, Allott EH, Donohoe G, Cummings E, Gill M, Corvin AP, Morris DW. Replicated genetic evidence supports a role for HOMER2 in schizophrenia. Neurosci Lett. 2010;468(3):229–33.
Article PubMed CAS Google Scholar
Azaiez H, Decker AR, Booth KT, Simpson AC, Shearer AE, Huygen PLM, Bu F, Hildebrand MS, Ranum PT, Shibata SB, et al. HOMER2, a Stereociliary Scaffolding Protein, Is Essential for Normal Hearing in Humans and Mice. PLoS Genet. 2015;11(3):e1005137.
Article PubMed PubMed Central Google Scholar
Zhang Y, Yang X, Liu Y, Ge L, Wang J, Sun X, Wu B, Wang J. Vav2 is a novel APP-interacting protein that regulates APP protein level. Sci Rep. 2022;12(1):12752.
Article PubMed PubMed Central CAS Google Scholar
Birkner K, Loos J, Gollan R, Steffen F, Wasser B, Ruck T, Meuth SG, Zipp F, Bittner S. Neuronal ICAM-5 Plays a Neuroprotective Role in Progressive Neurodegeneration. Front Neurol. 2019;12(10):205. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fneur.2019.00205.
Article Google Scholar
Lindsberg PJ, Launes J, Tian L, Välimaa H, Subramanian V, Sirén J, Hokkanen L, Hyypiä T, Carpén O, Gahmberg CG. Release of soluble ICAM-5, a neuronal adhesion molecule, in acute encephalitis. Neurology. 2002;58(3):446–51.
Article PubMed CAS Google Scholar
Skeide MA, Kraft I, Müller B, Schaadt G, Neef NE, Brauer J, Wilcke A, Kirsten H, Boltze J, Friederici AD. NRSN1 associated grey matter volume of the visual word form area reveals dyslexia before school. Brain. 2016;139(10):2792–803.
Article PubMed Google Scholar
Wong J, Chopra J, Chiang LLW, Liu T, Ho J, Wu WKK, Tse G, Wong SH. The Role of Connexins in Gastrointestinal Diseases. J Mol Biol. 2019;431(4):643–52.
Article PubMed CAS Google Scholar
Sánchez OF, Rodríguez AV, Velasco-España JM, Murillo LC, Sutachan J-J, Albarracin S-L. Role of connexins 30, 36, and 43 in brain tumors, neurodegenerative diseases, and neuroprotection. Cells. 2020;9(4):846.
Article PubMed PubMed Central Google Scholar
Zayats T, Jacobsen KK, Kleppe R, Jacob CP, Kittel-Schneider S, Ribasés M, Ramos-Quiroga JA, Richarte V, Casas M, Mota NR, et al. Exome chip analyses in adult attention deficit hyperactivity disorder. Transl Psychiatry. 2016;6(10):e923–e923.
Article PubMed PubMed Central CAS Google Scholar
Chen K, Zhang J, Meng L, Kong L, Lu M, Wang Z, Wang W. The epigenetic downregulation of LncGHRLOS mediated by RNA m6A methylase ZCCHC4 promotes colorectal cancer tumorigenesis. J Exp Clin Cancer Res. 2024;43(1):44.
Article PubMed PubMed Central CAS Google Scholar
Zhang M, Bouland GA, Holstege H, Reinders MJT. Identifying Aging and Alzheimer Disease-Associated Somatic Variations in Excitatory Neurons From the Human Frontal Cortex. Neurol Genet. 2023;9(3):e200066. https://doiorg.publicaciones.saludcastillayleon.es/10.1212/NXG.0000000000200066.
Article PubMed PubMed Central CAS Google Scholar
Roy BC, Ahmed I, Stubbs J, Zhang J, Attard T, Septer S, Welch D, Anant S, Sampath V, Umar S. DCLK1 isoforms and aberrant Notch signaling in the regulation of human and murine colitis. Cell Death Discovery. 2021;7(1):169.
Article PubMed PubMed Central CAS Google Scholar
Abramovitch-Dahan C, Asraf H, Bogdanovic M, Sekler I, Bush AI, Hershfinkel M. Amyloid β attenuates metabotropic zinc sensing receptor, mZnR/GPR39, dependent Ca2+, ERK1/2 and Clusterin signaling in neurons. J Neurochem. 2016;139(2):221–33.
Article PubMed CAS Google Scholar
Popovics P, Stewart AJ. GPR39: a Zn2+-activated G protein-coupled receptor that regulates pancreatic, gastrointestinal and neuronal functions. Cell Mol Life Sci. 2011;68(1):85–95.
Article PubMed CAS Google Scholar
Moreno-Jiménez EP, Flor-García M, Terreros-Roncal J, Rábano A, Cafini F, Pallas-Bazarra N, Ávila J, Llorens-Martín M. Adult hippocampal neurogenesis is abundant in neurologically healthy subjects and drops sharply in patients with Alzheimer’s disease. Nat Med. 2019;25(4):554–60.
Article PubMed Google Scholar
Chatterjee N, Wheeler B, Sampson J, Hartge P, Chanock SJ, Park J-H. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet. 2013;45(4):400–5.
Article PubMed PubMed Central CAS Google Scholar
Ho W-K, Tan M-M, Mavaddat N, Tai M-C, Mariapun S, Li J, Ho P-J, Dennis J, Tyrer JP, Bolla MK, et al. European polygenic risk score for prediction of breast cancer shows similar performance in Asian women. Nat Commun. 2020;11(1):3833.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This research has been conducted using the UK Biobank Resource under Application Number 86920. We are grateful to UK Biobank participants and the UK Biobank team for their dedication and contributions.

Clinical trial

Not applicable.

Funding

This work was supported by the National Natural Science Foundation of China [32170637]; the Guangdong Project [2017GC010644]; and the Basic and Applied Basic Research Foundation of Guangdong Province [2022 A1515110913].

Author information

Authors and Affiliations

Zhongshan School of Medicine, Sun Yat-Sen University, Guangzhou, 510080, China
Lihang Ye, Liubin Zhang, Bin Tang, Junhao Liang, Ruijie Tan, Hui Jiang, Wenjie Peng, Nan Lin, Kun Li, Chao Xue & Miaoxin Li
Key Laboratory of Tropical Disease Control (Sun Yat-Sen University), Ministry of Education, Guangzhou, 510080, China
Lihang Ye, Liubin Zhang, Bin Tang, Junhao Liang, Ruijie Tan, Wenjie Peng, Nan Lin, Chao Xue & Miaoxin Li
Department of Medical Genetics and Prenatal Diagnosis, The Third Affiliated Hospital of Zhengzhou University, Zhengzhou, 450052, China
Hui Jiang
School of Medicine, Zhejiang University, Hangzhou, 310058, China
Kun Li

Authors

Lihang Ye
View author publications
You can also search for this author inPubMed Google Scholar
Liubin Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Bin Tang
View author publications
You can also search for this author inPubMed Google Scholar
Junhao Liang
View author publications
You can also search for this author inPubMed Google Scholar
Ruijie Tan
View author publications
You can also search for this author inPubMed Google Scholar
Hui Jiang
View author publications
You can also search for this author inPubMed Google Scholar
Wenjie Peng
View author publications
You can also search for this author inPubMed Google Scholar
Nan Lin
View author publications
You can also search for this author inPubMed Google Scholar
Kun Li
View author publications
You can also search for this author inPubMed Google Scholar
Chao Xue
View author publications
You can also search for this author inPubMed Google Scholar
Miaoxin Li
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

LY: Conceptualization, Methodology, Software, Validation, Visualization, Writing—original draft, Writing—review & editing, Investigation. ML: Project administration, Funding acquisition, Supervision, Writing—review & editing, Software, Validation. LZ: Methodology, Software, Writing—review & editing. BT: Methodology, Writing—review & editing. HJ: Funding acquisition. JL, RT, HJ, WP, NL, KL, and CX: Writing—review & editing.

Corresponding author

Correspondence to Miaoxin Li.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Material 1. Information.

Supplementary Material 2: Notes, Figures S1 to S8.

Supplementary Material 3: Tables S1 to S3.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ye, L., Zhang, L., Tang, B. et al. Ge-SAND: an explainable deep learning-driven framework for disease risk prediction by uncovering complex genetic interactions in parallel. BMC Genomics 26, 432 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-025-11588-9

Download citation

Received: 24 December 2024
Accepted: 09 April 2025
Published: 01 May 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-025-11588-9

Ge-SAND: an explainable deep learning-driven framework for disease risk prediction by uncovering complex genetic interactions in parallel

Abstract

Background

Results

Conclusion

Background

Methods

Simulation data

Real data: UK Biobank

The proposed Ge-SAND method

Genomic embedding self-attention network

The overview of the genomic embedding self-attention network

Genomic embedding

The self-attention model

Gemini neurodynamic learning network

The overview of the Gemini neurodynamic learning network

The overall structure of the neurodynamic learning network

Neurodynamic optimization algorithm

Theorem 1.

Weight matrix update rule

Gemini structure

Model training

Model interpretation

Analysis of attention scores

Permutation-based p-value calculation for AUC-ROC values

Empirical p-value estimation for functional enrichment

Conventional statistical method for comparative analysis

Odds ratio

Computational implementation

Results

The workflow of Ge-SAND

Benchmarking Ge-SAND: unveiling performance on simulations with key interaction insights

Ge-SAND’s performance across genotype–phenotype models

Analysis of Ge-SAND’s advantage in different sample sizes

Attention scores and interaction analysis

Analysis of top SNP interactions

Scalability analysis across sample sizes

Cross-disease analysis: superior risk prediction and interaction discovery in complex diseases

Ablation study

Unified classification performance across Crohn’s disease, schizophrenia, and Alzheimer’s disease

Permutation testing to interpret attention scores

Top SNP interactions in three diseases

Gene network construction and functional enrichment analysis

Discussion

Data availability

References

Acknowledgements

Clinical trial

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Supplementary Material 1. Information.

Supplementary Material 2: Notes, Figures S1 to S8.

Supplementary Material 3: Tables S1 to S3.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us