Protein structural domain-disease association prediction based on heterogeneous networks

Zhang, Jingpu; Deng, Lianping; Deng, Lei

doi:10.1186/s12864-024-11117-0

Volume 23 Supplement 6

Selected articles from the 16th International Symposium on Bioinformatics Research and Applications (ISBRA-20): genomics

Research
Open access
Published: 10 April 2025

Protein structural domain-disease association prediction based on heterogeneous networks

BMC Genomics volume 23, Article number: 869 (2022) Cite this article

282 Accesses
Metrics details

Abstract

Background

Domains can be viewed as portable units of protein structure, folding, function, evolution, and design. Small proteins are often found to be composed of only a single domain, while most large proteins consist of multiple domains for achieving various composite cellular functions. A dysfunction in domains may affect the function of proteins in some disease. Inferring the disease-related domains will help our understanding of the mechanism of human complex diseases.

Results

In this study, we firstly build a global heterogeneous information network based on structural-based domains, proteins, and diseases. Then the topological features of the network are extracted according to the meta-paths between domain and disease nodes. Finally, we train a binary classifier based on the XGBOOST (eXtreme Gradient Boosting) algorithm to predict the potential associations between domains and diseases. The results show that the binary classification model using the XGBOOST algorithm performs significantly better than models using other machine learning algorithms, achieving an AUC (Area Under Curve) score of 0.94 in the leave-one-out cross-validation experiment.

Conclusions

We develop a method to build a binary classifier using the topological features based on meta-paths and predict the potential associations between domains and diseases. Based on its predictive performance in independent test sets, the method is proved to be powerful. Moreover, representing domains and diseases through integrating more multi-omic data will further optimize predictive performance.

Background

Modern human and medical genetics have been revolutionizing the use of gene-mapping techniques (such as linkage analysis and association studies) to search for genetic variations in complex human diseases, which has set off a wave of biomedical research [1, 2]. For example, the Human Genome Project and the HapMap Project, combined with GWAS and sequencing studies, have supported identification of more than 60,000 genetic associations across thousands of human diseases and traits [3]. Even if the particular disease associated genes are identified [4,5,6,7,8], narrowing it down to a specific structure-based protein domain is very challenging because genes control the proteins containing single or multiple domains [9, 10]. A protein domain is a conserved and functional unit of a protein, which can form a relatively independent spatial structure by folding, and each structure-based domain has a unique function [11]. Most proteins consist of one or several domains, and an identical domain may find in a variety of different proteins that capture specific functions [12, 13]. If a gene encodes a protein that contains a great deal of domains and is associated with a disease, one of the domains belonging to the protein may be related to the disease. Narrowing down domains related to complex diseases of human will greatly improve our understanding of complex diseases, and promote drug discovery and personalized treatment of complex diseases [14,15,16].

Domains fall into two categories, namely, sequence-based domains and structure-based domains. Sequence-based domains are usually obtained from the given protein sequence based on homology alone and the relations between human proteins and sequence-based domains are conveniently obtained from the Pfam database [17,18,19,20]. For this reason, there have been some researches on the potential associations between the sequence-based domains and diseases. For example, Wang et al. researched the relationships between human diseases and domains by using the principle of ’guilt-by-proximity’ [20]. Zhang et al. developed the method named domainRBF to infer the domains and human disease associations [21]. However, the structural-based domains are more convincing than the sequence-based domains in identifying protein functions [22, 23]. Thus, in this work, we research the structural-based domains related to diseases. As far as we know, there are no research to predict the relationships between structure-based domains and diseases at present.

In this article, we build heterogeneous information network containing domain-protein associations, disease similarities, domain-disease associations, protein-disease associations, and protein interactions. Inferring the associations between the domains and diseases can be transformed into the task of link prediction between the nodes in the heterogeneous information network. Unlike traditional homogeneous networks, heterogeneous information networks have many types of nodes, which can be connected through different relationships [24].

In order to obtain the topological characteristics in the heterogeneous information network to predict the potential associations between the structural-based domains of protein and diseases, we draw lessons from meta-path proposed by Sun et al [25]. The definition of a meta-path links a certain type of path of the start node and the end node. Each meta-path has a specific semantics. The sum of paths belonging to a particular meta-path is an important topological function for evaluating the strength of the association between the start node and the end node. For example, there are multiple paths between a domain and a disease: (A) domain $\rightarrow$ protein $\rightarrow$ protein $\rightarrow$ disease, and (B) domain $\rightarrow$ protein $\rightarrow$ disease $\rightarrow$ disease. The two meta-paths which connect the starting protein structural domain to the ending disease have different semantics: (A) if the protein is involved in a disease, the domain belonging to another protein interacting with the protein might be associated with the disease; (B) if a protein is associated with a disease, the domain in the protein might be related to other diseases similar to the disease.

We extract the topological features of the global network by designing different meta-paths. According to the topological features, we build binary classifier based on the XGBOOST (eXtreme Gradient Boosting) algorithm to predict the potential associations between structure-based domains and diseases.

Materials and methods

Datasets

To build the heterogeneous information network, we collected information about structure-based domains, proteins and diseases from different data sources. The data set mainly includes a disease similarity network data set, a structure-based domain and protein association network data set, a protein interaction network data set, a protein and disease association network data set and a structure-based domain and disease association network data set.

Domain-protein network

We downloaded structure based domains-protein associations from the SDADB database [12]. SDADB is a database of predicted annotations using the structural (SCOP) domain of the integrated method. We uniformly use the protein Uniport_ID as the ID that uniquely identifies the protein, and convert the PDB_ID in the SDADB database to the Uniport_ID through the relevant mapping file. Finally, after the data is processed, we finally obtain 1,756,074 domain-protein associations involving 140,189 domains and 35,866 proteins.

Protein-protein network

We collected protein interaction data sets from the STRING database [26]. Since this article only studies the association between human-related diseases and protein domains, we have selected only human protein interaction data sets. At present, many experiments have proved that combined_score$\ge$ 400 of the protein interaction is related, so we only select the associated data of combined_score$\ge$400. Secondly, we uniformly use the protein Uniport_ID as the ID that uniquely identifies the protein, and convert the String_ID in the STRING database to the Uniport_ID through the relevant mapping file. After de-redundancy, we collect 1,462,673 human protein interactions data set between 35,866 proteins from the STRING database.

Protein-disease network and disease-disease network

We obtained human-related disease phenotypic data from the OMIM database [27]. In this context, the disease phenotype is simply referred to as a disease. OMIM is one of the tools commonly used by genetic workers. It is a continuously updated database of human genes and disorders of genetic [28, 29]. It centered on heritable or disease about hereditary genetic. After data processing, we finally obtain 5,802 protein-disease associations between these 35,866 proteins and 5,099 diseases. Moreover, we download 88,665 phenotypic similarities about 5,099 diseases from a recent work of Reyes-Palomares et al [30].

Disease-domain network

To train and evaluate our classifier, we need to build the reference data set, namely the structural-based domain disease associations. However, there is no manually assigned structural-based domain-disease association data yet. Alternatively, we can obtain the association data by transferring the associations involving the single-domain proteins to their component domains, since the relationship between a single-domain protein and a disease is considered that it occurs between the domain generated from the protein and the disease. After data processing, we finally obtain 666 domain-disease associations between 140,189 domains and 5,099 diseases.

Building a global heterogeneous information network

According to the data sources described above, we build a global heterogeneous network composed of domain-protein association network, protein interaction network, protein-disease association network, disease similarity network, domain-disease association network. The global heterogeneous network is denoted as G=(H, F), where H = D$\cup$P$\cup$S, D, P and S are the sets of structure-based domain, protein, disease nodes in the network respectively, while $\text {F} = \text {F}_{d,p}\cup \text {F}_{p,p} \cup \text {F}_{s,d} \cup \text {F}_{p,s} \cup \text {F}_{s,s} \cup \text {F}_{d,s} \cup \text {F}_{p,d} \cup \text {F}_{s,p}$ are the sets of heterogeneous links in G [31]. Based on the semantic links between any two nodes, a total of five adjacency matrices are calculated. The elements of the adjacency matrix have two values: ‘0’ for the unseen link and ‘1’ for the observed link. It is worth noting that these adjacency matrices are reversible, and the inverse association matrices can be derived from the transposition of the original matrices. The global heterogeneous information network is shown in Fig. 1.

Extracting meta-path-based topological features

As described above, the total number of paths belonging to a particular meta-path is an important topological feature that can be extracted from a heterogeneous network to assess the strength of the association between the start and end nodes. Hence, we construct the meta-paths to extract topological features of domain and disease nodes in the network [32]. The meta-path topological features are encoded in commuting matrices. Each commuting matrix represents a certain type of meta-path of a given length. Each element in the commuting matrix represents the number of path instances that link one domain to one disease, and the value in the matrix is non-negative. Figure 2 takes one of the meta-paths (e.g. domain $\rightarrow$ protein $\rightarrow$ disease $\rightarrow$ disease) to demonstrate the calculation of commuting matrix. The commuting matrix in the example is calculated by multiplying domain-protein association adjacency matrix, protein-disease association adjacency matrix and disease-disease similarity adjacency matrix [25, 33].

To predict the potential associations between structural-based domains and disease, we list all possible meta-paths, producing a total of 27 meta-paths. The length of the meta-path is equal to the number of adjacency matrices multiplied. Among the 27 commuter matrices, there are 2 meta-paths with step size 2, there are 4 meta-paths with step size 3, there are 11 meta-paths with step size 4, there are 10 meta-paths with step size 5. All meta-paths are as shown in Table 1.

Table 1 The meta-paths from domains to diseases in the heterogeneous network

Full size table

Each element in the commuting matrix is denoted by $M_{i,j}$, which indicates the number of path instances between nodes i and j. We normalize the number of path instances by running Random Walk (RW) algorithm. RW is computed as $M_{i,j} / M_{i}$, where $M_i$ is row-wise summations.

Binary classification model based on XGBOOST algorithm

XGBOOST is an optimized version of the Gradient Boosting algorithm. It is developed for the purpose of speed and performance. Gradient boosting is an algorithm in which new models are created to predict the residuals of prior models and then add together to make the final prediction. This approach supports both regression and classification. In the manuscript, XGBOOST is employed to predict the potential associations between domains and diseases.

In order to train a XGBOOST classifier model, the positive and negative samples need to be obtained. We regard a total of 666 domain-disease associations downloaded as positive samples, and then randomly combine those unrelated data at a ratio of 1:1 as negative samples. In order to accurately evaluate the predictive performance without prior knowledge, the links we removed from the heterogeneous information network are used as positive label data in the test set. It is then combined with the negative label data as a test set. Training set accounts for two-thirds of the total data set, and test set accounts for one-third of the total data set. Therefore, in the training set, the total number of training sets is 888 link data set, including 444 link data set marked positive and 444 link data set marked negative. In the test set, the total number of training set are 444 link data set, there are including 222 positively labeled link data set and 222 negatively labeled link data set [25].

Evaluation measures

To evaluate the prediction performances of the classifier, we employ two metrics, which are the area under the ROC curve (AUC), $F_1$ score respectively. For different classification thresholds, ROC curve graphs were drawn according to a relational function of true positive rate (TPR) and false positive rate (FPR). For each threshold, the corresponding TPR and FPR are computed. The ROC curve is obtained by changing the threshold [34]. The AUC to measure the overall performance is calculated.

$F_1$ score is an indicator used to assess the accuracy of binary classification prediction model, which can be thought of as a harmonic average of the model accuracy rate and the recall rate. It takes into account the accuracy and recall of the classification model. The definition is given by:

$$\begin{aligned} F_1 = \frac{ 2TP }{ 2TP + FN + FP } \end{aligned}$$

(1)

Results

XGBOOST model

To achieve the optimal prediction performance, it is necessary to tune the parameters in XGBOOST algorithm carefully. However, adjusting parameters can be a very difficult task because it has many parameters. In the work, the optimal values for tuning parameters, namely n_estimators, max_depth, min_child_weight, gamma, colsample_bytree, reg_alpha are tuned through the grid search. Other parameters are set to the default values. The performance of the XGBOOST model is evaluated by 3-fold cross validation. First of all, the validation data set is randomly divided into three groups of equal size, one group is used as the test set, and the other two groups are used as the training set. The experiment is repeated three times so that each set is hidden once and each hidden domain-disease pair is used to testing the performance of the model. Then take the average value as the final performance evaluation result.

Effect of RW normalization on performance

RW normalization [35] of the features will make it possible to improve the predictive performance of the binary classifier. To verify this, we constructed two sets of topological features according to the path count. Feature-I contains the total number of all paths encoded by 27 commuter matrices, which are provided in the Additional files 1 and 3, respectively. The RW was normalized to the topological characteristics of the total number of paths for the 27 commuter matrices, and by combining the path count and RW normalization, we obtained Feature-II containing 54 topological characteristics, which are provided in the Additional files 2 and 4, respectively. We run four classifiers (XGBOOST (eXtreme Gradient Boosting), RF (Random Forest), SVM (Support Vector Machine) and BYS (Naive Bayes)) on Feature-I and Feature-II respectively. The performance comparisons of the four classifiers between the two topological feature sets are shown in Table 2 and Figs. 3, 4, 5, 6, 7, 8, 9, and 10.

Table 2 Performance comparison based on two topological feature sets of among different models

Full size table

From the table and these figures, we can find the ranking of the predictive performance of all the models on two different feature sets: Feature-II > Feature-I. The predictive performance of XGBOOST model is improved by about 1%, 1% in terms of AUC and $F_1$ score respectively. Besides XGBOOST, the other three models also have improvements in performances in AUC and $F_1$ score. The results demonstrate that RW normalization can greatly improve the predictive performance by enriching the feature space.

Influence of the length of meta-paths on performance

Generally, the meta-paths with different length make different contributions to the performance. To determine which length can achieve the best performance, we build four sets of topological features (TF1, TF2, TF3 and TF4) by combining the meta-paths with different length respectively. TF1 contains the path counts of meta-path with step size 2 and step size 3 were encoded in 6 commuting matrices. TF2 contains the path counts of meta-path with step size 2, 3 and step size 4 were encoded in 17 commuting matrices. TF3 contains the path counts of meta-path with step size 2, 3, 4 and step size 5 were encoded in 27 commuting matrices. TF4 contains the path counts of meta-path with step size 2, 3, 4, 5 and step size 6 were encoded in 37 commuting matrices. The meta-paths with length 6 are listed in Table 3. Based on the four feature sets, we compare the predictive performance of the classifier. The results are shown in Fig. 11.

Table 3 The meta-paths from domains to diseases in the heterogeneous network with length of 6

Full size table

The demonstration shows that all the evaluation metrics gain the max values on the set of TF3, the worst on TF1. With the number of meta-paths increasing, the prediction performance of the XGBOOST model has a boost. However, the performance of XGBOOST on TF4 is worse than that on TF3. Therefore, we only consider the meta-paths with length less than five in this work.

Comparison with other machine learning models

To further validate the effectiveness of our chosen XGBOOST model, we compare it with other popular representative machine learning algorithms: RF, SVM and BYS. In order to make a fair comparison of the four methods, we use the same data set for experiments (666 different experimentally validated domain-disease associations as described in the Materials and methods section). The performance is evaluated by AUC, $F_1$ score and the results can be seen in Table 4 and Fig. 12.

Table 4 Performance comparison based on different models

Full size table

The results show that all models have achieved satisfactory results in predicting the potential associations between structure-based domains and diseases. It can be known that the AUC scores obtained by the above four models are all higher than 0.85 from the table, the $F_1$ scores are all above 0.8. We can find that the XGBOOST model has the best prediction performance with an AUC score of 0.9461, $F_1$ score of 0.8942. The RF model is a little inferior to the XGBOOST model in AUC and $F_1$ score, while the BYS model performs worst with the two metrics. The demonstration indicates that the performance of the XGBOOST model is better than those of other machine learning algorithms. From Fig. 12, we can see that the predictive power of the four models based on different machine learning algorithms follows an order: XGBOOST > RF > SVM > BYS.

Discussion

Recent studies on the modular nature of human genetic diseases have revealed that diseases share common clinical characteristics are often caused by functionally related genes. It has been becoming important to further locate genes that are responsible for complex diseases, resulting in facilitating the prevention, diagnosis and treatment of these diseases. Notably, a protein typically is composed of several structural domains, each of which is closely related to a specific function of the protein. Therefore, it becomes reasonable to infer associations between protein domains and complex diseases. In this paper, we adopt the ’guilt-by-association’ principle that is based on a global heterogeneous information network consisting of domain-protein association network, protein interaction network, protein-disease association network, disease similarity network, domain-disease association network. Since protein structure is evolutionarily more conserved than sequence, protein function in nature depends on the global architecture, the inner dynamics of folds, and the subtle surface properties that give binding specificity. Hence, we predict the structural-based domains for diseases. According to the global heterogeneous network, we extract meta-path topological features of domain and disease nodes and encode them through commuting matrices. Furtherly, the commuting matrices are normalized by running RW algorithm to improve the predictive performance. To verify this, we build two sets of topological features, namely Feature-I and Feature-II which contains RW normalization of the commuting matrices. Different models (XGBOOST, RF, SVM and BYS) are employed to evaluate the performance on the two feature sets. The predictive performance of each model on Feature-II is improved significantly, which is shown in Table 2. To further investigate the influence of meta-path length on performance, multiple topological feature sets with different meta-path lengths are examined using the XGBOOST model. The results are illustrated in Fig. 11 and we finally utilize the meta-paths with length less than five in this study. Moreover, we compare XGBOOST model with three different baseline models, demonstrating its superior performance on the experimentally validated data set.

Our approach might be extended from the following directions. First, the method enables a new network of biological entities to be integrated into the global network as long as there are connections between the new network and two other networks in the global network. For example, the microRNAs can be integrated into the global network. Second, we only utilize the topological features of the global network. In fact, more features, for example sequence and structural features of domains, could be explored and incorporated into the method.

Conclusions

With the latest developments in human genetics and computational biology, it has become possible to use the latest technology to identify many of the genes involved in complex diseases. However, narrowing down to a specific structure-based domain can be challenging. At present, structure-based domain-disease association prediction is a relatively new topic, and determining the associations has drawn extensive attention since it is beneficial for drug discovery and personalized treatment of complex diseases. Machine learning methods provide us with a new way to predict the potential associations. Moreover, the semantic network integrating domain knowledge across chemical and biological space can contribute to the research. In this study, we develop a method to build a binary classifier using the topological features based on meta-paths, and predict the potential associations between domains and diseases. Based on its predictive performance in independent test sets, the method is proven to be powerful. We also demonstrate the model based on XGBOOST outperform other machine learning model such as SVM, RF, Naive Bayes and etc. In the future, we will collect more multi-omic data to represent domains and diseases, further infer the associations between them.

Data availability

Topological features according to the meta-paths are available online.

Abbreviations

RF:: Random Forest
SVM:: Support Vector Machine
BYS:: Naive Bayes
XGBOOST:: eXtreme Gradient Boosting

References

Lander ES, Schork NJ. Genetic dissection of complex traits. Science. 1994;265(5181):2037–48.
Article PubMed Google Scholar
Wang WY, Barratt BJ, Clayton DG, Todd JA. Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet. 2005;6(2):109–18.
Article PubMed Google Scholar
Claussnitzer M, Cho JH, Collins R, Cox NJ, Dermitzakis ET, Hurles ME, et al. A brief history of human disease genetics. Nature. 2020;577(7789):179–89.
Article PubMed PubMed Central Google Scholar
Lage K, Hansen NT, Karlberg EO, Eklund AC, Roque FS, Donahoe PK, et al. A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes. Proc Natl Acad Sci. 2008;105(52):20870–5.
Article PubMed PubMed Central Google Scholar
Oellrich A, Smedley D. Linking tissues to phenotypes using gene expression profiles. Database. 2014;2014:bau017.
Farh KKH, Marson A, Zhu J, Kleinewietfeld M, Housley WJ, Beik S, et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature. 2015;518(7539):337–43.
Article PubMed Google Scholar
Yang H, Robinson PN, Wang K. Phenolyzer: phenotype-based prioritization of candidate genes for human diseases. Nat Methods. 2015;12(9):841–3.
Article PubMed PubMed Central Google Scholar
Voskanian-Kordi A, Funai A, Kann MG. DomainScope: A disease network based on protein domain connections. 2019. arXiv preprint arXiv:1911.08676. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1911.08676.
Peterson TA, Park D, Kann MG. A protein domain-centric approach for the comparative analysis of human and yeast phenotypically relevant mutations. BMC Genomics. 2013;14(S3):S5.
Article PubMed PubMed Central Google Scholar
Peterson TA, Adadey A, Santana-Cruz I, Sun Y, Winder A, Kann MG. DMDM: domain mapping of disease mutations. Bioinformatics. 2010;26(19):2458–9.
Article PubMed PubMed Central Google Scholar
Gonzalez MW, Kann MG. Chapter 4: Protein interactions and disease. PLoS Comput Biol. 2012;8(12):e1002819.
Zeng C, Zhan W, Deng L. SDADB: a functional annotation database of protein structural domains. Database. 2018;2018:bay064.
Savojardo C, Babbi G, Martelli PL, Casadio R. Functional and Structural Features of Disease-Related Protein Variants. Proc Natl Acad Sci. 2019;20(7):1530.
Google Scholar
Zhang W, Coba MP, Sun F. Inference of domain-disease associations from domain-protein, protein-disease and disease-disease relationships. In: BMC systems biology. vol. 10. San Francisco: BioMed Central; 2016. p. S4.
Dudley JT, Deshpande T, Butte AJ. Exploiting drug-disease relationships for computational drug repositioning. Brief Bioinforma. 2011;12(4):303–11.
Article Google Scholar
Krassowski M, Paczkowska M, Cullion K, Huang T, Dzneladze I, Ouellette BFF, et al. ActiveDriverDB: human disease mutations and genome variation in post-translational modification sites of proteins. Nucleic Acids Res. 2018;46(D1):D901–10.
Article PubMed Google Scholar
Oliver S. Guilt-by-association goes global. Nature. 2000;403(6770):601–2.
Article PubMed Google Scholar
Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, et al. Gene prioritization through genomic data fusion. Nat Biotechnol. 2006;24(5):537–44.
Article PubMed Google Scholar
Köhler S, Bauer S, Horn D, Robinson PN. Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008;82(4):949–58.
Article PubMed PubMed Central Google Scholar
Wang W, Zhang W, Jiang R, Luan Y. Prioritisation of associations between protein domains and complex diseases using domain-domain interaction networks. IET Systems Biol. 2010;4(3):212–22.
Article Google Scholar
Zhang W, Chen Y, Sun F, Jiang R. DomainRBF: a Bayesian regression approach to the prioritization of candidate domains for complex diseases. BMC Syst Biol. 2011;5(1):55.
Article PubMed PubMed Central Google Scholar
Deng L, Chen Z. An integrated framework for functional annotation of protein structural domains. IEEE/ACM Trans Comput Biol Bioinforma. 2015;12(4):902–13.
Article Google Scholar
Rotman Y, Koh C, Zmuda JM, Kleiner DE, Liang TJ, NASH CRN. The association of genetic variability in patatin-like phospholipase domain-containing protein 3 (PNPLA3) with histological severity of nonalcoholic fatty liver disease. Hepatology. 2010;52(3):894–903.
Murzin AG, Brenner SE, Hubbard T, Chothia C, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247(4):536–40.
Article PubMed Google Scholar
Fu G, Ding Y, Seal A, Chen B, Sun Y, Bolton E. Predicting drug target interactions using meta-path-based semantic network analysis. BMC Bioinformatics. 2016;17(1):160.
Article PubMed PubMed Central Google Scholar
Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2010;39(suppl_1):D561–D568.
Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh AOMIM. org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 2015;43(D1):D789–98.
Article PubMed Google Scholar
Rahimi K, Emdin CA, MacMahon S. The epidemiology of blood pressure and its worldwide management. Circ Res. 2015;116(6):925–36.
Article PubMed Google Scholar
Dorius N. Understanding change in poor communities: What is it and how will we know when it happens? J Urban Affairs. 2009;31(1):97–109.
Article Google Scholar
Reyes-Palomares A, Rodríguez-López R, Ranea JA, Jiménez FS, Medina MA. Global analysis of the human pathophenotypic similarity gene network merges disease module components. PloS ONE. 2013;8(2):e56653.
Wu G, Liu J, Yue X. Prediction of drug-disease associations based on ensemble meta paths and singular value decomposition. BMC Bioinformatics. 2019;20(3):134.
Article PubMed PubMed Central Google Scholar
Shim JE, Kim JH, Shin J, Lee JE, Lee I. Pathway-specific protein domains are predictive for human diseases. PLoS Comput Biol. 2019;15(5):e1007052.
Article PubMed PubMed Central Google Scholar
Ortiz FW, Sergeev YV. Global computational mutagenesis of domain structures associated with inherited eye disease. Sci Rep. 2019;9(1):1–12.
Article Google Scholar
Zhang H, Li Y, Zhang Y, Shen Q. Spectral-spatial classification of hyperspectral imagery using a dual-channel convolutional neural network. Remote Sens Lett. 2017;8(5):438–47.
Article Google Scholar
Kearney PM, Whelton M, Reynolds K, Muntner P, Whelton PK, He J. Global burden of hypertension: analysis of worldwide data. Lancet. 2005;365(9455):217–23.
Article PubMed Google Scholar

Download references

Acknowledgements

Not applicable.

About this supplement

This article has been published as part of BMC Genomics Volume 23 Supplement 6, 2022: Selected articles from the 16th International Symposium on Bioinformatics Research and Applications (ISBRA-20): genomics. The full contents of the supplement are available online at https://biomedcentral-bmcgenomics.publicaciones.saludcastillayleon.es/articles/supplements/volume-23-supplement-6.

Funding

Publication costs are funded by National Natural Science Foundation of China (No. 62172140) & the Academic and Technical Leader project of HUUC (No.YCJXSJSDTR202206). The funder did not have any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

School of Computer and Data Science, Henan University of Urban Construction, 467000, Pingdingshan, China
Jingpu Zhang
School of Computer Science and Engineering, Central South University, 410075, Changsha, China
Lianping Deng & Lei Deng

Authors

Jingpu Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Lianping Deng
View author publications
You can also search for this author inPubMed Google Scholar
Lei Deng
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

JPZ and LPD implemented the method, performed the experiments, and drafted the manuscript. JPZ and LD reviewed and revised the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Lei Deng.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1: The topological feature test set.

Additional file 2: The topological feature test set through RW normalization.

Additional file 3: The topological feature train set.

Additional file 4: The topological feature train set through RW normalization.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, J., Deng, L. & Deng, L. Protein structural domain-disease association prediction based on heterogeneous networks. BMC Genomics 23 (Suppl 6), 869 (2022). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-024-11117-0

Download citation

Received: 01 May 2021
Accepted: 02 December 2024
Published: 10 April 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-024-11117-0

Selected articles from the 16th International Symposium on Bioinformatics Research and Applications (ISBRA-20): genomics

Protein structural domain-disease association prediction based on heterogeneous networks

Abstract

Background

Results

Conclusions

Background

Materials and methods

Datasets

Domain-protein network

Protein-protein network

Protein-disease network and disease-disease network

Disease-domain network

Building a global heterogeneous information network

Extracting meta-path-based topological features

Binary classification model based on XGBOOST algorithm

Evaluation measures

Results

XGBOOST model

Effect of RW normalization on performance

Influence of the length of meta-paths on performance

Comparison with other machine learning models

Discussion

Conclusions

Data availability

Abbreviations

References

Acknowledgements

About this supplement

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary information

Additional file 1: The topological feature test set.

Additional file 2: The topological feature test set through RW normalization.

Additional file 3: The topological feature train set.

Additional file 4: The topological feature train set through RW normalization.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us