{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T11:06:39Z","timestamp":1767870399293,"version":"3.49.0"},"reference-count":35,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T00:00:00Z","timestamp":1767830400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000052","name":"NIH Office of the Director","doi-asserted-by":"publisher","award":["3OT2OD030546-01S3"],"award-info":[{"award-number":["3OT2OD030546-01S3"]}],"id":[{"id":"10.13039\/100000052","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000025","name":"National Institute of Mental Health","doi-asserted-by":"publisher","award":["R01MH129764"],"award-info":[{"award-number":["R01MH129764"]}],"id":[{"id":"10.13039\/100000025","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Bioinform."],"abstract":"<jats:sec>\n                    <jats:title>Background<\/jats:title>\n                    <jats:p>Biomedical knowledge graphs (KGs), such as the Data Distillery Knowledge Graph (DDKG), capture known relationships among entities (e.g., genes, diseases, proteins), providing valuable insights for research. However, these relationships are typically derived from prior studies, leaving potential unknown associations unexplored. Identifying such unknown associations, including previously unknown disease-associated genes, remains a critical challenge in bioinformatics and is crucial for advancing biomedical knowledge.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Methods<\/jats:title>\n                    <jats:p>Traditional methods, such as linkage analysis and genome-wide association studies (GWAS), can be time-consuming and resource-intensive. This highlights the need for efficient computational approaches to identify or predict new genes using known disease-gene associations. Recently, network-based methods and KGs, enhanced by advances in machine learning (ML) frameworks, have emerged as promising tools for inferring these unexplored associations. Given the technical limitations of the Neo4j Graph Data Science (GDS) machine learning pipeline, we developed a novel machine learning pipeline called KG2ML (Knowledge Graph to Machine Learning). This pipeline utilizes our Positive and Unlabeled (PU) learning algorithm, PULSCAR (Positive Unlabeled Learning Selected Completely At Random), and incorporates path-based feature extraction from ProteinGraphML.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>KG2ML was applied to 12 diseases, including Bipolar Disorder, Coronary Artery Disease, and Parkinson\u2019s Disease, to infer disease-associated genes not explicitly recorded in DDKG. For several of these diseases, 14 out of the 15 top-ranked genes lacked prior explicit associations in the DDKG but were supported by literature and TINX (Target Importance and Novelty Explorer) evidence. Incorporating PULSCAR-imputed genes as positives enhanced XGBoost classification, demonstrating the potential of PU learning in identifying hidden gene-disease relationships.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Conclusion<\/jats:title>\n                    <jats:p>The observed improvement in classification performance after the inclusion of PULSCAR-imputed genes as positive examples, along with the subject matter experts\u2019 (SME) evaluations of the top 15 imputed genes for 12 diseases, suggests that PU learning can effectively uncover disease-gene associations missing from existing knowledge graphs (KGs). By integrating KG data with ML-based inference, our KG2ML pipeline provides a scalable and interpretable framework to advance biomedical research while addressing the inherent limitations of current KGs.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.3389\/fbinf.2025.1727953","type":"journal-article","created":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T06:34:54Z","timestamp":1767854094000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["KG2ML: integrating knowledge graphs and positive unlabeled learning for identifying disease-associated genes"],"prefix":"10.3389","volume":"5","author":[{"given":"Praveen","family":"Kumar","sequence":"first","affiliation":[]},{"given":"Vincent T.","family":"Metzger","sequence":"additional","affiliation":[]},{"given":"Swastika T.","family":"Purushotham","sequence":"additional","affiliation":[]},{"given":"Priyansh","family":"Kedia","sequence":"additional","affiliation":[]},{"given":"Cristian G.","family":"Bologa","sequence":"additional","affiliation":[]},{"given":"Christophe G.","family":"Lambert","sequence":"additional","affiliation":[]},{"given":"Jeremy J.","family":"Yang","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2026,1,8]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11715","article-title":"Estimating the class prior in positive and unlabeled data through decision tree induction\u2019, proceedings of the AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence","volume":"32","author":"Bekker","year":"2018"},{"key":"B2","doi-asserted-by":"publisher","first-page":"125","DOI":"10.1038\/s42003-022-03068-7","article-title":"Machine learning prediction and tau-based screening identifies potential Alzheimer\u2019s disease genes relevant to immunity","volume":"5","author":"Binder","year":"2022","journal-title":"Commun. Biology"},{"key":"B3","doi-asserted-by":"publisher","first-page":"D267","DOI":"10.1093\/nar\/gkh061","article-title":"The Unified Medical Language System (UMLS): integrating biomedical terminology","volume":"32","author":"Bodenreider","year":"2004","journal-title":"Nucleic Acids Research"},{"key":"B4","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41592-024-02563-5","article-title":"Human BioMolecular Atlas Program (HuBMAP): 3D Human Reference Atlas construction and usage","volume":"22","author":"B\u00f6rner","year":"2025","journal-title":"Nat. Methods"},{"key":"B5","doi-asserted-by":"crossref","DOI":"10.1145\/2939672.2939785","article-title":"XGBoost","volume-title":"Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. KDD \u201916: the 22nd ACM SIGKDD international conference on knowledge discovery and data mining","author":"Chen","year":"2016"},{"key":"B6","doi-asserted-by":"publisher","first-page":"192435","DOI":"10.1109\/access.2020.3030076","article-title":"Knowledge graph completion: a review","volume":"8","author":"Chen","year":"2020","journal-title":"IEEE Access Practical Innovations, Open Solutions"},{"key":"B7","doi-asserted-by":"publisher","first-page":"750","DOI":"10.3390\/electronics9050750","article-title":"A survey on knowledge graph embedding: approaches, applications and benchmarks","volume":"9","author":"Dai","year":"2020","journal-title":"Electronics"},{"key":"B8","doi-asserted-by":"publisher","first-page":"7166","DOI":"10.3390\/s22197166","article-title":"Impact of label noise on the learning based models for a binary classification of physiological signal","volume":"22","author":"Ding","year":"2022","journal-title":"Sensors Basel, Switz."},{"key":"B9","doi-asserted-by":"publisher","first-page":"e1009909","DOI":"10.1371\/journal.pcbi.1009909","article-title":"Causal reasoning over knowledge graphs leveraging drug-perturbed and disease-specific transcriptomic signatures for drug discovery","volume":"18","author":"Domingo-Fern\u00e1ndez","year":"2022","journal-title":"PLoS Computational Biology"},{"key":"B10","doi-asserted-by":"crossref","DOI":"10.1145\/1401890.1401920","article-title":"Learning classifiers from only positive and unlabeled data","volume-title":"Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. KDD08: the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","author":"Elkan","year":"2008"},{"key":"B11","first-page":"468","article-title":"A knowledge graph-based disease-gene prediction system using multi-relational graph convolution networks","volume":"2022","author":"Gao","year":"2022","journal-title":"Annu. Symp. Proceedings. AMIA Symp."},{"key":"B12","first-page":"191","article-title":"A review of challenges and opportunities in machine learning for health","volume":"2020","author":"Ghassemi","year":"2020","journal-title":"AMIA Jt. Summits Transl. Sci. Proceedings"},{"key":"B13","doi-asserted-by":"publisher","first-page":"lqae049","DOI":"10.1093\/nargab\/lqae049","article-title":"Predicting gene disease associations with knowledge graph embeddings for diseases with curtailed information","volume":"6","author":"Gualdi","year":"2024","journal-title":"NAR Genomics Bioinformatics"},{"key":"B14","doi-asserted-by":"publisher","first-page":"e1004259","DOI":"10.1371\/journal.pcbi.1004259","article-title":"Heterogeneous network edge prediction: a data integration approach to prioritize disease-associated genes","volume":"11","author":"Himmelstein","year":"2015","journal-title":"PLoS Computational Biology"},{"key":"B15","doi-asserted-by":"crossref","DOI":"10.1109\/ICMLA51294.2020.00128","article-title":"DEDPUL: Difference-of-estimated-densities-based positive-unlabeled learning","volume-title":"2020 19th IEEE international conference on machine learning and applications (ICMLA). 2020 19th IEEE international conference on machine learning and applications (ICMLA)","author":"Ivanov","year":"2020"},{"key":"B16","doi-asserted-by":"publisher","first-page":"e2451","DOI":"10.7717\/peerj-cs.2451","article-title":"Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation without the selected completely at random assumption\u2019, PeerJ","volume":"10","author":"Kumar","year":"2024","journal-title":"Comput. Science"},{"key":"B17","doi-asserted-by":"publisher","first-page":"750","DOI":"10.1109\/JBHI.2024.3515805","article-title":"Detecting opioid use disorder in health claims data with positive unlabeled learning","volume":"29","author":"Kumar","year":"2025","journal-title":"IEEE Journal Biomedical Health Informatics"},{"key":"B18","doi-asserted-by":"publisher","first-page":"bbab461","DOI":"10.1093\/bib\/bbab461","article-title":"Positive-unlabeled learning in bioinformatics and computational biology: a brief review","volume":"23","author":"Li","year":"2021","journal-title":"Briefings Bioinforma."},{"key":"B19","doi-asserted-by":"publisher","first-page":"bbad118","DOI":"10.1093\/bib\/bbad118","article-title":"End-to-end interpretable disease-gene association prediction","volume":"24","author":"Li","year":"2023","journal-title":"Briefings Bioinformatics"},{"key":"B20","doi-asserted-by":"publisher","first-page":"573","DOI":"10.1109\/JBHI.2022.3217433","article-title":"PPAEDTI: personalized propagation auto-encoder model for predicting drug-target interactions","volume":"27","author":"Li","year":"2023","journal-title":"IEEE Journal Biomedical Health Informatics"},{"key":"B21","doi-asserted-by":"publisher","first-page":"270","DOI":"10.3389\/fgene.2019.00270","article-title":"Identifying disease-gene associations with graph-regularized manifold learning","volume":"10","author":"Luo","year":"2019","journal-title":"Front. Genetics"},{"key":"B22","article-title":"Genetics, bethesda (MD): national Library of medicine (US)","year":"2020","journal-title":"Medlin. Genet"},{"key":"B23","doi-asserted-by":"publisher","first-page":"e17470","DOI":"10.7717\/peerj.17470","article-title":"TIN-X version 3: update with expanded dataset and modernized architecture for enhanced illumination of understudied targets","volume":"12","author":"Metzger","year":"2024","journal-title":"PeerJ"},{"key":"B24","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1186\/1471-2105-12-389","article-title":"ProDiGe: prioritization of Disease Genes with multitask machine learning from positive and unlabeled examples","volume":"12","author":"Mordelet","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"B25","doi-asserted-by":"publisher","first-page":"1057","DOI":"10.1093\/bioinformatics\/btq076","article-title":"The power of protein interaction networks for associating genes with diseases","volume":"26","author":"Navlakha","year":"2010","journal-title":"Bioinforma. Oxf. Engl."},{"key":"B26","doi-asserted-by":"publisher","first-page":"381","DOI":"10.3389\/fgene.2019.00381","article-title":"To embed or not: network embedding as a paradigm in computational biology","volume":"10","author":"Nelson","year":"2019","journal-title":"Front. Genetics"},{"key":"B27","unstructured":"Neo4j - the world\u2019s leading graph database\n          \n          \n          2012"},{"key":"B28","unstructured":"Neo4j Graph Platform, Neo4j\n          \n          \n          2018"},{"key":"B29","doi-asserted-by":"publisher","first-page":"578","DOI":"10.12688\/f1000research.10788.1","article-title":"Recent advances in predicting gene-disease associations","volume":"6","author":"Opap","year":"2017","journal-title":"F1000Research"},{"key":"B30","doi-asserted-by":"publisher","first-page":"19955","DOI":"10.1038\/s41598-022-24421-0","article-title":"GediNET for discovering gene associations across diseases using knowledge based machine learning approach","volume":"12","author":"Qumsiyeh","year":"2022","journal-title":"Sci. Reports"},{"key":"B31","doi-asserted-by":"publisher","first-page":"e20220067","DOI":"10.1002\/ntls.20220067","article-title":"Autophagy dark genes: can we find them with machine learning?","volume":"3","author":"Ranjbar","year":"2023","journal-title":"Nat. Sciences Weinh. Ger."},{"key":"B32","doi-asserted-by":"publisher","first-page":"324","DOI":"10.1186\/s12859-023-05451-5","article-title":"A knowledge graph approach to predict and interpret disease-causing gene interactions","volume":"24","author":"Renaux","year":"2023","journal-title":"BMC Bioinformatics"},{"key":"B33","article-title":"Mixture proportion estimation via kernel embeddings distributions","volume-title":"Proceedings of the 33rd international conference on machine learning, PMLR 48. The 33rd international conference on machine learning","author":"Tewari","year":"2016"},{"key":"B34","unstructured":"The Unified Biomedical Knowledge Graph (UBKG) is a knowledge graph infrastructure that represents a set of interrelated concepts from biomedical ontologies and vocabularies\n          \n          \n          2024"},{"key":"B35","doi-asserted-by":"publisher","first-page":"108048","DOI":"10.1016\/j.compbiomed.2024.108048","article-title":"Predicting disease-gene associations through self-supervised mutual infomax graph convolution network","volume":"170","author":"Xie","year":"2024","journal-title":"Comput. Biology Medicine"}],"container-title":["Frontiers in Bioinformatics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2025.1727953\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T06:34:56Z","timestamp":1767854096000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2025.1727953\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,8]]},"references-count":35,"alternative-id":["10.3389\/fbinf.2025.1727953"],"URL":"https:\/\/doi.org\/10.3389\/fbinf.2025.1727953","relation":{},"ISSN":["2673-7647"],"issn-type":[{"value":"2673-7647","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,8]]},"article-number":"1727953"}}