{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T21:49:02Z","timestamp":1767649742496,"version":"3.48.0"},"reference-count":40,"publisher":"SAGE Publications","issue":"3-4","license":[{"start":{"date-parts":[[2024,5,30]],"date-time":"2024-05-30T00:00:00Z","timestamp":1717027200000},"content-version":"vor","delay-in-days":366,"URL":"http:\/\/www.sagepub.com\/licence-information-for-chorus"},{"start":{"date-parts":[[2023,5,30]],"date-time":"2023-05-30T00:00:00Z","timestamp":1685404800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"funder":[{"DOI":"10.13039\/100000015","name":"Department of Energy","doi-asserted-by":"crossref","award":["#DE-SC0022328"],"award-info":[{"award-number":["#DE-SC0022328"]}],"id":[{"id":"10.13039\/100000015","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2023,7]]},"abstract":"<jats:p>Identifying and addressing anomalies in complex, distributed systems can be challenging for reliable execution of scientific workflows. We model these workflows as directed acyclic graphs (DAGs), where the nodes and edges of the DAGs represent jobs and their dependencies, respectively. We develop graph neural networks (GNNs) to learn patterns in the DAGs and to detect anomalies at the node (job) and graph (workflow) levels. We investigate workflow-specific GNN models that are trained on a particular workflow and workflow-agnostic GNN models that are trained across the workflows. Our GNN models, which incorporate both individual job features and topological information from the workflow, show improved accuracy and efficiency compared to conventional learning methods for detecting anomalies. While joint trained with multiple scientific workflows, our GNN models reached an accuracy more than 80% for workflow level and 75% for job level anomalies. In addition, we illustrate the importance of hyperparameter tuning method in our study that can significantly improve the metric(s) measure of evaluating the GNN models. Finally, we integrate explainable GNN methods to provide insights on job features in the workflow that cause an anomaly.<\/jats:p>","DOI":"10.1177\/10943420231172140","type":"journal-article","created":{"date-parts":[[2023,5,31]],"date-time":"2023-05-31T00:42:30Z","timestamp":1685493750000},"page":"394-411","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":7,"title":["Graph neural networks for detecting anomalies in scientific workflows"],"prefix":"10.1177","volume":"37","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2851-595X","authenticated-orcid":false,"given":"Hongwei","family":"Jin","sequence":"first","affiliation":[{"name":"Argonne National Laboratory"}]},{"given":"Krishnan","family":"Raghavan","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9384-5034","authenticated-orcid":false,"given":"George","family":"Papadimitriou","sequence":"additional","affiliation":[{"name":"University of Southern California"}]},{"given":"Cong","family":"Wang","sequence":"additional","affiliation":[{"name":"Renaissance Computing Institute (RENCI)"}]},{"given":"Anirban","family":"Mandal","sequence":"additional","affiliation":[{"name":"Renaissance Computing Institute (RENCI)"}]},{"given":"Mariam","family":"Kiran","sequence":"additional","affiliation":[{"name":"Energy Sciences Network (ESnet)"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5106-503X","authenticated-orcid":false,"given":"Ewa","family":"Deelman","sequence":"additional","affiliation":[{"name":"University of Southern California"}]},{"given":"Prasanna","family":"Balaprakash","sequence":"additional","affiliation":[{"name":"Oak Ridge National Laboratory"}]}],"member":"179","published-online":{"date-parts":[[2023,5,30]]},"reference":[{"key":"e_1_3_3_2_1","unstructured":"ELK stack (2018) https:\/\/www.elastic.co\/elk-stack"},{"key":"e_1_3_3_3_1","unstructured":"Collaborative Adaptive Sensing of the Atmosphere (2020). http:\/\/www.casa.umass.edu\/"},{"key":"e_1_3_3_4_1","unstructured":"National Energy Research Scientific Computing Center (NERSC) (2022) https:\/\/www.nersc.gov"},{"key":"e_1_3_3_5_1","unstructured":"Oak Ridge Leadership Computing Facility (OLCF) (2022) https:\/\/www.olcf.ornl.gov"},{"key":"e_1_3_3_6_1","doi-asserted-by":"publisher","DOI":"10.1038\/nature15393"},{"key":"e_1_3_3_7_1","doi-asserted-by":"crossref","unstructured":"Balaprakash P Salim M Uram TD et al. (2018) Deephyper: Asynchronous hyperparameter search for deep neural networks. In: 2018 IEEE 25th International Conference on High Performance Computing (HiPC). Bengaluru India 17-20 December 2018.","DOI":"10.1109\/HiPC.2018.00014"},{"key":"e_1_3_3_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-33769-2_13"},{"key":"e_1_3_3_9_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i04.5747"},{"key":"e_1_3_3_10_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342015594515"},{"key":"e_1_3_3_11_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342019852127"},{"key":"e_1_3_3_12_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2014.10.008"},{"key":"e_1_3_3_13_1","unstructured":"Docker (2022) Docker https:\/\/docs.docker.com\/"},{"key":"e_1_3_3_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2019.01.015"},{"key":"e_1_3_3_15_1","doi-asserted-by":"crossref","unstructured":"Gaikwad P Mandal A Ruth P et al. (2016) Anomaly detection for scientific workflow applications on networked clouds. In 2016 International Conference on High Performance Computing & Simulation (HPCS). Innsbruck Austria 18-22 July 2016 pp. 645\u2013652.","DOI":"10.1109\/HPCSim.2016.7568396"},{"key":"e_1_3_3_16_1","first-page":"11","article-title":"Adaptive sampling towards fast graph representation learning","volume":"31","author":"Huang W","year":"2018","unstructured":"Huang W, Zhang T, Rong Y, et al. (2018) Adaptive sampling towards fast graph representation learning. Advances in Neural Information Processing Systems 31: 11.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_17_1","volume-title":"Linux advanced routing & traffic control","author":"Hubert B","year":"2002","unstructured":"Hubert B, Graf T, Maxwell G, et al. (2002) Linux advanced routing & traffic control. Ottawa Linux Symposium, volume 213."},{"key":"e_1_3_3_18_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342022107976"},{"key":"e_1_3_3_19_1","first-page":"240","volume-title":"Artificial Intelligence and Statistics","author":"Jamieson K","year":"2016","unstructured":"Jamieson K, Talwalkar A (2016) Non-stochastic best arm identification and hyperparameter optimization. Artificial Intelligence and Statistics. PMLR, pp. 240\u2013248."},{"key":"e_1_3_3_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/WORKS56498.2022.00010"},{"key":"e_1_3_3_21_1","unstructured":"Keahey K Anderson J Zhen Z et al. (2020) Lessons learned from the chameleon testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC \u201920). 2020 USENIX Association."},{"key":"e_1_3_3_22_1","unstructured":"Kingma DP Ba J (2015) Adam: A method for stochastic optimization. In International Conference on Learning Representations."},{"key":"e_1_3_3_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437359.3465597"},{"key":"e_1_3_3_24_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4703"},{"key":"e_1_3_3_25_1","volume-title":"Control groups v2","author":"Kernel Organization L","year":"2023","unstructured":"Kernel Organization L (2023) Control groups v2 https:\/\/docs.kernel.org\/admin-guide\/cgroup-v2.html"},{"key":"e_1_3_3_26_1","doi-asserted-by":"publisher","unstructured":"Lyons E Papadimitriou G Wang C et al. (2019) Toward a dynamic network-centric distributed cloud platform for scientific workflows: A case study for adaptive weather sensing. In 15th International Conference on eScience (eScience) San Diego CA 24-27 September 2019 pp. 67\u201376. DOI: 10.1109\/eScience.2019.00015","DOI":"10.1109\/eScience.2019.00015"},{"key":"e_1_3_3_27_1","volume-title":"Pegasus panorama","author":"Papadimitriou G","year":"2018","unstructured":"Papadimitriou G, Deelman E (2018) Pegasus panorama. https:\/\/github.com\/pegasus-isi\/pegasus\/tree\/panorama"},{"key":"e_1_3_3_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/INDIS49552.2019.00012"},{"key":"e_1_3_3_29_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2020.11.024"},{"key":"e_1_3_3_30_1","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/78\/1\/012057"},{"key":"e_1_3_3_31_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2018.05.014"},{"key":"e_1_3_3_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/TGRS.2014.2368076"},{"key":"e_1_3_3_33_1","doi-asserted-by":"crossref","unstructured":"Singh A Altintas I Schram M et al. (2018) Deep learning for enhancing fault tolerant capabilities of scientific workflows. In: 2018 IEEE International Conference on Big Data (Big Data). Seattle WA 10-13 December 2018 IEEE pp. 3905\u20133914.","DOI":"10.1109\/BigData.2018.8622509"},{"key":"e_1_3_3_34_1","volume-title":"AI for Science","author":"Stevens R","year":"2020","unstructured":"Stevens R, Taylor V, Nichols J, et al. (2020) AI for Science. Argonne, IL (United States): Argonne National Lab.(ANL). Technical report."},{"key":"e_1_3_3_35_1","doi-asserted-by":"crossref","unstructured":"Taufer M (2021) AI4IO: A suite of ai-based tools for io-aware HPC resource management. In 2021 IEEE 28th International Conference on High Performance Computing Data and Analytics (HiPC). Bengaluru India 17-20 December 2021.","DOI":"10.1109\/HiPC53243.2021.00012"},{"key":"e_1_3_3_36_1","volume-title":"Workflows for E-Science: Scientific Workflows for Grids","author":"Taylor IJ","year":"2014","unstructured":"Taylor IJ, Deelman E, Gannon DB, et al. (2014) Workflows for E-Science: Scientific Workflows for Grids. Springer Publishing Company"},{"key":"e_1_3_3_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2014.80"},{"key":"e_1_3_3_38_1","doi-asserted-by":"crossref","unstructured":"Wang C Papadimitriou G Kiran M et al. (2020) Identifying execution anomalies for data intensive workflows using lightweight ML techniques. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). Waltham MA 22-24 September 2020 pp. 1\u20137.","DOI":"10.1109\/HPEC43674.2020.9286139"},{"key":"e_1_3_3_39_1","volume-title":"POSIX Workload Generator","author":"Waterland A","year":"2013","unstructured":"Waterland A (2013) POSIX Workload Generator."},{"key":"e_1_3_3_40_1","unstructured":"Welling M Kipf TN (2016) Semi-supervised classification with graph convolutional networks. In J. International Conference on Learning Representations. ICLR 2017)."},{"key":"e_1_3_3_41_1","first-page":"9240","article-title":"GNNExplainer: Generating explanations for graph neural networks","volume":"32","author":"Ying R","year":"2019","unstructured":"Ying R, Bourgeois D, You J, et al. (2019) GNNExplainer: Generating explanations for graph neural networks. Advances in Neural Information Processing Systems 32: 9240\u20139251.","journal-title":"Advances in Neural Information Processing Systems"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420231172140","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/10943420231172140","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420231172140","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420231172140","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T20:14:24Z","timestamp":1767644064000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/10943420231172140"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,30]]},"references-count":40,"journal-issue":{"issue":"3-4","published-print":{"date-parts":[[2023,7]]}},"alternative-id":["10.1177\/10943420231172140"],"URL":"https:\/\/doi.org\/10.1177\/10943420231172140","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"type":"print","value":"1094-3420"},{"type":"electronic","value":"1741-2846"}],"subject":[],"published":{"date-parts":[[2023,5,30]]}}}