{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,4,2]],"date-time":"2025-04-02T18:42:19Z","timestamp":1743619339149,"version":"3.37.3"},"reference-count":21,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2023,10,5]],"date-time":"2023-10-05T00:00:00Z","timestamp":1696464000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,10,5]],"date-time":"2023-10-05T00:00:00Z","timestamp":1696464000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"publisher","award":["JP21H03449"],"award-info":[{"award-number":["JP21H03449"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001700","name":"MEXT","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001700","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"publisher","award":["22K19764"],"award-info":[{"award-number":["22K19764"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["CCF Trans. HPC"],"published-print":{"date-parts":[[2024,8]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>NEC SX-Aurora TSUBASA\u00a0(SX-AT) is the latest vector supercomputer, consisting of host processors called Vector Hosts\u00a0(VHs) and vector processors called Vector Engines\u00a0(VEs). The goal of this work is to simultaneously use both VHs and VEs to increase the resource utilization and improve the system throughput by co-executing more workloads. One difficulty is that performance interferences among VH and VE workloads could occur because they share some computing resources and potentially compete to use the same resource at the same time, so-called resource conflicts. To achieve efficient workload co-execution, first, this paper experimentally investigates the performance interference between a VH and a VE, when each of the two processors executes a different workload. It is empirically shown that the frequency of system calls from the VE workload could be a good indicator to predict if the co-execution could cause severe performance interference, even though monitoring system calls requires a huge runtime overhead and it is impractical to simply use it for decision making of co-execution. Then, this paper proposes a workload co-execution strategy based on a practical approach to identifying a pair of VE and VH workloads that could cause severe performance interferences. Our evaluation results clearly demonstrate that the system call frequency can be used to predict if the workload can affect the performance of another co-executing workload, and VH\u2019s CPU load can be a good approximation of the system call frequency. The proposed approach based on the CPU loads could accurately identify a pair of workloads causing frequent resource conflicts, and thus reduce the risk of severe performance interferences between co-executing workloads on an SX-AT system, resulting in shorter makespan without significantly increasing the turn-around time.<\/jats:p>","DOI":"10.1007\/s42514-023-00171-x","type":"journal-article","created":{"date-parts":[[2023,10,5]],"date-time":"2023-10-05T08:01:38Z","timestamp":1696492898000},"page":"425-438","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Conflict-aware workload co-execution on SX-aurora TSUBASA"],"prefix":"10.1007","volume":"6","author":[{"given":"Riku","family":"Nunokawa","sequence":"first","affiliation":[]},{"given":"Yoichi","family":"Shimomura","sequence":"additional","affiliation":[]},{"given":"Mulya","family":"Agung","sequence":"additional","affiliation":[]},{"given":"Ryusuke","family":"Egawa","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2858-3140","authenticated-orcid":false,"given":"Hiroyuki","family":"Takizawa","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,10,5]]},"reference":[{"key":"171_CR1","doi-asserted-by":"crossref","unstructured":"Aceituno, J.M., Guasque, A., Balbastre, P., et\u00a0al.: Hardware resources contention-aware scheduling of hard real-time multiprocessor systems. Journal of Systems Architecture 118 (2021)","DOI":"10.1016\/j.sysarc.2021.102223"},{"key":"171_CR2","doi-asserted-by":"crossref","unstructured":"Alsubaihi, S., Gaudiot, J.L.: PETRAS: Performance, energy and thermal aware resource allocation and scheduling for heterogeneous systems. In: International Workshop on Programming Models and Applications for Multicores and Manycores, pp 29\u201338 (2017)","DOI":"10.1145\/3026937.3026944"},{"key":"171_CR3","doi-asserted-by":"crossref","unstructured":"Egawa, R., Fujimoto, S., Yamashita, T., et\u00a0al.: Exploiting the potentials of the second generation SX-Aurora TSUBASA. In: 2020 IEEE\/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp 39\u201349 (2020)","DOI":"10.1109\/PMBS51919.2020.00010"},{"key":"171_CR4","unstructured":"Himeno, R.: Himeno benchmark. https:\/\/i.riken.jp\/en\/supercom\/documents\/himenobmt\/ (2001)"},{"key":"171_CR5","unstructured":"Intel Corporation.: Introducing Intel MPI benchmarks. https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/articles\/intel-mpi-benchmarks.html (2018)"},{"key":"171_CR6","doi-asserted-by":"crossref","unstructured":"Kayiran, O., Nachiappan, N.C., Jog, A., et\u00a0al.: Managing GPU concurrency in heterogeneous architectures. In: IEEE\/ACM International Symposium on Microarchitecture (MICRO), pp 114\u2013126 (2014)","DOI":"10.1109\/MICRO.2014.62"},{"key":"171_CR7","doi-asserted-by":"crossref","unstructured":"Ke, Y., Agung, M., Takizawa, H.: neoSYCL: a SYCL implementation for SX-Aurora TSUBASA. In: International Conference on High Performance Computing in Asia-Pacific Region, pp 50\u201357 (2021)","DOI":"10.1145\/3432261.3432268"},{"key":"171_CR8","doi-asserted-by":"crossref","unstructured":"Komatsu, K., Momose, S., Isobe, Y., et\u00a0al.: Performance evaluation of a vector supercomputer SX-Aurora TSUBASA. In: The International Conference for High Performance Computing, Networking, Storage, and Analysis\u00a0(SC18), pp 685\u2013696 (2018)","DOI":"10.1109\/SC.2018.00057"},{"key":"171_CR9","unstructured":"McCalpin J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp 19\u201325 (1995)"},{"key":"171_CR10","doi-asserted-by":"crossref","unstructured":"Nunokawa, R., Shimomura, Y., Agung, M., et\u00a0al.: Towards conflict-aware workload co-execution on SX-Aurora TSUBASA. In: Parallel and Distributed Computing, Applications and Technologies (PDCAT 2021) (2022)","DOI":"10.1007\/s42514-023-00171-x"},{"key":"171_CR11","unstructured":"Petitet, A., Whaley, R.C., Dongarra, J., et\u00a0al.: HPL \u2013 a portable implementation of the high-performance Linpack benchmark for distributed-memory computers, version 2.3. (2018). https:\/\/www.netlib.org\/benchmark\/hpl\/"},{"key":"171_CR12","doi-asserted-by":"crossref","unstructured":"Rabenseifner, R., Koniges, A.E., Livermore, L.: The parallel communication and I\/O bandwidth benchmarks: b_eff and b_eff_io. In: Proc. of 43rd cray user group conference, indian wells, california, usa, Citeseer (2001)","DOI":"10.1007\/3-540-45417-9_9"},{"key":"171_CR13","doi-asserted-by":"crossref","unstructured":"Sasaki, Y., Ishizuka, A., Agung, M., et\u00a0al.: Evaluating I\/O acceleration mechanisms of SX-Aurora TSUBASA. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 752\u2013759 (2021)","DOI":"10.1109\/IPDPSW52791.2021.00113"},{"key":"171_CR14","volume-title":"MiniAMR - a miniapp for adaptive mesh refinement","author":"A Sasidharan","year":"2016","unstructured":"Sasidharan, A., Snir, M.: MiniAMR - a miniapp for adaptive mesh refinement. University of Illinois at Urbana-Champaign, Tech. rep. (2016)"},{"key":"171_CR15","doi-asserted-by":"crossref","unstructured":"Shan, H., Antypas, K., Shalf, J.: Characterizing and predicting the I\/O performance of HPC applications using a parameterized synthetic benchmark. In: SC\u201908: Proceedings of the 2008 ACM\/IEEE Conference on Supercomputing, IEEE, pp 1\u201312 (2008)","DOI":"10.1109\/SC.2008.5222721"},{"key":"171_CR16","doi-asserted-by":"crossref","unstructured":"Takizawa, H., Shiotsuki, S., Ebata, N., et\u00a0al.: OpenCL-like offloading with metaprogramming for SX-Aurora TSUBASA. Parallel Computing 102 (2021)","DOI":"10.1016\/j.parco.2021.102754"},{"key":"171_CR01","doi-asserted-by":"crossref","unstructured":"Takizawa H, Takahash K, Shimomura Y, Egawa R, Oizumi K, Ono S, Yamashita T, Saito A.: \u201cAOBA: The Most Powerful Vector Supercomputer in the World,\u201d Sustained Simulation Performance 2022, Springer Nature (2023)","DOI":"10.1007\/978-3-031-41073-4_6"},{"key":"171_CR17","doi-asserted-by":"crossref","unstructured":"Wen, Y., O\u2019Boyle, M.F.P.: Merge or separate? multi-job scheduling for opencl kernels on CPU\/GPU platforms. In: GPGPU-10: Proceedings of the General Purpose GPUs, pp 22\u201331 (2017)","DOI":"10.1145\/3038228.3038235"},{"key":"171_CR18","doi-asserted-by":"crossref","unstructured":"Xiong, Q., Ates, E., Herbordt, M.C., et\u00a0al.: Tangram: Colocating HPC applications with oversubscription. In: IEEE High Performance Extreme Computing Conference, pp 1\u20137 (2018)","DOI":"10.1109\/HPEC.2018.8547644"},{"key":"171_CR19","unstructured":"Yamada, Y., Momose, S.: Vector engine processor of NEC\u2019s brand-new supercomputer SX-Aurora TSUBASA. In: Hot Chips: A Symposium on High Performance Chips (2018)"},{"key":"171_CR20","doi-asserted-by":"crossref","unstructured":"Zhu, Q., Wu, B., Shen, X., et\u00a0al.: Co-run scheduling with power cap on integrated CPU-GPU systems. In: International Symposium on Parallel and Distributed Processing, pp 967\u2013977 (2017)","DOI":"10.1109\/IPDPS.2017.124"}],"container-title":["CCF Transactions on High Performance Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42514-023-00171-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s42514-023-00171-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42514-023-00171-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,19]],"date-time":"2024-08-19T13:06:17Z","timestamp":1724072777000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s42514-023-00171-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,5]]},"references-count":21,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["171"],"URL":"https:\/\/doi.org\/10.1007\/s42514-023-00171-x","relation":{},"ISSN":["2524-4922","2524-4930"],"issn-type":[{"type":"print","value":"2524-4922"},{"type":"electronic","value":"2524-4930"}],"subject":[],"published":{"date-parts":[[2023,10,5]]},"assertion":[{"value":"19 April 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 September 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 October 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}