{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,4,5]],"date-time":"2025-04-05T09:10:04Z","timestamp":1743844204458,"version":"3.40.3"},"reference-count":85,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,4,3]],"date-time":"2025-04-03T00:00:00Z","timestamp":1743638400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,4,3]],"date-time":"2025-04-03T00:00:00Z","timestamp":1743638400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100012470","name":"CERN","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100012470","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003768","name":"Universit\u00e9 de Strasbourg","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100003768","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100007370","name":"Universit\u00e9 Catholique de Louvain","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100007370","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100012470","name":"CERN","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100012470","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Comput Softw Big Sci"],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Data acquisition (DAQ) networks, widely used in scientific research and industrial applications, are composed of numerous interconnected servers, exchanging substantial data volumes produced by large scientific instruments. One traffic matrix generally used in such networks is the all-to-all collective exchange, which demands substantial network resources, making network failures particularly challenging to mitigate. If not mitigated, the effects of network failures severely hamper the performance of the DAQ network, potentially leading to the loss of valuable experimental data. In the context of DAQ networks using a fat-tree topology, we propose FORS: a scheduling and associated routing solution to support the all-to-all collective exchange under network failures. FORS optimizes bandwidth utilization in the face of any failure scenarios, ensuring robust performance compared to the existing approaches. We propose an algorithm to solve the scheduling. For the routing, we design an algorithm for simple failure scenarios, along with a linear programming model to address more complex failure scenarios. We validate our proposed solution using a real-world DAQ network as a case study. Results demonstrate significant performance degradation in existing approaches and FORS\u2019 consistent ability to achieve higher throughput across various failure scenarios.<\/jats:p>","DOI":"10.1007\/s41781-024-00129-w","type":"journal-article","created":{"date-parts":[[2025,4,5]],"date-time":"2025-04-05T08:56:32Z","timestamp":1743843392000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["FORS: fault-adaptive optimized routing and scheduling for DAQ networks"],"prefix":"10.1007","volume":"9","author":[{"given":"Eloise","family":"Stein","sequence":"first","affiliation":[]},{"given":"Quentin","family":"Bramas","sequence":"additional","affiliation":[]},{"given":"Flavio","family":"Pisani","sequence":"additional","affiliation":[]},{"given":"Tommaso","family":"Colombo","sequence":"additional","affiliation":[]},{"given":"Cristel","family":"Pelsser","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,4,3]]},"reference":[{"issue":"3","key":"129_CR1","doi-asserted-by":"publisher","first-page":"22","DOI":"10.1109\/MCSE.2015.1","volume":"17","author":"J Borrill","year":"2015","unstructured":"Borrill J, Keskitalo R, Kisner T (2015) Big bang, big data, big iron: fifteen years of cosmic microwave background data analysis at NERSC. Comput Sci Eng 17(3):22\u201329. https:\/\/doi.org\/10.1109\/MCSE.2015.1","journal-title":"Comput Sci Eng"},{"key":"129_CR2","unstructured":"Dorelli J, Bard C, Chen T, Silva D, Santos LF, Ireland J, Kirk M, McGranaghan R, Narock A, Nieves-Chinchilla T, Samara M, Sarantos M, Schuck P, Thompson B. Deep learning for space weather prediction: bridging the gap between heliophysics data and theory 2022"},{"key":"129_CR3","unstructured":"Update, P.: SIMULIA. [Online]. 2011. https:\/\/www.fsb.unizg.hr\/atlantis\/upload\/newsboard\/22 07 2011 15351 SIMULIA RSN-May2011.pdf. Accessed 24 Jun 2024."},{"key":"129_CR4","doi-asserted-by":"publisher","unstructured":"Liu X, Xu Z, Liu G, Liu L. Intelligent compound selection of anti-cancer drugs based on multi-objective optimization. In: 2023 International Conference on Intelligent Supercomputing and BioPharma (ISBP), IEEE, Zhuhai, China pp. 48\u201353, 2023. https:\/\/doi.org\/10.1109\/ISBP57705.2023.10061321","DOI":"10.1109\/ISBP57705.2023.10061321"},{"key":"129_CR5","doi-asserted-by":"publisher","unstructured":"Leung CK, Sarumi OA, Zhang CY. Predictive analytics on genomic data with high-performance computing. In: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, Seoul, Korea (South) pp. 2187\u20132194, 2020. https:\/\/doi.org\/10.1109\/BIBM49941.2020.9312982","DOI":"10.1109\/BIBM49941.2020.9312982"},{"key":"129_CR6","unstructured":"Belyaev N, Krasnopevtsev D, Konoplich R, Velikhov V, Klimentov A. High performance computing system in the framework of the Higgs boson studies. Technical report, ATL-COM-SOFT-2017-089 2017"},{"key":"129_CR7","doi-asserted-by":"publisher","unstructured":"Jereczek G, Lehmann Miotto G, Malone D. Analogues between tuning TCP for data acquisition and datacenter networks. In: 2015 IEEE International Conference on Communications (ICC). IEEE, London, UK pp. 6062\u20136067, 2015. https:\/\/doi.org\/10.1109\/ICC.2015.7249288","DOI":"10.1109\/ICC.2015.7249288"},{"issue":"3","key":"129_CR8","doi-asserted-by":"publisher","first-page":"1099","DOI":"10.1109\/TNS.2015.2426216","volume":"62","author":"T Bawej","year":"2015","unstructured":"Bawej T, Behrens U, Branson J, Chaze O, Cittolin S, Darlea G-L, Deldicque C, Dobson M, Dupont A, Erhan S, Forrest A, Gigi D, Glege F, Gomez-Ceballos G, Gomez-Reino R, Hegeman J, Holzner A, Masetti L, Meijers F, Meschi E, Mommsen RK, Morovic S, Nunez-Barranco-Fernandez C, O\u2019Dell V, Orsini L, Paus C, Petrucci A, Pieri M, Racz A, Sakulin H, Schwick C, Stieger B, Sumorok K, Veverka J, Zejdl P (2015) The new CMS DAQ system for run-2 of the LHC. IEEE Trans Nucl Sci 62(3):1099\u20131103. https:\/\/doi.org\/10.1109\/TNS.2015.2426216","journal-title":"IEEE Trans Nucl Sci"},{"key":"129_CR9","unstructured":"LHCb Collaboration: LHCb Trigger and Online Upgrade Technical Design Report. Technical report, CERN, Geneva 2014"},{"key":"129_CR10","unstructured":"LHCb Collaboration: LHCb Upgrade GPU High Level Trigger Technical Design Report. Technical report, CERN, Geneva 2020."},{"key":"129_CR11","unstructured":"CERN: Key Facts and Figures \u2013 CERN Data Centre. [Online]. 2019. https:\/\/information-technology.web.cern.ch\/sites\/default\/files\/ CERNDataCentre KeyInformation July2019V1.pdf. Accessed 24 Jun 2024."},{"key":"129_CR12","doi-asserted-by":"publisher","unstructured":"Wu S, Zhai Y, Liu J, Huang J, Jian Z, Wong B, Chen Z. Anatomy of high-performance GEMM with online fault tolerance on GPUs. In: Proceedings of the 37th International Conference on Supercomputing. ICS \u201923. Association for Computing Machinery, New York, NY, USA , pp. 360\u2013372, 2023. https:\/\/doi.org\/10.1145\/3577193.3593715","DOI":"10.1145\/3577193.3593715"},{"key":"129_CR13","doi-asserted-by":"publisher","unstructured":"Wang R, Dong D, Lei F, Ma J, Wu K, Lu K. Roar: a router microarchitecture for in-network allreduce. In: Proceedings of the 37th International Conference on Supercomputing. ICS \u201923,. Association for Computing Machinery, New York, NY, USA pp. 423\u2013436, 2023. https:\/\/doi.org\/10.1145\/3577193.3593711","DOI":"10.1145\/3577193.3593711"},{"key":"129_CR14","doi-asserted-by":"publisher","unstructured":"Chirkov G, Wentzlaff D. Seizing the bandwidth scaling of on-package interconnect in a Post-Moore\u2019s Law World. In: Proceedings of the 37th International Conference on Supercomputing. ICS \u201923, Association for Computing Machinery, New York, NY, USA pp. 410\u2013422, 2023. https:\/\/doi.org\/10.1145\/3577193.3593702","DOI":"10.1145\/3577193.3593702"},{"key":"129_CR15","doi-asserted-by":"crossref","unstructured":"Huang J, Di S, Yu X, Zhai Y, Liu J, Huang Y, Raffenetti K, Zhou H, Zhao K, Chen Z, et al.: gzccl: Compression-accelerated collective communication framework for gpu clusters. 2023. arXiv preprint arXiv:2308.05199","DOI":"10.1145\/3650200.3656636"},{"key":"129_CR16","doi-asserted-by":"publisher","unstructured":"Contini, N., Ramesh, B., Kandadi Suresh, K., Tran, T., Michalowicz, B., Abduljabbar, M., Subramoni, H., Panda, D.: Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication. In: Proceedings of the 37th International Conference on Supercomputing. ICS \u201923. Association for Computing Machinery, New York, NY, USA , pp. 477\u2013487, 2023. https:\/\/doi.org\/10.1145\/3577193.3593720","DOI":"10.1145\/3577193.3593720"},{"key":"129_CR17","doi-asserted-by":"crossref","unstructured":"Feng G, Dong D, Zhao S, Lu Y. GRAP: group-level resource allocation policy for reconfigurable dragonfly network in HPC. In: Proceedings of the 37th International Conference on Supercomputing. ICS \u201923, pp. 437\u2013449. Association for Computing Machinery, New York, NY, USA (2023).","DOI":"10.1145\/3577193.3593732"},{"key":"129_CR18","doi-asserted-by":"publisher","unstructured":"Prisacari B, Rodriguez G, Minkenberg C, Hoefler T. Bandwidth-optimal all-to-all exchanges in fat tree networks. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ICS \u201913, pp. 139\u2013148. Association for Computing Machinery, New York, NY, USA (2013). https:\/\/doi.org\/10.1145\/2464996.2465434","DOI":"10.1145\/2464996.2465434"},{"key":"129_CR19","doi-asserted-by":"publisher","unstructured":"Prisacari B, Rodriguez G, Minkenberg C. Generalized hierarchical all-to all exchange patterns. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing,. IEEE, Cambridge, MA, USA pp. 537\u2013547, 2013. https:\/\/doi.org\/10.1109\/IPDPS.2013.87","DOI":"10.1109\/IPDPS.2013.87"},{"key":"129_CR20","unstructured":"Al-Fares M, Radhakrishnan S, Raghavan B, Huang N, Vahdat A. Hedera: dynamic flow scheduling for data center networks. NSDI\u201910, p. 19. USENIX Association, USA (2010)"},{"key":"129_CR21","doi-asserted-by":"crossref","unstructured":"Izzi D, Massini A. Optimal all-to-all personalized communication on Butterfly networks through a reduced Latin square. In: 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC\/SmartCity\/DSS). IEEE, Yanuca Island, Cuvu, Fiji pp. 1065\u20131072, 2020.","DOI":"10.1109\/HPCC-SmartCity-DSS50907.2020.00195"},{"key":"129_CR22","doi-asserted-by":"publisher","first-page":"217","DOI":"10.1002\/cpe.1527","volume":"22","author":"E Zahavi","year":"2009","unstructured":"Zahavi E, Johnson G, Kerbyson D, Lang M (2009) Optimized InfiniBandTM fat-tree routing for shift all-to-all communication patterns. Concurr Comput Pract Exp 22:217\u2013231. https:\/\/doi.org\/10.1002\/cpe.1527","journal-title":"Concurr Comput Pract Exp"},{"key":"129_CR23","doi-asserted-by":"crossref","unstructured":"Peng, J., Liu, J., Dai, Y., Xie, M., Gong, C.: Optimizing all-toall collective communication on tianhe supercomputer. In: 2022 IEEE Intl Conf 1on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA\/BDCloud\/SocialCom\/SustainCom),. IEEE, Melbourne, Australia pp. 402\u2013409, 2022.","DOI":"10.1109\/ISPA-BDCloud-SocialCom-SustainCom57177.2022.00058"},{"key":"129_CR24","doi-asserted-by":"publisher","unstructured":"Izzi D, Massini A.: All-to-all personalized communication on fat-trees using latin squares. In: 2022 International Conference on Software, Telecommunications and Computer Networks (SoftCOMIEEE, Split, Croatia ), pp. 1\u20136, 2022. https:\/\/doi.org\/10.23919\/SoftCOM55329.2022.9911285","DOI":"10.23919\/SoftCOM55329.2022.9911285"},{"key":"129_CR25","doi-asserted-by":"publisher","first-page":"51064","DOI":"10.1109\/ACCESS.2023.3279494","volume":"11","author":"D Izzi","year":"2023","unstructured":"Izzi D, Massini A (2023) Realizing optimal all-to-all personalized communication using butterfly-based networks. IEEE Access 11:51064\u201351083. https:\/\/doi.org\/10.1109\/ACCESS.2023.3279494","journal-title":"IEEE Access"},{"issue":"2","key":"129_CR26","doi-asserted-by":"publisher","first-page":"2","DOI":"10.1145\/3464994.3464996","volume":"51","author":"R Singh","year":"2021","unstructured":"Singh R, Mukhtar M, Krishna A, Parkhi A, Padhye J, Maltz D (2021) Surviving switch failures in cloud datacenters. SIGCOMM Comput Commun Rev 51(2):2\u20139. https:\/\/doi.org\/10.1145\/3464994.3464996","journal-title":"SIGCOMM Comput Commun Rev"},{"key":"129_CR27","unstructured":"Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow. 2018. arXiv preprint arXiv:1802.05799"},{"key":"129_CR28","unstructured":"Zhao L, Maleki S, Yang Z, Pourreza H, Shah A, Hwang C, Krishnamurthy A. ForestColl: efficient collective communications on heterogeneous network fabrics. 2024.arXiv preprint arXiv:2402.06787"},{"key":"129_CR29","doi-asserted-by":"publisher","unstructured":"Zhou Q, Anthony Q, Xu L, Shafi A, Abduljabbar M, Subramoni H, Panda DKD. Accelerating distributed deep learning training with compression assisted allgather and reduce-scatter communication. In: 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 134\u2013144 (2023). https:\/\/doi.org\/10.1109\/IPDPS54959.2023.00023","DOI":"10.1109\/IPDPS54959.2023.00023"},{"key":"129_CR30","unstructured":"Wang W, Ghobadi M, Shakeri K, Zhang Y, Hasani N. Optimized network architectures for large language model training with billions of parameters. 2023. arXiv preprint arXiv:2307.12169"},{"key":"129_CR31","unstructured":"Nvidia: Collective Operations. [Online]. 2020. https:\/\/docs.nvidia.com\/deeplearning\/nccl\/user-guide\/docs\/usage\/collectives.html. Accessed 24 Jun 2024."},{"key":"129_CR32","doi-asserted-by":"crossref","unstructured":"Petrini F, Vanneschi M. k-ary n-trees: high performance networks for massively parallel architectures. In: Proceedings 11th International Parallel Processing Symposium,. IEEE, Geneva, Switzerland pp. 87\u201393, 1997.","DOI":"10.1109\/IPPS.1997.580853"},{"key":"129_CR33","unstructured":"List, T.: TOP500 List. [Online]. 2024. https:\/\/www.top500.org\/lists\/top500\/2024\/06\/. Accessed 24 Jun 2024."},{"issue":"2","key":"129_CR34","doi-asserted-by":"publisher","first-page":"145","DOI":"10.1016\/0022-0000(78)90001-6","volume":"17","author":"N Pippenger","year":"1978","unstructured":"Pippenger N (1978) On rearrangeable and non-blocking switching networks. J Comput Syst Sci 17(2):145\u2013162. https:\/\/doi.org\/10.1016\/0022-0000(78)90001-6","journal-title":"J Comput Syst Sci"},{"key":"129_CR35","doi-asserted-by":"publisher","unstructured":"Yao F, Wu J, Venkataramani G, Subramaniam S. A comparative analysis of data center network architectures. In: 2014 IEEE International Conference on Communications (ICC),. IEEE, Sydney, NSW, Australia pp. 3106\u20133111, 2014. https:\/\/doi.org\/10.1109\/ICC.2014.6883798","DOI":"10.1109\/ICC.2014.6883798"},{"key":"129_CR36","doi-asserted-by":"publisher","unstructured":"Sen A, Datta A, De M. Fault tolerant wormhole routing for complete exchange in multi-mesh. In: 2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS),. IEEE, Belgaum, India pp. 415\u2013420, 2018. https:\/\/doi.org\/10.1109\/CTEMS.2018.8769323","DOI":"10.1109\/CTEMS.2018.8769323"},{"key":"129_CR37","doi-asserted-by":"crossref","unstructured":"Yazaki S, Takaue H, Ajima Y, Shimizu T, Ishihata H. An efficient allto-all communication algorithm for mesh\/torus networks. In: 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications,. IEEE, Leganes, Spain pp. 277\u2013284, 2012.","DOI":"10.1109\/ISPA.2012.44"},{"key":"129_CR38","doi-asserted-by":"publisher","unstructured":"Doi J, Negishi Y. Overlapping methods of all-to-all communication and FFT algorithms for torus-connected massively parallel supercomputers. In: SC \u201910: Proceedings of the 2010 ACM\/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, LA, USA pp. 1\u20139\u00b8 2010. https:\/\/doi.org\/10.1109\/SC.2010.38","DOI":"10.1109\/SC.2010.38"},{"key":"129_CR39","doi-asserted-by":"publisher","unstructured":"Kim J, Dally WJ, Scott S, Abts D. Technology-driven, highly-scalable dragonfly topology. In: 2008 International Symposium on Computer Architecture, pp. 77\u201388 (2008). https:\/\/doi.org\/10.1109\/ISCA.2008.19","DOI":"10.1109\/ISCA.2008.19"},{"key":"129_CR40","doi-asserted-by":"publisher","unstructured":"Shpiner, A., Haramaty, Z., Eliad, S., Zdornov, V., Gafni, B., Zahavi, E.: Dragonfly+: Low Cost Topology for Scaling Datacenters. In: 2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB). IEEE, Austin, TX, USA pp. 1\u20138, 2017. https:\/\/doi.org\/10.1109\/HiPINEB.2017.11","DOI":"10.1109\/HiPINEB.2017.11"},{"key":"129_CR41","doi-asserted-by":"publisher","unstructured":"Beni, M.S., Cosenza, B.: An Analysis of Performance Variability on Dragonfly+topology. In: 2022 IEEE International Conference on Cluster Computing (CLUSTER),. IEEE, Heidelberg, Germany pp. 500\u2013501, 2022. https:\/\/doi.org\/10.1109\/CLUSTER51413.2022.00061","DOI":"10.1109\/CLUSTER51413.2022.00061"},{"key":"129_CR42","unstructured":"MPI: MPI(3) man page. [Online]. 2021. https:\/\/www.open-mpi.org\/doc\/v4.0\/man3. Accessed 24 Jun 2024. (2021)"},{"key":"129_CR43","doi-asserted-by":"publisher","first-page":"137","DOI":"10.1016\/j.jpdc.2019.02.006","volume":"128","author":"L Dalcin","year":"2019","unstructured":"Dalcin L, Mortensen M, Keyes DE (2019) Fast parallel multidimensional FFT using advanced MPI. J Parallel Distrib Comput 128:137\u2013150. https:\/\/doi.org\/10.1016\/j.jpdc.2019.02.006","journal-title":"J Parallel Distrib Comput"},{"key":"129_CR44","doi-asserted-by":"publisher","unstructured":"Czechowski K, Battaglino C, McClanahan C, Iyer K, Yeung P-K, Vuduc R. On the Communication complexity of 3D FFTs and its implications for exascale. In: Proceedings of the 26th ACM International Conference on Supercomputing. ICS \u201912, Association for Computing Machinery, New York, NY, USA pp. 205\u2013214, 2012. https:\/\/doi.org\/10.1145\/2304576.2304604 .","DOI":"10.1145\/2304576.2304604"},{"key":"129_CR45","first-page":"765","volume":"2","author":"T Sumanaweera","year":"2005","unstructured":"Sumanaweera T, Liu D (2005) Medical image reconstruction with the FFT. GPU gems 2:765\u2013784","journal-title":"GPU gems"},{"key":"129_CR46","unstructured":"Mu\u00a8ller A, Deconinck W, Ku\u00a8hnlein C, Mengaldo G, Lange M, Wedi N, Bauer P, Smolarkiewicz P, Diamantakis M, Lock S-J, Saarinen S."},{"key":"129_CR47","doi-asserted-by":"publisher","first-page":"4425","DOI":"10.5194\/gmd-12-4425-2019","volume":"12","author":"G Mozdzynski","year":"2019","unstructured":"Mozdzynski G, Thiemert D, Glinton M, Benard P, Voitus F, Colavolpe C, Marguinaud P, New N (2019) The ESCAPE project: energy-efficient scalable algorithms for weather prediction at exascale. Geosci Model Dev 12:4425\u20134441. https:\/\/doi.org\/10.5194\/gmd-12-4425-2019","journal-title":"Geosci Model Dev"},{"key":"129_CR48","doi-asserted-by":"publisher","unstructured":"Roy A, Zeng H, Bagga J, Porter G, Snoeren AC. Inside the social network\u2019s (Datacenter) network. In: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. SIGCOMM \u201915. Association for Computing Machinery, New York, NY, USA pp. 123\u2013 137, 2015. https:\/\/doi.org\/10.1145\/2785956.2787472","DOI":"10.1145\/2785956.2787472"},{"key":"129_CR49","unstructured":"Jereczek, G.: Software switching for high throughput data acquisition networks. PhD thesis, National University of Ireland, Maynooth (Ireland) 2017"},{"key":"129_CR50","doi-asserted-by":"crossref","unstructured":"Amoiridis V, James TO, Rabady DS, Zogatova D, Gigi D, Racz A, Deldicque C, Cano E, Meschi E, Brummer PM, et al. The cms orbit builder for the hl-lhc at cern. Technical report, CERN 2023","DOI":"10.1051\/epjconf\/202429502011"},{"key":"129_CR51","doi-asserted-by":"publisher","unstructured":"Pisani F, Colombo T, Durante P, Frank M, Gaspar C., Cardoso, L.G., Neufeld, N., Perro, A.: Design and Commissioning of the First 32-Tbit\/s EventBuilder. IEEE Transactions on Nuclear Science 70(6), 906\u2013913 (2023) https:\/\/ doi.org\/https:\/\/doi.org\/10.1109\/TNS.2023.3240514","DOI":"10.1109\/TNS.2023.3240514"},{"issue":"08","key":"129_CR52","doi-asserted-by":"publisher","first-page":"08005","DOI":"10.1088\/1748-0221\/3\/08\/S08005","volume":"3","author":"The LHCb Collaboration","year":"2008","unstructured":"The LHCb Collaboration (2008) The LHCb Detector at the LHC. J Instrum 3(08):08005. https:\/\/doi.org\/10.1088\/1748-0221\/3\/08\/S08005","journal-title":"J Instrum"},{"key":"129_CR53","unstructured":"CERN: The Large Hadron Collider. [Online]. 2024. https:\/\/home.cern\/science\/accelerators\/large-hadron-collider. Accessed 24 Jun 2024."},{"key":"129_CR54","doi-asserted-by":"publisher","unstructured":"Amoiridis, Vassileios, Behrens, Ulf, Bocci, Andrea, Branson, James, Brummer, Philipp, Cano, Eric, Cittolin, Sergio, Da Silva Almeida Da Quintanilha, Joao, Darlea, Georgiana-Lavinia, Deldicque, Christian, Dobson, Marc, Dvorak, Antonin, Gigi, Dominique, Glege, Frank, Gomez-Ceballos, Guillelmo, Gorniak, Patrycja, Guti\u00b4c, Neven, Hegeman, Jeroen, Izquierdo Moreno, Guillermo, James, Thomas Owen, Karimeh, Wassef, Kartalas, Miltiadis, Krawczyk, Rafal Dominik, Li, Wei, Long, Kenneth, Meijers, Frans, Meschi, Emilio, Morovi\u00b4c, Sre\u00b4cko, Orsini, Luciano, Paus, Christoph, Petrucci, Andrea, Pieri, Marco, Rabady, Dinyar Sebastian, Racz, Attila, Rizopoulos, Theodoros, Sakulin, Hannes, Schwick, Christoph, Simelevi\u02c7cius, Dainius, Tzanis, Polyneikis, Vazquez Velez, Cristina,\u02c7Zejdl, Petr,\u02c7 Zhang, Yousen, Zogatova, Dominika: The CMS Orbit Builder for the HL-LHC at CERN. EPJ Web of Conf. 2024;295:02011. https:\/\/doi.org\/10.1051\/epjconf\/202429502011","DOI":"10.1051\/epjconf"},{"key":"129_CR55","unstructured":"Kopeliansky, R.: ATLAS Trigger and Data Acquisition upgrades for the High Luminosity LHC. Technical report, CERN, Geneva 2023. https:\/\/cds.cern.ch\/record\/2871280"},{"key":"129_CR56","unstructured":"Nvidia: OpenSM. [Online]. 2023. https:\/\/docs.nvidia.com\/networking\/ display\/MLNXOFEDv461000\/OpenSM. Accessed 24 Jun 2024."},{"key":"129_CR57","doi-asserted-by":"publisher","first-page":"1008737","DOI":"10.3389\/fdata.2022.1008737","volume":"5","author":"L Calefice","year":"2022","unstructured":"Calefice L, Hennequin A, Henry L, Jashal B, Mendoza D, Oyanguren A, Sanderswood I, Sierra C, Zhuo J, Collaboration P (2022) Effect of the high-level trigger for detecting long-lived particles at lhcb. Front Big Data 5:1008737. https:\/\/doi.org\/10.3389\/fdata.2022.1008737","journal-title":"Front Big Data"},{"issue":"4","key":"129_CR58","doi-asserted-by":"publisher","first-page":"350","DOI":"10.1145\/2043164.2018477","volume":"41","author":"P Gill","year":"2011","unstructured":"Gill P, Jain N, Nagappan N (2011) Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. SIGCOMM Comput Commun Rev 41(4):350\u2013361. https:\/\/doi.org\/10.1145\/2043164.2018477","journal-title":"SIGCOMM Comput Commun Rev"},{"key":"129_CR59","doi-asserted-by":"publisher","unstructured":"Yigitbasi N, Gallet M, Kondo D, Iosup A, Epema D. Analysis and modeling of time-correlated failures in large-scale distributed systems. In: 2010 11th IEEE\/ACM International Conference on Grid Computing,. IEEE, Brussels, Belgium pp. 65\u201372, 2010. https:\/\/doi.org\/10.1109\/GRID.2010.5697961","DOI":"10.1109\/GRID.2010.5697961"},{"key":"129_CR60","doi-asserted-by":"publisher","DOI":"10.1109\/TNS.2024.3451177","author":"E Stein","year":"2024","unstructured":"Stein E, Pisani F, Colombo T, Pelsser C (2024) Measuring performance under failures in the lhcb data acquisition network. IEEE Trans Nuclear Sci. https:\/\/doi.org\/10.1109\/TNS.2024.3451177","journal-title":"IEEE Trans Nuclear Sci"},{"key":"129_CR61","doi-asserted-by":"publisher","unstructured":"Gill P, Jain N, Nagappan N. Understanding network failures in data centers: measurement, analysis, and implications. In: Proceedings of the ACM SIGCOMM 2011 Conference. SIGCOMM \u201911, Association for Computing Machinery, New York, NY, USA pp. 350\u2013361, 2011. https:\/\/doi.org\/10.1145\/2018436.2018477","DOI":"10.1145\/2018436"},{"key":"129_CR62","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/BF01379320","volume":"17","author":"D Hensgen","year":"1988","unstructured":"Hensgen D, Finkel R, Manber U (1988) Two algorithms for barrier synchronization. Int J Parallel Progr 17:1\u201317. https:\/\/doi.org\/10.1007\/BF01379320","journal-title":"Int J Parallel Progr"},{"key":"129_CR63","unstructured":"Drung, B., Rosenstock, H.: Current OpenSM Routing. [Online]. 2017. https:\/\/github.com\/linux-rdma\/opensm\/blob\/master\/doc\/current-routing.txt. Accessed 24 Jun 2024."},{"key":"129_CR64","doi-asserted-by":"publisher","unstructured":"Bogdanski, B., Johnsen, B.D., Reinemo, S.-A., Sem-Jacobsen, F.O.: Discovery and routing of degraded fat-trees. In: 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 697\u2013702. IEEE, Beijing, China (2012). https:\/\/doi.org\/10.1109\/PDCAT.2012.67","DOI":"10.1109\/PDCAT.2012.67"},{"key":"129_CR65","doi-asserted-by":"publisher","unstructured":"Abdous S, Sharafzadeh E, Ghorbani S. Burst-tolerant datacenter networks with Vertigo. In: Proceedings of the 17th International Conference on Emerging Networking EXperiments and Technologies. CoNEXT \u201921. Association for Computing Machinery, New York, NY, USA pp. 1\u201315, 2021. https:\/\/doi.org\/10.1145\/3485983.3494873","DOI":"10.1145\/3485983.3494873"},{"key":"129_CR66","doi-asserted-by":"publisher","unstructured":"Zhang Q, Liu V, Zeng H, Krishnamurthy A. High-resolution measurement of data center microbursts. In: Proceedings of the 2017 Internet Measurement Conference. IMC \u201917. Association for Computing Machinery, New York, NY, USA pp. 78\u201385, 2017. https:\/\/doi.org\/10.1145\/3131365.3131375.","DOI":"10.1145\/3131365.3131375"},{"key":"129_CR67","unstructured":"Shin J, Pinkston TM. The Performance of Routing Algorithms under Bursty Traffic Loads. In: International Conference on Parallel and Distributed Processing Techniques and Applications. Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA , pp. 737\u2013743, 2003. https:\/\/api.semanticscholar.org\/CorpusID: 17559938"},{"key":"129_CR68","doi-asserted-by":"publisher","unstructured":"Stein E, Bramas Q, Colombo T., Pelsser, C.: Fault-adaptive scheduling for data acquisition networks. In: 2023 IEEE 48th Conference on Local Computer Networks (LCN). IEEE, Daytona Beach, FL, USA pp. 1\u20134, 2023. https:\/\/doi.org\/10.1109\/LCN58197.2023.10223324","DOI":"10.1109\/LCN58197.2023.10223324"},{"key":"129_CR69","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2541228.2555293","volume":"10","author":"B Prisacari","year":"2013","unstructured":"Prisacari B, Rodriguez G, Minkenberg C, Hoefler T (2013) Fast Pattern-Specific Routing for Fat Tree Networks. ACM Trans Arch Code Optim 10:1\u201325. https:\/\/doi.org\/10.1145\/2541228.2555293","journal-title":"ACM Trans Arch Code Optim"},{"key":"129_CR70","doi-asserted-by":"publisher","unstructured":"Schweissguth E, Danielis P, Timmermann D, Parzyjegla H, Mu\u00a8hl G. ILP-based joint routing and scheduling for time-triggered networks. In: Proceedings of the 25th International Conference on Real-Time Networks and Systems. RTNS \u201917, pp. 8\u201317. Association for Computing Machinery, New York, NY, USA 2017. https:\/\/doi.org\/10.1145\/3139258.3139289","DOI":"10.1145\/3139258.3139289"},{"key":"129_CR71","unstructured":"experts, G.: Gurobi Optimization Inc. Gurobi optimizer reference manual. 2023. http:\/\/www.gurobi.com. Accessed 24 Jun 24 2024"},{"key":"129_CR72","doi-asserted-by":"publisher","unstructured":"Luppold A, Oehlert D, Falk H. Evaluating the performance of solvers for integer-linear programming. Technical report, Hamburg University of Technology, Hamburg, Germany 2018. https:\/\/doi.org\/10.15480\/882.1839","DOI":"10.15480\/882.1839"},{"key":"129_CR73","doi-asserted-by":"publisher","unstructured":"Subramoni H, Kandalla K, Jose J, Tomko K, Schulz K, Pekurovsky D, Panda DK. Designing topology-aware communication schedules for alltoall operations in large infiniband clusters. In: 2014 43rd International Conference on Parallel Processing, pp. 231\u2013240. IEEE, Minneapolis, MN, USA (2014). https: \/\/doi.org\/https:\/\/doi.org\/10.1109\/ICPP.2014.32","DOI":"10.1109\/ICPP.2014.32"},{"key":"129_CR74","doi-asserted-by":"publisher","unstructured":"Subramoni H, Bureddy D, Kandalla K, Schulz K, Barth B, Perkins J, Arnold M, Panda DK. Design of network topology aware scheduling services for large InfiniBand clusters. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1\u20138. IEEE, Indianapolis, IN, USA (2013https:\/\/doi.org\/10.1109\/CLUSTER.2013.6702677","DOI":"10.1109\/CLUSTER.2013.6702677"},{"key":"129_CR75","doi-asserted-by":"publisher","unstructured":"Domke J, Hoefler T. Scheduling-aware routing for supercomputers. In: SC \u201916: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, UT, USA pp. 142\u2013153, 2016. https:\/\/doi.org\/10.1109\/SC.2016.12","DOI":"10.1109\/SC.2016.12"},{"key":"129_CR76","doi-asserted-by":"crossref","unstructured":"Rocher-Gonz\u00b4alez J, Gran EG., Reinemo, S.-A., Skeie, T., Escudero-Sahuquillo, J., Garc\u00b4\u0131a, P.J., Flor, F.J.Q.: Adaptive Routing in InfiniBand Hardware. In: 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, Taormina, Italy pp. 463\u2013472, 2022.","DOI":"10.1109\/CCGrid54584.2022.00056"},{"key":"129_CR77","doi-asserted-by":"publisher","unstructured":"Kasan H, Kim G, Yi Y, Kim J. Dynamic global adaptive routing in highradix networks. In: proceedings of the 49th annual international symposium on computer architecture. ISCA \u201922, Association for Computing Machinery, New York, NY, USA pp. 771\u2013783, 2022. https:\/\/doi.org\/10.1145\/3470496.3527389","DOI":"10.1145\/3470496.3527389"},{"key":"129_CR78","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2020.07.009","author":"J Rocher-Gonzalez","year":"2020","unstructured":"Rocher-Gonzalez J, Escudero-Sahuquillo J, Garcia P, Quiles F, Mora G (2020) Towards an efficient combination of adaptive routing and queuing schemes in Fat-Tree topologies. J Parallel Distrib Comput. https:\/\/doi.org\/10.1016\/j.jpdc.2020.07.009","journal-title":"J Parallel Distrib Comput"},{"key":"129_CR79","doi-asserted-by":"publisher","unstructured":"Zahid F, Gran EG, Bogdanski B, Johnsen BD, Skeie T. A Weighted fat-tree routing algorithm for efficient load-balancing in infinity band enterprise clusters. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 35\u201342. IEEE, Turku, Finland 2015. https:\/\/doi.org\/10.1109\/PDP.2015.111","DOI":"10.1109\/PDP.2015.111"},{"key":"129_CR80","doi-asserted-by":"publisher","unstructured":"Bienkowski M, Korzeniowski M, R\u00a8acke H. A practical algorithm for constructing oblivious routing schemes. In: Proceedings of the Fifteenth Annual ACM Symposium on Parallel Algorithms and Architectures. SPAA \u201903, pp. 24\u201333. Association for Computing Machinery, New York, NY, USA (2003). https:\/\/doi.org\/10.1145\/777412.777418","DOI":"10.1145\/777412.777418"},{"key":"129_CR81","doi-asserted-by":"publisher","DOI":"10.1145\/3491050","author":"C Griner","year":"2021","unstructured":"Griner C, Zerwas J, Blenk A, Ghobadi M, Schmid S, Avin C (2021) Cerberus: the power of choices in datacenter topology design\u2014a throughput perspective. Proc ACM Meas Anal Comput Syst. https:\/\/doi.org\/10.1145\/3491050","journal-title":"Proc ACM Meas Anal Comput Syst."},{"key":"129_CR82","doi-asserted-by":"publisher","DOI":"10.1145\/3579449","author":"J Zerwas","year":"2023","unstructured":"Zerwas J, Gyorgyi C, Blenk A, Schmid S, Avin C (2023) Duo: a high-throughput reconfigurable datacenter network using local routing and control. Proc ACM Meas Anal Comput Syst. https:\/\/doi.org\/10.1145\/3579449","journal-title":"Proc ACM Meas Anal Comput Syst"},{"key":"129_CR83","doi-asserted-by":"publisher","unstructured":"Domke J, Matsuoka S, Ivanov IR, Tsushima Y, Yuki T, Nomura A, Miura S, McDonald N, Floyd DL, Dub\u00b4e N. HyperX topology: first atscale implementation and comparison to the fat-tree. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC \u201919. Association for Computing Machinery, New York, NY, USA. 2019. https:\/\/doi.org\/10.1145\/3295500.3356140.","DOI":"10.1145\/3295500.3356140"},{"key":"129_CR84","doi-asserted-by":"publisher","unstructured":"Domke J, Matsuoka S, Radanov I, Tsushima Y, Yuki T, Nomura A, Miura S, McDonald N, Floyd DL, Dub\u00b4e N. The first supercomputer with hyperx topology: a viable alternative to fat-trees? In: 2019 IEEE Symposium on HighPerformance Interconnects (HOTI), pp. 1\u20134. IEEE, Santa Clara, CA, USA 2019. https:\/\/doi.org\/10.1109\/HOTI.2019.00013","DOI":"10.1109\/HOTI.2019.00013"},{"key":"129_CR85","unstructured":"Liu V, Halperin D, Krishnamurthy A, Anderson T. F10: a fault-tolerant engineered network. In: 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pp. 399\u2013412. USENIX Association, Lombard, IL. 2013"}],"container-title":["Computing and Software for Big Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41781-024-00129-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s41781-024-00129-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41781-024-00129-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,4,5]],"date-time":"2025-04-05T08:56:49Z","timestamp":1743843409000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s41781-024-00129-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,3]]},"references-count":85,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["129"],"URL":"https:\/\/doi.org\/10.1007\/s41781-024-00129-w","relation":{},"ISSN":["2510-2036","2510-2044"],"issn-type":[{"value":"2510-2036","type":"print"},{"value":"2510-2044","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,4,3]]},"assertion":[{"value":"2 July 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 December 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 April 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"4"}}