{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T19:17:56Z","timestamp":1777058276486,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":37,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,9,14]],"date-time":"2022-09-14T00:00:00Z","timestamp":1663113600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100013020","name":"Compute Canada","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100013020","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000038","name":"Natural Sciences and Engineering Research Council of Canada","doi-asserted-by":"publisher","award":["RGPIN-05389-2016"],"award-info":[{"award-number":["RGPIN-05389-2016"]}],"id":[{"id":"10.13039\/501100000038","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,9,14]]},"DOI":"10.1145\/3555819.3555857","type":"proceedings-article","created":{"date-parts":[[2022,9,14]],"date-time":"2022-09-14T22:06:08Z","timestamp":1663193168000},"page":"68-78","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Efficient Process Arrival Pattern Aware Collective Communication for Deep Learning"],"prefix":"10.1145","author":[{"given":"Pedram","family":"Alizadeh","sequence":"first","affiliation":[{"name":"Queen's University, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Amirhossein","family":"Sojoodi","sequence":"additional","affiliation":[{"name":"Queen's University, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yiltan","family":"Hassan Temucin","sequence":"additional","affiliation":[{"name":"Queen's University, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ahmad","family":"Afsahi","sequence":"additional","affiliation":[{"name":"Queen's University, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,9,14]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"(June 2022). CIFAR-10 and CIFAR-100 datasets. https:\/\/www.cs.toronto.edu\/\u00a0kriz\/cifar.html  (June 2022). CIFAR-10 and CIFAR-100 datasets. https:\/\/www.cs.toronto.edu\/\u00a0kriz\/cifar.html"},{"key":"e_1_3_2_1_2_1","unstructured":"(June 2022). Message Passing Interface (MPI 4.0). http:\/\/www.mpi-forum.org  (June 2022). Message Passing Interface (MPI 4.0). http:\/\/www.mpi-forum.org"},{"key":"e_1_3_2_1_3_1","unstructured":"M. Abadi A. Agarwal P. Barham E. Brevdo Z. Chen C. Citro G.\u00a0S. Corrado A. Davis J. Dean M. Devin S. Ghemawat I. Goodfellow A. Harp G. Irving M. Isard Y. Jia R. Jozefowicz L. Kaiser M. Kudlur J. Levenberg D. Man\u00e9 R. Monga S. Moore D. Murray C. Olah M. Schuster J. Shlens B. Steiner I. Sutskever K. Talwar P. Tucker V. Vanhoucke V. Vasudevan F. Vi\u00e9gas O. Vinyals P. Warden M. Wattenberg M. Wicke Y. Yu and X. Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https:\/\/www.tensorflow.org\/  M. Abadi A. Agarwal P. Barham E. Brevdo Z. Chen C. Citro G.\u00a0S. Corrado A. Davis J. Dean M. Devin S. Ghemawat I. Goodfellow A. Harp G. Irving M. Isard Y. Jia R. Jozefowicz L. Kaiser M. Kudlur J. Levenberg D. Man\u00e9 R. Monga S. Moore D. Murray C. Olah M. Schuster J. Shlens B. Steiner I. Sutskever K. Talwar P. Tucker V. Vanhoucke V. Vasudevan F. Vi\u00e9gas O. Vinyals P. Warden M. Wattenberg M. Wicke Y. Yu and X. Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https:\/\/www.tensorflow.org\/"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1542275.1542306"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3320060"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4851"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33518-1_16"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10586-021-03370-9"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3392717.3392771"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2017.25"},{"key":"e_1_3_2_1_11_1","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, 386\u2013400","author":"Chunduri S.","unstructured":"S. Chunduri , S. Parker , P. Balaji , K. Harms , and K. Kumaran . 2018. Characterization of MPI usage on a production supercomputer . In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, 386\u2013400 . S. Chunduri, S. Parker, P. Balaji, K. Harms, and K. Kumaran. 2018. Characterization of MPI usage on a production supercomputer. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, 386\u2013400."},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-008-0070-9"},{"key":"e_1_3_2_1_13_1","volume-title":"Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). 47\u201350","author":"Faraji I.","unstructured":"I. Faraji and A. Afsahi . 2015. Hyper-Q aware intranode MPI collectives on the GPU . In Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). 47\u201350 . I. Faraji and A. Afsahi. 2015. Hyper-Q aware intranode MPI collectives on the GPU. In Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). 47\u201350."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4667"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-016-1779-7"},{"key":"e_1_3_2_1_16_1","volume-title":"2008 IEEE International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 1\u20138.","author":"Hoefler T.","unstructured":"T. Hoefler , T. Schneider , and A. Lumsdaine . 2008. Accurately measuring collective operations at massive scale . In 2008 IEEE International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 1\u20138. T. Hoefler, T. Schneider, and A. Lumsdaine. 2008. Accurately measuring collective operations at massive scale. In 2008 IEEE International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 1\u20138."},{"key":"e_1_3_2_1_17_1","volume-title":"SC\u201910: Proceedings of the 2010 ACM\/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1\u201311","author":"Hoefler T.","unstructured":"T. Hoefler , T. Schneider , and A. Lumsdaine . 2010. Characterizing the influence of system noise on large-scale applications by simulation . In SC\u201910: Proceedings of the 2010 ACM\/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1\u201311 . T. Hoefler, T. Schneider, and A. Lumsdaine. 2010. Characterizing the influence of system noise on large-scale applications by simulation. In SC\u201910: Proceedings of the 2010 ACM\/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1\u201311."},{"key":"e_1_3_2_1_18_1","volume-title":"2012 IEEE International Conference on Cluster Computing (Cluster). IEEE, 477\u2013485","author":"Inozemtsev G.","unstructured":"G. Inozemtsev and A. Afsahi . 2012. Designing an offloaded nonblocking MPI_Allgather collective using CORE-Direct . In 2012 IEEE International Conference on Cluster Computing (Cluster). IEEE, 477\u2013485 . G. Inozemtsev and A. Afsahi. 2012. Designing an offloaded nonblocking MPI_Allgather collective using CORE-Direct. In 2012 IEEE International Conference on Cluster Computing (Cluster). IEEE, 477\u2013485."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"crossref","unstructured":"A. Jocksch N. Ohana E. Lanti E. Koutsaniti V. Karakasis and L. Villard. 2021. An optimisation of allreduce communication in message-passing systems. Parallel Comput. 107 102812 (2021).  A. Jocksch N. Ohana E. Lanti E. Koutsaniti V. Karakasis and L. Villard. 2021. An optimisation of allreduce communication in message-passing systems. Parallel Comput. 107 102812 (2021).","DOI":"10.1016\/j.parco.2021.102812"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/HOTI.2013.26"},{"key":"e_1_3_2_1_21_1","volume-title":"Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 45\u201361","author":"Li S.","unstructured":"S. Li , T. Ben-Nun , S.\u00a0 D. Girolamo , D. Alistarh , and T. Hoefler . 2020. Taming unbalanced training workloads in deep learning with partial collective operations . In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 45\u201361 . S. Li, T. Ben-Nun, S.\u00a0D. Girolamo, D. Alistarh, and T. Hoefler. 2020. Taming unbalanced training workloads in deep learning with partial collective operations. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 45\u201361."},{"key":"e_1_3_2_1_22_1","volume-title":"European Conference on Parallel Processing (Euro-Par). Springer, 439\u2013450","author":"Marendi\u0107 P.","unstructured":"P. Marendi\u0107 , J. Lemeire , T. Haber , D. Vu\u010dini\u0107 , and P. Schelkens . 2012. An investigation into the performance of reduction algorithms under load imbalance . In European Conference on Parallel Processing (Euro-Par). Springer, 439\u2013450 . P. Marendi\u0107, J. Lemeire, T. Haber, D. Vu\u010dini\u0107, and P. Schelkens. 2012. An investigation into the performance of reduction algorithms under load imbalance. In European Conference on Parallel Processing (Euro-Par). Springer, 439\u2013450."},{"key":"e_1_3_2_1_24_1","volume-title":"2008 IEEE International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 1\u201311","author":"Patarasuk P.","unstructured":"P. Patarasuk and X. Yuan . 2008. Efficient MPI Bcast across different process arrival patterns . In 2008 IEEE International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 1\u201311 . P. Patarasuk and X. Yuan. 2008. Efficient MPI Bcast across different process arrival patterns. In 2008 IEEE International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 1\u201311."},{"key":"e_1_3_2_1_25_1","volume-title":"Proceedings IEEE International Symposium on Network Computing and Applications (NCA). IEEE, 24\u201335","author":"Petrini F.","unstructured":"F. Petrini , S. Coll , E. Frachtenberg , and A. Hoisie . 2001. Hardware-and software-based collective communication on the Quadrics network . In Proceedings IEEE International Symposium on Network Computing and Applications (NCA). IEEE, 24\u201335 . F. Petrini, S. Coll, E. Frachtenberg, and A. Hoisie. 2001. Hardware-and software-based collective communication on the Quadrics network. In Proceedings IEEE International Symposium on Network Computing and Applications (NCA). IEEE, 24\u201335."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-018-2356-z"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCGrid49817.2020.00-67"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-010-0152-3"},{"key":"e_1_3_2_1_30_1","volume-title":"Proceedings of the message passing interface developer\u2019s and user\u2019s conference, Vol.\u00a01999","author":"Rabenseifner R.","year":"1999","unstructured":"R. Rabenseifner . 1999 . Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512 . In Proceedings of the message passing interface developer\u2019s and user\u2019s conference, Vol.\u00a01999 . 77\u201385. R. Rabenseifner. 1999. Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512. In Proceedings of the message passing interface developer\u2019s and user\u2019s conference, Vol.\u00a01999. 77\u201385."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-24685-5_1"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ExaMPI52011.2020.00007"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2006.1639334"},{"key":"e_1_3_2_1_34_1","volume-title":"Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2135\u20132135","author":"Seide F.","unstructured":"F. Seide and A. Agarwal . 2016. CNTK: Microsoft\u2019s open-source deep-learning toolkit . In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2135\u20132135 . F. Seide and A. Agarwal. 2016. CNTK: Microsoft\u2019s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2135\u20132135."},{"key":"e_1_3_2_1_35_1","unstructured":"A. Sergeev and M. Del\u00a0Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799(2018).  A. Sergeev and M. Del\u00a0Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799(2018)."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1007\/11602569_19"},{"key":"e_1_3_2_1_37_1","volume-title":"28th Annual IEEE Symposium on High-Performance Interconnects (HotI). IEEE, 1\u201310","author":"Temucin H.","unstructured":"Y.\u00a0 H. Temucin , A. Sojoodi , P. Alizadeh , and A. Afsahi . 2021. Efficient Multi-Path NVLink\/PCIe-Aware UCX based Collective Communication for Deep Learning . In 28th Annual IEEE Symposium on High-Performance Interconnects (HotI). IEEE, 1\u201310 . Y.\u00a0H. Temucin, A. Sojoodi, P. Alizadeh, and A. Afsahi. 2021. Efficient Multi-Path NVLink\/PCIe-Aware UCX based Collective Communication for Deep Learning. In 28th Annual IEEE Symposium on High-Performance Interconnects (HotI). IEEE, 1\u201310."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2013.6702676"},{"key":"e_1_3_2_1_39_1","volume-title":"2005 International Conference on Parallel Processing (ICPP). IEEE, 399\u2013407","author":"Wu M.-S.","unstructured":"M.-S. Wu , R.\u00a0 A. Kendall , and K. Wright . 2005. Optimizing collective communications on SMP clusters . In 2005 International Conference on Parallel Processing (ICPP). IEEE, 399\u2013407 . M.-S. Wu, R.\u00a0A. Kendall, and K. Wright. 2005. Optimizing collective communications on SMP clusters. In 2005 International Conference on Parallel Processing (ICPP). IEEE, 399\u2013407."}],"event":{"name":"EuroMPI\/USA'22: 29th European MPI Users' Group Meeting","location":"Chattanooga TN USA","acronym":"EuroMPI\/USA'22"},"container-title":["Proceedings of the 29th European MPI Users' Group Meeting"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3555819.3555857","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3555819.3555857","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:36Z","timestamp":1750182696000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3555819.3555857"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,14]]},"references-count":37,"alternative-id":["10.1145\/3555819.3555857","10.1145\/3555819"],"URL":"https:\/\/doi.org\/10.1145\/3555819.3555857","relation":{},"subject":[],"published":{"date-parts":[[2022,9,14]]},"assertion":[{"value":"2022-09-14","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}