{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T05:03:05Z","timestamp":1755838985289,"version":"3.40.3"},"publisher-location":"Cham","reference-count":22,"publisher":"Springer International Publishing","isbn-type":[{"type":"print","value":"9783030507428"},{"type":"electronic","value":"9783030507435"}],"license":[{"start":{"date-parts":[[2020,1,1]],"date-time":"2020-01-01T00:00:00Z","timestamp":1577836800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,6,15]],"date-time":"2020-06-15T00:00:00Z","timestamp":1592179200000},"content-version":"vor","delay-in-days":166,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Analytic, first-principles performance modeling of distributed-memory parallel codes is notoriously imprecise. Even for applications with extremely regular and homogeneous compute-communicate phases, simply adding communication time to computation time does often not yield a satisfactory prediction of parallel runtime due to deviations from the expected simple lockstep pattern caused by system noise, variations in communication time, and inherent load imbalance. In this paper, we highlight the specific cases of provoked and spontaneous desynchronization of memory-bound, bulk-synchronous pure MPI and hybrid MPI+OpenMP programs. Using simple microbenchmarks we observe that although desynchronization can introduce increased waiting time per process, it does not necessarily cause lower resource utilization but can lead to an increase in available bandwidth per core. In case of significant communication overhead, even natural noise can shove the system into a state of automatic overlap of communication and computation, improving the overall time to solution. The saturation point, i.e., the number of processes per memory domain required to achieve full memory bandwidth, is pivotal in the dynamics of this process and the emerging stable wave pattern. We also demonstrate how hybrid MPI-OpenMP programming can prevent desirable desynchronization by eliminating the bandwidth bottleneck among processes. A Chebyshev filter diagonalization application is used to demonstrate some of the observed effects in a realistic setting.<\/jats:p>","DOI":"10.1007\/978-3-030-50743-5_20","type":"book-chapter","created":{"date-parts":[[2020,6,15]],"date-time":"2020-06-15T19:03:45Z","timestamp":1592247825000},"page":"391-411","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs"],"prefix":"10.1007","author":[{"given":"Ayesha","family":"Afzal","sequence":"first","affiliation":[]},{"given":"Georg","family":"Hager","sequence":"additional","affiliation":[]},{"given":"Gerhard","family":"Wellein","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,6,15]]},"reference":[{"key":"20_CR1","unstructured":"Afzal, A., Hager, G., Wellein, G.: Delay flow mechanisms on clusters. In: Poster at EuroMPI: 10\u201313 September 2019, Zurich, Switzerland (2019). https:\/\/hpc.fau.de\/files\/2019\/09\/EuroMPI2019_AHW-Poster.pdf"},{"key":"20_CR2","doi-asserted-by":"publisher","unstructured":"Afzal, A., Hager, G., Wellein, G.: Propagation and decay of injected one-off delays on clusters: a case study. In: 2019 IEEE International Conference on Cluster Computing, CLUSTER 2019, Albuquerque, NM, USA, 23\u201326 September 2019, pp. 1\u201310 (2019). https:\/\/doi.org\/10.1109\/CLUSTER.2019.8890995","DOI":"10.1109\/CLUSTER.2019.8890995"},{"key":"20_CR3","doi-asserted-by":"publisher","unstructured":"Bhatele, A., Mohror, K., Langer, S.H., Isaacs, K.E.: There goes the neighborhood: performance degradation due to nearby jobs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis SC 2013, pp. 1\u201312 (2013). https:\/\/doi.org\/10.1145\/2503210.2503247","DOI":"10.1145\/2503210.2503247"},{"issue":"2","key":"20_CR4","doi-asserted-by":"publisher","first-page":"11:1","DOI":"10.1145\/2934661","volume":"3","author":"D B\u00f6hme","year":"2016","unstructured":"B\u00f6hme, D., et al.: Identifying the root causes of wait states in large-scale parallel applications. ACM Trans. Parallel Comput. 3(2), 11:1\u201311:24 (2016). https:\/\/doi.org\/10.1145\/2934661. ISSN: 2329\u20134949","journal-title":"ACM Trans. Parallel Comput."},{"issue":"3","key":"20_CR5","doi-asserted-by":"publisher","first-page":"168","DOI":"10.1016\/j.jocs.2010.05.001","volume":"1","author":"MJ Chorley","year":"2010","unstructured":"Chorley, M.J., Walker, D.W.: Performance analysis of a hybrid MPI\/OpenMP application on multi-core clusters. J. Comput. Sci. 1(3), 168\u2013174 (2010). https:\/\/doi.org\/10.1016\/j.jocs.2010.05.001","journal-title":"J. Comput. Sci."},{"key":"20_CR6","doi-asserted-by":"publisher","unstructured":"Gamell, M., et al.: Local recovery and failure masking for stencil-based applications at extreme scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis SC 2015, pp. 1\u201312, November 2015. https:\/\/doi.org\/10.1145\/2807591.2807672","DOI":"10.1145\/2807591.2807672"},{"issue":"3","key":"20_CR7","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1016\/S0167-8191(06)80021-9","volume":"20","author":"RW Hockney","year":"1994","unstructured":"Hockney, R.W.: The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput. 20(3), 389\u2013398 (1994). https:\/\/doi.org\/10.1016\/S0167-8191(06)80021-9. ISSN: 0167\u20138191","journal-title":"Parallel Comput."},{"key":"20_CR8","doi-asserted-by":"publisher","unstructured":"Hoefler, T., Schneider, T., Lumsdaine, A.: LogGOPSim - simulating large-scale applications in the LogGOPS model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 597\u2013604. ACM, Chicago, June 2010. https:\/\/doi.org\/10.1145\/1851476.1851564. ISBN: 978-1-60558-942-8","DOI":"10.1145\/1851476.1851564"},{"key":"20_CR9","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"22","DOI":"10.1007\/978-3-319-92040-5_2","volume-title":"High Performance Computing","author":"J Hofmann","year":"2018","unstructured":"Hofmann, J., Hager, G., Fey, D.: On the accuracy and usefulness of analytic energy models for contemporary multicore processors. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds.) ISC High Performance 2018. LNCS, vol. 10876, pp. 22\u201343. Springer, Cham (2018). https:\/\/doi.org\/10.1007\/978-3-319-92040-5_2"},{"key":"20_CR10","unstructured":"Hofmann, J., et al.: Bridging the architecture gap: abstracting performance-relevant properties of modern server processors. arXiv (2019, Submitted). arXiv:1907.00048 [cs.DC]"},{"key":"20_CR11","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"269","DOI":"10.1007\/978-3-319-92040-5_14","volume-title":"High Performance Computing","author":"JP Kenny","year":"2018","unstructured":"Kenny, J.P., Sargsyan, K., Knight, S., Michelogiannakis, G., Wilke, J.J.: The pitfalls of provisioning exascale networks: a trace replay analysis for understanding communication performance. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds.) ISC High Performance 2018. LNCS, vol. 10876, pp. 269\u2013288. Springer, Cham (2018). https:\/\/doi.org\/10.1007\/978-3-319-92040-5_14"},{"key":"20_CR12","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"329","DOI":"10.1007\/978-3-319-92040-5_17","volume-title":"High Performance Computing","author":"M Kreutzer","year":"2018","unstructured":"Kreutzer, M., et al.: Chebyshev filter diagonalization on modern manycore processors and GPGPUs. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds.) ISC High Performance 2018. LNCS, vol. 10876, pp. 329\u2013349. Springer, Cham (2018). https:\/\/doi.org\/10.1007\/978-3-319-92040-5_17"},{"key":"20_CR13","doi-asserted-by":"publisher","unstructured":"Kreutzer, M., et al.: Performance engineering of the Kernel Polynomial Method on large-scale CPU-GPU systems. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 417\u2013426, May 2015. https:\/\/doi.org\/10.1109\/IPDPS.2015.76","DOI":"10.1109\/IPDPS.2015.76"},{"key":"20_CR14","doi-asserted-by":"publisher","unstructured":"Le\u00f3n, E.A., Karlin, I., Moody, A.T.: System noise revisited: enabling application scalability and reproducibility with SMT. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 596\u2013607 (2016). https:\/\/doi.org\/10.1109\/IPDPS.2016.48","DOI":"10.1109\/IPDPS.2016.48"},{"issue":"1","key":"20_CR15","doi-asserted-by":"publisher","first-page":"013306","DOI":"10.1103\/PhysRevE.91.013306","volume":"91","author":"S Markidis","year":"2015","unstructured":"Markidis, S., et al.: Idle waves in high-performance computing. Phys. Rev. E 91(1), 013306 (2015). https:\/\/doi.org\/10.1103\/PhysRevE.91.013306","journal-title":"Phys. Rev. E"},{"key":"20_CR16","doi-asserted-by":"publisher","unstructured":"Petrini, F., Kerbyson, D.J., Pakin, S.: The case of the missing supercomputer performance: achieving optimal performance on the 8,192 processors of ASCI Q. In: 2003 ACM\/IEEE Conference on Supercomputing, pp. 55\u201355. IEEE (2003). https:\/\/doi.org\/10.1145\/1048935.1050204","DOI":"10.1145\/1048935.1050204"},{"key":"20_CR17","doi-asserted-by":"publisher","first-page":"226","DOI":"10.1016\/j.jcp.2016.08.027","volume":"325","author":"A Pieper","year":"2016","unstructured":"Pieper, A., et al.: High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations. J. Comput. Phys. 325, 226\u2013243 (2016). https:\/\/doi.org\/10.1016\/j.jcp.2016.08.027","journal-title":"J. Comput. Phys."},{"key":"20_CR18","doi-asserted-by":"publisher","unstructured":"Rabenseifner, R., Hager, G., Jost, G.: Hybrid MPI\/OpenMP parallel programming on clusters of multi-core SMP nodes. In: 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, Los Alamitos, CA, USA, pp. 427\u2013436. IEEE Computer Society, Feburary 2009. https:\/\/doi.org\/10.1109\/PDP.2009.43","DOI":"10.1109\/PDP.2009.43"},{"key":"20_CR19","doi-asserted-by":"publisher","unstructured":"Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. In: Proceedings of the 29th ACM International Conference on Supercomputing, ICS 2015, Newport Beach, CA. ACM (2015). https:\/\/doi.org\/10.1145\/2751205.2751240","DOI":"10.1145\/2751205.2751240"},{"key":"20_CR20","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"246","DOI":"10.1007\/978-3-319-92040-5_13","volume-title":"High Performance Computing","author":"H Weisbach","year":"2018","unstructured":"Weisbach, H., Gerofi, B., Kocoloski, B., H\u00e4rtig, H., Ishikawa, Y.: Hardware performance variation: a comparative study using lightweight kernels. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds.) ISC High Performance 2018. LNCS, vol. 10876, pp. 246\u2013265. Springer, Cham (2018). https:\/\/doi.org\/10.1007\/978-3-319-92040-5_13"},{"issue":"4","key":"20_CR21","doi-asserted-by":"publisher","first-page":"65","DOI":"10.1145\/1498765.1498785","volume":"52","author":"S Williams","year":"2009","unstructured":"Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65\u201376 (2009). https:\/\/doi.org\/10.1145\/1498765.1498785. ISSN: 0001\u20130782","journal-title":"Commun. ACM"},{"key":"20_CR22","unstructured":"Wu, X., Taylor, V.: Using processor partitioning to evaluate the performance of MPI, OpenMP and hybrid parallel applications on dual-and quad-core Cray XT4 systems. In: The 51st Cray User Group Conference (CUG2009), pp. 4\u20137 (2009). http:\/\/faculty.cse.tamu.edu\/wuxf\/papers\/cug09.pdf"}],"container-title":["Lecture Notes in Computer Science","High Performance Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-030-50743-5_20","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,12,18]],"date-time":"2023-12-18T20:05:33Z","timestamp":1702929933000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-030-50743-5_20"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020]]},"ISBN":["9783030507428","9783030507435"],"references-count":22,"URL":"https:\/\/doi.org\/10.1007\/978-3-030-50743-5_20","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"type":"print","value":"0302-9743"},{"type":"electronic","value":"1611-3349"}],"subject":[],"published":{"date-parts":[[2020]]},"assertion":[{"value":"15 June 2020","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"ISC High Performance","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"International Conference on High Performance Computing","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Frankfurt am Main","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Germany","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2020","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"22 June 2020","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"25 June 2020","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"35","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"supercomputing2020","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/www.isc-hpc.com\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Double-blind","order":1,"name":"type","label":"Type","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"Linklings","order":2,"name":"conference_management_system","label":"Conference Management System","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"87","order":3,"name":"number_of_submissions_sent_for_review","label":"Number of Submissions Sent for Review","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"27","order":4,"name":"number_of_full_papers_accepted","label":"Number of Full Papers Accepted","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"0","order":5,"name":"number_of_short_papers_accepted","label":"Number of Short Papers Accepted","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"31% - The value is computed by the equation \"Number of Full Papers Accepted \/ Number of Submissions Sent for Review * 100\" and then rounded to a whole number.","order":6,"name":"acceptance_rate_of_full_papers","label":"Acceptance Rate of Full Papers","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"3.73","order":7,"name":"average_number_of_reviews_per_paper","label":"Average Number of Reviews per Paper","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"4.33","order":8,"name":"average_number_of_papers_per_reviewer","label":"Average Number of Papers per Reviewer","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"No","order":9,"name":"external_reviewers_involved","label":"External Reviewers Involved","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"The conference was held virtually due to the COVID-19 pandemic.","order":10,"name":"additional_info_on_review_process","label":"Additional Info on Review Process","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}}]}}