{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:14:55Z","timestamp":1750306495031,"version":"3.41.0"},"reference-count":34,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2016,1,29]],"date-time":"2016-01-29T00:00:00Z","timestamp":1454025600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2016,3,15]]},"abstract":"<jats:p>\n            As the level of parallelism in manycore processors keeps increasing, providing efficient mechanisms for thread synchronization in concurrent programs is becoming a major concern. On cache-coherent shared-memory processors, synchronization efficiency is ultimately limited by the performance of the underlying cache coherence protocol. This article studies how hardware support for message passing can improve synchronization performance. Considering the ubiquitous problem of mutual exclusion, we devise novel algorithms for (i) classic locking, where application threads obtain exclusive access to a shared resource prior to executing their critical sections (CSes), and (ii) delegation, where CSes are executed by special threads. For classic locking, our\n            <jats:sc>HybLock<\/jats:sc>\n            algorithm uses a mix of shared memory and hardware message passing, which introduces the idea of hybrid synchronization algorithms. For delegation, we propose\n            <jats:sc>mp-server<\/jats:sc>\n            and\n            <jats:sc>HybComb<\/jats:sc>\n            : the former is a straightforward adaptation of the server approach to hardware message passing, whereas the latter is a novel hybrid combining algorithm. Evaluation on Tilera's TILE-Gx processor shows that\n            <jats:sc>HybLock<\/jats:sc>\n            outperforms the best known classic locks. Furthermore,\n            <jats:sc>mp-server<\/jats:sc>\n            can execute contended CSes with unprecedented throughput, as stalls related to cache coherence are removed from the critical path.\n            <jats:sc>HybComb<\/jats:sc>\n            can achieve comparable performance while avoiding the need to dedicate server cores. Consequently, our queue and stack implementations, based on the new synchronization algorithms, largely outperform their most efficient shared-memory-only counterparts.\n          <\/jats:p>","DOI":"10.1145\/2858652","type":"journal-article","created":{"date-parts":[[2016,2,1]],"date-time":"2016-02-01T20:37:54Z","timestamp":1454359074000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["Leveraging Hardware Message Passing for Efficient Thread Synchronization"],"prefix":"10.1145","volume":"2","author":[{"given":"Darko","family":"Petrovi\u0107","sequence":"first","affiliation":[{"name":"Ecole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL), Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Thomas","family":"Ropars","sequence":"additional","affiliation":[{"name":"Ecole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL), Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andr\u00e9","family":"Schiper","sequence":"additional","affiliation":[{"name":"Ecole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL), Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2016,1,29]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.87"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-32820-6_64"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629575.1629579"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/IGCC.2011.6008565"},{"volume-title":"Proceedings of the 5th USENIX Workshop on Hot Topics in Parallelism. 6.","author":"Calciu I.","key":"e_1_2_1_5_1","unstructured":"I. Calciu , J. Gottschlich , and M. Herlihy . 2013. Using elimination and delegation to implement a scalable NUMA-friendly stack . In Proceedings of the 5th USENIX Workshop on Hot Topics in Parallelism. 6. I. Calciu, J. Gottschlich, and M. Herlihy. 2013. Using elimination and delegation to implement a scalable NUMA-friendly stack. In Proceedings of the 5th USENIX Workshop on Hot Topics in Parallelism. 6."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2400682.2400686"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2517349.2522714"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989493.1989549"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145849"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2168836.2168872"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1810479.1810540"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/210223.210225"},{"key":"e_1_2_1_14_1","unstructured":"M. Herlihy and N. Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann San Francisco CA.   M. Herlihy and N. Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann San Francisco CA."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/78969.78972"},{"volume-title":"Proceedings of the International IEEE Solid-State Circuits Conference Digest of Technical Papers. 108--109","author":"Howard J.","key":"e_1_2_1_16_1","unstructured":"J. Howard , S. Dighe , Y. Hoskote , S. Vangal , D. Finan , G. Ruhl , D. Jenkins , H. Wilson , N. Borkar , G. Schrom , F. Pailet , S. Jain , T. Jacob , S. Yada , S. Marella , P. Salihundam , V. Erraguntla , M. Knonw , M. Riepen , G. Droege , J. Lindemann , M. Gries , T. Apel , K. Henriss , T. Lund-Larsen , S. Borkar , V. De , R. Van Der Wijngaart, and T. Mattson. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS . In Proceedings of the International IEEE Solid-State Circuits Conference Digest of Technical Papers. 108--109 . J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Knonw, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Borkar, V. De, R. Van Der Wijngaart, and T. Mattson. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of the International IEEE Solid-State Circuits Conference Digest of Technical Papers. 108--109."},{"key":"e_1_2_1_17_1","unstructured":"Kalray. 2014. Kalray Manycore Processors. Available at http:\/\/www.kalray.eu.  Kalray. 2014. Kalray Manycore Processors. Available at http:\/\/www.kalray.eu."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2612669.2612714"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1979.1675439"},{"volume-title":"Proceedings of the 2012 USENIX Annual Technical Conference. 1. http:\/\/dl.acm.org\/citation.cfm?id=2342821","author":"Lozi J.-P.","key":"e_1_2_1_20_1","unstructured":"J.-P. Lozi , F. David , G. Thomas , J. Lawall , and G. Muller . 2012. Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications . In Proceedings of the 2012 USENIX Annual Technical Conference. 1. http:\/\/dl.acm.org\/citation.cfm?id=2342821 .2342827 J.-P. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller. 2012. Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In Proceedings of the 2012 USENIX Annual Technical Conference. 1. http:\/\/dl.acm.org\/citation.cfm?id=2342821.2342827"},{"volume-title":"Proceedings of the 8th International Symposium on Parallel Processing. IEEE","author":"Magnusson P. S.","key":"e_1_2_1_21_1","unstructured":"P. S. Magnusson , A. Landin , and E. Hagersten . 1994. Queue locks on cache coherent multiprocessors . In Proceedings of the 8th International Symposium on Parallel Processing. IEEE , Los Alamitos, CA, 165--171. http:\/\/dl.acm.org\/citation.cfm?id=645604.662740 P. S. Magnusson, A. Landin, and E. Hagersten. 1994. Queue locks on cache coherent multiprocessors. In Proceedings of the 8th International Symposium on Parallel Processing. IEEE, Los Alamitos, CA, 165--171. http:\/\/dl.acm.org\/citation.cfm?id=645604.662740"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2209249.2209269"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/103727.103729"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145874"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/248052.248106"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2442516.2442527"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1006\/jpdc.1996.0041"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-03359-9_27"},{"volume-title":"Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications. 16","author":"Oyama Y.","key":"e_1_2_1_29_1","unstructured":"Y. Oyama , K. Taura , and A. Yonezawa . 1999. Executing parallel programs with synchronization bottlenecks efficiently . In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications. 16 . Y. Oyama, K. Taura, and A. Yonezawa. 1999. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications. 16."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1736020.1736055"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/215399.215419"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.5555\/2028905"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2010.7"},{"key":"e_1_2_1_34_1","unstructured":"Tilera. 2014. Tilera Manycore Processors. Available at http:\/\/www.tilera.com.  Tilera. 2014. Tilera Manycore Processors. Available at http:\/\/www.tilera.com."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/1531793.1531805"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2858652","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2858652","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T05:48:43Z","timestamp":1750225723000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2858652"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,1,29]]},"references-count":34,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2016,3,15]]}},"alternative-id":["10.1145\/2858652"],"URL":"https:\/\/doi.org\/10.1145\/2858652","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"type":"print","value":"2329-4949"},{"type":"electronic","value":"2329-4957"}],"subject":[],"published":{"date-parts":[[2016,1,29]]},"assertion":[{"value":"2014-12-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-01-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}