{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:29:50Z","timestamp":1750220990299,"version":"3.41.0"},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2019,2,13]],"date-time":"2019-02-13T00:00:00Z","timestamp":1550016000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000028","name":"Semiconductor Research Corporation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000028","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2019,3,31]]},"abstract":"<jats:p>\n            Thread synchronization using shared memory hardware cache coherence paradigm is prevalent in multicore processors. However, as the number of cores increase on a chip, cache line ping-pong prevents performance scaling for algorithms that deploy fine-grain synchronization. This article proposes an in-hardware moving computation to data model (MC) that pins shared data at dedicated cores. The critical code sections are serialized and executed at these cores in a spatial setting to enable data locality optimizations. In-hardware messages enable non-blocking and blocking communication between cores, without involving the cache coherence protocol. The in-hardware MC model is implemented on\n            <jats:italic>Tilera Tile-Gx72<\/jats:italic>\n            multicore platform to evaluate 8- to 64-core count scale. A simulated RISC-V multicore environment is built to further evaluate the performance scaling advantages of the MC model at 1,024-cores scale. The evaluation using graph and machine-learning benchmarks illustrates that atomic instructions based synchronization scales up to 512 cores, and the MC model at the same core count outperforms by 27% in completion time and 39% in dynamic energy consumption.\n          <\/jats:p>","DOI":"10.1145\/3300208","type":"journal-article","created":{"date-parts":[[2019,2,14]],"date-time":"2019-02-14T19:36:17Z","timestamp":1550172977000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Accelerating Synchronization Using Moving Compute to Data Model at 1,000-core Multicore Scale"],"prefix":"10.1145","volume":"16","author":[{"given":"Halit","family":"Dogan","sequence":"first","affiliation":[{"name":"University of Connecticut, Storrs, Connecticut, USA"}]},{"given":"Masab","family":"Ahmad","sequence":"additional","affiliation":[{"name":"University of Connecticut, Storrs, Connecticut, USA"}]},{"given":"Brian","family":"Kahne","sequence":"additional","affiliation":[{"name":"NXP Semiconductors, Austin, TX"}]},{"given":"Omer","family":"Khan","sequence":"additional","affiliation":[{"name":"University of Connecticut, Connecticut, USA"}]}],"member":"320","published-online":{"date-parts":[[2019,2,13]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2015.11"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750386"},{"key":"e_1_2_1_3_1","unstructured":"R. Bayer and M. Schkolnick. 1988. Concurrency of Operations on B-trees. In Readings in Database Systems. Morgan Kaufmann Publishers Inc. San Francisco CA 129--139.   R. Bayer and M. Schkolnick. 1988. Concurrency of Operations on B-trees. In Readings in Database Systems. Morgan Kaufmann Publishers Inc. San Francisco CA 129--139."},{"key":"e_1_2_1_4_1","volume-title":"Dally and Brian Towles","author":"William","year":"2004","unstructured":"William J. Dally and Brian Towles . 2004 . Principles and Practices of Interconnection Networks . William J. Dally and Brian Towles. 2004. Principles and Practices of Interconnection Networks."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"volume-title":"Proceedings of the IEEE International Conference on Computer Design (ICCD\u201918)","author":"Dogan H.","key":"e_1_2_1_6_1","unstructured":"H. Dogan , M. Ahmad , J. Joao , and O. Khan . 2018. Accelerating synchronization in graph analytics using moving compute to data model on Tilera TILE-Gx72 . In Proceedings of the IEEE International Conference on Computer Design (ICCD\u201918) . H. Dogan, M. Ahmad, J. Joao, and O. Khan. 2018. Accelerating synchronization in graph analytics using moving compute to data model on Tilera TILE-Gx72. In Proceedings of the IEEE International Conference on Computer Design (ICCD\u201918)."},{"volume-title":"Proceedings of the Annual International Parallel and Distributed Processing Symposium (IPDPS\u201917)","author":"Dogan H.","key":"e_1_2_1_7_1","unstructured":"H. Dogan , F. Hijaz , M. Ahmad , B. Kahne , P. Wilson , and O. Khan . 2017. Accelerating graph and machine-learning workloads using a shared memory multicore architecture with auxiliary support for in-hardware explicit messaging . In Proceedings of the Annual International Parallel and Distributed Processing Symposium (IPDPS\u201917) . H. Dogan, F. Hijaz, M. Ahmad, B. Kahne, P. Wilson, and O. Khan. 2017. Accelerating graph and machine-learning workloads using a shared memory multicore architecture with auxiliary support for in-hardware explicit messaging. In Proceedings of the Annual International Parallel and Distributed Processing Symposium (IPDPS\u201917)."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/648139.749473"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2014.2307874"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2011.42"},{"key":"e_1_2_1_11_1","volume-title":"SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and &lt;0.5 MB model size. arXiv:1602.07360 preprint","author":"Iandola Forrest N.","year":"2016","unstructured":"Forrest N. Iandola , Song Han , Matthew W. Moskewicz , Khalid Ashraf , William J. Dally , and Kurt Keutzer . 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and &lt;0.5 MB model size. arXiv:1602.07360 preprint ( 2016 ). Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and &lt;0.5 MB model size. arXiv:1602.07360 preprint (2016)."},{"key":"e_1_2_1_12_1","unstructured":"Brian Kahne. 2013. FreescaleADL: An Industrial-Strength Architectural Description Language For Programmable Cores. Retrieved from http:\/\/opensource.freescale.com\/fsl-oss-projects\/.  Brian Kahne. 2013. FreescaleADL: An Industrial-Strength Architectural Description Language For Programmable Cores. Retrieved from http:\/\/opensource.freescale.com\/fsl-oss-projects\/."},{"key":"e_1_2_1_13_1","unstructured":"A. Krizhevsky I. Sutskever and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems.   A. Krizhevsky I. Sutskever and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/165939.165970"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854332"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2013.154"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1080\/15427951.2009.10129177"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2898361"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669172"},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the USENIX Annual Technical Conference. 65--76","author":"Lozi Jean-Pierre","year":"2012","unstructured":"Jean-Pierre Lozi , Florian David , Ga\u00ebl Thomas , Julia L. Lawall , Gilles Muller , 2012 . Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications . In Proceedings of the USENIX Annual Technical Conference. 65--76 . Jean-Pierre Lozi, Florian David, Ga\u00ebl Thomas, Julia L. Lawall, Gilles Muller, et al. 2012. Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In Proceedings of the USENIX Annual Technical Conference. 65--76."},{"volume-title":"Proceedings of the IEEE Symposium on High Performance Computer Architecture (HPCA\u201910)","author":"Miller J. E.","key":"e_1_2_1_21_1","unstructured":"J. E. Miller , H. Kasture , G. Kurian , C. Gruenwald , N. Beckmann , C. Celio , J. Eastep , and A. Agarwal . 2010. Graphite: A distributed parallel simulator for multicores . In Proceedings of the IEEE Symposium on High Performance Computer Architecture (HPCA\u201910) . J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the IEEE Symposium on High Performance Computer Architecture (HPCA\u201910)."},{"key":"e_1_2_1_22_1","doi-asserted-by":"crossref","unstructured":"Samuel K. Moore. 2016. Breaking the Multicore Bottleneck. Retrieved from https:\/\/spectrum.ieee.org\/semiconductors\/processors\/breaking-the-multicore-bottleneck.  Samuel K. Moore. 2016. Breaking the Multicore Bottleneck. Retrieved from https:\/\/spectrum.ieee.org\/semiconductors\/processors\/breaking-the-multicore-bottleneck.","DOI":"10.1109\/MSPEC.2016.7607015"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1357010.1352618"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2228360.2228431"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1736020.1736055"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/1508244.1508274"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/NOCS.2012.31"},{"volume-title":"Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture (HPCA\u201911)","author":"Tiwari D.","key":"e_1_2_1_28_1","unstructured":"D. Tiwari , J. Tuck , Solihin Y, and S. Lee . 2011. HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor . In Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture (HPCA\u201911) . D. Tiwari, J. Tuck, Solihin Y, and S. Lee. 2011. HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. In Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture (HPCA\u201911)."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/139669.140382"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.612254"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2967938.2967954"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.5555\/1320302.1320834"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3300208","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3300208","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:25:23Z","timestamp":1750206323000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3300208"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,2,13]]},"references-count":32,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,3,31]]}},"alternative-id":["10.1145\/3300208"],"URL":"https:\/\/doi.org\/10.1145\/3300208","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2019,2,13]]},"assertion":[{"value":"2018-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-12-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-13","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}