{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:21:30Z","timestamp":1750306890958,"version":"3.41.0"},"reference-count":33,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2013,1,1]],"date-time":"2013-01-01T00:00:00Z","timestamp":1356998400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2013,1]]},"abstract":"<jats:p>Highly parallel systems are becoming mainstream in a wide range of sectors ranging from their traditional stronghold high-performance computing, to data centers and even embedded systems. However, despite the quantum leaps of improvements in cost and performance of individual components over the last decade (e.g., processor speeds, memory\/interconnection bandwidth, etc.), system manufacturers are still struggling to deliver low-latency, highly scalable solutions. One of the main reasons is that the intercommunication latency grows significantly with the number of processor nodes. This article presents a novel way to reduce this intercommunication delay by implementing, in custom hardware, certain communication tasks. In particular, the proposed novel device implements the two most widely used procedures of the most popular communication protocol in parallel systems the Message Passing Interface (MPI). Our novel approach has initially been simulated within a pioneering parallel systems simulation framework and then synthesized directly from a high-level description language (i.e., SystemC) using a state-of-the-art synthesis tool. To the best of our knowledge, this is the first article presenting the complete hardware implementation of such a system. The proposed novel approach triggers a speedup from one to four orders of magnitude when compared with conventional software-based solutions and from one to three orders of magnitude when compared with a sophisticated software-based approach. Moreover, the performance of our system is from one to two orders of magnitude higher than the simulated performance of a similar but, relatively simpler hardware architecture; at the same time the power consumption of our device is about two orders of magnitude lower than that of a low-power CPU when executing the exact same intercommunication tasks.<\/jats:p>","DOI":"10.1145\/2400682.2400710","type":"journal-article","created":{"date-parts":[[2013,1,22]],"date-time":"2013-01-22T15:28:56Z","timestamp":1358868536000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Significantly reducing MPI intercommunication latency and power overhead in both embedded and HPC systems"],"prefix":"10.1145","volume":"9","author":[{"given":"Pavlos M.","family":"Mattheakis","sequence":"first","affiliation":[{"name":"Telecommunication Systems Institute, Technical University of Crete, and University of Crete, Heraklion, Greece"}]},{"given":"Ioannis","family":"Papaefstathiou","sequence":"additional","affiliation":[{"name":"Synelixis Solutions Ltd, Chalkida, Greece"}]}],"member":"320","published-online":{"date-parts":[[2013,1,20]]},"reference":[{"volume-title":"Proceeding of the 10th International Euro-pav Conference. 833--845","author":"Alm\u00e1si G.","key":"e_1_2_2_1_1"},{"key":"e_1_2_2_2_1","unstructured":"ARM. 2012. http:\/\/www.arm.com\/products\/processors\/cortex-a\/cortex-a8.php.  ARM. 2012. http:\/\/www.arm.com\/products\/processors\/cortex-a\/cortex-a8.php."},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/125826.125925"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1941487.1941507"},{"volume-title":"Proceeding of the 19th International Conference on Computer Communications and Networks (ICCCN)","year":"2010","author":"Brightwell R.","key":"e_1_2_2_5_1"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2005.13"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1504\/IJHPCN.2006.010633"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342005054257"},{"volume-title":"Proceedings of the International Parallel and Distributed Processing Symposium 9.","author":"Brightwell R.","key":"e_1_2_2_9_1"},{"key":"e_1_2_2_10_1","unstructured":"Cadence. 2012. http:\/\/www.cadence.com.  Cadence. 2012. http:\/\/www.cadence.com."},{"key":"e_1_2_2_11_1","unstructured":"Cypress. 2012. http:\/\/www.cypress.com\/&quest;mpn=CY7C1089DV33-12BAXI.  Cypress. 2012. http:\/\/www.cypress.com\/&quest;mpn=CY7C1089DV33-12BAXI."},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/944645.944651"},{"key":"e_1_2_2_13_1","unstructured":"Fahey M. Alam S. Dunigan T. Vetter J. and Worley P. 2004. Early evaluation of the Better Cray XD1. Tech. rep. Oak Ridge National Laboratory.  Fahey M. Alam S. Dunigan T. Vetter J. and Worley P. 2004. Early evaluation of the Better Cray XD1. Tech. rep. Oak Ridge National Laboratory."},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/898758"},{"volume-title":"Proceedings of the International Conference on Embedded Computer Systems (SAMOS),. 357--364","author":"Go","key":"e_1_2_2_15_1"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/11752578_29"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTR.2007.4629234"},{"key":"e_1_2_2_18_1","unstructured":"Imperas. 2012. http:\/\/www.imperas.com.  Imperas. 2012. http:\/\/www.imperas.com."},{"key":"e_1_2_2_19_1","unstructured":"Intel. 2008. http:\/\/www.intel.com.  Intel. 2008. http:\/\/www.intel.com."},{"key":"e_1_2_2_20_1","unstructured":"Intel. 2012. http:\/\/software.intel.com\/en-us\/intel-vtune-amplifier-xe.  Intel. 2012. http:\/\/software.intel.com\/en-us\/intel-vtune-amplifier-xe."},{"volume-title":"Proceedings of the 17th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface. 179--188","author":"Keller R.","key":"e_1_2_2_21_1"},{"key":"e_1_2_2_22_1","unstructured":"Knuth D. E. 1998. The Art of Computer Programming vol. 1 (3rd ed.): Fundamental Algorithms. Addison-Wesley Longman Publishing Co. Inc. Boston MA.   Knuth D. E. 1998. The Art of Computer Programming vol. 1 (3rd ed.): Fundamental Algorithms. Addison-Wesley Longman Publishing Co. Inc. Boston MA."},{"key":"e_1_2_2_23_1","unstructured":"Myricom. 2009. http:\/\/www.myricom.com.  Myricom. 2009. http:\/\/www.myricom.com."},{"key":"e_1_2_2_24_1","unstructured":"Open Risc. 2012. http:\/\/opencores.org\/openrisc architecture.  Open Risc. 2012. http:\/\/opencores.org\/openrisc architecture."},{"key":"e_1_2_2_25_1","unstructured":"OVP. 2012. http:\/\/www.ovpworld.org.  OVP. 2012. http:\/\/www.ovpworld.org."},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/40.988689"},{"volume-title":"Proceedings of the 15th International Conference on Field Programmable Logic and Applications (FPL'06)","author":"Saldana M.","key":"e_1_2_2_27_1"},{"key":"e_1_2_2_28_1","unstructured":"Synopsys. 2012. Synopsys high density leakage control SRAM 512K sync compiler GlobalFoundries 40LP P-optional Vt\/cell SVt S-bitCell. http:\/\/www.synopsys.com.  Synopsys. 2012. Synopsys high density leakage control SRAM 512K sync compiler GlobalFoundries 40LP P-optional Vt\/cell SVt S-bitCell. http:\/\/www.synopsys.com."},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS.2009.140"},{"key":"e_1_2_2_30_1","unstructured":"TI. 2012. www.ti.com.  TI. 2012. www.ti.com."},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/1018425.1020282"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2005.30"},{"volume-title":"Proceedings of the IEEE Symposium on Cluster Computing. 1--10","author":"Underwood K. D.","key":"e_1_2_2_33_1"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2400682.2400710","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2400682.2400710","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T08:18:52Z","timestamp":1750234732000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2400682.2400710"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,1]]},"references-count":33,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2013,1]]}},"alternative-id":["10.1145\/2400682.2400710"],"URL":"https:\/\/doi.org\/10.1145\/2400682.2400710","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2013,1]]},"assertion":[{"value":"2012-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2012-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2013-01-20","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}