{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:13:59Z","timestamp":1750306439601,"version":"3.41.0"},"reference-count":28,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2015,9,11]],"date-time":"2015-09-11T00:00:00Z","timestamp":1441929600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["0844951"],"award-info":[{"award-number":["0844951"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2015,10]]},"abstract":"<jats:p>Massively parallel memory systems are designed to deliver high bandwidth at relatively low clock speed for memory-intensive applications implemented on programmable logic. For example, the Convey HC-1 provides 1,024 DRAM banks to each of four FPGAs through a full crossbar, presenting a peak bandwidth of 76.8GB\/s to the user logic. Such highly parallel memory systems suffer from high latency, and their effective bandwidth is highly sensitive to access ordering. To achieve high performance, the user must use a customized memory interface that combines scheduling, latency hiding, and data reuse. In this article, we describe the design of a custom memory interface for 3D stencil kernels on the Convey HC-1 that incorporates these features. Experimental results show that the proposed memory interface achieves a speedup in runtime of 2.2 for 6-point stencil and 9.5 for 27-point stencil when compared to a naive memory interface.<\/jats:p>","DOI":"10.1145\/2800788","type":"journal-article","created":{"date-parts":[[2015,9,15]],"date-time":"2015-09-15T12:09:15Z","timestamp":1442318955000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System"],"prefix":"10.1145","volume":"8","author":[{"given":"Zheming","family":"Jin","sequence":"first","affiliation":[{"name":"University of South Carolina, Columbia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jason D.","family":"Bakos","sequence":"additional","affiliation":[{"name":"University of South Carolina, Columbia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2015,9,11]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654102"},{"volume-title":"Proceedings of the Second International Workshop on New Frontiers in High-Performance and Hardware-Aware Computing (HipHaC'11)","author":"Augustin W.","key":"e_1_2_1_2_1","unstructured":"W. Augustin , J. Weiss , and V. Heuveline . 2011. Convey HC-1 Hybrid Core Computer-The Potential of FPGAs in numerical simulation . In Proceedings of the Second International Workshop on New Frontiers in High-Performance and Hardware-Aware Computing (HipHaC'11) . San Antonio, Texas, USA. W. Augustin, J. Weiss, and V. Heuveline. 2011. Convey HC-1 Hybrid Core Computer-The Potential of FPGAs in numerical simulation. In Proceedings of the Second International Workshop on New Frontiers in High-Performance and Hardware-Aware Computing (HipHaC'11). San Antonio, Texas, USA."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/774789.774805"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1391962.1391969"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1878961.1878989"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/92.994985"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/368434.368769"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2024724.2024937"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/SASP.2011.5941081"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1929943.1929947"},{"volume-title":"Proceedings of the 2011 International Conference on Computer-Aided Design (ICCAD\u201911)","author":"Cong J.","key":"e_1_2_1_11_1","unstructured":"J. Cong , P. Zhang , and Y. Zou . 2011d. Combined loop transformation and hierarchy allocation in data reuse optimization . In Proceedings of the 2011 International Conference on Computer-Aided Design (ICCAD\u201911) . 185--192. J. Cong, P. Zhang, and Y. Zou. 2011d. Combined loop transformation and hierarchy allocation in data reuse optimization. In Proceedings of the 2011 International Conference on Computer-Aided Design (ICCAD\u201911). 185--192."},{"key":"e_1_2_1_12_1","volume-title":"Retrieved","author":"Convey Corporation","year":"2012","unstructured":"Convey Corporation . 2012 . Convey Personality Development Kit Reference Manual . Retrieved August 24, 2015, from http:\/\/www.conveysupport.com\/alldocs\/ConveyPDKReferenceManual.pdf. Convey Corporation. 2012. Convey Personality Development Kit Reference Manual. Retrieved August 24, 2015, from http:\/\/www.conveysupport.com\/alldocs\/ConveyPDKReferenceManual.pdf."},{"volume-title":"Proceedings of the 2008 ACM\/IEEE Conference on Supercomputing. IEEE","author":"Datta K.","key":"e_1_2_1_13_1","unstructured":"K. Datta , M. Murphy , V. Volkov , S. Williams , J. Carter , L. Oliker , D. Patterson , J. Shalf , and K. Yelick . 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures . In Proceedings of the 2008 ACM\/IEEE Conference on Supercomputing. IEEE , Los Alamitos, CA, 1--12. K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM\/IEEE Conference on Supercomputing. IEEE, Los Alamitos, CA, 1--12."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2009.5161013"},{"volume-title":"Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM\u201904)","author":"He C.","key":"e_1_2_1_15_1","unstructured":"C. He , M. Lu , and C. Sun . 2004. Accelerating seismic migration using FPGA-based coprocessor platform . In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM\u201904) . IEEE, Los Alamitos, CA, 207--216. C. He, M. Lu, and C. Sun. 2004. Accelerating seismic migration using FPGA-based coprocessor platform. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM\u201904). IEEE, Los Alamitos, CA, 207--216."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASAP.2006.9"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2005.65"},{"key":"e_1_2_1_18_1","series-title":"Lecture Notes in Computer Science","volume-title":"Field Programmable Logic and Applications","author":"Ho W. K. C.","unstructured":"W. K. C. Ho and S. J. E. Wilton . 2004. Logical-to-physical memory mapping for FPGAs with dual-port embedded arrays . In Field Programmable Logic and Applications . Lecture Notes in Computer Science , Vol. 1673 . Springer , 111--123. W. K. C. Ho and S. J. E. Wilton. 2004. Logical-to-physical memory mapping for FPGAs with dual-port embedded arrays. In Field Programmable Logic and Applications. Lecture Notes in Computer Science, Vol. 1673. Springer, 111--123."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1230800.1230807"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2013.55"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2003.822123"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1391469.1391691"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/996566.996596"},{"volume-title":"Proceedings of the 1997 European Conference on Design and Test (EDTC\u201997)","author":"Panda P. R.","key":"e_1_2_1_24_1","unstructured":"P. R. Panda , N. D. Dutt , and A. Nicolau . 1997. Efficient utilization of scratch-pad memory in embedded processor applications . In Proceedings of the 1997 European Conference on Design and Test (EDTC\u201997) . 7. P. R. Panda, N. D. Dutt, and A. Nicolau. 1997. Efficient utilization of scratch-pad memory in embedded processor applications. In Proceedings of the 1997 European Conference on Design and Test (EDTC\u201997). 7."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/339647.339668"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1049\/el:19991511"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463209.2488748"},{"volume-title":"Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC\u201912)","author":"Wang Y.","key":"e_1_2_1_28_1","unstructured":"Y. Wang , P. Zhang , X. Cheng , and J. Cong . 2012. An integrated and automated memory optimization flow for FPGA behavioral synthesis .\u201d In Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC\u201912) . 257--262. Y. Wang, P. Zhang, X. Cheng, and J. Cong. 2012. An integrated and automated memory optimization flow for FPGA behavioral synthesis.\u201d In Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC\u201912). 257--262."}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2800788","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2800788","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T05:42:44Z","timestamp":1750225364000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2800788"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,9,11]]},"references-count":28,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2015,10]]}},"alternative-id":["10.1145\/2800788"],"URL":"https:\/\/doi.org\/10.1145\/2800788","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"type":"print","value":"1936-7406"},{"type":"electronic","value":"1936-7414"}],"subject":[],"published":{"date-parts":[[2015,9,11]]},"assertion":[{"value":"2014-11-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-06-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-09-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}