{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,15]],"date-time":"2025-12-15T19:30:14Z","timestamp":1765827014314,"version":"3.41.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2012,1,1]],"date-time":"2012-01-01T00:00:00Z","timestamp":1325376000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2012,1]]},"abstract":"<jats:p>In this paper we investigate a general approach to automate some deployment decisions for a certain class of applications on multi-core computers. We consider data-parallelizable programs that use the well-known double buffering technique to bring the data from the off-chip slow memory to the local memory of the cores via a DMA (direct memory access) mechanism. Based on the computation time and size of elementary data items as well as DMA characteristics, we derive optimal and near optimal values for the number of blocks that should be clustered in a single DMA command. We then extend the results to the case where a computation for one data item needs some data in its neighborhood. In this setting we characterize the performance of several alternative mechanisms for data sharing. Our models are validated experimentally using a cycle-accurate simulator of the Cell Broadband Engine architecture.<\/jats:p>","DOI":"10.1145\/2086696.2086716","type":"journal-article","created":{"date-parts":[[2012,1,24]],"date-time":"2012-01-24T16:47:14Z","timestamp":1327423634000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":22,"title":["Optimizing explicit data transfers for data parallel applications on the cell architecture"],"prefix":"10.1145","volume":"8","author":[{"given":"Selma","family":"Saidi","sequence":"first","affiliation":[{"name":"Verimag Lab, University of Grenoble and STMicroelectronics Grenoble, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Pranav","family":"Tendulkar","sequence":"additional","affiliation":[{"name":"Verimag Lab, University of Grenoble, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Thierry","family":"Lepley","sequence":"additional","affiliation":[{"name":"STMicroelectronics Grenoble, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Oded","family":"Maler","sequence":"additional","affiliation":[{"name":"CNRS-Verimag Lab, Grenoble, France"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2012,1,26]]},"reference":[{"doi-asserted-by":"publisher","key":"e_1_2_1_1_1","DOI":"10.1109\/71.466632"},{"doi-asserted-by":"publisher","key":"e_1_2_1_2_1","DOI":"10.1145\/28395.28428"},{"doi-asserted-by":"publisher","key":"e_1_2_1_3_1","DOI":"10.1145\/1463768.1463771"},{"volume-title":"Proceedings of the 1st IEEE International Conference on Ubi-Media Computing. 155--158","author":"Bai S.","unstructured":"Bai , S. , Zhou , Q. , Zhou , R. , and Li , L . 2008. Barrier synchronization for cell multi-processor architecture . In Proceedings of the 1st IEEE International Conference on Ubi-Media Computing. 155--158 . Bai, S., Zhou, Q., Zhou, R., and Li, L. 2008. Barrier synchronization for cell multi-processor architecture. In Proceedings of the 1st IEEE International Conference on Ubi-Media Computing. 155--158.","key":"e_1_2_1_4_1"},{"volume-title":"Proceedings of the International Conference on High Performance Computing (HiPC). IEEE, 245--253","author":"Beltran V.","unstructured":"Beltran , V. , Carrera , D. , Torres , J. , and Ayguad\u00e9 , E . 2009. CellMT: A cooperative multithreading library for the Cell\/BE . In Proceedings of the International Conference on High Performance Computing (HiPC). IEEE, 245--253 . Beltran, V., Carrera, D., Torres, J., and Ayguad\u00e9, E. 2009. CellMT: A cooperative multithreading library for the Cell\/BE. In Proceedings of the International Conference on High Performance Computing (HiPC). IEEE, 245--253.","key":"e_1_2_1_5_1"},{"volume-title":"Proceedings of the 3rd International Conference on High Performance Embedded Architectures and Compilers (HiPEAC'08)","author":"Blagojevic F.","unstructured":"Blagojevic , F. , Feng , X. , Cameron , K. W. , and Nikolopoulos , D. S . 2008. Modeling multigrain parallelism on heterogeneous multi-core processors: a case study of the cell be . In Proceedings of the 3rd International Conference on High Performance Embedded Architectures and Compilers (HiPEAC'08) . Springer-Verlag, Berlin, 38--52. Blagojevic, F., Feng, X., Cameron, K. W., and Nikolopoulos, D. S. 2008. Modeling multigrain parallelism on heterogeneous multi-core processors: a case study of the cell be. In Proceedings of the 3rd International Conference on High Performance Embedded Architectures and Compilers (HiPEAC'08). Springer-Verlag, Berlin, 38--52.","key":"e_1_2_1_6_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_7_1","DOI":"10.1145\/106972.106979"},{"doi-asserted-by":"publisher","key":"e_1_2_1_8_1","DOI":"10.1007\/978-3-642-11515-8_9"},{"doi-asserted-by":"publisher","key":"e_1_2_1_9_1","DOI":"10.5555\/1757112.1757144"},{"doi-asserted-by":"publisher","key":"e_1_2_1_10_1","DOI":"10.1145\/192007.192030"},{"doi-asserted-by":"publisher","key":"e_1_2_1_11_1","DOI":"10.1145\/207110.207162"},{"doi-asserted-by":"publisher","key":"e_1_2_1_12_1","DOI":"10.1145\/155332.155333"},{"doi-asserted-by":"publisher","key":"e_1_2_1_13_1","DOI":"10.1109\/71.395402"},{"volume-title":"Department of Computer Science","author":"Esseghir K.","unstructured":"Esseghir , K. 1993. Improving data locality for caches. Master's thesis , Department of Computer Science , Rice University . Esseghir, K. 1993. Improving data locality for caches. Master's thesis, Department of Computer Science, Rice University.","key":"e_1_2_1_14_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_15_1","DOI":"10.1145\/1188455.1188543"},{"doi-asserted-by":"publisher","key":"e_1_2_1_16_1","DOI":"10.1109\/ICME.2002.1035522"},{"doi-asserted-by":"publisher","key":"e_1_2_1_17_1","DOI":"10.1109\/12.381947"},{"doi-asserted-by":"publisher","key":"e_1_2_1_18_1","DOI":"10.1007\/s10766-007-0035-4"},{"unstructured":"IBM. 2008. Cell SDK 3.1. https:\/\/www.ibm.com\/developerworks\/power\/cell\/.  IBM. 2008. Cell SDK 3.1. https:\/\/www.ibm.com\/developerworks\/power\/cell\/.","key":"e_1_2_1_19_1"},{"unstructured":"IBM. 2009. Cell Simulator. http:\/\/www.alphaworks.ibm.com\/tech\/cellsystemsim.  IBM. 2009. Cell Simulator. http:\/\/www.alphaworks.ibm.com\/tech\/cellsystemsim.","key":"e_1_2_1_20_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_21_1","DOI":"10.1109\/MM.2006.49"},{"doi-asserted-by":"publisher","key":"e_1_2_1_22_1","DOI":"10.1145\/106975.106981"},{"doi-asserted-by":"publisher","key":"e_1_2_1_23_1","DOI":"10.1016\/0743-7315(91)90014-Z"},{"volume-title":"Fast Fourier Transform and Convolution Algorithms","author":"Nussbaumer H. J.","unstructured":"Nussbaumer , H. J. 1981. Fast Fourier Transform and Convolution Algorithms . Springer-Verlag , Berlin . Nussbaumer, H. J. 1981. Fast Fourier Transform and Convolution Algorithms. Springer-Verlag, Berlin.","key":"e_1_2_1_24_1"},{"volume-title":"Proceedings of IPDPS'07","author":"Petrini F.","unstructured":"Petrini , F. , Fossum , G. , Fernandez , J. , Varbanescu , A. , Kistler , M. , and Perrone , M . 2007. Multicore surprises: Lessons learned from optimizing sweep3d on the cell broadband engine . In Proceedings of IPDPS'07 . IEEE, 1--10. Petrini, F., Fossum, G., Fernandez, J., Varbanescu, A., Kistler, M., and Perrone, M. 2007. Multicore surprises: Lessons learned from optimizing sweep3d on the cell broadband engine. In Proceedings of IPDPS'07. IEEE, 1--10.","key":"e_1_2_1_25_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_26_1","DOI":"10.1145\/1118299.1118497"},{"volume-title":"IPDPS 2008. Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. IEEE.","author":"Sancho J.","unstructured":"Sancho , J. and Kerbyson , D . 2008. Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE . In IPDPS 2008. Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. IEEE. Sancho, J. and Kerbyson, D. 2008. Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE. In IPDPS 2008. Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. IEEE.","key":"e_1_2_1_27_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_28_1","DOI":"10.1145\/1188455.1188585"},{"doi-asserted-by":"publisher","key":"e_1_2_1_29_1","DOI":"10.1109\/MC.2009.407"},{"doi-asserted-by":"publisher","key":"e_1_2_1_30_1","DOI":"10.1145\/1594835.1504197"},{"unstructured":"STMicroelectronics and CEA. 2010. Platform 2012: A many core programmable accelerator for ultra efficient embedded computing in nanometer technology.  STMicroelectronics and CEA. 2010. Platform 2012: A many core programmable accelerator for ultra efficient embedded computing in nanometer technology.","key":"e_1_2_1_31_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_32_1","DOI":"10.1109\/IPDPS.2009.5161168"},{"doi-asserted-by":"publisher","key":"e_1_2_1_33_1","DOI":"10.1145\/79173.79181"},{"doi-asserted-by":"publisher","key":"e_1_2_1_34_1","DOI":"10.1145\/871656.859663"},{"doi-asserted-by":"publisher","key":"e_1_2_1_35_1","DOI":"10.1145\/76263.76337"},{"doi-asserted-by":"publisher","key":"e_1_2_1_36_1","DOI":"10.1109\/RTAS.2006.38"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2086696.2086716","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2086696.2086716","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T10:06:43Z","timestamp":1750241203000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2086696.2086716"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,1]]},"references-count":36,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2012,1]]}},"alternative-id":["10.1145\/2086696.2086716"],"URL":"https:\/\/doi.org\/10.1145\/2086696.2086716","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2012,1]]},"assertion":[{"value":"2011-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2011-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2012-01-26","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}