{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T18:47:58Z","timestamp":1767638878126,"version":"3.48.0"},"reference-count":45,"publisher":"SAGE Publications","issue":"2","license":[{"start":{"date-parts":[[2012,5,21]],"date-time":"2012-05-21T00:00:00Z","timestamp":1337558400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2013,5]]},"abstract":"<jats:p>\n                    Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM\n                    <jats:sup>\u00ae<\/jats:sup>\n                    Blue Gene\n                    <jats:sup>\u00ae<\/jats:sup>\n                    \/P supercomputer\u2019s PowerPC\n                    <jats:sup>\u00ae<\/jats:sup>\n                    450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU\u2019s instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7\n                    <jats:inline-formula>\n                      <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" overflow=\"scroll\">\n                        <mml:mo stretchy=\"false\">\u00d7<\/mml:mo>\n                      <\/mml:math>\n                    <\/jats:inline-formula>\n                    speedup over the best previously published results.\n                  <\/jats:p>","DOI":"10.1177\/1094342012444795","type":"journal-article","created":{"date-parts":[[2012,5,21]],"date-time":"2012-05-21T21:08:07Z","timestamp":1337634487000},"page":"193-209","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":4,"title":["Optimizing the performance of streaming numerical kernels on the IBM Blue Gene\/P PowerPC 450 processor"],"prefix":"10.1177","volume":"27","author":[{"given":"Tareq","family":"Malas","sequence":"first","affiliation":[{"name":"King Abdullah University of Science and Technology, Thuwal, Saudi Arabia"}]},{"given":"Aron J.","family":"Ahmadia","sequence":"additional","affiliation":[{"name":"King Abdullah University of Science and Technology, Thuwal, Saudi Arabia"}]},{"given":"Jed","family":"Brown","sequence":"additional","affiliation":[{"name":"Argonne National Laboratory, Argonne, IL, USA"}]},{"given":"John A.","family":"Gunnels","sequence":"additional","affiliation":[{"name":"IBM T.J. Watson Research Center, Yorktown Heights, NY, USA"}]},{"given":"David E.","family":"Keyes","sequence":"additional","affiliation":[{"name":"King Abdullah University of Science and Technology, Thuwal, Saudi Arabia"}]}],"member":"179","published-online":{"date-parts":[[2012,5,21]]},"reference":[{"key":"e_1_3_3_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654124"},{"key":"e_1_3_3_3_1","doi-asserted-by":"publisher","DOI":"10.1155\/2009\/382638"},{"issue":"3","key":"e_1_3_3_4_1","first-page":"63","article-title":"The NAS parallel benchmarks","volume":"5","author":"Bailey D","year":"1991","unstructured":"Bailey D, Barszcz E, Barton J, (1991) The NAS parallel benchmarks. International Journal of High Performance Computing Applications 5(3): 63.","journal-title":"International Journal of High Performance Computing Applications"},{"key":"e_1_3_3_5_1","doi-asserted-by":"publisher","DOI":"10.1016\/0021-9991(84)90073-1"},{"key":"e_1_3_3_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/0743-7315(88)90002-0"},{"key":"e_1_3_3_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/197320.197366"},{"key":"e_1_3_3_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0898-1221(97)00184-3"},{"key":"e_1_3_3_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00450-011-0160-6"},{"key":"e_1_3_3_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2009.5161031"},{"key":"e_1_3_3_11_1","unstructured":"Datta K (2009) Auto-tuning Stencil Codes for Cache-Based Multicore Platforms. PhD thesis EECS Department University of California Berkeley CA. http:\/\/www.eecs.berkeley.edu\/Pubs\/TechRpts\/2009\/EECS-2009-177.html."},{"key":"e_1_3_3_12_1","unstructured":"Datta K Williams S Volkov V (2009) Auto-tuning the 27-point Stencil for Multicore. In Proc. iWAPT2009: The Fourth International Workshop on Automatic Performance Tuning. http:\/\/crd.lbl.gov\/~oliker\/papers\/iwapt09.pdf."},{"key":"e_1_3_3_13_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342010391989"},{"key":"e_1_3_3_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-03869-3_61"},{"key":"e_1_3_3_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/996893.996853"},{"key":"e_1_3_3_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2007.445"},{"key":"e_1_3_3_17_1","volume-title":"Proceedings of the First USENIX conference on Hot topics in parallelism","author":"Ganapathi A","year":"2009","unstructured":"Ganapathi A, Datta K, Fox A, Patterson D (2009) A case for machine learning to optimize multicore performance. In Proceedings of the First USENIX conference on Hot topics in parallelism. Berkeley, CA: USENIX Association."},{"key":"e_1_3_3_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654122"},{"key":"e_1_3_3_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2166.357217"},{"key":"e_1_3_3_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-19861-8_13"},{"issue":"1","key":"e_1_3_3_21_1","first-page":"199","article-title":"Overview of the IBM Blue Gene\/P project","volume":"52","author":"IBM Blue Gene Team","year":"2008","unstructured":"IBM Blue Gene Team (2008) Overview of the IBM Blue Gene\/P project. IBM Journal of Research and Development 52(1\/2): 199\u2013220.","journal-title":"IBM Journal of Research and Development"},{"key":"e_1_3_3_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2010.5470421"},{"key":"e_1_3_3_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1178597.1178605"},{"key":"e_1_3_3_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1111583.1111589"},{"key":"e_1_3_3_25_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.crme.2010.11.002"},{"key":"e_1_3_3_26_1","unstructured":"Kogge P Bergman K Borkar S (2008) Exascale computing study: Technology challenges in achieving exascale systems. http:\/\/www.cse.nd.edu\/Reports\/2008\/TR-2008-13.pdf"},{"key":"e_1_3_3_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1273442.1250761"},{"key":"e_1_3_3_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1034774.1034777"},{"key":"e_1_3_3_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.31"},{"key":"e_1_3_3_30_1","unstructured":"Mueller C Martin B (2007) CorePy: high-productivity Cell\/BE programming. Applications for the Cell\/BE http:\/\/sti.cc.gatech.edu\/Slides\/Mueller-070619.pdf."},{"key":"e_1_3_3_31_1","doi-asserted-by":"crossref","unstructured":"Newburn C So B Liu Z (2011) Intel\u2019s Array Building Blocks: A Retargetable Dynamic Compiler and Embedded Language. Proceedings of Code Generation and Optimization http:\/\/software.intel.com\/en-us\/blogs\/wordpress\/wp-content\/uploads\/2011\/03\/ArBB-CGO2011-distr.pdf.","DOI":"10.1109\/CGO.2011.5764690"},{"key":"e_1_3_3_32_1","first-page":"1","article-title":"3.5-D blocking optimization for stencil computations on modern CPUs and GPUs","author":"Nguyen A","year":"2010","unstructured":"Nguyen A, Satish N, Chhugani J, Kim C, Dubey P (2010) 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. Proceedings of SuperComputing, pp. 1\u201313.","journal-title":"Proceedings of SuperComputing"},{"key":"e_1_3_3_33_1","first-page":"1","article-title":"High-order stencil computations on multicore clusters","author":"Peng L","year":"2009","unstructured":"Peng L, Seymour R, Nomura K-I, (2009) High-order stencil computations on multicore clusters. Proceedings of IPDPS, pp. 1\u201311.","journal-title":"Proceedings of IPDPS"},{"key":"e_1_3_3_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2005.859896"},{"key":"e_1_3_3_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654121"},{"key":"e_1_3_3_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2000.10015"},{"key":"e_1_3_3_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1399504.1360617"},{"key":"e_1_3_3_38_1","doi-asserted-by":"publisher","DOI":"10.1080\/1061856031000104851"},{"key":"e_1_3_3_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/1250734.1250754"},{"key":"e_1_3_3_40_1","volume-title":"IBM System Blue Gene Solution: Blue Gene\/P Application Development","author":"Sosa C and International Business Machines Corporation","year":"2008","unstructured":"Sosa C and International Business Machines Corporation (2008). IBM System Blue Gene Solution: Blue Gene\/P Application Development. IBM International Technical Support Organization."},{"key":"e_1_3_3_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989493.1989508"},{"key":"e_1_3_3_42_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-14390-8_64"},{"key":"e_1_3_3_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/COMPSAC.2009.82"},{"key":"e_1_3_3_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/358438.349318"},{"key":"e_1_3_3_45_1","first-page":"1","article-title":"Lattice Boltzmann simulation optimization on leading multicore platforms","author":"Williams S","year":"2008","unstructured":"Williams S, Carter J, Oliker L, Shalf J, Yelick K (2008) Lattice Boltzmann simulation optimization on leading multicore platforms. 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1\u201314.","journal-title":"2008 IEEE International Symposium on Parallel and Distributed Processing"},{"key":"e_1_3_3_46_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626410000296"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342012444795","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/1094342012444795","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342012444795","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T14:57:54Z","timestamp":1767625074000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342012444795"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,5,21]]},"references-count":45,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2013,5]]}},"alternative-id":["10.1177\/1094342012444795"],"URL":"https:\/\/doi.org\/10.1177\/1094342012444795","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"type":"print","value":"1094-3420"},{"type":"electronic","value":"1741-2846"}],"subject":[],"published":{"date-parts":[[2012,5,21]]}}}