{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:36:08Z","timestamp":1750221368159,"version":"3.41.0"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2017,9,30]],"date-time":"2017-09-30T00:00:00Z","timestamp":1506729600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Extreme Computing Research Center"},{"name":"ECRC"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2017,9,30]]},"abstract":"<jats:p>\n            Optimizing the performance of stencil algorithms has been the subject of intense research over the last two decades. Since many stencil schemes have low arithmetic intensity, most optimizations focus on increasing the temporal data access locality, thus reducing the data traffic through the main memory interface with the ultimate goal of decoupling from this bottleneck. There are, however, only a few approaches that explicitly leverage the shared cache feature of modern multicore chips. If every thread works on its private, separate cache block, the available cache space can become too small, and sufficient temporal locality may not be achieved. We propose a flexible multidimensional intratile parallelization method for stencil algorithms on multicore CPUs with a shared outer-level cache. This method leads to a significant reduction in the required cache space without adverse effects from hardware prefetching or TLB shortage. Our\n            <jats:italic>Girih<\/jats:italic>\n            framework includes an autotuner to select optimal parameter configurations on the target hardware. We conduct performance experiments on two contemporary Intel processors and compare with the state-of-the-art stencil frameworks Pluto and Pochoir, using four corner-case stencil schemes and a wide range of problem sizes.\n            <jats:italic>Girih<\/jats:italic>\n            shows substantial performance advantages and best arithmetic intensity at almost all problem sizes, especially on low-intensity stencils with variable coefficients. We study in detail the performance behavior at varying grid sizes using phenomenological performance modeling. Our analysis of energy consumption reveals that our method can save energy through reduced DRAM bandwidth usage even at a marginal performance gain. It is thus well suited for future architectures that will be strongly challenged by the cost of data movement, be it in terms of performance or energy consumption.\n          <\/jats:p>","DOI":"10.1145\/3155290","type":"journal-article","created":{"date-parts":[[2017,12,20]],"date-time":"2017-12-20T14:54:00Z","timestamp":1513781640000},"page":"1-32","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":22,"title":["Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations"],"prefix":"10.1145","volume":"4","author":[{"given":"Tareq M.","family":"Malas","sequence":"first","affiliation":[{"name":"National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory"}]},{"given":"Georg","family":"Hager","sequence":"additional","affiliation":[{"name":"Erlangen Regional Computing Center (RRZE), Friedrich-Alexander University of Erlangen-Nuremberg, Erlangen, Germany"}]},{"given":"Hatem","family":"Ltaief","sequence":"additional","affiliation":[{"name":"Extreme Computing Research Center (ECRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia"}]},{"given":"David E.","family":"Keyes","sequence":"additional","affiliation":[{"name":"ECRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia"}]}],"member":"320","published-online":{"date-parts":[[2017,12,18]]},"reference":[{"volume-title":"Technical Report UCB\/EECS-2006-183, EECS Department","year":"2006","author":"Asanovic K.","key":"e_1_2_2_1_1"},{"volume-title":"Proceedings of the 4th International Workshop on Performance Modeling, Benchmarking, and Simulation of HPC Systems.","author":"Balaprakash P.","key":"e_1_2_2_2_1"},{"volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11","author":"Bandishti V.","key":"e_1_2_2_3_1"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1379022.1375595"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751224"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/0743-7315(88)90002-0"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.77"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.70"},{"key":"e_1_2_2_9_1","unstructured":"K. Datta. 2009. Auto-Tuning Stencil Codes for Cache-Based Multicore Platforms. Ph.D. Dissertation. EECS Department University of California Berkeley. Retrieved from http:\/\/www.eecs.berkeley.edu\/Pubs\/TechRpts\/2009\/EECS-2009-177.html.   K. Datta. 2009. Auto-Tuning Stencil Codes for Cache-Based Multicore Platforms. Ph.D. Dissertation. EECS Department University of California Berkeley. Retrieved from http:\/\/www.eecs.berkeley.edu\/Pubs\/TechRpts\/2009\/EECS-2009-177.html."},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1137\/070693199"},{"volume-title":"Proceedings of the 40th Annual Symposium on Foundations of Computer Science","year":"1999","author":"Frigo M.","key":"e_1_2_2_11_1"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1088149.1088197"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1088149.1088197"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1137\/130935781"},{"key":"e_1_2_2_15_1","unstructured":"Girih. 2015. Girih stencil optimization framework. Retrieved from https:\/\/github.com\/tareqmalas\/girih.  Girih. 2015. Girih stencil optimization framework. Retrieved from https:\/\/github.com\/tareqmalas\/girih."},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2544137.2544160"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626414410023"},{"key":"e_1_2_2_18_1","doi-asserted-by":"crossref","unstructured":"P. Gschwandtner J. J. Durillo and T. Fahringer. 2014. Multi-objective auto-tuning with insieme: Optimization and trade-off analysis for time energy and resource usage. In Euro-Par 2014 Parallel Processing. Vol. 8632. Springer International Publishing 87--98.  P. Gschwandtner J. J. Durillo and T. Fahringer. 2014. Multi-objective auto-tuning with insieme: Optimization and trade-off analysis for time energy and resource usage. In Euro-Par 2014 Parallel Processing. Vol. 8632. Springer International Publishing 87--98.","DOI":"10.1007\/978-3-319-09873-9_8"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751223"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2015.70"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3180"},{"volume-title":"Computer Architecture: A Quantitative Approach.","year":"2012","author":"Hennessy J. L.","key":"e_1_2_2_22_1"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2464996.2467268"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/0167-8191(89)90100-2"},{"volume-title":"11th International Conference on Parallel Processing and Applied Mathematics (PPAM\u201915)","author":"Hofmann J.","key":"e_1_2_2_25_1"},{"key":"e_1_2_2_26_1","unstructured":"Intel. 2015. Intel(R) 64 and IA-32 Architectures Optimization Reference Manual. Retrieved from http:\/\/www.intel.com\/content\/dam\/www\/public\/us\/en\/documents\/manuals\/64-ia-32-architectures-optimization-manual.pdf.  Intel. 2015. Intel(R) 64 and IA-32 Architectures Optimization Reference Manual. Retrieved from http:\/\/www.intel.com\/content\/dam\/www\/public\/us\/en\/documents\/manuals\/64-ia-32-architectures-optimization-manual.pdf."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2013.95"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.5555\/17407.17362"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/360827.360844"},{"key":"e_1_2_2_30_1","unstructured":"likwid. 2015. LIKWID performance tools. Retrieved from https:\/\/github.com\/rrze-likwid\/likwid\/.  likwid. 2015. LIKWID performance tools. Retrieved from https:\/\/github.com\/rrze-likwid\/likwid\/."},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342012444795"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1137\/140991133"},{"volume-title":"Proceedings of the International Parallel and Distributed Processing Symposium. IEEE Computer Society.","author":"Malas T.","key":"e_1_2_2_33_1"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2010.2"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2009.44"},{"key":"e_1_2_2_36_1","doi-asserted-by":"crossref","unstructured":"D. Orozco E. Garcia and G. Gao. 2011. Locality optimization of stencil applications using data dependency graphs. In Languages and Compilers for Parallel Computing. Springer Berlin 77--91.   D. Orozco E. Garcia and G. Gao. 2011. Locality optimization of stencil applications using data dependency graphs. In Languages and Compilers for Parallel Computing. Springer Berlin 77--91.","DOI":"10.1007\/978-3-642-19595-2_6"},{"volume-title":"Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. Self-edition.","year":"2000","author":"Sch\u00f6nauer W.","key":"e_1_2_2_37_1"},{"volume-title":"Proceedings of the 13th Annual IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201915)","author":"Shrestha S.","key":"e_1_2_2_38_1"},{"volume-title":"Proceedings of the 27th International Workshop on Languages and Compilers for Parallel Computing.","author":"Shrestha S.","key":"e_1_2_2_39_1"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751240"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1810085.1810096"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2011.47"},{"volume-title":"Proceedings of the 2014 IEEE International Conference on Cluster Computing (CLUSTER\u201914)","author":"Suresh A.","key":"e_1_2_2_43_1"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989493.1989508"},{"key":"e_1_2_2_45_1","doi-asserted-by":"crossref","unstructured":"J. Treibig and G. Hager. 2010. Introducing a performance model for bandwidth-limited loop kernels. In Parallel Processing and Applied Mathematics Roman Wyrzykowski Jack Dongarra Konrad Karczewski and Jerzy Wasniewski (Eds.). Lecture Notes in Computer Science Vol. 6067. Springer Berlin 615--624.   J. Treibig and G. Hager. 2010. Introducing a performance model for bandwidth-limited loop kernels. In Parallel Processing and Applied Mathematics Roman Wyrzykowski Jack Dongarra Konrad Karczewski and Jerzy Wasniewski (Eds.). Lecture Notes in Computer Science Vol. 6067. Springer Berlin 615--624.","DOI":"10.1007\/978-3-642-14390-8_64"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jocs.2011.01.010"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/COMPSAC.2009.82"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/1498765.1498785"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626410000296"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.5555\/846234.849346"},{"volume-title":"Proceedings of the 3rd International Workshop on Polyhedral Compilation Techniques. 3--11","author":"Wonnacott D. G.","key":"e_1_2_2_51_1"},{"key":"e_1_2_2_52_1","unstructured":"X. Zhou. 2013. Tiling Optimizations for Stencil Computations. Ph.D. Dissertation. University of Illinois at Urbana-Champaign. Retrieved from http:\/\/polaris.cs.uiuc.edu\/ zhou53\/papers\/Xing_Zhou.pdf.  X. Zhou. 2013. Tiling Optimizations for Stencil Computations. Ph.D. Dissertation. University of Illinois at Urbana-Champaign. Retrieved from http:\/\/polaris.cs.uiuc.edu\/ zhou53\/papers\/Xing_Zhou.pdf."}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3155290","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3155290","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T02:26:28Z","timestamp":1750213588000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3155290"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,9,30]]},"references-count":52,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2017,9,30]]}},"alternative-id":["10.1145\/3155290"],"URL":"https:\/\/doi.org\/10.1145\/3155290","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"type":"print","value":"2329-4949"},{"type":"electronic","value":"2329-4957"}],"subject":[],"published":{"date-parts":[[2017,9,30]]},"assertion":[{"value":"2016-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-12-18","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}