{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,2]],"date-time":"2026-07-02T23:41:10Z","timestamp":1783035670844,"version":"3.54.6"},"reference-count":34,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2014,4,1]],"date-time":"2014-04-01T00:00:00Z","timestamp":1396310400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100004837","name":"Ministerio de Ciencia e Innovaci\u00f3n","doi-asserted-by":"publisher","award":["TIN2007-60625"],"award-info":[{"award-number":["TIN2007-60625"]}],"id":[{"id":"10.13039\/501100004837","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004837","name":"Ministerio de Ciencia e Innovaci\u00f3n","doi-asserted-by":"publisher","award":["2009501052"],"award-info":[{"award-number":["2009501052"]}],"id":[{"id":"10.13039\/501100004837","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004963","name":"Seventh Framework Programme","doi-asserted-by":"publisher","award":["RI-211528"],"award-info":[{"award-number":["RI-211528"]}],"id":[{"id":"10.13039\/501100004963","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001943","name":"Partnership for Advanced Computing in Europe AISBL","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100001943","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Math. Softw."],"published-print":{"date-parts":[[2014,4]]},"abstract":"<jats:p>Finite Difference (FD) is a widely used method to solve Partial Differential Equations (PDE). PDEs are the core of many simulations in different scientific fields, such as geophysics, astrophysics, etc. The typical FD solver performs stencil computations for the entire computational domain, thus solving the differential operators. In general terms, the stencil computation consists of a weighted accumulation of the contribution of neighbor points along the cartesian axis. Therefore, optimizing stencil computations is crucial in reducing the application execution time.<\/jats:p>\n          <jats:p>\n            Stencil computation performance is bounded by two main factors: the memory access pattern and the inefficient reuse of the accessed data. We propose a novel algorithm, named\n            <jats:italic>Semi-stencil<\/jats:italic>\n            , that tackles these two problems. The main idea behind this algorithm is to change the way in which the stencil computation progresses within the computational domain. Instead of accessing all required neighbors and adding all their contributions at once, the Semi-stencil algorithm divides the computation into several updates. Then, each update gathers half of the axis neighbors, partially computing at the same time the stencil in a set of closely located points. As Semi-stencil progresses through the domain, the stencil computations are completed on precomputed points. This computation strategy improves the memory access pattern and efficiently reuses the accessed data.\n          <\/jats:p>\n          <jats:p>Our initial target architecture was the Cell\/B.E., where the Semi-stencil in a SPE was 44% faster than the naive stencil implementation. Since then, we have continued our research on emerging multicore architectures in order to assess and extend this work on homogeneous architectures. The experiments presented combine the Semi-stencil strategy with space- and time-blocking algorithms used in hierarchical memory architectures. Two x86 (Intel Nehalem and AMD Opteron) and two POWER (IBM POWER6 and IBM BG\/P) platforms are used as testbeds, where the best improvements for a 25-point stencil range from 1.27 to 1.76\u00d7 faster. The results show that this novel strategy is a feasible optimization method which may be integrated into auto-tuning frameworks. Also, since all current architectures are multicore based, we have introduced a brief section where scalability results on IBM POWER7-, Intel Xeon-, and MIC-based systems are presented. In a nutshell, the algorithm scales as well as or better than other stencil techniques. For instance, the scalability of Semi-stencil on MIC for a certain testcase reached 93.8 \u00d7 over 244 threads.<\/jats:p>","DOI":"10.1145\/2591006","type":"journal-article","created":{"date-parts":[[2014,4,22]],"date-time":"2014-04-22T13:37:45Z","timestamp":1398173865000},"page":"1-39","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":21,"title":["Algorithm 942"],"prefix":"10.1145","volume":"40","author":[{"given":"Ra\u00fal","family":"de la Cruz","sequence":"first","affiliation":[{"name":"Barcelona Supercomputing Center"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Mauricio","family":"Araya-Polo","sequence":"additional","affiliation":[{"name":"Repsol USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2014,4]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/212094.212131"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevLett.101.096403"},{"key":"e_1_2_2_3_1","unstructured":"ANAG. 2012. Chombo software package for amr applications. Applied Numerical Algorithms Group (ANAG) Lawrence Berkeley National Laboratory Berkeley CA. http:\/\/seesar.lbl.gov\/anag\/software.html.  ANAG. 2012. Chombo software package for amr applications. Applied Numerical Algorithms Group (ANAG) Lawrence Berkeley National Laboratory Berkeley CA. http:\/\/seesar.lbl.gov\/anag\/software.html."},{"key":"e_1_2_2_4_1","first-page":"1","article-title":"3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors. Sci","volume":"17","author":"Araya-Polo Mauricio","year":"2008","journal-title":"Program. Cell Process."},{"key":"e_1_2_2_5_1","volume-title":"The Fluid Mechanics of Astrophysics and Geophysics","author":"Brandenburg Axel"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/106972.106979"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1002\/pssb.200642067"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/1413370.1413375"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1137\/070693199"},{"key":"e_1_2_2_10_1","volume-title":"Scientific Computing on Multicore and Accelerators","author":"Datta Kaushik"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2011.04.235"},{"key":"e_1_2_2_12_1","volume-title":"Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics (PPAM\u201909)","volume":"6067","author":"de la Cruz Ra\u00fal","year":"2009"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1088149.1088197"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1148109.1148157"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0218396X08003683"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1111583.1111589"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1178597.1178605"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jsv.2008.03.024"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/106975.106981"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.577265"},{"key":"e_1_2_2_21_1","unstructured":"John McCalpin and David Wonnacott. 1999. Time skewing: A value-based approach to optimizing for memory locality. Tech. rep. DCS-TR-379 Department of Computer Science Rutgers University. http:\/\/www.haverford.edu\/cmsc\/davew\/cache-opt\/cache-opt.html.  John McCalpin and David Wonnacott. 1999. Time skewing: A value-based approach to optimizing for memory locality. Tech. rep. DCS-TR-379 Department of Computer Science Rutgers University. http:\/\/www.haverford.edu\/cmsc\/davew\/cache-opt\/cache-opt.html."},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/233561.233564"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1002\/ima.1850010104"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/0743-7315(91)90014-Z"},{"key":"e_1_2_2_25_1","volume-title":"Proceedings of the Department of Defense HPCMP Users Group Conference. 7--10","author":"Mucci Philip J.","year":"1999"},{"key":"e_1_2_2_26_1","first-page":"2265","article-title":"3D frequency-domain finite-difference modeling of acoustic wave propagation using a massively parallel direct solver: A feasibility study","volume":"72","author":"Operto Stephane","year":"2006","journal-title":"SEG Tech. Program Expanded Abstracts"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.5555\/370049.370403"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/143371.143484"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/183018.183047"},{"key":"e_1_2_2_30_1","volume-title":"Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics (PPAM\u201909)","volume":"6067","author":"Treibig Jan","year":"2009"},{"key":"e_1_2_2_31_1","volume-title":"High Performance Computing in Science and Engineering","author":"Treibig Jan"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPPW.2010.38"},{"key":"e_1_2_2_33_1","unstructured":"Samuel Webb Williams Andrew Waterman and David A. Patterson. 2008. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Tech. rep. UCB\/EECS-2008-134 EECS Department University of California Berkeley. http:\/\/www.eecs.berkeley.edu\/Pubs\/TechRpts\/2008\/EECS-2008-134.html.  Samuel Webb Williams Andrew Waterman and David A. Patterson. 2008. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Tech. rep. UCB\/EECS-2008-134 EECS Department University of California Berkeley. http:\/\/www.eecs.berkeley.edu\/Pubs\/TechRpts\/2008\/EECS-2008-134.html."},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.5555\/645677.663799"}],"container-title":["ACM Transactions on Mathematical Software"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2591006","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2591006","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T20:01:13Z","timestamp":1750276873000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2591006"}},"subtitle":["Semi-Stencil"],"short-title":[],"issued":{"date-parts":[[2014,4]]},"references-count":34,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2014,4]]}},"alternative-id":["10.1145\/2591006"],"URL":"https:\/\/doi.org\/10.1145\/2591006","relation":{},"ISSN":["0098-3500","1557-7295"],"issn-type":[{"value":"0098-3500","type":"print"},{"value":"1557-7295","type":"electronic"}],"subject":[],"published":{"date-parts":[[2014,4]]},"assertion":[{"value":"2012-01-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2013-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-04-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}