{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,16]],"date-time":"2026-02-16T20:48:04Z","timestamp":1771274884909,"version":"3.50.1"},"reference-count":42,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2022,5,26]],"date-time":"2022-05-26T00:00:00Z","timestamp":1653523200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100005416","name":"Research Council of Norway","doi-asserted-by":"crossref","award":["251186"],"award-info":[{"award-number":["251186"]}],"id":[{"id":"10.13039\/501100005416","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100005416","name":"Research Council of Norway","doi-asserted-by":"crossref","award":["270053"],"award-info":[{"award-number":["270053"]}],"id":[{"id":"10.13039\/501100005416","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Math. Softw."],"published-print":{"date-parts":[[2022,6,30]]},"abstract":"<jats:p>Motivated by the wish to understand the achievable performance of finite element assembly on unstructured computational meshes, we dissect the standard cellwise assembly algorithm into four kernels, two of which are dominated by irregular memory traffic. Several optimisation schemes are studied together with associated lower and upper bounds on the estimated memory traffic volume. Apart from properly reordering the mesh entities, the two most significant optimisations include adopting a lookup table in adding element matrices or vectors to their global counterparts, and using a row-wise assembly algorithm for multi-threaded parallelisation. Rigorous benchmarking shows that, due to the various optimisations, the actual volumes of memory traffic are in many cases very close to the estimated lower bounds. These results confirm the effectiveness of the optimisations, while also providing a recipe for developing efficient software for finite element assembly.<\/jats:p>","DOI":"10.1145\/3503925","type":"journal-article","created":{"date-parts":[[2022,3,4]],"date-time":"2022-03-04T22:28:24Z","timestamp":1646432904000},"page":"1-31","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["On Memory Traffic and Optimisations for Low-order Finite Element Assembly Algorithms on Multi-core CPUs"],"prefix":"10.1145","volume":"48","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4498-020X","authenticated-orcid":false,"given":"James D.","family":"Trotter","sequence":"first","affiliation":[{"name":"Simula Research Laboratory and University of Oslo, Blindern, Oslo, Norway"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xing","family":"Cai","sequence":"additional","affiliation":[{"name":"Simula Research Laboratory and University of Oslo, Blindern, Oslo, Norway"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Simon W.","family":"Funke","sequence":"additional","affiliation":[{"name":"Simula Research Laboratory, Simula Research Laboratory, Lysaker, Norway"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,5,26]]},"reference":[{"key":"e_1_3_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1644001.1644007"},{"key":"e_1_3_1_3_1","unstructured":"Aneurisk-Team. 2012. AneuriskWeb Project Website. (June 2012). Retrieved June 9 2020 from http:\/\/ecm2.mathcs.emory.edu\/aneuriskweb."},{"key":"e_1_3_1_4_1","doi-asserted-by":"publisher","DOI":"10.1002\/fld.3909"},{"key":"e_1_3_1_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10915-010-9396-8"},{"key":"e_1_3_1_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.compfluid.2010.08.012"},{"key":"e_1_3_1_7_1","doi-asserted-by":"publisher","DOI":"10.1002\/cnm.1630040303"},{"key":"e_1_3_1_8_1","doi-asserted-by":"publisher","DOI":"10.1002\/nme.2989"},{"key":"e_1_3_1_9_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9780898719208"},{"key":"e_1_3_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/800195.805928"},{"key":"e_1_3_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4757-4355-5"},{"key":"e_1_3_1_12_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342020915762"},{"key":"e_1_3_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2071379.2071383"},{"key":"e_1_3_1_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cam.2013.09.001"},{"key":"e_1_3_1_15_1","doi-asserted-by":"publisher","DOI":"10.1137\/0710032"},{"key":"e_1_3_1_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/BFb0064460"},{"key":"e_1_3_1_17_1","unstructured":"Xu Guo. 2019. Best Practice Guide - AMD EPYC. (Feb. 2019). Retrieved June 9 2020 from https:\/\/prace-ri.eu\/wp-content\/uploads\/Best-Practice-Guide_AMD.pdf."},{"key":"e_1_3_1_18_1","doi-asserted-by":"publisher","DOI":"10.1137\/17M1130642"},{"key":"e_1_3_1_19_1","unstructured":"Intel Corporation. 2018. Intel \u00ae 64 and IA-32 Architectures Optimization Reference Manual. (April 2018). Retrieved June 9 2020 from https:\/\/software.intel.com\/content\/dam\/develop\/public\/us\/en\/documents\/64-ia-32-architectures-optimization-manual.pdf."},{"key":"e_1_3_1_20_1","doi-asserted-by":"publisher","DOI":"10.1161\/JAHA.114.001399"},{"key":"e_1_3_1_21_1","doi-asserted-by":"publisher","DOI":"10.1137\/040607824"},{"key":"e_1_3_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1163641.1163644"},{"key":"e_1_3_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2427023.2427027"},{"key":"e_1_3_1_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.compfluid.2012.04.012"},{"key":"e_1_3_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3325864"},{"key":"e_1_3_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23099-8"},{"key":"e_1_3_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3054944"},{"key":"e_1_3_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2687415"},{"key":"e_1_3_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-52718-5_12"},{"key":"e_1_3_1_30_1","doi-asserted-by":"publisher","DOI":"10.1002\/fld.3648"},{"key":"e_1_3_1_31_1","unstructured":"John D. McCalpin. 2013. STREAM: Sustainable Memory Bandwidth in High Performance Computers. (Jan. 2013). Retrieved June 9 2020 from https:\/\/www.cs.virginia.edu\/stream\/."},{"key":"e_1_3_1_32_1","doi-asserted-by":"publisher","DOI":"10.1137\/19M1246523"},{"key":"e_1_3_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1644001.1644009"},{"key":"e_1_3_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2998441"},{"key":"e_1_3_1_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-013-0301-6"},{"key":"e_1_3_1_36_1","doi-asserted-by":"publisher","DOI":"10.1137\/08073901X"},{"key":"e_1_3_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491491.2491496"},{"key":"e_1_3_1_38_1","unstructured":"David Schor. 2018. A Look at Cavium\u2019s New High-Performance ARM Microprocessors and the Isambard Supercomputer. (June 2018). Retrieved May 22 2020 from https:\/\/fuse.wikichip.org\/news\/1316\/a-look-at-caviums-new-high-performance-arm-microprocessors-and-the-isambard-supercomputer\/."},{"key":"e_1_3_1_39_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342020945005"},{"key":"e_1_3_1_40_1","volume-title":"Proceedings of the International Workshop on Super Visualization","author":"Tchiboukdjian Marc","year":"2008","unstructured":"Marc Tchiboukdjian, Vincent Danjean, and Bruno Raffin. 2008. A fast cache-oblivious mesh layout with theoretical guarantees. In Proceedings of the International Workshop on Super Visualization. Kos, Greece. Retrieved from https:\/\/hal.inria.fr\/inria-00436053."},{"key":"e_1_3_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2010.19"},{"key":"e_1_3_1_42_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2010.03.031"},{"key":"e_1_3_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2006.162"}],"container-title":["ACM Transactions on Mathematical Software"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503925","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503925","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:30:32Z","timestamp":1750188632000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503925"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,26]]},"references-count":42,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,6,30]]}},"alternative-id":["10.1145\/3503925"],"URL":"https:\/\/doi.org\/10.1145\/3503925","relation":{},"ISSN":["0098-3500","1557-7295"],"issn-type":[{"value":"0098-3500","type":"print"},{"value":"1557-7295","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,5,26]]},"assertion":[{"value":"2020-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-12-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-05-26","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}