{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T17:21:21Z","timestamp":1773336081745,"version":"3.50.1"},"reference-count":44,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2016,8,16]],"date-time":"2016-08-16T00:00:00Z","timestamp":1471305600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Math. Softw."],"published-print":{"date-parts":[[2017,6,30]]},"abstract":"<jats:p>\n            The solution of a constant-coefficient elliptic Partial Differential Equation (PDE) can be computed using an integral transform: A convolution with the fundamental solution of the PDE, also known as a volume potential. We present a Fast Multipole Method (FMM) for computing volume potentials and use them to construct spatially adaptive solvers for the Poisson, Stokes, and low-frequency Helmholtz problems. Conventional N-body methods apply to discrete particle interactions. With volume potentials, one replaces the sums with volume integrals. Particle N-body methods can be used to accelerate such integrals. but it is more efficient to develop a special FMM. In this article, we discuss the efficient implementation of such an FMM. We use high-order piecewise Chebyshev polynomials and an octree data structure to represent the input and output fields and enable spectrally accurate approximation of the near-field and the Kernel Independent FMM (KIFMM) for the far-field approximation. For distributed-memory parallelism, we use space-filling curves, locally essential trees, and a hypercube-like communication scheme developed previously in our group. We present new near and far interaction traversals that optimize cache usage. Also, unlike particle N-body codes, we need a 2:1 balanced tree to allow for precomputations. We present a fast scheme for 2:1 balancing. Finally, we use vectorization, including the AVX instruction set on the Intel Sandy Bridge architecture to get better than 50% of peak floating-point performance. We use task parallelism to employ the Xeon Phi on the Stampede platform at the Texas Advanced Computing Center (TACC). We achieve about 600\n            <jats:sc>gflop<\/jats:sc>\n            \/s of double-precision performance on a single node. Our largest run on Stampede took 3.5s on 16K cores for a problem with 18\n            <jats:sc>e<\/jats:sc>\n            +9 unknowns for a highly nonuniform particle distribution (corresponding to an effective resolution exceeding 3\n            <jats:sc>e<\/jats:sc>\n            +23 unknowns since we used 23 levels in our octree).\n          <\/jats:p>","DOI":"10.1145\/2898349","type":"journal-article","created":{"date-parts":[[2016,8,16]],"date-time":"2016-08-16T12:14:20Z","timestamp":1471349660000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":20,"title":["Algorithm 967"],"prefix":"10.1145","volume":"43","author":[{"given":"Dhairya","family":"Malhotra","sequence":"first","affiliation":[{"name":"The University of Texas at Austin, Austin, TX"}]},{"given":"George","family":"Biros","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, Austin, TX"}]}],"member":"320","published-online":{"date-parts":[[2016,8,16]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1086\/345538"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1093\/gji\/ggs070"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1137\/100791634"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2010.19"},{"key":"e_1_2_2_5_1","volume-title":"IEEE Conference Record-Abstracts of the IEEE International Conference on Plasma Science (ICOPS'05)","author":"Christlieb A. J."},{"key":"e_1_2_2_6_1","volume-title":"Inverse Acoustic and Electromagnetic Scattering Theory","author":"Colton David","edition":"2"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511526442"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1006\/jcph.2001.6961"},{"key":"e_1_2_2_9_1","unstructured":"M. Doi and S. F. Edwards. 1988. The Theory of Polymer Dynamics. Vol. 73. Oxford University Press Oxford UK.  M. Doi and S. F. Edwards. 1988. The Theory of Polymer Dynamics. Vol. 73. Oxford University Press Oxford UK."},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2010.23"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1137\/0719090"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1137\/S1064827500369967"},{"key":"e_1_2_2_13_1","volume-title":"Journal of Physics: Conference Series","volume":"78","author":"Fann G. I.","year":"2018"},{"key":"e_1_2_2_14_1","volume-title":"Parallel Computing: Design and Analysis of Algorithms","author":"Grama A.","year":"2003"},{"key":"e_1_2_2_15_1","volume-title":"Finite Element for Viscous Incompressible Flows","author":"Gunzburger Max D."},{"key":"e_1_2_2_16_1","volume-title":"Nicolaides","author":"Gunzburger Max D.","year":"1993"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654123"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1063\/1.1791051"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063432"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2012.47"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2010.49"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1086\/313015"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.2140\/camcos.2011.6.79"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2160718.2160740"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1006\/jcph.2001.6862"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10915-004-4787-3"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.4208\/cicp.020215.150515sw"},{"key":"e_1_2_2_28_1","volume-title":"Baden","author":"McCorquodale Peter","year":"2006"},{"key":"e_1_2_2_29_1","volume-title":"Optimized M2L kernels for the Chebyshev interpolation based fast multipole method. arXiv preprint arXiv:1210.7292","author":"Messner Matthias","year":"2012"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/2388996.2389056"},{"key":"e_1_2_2_31_1","volume-title":"Proceedings of Supercomputing (The SCxy Conference Series). ACM\/IEEE","author":"Phillips James C."},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/43.662670"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1006\/jcph.2000.6582"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2005.11.017"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1016\/S1384-1076(01)00042-2"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.5555\/2388996.2389055"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2464996.2465442"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1137\/070681727"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1002\/nme.3240"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/169627.169640"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2003.11.021"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2006.03.021"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342011429952"},{"key":"e_1_2_2_44_1","doi-asserted-by":"crossref","unstructured":"R. Yokota J. P. Bardhan M. G. Knepley L. A. Barba and T. Hamada. 2011. Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUS and a billion unknowns. Computer Physics Communications (2011).  R. Yokota J. P. Bardhan M. G. Knepley L. A. Barba and T. Hamada. 2011. Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUS and a billion unknowns. Computer Physics Communications (2011).","DOI":"10.1016\/j.cpc.2011.02.013"}],"container-title":["ACM Transactions on Mathematical Software"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2898349","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2898349","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:56:29Z","timestamp":1750222589000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2898349"}},"subtitle":["A Distributed-Memory Fast Multipole Method for Volume Potentials"],"short-title":[],"issued":{"date-parts":[[2016,8,16]]},"references-count":44,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2017,6,30]]}},"alternative-id":["10.1145\/2898349"],"URL":"https:\/\/doi.org\/10.1145\/2898349","relation":{},"ISSN":["0098-3500","1557-7295"],"issn-type":[{"value":"0098-3500","type":"print"},{"value":"1557-7295","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,8,16]]},"assertion":[{"value":"2014-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-08-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}