{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,9,14]],"date-time":"2023-09-14T06:40:44Z","timestamp":1694673644289},"reference-count":42,"publisher":"Wiley","issue":"10","license":[{"start":{"date-parts":[[2012,10,23]],"date-time":"2012-10-23T00:00:00Z","timestamp":1350950400000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/onlinelibrary.wiley.com\/termsAndConditions#vor"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Concurrency and Computation"],"published-print":{"date-parts":[[2013,7]]},"abstract":"<jats:title>SUMMARY<\/jats:title><jats:p>Performance improvements in biomolecular simulations based on molecular dynamics (MD) codes are widely desired. Unfortunately, the factors, which allowed past performance improvements, particularly the microprocessor clock frequencies, are no longer increasing. Hence, novel software and hardware solutions are being explored for accelerating performance of widely used MD codes. In this paper, we describe our efforts on porting, optimizing and tuning of Large\u2010scale Atomic\/Molecular Massively Parallel Simulator, a popular MD framework, on heterogeneous architectures: multi\u2010core processors with graphical processing unit (GPU) accelerators. Our implementation is based on accelerating the most computationally expensive non\u2010bonded interaction terms on the GPUs and overlapping the computation on the CPU and GPUs. This functionality is built on top of message passing interface that allows multi\u2010level parallelism to be extracted even at the workstation level with the multi\u2010core CPUs and allows extension of the implementation on GPU\u2010enabled clusters. We hypothesize that the optimal benefit of heterogeneous architectures for applications will come by utilizing all possible resources (for example, CPU\u2010cores and GPU devices on GPU\u2010enabled clusters). Benchmarks for a range of biomolecular system sizes are provided, and an analysis is performed on four generations of NVIDIA's GPU devices. On GPU\u2010enabled Linux clusters, by overlapping and pipelining computation and communication, we observe up to 10\u2010folds application acceleration in multi\u2010core and multi\u2010GPU environments illustrating significant performance improvements. Detailed analysis of the implementation is presented that allows identification of bottlenecks in algorithm, indicating that code optimization and improvements on GPUs could allow microsecond scale simulation throughput on workstations and inexpensive GPU clusters, putting widely desired biologically relevant simulation time\u2010scales within reach of a large user community. In order to systematically optimize simulation throughput and to enable performance prediction, we have developed a parameterized performance model that will allow developers and users to explore the performance potential of future heterogeneous systems for biological simulations. Copyright \u00a9 2012 John Wiley &amp; Sons, Ltd.<\/jats:p>","DOI":"10.1002\/cpe.2943","type":"journal-article","created":{"date-parts":[[2012,10,23]],"date-time":"2012-10-23T08:11:16Z","timestamp":1350979876000},"page":"1356-1375","source":"Crossref","is-referenced-by-count":15,"title":["Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures"],"prefix":"10.1002","volume":"25","author":[{"given":"Pratul K.","family":"Agarwal","sequence":"first","affiliation":[{"name":"Oak Ridge National Laboratory  Oak Ridge Tennessee USA"}]},{"given":"Scott","family":"Hampton","sequence":"additional","affiliation":[]},{"given":"Jeffrey","family":"Poznanovic","sequence":"additional","affiliation":[{"name":"Swiss National Supercomputing Center  Manno Switzerland"}]},{"given":"Arvind","family":"Ramanthan","sequence":"additional","affiliation":[{"name":"Oak Ridge National Laboratory  Oak Ridge Tennessee USA"}]},{"given":"Sadaf R.","family":"Alam","sequence":"additional","affiliation":[{"name":"Swiss National Supercomputing Center  Manno Switzerland"}]},{"given":"Paul S.","family":"Crozier","sequence":"additional","affiliation":[{"name":"Sandia National Laboratories  Albuquerque New Mexico USA"}]}],"member":"311","published-online":{"date-parts":[[2012,10,23]]},"reference":[{"key":"e_1_2_7_2_1","unstructured":"CAPS HMPP workbench. Available at:http:\/\/www.caps\u2010entreprise.com[10 October 2011]."},{"key":"e_1_2_7_3_1","unstructured":"GPGPU developer resources. Available at:http:\/\/gpgpu.org\/developer[10 October 2011]."},{"key":"e_1_2_7_4_1","unstructured":"LAMMPS molecular developer simulator. Available at:http:\/\/lammps.sandia.gov\/[10 October 2011]."},{"key":"e_1_2_7_5_1","unstructured":"NVIDIA Fermi architecture white paper. Available at:http:\/\/www.nvidia.com\/object\/fermi_architecture.html[10 October 2011]."},{"key":"e_1_2_7_6_1","unstructured":"NVIDIA GPU computing developer home page. Available at:http:\/\/developer.nvidia.com\/object\/gpucomputing.html[10 October 2011]."},{"key":"e_1_2_7_7_1","unstructured":"OpenCL\u2014The open standard for parallel programming of heterogeneous systems. Available at:http:\/\/www.khronos.org\/opencl\/[10 October 2011]."},{"key":"e_1_2_7_8_1","unstructured":"PGI accelerator compilers. Available at:http:\/\/www.pgroup.com\/resources\/accel.htm[10 October 2011]."},{"key":"e_1_2_7_9_1","unstructured":"Top 500 supercomputers list (Nov. 2010). Available at:http:\/\/www.top500.org\/[10 October 2011]."},{"key":"e_1_2_7_10_1","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/46\/1\/046"},{"key":"e_1_2_7_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2007.108"},{"key":"e_1_2_7_12_1","doi-asserted-by":"crossref","unstructured":"AlamSR KuehnJA BarrettRF LarkinJM FaheyMR SankaranR Worley PH. Cray XT4: an early evaluation for petascale scientific simulation. Proceedings of the 2007 ACM\/IEEE conference on Supercomputing (SC '07). ACM New York NY USA. DOI=10.1145\/1362622.1362675","DOI":"10.1145\/1362622.1362675"},{"key":"e_1_2_7_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2008.01.047"},{"key":"e_1_2_7_14_1","doi-asserted-by":"crossref","unstructured":"BaghsorkhiSS DelahayeM PatelSJ GroppWD HwuW.An adaptive performance modeling tool for GPU architectures.15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010) Bangalore India. DOI=10.1145\/1693453.1693470","DOI":"10.1145\/1693453.1693470"},{"key":"e_1_2_7_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-87744-8_21"},{"key":"e_1_2_7_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-01970-8_90"},{"key":"e_1_2_7_17_1","unstructured":"ChoiJW SinghA VuducRW.Model\u2010driven autotuning of sparse matrix\u2013vector multiply on GPUs. Proceedings of ACM SIGPLAN Symposium Principles and Practice of Parallel Programming (PPoPP) Bangalore India. DOI=10.1145\/1693453.1693471"},{"key":"e_1_2_7_18_1","unstructured":"ChowE RendlemanCA BowersKJ DrorRO HughesDH GullingsrudJ SacerdotiFD ShawDE.Desmond performance on a cluster of multicore processors. D. E. Shaw Research Technical Report DESRES\/TR\u2010\u20102008\u201001 2008. Available at:http:\/\/www.deshawresearch.com\/publications\/Desmond%20Performance%20on%20a%20Cluster%20of%20Multicore%20Processors.pdf. [10 October 2011]."},{"key":"e_1_2_7_19_1","doi-asserted-by":"publisher","DOI":"10.1063\/1.464397"},{"key":"e_1_2_7_20_1","doi-asserted-by":"publisher","DOI":"10.1002\/jcc.21209"},{"key":"e_1_2_7_21_1","doi-asserted-by":"crossref","unstructured":"GlosliJN RichardsDF CaspersenKJ RuddRE GunnelsJA StreitzFH.Extending stability beyond CPU millennium: a micron\u2010scale atomistic simulation of Kelvin\u2010Helmholtz instability. Proceedings of the 2007 ACM\/IEEE conference on Supercomputing (SC '07). ACM New York NY USA. DOI=10.1145\/1362622.1362700","DOI":"10.1145\/1362622.1362700"},{"key":"e_1_2_7_22_1","doi-asserted-by":"publisher","DOI":"10.1147\/JRD.2009.5429067"},{"key":"e_1_2_7_23_1","doi-asserted-by":"crossref","unstructured":"HamptonSS AgarwalPK AlamSR CrozierPS.Towards microsecond biological molecular dynamics simulations on hybrid processors. Proceedings of International Conference on High Performance Computing and Simulation (HPCS) 2010. DOI=10.1109\/HPCS.2010.5547149","DOI":"10.1109\/HPCS.2010.5547149"},{"key":"e_1_2_7_24_1","doi-asserted-by":"crossref","unstructured":"HamptonSS AlamSR CrozierPS AgarwalPK.Optimal utilization of heterogeneous resources for biomolecular simulations. Proceedings of the 2010 ACM\/IEEE International Conference for High Performance Computing Networking Storage and Analysis (SC'10) 2010. ACM New York NY USA. DOI=10.1109\/SC.2010.37","DOI":"10.1109\/SC.2010.37"},{"key":"e_1_2_7_25_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2008.12.005"},{"key":"e_1_2_7_26_1","doi-asserted-by":"publisher","DOI":"10.1021\/ct900275y"},{"key":"e_1_2_7_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2008.143"},{"key":"e_1_2_7_28_1","doi-asserted-by":"crossref","unstructured":"HwuW. RyooS Sain\u2010ZeeU KelmJH GeladoI StoneSS KiddRE BaghsorkhiSS MahesriAA TsaoSC NavarroN LumettaSS FrankMI PatelSJ.Implicitly parallel programming models for thousand\u2010core microprocessors. Proceedings of the Design Automation Conference 2007. DAC '07. 44th ACM\/IEEE 754\u2010759","DOI":"10.1145\/1278480.1278669"},{"key":"e_1_2_7_29_1","unstructured":"JangB KaeliD SynhoDo PienH.Multi GPU implementation of iterative tomographic reconstruction algorithms. IEEE International Symposium on Biomedical Imaging: From Nano to Macro 2009. ISBI '09. DOI=10.1109\/ISBI.2009.5193014"},{"key":"e_1_2_7_30_1","doi-asserted-by":"publisher","DOI":"10.1147\/rd.521.0199"},{"key":"e_1_2_7_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.31"},{"key":"e_1_2_7_32_1","unstructured":"NallamuthuA SmithMC HamptonSS AgarwalPK AlamSR.Energy efficient biomolecular simulations with FPGA\u2010based reconfigurable computing. Proceedings of the 7th ACM international conference on Computing frontiers (CF '10). ACM New York NY USA 83\u201384. DOI=10.1145\/1787275.1787294"},{"key":"e_1_2_7_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2008.917757"},{"key":"e_1_2_7_34_1","doi-asserted-by":"publisher","DOI":"10.1006\/jcph.1995.1039"},{"key":"e_1_2_7_35_1","unstructured":"ShawDE DeneroffMM DrorRO KuskinJS LarsonRH SalmonJK YoungC BatsonB BowersKJ ChaoJC EastwoodMP GagliardoJ GrossmanJP HoCR IerardiDJ KolossvaryI KlepeisJL LaymanT McLeaveyC MoraesMA MuellerR PriestEC ShanY SpenglerJ TheobaldM TowlesB WangSC.Anton a special\u2010purpose machine for molecular dynamics simulation. Proceedings of the 34th annual international symposium on Computer architecture (ISCA '07). ACM New York NY USA. DOI=10.1145\/1250662.1250664"},{"key":"e_1_2_7_36_1","doi-asserted-by":"publisher","DOI":"10.1002\/jcc.20829"},{"key":"e_1_2_7_37_1","unstructured":"SumanthJV Swansom DR JiangH.Performance and cost effectiveness of a cluster of workstations and MD\u2010GRAPE 2 for MD simulations. Proceedings Second International Symposium on Parallel and Distributed Computing 2003. DOI=10.1109\/ISPDC.2003.1267670"},{"key":"e_1_2_7_38_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cpc.2009.06.009"},{"key":"e_1_2_7_39_1","doi-asserted-by":"crossref","first-page":"1317","DOI":"10.1002\/cpe.894","article-title":"Practical performance portability in the Parallel Ocean Program (POP)","volume":"17","author":"Jones PW","year":"2003","journal-title":"Concurrency and Computation: Practice and Experience"},{"key":"e_1_2_7_40_1","doi-asserted-by":"crossref","unstructured":"SnavelyA CarringtonL. WolterN LabartaJ BadiaR PurkayasthaA.A framework for performance modeling and prediction. Proceedings of the 2002 ACM\/IEEE International Conference for High Performance Computing Networking Storage and Analysis (SC'02) 2002. ACM New York NY USA. DOI=10.1109\/SC.2002.10004","DOI":"10.1109\/SC.2002.10004"},{"key":"e_1_2_7_41_1","doi-asserted-by":"crossref","unstructured":"KerbysonDJ AlmeHJ HoisieA PetriniF WassermanHJ GittingsM.Predictive performance and scalability modeling of a large\u2010scale application. Proceedings of the 2001 ACM\/IEEE International Conference for High Performance Computing Networking Storage and Analysis (SC'01) 2001. ACM New York NY USA. DOI=10.1145\/582034.582071","DOI":"10.1145\/582034.582071"},{"key":"e_1_2_7_42_1","unstructured":"ScottS AbtsD KimJ DallyWJ.The blackwidow high\u2010radix clos network. Proceedings 33rd International Symposium on Computer Architecture 2006. ISCA '06. DOI=10.1109\/ISCA.2006.40"},{"key":"e_1_2_7_43_1","unstructured":"YangLT MaX MuellerF.Cross\u2010platform performance prediction of parallel applications using partial execution. Proceedings of the 2005 ACM\/IEEE International Conference for High Performance Computing Networking Storage and Analysis (SC'05) 2005. ACM New York NY USA. DOI=10.1109\/SC.2005.20"}],"container-title":["Concurrency and Computation: Practice and Experience"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fcpe.2943","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/cpe.2943","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,13]],"date-time":"2023-09-13T22:45:51Z","timestamp":1694645151000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/cpe.2943"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,10,23]]},"references-count":42,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2013,7]]}},"alternative-id":["10.1002\/cpe.2943"],"URL":"http:\/\/dx.doi.org\/10.1002\/cpe.2943","archive":["Portico"],"relation":{},"ISSN":["1532-0626","1532-0634"],"issn-type":[{"value":"1532-0626","type":"print"},{"value":"1532-0634","type":"electronic"}],"subject":[],"published":{"date-parts":[[2012,10,23]]}}}