{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,12]],"date-time":"2026-05-12T01:40:10Z","timestamp":1778550010817,"version":"3.51.4"},"reference-count":60,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2017,11,30]],"date-time":"2017-11-30T00:00:00Z","timestamp":1512000000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computation"],"abstract":"<jats:p>Heterogeneous clusters are a widely utilized class of supercomputers assembled from different types of computing devices, for instance CPUs and GPUs, providing a huge computational potential. Programming them in a scalable way exploiting the maximal performance introduces numerous challenges such as optimizations for different computing devices, dealing with multiple levels of parallelism, the application of different programming models, work distribution, and hiding of communication with computation. We utilize the lattice Boltzmann method for fluid flow as a representative of a scientific computing application and develop a holistic implementation for large-scale CPU\/GPU heterogeneous clusters. We review and combine a set of best practices and techniques ranging from optimizations for the particular computing devices to the orchestration of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with an implementation using all the available computational resources for the lattice Boltzmann method operators. Our approach shows excellent scalability behavior making it future-proof for heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of more than     90 %     are achieved leading to     2604.72     GLUPS utilizing 24,576 CPU cores and 2048 GPUs of the CPU\/GPU heterogeneous cluster Piz Daint and computing more than     6.8 \u00d7  10 9      lattice cells.<\/jats:p>","DOI":"10.3390\/computation5040048","type":"journal-article","created":{"date-parts":[[2017,11,30]],"date-time":"2017-11-30T12:02:51Z","timestamp":1512043371000},"page":"48","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":31,"title":["A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU\/GPU Heterogeneous Clusters"],"prefix":"10.3390","volume":"5","author":[{"given":"Christoph","family":"Riesinger","sequence":"first","affiliation":[{"name":"Department of Informatics, Technical University of Munich, 85748 Garching, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Arash","family":"Bakhtiari","sequence":"additional","affiliation":[{"name":"Department of Informatics, Technical University of Munich, 85748 Garching, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Martin","family":"Schreiber","sequence":"additional","affiliation":[{"name":"Department of Computer Science\/Mathematics, University of Exeter, Exeter EX4 4QF, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Philipp","family":"Neumann","sequence":"additional","affiliation":[{"name":"Scientific Computing, University of Hamburg, 20146 Hamburg, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hans-Joachim","family":"Bungartz","sequence":"additional","affiliation":[{"name":"Department of Informatics, Technical University of Munich, 85748 Garching, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2017,11,30]]},"reference":[{"key":"ref_1","unstructured":"(2017, October 16). PEZY Computing. Available online: http:\/\/pezy.jp\/."},{"key":"ref_2","unstructured":"TOP500.org. (2017, October 16). Top500 List\u2014November 2017. Available online: https:\/\/www.top500.org\/list\/2017\/11\/."},{"key":"ref_3","unstructured":"Riesinger, C., Bakhtiari, A., and Schreiber, M. (2017, October 16). Available online: https:\/\/gitlab.com\/christoph.riesinger\/lbm\/."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"910","DOI":"10.1016\/j.compfluid.2005.02.008","article-title":"On the single processor performance of simple lattice Boltzmann kernels","volume":"35","author":"Wellein","year":"2006","journal-title":"Comput. Fluids"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1080\/10618560802238275","article-title":"TeraFLOP computing on a desktop PC with GPUs for 3D CFD","volume":"22","author":"Krafczyk","year":"2008","journal-title":"Int. J. Comput. Fluid Dyn."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Bailey, P., Myre, J., Walsh, S.D.C., Lilja, D.J., and Saar, M.O. (2009, January 22\u201325). Accelerating lattice boltzmann fluid flow simulations using graphics processors. Proceedings of the International Conference on Parallel Processing, Vienna, Austria.","DOI":"10.1109\/ICPP.2009.38"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"2380","DOI":"10.1016\/j.camwa.2009.08.052","article-title":"LBM based flow simulation using GPU computing processor","volume":"59","author":"Kuznik","year":"2010","journal-title":"Comput. Math. Appl."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"3628","DOI":"10.1016\/j.camwa.2010.01.054","article-title":"A new approach to the lattice Boltzmann method for graphics processing units","volume":"61","author":"Obrecht","year":"2011","journal-title":"Comput. Math. Appl."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"163","DOI":"10.1016\/j.simpat.2012.03.004","article-title":"A Lattice-Boltzmann solver for 3D fluid simulation on GPU","volume":"25","author":"Rinaldi","year":"2012","journal-title":"Simul. Model. Pract. Theory"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"276","DOI":"10.1016\/j.compfluid.2012.02.013","article-title":"Performance engineering for the lattice Boltzmann method on GPGPUs: Architectural requirements and performance results","volume":"80","author":"Habich","year":"2013","journal-title":"Comput. Fluids"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"252","DOI":"10.1016\/j.camwa.2011.02.020","article-title":"Multi-GPU implementation of the lattice Boltzmann method","volume":"65","author":"Obrecht","year":"2013","journal-title":"Comput. Math. Appl."},{"key":"ref_12","first-page":"521","article-title":"Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster","volume":"37","author":"Wang","year":"2011","journal-title":"Parallel Comput."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Calore, E., Marchi, D., Schifano, S.F., and Tripiccione, R. (2015, January 20\u201324). Optimizing communications in multi-GPU Lattice Boltzmann simulations. Proceedings of the 2015 International Conference on High Performance Computing & Simulation (HPCS), Amsterdam, The Netherlands.","DOI":"10.1109\/HPCSim.2015.7237021"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"536","DOI":"10.1016\/j.parco.2011.03.005","article-title":"A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU-CPU clusters","volume":"37","author":"Feichtinger","year":"2011","journal-title":"Parallel Comput."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"114","DOI":"10.1016\/j.compfluid.2014.06.002","article-title":"Parallel computation of Entropic Lattice Boltzmann method on hybrid CPU\u2013GPU accelerated system","volume":"110","author":"Ye","year":"2015","journal-title":"Comput. Fluids"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Shimokawabe, T., Aoki, T., Takaki, T., Yamanaka, A., Nukada, A., Endo, T., Maruyama, N., and Matsuoka, S. (2011, January 12\u201318). Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis\u2014SC \u201911, Seatle, WA, USA.","DOI":"10.1145\/2063384.2063388"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"707","DOI":"10.1007\/s11434-011-4908-y","article-title":"Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units","volume":"57","author":"Xiong","year":"2012","journal-title":"Chin. Sci. Bull."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.parco.2016.08.005","article-title":"Massively parallel lattice\u2013Boltzmann codes on large GPU clusters","volume":"58","author":"Calore","year":"2016","journal-title":"Parallel Comput."},{"key":"ref_19","unstructured":"Riesinger, C. (2017). Scalable Scientific Computing Applications for GPU-Accelerated Heterogeneous Systems. [Ph.D. Thesis, Technische Universit\u00e4t M\u00fcnchen]."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"984","DOI":"10.1016\/j.procs.2011.04.104","article-title":"Free-Surface Lattice-Boltzmann Simulation on Many-Core Architectures","volume":"4","author":"Schreiber","year":"2011","journal-title":"Procedia Comput. Sci."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"444","DOI":"10.1007\/s00371-003-0210-6","article-title":"Implementing lattice Boltzmann computation on graphics hardware","volume":"19","author":"Li","year":"2003","journal-title":"Vis. Comput."},{"key":"ref_22","unstructured":"Zhe, F., Feng, Q., Kaufman, A., and Yoakum-Stover, S. (September, January 31). GPU Cluster for High Performance Computing. Proceedings of the ACM\/IEEE SC2004 Conference, New Orleans, LA, USA."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"354","DOI":"10.3390\/computation3030354","article-title":"Validation of the GPU-Accelerated CFD Solver ELBE for Free Surface Flow Problems in Civil and Environmental Engineering","volume":"3","author":"Mierke","year":"2015","journal-title":"Computation"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Di Martino, B., Kranzlm\u00fcller, D., and Dongarra, J.J. (2005, January 18\u201321). Nesting OpenMP in MPI to Implement a Hybrid Communication Method of Parallel Simulated Annealing on a Cluster of SMP Nodes. Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 12th European PVM\/MPI Users\u2019 Group Meeting, Sorrento, Italy.","DOI":"10.1007\/11557265"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Rabenseifner, R., Hager, G., and Jost, G. (2009, January 18\u201320). Hybrid MPI\/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes. Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based, Weimar, Germany.","DOI":"10.1109\/PDP.2009.43"},{"key":"ref_26","unstructured":"Linxweiler, J. (2011). Ein Integrierter Softwareansatz zur Interaktiven Exploration und Steuerung von Str\u00f6mungssimulationen auf Many-Core-Architekturen. [Ph.D. Thesis, Technische Universit\u00e4t Braunschweig]."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Valero-Lara, P., and Jansson, J. (2015, January 8\u201311). LBM-HPC - An Open-Source Tool for Fluid Simulations. Case Study: Unified Parallel C (UPC-PGAS). Proceedings of the 2015 IEEE International Conference on Cluster Computing, Chicago, IL, USA.","DOI":"10.1109\/CLUSTER.2015.52"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Calore, E., Gabbana, A., Schifano, S.F., and Tripiccione, R. (2017). Optimization of lattice Boltzmann simulations on heterogeneous computers. Int. J. High Perform. Comput. Appl.","DOI":"10.1177\/1094342017703771"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"249","DOI":"10.1016\/j.jocs.2015.07.002","article-title":"Accelerating fluid\u2013solid simulations (Lattice-Boltzmann & Immersed-Boundary) on heterogeneous architectures","volume":"10","author":"Igual","year":"2015","journal-title":"J. Comput. Sci."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"e3919","DOI":"10.1002\/cpe.3919","article-title":"Heterogeneous CPU+GPU approaches for mesh refinement over Lattice-Boltzmann simulations","volume":"29","author":"Jansson","year":"2017","journal-title":"Concurr. Comput. Pract. Exp."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Shimokawabe, T., Endo, T., Onodera, N., and Aoki, T. (2017, January 5\u20138). A Stencil Framework to Realize Large-Scale Computations Beyond Device Memory Capacity on GPU Supercomputers. Proceedings of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, USA.","DOI":"10.1109\/CLUSTER.2017.97"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"6811","DOI":"10.1103\/PhysRevE.56.6811","article-title":"Theory of the lattice Boltzmann method: From the Boltzmann equation to the lattice Boltzmann equation","volume":"56","author":"He","year":"1997","journal-title":"Phys. Rev. E"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"329","DOI":"10.1146\/annurev.fluid.30.1.329","article-title":"Lattice Boltzmann Method for Fluid Flows","volume":"30","author":"Chen","year":"1998","journal-title":"Annu. Rev. Fluid Mech."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Wolf-Gladrow, D.A. (2000). Lattice-Gas Cellular Automata and Lattice Boltzmann Models\u2014An Introduction, Springer.","DOI":"10.1007\/b72010"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"439","DOI":"10.1146\/annurev-fluid-121108-145519","article-title":"Lattice-Boltzmann Method for Complex Flows","volume":"42","author":"Aidun","year":"2010","journal-title":"Annu. Rev. Fluid Mech."},{"key":"ref_36","unstructured":"Succi, S. (2013). The Lattice Boltzmann Equation: for Fluid Dynamics and Beyond, Oxford University Press."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Kr\u00fcger, T., Kusumaatmaja, H., Kuzmin, A., Shardt, O., Silva, G., and Viggen, E.M. (2017). The Lattice Boltzmann Method: Principles and Practice; Graduate Texts in Physics, Springer International Publishing.","DOI":"10.1007\/978-3-319-44649-3"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"927","DOI":"10.1023\/B:JOSS.0000015179.12689.e4","article-title":"Lattice Boltzmann Model for the Incompressible Navier\u2013Stokes Equation","volume":"88","author":"He","year":"1997","journal-title":"J. Stat. Phys."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"798","DOI":"10.1209\/epl\/i2003-00496-6","article-title":"Minimal entropic kinetic models for hydrodynamics","volume":"63","author":"Ansumali","year":"2003","journal-title":"Europhys. Lett."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"511","DOI":"10.1103\/PhysRev.94.511","article-title":"A Model for Collision Processes in Gases","volume":"94","author":"Bhatnagar","year":"1954","journal-title":"Phys. Rev."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"437","DOI":"10.1098\/rsta.2001.0955","article-title":"Multiple-relaxation-time lattice Boltzmann models in three dimensions","volume":"360","author":"Ginzburg","year":"2002","journal-title":"Philos. Trans. R. Soc. A Math. Phys. Eng. Sci."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"717","DOI":"10.1098\/rspa.2000.0689","article-title":"Entropic lattice Boltzmann methods","volume":"457","author":"Boghosian","year":"2001","journal-title":"Proc. R. Soc. A Math. Phys. Eng. Sci."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"066705","DOI":"10.1103\/PhysRevE.73.066705","article-title":"Cascaded digital lattice Boltzmann automata for high Reynolds number flow","volume":"73","author":"Geier","year":"2006","journal-title":"Phys. Rev. E"},{"key":"ref_44","unstructured":"Wolfe, M. (2015). OpenACC for Multicore CPUs, PGI, NVIDIA Corporation."},{"key":"ref_45","unstructured":"Bailey, D.H. (1991). Twelve ways to fool the masses when giving performance results on parallel computers. Supercomputing Review, MIT Press."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"H\u00f6fler, T., and Belli, R. (2015, January 15\u201320). Scientific benchmarking of parallel computing systems. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis\u2014SC \u201915, Austin, TX, USA.","DOI":"10.1145\/2807591.2807644"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"e4221","DOI":"10.1002\/cpe.4221","article-title":"Reducing memory requirements for large size LBM simulations on GPUs","volume":"29","year":"2017","journal-title":"Concurr. Comput. Pract. Exp."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"924","DOI":"10.1016\/j.camwa.2012.05.002","article-title":"Comparison of different propagation steps for lattice Boltzmann methods","volume":"65","author":"Wittmann","year":"2013","journal-title":"Comput. Math. Appl."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"65","DOI":"10.4208\/cicp.210910.200611a","article-title":"A Coupled Approach for Fluid Dynamic Problems Using the PDE Framework Peano","volume":"12","author":"Neumann","year":"2012","journal-title":"Commun. Comput. Phys."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Geier, M., and Sch\u00f6nherr, M. (2017). Esoteric Twist: An Efficient in-Place Streaming Algorithmus for the Lattice Boltzmann Method on Massively Parallel Hardware. Computation, 5.","DOI":"10.3390\/computation5020019"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Lam, M.D., Rothberg, E.E., and Wolf, M.E. (1991, January 8\u201311). The cache performance and optimizations of blocked algorithms. Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems\u2014ASPLOS-IV, Santa Clara, CA, USA.","DOI":"10.1145\/106972.106981"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Chapman, B., Zheng, W., Gao, G.R., Sato, M., Ayguad\u00e9, E., and Wang, D. (2008). A Proposal for Task Parallelism in OpenMP. A Practical Programming Model for the Multi-Core Era, Springer.","DOI":"10.1007\/978-3-540-69303-1"},{"key":"ref_53","unstructured":"Schreiber, M. (2010). GPU Based Simulation and Visualization of Fluids with Free Surfaces. [Diploma Thesis, Technische Universit\u00e4t M\u00fcnchen]."},{"key":"ref_54","unstructured":"NVIDIA Corporation (2017, October 16). Tuning CUDA Applications for Kepler. Available online: http:\/\/docs.nvidia.com\/cuda\/kepler-tuning-guide\/."},{"key":"ref_55","unstructured":"NVIDIA Corporation (2017, October 16). Achieved Occupancy. Available online: https:\/\/docs.nvidia.com\/gameworks\/content\/developertools\/desktop\/analysis\/report\/cudaexperiments\/kernellevel\/achievedoccupancy.htm."},{"key":"ref_56","unstructured":"Bakhtiari, A. (2013). MPI Parallelization of GPU-Based Lattice Boltzmann Simulations. [Master\u2019s Thesis, Technische Universit\u00e4t M\u00fcnchen]."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"348","DOI":"10.1016\/0021-9991(73)90157-5","article-title":"Numerical study of viscous flow in a cavity","volume":"12","author":"Bozeman","year":"1973","journal-title":"J. Comput. Phys."},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"387","DOI":"10.1016\/0021-9991(82)90058-4","article-title":"High-Re solutions for incompressible flow using the Navier\u2013Stokes equations and a multigrid method","volume":"48","author":"Ghia","year":"1982","journal-title":"J. Comput. Phys."},{"key":"ref_59","unstructured":"Intel Corporation (2017, October 16). Intel Xeon Processor E5-2690v3. Available online: https:\/\/ark.intel.com\/products\/81713\/."},{"key":"ref_60","unstructured":"Global Scientific Information and Computing Center (2013). TSUBAME2.5 Hardware Software Specifications, Tokyo Institute of Technology. Technical Report."}],"container-title":["Computation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2079-3197\/5\/4\/48\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T18:52:04Z","timestamp":1760208724000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2079-3197\/5\/4\/48"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,11,30]]},"references-count":60,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2017,12]]}},"alternative-id":["computation5040048"],"URL":"https:\/\/doi.org\/10.3390\/computation5040048","relation":{},"ISSN":["2079-3197"],"issn-type":[{"value":"2079-3197","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,11,30]]}}}