{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,24]],"date-time":"2025-08-24T01:54:08Z","timestamp":1756000448536,"version":"3.37.3"},"reference-count":31,"publisher":"Springer Science and Business Media LLC","issue":"8","license":[{"start":{"date-parts":[[2024,2,2]],"date-time":"2024-02-02T00:00:00Z","timestamp":1706832000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,2,2]],"date-time":"2024-02-02T00:00:00Z","timestamp":1706832000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100031060","name":"EuroHPC-JU","doi-asserted-by":"crossref","award":["956137"],"award-info":[{"award-number":["956137"]}],"id":[{"id":"10.13039\/100031060","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100006690","name":"Politecnico di Milano","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100006690","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Supercomput"],"published-print":{"date-parts":[[2024,5]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Virtual screening is an early stage in the drug discovery process that selects the most promising candidates. In the urgent computing scenario, finding a solution in the shortest time frame is critical. Any improvement in the performance of a virtual screening application translates into an increase in the number of candidates evaluated, thereby raising the probability of finding a drug. In this paper, we show how we can improve application throughput using Out-of-kernel optimizations. They use input features, kernel requirements, and architectural features to rearrange the kernel inputs, executing them out of order, to improve the computation efficiency. These optimizations\u2019 implementations are designed on an extreme-scale virtual screening application, named LiGen, that can hinge on CUDA and SYCL kernels to carry out the computation on modern supercomputer nodes. Even if they are tailored to a single application, they might\u00a0also be of interest for applications that share a similar design pattern. The experimental results show how these optimizations can increase kernel performance by 2<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\times$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mo>\u00d7<\/mml:mo>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula>, respectively,\u00a0up to 2.2<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\times$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mo>\u00d7<\/mml:mo>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> in CUDA\u00a0and\u00a0up to 1.9<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\times$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mo>\u00d7<\/mml:mo>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula>,\u00a0in SYCL. Moreover, the reported speedup can be achieved with the best-proposed parameterization, as shown by the data we collected and reported in this manuscript.<\/jats:p>","DOI":"10.1007\/s11227-023-05884-y","type":"journal-article","created":{"date-parts":[[2024,2,2]],"date-time":"2024-02-02T10:02:32Z","timestamp":1706868152000},"page":"11798-11815","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs"],"prefix":"10.1007","volume":"80","author":[{"given":"Gianmarco","family":"Accordi","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Davide","family":"Gadioli","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Emanele","family":"Vitali","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Luigi","family":"Crisci","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Biagio","family":"Cosenza","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andrea","family":"Beccari","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gianluca","family":"Palermo","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,2,2]]},"reference":[{"issue":"1","key":"5884_CR1","doi-asserted-by":"publisher","first-page":"156","DOI":"10.1038\/s41418-021-00844-6","volume":"29","author":"M Allegretti","year":"2022","unstructured":"Allegretti M, Cesta MC, Zippoli M et al (2022) Repurposing the estrogen receptor modulator raloxifene to treat SARS-CoV-2 infection. Cell Death Differ 29(1):156\u2013166","journal-title":"Cell Death Differ"},{"issue":"2","key":"5884_CR2","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1145\/567806.567807","volume":"28","author":"LS Blackford","year":"2002","unstructured":"Blackford LS, Petitet A, Pozo R et al (2002) An updated set of basic linear algebra subprograms (BLAS). ACM Trans Math Softw 28(2):135\u2013151","journal-title":"ACM Trans Math Softw"},{"key":"5884_CR3","unstructured":"Crankshaw D, Wang X, Zhou G, et al (2017) Clipper: a low-latency online prediction serving system. In: NSDI, pp 613\u2013627"},{"key":"5884_CR4","doi-asserted-by":"crossref","unstructured":"Crisci L, Salimi\u00a0Beni M, Cosenza B, et al (2022) Towards a portable drug discovery pipeline with SYCL 2020. In: International workshop on OpenCL","DOI":"10.1145\/3529538.3529688"},{"key":"5884_CR5","doi-asserted-by":"crossref","unstructured":"Ding N, Williams S (2019) An instruction roofline model for gpus. In: 2019 IEEE\/ACM performance modeling, benchmarking and simulation of high performance computer systems (PMBS), pp 7\u201318","DOI":"10.1109\/PMBS49563.2019.00007"},{"key":"5884_CR6","doi-asserted-by":"crossref","unstructured":"Gadioli D, Vitali E, Ficarelli F, et\u00a0al (2022) Exscalate: an extreme-scale virtual screening platform for drug discovery targeting polypharmacology to fight SARS-CoV-2. IEEE Transactions on Emerging Topics in Computing pp 1\u201312","DOI":"10.1109\/TETC.2022.3187134"},{"issue":"10","key":"5884_CR7","doi-asserted-by":"publisher","first-page":"2757","DOI":"10.1021\/ci400391s","volume":"53","author":"H Ge","year":"2013","unstructured":"Ge H, Wang Y, Li C et al (2013) Molecular dynamics-based virtual screening: accelerating the drug discovery process by high-performance computing. J Chem Inf Model 53(10):2757\u20132764","journal-title":"J Chem Inf Model"},{"issue":"5","key":"5884_CR8","doi-asserted-by":"publisher","first-page":"452","DOI":"10.1177\/10943420211001565","volume":"35","author":"J Glaser","year":"2021","unstructured":"Glaser J, Vermaas JV, Rogers DM et al (2021) High-throughput virtual laboratory for drug discovery using massive datasets. Int J High Perform Comput Appl 35(5):452\u2013468","journal-title":"Int J High Perform Comput Appl"},{"issue":"6","key":"5884_CR9","doi-asserted-by":"publisher","first-page":"630","DOI":"10.1093\/comjnl\/bxm099","volume":"51","author":"M Hassaballah","year":"2008","unstructured":"Hassaballah M, Omran S, Mahdy YB (2008) A review of SIMD multimedia extensions and their usage in scientific and engineering applications. Comput J 51(6):630\u2013649","journal-title":"Comput J"},{"key":"5884_CR10","doi-asserted-by":"crossref","unstructured":"Hijma P, Heldens S, Sclocco A et al (2023) Optimization techniques for GPU programming. ACM Comput Surv 55(11)","DOI":"10.1145\/3570638"},{"issue":"4","key":"5884_CR11","doi-asserted-by":"publisher","first-page":"865","DOI":"10.1021\/ci100459b","volume":"51","author":"O Korb","year":"2011","unstructured":"Korb O, St\u00fctzle T, Exner TE (2011) Accelerating molecular docking calculations using graphics processing units. J Chem Inf Model 51(4):865\u2013876","journal-title":"J Chem Inf Model"},{"key":"5884_CR12","doi-asserted-by":"crossref","unstructured":"Lemeire J, Cornelis JG, Segers L (2016) Microbenchmarks for GPU characteristics: the occupancy roofline and the pipeline model. In: 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pp 456\u2013463","DOI":"10.1109\/PDP.2016.120"},{"issue":"1","key":"5884_CR13","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1093\/nsr\/nww003","volume":"3","author":"T Liu","year":"2016","unstructured":"Liu T, Lu D, Zhang H et al (2016) Applying high-performance computing in drug discovery and molecular simulation. Natl Sci Rev 3(1):49\u201363","journal-title":"Natl Sci Rev"},{"key":"5884_CR14","doi-asserted-by":"crossref","unstructured":"L\u00f3pez N, Debbio LD, Baaden M, et al (2021) Lessons learned from urgent computing in Europe: tackling the COVID-19 pandemic. In: Proceedings of the National Academy of Sciences, vol 118, pp 46","DOI":"10.1073\/pnas.2024891118"},{"key":"5884_CR15","unstructured":"Ma S, Belkin M (2019) Kernel machines that adapt to GPUS for effective large batch training. In: Talwalkar A, Smith V, Zaharia M (eds) Proceedings of Machine Learning and Systems, pp 360\u2013373"},{"key":"5884_CR16","doi-asserted-by":"crossref","unstructured":"Matter H, Sotriffer C (2011) Applications and success stories in virtual screening. Wiley, chap 12, pp 319\u2013358","DOI":"10.1002\/9783527633326.ch12"},{"key":"5884_CR17","doi-asserted-by":"crossref","unstructured":"Murugan NA, Podobas A, Gadioli D, et al (2022) A review on parallel virtual screening softwares for high-performance computers. Pharmaceuticals 15(1)","DOI":"10.3390\/ph15010063"},{"issue":"10","key":"5884_CR18","doi-asserted-by":"publisher","first-page":"2496","DOI":"10.1109\/TPDS.2022.3144614","volume":"33","author":"SM Nabavinejad","year":"2022","unstructured":"Nabavinejad SM, Reda S, Ebrahimi M (2022) Coordinated batching and DVFS for DNN inference on GPU accelerators. IEEE Trans Parallel Distrib Syst 33(10):2496\u20132508","journal-title":"IEEE Trans Parallel Distrib Syst"},{"issue":"2","key":"5884_CR19","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1007\/s12551-016-0247-1","volume":"9","author":"NS Pagadala","year":"2017","unstructured":"Pagadala NS, Syed K, Tuszynski J (2017) Software for molecular docking: a review. Biophys Rev 9(2):91\u2013102","journal-title":"Biophys Rev"},{"key":"5884_CR20","doi-asserted-by":"crossref","unstructured":"Palermo G, Accordi G, Gadioli D et al (2023) Tunable and portable extreme-scale drug discovery platform at exascale: the lIGATE approach. In: Proceedings of the 20th ACM International Conference on Computing Frontiers, pp 272\u2013278","DOI":"10.1145\/3587135.3592172"},{"key":"5884_CR21","unstructured":"Ruder S (2017) An overview of gradient descent optimization algorithms"},{"issue":"10","key":"5884_CR22","doi-asserted-by":"publisher","first-page":"1389","DOI":"10.1016\/j.jpdc.2008.05.011","volume":"68","author":"S Ryoo","year":"2008","unstructured":"Ryoo S, Rodrigues CI, Stone SS et al (2008) Program optimization carving for GPU computing. J Parallel Distrib Comput 68(10):1389\u20131401","journal-title":"J Parallel Distrib Comput"},{"key":"5884_CR23","doi-asserted-by":"crossref","unstructured":"Sethia A, Mahlke S (2014) Equalizer: dynamic tuning of GPU resources for efficient execution. In: 2014 47th Annual IEEE\/ACM International Symposium on Microarchitecture, pp 647\u2013658","DOI":"10.1109\/MICRO.2014.16"},{"issue":"9","key":"5884_CR24","doi-asserted-by":"publisher","first-page":"3041","DOI":"10.3390\/molecules27093041","volume":"27","author":"S Tang","year":"2022","unstructured":"Tang S, Chen R, Lin M et al (2022) Accelerating autodock vina with GPUS. Molecules 27(9):3041","journal-title":"Molecules"},{"key":"5884_CR25","unstructured":"Tillmann M, Karcher T, Dachsbacher C, et al (2014) Application-independent autotuning for GPUS. In: Parallel Computing: Accelerating Computational Science and Engineering (CSE). IOS Press, pp 626\u2013635"},{"key":"5884_CR26","doi-asserted-by":"crossref","unstructured":"Vitali E, Ficarelli F, Bisson M, et al (2024) GPU-optimized approaches to molecular docking-based virtual screening in drug discovery: a comparative analysis. J Parallel Distrib Comput 186(4)","DOI":"10.1016\/j.jpdc.2023.104819"},{"issue":"4","key":"5884_CR27","doi-asserted-by":"publisher","first-page":"65","DOI":"10.1145\/1498765.1498785","volume":"52","author":"S Williams","year":"2009","unstructured":"Williams S, Waterman A, Patterson D (2009) Roofline. Commun ACM 52(4):65\u201376","journal-title":"Commun ACM"},{"key":"5884_CR28","doi-asserted-by":"crossref","unstructured":"Wu D, Zhang F, Ao N, et al (2009) A batched GPU algorithm for set intersection. In: 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks, pp 752\u2013756","DOI":"10.1109\/I-SPAN.2009.89"},{"key":"5884_CR29","doi-asserted-by":"crossref","unstructured":"Yu Y, Cai C, Zhu Z, et al (2022) Uni-dock: a GPU-accelerated docking program enables ultra-large virtual screening. American Chemical Society (ACS)","DOI":"10.26434\/chemrxiv-2022-5t5ts"},{"issue":"10","key":"5884_CR30","doi-asserted-by":"publisher","first-page":"581","DOI":"10.1002\/jmr.2471","volume":"28","author":"E Yuriev","year":"2015","unstructured":"Yuriev E, Holien J, Ramsland PA (2015) Improvements, trends, and new ideas in molecular docking: 2012\u20132013 in review. J Mol Recognit 28(10):581\u2013604","journal-title":"J Mol Recognit"},{"issue":"3","key":"5884_CR31","doi-asserted-by":"publisher","first-page":"1406","DOI":"10.1109\/TSG.2016.2600587","volume":"8","author":"G Zhou","year":"2017","unstructured":"Zhou G, Feng Y, Bo R et al (2017) GPU-accelerated batch-ACPF solution for n-1 static security analysis. IEEE Trans Smart Grid 8(3):1406\u20131416","journal-title":"IEEE Trans Smart Grid"}],"container-title":["The Journal of Supercomputing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-023-05884-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11227-023-05884-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-023-05884-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,6]],"date-time":"2024-05-06T11:11:41Z","timestamp":1714993901000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11227-023-05884-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,2]]},"references-count":31,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2024,5]]}},"alternative-id":["5884"],"URL":"https:\/\/doi.org\/10.1007\/s11227-023-05884-y","relation":{},"ISSN":["0920-8542","1573-0484"],"issn-type":[{"type":"print","value":"0920-8542"},{"type":"electronic","value":"1573-0484"}],"subject":[],"published":{"date-parts":[[2024,2,2]]},"assertion":[{"value":"23 December 2023","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 February 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"All authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical approval"}}]}}