{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T09:43:33Z","timestamp":1775123013582,"version":"3.50.1"},"reference-count":49,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2020,6,1]],"date-time":"2020-06-01T00:00:00Z","timestamp":1590969600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,6,24]],"date-time":"2020-06-24T00:00:00Z","timestamp":1592956800000},"content-version":"vor","delay-in-days":23,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"National Key R&D Program of China","award":["2016YFA0602100"],"award-info":[{"award-number":["2016YFA0602100"]}]},{"name":"National Key R&D Program of China","award":["2017YFA0604500"],"award-info":[{"award-number":["2017YFA0604500"]}]},{"name":"Center for High Performance Computing and System Simulation of Pilot National Laboratory for Marine Science and Technology (Qingdao).","award":["N\/A"],"award-info":[{"award-number":["N\/A"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["CCF Trans. HPC"],"published-print":{"date-parts":[[2020,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The ever-growing complexity of HPC applications and the computer architectures cost more efforts than ever to learn application behaviors. In this paper, we propose the <jats:italic>APMT<\/jats:italic>, an Automatic Performance Modeling Tool, to understand and predict performance efficiently in the regimes of interest to developers and performance analysts while outperforming many traditional techniques. In APMT, we use hardware counter-assisted profiling to identify the key kernels and non-scalable kernels and build each kernel model according to our performance modeling framework. Meantime, we also provide an optional refinement modeling framework to further understand the key performance metric, cycles-per-instruction (CPI). Our evaluations show that by only performing a few small-scale profiling, APMT is able to keep the average error rate around 15% with average performance overheads of 3% in different scenarios, including NAS parallel benchmarks, dynamical core of atmosphere model of the Community Earth System Model (CESM), and the ice component of CESM on commodity clusters. APMT improve the model prediction accuracies by 25\u201352% in strong scaling tests comparing to the well-known analytical model and the empirical model.<\/jats:p>","DOI":"10.1007\/s42514-020-00035-8","type":"journal-article","created":{"date-parts":[[2020,6,24]],"date-time":"2020-06-24T07:03:37Z","timestamp":1592982217000},"page":"135-148","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["APMT: an automatic hardware counter-based performance modeling tool for HPC applications"],"prefix":"10.1007","volume":"2","author":[{"given":"Nan","family":"Ding","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Victor W.","family":"Lee","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9740-6581","authenticated-orcid":false,"given":"Wei","family":"Xue","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Weimin","family":"Zheng","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2020,6,24]]},"reference":[{"key":"35_CR1","doi-asserted-by":"crossref","unstructured":"Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: Hpctoolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685\u2013701 (2010)","DOI":"10.1002\/cpe.1553"},{"key":"35_CR2","doi-asserted-by":"crossref","unstructured":"Arenaz, M., Touri\u00f1o, J., Doallo, R.: Xark: an extensible framework for automatic recognition of computational kernels. ACM Trans. Program Langu. Syst. (TOPLAS) 30(6), 32 (2008)","DOI":"10.1145\/1391956.1391959"},{"key":"35_CR3","unstructured":"Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., et\u00a0al. The landscape of parallel computing research: a view from berkeley. Technical report, Technical Report UCB\/EECS-2006-183, EECS Department, University of California, Berkeley (2006)"},{"key":"35_CR4","doi-asserted-by":"crossref","unstructured":"Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., et al.: The nas parallel benchmarks. Int. J. High Perform. Comput. Appl. 5(3), 63\u201373 (1991))","DOI":"10.1177\/109434209100500306"},{"key":"35_CR5","doi-asserted-by":"crossref","unstructured":"Balaprakash, P., Tiwari, A., Wild, S.M., Carrington, L., Hovland, P.D.: Automomml: Automatic multi-objective modeling with machine learning. In International Conference on High Performance Computing, pp. 219\u2013239 (2016)","DOI":"10.1007\/978-3-319-41321-1_12"},{"key":"35_CR6","doi-asserted-by":"crossref","unstructured":"Barnes, B.J., Rountree, B., Lowenthal, D.K., Reeves, J., De\u00a0Supinski, B., Schulz, M.: A regression-based approach to scalability prediction. In Proceedings of the 22nd annual international conference on Supercomputing, pp. 368\u2013377. ACM (2008)","DOI":"10.1145\/1375527.1375580"},{"key":"35_CR7","doi-asserted-by":"crossref","unstructured":"Bauer, G., Gottlieb, S., Hoefler, T.: Performance modeling and comparative analysis of the milc lattice qcd application su3\\_rmd. In Cluster, Cloud and Grid Computing (CCGrid), pp. 652\u2013659. IEEE (2012)","DOI":"10.1109\/CCGrid.2012.123"},{"key":"35_CR8","doi-asserted-by":"crossref","unstructured":"Bhattacharyya, A., Hoefler, T.: Pemogen: automatic adaptive performance modeling during program runtime. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, pp. 393\u2013404. ACM, (2014)","DOI":"10.1145\/2628071.2628100"},{"key":"35_CR9","doi-asserted-by":"crossref","unstructured":"Bhattacharyya, A., Kwasniewski, G., Hoefler, T.: Using compiler techniques to improve automatic performance modeling. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation. ACM (2015)","DOI":"10.1109\/PACT.2015.39"},{"key":"35_CR10","unstructured":"Bitzes, G., Nowak, A.: The overhead of profiling using pmu hardware counters. CERN openlab report (2014)"},{"key":"35_CR11","doi-asserted-by":"crossref","unstructured":"Browne, S., Dongarra, J., Garner, N., London, K., Mucci, P.: A scalable cross-platform infrastructure for application performance tuning using hardware counters. In Supercomputing, ACM\/IEEE 2000 Conference, IEEE, pp. 42\u201342 (2000)","DOI":"10.1109\/SC.2000.10029"},{"key":"35_CR12","doi-asserted-by":"crossref","unstructured":"Calotoiu, A., Hoefler, T., Poke, M., Wolf, F.: Using automated performance modeling to find scalability bugs in complex codes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp.\u00a045. ACM (2013)","DOI":"10.1145\/2503210.2503277"},{"key":"35_CR13","doi-asserted-by":"crossref","unstructured":"Chou, C.-Y., Chang, H.-Y., Wang, S.-T., Huang, K.-C., Shen, C.-Y.: An improved model for predicting hpl performance. In International Conference on Grid and Pervasive Computing, pp. 158\u2013168. Springer (2007)","DOI":"10.1007\/978-3-540-72360-8_14"},{"key":"35_CR14","doi-asserted-by":"crossref","unstructured":"Clapp, R., Dimitrov, M., Kumar, K., Viswanathan, V., Willhalm, T.: Quantifying the performance impact of memory latency and bandwidth for big data workloads. In Workload Characterization (IISWC), pp. 213\u2013224. IEEE (2015)","DOI":"10.1109\/IISWC.2015.32"},{"key":"35_CR15","doi-asserted-by":"crossref","unstructured":"Craig, A.P., Mickelson, S.A., Hunke, E.C., Bailey, D.A.: Improved parallel performance of the cice model in cesm1. Int. J. High Perfor. Comput. Appl. 29(2), 154\u2013165 (2015)","DOI":"10.1177\/1094342014548771"},{"key":"35_CR16","doi-asserted-by":"crossref","unstructured":"Dennis, J.M., Edwards, J., Evans, K.J., Guba, O., Lauritzen, P.H., Mirin, A.A., St-Cyr, A., Taylor, M.A., Worley, P.H.: Cam-se: a scalable spectral element dynamical core for the community atmosphere model. Int. J. High Perform. Comput. Appl. 26(1), 74\u201389 (2012)","DOI":"10.1177\/1094342011428142"},{"key":"35_CR17","doi-asserted-by":"crossref","unstructured":"Doweck, J.: Inside intel\u00ae core microarchitecture. In Hot Chips 18 Symposium (HCS), pp. 1\u201335. IEEE (2006)","DOI":"10.1109\/HOTCHIPS.2006.7477876"},{"key":"35_CR18","unstructured":"Gamblin, T., Schulz, M., de\u00a0Supinski, B.R., Wolf, F., Wylie, B.J.N. et\u00a0al. Reconciling sampling and direct instrumentation for unintrusive call-path profiling of mpi programs. In Parallel & Distributed Processing Symposium (IPDPS), 2011. IEEE (2011)"},{"key":"35_CR19","doi-asserted-by":"crossref","unstructured":"Garcia, S., Jeon, D., Louie, Christopher\u00a0M., Taylor, Michael\u00a0B.: Kremlin: rethinking and rebooting gprof for the multicore age. In ACM SIGPLAN Notices, vol.\u00a046, pp. 458\u2013469. ACM (2011)","DOI":"10.1145\/1993316.1993553"},{"key":"35_CR20","doi-asserted-by":"crossref","unstructured":"Geimer, M., Wolf, F., Wylie, B.J.N., \u00c1brah\u00e1m, E., Becker, D., Mohr, B.: The scalasca performance toolset architecture. Concurr. Comput. Pract. Exp. 22(6), 702\u2013719 (2010)","DOI":"10.1002\/cpe.1556"},{"key":"35_CR21","doi-asserted-by":"crossref","unstructured":"Hong, S., Kim, H.: An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In ACM SIGARCH Computer Architecture News, vol.\u00a037, pp. 152\u2013163. ACM (2009)","DOI":"10.1145\/1555815.1555775"},{"key":"35_CR22","doi-asserted-by":"crossref","unstructured":"Hu, P.J., Chau, P.Y.K., Sheng, O.R.L., Tam, K.Y.: Examining the technology acceptance model using physician acceptance of telemedicine technology. J. Manag. Inf. Syst. 16(2), 91\u2013112 (1999)","DOI":"10.1080\/07421222.1999.11518247"},{"key":"35_CR23","unstructured":"Hunke, E.C., Lipscomb, W.H., Turner, A.K., et\u00a0al. Cice: the los alamos sea ice model documentation and software user\u2019s manual version 4.1 la-cc-06-012. T-3 Fluid Dynamics Group, Los Alamos National Laboratory, pp. 675 (2010)"},{"key":"35_CR24","doi-asserted-by":"crossref","unstructured":"Jayakumar, A., Murali, P., Vadhiyar, S.: Matching application signatures for performance predictions using a single execution. In Parallel and Distributed Processing Symposium (IPDPS), pp. 1161\u20131170. IEEE (2015)","DOI":"10.1109\/IPDPS.2015.20"},{"key":"35_CR25","doi-asserted-by":"crossref","unstructured":"Jones, P.W., Worley, P.H., Yoshida, Y., White, J.B., Levesque, J.: Practical performance portability in the parallel ocean program (pop). Concurr. Comput. Pract. Exp. 17(10), 1317\u20131327 (2005)","DOI":"10.1002\/cpe.894"},{"key":"35_CR26","doi-asserted-by":"crossref","unstructured":"Keller, R., Gabriel, E., Krammer, B., Mueller, M.S., Resch, M.M.: Towards efficient execution of mpi applications on the grid: porting and optimization issues. J. Grid Comput. 1(2), 133\u2013149 (2003)","DOI":"10.1023\/B:GRID.0000024071.12177.91"},{"key":"35_CR27","doi-asserted-by":"crossref","unstructured":"Kn\u00fcpfer, A., R\u00f6ssel, C., an\u00a0Mey, D., Biersdorff, S., Diethelm, K., Eschweiler, D., Geimer, M., Gerndt, M., Lorenz, D., Malony, A. et\u00a0al.: Score-p: a joint performance measurement run-time infrastructure for periscope, scalasca, tau, and vampir. In Tools for High Performance Computing, pp. 79\u201391. Springer, Amsterdam (2012)","DOI":"10.1007\/978-3-642-31476-6_7"},{"key":"35_CR28","doi-asserted-by":"crossref","unstructured":"Lee, Benjamin\u00a0C., Brooks, David\u00a0M., de\u00a0Supinski, B.R., Schulz, M., Singh, K., McKee, S.A.: Methods of inference and learning for performance modeling of parallel applications. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 249\u2013258 (2007)","DOI":"10.1145\/1229428.1229479"},{"key":"35_CR29","unstructured":"Liang, Q.: Performance monitor counter data analysis using counter analyzer. IBM developerWorks (2009)"},{"key":"35_CR30","unstructured":"Malladi, R.K.: Using intel\u00ae vtune$$^{\\rm TM}$$ performance analyzer events\/ratios & optimizing applications. http:\/software.intel.com. (2009)"},{"key":"35_CR31","doi-asserted-by":"crossref","unstructured":"Malony, A.D., Shende, S.S.: Overhead compensation in performance profiling. In European Conference on Parallel Processing, Springer, pp. 119\u2013132 (2004)","DOI":"10.1007\/978-3-540-27866-5_16"},{"key":"35_CR32","doi-asserted-by":"crossref","unstructured":"Marathe, A., Anirudh, R., Jain, N., Bhatele, A., Thiagarajan, J., Kailkhura, B., Yeom, J.-S., Rountree, B., Gamblin, T.: Performance modeling under resource constraints using deep transfer learning. In Proceedings of the: ACM\/IEEE International Conference for High Performance Computing, p. 2017. Networking, Storage and Analysis (SC), Denver, Colorado (2017)","DOI":"10.1145\/3126908.3126969"},{"key":"35_CR33","doi-asserted-by":"crossref","unstructured":"Merten, M.C., Trick, A.R., George, C.N., Gyllenhaal, J.C., Hwu, W.W.: A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization. In ACM SIGARCH Computer Architecture News, vol.\u00a027, pp. 136\u2013147. IEEE Computer Society (1999)","DOI":"10.1145\/307338.300991"},{"key":"35_CR34","unstructured":"Mucci, P.J., Browne, S., Deane, C., Ho, G.: Papi: A portable interface to hardware performance counters. In Proceedings of the department of defense HPCMP users group conference, vol 710 (1999)"},{"key":"35_CR35","unstructured":"Nan, D., Wei, X., Xu, J., Haoyu, X., Zhenya, S.: Cesmtuner: an auto-tuning framework for the community earth system model. In High Performance Computing and Communications (HPCC), pp. 282\u2013289, Washington, DC. IEEE Computer Society (2014)"},{"key":"35_CR36","doi-asserted-by":"crossref","unstructured":"Pallipuram, V.K., Smith, M.C., Raut, N., Ren, X.: A regression-based performance prediction framework for synchronous iterative algorithms on general purpose graphical processing unit clusters. Concurr. Comput. Pract. Exp. 26(2), 532\u2013560 (2014)","DOI":"10.1002\/cpe.3017"},{"key":"35_CR37","doi-asserted-by":"crossref","unstructured":"Pallipuram, V., Smith, M., Sarma, N., Anand, R., Weill, E., Sapra, K.: Subjective versus objective: classifying analytical models for productive heterogeneous performance prediction. J. Supercomput. 71(1) (2015)","DOI":"10.1007\/s11227-014-1292-9"},{"key":"35_CR38","unstructured":"PMU Intel. Profiling tools. https:\/\/github.com\/andikleen\/pmu-tools"},{"key":"35_CR39","doi-asserted-by":"crossref","unstructured":"Spafford, K.L., Vetter, J.S.: Aspen: a domain specific language for performance modeling. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp\u00a084. IEEE Computer Society Press (2012)","DOI":"10.1109\/SC.2012.20"},{"key":"35_CR40","doi-asserted-by":"crossref","unstructured":"Sprunt, B.: Pentium 4 performance-monitoring features. Micro IEEE 22(4), 72\u201382 (2002)","DOI":"10.1109\/MM.2002.1028478"},{"key":"35_CR41","doi-asserted-by":"crossref","unstructured":"Stewart, A.: A programming model for bsp with partitioned synchronisation. Formal Aspects Comput. 23(4), 421\u2013432 (2011)","DOI":"10.1007\/s00165-010-0163-2"},{"key":"35_CR42","doi-asserted-by":"crossref","unstructured":"Treibig, J., Hager, G., Wellein, G.: Likwid: a lightweight performance-oriented tool suite for x86 multicore environments. In Parallel Processing Workshops (ICPPW), pp. 207\u2013216. IEEE (2010)","DOI":"10.1109\/ICPPW.2010.38"},{"key":"35_CR43","doi-asserted-by":"crossref","unstructured":"Van\u00a0den Steen, S., De\u00a0Pestel, S., Mechri, M., Eyerman, S., Carlson, T., Black-Schaffer, D., Hagersten, E., Eeckhout, L.: Micro-architecture independent analytical processor performance and power modeling. In Performance Analysis of Systems and Software (ISPASS), pp. 32\u201341. IEEE (2015)","DOI":"10.1109\/ISPASS.2015.7095782"},{"key":"35_CR44","unstructured":"Weaver, V.M.: Linux perf\\_event features and overhead. In: The 2nd International workshop on performance analysis of workload optimized systems, FastPath, vol\u00a013 (2013)"},{"key":"35_CR45","doi-asserted-by":"crossref","unstructured":"Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65\u201376 (2009)","DOI":"10.1145\/1498765.1498785"},{"key":"35_CR46","doi-asserted-by":"crossref","unstructured":"Worley, P.H., Craig, A.P., Dennis, J.M., Mirin, Arthur\u00a0A., Taylor, M.A., Vertenstein, M.: Performance of the community earth system model. In High Performance Computing, Networking, Storage and Analysis (SC), pp. 1\u201311. IEEE (2011)","DOI":"10.1145\/2063384.2063457"},{"key":"35_CR47","doi-asserted-by":"crossref","unstructured":"Wu, Q., Mencer, O.: Evaluating sampling based hotspot detection. In International Conference on Architecture of Computing Systems, Springer, pp. 28\u201339 (2009)","DOI":"10.1007\/978-3-642-00454-4_6"},{"key":"35_CR48","unstructured":"Xingfu, W., Lively, C., Taylor, V., Hung, C.C., Chun Y.S., Katherine, C., Steven, M., Dan, T., Vince, W.: Multiple metrics modeling infrastructure. Springer, MuMMI (2014)"},{"key":"35_CR49","doi-asserted-by":"crossref","unstructured":"Zaparanuks, D., Jovic, M., Hauswirth, M.: Accuracy of performance counter measurements. In Performance Analysis of Systems and Software, ISPASS (2009)","DOI":"10.1109\/ISPASS.2009.4919635"}],"container-title":["CCF Transactions on High Performance Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42514-020-00035-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s42514-020-00035-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42514-020-00035-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,6,23]],"date-time":"2021-06-23T23:54:17Z","timestamp":1624492457000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s42514-020-00035-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,6]]},"references-count":49,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2020,6]]}},"alternative-id":["35"],"URL":"https:\/\/doi.org\/10.1007\/s42514-020-00035-8","relation":{},"ISSN":["2524-4922","2524-4930"],"issn-type":[{"value":"2524-4922","type":"print"},{"value":"2524-4930","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,6]]},"assertion":[{"value":"2 November 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 April 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 June 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}