{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T03:33:59Z","timestamp":1773804839932,"version":"3.50.1"},"reference-count":66,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2014,2,1]],"date-time":"2014-02-01T00:00:00Z","timestamp":1391212800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2014,2]]},"abstract":"<jats:p>Compiler-based auto-parallelization is a much-studied area but has yet to find widespread application. This is largely due to the poor identification and exploitation of application parallelism, resulting in disappointing performance far below that which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, integrated approach resulting in significant performance improvements of the generated parallel code. Using profile-driven parallelism detection, we overcome the limitations of static analysis, enabling the identification of more application parallelism, and only rely on the user for final approval. We then replace the traditional target-specific and inflexible mapping heuristics with a machine-learning-based prediction mechanism, resulting in better mapping decisions while automating adaptation to different target architectures. We have evaluated our parallelization strategy on the NAS and SPEC CPU2000 benchmarks and two different multicore platforms (dual quad-core Intel Xeon SMP and dual-socket QS20 Cell blade). We demonstrate that our approach not only yields significant improvements when compared with state-of-the-art parallelizing compilers but also comes close to and sometimes exceeds the performance of manually parallelized codes. On average, our methodology achieves 96% of the performance of the hand-tuned OpenMP NAS and SPEC parallel benchmarks on the Intel Xeon platform and gains a significant speedup for the IBM Cell platform, demonstrating the potential of profile-guided and machine-learning- based parallelization for complex multicore platforms.<\/jats:p>","DOI":"10.1145\/2579561","type":"journal-article","created":{"date-parts":[[2014,3,18]],"date-time":"2014-03-18T12:09:07Z","timestamp":1395144547000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":48,"title":["Integrating profile-driven parallelism detection and machine-learning-based mapping"],"prefix":"10.1145","volume":"11","author":[{"given":"Zheng","family":"Wang","sequence":"first","affiliation":[{"name":"Lancaster University, United Kingdom"}]},{"given":"Georgios","family":"Tournavitis","sequence":"additional","affiliation":[{"name":"Intel Barcelona Research Center, Spain"}]},{"given":"Bj\u00f6rn","family":"Franke","sequence":"additional","affiliation":[{"name":"University of Edinburgh, United Kingdom"}]},{"given":"Michael F. P.","family":"O'boyle","sequence":"additional","affiliation":[{"name":"University of Edinburgh, United Kingdom"}]}],"member":"320","published-online":{"date-parts":[[2014,2]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"NAS Parallel Benchmarks 2.3 OpenMP C version. (2004). http:\/\/www.hpcs.cs.tsukuba.ac.jp\/omni-compiler\/download\/download-benchmarks.html.  NAS Parallel Benchmarks 2.3 OpenMP C version. (2004). http:\/\/www.hpcs.cs.tsukuba.ac.jp\/omni-compiler\/download\/download-benchmarks.html."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1562764.1562783"},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing (LCPC'10)","author":"Aslam Amina","year":"2010","unstructured":"Amina Aslam and Laurie Hendren . 2010 . McFLAT: A profile-based framework for MATLAB loop analysis and transformations . In Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing (LCPC'10) . 1--15. Amina Aslam and Laurie Hendren. 2010. McFLAT: A profile-based framework for MATLAB loop analysis and transformations. In Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing (LCPC'10). 1--15."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.5555\/647074.713908"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/125826.125925"},{"key":"e_1_2_1_6_1","volume-title":"Pattern Recognition and Machine Learning (Information Science and Statistics)","author":"Bishop Christopher M.","unstructured":"Christopher M. Bishop . 2007. Pattern Recognition and Machine Learning (Information Science and Statistics) . Springer . Christopher M. Bishop. 2007. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/130385.130401"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8191(96)00097-X"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.35"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/12276.13328"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/859618.859668"},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","unstructured":"Tong Chen Jin Lin Xiaoru Dai Wei-Chung Hsu and Pen-Chung Yew. 2004. Data dependence profiling for speculative optimizations. In Compiler Construction. 57--72.  Tong Chen Jin Lin Xiaoru Dai Wei-Chung Hsu and Pen-Chung Yew. 2004. Data dependence profiling for speculative optimizations. In Compiler Construction. 57--72.","DOI":"10.1007\/978-3-540-24723-4_5"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 4th Conference on Operating System Design and Implementation (OSDI'00)","author":"Corbal\u00e1n Julita","year":"2000","unstructured":"Julita Corbal\u00e1n , Xavier Martorell , and Jes\u00fas Labarta . 2000 . Performance-driven processor allocation . In Proceedings of the 4th Conference on Operating System Design and Implementation (OSDI'00) . 5--17. Julita Corbal\u00e1n, Xavier Martorell, and Jes\u00fas Labarta. 2000. Performance-driven processor allocation. In Proceedings of the 4th Conference on Operating System Design and Implementation (OSDI'00). 5--17."},{"key":"e_1_2_1_14_1","unstructured":"CoSy. 2009. CoSy compiler development system. Retrieved from http:\/\/www.ace.nl\/compiler\/.  CoSy. 2009. CoSy compiler development system. Retrieved from http:\/\/www.ace.nl\/compiler\/."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-13374-9_9"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1250734.1250760"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/781131.781159"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.5555\/1025127.1026009"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/996841.996852"},{"key":"e_1_2_1_20_1","unstructured":"Vector Fabrics. 2013. Homepage. Retrieved from http:\/\/www.vectorfabrics.com\/.  Vector Fabrics. 2013. Homepage. Retrieved from http:\/\/www.vectorfabrics.com\/."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/277650.277725"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1993498.1993553"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/605397.605428"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS","author":"Ryan","year":"2007","unstructured":"Ryan E. Grant and Ahmad Afsahi. 2007. A comprehensive analysis of OpenMP applications on dual-core Intel Xeon SMPs . In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007 ). 1--8. Ryan E. Grant and Ahmad Afsahi. 2007. A comprehensive analysis of OpenMP applications on dual-core Intel Xeon SMPs. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007). 1--8."},{"key":"e_1_2_1_26_1","volume-title":"CGO'13","author":"Grewe Dominik","unstructured":"Dominik Grewe , Zheng Wang , and Michael F.P . O'Boyle. 2013. Portable mapping of data parallel programs to OpenCL for heterogeneous systems . In CGO'13 . Dominik Grewe, Zheng Wang, and Michael F.P. O'Boyle. 2013. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In CGO'13."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1944862.1944881"},{"key":"e_1_2_1_28_1","volume-title":"LCPC'13","author":"Grewe Dominik","unstructured":"Dominik Grewe , Zheng Wang , and Michael F. P . O'Boyle. 2013. OpenCL task partitioning in the presence of GPU contention . In LCPC'13 . Dominik Grewe, Zheng Wang, and Michael F. P. O'Boyle. 2013. OpenCL task partitioning in the presence of GPU contention. In LCPC'13."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.546613"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/782814.782825"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/109025.109086"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1093\/ietisy\/e89-d.2.399"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2254064.2254107"},{"key":"e_1_2_1_34_1","volume-title":"Allen","author":"Kennedy Ken","year":"2002","unstructured":"Ken Kennedy and John R . Allen . 2002 . Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann . Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.86108"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.49"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/567532.567555"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/1250734.1250759"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/360827.360844"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/301104.301108"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/263699.263719"},{"key":"e_1_2_1_42_1","unstructured":"Open64. 2013. Homepage. Retrieved from http:\/\/www.open64.net.  Open64. 2013. Homepage. Retrieved from http:\/\/www.open64.net."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2005.13"},{"key":"e_1_2_1_44_1","volume-title":"Polaris: A New-Generation Parallelizing Compiler for MPPs. Technical Report","author":"Padua David A.","year":"1993","unstructured":"David A. Padua , Rudolf Eigenmann , Jay Hoeflinger , Paul Petersen , Peng Tu , Stephen Weatherford , and Keith Faigin . 1993 . Polaris: A New-Generation Parallelizing Compiler for MPPs. Technical Report . University of Illinois at Urbana-Champaign. David A. Padua, Rudolf Eigenmann, Jay Hoeflinger, Paul Petersen, Peng Tu, Stephen Weatherford, and Keith Faigin. 1993. Polaris: A New-Generation Parallelizing Compiler for MPPs. Technical Report. University of Illinois at Urbana-Champaign."},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing. 64--81","author":"Peterson P.","unstructured":"P. Peterson and David A. Padua . 1993. Dynamic dependence analysis: A novel method for data dependence evaluation . In Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing. 64--81 . P. Peterson and David A. Padua. 1993. Dynamic dependence analysis: A novel method for data dependence evaluation. In Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing. 64--81."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2010.14"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/1065944.1065964"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/1772954.1772961"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/76263.76335"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.5555\/1025127.1026007"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/224538.224553"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.5555\/648048.745869"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/207110.207148"},{"key":"e_1_2_1_55_1","volume-title":"HiPEAC Industrial Workshop.","author":"Rul Sean","year":"2008","unstructured":"Sean Rul , Hans Vandierendonck , and Koen De Bosschere . 2008 . A dynamic analysis tool for finding coarse-grain parallelism . In HiPEAC Industrial Workshop. Sean Rul, Hans Vandierendonck, and Koen De Bosschere. 2008. A dynamic analysis tool for finding coarse-grain parallelism. In HiPEAC Industrial Workshop."},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/1274971.1275008"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1024597010150"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/1229428.1229483"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.7"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854321"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/1542476.1542496"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.5555\/1299042.1299110"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854322"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/1504176.1504189"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854313"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/2512436"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-89740-8_16"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/1046192.1046216"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2579561","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2579561","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:43:50Z","timestamp":1750290230000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2579561"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,2]]},"references-count":66,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2014,2]]}},"alternative-id":["10.1145\/2579561"],"URL":"https:\/\/doi.org\/10.1145\/2579561","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2014,2]]},"assertion":[{"value":"2012-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2013-09-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-02-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}