{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,10]],"date-time":"2026-01-10T07:40:45Z","timestamp":1768030845051,"version":"3.49.0"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2018,12,8]],"date-time":"2018-12-08T00:00:00Z","timestamp":1544227200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National R&D Program of China","award":["2018YFB1004802"],"award-info":[{"award-number":["2018YFB1004802"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61602301, 61632017"],"award-info":[{"award-number":["61602301, 61632017"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2018,12,31]]},"abstract":"<jats:p>\n            Parallel computers now start to adopt Bandwidth-Asymmetric Memory architecture that consists of traditional DRAM memory and new High Bandwidth Memory (HBM) for high memory bandwidth. However, existing task schedulers suffer from low bandwidth usage and poor data locality problems in bandwidth-asymmetric memory architectures. To solve the two problems, we propose a Bandwidth and Locality Aware Task-stealing (BATS) system, which consists of an HBM-aware data allocator, a bandwidth-aware traffic balancer, and a hierarchical task-stealing scheduler. Leveraging compile-time code transformation and run-time data distribution, the data allocator enables HBM usage automatically without user interference. According to data access hotness, the traffic balancer migrates data to balance memory traffic across memory nodes proportional to their bandwidth. The hierarchical scheduler improves data locality at runtime without\n            <jats:italic>a priori<\/jats:italic>\n            program knowledge. Experiments on an Intel Knights Landing server that adopts bandwidth-asymmetric memory show that BATS reduces the execution time of memory-bound programs up to 83.5% compared with traditional task-stealing schedulers.\n          <\/jats:p>","DOI":"10.1145\/3291058","type":"journal-article","created":{"date-parts":[[2018,12,10]],"date-time":"2018-12-10T13:09:16Z","timestamp":1544447356000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory"],"prefix":"10.1145","volume":"15","author":[{"given":"Han","family":"Zhao","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5832-0347","authenticated-orcid":false,"given":"Quan","family":"Chen","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuxian","family":"Qiu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ming","family":"Wu","sequence":"additional","affiliation":[{"name":"Microsoft Research Asia, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yao","family":"Shen","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jingwen","family":"Leng","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chao","family":"Li","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Minyi","family":"Guo","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2018,12,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2017. Intel Memory Latency Checker. Retrieved from https:\/\/software.intel.com\/en-us\/articles\/intelr-memory-latency-checker.  2017. Intel Memory Latency Checker. Retrieved from https:\/\/software.intel.com\/en-us\/articles\/intelr-memory-latency-checker."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.1631"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2008.105"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the Workshop on Programming Models for Ubiquitous Parallelism. Citeseer.","author":"Barik Rajkishore","year":"2006","unstructured":"Rajkishore Barik , Vincent Cave , Christopher Donawa , Allan Kielstra , Igor Peshansky , and Vivek Sarkar . 2006 . Experiences with an SMP implementation for \u00d7 10 based on the Java concurrency utilities . In Proceedings of the Workshop on Programming Models for Ubiquitous Parallelism. Citeseer. Rajkishore Barik, Vincent Cave, Christopher Donawa, Allan Kielstra, Igor Peshansky, and Vivek Sarkar. 2006. Experiences with an SMP implementation for \u00d7 10 based on the Java concurrency utilities. In Proceedings of the Workshop on Programming Models for Ubiquitous Parallelism. Citeseer."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1006\/jpdc.1996.0107"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2012.32"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2579674"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2597652.2597665"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304599"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2011.32"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132402.3132404"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-09873-9_50"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2451116.2451157"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2967938.2967946"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.66"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1693453.1693504"},{"key":"e_1_2_1_18_1","doi-asserted-by":"crossref","unstructured":"James Jeffers James Reinders and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann.   James Jeffers James Reinders and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann.","DOI":"10.1016\/B978-0-12-809194-4.00002-8"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/167962.165874"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.5555\/977395.977673"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1693453.1693459"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629911.1630048"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2851141.2851151"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462193"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.75"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1002\/spe.2163"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD.2016.7753305"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/72.788643"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541987"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342011434065"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2014.09.014"},{"key":"e_1_2_1_32_1","volume-title":"Daniel Cordeiro, Abhinav Bhatele, Philippe O. A. Navaux, Jean-Fran\u00e7ois M\u00e9haut, Laxmikant V. Kal\u00e9, et al.","author":"Pilla La\u00e9rcio L.","year":"2011","unstructured":"La\u00e9rcio L. Pilla , Christiane Pousa Ribeiro , Daniel Cordeiro, Abhinav Bhatele, Philippe O. A. Navaux, Jean-Fran\u00e7ois M\u00e9haut, Laxmikant V. Kal\u00e9, et al. 2011 . Improving parallel system performance with a NUMA-aware load balancer. JLPC ( 2011). http:\/\/hdl.handle.net\/2142\/25911. La\u00e9rcio L. Pilla, Christiane Pousa Ribeiro, Daniel Cordeiro, Abhinav Bhatele, Philippe O. A. Navaux, Jean-Fran\u00e7ois M\u00e9haut, Laxmikant V. Kal\u00e9, et al. 2011. Improving parallel system performance with a NUMA-aware load balancer. JLPC (2011). http:\/\/hdl.handle.net\/2142\/25911."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342009106195"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.5555\/1887695.1887719"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2017.30"},{"key":"e_1_2_1_36_1","unstructured":"J. Reinders. 2007. Intel Threading Building Blocks. Intel.   J. Reinders. 2007. Intel Threading Building Blocks. Intel."},{"key":"e_1_2_1_37_1","volume-title":"Product type (GPU, CPU, APU, FPGA, ASIC), Application, and Geography-Global Forecast to","author":"Markets Research","year":"2023","unstructured":"Research and Markets . 2018. Hybrid Memory Cube (HMC) and High-bandwidth Memory (HBM) Market by Memory Type (HMC and HBM) , Product type (GPU, CPU, APU, FPGA, ASIC), Application, and Geography-Global Forecast to 2023 . https:\/\/www.researchandmarkets.com\/research\/5cwh7b. Research and Markets. 2018. Hybrid Memory Cube (HMC) and High-bandwidth Memory (HBM) Market by Memory Type (HMC and HBM), Product type (GPU, CPU, APU, FPGA, ASIC), Application, and Geography-Global Forecast to 2023. https:\/\/www.researchandmarkets.com\/research\/5cwh7b."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541971"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/356004.356006"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2012.50"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/2464996.2465016"},{"key":"e_1_2_1_42_1","volume-title":"MIT","author":"Supercomputing Technologies Group","year":"2001","unstructured":"Supercomputing Technologies Group , MIT 2001 . Cilk 5.4.6 Reference Manual. Supercomputing Technologies Group, MIT. Retrieved from http:\/\/supertech.lcs.mit.edu\/cilk\/manual-5.4.6.pdf. Supercomputing Technologies Group, MIT 2001. Cilk 5.4.6 Reference Manual. Supercomputing Technologies Group, MIT. Retrieved from http:\/\/supertech.lcs.mit.edu\/cilk\/manual-5.4.6.pdf."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.5555\/2391541.2391548"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.14"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080214"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2013.05.201"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-43659-3_39"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-67952-5_5"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2484425.2484427"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854306"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-13374-9_12"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/2486159.2486175"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2012.6237047"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3291058","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3291058","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:01:52Z","timestamp":1750208512000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3291058"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,12,8]]},"references-count":52,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2018,12,31]]}},"alternative-id":["10.1145\/3291058"],"URL":"https:\/\/doi.org\/10.1145\/3291058","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,12,8]]},"assertion":[{"value":"2018-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-12-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}