{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T16:34:05Z","timestamp":1773246845874,"version":"3.50.1"},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2019,10,31]],"date-time":"2019-10-31T00:00:00Z","timestamp":1572480000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NRC Associate Fellowship Award"},{"DOI":"10.13039\/100000001","name":"U.S. National Science Foundation","doi-asserted-by":"crossref","award":["1725456 and 1615475"],"award-info":[{"award-number":["1725456 and 1615475"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000015","name":"U.S. Department of Energy","doi-asserted-by":"crossref","award":["SC0017030"],"award-info":[{"award-number":["SC0017030"]}],"id":[{"id":"10.13039\/100000015","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Emerg. Technol. Comput. Syst."],"published-print":{"date-parts":[[2019,10,31]]},"abstract":"<jats:p>\n            Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, memory becomes a bottleneck of GPU\u2019s performance and energy efficiency. In this article, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. First, we propose a\n            <jats:italic>thread batch enabled memory partitioning (TEMP)<\/jats:italic>\n            to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each\n            <jats:italic>stream multiprocessor (SM)<\/jats:italic>\n            to the dedicated memory banks. After that, TEMP dispatches the thread batch to an\n            <jats:italic>SM<\/jats:italic>\n            to ensure high-parallel memory-access streaming from the different thread blocks. Second, a\n            <jats:italic>thread batch-aware scheduling (TBAS)<\/jats:italic>\n            scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.\n          <\/jats:p>","DOI":"10.1145\/3330152","type":"journal-article","created":{"date-parts":[[2019,12,16]],"date-time":"2019-12-16T13:12:30Z","timestamp":1576501950000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Thread Batching for High-performance Energy-efficient GPU Memory Design"],"prefix":"10.1145","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0732-2267","authenticated-orcid":false,"given":"Bing","family":"Li","sequence":"first","affiliation":[{"name":"Duke University, USA and Army Research Office, Research Triangle Park, USA"}]},{"given":"Mengjie","family":"Mao","sequence":"additional","affiliation":[{"name":"MathWorks Inc., USA"}]},{"given":"Xiaoxiao","family":"Liu","sequence":"additional","affiliation":[{"name":"AMD, USA"}]},{"given":"Tao","family":"Liu","sequence":"additional","affiliation":[{"name":"Florida International University, Miami, FL, USA"}]},{"given":"Zihao","family":"Liu","sequence":"additional","affiliation":[{"name":"Florida International University, Miami, FL, USA"}]},{"given":"Wujie","family":"Wen","sequence":"additional","affiliation":[{"name":"Florida International University, Miami, FL, USA"}]},{"given":"Yiran","family":"Chen","sequence":"additional","affiliation":[{"name":"Duke University, Durham, North Carolina, USA"}]},{"given":"Hai (Helen)","family":"Li","sequence":"additional","affiliation":[{"name":"Duke University, Durham, North Carolina, USA"}]}],"member":"320","published-online":{"date-parts":[[2019,12,16]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522337"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201915)","author":"Agarwal Neha"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.5555\/2337159.2337207"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the 43rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201910)","author":"Bakhoda Ali","year":"2010"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 163--174","author":"Bakhoda Ali","year":"2009"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2012.2"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the APU 13th Developer Summit. 11--13","author":"Chu Hanjin","year":"2013"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the 44th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201911)","author":"Ebrahimi Eiman"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.44"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454152"},{"key":"e_1_2_1_12_1","unstructured":"Advanced Micro Devices Inc. [n.d.]. AMD Quad-Core A10-Series APU for Desktops. Retrieved from http:\/\/products.amd.com\/en-us\/DesktopAPUDetail.aspx?id&equals;100\/.  Advanced Micro Devices Inc. [n.d.]. AMD Quad-Core A10-Series APU for Desktops. Retrieved from http:\/\/products.amd.com\/en-us\/DesktopAPUDetail.aspx?id&equals;100\/."},{"key":"e_1_2_1_13_1","volume-title":"Micron DDR3 SDRAM Part MT41J256M8","author":"Micron Technology Inc. [n.d.]."},{"key":"e_1_2_1_14_1","unstructured":"The Khronos Group Inc. [n.d.]. OpenCL. Retrieved from https:\/\/www.khronos.org\/opencl\/.  The Khronos Group Inc. [n.d.]. OpenCL. Retrieved from https:\/\/www.khronos.org\/opencl\/."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2228360.2228513"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2012.6168944"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304582"},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201913)","author":"Jog Adwait"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA\u201913)","author":"Jog Adwait"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.62"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.51"},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture. IEEE, 367--378","author":"Lin Jiang","year":"2008"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370816.2370869"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1065010.1065034"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2593069.2593137"},{"key":"e_1_2_1_26_1","volume-title":"Network and Parallel Computing","author":"Mi Wei"},{"key":"e_1_2_1_27_1","unstructured":"Micron. [n.d.]. Micron system power calculators. Retrieved from http:\/\/www.micron.com\/products\/support\/power-calc\/  Micron. [n.d.]. Micron system power calculators. Retrieved from http:\/\/www.micron.com\/products\/support\/power-calc\/"},{"key":"e_1_2_1_28_1","unstructured":"Micron. [n.d.]. Micron TN-ED-01: GDDR5 SGRAM Introduction. Retrieved from http:\/\/www.micron.com\/products\/dram\/gddr5\/  Micron. [n.d.]. Micron TN-ED-01: GDDR5 SGRAM Introduction. Retrieved from http:\/\/www.micron.com\/products\/dram\/gddr5\/"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.40"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1394608.1382128"},{"key":"e_1_2_1_31_1","unstructured":"NVIDIA. [n.d.]. CUDA. Retrieved from http:\/\/www.nvidia.com\/object\/cuda_home_new.html\/.  NVIDIA. [n.d.]. CUDA. Retrieved from http:\/\/www.nvidia.com\/object\/cuda_home_new.html\/."},{"key":"e_1_2_1_32_1","unstructured":"NVIDIA. [n.d.]. CUDA SDK. Retrieved from https:\/\/developer.nvidia.com\/cuda-downloads\/.  NVIDIA. [n.d.]. CUDA SDK. Retrieved from https:\/\/developer.nvidia.com\/cuda-downloads\/."},{"key":"e_1_2_1_33_1","unstructured":"NVIDIA. 2009. Nvidia Fermi Architecture. Retrieved from http:\/\/www.nvidia.com\/object\/fermi-architecture.html.  NVIDIA. 2009. Nvidia Fermi Architecture. Retrieved from http:\/\/www.nvidia.com\/object\/fermi-architecture.html."},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA\u201900)","author":"Owens John D."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2014.2299539"},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201914)","author":"Power Jason","year":"2014"},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the 45th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201912)","author":"Rogers Timothy G.","year":"2012"},{"key":"e_1_2_1_38_1","volume-title":"Retrieved on","author":"Shimpi Anand Lal","year":"2012"},{"key":"e_1_2_1_39_1","volume-title":"Geng Daniel Liu, and W. M. W. Hwu","author":"Stratton John A.","year":"2012"},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201910)","author":"Sung Jui"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CSE.2013.106"},{"key":"e_1_2_1_42_1","first-page":"4","article-title":"DASH: Deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators","volume":"12","author":"Usui Hiroyuki","year":"2016","journal-title":"ACM Trans. Architect. Code Optim."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835945"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830813"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/1806596.1806606"},{"key":"e_1_2_1_46_1","volume-title":"Proceedings of the 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909)","author":"Yuan George L."}],"container-title":["ACM Journal on Emerging Technologies in Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3330152","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3330152","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:26:23Z","timestamp":1750206383000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3330152"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,31]]},"references-count":46,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2019,10,31]]}},"alternative-id":["10.1145\/3330152"],"URL":"https:\/\/doi.org\/10.1145\/3330152","relation":{},"ISSN":["1550-4832","1550-4840"],"issn-type":[{"value":"1550-4832","type":"print"},{"value":"1550-4840","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,10,31]]},"assertion":[{"value":"2019-01-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-05-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-12-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}