{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T12:14:55Z","timestamp":1763468095011,"version":"3.41.0"},"reference-count":29,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2013,1,1]],"date-time":"2013-01-01T00:00:00Z","timestamp":1356998400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000143","name":"Division of Computing and Communication Foundations","doi-asserted-by":"publisher","award":["CCF-0905509"],"award-info":[{"award-number":["CCF-0905509"]}],"id":[{"id":"10.13039\/100000143","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2013,1]]},"abstract":"<jats:p>Today's heterogeneous architectures bring together multiple general-purpose CPUs and multiple domain-specific GPUs and FPGAs to provide dramatic speedup for many applications. However, the challenge lies in utilizing these heterogeneous processors to optimize overall application performance by minimizing workload completion time. Operating system and application development for these systems is in their infancy.<\/jats:p>\n          <jats:p>In this article, we propose a new scheduling and workload balancing scheme, HDSS, for execution of loops having dependent or independent iterations on heterogeneous multiprocessor systems. The new algorithm dynamically learns the computational power of each processor during an adaptive phase and then schedules the remainder of the workload using a weighted self-scheduling scheme during the completion phase. Different from previous studies, our scheme uniquely considers the runtime effects of block sizes on the performance for heterogeneous multiprocessors. It finds the right trade-off between large and small block sizes to maintain balanced workload while keeping the accelerator utilization at maximum. Our algorithm does not require offline training or architecture-specific parameters.<\/jats:p>\n          <jats:p>We have evaluated our scheme on two different heterogeneous architectures: AMD 64-core Bulldozer system with nVidia Fermi C2050 GPU and Intel Xeon 32-core SGI Altix 4700 supercomputer with Xilinx Virtex 4 FPGAs. The experimental results show that our new scheduling algorithm can achieve performance improvements up to over 200% when compared to the closest existing load balancing scheme. Our algorithm also achieves full processor utilization with all processors completing at nearly the same time which is significantly better than alternative current approaches.<\/jats:p>","DOI":"10.1145\/2400682.2400716","type":"journal-article","created":{"date-parts":[[2013,1,22]],"date-time":"2013-01-22T15:28:56Z","timestamp":1358868536000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":74,"title":["A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures"],"prefix":"10.1145","volume":"9","author":[{"given":"Mehmet E.","family":"Belviranli","sequence":"first","affiliation":[{"name":"University of California, Riverside"}]},{"given":"Laxmi N.","family":"Bhuyan","sequence":"additional","affiliation":[{"name":"University of California, Riverside"}]},{"given":"Rajiv","family":"Gupta","sequence":"additional","affiliation":[{"name":"University of California, Riverside"}]}],"member":"320","published-online":{"date-parts":[[2013,1,20]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-03869-3_80"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2005.31"},{"volume-title":"Proceedings of the 15th International Parallel and Distributed Processing Symposium. IEEE, 791--801","author":"Banicescu I.","key":"e_1_2_1_3_1","unstructured":"Banicescu , I. and Velusamy , V . 2001. Performance of scheduling scientific applications with adaptive weighted factoring . In Proceedings of the 15th International Parallel and Distributed Processing Symposium. IEEE, 791--801 . Banicescu, I. and Velusamy, V. 2001. Performance of scheduling scientific applications with adaptive weighted factoring. In Proceedings of the 15th International Parallel and Distributed Processing Symposium. IEEE, 791--801."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2008.05.014"},{"volume-title":"Proceedings of the Internatioanl Conference on Cluster Computing. IEEE Computer Society, 282","author":"Chronopoulos A.","key":"e_1_2_1_5_1","unstructured":"Chronopoulos , A. , Benche , M. , Grosu , D. , and Andonie , R . 2001. A class of loop self-scheduling for heterogeneous clusters . In Proceedings of the Internatioanl Conference on Cluster Computing. IEEE Computer Society, 282 . Chronopoulos, A., Benche, M., Grosu, D., and Andonie, R. 2001. A class of loop self-scheduling for heterogeneous clusters. In Proceedings of the Internatioanl Conference on Cluster Computing. IEEE Computer Society, 282."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.v18:7"},{"volume-title":"Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS'06)","author":"Ciorba F.","key":"e_1_2_1_7_1","unstructured":"Ciorba , F. , Andronikos , T. , Riakiotakis , I. , Chronopoulos , A. , and Papakonstantinou , G . 2006. Dynamic multi phase scheduling for heterogeneous clusters . In Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS'06) . IEEE. Ciorba, F., Andronikos, T., Riakiotakis, I., Chronopoulos, A., and Papakonstantinou, G. 2006. Dynamic multi phase scheduling for heterogeneous clusters. In Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS'06). IEEE."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/1322109.1322131"},{"volume-title":"Proceedings of the Cray User Group Meeting. 12","author":"Fahey M.","key":"e_1_2_1_9_1","unstructured":"Fahey , M. , Alam , S. , Dunigan , T. , Vetter , J. , and Worley , P . 2005. Early evaluation of the cray xd1 . In Proceedings of the Cray User Group Meeting. 12 . Fahey, M., Alam, S., Dunigan, T., Vetter, J., and Worley, P. 2005. Early evaluation of the cray xd1. In Proceedings of the Cray User Group Meeting. 12."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/DATE.2005.29"},{"volume-title":"Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'07)","author":"Harris B.","key":"e_1_2_1_11_1","unstructured":"Harris , B. , Jacob , A. , Lancaster , J. , Buhler , J. , and Chamberlain , R . 2007. A banded Smith-Waterman fpga accelerator for mercury BLASTP . In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'07) . IEEE, 765--769. Harris, B., Jacob, A., Lancaster, J., Buhler, J., and Chamberlain, R. 2007. A banded Smith-Waterman fpga accelerator for mercury BLASTP. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'07). IEEE, 765--769."},{"volume-title":"Proceedings of the International European Conference on Parallel Processing (Euro-Par'10)","author":"Hermann E.","key":"e_1_2_1_12_1","unstructured":"Hermann , E. , Raffin , B. , Faure , F. , Gautier , T. , and Allard , J . 2010. Multi-GPU and multi-cpu parallelization for interactive physics simulations . In Proceedings of the International European Conference on Parallel Processing (Euro-Par'10) . 235--246. Hermann, E., Raffin, B., Faure, F., Gautier, T., and Allard, J. 2010. Multi-GPU and multi-cpu parallelization for interactive physics simulations. In Proceedings of the International European Conference on Parallel Processing (Euro-Par'10). 235--246."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/237502.237576"},{"volume-title":"Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA'10)","author":"Lee J.","key":"e_1_2_1_14_1","unstructured":"Lee , J. , Lee , J. , Seo , S. , Kim , J. , Kim , S. , and Sura , Z . 2010. Comic&plus;&plus;: A software svm system for heterogeneous multicore accelerator clusters . In Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA'10) . IEEE, 1--12. Lee, J., Lee, J., Seo, S., Kim, J., Kim, S., and Sura, Z. 2010. Comic&plus;&plus;: A software svm system for heterogeneous multicore accelerator clusters. In Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA'10). IEEE, 1--12."},{"volume-title":"Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA'10)","author":"Li T.","key":"e_1_2_1_15_1","unstructured":"Li , T. , Brett , P. , Knauerhase , R. , Koufaty , D. , Reddy , D. , and Hahn , S . 2010. Operating system support for overlapping-isa heterogeneous multi-core architectures . In Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA'10) . IEEE. Li, T., Brett, P., Knauerhase, R., Koufaty, D., Reddy, D., and Hahn, S. 2010. Operating system support for overlapping-isa heterogeneous multi-core architectures. In Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA'10). IEEE."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.31"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669121"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1987.5009495"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2007.70777"},{"key":"e_1_2_1_20_1","unstructured":"SGI. 2008. Sgi altix 4700. Delivering new levels of performance and flexibility. http:\/\/www.sgi.com\/products\/servers\/altix\/4000\/index.html  SGI. 2008. Sgi altix 4700. Delivering new levels of performance and flexibility. http:\/\/www.sgi.com\/products\/servers\/altix\/4000\/index.html"},{"volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09)","author":"Smith R.","key":"e_1_2_1_21_1","unstructured":"Smith , R. , Goyal , N. , Ormont , J. , Sankaralingam , K. , and Estan , C . 2009. Evaluating gpus for network packet signature matching . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09) . IEEE, 175--184. Smith, R., Goyal, N., Ormont, J., Sankaralingam, K., and Estan, C. 2009. Evaluating gpus for network packet signature matching. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09). IEEE, 175--184."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2005.54"},{"volume-title":"Proceedings of the International Conference on Field Programmable Technology (FPT'10)","author":"Tse A.","key":"e_1_2_1_23_1","unstructured":"Tse , A. , Thomas , D. , Tsoi , K. , and Luk , W . 2010. Dynamic scheduling monte-carlo framework for multi-accelerator heterogeneous clusters . In Proceedings of the International Conference on Field Programmable Technology (FPT'10) . IEEE, 233--240. Tse, A., Thomas, D., Tsoi, K., and Luk, W. 2010. Dynamic scheduling monte-carlo framework for multi-accelerator heterogeneous clusters. In Proceedings of the International Conference on Field Programmable Technology (FPT'10). IEEE, 233--240."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1723112.1723134"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1926367.1926377"},{"key":"e_1_2_1_26_1","first-page":"785","article-title":"AES encryption and decryption on the gpu","volume":"3","author":"Yamanouchi T.","year":"2007","unstructured":"Yamanouchi , T. 2007 . AES encryption and decryption on the gpu . GPU Gems 3 , 785 -- 803 . Yamanouchi, T. 2007. AES encryption and decryption on the gpu. GPU Gems 3, 785--803.","journal-title":"GPU Gems"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-005-0787-9"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-007-0146-0"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2008.19"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2400682.2400716","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2400682.2400716","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T08:18:52Z","timestamp":1750234732000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2400682.2400716"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,1]]},"references-count":29,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2013,1]]}},"alternative-id":["10.1145\/2400682.2400716"],"URL":"https:\/\/doi.org\/10.1145\/2400682.2400716","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2013,1]]},"assertion":[{"value":"2012-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2012-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2013-01-20","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}