{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,26]],"date-time":"2025-09-26T13:05:01Z","timestamp":1758891901163,"version":"3.41.0"},"reference-count":35,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2012,3,1]],"date-time":"2012-03-01T00:00:00Z","timestamp":1330560000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2012,3]]},"abstract":"<jats:p>Applications that use learning and classification algorithms operate on large amounts of unstructured data, and have stringent performance constraints. For such applications, the performance of general purpose processors scales poorly with data size because of their limited support for fine-grained parallelism and absence of software-managed caches. The large intermediate data in these applications also limits achievable performance on many-core processors such as GPUs. To accelerate such learning applications, we present a programmable accelerator that can execute multiple learning and classification algorithms. To architect such an accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max\/min and aggregation. Our proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses dynamic in-memory processing where on-chip memory blocks perform the secondary reduction operations. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features allow MAPLE to scale its performance with data size. We also present an Atom based energy-efficient heterogeneous system with MAPLE as the accelerator that satisfies the application\u2019s performance requirements at a lower system power. This article describes the MAPLE architecture, explores its design space with a simulator, illustrates how to automatically map application kernels to the hardware, and presents its performance improvement and energy benefits over classic server-based implementations. We implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5-10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz clock rate. With MAPLE connected to a 1.6GHz dual-core Atom, we show an energy improvement of 38-84% over the Xeon server coupled to a 1.3 GHz 240 core Tesla GPU.<\/jats:p>","DOI":"10.1145\/2133382.2133388","type":"journal-article","created":{"date-parts":[[2012,4,3]],"date-time":"2012-04-03T14:56:22Z","timestamp":1333464982000},"page":"1-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":37,"title":["A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification"],"prefix":"10.1145","volume":"9","author":[{"given":"Abhinandan","family":"Majumdar","sequence":"first","affiliation":[{"name":"NEC Laboratories America, Inc."}]},{"given":"Srihari","family":"Cadambi","sequence":"additional","affiliation":[{"name":"NEC Laboratories America, Inc."}]},{"given":"Michela","family":"Becchi","sequence":"additional","affiliation":[{"name":"NEC Laboratories America, Inc."}]},{"given":"Srimat T.","family":"Chakradhar","sequence":"additional","affiliation":[{"name":"NEC Laboratories America, Inc."}]},{"given":"Hans Peter","family":"Graf","sequence":"additional","affiliation":[{"name":"NEC Laboratories America, Inc."}]}],"member":"320","published-online":{"date-parts":[[2012,3]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Alpha-Data. http:\/\/www.alpha-data.com\/products.php?product=adm-xrc-5t2. Alpha-Data . http:\/\/www.alpha-data.com\/products.php?product=adm-xrc-5t2."},{"key":"e_1_2_1_2_1","unstructured":"AT3N7A-I. Specification. http:\/\/www.asus.com\/product.aspx?P_ID=xrR7wto9Z5BL42aU&templete=2. AT3N7A-I . Specification. http:\/\/www.asus.com\/product.aspx?P_ID=xrR7wto9Z5BL42aU&templete=2."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-009-9117-9"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2004.65"},{"key":"e_1_2_1_5_1","unstructured":"C2050\/C2070 Power. http:\/\/www.nvidia.com\/object\/product_tesla_C2050_C2070_us.html. C2050\/C2070 Power . http:\/\/www.nvidia.com\/object\/product_tesla_C2050_C2070_us.html."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2009.34"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390156.1390170"},{"volume-title":"Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition.","author":"Chellapilla K.","key":"e_1_2_1_8_1","unstructured":"Chellapilla , K. , Puri , S. , and Simard , P . 2006. High performance convolutional neural networks for document processing . In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition. Chellapilla, K., Puri, S., and Simard, P. 2006. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390156.1390177"},{"volume-title":"Proceedings of the International Conference on Pattern Recognition. 1--4.","author":"Cosatto E.","key":"e_1_2_1_10_1","unstructured":"Cosatto , E. , Miller , M. , Graf , H. P. , and Meyer , J . 2008. Grading nuclear pleomorphism on histological micrographs . In Proceedings of the International Conference on Pattern Recognition. 1--4. Cosatto, E., Miller, M., Graf, H. P., and Meyer, J. 2008. Grading nuclear pleomorphism on histological micrographs. In Proceedings of the International Conference on Pattern Recognition. 1--4."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1348246.1348248"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1345206.1345218"},{"volume-title":"Proceedings of the Neural Information Processing Systems (NIPS). 529--536","author":"Graf H. P.","key":"e_1_2_1_13_1","unstructured":"Graf , H. P. , Cadambi , S. , Durdanovic , I. , Jakkula , V. , Sankaradass , M. , Cosatto , E. , and Chakradhar , S. T . 2008. A massively parallel digital learning processor . In Proceedings of the Neural Information Processing Systems (NIPS). 529--536 . Graf, H. P., Cadambi, S., Durdanovic, I., Jakkula, V., Sankaradass, M., Cosatto, E., and Chakradhar, S. T. 2008. A massively parallel digital learning processor. In Proceedings of the Neural Information Processing Systems (NIPS). 529--536."},{"volume-title":"Proceedings of the ACM Workshop on General Purpose Computing on Graphics Processors (SIGGRAPH poster).","author":"Hall J. D.","key":"e_1_2_1_14_1","unstructured":"Hall , J. D. , and Hart , J. C . 2004. GPU acceleration of iterative clustering . In Proceedings of the ACM Workshop on General Purpose Computing on Graphics Processors (SIGGRAPH poster). Hall, J. D., and Hart, J. C. 2004. GPU acceleration of iterative clustering. In Proceedings of the ACM Workshop on General Purpose Computing on Graphics Processors (SIGGRAPH poster)."},{"key":"e_1_2_1_15_1","unstructured":"Intel Atom. http:\/\/ark.intel.com\/Product.aspx?id=35641. Intel Atom . http:\/\/ark.intel.com\/Product.aspx?id=35641."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2008.37"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1982.1056489"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the Berkeley Symposium on Mathematics Statistics and Probability. 281--297","author":"MacQueen J. B.","year":"1967","unstructured":"MacQueen , J. B. 1967 . Some methods for classification and analysis of multivariate observation . In Proceedings of the Berkeley Symposium on Mathematics Statistics and Probability. 281--297 . MacQueen, J. B. 1967. Some methods for classification and analysis of multivariate observation. In Proceedings of the Berkeley Symposium on Mathematics Statistics and Probability. 281--297."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1291233.1291467"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-03767-2_10"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.1467-8659.2007.01012.x"},{"volume-title":"Advances in Kernel Methods -- Support Vector Learning","author":"Platt J.","key":"e_1_2_1_23_1","unstructured":"Platt , J. 1999. Fast training of support vector machines using sequential minimal optimization . In Advances in Kernel Methods -- Support Vector Learning . MIT Press , 997--1001. Platt, J. 1999. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods -- Support Vector Learning. MIT Press, 997--1001."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553486"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1816002"},{"volume-title":"Proceedings of the 3rd Annual Reconfigurable Systems Summer Institute (RSSI\u201907)","author":"Rousseaux S.","key":"e_1_2_1_26_1","unstructured":"Rousseaux , S. , Hubaux , D. , Guisset , P. , and Legat , J . 2007. A high performance fpga-based accelerator for blas library implementation . In Proceedings of the 3rd Annual Reconfigurable Systems Summer Institute (RSSI\u201907) . Rousseaux, S., Hubaux, D., Guisset, P., and Legat, J. 2007. A high performance fpga-based accelerator for blas library implementation. In Proceedings of the 3rd Annual Reconfigurable Systems Summer Institute (RSSI\u201907)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASAP.2009.25"},{"key":"e_1_2_1_28_1","unstructured":"Sato A. and Yamada K. 1995. Generalized learning vector quantization. In Neural Information Processing Systems. 423--429. Sato A. and Yamada K. 1995. Generalized learning vector quantization. In Neural Information Processing Systems . 423--429."},{"key":"e_1_2_1_29_1","unstructured":"SeaMicro. http:\/\/gigaom.com\/2010\/01\/06\/seamicros-secret-server-changes-computing-economics\/. SeaMicro . http:\/\/gigaom.com\/2010\/01\/06\/seamicros-secret-server-changes-computing-economics\/."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1399504.1360617"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2002.997877"},{"volume-title":"Proceedings of the 12th Workshop on Hot Topics in Operating Systems (HotOS XII). 1--5.","author":"Vasudevan V.","key":"e_1_2_1_32_1","unstructured":"Vasudevan , V. , Franklin , J. , Andersen , D. , Phanishayee , A. , Tan , L. , Kaminsky , M. , and Moraru , I . 2009. FAWNdamentally power-efficient clusters . In Proceedings of the 12th Workshop on Hot Topics in Operating Systems (HotOS XII). 1--5. Vasudevan, V., Franklin, J., Andersen, D., Phanishayee, A., Tan, L., Kaminsky, M., and Moraru, I. 2009. FAWNdamentally power-efficient clusters. In Proceedings of the 12th Workshop on Hot Topics in Operating Systems (HotOS XII). 1--5."},{"key":"e_1_2_1_33_1","unstructured":"Watts-up Pro. https:\/\/www.wattsupmeters.com\/secure\/products.php?pn=0&wai=228&more=2. Watts-up Pro . https:\/\/www.wattsupmeters.com\/secure\/products.php?pn=0&wai=228&more=2."},{"key":"e_1_2_1_34_1","unstructured":"X7460 Power. http:\/\/www.intel.com\/cd\/products\/services\/emea\/eng\/processors\/xeon7000\/343718.htm. X7460 Power . http:\/\/www.intel.com\/cd\/products\/services\/emea\/eng\/processors\/xeon7000\/343718.htm."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2005.31"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2133382.2133388","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2133382.2133388","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T10:06:05Z","timestamp":1750241165000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2133382.2133388"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,3]]},"references-count":35,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2012,3]]}},"alternative-id":["10.1145\/2133382.2133388"],"URL":"https:\/\/doi.org\/10.1145\/2133382.2133388","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2012,3]]},"assertion":[{"value":"2010-09-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2011-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2012-03-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}