{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,4]],"date-time":"2025-10-04T00:39:30Z","timestamp":1759538370382,"version":"build-2065373602"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"5s","funder":[{"name":"DOE award","award":["DESC0024458"],"award-info":[{"award-number":["DESC0024458"]}]},{"name":"Columbia Center of Artificial Intelligence Technology (CAIT) Award"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2025,11,30]]},"abstract":"<jats:p>Tiled accelerator architectures provide opportunities to optimize the performance of multi-model augmented and virtual reality (AR\/VR) applications through intra-layer parallelism and inter-layer pipelining. However, balancing these two strategies is a difficult task that demands a flexible architecture to deploy models and an optimization approach, that is, capable of selecting an optimal strategy from an enormous mapping space. This article presents FLIP2M, a holistic solution for mapping multi-model AR\/VR workloads on tiled architectures. FLIP2M consists of (1) FLIP, an acceleration fabric that supports a wide variety of optimizations through flexible on-chip communication, and (2) OASIS, an optimization framework based on dynamic and constraint programming, that is, capable of selecting an efficient strategy for mapping multi-model workloads onto FLIP. We demonstrate FLIP2M on an FPGA prototype of FLIP that features 36 accelerators and 7 DDR4 controllers. Using OASIS-generated mappings for three different multi-model AR\/VR workloads, FLIP2M achieves up to 1.94\u00d7 improvement in latency, 1.37\u00d7 in energy, and 2.59\u00d7 in energy-delay product relative to a FLIP baseline without intra-layer resource allocation flexibility and inter-layer pipelining.<\/jats:p>","DOI":"10.1145\/3762656","type":"journal-article","created":{"date-parts":[[2025,8,25]],"date-time":"2025-08-25T11:25:51Z","timestamp":1756121151000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["FLIP2M: Flexible Intra-layer Parallelism and Inter-layer Pipelining for Multi-model AR\/VR Workloads"],"prefix":"10.1145","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2590-0235","authenticated-orcid":false,"given":"Gabriele","family":"Tombesi","sequence":"first","affiliation":[{"name":"Computer Science, Columbia University","place":["New York, United States"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-2024-6542","authenticated-orcid":false,"given":"Je","family":"Yang","sequence":"additional","affiliation":[{"name":"Computer Science, Columbia University","place":["New York, United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3081-1077","authenticated-orcid":false,"given":"Joseph","family":"Zuckerman","sequence":"additional","affiliation":[{"name":"Columbia University","place":["New York, United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4101-4516","authenticated-orcid":false,"given":"Davide","family":"Giri","sequence":"additional","affiliation":[{"name":"Computer Science, Columbia University","place":["New York, United States"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-8161-0444","authenticated-orcid":false,"given":"William","family":"Baisi","sequence":"additional","affiliation":[{"name":"Columbia University","place":["New York, United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5600-8931","authenticated-orcid":false,"given":"Luca","family":"Carloni","sequence":"additional","affiliation":[{"name":"Computer Science, Columbia University","place":["New York, United States"]}]}],"member":"320","published-online":{"date-parts":[[2025,9,26]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00081"},{"key":"e_1_3_3_3_2","first-page":"1","volume-title":"Proceedings of the Design Automation Conf. (DAC)","author":"Blanco F.G.","year":"2024","unstructured":"F.G. Blanco, E. Russo, M. Palesi, D. Patti, G. Ascia, and V. Catania. 2024. A deep reinforcement learning based online scheduling policy for deep neural network multi-tenant multi-accelerator systems. In Proceedings of the Design Automation Conf. (DAC). 1\u20136."},{"key":"e_1_3_3_4_2","volume-title":"Proceedings of the Intl. Symp. on Computer Architecture (ISCA)","author":"Cai J.","year":"2023","unstructured":"J. Cai, Y. Wei, Z. Wu, S. Peng, and K. Ma. 2023. Inter-layer scheduling space definition and exploration for tiled accelerators. In Proceedings of the Intl. Symp. on Computer Architecture (ISCA). Article 13, 17 pages."},{"key":"e_1_3_3_5_2","first-page":"156","volume-title":"Proceedings of the 2024 IEEE Intl. Symp. on High-Performance Computer Architecture (HPCA)","author":"Cai J.","year":"2024","unstructured":"J. Cai, Z. Wu, S. Peng, Y. Wei, Z. Tan, G. Shi, M. Gao, and K. Ma. 2024. Gemini: Mapping and architecture co-exploration for large-scale DNN chiplet accelerators. In Proceedings of the 2024 IEEE Intl. Symp. on High-Performance Computer Architecture (HPCA). 156\u2013171."},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2015.2480849"},{"key":"e_1_3_3_7_2","first-page":"17:1\u201317:6","volume-title":"Proceedings of the Design Automation Conf. (DAC)","author":"Carloni L. P.","year":"2016","unstructured":"L. P. Carloni. 2016. The case for embedded scalable platforms. In Proceedings of the Design Automation Conf. (DAC). 17:1\u201317:6."},{"key":"e_1_3_3_8_2","first-page":"367","volume-title":"Proceedings of the 43rd Intl. Symp. on Computer Architecture (ISCA)","author":"Chen Y.","year":"2016","unstructured":"Y. Chen, J. Emer, and V. Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 43rd Intl. Symp. on Computer Architecture (ISCA). 367\u2013379."},{"key":"e_1_3_3_9_2","first-page":"637","volume-title":"Proceedings of the Intl. Conf. on Computer Design (ICCD)","author":"Chiu K.-L","year":"2024","unstructured":"K.-L Chiu, G. Eichler, C.-T. Lin, G.-D. Guglielmo, and L.-P. Carloni. 2024. WOLT: Transparent deployment of ML workloads on lightweight many-accelerator architectures. In Proceedings of the Intl. Conf. on Computer Design (ICCD). 637\u2013644."},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2023.3299030"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","unstructured":"Yujeong Choi and Minsoo Rhu. 2020. PREMA: A predictive multi-task scheduling algorithm for preemptible neural processing units. In 2020 IEEE Intl. Symp. on High-Performance Computer Architecture (HPCA). 220\u2013233. DOI:10.1109\/HPCA47549.2020.00027","DOI":"10.1109\/HPCA47549.2020.00027"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","unstructured":"Steven Colleman and others. 2024. Optimizing layer-fused scheduling of transformer networks on multi-accelerator platforms. In 2024 25th Intl. Symp. on Quality Electronic Design (ISQED). 1\u20136. DOI:10.1109\/ISQED60706.2024.10528689","DOI":"10.1109\/ISQED60706.2024.10528689"},{"key":"e_1_3_3_13_2","volume-title":"Proceedings of the North American Chapter of the Association for Computational Linguistics","author":"Devlin J.","year":"2019","unstructured":"J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics."},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3613424.3614263"},{"key":"e_1_3_3_15_2","first-page":"751","volume-title":"Proceedings of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS)","author":"Gao M.","year":"2017","unstructured":"M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis. 2017. Tetris: Scalable and efficient neural network acceleration with 3d memory. In Proceedings of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 751\u2013764."},{"key":"e_1_3_3_16_2","first-page":"807","volume-title":"Proceedings of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS)","author":"Gao M.","year":"2019","unstructured":"M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis. 2019. Tangram: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 807\u2013820."},{"key":"e_1_3_3_17_2","first-page":"681","volume-title":"Proceedings of the IEEE\/ACM Intl. Symp. on Microarchitecture (MICRO)","author":"Ghodrati S.","year":"2020","unstructured":"S. Ghodrati, S. Ghodratiand, B.-H. Ahn, J.-K. Kim, S. Kinzer, B.-R. Yatham, N. Alla, H. Sharma, M. Alian, E. Ebrahimi, N.-S. Kim, C. Young, and H. Esmaeilzadeh. 2020. Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks. In Proceedings of the IEEE\/ACM Intl. Symp. on Microarchitecture (MICRO). 681\u2013697."},{"key":"e_1_3_3_18_2","first-page":"1","volume-title":"Proceedings of the Design, Automation, and Test in Europe Conf. (DATE)","author":"Glint T.","year":"2024","unstructured":"T. Glint, M. Pechimuthu, and J. Mekie. 2024. DeepFrack: A comprehensive framework for layer fusion, face tiling, and efficient mapping in DNN hardware accelerators. In Proceedings of the Design, Automation, and Test in Europe Conf. (DATE). 1\u20136."},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3643832.3661878"},{"key":"e_1_3_3_20_2","first-page":"770","volume-title":"Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)","author":"He K.","year":"2016","unstructured":"K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 770\u2013778."},{"key":"e_1_3_3_21_2","unstructured":"F. N. Iandola and others. 2016. SqueezeNet: AlexNet-level accuracy with 50\u00d7 fewer parameters and <1MB model size. ArXiv abs\/1602.07360 (2016). Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:14136028"},{"key":"e_1_3_3_22_2","first-page":"814","volume-title":"Proceedings of the IEEE Intl. Symp. on High-Performance Computer Architecture (HPCA)","author":"Kao S.","year":"2022","unstructured":"S. Kao and T. Krishna. 2022. MAGMA: An optimization framework for mapping multiple DNNs on multiple accelerator cores. In Proceedings of the IEEE Intl. Symp. on High-Performance Computer Architecture (HPCA). 814\u2013830."},{"key":"e_1_3_3_23_2","first-page":"1","volume-title":"Proceedings of the Design, Automation & Test in Europe Conf. (DATE)","author":"Karl S.","year":"2023","unstructured":"S. Karl, A. Symons, N. Fasfous, and M. Verhelst. 2023. Genetic algorithm-based framework for layer-fused scheduling of multiple DNNs on multi-core systems. In Proceedings of the Design, Automation & Test in Europe Conf. (DATE). 1\u20136."},{"key":"e_1_3_3_24_2","first-page":"1","volume-title":"Proceedings of the Design Automation Conf. (DAC)","author":"Khailany B.","year":"2018","unstructured":"B. Khailany, E. Krimer, R. Venkatesan, J. Clemons, J.-S. Emer, M. Fojtik, A. Klinefelter, M. Pellauer, N. Pinckney, Y.-S. Shao, S. Srinath, C. Torng, S.-L. Xi, Y. Zhang, and B. Zimmer. 2018. INVITED: A modular digital VLSI flow for high-productivity soc design. In Proceedings of the Design Automation Conf. (DAC). 1\u20136."},{"key":"e_1_3_3_25_2","first-page":"62","volume-title":"Proceedings of the Intl. Symp. on Microarchitecture (MICRO)","author":"Kim S.","year":"2023","unstructured":"S. Kim, J. Zhao, K. Asanovic, B. Nikolic, and Y.-S. Shao. 2023. AuRORA: Virtualized accelerator orchestration for multi-tenant workloads. In Proceedings of the Intl. Symp. on Microarchitecture (MICRO). 62\u201376."},{"key":"e_1_3_3_26_2","first-page":"73","volume-title":"Proceedings of the Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS)","author":"Kim S.","year":"2023","unstructured":"S. Kim, H. Kwon, J. Song, J. Jo, Y.-H. Chen, L. Lai, and V. Chandra. 2023. DREAM: A dynamic scheduler for dynamic real-time multi-model ML workloads. In Proceedings of the Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 73\u201386."},{"key":"e_1_3_3_27_2","first-page":"828","volume-title":"Proceedings of the Intl. Symp. on High-Performance Computer Architecture (HPCA)","author":"Kim S.","year":"2023","unstructured":"S. Kim, H. Genc, V.-V. Nikiforov, K. Asanovic, B. Nikolic, and Y.-S. Shao. 2023. MoCA: Memory-centric, adaptive execution for multi-tenant deep neural networks. In Proceedings of the Intl. Symp. on High-Performance Computer Architecture (HPCA). 828\u2013841."},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2020.2985963"},{"key":"e_1_3_3_29_2","first-page":"71","volume-title":"Proceedings of the Intl. Symp. on High-Performance Computer Architecture (HPCA)","author":"Kwon H.","year":"2021","unstructured":"H. Kwon, L. Lai, M. Pellauer, T. Krishna, Y.-H. Chen, and V. Chandra. 2021. Heterogeneous dataflow accelerators for Multi-DNN workloads. In Proceedings of the Intl. Symp. on High-Performance Computer Architecture (HPCA). 71\u201383."},{"key":"e_1_3_3_30_2","first-page":"1","article-title":"XRBench: An extended reality (XR) machine learning benchmark suite for the metaverse","volume":"5","author":"Kwon H.","year":"2023","unstructured":"H. Kwon, K. Nair, J. Seo, J. Yik, D. Mohapatra, D. Zhan, J. Song, P. Capak, P. Zhang, P. Vajda, C. Banbury, M. Mazumder, L. Lai, A. Sirasao, T. Krishna, H. Khaitan, V. Chandra, and V.-J. Reddi. 2023. XRBench: An extended reality (XR) machine learning benchmark suite for the metaverse. Proc. of Machine Learning and Systems 5 (2023), 1\u201320.","journal-title":"Proc. of Machine Learning and Systems"},{"key":"e_1_3_3_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2022.3197493"},{"key":"e_1_3_3_32_2","unstructured":"Linyan M. Pouya H. Vikram J. Sebastian G. and Marian V.. 2020. ZigZag: A memory-centric rapid DNN accelerator design space exploration framework. arxiv:2007.11360. Retrieved from https:\/\/arxiv.org\/abs\/2007.11360 (2020)."},{"key":"e_1_3_3_33_2","volume-title":"Proceedings of the Intl. Conf. on Computer-Aided Design","author":"Mantovani P.","year":"2020","unstructured":"P. Mantovani, D. Giri, G.-D. Guglielmo, L. Piccolboni, J. Zuckerman, E.-G. Cota, M. Petracca, C. Pilato, and L.-P. Carloni. 2020. Agile SoC development with open ESP. In Proceedings of the Intl. Conf. on Computer-Aided Design."},{"key":"e_1_3_3_34_2","first-page":"570","volume-title":"Proceedings of the 2023 IEEE Intl. Symp. on High-Performance Computer Architecture (HPCA)","author":"Mei L.","year":"2023","unstructured":"L. Mei, K. Goetschalckx, A. Symons, and M. Verhelst. 2023. DeFiNES: Enabling fast exploration of the depth-first scheduling space for DNN accelerators through analytical modeling. In Proceedings of the 2023 IEEE Intl. Symp. on High-Performance Computer Architecture (HPCA). 570\u2013583."},{"key":"e_1_3_3_35_2","unstructured":"NVIDIA. 2022. Multi-Instance GPU User Guide. Retrieved from https:\/\/docs.nvidia.com\/datacenter\/tesla\/mig-user-guide\/index.html. (2022)."},{"key":"e_1_3_3_36_2","first-page":"565","volume-title":"Proceedings of the Intl. Symp. on Microarchitecture (MICRO)","author":"Odema M.","year":"2024","unstructured":"M. Odema, L. Chen, H. Kwon, and M.-A.-A. Faruque. 2024. SCAR: Scheduling multi-model AI workloads on heterogeneous multi-chiplet module accelerators. In Proceedings of the Intl. Symp. on Microarchitecture (MICRO). 565\u2013579."},{"key":"e_1_3_3_37_2","first-page":"584","volume-title":"Proceedings of the 2021 IEEE Intl. Symp. on High-Performance Computer Architecture (HPCA)","author":"Oh Y.","year":"2021","unstructured":"Y. Oh, S. Kim, Y. Jin, S. Son, J. Bae, J. Lee, Y. Park, D.-U. Kim, T.-J. Ham, and J.-W. Lee. 2021. Layerweaver: Maximizing resource utilization of neural processing units via layer-wise scheduling. In Proceedings of the 2021 IEEE Intl. Symp. on High-Performance Computer Architecture (HPCA). 584\u2013597."},{"key":"e_1_3_3_38_2","first-page":"304","volume-title":"Proceedings of the 2019 IEEE Intl. Symp. on Performance Analysis of Systems and Software (ISPASS)","author":"Parashar A.","year":"2019","unstructured":"A. Parashar, P. Raina, Y.-S. Shao, Y.-H. Chen, V.-A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S.-W. Keckler, and J. Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the 2019 IEEE Intl. Symp. on Performance Analysis of Systems and Software (ISPASS). 304\u2013315."},{"key":"e_1_3_3_39_2","first-page":"234","volume-title":"Proceedings of the Intl. Conf. on Medical Image Computing and Computer-Assisted Intervention (MICCAI)","author":"Ronneberger O.","year":"2015","unstructured":"O. Ronneberger, P. Fischer, and T. Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Intl. Conf. on Medical Image Computing and Computer-Assisted Intervention (MICCAI). 234\u2013241."},{"key":"e_1_3_3_40_2","first-page":"4510","volume-title":"Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)","author":"Sandler M.","year":"2018","unstructured":"M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 4510\u20134520."},{"key":"e_1_3_3_41_2","unstructured":"K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd Intl. Conf. on Learning Representations ICLR 2015. http:\/\/arxiv.org\/abs\/1409.1556"},{"key":"e_1_3_3_42_2","first-page":"56","volume-title":"Proceedings of the Intl. Symp. on High Performance Computer Architecture (HPCA)","author":"Song L.","year":"2019","unstructured":"L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen. 2019. Hypar: Towards hybrid parallelism for deep learning accelerator array. In Proceedings of the Intl. Symp. on High Performance Computer Architecture (HPCA). 56\u201368."},{"key":"e_1_3_3_43_2","first-page":"342","volume-title":"Proceedings of the Intl. Symp. on High Performance Computer Architecture (HPCA)","author":"Song L.","year":"2020","unstructured":"L. Song, F. Chen, Y. Zhuo, X. Qian, H. Li, and Y. Chen. 2020. Accpar: Tensor partitioning for heterogeneous deep learning accelerators. In Proceedings of the Intl. Symp. on High Performance Computer Architecture (HPCA). 342\u2013355."},{"key":"e_1_3_3_44_2","article-title":"MobileBERT: A compact task-agnostic BERT for resource-limited devices","author":"Sun Z.","year":"2020","unstructured":"Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou. 2020. MobileBERT: A compact task-agnostic BERT for resource-limited devices. arXiv:2004.02984. Retrieved from https:\/\/arxiv.org\/abs\/2004.02984 (2020).","journal-title":"arXiv:2004.02984"},{"key":"e_1_3_3_45_2","doi-asserted-by":"crossref","unstructured":"A. Symons and others. 2025. Stream: Design space exploration of layer-fused DNNs on heterogeneous dataflow accelerators. IEEE Trans. on Computers 74 1 (2025) 237\u2013249.","DOI":"10.1109\/TC.2024.3477938"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1287\/mnsc.28.10.1197"},{"key":"e_1_3_3_47_2","first-page":"1","volume-title":"Proceedings of the Intl. Conf. on Computer-Aided Design (ICCAD)","author":"Venkatesan R.","year":"2019","unstructured":"R. Venkatesan, Y.-S. Shao, M. Wang, J. Clemons, S. Dai, M. Fojtik, B. Keller, A. Klinefelter, N. Pinckney, P. Raina, Y. Zhang, B. Zimmer, W.-J. Dally, J. Emer, S.-W. Keckler, and B. Khailany. 2019. MAGNet: A modular accelerator generator for neural networks. In Proceedings of the Intl. Conf. on Computer-Aided Design (ICCAD). 1\u20138."},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMC.2024.3355764"},{"key":"e_1_3_3_49_2","first-page":"331","volume-title":"Proceedings of the Intl. Symp. on High Performance Computer Architecture (HPCA)","author":"Wu C.J.","year":"2019","unstructured":"C.J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, and M. Dukhan. 2019. Machine learning at facebook: Understanding inference at the edge. In Proceedings of the Intl. Symp. on High Performance Computer Architecture (HPCA). 331\u2013344."},{"key":"e_1_3_3_50_2","first-page":"1","volume-title":"Proceedings of the 2019 IEEE\/ACM Intl. Conf. on Computer-Aided Design (ICCAD)","author":"Wu Y.","year":"2019","unstructured":"Y. Wu, J. S. Emer, and V. Sze. 2019. Accelergy: An architecture-level energy estimation methodology for accelerator designs. In Proceedings of the 2019 IEEE\/ACM Intl. Conf. on Computer-Aided Design (ICCAD). 1\u20138."},{"key":"e_1_3_3_51_2","first-page":"349","volume-title":"Proceedings of the European Solid State Circuits Conf. (ESSERC)","author":"Yang J.","year":"2023","unstructured":"J. Yang, S. Lim, S. Lee, J.-Y. Kim, and J.-Y. Kim. 2023. JNPU: A 1.04 TFLOPS Joint-DNN training processor with speculative cyclic quantization and triple heterogeneity on microarchitecture\/precision\/dataflow. In Proceedings of the European Solid State Circuits Conf. (ESSERC). 349\u2013352."},{"key":"e_1_3_3_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2013.2276399"},{"key":"e_1_3_3_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2019.2926114"},{"key":"e_1_3_3_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2022.3214113"},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2024.3438549"},{"key":"e_1_3_3_56_2","first-page":"601","volume-title":"Proceedings of the Design Automation Conf. (DAC)","author":"Zhang X.","year":"2022","unstructured":"X. Zhang, C. Hao, P. Zhou, A. Jones, and J. Hu. 2022. H2H: Heterogeneous model to heterogeneous system mapping with computation and communication awareness. In Proceedings of the Design Automation Conf. (DAC). 601\u2013606."}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3762656","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,3]],"date-time":"2025-10-03T14:05:47Z","timestamp":1759500347000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3762656"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,26]]},"references-count":55,"journal-issue":{"issue":"5s","published-print":{"date-parts":[[2025,11,30]]}},"alternative-id":["10.1145\/3762656"],"URL":"https:\/\/doi.org\/10.1145\/3762656","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"type":"print","value":"1539-9087"},{"type":"electronic","value":"1558-3465"}],"subject":[],"published":{"date-parts":[[2025,9,26]]},"assertion":[{"value":"2025-08-12","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-08-12","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-26","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}