{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T05:13:17Z","timestamp":1755839597389,"version":"3.41.0"},"reference-count":66,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,5,11]],"date-time":"2024-05-11T00:00:00Z","timestamp":1715385600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"NSF CCRI","award":["2016727"],"award-info":[{"award-number":["2016727"]}]},{"name":"NSF RTML","award":["1937592"],"award-info":[{"award-number":["1937592"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2024,5,31]]},"abstract":"<jats:p>\n            Vision Transformer (ViT) has demonstrated promising performance in various computer vision tasks, and recently attracted a lot of research attention. Many recent works have focused on proposing new architectures to improve ViT and deploying it into real-world applications. However, little effort has been made to analyze and understand ViT\u2019s architecture design space and its implication for hardware costs on different devices. In this work, by simply scaling ViT\u2019s depth, width, input size, and other basic configurations, we show that a scaled vanilla ViT model without bells and whistles can achieve comparable or superior accuracy-efficiency trade-off than most of the latest ViT variants. Specifically, compared with DeiT-Tiny, our scaled model achieves a \u2191 1.9% higher ImageNet top-1 accuracy under the same FLOPs and a \u2191 3.7% better ImageNet top-1 accuracy under the same latency on an NVIDIA Edge GPU TX2. Motivated by this, we further investigate the extracted scaling strategies from the following two aspects: (1)\n            <jats:italic>can these scaling strategies be transferred across different real hardware devices<\/jats:italic>\n            ? and (2)\n            <jats:italic>can these scaling strategies be transferred to different ViT variants and tasks<\/jats:italic>\n            ?. For (1), our exploration, based on various devices with different resource budgets, indicates that the transferability effectiveness depends on the underlying device together with its corresponding deployment tool. For (2), we validate the effective transferability of the aforementioned scaling strategies obtained from a vanilla ViT model on top of an image classification task to the PiT model, a strong ViT variant targeting efficiency as well as object detection and video classification tasks. In particular, when transferred to PiT, our scaling strategies lead to a boosted ImageNet top-1 accuracy of from 74.6% to 76.7% (\u2191 2.1%) under the same 0.7G FLOPs. When transferred to the COCO object detection task, the average precision is boosted by \u2191 0.7% under a similar throughput on a V100 GPU.\n          <\/jats:p>","DOI":"10.1145\/3611387","type":"journal-article","created":{"date-parts":[[2023,8,21]],"date-time":"2023-08-21T12:16:36Z","timestamp":1692620196000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["An Investigation on Hardware-Aware Vision Transformer Scaling"],"prefix":"10.1145","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4030-9777","authenticated-orcid":false,"given":"Chaojian","family":"Li","sequence":"first","affiliation":[{"name":"Georgia Institute of Technology, Atlanta, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5167-0683","authenticated-orcid":false,"given":"Kyungmin","family":"Kim","sequence":"additional","affiliation":[{"name":"University of California, Irvine, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2649-5561","authenticated-orcid":false,"given":"Bichen","family":"Wu","sequence":"additional","affiliation":[{"name":"Meta, San Francisco, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7128-191X","authenticated-orcid":false,"given":"Peizhao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Meta, San Francisco United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7816-4238","authenticated-orcid":false,"given":"Hang","family":"Zhang","sequence":"additional","affiliation":[{"name":"Cruise, San Francisco United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3098-2714","authenticated-orcid":false,"given":"Xiaoliang","family":"Dai","sequence":"additional","affiliation":[{"name":"Meta, San Francisco, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2031-4678","authenticated-orcid":false,"given":"Peter","family":"Vajda","sequence":"additional","affiliation":[{"name":"Meta, San Francisco, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5946-203X","authenticated-orcid":false,"given":"Yingyan (Celine)","family":"Lin","sequence":"additional","affiliation":[{"name":"Georgia Institute of Technology, Atlanta, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,5,11]]},"reference":[{"key":"e_1_3_2_2_2","first-page":"508","article-title":"The Kendall rank correlation coefficient","author":"Abdi Herv\u00e9","year":"2007","unstructured":"Herv\u00e9 Abdi. 2007. The Kendall rank correlation coefficient. Encyclopedia of Measurement and Statistics. Sage, Thousand Oaks, CA (2007), 508\u2013510.","journal-title":"Encyclopedia of Measurement and Statistics. Sage, Thousand Oaks, CA"},{"key":"e_1_3_2_3_2","unstructured":"Junjie Bai Fang Lu Ke Zhang et\u00a0al. 2019. ONNX: Open Neural Network Exchange. https:\/\/github.com\/onnx\/onnx"},{"key":"e_1_3_2_4_2","article-title":"Revisiting ResNets: Improved training and scaling strategies","author":"Bello Irwan","year":"2021","unstructured":"Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. 2021. Revisiting ResNets: Improved training and scaling strategies. arXiv preprint arXiv:2103.07579 (2021).","journal-title":"arXiv preprint arXiv:2103.07579"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00338"},{"key":"e_1_3_2_6_2","article-title":"Is space-time attention all you need for video understanding?","author":"Bertasius Gedas","year":"2021","unstructured":"Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095 (2021).","journal-title":"arXiv preprint arXiv:2102.05095"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_8_2","article-title":"Crossvit: Cross-attention multi-scale vision transformer for image classification","author":"Chen Chun-Fu","year":"2021","unstructured":"Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. 2021. Crossvit: Cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899 (2021).","journal-title":"arXiv preprint arXiv:2103.14899"},{"key":"e_1_3_2_9_2","series-title":"Proceedings of Machine Learning Research","first-page":"1691","volume-title":"Proceedings of the 37th International Conference on Machine Learning","volume":"119","author":"Chen Mark","year":"2020","unstructured":"Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 119), Hal Daum\u00e9 III and Aarti Singh (Eds.). PMLR, 1691\u20131703. http:\/\/proceedings.mlr.press\/v119\/chen20s.html"},{"key":"e_1_3_2_10_2","unstructured":"Xiangxiang Chu Zhi Tian Yuqing Wang Bo Zhang Haibing Ren Xiaolin Wei Huaxia Xia and Chunhua Shen. 2021. Twins: Revisiting Spatial Attention Design in Vision Transformers. arxiv:2104.13840 [cs.CV]"},{"key":"e_1_3_2_11_2","article-title":"FBNetV3: Joint architecture-recipe search using neural acquisition function","volume":"2006","author":"Dai Xiaoliang","year":"2020","unstructured":"Xiaoliang Dai, Alvin Wan, P. Zhang, B. Wu, Zijian He, Zhen Wei, K. Chen, Yuandong Tian, Matthew E. Yu, P\u00e9ter Vajda, and J. Gonzalez. 2020. FBNetV3: Joint architecture-recipe search using neural acquisition function. ArXiv abs\/2006.02049 (2020).","journal-title":"ArXiv"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_13_2","volume-title":"International Conference on Learning Representations","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations."},{"key":"e_1_3_2_14_2","first-page":"10480","article-title":"BRP-NAS: Prediction-based NAS using GCNS","volume":"33","author":"Dudziak Lukasz","year":"2020","unstructured":"Lukasz Dudziak, Thomas Chau, Mohamed Abdelfattah, Royson Lee, Hyeji Kim, and Nicholas Lane. 2020. BRP-NAS: Prediction-based NAS using GCNS. Advances in Neural Information Processing Systems 33 (2020), 10480\u201310490.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_15_2","article-title":"Multiscale vision transformers","author":"Fan Haoqi","year":"2021","unstructured":"Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. arXiv preprint arXiv:2104.11227 (2021).","journal-title":"arXiv preprint arXiv:2104.11227"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00028"},{"key":"e_1_3_2_17_2","unstructured":"Francisco Massa. 2021. Script to Calculate the Throughput for DeiT. Retrieved May 1 2021 from https:\/\/gist.github.com\/fmassa\/1f4edb34ca041634c9b730473753b8ad"},{"key":"e_1_3_2_18_2","unstructured":"Google LLC.2020. Performance Measurement. Retrieved May 21 2021 from https:\/\/www.tensorflow.org\/lite\/performance\/measurement"},{"key":"e_1_3_2_19_2","unstructured":"Google LLC.2020. Pixel3 Mobile Phone. Retrieved September 1 2020 from https:\/\/g.co\/kgs\/pVRc1Y"},{"key":"e_1_3_2_20_2","unstructured":"Google LLC.2020. TensorFlow Lite: Deploy Machine Learning Models on Mobile and IoT Devices. Retrieved Novembeer 21 2019 from https:\/\/www.tensorflow.org\/lite"},{"key":"e_1_3_2_21_2","article-title":"LeViT: A vision transformer in ConvNet\u2019s clothing for faster inference","author":"Graham Ben","year":"2021","unstructured":"Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herv\u00e9 J\u00e9gou, and Matthijs Douze. 2021. LeViT: A vision transformer in ConvNet\u2019s clothing for faster inference. arXiv preprint arXiv:2104.01136 (2021).","journal-title":"arXiv preprint arXiv:2104.01136"},{"key":"e_1_3_2_22_2","first-page":"1157","article-title":"An introduction to variable and feature selection","volume":"3","author":"Guyon Isabelle","year":"2003","unstructured":"Isabelle Guyon and Andre Elisseeff. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, null (March2003), 1157\u20131182.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_24_2","article-title":"Scaling laws for autoregressive generative modeling","author":"Henighan Tom","year":"2020","unstructured":"Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, et\u00a0al. 2020. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701 (2020).","journal-title":"arXiv preprint arXiv:2010.14701"},{"key":"e_1_3_2_25_2","article-title":"AdamP: Slowing down the slowdown for momentum optimizers on scale-invariant weights","author":"Heo Byeongho","year":"2020","unstructured":"Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. 2020. AdamP: Slowing down the slowdown for momentum optimizers on scale-invariant weights. arXiv preprint arXiv:2006.08217 (2020).","journal-title":"arXiv preprint arXiv:2006.08217"},{"key":"e_1_3_2_26_2","article-title":"Rethinking spatial dimensions of vision transformers","author":"Heo Byeongho","year":"2021","unstructured":"Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. 2021. Rethinking spatial dimensions of vision transformers. arXiv preprint arXiv:2103.16302 (2021).","journal-title":"arXiv preprint arXiv:2103.16302"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00140"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_2_29_2","unstructured":"NVIDIA Inc.2020. NVIDIA Jetson TX2. (2020). Retrieved September 1 2020 from https:\/\/www.nvidia.com\/en-us\/autonomous-machines\/embedded-systems\/jetson-tx2\/"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/34.574797"},{"key":"e_1_3_2_31_2","article-title":"Scaling laws for neural language models","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).","journal-title":"arXiv preprint arXiv:2001.08361"},{"key":"e_1_3_2_32_2","article-title":"The kinetics human action video dataset","author":"Kay Will","year":"2017","unstructured":"Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et\u00a0al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).","journal-title":"arXiv preprint arXiv:1705.06950"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58545-7_29"},{"key":"e_1_3_2_34_2","article-title":"HW-NAS-Bench: Hardware-aware neural architecture search benchmark","author":"Li Chaojian","year":"2021","unstructured":"Chaojian Li, Zhongzhi Yu, Yonggan Fu, Yongan Zhang, Yang Zhao, Haoran You, Qixuan Yu, Yue Wang, and Yingyan Lin. 2021. HW-NAS-Bench: Hardware-aware neural architecture search benchmark. arXiv preprint arXiv:2103.10584 (2021).","journal-title":"arXiv preprint arXiv:2103.10584"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00060"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_37_2","article-title":"Swin transformer: Hierarchical vision transformer using shifted windows","author":"Liu Ze","year":"2021","unstructured":"Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021).","journal-title":"arXiv preprint arXiv:2103.14030"},{"key":"e_1_3_2_38_2","article-title":"Decoupled weight decay regularization","author":"Loshchilov Ilya","year":"2017","unstructured":"Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).","journal-title":"arXiv preprint arXiv:1711.05101"},{"key":"e_1_3_2_39_2","unstructured":"Maxim Lukiyanov Guoliang Hua Geeta Chauhan and Gisle Dankel. 2020. Introducing PyTorch Profiler \u2014 the New and Improved Performance Tool. https:\/\/pytorch.org\/blog\/introducing-pytorch-profiler-the-new-and-improved-performance-tool\/. Accessed 2022-05-21."},{"key":"e_1_3_2_40_2","unstructured":"NVIDIA Inc.2020. Performance Tuning - Maximizing Performance. Retrieved September 1 2020 from https:\/\/developer.ridgerun.com\/wiki\/index.php?title=Xavier\/JetPack_4.1\/Performance_Tuning\/Maximizing_Performance"},{"key":"e_1_3_2_41_2","unstructured":"NVIDIA Inc.2020. TensorRT Command-Line Wrapper: Trtexec. Retrieved May 21 2021 from https:\/\/github.com\/NVIDIA\/TensorRT\/tree\/master\/samples\/opensource\/trtexec"},{"key":"e_1_3_2_42_2","unstructured":"NVIDIA Inc.2020. TensorRT Open Source Software. Retrieved September 1 2020 from https:\/\/github.com\/NVIDIA\/TensorRT"},{"key":"e_1_3_2_43_2","unstructured":"NVIDIA LLC.2020. NVIDIA V100 TENSOR CORE GPU. Retrieved September 1 2020 from https:\/\/www.nvidia.com\/en-us\/data-center\/v100\/"},{"key":"e_1_3_2_44_2","first-page":"8026","volume-title":"Advances in Neural Information Processing Systems","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et\u00a0al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8026\u20138037."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01044"},{"key":"e_1_3_2_46_2","first-page":"4079","article-title":"Speedy performance estimation for neural architecture search","volume":"34","author":"Ru Robin","year":"2021","unstructured":"Robin Ru, Clare Lyle, Lisa Schut, Miroslav Fil, Mark van der Wilk, and Yarin Gal. 2021. Speedy performance estimation for neural architecture search. Advances in Neural Information Processing Systems 34 (2021), 4079\u20134092.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00474"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2018.00101"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.97"},{"key":"e_1_3_2_50_2","first-page":"6105","volume-title":"International Conference on Machine Learning","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 6105\u20136114."},{"key":"e_1_3_2_51_2","article-title":"Training data-efficient image transformers & distillation through attention","author":"Touvron Hugo","year":"2020","unstructured":"Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv\u00e9 J\u00e9gou. 2020. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 (2020).","journal-title":"arXiv preprint arXiv:2012.12877"},{"key":"e_1_3_2_52_2","article-title":"Going deeper with image transformers","author":"Touvron Hugo","year":"2021","unstructured":"Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv\u00e9 J\u00e9gou. 2021. Going deeper with image transformers. arXiv preprint arXiv:2103.17239 (2021).","journal-title":"arXiv preprint arXiv:2103.17239"},{"key":"e_1_3_2_53_2","article-title":"Attention is all you need","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).","journal-title":"arXiv preprint arXiv:1706.03762"},{"key":"e_1_3_2_54_2","article-title":"Pyramid vision transformer: A versatile backbone for dense prediction without convolutions","author":"Wang Wenhai","year":"2021","unstructured":"Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122 (2021).","journal-title":"arXiv preprint arXiv:2102.12122"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2019.8794182"},{"key":"e_1_3_2_56_2","article-title":"Visual transformers: Token-based image representation and processing for computer vision","author":"Wu Bichen","year":"2020","unstructured":"Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677 (2020).","journal-title":"arXiv preprint arXiv:2006.03677"},{"key":"e_1_3_2_57_2","article-title":"Cvt: Introducing convolutions to vision transformers","author":"Wu Haiping","year":"2021","unstructured":"Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. 2021. Cvt: Introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808 (2021).","journal-title":"arXiv preprint arXiv:2103.15808"},{"key":"e_1_3_2_58_2","unstructured":"Yuxin Wu Alexander Kirillov Francisco Massa Wan-Yen Lo and Ross Girshick. 2019. Detectron2. https:\/\/github.com\/facebookresearch\/detectron2"},{"key":"e_1_3_2_59_2","article-title":"MobileDets: Searching for object detection architectures for mobile accelerators","author":"Xiong Yunyang","year":"2020","unstructured":"Yunyang Xiong, Hanxiao Liu, Suyog Gupta, Berkin Akin, Gabriel Bender, Pieter-Jan Kindermans, Mingxing Tan, Vikas Singh, and Bo Chen. 2020. MobileDets: Searching for object detection architectures for mobile accelerators. arXiv preprint arXiv:2004.14525 (2020).","journal-title":"arXiv preprint arXiv:2004.14525"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00207"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58571-6_41"},{"key":"e_1_3_2_62_2","article-title":"Tokens-to-token ViT: Training vision transformers from scratch on ImageNet","author":"Yuan Li","year":"2021","unstructured":"Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. arXiv preprint arXiv:2101.11986 (2021).","journal-title":"arXiv preprint arXiv:2101.11986"},{"key":"e_1_3_2_63_2","article-title":"Towards automated deep learning: Efficient joint neural architecture and hyperparameter search","author":"Zela Arber","year":"2018","unstructured":"Arber Zela, Aaron Klein, Stefan Falkner, and Frank Hutter. 2018. Towards automated deep learning: Efficient joint neural architecture and hyperparameter search. arXiv preprint arXiv:1807.06906 (2018).","journal-title":"arXiv preprint arXiv:1807.06906"},{"key":"e_1_3_2_64_2","article-title":"Scaling vision transformers","author":"Zhai Xiaohua","year":"2021","unstructured":"Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. 2021. Scaling vision transformers. arXiv preprint arXiv:2106.04560 (2021).","journal-title":"arXiv preprint arXiv:2106.04560"},{"key":"e_1_3_2_65_2","article-title":"ResNeSt: Split-attention networks","author":"Zhang Hang","year":"2020","unstructured":"Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, et\u00a0al. 2020. ResNeSt: Split-attention networks. arXiv preprint arXiv:2004.08955 (2020).","journal-title":"arXiv preprint arXiv:2004.08955"},{"key":"e_1_3_2_66_2","article-title":"DeepViT: Towards deeper vision transformer","author":"Zhou Daquan","year":"2021","unstructured":"Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Qibin Hou, and Jiashi Feng. 2021. DeepViT: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021).","journal-title":"arXiv preprint arXiv:2103.11886"},{"key":"e_1_3_2_67_2","article-title":"Deformable DETR: Deformable transformers for end-to-end object detection","author":"Zhu Xizhou","year":"2020","unstructured":"Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).","journal-title":"arXiv preprint arXiv:2010.04159"}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3611387","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3611387","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:37:09Z","timestamp":1750178229000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3611387"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,11]]},"references-count":66,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,5,31]]}},"alternative-id":["10.1145\/3611387"],"URL":"https:\/\/doi.org\/10.1145\/3611387","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"type":"print","value":"1539-9087"},{"type":"electronic","value":"1558-3465"}],"subject":[],"published":{"date-parts":[[2024,5,11]]},"assertion":[{"value":"2022-09-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-06-04","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-05-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}