{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,23]],"date-time":"2026-03-23T16:50:30Z","timestamp":1774284630199,"version":"3.50.1"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"4","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["92367103"],"award-info":[{"award-number":["92367103"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"King Saud University, Riyadh, Saudi Arabia, for supporting this work through the ongoing research funding program","award":["ORF-2026-493"],"award-info":[{"award-number":["ORF-2026-493"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,4,30]]},"abstract":"<jats:p>3D object detection plays a pivotal role in facilitating comprehensive scene understanding in autonomous driving systems. One of its key challenges is to achieve accurate perception in complex environments. Compared with LiDAR systems and stereo-vision approaches, monocular camera-based solutions are more cost-effective and easier to deploy. However, the absence of depth in monocular images hinders the accurate localization of 3D bounding boxes when only monocular images are used. This work proposes MonoLS, a monocular 3D object detection framework that incorporates lightweight multi-scale feature fusion and spatially-aware attention. It aims to address the challenge of missing depth information while achieving precise object localization. First, lightweight multi-scale feature fusion combines deep and shallow features. This design allows for effective multi-scale feature extraction without compromising real-time detection capabilities. Second, spatially-aware attention employs a dual-branch structure, with the spatial branch using a triplet attention to capture spatial details, and the context branch aggregating global context information through global attention. These two branches are subsequently fused to produce enhanced feature representations that preserve spatial distribution and semantic richness. Finally, experiments on the KITTI dataset demonstrate that our method outperforms the baseline, achieving a real-time inference speed of up to 67 FPS.<\/jats:p>","DOI":"10.1145\/3797273","type":"journal-article","created":{"date-parts":[[2026,3,2]],"date-time":"2026-03-02T13:10:46Z","timestamp":1772457046000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["MonoLS: Multi-Scale Feature Fusion and Spatially-Aware Attention for Monocular 3D Object Detection"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6861-9684","authenticated-orcid":false,"given":"Honghao","family":"Gao","sequence":"first","affiliation":[{"name":"School of Computer Engineering and Science, Shanghai University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-5785-140X","authenticated-orcid":false,"given":"Dubin","family":"Feng","sequence":"additional","affiliation":[{"name":"School of Computer Engineering and Science, Shanghai University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1454-2161","authenticated-orcid":false,"given":"Ye","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Computer Engineering and Science, Shanghai University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-4334-6711","authenticated-orcid":false,"given":"Zhihao","family":"Pan","sequence":"additional","affiliation":[{"name":"School of Computer Engineering and Science, Shanghai University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7210-0543","authenticated-orcid":false,"given":"Yueshen","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xidian University, Xi\u2019an, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7479-7102","authenticated-orcid":false,"given":"Bader Fahad","family":"Alkhamees","sequence":"additional","affiliation":[{"name":"Department of Information Systems, King Saud University, Riyadh, Saudi Arabia"}]}],"member":"320","published-online":{"date-parts":[[2026,3,23]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3674979"},{"key":"e_1_3_1_3_2","doi-asserted-by":"crossref","first-page":"2122","DOI":"10.1007\/s11263-023-01784-z","article-title":"Multi-modal 3D object detection in autonomous driving","volume":"131","author":"Wang Yingjie","year":"2023","unstructured":"Yingjie Wang, Qiuyu Mao, Hanqi Zhu, Jiajun Deng, Yu Zhang, Jianmin Ji, Houqiang Li, and Yanyong Zhang. 2023. Multi-modal 3D object detection in autonomous driving. International Journal of Computer Vision 131 (2023), 2122\u20132152.","journal-title":"International Journal of Computer Vision"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547859"},{"key":"e_1_3_1_5_2","first-page":"3537","article-title":"3D object detection from images for autonomous driving: A survey","author":"Ma Xinzhu","year":"2024","unstructured":"Xinzhu Ma, Wanli Ouyang, Andrea Simonelli, and Elisa Ricci. 2024. 3D object detection from images for autonomous driving: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (2024), 3537\u20133556.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_6_2","doi-asserted-by":"crossref","first-page":"281","DOI":"10.1109\/TICPS.2024.3427060","article-title":"Toward effective 3D object detection via multimodal fusion to automatic driving for industrial cyber-physical systems","volume":"2","author":"Gao Honghao","year":"2024","unstructured":"Honghao Gao, Yan Sun, Junsheng Xiao, Danqing Fang, Yueshen Xu, and Wei Wei. 2024. Toward effective 3D object detection via multimodal fusion to automatic driving for industrial cyber-physical systems. IEEE Transactions on Industrial Cyber-Physical Systems 2 (2024), 281\u2013291.","journal-title":"IEEE Transactions on Industrial Cyber-Physical Systems"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3681420"},{"key":"e_1_3_1_8_2","first-page":"15641","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Liu Zongdai","year":"2021","unstructured":"Zongdai Liu, Dingfu Zhou, Feixiang Lu, Jin Fang, and Liangjun Zhang. 2021. AutoShape: Real-time shape-aware monocular 3D object detection. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 15641\u201315650."},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/TITS.2024.3411159"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01169"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.5555\/2354409.2354978"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1177\/0278364913491297"},{"issue":"2","key":"e_1_3_1_13_2","first-page":"1922","article-title":"CFPC: The curbed fake point collector to pseudo-LiDAR-based 3D object detection for autonomous vehicles","volume":"74","author":"Gao Honghao","year":"2024","unstructured":"Honghao Gao, Jie Shao, Muddesar Iqbal, Ye Wang, and Zhengzhe Xiang. 2024. CFPC: The curbed fake point collector to pseudo-LiDAR-based 3D object detection for autonomous vehicles. IEEE Transactions on Vehicular Technology 74, 2 (2024), 1922\u20131934.","journal-title":"IEEE Transactions on Vehicular Technology"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00864"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00845"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3703458"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.597"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00938"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01489"},{"key":"e_1_3_1_20_2","first-page":"2791","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Li Zhuoling","year":"2022","unstructured":"Zhuoling Li, Zhan Qu, Yang Zhou, Jianzhuang Liu, Haoqian Wang, and Lihui Jiang. 2022. Diversity matters: Fully exploiting depth clues for reliable monocular 3D object detection. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2791\u20132800."},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00310"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNSE.2025.3541138"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3419842"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00060"},{"issue":"8","key":"e_1_3_1_26_2","first-page":"8831","article-title":"CAMRL: A joint method of channel attention and multidimensional regression loss for 3D object detection in automated vehicles","volume":"24","author":"Gao Honghao","year":"2022","unstructured":"Honghao Gao, Danqing Fang, Junsheng Xiao, Walayat Hussain, and Jung Yoon Kim. 2022. CAMRL: A joint method of channel attention and multidimensional regression loss for 3D object detection in automated vehicles. IEEE Transactions on Intelligent Transportation Systems 24, 8 (2022), 8831\u20138845.","journal-title":"IEEE Transactions on Intelligent Transportation Systems"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCE.2024.3353530"},{"key":"e_1_3_1_28_2","first-page":"5998","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30, 5998\u20136008.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3657298"},{"key":"e_1_3_1_30_2","first-page":"9155","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Zhang Renrui","year":"2023","unstructured":"Renrui Zhang, Han Qiu, Tai Wang, Ziyu Guo, Ziteng Cui, Yu Qiao, Hongsheng Li, and Peng Gao. 2023. MonoDETR: Depth-guided transformer for monocular 3D object detection. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 9155\u20139166."},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3241056"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00255"},{"key":"e_1_3_1_33_2","first-page":"12021","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Chen Jierun","year":"2023","unstructured":"Jierun Chen, Shiu-Hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, and S.-H. Gary Chan. 2023. Run, don\u2019t walk: Chasing higher flops for faster neural networks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 12021\u201312031."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.106"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00953"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00318"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/TITS.2025.3525772"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSS.2025.3563757"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.324"},{"key":"e_1_3_1_40_2","first-page":"1","article-title":"Enhancing monocular 3-D object detection through data augmentation strategies","volume":"73","author":"Jia Yisong","year":"2024","unstructured":"Yisong Jia, Jue Wang, Huihui Pan, and Weichao Sun. Enhancing monocular 3-D object detection through data augmentation strategies. IEEE Transactions on Instrumentation and Measurement 73 (2024), 1\u201311.","journal-title":"IEEE Transactions on Instrumentation and Measurement"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3660347"},{"key":"e_1_3_1_42_2","first-page":"664","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Kumar Abhinav","year":"2022","unstructured":"Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami, and Xiaoming Liu. 2022. DEVIANT: Depth equivariant network for monocular 3D object detection. In Proceedings of the European Conference on Computer Vision. Springer, 664\u2013683."},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00398"},{"key":"e_1_3_1_44_2","first-page":"71","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Peng Liang","year":"2022","unstructured":"Liang Peng, Xiaopei Wu, Zheng Yang, Haifeng Liu, and Deng Cai. 2022. DID-M3D: Decoupling instance depth for monocular 3D object detection. In Proceedings of the European Conference on Computer Vision. Springer, 71\u201388."},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01264"},{"key":"e_1_3_1_46_2","doi-asserted-by":"crossref","first-page":"11703","DOI":"10.52202\/075280-0514","article-title":"MonoUNI: A unified vehicle and infrastructure-side monocular 3d object detection network with sufficient depth clues","volume":"36","author":"Jinrang Jia","year":"2023","unstructured":"Jia Jinrang, Zhenjia Li, and Yifeng Shi. 2023. MonoUNI: A unified vehicle and infrastructure-side monocular 3d object detection network with sufficient depth clues. In Advances in Neural Information Processing Systems, Vol. 36, 11703\u201311715.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2023.3238137"},{"key":"e_1_3_1_48_2","first-page":"4842","volume-title":"2023 IEEE International Conference on Robotics and Automation (ICRA)","author":"Wu Zizhang","year":"2023","unstructured":"Zizhang Wu, Yuanzhu Gan, Lei Wang, Guilian Chen, and Jian Pu. 2023. MonoPGC: Monocular 3D object detection with pixel geometry contexts. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4842\u20134849."},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00976"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2025.3544487"},{"key":"e_1_3_1_51_2","doi-asserted-by":"crossref","first-page":"6189","DOI":"10.1609\/aaai.v38i6.28436","article-title":"FD3D: Exploiting foreground depth map for feature-supervised monocular 3D object detection","volume":"38","author":"Wu Zizhang","year":"2024","unstructured":"Zizhang Wu, Yuanzhu Gan, Yunzhe Wu, Ruihao Wang, Xiaoquan Wang, and Jian Pu. 2024. FD3D: Exploiting foreground depth map for feature-supervised monocular 3D object detection. In Proceedings of the AAAI Conference on Artificial Intelligence 38, 6189\u20136197.","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2025.3544880"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00469"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1145\/3674838"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01155"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3797273","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,23]],"date-time":"2026-03-23T15:51:13Z","timestamp":1774281073000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3797273"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,23]]},"references-count":55,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2026,4,30]]}},"alternative-id":["10.1145\/3797273"],"URL":"https:\/\/doi.org\/10.1145\/3797273","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,23]]},"assertion":[{"value":"2025-08-08","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-27","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-03-23","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}