{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,17]],"date-time":"2026-05-17T15:07:08Z","timestamp":1779030428167,"version":"3.51.4"},"reference-count":49,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2026,5,12]],"date-time":"2026-05-12T00:00:00Z","timestamp":1778544000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/legalcode"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62372188"],"award-info":[{"award-number":["62372188"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100003453","name":"GuangDong Natural Science Foundation","doi-asserted-by":"crossref","award":["2024A1515010100"],"award-info":[{"award-number":["2024A1515010100"]}],"id":[{"id":"10.13039\/501100003453","id-type":"DOI","asserted-by":"crossref"}]},{"name":"China National Key R&D Program","award":["2023YFE0202700"],"award-info":[{"award-number":["2023YFE0202700"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Cyber-Phys. Syst."],"published-print":{"date-parts":[[2026,7,31]]},"abstract":"<jats:p>\n                    Roadside cameras effectively enhance the perception capabilities of embodied artificial intelligence systems such as vehicles by compensating for the limitations of vehicle-mounted cameras, which are prone to occlusion and have a limited sensing range, thereby improving the safety of autonomous vehicles. However, existing object detection systems often encounter perception errors when handling comprehensive viewpoint noise in roadside scenes, as well as variations in traffic flow, lighting conditions, and camera poses. This makes it challenging for them to perform robustly in complex road environments. To address these issues, we propose\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\mathrm{R^{2}MOAG}\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    , a highly robust monocular 3D object detection method for roadside systems, based on ground perception embedding and heterogeneous visual tokens. The proposed method extracts detailed road information through ground plane equations and utilizes heterogeneous visual tokens to focus on foreground features. By integrating low-dimensional ground information with high-dimensional visual features, the model is provided with clear and rich cues for object detection, significantly enhancing its stability. We conducted extensive experiments on the widely recognized roadside datasets DAIR-V2X-I and Rope3D. The results show that, in terms of overall performance, the proposed model achieved a 4.65% and 4.26% improvement in the\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(AP_{3D}|_{R40}\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    metric for the vehicle category on these two datasets, respectively. Moreover, the model maintained stable recognition performance across various road scenarios and camera poses, demonstrating exceptional robustness.\n                  <\/jats:p>","DOI":"10.1145\/3790253","type":"journal-article","created":{"date-parts":[[2026,2,23]],"date-time":"2026-02-23T14:06:10Z","timestamp":1771855570000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["R\n                    <sup>2<\/sup>\n                    MOAG: Robust Roadside Monocular 3D Object Detection with Adaptive Token and Ground Embedding"],"prefix":"10.1145","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8602-7754","authenticated-orcid":false,"given":"Jie","family":"Tang","sequence":"first","affiliation":[{"name":"South China University of Technology, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-3515-6548","authenticated-orcid":false,"given":"Haoran","family":"Pan","sequence":"additional","affiliation":[{"name":"South China University of Technology, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0139-3622","authenticated-orcid":false,"given":"Bo","family":"Yu","sequence":"additional","affiliation":[{"name":"Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5132-8351","authenticated-orcid":false,"given":"Shaoshan","family":"Liu","sequence":"additional","affiliation":[{"name":"Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,5,12]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00938"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58592-1_9"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSEN.2022.3189174"},{"key":"e_1_3_1_6_2","unstructured":"Xiahan Chen Mingjian Chen Sanli Tang Yi Niu and Jiang Zhu. 2024. MOSE: Boosting vision-based roadside 3D object detection with scene cues. arXiv:2404.05280. Retrieved from https:\/\/arxiv.org\/abs\/2404.05280"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.691"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2012.6248074"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00398"},{"key":"e_1_3_1_11_2","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:6628106"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01298"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i2.25233"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20077-9_1"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.324"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19812-0_31"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2021.3052442"},{"key":"e_1_3_1_18_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Loshchilov Ilya","year":"2019","unstructured":"Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=Bkg6RiCqY7"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00469"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00377"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00780"},{"key":"e_1_3_1_22_2","first-page":"1","volume-title":"Proceedings of the 35th International Conference on Neural Information Processing Systems (NIPS \u201921)","volume":"1068","author":"Rao Yongming","year":"2024","unstructured":"Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2024. DynamicViT: Efficient vision transformers with dynamic token sparsification. In Proceedings of the 35th International Conference on Neural Information Processing Systems (NIPS \u201921). Curran Associates Inc., Red Hook, NY, Article 1068, 1\u201313."},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV51458.2022.00133"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2024.3463409"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00208"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/s40747-022-00962-9"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00462"},{"key":"e_1_3_1_29_2","first-page":"180","volume-title":"Proceedings of the 5th Conference on Robot Learning (CoRL \u201921)","author":"Wang Yue","year":"2022","unstructured":"Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. 2022. DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In Proceedings of the 5th Conference on Robot Learning (CoRL \u201921). Aleksandra Faust, David Hsu, and Gerhard Neumann (Eds.), PMLR, 180\u2013191. Retrieved from https:\/\/proceedings.mlr.press\/v164\/wang22b.html"},{"key":"e_1_3_1_30_2","unstructured":"Junjie Yan Yingfei Liu Jianjian Sun Fan Jia Shuailin Li Tiancai Wang and Xiangyu Zhang. 2023. Cross modal transformer: Towards fast and robust 3D object detection. arXiv:2301.01283. Retrieved from https:\/\/arxiv.org\/abs\/2301.01283"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2023.126894"},{"key":"e_1_3_1_32_2","unstructured":"Lei Yang Tao Tang Jun Li Peng Chen Kun Yuan Li Wang Yi Huang Xinyu Zhang and Kaicheng Yu. 2023. BEVHeight++: Toward robust visual centric 3D object detection. arXiv:2309.16179. Retrieved from https:\/\/arxiv.org\/abs\/2309.16179"},{"key":"e_1_3_1_33_2","doi-asserted-by":"crossref","unstructured":"Lei Yang Jiaxin Yu Xinyu Zhang Jun Li Li Wang Yi Huang Chuang Zhang Hong Wang and Yiming Li. 2023. MonoGAE: Roadside monocular 3D object detection with ground-aware embeddings. arXiv:2310.00400. Retrieved from https:\/\/arxiv.org\/abs\/2310.00400","DOI":"10.1109\/TITS.2024.3412759"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02070"},{"key":"e_1_3_1_35_2","doi-asserted-by":"crossref","unstructured":"Lei Yang Xinyu Zhang Jun Li Li Wang Chuang Zhang Li Ju Zhiwei Li and Yang Shen. 2024. SGV3D: Towards scenario generalization for vision-based roadside 3D object detection. arXiv:2401.16110. Retrieved from https:\/\/arxiv.org\/abs\/2401.16110","DOI":"10.1109\/TITS.2025.3569399"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.02065"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.3390\/app132011402"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01161"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.02067"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW59228.2023.00321"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01082"},{"key":"e_1_3_1_42_2","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"Zhang Hao","year":"2023","unstructured":"Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel Ni, and Heung-Yeung Shum. 2023. DINO: DETR with improved DeNoising anchor boxes for end-to-end object detection. In Proceedings of the 11th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=3mRwyG5one"},{"key":"e_1_3_1_43_2","doi-asserted-by":"crossref","unstructured":"Renrui Zhang Han Qiu Tai Wang Ziyu Guo Xuanzhuo Xu Ziteng Cui Yu Qiao Peng Gao and Hongsheng Li. 2023. MonoDETR: Depth-guided transformer for monocular 3D object detection. arXiv:2203.13310. Retrieved from https:\/\/arxiv.org\/abs\/2203.13310","DOI":"10.1109\/ICCV51070.2023.00840"},{"key":"e_1_3_1_44_2","volume-title":"Proceedings of the 38th Annual Conference on Neural Information Processing Systems","author":"Zhang Yushun","year":"2024","unstructured":"Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhi-Quan Luo. 2024. Why transformers need Adam: A hessian perspective. In Proceedings of the 38th Annual Conference on Neural Information Processing Systems. Retrieved from https:\/\/openreview.net\/forum?id=X6rqEpbnj3"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00330"},{"key":"e_1_3_1_46_2","unstructured":"Xingyi Zhou Dequan Wang and Philipp Kr\u00e4henb\u00fchl. 2019. Objects as points. arXiv:1904.07850. Retrieved from https:\/\/arxiv.org\/abs\/1904.07850"},{"key":"e_1_3_1_47_2","first-page":"2033","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS \u201922)","author":"Zhou Yunsong","year":"2022","unstructured":"Yunsong Zhou, Quan Liu, Hongzi Zhu, Yunzhe Li, Shan Chang, and Minyi Guo. 2022. MoGDE: Boosting mobile monocular 3D object detection with ground depth estimation. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS \u201922), 2033\u20132045."},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00472"},{"key":"e_1_3_1_49_2","doi-asserted-by":"crossref","unstructured":"Yunsong Zhou Hongzi Zhu Quan Liu Shan Chang and Minyi Guo. 2023. MonoATT: Online monocular 3D object detection with adaptive token transformer. arXiv:2303.13018. Retrieved from https:\/\/arxiv.org\/abs\/2303.13018","DOI":"10.1109\/CVPR52729.2023.01678"},{"key":"e_1_3_1_50_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Zhu Xizhou","year":"2021","unstructured":"Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=gZ9hCDWe6ke"}],"container-title":["ACM Transactions on Cyber-Physical Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3790253","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,17]],"date-time":"2026-05-17T14:36:34Z","timestamp":1779028594000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3790253"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,5,12]]},"references-count":49,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,7,31]]}},"alternative-id":["10.1145\/3790253"],"URL":"https:\/\/doi.org\/10.1145\/3790253","relation":{},"ISSN":["2378-962X","2378-9638"],"issn-type":[{"value":"2378-962X","type":"print"},{"value":"2378-9638","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,5,12]]},"assertion":[{"value":"2025-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-06","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-05-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}