{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,25]],"date-time":"2026-04-25T02:54:07Z","timestamp":1777085647367,"version":"3.51.4"},"reference-count":68,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2025,9,18]],"date-time":"2025-09-18T00:00:00Z","timestamp":1758153600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Imaging"],"abstract":"<jats:p>In domains such as autonomous driving, 3D object detection is a key technology for environmental perception. By integrating multimodal information from sensors such as LiDAR and cameras, the detection accuracy can be significantly improved. However, the current multimodal fusion perception framework still suffers from two problems: first, due to the inherent physical limitations of LiDAR detection, the number of point clouds of distant objects is sparse, resulting in small target objects being easily overwhelmed by the background; second, the cross-modal information interaction is insufficient, and the complementarity and correlation between the LiDAR point cloud and the camera image are not fully exploited and utilized. Therefore, we propose a new multimodal detection strategy, Semantic-Enhanced and Temporally Refined Bidirectional BEV Fusion (SETR-Fusion). This method integrates three key components: the Discriminative Semantic Saliency Activation (DSSA) module, the Temporally Consistent Semantic Point Fusion (TCSP) module, and the Bilateral Cross-Attention Fusion (BCAF) module. The DSSA module fully utilizes image semantic features to capture more discriminative foreground and background cues; the TCSP module generates semantic LiDAR points and, after noise filtering, produces a more accurate semantic LiDAR point cloud; and the BCAF module\u2019s cross-attention to camera and LiDAR BEV features in both directions enables strong interaction between the two types of modal information. SETR-Fusion achieves 71.2% mAP and 73.3% NDS values on the nuScenes test set, outperforming several state-of-the-art methods.<\/jats:p>","DOI":"10.3390\/jimaging11090319","type":"journal-article","created":{"date-parts":[[2025,9,18]],"date-time":"2025-09-18T09:32:32Z","timestamp":1758187952000},"page":"319","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Semantic-Enhanced and Temporally Refined Bidirectional BEV Fusion for LiDAR\u2013Camera 3D Object Detection"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-9972-0815","authenticated-orcid":false,"given":"Xiangjun","family":"Qu","sequence":"first","affiliation":[{"name":"Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China"},{"name":"School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-6271-3709","authenticated-orcid":false,"given":"Kai","family":"Qin","sequence":"additional","affiliation":[{"name":"Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China"},{"name":"School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China"}]},{"given":"Yaping","family":"Li","sequence":"additional","affiliation":[{"name":"Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China"},{"name":"School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China"}]},{"given":"Shuaizhang","family":"Zhang","sequence":"additional","affiliation":[{"name":"Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China"},{"name":"School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China"}]},{"given":"Yuchen","family":"Li","sequence":"additional","affiliation":[{"name":"Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China"},{"name":"School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China"}]},{"given":"Sizhe","family":"Shen","sequence":"additional","affiliation":[{"name":"Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China"},{"name":"School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China"}]},{"given":"Yun","family":"Gao","sequence":"additional","affiliation":[{"name":"Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China"},{"name":"School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,9,18]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Nagiub, A.S., Fayez, M., Khaled, H., and Ghoniemy, S. (2024, January 6\u20137). 3D object detection for autonomous driving: A comprehensive review. Proceedings of the 2024 6th International Conference on Computing and Informatics (ICCI), Cairo, Egypt.","DOI":"10.1109\/ICCI61671.2024.10485120"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Zhou, Y., and Tuzel, O. (2018, January 18\u201323). Voxelnet: End-to-end learning for point cloud based 3D object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00472"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Yan, Y., Mao, Y., and Li, B.J.S. (2018). Second: Sparsely embedded convolutional detection. Sensors, 18.","DOI":"10.3390\/s18103337"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., and Beijbom, O. (2019, January 15\u201320). Pointpillars: Fast encoders for object detection from point clouds. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01298"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Chen, Y., Liu, S., Shen, X., and Jia, J. (2019, January 15\u201320). Fast point r-cnn. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Long Beach, CA, USA.","DOI":"10.1109\/ICCV.2019.00987"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Yang, Z., Sun, Y., Liu, S., and Jia, J. (2020, January 13\u201319). 3DSSD: Point-based 3D single stage object detector. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01105"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., and Li, H. (2021, January 2\u20139). Voxel r-cnn: Towards high performance voxel-based 3D object detection. Proceedings of the AAAI Conference on Artificial Intelligence, Virtual.","DOI":"10.1609\/aaai.v35i2.16207"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Wang, T., Zhu, X., Pang, J., and Lin, D. (2021, January 10\u201317). FCOS3D: Fully convolutional one-stage monocular 3D object detection. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCVW54120.2021.00107"},{"key":"ref_9","unstructured":"Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., and Solomon, J. (2022, January 14\u201318). Detr3D: 3D object detection from multi-view images via 3D-to-2D queries. Proceedings of the Conference on Robot Learning, Auckland, New Zealand."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Liu, Y., Wang, T., Zhang, X., and Sun, J. (2022, January 23\u201327). Petr: Position embedding transformation for multi-view 3D object detection. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-19812-0_31"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Liu, Y., Yan, J., Jia, F., Li, S., Gao, A., Wang, T., and Zhang, X. (2023, January 17\u201324). Petrv2: A unified framework for 3D perception from multi-camera images. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Vancouver, BC, Canada.","DOI":"10.1109\/ICCV51070.2023.00302"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Philion, J., and Fidler, S. (2020, January 23\u201328). Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58568-6_12"},{"key":"ref_13","unstructured":"Bai, X., Hu, Z., Zhu, X., Huang, Q., Chen, Y., Fu, H., and Tai, C.-L. (2020, January 13\u201319). Transfusion: Robust LiDAR-camera fusion for 3D object detection with transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Vora, S., Lang, A.H., Helou, B., and Beijbom, O. (2020, January 13\u201319). Pointpainting: Sequential fusion for 3D object detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00466"},{"key":"ref_15","unstructured":"Zhou, X., Wang, D., and Kr\u00e4henb\u00fchl, P. (2019). Objects as points. arXiv."},{"key":"ref_16","unstructured":"Liang, T., Xie, H., Yu, K., Xia, Z., Lin, Z., Wang, Y., Tang, T., Wang, B., and Tang, Z. (December, January 28). Bevfusion: A simple and robust LiDAR-camera fusion framework. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D., and Han, S. (2022). Bevfusion: Multi-task multi-sensor fusion with unified bird\u2019s-eye view representation. arXiv.","DOI":"10.1109\/ICRA48891.2023.10160968"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Park, D., Ambrus, R., Guizilini, V., Li, J., and Gaidon, A. (2021, January 10\u201317). Is pseudo-LiDAR needed for monocular 3D object detection?. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00313"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Luo, S., Dai, H., Shao, L., and Ding, Y. (2021, January 20\u201325). M3DSSD: Monocular 3D single stage object detector. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00608"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Zhou, Y., He, Y., Zhu, H., Wang, C., Li, H., and Jiang, Q. (2021, January 20\u201325). Monocular 3D object detection: An extrinsic parameter free approach. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00747"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Kumar, A., Brazil, G., and Liu, X. (2021, January 20\u201325). Groomed-nms: Grouped mathematically differentiable nms for monocular 3D object detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00886"},{"key":"ref_22","unstructured":"Liu, X., Xue, N., and Wu, T. (March, January 22). Learning auxiliary monocular contexts helps monocular 3D object detection. Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"3962","DOI":"10.1109\/TCSVT.2023.3237579","article-title":"Pseudo-mono for monocular 3D object detection in autonomous driving","volume":"33","author":"Tao","year":"2023","journal-title":"IEEE Trans. Circ. Syst. Vide. Technol."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., and Beijbom, O. (2020, January 16\u201318). nuscenes: A multimodal dataset for autonomous driving. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01164"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., and Caine, B. (2020, January 16\u201318). Scalability in perception for autonomous driving: Waymo open dataset. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00252"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Rukhovich, D., Vorontsova, A., and Konushin, A. (2022, January 3\u20138). Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3D object detection. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV51458.2022.00133"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Li, Y., Ge, Z., Yu, G., Yang, J., Wang, Z., Shi, Y., Sun, J., and Li, Z. (2023, January 7\u201314). Bevdepth: Acquisition of reliable depth for multi-view 3D object detection. Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA.","DOI":"10.1609\/aaai.v37i2.25233"},{"key":"ref_28","unstructured":"Huang, J., Huang, G., Zhu, Z., Ye, Y., and Du, D. (2021). Bevdet: High-performance multi-camera 3D object detection in bird-eye-view. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q., and Dai, J. (2022). Bevformer: Learning bird\u2019s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv.","DOI":"10.1007\/978-3-031-20077-9_1"},{"key":"ref_30","unstructured":"Huang, J., and Huang, G. (2022). Bevdet4d: Exploit temporal cues in multi-camera 3D object detection. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Wang, S., Liu, Y., Wang, T., Li, Y., and Zhang, X. (2023, January 1\u20136). Exploring object-centric temporal modeling for efficient multi-view 3D object detection. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00335"},{"key":"ref_32","unstructured":"Qi, C.R., Su, H., Mo, K., and Guibas, L.J. (2017, January 21\u201326). Pointnet: Deep learning on point sets for 3D classification and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA."},{"key":"ref_33","unstructured":"Qi, C.R., Yi, L., Su, H., and Guibas, L.J. (2017, January 4\u20139). Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Shi, S., Wang, X., and Li, H. (2019, January 16\u201320). Pointrcnn: 3D object proposal generation and detection from point cloud. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00086"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Chen, Y., Liu, J., Zhang, X., Qi, X., and Jia, J. (2023, January 17\u201324). LargeKernel3D: Scaling up kernels in 3D sparse cnns. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01296"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Wang, Y., Fathi, A., Kundu, A., Ross, D.A., Pantofaru, C., Funkhouser, T., and Solomon, J. (2020, January 23\u201328). Pillar-based object detection for autonomous driving. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58542-6_2"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Yin, T., Zhou, X., and Krahenbuhl, P. (2021, January 20\u201325). Center-based 3D object detection and tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01161"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Wang, C., Ma, C., Zhu, M., and Yang, X. (2021, January 20\u201325). Pointaugmenting: Cross-modal augmentation for 3D object detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01162"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Wu, X., Peng, L., Yang, H., Xie, L., Huang, C., Deng, C., Liu, H., and Cai, D. (2022, January 18\u201324). Sparse fuse dense: Towards high quality 3D detection with depth completion. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00534"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Jacobson, P., Zhou, Y., Zhan, W., Tomizuka, M., and Wu, M.C. (2022). Center feature fusion: Selective multi-sensor fusion of center-based objects. arXiv.","DOI":"10.1109\/ICRA48891.2023.10160616"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Li, H., Zhang, Z., Zhao, X., Wang, Y., Shen, Y., Pu, S., and Mao, H. (2022, January 23\u201327). Enhancing multi-modal features using local self-attention for 3D object detection. Proceedings of the European Conference on Computer Vision, Tel Aviiv, Israel.","DOI":"10.1007\/978-3-031-20080-9_31"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Xu, J., Zhang, R., Dou, J., Zhu, Y., Sun, J., and Pu, S. (2021, January 10\u201317). Rpvnet: A deep and efficient range-point-voxel fusion network for LiDAR point cloud segmentation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.01572"},{"key":"ref_43","unstructured":"Song, Z., Yang, L., Xu, S., Liu, L., Xu, D., Jia, C., Jia, F., and Wang, L. (October, January 29). Graphbev: Towards robust bev feature alignment for multi-modal 3D object detection. Proceedings of the European Conference on Computer Vision, Milan, Italy."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"2619","DOI":"10.1109\/TCSVT.2023.3306361","article-title":"GraphAlign++: An accurate feature alignment by graph matching for multi-modal 3D object detection","volume":"34","author":"Song","year":"2023","journal-title":"IEEE Trans Circ. Syst. Vide. Technol."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"5753","DOI":"10.1109\/TCSVT.2024.3366664","article-title":"Toward robust LiDAR-camera fusion in BEV space via mutual deformable attention and temporal aggregation","volume":"34","author":"Wang","year":"2024","journal-title":"IEEE Trans Circ. Syst. Vide. Technol."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Li, X., Fan, B., Tian, J., and Fan, H. (2024, January 16\u201322). Gafusion: Adaptive fusing LiDAR and camera with multiple guidance for 3D object detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.02004"},{"key":"ref_47","unstructured":"Zhao, Y., Gong, Z., Zheng, P., Zhu, H., and Wu, S. (2024). Simplebev: Improved LiDAR-camera fusion architecture for 3D object detection. arXiv."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"10447","DOI":"10.1109\/ACCESS.2024.3518564","article-title":"Cross-Supervised LiDAR-Camera Fusion for 3D Object Detection","volume":"13","author":"Zuo","year":"2024","journal-title":"IEEE Access"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Yin, J., Shen, J., Chen, R., Li, W., Yang, R., Frossard, P., and Wang, W. (2024, January 16\u201322). Is-fusion: Instance-scene collaborative fusion for multimodal 3D object detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.01412"},{"key":"ref_50","unstructured":"Ji, M., Yang, J., and Zhang, S. (2025). DepthFusion: Depth-Aware Hybrid Feature Fusion for LiDAR-Camera 3D Object Detection. arXiv."},{"key":"ref_51","unstructured":"Sadeghian, R., Hooshyaripour, N., Joslin, C., and Lee, W. (2025). Reliability-Driven LiDAR-Camera Fusion for Robust 3D Object Detection. arXiv."},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"1567","DOI":"10.1007\/s40747-024-01567-0","article-title":"CL-fusionBEV: 3D Object Detection with Camera\u2013LiDAR Fusion in Bird\u2019s-Eye View","volume":"10","author":"Shi","year":"2024","journal-title":"Complex Intell. Syst."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Xu, Y., Chen, L., Wang, P., and Li, B. (2024). TiGDistill-BEV: BEV 3D Detection via Inner-Geometry Learning Distillation. arXiv.","DOI":"10.1109\/TCSVT.2025.3596322"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Wang, J., Li, F., Zhang, X., and Sun, H. (2025). Attention-Based LiDAR\u2013Camera Fusion for 3D Object Detection. World Electr. Veh. J., 16.","DOI":"10.3390\/wevj16060306"},{"key":"ref_55","unstructured":"Wang, T., Zhao, H., Guo, Y., and Zhang, M. (2025). LDRFusion: LiDAR-Dominant Multimodal Refinement Framework for 3D Object Detection. arXiv."},{"key":"ref_56","unstructured":"Fischer, P., and Brox, T. (2015, January 5\u20139). U-net: Convolutional networks for biomedical image segmentation. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Virtual."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 10\u201317). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Doll\u00e1r, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017, January 21\u201326). Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.106"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Zhang, N., Nex, F., Vosselman, G., and Kerle, N. (2023, January 17\u201324). Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01778"},{"key":"ref_60","unstructured":"Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv."},{"key":"ref_61","unstructured":"Yin, T., Zhou, X., and Kr\u00e4henb\u00fchl, P. (2021, January 6\u201314). Multimodal virtual point 3D detection. Proceedings of the 35th International Conference on Neural Information Processing Systems, Virtual."},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll\u00e1r, P. (2017, January 22\u201329). Focal loss for dense object detection. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_63","unstructured":"(2025, August 09). MMDetection3D Contributors. OpenMMLab\u2019s Next-Generation Platform for General 3D Object Detection. Available online: https:\/\/github.com\/open-mmlab\/mmdetection3d."},{"key":"ref_64","unstructured":"Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., and Antiga, L. (2019, January 8\u201314). Pytorch: An imperative style, high-performance deep learning library. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_65","unstructured":"Loshchilov, I., and Hutter, F. (2017). Decoupled weight decay regularization. arXiv."},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F., Zhou, B., and Zhao, H. (2022). Autoalign: Pixel-instance feature aggregation for multi-modal 3D object detection. arXiv.","DOI":"10.24963\/ijcai.2022\/116"},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., and Zhao, F. (2022, January 23\u201327). Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3D object detection. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-20074-8_36"},{"key":"ref_68","doi-asserted-by":"crossref","first-page":"104134","DOI":"10.1016\/j.jnca.2025.104134","article-title":"Advanced Aerial Monitoring and Vehicle Classification for Intelligent Transportation Systems with YOLOv8 Variants","volume":"237","author":"Bakirci","year":"2025","journal-title":"J. Netw. Comput. Appl."}],"container-title":["Journal of Imaging"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2313-433X\/11\/9\/319\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:47:31Z","timestamp":1760035651000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2313-433X\/11\/9\/319"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,18]]},"references-count":68,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2025,9]]}},"alternative-id":["jimaging11090319"],"URL":"https:\/\/doi.org\/10.3390\/jimaging11090319","relation":{},"ISSN":["2313-433X"],"issn-type":[{"value":"2313-433X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,18]]}}}