{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,12]],"date-time":"2026-06-12T16:06:45Z","timestamp":1781280405636,"version":"3.54.1"},"reference-count":36,"publisher":"MDPI AG","issue":"24","license":[{"start":{"date-parts":[[2020,12,17]],"date-time":"2020-12-17T00:00:00Z","timestamp":1608163200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key R&amp;D Program of China","award":["2018AAA0103302"],"award-info":[{"award-number":["2018AAA0103302"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["No. 61902194"],"award-info":[{"award-number":["No. 61902194"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>We focus on exploring the LIDAR-RGB fusion-based 3D object detection in this paper. This task is still challenging in two aspects: (1) the difference of data formats and sensor positions contributes to the misalignment of reasoning between the semantic features of images and the geometric features of point clouds. (2) The optimization of traditional IoU is not equal to the regression loss of bounding boxes, resulting in biased back-propagation for non-overlapping cases. In this work, we propose a cascaded cross-modality fusion network (CCFNet), which includes a cascaded multi-scale fusion module (CMF) and a novel center 3D IoU loss to resolve these two issues. Our CMF module is developed to reinforce the discriminative representation of objects by reasoning the relation of corresponding LIDAR geometric capability and RGB semantic capability of the object from two modalities. Specifically, CMF is added in a cascaded way between the RGB and LIDAR streams, which selects salient points and transmits multi-scale point cloud features to each stage of RGB streams. Moreover, our center 3D IoU loss incorporates the distance between anchor centers to avoid the oversimple optimization for non-overlapping bounding boxes. Extensive experiments on the KITTI benchmark have demonstrated that our proposed approach performs better than the compared methods.<\/jats:p>","DOI":"10.3390\/s20247243","type":"journal-article","created":{"date-parts":[[2020,12,17]],"date-time":"2020-12-17T10:42:47Z","timestamp":1608201767000},"page":"7243","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Cascaded Cross-Modality Fusion Network for 3D Object Detection"],"prefix":"10.3390","volume":"20","author":[{"given":"Zhiyu","family":"Chen","sequence":"first","affiliation":[{"name":"School of Computer Science, Nanjing University of Posts and Telecommunications, No. 9 Wenyuan Road, Yadong New District, Nanjing 210023, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Qiong","family":"Lin","sequence":"additional","affiliation":[{"name":"College of Automation, Nanjing University of Posts and Telecommunications, No. 9 Wenyuan Road, Yadong New District, Nanjing 210023, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jing","family":"Sun","sequence":"additional","affiliation":[{"name":"School of Computer Science, Nanjing University of Posts and Telecommunications, No. 9 Wenyuan Road, Yadong New District, Nanjing 210023, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yujian","family":"Feng","sequence":"additional","affiliation":[{"name":"School of Computer Science, Nanjing University of Posts and Telecommunications, No. 9 Wenyuan Road, Yadong New District, Nanjing 210023, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8511-7544","authenticated-orcid":false,"given":"Shangdong","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Computer Science, Nanjing University of Posts and Telecommunications, No. 9 Wenyuan Road, Yadong New District, Nanjing 210023, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Qiang","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Computer Science, Nanjing University of Posts and Telecommunications, No. 9 Wenyuan Road, Yadong New District, Nanjing 210023, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yimu","family":"Ji","sequence":"additional","affiliation":[{"name":"School of Computer Science, Nanjing University of Posts and Telecommunications, No. 9 Wenyuan Road, Yadong New District, Nanjing 210023, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2809-2237","authenticated-orcid":false,"given":"He","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Computer Science, Nanjing University of Posts and Telecommunications, No. 9 Wenyuan Road, Yadong New District, Nanjing 210023, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2020,12,17]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Siam, M., Mahgoub, H., Zahran, M., Yogamani, S., Jagersand, M., and EI-Sallab, A. (2018, January 4\u20137). MODNet: Motion and Appearance based Moving Object Detection Network for Autonomous Driving. Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA.","DOI":"10.1109\/ITSC.2018.8569744"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Cai, Z., Fan, Q., Feris, R.S., and Vasconcelos, N. (2016, January 11\u201314). A unified multi-scale deep convolutional neural network for fast object detection. Proceedings of the 14th European Conference, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46493-0_22"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Ku, J., Mozifian, M., Lee, J., Harakeh, A., and Waslander, S. (2018, January 1\u20135). Joint 3D proposal generation and object detection from view aggregation. Proceedings of the 2018 IEEE\/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain.","DOI":"10.1109\/IROS.2018.8594049"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"182","DOI":"10.1016\/j.neucom.2019.12.042","article-title":"Dynamic attention network for semantic segmentation","volume":"384","author":"Wu","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Vora, S., Lang, A., Helou, B., and Beijbom, O. (2020, January 16\u201318). Pointpainting: Sequential fusion for 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00466"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Xie, L., Xiang, C., Yu, G., Yang, Z., Cai, D., and He, X. (2020). PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. arXiv.","DOI":"10.1609\/aaai.v34i07.6933"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Chen, X., Ma, H., Wan, J., Li, B., and Xia, T. (2017, January 21\u201326). Multi-view 3d object detection network for autonomous driving. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.691"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Pang, S., Morris, D., and Radha, H. (2020). CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. arXiv.","DOI":"10.1109\/IROS45743.2020.9341791"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Yoo, J., Kim, Y., Kim, J., and Choi, J. (2020). 3D-CVF: Generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection. arXiv.","DOI":"10.1007\/978-3-030-58583-9_43"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Liang, M., Yang, B., Chen, Y., Hu, R., and Urtasun, R. (2019, January 15\u201321). Multi-Task multi-sensor fusion for 3D object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00752"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Rezatofighi, H., Tsoi, N., Gwak, J.Y., Sadeghian, A., Reid, I., and Savarese, S. (2019, January 15\u201321). Generalized intersection over union: A metric and a loss for bounding box regression. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00075"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., and Berg, A.C. (2016, January 11\u201314). Ssd: Single shot multibox detector. Proceedings of the 14th European Conference, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Redmon, J., and Farhadi, A. (2017, January 21\u201326). YOLO9000: Better, faster, stronger. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.690"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Song, S., and Xiao, J. (2016, January 27\u201330). Deep sliding shapes for amodal 3d object detection in rgb-d images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.94"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Gupta, S., and Xiao, J. (2014, January 6\u201312). Learning rich features from RGB-D images for object detection and segmentation. Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10584-0_23"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Tekin, B., Sinha, S.N., Fua, P., and Fua, P. (2018, January 18\u201322). Real-time seamless single shot 6D object pose prediction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00038"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Dhiman, V., Tran, Q.H., Corso, J., and Chandraker, M. (2016, January 27\u201330). A continuous occlusion model for road scene understanding. Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.469"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Laidlow, T., Czarnowski, J., and Leutenegger, S. (2019, January 20\u201324). DeepFusion: Real-time dense 3D reconstruction for monocular slam using single-view depth and gradient predictions. Proceedings of the 2019 International Conference on Robotics and Automation, Montreal, QC, Canada.","DOI":"10.1109\/ICRA.2019.8793527"},{"key":"ref_20","unstructured":"Bo, L., Zhang, T., and Xia, T. (2016). Vehicle detection from 3d lidar using fully convolutional network. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Yang, B., Luo, W., and Urtasun, R. (2018, January 18\u201322). Pixor: Real-time 3d object detection from point clouds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00798"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Shi, S., Wang, X., and Li, H. (2019, January 15\u201321). Pointrcnn: 3d object proposal generation and detection from point cloud. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00086"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"477","DOI":"10.1016\/j.neucom.2019.10.116","article-title":"Multi-view semantic learning network for point cloud based 3D object detection","volume":"397","author":"Yang","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_24","unstructured":"Qi, C.R., Su, H., Mo, K., and Guibas, L. (2017, January 21\u201326). PointNet: Deep learning on point sets for 3D classification and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA."},{"key":"ref_25","unstructured":"Qi, C.R., Yi, L., Su, H., and Guibas, L. (2017, January 4\u20139). PointNet++: Deep hierarchical feature learning on point sets in a metric space. Proceedings of the Advance in Neural Information Processing Systems 2017, Long Beach, CA, USA."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., and Wang, X. (2020, January 16\u201318). PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01054"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"He, C., Zeng, H., Huang, J., Hua, X., and Zhang, L. (2020, January 16\u201318). Structure Aware Single-stage 3D Object Detection from Point Cloud. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01189"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Yang, Z., Sun, Y., Liu, S., Shen, X., and Jia, J. (2019, January 15\u201321). Std: Sparse-to-dense 3d object detector for point cloud. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA.","DOI":"10.1109\/ICCV.2019.00204"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Zhou, Y., and Tuzel, O. (2018, January 18\u201322). VoxelNet: End-to-end learning for point cloud based 3D object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00472"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Kuang, H., Wang, B., An, J., Zhang, M., and Zhang, Z. (2020). Voxel-FPN: Multi-scale voxel feature aggregation in 3D object detection from point clouds. Sensors, 20.","DOI":"10.3390\/s20030704"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Yan, Y., Mao, Y., and Li, B. (2018). SECOND: Sparsely embedded convolutional detection. Sensors, 18.","DOI":"10.3390\/s18103337"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Lang, A.H., Vora, S., and Caesar, H. (2019, January 15\u201321). PointPillars: Fast encoders for object detection from point clouds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01298"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Shi, S., Wang, Z., Shi, J., Wang, X., and Li, H. (2019). From Points to parts: 3D object detection from point cloud with part-aware and part-aggregation Network. arXiv.","DOI":"10.1109\/TPAMI.2020.2977026"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Geiger, A., Lenz, P., and Urtasun, R. (2012, January 16\u201321). Are we ready for autonomous driving? The kitti vision benchmarksuite. Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA.","DOI":"10.1109\/CVPR.2012.6248074"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Qi, C.R., Liu, W., Wu, C., Su, H., and Guibas, L. (2018, January 18\u201322). Frustum pointnets for 3d object detection from rgb-d data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00102"},{"key":"ref_36","unstructured":"Liu, Z., Tang, H., Lin, Y., and Han, S. (2019, January 10\u201312). Point-Voxel CNN for efficient 3D deep learning. Proceedings of the Advances in Neural Information Processing Systems 2019, Vancouver, BC, Canada."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/24\/7243\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T10:46:30Z","timestamp":1760179590000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/24\/7243"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,12,17]]},"references-count":36,"journal-issue":{"issue":"24","published-online":{"date-parts":[[2020,12]]}},"alternative-id":["s20247243"],"URL":"https:\/\/doi.org\/10.3390\/s20247243","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,12,17]]}}}