{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:18:02Z","timestamp":1760145482999,"version":"build-2065373602"},"reference-count":42,"publisher":"MDPI AG","issue":"14","license":[{"start":{"date-parts":[[2024,7,22]],"date-time":"2024-07-22T00:00:00Z","timestamp":1721606400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Science and Technology Major Project","award":["2022ZD0119502","62201018","62201017","L222052"],"award-info":[{"award-number":["2022ZD0119502","62201018","62201017","L222052"]}]},{"name":"National Natural Science Foundation of China","award":["2022ZD0119502","62201018","62201017","L222052"],"award-info":[{"award-number":["2022ZD0119502","62201018","62201017","L222052"]}]},{"name":"Beijing Natural Science Foundation and Haidian Original Innovation Joint Fund","award":["2022ZD0119502","62201018","62201017","L222052"],"award-info":[{"award-number":["2022ZD0119502","62201018","62201017","L222052"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Real-world understanding serves as a medium that bridges the information world and the physical world, enabling the realization of virtual\u2013real mapping and interaction. However, scene understanding based solely on 2D images faces problems such as a lack of geometric information and limited robustness against occlusion. The depth sensor brings new opportunities, but there are still challenges in fusing depth with geometric and semantic priors. To address these concerns, our method considers the repeatability of video stream data and the sparsity of newly generated data. We introduce a sparsely correlated network architecture (SCN) designed explicitly for online RGBD instance segmentation. Additionally, we leverage the power of object-level RGB-D SLAM systems, thereby transcending the limitations of conventional approaches that solely emphasize geometry or semantics. We establish correlation over time and leverage this correlation to develop rules and generate sparse data. We thoroughly evaluate the system\u2019s performance on the NYU Depth V2 and ScanNet V2 datasets, demonstrating that incorporating frame-to-frame correlation leads to significantly improved accuracy and consistency in instance segmentation compared to existing state-of-the-art alternatives. Moreover, using sparse data reduces data complexity while ensuring the real-time requirement of 18 fps. Furthermore, by utilizing prior knowledge of object layout understanding, we showcase a promising application of augmented reality, showcasing its potential and practicality.<\/jats:p>","DOI":"10.3390\/s24144756","type":"journal-article","created":{"date-parts":[[2024,7,22]],"date-time":"2024-07-22T17:36:04Z","timestamp":1721669764000},"page":"4756","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Online Scene Semantic Understanding Based on Sparsely Correlated Network for AR"],"prefix":"10.3390","volume":"24","author":[{"given":"Qianqian","family":"Wang","sequence":"first","affiliation":[{"name":"The School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing 102488, China"}]},{"given":"Junhao","family":"Song","sequence":"additional","affiliation":[{"name":"The School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing 102488, China"}]},{"given":"Chenxi","family":"Du","sequence":"additional","affiliation":[{"name":"The School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing 102488, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4334-6103","authenticated-orcid":false,"given":"Chen","family":"Wang","sequence":"additional","affiliation":[{"name":"The School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing 102488, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,7,22]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Wang, M., Ye, Z.-M., Shi, J.-C., and Yang, Y.-L. (April, January 27). Scene-Context-Aware Indoor Object Selection and Movement in VR. Proceedings of the 2021 IEEE Virtual Reality and 3D User Interfaces (VR), Lisboa, Portugal.","DOI":"10.1109\/VR50410.2021.00045"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Zhang, A., Zhao, Y., Wang, S., and Wei, J. (2022, January 12\u201316). An improved augmented-reality method of inserting virtual objects into the scene with transparent objects. Proceedings of the 2022 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Christchurch, New Zealand.","DOI":"10.1109\/VR51125.2022.00021"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Seichter, D., K\u00f6hler, M., Lewandowski, B., Wengefeld, T., and Gross, H.-M. (June, January 30). Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis. Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi\u2019an, China.","DOI":"10.1109\/ICRA48506.2021.9561675"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"3483","DOI":"10.1109\/TMM.2022.3161852","article-title":"PGDENet: Progressive Guided Fusion and Depth Enhancement Network for RGB-D Indoor Scene Parsing","volume":"25","author":"Zhou","year":"2023","journal-title":"IEEE Trans. Multimed."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., and Fitzgibbon, A. (2011, January 26\u201329). KinectFusion: Real-time dense surface mapping and tracking. Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, Switzerland.","DOI":"10.1109\/ISMAR.2011.6092378"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3072959.3054739","article-title":"Bundlefusion: Real-time globally consistent 3D reconstruction using on-the-fly surface reintegration","volume":"36","author":"Dai","year":"2017","journal-title":"ACM Trans. Graph. (TOG)"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Lin, W., Zheng, C., Yong, J.-H., and Xu, F. (2022, January 18\u201322). OcclusionFusion: Occlusion-aware Motion Estimation for Real-time Dynamic 3D Reconstruction. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00178"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"108225","DOI":"10.1016\/j.patcog.2021.108225","article-title":"Blitz-SLAM: A semantic SLAM in dynamic environments","volume":"121","author":"Fan","year":"2022","journal-title":"Pattern Recognit."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"6011","DOI":"10.1007\/s00521-021-06764-3","article-title":"YOLO-SLAM: A semantic SLAM system towards dynamic environment with geometric constraint","volume":"34","author":"Wu","year":"2022","journal-title":"Neural Comput. Appl."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Dai, A., Diller, C., and Niessner, M. (2020, January 13\u201319). SG-NN: Sparse Generative Neural Networks for Self-Supervised Scene Completion of RGB-D Scans. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00093"},{"key":"ref_11","first-page":"1","article-title":"Real-time 3D reconstruction at scale using voxel hashing","volume":"32","author":"Izadi","year":"2013","journal-title":"ACM Trans. Graph. (TOG)"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Whelan, T., Leutenegger, S., Salas-Moreno, R.F., Glocker, B., and Davison, A.J. (2015). ElasticFusion: Dense SLAM without a Pose Graph. Robotics: Science and Systems, Available online: https:\/\/roboticsproceedings.org\/rss11\/p01.pdf.","DOI":"10.15607\/RSS.2015.XI.001"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Font\u00e1n, A., Civera, J., and Triebel, R. (2020, January 13\u201319). Information-Driven Direct RGB-D Odometry. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00498"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1109\/JPROC.2023.3238524","article-title":"Object Detection in 20 Years: A Survey","volume":"111","author":"Zou","year":"2023","journal-title":"Proc. IEEE"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"104401","DOI":"10.1016\/j.imavis.2022.104401","article-title":"A review on 2D instance segmentation based on deep neural networks","volume":"120","author":"Gu","year":"2022","journal-title":"Image Vis. Comput."},{"key":"ref_16","unstructured":"Fathi, A., Wojna, Z., Rathod, V., Wang, P., Song, H.O., Guadarrama, S., and Murphy, K.P. (2017). Semantic instance segmentation via deep metric learning. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Bolya, D., Zhou, C., Xiao, F., and Lee, Y.J. (November, January 27). YOLACT: Real-Time Instance Segmentation. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00925"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Hazirbas, C., Ma, L., Domokos, C., and Cremers, D. (2016). FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture, Springer.","DOI":"10.1007\/978-3-319-54181-5_14"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1109\/MIS.2020.2999462","article-title":"TSNet: Three-Stream Self-Attention Network for RGB-D Indoor Semantic Segmentation","volume":"36","author":"Zhou","year":"2021","journal-title":"IEEE Intell. Syst."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Shen, X., and Stamos, I. (2020, January 1\u20135). Frustum VoxNet for 3D object detection from RGB-D or Depth images. Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA.","DOI":"10.1109\/WACV45572.2020.9093276"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Charles, R.Q., Su, H., Kaichun, M., and Guibas, L.J. (2017, January 21\u201326). PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.16"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"199","DOI":"10.1016\/j.isprsjprs.2021.03.001","article-title":"A point-based deep learning network for semantic segmentation of MLS point clouds","volume":"175","author":"Han","year":"2021","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., and Markham, A. (2020, January 13\u201319). RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01112"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Thomas, H., Qi, C.R., Deschaud, J.-E., Marcotegui, B., Goulette, F., and Guibas, L. (November, January 27). KPConv: Flexible and Deformable Convolution for Point Clouds. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00651"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., and Girshick, R. (2017, January 22\u201329). Mask R-CNN. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.322"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., and Luo, P. (2021, January 19\u201325). Sparse R-CNN: End-to-End Object Detection with Learnable Proposals. Proceedings of the 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01422"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.-W., and Jia, J. (2020, January 13\u201319). PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00492"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Zhong, M., Chen, X., Chen, X., Zeng, G., and Wang, Y. (2022, January 18\u201322). Maskgroup: Hierarchical point grouping and masking for 3d instance segmentation. Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan.","DOI":"10.1109\/ICME52920.2022.9859996"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"McCormac, J., Handa, A., Davison, A., and Leutenegger, S. (June, January 29). SemanticFusion: Dense 3D semantic mapping with convolutional neural networks. Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore.","DOI":"10.1109\/ICRA.2017.7989538"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Xiang, Y., and Fox, D. (2017). DA-RNN: Semantic mapping with data associated recurrent neural networks. arXiv.","DOI":"10.15607\/RSS.2017.XIII.013"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Salas-Moreno, R.F., Newcombe, R.A., Strasdat, H., Kelly, P.H.J., and Davison, A.J. (2013, January 23\u201328). SLAM++: Simultaneous Localisation and Mapping at the Level of Objects. Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA.","DOI":"10.1109\/CVPR.2013.178"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Tateno, K., Tombari, F., and Navab, N. (2016, January 16\u201321). When 2.5D is not enough: Simultaneous reconstruction, segmentation and recognition on dense SLAM. Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden.","DOI":"10.1109\/ICRA.2016.7487378"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Runz, M., Buffier, M., and Agapito, L. (2018, January 16\u201320). Maskfusion: Real-time recognition, tracking and reconstruction of multiple moving objects. Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany.","DOI":"10.1109\/ISMAR.2018.00024"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"McCormac, J., Clark, R., Bloesch, M., Davison, A., and Leutenegger, S. (2018, January 5\u20138). Fusion++: Volumetric object-level slam. Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy.","DOI":"10.1109\/3DV.2018.00015"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Xu, B., Li, W., Tzoumanikas, D., Bloesch, M., Davison, A., and Leutenegger, S. (2019, January 20\u201324). MID-Fusion: Octree-based Object-Level Multi-Instance Dynamic SLAM. Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada.","DOI":"10.1109\/ICRA.2019.8794371"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Narita, G., Seno, T., Ishikawa, T., and Kaji, Y. (2019, January 3\u20138). Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. Proceedings of the 2019 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China.","DOI":"10.1109\/IROS40897.2019.8967890"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"2847","DOI":"10.1109\/TITS.2023.3284228","article-title":"RGBD-SLAM Based on Object Detection with Two-Stream YOLOv4-MobileNetv3 in Autonomous Driving","volume":"25","author":"Li","year":"2024","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Wang, C., Oi, Y., and Yang, S. (2020, January 21\u201326). Recurrent R-CNN: Online Instance Mapping with context correlation. Proceedings of the 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), Atlanta, GA, USA.","DOI":"10.1109\/VRW50115.2020.00239"},{"key":"ref_40","unstructured":"Li, X., Guivant, J., Kwok, N., Xu, Y., Li, R., and Wu, H. (2019). Three-dimensional backbone network for 3D object detection in traffic scenes. arXiv."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Dai, A., Chang, X., Savva, M., Halber, M., Funkhouser, T., and Nie\u00dfner, M. (2017, January 21\u201326). ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.261"},{"key":"ref_42","first-page":"746","article-title":"Indoor segmentation and support inference from rgbd images","volume":"7576","author":"Silberman","year":"2012","journal-title":"ECCV"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/14\/4756\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T15:21:15Z","timestamp":1760109675000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/14\/4756"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,22]]},"references-count":42,"journal-issue":{"issue":"14","published-online":{"date-parts":[[2024,7]]}},"alternative-id":["s24144756"],"URL":"https:\/\/doi.org\/10.3390\/s24144756","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2024,7,22]]}}}