{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,16]],"date-time":"2026-05-16T01:29:16Z","timestamp":1778894956049,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":44,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,12,6]],"date-time":"2023-12-06T00:00:00Z","timestamp":1701820800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,12,6]]},"DOI":"10.1145\/3595916.3626405","type":"proceedings-article","created":{"date-parts":[[2024,1,1]],"date-time":"2024-01-01T16:34:41Z","timestamp":1704126881000},"page":"1-7","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Reimagining 3D Visual Grounding: Instance Segmentation and Transformers for Fragmented Point Cloud Scenarios"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0931-8985","authenticated-orcid":false,"given":"Zehan","family":"Tan","sequence":"first","affiliation":[{"name":"Fudan University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6473-9272","authenticated-orcid":false,"given":"Weidong","family":"Yang","sequence":"additional","affiliation":[{"name":"Fudan University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-6994-4146","authenticated-orcid":false,"given":"Zhiwei","family":"Wang","sequence":"additional","affiliation":[{"name":"GREE ELECTRIC APPLIANCES, INC. OF ZHUHAI, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,1]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Proceedings, Part I 16","author":"Achlioptas Panos","year":"2020","unstructured":"Panos Achlioptas , Ahmed Abdelreheem , Fei Xia , Mohamed Elhoseiny , and Leonidas Guibas . 2020 . Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020 , Proceedings, Part I 16 . Springer, 422\u2013440. Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. 2020. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part I 16. Springer, 422\u2013440."},{"key":"e_1_3_2_1_2_1","volume-title":"Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254","author":"Bao Hangbo","year":"2021","unstructured":"Hangbo Bao , Li Dong , Songhao Piao , and Furu Wei . 2021 . Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021). Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)."},{"key":"e_1_3_2_1_3_1","volume-title":"DynaSLAM II: Tightly-coupled multi-object tracking and SLAM","author":"Bescos Berta","year":"2021","unstructured":"Berta Bescos , Carlos Campos , Juan\u00a0 D Tard\u00f3s , and Jos\u00e9 Neira . 2021. DynaSLAM II: Tightly-coupled multi-object tracking and SLAM . IEEE robotics and automation letters 6, 3 ( 2021 ), 5191\u20135198. Berta Bescos, Carlos Campos, Juan\u00a0D Tard\u00f3s, and Jos\u00e9 Neira. 2021. DynaSLAM II: Tightly-coupled multi-object tracking and SLAM. IEEE robotics and automation letters 6, 3 (2021), 5191\u20135198."},{"key":"e_1_3_2_1_4_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 16464\u201316473","author":"Cai Daigang","year":"2022","unstructured":"Daigang Cai , Lichen Zhao , Jing Zhang , Lu Sheng , and Dong Xu . 2022 . 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 16464\u201316473 . Daigang Cai, Lichen Zhao, Jing Zhang, Lu Sheng, and Dong Xu. 2022. 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 16464\u201316473."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/TRO.2021.3075644"},{"key":"e_1_3_2_1_6_1","volume-title":"Proceedings, Part XX. Springer, 202\u2013221","author":"Chen Dave\u00a0Zhenyu","year":"2020","unstructured":"Dave\u00a0Zhenyu Chen , Angel\u00a0 X Chang , and Matthias Nie\u00dfner . 2020 . Scanrefer: 3d object localization in rgb-d scans using natural language. In Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020 , Proceedings, Part XX. Springer, 202\u2013221 . Dave\u00a0Zhenyu Chen, Angel\u00a0X Chang, and Matthias Nie\u00dfner. 2020. Scanrefer: 3d object localization in rgb-d scans using natural language. In Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part XX. Springer, 202\u2013221."},{"key":"e_1_3_2_1_7_1","volume-title":"HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding. arXiv preprint arXiv:2210.12513","author":"Chen Jiaming","year":"2022","unstructured":"Jiaming Chen , Weixin Luo , Xiaolin Wei , Lin Ma , and Wei Zhang . 2022 . HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding. arXiv preprint arXiv:2210.12513 (2022). Jiaming Chen, Weixin Luo, Xiaolin Wei, Lin Ma, and Wei Zhang. 2022. HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding. arXiv preprint arXiv:2210.12513 (2022)."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.261"},{"key":"e_1_3_2_1_9_1","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision. 1769\u20131779","author":"Deng Jiajun","year":"2021","unstructured":"Jiajun Deng , Zhengyuan Yang , Tianlang Chen , Wengang Zhou , and Houqiang Li . 2021 . Transvg: End-to-end visual grounding with transformers . In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 1769\u20131779 . Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 1769\u20131779."},{"key":"e_1_3_2_1_10_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_1_11_1","volume-title":"An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 ( 2020 ). Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_1_12_1","volume-title":"2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1\u20136.","author":"Du Ye","year":"2022","unstructured":"Ye Du , Zehua Fu , Qingjie Liu , and Yunhong Wang . 2022 . Visual grounding with transformers . In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1\u20136. Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. 2022. Visual grounding with transformers. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1\u20136."},{"key":"e_1_3_2_1_13_1","volume-title":"Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277","author":"Fathi Alireza","year":"2017","unstructured":"Alireza Fathi , Zbigniew Wojna , Vivek Rathod , Peng Wang , Hyun\u00a0Oh Song , Sergio Guadarrama , and Kevin\u00a0 P Murphy . 2017. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277 ( 2017 ). Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun\u00a0Oh Song, Sergio Guadarrama, and Kevin\u00a0P Murphy. 2017. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277 (2017)."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.169"},{"key":"e_1_3_2_1_15_1","volume-title":"A review on 2D instance segmentation based on deep neural networks. Image and Vision Computing","author":"Gu Wenchao","year":"2022","unstructured":"Wenchao Gu , Shuang Bai , and Lingxing Kong . 2022. A review on 2D instance segmentation based on deep neural networks. Image and Vision Computing ( 2022 ), 104401. Wenchao Gu, Shuang Bai, and Lingxing Kong. 2022. A review on 2D instance segmentation based on deep neural networks. Image and Vision Computing (2022), 104401."},{"key":"e_1_3_2_1_16_1","volume-title":"International journal of multimedia information retrieval 9, 3","author":"Hafiz Abdul\u00a0Mueed","year":"2020","unstructured":"Abdul\u00a0Mueed Hafiz and Ghulam\u00a0Mohiuddin Bhat . 2020. A survey on instance segmentation: state of the art . International journal of multimedia information retrieval 9, 3 ( 2020 ), 171\u2013189. Abdul\u00a0Mueed Hafiz and Ghulam\u00a0Mohiuddin Bhat. 2020. A survey on instance segmentation: state of the art. International journal of multimedia information retrieval 9, 3 (2020), 171\u2013189."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.322"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.470"},{"key":"e_1_3_2_1_19_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15524\u201315533","author":"Huang Shijia","year":"2022","unstructured":"Shijia Huang , Yilun Chen , Jiaya Jia , and Liwei Wang . 2022 . Multi-view transformer for 3d visual grounding . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15524\u201315533 . Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. 2022. Multi-view transformer for 3d visual grounding. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15524\u201315533."},{"key":"e_1_3_2_1_20_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15513\u201315523","author":"Jiang Haojun","year":"2022","unstructured":"Haojun Jiang , Yuanze Lin , Dongchen Han , Shiji Song , and Gao Huang . 2022 . Pseudo-q: Generating pseudo language queries for visual grounding . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15513\u201315523 . Haojun Jiang, Yuanze Lin, Dongchen Han, Shiji Song, and Gao Huang. 2022. Pseudo-q: Generating pseudo language queries for visual grounding. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15513\u201315523."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00180"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.106"},{"key":"e_1_3_2_1_23_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 6032\u20136041","author":"Liu Haolin","year":"2021","unstructured":"Haolin Liu , Anran Lin , Xiaoguang Han , Lei Yang , Yizhou Yu , and Shuguang Cui . 2021 . Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 6032\u20136041 . Haolin Liu, Anran Lin, Xiaoguang Han, Lei Yang, Yizhou Yu, and Shuguang Cui. 2021. Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 6032\u20136041."},{"key":"e_1_3_2_1_24_1","volume-title":"Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499","author":"Liu Shilong","year":"2023","unstructured":"Shilong Liu , Zhaoyang Zeng , Tianhe Ren , Feng Li , Hao Zhang , Jie Yang , Chunyuan Li , Jianwei Yang , Hang Su , Jun Zhu , 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 ( 2023 ). Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"e_1_3_2_1_26_1","volume-title":"Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu , Dhruv Batra , Devi Parikh , and Stefan Lee . 2019 . Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019). Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-020-10073-7"},{"key":"e_1_3_2_1_28_1","volume-title":"Cross3DVG: Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans. arXiv preprint arXiv:2305.13876","author":"Miyanishi Taiki","year":"2023","unstructured":"Taiki Miyanishi , Daichi Azuma , Shuhei Kurita , and Motoki Kawanabe . 2023. Cross3DVG: Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans. arXiv preprint arXiv:2305.13876 ( 2023 ). Taiki Miyanishi, Daichi Azuma, Shuhei Kurita, and Motoki Kawanabe. 2023. Cross3DVG: Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans. arXiv preprint arXiv:2305.13876 (2023)."},{"key":"e_1_3_2_1_29_1","volume-title":"International conference on machine learning. PMLR, 8748\u20138763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong\u00a0Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , 2021 . Learning transferable visual models from natural language supervision . In International conference on machine learning. PMLR, 8748\u20138763 . Alec Radford, Jong\u00a0Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748\u20138763."},{"key":"e_1_3_2_1_30_1","volume-title":"Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross Girshick , and Jian Sun . 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 ( 2015 ). Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)."},{"key":"e_1_3_2_1_31_1","volume-title":"Indoor segmentation and support inference from rgbd images.ECCV (5) 7576","author":"Silberman Nathan","year":"2012","unstructured":"Nathan Silberman , Derek Hoiem , Pushmeet Kohli , and Rob Fergus . 2012. Indoor segmentation and support inference from rgbd images.ECCV (5) 7576 ( 2012 ), 746\u2013760. Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images.ECCV (5) 7576 (2012), 746\u2013760."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/IROS.2012.6385773"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/IROS.2003.1248813"},{"key":"e_1_3_2_1_34_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan\u00a0 N Gomez , \u0141ukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. Advances in neural information processing systems 30 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan\u00a0N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_35_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1960\u20131968","author":"Wang Peng","year":"2019","unstructured":"Peng Wang , Qi Wu , Jiewei Cao , Chunhua Shen , Lianli Gao , and Anton van\u00a0den Hengel . 2019 . Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1960\u20131968 . Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van\u00a0den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1960\u20131968."},{"key":"e_1_3_2_1_36_1","volume-title":"International Conference on Machine Learning. PMLR, 23318\u201323340","author":"Wang Peng","year":"2022","unstructured":"Peng Wang , An Yang , Rui Men , Junyang Lin , Shuai Bai , Zhikang Li , Jianxin Ma , Chang Zhou , Jingren Zhou , and Hongxia Yang . 2022 . Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework . In International Conference on Machine Learning. PMLR, 23318\u201323340 . Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning. PMLR, 23318\u201323340."},{"key":"e_1_3_2_1_37_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 19231\u201319242","author":"Wu Yanmin","year":"2023","unstructured":"Yanmin Wu , Xinhua Cheng , Renrui Zhang , Zesen Cheng , and Jian Zhang . 2023 . EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 19231\u201319242 . Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, and Jian Zhang. 2023. EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 19231\u201319242."},{"key":"e_1_3_2_1_38_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15325\u201315336","author":"Yan Bin","year":"2023","unstructured":"Bin Yan , Yi Jiang , Jiannan Wu , Dong Wang , Ping Luo , Zehuan Yuan , and Huchuan Lu . 2023 . Universal instance perception as object discovery and retrieval . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15325\u201315336 . Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, and Huchuan Lu. 2023. Universal instance perception as object discovery and retrieval. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15325\u201315336."},{"key":"e_1_3_2_1_39_1","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 4145\u20134154","author":"Yang Sibei","year":"2019","unstructured":"Sibei Yang , Guanbin Li , and Yizhou Yu . 2019 . Cross-modal relationship inference for grounding referring expressions . In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 4145\u20134154 . Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 4145\u20134154."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00142"},{"key":"e_1_3_2_1_41_1","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision. 1791\u20131800","author":"Yuan Zhihao","year":"2021","unstructured":"Zhihao Yuan , Xu Yan , Yinghong Liao , Ruimao Zhang , Sheng Wang , Zhen Li , and Shuguang Cui . 2021 . Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring . In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 1791\u20131800 . Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Sheng Wang, Zhen Li, and Shuguang Cui. 2021. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 1791\u20131800."},{"key":"e_1_3_2_1_42_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4158\u20134166","author":"Zhang Hanwang","year":"2018","unstructured":"Hanwang Zhang , Yulei Niu , and Shih-Fu Chang . 2018 . Grounding referring expressions in images by variational context . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4158\u20134166 . Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4158\u20134166."},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.888718"},{"key":"e_1_3_2_1_44_1","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision. 2928\u20132937","author":"Zhao Lichen","year":"2021","unstructured":"Lichen Zhao , Daigang Cai , Lu Sheng , and Dong Xu . 2021 . 3DVG-Transformer: Relation modeling for visual grounding on point clouds . In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 2928\u20132937 . Lichen Zhao, Daigang Cai, Lu Sheng, and Dong Xu. 2021. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 2928\u20132937."}],"event":{"name":"MMAsia '23: ACM Multimedia Asia","location":"Tainan Taiwan","acronym":"MMAsia '23","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["ACM Multimedia Asia 2023"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3595916.3626405","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3595916.3626405","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:35:55Z","timestamp":1750178155000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3595916.3626405"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,6]]},"references-count":44,"alternative-id":["10.1145\/3595916.3626405","10.1145\/3595916"],"URL":"https:\/\/doi.org\/10.1145\/3595916.3626405","relation":{},"subject":[],"published":{"date-parts":[[2023,12,6]]},"assertion":[{"value":"2024-01-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}