{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,6]],"date-time":"2026-06-06T16:57:38Z","timestamp":1780765058056,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":38,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T00:00:00Z","timestamp":1634428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["Grant 61876177"],"award-info":[{"award-number":["Grant 61876177"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Beijing Natural Science Foundation","award":["4202034"],"award-info":[{"award-number":["4202034"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,17]]},"DOI":"10.1145\/3474085.3475397","type":"proceedings-article","created":{"date-parts":[[2023,1,5]],"date-time":"2023-01-05T23:03:42Z","timestamp":1672959822000},"page":"2344-2352","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":76,"title":["TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding"],"prefix":"10.1145","author":[{"given":"Dailan","family":"He","sequence":"first","affiliation":[{"name":"Beihang University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yusheng","family":"Zhao","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Junyu","family":"Luo","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Tianrui","family":"Hui","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Shaofei","family":"Huang","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Aixi","family":"Zhang","sequence":"additional","affiliation":[{"name":"Alibaba Group, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Si","family":"Liu","sequence":"additional","affiliation":[{"name":"Institute of Artificial Intelligence, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2021,10,17]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Panos Achlioptas Ahmed Abdelreheem Fei Xia Mohamed Elhoseiny and Leonidas Guibas. 2020. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In ECCV.  Panos Achlioptas Ahmed Abdelreheem Fei Xia Mohamed Elhoseiny and Leonidas Guibas. 2020. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In ECCV.","DOI":"10.1007\/978-3-030-58452-8_25"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.  Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_1_3_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E Hinton","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba , Jamie Ryan Kiros, and Geoffrey E Hinton . 2016 . Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"crossref","unstructured":"Nicolas Carion Francisco Massa Gabriel Synnaeve Nicolas Usunier Alexander Kirillov and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV.  Nicolas Carion Francisco Massa Gabriel Synnaeve Nicolas Usunier Alexander Kirillov and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_1_5_1","volume-title":"Scanrefer: 3d object localization in rgb-d scans using natural language. arXiv preprint arXiv:1912.08830","author":"Chen Dave Zhenyu","year":"2019","unstructured":"Dave Zhenyu Chen , Angel X Chang , and Matthias Nie\u00dfner . 2019. Scanrefer: 3d object localization in rgb-d scans using natural language. arXiv preprint arXiv:1912.08830 ( 2019 ). Dave Zhenyu Chen, Angel X Chang, and Matthias Nie\u00dfner. 2019. Scanrefer: 3d object localization in rgb-d scans using natural language. arXiv preprint arXiv:1912.08830 (2019)."},{"key":"e_1_3_2_1_6_1","volume-title":"Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364","author":"Chen Hanting","year":"2020","unstructured":"Hanting Chen , Yunhe Wang , Tianyu Guo , Chang Xu , Yiping Deng , Zhenhua Liu , Siwei Ma , Chunjing Xu , Chao Xu , and Wen Gao . 2020. Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364 ( 2020 ). Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. 2020. Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364 (2020)."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"crossref","unstructured":"Kan Chen Rama Kovvuri and Ram Nevatia. 2017. Query-guided regression network with context policy for phrase grounding. In ICCV.  Kan Chen Rama Kovvuri and Ram Nevatia. 2017. Query-guided regression network with context policy for phrase grounding. In ICCV.","DOI":"10.1109\/ICCV.2017.95"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3069041"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Marcella Cornia Matteo Stefanini Lorenzo Baraldi and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In CVPR.  Marcella Cornia Matteo Stefanini Lorenzo Baraldi and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In CVPR.","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"e_1_3_2_1_10_1","volume-title":"Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR.","author":"Dai Angela","year":"2017","unstructured":"Angela Dai , Angel X Chang , Manolis Savva , Maciej Halber , Thomas Funkhouser , and Matthias Nie\u00dfner . 2017 . Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR. Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nie\u00dfner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR."},{"key":"e_1_3_2_1_11_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"crossref","unstructured":"Zhipeng Ding Xu Han and Marc Niethammer. 2019. VoteNet: A deep learning label fusion method for multi-atlas segmentation. In MICCAI.  Zhipeng Ding Xu Han and Marc Niethammer. 2019. VoteNet: A deep learning label fusion method for multi-atlas segmentation. In MICCAI.","DOI":"10.1007\/978-3-030-32248-9_23"},{"key":"e_1_3_2_1_13_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly etal 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).  Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Lun Huang Wenmin Wang Jie Chen and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In ICCV.  Lun Huang Wenmin Wang Jie Chen and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In ICCV.","DOI":"10.1109\/ICCV.2019.00473"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Pin-Hao Huang Han-Hung Lee Hwann-Tzong Chen and Tyng-Luh Liu. 2021. Text-Guided Graph Neural Networks for Referring 3D Instance Segmentation. In AAAI.  Pin-Hao Huang Han-Hung Lee Hwann-Tzong Chen and Tyng-Luh Liu. 2021. Text-Guided Graph Neural Networks for Referring 3D Instance Segmentation. In AAAI.","DOI":"10.1609\/aaai.v35i2.16253"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"crossref","unstructured":"Shaofei Huang Tianrui Hui Si Liu Guanbin Li Yunchao Wei Jizhong Han Luoqi Liu and Bo Li. 2020. Referring image segmentation via cross-modal progressive comprehension. In CVPR.  Shaofei Huang Tianrui Hui Si Liu Guanbin Li Yunchao Wei Jizhong Han Luoqi Liu and Bo Li. 2020. Referring image segmentation via cross-modal progressive comprehension. In CVPR.","DOI":"10.1109\/CVPR42600.2020.01050"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413902"},{"key":"e_1_3_2_1_18_1","volume-title":"Deep fragment embeddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406.5679","author":"Karpathy Andrej","year":"2014","unstructured":"Andrej Karpathy , Armand Joulin , and Li Fei-Fei . 2014. Deep fragment embeddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406.5679 ( 2014 ). Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406.5679 (2014)."},{"key":"e_1_3_2_1_19_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_1_20_1","unstructured":"Kunpeng Li Yulun Zhang Kai Li Yuanyuan Li and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV.  Kunpeng Li Yulun Zhang Kai Li Yuanyuan Li and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV."},{"key":"e_1_3_2_1_21_1","unstructured":"Chunxiao Liu Zhendong Mao Tianzhu Zhang Hongtao Xie Bin Wang and Yongdong Zhang. 2020 b. Graph structured network for image-text matching. In CVPR.  Chunxiao Liu Zhendong Mao Tianzhu Zhang Hongtao Xie Bin Wang and Yongdong Zhang. 2020 b. Graph structured network for image-text matching. In CVPR."},{"key":"e_1_3_2_1_22_1","volume-title":"2020 a. Understanding the difficulty of training transformers. arXiv preprint arXiv:2004.08249","author":"Liu Liyuan","year":"2020","unstructured":"Liyuan Liu , Xiaodong Liu , Jianfeng Gao , Weizhu Chen , and Jiawei Han . 2020 a. Understanding the difficulty of training transformers. arXiv preprint arXiv:2004.08249 ( 2020 ). Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. 2020 a. Understanding the difficulty of training transformers. arXiv preprint arXiv:2004.08249 (2020)."},{"key":"e_1_3_2_1_23_1","volume-title":"Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu , Myle Ott , Naman Goyal , Jingfei Du , Mandar Joshi , Danqi Chen , Omer Levy , Mike Lewis , Luke Zettlemoyer , and Veselin Stoyanov . 2019 . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019). Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)."},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/3157096.3157129"},{"key":"e_1_3_2_1_25_1","volume-title":"Yen Lam Hoang Nguyen, and Lam Thu Bui.","author":"Phan Anh Viet","year":"2018","unstructured":"Anh Viet Phan , Minh Le Nguyen , Yen Lam Hoang Nguyen, and Lam Thu Bui. 2018 . Dgcnn : A convolutional neural network over large-scale labeled graphs. Neural Networks ( 2018). Anh Viet Phan, Minh Le Nguyen, Yen Lam Hoang Nguyen, and Lam Thu Bui. 2018. Dgcnn: A convolutional neural network over large-scale labeled graphs. Neural Networks (2018)."},{"key":"e_1_3_2_1_26_1","volume-title":"Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR.","author":"Qi Charles R","year":"2017","unstructured":"Charles R Qi , Hao Su , Kaichun Mo , and Leonidas J Guibas . 2017 a. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR. Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017a. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR."},{"key":"e_1_3_2_1_27_1","volume-title":"Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413","author":"Qi Charles R","year":"2017","unstructured":"Charles R Qi , Li Yi , Hao Su , and Leonidas J Guibas . 2017b. Pointnet+ : Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413 ( 2017 ). Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017b. Pointnet+: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413 (2017)."},{"key":"e_1_3_2_1_28_1","volume-title":"Attention is all you need. arXiv preprint arXiv:1706.03762","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , Lukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"crossref","unstructured":"Haoran Wang Ying Zhang Zhong Ji Yanwei Pang and Lin Ma. 2020. Consensus-aware visual-semantic embedding for image-text matching. In ECCV.  Haoran Wang Ying Zhang Zhong Ji Yanwei Pang and Lin Ma. 2020. Consensus-aware visual-semantic embedding for image-text matching. In ECCV.","DOI":"10.1007\/978-3-030-58586-0_2"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"crossref","unstructured":"Zhengyuan Yang Boqing Gong Liwei Wang Wenbing Huang Dong Yu and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In ICCV.  Zhengyuan Yang Boqing Gong Liwei Wang Wenbing Huang Dong Yu and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In ICCV.","DOI":"10.1109\/ICCV.2019.00478"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413846"},{"key":"e_1_3_2_1_32_1","unstructured":"Zhou Yu Jun Yu Yuhao Cui Dacheng Tao and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In CVPR.  Zhou Yu Jun Yu Yuhao Cui Dacheng Tao and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In CVPR."},{"key":"e_1_3_2_1_33_1","volume-title":"Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. TNNLS","author":"Yu Zhou","year":"2018","unstructured":"Zhou Yu , Jun Yu , Chenchao Xiang , Jianping Fan , and Dacheng Tao . 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. TNNLS ( 2018 ). Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. TNNLS (2018)."},{"key":"e_1_3_2_1_34_1","volume-title":"InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. arXiv preprint arXiv:2103.01128","author":"Yuan Zhihao","year":"2021","unstructured":"Zhihao Yuan , Xu Yan , Yinghong Liao , Ruimao Zhang , Zhen Li , and Shuguang Cui . 2021. InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. arXiv preprint arXiv:2103.01128 ( 2021 ). Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li, and Shuguang Cui. 2021. InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. arXiv preprint arXiv:2103.01128 (2021)."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3455360"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00611"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3351063"},{"key":"e_1_3_2_1_38_1","volume-title":"Point transformer. arXiv preprint arXiv:2012.09164","author":"Zhao Hengshuang","year":"2020","unstructured":"Hengshuang Zhao , Li Jiang , Jiaya Jia , Philip Torr , and Vladlen Koltun . 2020. Point transformer. arXiv preprint arXiv:2012.09164 ( 2020 ). Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, and Vladlen Koltun. 2020. Point transformer. arXiv preprint arXiv:2012.09164 (2020)."}],"event":{"name":"MM '21: ACM Multimedia Conference","location":"Virtual Event China","acronym":"MM '21","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 29th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475397","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3474085.3475397","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:32Z","timestamp":1750193312000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475397"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,17]]},"references-count":38,"alternative-id":["10.1145\/3474085.3475397","10.1145\/3474085"],"URL":"https:\/\/doi.org\/10.1145\/3474085.3475397","relation":{},"subject":[],"published":{"date-parts":[[2021,10,17]]},"assertion":[{"value":"2021-10-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}