{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,24]],"date-time":"2025-11-24T16:44:11Z","timestamp":1764002651403,"version":"build-2065373602"},"publisher-location":"New York, NY, USA","reference-count":52,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,8,4]],"date-time":"2023-08-04T00:00:00Z","timestamp":1691107200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100012543","name":"Shanghai Science and Technology Development Foundation","doi-asserted-by":"publisher","award":["21511100401"],"award-info":[{"award-number":["21511100401"]}],"id":[{"id":"10.13039\/100012543","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["92270121"],"award-info":[{"award-number":["92270121"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,8,6]]},"DOI":"10.1145\/3580305.3599862","type":"proceedings-article","created":{"date-parts":[[2023,8,4]],"date-time":"2023-08-04T18:13:58Z","timestamp":1691172838000},"page":"5382-5392","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["M3PT: A Multi-Modal Model for POI Tagging"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4605-9676","authenticated-orcid":false,"given":"Jingsong","family":"Yang","sequence":"first","affiliation":[{"name":"Fudan University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3399-9645","authenticated-orcid":false,"given":"Guanzhou","family":"Han","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1390-3861","authenticated-orcid":false,"given":"Deqing","family":"Yang","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8671-2302","authenticated-orcid":false,"given":"Jingping","family":"Liu","sequence":"additional","affiliation":[{"name":"East China University of Science and Technology, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8403-9591","authenticated-orcid":false,"given":"Yanghua","family":"Xiao","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-2891-0145","authenticated-orcid":false,"given":"Xiang","family":"Xu","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3627-7058","authenticated-orcid":false,"given":"Baohua","family":"Wu","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7172-6077","authenticated-orcid":false,"given":"Shenghua","family":"Ni","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2023,8,4]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.5555\/944919.944965"},{"key":"e_1_3_2_2_2_1","volume-title":"Asymmetric loss for multi-label classification. arXiv preprint arXiv:2009.14119","author":"Ben-Baruch Emanuel","year":"2020","unstructured":"Emanuel Ben-Baruch , Tal Ridnik , Nadav Zamir , Asaf Noy , Itamar Friedman , Matan Protter , and Lihi Zelnik-Manor . 2020. Asymmetric loss for multi-label classification. arXiv preprint arXiv:2009.14119 ( 2020 ). Emanuel Ben-Baruch, Tal Ridnik, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. 2020. Asymmetric loss for multi-label classification. arXiv preprint arXiv:2009.14119 (2020)."},{"key":"e_1_3_2_2_3_1","volume-title":"Proceedings, Part XXX (Lecture Notes in Computer Science","volume":"120","author":"Chen Yen-Chun","year":"2020","unstructured":"Yen-Chun Chen , Linjie Li , Licheng Yu , Ahmed El Kholy , Faisal Ahmed , Zhe Gan , Yu Cheng , and Jingjing Liu . 2020 . UNITER: UNiversal Image-TExt Representation Learning. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23--28, 2020 , Proceedings, Part XXX (Lecture Notes in Computer Science , Vol. 12375), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 104-- 120 . https:\/\/doi.org\/10.1007\/978--3-030--58577--8_7 10.1007\/978--3-030--58577--8_7 Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXX (Lecture Notes in Computer Science, Vol. 12375), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 104--120. https:\/\/doi.org\/10.1007\/978--3-030--58577--8_7"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2018.02.017"},{"key":"e_1_3_2_2_5_1","volume-title":"VirTex: Learning Visual Representations from Textual Annotations. CoRR","author":"Desai Karan","year":"2020","unstructured":"Karan Desai and Justin Johnson . 2020. VirTex: Learning Visual Representations from Textual Annotations. CoRR , Vol. abs\/ 2006 .06666 ( 2020 ). showeprint[arXiv]2006.06666 https:\/\/arxiv.org\/abs\/2006.06666 Karan Desai and Justin Johnson. 2020. VirTex: Learning Visual Representations from Textual Annotations. CoRR , Vol. abs\/2006.06666 (2020). showeprint[arXiv]2006.06666 https:\/\/arxiv.org\/abs\/2006.06666"},{"key":"e_1_3_2_2_6_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_7_1","volume-title":"9th International Conference on Learning Representations, ICLR 2021","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , Jakob Uszkoreit , and Neil Houlsby . 2021 . An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale . In 9th International Conference on Learning Representations, ICLR 2021 , Virtual Event, Austria, May 3--7 , 2021. OpenReview.net. https:\/\/openreview.net\/forum?id=YicbFdNTTy Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3--7, 2021. OpenReview.net. https:\/\/openreview.net\/forum?id=YicbFdNTTy"},{"key":"e_1_3_2_2_8_1","volume-title":"Proceedings, Part VII (Lecture Notes in Computer Science","volume":"438","author":"Feng Zheyun","unstructured":"Zheyun Feng , Songhe Feng , Rong Jin , and Anil K. Jain . 2014. Image Tag Completion by Noisy Matrix Recovery. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6--12, 2014 , Proceedings, Part VII (Lecture Notes in Computer Science , Vol. 8695), David J. Fleet, Tom\u00e1 s Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer, 424-- 438 . https:\/\/doi.org\/10.1007\/978--3--319--10584-0_28 10.1007\/978--3--319--10584-0_28 Zheyun Feng, Songhe Feng, Rong Jin, and Anil K. Jain. 2014. Image Tag Completion by Noisy Matrix Recovery. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part VII (Lecture Notes in Computer Science, Vol. 8695), David J. Fleet, Tom\u00e1 s Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer, 424--438. https:\/\/doi.org\/10.1007\/978--3--319--10584-0_28"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3487553.3524214"},{"key":"e_1_3_2_2_10_1","volume-title":"8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26--28, 2004, Proceedings (Lecture Notes in Computer Science","volume":"30","author":"Godbole Shantanu","year":"2004","unstructured":"Shantanu Godbole and Sunita Sarawagi . 2004 . Discriminative Methods for Multi-labeled Classification. In Advances in Knowledge Discovery and Data Mining , 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26--28, 2004, Proceedings (Lecture Notes in Computer Science , Vol. 3056), , Honghua Dai, Ramakrishnan Srikant, and Chengqi Zhang (Eds.). Springer, 22-- 30 . https:\/\/doi.org\/10.1007\/978--3--540--24775--3_5 10.1007\/978--3--540--24775--3_5 Shantanu Godbole and Sunita Sarawagi. 2004. Discriminative Methods for Multi-labeled Classification. In Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26--28, 2004, Proceedings (Lecture Notes in Computer Science, Vol. 3056), , Honghua Dai, Ramakrishnan Srikant, and Chengqi Zhang (Eds.). Springer, 22--30. https:\/\/doi.org\/10.1007\/978--3--540--24775--3_5"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2005.170"},{"key":"e_1_3_2_2_12_1","volume-title":"Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016","author":"He Kaiming","year":"2016","unstructured":"Kaiming He , Xiangyu Zhang , Shaoqing Ren , and Jian Sun . 2016 b. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 , Las Vegas, NV, USA, June 27--30 , 2016. IEEE Computer Society, 770--778. https:\/\/doi.org\/10.1109\/CVPR.2016.90 10.1109\/CVPR.2016.90 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 770--778. https:\/\/doi.org\/10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_13_1","article-title":"A Spatial-Temporal Topic Model for the Semantic Annotation of POIs in LBSNs","volume":"8","author":"He Tieke","year":"2016","unstructured":"Tieke He , Hongzhi Yin , Zhenyu Chen , Xiaofang Zhou , Shazia W. Sadiq , and Bin Luo . 2016 a. A Spatial-Temporal Topic Model for the Semantic Annotation of POIs in LBSNs . ACM Trans. Intell. Syst. Technol. , Vol. 8 , 1 (2016), 12:1--12:24. https:\/\/doi.org\/10.1145\/2905373 10.1145\/2905373 Tieke He, Hongzhi Yin, Zhenyu Chen, Xiaofang Zhou, Shazia W. Sadiq, and Bin Luo. 2016a. A Spatial-Temporal Topic Model for the Semantic Annotation of POIs in LBSNs. ACM Trans. Intell. Syst. Technol. , Vol. 8, 1 (2016), 12:1--12:24. https:\/\/doi.org\/10.1145\/2905373","journal-title":"ACM Trans. Intell. Syst. Technol."},{"key":"e_1_3_2_2_14_1","volume-title":"ECIR 2013, Moscow, Russia, March 24--27, 2013. Proceedings (Lecture Notes in Computer Science","volume":"229","author":"Hegde Vinod","year":"2013","unstructured":"Vinod Hegde , Josiane Xavier Parreira , and Manfred Hauswirth . 2013 . Semantic Tagging of Places Based on User Interest Profiles from Online Social Networks. In Advances in Information Retrieval - 35th European Conference on IR Research , ECIR 2013, Moscow, Russia, March 24--27, 2013. Proceedings (Lecture Notes in Computer Science , Vol. 7814), Pavel Serdyukov, Pavel Braslavski, Sergei O. Kuznetsov, Jaap Kamps, Stefan M. R\u00fc ger, Eugene Agichtein, Ilya Segalovich, and Emine Yilmaz (Eds.). Springer, 218-- 229 . https:\/\/doi.org\/10.1007\/978--3--642--36973--5_19 10.1007\/978--3--642--36973--5_19 Vinod Hegde, Josiane Xavier Parreira, and Manfred Hauswirth. 2013. Semantic Tagging of Places Based on User Interest Profiles from Online Social Networks. In Advances in Information Retrieval - 35th European Conference on IR Research, ECIR 2013, Moscow, Russia, March 24--27, 2013. Proceedings (Lecture Notes in Computer Science, Vol. 7814), Pavel Serdyukov, Pavel Braslavski, Sergei O. Kuznetsov, Jaap Kamps, Stefan M. R\u00fc ger, Eugene Agichtein, Ilya Segalovich, and Emine Yilmaz (Eds.). Springer, 218--229. https:\/\/doi.org\/10.1007\/978--3--642--36973--5_19"},{"key":"e_1_3_2_2_15_1","volume-title":"Seeing Out of the Box: End-to-End Pre-Training for Vision-Language Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021","author":"Huang Zhicheng","year":"2021","unstructured":"Zhicheng Huang , Zhaoyang Zeng , Yupan Huang , Bei Liu , Dongmei Fu , and Jianlong Fu . 2021 . Seeing Out of the Box: End-to-End Pre-Training for Vision-Language Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021 , virtual, June 19 --25 , 2021. Computer Vision Foundation \/ IEEE, 12976--12985. https:\/\/doi.org\/10.1109\/CVPR46437.2021.01278 10.1109\/CVPR46437.2021.01278 Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. 2021. Seeing Out of the Box: End-to-End Pre-Training for Vision-Language Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19--25, 2021. Computer Vision Foundation \/ IEEE, 12976--12985. https:\/\/doi.org\/10.1109\/CVPR46437.2021.01278"},{"key":"e_1_3_2_2_16_1","volume-title":"Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, Shizhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, and Ji-Rong Wen.","author":"Huo Yuqi","year":"2021","unstructured":"Yuqi Huo , Manli Zhang , Guangzhen Liu , Haoyu Lu , Yizhao Gao , Guoxing Yang , Jingyuan Wen , Heng Zhang , Baogui Xu , Weihao Zheng , Zongzheng Xi , Yueqian Yang , Anwen Hu , Jinming Zhao , Ruichen Li , Yida Zhao , Liang Zhang , Yuqing Song , Xin Hong , Wanqing Cui , Dan Yang Hou , Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, Shizhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, and Ji-Rong Wen. 2021 . WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training. CoRR , Vol. abs\/ 2103 .06561 (2021). showeprint[arXiv]2103.06561 https:\/\/arxiv.org\/abs\/2103.06561 Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang, Yuqing Song, Xin Hong, Wanqing Cui, Dan Yang Hou, Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, Shizhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, and Ji-Rong Wen. 2021. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training. CoRR , Vol. abs\/2103.06561 (2021). showeprint[arXiv]2103.06561 https:\/\/arxiv.org\/abs\/2103.06561"},{"key":"e_1_3_2_2_17_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24","volume":"4916","author":"Jia Chao","year":"2021","unstructured":"Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc V. Le , Yun-Hsuan Sung , Zhen Li , and Tom Duerig . 2021 . Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision . In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research , Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 4904-- 4916 . http:\/\/proceedings.mlr.press\/v139\/jia21b.html Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 4904--4916. http:\/\/proceedings.mlr.press\/v139\/jia21b.html"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1518701.1518797"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2493432.2493504"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/PERCOM.2015.7146504"},{"key":"e_1_3_2_2_21_1","volume-title":"ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8","author":"Lagos Nikolaos","year":"2020","unstructured":"Nikolaos Lagos , Salah Ait-Mokhtar , and Ioan Calapodescu . 2020. Point-Of-Interest Semantic Tag Completion in a Global Crowdsourced Search-and-Discovery Database . In ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8 September 2020 , Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020) (Frontiers in Artificial Intelligence and Applications , Vol. 325), , Giuseppe De Giacomo, Alejandro Catal\u00e1, Bistra Dilkina, Michela Milano, Sen\u00e9 n Barro, Alberto Bugar'i n, and J\u00e9 r\u00f4 me Lang (Eds.). IOS Press, 2993-- 3000 . https:\/\/doi.org\/10.3233\/FAIA200474 10.3233\/FAIA200474 Nikolaos Lagos, Salah Ait-Mokhtar, and Ioan Calapodescu. 2020. Point-Of-Interest Semantic Tag Completion in a Global Crowdsourced Search-and-Discovery Database. In ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020) (Frontiers in Artificial Intelligence and Applications, Vol. 325), , Giuseppe De Giacomo, Alejandro Catal\u00e1, Bistra Dilkina, Michela Milano, Sen\u00e9 n Barro, Alberto Bugar'i n, and J\u00e9 r\u00f4 me Lang (Eds.). IOS Press, 2993--3000. https:\/\/doi.org\/10.3233\/FAIA200474"},{"key":"e_1_3_2_2_22_1","volume-title":"Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942","author":"Lan Zhenzhong","year":"2019","unstructured":"Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , and Radu Soricut . 2019 . Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019). Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)."},{"key":"e_1_3_2_2_23_1","volume-title":"BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, ICML 2022","volume":"12900","author":"Li Junnan","year":"2022","unstructured":"Junnan Li , Dongxu Li , Caiming Xiong , and Steven C. H. Hoi . 2022 . BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, ICML 2022 , 17--23 July 2022 , Baltimore, Maryland, USA (Proceedings of Machine Learning Research , Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesv\u00e1 ri, Gang Niu, and Sivan Sabato (Eds.). PMLR, 12888-- 12900 . https:\/\/proceedings.mlr.press\/v162\/li22n.html Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, ICML 2022, 17--23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesv\u00e1 ri, Gang Niu, and Sivan Sabato (Eds.). PMLR, 12888--12900. https:\/\/proceedings.mlr.press\/v162\/li22n.html"},{"key":"e_1_3_2_2_24_1","volume-title":"VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR","author":"Li Liunian Harold","year":"2019","unstructured":"Liunian Harold Li , Mark Yatskar , Da Yin , Cho-Jui Hsieh , and Kai-Wei Chang . 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR , Vol. abs\/ 1908 .03557 ( 2019 ). showeprint[arXiv]1908.03557 http:\/\/arxiv.org\/abs\/1908.03557 Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR , Vol. abs\/1908.03557 (2019). showeprint[arXiv]1908.03557 http:\/\/arxiv.org\/abs\/1908.03557"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.202"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2016.2518478"},{"key":"e_1_3_2_2_27_1","volume-title":"M6: A Chinese Multimodal Pretrainer. CoRR","author":"Lin Junyang","year":"2021","unstructured":"Junyang Lin , Rui Men , An Yang , Chang Zhou , Ming Ding , Yichang Zhang , Peng Wang , Ang Wang , Le Jiang , Xianyan Jia , Jie Zhang , Jianwei Zhang , Xu Zou , Zhikang Li , Xiaodong Deng , Jie Liu , Jinbao Xue , Huiling Zhou , Jianxin Ma , Jin Yu , Yong Li , Wei Lin , Jingren Zhou , Jie Tang , and Hongxia Yang . 2021. M6: A Chinese Multimodal Pretrainer. CoRR , Vol. abs\/ 2103 .00823 ( 2021 ). showeprint[arXiv]2103.00823 https:\/\/arxiv.org\/abs\/2103.00823 Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding, Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan Jia, Jie Zhang, Jianwei Zhang, Xu Zou, Zhikang Li, Xiaodong Deng, Jie Liu, Jinbao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yong Li, Wei Lin, Jingren Zhou, Jie Tang, and Hongxia Yang. 2021. M6: A Chinese Multimodal Pretrainer. CoRR , Vol. abs\/2103.00823 (2021). showeprint[arXiv]2103.00823 https:\/\/arxiv.org\/abs\/2103.00823"},{"key":"e_1_3_2_2_28_1","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0--0.","author":"Lin Rongcheng","year":"2018","unstructured":"Rongcheng Lin , Jing Xiao , and Jianping Fan . 2018 . Nextvlad: An efficient neural network to aggregate frame-level features for large-scale video classification . In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0--0. Rongcheng Lin, Jing Xiao, and Jianping Fan. 2018. Nextvlad: An efficient neural network to aggregate frame-level features for large-scale video classification. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0--0."},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.212"},{"key":"e_1_3_2_2_30_1","volume-title":"Query2Label: A Simple Transformer Way to Multi-Label Classification. CoRR","author":"Liu Shilong","year":"2021","unstructured":"Shilong Liu , Lei Zhang , Xiao Yang , Hang Su , and Jun Zhu . 2021. Query2Label: A Simple Transformer Way to Multi-Label Classification. CoRR , Vol. abs\/ 2107 .10834 ( 2021 ). showeprint[arXiv]2107.10834 https:\/\/arxiv.org\/abs\/2107.10834 Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, and Jun Zhu. 2021. Query2Label: A Simple Transformer Way to Multi-Label Classification. CoRR , Vol. abs\/2107.10834 (2021). showeprint[arXiv]2107.10834 https:\/\/arxiv.org\/abs\/2107.10834"},{"key":"e_1_3_2_2_31_1","volume-title":"Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019","author":"Loshchilov Ilya","year":"2019","unstructured":"Ilya Loshchilov and Frank Hutter . 2019 . Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019 , New Orleans, LA, USA, May 6--9 , 2019. OpenReview.net. https:\/\/openreview.net\/forum?id=Bkg6RiCqY7 Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019. OpenReview.net. https:\/\/openreview.net\/forum?id=Bkg6RiCqY7"},{"key":"e_1_3_2_2_32_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24","volume":"8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , Gretchen Krueger , and Ilya Sutskever . 2021 . Learning Transferable Visual Models From Natural Language Supervision . In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research , Vol. 139), , Marina Meila and Tong Zhang (Eds.). PMLR, 8748-- 8763 . http:\/\/proceedings.mlr.press\/v139\/radford21a.html Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), , Marina Meila and Tong Zhang (Eds.). PMLR, 8748--8763. http:\/\/proceedings.mlr.press\/v139\/radford21a.html"},{"key":"e_1_3_2_2_33_1","article-title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , and Peter J. Liu . 2020 . Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer . J. Mach. Learn. Res. , Vol. 21 (2020), 140:1--140:67. http:\/\/jmlr.org\/papers\/v21\/20-074.html Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. , Vol. 21 (2020), 140:1--140:67. http:\/\/jmlr.org\/papers\/v21\/20-074.html","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_2_2_34_1","volume-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross B. Girshick , and Jian Sun . 2015 . Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015 , December 7 --12 , 2015, Montreal, Quebec, Canada, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 91--99. https:\/\/proceedings.neurips.cc\/paper\/2015\/hash\/14bfa6bb14875e45bba028a21ed38046-Abstract.html Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 91--99. https:\/\/proceedings.neurips.cc\/paper\/2015\/hash\/14bfa6bb14875e45bba028a21ed38046-Abstract.html"},{"key":"e_1_3_2_2_35_1","volume-title":"TResNet: High Performance GPU-Dedicated Architecture. In IEEE Winter Conference on Applications of Computer Vision, WACV 2021","author":"Ridnik Tal","year":"2021","unstructured":"Tal Ridnik , Hussam Lawen , Asaf Noy , Emanuel Ben Baruch , Gilad Sharir , and Itamar Friedman . 2021 . TResNet: High Performance GPU-Dedicated Architecture. In IEEE Winter Conference on Applications of Computer Vision, WACV 2021 , Waikoloa, HI, USA, January 3--8 , 2021. IEEE, 1399--1408. https:\/\/doi.org\/10.1109\/WACV48630.2021.00144 10.1109\/WACV48630.2021.00144 Tal Ridnik, Hussam Lawen, Asaf Noy, Emanuel Ben Baruch, Gilad Sharir, and Itamar Friedman. 2021. TResNet: High Performance GPU-Dedicated Architecture. In IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3--8, 2021. IEEE, 1399--1408. https:\/\/doi.org\/10.1109\/WACV48630.2021.00144"},{"key":"e_1_3_2_2_36_1","volume-title":"Learning Visual Representations with Caption Annotations. CoRR","author":"Sariyildiz Mert B\u00fc","year":"2020","unstructured":"Mert B\u00fc lent Sariyildiz , Julien Perez , and Diane Larlus . 2020. Learning Visual Representations with Caption Annotations. CoRR , Vol. abs\/ 2008 .01392 ( 2020 ). showeprint[arXiv]2008.01392 https:\/\/arxiv.org\/abs\/2008.01392 Mert B\u00fc lent Sariyildiz, Julien Perez, and Diane Larlus. 2020. Learning Visual Representations with Caption Annotations. CoRR , Vol. abs\/2008.01392 (2020). showeprint[arXiv]2008.01392 https:\/\/arxiv.org\/abs\/2008.01392"},{"key":"e_1_3_2_2_37_1","volume-title":"Attention is all you need. Advances in neural information processing systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , \u0141ukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. Advances in neural information processing systems , Vol. 30 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems , Vol. 30 (2017)."},{"key":"e_1_3_2_2_38_1","volume-title":"Understanding the Behaviour of Contrastive Loss. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021","author":"Wang Feng","year":"2021","unstructured":"Feng Wang and Huaping Liu . 2021 . Understanding the Behaviour of Contrastive Loss. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021 , virtual, June 19 --25 , 2021. Computer Vision Foundation \/ IEEE, 2495--2504. https:\/\/doi.org\/10.1109\/CVPR46437.2021.00252 10.1109\/CVPR46437.2021.00252 Feng Wang and Huaping Liu. 2021. Understanding the Behaviour of Contrastive Loss. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19--25, 2021. Computer Vision Foundation \/ IEEE, 2495--2504. https:\/\/doi.org\/10.1109\/CVPR46437.2021.00252"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132847.3133075"},{"key":"e_1_3_2_2_40_1","volume-title":"SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. In The Tenth International Conference on Learning Representations, ICLR 2022","author":"Wang Zirui","year":"2022","unstructured":"Zirui Wang , Jiahui Yu , Adams Wei Yu , Zihang Dai , Yulia Tsvetkov , and Yuan Cao . 2022 . SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. In The Tenth International Conference on Learning Representations, ICLR 2022 , Virtual Event, April 25--29 , 2022. OpenReview.net. https:\/\/openreview.net\/forum?id=GUrhfTuf_3 Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. 2022. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25--29, 2022. OpenReview.net. https:\/\/openreview.net\/forum?id=GUrhfTuf_3"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.adhoc.2022.103022"},{"key":"e_1_3_2_2_42_1","volume-title":"Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016","author":"Yang Dingqi","year":"2016","unstructured":"Dingqi Yang , Bin Li , and Philippe Cudr\u00e9 - Mauroux . 2016 . POISketch: Semantic Place Labeling over User Activity Streams . In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016 , New York, NY, USA, 9- -15 July 2016, , Subbarao Kambhampati (Ed.). IJCAI\/AAAI Press, 2697--2703. http:\/\/www.ijcai.org\/Abstract\/16\/383 Dingqi Yang, Bin Li, and Philippe Cudr\u00e9 -Mauroux. 2016. POISketch: Semantic Place Labeling over User Activity Streams. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9--15 July 2016, , Subbarao Kambhampati (Ed.). IJCAI\/AAAI Press, 2697--2703. http:\/\/www.ijcai.org\/Abstract\/16\/383"},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1009982220290"},{"key":"e_1_3_2_2_44_1","volume-title":"FILIP: Fine-grained Interactive Language-Image Pre-Training. In The Tenth International Conference on Learning Representations, ICLR 2022","author":"Yao Lewei","year":"2022","unstructured":"Lewei Yao , Runhui Huang , Lu Hou , Guansong Lu , Minzhe Niu , Hang Xu , Xiaodan Liang , Zhenguo Li , Xin Jiang , and Chunjing Xu . 2022 . FILIP: Fine-grained Interactive Language-Image Pre-Training. In The Tenth International Conference on Learning Representations, ICLR 2022 , Virtual Event, April 25--29 , 2022. OpenReview.net. https:\/\/openreview.net\/forum?id=cpDhcsEDC2 Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2022. FILIP: Fine-grained Interactive Language-Image Pre-Training. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25--29, 2022. OpenReview.net. https:\/\/openreview.net\/forum?id=cpDhcsEDC2"},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2020408.2020491"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3001593"},{"key":"e_1_3_2_2_47_1","volume-title":"Tailor versatile multi-modal learning for multi-label emotion recognition. arXiv preprint arXiv:2201.05834","author":"Zhang Yi","year":"2022","unstructured":"Yi Zhang , Mingyuan Chen , Jundong Shen , and Chongjun Wang . 2022. Tailor versatile multi-modal learning for multi-label emotion recognition. arXiv preprint arXiv:2201.05834 ( 2022 ). Yi Zhang, Mingyuan Chen, Jundong Shen, and Chongjun Wang. 2022. Tailor versatile multi-modal learning for multi-label emotion recognition. arXiv preprint arXiv:2201.05834 (2022)."},{"key":"e_1_3_2_2_48_1","volume-title":"Langlotz","author":"Zhang Yuhao","year":"2020","unstructured":"Yuhao Zhang , Hang Jiang , Yasuhide Miura , Christopher D. Manning , and Curtis P . Langlotz . 2020 . Contrastive Learning of Medical Visual Representations from Paired Images and Text. CoRR , Vol. abs\/ 2010 .00747 (2020). showeprint[arXiv]2010.00747 https:\/\/arxiv.org\/abs\/2010.00747 Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. 2020. Contrastive Learning of Medical Visual Representations from Paired Images and Text. CoRR , Vol. abs\/2010.00747 (2020). showeprint[arXiv]2010.00747 https:\/\/arxiv.org\/abs\/2010.00747"},{"key":"e_1_3_2_2_49_1","series-title":"ERNIE: Enhanced Language Representation with Informative Entities. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL","volume-title":"Long Papers, , Anna Korhonen, David R. Traum, and Llu'i s M\u00e0 rquez (Eds.)","author":"Zhang Zhengyan","year":"2019","unstructured":"Zhengyan Zhang , Xu Han , Zhiyuan Liu , Xin Jiang , Maosong Sun , and Qun Liu . 2019 . ERNIE: Enhanced Language Representation with Informative Entities. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence , Italy, July 28- August 2, 2019, Volume 1 : Long Papers, , Anna Korhonen, David R. Traum, and Llu'i s M\u00e0 rquez (Eds.) . Association for Computational Linguistics , 1441--1451. https:\/\/doi.org\/10.18653\/v1\/p19--1139 10.18653\/v1 Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced Language Representation with Informative Entities. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, , Anna Korhonen, David R. Traum, and Llu'i s M\u00e0 rquez (Eds.). Association for Computational Linguistics, 1441--1451. https:\/\/doi.org\/10.18653\/v1\/p19--1139"},{"key":"e_1_3_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475191"},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330698"},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.pmcj.2013.07.004"}],"event":{"name":"KDD '23: The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining","sponsor":["SIGMOD ACM Special Interest Group on Management of Data","SIGKDD ACM Special Interest Group on Knowledge Discovery in Data"],"location":"Long Beach CA USA","acronym":"KDD '23"},"container-title":["Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3580305.3599862","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3580305.3599862","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:24Z","timestamp":1750182564000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3580305.3599862"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,4]]},"references-count":52,"alternative-id":["10.1145\/3580305.3599862","10.1145\/3580305"],"URL":"https:\/\/doi.org\/10.1145\/3580305.3599862","relation":{},"subject":[],"published":{"date-parts":[[2023,8,4]]},"assertion":[{"value":"2023-08-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}