{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,18]],"date-time":"2026-05-18T11:26:37Z","timestamp":1779103597458,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":43,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001381","name":"National Research Foundation Singapore","doi-asserted-by":"publisher","award":["NRF-NRFF13-2021-0008"],"award-info":[{"award-number":["NRF-NRFF13-2021-0008"]}],"id":[{"id":"10.13039\/501100001381","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004543","name":"China Scholarship Council","doi-asserted-by":"publisher","award":["202006460059"],"award-info":[{"award-number":["202006460059"]}],"id":[{"id":"10.13039\/501100004543","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62076024, 62006018"],"award-info":[{"award-number":["62076024, 62006018"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100014219","name":"National Science Fund for Distinguished Young Scholars","doi-asserted-by":"publisher","award":["62125601"],"award-info":[{"award-number":["62125601"]}],"id":[{"id":"10.13039\/501100014219","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3547977","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:01Z","timestamp":1665416581000},"page":"4564-4572","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA"],"prefix":"10.1145","author":[{"given":"Zan-Xia","family":"Jin","sequence":"first","affiliation":[{"name":"University of Science and Technology Beijing, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mike Zheng","family":"Shou","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fang","family":"Zhou","sequence":"additional","affiliation":[{"name":"University of Science and Technology Beijing, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Satoshi","family":"Tsutsui","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jingyan","family":"Qin","sequence":"additional","affiliation":[{"name":"University of Science and Technology Beijing, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xu-Cheng","family":"Yin","sequence":"additional","affiliation":[{"name":"University of Science and Technology Beijing, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2014.2339814"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2020.107538"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2019.00251"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00439"},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219861"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219861"},{"key":"e_1_3_2_2_10_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171--4186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171--4186."},{"key":"e_1_3_2_2_11_1","volume-title":"Anton Van den Hengel, and QiWu","author":"Gao Chenyu","year":"2021","unstructured":"Chenyu Gao , Qi Zhu , Peng Wang , Hui Li , Yuliang Liu , Anton Van den Hengel, and QiWu . 2021 . Structured multimodal attentions for textvqa. IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2021). Chenyu Gao, Qi Zhu, Peng Wang, Hui Li, Yuliang Liu, Anton Van den Hengel, and QiWu. 2021. Structured multimodal attentions for textvqa. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021)."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01276"},{"key":"e_1_3_2_2_13_1","volume-title":"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR","author":"Goyal Yash","year":"2017","unstructured":"Yash Goyal , Tejas Khot , Douglas Summers-Stay , Dhruv Batra , and Devi Parikh . 2017 . Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 6325--6334. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 6325--6334."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"crossref","unstructured":"Wei Han Hantao Huang and Tao Han. 2020. Finding the Evidence: Localizationaware Answer Prediction for Text Visual Question Answering. In COLING Donia Scott N\u00faria Bel and Chengqing Zong (Eds.). 3118--3131.  Wei Han Hantao Huang and Tao Han. 2020. Finding the Evidence: Localizationaware Answer Prediction for Text Visual Question Answering. In COLING Donia Scott N\u00faria Bel and Chengqing Zong (Eds.). 3118--3131.","DOI":"10.18653\/v1\/2020.coling-main.278"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/TITS.2020.2996027"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01001"},{"key":"e_1_3_2_2_17_1","volume-title":"Pythia v0. 1: the winning entry to the vqa challenge","author":"Jiang Yu","year":"2018","unstructured":"Yu Jiang , Vivek Natarajan , Xinlei Chen , Marcus Rohrbach , Dhruv Batra , and Devi Parikh . 2018. Pythia v0. 1: the winning entry to the vqa challenge 2018 . arXiv preprint arXiv:1807.09956 (2018). Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0. 1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956 (2018)."},{"key":"e_1_3_2_2_18_1","volume-title":"Ruart: A novel text-centered solution for text-based visual question answering","author":"Jin Zan-Xia","year":"2021","unstructured":"Zan-Xia Jin , Heran Wu , Chun Yang , Fang Zhou , Jingyan Qin , Lei Xiao , and Xu-Cheng Yin . 2021 . Ruart: A novel text-centered solution for text-based visual question answering . IEEE Transactions on Multimedia ( 2021). Zan-Xia Jin, Heran Wu, Chun Yang, Fang Zhou, Jingyan Qin, Lei Xiao, and Xu-Cheng Yin. 2021. Ruart: A novel text-centered solution for text-based visual question answering. IEEE Transactions on Multimedia (2021)."},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btz195"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2020.05.110"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58545-7_41"},{"key":"e_1_3_2_2_22_1","unstructured":"Jin-Hwa Kim Jaehyun Jun and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 1564--1574.  Jin-Hwa Kim Jaehyun Jun and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 1564--1574."},{"key":"e_1_3_2_2_23_1","volume-title":"Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR","author":"Diederik","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015 . Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015 , Yoshua Bengio and Yann LeCun (Eds.). Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, Yoshua Bengio and Yann LeCun (Eds.)."},{"key":"e_1_3_2_2_24_1","unstructured":"Vladimir I Levenshtein etal 1966. Binary codes capable of correcting deletions insertions and reversals. In Soviet physics doklady Vol. 10. Soviet Union 707--710.  Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions insertions and reversals. In Soviet physics doklady Vol. 10. Soviet Union 707--710."},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2020.02.031"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413924"},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW54120.2021.00297"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_5"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.coling-main.4"},{"key":"e_1_3_2_2_30_1","volume-title":"Faster RCNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross B. Girshick , and Jian Sun . 2015 . Faster RCNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 91--99. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster RCNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 91--99."},{"key":"e_1_3_2_2_31_1","volume-title":"ASTER: An Attentional Scene Text Recognizer with Flexible Rectification","author":"Shi Baoguang","year":"2019","unstructured":"Baoguang Shi , Mingkun Yang , Xinggang Wang , Pengyuan Lyu , Cong Yao , and Xiang Bai . 2019 . ASTER: An Attentional Scene Text Recognizer with Flexible Rectification . IEEE transactions on pattern analysis and machine intelligence 41, 9 (2019), 2035--2048. Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2019. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. IEEE transactions on pattern analysis and machine intelligence 41, 9 (2019), 2035--2048."},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"crossref","unstructured":"Amanpreet Singh Vivek Natarajan Meet Shah Yu Jiang Xinlei Chen Dhruv Batra Devi Parikh and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8317--8326.  Amanpreet Singh Vivek Natarajan Meet Shah Yu Jiang Xinlei Chen Dhruv Batra Devi Parikh and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8317--8326.","DOI":"10.1109\/CVPR.2019.00851"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00869"},{"key":"e_1_3_2_2_34_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008."},{"key":"e_1_3_2_2_35_1","unstructured":"Oriol Vinyals Meire Fortunato and Navdeep Jaitly. 2015. Pointer networks. In Advances in neural information processing systems. 2692--2700.  Oriol Vinyals Meire Fortunato and Navdeep Jaitly. 2015. Pointer networks. In Advances in neural information processing systems. 2692--2700."},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475390"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"crossref","unstructured":"Zhaokai Wang Renda Bao Qi Wu and Si Liu. 2021. Confidence-aware nonrepetitive multimodal transformers for textcaps. In AAAI.  Zhaokai Wang Renda Bao Qi Wu and Si Liu. 2021. Confidence-aware nonrepetitive multimodal transformers for textcaps. In AAAI.","DOI":"10.1609\/aaai.v35i4.16389"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240572"},{"key":"e_1_3_2_2_39_1","volume-title":"HiGCIN: Hierarchical graph-based cross inference network for group activity recognition","author":"Yan Rui","year":"2020","unstructured":"Rui Yan , Lingxi Xie , Jinhui Tang , Xiangbo Shu , and Qi Tian . 2020. HiGCIN: Hierarchical graph-based cross inference network for group activity recognition . IEEE transactions on pattern analysis and machine intelligence ( 2020 ). Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. 2020. HiGCIN: Hierarchical graph-based cross inference network for group activity recognition. IEEE transactions on pattern analysis and machine intelligence (2020)."},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58598-3_13"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00864"},{"key":"e_1_3_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475606"},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00134"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547977","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3547977","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:31Z","timestamp":1750186831000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547977"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":43,"alternative-id":["10.1145\/3503161.3547977","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3547977","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}