{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,20]],"date-time":"2026-01-20T09:44:55Z","timestamp":1768902295617,"version":"3.49.0"},"reference-count":60,"publisher":"Association for Computing Machinery (ACM)","issue":"7","license":[{"start":{"date-parts":[[2024,3,27]],"date-time":"2024-03-27T00:00:00Z","timestamp":1711497600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62276242"],"award-info":[{"award-number":["62276242"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"National Aviation Science Foundation","award":["2022Z071078001"],"award-info":[{"award-number":["2022Z071078001"]}]},{"name":"CAAI-Huawei MindSpore Open Fund","award":["CAAIXSJLJJ-2021-016B, CAAIXSJLJJ-2022-001A"],"award-info":[{"award-number":["CAAIXSJLJJ-2021-016B, CAAIXSJLJJ-2022-001A"]}]},{"name":"Anhui Province Key Research and Development Program","award":["202104a05020007"],"award-info":[{"award-number":["202104a05020007"]}]},{"name":"Dreams Foundation of Jianghuai Advance Technology Center","award":["2023-ZM01Z001"],"award-info":[{"award-number":["2023-ZM01Z001"]}]},{"name":"USTC-IAT Application Sci. & Tech. Achievement Cultivation Program","award":["JL06521001Y"],"award-info":[{"award-number":["JL06521001Y"]}]},{"name":"Sci. & Tech. Innovation Special Zone","award":["20-163-14-LZ-001-004-01"],"award-info":[{"award-number":["20-163-14-LZ-001-004-01"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,7,31]]},"abstract":"<jats:p>Scene Text Recognition (STR), the critical step in OCR systems, has attracted much attention in computer vision. Recent research on modeling textual semantics with Language Model (LM) has witnessed remarkable progress. However, LM only optimizes the joint probability of the estimated characters generated from the Vision Model (VM) in a single language modality, ignoring the visual-semantic relations in different modalities. Thus, LM-based methods can hardly generalize well to some challenging conditions, in which the text has weak or multiple semantics, arbitrary shape, and so on. To migrate the above issue, in this paper, we propose Multimodal Visual-Semantic Representations Learning for Text Recognition Network (MVSTRN) to reason and combine the multimodal visual-semantic information for accurate Scene Text Recognition. Specifically, our MVSTRN builds a bridge between vision and language through its unified architecture and has the ability to reason visual semantics by guiding the network to reconstruct the original image from the latent text representation, breaking the structural gap between vision and language. Finally, the tailored multimodal Fusion (MMF) module is motivated to combine the multimodal visual and textual semantics from VM and LM to make the final predictions. Extensive experiments demonstrate our MVSTRN achieves state-of-the-art performance on several benchmarks.<\/jats:p>","DOI":"10.1145\/3646551","type":"journal-article","created":{"date-parts":[[2024,2,19]],"date-time":"2024-02-19T12:20:53Z","timestamp":1708345253000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Multimodal Visual-Semantic Representations Learning for Scene Text Recognition"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9795-5908","authenticated-orcid":false,"given":"Xinjian","family":"Gao","sequence":"first","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-6393-8930","authenticated-orcid":false,"given":"Ye","family":"Pang","sequence":"additional","affiliation":[{"name":"Ping An Technology Co., Ltd, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-1541-5000","authenticated-orcid":false,"given":"Yuyu","family":"Liu","sequence":"additional","affiliation":[{"name":"Ping An Technology Co., Ltd, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-9335-4025","authenticated-orcid":false,"given":"Maokun","family":"Han","sequence":"additional","affiliation":[{"name":"Ping An Technology Co., Ltd, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3197-8103","authenticated-orcid":false,"given":"Jun","family":"Yu","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0550-497X","authenticated-orcid":false,"given":"Wei","family":"Wang","sequence":"additional","affiliation":[{"name":"Ping An Technology Co., Ltd, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-9132-3837","authenticated-orcid":false,"given":"Yuanxu","family":"Chen","sequence":"additional","affiliation":[{"name":"Ping An Technology Co., Ltd, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2024,3,27]]},"reference":[{"key":"e_1_3_1_2_2","article-title":"Multimodal semi-supervised learning for text recognition","author":"Aberdam Aviad","year":"2022","unstructured":"Aviad Aberdam, Roy Ganz, Shai Mazor, and Ron Litman. 2022. Multimodal semi-supervised learning for text recognition. arXiv preprint arXiv:2205.03873 (2022).","journal-title":"arXiv preprint arXiv:2205.03873"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-86549-8_21"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00481"},{"key":"e_1_3_1_5_2","article-title":"BEiT: Bert pre-training of image transformers","author":"Bao Hangbo","year":"2021","unstructured":"Hangbo Bao, Li Dong, and Furu Wei. 2021. BEiT: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).","journal-title":"arXiv preprint arXiv:2106.08254"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01467"},{"key":"e_1_3_1_7_2","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and others. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877\u20131901.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58604-1_28"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00950"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.543"},{"key":"e_1_3_1_11_2","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).","journal-title":"arXiv preprint arXiv:1810.04805"},{"key":"e_1_3_1_12_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00702"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00917"},{"key":"e_1_3_1_15_2","article-title":"Learning pixel affinity pyramid for arbitrary-shaped text detection","author":"Fu Zilong","year":"2022","unstructured":"Zilong Fu, Hongtao Xie, Shancheng Fang, Yuxin Wang, MengTing Xing, and Yongdong Zhang. 2022. Learning pixel affinity pyramid for arbitrary-shaped text detection. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2022).","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.201"},{"key":"e_1_3_1_17_2","first-page":"2315","volume-title":"CVPR","author":"Gupta Ankush","year":"2016","unstructured":"Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In CVPR. 2315\u20132324."},{"key":"e_1_3_1_18_2","article-title":"Masked autoencoders are scalable vision learners","author":"He Kaiming","year":"2021","unstructured":"Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll\u00e1r, and Ross Girshick. 2021. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021).","journal-title":"arXiv preprint arXiv:2111.06377"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.5555\/3016387.3016396"},{"key":"e_1_3_1_20_2","article-title":"Synthetic data and artificial neural networks for natural scene text recognition","author":"Jaderberg Max","year":"2014","unstructured":"Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and et al.2014. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014).","journal-title":"arXiv preprint arXiv:1406.2227"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0823-z"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-86549-8_19"},{"key":"e_1_3_1_23_2","first-page":"1156","volume-title":"ICDAR","year":"2015","unstructured":"Dimosthenis Karatzas, Lluis Gomez-Bigorda, and Anguelos Nicolaou. 2015. ICDAR 2015 competition on robust reading. In ICDAR. IEEE, 1156\u20131160."},{"key":"e_1_3_1_24_2","first-page":"1484","volume-title":"ICDAR","year":"2013","unstructured":"Dimosthenis Karatzas, Faisal Shafait, and Seiichi Uchida. 2013. ICDAR 2013 robust reading competition. In ICDAR. IEEE, 1484\u20131493."},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00281"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33018610"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2021.108455"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33018714"},{"issue":"4","key":"e_1_3_1_29_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3356728","article-title":"AB-LSTM: Attention-based bidirectional LSTM model for scene text detection","volume":"15","author":"Liu Zhandong","year":"2019","unstructured":"Zhandong Liu, Wengang Zhou, and Houqiang Li. 2019. AB-LSTM: Attention-based bidirectional LSTM model for scene text detection. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 4 (2019), 1\u201323.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)"},{"issue":"3","key":"e_1_3_1_30_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3440087","article-title":"MFECN: Multi-level feature enhanced cumulative network for scene text detection","volume":"17","author":"Liu Zhandong","year":"2021","unstructured":"Zhandong Liu, Wengang Zhou, and Houqiang Li. 2021. MFECN: Multi-level feature enhanced cumulative network for scene text detection. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 3 (2021), 1\u201322.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_5"},{"key":"e_1_3_1_32_2","article-title":"The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision","author":"Mao Jiayuan","year":"2019","unstructured":"Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. 2019. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584 (2019).","journal-title":"arXiv preprint arXiv:1904.12584"},{"key":"e_1_3_1_33_2","volume-title":"BMVC","author":"Mishra Anand","year":"2012","unstructured":"Anand Mishra, Karteek Alahari, and C. V. Jawahar. 2012. Scene text recognition using higher order language priors. In BMVC. BMVA."},{"key":"e_1_3_1_34_2","first-page":"569","volume-title":"ICCV","author":"Phan Trung Quy","year":"2013","unstructured":"Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, and et al.2013. Recognizing text with perspective distortion in natural scenes. In ICCV. 569\u2013576."},{"key":"e_1_3_1_35_2","first-page":"2046","volume-title":"MM","author":"Qiao Zhi","year":"2021","unstructured":"Zhi Qiao, Yu Zhou, Jin Wei, Wang, and et al.2021. PIMNet: A parallel, iterative and mimicking network for scene text recognition. In MM. ACM, 2046\u20132055."},{"key":"e_1_3_1_36_2","first-page":"13528","volume-title":"CVPR","author":"Qiao Zhi","year":"2020","unstructured":"Zhi Qiao, Yu Zhou, Dongbao Yang, and et al.2020. SEED: Semantics enhanced encoder-decoder framework for scene text recognition. In CVPR. 13528\u201313537."},{"key":"e_1_3_1_37_2","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018)."},{"issue":"8","key":"e_1_3_1_38_2","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et\u00a0al. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.","journal-title":"OpenAI Blog"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2014.07.008"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2646371"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2848939"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6891"},{"key":"e_1_3_1_43_2","article-title":"Gated recurrent convolution neural network for OCR","volume":"30","author":"Wang Jianfeng","year":"2017","unstructured":"Jianfeng Wang and Xiaolin Hu. 2017. Gated recurrent convolution neural network for OCR. Advances in Neural Information Processing Systems 30 (2017).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2011.6126402"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6903"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01393"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01177"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3231737"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00670"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00035"},{"issue":"8","key":"e_1_3_1_51_2","article-title":"A holistic representation guided attention network for scene text recognition","volume":"414","author":"Yang L.","year":"2020","unstructured":"L. Yang, P. Wang, H. Li, Z. Li, and Y. Zhang. 2020. A holistic representation guided attention network for scene text recognition. Neurocomputing 414, 8 (2020).","journal-title":"Neurocomputing"},{"key":"e_1_3_1_52_2","first-page":"3","volume-title":"IJCAI","author":"Yang Xiao","year":"2017","unstructured":"Xiao Yang, Dafang He, Zihan Zhou, Daniel Kifer, and C. Lee Giles. 2017. Learning to read irregular text with attention mechanisms. In IJCAI, Vol. 1. 3."},{"key":"e_1_3_1_53_2","article-title":"Neural-symbolic VQA: Disentangling reasoning from vision and language understanding","volume":"31","author":"Yi Kexin","year":"2018","unstructured":"Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. 2018. Neural-symbolic VQA: Disentangling reasoning from vision and language understanding. Advances in Neural Information Processing Systems 31 (2018).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01213"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58529-7_9"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00216"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58517-4_4"},{"key":"e_1_3_1_58_2","article-title":"SPIN: Structure-preserving inner offset network for scene text recognition","author":"Zhang Chengwei","year":"2020","unstructured":"Chengwei Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Yi Niu, Fei Wu, and Futai Zou. 2020. SPIN: Structure-preserving inner offset network for scene text recognition. arXiv preprint arXiv:2005.13117 (2020).","journal-title":"arXiv preprint arXiv:2005.13117"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58586-0_44"},{"key":"e_1_3_1_60_2","article-title":"iBOT: Image BERT pre-training with online tokenizer","author":"Zhou Jinghao","year":"2021","unstructured":"Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. 2021. iBOT: Image BERT pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021).","journal-title":"arXiv preprint arXiv:2111.07832"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/2808210"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3646551","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3646551","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:17:41Z","timestamp":1750295861000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3646551"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,27]]},"references-count":60,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2024,7,31]]}},"alternative-id":["10.1145\/3646551"],"URL":"https:\/\/doi.org\/10.1145\/3646551","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,27]]},"assertion":[{"value":"2022-09-08","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-19","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}