{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,22]],"date-time":"2026-01-22T16:27:00Z","timestamp":1769099220115,"version":"3.49.0"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,1,11]],"date-time":"2024-01-11T00:00:00Z","timestamp":1704931200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["92370119, 62376113, and 62276258"],"award-info":[{"award-number":["92370119, 62376113, and 62276258"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Jiangsu Science and Technology Programme","award":["BE2020006-4"],"award-info":[{"award-number":["BE2020006-4"]}]},{"name":"European Union\u2019s Horizon 2020 research and innovation programme","award":["956123"],"award-info":[{"award-number":["956123"]}]},{"name":"UK EPSRC under projects","award":["EP\/T026995\/1"],"award-info":[{"award-number":["EP\/T026995\/1"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,4,30]]},"abstract":"<jats:p>Scene text recognition (STR), one typical sequence-to-sequence problem, has drawn much attention recently in multimedia applications. To guarantee good performance, it is essential for STR to obtain aligned character-wise features from the whole-image feature maps. While most present works adopt fully data-driven attention-based alignment, such practice ignores specific character geometric information. In this article, built upon a group of learnable geometric points, we propose a novel shape-driven attention alignment method that is able to obtain character-wise features. Concretely, we first design a corner detector to generate a shape map to guide the attention alignments explicitly, where a series of points can be learned to represent character-wise features flexibly. We then propose a dual-path network with a mutual learning and cooperating strategy that successfully combines CNN with a ViT-based model, leading to further accuracy improvement. We conduct extensive experiments to evaluate the proposed method on various scene text benchmarks, including six popular regular and irregular datasets, two more challenging datasets (i.e., WordArt and OST), and three Chinese datasets. Experimental results indicate that our method can achieve superior performance with a comparable model size against many state-of-the-art models.<\/jats:p>","DOI":"10.1145\/3633517","type":"journal-article","created":{"date-parts":[[2023,11,21]],"date-time":"2023-11-21T09:01:30Z","timestamp":1700557290000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Scene Text Recognition via Dual-path Network with Shape-driven Attention Alignment"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-5968-3389","authenticated-orcid":false,"given":"Yijie","family":"Hu","sequence":"first","affiliation":[{"name":"School of Advanced Technology, Xi\u2019an Jiaotong-Liverpool University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-8431-0644","authenticated-orcid":false,"given":"Bin","family":"Dong","sequence":"additional","affiliation":[{"name":"Ricoh Software Research Center(Beijing) Co., Ltd., China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3034-9639","authenticated-orcid":false,"given":"Kaizhu","family":"Huang","sequence":"additional","affiliation":[{"name":"Data Science Research Center, Duke Kunshan University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-5201-5564","authenticated-orcid":false,"given":"Lei","family":"Ding","sequence":"additional","affiliation":[{"name":"Ricoh Software Research Center(Beijing) Co., Ltd., China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0707-8076","authenticated-orcid":false,"given":"Wei","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Advanced Technology, Xi\u2019an Jiaotong-Liverpool University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6267-0366","authenticated-orcid":false,"given":"Xiaowei","family":"Huang","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Liverpool, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0918-4606","authenticated-orcid":false,"given":"Qiu-Feng","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Advanced Technology, Xi\u2019an Jiaotong-Liverpool University, China"}]}],"member":"320","published-online":{"date-parts":[[2024,1,11]]},"reference":[{"key":"e_1_3_2_2_2","first-page":"1508","volume-title":"Proceedings of the CVPR","author":"Bai Fan","year":"2018","unstructured":"Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Edit probability for scene text recognition. In Proceedings of the CVPR. 1508\u20131516."},{"key":"e_1_3_2_3_2","first-page":"178","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Bautista Darwin","year":"2022","unstructured":"Darwin Bautista and Rowel Atienza. 2022. Scene text recognition with permuted autoregressive sequence models. In Proceedings of the European Conference on Computer Vision. Springer, 178\u2013196."},{"key":"e_1_3_2_4_2","first-page":"14940","volume-title":"Proceedings of the CVPR","author":"Bhunia Ayan Kumar","year":"2021","unstructured":"Ayan Kumar Bhunia, Aneeshan Sain, Amandeep Kumar, Shuvozit Ghose, Pinaki Nath Chowdhury, and Yi-Zhe Song. 2021. Joint visual semantic reasoning: Multi-stage decoder for text recognition. In Proceedings of the CVPR. 14940\u201314949."},{"key":"e_1_3_2_5_2","first-page":"113","volume-title":"Proceedings of the AAAI","author":"Bian Xiaohang","year":"2022","unstructured":"Xiaohang Bian, Bo Qin, Xiaozhe Xin, Jianwu Li, Xuefeng Su, and Yanfeng Wang. 2022. Handwritten mathematical expression recognition via attention aggregation based bi-directional mutual learning. In Proceedings of the AAAI. 113\u2013121."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01185"},{"key":"e_1_3_2_7_2","unstructured":"Jingye Chen Haiyang Yu Jianqi Ma Mengnan Guan Xixi Xu Xiaocong Wang Shaobo Qu Bin Li and Xiangyang Xue. 2021. Benchmarking chinese text recognition: Datasets baselines and an empirical study. CoRR abs\/2112.15093 (2021)."},{"key":"e_1_3_2_8_2","first-page":"5076","volume-title":"Proceedings of the ICCV","author":"Cheng Zhanzhan","year":"2017","unstructured":"Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of the ICCV. 5076\u20135084."},{"key":"e_1_3_2_9_2","first-page":"322","volume-title":"Proceedings of the ECCV","author":"Da Cheng","year":"2022","unstructured":"Cheng Da, Peng Wang, and Cong Yao. 2022. Levenshtein OCR. In Proceedings of the ECCV. 322\u2013338."},{"key":"e_1_3_2_10_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2021. An Image is Worth \\(16\\times 16\\) Words: Transformers for Image Recognition at Scale. In ICLR ."},{"key":"e_1_3_2_11_2","doi-asserted-by":"crossref","unstructured":"Yongkun Du Zhineng Chen Caiyan Jia Xiaoting Yin Tianlun Zheng Chenxia Li Yuning Du and Yu-Gang Jiang. 2022. SVTR: Scene Text Recognition with a Single Visual Model. In IJCAI . 884\u2013890.","DOI":"10.24963\/ijcai.2022\/124"},{"key":"e_1_3_2_12_2","first-page":"7098","volume-title":"Proceedings of the CVPR","author":"Fang Shancheng","year":"2021","unstructured":"Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the CVPR. 7098\u20137107."},{"key":"e_1_3_2_13_2","doi-asserted-by":"crossref","unstructured":"Mudasir A. Ganaie Minghui Hu A. K. Malik M. Tanveer and P. N. Suganthan. 2022. Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence 115 (2022) 105151.","DOI":"10.1016\/j.engappai.2022.105151"},{"key":"e_1_3_2_14_2","first-page":"888","volume-title":"Proceedings of the AAAI","author":"He Yue","year":"2022","unstructured":"Yue He, Chen Chen, Jing Zhang, Juhua Liu, Fengxiang He, Chaoyue Wang, and Du.Bo. 2022. Visual semantics allow for textual reasoning better in scene text recognition. In Proceedings of the AAAI. 888\u2013896."},{"key":"e_1_3_2_15_2","unstructured":"Geoffrey Hinton Oriol Vinyals Jeff Dean et\u00a0al. 2015. Distilling the knowledge in a neural network. arXiv:1503.02531. Retrieved from https:\/\/arxiv.org\/abs\/1503.02531"},{"key":"e_1_3_2_16_2","first-page":"705","volume-title":"Proceedings of the International Conference on Neural Information Processing","author":"Hu Yijie","year":"2022","unstructured":"Yijie Hu, Bin Dong, Qiufeng Wang, Lei Ding, Xiaobo Jin, and Kaizhu Huang. 2022. Towards accurate alignment and sufficient context in scene text recognition. In Proceedings of the International Conference on Neural Information Processing. Springer, 705\u2013717."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-06073-2"},{"key":"e_1_3_2_18_2","unstructured":"Masakazu Iwamura. 2018. Advances of Scene Text Datasets. CoRR abs\/1812.05219 (2018)."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1080\/02664763.2018.1441383"},{"key":"e_1_3_2_20_2","first-page":"553","article-title":"Application of majority voting to pattern recognition: An analysis of its behavior and performance","author":"Lam Louisa","year":"1997","unstructured":"Louisa Lam and S. Y. Suen. 1997. Application of majority voting to pattern recognition: An analysis of its behavior and performance. IEEE T-SMC 27, 5 (1997), 553\u2013568.","journal-title":"IEEE T-SMC"},{"key":"e_1_3_2_21_2","first-page":"4050","volume-title":"Proceedings of the CVPR","author":"Lee Chen-Yu","year":"2014","unstructured":"Chen-Yu Lee, Anurag Bhardwaj, Wei Di, Vignesh Jagadeesh, and Robinson Piramuthu. 2014. Region-based discriminative feature pooling for scene text recognition. In Proceedings of the CVPR. 4050\u20134057."},{"key":"e_1_3_2_22_2","first-page":"8610","volume-title":"Proceedings of the AAAI","author":"Li Hui","year":"2019","unstructured":"Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. 2019. Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI. 8610\u20138617."},{"key":"e_1_3_2_23_2","first-page":"152","volume-title":"Proceedings of the ICONIP","author":"Li Jing","year":"2020","unstructured":"Jing Li, Qiu-Feng Wang, Rui Zhang, and Kaizhu Huang. 2020. Adversarial rectification network for scene text regularization. In Proceedings of the ICONIP. 152\u2013163."},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2011.17"},{"key":"e_1_3_2_25_2","unstructured":"Yuliang Liu Zhang Li Hongliang Li Wenwen Yu Mingxin Huang Dezhi Peng Mingyu Liu Mingrui Chen Chunyuan Li Lianwen Jin and Xiang Bai. 2023. On the hidden mystery of ocr in large multimodal models. CoRR abs\/2305.07895 (2023)."},{"key":"e_1_3_2_26_2","doi-asserted-by":"crossref","unstructured":"Simon M. Lucas Alex Panaretos Luis Sosa Anthony Tang Shirley Wong Robert Young Kazuki Ashida Hiroki Nagai Masayuki Okamoto Hiroaki Yamamoto Hidetoshi Miyao JunMin Zhu WuWen Ou Christian Wolf Jean-Michel Jolion Leon Todoran Marcel Worring and Xiaofan Lin. 2005. ICDAR 2003 robust reading competitions: entries results and future directions. International Journal of Document Analysis and Recognition (IJDAR) 7 2\u20133 (2005) 105\u2013122.","DOI":"10.1007\/s10032-004-0134-3"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2019.01.020"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2022.108889"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475238"},{"key":"e_1_3_2_30_2","first-page":"13528","volume-title":"Proceedings of the CVPR","author":"Qiao Zhi","year":"2020","unstructured":"Zhi Qiao, Yu Zhou, Dongbao Yang, Yucan Zhou, and Weiping Wang. 2020. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In Proceedings of the CVPR. 13528\u201313537."},{"key":"e_1_3_2_31_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_2_32_2","doi-asserted-by":"crossref","unstructured":"Baoguang Shi Xiang Bai and Cong Yao. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE T-PAMI 39 11 (2017) 2298\u20132304.","DOI":"10.1109\/TPAMI.2016.2646371"},{"key":"e_1_3_2_33_2","article-title":"Aster: An attentional scene text recognizer with flexible rectification","author":"Shi Baoguang","year":"2018","unstructured":"Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2018. Aster: An attentional scene text recognizer with flexible rectification. IEEE T-PAMI 41, 9 (2018), 2035\u20132048.","journal-title":"IEEE T-PAMI"},{"key":"e_1_3_2_34_2","first-page":"593","volume-title":"Proceedings of the CVPR","year":"1994","unstructured":"Jianbo Shi and Carlo Tomasis. 1994. Good features to track. In Proceedings of the CVPR. 593\u2013600."},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_3_2_36_2","first-page":"4563","volume-title":"Proceedings of the CVPR","author":"Tang Jingqun","year":"2022","unstructured":"Jingqun Tang, Wenqing Zhang, Hongye Liu, MingKun Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. 2022. Few could be better than all: Feature sampling and grouping for scene text detection. In Proceedings of the CVPR. 4563\u20134572."},{"key":"e_1_3_2_37_2","unstructured":"Xin Tang Diao Liang Wang Jun Fang Rui Xie Guotong and Chen Weifu. 2022. Visual-Semantic Transformer for Scene Text Recognition. In BMVC . 772."},{"key":"e_1_3_2_38_2","first-page":"11425","volume-title":"Proceedings of the CVPR","author":"Wan Zhaoyi","year":"2020","unstructured":"Zhaoyi Wan, Jielei Zhang, Liang Zhang, Jiebo Luo, and Cong Yao. 2020. On vocabulary reliance in scene text recognition. In Proceedings of the CVPR. 11425\u201311434."},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2011.264"},{"key":"e_1_3_2_40_2","first-page":"3304","volume-title":"Proceedings of the ICPR","author":"Wang Tao","year":"2012","unstructured":"Tao Wang, David J. Wu, Adam Coates, and Andrew Y. Ng. 2012. End-to-end text recognition with convolutional neural networks. In Proceedings of the ICPR. 3304\u20133308."},{"key":"e_1_3_2_41_2","first-page":"12216","volume-title":"Proceedings of the AAAI","author":"Wang Tianwei","year":"2020","unstructured":"Tianwei Wang, Yuanzhi Zhu, Lianwen Jin, Canjie Luo, Xiaoxue Chen, Yaqiang Wu, Qianying Wang, and Mingxiang Cai. 2020. Decoupled attention network for text recognition. In Proceedings of the AAAI. 12216\u201312224."},{"key":"e_1_3_2_42_2","first-page":"14194","volume-title":"Proceedings of the ICCV","author":"Wang Yuxin","year":"2021","unstructured":"Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, and Yongdong Zhang. 2021. From two to one: A new scene text recognizer with visual language modeling network. In Proceedings of the ICCV. 14194\u201314203."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3231737"},{"key":"e_1_3_2_44_2","first-page":"303","volume-title":"Proceedings of the ECCV","author":"Xie Xudong","year":"2022","unstructured":"Xudong Xie, Ling Fu, Zhifei Zhang, Zhaowen Wang, and Xiang Bai. 2022. Toward understanding WordArt: Corner-guided transformer for scene text recognition. In Proceedings of the ECCV. 303\u2013321."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2020.2975798"},{"issue":"1","key":"e_1_3_2_46_2","first-page":"43","article-title":"Task-adaptive attention for image captioning","volume":"32","author":"Yan Chenggang","year":"2021","unstructured":"Chenggang Yan, Yiming Hao, Liang Li, Jian Yin, Anan Liu, Zhendong Mao, Zhenyu Chen, and Xingyu Gao. 2021. Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video technology 32, 1 (2021), 43\u201351.","journal-title":"IEEE Transactions on Circuits and Systems for Video technology"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3404374"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472810"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3468872"},{"key":"e_1_3_2_50_2","first-page":"284","volume-title":"Proceedings of the CVPR","author":"Yan Ruijie","year":"2021","unstructured":"Ruijie Yan, Liangrui Peng, Shanyu Xiao, and Gang Yao. 2021. Primitive representation learning for scene text recognition. In Proceedings of the CVPR. 284\u2013293."},{"key":"e_1_3_2_51_2","first-page":"12113","volume-title":"Proceedings of the CVPR","author":"Yu Deli","year":"2020","unstructured":"Deli Yu, Xuan Li, Chengquan Zhang, Tao Liu, Junyu Han, Jingtuo Liu, and Errui Ding. 2020. Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the CVPR. 12113\u201312122."},{"key":"e_1_3_2_52_2","doi-asserted-by":"crossref","unstructured":"Xinyun Zhang Binwu Zhu Xufeng Yao Qi Sun Ruiyu Li and Bei Yu. 2022. Context-based contrastive learning for scene text recognition. In AAAI Vol. 36 3353\u20133361.","DOI":"10.1609\/aaai.v36i3.20245"},{"key":"e_1_3_2_53_2","first-page":"4320","volume-title":"Proceedings of the CVPR","author":"Zhang Ying","year":"2018","unstructured":"Ying Zhang, Tao Xiang, Timothy M. Hospedales, and Huchuan Lu. 2018. Deep mutual learning. In Proceedings of the CVPR. 4320\u20134328."},{"key":"e_1_3_2_54_2","first-page":"4159","volume-title":"Proceedings of the CVPR","author":"Zhang Zheng","year":"2016","unstructured":"Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented text detection with fully convolutional networks. In Proceedings of the CVPR. 4159\u20134167."},{"key":"e_1_3_2_55_2","unstructured":"Shuai Zhao Xiaohan Wang Linchao Zhu and Yi Yang. 2023. CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model. CoRR abs\/2305.14014 (2023)."},{"key":"e_1_3_2_56_2","unstructured":"Xizhou Zhu Weijie Su Lewei Lu Bin Li Xiaogang Wang and Jifeng Dai. 2021. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR ."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3633517","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3633517","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:50:09Z","timestamp":1750287009000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3633517"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,11]]},"references-count":55,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,4,30]]}},"alternative-id":["10.1145\/3633517"],"URL":"https:\/\/doi.org\/10.1145\/3633517","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,11]]},"assertion":[{"value":"2023-07-14","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-08","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}