{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T14:49:51Z","timestamp":1776350991750,"version":"3.51.2"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2021,4,8]],"date-time":"2021-04-08T00:00:00Z","timestamp":1617840000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61702502, U1836204, U1936108, and 61572221"],"award-info":[{"award-number":["61702502, U1836204, U1936108, and 61572221"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"National Key Research and Development Program of China","award":["2016YFB0800402"],"award-info":[{"award-number":["2016YFB0800402"]}]},{"name":"Major Projects of the National Social Science Foundation","award":["16ZDA092"],"award-info":[{"award-number":["16ZDA092"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM\/IMS Trans. Data Sci."],"published-print":{"date-parts":[[2021,5,31]]},"abstract":"<jats:p>Recognizing irregular text from natural scene images is challenging due to the unconstrained appearance of text, such as curvature, orientation, and distortion. Recent recognition networks regard this task as a text sequence labeling problem and most networks capture the sequence only from a single-granularity visual representation, which to some extent limits the performance of recognition. In this article, we propose a hierarchical attention network to capture multi-granularity deep local representations for recognizing irregular scene text. It consists of several hierarchical attention blocks, and each block contains a Local Visual Representation Module (LVRM) and a Decoder Module (DM). Based on the hierarchical attention network, we propose a scene text recognition network. The extensive experiments show that our proposed network achieves the state-of-the-art performance on several benchmark datasets including IIIT-5K, SVT, CUTE, SVT-Perspective, and ICDAR datasets under shorter training time.<\/jats:p>","DOI":"10.1145\/3446971","type":"journal-article","created":{"date-parts":[[2021,4,8]],"date-time":"2021-04-08T16:54:59Z","timestamp":1617900899000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Multi-granularity Deep Local Representations for Irregular Scene Text Recognition"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4068-0042","authenticated-orcid":false,"given":"Hongchao","family":"Gao","sequence":"first","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, China and School of Cyber Security, University of Chinese Academy of Sciences, Hai Dian, Bei Jing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yujia","family":"Li","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, China and School of Cyber Security, University of Chinese Academy of Sciences, Hai Dian, Bei Jing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jiao","family":"Dai","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, Hai Dian, Bei Jing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xi","family":"Wang","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, Hai Dian, Bei Jing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jizhong","family":"Han","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ruixuan","family":"Li","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, China and School of Computer Science and Technology, Huazhong University of Science and Technology, Wu Han, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,4,8]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision. IEEE, 4714\u20134722","author":"Baek Jeonghun","year":"2019","unstructured":"Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What is wrong with scene text recognition model comparisons? Dataset and model analysis. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. IEEE, 4714\u20134722. DOI:https:\/\/doi.org\/10.1109\/ICCV.2019.00481"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201918)","author":"Bai Fan","year":"2018","unstructured":"Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Edit probability for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201918)."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219861"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2005.38"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.543"},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201918)","author":"Cheng Zhanzhan","year":"2018","unstructured":"Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. AON: Towards arbitrarily oriented text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201918). 5571\u20135579."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2005.177"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/3305381.3305478"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240571"},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201917)","author":"Fu Jianlong","year":"2017","unstructured":"Jianlong Fu, Heliang Zheng, and Tao Mei. 2017. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201917). 4438\u20134446."},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the IEEE International Conference on Multimedia and Expo. IEEE, 1672\u20131677","author":"Gao Hongchao","year":"2019","unstructured":"Hongchao Gao, Xi Wang, Yujia Li, Jizhong Han, Songlin Hu, and Ruixuan Li. 2019. Self-representation convolutional neural networks. In Proceedings of the IEEE International Conference on Multimedia and Expo. IEEE, 1672\u20131677."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.5555\/3305381.3305510"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1143844.1143891"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.254"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969442.2969465"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the 3rd International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.). http:\/\/arxiv.org\/abs\/1412","author":"Jaderberg Max","year":"2015","unstructured":"Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep structured output learning for unconstrained text recognition. In Proceedings of the 3rd International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.). http:\/\/arxiv.org\/abs\/1412.5903."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0823-z"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201914)","author":"Jaderberg Max","year":"2014","unstructured":"Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep features for text spotting. In Proceedings of the European Conference on Computer Vision (ECCV\u201914). 512\u2013528."},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the 37th IEEE International Conference on Computer Design. IEEE, 199\u2013207","author":"Jiang Tianming","year":"2019","unstructured":"Tianming Jiang, Jiangfeng Zeng, Ke Zhou, Ping Huang, and Tianming Yang. 2019. Lifelong disk failure prediction via GAN-based anomaly detection. In Proceedings of the 37th IEEE International Conference on Computer Design. IEEE, 199\u2013207. DOI:https:\/\/doi.org\/10.1109\/ICCD46524.2019.00033"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2015.7333942"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2013.221"},{"key":"e_1_2_1_23_1","doi-asserted-by":"crossref","first-page":"2033","DOI":"10.1109\/TPDS.2019.2902392","article-title":"Improving cache performance for large-scale photo stores via heuristic prefetching scheme","volume":"30","author":"Wang Hua","year":"2019","unstructured":"Hua Wang, Ping Huang, Xubin He, Ran Lai, Wenyan Li, Wenjie Liu, Tianming Yang, Ke Zhou, and Si Sun. 2019. Improving cache performance for large-scale photo stores via heuristic prefetching scheme. IEEE Trans. Parallel Distrib. Syst. 30, 9 (2019), 2033\u20132045.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the 3rd International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.).","author":"Diederik","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.). Retrieved from http:\/\/arxiv.org\/abs\/1412.6980."},{"key":"e_1_2_1_25_1","unstructured":"Alex Krizhevsky and G. Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto."},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916)","author":"Lee C.","unstructured":"C. Lee, S. Osinderoet al. 2016. Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916). 2231\u20132239."},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the 33th AAAI Conference on Artificial Intelligence (AAAI\u201919)","author":"Li Hui","year":"2019","unstructured":"Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. 2019. Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the 33th AAAI Conference on Artificial Intelligence (AAAI\u201919)."},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence (AAAI\u201919)","author":"Liao Minghui","year":"2019","unstructured":"Minghui Liao, Jian Zhang, Zhaoyi Wan, et al. 2019. Scene text recognition from two-dimensional perspective. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI\u201919)."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.106"},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the 5th International Conference on Learning Representations.","author":"Lin Zhouhan","year":"2017","unstructured":"Zhouhan Lin, Minwei Feng, C\u00edcero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In Proceedings of the 5th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=BJC_jUqxe."},{"key":"e_1_2_1_31_1","volume-title":"Berg","author":"Liu Wei","year":"2016","unstructured":"Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV\u201916). Springer, 21\u201337."},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the British Machine Vision Conference, Richard C. Wilson, Edwin R. Hancock, and William A. P. Smith (Eds.). BMVA Press.","author":"Liu Wei","year":"2016","unstructured":"Wei Liu, Chaofeng Chen, Kwan-Yee K. Wong, Zhizhong Su, and Junyu Han. 2016. STAR-Net: A SpaTial Attention Residue Network for scene text recognition. In Proceedings of the British Machine Vision Conference, Richard C. Wilson, Edwin R. Hancock, and William A. P. Smith (Eds.). BMVA Press. Retrieved from http:\/\/www.bmva.org\/bmvc\/2016\/papers\/paper043\/index.html."},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI\u201918)","author":"Liu Wei","year":"2018","unstructured":"Wei Liu, Chaofeng Chen, and Kwan-Yee K Wong. 2018. Char-Net: A character-aware neural network for distorted scene text recognition. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI\u201918)."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2018.2822781"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/3157096.3157129"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10032-004-0134-3"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2019.01.020"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2018.10.041"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:VISI.0000027790.02288.f2"},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the British Machine Vision Conference (BMVC\u201912)","author":"Mishra Anand","unstructured":"Anand Mishra, Karteek Alahari, and C. V. Jawahar. 2012. Scene text recognition using higher order language priors. In Proceedings of the British Machine Vision Conference (BMVC\u201912)."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.5555\/2354409.2354999"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969033.2969073"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2002.1017623"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.76"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2014.07.008"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2646371"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.452"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.452"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2848939"},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the 3rd International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.).","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.). Retrieved from http:\/\/arxiv.org\/abs\/1409.1556."},{"key":"e_1_2_1_51_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201919)","author":"Tian Zhuotao","year":"2019","unstructured":"Zhuotao Tian, Michelle Shu, Pengyuan Lyu, Ruiyu Li, Chao Zhou, Xiaoyong Shen, and Jiaya Jia. 2019. Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201919). 4234\u20134243."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.5555\/3294771.3294803"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2011.6126402"},{"key":"e_1_2_1_55_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence (AAAI\u201919)","author":"Wang Peng","year":"2019","unstructured":"Peng Wang, Lu Yang, et al. 2019. A simple and robust convolutional-attention network for irregular text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI\u201919)."},{"key":"e_1_2_1_56_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 7794\u20137803","author":"Wang Xiaolong","year":"2018","unstructured":"Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 7794\u20137803."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.5555\/3172077.3172347"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.515"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2014.2366765"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2893806"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2019.07.085"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3300085"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2018.03.047"}],"container-title":["ACM\/IMS Transactions on Data Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3446971","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3446971","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T13:55:49Z","timestamp":1776347749000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3446971"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,4,8]]},"references-count":63,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2021,5,31]]}},"alternative-id":["10.1145\/3446971"],"URL":"https:\/\/doi.org\/10.1145\/3446971","relation":{},"ISSN":["2691-1922"],"issn-type":[{"value":"2691-1922","type":"print"}],"subject":[],"published":{"date-parts":[[2021,4,8]]},"assertion":[{"value":"2020-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-01-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-04-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}