{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T02:00:46Z","timestamp":1772676046589,"version":"3.50.1"},"reference-count":46,"publisher":"MDPI AG","issue":"18","license":[{"start":{"date-parts":[[2020,9,15]],"date-time":"2020-09-15T00:00:00Z","timestamp":1600128000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"CERNET innovation project","award":["NGII20180617"],"award-info":[{"award-number":["NGII20180617"]}]},{"name":"SJQU research project","award":["SJQ19010"],"award-info":[{"award-number":["SJQ19010"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Language-based person search retrieves images of a target person using natural language description and is a challenging fine-grained cross-modal retrieval task. A novel hybrid attention network is proposed for the task. The network includes the following three aspects: First, a cubic attention mechanism for person image, which combines cross-layer spatial attention and channel attention. It can fully excavate both important midlevel details and key high-level semantics to obtain better discriminative fine-grained feature representation of a person image. Second, a text attention network for language description, which is based on bidirectional LSTM (BiLSTM) and self-attention mechanism. It can better learn the bidirectional semantic dependency and capture the key words of sentences, so as to extract the context information and key semantic features of the language description more effectively and accurately. Third, a cross-modal attention mechanism and a joint loss function for cross-modal learning, which can pay more attention to the relevant parts between text and image features. It can better exploit both the cross-modal and intra-modal correlation and can better solve the problem of cross-modal heterogeneity. Extensive experiments have been conducted on the CUHK-PEDES dataset. Our approach obtains higher performance than state-of-the-art approaches, demonstrating the advantage of the approach we propose.<\/jats:p>","DOI":"10.3390\/s20185279","type":"journal-article","created":{"date-parts":[[2020,9,15]],"date-time":"2020-09-15T10:24:09Z","timestamp":1600165449000},"page":"5279","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Hybrid Attention Network for Language-Based Person Search"],"prefix":"10.3390","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4915-613X","authenticated-orcid":false,"given":"Yang","family":"Li","sequence":"first","affiliation":[{"name":"School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China"},{"name":"School of Information Technology, Shanghai Jianqiao University, Shanghai 201306, China"}]},{"given":"Huahu","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China"}]},{"given":"Junsheng","family":"Xiao","sequence":"additional","affiliation":[{"name":"School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China"}]}],"member":"1968","published-online":{"date-parts":[[2020,9,15]]},"reference":[{"key":"ref_1","unstructured":"Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., and Hoi, S. (2020). Deep learning for person re-identification: A survey and outlook. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Wu, A., Zheng, W.-S., Guo, X., and Lai, J.-H. (2019, January 15\u201321). Distilled person re-identification: Towards a more scalable system. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00128"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Dong, Q., Zhu, X., and Gong, S. (November, January 29). Person search by text attribute query as zero-shot learning. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00375"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"151","DOI":"10.1016\/j.patcog.2019.06.006","article-title":"Improving person re-identification by attribute and identity learning","volume":"95","author":"Lin","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Li, Y., Xu, H., Bian, M., and Xiao, J. (2020). Attention based CNN-ConvLSTM for Pedestrian attribute recognition. Sensors, 20.","DOI":"10.3390\/s20030811"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., and Wang, X. (2017, January 21\u201326). Person search with natural language description. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.551"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Zhang, Y., and Lu, H. (2018, January 13\u201316). Deep Cross-Modal Projection Learning for Image-Text Matching. Proceedings of the Haptics: Science, Technology, Applications, Pisa, Italy.","DOI":"10.1007\/978-3-030-01246-5_42"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"11189","DOI":"10.1609\/aaai.v34i07.6777","article-title":"Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search","volume":"Volume 34","author":"Jing","year":"2020","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1038\/nrn755","article-title":"Control of goal-directed and stimulus-driven attention in the brain","volume":"3","author":"Corbetta","year":"2002","journal-title":"Nat. Rev. Neurosci."},{"key":"ref_10","unstructured":"Larochelle, H., and Hinton, G.E. (2010, January 6\u20139). Learning to combine foveal glimpses with a third-order Boltzmann machine. Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS), Vancouver, BC, Canada."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Zeiler, M.D., and Fergus, R. (2014, January 6\u201312). Visualizing and Understanding Convolutional Networks. Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10590-1_53"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.-Y., and Kweon, I.S. (2018, January 13\u201316). CBAM: Convolutional Block Attention Module. Proceedings of the Haptics: Science, Technology, Pisa, Italy.","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"2673","DOI":"10.1109\/78.650093","article-title":"Bidirectional recurrent neural networks","volume":"45","author":"Schuster","year":"1997","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_15","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA."},{"key":"ref_16","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3383184","article-title":"Dual-path convolutional image-text embeddings with instance loss","volume":"16","author":"Zheng","year":"2020","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Varior, R.R., Shuai, B., Lu, J., Xu, D., and Wang, G. (2016, January 8\u201316). A siamese long short-term memory architecture for human re-identification. Proceedings of the Computer Vision\u2014ECCV 2016, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46478-7_9"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Liu, X., Zhao, H., Tian, M., Sheng, L., Shao, J., Yi, S., Yan, J., and Wang, X. (2017, January 22\u201329). Hydraplus-net: Attentive deep features for pedestrian analysis. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.46"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"He, L., Liang, J., Li, H., and Sun, Z. (2018, January 18\u201322). Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00739"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Song, C., Huang, Y., Ouyang, W., and Wang, L. (2018, January 18\u201322). Mask-guided contrastive attention model for person re-identification. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00129"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Guo, Y., and Cheung, N.-M. (2018, January 18\u201322). Efficient and deep person re-identification using multi-level similarity. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00248"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Xin, X., Wu, X., Wang, Y., and Wang, J. (2019, January 22\u201325). Deep Self-Paced Learning for Semi-Supervised Person Re-Identification Using Multi-View Self-Paced Clustering. Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan.","DOI":"10.1109\/ICIP.2019.8803290"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3243316","article-title":"Unsupervised person re-identification","volume":"14","author":"Fan","year":"2018","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Xiao, T., Li, S., Wang, B., Lin, L., and Wang, X. (2017, January 21\u201326). Joint detection and identification feature learning for person search. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.360"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y., and Tian, Q. (2017, January 21\u201326). Person re-identification in the wild. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.357"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"012053","DOI":"10.1088\/1757-899X\/646\/1\/012053","article-title":"End to end person re-identification based on attention mechanism","volume":"Volume 646","author":"Li","year":"2019","journal-title":"Proceedings of the IOP Conference Series: Materials Science and Engineering"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Stefan, L.-D., Abdulamit, S., Dogariu, M., Constantin, M.G., and Ionescu, B. (2020, January 18\u201320). Deep learning-based person search with visual attention embedding. Proceedings of the 2020 13th International Conference on Communications (COMM), Bucharest, Romania.","DOI":"10.1109\/COMM48946.2020.9141958"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Dong, W., Zhang, Z., Song, C., and Tan, T. (2020, January 13\u201319). Instance guided proposal network for person search. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00266"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Zhong, Y., Wang, X., and Zhang, S. (2020, January 13\u201319). Robust partial matching for person search in the wild. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00686"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Li, S., Xiao, T., Li, H., Yang, W., and Wang, X. (2017, January 22\u201329). Identity-aware textual-visual matching with latent co-attention. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.209"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Chen, T., Xu, C., and Luo, J. (2018, January 12\u201315). Improving text-based person search by spatial matching and adaptive threshold. Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA.","DOI":"10.1109\/WACV.2018.00208"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Yamaguchi, M., Saito, K., Ushiku, Y., and Harada, T. (2017, January 22\u201329). Spatio-temporal person retrieval via natural language queries. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.162"},{"key":"ref_34","unstructured":"Shah, A., and Vuong, T. (2018). Natural Language Person Search Using Deep Reinforcement Learning. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Aggarwal, S., Babu, R.V., and Chakraborty, A. (2020, January 1\u20135). Text-based person search via attribute-aided matching. Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA.","DOI":"10.1109\/WACV45572.2020.9093640"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_37","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013, January 5\u201310). Distributed Representations of Words and Phrases and their Compositionality. Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA."},{"key":"ref_38","unstructured":"Mikolov, T., Corrado, G., Chen, K., and Dean, J. (2013, January 2\u20134). Efficient estimation of word representations in vector space. Proceedings of the International Conference on Learning Representations (ICLR), Scottsdale, AZ, USA."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Li, F.F. (2009, January 20\u201325). ImageNet: A large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"157","DOI":"10.1109\/72.279181","article-title":"Learning long-term dependencies with gradient descent is difficult","volume":"5","author":"Bengio","year":"1994","journal-title":"IEEE Trans. Neural Netw."},{"key":"ref_41","unstructured":"Xu, B., Wang, N., Chen, T., and Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. arXiv."},{"key":"ref_42","unstructured":"Karpathy, A., Joulin, A., and Li, F. (2014, January 8\u201311). Deep fragment embeddings for bidirectional image sentence mapping. Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Nam, H., Ha, J.-W., and Kim, J. (2017, January 21\u201326). Dual attention networks for multimodal reasoning and matching. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.232"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Reed, S., Akata, Z., Lee, H., and Schiele, B. (2016, January 27\u201330). Learning deep representations of fine-grained visual descriptions. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.13"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Chen, D., Li, H., Liu, X., Shen, Y., Shao, J., Yuan, Z., and Wang, X. (2018, January 13\u201316). Improving deep visual representation for person re-identification by global and local image-language association. Proceedings of the Haptics: Science, Technology, Applications, Pisa, Italy.","DOI":"10.1007\/978-3-030-01270-0_4"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/18\/5279\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T10:10:16Z","timestamp":1760177416000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/18\/5279"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,9,15]]},"references-count":46,"journal-issue":{"issue":"18","published-online":{"date-parts":[[2020,9]]}},"alternative-id":["s20185279"],"URL":"https:\/\/doi.org\/10.3390\/s20185279","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,9,15]]}}}