{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,26]],"date-time":"2026-05-26T16:02:35Z","timestamp":1779811355007,"version":"3.53.1"},"reference-count":65,"publisher":"Elsevier BV","license":[{"start":{"date-parts":[[2026,8,1]],"date-time":"2026-08-01T00:00:00Z","timestamp":1785542400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.elsevier.com\/tdm\/userlicense\/1.0\/"},{"start":{"date-parts":[[2026,8,1]],"date-time":"2026-08-01T00:00:00Z","timestamp":1785542400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.elsevier.com\/legal\/tdmrep-license"},{"start":{"date-parts":[[2026,8,1]],"date-time":"2026-08-01T00:00:00Z","timestamp":1785542400000},"content-version":"stm-asf","delay-in-days":0,"URL":"https:\/\/doi.org\/10.15223\/policy-017"},{"start":{"date-parts":[[2026,8,1]],"date-time":"2026-08-01T00:00:00Z","timestamp":1785542400000},"content-version":"stm-asf","delay-in-days":0,"URL":"https:\/\/doi.org\/10.15223\/policy-037"},{"start":{"date-parts":[[2026,8,1]],"date-time":"2026-08-01T00:00:00Z","timestamp":1785542400000},"content-version":"stm-asf","delay-in-days":0,"URL":"https:\/\/doi.org\/10.15223\/policy-012"},{"start":{"date-parts":[[2026,8,1]],"date-time":"2026-08-01T00:00:00Z","timestamp":1785542400000},"content-version":"stm-asf","delay-in-days":0,"URL":"https:\/\/doi.org\/10.15223\/policy-029"},{"start":{"date-parts":[[2026,8,1]],"date-time":"2026-08-01T00:00:00Z","timestamp":1785542400000},"content-version":"stm-asf","delay-in-days":0,"URL":"https:\/\/doi.org\/10.15223\/policy-004"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["elsevier.com","sciencedirect.com"],"crossmark-restriction":true},"short-container-title":["Engineering Applications of Artificial Intelligence"],"published-print":{"date-parts":[[2026,8]]},"DOI":"10.1016\/j.engappai.2026.114990","type":"journal-article","created":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T12:36:56Z","timestamp":1778071016000},"page":"114990","update-policy":"https:\/\/doi.org\/10.1016\/elsevier_cm_policy","source":"Crossref","is-referenced-by-count":0,"special_numbering":"P1","title":["Learning Region-aware Patch Embedding with mask-text prompted contrastive supervision for vision-language tracking"],"prefix":"10.1016","volume":"178","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-0469-1359","authenticated-orcid":false,"given":"Kang","family":"Liu","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-6108-2097","authenticated-orcid":false,"given":"Long","family":"Liu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yunhe","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Pingyan","family":"Hu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"78","reference":[{"key":"10.1016\/j.engappai.2026.114990_b1","unstructured":"Chen, Hong-You, Lai, Zhengfeng, Zhang, Haotian, Wang, Xinze, Eichner, Marcin, You, Keen, Cao, Meng, Zhang, Bowen, Yang, Yinfei, Gan, Zhe, 2025. Contrastive Localized Language-Image Pre-Training. In: Forty-Second International Conference on Machine Learning."},{"key":"10.1016\/j.engappai.2026.114990_b2","doi-asserted-by":"crossref","unstructured":"Cheng, Tianheng, Song, Lin, Ge, Yixiao, Liu, Wenyu, Wang, Xinggang, Shan, Ying, 2024. Yolo-world: Real-time open-vocabulary object detection. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 16901\u201316911.","DOI":"10.1109\/CVPR52733.2024.01599"},{"key":"10.1016\/j.engappai.2026.114990_b3","doi-asserted-by":"crossref","unstructured":"Chunhui, Zhang, Xin, Sun, Li, Liu, Yiqian, Yang, Qiong, Liu, Xi, Zhou, Yanfeng, Wang, 2023. All in one: Exploring unified vision-language tracking with multi-modal alignment. In: ACM International Conference on Multimedia. ACMMM.","DOI":"10.1145\/3581783.3611803"},{"key":"10.1016\/j.engappai.2026.114990_b4","doi-asserted-by":"crossref","unstructured":"Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina, 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171\u20134186.","DOI":"10.18653\/v1\/N19-1423"},{"key":"10.1016\/j.engappai.2026.114990_b5","series-title":"9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021","article-title":"An image is worth 16x16 words: Transformers for image recognition at scale","author":"Dosovitskiy","year":"2021"},{"issue":"2","key":"10.1016\/j.engappai.2026.114990_b6","doi-asserted-by":"crossref","first-page":"439","DOI":"10.1007\/s11263-020-01387-y","article-title":"Lasot: A high-quality large-scale single object tracking benchmark","volume":"129","author":"Fan","year":"2021","journal-title":"Int. J. Comput. Vis."},{"key":"10.1016\/j.engappai.2026.114990_b7","doi-asserted-by":"crossref","unstructured":"Fan, Heng, Lin, Liting, Yang, Fan, Chu, Peng, Deng, Ge, Yu, Sijia, Bai, Hexin, Xu, Yong, Liao, Chunyuan, Ling, Haibin, 2019. Lasot: A high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 5374\u20135383.","DOI":"10.1109\/CVPR.2019.00552"},{"key":"10.1016\/j.engappai.2026.114990_b8","doi-asserted-by":"crossref","unstructured":"Feng, Qi, Ablavsky, Vitaly, Bai, Qinxun, Sclaroff, Stan, 2021. Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 5851\u20135860.","DOI":"10.1109\/CVPR46437.2021.00579"},{"key":"10.1016\/j.engappai.2026.114990_b9","first-page":"14903","article-title":"MemVLT: Vision-language tracking with adaptive memory-based prompts","volume":"37","author":"Feng","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"10.1016\/j.engappai.2026.114990_b10","series-title":"ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing","first-page":"1","article-title":"Enhancing vision-language tracking by effectively converting textual cues into visual cues","author":"Feng","year":"2025"},{"key":"10.1016\/j.engappai.2026.114990_b11","doi-asserted-by":"crossref","unstructured":"Ge, Jiawei, Cao, Jiuxin, Zhu, Xuelin, Zhang, Xinyu, Liu, Chang, Wang, Kun, Liu, Bo, 2024. Consistencies are all you need for semi-supervised vision-language tracking. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 1895\u20131904.","DOI":"10.1145\/3664647.3680657"},{"key":"10.1016\/j.engappai.2026.114990_b12","first-page":"4446","article-title":"Divert more attention to vision-language tracking","volume":"35","author":"Guo","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"10.1016\/j.engappai.2026.114990_b13","doi-asserted-by":"crossref","unstructured":"He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll\u00e1r, Piotr, Girshick, Ross, 2022. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 16000\u201316009.","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"10.1016\/j.engappai.2026.114990_b14","doi-asserted-by":"crossref","unstructured":"He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross, 2020. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 9729\u20139738.","DOI":"10.1109\/CVPR42600.2020.00975"},{"issue":"2","key":"10.1016\/j.engappai.2026.114990_b15","first-page":"3","article-title":"Lora: Low-rank adaptation of large language models","volume":"1","author":"Hu","year":"2022","journal-title":"ICLR"},{"key":"10.1016\/j.engappai.2026.114990_b16","first-page":"25007","article-title":"A multi-modal global instance tracking benchmark (mgit): Better locating target in complex spatio-temporal and causal relationship","volume":"36","author":"Hu","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"10.1016\/j.engappai.2026.114990_b17","doi-asserted-by":"crossref","unstructured":"Huang, Fuxiang, Zhang, Lei, Fu, Xiaowei, Song, Suqi, 2024. Dynamic weighted combiner for mixed-modal image retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38, pp. 2303\u20132311.","DOI":"10.1609\/aaai.v38i3.28004"},{"issue":"5","key":"10.1016\/j.engappai.2026.114990_b18","doi-asserted-by":"crossref","first-page":"1562","DOI":"10.1109\/TPAMI.2019.2957464","article-title":"Got-10k: A large high-diversity benchmark for generic object tracking in the wild","volume":"43","author":"Huang","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"10.1016\/j.engappai.2026.114990_b19","doi-asserted-by":"crossref","unstructured":"Kirillov, Alexander, Mintun, Eric, Ravi, Nikhila, Mao, Hanzi, Rolland, Chloe, Gustafson, Laura, Xiao, Tete, Whitehead, Spencer, Berg, Alexander C, Lo, Wan-Yen, et al., 2023. Segment anything. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision. pp. 4015\u20134026.","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"10.1016\/j.engappai.2026.114990_b20","doi-asserted-by":"crossref","unstructured":"Law, Hei, Deng, Jia, 2018. Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 734\u2013750.","DOI":"10.1007\/978-3-030-01264-9_45"},{"key":"10.1016\/j.engappai.2026.114990_b21","article-title":"Multi-modal hybrid interaction vision-language tracking","author":"Lei","year":"2025","journal-title":"IEEE Trans. Multimed."},{"key":"10.1016\/j.engappai.2026.114990_b22","doi-asserted-by":"crossref","unstructured":"Li, Siyuan, Fischer, Tobias, Ke, Lei, Ding, Henghui, Danelljan, Martin, Yu, Fisher, 2023. Ovtrack: Open-vocabulary multiple object tracking. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 5567\u20135577.","DOI":"10.1109\/CVPR52729.2023.00539"},{"key":"10.1016\/j.engappai.2026.114990_b23","doi-asserted-by":"crossref","unstructured":"Li, Xin, Huang, Yuqing, He, Zhenyu, Wang, Yaowei, Lu, Huchuan, Yang, Ming-Hsuan, 2023. Citetracker: Correlating image and text for visual tracking. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision. pp. 9974\u20139983.","DOI":"10.1109\/ICCV51070.2023.00915"},{"key":"10.1016\/j.engappai.2026.114990_b24","article-title":"Boost tracking by natural language with prompt-guided grounding","author":"Li","year":"2024","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"10.1016\/j.engappai.2026.114990_b25","doi-asserted-by":"crossref","unstructured":"Li, Zhenyang, Tao, Ran, Gavves, Efstratios, Snoek, Cees GM, Smeulders, Arnold WM, 2017. Tracking by natural language specification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6495\u20136503.","DOI":"10.1109\/CVPR.2017.777"},{"key":"10.1016\/j.engappai.2026.114990_b26","doi-asserted-by":"crossref","unstructured":"Li, Yihao, Yu, Jun, Cai, Zhongpeng, Pan, Yuwen, 2022. Cross-modal target retrieval for tracking by natural language. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 4931\u20134940.","DOI":"10.1109\/CVPRW56347.2022.00540"},{"key":"10.1016\/j.engappai.2026.114990_b27","doi-asserted-by":"crossref","unstructured":"Li, Liunian Harold, Zhang, Pengchuan, Zhang, Haotian, Yang, Jianwei, Li, Chunyuan, Zhong, Yiwu, Wang, Lijuan, Yuan, Lu, Zhang, Lei, Hwang, Jenq-Neng, et al., 2022. Grounded language-image pre-training. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 10965\u201310975.","DOI":"10.1109\/CVPR52688.2022.01069"},{"key":"10.1016\/j.engappai.2026.114990_b28","article-title":"SIEVL-track: Exploring semantic information enhancement for visual-language object tracking","author":"Li","year":"2025","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"10.1016\/j.engappai.2026.114990_b29","series-title":"European Conference on Computer Vision","first-page":"300","article-title":"Tracking meets lora: Faster training, larger model, stronger performance","author":"Lin","year":"2024"},{"key":"10.1016\/j.engappai.2026.114990_b30","first-page":"16743","article-title":"Swintrack: A simple and strong baseline for transformer tracking","volume":"35","author":"Lin","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"10.1016\/j.engappai.2026.114990_b31","series-title":"European Conference on Computer Vision","first-page":"740","article-title":"Microsoft coco: Common objects in context","author":"Lin","year":"2014"},{"key":"10.1016\/j.engappai.2026.114990_b32","series-title":"Decoupled weight decay regularization","author":"Loshchilov","year":"2017"},{"key":"10.1016\/j.engappai.2026.114990_b33","doi-asserted-by":"crossref","unstructured":"Ma, Yinchao, Tang, Yuyang, Yang, Wenfei, Zhang, Tianzhu, Zhang, Jinpeng, Kang, Mengxue, 2024. Unifying visual and vision-language tracking via contrastive learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38, pp. 4107\u20134116.","DOI":"10.1609\/aaai.v38i5.28205"},{"key":"10.1016\/j.engappai.2026.114990_b34","doi-asserted-by":"crossref","unstructured":"Ma, Ding, Wu, Xiangqian, 2021. Capsule-based object tracking with natural language specification. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 1948\u20131956.","DOI":"10.1145\/3474085.3475349"},{"key":"10.1016\/j.engappai.2026.114990_b35","doi-asserted-by":"crossref","first-page":"2254","DOI":"10.1109\/TIP.2025.3553290","article-title":"A swiss army knife for tracking by natural language specification","volume":"34","author":"Mao","year":"2025","journal-title":"IEEE Trans. Image Process."},{"key":"10.1016\/j.engappai.2026.114990_b36","series-title":"ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing","first-page":"8025","article-title":"Textual tokens classification for multi-modal alignment in vision-language tracking","author":"Mao","year":"2024"},{"key":"10.1016\/j.engappai.2026.114990_b37","doi-asserted-by":"crossref","unstructured":"Muller, Matthias, Bibi, Adel, Giancola, Silvio, Alsubaihi, Salman, Ghanem, Bernard, 2018. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 300\u2013317.","DOI":"10.1007\/978-3-030-01246-5_19"},{"key":"10.1016\/j.engappai.2026.114990_b38","series-title":"Representation learning with contrastive predictive coding","author":"Oord","year":"2018"},{"key":"10.1016\/j.engappai.2026.114990_b39","doi-asserted-by":"crossref","DOI":"10.1016\/j.engappai.2025.110482","article-title":"Visual object tracking using learnable target-aware token emphasis","volume":"149","author":"Park","year":"2025","journal-title":"Eng. Appl. Artif. Intell."},{"key":"10.1016\/j.engappai.2026.114990_b40","series-title":"European Conference on Computer Vision","first-page":"462","article-title":"Online zero-shot classification with clip","author":"Qian","year":"2024"},{"key":"10.1016\/j.engappai.2026.114990_b41","series-title":"International Conference on Machine Learning","first-page":"8748","article-title":"Learning transferable visual models from natural language supervision","author":"Radford","year":"2021"},{"key":"10.1016\/j.engappai.2026.114990_b42","doi-asserted-by":"crossref","unstructured":"Rezatofighi, Hamid, Tsoi, Nathan, Gwak, JunYoung, Sadeghian, Amir, Reid, Ian, Savarese, Silvio, 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 658\u2013666.","DOI":"10.1109\/CVPR.2019.00075"},{"key":"10.1016\/j.engappai.2026.114990_b43","doi-asserted-by":"crossref","unstructured":"Shao, Yanyan, He, Shuting, Ye, Qi, Feng, Yuchao, Luo, Wenhan, Chen, Jiming, 2024. Context-aware integration of language and visual references for natural language tracking. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 19208\u201319217.","DOI":"10.1109\/CVPR52733.2024.01817"},{"key":"10.1016\/j.engappai.2026.114990_b44","doi-asserted-by":"crossref","unstructured":"Shtedritski, Aleksandar, Rupprecht, Christian, Vedaldi, Andrea, 2023. What does clip know about a red circle? visual prompt engineering for vlms. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision. pp. 11987\u201311997.","DOI":"10.1109\/ICCV51070.2023.01101"},{"key":"10.1016\/j.engappai.2026.114990_b45","doi-asserted-by":"crossref","DOI":"10.1145\/3757322","article-title":"Language-guided visual tracking: Comprehensive and effective multimodal information fusion","author":"Song","year":"2025","journal-title":"ACM Trans. Multimed. Comput. Commun. Appl."},{"key":"10.1016\/j.engappai.2026.114990_b46","doi-asserted-by":"crossref","unstructured":"Wang, Xiao, Shu, Xiujun, Zhang, Zhipeng, Jiang, Bo, Wang, Yaowei, Tian, Yonghong, Wu, Feng, 2021. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 13763\u201313773.","DOI":"10.1109\/CVPR46437.2021.01355"},{"key":"10.1016\/j.engappai.2026.114990_b47","doi-asserted-by":"crossref","DOI":"10.1016\/j.engappai.2024.108329","article-title":"Dynamic region-aware transformer backbone network for visual tracking","volume":"133","author":"Wang","year":"2024","journal-title":"Eng. Appl. Artif. Intell."},{"key":"10.1016\/j.engappai.2026.114990_b48","series-title":"2024 IEEE International Conference on Multimedia and Expo","first-page":"1","article-title":"Joint language prompt and object tracking","author":"Weng","year":"2024"},{"key":"10.1016\/j.engappai.2026.114990_b49","doi-asserted-by":"crossref","unstructured":"Wu, You, Wang, Xucheng, Yang, Xiangyang, Liu, Mengyuan, Zeng, Dan, Ye, Hengzhou, Li, Shuiwang, 2025. Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17103\u201317113.","DOI":"10.1109\/CVPR52734.2025.01594"},{"key":"10.1016\/j.engappai.2026.114990_b50","doi-asserted-by":"crossref","unstructured":"Wu, Size, Zhang, Wenwei, Xu, Lumin, Jin, Sheng, Liu, Wentao, Loy, Chen Change, 2024. Clim: Contrastive language-image mosaic for region representation. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38, pp. 6117\u20136125.","DOI":"10.1609\/aaai.v38i6.28428"},{"key":"10.1016\/j.engappai.2026.114990_b51","series-title":"European Conference on Computer Vision","first-page":"341","article-title":"Joint feature learning and relation modeling for tracking: A one-stream framework","author":"Ye","year":"2022"},{"key":"10.1016\/j.engappai.2026.114990_b52","first-page":"1","article-title":"RWKV-inspired multi-modal relation modeling for vision-language tracking","author":"Zhang","year":"2026","journal-title":"IEEE Trans. Multimed."},{"issue":"10","key":"10.1016\/j.engappai.2026.114990_b53","doi-asserted-by":"crossref","first-page":"9053","DOI":"10.1109\/TCSVT.2024.3395352","article-title":"One-stream stepwise decreasing for vision-language tracking","volume":"34","author":"Zhang","year":"2024","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"10.1016\/j.engappai.2026.114990_b54","unstructured":"Zhang, Guangtong, Zhong, Bineng, Liang, Qihua, Mo, Zhiyi, Song, Shuxiang, 2024b. Diffusion mask-driven visual-language tracking. In: Proc. 33rd Int. Joint Conf. Artif. Intell. pp. 1652\u20131660."},{"key":"10.1016\/j.engappai.2026.114990_b55","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1016\/j.patrec.2023.02.023","article-title":"Transformer vision-language tracking via proxy token guided cross-modal fusion","volume":"168","author":"Zhao","year":"2023","journal-title":"Pattern Recognit. Lett."},{"issue":"4","key":"10.1016\/j.engappai.2026.114990_b56","doi-asserted-by":"crossref","first-page":"2125","DOI":"10.1109\/TCSVT.2023.3301933","article-title":"Toward unified token learning for vision-language tracking","volume":"34","author":"Zheng","year":"2023","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"10.1016\/j.engappai.2026.114990_b57","doi-asserted-by":"crossref","unstructured":"Zheng, Yaozong, Zhong, Bineng, Liang, Qihua, Mo, Zhiyi, Zhang, Shengping, Li, Xianxian, 2024. Odtrack: Online dense temporal token learning for visual tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38, pp. 7588\u20137596.","DOI":"10.1609\/aaai.v38i7.28591"},{"key":"10.1016\/j.engappai.2026.114990_b58","doi-asserted-by":"crossref","unstructured":"Zhong, Yiwu, Yang, Jianwei, Zhang, Pengchuan, Li, Chunyuan, Codella, Noel, Li, Liunian Harold, Zhou, Luowei, Dai, Xiyang, Yuan, Lu, Li, Yin, et al., 2022. Regionclip: Region-based language-image pretraining. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 16793\u201316803.","DOI":"10.1109\/CVPR52688.2022.01629"},{"key":"10.1016\/j.engappai.2026.114990_b59","series-title":"Findings of the Association for Computational Linguistics: ACL 2024","first-page":"15890","article-title":"Visual in-context learning for large vision-language models","author":"Zhou","year":"2024"},{"key":"10.1016\/j.engappai.2026.114990_b60","doi-asserted-by":"crossref","unstructured":"Zhou, Yucheng, Song, Lingran, Shen, Jianbing, 2025. Improving medical large vision-language models with abnormal-aware feedback. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12994\u201313011.","DOI":"10.18653\/v1\/2025.acl-long.636"},{"key":"10.1016\/j.engappai.2026.114990_b61","doi-asserted-by":"crossref","unstructured":"Zhou, Kaiyang, Yang, Jingkang, Loy, Chen Change, Liu, Ziwei, 2022. Conditional prompt learning for vision-language models. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 16816\u201316825.","DOI":"10.1109\/CVPR52688.2022.01631"},{"key":"10.1016\/j.engappai.2026.114990_b62","doi-asserted-by":"crossref","unstructured":"Zhou, Li, Zhou, Zikun, Mao, Kaige, He, Zhenyu, 2023. Joint visual grounding and tracking with natural language specification. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 23151\u201323160.","DOI":"10.1109\/CVPR52729.2023.02217"},{"key":"10.1016\/j.engappai.2026.114990_b63","series-title":"International Conference on Representation Learning","first-page":"18378","article-title":"MiniGPT-4: Enhancing vision-language understanding with advanced large language models","volume":"Vol. 2024","author":"Zhu","year":"2024"},{"key":"10.1016\/j.engappai.2026.114990_b64","doi-asserted-by":"crossref","DOI":"10.1016\/j.engappai.2025.110787","article-title":"Joint feature extraction and alignment in object tracking with vision-language model","volume":"152","author":"Zhu","year":"2025","journal-title":"Eng. Appl. Artif. Intell."},{"key":"10.1016\/j.engappai.2026.114990_b65","doi-asserted-by":"crossref","DOI":"10.1109\/TCSVT.2025.3557053","article-title":"Learning language prompt for vision-language tracking","author":"Zong","year":"2025","journal-title":"IEEE Trans. Circuits Syst. Video Technol."}],"container-title":["Engineering Applications of Artificial Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/api.elsevier.com\/content\/article\/PII:S095219762601273X?httpAccept=text\/xml","content-type":"text\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/api.elsevier.com\/content\/article\/PII:S095219762601273X?httpAccept=text\/plain","content-type":"text\/plain","content-version":"vor","intended-application":"text-mining"}],"deposited":{"date-parts":[[2026,5,26]],"date-time":"2026-05-26T15:05:31Z","timestamp":1779807931000},"score":1,"resource":{"primary":{"URL":"https:\/\/linkinghub.elsevier.com\/retrieve\/pii\/S095219762601273X"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,8]]},"references-count":65,"alternative-id":["S095219762601273X"],"URL":"https:\/\/doi.org\/10.1016\/j.engappai.2026.114990","relation":{},"ISSN":["0952-1976"],"issn-type":[{"value":"0952-1976","type":"print"}],"subject":[],"published":{"date-parts":[[2026,8]]},"assertion":[{"value":"Elsevier","name":"publisher","label":"This article is maintained by"},{"value":"Learning Region-aware Patch Embedding with mask-text prompted contrastive supervision for vision-language tracking","name":"articletitle","label":"Article Title"},{"value":"Engineering Applications of Artificial Intelligence","name":"journaltitle","label":"Journal Title"},{"value":"https:\/\/doi.org\/10.1016\/j.engappai.2026.114990","name":"articlelink","label":"CrossRef DOI link to publisher maintained version"},{"value":"article","name":"content_type","label":"Content Type"},{"value":"\u00a9 2026 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.","name":"copyright","label":"Copyright"}],"article-number":"114990"}}