{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T00:26:02Z","timestamp":1781483162967,"version":"3.54.1"},"reference-count":46,"publisher":"MDPI AG","issue":"18","license":[{"start":{"date-parts":[[2024,9,21]],"date-time":"2024-09-21T00:00:00Z","timestamp":1726876800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Natural Science Foundation of China","award":["62306310"],"award-info":[{"award-number":["62306310"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Accurate crop disease classification is crucial for ensuring food security and enhancing agricultural productivity. However, the existing crop disease classification algorithms primarily focus on a single image modality and typically require a large number of samples. Our research counters these issues by using pre-trained Vision\u2013Language Models (VLMs), which enhance the multimodal synergy for better crop disease classification than the traditional unimodal approaches. Firstly, we apply the multimodal model Qwen-VL to generate meticulous textual descriptions for representative disease images selected through clustering from the training set, which will serve as prompt text for generating classifier weights. Compared to solely using the language model for prompt text generation, this approach better captures and conveys fine-grained and image-specific information, thereby enhancing the prompt quality. Secondly, we integrate cross-attention and SE (Squeeze-and-Excitation) Attention into the training-free mode VLCD(Vision-Language model for Crop Disease classification) and the training-required mode VLCD-T (VLCD-Training), respectively, for prompt text processing, enhancing the classifier weights by emphasizing the key text features. The experimental outcomes conclusively prove our method\u2019s heightened classification effectiveness in few-shot crop disease scenarios, tackling the data limitations and intricate disease recognition issues. It offers a pragmatic tool for agricultural pathology and reinforces the smart farming surveillance infrastructure.<\/jats:p>","DOI":"10.3390\/s24186109","type":"journal-article","created":{"date-parts":[[2024,9,24]],"date-time":"2024-09-24T08:56:06Z","timestamp":1727168166000},"page":"6109","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":23,"title":["Few-Shot Image Classification of Crop Diseases Based on Vision\u2013Language Models"],"prefix":"10.3390","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-3624-5443","authenticated-orcid":false,"given":"Yueyue","family":"Zhou","sequence":"first","affiliation":[{"name":"School of Information Engineering, China University of Geosciences, Beijing 100083, China"},{"name":"State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4296-2289","authenticated-orcid":false,"given":"Hongping","family":"Yan","sequence":"additional","affiliation":[{"name":"School of Information Engineering, China University of Geosciences, Beijing 100083, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2256-8815","authenticated-orcid":false,"given":"Kun","family":"Ding","sequence":"additional","affiliation":[{"name":"State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Tingting","family":"Cai","sequence":"additional","affiliation":[{"name":"School of Information Engineering, China University of Geosciences, Beijing 100083, China"},{"name":"State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yan","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Information Engineering, China University of Geosciences, Beijing 100083, China"},{"name":"State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2024,9,21]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"56683","DOI":"10.1109\/ACCESS.2021.3069646","article-title":"Plant disease detection and classification by deep learning\u2014A review","volume":"9","author":"Li","year":"2021","journal-title":"IEEE Access"},{"key":"ref_2","first-page":"154","article-title":"Image recognition of stored grain pests: Based on deep convolutional neural network","volume":"34","author":"Cheng","year":"2018","journal-title":"Chin. Agric. Sci. Bull."},{"key":"ref_3","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, Virtual."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"5625","DOI":"10.1109\/TPAMI.2024.3369699","article-title":"Vision-Language Models for Vision Tasks: A Survey","volume":"46","author":"Zhang","year":"2024","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_5","unstructured":"Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. (2023). Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., and Li, H. (2022). Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification. European Conference on Computer Vision, Springer Nature Switzerland.","DOI":"10.1007\/978-3-031-19833-5_29"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Zhu, X., Zhang, R., He, B., Zhou, A., Wang, D., Zhao, B., and Gao, P. (2023, January 1\u20136). Not All Features Matter: Enhancing Few-Shot CLIP with Adaptive Prior Refinement. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00246"},{"key":"ref_8","unstructured":"Ng, A. (2024, July 04). AI Doesn\u2019t Have to Be Too Complicated or Expensive for Your Business. Harvard Business Review. Available online: https:\/\/hbr.org\/2021\/07\/ai-doesnt-have-to-be-too-complicated-or-expensive-for-your-business."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Hamid, O.H. (2023). Data-Centric and Model-Centric AI: Twin Drivers of Compact and Robust Industry 4.0 Solutions. Appl. Sci., 13.","DOI":"10.3390\/app13052753"},{"key":"ref_10","first-page":"367","article-title":"A novel approach for tomato leaf disease classification with deep convolutional neural networks","volume":"30","author":"Irmak","year":"2024","journal-title":"J. Agric. Sci."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"311","DOI":"10.1016\/j.compag.2018.01.009","article-title":"Deep learning models for plant disease detection and diagnosis","volume":"145","author":"Ferentinos","year":"2018","journal-title":"Comput. Electron. Agric."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"2129","DOI":"10.3923\/itj.2014.2129.2136","article-title":"Design of automatic recognition of cucumber disease image","volume":"13","author":"Guo","year":"2014","journal-title":"Inf. Technol. J."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1016\/j.compag.2017.01.014","article-title":"Leaf image based cucumber disease recognition using sparse representation classification","volume":"134","author":"Zhang","year":"2017","journal-title":"Comput. Electron. Agric."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1016\/j.compag.2019.01.041","article-title":"Analysis of transfer learning for deep neural network based plant classification models","volume":"158","author":"Kaya","year":"2019","journal-title":"Comput. Electron. Agric."},{"key":"ref_15","first-page":"1","article-title":"An interpretable high-accuracy method for rice disease detection based on multi-source data and transfer learning","volume":"13","author":"Bai","year":"2023","journal-title":"Agriculture"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13007-021-00770-1","article-title":"Semi-supervised few-shot learning approach for plant diseases recognition","volume":"17","author":"Li","year":"2021","journal-title":"Plant Methods"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Nuthalapati, S.V., and Tunga, A. (2021, January 10\u201317). Multi-domain few-shot learning and dataset for agricultural applications. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCVW54120.2021.00161"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1109\/MGRS.2024.3383473","article-title":"Vision-language models in remote sensing: Current progress and future trends","volume":"12","author":"Li","year":"2024","journal-title":"IEEE Geosci. Remote Sens. Mag."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Bossard, L., Guillaumin, M., and Van Gool, L. (2014, January 6\u201312). Food-101\u2013mining discriminative components with random forests. Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland. Part VI.","DOI":"10.1007\/978-3-319-10599-4_29"},{"key":"ref_20","first-page":"2611","article-title":"The hateful memes challenge: Detecting hate speech in multimodal memes","volume":"33","author":"Kiela","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"2337","DOI":"10.1007\/s11263-022-01653-1","article-title":"Learning to prompt for vision-language models","volume":"130","author":"Zhou","year":"2022","journal-title":"Int. J. Comput. Vis."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Zhou, K., Yang, J., Loy, C.C., and Liu, Z. (2022, January 18\u201324). Conditional Prompt Learning for Vision-Language Models. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01631"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Yao, H., Zhang, R., and Xu, C. (2023, January 17\u201324). Visual-language prompt tuning with knowledge-guided context optimization. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00653"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"581","DOI":"10.1007\/s11263-023-01891-x","article-title":"Clip-adapter: Better vision-language models with feature adapters","volume":"132","author":"Gao","year":"2023","journal-title":"Int. J. Comput. Vis."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Yu, T., Lu, Z., Jin, X., Chen, Z., and Wang, X. (2023, January 17\u201324). Task residual for tuning vision-language models. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01049"},{"key":"ref_26","unstructured":"Li, X., Lian, D., Lu, Z., Bai, J., Chen, Z., and Wang, X. (2023, January 3\u20136). GraphAdapter: Tuning Vision-Language Models with Dual Knowledge Graph. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA."},{"key":"ref_27","unstructured":"Lu, Z., Bai, J., Li, X., Xiao, Z., and Wang, X. (2024, January 21\u201327). Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models. Proceedings of the Forty-first International Conference on Machine Learning, Vienna, Austria."},{"key":"ref_28","unstructured":"Lewis, K.M., Mu, E., Dalca, A.V., and Guttag, J. (2023). Gist: Generating image-specific text for fine-grained object classification. arXiv."},{"key":"ref_29","unstructured":"Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2024, July 29). Improving Language Understanding by Generative Pre-Training. Available online: https:\/\/www.mikecaptain.com\/resources\/pdf\/GPT-1.pdf."},{"key":"ref_30","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., and Anadkat, S. (2023). GPT-4 technical report. arXiv."},{"key":"ref_31","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is all you need. arXiv."},{"key":"ref_32","unstructured":"Martins, A., and Astudillo, R. (2016, January 19\u201324). From softmax to sparsemax: A sparse model of attention and multi-label classification. Proceedings of the International Conference on Machine Learning."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. (2016, January 12\u201317). Hierarchical attention networks for document classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA.","DOI":"10.18653\/v1\/N16-1174"},{"key":"ref_34","first-page":"12077","article-title":"SegFormer: Simple and efficient design for semantic segmentation with transformers","volume":"34","author":"Xie","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"12581","DOI":"10.1109\/TPAMI.2023.3282631","article-title":"Uniformer: Unifying convolution and self-attention for visual recognition","volume":"45","author":"Li","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"339","DOI":"10.1007\/s41019-022-00200-9","article-title":"A multi-level mesh mutual attention model for visual question answering","volume":"7","author":"Lei","year":"2022","journal-title":"Data Sci. Eng."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Meinhardt, T., Kirillov, A., Leal-Taixe, L., and Feichtenhofer, C. (2022, January 18\u201324). Trackformer: Multi-object tracking with transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00864"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Maniparambil, M., Vorster, C., Molloy, D., Murphy, N., McGuinness, K., and O\u2019Connor, N.E. (2023, January 1\u20136). Enhancing CLIP with GPT-4: Harnessing visual descriptions as prompts. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCVW60793.2023.00034"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201323). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_40","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Singh, D., Jain, N., Jain, P., Kayal, P., Kumawat, S., and Batra, N. (2020, January 5\u20137). PlantDoc: A dataset for visual plant disease detection. Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, Hyderabad, India.","DOI":"10.1145\/3371158.3371196"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_43","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020, January 30). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia."},{"key":"ref_44","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017, January 22\u201329). Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.74"},{"key":"ref_46","unstructured":"Shen, S., Li, L.H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.-W., Yao, Z., and Keutzer, K. (2021). How much can clip benefit vision-and-language tasks?. arXiv."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/18\/6109\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:01:44Z","timestamp":1760112104000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/18\/6109"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,21]]},"references-count":46,"journal-issue":{"issue":"18","published-online":{"date-parts":[[2024,9]]}},"alternative-id":["s24186109"],"URL":"https:\/\/doi.org\/10.3390\/s24186109","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,9,21]]}}}