{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T12:05:25Z","timestamp":1775217925453,"version":"3.50.1"},"reference-count":49,"publisher":"Institution of Engineering and Technology (IET)","issue":"1","license":[{"start":{"date-parts":[[2025,10,24]],"date-time":"2025-10-24T00:00:00Z","timestamp":1761264000000},"content-version":"vor","delay-in-days":296,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"},{"start":{"date-parts":[[2025,1,1]],"date-time":"2025-01-01T00:00:00Z","timestamp":1735689600000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/doi.wiley.com\/10.1002\/tdm_license_1.1"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62276240"],"award-info":[{"award-number":["62276240"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"publisher","award":["CUC25CGJ02"],"award-info":[{"award-number":["CUC25CGJ02"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["ietresearch.onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["IET Image Processing"],"published-print":{"date-parts":[[2025,1]]},"abstract":"<jats:title>ABSTRACT<\/jats:title>\n                  <jats:p>Image emotion classification remains a challenging task due to the intrinsic subjectivity of emotional perception and the semantic ambiguity inherent in visual content. Although recent studies have applied language supervision to exploit semantic cues, but most methods rely on fixed templates that exhibit limited emotional relevance, while generating high\u2010quality affective textual descriptions typically involves substantial manual effort and cost. To overcome these limitations, this paper proposes an emotion classification framework that integrates language supervision with instruction tuning. The proposed approach significantly improves emotion classification performance through three key components: (1) instruction\u2010guided generation of descriptive emotion captions (DECs), (2) cross\u2010modal pseudo\u2010label construction, and (3) adaptive multimodal fusion. We design emotion\u2010centric instructional prompts to guide large pre\u2010trained vision\u2010language models in generating semantically rich DECs, thereby surpassing the constraints of conventional template\u2010based methods. Instruction tuning is further employed to generate structured emotion pseudo\u2010labels, forming image\u2010caption\u2010pseudo\u2010label triplets that strengthen cross\u2010modal alignment. Finally, an adaptive fusion mechanism combined with a multi\u2010branch loss function is introduced to optimize classification efficacy. Extensive experiments conducted across multiple domain\u2010specific datasets demonstrate that our method achieves state\u2010of\u2010the\u2010art accuracy. Ablation studies confirm the critical role of multimodal collaboration in enhancing model performance. Furthermore, detailed linguistic analysis shows that DECs achieve high levels of emotional expressiveness, as evidenced by their length, degree of abstraction, and affective distribution, closely approximating the quality of human\u2010authored descriptions. This work demonstrates fine\u2010grained emotion description generation without manual annotation and offers a potential solution for multi\u2010source image emotion recognition.<\/jats:p>","DOI":"10.1049\/ipr2.70235","type":"journal-article","created":{"date-parts":[[2025,10,24]],"date-time":"2025-10-24T14:13:21Z","timestamp":1761315201000},"update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["An Image Emotion Classification Framework Based on Instruction\u2010Guided Triplets With Descriptive Captions"],"prefix":"10.1049","volume":"19","author":[{"given":"Fuxiao","family":"Zhang","sequence":"first","affiliation":[{"name":"Key Laboratory of Acoustic Visual Technology and Intelligent Control System Ministry of Culture and Tourism Communication University of China  Beijing China"},{"name":"School of Computer and Cyber Sciences Communication University of China  Beijing China"},{"name":"Beijing Key Laboratory of Modern Entertainment Technology Communication University of China  Beijing China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3571-5324","authenticated-orcid":false,"given":"Jingjing","family":"Zhang","sequence":"additional","affiliation":[{"name":"Key Laboratory of Acoustic Visual Technology and Intelligent Control System Ministry of Culture and Tourism Communication University of China  Beijing China"},{"name":"School of Computer and Cyber Sciences Communication University of China  Beijing China"},{"name":"Beijing Key Laboratory of Modern Entertainment Technology Communication University of China  Beijing China"}]},{"given":"Chunxiao","family":"Wang","sequence":"additional","affiliation":[{"name":"Key Laboratory of Acoustic Visual Technology and Intelligent Control System Ministry of Culture and Tourism Communication University of China  Beijing China"},{"name":"Beijing Key Laboratory of Modern Entertainment Technology Communication University of China  Beijing China"},{"name":"Center For Ethnic and Folk Literature and Art Development Ministry of Culture and Tourism. P.R.C.  Beijing China"}]},{"given":"Yanhao","family":"Li","sequence":"additional","affiliation":[{"name":"Key Laboratory of Acoustic Visual Technology and Intelligent Control System Ministry of Culture and Tourism Communication University of China  Beijing China"},{"name":"School of Computer and Cyber Sciences Communication University of China  Beijing China"},{"name":"Beijing Key Laboratory of Modern Entertainment Technology Communication University of China  Beijing China"}]}],"member":"265","published-online":{"date-parts":[[2025,10,24]]},"reference":[{"key":"e_1_2_10_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2017.02.003"},{"key":"e_1_2_10_3_1","doi-asserted-by":"crossref","unstructured":"J.MachajdikandA.Hanbury \u201cAffective Image Classification Using Features Inspired by Psychology and Art Theory \u201d inProceedings of the 18th ACM International Conference on Multimedia(Association for Computing Machinery 2010) 83\u201392.","DOI":"10.1145\/1873951.1873965"},{"key":"e_1_2_10_4_1","doi-asserted-by":"crossref","unstructured":"D.Borth R.Ji T.Chen et\u00a0al. \u201cLarge\u2010Scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs \u201d inProceedings of the 21st ACM International Conference on Multimedia(Association for Computing Machinery 2013) 223\u2013232 https:\/\/doi.org\/10.1145\/2502081.2502282.","DOI":"10.1145\/2502081.2502282"},{"key":"e_1_2_10_5_1","doi-asserted-by":"crossref","unstructured":"J.Yang D.She Y. K.Lai et\u00a0al. \u201cRetrieving and Classifying Affective Images via Deep Metric Learning \u201dProceedings of the32nd AAAIConferenceon Artificial Intelligence(AAAI Press 2018) 8.","DOI":"10.1609\/aaai.v32i1.11275"},{"key":"e_1_2_10_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_2_10_7_1","unstructured":"A.Vaswani N.Shazeer N.Parmar et\u00a0al. \u201cAttention is All You Need \u201d inAdvances in Neural Information Processing Systems(Curran Associates 2017) 5998\u20136008."},{"key":"e_1_2_10_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042\u2010016\u20104310\u20105"},{"key":"e_1_2_10_9_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2018.05.104"},{"key":"e_1_2_10_10_1","doi-asserted-by":"crossref","unstructured":"L.Xu Z.Wang B.Wu et\u00a0al. \u201cMDAN: Multi\u2010Level Dependent Attention Network for Visual Emotion Analysis \u201dProceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(IEEE 2022) 9479\u20139488.","DOI":"10.1109\/CVPR52688.2022.00926"},{"key":"e_1_2_10_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2022.3225049"},{"key":"e_1_2_10_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2023.3331776"},{"key":"e_1_2_10_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2023.3282953"},{"key":"e_1_2_10_14_1","doi-asserted-by":"publisher","DOI":"10.3389\/fpsyg.2015.00444"},{"key":"e_1_2_10_15_1","doi-asserted-by":"publisher","DOI":"10.1177\/0963721414553440"},{"key":"e_1_2_10_16_1","doi-asserted-by":"crossref","unstructured":"P.Achlioptas M.Ovsjanikov L.Guibas et\u00a0al. \u201cAffection: Learning Affective Explanations for Real\u2010World Visual Data \u201dProceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(IEEE 2023) 6641\u20136651.","DOI":"10.1109\/CVPR52729.2023.00642"},{"key":"e_1_2_10_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2024.3372090"},{"key":"e_1_2_10_18_1","unstructured":"A.Radford J. W.Kim C.Hallacy et\u00a0al. \u201cLearning Transferable Visual Models From Natural Language Supervision \u201d inProceedings of the 38th International Conference on Machine Learning(PMLR 2021) 8748\u20138763."},{"key":"e_1_2_10_19_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i2.25353"},{"key":"e_1_2_10_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3341840"},{"key":"e_1_2_10_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-022-01653-1"},{"key":"e_1_2_10_22_1","doi-asserted-by":"crossref","unstructured":"K.Zhou J.Yang C. C.Loy et\u00a0al. \u201cConditional Prompt Learning for Vision\u2010Language Models \u201d inProceedings of theIEEE\/CVFConference on Computer Vision and Pattern Recognition(IEEE 2022) 16816\u201316825.","DOI":"10.1109\/CVPR52688.2022.01631"},{"key":"e_1_2_10_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41095-023-0389-6"},{"key":"e_1_2_10_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2024.102366"},{"key":"e_1_2_10_25_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2024.111790"},{"key":"e_1_2_10_26_1","doi-asserted-by":"crossref","unstructured":"O.Vinyals A.Toshev S.Bengio et\u00a0al. \u201cShow and Tell: A Neural Image Caption Generator \u201d inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition(IEEE 2015) 3156\u20133164.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_2_10_27_1","unstructured":"K.Xu J.Ba R.Kiros et\u00a0al. \u201cShow Attend and Tell: Neural Image Caption Generation With Visual Attention \u201d inProceedings of the International Conference on Machine Learning(PMLR 2015) 2048\u20132057."},{"key":"e_1_2_10_28_1","doi-asserted-by":"crossref","unstructured":"X.Li X.Yin C.Li et\u00a0al. \u201cOscar: Object\u2010Semantics Aligned Pre\u2010Training for Vision\u2010Language Tasks \u201d inProceedings of the European Conference on Computer Vision (Springer 2020) 121\u2013137.","DOI":"10.1007\/978-3-030-58577-8_8"},{"key":"e_1_2_10_29_1","doi-asserted-by":"crossref","unstructured":"P.Zhang X.Li X.Hu et\u00a0al. \u201cVinVL: Revisiting Visual Representations in Vision\u2010Language Models \u201d inProceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(IEEE 2021) 5579\u20135588.","DOI":"10.1109\/CVPR46437.2021.00553"},{"key":"e_1_2_10_30_1","unstructured":"J.Li D.Li C.Xiong et\u00a0al. \u201cBLIP: Bootstrapping Language\u2010Image Pre\u2010Training for Unified Vision\u2010Language Understanding and Generation \u201d inProceedings of the International Conference on Machine Learning(PMLR 2022) 12888\u201312900."},{"key":"e_1_2_10_31_1","doi-asserted-by":"crossref","unstructured":"Z.Zeng H.Zhang R.Lu et\u00a0al. \u201cConZIC: Controllable Zero\u2010Shot Image Captioning by Sampling\u2010Based Polishing \u201d inProceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(IEEE 2023) 23465\u201323476.","DOI":"10.1109\/CVPR52729.2023.02247"},{"key":"e_1_2_10_32_1","doi-asserted-by":"crossref","unstructured":"A.Mathews L.Xie andX.He \u201cSenticap: Generating Image Descriptions With Sentiments \u201d inProceedings of the AAAI Conference on Artificial Intelligence(AAAI Press 2016) 3574\u20133580.","DOI":"10.1609\/aaai.v30i1.10475"},{"key":"e_1_2_10_33_1","doi-asserted-by":"crossref","unstructured":"P.Achlioptas M.Ovsjanikov K.Haydarov et\u00a0al. \u201cArtemis: Affective Language for Visual Art \u201d inProceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(IEEE 2021) 11569\u201311579.","DOI":"10.1109\/CVPR46437.2021.01140"},{"key":"e_1_2_10_34_1","doi-asserted-by":"crossref","unstructured":"Y.Mohamed F. F.Khan K.Haydarov et\u00a0al. \u201cIt is Okay to not be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection \u201dProceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(IEEE 2022) 21263\u201321272.","DOI":"10.1109\/CVPR52688.2022.02058"},{"key":"e_1_2_10_35_1","unstructured":"X.Chen H.Fang T. Y.Lin et\u00a0al. \u201cMicrosoft COCO Captions: Data Collection and Evaluation Server \u201d preprint arXiv April 2 2015 https:\/\/arxiv.org\/abs\/1504.00325."},{"key":"e_1_2_10_36_1","unstructured":"J.Li D.Li S.Savarese et\u00a0al. \u201cBLIP\u20102: Bootstrapping Language\u2010Image Pre\u2010Training With Frozen Image Encoders and Large Language Models \u201d inProceedings of the International Conference on Machine Learning(PMLR 2023) 19730\u201319742."},{"key":"e_1_2_10_37_1","unstructured":"W.Dai J.Li D.Li et\u00a0al. \u201cInstructBLIP: Towards General\u2010purpose Vision\u2010language Models With Instruction Tuning \u201d preprint arXiv May 12 2023 https:\/\/arxiv.org\/abs\/2305.06500."},{"key":"e_1_2_10_38_1","doi-asserted-by":"crossref","unstructured":"Q.You J.Luo H.Jin et\u00a0al. \u201cBuilding a Large Scale Dataset for Image Emotion Recognition: The Fine Print and the Benchmark \u201d inProceedings of the AAAI Conference on Artificial Intelligence(AAAI Press 2016) 1190\u20131196.","DOI":"10.1609\/aaai.v30i1.9987"},{"key":"e_1_2_10_39_1","doi-asserted-by":"crossref","unstructured":"K. C.Peng A.Sadovnik A.Gallagher et\u00a0al. \u201cWhere Do Emotions Come From?Predicting the Emotion Stimuli Map\u201d inProceedings of the IEEE International Conference on Image Processing(IEEE 2016) 614\u2013618.","DOI":"10.1109\/ICIP.2016.7532430"},{"key":"e_1_2_10_40_1","doi-asserted-by":"crossref","unstructured":"Q.You J.Luo H.Jin et\u00a0al. \u201cRobust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks \u201dProceedings of the AAAI Conference on Artificial Intelligence(AAAI Press 2015) 381\u2013387.","DOI":"10.1609\/aaai.v29i1.9179"},{"key":"e_1_2_10_41_1","doi-asserted-by":"crossref","unstructured":"J.Zhang C.Lin C.Wang et\u00a0al. \u201cMovieEmotion\u2010IMG: An Emotion Distribution Dataset of Movie Scene Images \u201dProceedings of the International Conference on Culture\u2010Oriented Science and Technology(IEEE 2021) 498\u2013503.","DOI":"10.1109\/ICCST53801.2021.00110"},{"key":"e_1_2_10_42_1","doi-asserted-by":"crossref","unstructured":"J.Yang Q.Huang T.Ding et\u00a0al. \u201cEmoSet: A Large\u2010scale Visual Emotion Dataset With Rich Attributes \u201d inProceedings of the IEEE\/CVF International ConferenceonComputer Vision(IEEE 2023) 20383\u201320394.","DOI":"10.1109\/ICCV51070.2023.01864"},{"key":"e_1_2_10_43_1","doi-asserted-by":"publisher","DOI":"10.3758\/BF03192732"},{"key":"e_1_2_10_44_1","doi-asserted-by":"publisher","DOI":"10.1080\/02699939208411068"},{"key":"e_1_2_10_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2939744"},{"key":"e_1_2_10_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2928998"},{"key":"e_1_2_10_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3106813"},{"key":"e_1_2_10_48_1","doi-asserted-by":"publisher","DOI":"10.3758\/s13428\u2010013\u20100403\u20105"},{"key":"e_1_2_10_49_1","unstructured":"\u201cTextBlob \u201d accessed November 16 2020 https:\/\/textblob.readthedocs.io\/en\/dev\/."},{"key":"e_1_2_10_50_1","doi-asserted-by":"crossref","unstructured":"C.HuttoandE.Gilbert \u201cVADER: A Parsimonious Rule\u2010based Model for Sentiment Analysis of Social Media Text \u201d inProceedings of the InternationalAAAIConference on Web and Social Media(AAAI Press 2014) 216\u2013225.","DOI":"10.1609\/icwsm.v8i1.14550"}],"container-title":["IET Image Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/pdf\/10.1049\/ipr2.70235","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/full-xml\/10.1049\/ipr2.70235","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/pdf\/10.1049\/ipr2.70235","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T11:28:37Z","timestamp":1775215717000},"score":1,"resource":{"primary":{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/10.1049\/ipr2.70235"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1]]},"references-count":49,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,1]]}},"alternative-id":["10.1049\/ipr2.70235"],"URL":"https:\/\/doi.org\/10.1049\/ipr2.70235","archive":["Portico"],"relation":{},"ISSN":["1751-9659","1751-9667"],"issn-type":[{"value":"1751-9659","type":"print"},{"value":"1751-9667","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1]]},"assertion":[{"value":"2025-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-13","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-24","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"e70235"}}