{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,11]],"date-time":"2025-12-11T05:15:13Z","timestamp":1765430113099,"version":"3.46.0"},"reference-count":60,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2025,12,7]],"date-time":"2025-12-07T00:00:00Z","timestamp":1765065600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62476060"],"award-info":[{"award-number":["62476060"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Controllable Image Captioning (CIC) aims to generate coherent and semantically faithful textual descriptions of images while adhering to user-specified constraints. Existing methods have achieved promising results under individual constraints such as sentimental style or sentence length. However, they typically fail to handle and satisfy multiple constraints simultaneously, as the controls often interact and interfere with one another. To overcome these challenges, we propose Internal\u2013External Multi-Agent Steering (IE-MAS) for CIC. IE-MAS introduces an internal multimodal steering (IMS) strategy to control affective coherence within the caption, and an external multi-agent collaboration system (EMCS) to guide visual grounding and contextual alignment. From an information-theoretic view, IMS reduces uncertainty in the generation process, while EMCS strengthens the dependency between captions and visual inputs, converting the length and sentiment constraints into information gains. Together, they produce a stable balance among semantic consistency, affective expression, and length control through an adaptive steering process that dynamically balances internal linguistic control and external perceptual grounding. Experimental results demonstrate that IE-MAS effectively coordinates multiple constraints, producing captions that satisfy the length constraint and are sentimental expressive and visually faithful.<\/jats:p>","DOI":"10.3390\/e27121237","type":"journal-article","created":{"date-parts":[[2025,12,8]],"date-time":"2025-12-08T08:21:43Z","timestamp":1765182103000},"page":"1237","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["IE-MAS: Internal\u2013External Multi-Agent Steering for Controllable Image Captioning"],"prefix":"10.3390","volume":"27","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4254-313X","authenticated-orcid":false,"given":"Tiecheng","family":"Cai","sequence":"first","affiliation":[{"name":"College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chao","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shanshan","family":"Lin","sequence":"additional","affiliation":[{"name":"College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-6131-8449","authenticated-orcid":false,"given":"Sibo","family":"Ju","sequence":"additional","affiliation":[{"name":"College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiangwen","family":"Liao","sequence":"additional","affiliation":[{"name":"College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,12,7]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Farhadi, A., Hejrati, M., Sadeghi, M., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. (2010, January 5\u201311). Every picture tells a story: Generating sentences from images. Proceedings of the European Conference on Computer Vision, Crete, Greece.","DOI":"10.1007\/978-3-642-15561-1_2"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Chen, S., Jin, Q., Wang, P., and Wu, Q. (2020, January 13\u201319). Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00998"},{"key":"ref_3","unstructured":"Wang, N., Xie, J., Wu, J., Jia, M., and Li, L. (2023, January 7\u201314). Controllable image captioning via prompting. Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Zeng, Z., Zhang, H., Lu, R., Wang, D., Chen, B., and Wang, Z. (2023, January 17\u201324). Conzic: Controllable zero-shot image captioning by sampling-based polishing. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.02247"},{"key":"ref_5","unstructured":"Danescu-Niculescu-Mizil, C., Gamon, M., and Dumais, S. (April, January 28). Mark my words! Linguistic style accommodation in social media. Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3617592","article-title":"Deep learning approaches on image captioning: A review","volume":"56","author":"Ghandi","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"ref_7","unstructured":"Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., and Tang, J. (2025). Qwen2.5-vl technical report. arXiv."},{"key":"ref_8","first-page":"2024","article-title":"Llama 3.2: Revolutionizing edge ai and vision with open, customizable models","volume":"20","author":"Meta","year":"2024","journal-title":"Meta AI Blog. Retrieved Dec."},{"key":"ref_9","unstructured":"Che, W., Nabende, J., Shutova, E., and Pilehvar, M.T. (August, January 27). Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Deng, C., Ding, N., Tan, M., and Wu, Q. (2020, January 23\u201328). Length-controllable image captioning. Proceedings of the European conference on computer vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58601-0_42"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Wang, N., Duan, F., Zhang, Y., Zhou, W., Xu, K., Huang, W., and Fu, J. (2024, January 12\u201316). PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA.","DOI":"10.18653\/v1\/2024.findings-emnlp.983"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"100143","DOI":"10.1016\/j.nlp.2025.100143","article-title":"Precise length control for large language models","volume":"11","author":"Butcher","year":"2025","journal-title":"Nat. Lang. Process. J."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Gu, Y., Wang, W., Feng, X., Zhong, W., Zhu, K., Huang, L., Liu, T., and Qin, B. (August, January 27). Length Controlled Generation for Black-box LLMs. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria.","DOI":"10.18653\/v1\/2025.acl-long.825"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Retkowski, F., and Waibel, A. (May, January 29). Zero-Shot Strategies for Length-Controllable Summarization. Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, NM, USA.","DOI":"10.18653\/v1\/2025.findings-naacl.34"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1207\/s15516709cog1202_4","article-title":"Cognitive load during problem solving: Effects on learning","volume":"12","author":"Sweller","year":"1988","journal-title":"Cogn. Sci."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Bai, L., Borah, A., Ignat, O., and Mihalcea, R. (May, January 29). The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA.","DOI":"10.18653\/v1\/2025.naacl-long.152"},{"key":"ref_17","unstructured":"Lee, S., Yoon, S., Bui, T., Shi, J., and Yoon, S. (2025, January 13\u201319). Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage. Proceedings of the Forty-second International Conference on Machine Learning, Vancouver, WC, Canada."},{"key":"ref_18","unstructured":"Yang, P., and Dong, B. (2025). Mocoll: Agent-based specific and general model collaboration for image captioning. arXiv."},{"key":"ref_19","first-page":"38154","article-title":"Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face","volume":"36","author":"Shen","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Zhang, X., Dong, X., Wang, Y., Zhang, D., and Cao, F. (2025). A Survey of Multi-AI Agent Collaboration: Theories, Technologies and Applications, Association for Computing Machinery.","DOI":"10.1145\/3745238.3745531"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1016\/j.neucom.2018.03.078","article-title":"Image captioning by incorporating affective concepts learned from both visual and textual components","volume":"328","author":"Yang","year":"2019","journal-title":"Neurocomputing"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1002\/j.1538-7305.1948.tb01338.x","article-title":"A mathematical theory of communication","volume":"27","author":"Shannon","year":"1948","journal-title":"Bell Syst. Tech. J."},{"key":"ref_23","unstructured":"(1999). Elements of Information Theory, John Wiley & Sons."},{"key":"ref_24","first-page":"9694","article-title":"Align before fuse: Vision and language representation learning with momentum distillation","volume":"34","author":"Li","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Mathews, A., Xie, L., and He, X. (2016, January 12\u201317). Senticap: Generating image descriptions with sentiments. Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AR, USA.","DOI":"10.1609\/aaai.v30i1.10475"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Gan, C., Gan, Z., He, X., Gao, J., and Deng, L. (2017, January 21\u201326). Stylenet: Generating attractive visual captions with styles. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.108"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Guo, L., Liu, J., Yao, P., Li, J., and Lu, H. (2019, January 15\u201320). MSCap: Multi-Style Image Captioning with Unpaired Stylized Text. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00433"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, W., Shi, Y., and Zhao, J. (2021, January 13\u201315). A robustly optimized BERT pre-training approach with post-training. Proceedings of the China National Conference on Chinese Computational Linguistics, Hohhot, China.","DOI":"10.1007\/978-3-030-84186-7_31"},{"key":"ref_29","unstructured":"Radford, A., Kim, J., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, PMLR, Online."},{"key":"ref_30","unstructured":"Li, J., Li, D., Savarese, S., and Hoi, S. (2023, January 23\u201329). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Tian, J., Yang, Z., and Shi, S. (2022, January 19\u201322). Unsupervised style control for image captioning. Proceedings of the International Conference of Pioneering Computer Scientists, Engineers and Educators, Chengdu, China.","DOI":"10.1007\/978-981-19-5194-7_31"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020, January 13\u201319). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"ref_33","unstructured":"Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"529","DOI":"10.1038\/nature14236","article-title":"Human-level control through deep reinforcement learning","volume":"518","author":"Mnih","year":"2015","journal-title":"Nature"},{"key":"ref_35","unstructured":"Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., and Le, Q.V. (2021). Finetuned language models are zero-shot learners. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Kikuchi, Y., Neubig, G., Sasano, R., Takamura, H., and Okumura, M. (2016, January 1\u20134). Controlling Output Length in Neural Encoder-Decoders. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, USA.","DOI":"10.18653\/v1\/D16-1140"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Takase, S., and Okazaki, N. (2019, January 3\u20135). Positional Encoding to Control Output Sequence Length. Proceedings of the 2019 Conference of the North. Association for Computational Linguistics, Minneapolis, MN, USA.","DOI":"10.18653\/v1\/N19-1401"},{"key":"ref_38","first-page":"1","article-title":"Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing","volume":"55","author":"Liu","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"5266","DOI":"10.1109\/TCSVT.2023.3343520","article-title":"Cascade semantic prompt alignment network for image captioning","volume":"34","author":"Li","year":"2023","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. (2024, January 11\u201316). Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand.","DOI":"10.18653\/v1\/2024.acl-long.828"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Subramani, N., Suresh, N., and Peters, M.E. (2022, January 22\u201327). Extracting Latent Steering Vectors from Pretrained Language Models. Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland.","DOI":"10.18653\/v1\/2022.findings-acl.48"},{"key":"ref_42","unstructured":"Turner, A.M., Thiergart, L., Leech, G., Udell, D., Vazquez, J.J., Mini, U., and MacDiarmid, M. (2023). Steering language models with activation engineering. arXiv."},{"key":"ref_43","unstructured":"Soo, S., Teng, W., Balaganesh, C., Tan, G., and Yan, M. (2025, January 24). Interpretable Steering of Large Language Models with Feature Guided Activation Additions. Proceedings of the Building Trust Workshop at ICLR 2025, Singapore."},{"key":"ref_44","unstructured":"Bayat, R., Rahimi-Kalahroudi, A., Pezeshki, M., Chandar, S., and Vincent, P. (2025). Steering large language model activations in sparse spaces. arXiv."},{"key":"ref_45","unstructured":"Su, J., Chen, J., Li, H., Chen, Y., Qing, L., and Zhang, Z. (August, January 27). Activation steering decoding: Mitigating hallucination in large vision-language models through bidirectional hidden state intervention. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria."},{"key":"ref_46","unstructured":"Kim, J., Lee, J., Choi, H.J., Hsu, T.Y., Huang, C.Y., Kim, S., Rossi, R., Yu, T., Giles, C.L., and Huang, T.H. (March, January 25). Multi-LLM Collaborative Caption Generation in Scientific Documents. Proceedings of the International Workshop on AI for Transportation, Philadelphia, PA, USA."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Jiang, A., Wang, D., Peng, C., and Wang, M. (2025). Relational Reasoning Image Captioning via Multi-Agent Retrieval-Augmented Generation. Knowl.-Based Syst., 114977.","DOI":"10.2139\/ssrn.5351688"},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"1073","DOI":"10.1007\/s11263-023-01752-7","article-title":"Sentimental visual captioning using multimodal transformer","volume":"131","author":"Wu","year":"2023","journal-title":"Int. J. Comput. Vis."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 11\u201317). Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_50","unstructured":"Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., and Chen, C. (2022). Toy models of superposition. arXiv."},{"key":"ref_51","unstructured":"Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., and Dombrowski, A.K. (2023). Representation engineering: A top-down approach to ai transparency. arXiv."},{"key":"ref_52","unstructured":"Huben, R., Cunningham, H., Riggs Smith, L., Ewart, A., and Sharkey, L. (2024, January 7\u201311). Sparse Autoencoders Find Highly Interpretable Features in Language Models. Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria."},{"key":"ref_53","unstructured":"Wooldridge, M. (2009). An introduction to Multiagent Systems, John Wiley & Sons."},{"key":"ref_54","unstructured":"Ferber, J., and Weiss, G. (1999). Multi-Agent Systems: An Introduction to Distributed Artificial Intelligence, Addison-Wesley Reading."},{"key":"ref_55","unstructured":"Tishby, N., Pereira, F.C., and Bialek, W. (2000, January 4\u20136). The Information Bottleneck Method. Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA."},{"key":"ref_56","unstructured":"Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019, January 8). DistilBERT: A Distilled Version of BERT\u2014Smaller, Faster, Cheaper and Lighter. Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing, Vancouver, BC, Canada."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Hessel, J., Holtzman, A., Forbes, M., and Choi, Y. (2021, January 7\u201311). CLIPScore: A Reference-Free Evaluation Metric for Image Captioning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Punta Cana, Dominican Republic. Online.","DOI":"10.18653\/v1\/2021.emnlp-main.595"},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W. (2002, January 6\u201312). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015, January 7\u201312). Cider: Consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Tewel, Y., Shalev, Y., Schwartz, I., and Wolf, L. (2022, January 18\u201324). Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01739"}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/27\/12\/1237\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,11]],"date-time":"2025-12-11T05:12:30Z","timestamp":1765429950000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/27\/12\/1237"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,7]]},"references-count":60,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["e27121237"],"URL":"https:\/\/doi.org\/10.3390\/e27121237","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2025,12,7]]}}}