{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T15:08:58Z","timestamp":1763564938768,"version":"3.45.0"},"reference-count":34,"publisher":"Fuji Technology Press Ltd.","issue":"6","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["JACIII","J. Adv. Comput. Intell. Intell. Inform."],"published-print":{"date-parts":[[2025,11,20]]},"abstract":"<jats:p>This study proposes an image captioning method designed to incorporate user-specific explanatory intentions into the generated text, as signaled by the user\u2019s trace on the image. We extract areas of interest from dense sections of the trace, determine the order of explanations by tracking changes in the pen-tip coordinates, and assess the degree of interest in each area by analyzing the time spent on them. Additionally, a diffusion language model is utilized to generate sentences in a non-autoregressive manner, allowing control over sentence length based on the temporal data of the trace. In the actual caption generation task, the proposed method achieved higher string similarity than conventional methods, including autoregressive models, and successfully captured user intent from the trace and faithfully reflected it in the generated text.<\/jats:p>","DOI":"10.20965\/jaciii.2025.p1417","type":"journal-article","created":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T15:02:09Z","timestamp":1763564529000},"page":"1417-1426","source":"Crossref","is-referenced-by-count":0,"title":["Interactive Image Caption Generation Reflecting User Intent from Trace Using a Diffusion Language Model"],"prefix":"10.20965","volume":"29","author":[{"given":"Satoko","family":"Hirano","sequence":"first","affiliation":[{"name":"Ochanomizu University, 2-1-1 Otsuka, Bunkyo-ku, Tokyo 112-8610, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"name":"Editorial Office","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7789-475X","authenticated-orcid":true,"given":"Ichiro","family":"Kobayashi","sequence":"additional","affiliation":[{"name":"Ochanomizu University, 2-1-1 Otsuka, Bunkyo-ku, Tokyo 112-8610, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"8550","published-online":{"date-parts":[[2025,11,20]]},"reference":[{"key":"key-10.20965\/jaciii.2025.p1417-1","unstructured":"D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, \u201cPalm-e: An embodied multimodal language model,\u201d arXiv preprint, arXiv:2303.03378, 2023. https:\/\/doi.org\/10.48550\/arXiv.2303.03378"},{"key":"key-10.20965\/jaciii.2025.p1417-2","unstructured":"A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, \u201cLearning transferable visual models from natural language supervision,\u201d arXiv preprint,\tarXiv:2103.00020, 2021. https:\/\/doi.org\/10.48550\/arXiv.2103.00020"},{"key":"key-10.20965\/jaciii.2025.p1417-3","unstructured":"J. Li, D. Li, C. Xiong, and S. Hoi, \u201cBLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,\u201d Proc. of the 39th Int. Conf. on Machine Learning, pp. 12888-12900, 2022."},{"key":"key-10.20965\/jaciii.2025.p1417-4","unstructured":"J. Li, D. Li, S. Savarese, and S. Hoi, \u201cBLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,\u201d Proc. of the 40th Int. Conf. on Machine Learning (ICML\u201923), pp. 19730-19742, 2023. https:\/\/doi.org\/10.48550\/arXiv.2301.12597"},{"key":"key-10.20965\/jaciii.2025.p1417-5","unstructured":"A. Bhattacharyya, M. Palmer, and C. Heckman, \u201cReCAP: Semantic role enhanced caption generation,\u201d Proc. of the 2024 Joint Int. Conf. on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 13633-13649, 2024."},{"key":"key-10.20965\/jaciii.2025.p1417-6","doi-asserted-by":"crossref","unstructured":"K. Basioti, M. A. Abdelsalam, F. Fancellu, V. Pavlovic, and A. Fazly, \u201cCic-bart-ssa: Controllable image captioning with structured semantic augmentation,\u201d arXiv preprint, arXiv:2407.11393, 2024. https:\/\/doi.org\/10.48550\/arXiv.2407.11393","DOI":"10.1007\/978-3-031-72848-8_26"},{"key":"key-10.20965\/jaciii.2025.p1417-7","doi-asserted-by":"crossref","unstructured":"S. Mao, C. Zhang, H. Su, H. Song, I. Shalyminov, and W. Cai, \u201cControllable contextualized image captioning: Directing the visual narrative through user-defined highlights,\u201d Proc. of the 18th European Conf. on Computer Vision (ECCV), 2024. https:\/\/doi.org\/10.1007\/978-3-031-72973-7_27","DOI":"10.1007\/978-3-031-72973-7_27"},{"key":"key-10.20965\/jaciii.2025.p1417-8","unstructured":"X. Wang, M. Diao, B. Li, H. Zhang, K. Liang, and Z. Ma, \u201cFrom simple to professional: A combinatorial controllable image captioning agent,\u201d arXiv preprint, arXiv:2412.11025, 2024. https:\/\/doi.org\/10.48550\/arXiv.2412.11025"},{"key":"key-10.20965\/jaciii.2025.p1417-9","unstructured":"Y. Zhao, Y. Liu, Z. Guo, W. Wu, C. Gong, F. Wan, and Q. Ye, \u201cControlcap: Controllable region-level captioning,\u201d arXiv preprint,\tarXiv:2401.17910, 2024. https:\/\/doi.org\/10.48550\/arXiv.2401.17910"},{"key":"key-10.20965\/jaciii.2025.p1417-10","unstructured":"A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, \u201cHierarchical text-conditional image generation with clip latents,\u201d arXiv preprint, arXiv:2204.06125, 2022. https:\/\/doi.org\/10.48550\/arXiv.2204.06125"},{"key":"key-10.20965\/jaciii.2025.p1417-11","doi-asserted-by":"crossref","unstructured":"R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, \u201cHigh-resolution image synthesis with latent diffusion models,\u201d arXiv preprint,\tarXiv:2112.10752, 2021. https:\/\/doi.org\/10.48550\/arXiv.2112.10752","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"key-10.20965\/jaciii.2025.p1417-12","unstructured":"X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. Hashimoto, \u201cDiffusion-lm improves controllable text generation,\u201d arXiv preprint, arXiv:2205.14217, 2022. https:\/\/doi.org\/10.48550\/arXiv.2205.14217"},{"key":"key-10.20965\/jaciii.2025.p1417-13","unstructured":"J. Ho and T. Salimans, \u201cClassifier-free diffusion guidance,\u201d arXiv preprint,\tarXiv:2207.12598, 2022. https:\/\/doi.org\/10.48550\/arXiv.2207.12598"},{"key":"key-10.20965\/jaciii.2025.p1417-14","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, \u201cAttention is all you need,\u201d Prof. of Advances in Neural Information Processing Systems (NIPS 2017), Vol.30, 2017."},{"key":"key-10.20965\/jaciii.2025.p1417-15","doi-asserted-by":"crossref","unstructured":"O. Ronneberger, P. Fischer, and T. Brox, \u201cU-net: Convolutional networks for biomedical image segmentation,\u201d Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), pp. 234-241, 2015. https:\/\/doi.org\/10.1007\/978-3-319-24574-4_28","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"key-10.20965\/jaciii.2025.p1417-16","doi-asserted-by":"crossref","unstructured":"H. Zhang, X. Liu, and J. Zhang, \u201cDiffuSum: Generation enhanced extractive summarization with diffusion,\u201d Findings of the Association for Computational Linguistics (ACL 2023), pp. 13089-13100, 2023. https:\/\/doi.org\/10.18653\/v1\/2023.findings-acl.828.","DOI":"10.18653\/v1\/2023.findings-acl.828"},{"key":"key-10.20965\/jaciii.2025.p1417-17","unstructured":"S. Gong, M. Li, J. Feng, Z. Wu, and L. Kong, \u201cDiffuSeq: Sequence to sequence text generation with diffusion models,\u201d Proc. of Int. Conf. on Learning Representations (ICLR 2023), 2023."},{"key":"key-10.20965\/jaciii.2025.p1417-18","unstructured":"H. Yuan, Z. Yuan, C. Tan, F. Huang, and S. Huang, \u201cSeqdiffuseq: Text diffusion with encoder-decoder transformers,\u201d arXiv preprint,\tarXiv:2212.10325, 2022. https:\/\/doi.org\/10.48550\/arXiv.2212.10325"},{"key":"key-10.20965\/jaciii.2025.p1417-19","unstructured":"P. Dhariwal and A. Nichol, \u201cDiffusion models beat gans on image synthesis,\u201d arXiv preprint,\tarXiv:2105.05233, 2021. https:\/\/doi.org\/10.48550\/arXiv.2105.05233"},{"key":"key-10.20965\/jaciii.2025.p1417-20","doi-asserted-by":"crossref","unstructured":"X. Han, S. Kumar, and Y. Tsvetkov, \u201cSSD-LM: Semi-autoregressive simplex-based diffusion language model for text generation and modular control,\u201d Proc. of the 61st Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers), pp. 11575-11596, 2023. https:\/\/doi.org\/10.18653\/v1\/2023.acl-long.647.","DOI":"10.18653\/v1\/2023.acl-long.647"},{"key":"key-10.20965\/jaciii.2025.p1417-21","doi-asserted-by":"crossref","unstructured":"Z. Horvitz, A. Patel, C. Callison-Burch, Z. Yu, and K. McKeown, \u201cParaguide: Guided diffusion paraphrasers for plug-and-play textual style transfer,\u201d arXiv preprint, arXiv:2308.15459, 2024. https:\/\/doi.org\/10.48550\/arXiv.2308.15459","DOI":"10.1609\/aaai.v38i16.29780"},{"key":"key-10.20965\/jaciii.2025.p1417-22","unstructured":"T. Wu, Z. Fan, X. Liu, Y. Gong, Y. Shen, J. Jiao, H.-T. Zheng, J. Li, Z. Wei, J. Guo, N. Duan, and W. Chen, \u201cAr-diffusion: Auto-regressive diffusion model for text generation,\u201d arXiv preprint,\tarXiv:2305.09515, 2023. https:\/\/doi.org\/10.48550\/arXiv.2305.09515"},{"key":"key-10.20965\/jaciii.2025.p1417-23","unstructured":"J. Ho, A. Jain, and P. Abbeel, \u201cDenoising diffusion probabilistic models,\u201d Prof. of Advances in Neural Information Processing Systems (NIPS\u201920), Vol.33, pp. 6840-6851, 2020."},{"key":"key-10.20965\/jaciii.2025.p1417-24","unstructured":"J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, \u201cDeep unsupervised learning using nonequilibrium thermodynamics,\u201d Proc. of the 32nd Int. Conf. on Machine Learning, pp. 2256-2265, 2015."},{"key":"key-10.20965\/jaciii.2025.p1417-25","doi-asserted-by":"crossref","unstructured":"J. Pont-Tuset, J. Uijlings, S. Changpinyo, R. Soricut, and V. Ferrari, \u201cConnecting vision and language with localized narratives,\u201d European Conf. on Computer Vision (ECCV 2020), pp. 647-664, 2020. https:\/\/doi.org\/10.1007\/978-3-030-58558-7_38","DOI":"10.1007\/978-3-030-58558-7_38"},{"key":"key-10.20965\/jaciii.2025.p1417-26","doi-asserted-by":"crossref","unstructured":"J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y. Choi, \u201cCLIPScore: A reference-free evaluation metric for image captioning,\u201d Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing, pp. 7514-7528, 2021. https:\/\/doi.org\/10.18653\/v1\/2021.emnlp-main.595","DOI":"10.18653\/v1\/2021.emnlp-main.595"},{"key":"key-10.20965\/jaciii.2025.p1417-27","doi-asserted-by":"crossref","unstructured":"A. Karpathy and L. Fei-Fei, \u201cDeep visual-semantic alignments for generating image descriptions,\u201d 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3128-3137, 2015. https:\/\/doi.org\/10.1109\/CVPR.2015.7298932","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"key-10.20965\/jaciii.2025.p1417-28","unstructured":"I. Loshchilov and F. Hutter, \u201cDecoupled weight decay regularization,\u201d arXiv preprint,\tarXiv:1711.05101, 2019. https:\/\/doi.org\/10.48550\/arXiv.1711.05101"},{"key":"key-10.20965\/jaciii.2025.p1417-29","doi-asserted-by":"crossref","unstructured":"K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, \u201cBleu: a method for automatic evaluation of machine translation,\u201d Proc. of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318, 2002. https:\/\/doi.org\/10.3115\/1073083.1073135","DOI":"10.3115\/1073083.1073135"},{"key":"key-10.20965\/jaciii.2025.p1417-30","unstructured":"C.-Y. Lin, \u201cROUGE: A package for automatic evaluation of summaries,\u201d Text Summarization Branches Out (WAS 2004), pp. 74-81, 2004."},{"key":"key-10.20965\/jaciii.2025.p1417-31","unstructured":"T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, \u201cBertscore: Evaluating text generation with bert,\u201d Int. Conf. on Learning Representations (ICLR 2020), 2020."},{"key":"key-10.20965\/jaciii.2025.p1417-32","unstructured":"C. Deng, N. Ding, M. Tan, and Q. Wu, \u201cLength-controllable image captioning,\u201d arXiv preprint, arXiv:2007.09580, 2020. https:\/\/doi.org\/10.48550\/arXiv.2007.09580"},{"key":"key-10.20965\/jaciii.2025.p1417-33","unstructured":"J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang, \u201cGit: A generative image-to-text transformer for vision and language,\u201d arXiv preprint, arXiv:2205.14100, 2022. https:\/\/doi.org\/10.48550\/arXiv.2205.14100"},{"key":"key-10.20965\/jaciii.2025.p1417-34","unstructured":"S. Watanabe and I. Kobayashi, \u201cImage captioning that reflects the intent of the explainer based on tracing with a pen,\u201d Proc. of the 36th Annual Conf. of the Japanese Society for Artificial Intelligence (JSAI2022), Article No.3Yin2-23, 2022 (in Japanese). https:\/\/doi.org\/10.11517\/pjsai.JSAI2022.0_3Yin223"}],"container-title":["Journal of Advanced Computational Intelligence and Intelligent Informatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.fujipress.jp\/main\/wp-content\/themes\/Fujipress\/hyosetsu.php?ppno=jacii002900060017","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T15:04:15Z","timestamp":1763564655000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.fujipress.jp\/jaciii\/jc\/jacii002900061417"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,20]]},"references-count":34,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2025,11,20]]},"published-print":{"date-parts":[[2025,11,20]]}},"URL":"https:\/\/doi.org\/10.20965\/jaciii.2025.p1417","relation":{},"ISSN":["1883-8014","1343-0130"],"issn-type":[{"type":"electronic","value":"1883-8014"},{"type":"print","value":"1343-0130"}],"subject":[],"published":{"date-parts":[[2025,11,20]]}}}