{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T07:59:50Z","timestamp":1771919990731,"version":"3.50.1"},"reference-count":97,"publisher":"Association for Computing Machinery (ACM)","issue":"8","license":[{"start":{"date-parts":[[2024,6,12]],"date-time":"2024-06-12T00:00:00Z","timestamp":1718150400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"PNRR-M4C2","award":["PE00000013"],"award-info":[{"award-number":["PE00000013"]}]},{"name":"FAIR - Future Artificial Intelligence Research"},{"DOI":"10.13039\/501100000780","name":"European Commission","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"crossref"}]},{"name":"CREATIVE: CRoss-modal understanding and gEnerATIon of Visual and tExtual content","award":["CUP B87G22000460001"],"award-info":[{"award-number":["CUP B87G22000460001"]}]},{"name":"Italian Ministry of University and Research"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,8,31]]},"abstract":"<jats:p>The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach toward developing image captioning models that utilize an external<jats:italic>k<\/jats:italic>NN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a<jats:italic>k<\/jats:italic>NN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale.<\/jats:p>","DOI":"10.1145\/3663667","type":"journal-article","created":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T11:56:37Z","timestamp":1714737397000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["Towards Retrieval-Augmented Architectures for Image Captioning"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1057-3374","authenticated-orcid":false,"given":"Sara","family":"Sarto","sequence":"first","affiliation":[{"name":"Department of Engineering \"Enzo Ferrari\", University of Modena and Reggio Emilia, Modena, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9640-9385","authenticated-orcid":false,"given":"Marcella","family":"Cornia","sequence":"additional","affiliation":[{"name":"Department of Education and Humanities, University of Modena and Reggio Emilia, Modena, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5125-4957","authenticated-orcid":false,"given":"Lorenzo","family":"Baraldi","sequence":"additional","affiliation":[{"name":"Department of Engineering \"Enzo Ferrari\", University of Modena and Reggio Emilia, Modena, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-5071-5687","authenticated-orcid":false,"given":"Alessandro","family":"Nicolosi","sequence":"additional","affiliation":[{"name":"Leonardo SpA, Roma, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2239-283X","authenticated-orcid":false,"given":"Rita","family":"Cucchiara","sequence":"additional","affiliation":[{"name":"Department of Engineering \"Enzo Ferrari\", University of Modena and Reggio Emilia, Modena, Italy"}]}],"member":"320","published-online":{"date-parts":[[2024,6,12]]},"reference":[{"key":"e_1_3_2_2_2","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Agrawal Harsh","year":"2019","unstructured":"Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. 2019. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE\/CVF International Conference on Computer Vision."},{"key":"e_1_3_2_3_2","volume-title":"Advances in Neural Information Processing Systems","year":"2022","unstructured":"Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko\u0142aj Bi\u0144kowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Kar\u00e9n Simonyan. 2022. Flamingo: A visual language model for few-shot learning. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_2_4_2","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Anderson Peter","year":"2016","unstructured":"Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision."},{"key":"e_1_3_2_5_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Anderson Peter","year":"2018","unstructured":"Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_6_2","unstructured":"Simran Arora Avanika Narayan Mayee F. Chen Laurel J. Orr Neel Guha Kush Bhatia Ines Chami and Christopher Re. 2023. Ask Me Anything: A simple strategy for prompting language models. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_7_2","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics Workshops","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Annual Meeting of the Association for Computational Linguistics Workshops."},{"key":"e_1_3_2_8_2","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Barraco Manuele","year":"2023","unstructured":"Manuele Barraco, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2023. With a little help from your own past: Prototypical memory networks for image captioning. In Proceedings of the IEEE\/CVF International Conference on Computer Vision."},{"key":"e_1_3_2_9_2","volume-title":"Proceedings of the International Conference on Machine Learning","year":"2022","unstructured":"Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack Rae, Erich Elsen, and Laurent Sifre. 2022. Improving language models by retrieving from trillions of tokens. In Proceedings of the International Conference on Machine Learning."},{"key":"e_1_3_2_10_2","volume-title":"Advances in Neural Information Processing Systems","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_2_11_2","volume-title":"Proceedings of the International Conference on Image Analysis and Processing","author":"Caffagni Davide","year":"2023","unstructured":"Davide Caffagni, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2023. SynthCap: Augmenting transformers with synthetic data for image captioning. In Proceedings of the International Conference on Image Analysis and Processing."},{"key":"e_1_3_2_12_2","doi-asserted-by":"crossref","unstructured":"Davide Caffagni Federico Cocchi Luca Barsellotti Nicholas Moratelli Sara Sarto Lorenzo Baraldi Marcella Cornia and Rita Cucchiara. 2024. The (r)evolution of multimodal large language models: A survey. arXiv:2402.12451. Retrieved from https:\/\/arxiv.org\/abs\/2402.12451","DOI":"10.18653\/v1\/2024.findings-acl.807"},{"key":"e_1_3_2_13_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops","author":"Caffagni Davide","year":"2024","unstructured":"Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. Wiki-LLaVA: Hierarchical retrieval-augmented generation for multimodal LLMs. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops."},{"key":"e_1_3_2_14_2","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing","author":"Chen Wenhu","year":"2022","unstructured":"Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William W. Cohen. 2022. MuRAG: Multimodal retrieval-augmented generator for open question answering over images and text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing."},{"key":"e_1_3_2_15_2","unstructured":"Wenhu Chen Hexiang Hu Chitwan Saharia and William W. Cohen. 2022. Re-Imagen: Retrieval-augmented text-to-image generator. arXiv:2209.14491. Retrieved from https:\/\/arxiv.org\/abs\/2209.14491"},{"issue":"4","key":"e_1_3_2_16_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3499027","article-title":"Cross-modal graph matching network for image-text retrieval","volume":"18","author":"Cheng Yuhao","year":"2022","unstructured":"Yuhao Cheng, Xiaoguang Zhu, Jiuchao Qian, Fei Wen, and Peilin Liu. 2022. Cross-modal graph matching network for image-text retrieval. ACM Transactions on Multimedia Computing, Communications and Applications 18, 4 (2022), 1\u201323.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_2_17_2","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing","author":"Cho Kyunghyun","year":"2014","unstructured":"Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing."},{"issue":"2","key":"e_1_3_2_18_2","doi-asserted-by":"crossref","first-page":"111","DOI":"10.3233\/AIC-210172","article-title":"Explaining transformer-based image captioning models: An empirical analysis","volume":"35","author":"Cornia Marcella","year":"2022","unstructured":"Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2022. Explaining transformer-based image captioning models: An empirical analysis. AI Communications 35, 2 (2022), 111\u2013129.","journal-title":"AI Communications"},{"key":"e_1_3_2_19_2","doi-asserted-by":"crossref","unstructured":"Marcella Cornia Lorenzo Baraldi Giuseppe Fiameni and Rita Cucchiara. 2024. Generating more pertinent captions by leveraging semantics and style on multi-source datasets. International Journal of Computer Vision 123 5 (2024) 1701\u20131720.","DOI":"10.1007\/s11263-023-01949-w"},{"key":"e_1_3_2_20_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Cornia Marcella","year":"2020","unstructured":"Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_21_2","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Cui Chaoran","year":"2015","unstructured":"Chaoran Cui, Jialie Shen, Jun Ma, and Tao Lian. 2015. Social tag relevance estimation via ranking-oriented neighbour voting. In Proceedings of the ACM International Conference on Multimedia."},{"key":"e_1_3_2_22_2","volume-title":"Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics."},{"key":"e_1_3_2_23_2","doi-asserted-by":"crossref","first-page":"30615","DOI":"10.1007\/s11042-020-09539-5","article-title":"Reference-based model using multimodal gated recurrent units for image captioning","volume":"79","author":"Nogueira Tiago do Carmo","year":"2020","unstructured":"Tiago do Carmo Nogueira, C\u00e1ssio Dener Noronha Vinhal, G\u00e9lson da Cruz J\u00fanior, and Matheus Rudolfo Diedrich Ullmann. 2020. Reference-based model using multimodal gated recurrent units for image captioning. Multimedia Tools and Applications 79 (2020), 30615\u201330635.","journal-title":"Multimedia Tools and Applications"},{"key":"e_1_3_2_24_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_25_2","volume-title":"Proceedings of the IEEE International Conference on Multimedia and Expo","author":"Dubey Shiv Ram","year":"2021","unstructured":"Shiv Ram Dubey, Satish Kumar Singh, and Wei-Ta Chu. 2021. Vision transformer hashing for image retrieval. In Proceedings of the IEEE International Conference on Multimedia and Expo."},{"key":"e_1_3_2_26_2","unstructured":"Alaaeldin El-Nouby Natalia Neverova Ivan Laptev and Herv\u00e9 J\u00e9gou. 2021. Training vision transformers for image retrieval. arXiv:2102.05644. Retrieved from https:\/\/arxiv.org\/abs\/2102.05644"},{"key":"e_1_3_2_27_2","volume-title":"Proceedings of the CHI Conference on Human Factors in Computing Systems","author":"Fruchard Bruno","year":"2023","unstructured":"Bruno Fruchard, Sylvain Malacria, G\u00e9ry Casiez, and St\u00e9phane Huot. 2023. User preference and performance using tagging and browsing for image labeling. In Proceedings of the CHI Conference on Human Factors in Computing Systems."},{"issue":"1","key":"e_1_3_2_28_2","doi-asserted-by":"crossref","first-page":"363","DOI":"10.1109\/TIP.2012.2202676","article-title":"Visual-textual joint relevance learning for tag-based social image search","volume":"22","author":"Gao Yue","year":"2012","unstructured":"Yue Gao, Meng Wang, Zheng-Jun Zha, Jialie Shen, Xuelong Li, and Xindong Wu. 2012. Visual-textual joint relevance learning for tag-based social image search. IEEE Transactions on Image Processing 22, 1 (2012), 363\u2013376.","journal-title":"IEEE Transactions on Image Processing"},{"key":"e_1_3_2_29_2","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Guu Kelvin","year":"2020","unstructured":"Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In Proceedings of the International Conference on Machine Learning."},{"key":"e_1_3_2_30_2","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Guu Kelvin","year":"2020","unstructured":"Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. REALM: Retrieval-augmented language model pre-training. In Proceedings of the International Conference on Machine Learning."},{"key":"e_1_3_2_31_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_32_2","volume-title":"Advances in Neural Information Processing Systems","author":"Herdade Simao","year":"2019","unstructured":"Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In Advances in Neural Information Processing Systems."},{"issue":"8","key":"e_1_3_2_33_2","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735\u20131780.","journal-title":"Neural Computation"},{"key":"e_1_3_2_34_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Hu Xiaowei","year":"2022","unstructured":"Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_35_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Hu Ziniu","year":"2023","unstructured":"Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, and Alireza Fathi. 2023. REVEAL: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_36_2","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Huang Lun","year":"2019","unstructured":"Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE\/CVF International Conference on Computer Vision."},{"issue":"4","key":"e_1_3_2_37_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3460474","article-title":"Bi-directional co-attention network for image captioning","volume":"17","author":"Jiang Weitao","year":"2021","unstructured":"Weitao Jiang, Weixuan Wang, and Haifeng Hu. 2021. Bi-directional co-attention network for image captioning. ACM Transactions on Multimedia Computing, Communications and Applications 17, 4 (2021), 1\u201320.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"issue":"3","key":"e_1_3_2_38_2","doi-asserted-by":"crossref","first-page":"535","DOI":"10.1109\/TBDATA.2019.2921572","article-title":"Billion-scale similarity search with GPUs","volume":"7","author":"Johnson Jeff","year":"2019","unstructured":"Jeff Johnson, Matthijs Douze, and Herv\u00e9 J\u00e9gou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535\u2013547.","journal-title":"IEEE Transactions on Big Data"},{"key":"e_1_3_2_39_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Karpathy Andrej","year":"2015","unstructured":"Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_40_2","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing","author":"Karpukhin Vladimir","year":"2020","unstructured":"Vladimir Karpukhin, Barlas O\u011fuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing."},{"key":"e_1_3_2_41_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Khandelwal Urvashi","year":"2020","unstructured":"Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_42_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Kingma Diederik P.","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_43_2","doi-asserted-by":"crossref","unstructured":"Alina Kuznetsova Hassan Rom Neil Alldrin Jasper Uijlings Ivan Krasin Jordi Pont-Tuset Shahab Kamali Stefan Popov Matteo Malloci Alexander Kolesnikov Tom Duerig and Vittorio Ferrari. 2018. The open images dataset V4: Unified image classification object detection and visual relationship detection at scale. International Journal of Computer Vision 128 7 (2018) 1956\u20131981.","DOI":"10.1007\/s11263-020-01316-z"},{"key":"e_1_3_2_44_2","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Lee Kuang-Huei","year":"2018","unstructured":"Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision."},{"key":"e_1_3_2_45_2","volume-title":"Advances in Neural Information Processing Systems","author":"Lewis Patrick","year":"2020","unstructured":"Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\u00fcttler, Mike Lewis, Wen-tau Yih, Tim Rockt\u00e4schel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_2_46_2","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning."},{"key":"e_1_3_2_47_2","doi-asserted-by":"crossref","unstructured":"Jingyu Li Zhendong Mao Hao Li Weidong Chen and Yongdong Zhang. 2024. Exploring visual relationships via transformer-based graphs for enhanced image captioning. ACM Transactions on Multimedia Computing Communications and Applications 20 5 (2024) 1\u201323.","DOI":"10.1145\/3638558"},{"key":"e_1_3_2_48_2","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Li Wenhui","year":"2023","unstructured":"Wenhui Li, Xinqi Su, Dan Song, Lanjun Wang, Kun Zhang, and An-An Liu. 2023. Towards deconfounded image-text matching with causal inference. In Proceedings of the ACM International Conference on Multimedia."},{"issue":"8","key":"e_1_3_2_49_2","doi-asserted-by":"crossref","first-page":"2117","DOI":"10.1109\/TMM.2019.2896516","article-title":"Know more say less: Image captioning based on scene graphs","volume":"21","author":"Li Xiangyang","year":"2019","unstructured":"Xiangyang Li and Shuqiang Jiang. 2019. Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia 21, 8 (2019), 2117\u20132130.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_50_2","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Li Xiujun","year":"2020","unstructured":"Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et\u00a0al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision."},{"key":"e_1_3_2_51_2","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics Workshops","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Annual Meeting of the Association for Computational Linguistics Workshops."},{"key":"e_1_3_2_52_2","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Lin Tsung-Yi","year":"2014","unstructured":"Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision."},{"key":"e_1_3_2_53_2","volume-title":"Proceedings of the International Conference on World Wide Web","author":"Liu Dong","year":"2009","unstructured":"Dong Liu, Xian-Sheng Hua, Linjun Yang, Meng Wang, and Hong-Jiang Zhang. 2009. Tag ranking. In Proceedings of the International Conference on World Wide Web."},{"key":"e_1_3_2_54_2","volume-title":"Advances in Neural Information Processing Systems","author":"Liu Fenglin","year":"2020","unstructured":"Fenglin Liu, Xuancheng Ren, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou, and Xu Sun. 2020. Prophet attention: Predicting attention with future attention. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_2_55_2","unstructured":"Wei Liu Sihan Chen Longteng Guo Xinxin Zhu and Jing Liu. 2021. CPTR: Full transformer network for image captioning. arXiv:2101.10804. Retrieved from https:\/\/arxiv.org\/abs\/2101.10804"},{"key":"e_1_3_2_56_2","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Luo Yunpeng","year":"2021","unstructured":"Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence."},{"key":"e_1_3_2_57_2","volume-title":"Proceedings of the International Conference on Content-Based Multimedia Indexing","author":"Messina Nicola","year":"2022","unstructured":"Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Giuseppe Amato, and Rita Cucchiara. 2022. ALADIN: Distilling fine-grained alignment scores for efficient image-text matching and retrieval. In Proceedings of the International Conference on Content-Based Multimedia Indexing."},{"key":"e_1_3_2_58_2","unstructured":"Gr\u00e9goire Mialon Roberto Dess\u00ec Maria Lomeli Christoforos Nalmpantis Ram Pasunuru Roberta Raileanu Baptiste Rozi\u00e8re Timo Schick Jane Dwivedi-Yu Asli Celikyilmaz Edouard Grave Yann LeCun LeCun and Thomas Scialom. 2023. Augmented language models: A survey. arXiv:2302.07842. Retrieved from https:\/\/arxiv.org\/abs\/2302.07842"},{"key":"e_1_3_2_59_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Micikevicius Paulius","year":"2018","unstructured":"Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed precision training. In Proceedings of the International Conference on Learning Representations."},{"issue":"5","key":"e_1_3_2_60_2","first-page":"1","article-title":"Bottom-up and top-down object inference networks for image captioning","volume":"19","author":"Pan Yingwei","year":"2023","unstructured":"Yingwei Pan, Yehao Li, Ting Yao, and Tao Mei. 2023. Bottom-up and top-down object inference networks for image captioning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 5 (2023), 1\u201318.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_2_61_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Pan Yingwei","year":"2020","unstructured":"Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_62_2","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics."},{"key":"e_1_3_2_63_2","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning."},{"issue":"8","key":"e_1_3_2_64_2","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.","journal-title":"OpenAI Blog"},{"key":"e_1_3_2_65_2","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Rajbhandari Samyam","year":"2020","unstructured":"Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis."},{"key":"e_1_3_2_66_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Ranzato Marc\u2019Aurelio","year":"2016","unstructured":"Marc\u2019Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_67_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Rennie Steven J.","year":"2017","unstructured":"Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_68_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Sarto Sara","year":"2023","unstructured":"Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2023. Positive-augmented contrastive learning for image and video captioning evaluation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_69_2","volume-title":"Proceedings of the International Conference on Content-Based Multimedia Indexing","author":"Sarto Sara","year":"2022","unstructured":"Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2022. Retrieval-augmented transformer for image captioning. In Proceedings of the International Conference on Content-Based Multimedia Indexing."},{"key":"e_1_3_2_70_2","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics","author":"Sennrich Rico","year":"2016","unstructured":"Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the Annual Meeting of the Association for Computational Linguistics."},{"key":"e_1_3_2_71_2","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics","author":"Sharma Piyush","year":"2018","unstructured":"Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics."},{"key":"e_1_3_2_72_2","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1007\/s00530-014-0399-4","article-title":"Accurate online video tagging via probabilistic hybrid modeling","volume":"22","year":"2016","unstructured":"Jialie Shen, Meng Wang, and Tat-Seng Chua. 2016. Accurate online video tagging via probabilistic hybrid modeling. Multimedia Systems 22 (2016), 99\u2013113.","journal-title":"Multimedia Systems"},{"key":"e_1_3_2_73_2","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Shen Jialie","year":"2011","unstructured":"Jialie Shen, Meng Wang, Shuicheng Yan, and Xian-Sheng Hua. 2011. Multimedia tagging: Past, present and future. In Proceedings of the ACM International Conference on Multimedia."},{"key":"e_1_3_2_74_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Shen Sheng","year":"2022","unstructured":"Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. 2022. How much can clip benefit vision-and-language tasks?. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_75_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_76_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Socher Richard","year":"2010","unstructured":"Richard Socher and Li Fei-Fei. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"issue":"1","key":"e_1_3_2_77_2","doi-asserted-by":"crossref","first-page":"539","DOI":"10.1109\/TPAMI.2022.3148210","article-title":"From show to tell: A survey on deep learning-based image captioning","volume":"45","author":"Stefanini Matteo","year":"2022","unstructured":"Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. 2022. From show to tell: A survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 539\u2013559.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_2_78_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Szegedy Christian","year":"2015","unstructured":"Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_79_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Tolias Giorgos","year":"2016","unstructured":"Giorgos Tolias, Ronan Sicre, and Herv\u00e9 J\u00e9gou. 2016. Particular object retrieval with integral max-pooling of CNN activations. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_80_2","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_2_81_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Vedantam Ramakrishna","year":"2015","unstructured":"Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"issue":"3","key":"e_1_3_2_82_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3226037","article-title":"Image captioning with affective guiding and selective attention","volume":"14","author":"Wang Anqi","year":"2018","unstructured":"Anqi Wang, Haifeng Hu, and Liang Yang. 2018. Image captioning with affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications and Applications 14, 3 (2018), 1\u201315.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"issue":"2","key":"e_1_3_2_83_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3439734","article-title":"Integrating scene semantic knowledge into image captioning","volume":"17","author":"Wei Haiyang","year":"2021","unstructured":"Haiyang Wei, Zhixin Li, Feicheng Huang, Canlong Zhang, Huifang Ma, and Zhongzhi Shi. 2021. Integrating scene semantic knowledge into image captioning. ACM Transactions on Multimedia Computing, Communications and Applications 17, 2 (2021), 1\u201322.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_2_84_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Wu Mingrui","year":"2022","unstructured":"Mingrui Wu, Xuying Zhang, Xiaoshuai Sun, Yiyi Zhou, Chao Chen, Jiaxin Gu, Xing Sun, and Rongrong Ji. 2022. DIFNet: Boosting visual information flow for image captioning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_85_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Wu Yuhuai","year":"2022","unstructured":"Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. 2022. Memorizing transformers. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_86_2","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning."},{"key":"e_1_3_2_87_2","doi-asserted-by":"crossref","unstructured":"Yahui Xu Yi Bin Jiwei Wei Yang Yang Guoqing Wang and Heng Tao Shen. 2023. Multi-modal transformer with global-local alignment for composed query image retrieval. IEEE Transactions on Multimedia 25 (2023) 8346\u20138357.","DOI":"10.1109\/TMM.2023.3235495"},{"key":"e_1_3_2_88_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Yang Xu","year":"2019","unstructured":"Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_89_2","doi-asserted-by":"crossref","unstructured":"Benjamin Z. Yao Xiong Yang Liang Lin Mun Wai Lee and Song-Chun Zhu. 2010. I2T: Image parsing to text description. In Proceedings of the IEEE 98 8 (2010) 1485\u20131508.","DOI":"10.1109\/JPROC.2010.2050411"},{"key":"e_1_3_2_90_2","doi-asserted-by":"crossref","unstructured":"Tao Yao Yiru Li Ying Li Yingying Zhu Gang Wang and Jun Yue. 2023. Cross-modal semantically augmented network for image-text matching. ACM Transactions on Multimedia Computing Communications and Applications 20 4 (2023) 1\u201318.","DOI":"10.1145\/3631356"},{"key":"e_1_3_2_91_2","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Yao Ting","year":"2018","unstructured":"Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision."},{"key":"e_1_3_2_92_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"You Quanzeng","year":"2016","unstructured":"Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_93_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"You Yang","year":"2020","unstructured":"Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2020. Large batch optimization for deep learning: Training BERT in 76 minutes. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_94_2","first-page":"361","article-title":"Text mining in multimedia","author":"Zha Zheng-Jun","year":"2012","unstructured":"Zheng-Jun Zha, Meng Wang, Jialie Shen, and Tat-Seng Chua. 2012. Text mining in multimedia. Mining Text Data (2012), 361\u2013384.","journal-title":"Mining Text Data"},{"key":"e_1_3_2_95_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhang Pengchuan","year":"2021","unstructured":"Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. VinVL: Revisiting visual representations in vision-language models. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_96_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhang Xuying","year":"2021","unstructured":"Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2021. RSTNet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_97_2","doi-asserted-by":"crossref","first-page":"3101","DOI":"10.1109\/TMM.2021.3093725","article-title":"Exploring pairwise relationships adaptively from linguistic context in image captioning","volume":"24","author":"Zhang Zongjian","year":"2021","unstructured":"Zongjian Zhang, Qiang Wu, Yang Wang, and Fang Chen. 2021. Exploring pairwise relationships adaptively from linguistic context in image captioning. IEEE Transactions on Multimedia 24 (2021), 3101\u20133113.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_98_2","unstructured":"Wayne Xin Zhao Kun Zhou Junyi Li Tianyi Tang Xiaolei Wang Yupeng Hou Yingqian Min Beichen Zhang Junjie Zhang Zican Dong et\u00a0al. 2023. A survey of large language models. arXiv:2303.18223. Retrieved from https:\/\/arxiv.org\/abs\/2303.18223"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3663667","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3663667","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:57:59Z","timestamp":1750294679000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3663667"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,12]]},"references-count":97,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2024,8,31]]}},"alternative-id":["10.1145\/3663667"],"URL":"https:\/\/doi.org\/10.1145\/3663667","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,6,12]]},"assertion":[{"value":"2023-05-10","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-28","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-06-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}