{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T16:12:40Z","timestamp":1772554360357,"version":"3.50.1"},"reference-count":94,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2025,9,18]],"date-time":"2025-09-18T00:00:00Z","timestamp":1758153600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e Tecnologia (FCT)","award":["2021.06750.BD"],"award-info":[{"award-number":["2021.06750.BD"]}]},{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e Tecnologia (FCT)","award":["UIDB\/50021\/2020"],"award-info":[{"award-number":["UIDB\/50021\/2020"]}]},{"name":"Portuguese national funds through FCT","award":["2021.06750.BD"],"award-info":[{"award-number":["2021.06750.BD"]}]},{"name":"Portuguese national funds through FCT","award":["UIDB\/50021\/2020"],"award-info":[{"award-number":["UIDB\/50021\/2020"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Creating engaging narratives from visual data is crucial for automated digital media consumption, assistive technologies, and interactive entertainment. This survey covers methodologies used in the generation of these narratives, focusing on their principles, strengths, and limitations. The survey also covers tasks related to automatic story generation, such as image and video captioning, and Visual Question Answering. These tasks share common challenges with Visual Story Generation (VSG) and have served as inspiration for the techniques used in the field. We analyze the main datasets and evaluation metrics, providing a critical perspective on their limitations.<\/jats:p>","DOI":"10.3390\/info16090812","type":"journal-article","created":{"date-parts":[[2025,9,18]],"date-time":"2025-09-18T14:45:46Z","timestamp":1758206746000},"page":"812","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4413-6957","authenticated-orcid":false,"given":"Daniel A. P.","family":"Oliveira","sequence":"first","affiliation":[{"name":"INESC-ID, 1000-029 Lisbon, Portugal"},{"name":"Instituto Superior T\u00e9cnico, Universidade de Lisboa, 1649-004 Lisbon, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7147-8675","authenticated-orcid":false,"given":"Eug\u00e9nio","family":"Ribeiro","sequence":"additional","affiliation":[{"name":"INESC-ID, 1000-029 Lisbon, Portugal"},{"name":"Department of Information Science and Technology (ISTA), Instituto Universit\u00e1rio de Lisboa (ISCTE-IUL), 1649-026 Lisbon, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8631-2870","authenticated-orcid":false,"given":"David","family":"Martins de Matos","sequence":"additional","affiliation":[{"name":"INESC-ID, 1000-029 Lisbon, Portugal"},{"name":"Instituto Superior T\u00e9cnico, Universidade de Lisboa, 1649-004 Lisbon, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2025,9,18]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"H\u00fchn, P., Meister, J., Pier, J., and Schmid, W. (2014). Handbook of Narratology, De Gruyter. De Gruyter Handbook.","DOI":"10.1515\/9783110316469"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Abbott, H.P. (2008). The Cambridge Introduction to Narrative, Cambridge Introductions to Literature, Cambridge University Press. [2nd ed.].","DOI":"10.1017\/CBO9780511816932"},{"key":"ref_3","unstructured":"Jurafsky, D., and Martin, J.H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Pearson Prentice Hall."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Szeliski, R. (2010). Computer Vision: Algorithms and Applications, Springer. Texts in Computer Science.","DOI":"10.1007\/978-1-84882-935-0"},{"key":"ref_5","first-page":"49","article-title":"Computational Approaches to Storytelling and Creativity","volume":"30","author":"Gervas","year":"2009","journal-title":"AI Mag."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Krishna, R., Hata, K., Ren, F., Fei-Fei, L., and Niebles, J.C. (2017, January 22\u201329). Dense-Captioning Events in Videos. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.83"},{"key":"ref_7","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., and Bengio, Y. (2015, January 7\u20139). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Fan, A., Lewis, M., and Dauphin, Y. (2018, January 15\u201320). Hierarchical Neural Story Generation. Proceedings of the Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia.","DOI":"10.18653\/v1\/P18-1082"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Peng, N., Ghazvininejad, M., May, J., and Knight, K. (2018, January 5). Towards Controllable Story Generation. Proceedings of the First Workshop on Storytelling, New Orleans, LA, USA.","DOI":"10.18653\/v1\/W18-1505"},{"key":"ref_10","unstructured":"Simpson, J. (2002). Oxford English Dictionary: Version 3.0: Upgrade Version, Oxford University Press."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Lau, S.Y., and Chen, C.J. (2008). Designing a Virtual Reality (VR) Storytelling System for Educational Purposes. Technological Developments in Education and Automation, Springer.","DOI":"10.1007\/978-90-481-3656-8_26"},{"key":"ref_12","unstructured":"Mitchell, D. (2008). Cloud Atlas: A Novel, Random House Publishing Group."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"DiBattista, M. (2011). Novel Characters: A Genealogy, Wiley.","DOI":"10.1002\/9781444327984"},{"key":"ref_14","unstructured":"Griffith, K. (2010). Writing Essays About Literature, Cengage Learning."},{"key":"ref_15","unstructured":"Truby, J. (2007). The Anatomy of Story: 22 Steps to Becoming a Master Storyteller, Faber & Faber."},{"key":"ref_16","unstructured":"Rowling, J. (2015). Harry Potter and the Sorcerer\u2019s Stone, Harry Potter, Pottermore Publishing."},{"key":"ref_17","unstructured":"Dibell, A. (1999). Elements of Fiction Writing\u2014Plot, F+W Media."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Pinault, D. (1992). Story-Telling Techniques in the Arabian Nights, Brill. Studies in Arabic Literature.","DOI":"10.1163\/9789004663084"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Huang, T.H.K., Ferraro, F., Mostafazadeh, N., Misra, I., Agrawal, A., Devlin, J., Girshick, R., He, X., Kohli, P., and Batra, D. (2016, January 12\u201317). Visual Storytelling. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA.","DOI":"10.18653\/v1\/N16-1147"},{"key":"ref_20","unstructured":"Korhonen, A., Traum, D., and M\u00e0rquez, L. (August, January 28). Visual Story Post-Editing. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"565","DOI":"10.1162\/tacl_a_00553","article-title":"Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences","volume":"11","author":"Hong","year":"2023","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_22","unstructured":"Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (2015). Expressing an Image Stream with a Sequence of Natural Sentences. Proceedings of the Advances in Neural Information Processing Systems, Curran Associates, Inc."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1007\/s11263-016-0987-1","article-title":"Movie Description","volume":"123","author":"Rohrbach","year":"2016","journal-title":"Int. J. Comput. Vis."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Yu, Y., Chung, J., Yun, H., Kim, J., and Kim, G. (2021, January 15\u201319). Transitional Adaptation of Pretrained Models for Visual Storytelling. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01247"},{"key":"ref_25","unstructured":"Gurevych, I., and Miyao, Y. (2018, January 15\u201320). No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia."},{"key":"ref_26","first-page":"12747","article-title":"Two Heads are Better Than One: Hypergraph-Enhanced Graph Reasoning for Visual Event Ratiocination","volume":"Volume 139","author":"Meila","year":"2021","journal-title":"Proceedings of the 38th International Conference on Machine Learning"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Kim, K.M., Heo, M.O., Choi, S.H., and Zhang, B.T. (2017, January 19\u201325). DeepStory: Video story QA by deep embedded memory networks. Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia.","DOI":"10.24963\/ijcai.2017\/280"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Das, P., Xu, C., Doell, R., and Corso, J. (2013, January 23\u201328). A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching. Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA.","DOI":"10.1109\/CVPR.2013.340"},{"key":"ref_29","unstructured":"Isabelle, P., Charniak, E., and Lin, D. (2002, January 6\u201312). Bleu: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., and Yu, Y. (2018, January 8\u201312). Texygen: A Benchmarking Platform for Text Generation Models. Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA.","DOI":"10.1145\/3209978.3210080"},{"key":"ref_31","unstructured":"Goldstein, J., Lavie, A., Lin, C.Y., and Voss, C. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Association for Computational Linguistics."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Zitnick, C.L., and Parikh, D. (2014, January 7\u201312). CIDEr: Consensus-based image description evaluation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_33","unstructured":"Lin, C.Y. (2004, January 25\u201326). ROUGE: A Package for Automatic Evaluation of Summaries. Proceedings of the Text Summarization Branches Out, Barcelona, Spain."},{"key":"ref_34","unstructured":"Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., and Artzi, Y. (2019). BERTScore: Evaluating Text Generation with BERT. arXiv."},{"key":"ref_35","unstructured":"OpenAI (2025, February 09). Hello GPT-4o. Available online: https:\/\/openai.com\/index\/hello-gpt-4o."},{"key":"ref_36","unstructured":"OpenAI (2024, January 03). New Models and Developer Products Announced at DevDay. Available online: https:\/\/openai.com\/blog\/new-models-and-developer-products-announced-at-devday."},{"key":"ref_37","unstructured":"OpenAI (2024, March 04). GPT-4 and GPT-4 Turbo: Models Documentation. Available online: https:\/\/platform.openai.com\/docs\/models\/gpt-4-and-gpt-4-turbo."},{"key":"ref_38","unstructured":"OpenAI (2024, January 03). GPT-3.5: Models Documentation. Available online: https:\/\/platform.openai.com\/docs\/models\/gpt-3-5."},{"key":"ref_39","unstructured":"Touvron, H., Martin, L., Stone, K.R., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., and Bhosale, S. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv."},{"key":"ref_40","unstructured":"Meta (2024, May 28). Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. Available online: https:\/\/ai.meta.com\/blog\/meta-llama-3\/."},{"key":"ref_41","unstructured":"Mistral AI team (2025, September 10). Mixtral of Experts. Mistral AI Continues Its Mission to Deliver Open Models to the Developer Community, Introducing Mixtral 8x7B, a High-Quality Sparse Mixture of Experts Model. Available online: https:\/\/mistral.ai\/news\/mixtral-of-experts\/."},{"key":"ref_42","unstructured":"Mistral AI team (2025, July 27). Mistral NeMo: Our New Best Small Model. Available online: https:\/\/mistral.ai\/news\/mistral-nemo,."},{"key":"ref_43","unstructured":"Anthropic (2025, August 06). Introducing Claude 4. Available online: https:\/\/www.anthropic.com\/news\/claude-4."},{"key":"ref_44","unstructured":"Anthropic (2024, January 03). Claude-2.1: Overview and Specifications. Available online: https:\/\/www.anthropic.com\/index\/claude-2-1."},{"key":"ref_45","unstructured":"Anthropic (2024, January 03). Claude-2: Overview and Specifications. Available online: https:\/\/www.anthropic.com\/index\/claude-2."},{"key":"ref_46","unstructured":"Anthropic (2024, January 03). Introducing Claude. Available online: https:\/\/www.anthropic.com\/index\/introducing-claude."},{"key":"ref_47","unstructured":"Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., and Xing, E.P. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"e2305016120","DOI":"10.1073\/pnas.2305016120","article-title":"ChatGPT outperforms crowd workers for text-annotation tasks","volume":"120","author":"Gilardi","year":"2023","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2014, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., and Courville, A. (2015, January 7\u201313). Describing Videos by Exploiting Temporal Structure. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA.","DOI":"10.1109\/ICCV.2015.512"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2017, January 18\u201323). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long Short-Term Memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_53","unstructured":"Chung, J., G\u00fcl\u00e7ehre, \u00c7., Cho, K., and Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv."},{"key":"ref_54","unstructured":"Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4\u20139). Attention is All you Need. Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_55","unstructured":"Olimov, F., Dubey, S., Shrestha, L., Tin, T.T., and Jeon, M. (2021). Image Captioning using Multiple Transformers for Self-Attention Mechanism. arXiv."},{"key":"ref_56","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the International Conference on Machine Learning, Virtual."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 10\u201317). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"4467","DOI":"10.1109\/TCSVT.2019.2947482","article-title":"Multimodal Transformer With Multi-View Visual Representation for Image Captioning","volume":"30","author":"Yu","year":"2019","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., and Hu, H. (2021, January 18\u201324). Video Swin Transformer. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00320"},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Lin, K., Li, L., Lin, C.C., Ahmed, F., Gan, Z., Liu, Z., Lu, Y., and Wang, L. (2021, January 18\u201324). SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01742"},{"key":"ref_61","unstructured":"Kazemi, V., and Elqursh, A. (2017). Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering. arXiv."},{"key":"ref_62","first-page":"32897","article-title":"VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts","volume":"Volume 35","author":"Koyejo","year":"2022","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O.K., Singhal, S., and Som, S. (2023, January 17\u201324). Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks. Proceedings of the 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01838"},{"key":"ref_64","unstructured":"Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H. (2022, January 17\u201323). OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA."},{"key":"ref_65","unstructured":"Wang, P., Wang, S., Lin, J., Bai, S., Zhou, X., Zhou, J., Wang, X., and Zhou, C. (2023). ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities. arXiv."},{"key":"ref_66","unstructured":"Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A.J., Padlewski, P., Salz, D.M., Goodman, S., Grycner, A., Mustafa, B., and Beyer, L. (2022). PaLI: A Jointly-Scaled Multilingual Language-Image Model. arXiv."},{"key":"ref_67","unstructured":"Pereira, F., Burges, C., Bottou, L., and Weinberger, K. (2012, January 3\u20136). ImageNet Classification with Deep Convolutional Neural Networks. Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_68","unstructured":"Bengio, Y., and LeCun, Y. (2015, January 7\u20139). Very Deep Convolutional Networks for Large-Scale Image Recognition. Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA. Conference Track Proceedings."},{"key":"ref_69","unstructured":"Bernardi, R., Fernandez, R., Gella, S., Kafle, K., Kanan, C., Lee, S., and Nabi, M. (2019, January 6). The Steep Road to Happily Ever after: An Analysis of Current Visual Storytelling Models. Proceedings of the Second Workshop on Shortcomings in Vision and Language, Minneapolis, MN, USA."},{"key":"ref_70","unstructured":"Huang, Q., Gan, Z., Celikyilmaz, A., Wu, D.O., Wang, J., and He, X. (2018, January 2\u20137). Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA."},{"key":"ref_71","unstructured":"Jung, Y., Kim, D., Woo, S., Kim, K., Kim, S., and Kweon, I.S. (, January 7\u201312February). Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA."},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Park, J.S., Rohrbach, M., Darrell, T., and Rohrbach, A. (2018, January 15\u201320). Adversarial Inference for Multi-Sentence Video Description. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00676"},{"key":"ref_73","unstructured":"Mitchell, M., Huang, T.H.K., Ferraro, F., and Misra, I. (2018, January 5). A Pipeline for Creative Visual Storytelling. Proceedings of the First Workshop on Storytelling, New Orleans, LA, USA."},{"key":"ref_74","doi-asserted-by":"crossref","unstructured":"Halperin, B.A., and Lukin, S.M. (2023, January 23\u201328). Envisioning Narrative Intelligence: A Creative Visual Storytelling Anthology. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, New York, NY, USA. CHI \u201923.","DOI":"10.1145\/3544548.3580744"},{"key":"ref_75","doi-asserted-by":"crossref","unstructured":"Halperin, B.A., and Lukin, S.M. (2024, January 1\u20135). Artificial Dreams: Surreal Visual Storytelling as Inquiry Into AI \u2019Hallucination\u2019. Proceedings of the 2024 ACM Designing Interactive Systems Conference, New York, NY, USA.","DOI":"10.1145\/3643834.3660685"},{"key":"ref_76","first-page":"7952","article-title":"Knowledge-Enriched Visual Storytelling","volume":"34","author":"Hsu","year":"2020","journal-title":"AAAI Conf. Artif. Intell."},{"key":"ref_77","doi-asserted-by":"crossref","unstructured":"Hsu, C.C., Chen, Y.H., Chen, Z.Y., Lin, H.Y., Huang, T.H.K., and Ku, L.W. (2019, January 13\u201317). Dixit: Interactive Visual Storytelling via Term Manipulation. Proceedings of the World Wide Web Conference, New York, NY, USA.","DOI":"10.1145\/3308558.3314131"},{"key":"ref_78","unstructured":"Zong, C., Xia, F., Li, W., and Navigli, R. (2021, January 1\u20136). Plot and Rework: Modeling Storylines for Visual Storytelling. Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online."},{"key":"ref_79","unstructured":"Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2019). The Curious Case of Neural Text Degeneration. arXiv."},{"key":"ref_80","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., and Girshick, R.B. (2017). Mask R-CNN. arXiv.","DOI":"10.1109\/ICCV.2017.322"},{"key":"ref_81","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA."},{"key":"ref_82","unstructured":"Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (August, January 28). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2015","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_84","unstructured":"Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2025, September 15). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report. Available online: https:\/\/cdn.openai.com\/better-language-models\/language_models_are_unsupervised_multitask_learners.pdf."},{"key":"ref_85","unstructured":"Gonzalez-Rico, D., and Pineda, G.F. (2018). Contextualize, Show and Tell: A Neural Visual Storyteller. arXiv."},{"key":"ref_86","doi-asserted-by":"crossref","unstructured":"Caba Heilbron, F., Escorcia, V., Ghanem, B., and Niebles, J.C. (2015, January 7\u201312). ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"ref_87","unstructured":"Palmer, M., Hwa, R., and Riedel, S. (2017, January 9\u201311). Hierarchically-Attentive RNN for Album Summarization and Storytelling. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark."},{"key":"ref_88","unstructured":"Kim, T., Heo, M.O., Son, S., Park, K.W., and Zhang, B.T. (2018). GLAC Net: GLocal Attention Cascading Networks for Multi-image Cued Story Generation. arXiv."},{"key":"ref_89","doi-asserted-by":"crossref","first-page":"126486","DOI":"10.1016\/j.neucom.2023.126486","article-title":"AOG-LSTM: An adaptive attention neural network for visual storytelling","volume":"552","author":"Liu","year":"2023","journal-title":"Neurocomputing"},{"key":"ref_90","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1109\/TCSVT.2022.3199603","article-title":"Coherent Visual Storytelling via Parallel Top-Down Visual and Topic Attention","volume":"33","author":"Gu","year":"2023","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_91","unstructured":"Bouamor, H., Pino, J., and Bali, K. (2023, January 6\u201310). Location-Aware Visual Question Generation with Lightweight Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore."},{"key":"ref_92","doi-asserted-by":"crossref","unstructured":"Belz, J.H., Weilke, L.M., Winter, A., Hallgarten, P., Rukzio, E., and Grosse-Puppendahl, T. (2024, January 13\u201316). Story-Driven: Exploring the Impact of Providing Real-time Context Information on Automated Storytelling. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, New York, NY, USA. UIST \u201924.","DOI":"10.1145\/3654777.3676372"},{"key":"ref_93","unstructured":"Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., and Lv, C. (2025). Qwen3 Technical Report. arXiv."},{"key":"ref_94","unstructured":"Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., and Rosen, E. (2025). Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/9\/812\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:48:19Z","timestamp":1760035699000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/9\/812"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,18]]},"references-count":94,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2025,9]]}},"alternative-id":["info16090812"],"URL":"https:\/\/doi.org\/10.3390\/info16090812","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,18]]}}}