{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T00:37:29Z","timestamp":1770165449493,"version":"3.49.0"},"posted":{"date-parts":[[2026]]},"group-title":"SSRN","reference-count":38,"publisher":"Elsevier BV","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"abstract":"<jats:p>isual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. We address these issues through entity grounding and propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for narrative modeling, and a grounding scheme linking textual elements to visual entities across frames. We fine-tune Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations per story and improved creativity from 2.58 to 3.38 (+31.0%) compared to the base model.<\/jats:p>","DOI":"10.2139\/ssrn.6172566","type":"posted-content","created":{"date-parts":[[2026,2,3]],"date-time":"2026-02-03T11:42:42Z","timestamp":1770118962000},"source":"Crossref","is-referenced-by-count":0,"title":["StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation"],"prefix":"10.2139","author":[{"given":"Daniel","family":"Oliveira","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8631-2870","authenticated-orcid":true,"given":"David","family":"Martins de Matos","sequence":"additional","affiliation":[]}],"member":"78","reference":[{"key":"ref1","first-page":"2025","article-title":"Getting vit in shape: scaling laws for computeoptimal model design","author":"I Alabdulmohsin","year":"2023","journal-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23"},{"key":"ref2","author":"S Bai","year":"2025"},{"key":"ref3","first-page":"8259","article-title":"METEOR: An automatic metric for MT evaluation with improved correlation with human judgments","author":"S Banerjee","year":"2005","journal-title":"ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization"},{"key":"ref4","first-page":"1280","article-title":"Masked-attention mask transformer for universal image segmentation","author":"B Cheng","year":"2022","journal-title":"IEEE Conf. on Computer Vision and Pattern Recognition"},{"key":"ref5","article-title":"Arcface: Additive angular margin loss for deep face recognition","author":"J Deng","year":"2019","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)"},{"issue":"2","key":"ref6","doi-asserted-by":"crossref","first-page":"303","DOI":"10.1007\/s11263-009-0275-4","article-title":"The pascal visual object classes (voc) challenge","volume":"88","author":"M Everingham","year":"2010","journal-title":"International Journal of Computer Vision"},{"issue":"8017","key":"ref7","doi-asserted-by":"crossref","first-page":"625","DOI":"10.1038\/s41586-024-07421-0","article-title":"Detecting hallucinations in large language models using semantic entropy","volume":"630","author":"S Farquhar","year":"2024","journal-title":"Nature"},{"key":"ref8","first-page":"2025","author":"G Freytag","year":"1894","journal-title":"Freytag's Technique of the Drama: An Exposition of Dramatic Composition and Art. Scott, Foresman and Company, Chicago. An authorized translation from the 6th German edition of \"Die Technik des Dramas"},{"key":"ref9","doi-asserted-by":"crossref","first-page":"565","DOI":"10.1162\/tacl_a_00553","article-title":"Visual writing prompts: Character-grounded story generation with curated image sequences","volume":"11","author":"X Hong","year":"2023","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"ref10","author":"B Hsu","year":"2024","journal-title":"Liger kernel: Efficient triton kernels for llm training"},{"key":"ref11","article-title":"LoRA: Low-Rank Adaptation of Large Language Models","author":"E J Hu","year":"2022","journal-title":"Intl. Conf. on Learning Representations"},{"issue":"2","key":"ref12","doi-asserted-by":"crossref","DOI":"10.1145\/3703155","article-title":"A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions","volume":"43","author":"L Huang","year":"2025","journal-title":"ACM Trans. Inf. Syst"},{"key":"ref13","first-page":"1233","article-title":"Visual storytelling","author":"T.-H K Huang","year":"2016","journal-title":"Proceedings of the 2016 Conference of the North American Chapter"},{"key":"ref14","doi-asserted-by":"crossref","first-page":"216","DOI":"10.1609\/icwsm.v8i1.14550","article-title":"VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text","volume":"8","author":"C J Hutto","year":"2014","journal-title":"Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media"},{"key":"ref15","doi-asserted-by":"crossref","first-page":"61","DOI":"10.1177\/001316447003000105","article-title":"Estimating the reliability, systematic error and random error of interval data","volume":"30","author":"K Krippendorff","year":"1970","journal-title":"Educational and Psychological Measurement"},{"key":"ref16","doi-asserted-by":"crossref","DOI":"10.1145\/3600006.3613165","article-title":"Efficient memory management for large language model serving with pagedattention","author":"W Kwon","year":"2023","journal-title":"Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles"},{"key":"ref17","article-title":"BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models","author":"J Li","year":"2023","journal-title":"Proceedings of the 40th International Conference on Machine Learning, ICML'23"},{"key":"ref18","first-page":"3468","article-title":"REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction","volume":"229","author":"Z Liu","year":"2023","journal-title":"Proceedings of The 7th Conference on Robot Learning"},{"key":"ref19","author":"Z Liu","year":"2021","journal-title":"Swin transformer: Hierarchical vision transformer using shifted windows"},{"key":"ref20","article-title":"SGDR: stochastic gradient descent with warm restarts","author":"I Loshchilov","year":"2017","journal-title":"5th International Conference on Learning Representations"},{"key":"ref21","article-title":"Decoupled weight decay regularization","author":"I Loshchilov","year":"2019","journal-title":"International Conference on Learning Representations"},{"key":"ref22","author":"D A P Oliveira","year":"2024","journal-title":"Story generation from visual inputs: Techniques, related tasks, and challenges"},{"key":"ref23","author":"D A P Oliveira","year":"2025","journal-title":"GroundCap: A visually grounded image captioning dataset"},{"key":"ref24","first-page":"2025","author":"Openai","year":"2024","journal-title":"Hello GPT"},{"key":"ref25","article-title":"DINOv2: Learning Robust Visual Features without Supervision","author":"M Oquab","year":"2024","journal-title":"Transactions on Machine Learning Research"},{"key":"ref26","author":"K Park","year":"2024","journal-title":"A charactercentric creative story generation via imagination"},{"key":"ref27","article-title":"Learning transferable visual models from natural language supervision","author":"A Radford","year":"2021","journal-title":"International Conference on Machine Learning"},{"key":"ref28","author":"C Schuhmann","year":"2021","journal-title":"LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs"},{"key":"ref29","first-page":"4580","article-title":"VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research","author":"X E Wang","year":"2019","journal-title":"IEEE\/CVF International Conference on Computer Vision (ICCV)"},{"key":"ref30","article-title":"Chainof-thought prompting elicits reasoning in large language models","author":"J Wei","year":"2022","journal-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22"},{"key":"ref31","author":"G Welch","year":"1995","journal-title":"An introduction to the kalman filter"},{"key":"ref32","first-page":"2572","article-title":"Google landmarks dataset v2 -a large-scale benchmark for instance-level recognition and retrieval","author":"T Weyand","year":"2020","journal-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)"},{"key":"ref33","first-page":"12658","article-title":"Transitional adaptation of pretrained models for visual storytelling","author":"Y Yu","year":"2021","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)"},{"key":"ref34","first-page":"11941","article-title":"Sigmoid loss for language image pre-training","author":"X Zhai","year":"2023","journal-title":"IEEE\/CVF International Conference on Computer Vision (ICCV)"},{"key":"ref35","first-page":"14227","article-title":"GROUNDHOG: Grounding large language models to holistic segmentation","author":"Y Zhang","year":"2024","journal-title":"IEEE Conf. on Computer Vision and Pattern Recognition"},{"key":"ref36","article-title":"Multimodal chain-of-thought reasoning in language models","author":"Z Zhang","year":"2023","journal-title":"Trans. Mach. Learn. Res"},{"key":"ref37","first-page":"12747","article-title":"Two heads are better than one: Hypergraph-enhanced graph reasoning for visual event ratiocination","volume":"139","author":"W Zheng","year":"2021","journal-title":"Proceedings of the 38th International Conference on Machine Learning"},{"key":"ref38","article-title":"Storydiffusion: Consistent self-attention for long-range image and video generation","author":"Y Zhou","year":"2024","journal-title":"The Thirty-eighth Annual Conference on Neural Information Processing Systems"}],"container-title":[],"original-title":[],"deposited":{"date-parts":[[2026,2,3]],"date-time":"2026-02-03T11:46:23Z","timestamp":1770119183000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.ssrn.com\/abstract=6172566"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026]]},"references-count":38,"URL":"https:\/\/doi.org\/10.2139\/ssrn.6172566","relation":{},"subject":[],"published":{"date-parts":[[2026]]},"subtype":"preprint"}}