{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:08:52Z","timestamp":1750219732639,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":28,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,10,29]],"date-time":"2023-10-29T00:00:00Z","timestamp":1698537600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,10,29]]},"DOI":"10.1145\/3607540.3617144","type":"proceedings-article","created":{"date-parts":[[2023,10,30]],"date-time":"2023-10-30T01:00:50Z","timestamp":1698627650000},"page":"57-63","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Story-to-Images Translation: Leveraging Diffusion Models and Large Language Models for Sequence Image Generation"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-7278-8303","authenticated-orcid":false,"given":"Haruka","family":"Kumagai","sequence":"first","affiliation":[{"name":"The University of Tokyo, Tokyo, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-6940-8022","authenticated-orcid":false,"given":"Ryosuke","family":"Yamaki","sequence":"additional","affiliation":[{"name":"Ritsumeikan University &amp; ProPlace Inc., Kyoto, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4595-8381","authenticated-orcid":false,"given":"Hiroki","family":"Naganuma","sequence":"additional","affiliation":[{"name":"Universit\u00e9 de Montr\u00e9al &amp; ProPlace Inc., Montr\u00e9al, PQ, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,10,29]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661","author":"Aky\u00fcrek Ekin","year":"2022","unstructured":"Ekin Aky\u00fcrek , Dale Schuurmans , Jacob Andreas , Tengyu Ma , and Denny Zhou . 2022. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661 ( 2022 ). Ekin Aky\u00fcrek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. 2022. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661 (2022)."},{"key":"e_1_3_2_1_2_1","volume-title":"Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491","author":"Chen Wenhu","year":"2022","unstructured":"Wenhu Chen , Hexiang Hu , Chitwan Saharia , and William W Cohen . 2022 . Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491 (2022). Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. 2022. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491 (2022)."},{"key":"e_1_3_2_1_3_1","volume-title":"Testing relational understanding in text-guided image generation. arXiv preprint arXiv:2208.00005","author":"Conwell Colin","year":"2022","unstructured":"Colin Conwell and Tomer Ullman . 2022. Testing relational understanding in text-guided image generation. arXiv preprint arXiv:2208.00005 ( 2022 ). Colin Conwell and Tomer Ullman. 2022. Testing relational understanding in text-guided image generation. arXiv preprint arXiv:2208.00005 (2022)."},{"key":"e_1_3_2_1_4_1","volume-title":"An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618","author":"Gal Rinon","year":"2022","unstructured":"Rinon Gal , Yuval Alaluf , Yuval Atzmon , Or Patashnik , Amit H Bermano , Gal Chechik , and Daniel Cohen-Or . 2022. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 ( 2022 ). Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3422622"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01237-3_37"},{"key":"e_1_3_2_1_7_1","volume-title":"Denoising diffusion probabilistic models. Advances in neural information processing systems 33","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho , Ajay Jain , and Pieter Abbeel . 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 ( 2020 ), 6840--6851. Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840--6851."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.243"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jvcir.2020.102956"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00649"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.543"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.194"},{"key":"e_1_3_2_1_13_1","volume-title":"Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27","author":"Maharana Adyasha","year":"2022","unstructured":"Adyasha Maharana , Darryl Hannan , and Mohit Bansal . 2022 . Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27 , 2022, Proceedings, Part XXXVII. Springer , 70--87. Adyasha Maharana, Darryl Hannan, and Mohit Bansal. 2022. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXVII. Springer, 70--87."},{"key":"e_1_3_2_1_14_1","volume-title":"Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models. arXiv preprint arXiv:2211.10950","author":"Pan Xichen","year":"2022","unstructured":"Xichen Pan , Pengda Qin , Yuhong Li , Hui Xue , and Wenhu Chen . 2022. Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models. arXiv preprint arXiv:2211.10950 ( 2022 ). Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. 2022. Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models. arXiv preprint arXiv:2211.10950 (2022)."},{"key":"e_1_3_2_1_15_1","volume-title":"International conference on machine learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , 2021 . Learning transferable visual models from natural language supervision . In International conference on machine learning. PMLR, 8748--8763 . Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00246"},{"key":"e_1_3_2_1_17_1","volume-title":"Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125","author":"Ramesh Aditya","year":"2022","unstructured":"Aditya Ramesh , Prafulla Dhariwal , Alex Nichol , Casey Chu , and Mark Chen . 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 ( 2022 ). Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)."},{"key":"e_1_3_2_1_18_1","volume-title":"International Conference on Machine Learning. PMLR, 8821--8831","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh , Mikhail Pavlov , Gabriel Goh , Scott Gray , Chelsea Voss , Alec Radford , Mark Chen , and Ilya Sutskever . 2021 . Zero-shot text-to-image generation . In International Conference on Machine Learning. PMLR, 8821--8831 . Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821--8831."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02155"},{"key":"e_1_3_2_1_21_1","first-page":"36479","article-title":"Photorealistic text-to-image diffusion models with deep language understanding","volume":"35","author":"Saharia Chitwan","year":"2022","unstructured":"Chitwan Saharia , William Chan , Saurabh Saxena , Lala Li , Jay Whang , Emily L Denton , Kamyar Ghasemipour , Raphael Gontijo Lopes , Burcu Karagol Ayan , Tim Salimans , 2022 . Photorealistic text-to-image diffusion models with deep language understanding . Advances in Neural Information Processing Systems 35 (2022), 36479 -- 36494 . Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479--36494.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_22_1","unstructured":"Christoph Schuhmann Romain Beaumont Richard Vencu Cade Gordon Ross Wightman Mehdi Cherti Theo Coombes Aarush Katta Clayton Mullis Mitchell Wortsman etal 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022).  Christoph Schuhmann Romain Beaumont Richard Vencu Cade Gordon Ross Wightman Mehdi Cherti Theo Coombes Aarush Katta Clayton Mullis Mitchell Wortsman et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022)."},{"key":"e_1_3_2_1_23_1","volume-title":"International Conference on Machine Learning. PMLR, 2256--2265","author":"Sohl-Dickstein Jascha","year":"2015","unstructured":"Jascha Sohl-Dickstein , Eric Weiss , Niru Maheswaranathan , and Surya Ganguli . 2015 . Deep unsupervised learning using nonequilibrium thermodynamics . In International Conference on Machine Learning. PMLR, 2256--2265 . Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256--2265."},{"key":"e_1_3_2_1_24_1","volume-title":"Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677","author":"von Oswald Johannes","year":"2022","unstructured":"Johannes von Oswald , Eyvind Niklasson , Ettore Randazzo , Jo\u00e3o Sacramento , Alexander Mordvintsev , Andrey Zhmoginov , and Max Vladymyrov . 2022. Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677 ( 2022 ). Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo\u00e3o Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. 2022. Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677 (2022)."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00143"},{"key":"e_1_3_2_1_26_1","volume-title":"Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu.","author":"Yu Jiahui","year":"2022","unstructured":"Jiahui Yu , Yuanzhong Xu , Jing Yu Koh , Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. 2022 . Scaling Autoregressive Models for Content-Rich Text-to-Image Generation . Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. 2022. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3374587.3374649"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.629"}],"event":{"name":"MM '23: The 31st ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Ottawa ON Canada","acronym":"MM '23"},"container-title":["Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3607540.3617144","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3607540.3617144","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:28Z","timestamp":1750178188000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3607540.3617144"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,29]]},"references-count":28,"alternative-id":["10.1145\/3607540.3617144","10.1145\/3607540"],"URL":"https:\/\/doi.org\/10.1145\/3607540.3617144","relation":{},"subject":[],"published":{"date-parts":[[2023,10,29]]},"assertion":[{"value":"2023-10-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}