{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T11:10:19Z","timestamp":1777633819830,"version":"3.51.4"},"reference-count":85,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2024,1,3]],"date-time":"2024-01-03T00:00:00Z","timestamp":1704240000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"EU HEU AI4TRUST","award":["101070190"],"award-info":[{"award-number":["101070190"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2024,4,30]]},"abstract":"<jats:p>Neural video game simulators emerged as powerful tools to generate and edit videos. Their idea is to represent games as the evolution of an environment\u2019s state driven by the actions of its agents. While such a paradigm enables users to<jats:italic>play<\/jats:italic>a game action-by-action, its rigidity precludes more semantic forms of control. To overcome this limitation, we augment game models with<jats:italic>prompts<\/jats:italic>specified as a set of<jats:italic>natural language<\/jats:italic>actions and<jats:italic>desired states<\/jats:italic>. The result\u2014a Promptable Game Model (PGM)\u2014makes it possible for a user to<jats:italic>play<\/jats:italic>the game by prompting it with high- and low-level action sequences. Most captivatingly, our PGM unlocks the<jats:italic>director\u2019s mode<\/jats:italic>, where the game is played by specifying goals for the agents in the form of a prompt. This requires learning \u201cgame AI,\u201d encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, and devise a strategy to win a point. To render the resulting state, we use a compositional NeRF representation encapsulated in our synthesis model. To foster future research, we present newly collected, annotated and calibrated Tennis and Minecraft datasets. Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state-of-the-art. Our framework, data, and models are available at snap-research.github.io\/promptable-game-models.<\/jats:p>","DOI":"10.1145\/3635705","type":"journal-article","created":{"date-parts":[[2023,12,5]],"date-time":"2023-12-05T12:04:23Z","timestamp":1701777863000},"page":"1-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Promptable Game Models: Text-guided Game Simulation via Masked Diffusion Models"],"prefix":"10.1145","volume":"43","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0715-9300","authenticated-orcid":false,"given":"Willi","family":"Menapace","sequence":"first","affiliation":[{"name":"University of Trento, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9252-1775","authenticated-orcid":false,"given":"Aliaksandr","family":"Siarohin","sequence":"additional","affiliation":[{"name":"Snap Inc., USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6927-8930","authenticated-orcid":false,"given":"St\u00e9phane","family":"Lathuili\u00e8re","sequence":"additional","affiliation":[{"name":"LTCI, T\u00e9l\u00e9com Paris, Institut Polytechnique de Paris, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6427-6055","authenticated-orcid":false,"given":"Panos","family":"Achlioptas","sequence":"additional","affiliation":[{"name":"Snap Inc., USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1630-2006","authenticated-orcid":false,"given":"Vladislav","family":"Golyanik","sequence":"additional","affiliation":[{"name":"MPI for Informatics, SIC, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3465-1592","authenticated-orcid":false,"given":"Sergey","family":"Tulyakov","sequence":"additional","affiliation":[{"name":"Snap Inc., USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0228-1147","authenticated-orcid":false,"given":"Elisa","family":"Ricci","sequence":"additional","affiliation":[{"name":"University of Trento, Fondazione Bruno Kessler, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,1,3]]},"reference":[{"key":"e_1_3_2_2_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201923)","author":"Achlioptas Panos","year":"2023","unstructured":"Panos Achlioptas, Ian Huang, Minhyuk Sung, Sergey Tulyakov, and Leonidas Guibas. 2023. ChangeIt3D: Language-assisted 3D shape edits and deformations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201923)."},{"key":"e_1_3_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/3DV57658.2022.00053"},{"key":"e_1_3_2_4_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201918)","author":"Babaeizadeh Mohammad","year":"2018","unstructured":"Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. 2018. Stochastic variational video prediction. In Proceedings of the International Conference on Learning Representations (ICLR\u201918)."},{"key":"e_1_3_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00175"},{"key":"e_1_3_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02161"},{"key":"e_1_3_2_7_1","volume-title":"Proceedings of Nucl.AI","author":"B\u00fcttner Michael","year":"2015","unstructured":"Michael B\u00fcttner and Simon Clavet. 2015. Motion matching\u2014The road to next gen animation. In Proceedings of Nucl.AI."},{"key":"e_1_3_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01565"},{"key":"e_1_3_2_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19824-3_20"},{"key":"e_1_3_2_10_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201921)","author":"Chen Nanxin","year":"2021","unstructured":"Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. 2021. WaveGrad: Estimating gradients for waveform generation. In Proceedings of the International Conference on Learning Representations (ICLR\u201921)."},{"key":"e_1_3_2_11_1","article-title":"Recurrent environment simulators","author":"Chiappa Silvia","year":"2017","unstructured":"Silvia Chiappa, S\u00e9bastien Racani\u00e8re, Daan Wierstra, and Shakir Mohamed. 2017. Recurrent environment simulators. arXiv (2017).","journal-title":"arXiv"},{"key":"e_1_3_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3561975.3562941"},{"key":"e_1_3_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00941"},{"key":"e_1_3_2_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19790-1_5"},{"key":"e_1_3_2_15_1","doi-asserted-by":"publisher","unstructured":"Sander Dieleman Laurent Sartran Arman Roshannai Nikolay Savinov Yaroslav Ganin Pierre H. Richemond Arnaud Doucet Robin Strudel Chris Dyer Conor Durkan Curtis Hawthorne R\u00e9mi Leblond Will Grathwohl and Jonas Adler. 2022. Continuous diffusion for categorical data. 10.48550\/arXiv.2211.15089","DOI":"10.48550\/arXiv.2211.15089"},{"key":"e_1_3_2_16_1","volume-title":"Proceedings of the International Conference on Artificial Intelligence and Statistics","author":"Fortuin Vincent","year":"2020","unstructured":"Vincent Fortuin, Dmitry Baranchuk, Gunnar R\u00e4tsch, and Stephan Mandt. 2020. GP-VAE: Deep probabilistic time series imputation. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR."},{"key":"e_1_3_2_17_1","article-title":"Sur la distance de deux lois de probabilit\u00e9","author":"Fr\u00e9chet Maurice","year":"1957","unstructured":"Maurice Fr\u00e9chet. 1957. Sur la distance de deux lois de probabilit\u00e9. Comptes Rendus Hebdom. Seances Acad. Sci. 244, 6 (1957), 689\u2013692.","journal-title":"Comptes Rendus Hebdom. Seances Acad. Sci."},{"key":"e_1_3_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00542"},{"key":"e_1_3_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01029"},{"key":"e_1_3_2_20_1","article-title":"Rich feature hierarchies for accurate object detection and semantic segmentation","author":"Girshick Ross","year":"2013","unstructured":"Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2013. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201913).","journal-title":"Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201913)"},{"key":"e_1_3_2_21_1","volume-title":"Game Engine Architecture","author":"Gregory Jason","year":"2018","unstructured":"Jason Gregory. 2018. Game Engine Architecture. CRC Press."},{"key":"e_1_3_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00360"},{"key":"e_1_3_2_23_1","volume-title":"Advances in Neural Information Processing Systems (NeurIPS\u201917)","author":"Heusel Martin","year":"2017","unstructured":"Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS\u201917)."},{"key":"e_1_3_2_24_1","doi-asserted-by":"publisher","unstructured":"Jonathan Ho William Chan Chitwan Saharia Jay Whang Ruiqi Gao Alexey Gritsenko Diederik P. Kingma Ben Poole Mohammad Norouzi David J. Fleet and Tim Salimans. 2022. Imagen Video: High Definition Video Generation with Diffusion Models. 10.48550\/arXiv.2210.02303","DOI":"10.48550\/arXiv.2210.02303"},{"key":"e_1_3_2_25_1","volume-title":"Advances in Neural Information Processing Systems (NeurIPS\u201920)","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS\u201920)."},{"key":"e_1_3_2_26_1","volume-title":"Proceedings of the ICLR Workshop on Deep Generative Models for Highly Structured Data","author":"Ho Jonathan","year":"2022","unstructured":"Jonathan Ho, Tim Salimans, Alexey A. Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. 2022b. Video diffusion models. In Proceedings of the ICLR Workshop on Deep Generative Models for Highly Structured Data."},{"key":"e_1_3_2_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3386569.3392440"},{"key":"e_1_3_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073663"},{"key":"e_1_3_2_29_1","doi-asserted-by":"publisher","unstructured":"Wenyi Hong Ming Ding Wendi Zheng Xinghan Liu and Jie Tang. 2022. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. 10.48550\/arXiv.2205.15868","DOI":"10.48550\/arXiv.2205.15868"},{"key":"e_1_3_2_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19787-1_31"},{"key":"e_1_3_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00094"},{"key":"e_1_3_2_32_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46475-6_43"},{"key":"e_1_3_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00813"},{"key":"e_1_3_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00576"},{"key":"e_1_3_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00131"},{"key":"e_1_3_2_36_1","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR\u201915) San Diego CA Conference Track Proceedings. Retrieved from http:\/\/arxiv.org\/abs\/1412.6980"},{"key":"e_1_3_2_37_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201920)","author":"Kong Zhifeng","year":"2020","unstructured":"Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2020. DiffWave: A versatile diffusion model for audio synthesis. In Proceedings of the International Conference on Learning Representations (ICLR\u201920)."},{"key":"e_1_3_2_38_1","article-title":"Panoptic neural fields: A semantic object-aware neural scene representation","author":"Kundu Abhijit","year":"2022","unstructured":"Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Caroline Pantofaru, Leonidas J. Guibas, Andrea Tagliasacchi, Frank Dellaert, and Thomas A. Funkhouser. 2022. Panoptic neural fields: A semantic object-aware neural scene representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201922).","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201922)"},{"key":"e_1_3_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00191"},{"key":"e_1_3_2_40_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201922)","author":"Lam Max W. Y.","year":"2022","unstructured":"Max W. Y. Lam, Jun Wang, Dan Su, and Dong Yu. 2022. BDDM: Bilateral denoising diffusion models for fast and high-quality speech synthesis. In Proceedings of the International Conference on Learning Representations (ICLR\u201922)."},{"key":"e_1_3_2_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3272127.3275071"},{"key":"e_1_3_2_42_1","volume-title":"Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS\u201922)","author":"Leng Yichong","year":"2022","unstructured":"Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu Tan, Danilo Mandic, Lei He, Xiangyang Li, Tao Qin, sheng zhao, and Tie-Yan Liu. 2022. BinauralGrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS\u201922)."},{"key":"e_1_3_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/344779.344862"},{"key":"e_1_3_2_44_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19824-3_25"},{"key":"e_1_3_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00037"},{"key":"e_1_3_2_46_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201920)","author":"Liu Yinhan","year":"2020","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2020. RoBERTa: A robustly optimized BERT pretraining approach. In Proceedings of the International Conference on Learning Representations (ICLR\u201920)."},{"key":"e_1_3_2_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/2816795.2818013"},{"key":"e_1_3_2_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3450626.3459785"},{"key":"e_1_3_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00993"},{"key":"e_1_3_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00357"},{"key":"e_1_3_2_51_1","volume-title":"Proceedings of the NeurIPS 2022 Workshop on Score-based Methods","author":"Meng Chenlin","year":"2022","unstructured":"Chenlin Meng, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. 2022. On distillation of guided diffusion models. In Proceedings of the NeurIPS 2022 Workshop on Score-based Methods."},{"key":"e_1_3_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00272"},{"key":"e_1_3_2_53_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_24"},{"key":"e_1_3_2_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00394"},{"key":"e_1_3_2_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3528223.3530127"},{"key":"e_1_3_2_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417804"},{"key":"e_1_3_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01129"},{"key":"e_1_3_2_58_1","volume-title":"Proceedings of the Advances in Neural Information Processing Systems (NeurIPS\u201915)","author":"Oh Junhyuk","year":"2015","unstructured":"Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder Singh. 2015. Action-conditional video prediction using deep networks in Atari games. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS\u201915)."},{"key":"e_1_3_2_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00288"},{"key":"e_1_3_2_60_1","article-title":"Nerfies: Deformable neural radiance fields","author":"Park Keunhong","year":"2021","unstructured":"Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. 2021a. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE International Conference on Computer Vision (ICCV\u201921).","journal-title":"Proceedings of the IEEE International Conference on Computer Vision (ICCV\u201921)"},{"key":"e_1_3_2_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3478513.3480487"},{"key":"e_1_3_2_62_1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","author":"Raffel Colin","year":"2022","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2022. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1 (2022).","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_2_63_1","doi-asserted-by":"publisher","unstructured":"Aditya Ramesh Prafulla Dhariwal Alex Nichol Casey Chu and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. 10.48550\/arXiv.2204.06125","DOI":"10.48550\/arXiv.2204.06125"},{"key":"e_1_3_2_64_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning (ICML\u201921)","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning (ICML\u201921)."},{"key":"e_1_3_2_65_1","volume-title":"Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS\u201915)","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS\u201915)."},{"key":"e_1_3_2_66_1","unstructured":"ReplayMod. 2022. ReplayMod. Retrieved from https:\/\/github.com\/ReplayMod\/ReplayMod"},{"key":"e_1_3_2_67_1","article-title":"High-resolution image synthesis with latent diffusion models","author":"Rombach Robin","year":"2021","unstructured":"Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj\u00f6rn Ommer. 2021. High-resolution image synthesis with latent diffusion models. arXiv (2021).","journal-title":"arXiv"},{"key":"e_1_3_2_68_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"e_1_3_2_69_1","doi-asserted-by":"crossref","unstructured":"Chitwan Saharia William Chan Saurabh Saxena Lala Li Jay Whang Emily Denton Seyed Kamyar Seyed Ghasemipour Raphael Gontijo-Lopes Burcu Karagol Ayan Tim Salimans Jonathan Ho David J. Fleet and Mohammad Norouzi. 2022. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems. Retrieved from https:\/\/openreview.net\/forum?id=08Yk-n5l2Al","DOI":"10.1145\/3528233.3530757"},{"key":"e_1_3_2_70_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201922)","author":"Salimans Tim","year":"2022","unstructured":"Tim Salimans and Jonathan Ho. 2022. Progressive distillation for fast sampling of diffusion models. In Proceedings of the International Conference on Learning Representations (ICLR\u201922)."},{"key":"e_1_3_2_71_1","volume-title":"Proceedings of the Conference on Neural Information Processing Systems (NeurIPS\u201922) Datasets and Benchmarks Track","author":"Schuhmann Christoph","year":"2022","unstructured":"Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W. Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R. Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS\u201922) Datasets and Benchmarks Track."},{"key":"e_1_3_2_72_1","unstructured":"Uriel Singer Adam Polyak Thomas Hayes Xi Yin Jie An Songyang Zhang Qiyuan Hu Harry Yang Oron Ashual Oran Gafni Devi Parikh Sonal Gupta and Yaniv Taigman. 2023. Make-A-Video: Text-to-video generation without text-video data."},{"key":"e_1_3_2_74_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201921)","author":"Song Jiaming","year":"2021","unstructured":"Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations (ICLR\u201921)."},{"key":"e_1_3_2_75_1","volume-title":"Proceedings of the Eurographics\/ACM SIGGRAPH Symposium on Computer Animation","author":"Stanton Matt","year":"2016","unstructured":"Matt Stanton, Sascha Geddert, Adrian Blumer, Paul Hormis, Andy Nealen, Seth Cooper, and Adrien Treuille. 2016. Large-scale finite state game engines. In Proceedings of the Eurographics\/ACM SIGGRAPH Symposium on Computer Animation."},{"key":"e_1_3_2_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/3355089.3356505"},{"key":"e_1_3_2_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3386569.3392450"},{"key":"e_1_3_2_78_1","volume-title":"Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS\u201921)","author":"Tashiro Yusuke","year":"2021","unstructured":"Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. 2021. CSDI: Conditional score-based diffusion models for probabilistic time series imputation. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS\u201921)."},{"key":"e_1_3_2_79_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20047-2_21"},{"key":"e_1_3_2_80_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01272"},{"key":"e_1_3_2_81_1","unstructured":"Thomas Unterthiner Sjoerd van Steenkiste Karol Kurach Rapha\u00ebl Marinier Marcin Michalski and Sylvain Gelly. 2018. Towards accurate generative models of video: A new metric & challenges. CoRR abs\/1812.01717 (2018). Retrieved from http:\/\/arxiv.org\/abs\/1812.01717"},{"key":"e_1_3_2_82_1","volume-title":"Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS\u201917)","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS\u201917)."},{"key":"e_1_3_2_83_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01573"},{"key":"e_1_3_2_84_1","doi-asserted-by":"crossref","unstructured":"Yinghao Xu Menglei Chai Zifan Shi Sida Peng Ivan Skorokhodov Aliaksandr Siarohin Ceyuan Yang Yujun Shen Hsin-Ying Lee Bolei Zhou and Sergey Tulyakov. 2023. DisCoScene: Spatially disentangled generative radiance fields for controllable 3D-aware scene synthesis. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201923). 4402\u20134412.","DOI":"10.1109\/CVPR52729.2023.00428"},{"key":"e_1_3_2_85_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00570"},{"key":"e_1_3_2_86_1","doi-asserted-by":"publisher","unstructured":"Mingyuan Zhang Zhongang Cai Liang Pan Fangzhou Hong Xinying Guo Lei Yang and Ziwei Liu. 2022. MotionDiffuse: Text-driven human motion generation with diffusion model. 10.48550\/arXiv.2208.15001","DOI":"10.48550\/arXiv.2208.15001"},{"key":"e_1_3_2_87_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00068"}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3635705","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3635705","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:56:59Z","timestamp":1750291019000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3635705"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,3]]},"references-count":85,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,4,30]]}},"alternative-id":["10.1145\/3635705"],"URL":"https:\/\/doi.org\/10.1145\/3635705","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,3]]},"assertion":[{"value":"2023-05-02","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-13","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}