{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T16:23:08Z","timestamp":1775578988203,"version":"3.50.1"},"reference-count":100,"publisher":"Association for Computing Machinery (ACM)","issue":"9","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62301316, 62271312, and 62132006"],"award-info":[{"award-number":["62301316, 62271312, and 62132006"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100003399","name":"STCSM","doi-asserted-by":"crossref","award":["22DZ2229005"],"award-info":[{"award-number":["22DZ2229005"]}],"id":[{"id":"10.13039\/501100003399","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:p>\n            In recent years, AI-driven video generation has gained significant attention due to great advancements in visual and language generative techniques. Consequently, there is a growing need for accurate Video Quality Assessment (VQA) metrics to evaluate the perceptual quality of AI-generated content (AIGC) videos and optimize video generation models. However, assessing the quality of AIGC videos remains a significant challenge because these videos often exhibit highly complex distortions, such as unnatural actions and irrational objects. To address this challenge, we systematically investigate the AIGC-VQA problem in this article, considering both subjective and objective quality assessment perspectives. For the subjective perspective, we construct the\n            <jats:italic toggle=\"yes\">L<\/jats:italic>\n            arge-scale\n            <jats:italic toggle=\"yes\">G<\/jats:italic>\n            enerated\n            <jats:italic toggle=\"yes\">V<\/jats:italic>\n            ideo\n            <jats:italic toggle=\"yes\">Q<\/jats:italic>\n            uality Assessment (LGVQ) dataset, consisting of\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(2,\\!808\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            AIGC videos generated by six video generation models using 468 carefully curated text prompts. Unlike previous subjective VQA experiments, we evaluate the perceptual quality of AIGC videos from three critical dimensions: spatial quality, temporal quality, and text-video alignment, which hold utmost importance for current video generation techniques. For the objective perspective, we establish a benchmark for evaluating existing quality assessment metrics on the LGVQ dataset. Our findings show that current metrics perform poorly on this dataset, highlighting a gap in effective evaluation tools. To bridge this gap, we propose the\n            <jats:italic toggle=\"yes\">U<\/jats:italic>\n            nify\n            <jats:italic toggle=\"yes\">G<\/jats:italic>\n            enerated\n            <jats:italic toggle=\"yes\">V<\/jats:italic>\n            ideo\n            <jats:italic toggle=\"yes\">Q<\/jats:italic>\n            uality Assessment (UGVQ) model, designed to accurately evaluate the multi-dimensional quality of AIGC videos. The UGVQ model integrates the visual and motion features of videos with the textual features of their corresponding prompts, forming a unified quality-aware feature representation tailored to AIGC videos. Experimental results demonstrate that UGVQ achieves state-of-the-art performance on the LGVQ dataset across all three quality dimensions, validating its effectiveness as an accurate quality metric for AIGC videos. We hope that our benchmark can promote the development of AIGC-VQA studies. Both the LGVQ dataset and the UGVQ model are publicly available on\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/zczhang-sjtu\/UGVQ.git\">https:\/\/github.com\/zczhang-sjtu\/UGVQ.git<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3749844","type":"journal-article","created":{"date-parts":[[2025,7,22]],"date-time":"2025-07-22T22:20:26Z","timestamp":1753222826000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1466-6383","authenticated-orcid":false,"given":"Zhichao","family":"Zhang","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8162-1949","authenticated-orcid":false,"given":"Wei","family":"Sun","sequence":"additional","affiliation":[{"name":"East China Normal University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7362-0532","authenticated-orcid":false,"given":"Li","family":"Xinyue","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5424-4284","authenticated-orcid":false,"given":"Jun","family":"Jia","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5693-0416","authenticated-orcid":false,"given":"Xiongkuo","family":"Min","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7247-7938","authenticated-orcid":false,"given":"Zicheng","family":"Zhang","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0634-1710","authenticated-orcid":false,"given":"Chunyi","family":"Li","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8502-4110","authenticated-orcid":false,"given":"Zijian","family":"Chen","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-2943-4610","authenticated-orcid":false,"given":"Wang","family":"Puyi","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-8258-8109","authenticated-orcid":false,"given":"Sun","family":"Fengyu","sequence":"additional","affiliation":[{"name":"Huawei Technologies Co Ltd., Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1047-4264","authenticated-orcid":false,"given":"Jui","family":"Shangling","sequence":"additional","affiliation":[{"name":"Huawei Technologies Co Ltd., Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8165-9322","authenticated-orcid":false,"given":"Guangtao","family":"Zhai","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2025,9,11]]},"reference":[{"key":"e_1_3_2_2_2","volume-title":"Methodology for the Subjective Assessment of the Quality of Television Pictures","author":"International Telecommunication Union","year":"2002","unstructured":"International Telecommunication Union. 2002. Methodology for the Subjective Assessment of the Quality of Television Pictures, ITU-R Recommendation BT.500-11."},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00175"},{"key":"e_1_3_2_4_2","unstructured":"Mikolaj Binkowski Danica J. Sutherland Michael Arbel and Arthur Gretton. 2018. Towards accurate generative models of video: A new metric & challenges. arXiv:1812.01717. Retrieved from https:\/\/arxiv.org\/abs\/1812.01717"},{"key":"e_1_3_2_5_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Black Kevin","year":"2023","unstructured":"Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. 2023. Training diffusion models with reinforcement learning. In Proceedings of the 12th International Conference on Learning Representations."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2024.3378466"},{"key":"e_1_3_2_8_2","unstructured":"Haoxin Chen Menghan Xia Yingqing He Yong Zhang Xiaodong Cun Shaoshu Yang Jinbo Xing Yaofang Liu Qifeng Chen Xintao Wang et al. 2023. Videocrafter1: Open diffusion models for high-quality video generation. arXiv:2310.19512. Retrieved from https:\/\/arxiv.org\/abs\/2310.19512"},{"key":"e_1_3_2_9_2","doi-asserted-by":"crossref","unstructured":"Shoufa Chen Chongjian Ge Yuqi Zhang Yida Zhang Fengda Zhu Hao Yang Hongxiang Hao Hui Wu Zhichao Lai Yifei Hu et al. 2025. Goku: Flow based video generative foundation models. arXiv:2502.04896. Retrieved from https:\/\/arxiv.org\/abs\/2502.04896","DOI":"10.1109\/CVPR52734.2025.02190"},{"key":"e_1_3_2_10_2","unstructured":"Zijian Chen Wei Sun Yuan Tian Jun Jia Zicheng Zhang Wang Jiarui Ru Huang Xiongkuo Min Guangtao Zhai and Wenjun Zhang. 2024. GAIA: Rethinking action quality assessment for AI-generated videos. In Proceedings of the Advances in Neural Information Processing Systems Vol. 37 40111\u201340144."},{"key":"e_1_3_2_11_2","unstructured":"Iya Chivileva Philip Lynch Tomas E. Ward and Alan F. Smeaton. 2023. Measuring the quality of text-to-video model outputs: Metrics and dataset. arXiv:2309.08009. Retrieved from https:\/\/arxiv.org\/abs\/2309.08009"},{"key":"e_1_3_2_12_2","unstructured":"Joseph Cho Fachrina Dewi Puspitasari Sheng Zheng Jingyao Zheng Lik-Hang Lee Tae-Ho Kim Choong Seon Hong and Chaoning Zhang. 2024. Sora as an AGI world model? A complete survey on text-to-video generation. arXiv:2403.05131. Retrieved from https:\/\/arxiv.org\/abs\/2403.05131"},{"key":"e_1_3_2_13_2","first-page":"2216","article-title":"IRC-GAN: Introspective recurrent convolutional GAN for text-to-video generation","author":"Deng Kangle","year":"2019","unstructured":"Kangle Deng, Tianyi Fei, Xin Huang, and Yuxin Peng. 2019. IRC-GAN: Introspective recurrent convolutional GAN for text-to-video generation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), 2216\u20132222.","journal-title":"IJCAI)"},{"key":"e_1_3_2_14_2","unstructured":"Ming Ding Wendi Zheng Wenyi Hong and Jie Tang. 2022. CogView2: Faster and better text-to-image generation via hierarchical transformers. arXiv:2204.14217. Retrieved from https:\/\/arxiv.org\/abs\/2204.14217"},{"key":"e_1_3_2_15_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_2_16_2","first-page":"121","volume-title":"Proceedings of the International Conference on Human-Computer Interaction","author":"Du Duo","year":"2023","unstructured":"Duo Du, Yanling Zhang, and Jiao Ge. 2023. Effect of AI generated content advertising on consumer engagement. In Proceedings of the International Conference on Human-Computer Interaction. Springer, 121\u2013129."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00675"},{"key":"e_1_3_2_18_2","doi-asserted-by":"crossref","first-page":"2693","DOI":"10.1109\/TIP.2023.3272480","article-title":"Study of spatio-temporal modeling in video quality assessment","author":"Fang Yuming","year":"2023","unstructured":"Yuming Fang, Zhaoqian Li, Jiebin Yan, Xiangjie Sui, and Hantao Liu. 2023. Study of spatio-temporal modeling in video quality assessment. IEEE Transactions on Image Processing 32 (2023), 2693\u20132702.","journal-title":"IEEE Transactions on Image Processing"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00373"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00630"},{"key":"e_1_3_2_21_2","unstructured":"Qihang Ge Wei Sun Yu Zhang Yunhao Li Zhongpeng Ji Fengyu Sun Shangling Jui Xiongkuo Min and Guangtao Zhai. 2024. LMM-VQA: Advancing video quality assessment with large multimodal models. arXiv:2408.14008. Retrieved from https:\/\/arxiv.org\/abs\/2408.14008"},{"key":"e_1_3_2_22_2","unstructured":"Dhruba Ghosh Hannaneh Hajishirzi and Ludwig Schmidt. 2024. Geneval: An object-focused framework for evaluating text-to-image alignment. In Proceedings of the Advances in Neural Information Processing Systems Vol. 36."},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV51458.2022.00404"},{"key":"e_1_3_2_24_2","first-page":"1759","volume-title":"Proceedings of the IEEE International Conference on Computer Vision (ICCV)","author":"Goyal Yash","year":"2017","unstructured":"Yash Goyal, Ammar Khattak, Sandeep Kottur, Amit Agrawal, Dhruv Batra, and Devi Parikh. 2017. Making the VQA model smarter: Learning from the web. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1759\u20131767."},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00633"},{"key":"e_1_3_2_26_2","unstructured":"Rongzhang Gu Hui Li Changyue Su and Wenyan Wu. 2023. Innovative digital storytelling with AIGC: Exploration and discussion of recent advances. arXiv:2309.14329. Retrieved from https:\/\/arxiv.org\/abs\/2309.14329"},{"key":"e_1_3_2_27_2","unstructured":"Yingqing He Tianyu Yang Yong Zhang Ying Shan and Qifeng Chen. 2022. Latent video diffusion models for high-fidelity long video generation. arXiv:2211.13221. Retrieved from https:\/\/arxiv.org\/abs\/2211.13221"},{"key":"e_1_3_2_28_2","doi-asserted-by":"crossref","first-page":"7514","DOI":"10.18653\/v1\/2021.emnlp-main.595","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Hessel Jack","year":"2021","unstructured":"Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 7514\u20137528."},{"key":"e_1_3_2_29_2","unstructured":"Martin Heusel Hubert Ramsauer Thomas Unterthiner Bernhard Nessler and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of theAdvances in Neural Information Processing Systems. I. Guyon U. Von Luxburg S. Bengio H. Wallach R. Fergus S. Vishwanathan and R. Garnett (Eds.) Vol. 30 Curran Associates Inc."},{"key":"e_1_3_2_30_2","unstructured":"Jonathan Ho Ajay Jain and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems Vol. 33 6840\u20136851."},{"key":"e_1_3_2_31_2","volume-title":"Proceedings of the11th International Conference on Learning Representations","author":"Hong Wenyi","year":"2022","unstructured":"Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. 2022. CogVideo: Large-scale pretraining for text-to-video generation via transformers. In Proceedings of the11th International Conference on Learning Representations."},{"key":"e_1_3_2_32_2","first-page":"1","volume-title":"Proceedings of the 2017 9th International Conference on Quality of Multimedia Experience (QoMEX)","author":"Hosu Vlad","year":"2017","unstructured":"Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui Men, Tam\u00e1s Szir\u00e1nyi, Shujun Li, and Dietmar Saupe. 2017. The Konstanz natural video database (KoNViD-1k). In Proceedings of the 2017 9th International Conference on Quality of Multimedia Experience (QoMEX). IEEE, 1\u20136."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.2967829"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2024.0234"},{"key":"e_1_3_2_35_2","first-page":"21807","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Huang Ziqi","year":"2024","unstructured":"Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. 2024. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 21807\u201321818."},{"key":"e_1_3_2_36_2","first-page":"6706","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Hudson David A.","year":"2019","unstructured":"David A. Hudson and Christopher D. Manning. 2019. GQA: Visual question answering with graph-structured scenes. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6706\u20136715."},{"key":"e_1_3_2_37_2","first-page":"448","volume-title":"Proceedings of the 32nd International Conference on Machine Learning","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning. Francis Bach and David Blei (Eds.), Vol. 37, PMLR, Lille, France, 448\u2013456."},{"key":"e_1_3_2_38_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Jin Wonjoon","year":"2025","unstructured":"Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, and Sunghyun Cho. 2025. FloVD: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Retrieved from https:\/\/arxiv.org\/abs\/2502.08244"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00510"},{"key":"e_1_3_2_40_2","doi-asserted-by":"crossref","unstructured":"Levon Khachatryan Andranik Movsisyan Vahram Tadevosyan Roberto Henschel Zhangyang Wang Shant Navasardyan and Humphrey Shi. 2023. Text2Video-Zero: Text-to-image diffusion models are zero-shot video generators. arXiv:2303.13439. Retrieved from https:\/\/arxiv.org\/abs\/2303.13439","DOI":"10.1109\/ICCV51070.2023.01462"},{"key":"e_1_3_2_41_2","unstructured":"Yuval Kirstain Adam Poliak Uriel Singer and Omer Levy. 2023. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In Proceedings of the Advances in Neural Information Processing Systems Vol. 36."},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2923051"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3680868"},{"key":"e_1_3_2_44_2","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR)","author":"Kumar Manoj","year":"2020","unstructured":"Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk P. Kingma. 2020. VideoFlow: A conditional flow-based model for stochastic video generation. In Proceedings of the International Conference on Learning Representations (ICLR). Retrieved from https:\/\/arxiv.org\/abs\/1903.01434"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00636"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3319020"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3351028"},{"key":"e_1_3_2_48_2","article-title":"BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning (ICML).","journal-title":"ICML)"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12233"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00945"},{"key":"e_1_3_2_51_2","unstructured":"Jian Liang Chenfei Wu Xiaowei Hu Zhe Gan Jianfeng Wang Lijuan Wang Zicheng Liu Yuejian Fang and Nan Duan. 2022. Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. In Proceedings of the Advances in Neural Information Processing Systems Vol. 35 15420\u201315432."},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_53_2","doi-asserted-by":"crossref","unstructured":"Zhiqiu Lin Deepak Pathak Baiqi Li Jiayao Li Xide Xia Graham Neubig Pengchuan Zhang and Deva Ramanan. 2025. Evaluating text-to-visual generation with image-to-text generation. In Proceedings of the Computer Vision (ECCV \u201924). Ale\u0161 Leonardis Elisa Ricci Stefan Roth Olga Russakovsky Torsten Sattler and G\u00fcl Varol (Eds.) Springer Nature Switzerland Cham 366\u2013384.","DOI":"10.1007\/978-3-031-72673-6_20"},{"key":"e_1_3_2_54_2","unstructured":"Chen Liu and Tobias Ritschel. 2025. Generative video bi-flow. arXiv:2503.06364. Retrieved from https:\/\/arxiv.org\/abs\/2503.06364"},{"key":"e_1_3_2_55_2","first-page":"22139","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Liu Yaofang","year":"2024","unstructured":"Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. 2024. EvalCrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 22139\u201322149."},{"key":"e_1_3_2_56_2","unstructured":"Yuanxin Liu Lei Li Shuhuai Ren Rundong Gao Shicheng Li Sishuo Chen Xu Sun and Lu Hou. 2023. FETV: A benchmark for fine-grained evaluation of open-domain text-to-video generation. In Proceedings of the Advances in Neural Information Processing Systems. A. Oh T. Naumann A. Globerson K. Saenko M. Hardt and S. Levine (Eds.) Vol. 36 Curran Associates Inc. 62352\u201362387."},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00984"},{"key":"e_1_3_2_59_2","unstructured":"Xiongkuo Min Huiyu Duan Wei Sun Yucheng Zhu and Guangtao Zhai. 2024. Perceptual video quality assessment: A survey. arXiv:2402.03413. Retrieved from https:\/\/arxiv.org\/abs\/2402.03413"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2012.2214050"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2012.2227726"},{"key":"e_1_3_2_62_2","unstructured":"John Mullan Duncan Crawbuck and Aakash Sastry. 2023. Hotshot-XL. Retrieved from https:\/\/github.com\/hotshotco\/hotshot-xl"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01372"},{"key":"e_1_3_2_64_2","doi-asserted-by":"crossref","unstructured":"Sunghyun Park Kangyeol Kim Junsoo Lee Jaegul Choo Joonseok Lee Sookyung Kim and Edward Choi. 2021. Vid-ODE: Continuous-time video generation with neural ordinary differential equation. In Proceedings of the AAAI Conference on Artificial Intelligence. Retrieved from https:\/\/arxiv.org\/abs\/2010.08188","DOI":"10.1609\/aaai.v35i3.16342"},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00644"},{"key":"e_1_3_2_66_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_2_67_2","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"Radford A.","year":"2021","unstructured":"A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning."},{"key":"e_1_3_2_68_2","unstructured":"Tim Salimans Ian Goodfellow Wojciech Zaremba Vicki Cheung Alec Radford and Xi Chen. 2016. Improved techniques for training GANs. In Proceedings of the Advances in Neural Information Processing Systems Vol. 29."},{"key":"e_1_3_2_69_2","unstructured":"Christoph Schuhmann Romain Beaumont Richard Vencu Cade Gordon Ross Wightman Mehdi Cherti Thea Uhlich Andreas Askell Quoc-Huy Tran and Clayton Szczepaniak. 2021. LAION-5B: A New Dataset for CLIP-based Training and Beyond. Retrieved December 18 2024 from https:\/\/laion.ai\/"},{"key":"e_1_3_2_70_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2024.0123"},{"key":"e_1_3_2_71_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Su Shaolin","year":"2020","unstructured":"Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. 2020. Blindly assess image quality in the wild guided by a self-adaptive hyper network. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_3_2_72_2","unstructured":"Rui Sun Yumin Zhang Tejal Shah Jiaohao Sun Shuoying Zhang Wenqi Li Haoran Duan Bo Wei and Rajiv Ranjan. 2024. From sora what we can see: A survey of text-to-video generation. arxiv:2405.10674. Retrieved from https:\/\/arxiv.org\/abs\/2405.10674"},{"key":"e_1_3_2_73_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548329"},{"key":"e_1_3_2_74_2","doi-asserted-by":"crossref","DOI":"10.1109\/JSTSP.2023.3270621","article-title":"Blind quality assessment for in-the-wild images via hierarchical feature fusion and iterative mixed database training","author":"Sun Wei","year":"2023","unstructured":"Wei Sun, Xiongkuo Min, Danyang Tu, Siwei Ma, and Guangtao Zhai. 2023. Blind quality assessment for in-the-wild images via hierarchical feature fusion and iterative mixed database training. IEEE Journal of Selected Topics in Signal Processing 17 (2023), 1178\u20131192.","journal-title":"IEEE Journal of Selected Topics in Signal Processing"},{"key":"e_1_3_2_75_2","doi-asserted-by":"crossref","DOI":"10.1109\/TPAMI.2024.3385364","article-title":"Analysis of video quality datasets via design of minimalistic video quality models","author":"Sun Wei","year":"2024","unstructured":"Wei Sun, Wen Wen, Xiongkuo Min, Long Lan, Guangtao Zhai, and Kede Ma. 2024. Analysis of video quality datasets via design of minimalistic video quality models. IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (2024), 7056\u20137071.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_2_76_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58536-5_24"},{"key":"e_1_3_2_77_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3072221"},{"key":"e_1_3_2_78_2","doi-asserted-by":"publisher","DOI":"10.1109\/OJSP.2021.3090333"},{"key":"e_1_3_2_79_2","unstructured":"Thomas Unterthiner Sjoerd van Steenkiste Karol Kurach Raphael Marinier Marcin Michalski and Sylvain Gelly. 2019. Towards accurate generative models of video: A new metric challenges. arXiv:1812.01717. Retrieved from https:\/\/arxiv.org\/abs\/1812.01717"},{"key":"e_1_3_2_80_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i2.25353"},{"key":"e_1_3_2_81_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01398"},{"key":"e_1_3_2_82_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3681471"},{"key":"e_1_3_2_83_2","unstructured":"Yi Wang Yinan He Yizhuo Li Kunchang Li Jiashuo Yu Xin Ma Xinhao Li Guo Chen Xinyuan Chen Yaohui Wang et al. 2023. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv:2307.06942. Retrieved from https:\/\/arxiv.org\/abs\/2307.06942"},{"key":"e_1_3_2_84_2","unstructured":"Yi Wang Kunchang Li Yizhuo Li Yinan He Bingkun Huang Zhiyu Zhao Hongjie Zhang Jilan Xu Yi Liu Zun Wang et al. 2022. InternVideo: General video foundation models via generative and discriminative learning. arXiv:2212.03191. Retrieved from https:\/\/arxiv.org\/abs\/2212.03191"},{"key":"e_1_3_2_85_2","article-title":"A survey on ChatGPT: AI-generated contents, challenges, and solutions","author":"Wang Yuntao","year":"2023","unstructured":"Yuntao Wang, Yanghe Pan, Miao Yan, Zhou Su, and Tom H. Luan. 2023. A survey on ChatGPT: AI-generated contents, challenges, and solutions. IEEE Open Journal of the Computer Society 4 (2023), 280\u2013302.","journal-title":"IEEE Open Journal of the Computer Society"},{"key":"e_1_3_2_86_2","unstructured":"Shaoguo Wen and Junle Wang. 2021. A strong baseline for image and video quality assessment. arXiv:2111.07104. Retrieved from https:\/\/arxiv.org\/abs\/2111.07104"},{"key":"e_1_3_2_87_2","first-page":"720","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Wu Chenfei","year":"2022","unstructured":"Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. 2022. N\u00fcwa: Visual synthesis pre-training for neural visual world creation. In Proceedings of the European Conference on Computer Vision. Springer, 720\u2013736."},{"key":"e_1_3_2_88_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20068-7_31"},{"key":"e_1_3_2_89_2","volume-title":"Proceedings of the International Conference on Computer Vision (ICCV)","author":"Wu Haoning","year":"2023","unstructured":"Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou Hou, Annan Wang, Wenxiu Sun Sun, Qiong Yan, and Weisi Lin. 2023. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In Proceedings of the International Conference on Computer Vision (ICCV)."},{"key":"e_1_3_2_90_2","unstructured":"Haoning Wu Zicheng Zhang Weixia Zhang Chaofeng Chen Liang Liao Chunyi Li Yixuan Gao Annan Wang Erli Zhang Wenxiu Sun et al. 2023. Q-align: Teaching LMMs for visual scoring via discrete text-defined levels. arXiv:2312.17090. Retrieved from https:\/\/arxiv.org\/abs\/2312.17090"},{"key":"e_1_3_2_91_2","first-page":"7623","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Wu Jay Zhangjie","year":"2023","unstructured":"Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. 2023. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), 7623\u20137633."},{"key":"e_1_3_2_92_2","unstructured":"Xiaoshi Wu Yiming Hao Keqiang Sun Yixiong Chen Feng Zhu Rui Zhao and Hongsheng Li. 2023. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv:2306.09341. Retrieved from https:\/\/arxiv.org\/abs\/2306.09341"},{"key":"e_1_3_2_93_2","first-page":"2096","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Wu Xiaoshi","year":"2023","unstructured":"Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. 2023. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), 2096\u20132105."},{"key":"e_1_3_2_94_2","unstructured":"Jiazheng Xu Xiao Liu Yuchen Wu Yuxuan Tong Qinkai Li Min Ding Jie Tang and Yuxiao Dong. 2023. ImageReward: Learning and evaluating human preferences for text-to-image generation. In Proceedings of the Advances in Neural Information Processing Systems."},{"key":"e_1_3_2_95_2","first-page":"14019","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Ying Zhenqiang","year":"2021","unstructured":"Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram, and Alan Bovik. 2021. Patch-VQ: \u201cPatching up\u201d the video quality problem. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14019\u201314029."},{"key":"e_1_3_2_96_2","first-page":"2045","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Young Peter","year":"2014","unstructured":"Peter Young, Devendra Hazarika, Soujanya Poria, and Erik Cambria. 2014. Image captioning and visual question answering based on deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2045\u20132054."},{"key":"e_1_3_2_97_2","first-page":"1473","volume-title":"Proceedings of the 2012 19th IEEE International Conference on Image Processing","author":"Zhang Lin","year":"2012","unstructured":"Lin Zhang and Hongyu Li. 2012. SR-SIM: A fast and high performance IQA index based on spectral residual. In Proceedings of the 2012 19th IEEE International Conference on Image Processing. IEEE, 1473\u20131476."},{"key":"e_1_3_2_98_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3061932"},{"key":"e_1_3_2_99_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01352"},{"key":"e_1_3_2_100_2","unstructured":"Zhichao Zhang Wei Sun Xinyue Li Yunhao Li Qihang Ge Jun Jia Zicheng Zhang Zhongpeng Ji Fengyu Sun Shangling Jui et al. 2024. Human-activity AGV quality assessment: A benchmark dataset and an objective evaluation metric. arXiv:2411.16619. Retrieved from https:\/\/arxiv.org\/abs\/2411.16619"},{"key":"e_1_3_2_101_2","unstructured":"Zhiwei Zhong Wen-Ting Hsu He Xu Tsung-Yi Lee Yung-Hsiang Chou Jan-Yu Lee Yi Yu Zhe Yang Chen Sun Anelia Angelova et al. 2021. WIT: Web-image text pretraining for cross-modal vision-language understanding. arXiv:2102.05246. Retrieved from https:\/\/arxiv.org\/abs\/2102.05246"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3749844","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T19:44:32Z","timestamp":1757619872000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3749844"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,11]]},"references-count":100,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,9,30]]}},"alternative-id":["10.1145\/3749844"],"URL":"https:\/\/doi.org\/10.1145\/3749844","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,11]]},"assertion":[{"value":"2024-12-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-13","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-11","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}