{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T16:48:33Z","timestamp":1775580513197,"version":"3.50.1"},"reference-count":262,"publisher":"Association for Computing Machinery (ACM)","issue":"7","funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2021YFE0206700"],"award-info":[{"award-number":["2021YFE0206700"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62301310, 623B2073"],"award-info":[{"award-number":["62301310, 623B2073"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Sichuan Science and Technology Program","award":["2024NSFSC1426"],"award-info":[{"award-number":["2024NSFSC1426"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,7,31]]},"abstract":"<jats:p>Quality assessment, which evaluates the visual quality level of multimedia experiences, has garnered significant attention from researchers and has evolved substantially through dedicated efforts. Before the advent of large models, quality assessment typically relied on small expert models tailored for specific tasks. While these smaller models are effective at handling their designated tasks and predicting quality levels, they often lack explainability and robustness. With the advancement of large models, which align more closely with human cognitive and perceptual processes, many researchers are now leveraging the prior knowledge embedded in these large models for quality assessment tasks. This emergence of quality assessment within the context of large models motivates us to provide a comprehensive review focusing on two key aspects: (1) the assessment of large models and (2) the role of large models in assessment tasks. We begin by reflecting on the historical development of quality assessment. Subsequently, we move to detailed discussions of related works concerning quality assessment in the era of large models. Finally, we offer insights into the future progression and potential pathways for quality assessment in this new era. We hope that this survey will enable a rapid understanding of the development of quality assessment in the era of large models and inspire further advancements in the field.<\/jats:p>","DOI":"10.1145\/3722559","type":"journal-article","created":{"date-parts":[[2025,3,11]],"date-time":"2025-03-11T13:04:47Z","timestamp":1741698287000},"page":"1-31","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["Quality Assessment in the Era of Large Models: A Survey"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7247-7938","authenticated-orcid":false,"given":"Zicheng","family":"Zhang","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-3915-8257","authenticated-orcid":false,"given":"Yingjie","family":"Zhou","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0634-1710","authenticated-orcid":false,"given":"Chunyi","family":"Li","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8929-8322","authenticated-orcid":false,"given":"Baixuan","family":"Zhao","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6377-4730","authenticated-orcid":false,"given":"Xiaohong","family":"Liu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8165-9322","authenticated-orcid":false,"given":"Guangtao","family":"Zhai","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2025,7,19]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Lorenzo Agnolucci Leonardo Galteri and Marco Bertini. 2024. Quality-aware image-text alignment for real-world image quality assessment. arXiv:2403.11176. Retrieved from https:\/\/arxiv.org\/abs\/2403.11176"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00904"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2024.3355642"},{"key":"e_1_3_1_5_2","first-page":"282","volume-title":"Applications of Digital Image Processing XL","volume":"10396","author":"Alexiou Evangelos","year":"2017","unstructured":"Evangelos Alexiou and Touradj Ebrahimi. 2017. On the performance of metrics to predict quality in point cloud representations. In Applications of Digital Image Processing XL 10396. SPIE, 282\u2013297."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2018.8486512"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/QoMEX.2018.8463406"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/MMSP.2017.8122237"},{"key":"e_1_3_1_9_2","unstructured":"Anas Awadalla Irena Gao Josh Gardner Jack Hessel Yusuf Hanafy Wanrong Zhu Kalyani Marathe Yonatan Bitton Samir Gadre Shiori Sagawa et\u00a0al. 2023. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv:2308.01390. Retrieved from https:\/\/arxiv.org\/abs\/2308.01390"},{"key":"e_1_3_1_10_2","unstructured":"Jinze Bai Shuai Bai Shusheng Yang Shijie Wang Sinan Tan Peng Wang Junyang Lin Chang Zhou and Jingren Zhou. 2023. Qwen-VL: A versatile vision-language model for understanding localization text reading and beyond. arXiv:2308.12966. Retrieved from https:\/\/arxiv.org\/abs\/2308.12966"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2017.2726542"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3073294"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2729891"},{"key":"e_1_3_1_14_2","unstructured":"Rohan Bavishi Erich Elsen Curtis Hawthorne Maxwell Nye Odena Augustus Somani Arushi and Ta\u015f\u0131rlar Sa\u011fnak. 2023. Introducing our Multimodal Models. Retrieved from https:\/\/www.adept.ai\/blog\/fuyu-8b"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11760-017-1166-8"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02161"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2011.2166245"},{"key":"e_1_3_1_18_2","unstructured":"Tim Brooks Bill Peebles Connor Homes Will DePue Yufei Guo Li Jing David Schnurr Joe Taylor Troy Luhman Eric Luhman et\u00a0al. 2024. Video Generation Models as World Simulators. Retrieved from https:\/\/openai.com\/research\/video-generation-models-as-world-simulators"},{"key":"e_1_3_1_19_2","unstructured":"Rizhao Cai Zirui Song Dayan Guan Zhenhao Chen Xing Luo Chenyu Yi and Alex Kot. 2023. BenchLMM: Benchmarking cross-style visual capability of large multimodal models. arXiv:2312.02896. Retrieved from https:\/\/arxiv.org\/abs\/2312.02896"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1007\/s13042-022-01611-w"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00667"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1002\/cav.2105"},{"key":"e_1_3_1_24_2","first-page":"3514","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Chang Kuang-Yu","year":"2017","unstructured":"Kuang-Yu Chang, Kung-Hung Lu, and Chu-Song Chen. 2017. Aesthetic critiques generation for photos. In Proceedings of the IEEE International Conference on Computer Vision, 3514\u20133523."},{"key":"e_1_3_1_25_2","unstructured":"Keqin Chen Zhao Zhang Weili Zeng Richong Zhang Feng Zhu and Rui Zhao. 2023. Shikra: Unleashing multimodal LLM\u2019s referential dialogue magic. arXiv:2306.15195. Retrieved from https:\/\/arxiv.org\/abs\/2306.15195"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.displa.2023.102540"},{"key":"e_1_3_1_27_2","unstructured":"Xinlei Chen Hao Fang Tsung-Yi Lin Ramakrishna Vedantam Saurabh Gupta Piotr Dollar and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325. Retrieved from https:\/\/arxiv.org\/abs\/1504.00325"},{"key":"e_1_3_1_28_2","unstructured":"Yixiong Chen Li Liu and Chris Ding. 2023. X-iqe: Explainable image quality evaluation for text-to-image generation with visual large language models. arXiv:2305.10843. Retrieved from https:\/\/arxiv.org\/abs\/2305.10843"},{"key":"e_1_3_1_29_2","doi-asserted-by":"crossref","unstructured":"Zewen Chen Haina Qin Juan Wang Chunfeng Yuan Bing Li Weiming Hu and Liang Wang. 2024. PromptIQA: Boosting the performance and generalization for no-reference image quality assessment via prompts. arXiv:2403.04993. Retrieved from https:\/\/arxiv.org\/abs\/2403.04993","DOI":"10.1007\/978-3-031-73232-4_14"},{"key":"e_1_3_1_30_2","doi-asserted-by":"crossref","unstructured":"Zijian Chen Wei Sun Yuan Tian Jun Jia Zicheng Zhang Jiarui Wang Ru Huang Xiongkuo Min Guangtao Zhai and Wenjun Zhang. 2024. GAIA: Rethinking action quality assessment for AI-generated videos. arXiv:2406.06087. Retrieved from https:\/\/arxiv.org\/abs\/2406.06087","DOI":"10.52202\/079017-1267"},{"key":"e_1_3_1_31_2","unstructured":"Zijian Chen Wei Sun Haoning Wu Zicheng Zhang Jun Jia Xiongkuo Min Guangtao Zhai and Wenjun Zhang. 2023. Exploring the naturalness of AI-generated images. arXiv:2312.05476. Retrieved from https:\/\/arxiv.org\/abs\/2312.05476"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW53098.2021.00054"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TBC.2011.2104671"},{"key":"e_1_3_1_34_2","unstructured":"Iya Chivileva Philip Lynch Tomas E. Ward and Alan F. Smeaton. 2023. Measuring the quality of text-to-video model outputs: Metrics and dataset. arXiv:2309.08009. Retrieved from https:\/\/arxiv.org\/abs\/2309.08009"},{"key":"e_1_3_1_35_2","unstructured":"Hyung Won Chung Le Hou Shayne Longpre Barret Zoph Yi Tay William Fedus Eric Li Xuezhi Wang Mostafa Dehghani Siddhartha Brahma et\u00a0al. 2022. Scaling instruction-finetuned language models. arXiv:2210.11416. Retrieved from https:\/\/arxiv.org\/abs\/2210.11416"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00591"},{"key":"e_1_3_1_37_2","unstructured":"Wenliang Dai Junnan Li Dongxu Li Anthony Meng Huat Tiong Junqi Zhao Weisheng Wang Boyang Li Pascale Fung and Steven Hoi. 2023. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500. Retrieved from https:\/\/arxiv.org\/abs\/2305.06500"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.proeng.2013.09.086"},{"key":"e_1_3_1_39_2","first-page":"16890","article-title":"Cogview2: Faster and better text-to-image generation via hierarchical transformers","volume":"35","author":"Ding Ming","year":"2022","unstructured":"Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. 2022. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems 35 (2022), 16890\u201316902.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP40778.2020.9190956"},{"key":"e_1_3_1_41_2","unstructured":"Qingxiu Dong Lei Li Damai Dai Ce Zheng Zhiyong Wu Baobao Chang Xu Sun Jingjing Xu and Zhifang Sui. 2022. A survey on in-context learning. arXiv:2301.00234. Retrieved from https:\/\/arxiv.org\/abs\/2301.00234"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611923"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-14442-9_57"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.26"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.21227\/hp84-8m05"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/TBC.2018.2822870"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/MMSP55362.2022.9949359"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00373"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1006\/jcss.1997.1504"},{"key":"e_1_3_1_50_2","unstructured":"Chaoyou Fu Peixian Chen Yunhang Shen Yulei Qin Mengdan Zhang Xu Lin Zhenyu Qiu Wei Lin Jinrui Yang Xiawu Zheng et\u00a0al. 2023. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394. Retrieved from https:\/\/arxiv.org\/abs\/2306.13394"},{"key":"e_1_3_1_51_2","unstructured":"Chaoyou Fu Yuhan Dai Yondong Luo Lei Li Shuhuai Ren Renrui Zhang Zihan Wang Chenyu Zhou Yunhang Shen Mengdan Zhang et\u00a0al. 2024. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. arXiv:2405.21075. Retrieved from https:\/\/arxiv.org\/abs\/2405.21075"},{"key":"e_1_3_1_52_2","unstructured":"Peng Gao Jiaming Han Renrui Zhang Ziyi Lin Shijie Geng Aojun Zhou Wei Zhang Pan Lu Conghui He Xiangyu Yue et\u00a0al. 2023. LLaMA-Adapter V2: Parameter-efficient visual instruction model. arXiv:2304.15010. Retrieved from https:\/\/arxiv.org\/abs\/2304.15010"},{"key":"e_1_3_1_53_2","unstructured":"Wentao Ge Shunian Chen Guiming Chen Junying Chen Zhihong Chen Shuo Yan Chenghao Zhu Ziyue Lin Wenya Xie Xidong Wang et\u00a0al. 2023. Mllm-bench evaluating multi-modal llms using gpt-4v. arXiv:2311.13951. Retrieved from https:\/\/arxiv.org\/abs\/2311.13951"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2017.2707479"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.670"},{"key":"e_1_3_1_56_2","unstructured":"Jiaxi Gu Shicong Wang Haoyu Zhao Tianyi Lu Xing Zhang Zuxuan Wu Songcen Xu Wei Zhang Yu-Gang Jiang and Hang Xu. 2023. Reuse and diffuse: Iterative denoising for text-to-video generation. arXiv:2309.03549. Retrieved from https:\/\/arxiv.org\/abs\/2309.03549"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/TBC.2014.2344471"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01363"},{"key":"e_1_3_1_59_2","unstructured":"Yuwei Guo Ceyuan Yang Anyi Rao Yaohui Wang Yu Qiao Dahua Lin and Bo Dai. 2023. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv:2307.04725. Retrieved from https:\/\/arxiv.org\/abs\/2307.04725"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01996"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2022\/132"},{"key":"e_1_3_1_62_2","unstructured":"Yingqing He Tianyu Yang Yong Zhang Ying Shan and Qifeng Chen. 2022. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv:2211.13221. Retrieved from https:\/\/arxiv.org\/abs\/2211.13221"},{"key":"e_1_3_1_63_2","unstructured":"David Holz. 2023. Midjourney. Retrieved from https:\/\/www.midjourney.com"},{"key":"e_1_3_1_64_2","unstructured":"Wenyi Hong Ming Ding Wendi Zheng Xinghan Liu and Jie Tang. 2022. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv:2205.15868. Retrieved from https:\/\/arxiv.org\/abs\/2205.15868"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/QoMEX.2017.7965673"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.2967829"},{"key":"e_1_3_1_67_2","first-page":"4572","article-title":"Towards transparent deep image aesthetics assessment with tag-based content descriptors","author":"Hou Jingwen","year":"2023","unstructured":"Jingwen Hou, Weisi Lin, Yuming Fang, Haoning Wu, Chaofeng Chen, Liang Liao, and Weide Liu. 2023. Towards transparent deep image aesthetics assessment with tag-based content descriptors. IEEE Transactions on Image Processing 32 (2023), 4572\u20134584.","journal-title":"IEEE Transactions on Image Processing"},{"key":"e_1_3_1_68_2","doi-asserted-by":"crossref","unstructured":"Yipo Huang Xiangfei Sheng Zhichao Yang Quan Yuan Zhichao Duan Pengfei Chen Leida Li Weisi Lin and Guangming Shi. 2024. AesExpert: Towards multi-modality foundation model for image aesthetics perception. arXiv:2404.09624. Retrieved from https:\/\/arxiv.org\/abs\/2404.09624","DOI":"10.1145\/3664647.3680649"},{"key":"e_1_3_1_69_2","unstructured":"Yipo Huang Quan Yuan Xiangfei Sheng Zhichao Yang Haoning Wu Pengfei Chen Yuzhe Yang Leida Li and Weisi Lin. 2024. AesBench: An expert benchmark for multimodal large language models on image aesthetics perception. arXiv:2401.08276. Retrieved from https:\/\/arxiv.org\/abs\/2401.08276"},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02060"},{"key":"e_1_3_1_71_2","unstructured":"Zhipeng Huang Zhizheng Zhang Yiting Lu Zheng-Jun Zha Zhibo Chen and Baining Guo. 2024. VisualCritic: Making LMMs perceive visual quality like humans. arXiv:2403.12806. Retrieved from https:\/\/arxiv.org\/abs\/2403.12806"},{"key":"e_1_3_1_72_2","unstructured":"Huggingface. 2023. Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model. Retrieved from https:\/\/huggingface.co\/blog\/idefics"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.3037481"},{"key":"e_1_3_1_74_2","doi-asserted-by":"publisher","DOI":"10.1109\/TITS.2022.3207152"},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11286"},{"key":"e_1_3_1_76_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350970"},{"key":"e_1_3_1_77_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15561-1_8"},{"key":"e_1_3_1_78_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.image.2016.05.004"},{"key":"e_1_3_1_79_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2015.7351067"},{"key":"e_1_3_1_80_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00968"},{"key":"e_1_3_1_81_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01462"},{"key":"e_1_3_1_82_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2019.2898732"},{"key":"e_1_3_1_83_2","article-title":"Pick-a-pic: An open dataset of user preferences for text-to-image generation","volume":"36","author":"Kirstain Yuval","year":"2024","unstructured":"Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. 2024. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_84_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_40"},{"key":"e_1_3_1_85_2","unstructured":"Tengchuan Kou Xiaohong Liu Zicheng Zhang Chunyi Li Haoning Wu Xiongkuo Min Guangtao Zhai and Ning Liu. 2024. Subjective-aligned dataset and metric for text-to-video quality assessment. arXiv:2403.11956. Retrieved from https:\/\/arxiv.org\/abs\/2403.11956"},{"key":"e_1_3_1_86_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2960656"},{"key":"e_1_3_1_87_2","unstructured":"Bohao Li Rui Wang Guangzhi Wang Yuying Ge Yixiao Ge and Ying Shan. 2023. SEED-Bench: Benchmarking multimodal LLMs with generative comprehension. arXiv:2307.16125. Retrieved from https:\/\/arxiv.org\/abs\/2307.16125"},{"key":"e_1_3_1_88_2","unstructured":"Bo Li Peiyuan Zhang Jingkang Yang Yuanhan Zhang Fanyi Pu and Ziwei Liu. 2023. Otterhd: A high-resolution multi-modality model. arXiv:2311.04219. Retrieved from https:\/\/arxiv.org\/abs\/2311.04219"},{"key":"e_1_3_1_89_2","unstructured":"Bo Li Yuanhan Zhang Liangyu Chen Jinghao Wang Jingkang Yang and Ziwei Liu. 2023. Otter: A multi-modal model with in-context instruction tuning. arXiv:2305.03726. Retrieved from https:\/\/arxiv.org\/abs\/2305.03726"},{"key":"e_1_3_1_90_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2010.5651833"},{"key":"e_1_3_1_91_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00636"},{"key":"e_1_3_1_92_2","unstructured":"Chunyi Li Haoning Wu Zicheng Zhang Hongkun Hao Kaiwei Zhang Lei Bai Xiaohong Liu Xiongkuo Min Weisi Lin and Guangtao Zhai. 2024. Q-Refine: A perceptual quality refiner for AI-generated image. arXiv:2401.01117. Retrieved from https:\/\/arxiv.org\/abs\/2401.01117"},{"key":"e_1_3_1_93_2","unstructured":"Chunyi Li Xiele Wu Haoning Wu Donghui Feng Zicheng Zhang Guo Lu Xiongkuo Min Xiaohong Liu Guangtao Zhai and Weisi Lin. 2024. CMC-Bench: Towards a new paradigm of visual signal compression. arXiv:2406.09356. Retrieved from https:\/\/arxiv.org\/abs\/2406.09356"},{"key":"e_1_3_1_94_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3319020"},{"key":"e_1_3_1_95_2","first-page":"19730","volume-title":"International Conference on Machine Learning","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning. PMLR, 19730\u201319742."},{"key":"e_1_3_1_96_2","first-page":"12888","volume-title":"International Conference on Machine Learning","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888\u201312900."},{"key":"e_1_3_1_97_2","volume-title":"The 12th International Conference on Learning Representations","author":"Li Juncheng","year":"2023","unstructured":"Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. 2023. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions. In The 12th International Conference on Learning Representations."},{"key":"e_1_3_1_98_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02095"},{"key":"e_1_3_1_99_2","unstructured":"Mingsheng Li Xin Chen Chi Zhang Sijin Chen Hongyuan Zhu Fukun Yin Gang Yu and Tao Chen. 2023. M3DBench: Let\u2019s instruct large models with multi-modal 3D prompts. arXiv:2312.10763. Retrieved from https:\/\/arxiv.org\/abs\/2312.10763"},{"key":"e_1_3_1_100_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00643"},{"key":"e_1_3_1_101_2","doi-asserted-by":"publisher","DOI":"10.1109\/QoMEX.2019.8743252"},{"key":"e_1_3_1_102_2","unstructured":"Zhiqiu Lin Deepak Pathak Baiqi Li Jiayao Li Xide Xia Graham Neubig Pengchuan Zhang and Deva Ramanan. 2024. Evaluating text-to-visual generation with image-to-text generation. arXiv:2404.01291. Retrieved from https:\/\/arxiv.org\/abs\/2404.01291"},{"key":"e_1_3_1_103_2","volume-title":"The 12th International Conference on Learning Representations","author":"Liu Fuxiao","year":"2023","unstructured":"Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. 2023. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The 12th International Conference on Learning Representations."},{"key":"e_1_3_1_104_2","unstructured":"Haotian Liu Chunyuan Li Yuheng Li and Yong Jae Lee. 2023. Improved Baselines with Visual Instruction Tuning."},{"key":"e_1_3_1_105_2","unstructured":"Haotian Liu Chunyuan Li Yuheng Li Bo Li Yuanhan Zhang Sheng Shen and Yong Jae Lee. 2024. LLaVA-NeXT: Improved reasoning OCR and world knowledge. Retrieved from https:\/\/llava-vl.github.io\/blog\/2024-01-30-llava-next\/"},{"key":"e_1_3_1_106_2","unstructured":"Haotian Liu Chunyuan Li Qingyang Wu and Yong Jae Lee. 2023. Visual Instruction Tuning. arxiv:2304.08485. Rerieved from https:\/\/arxiv.org\/abs\/2304.08485"},{"key":"e_1_3_1_107_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.image.2015.10.005"},{"key":"e_1_3_1_108_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2022.3167151"},{"key":"e_1_3_1_109_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3096060"},{"key":"e_1_3_1_110_2","unstructured":"Xiaohong Liu Xiongkuo Min Guangtao Zhai Chunyi Li Tengchuan Kou Wei Sun Haoning Wu Yixuan Gao Yuqin Cao Zicheng Zhang et\u00a0al. 2024. NTIRE 2024 quality qssessment of AI-generated content challenge. arXiv:2404.16687. Retrieved from https:\/\/arxiv.org\/abs\/2404.16687"},{"key":"e_1_3_1_111_2","unstructured":"Yaofang Liu Xiaodong Cun Xuebo Liu Xintao Wang Yong Zhang Haoxin Chen Yang Liu Tieyong Zeng Raymond Chan and Ying Shan. 2023. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv:2310.11440. Retrieved from https:\/\/arxiv.org\/abs\/2310.11440"},{"key":"e_1_3_1_112_2","unstructured":"Yuan Liu Haodong Duan Yuanhan Zhang Bo Li Songyang Zhang Wangbo Zhao Yike Yuan Jiaqi Wang Conghui He Ziwei Liu et\u00a0al. 2023. MMBench: Is your multi-modal model an all-around player? arXiv:2307.06281. Retrieved from https:\/\/arxiv.org\/abs\/2307.06281"},{"key":"e_1_3_1_113_2","article-title":"Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation","volume":"36","author":"Liu Yuanxin","year":"2024","unstructured":"Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. 2024. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_114_2","doi-asserted-by":"publisher","DOI":"10.1145\/3550274"},{"key":"e_1_3_1_115_2","unstructured":"Pan Lu Hritik Bansal Tony Xia Jiacheng Liu Chunyuan Li Hannaneh Hajishirzi Hao Cheng Kai-Wei Chang Michel Galley and Jianfeng Gao. 2023. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv:2310.02255. Retrieved from https:\/\/arxiv.org\/abs\/2310.02255"},{"key":"e_1_3_1_116_2","doi-asserted-by":"publisher","DOI":"10.1109\/TBC.2022.3221689"},{"key":"e_1_3_1_117_2","first-page":"164","volume-title":"International Forum on Digital TV and Wireless Multimedia Communications","author":"Lu Wei","year":"2021","unstructured":"Wei Lu, Wei Sun, Wenhan Zhu, Xiongkuo Min, Zicheng Zhang, Tao Wang, and Guangtao Zhai. 2021. A cnn-based quality assessment method for pseudo 4k contents. In International Forum on Digital TV and Wireless Multimedia Communications. Springer, 164\u2013176."},{"key":"e_1_3_1_118_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2011.6126498"},{"key":"e_1_3_1_119_2","first-page":"10209","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Luo Zhengxiong","year":"2023","unstructured":"Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. 2023. VideoFusion: Decomposed diffusion models for High-Quality video generation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 10209\u201310218."},{"key":"e_1_3_1_120_2","unstructured":"Elman Mansimov Emilio Parisotto Jimmy Lei Ba and Ruslan Salakhutdinov. 2015. Generating images from captions with attention. arXiv:1511.02793. Retrieved from https:\/\/arxiv.org\/abs\/1511.02793"},{"key":"e_1_3_1_121_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00331"},{"key":"e_1_3_1_122_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2735192"},{"key":"e_1_3_1_123_2","doi-asserted-by":"publisher","DOI":"10.1109\/TITS.2018.2868771"},{"key":"e_1_3_1_124_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2902097"},{"key":"e_1_3_1_125_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2012.2227726"},{"key":"e_1_3_1_126_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCE-Berlin58801.2023.10375662"},{"key":"e_1_3_1_127_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2024.3402729"},{"key":"e_1_3_1_128_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2009.2015374"},{"key":"e_1_3_1_129_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2010.2043888"},{"key":"e_1_3_1_130_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2012.6247954"},{"key":"e_1_3_1_131_2","doi-asserted-by":"publisher","DOI":"10.1145\/3592786"},{"key":"e_1_3_1_132_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2020.3036153"},{"key":"e_1_3_1_133_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2718185"},{"key":"e_1_3_1_134_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2009.2014806"},{"key":"e_1_3_1_135_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2562513"},{"key":"e_1_3_1_136_2","unstructured":"OpenAI. 2023. GPT-4 technical report. arXiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774"},{"key":"e_1_3_1_137_2","unstructured":"Wensheng Pan Timin Gao Yan Zhang Runze Hu Xiawu Zheng Enwei Zhang Yuting Gao Yutao Liu Yunhang Shen Ke Li et\u00a0al. 2024. Multi-modal prompt learning on blind image quality assessment. arXiv:2404.14949. Retrieved from https:\/\/arxiv.org\/abs\/2404.14949"},{"key":"e_1_3_1_138_2","first-page":"396","volume-title":"Chinese Conference on Pattern Recognition and Computer Vision (PRCV)","author":"Pan Wensheng","year":"2023","unstructured":"Wensheng Pan, Zhifu Yang, DingMing Liu, Chenxin Fang, Yan Zhang, and Pingyang Dai. 2023. Quality-aware CLIP for blind image quality assessment. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 396\u2013408."},{"key":"e_1_3_1_139_2","first-page":"311","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311\u2013318."},{"key":"e_1_3_1_140_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2017.139"},{"key":"e_1_3_1_141_2","unstructured":"Zhiliang Peng Wenhui Wang Li Dong Yaru Hao Shaohan Huang Shuming Ma and Furu Wei. 2023. Kosmos-2: Grounding multimodal large language models to the world. arXiv: 2306.14824. Retrieved from https:\/\/arxiv.org\/abs\/2306.14824"},{"key":"e_1_3_1_142_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.image.2014.10.009"},{"issue":"4","key":"e_1_3_1_143_2","first-page":"30","article-title":"TID2008-a database for evaluation of full-reference visual quality assessment metrics","volume":"10","author":"Ponomarenko Nikolay","year":"2009","unstructured":"Nikolay Ponomarenko, Vladimir Lukin, Alexander Zelensky, Karen Egiazarian, Marco Carli, and Federica Battisti. 2009. TID2008-a database for evaluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectronics 10, 4 (2009), 30\u201345.","journal-title":"Advances of Modern Radioelectronics"},{"key":"e_1_3_1_144_2","first-page":"8748","volume-title":"International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et\u00a0al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_1_145_2","unstructured":"Aditya Ramesh Prafulla Dhariwal Alex Nichol Casey Chu and Mark Chen. 2022. Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125. Retrieved from https:\/\/arxiv.org\/abs\/2204.06125"},{"key":"e_1_3_1_146_2","first-page":"1060","volume-title":"International Conference on Machine Learning","author":"Reed Scott","year":"2016","unstructured":"Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In International Conference on Machine Learning. PMLR, 1060\u20131069."},{"key":"e_1_3_1_147_2","first-page":"27","volume-title":"Human Vision and Electronic Imaging XX","volume":"9394","author":"Rehman Abdul","year":"2015","unstructured":"Abdul Rehman, Kai Zeng, and Zhou Wang. 2015. Display device-adapted video quality-of-experience assessment. In Human Vision and Electronic Imaging XX, Vol. 9394. SPIE, 27\u201337."},{"key":"e_1_3_1_148_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_1_149_2","first-page":"36479","article-title":"Photorealistic text-to-image diffusion models with deep language understanding","volume":"35","author":"Saharia Chitwan","year":"2022","unstructured":"Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et\u00a0al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479\u201336494.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_150_2","first-page":"2234","article-title":"Improved techniques for training gans","volume":"29","author":"Salimans Tim","year":"2016","unstructured":"Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. Advances in Neural Information Processing Systems 29 (2016), 2234\u20132242.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_151_2","doi-asserted-by":"publisher","DOI":"10.1145\/3507901"},{"key":"e_1_3_1_152_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2007.366046"},{"key":"e_1_3_1_153_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2010.2042111"},{"key":"e_1_3_1_154_2","unstructured":"Wenqi Shao Yutao Hu Peng Gao Meng Lei Kaipeng Zhang Fanqing Meng Peng Xu Siyuan Huang Hongsheng Li Yu Qiao et\u00a0al. 2023. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv:2308.03729. Retrieved from https:\/\/arxiv.org\/abs\/2308.03729"},{"key":"e_1_3_1_155_2","unstructured":"H. Sheikh. 2005. LIVE Image Quality Assessment Database Release 2. Retrieved from http:\/\/live. ece. utexas. edu\/research\/quality"},{"key":"e_1_3_1_156_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2006.881959"},{"key":"e_1_3_1_157_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611969"},{"key":"e_1_3_1_158_2","unstructured":"Seongjin Shin Sang-Woo Lee Hwijeen Ahn Sungdong Kim HyoungSeok Kim Boseop Kim Kyunghyun Cho Gichang Lee Woomyoung Park Jung-Woo Ha et\u00a0al. 2022. On the effect of pretraining corpora on in-context learning by a large-scale language model. arXiv:2204.13509. Retrieved from https:\/\/arxiv.org\/abs\/2204.13509"},{"key":"e_1_3_1_159_2","unstructured":"Uriel Singer Adam Polyak Thomas Hayes Xi Yin Jie An Songyang Zhang Qiyuan Hu Harry Yang Oron Ashual Oran Gafni et\u00a0al. 2022. Make-a-video: Text-to-video generation without text-video data. arXiv:2209.14792. Retrieved from https:\/\/arxiv.org\/abs\/2209.14792"},{"key":"e_1_3_1_160_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2869673"},{"key":"e_1_3_1_161_2","unstructured":"SkunkworksAI. 2024. BakLLaVA. Retrieved from https:\/\/github.com\/SkunkworksAI\/BakLLaVA"},{"key":"e_1_3_1_162_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2019.8803298"},{"key":"e_1_3_1_163_2","doi-asserted-by":"publisher","DOI":"10.1145\/2072298.2071977"},{"key":"e_1_3_1_164_2","unstructured":"Quan Sun Yufeng Cui Xiaosong Zhang Fan Zhang Qiying Yu Zhengxiong Luo Yueze Wang Yongming Rao Jingjing Liu Tiejun Huang et\u00a0al. 2023. Generative multimodal models are in-context learners. arXiv:2312.13286. Retrieved from https:\/\/arxiv.org\/abs\/2312.13286"},{"key":"e_1_3_1_165_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2019.2955024"},{"key":"e_1_3_1_166_2","doi-asserted-by":"crossref","unstructured":"Wei Sun Haoning Wu Zicheng Zhang Jun Jia Zhichao Zhang Linhan Cao Qiubo Chen Xiongkuo Min Weisi Lin and Guangtao Zhai. 2024. Enhancing blind video quality assessment with rich quality-aware features. arXiv:2405.08745. Retrieved from https:\/\/arxiv.org\/abs\/2405.08745","DOI":"10.2139\/ssrn.5571157"},{"key":"e_1_3_1_167_2","doi-asserted-by":"crossref","unstructured":"Wei Sun Weixia Zhang Yanwei Jiang Haoning Wu Zicheng Zhang Jun Jia Yingjie Zhou Zhongpeng Ji Xiongkuo Min Weisi Lin et\u00a0al. 2024. Dual-branch network for portrait image quality assessment. arXiv:2405.08555.","DOI":"10.2139\/ssrn.5481622"},{"key":"e_1_3_1_168_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2016.07.033"},{"key":"e_1_3_1_169_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_1_170_2","unstructured":"InfiMM Team. 2024. InfiMM: Advancing Multimodal Understanding from Flamingo\u2019s Legacy through Diverse LLM Integration. Retrieved from https:\/\/huggingface.co\/Infi-MM\/"},{"key":"e_1_3_1_171_2","doi-asserted-by":"publisher","DOI":"10.1145\/2812802"},{"key":"e_1_3_1_172_2","doi-asserted-by":"publisher","DOI":"10.4324\/9781315128948-7"},{"key":"e_1_3_1_173_2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar et\u00a0al. 2023. LLaMA: Open and efficient foundation language models. arXiv:2302.13971. Retrieved from https:\/\/arxiv.org\/abs\/2302.13971"},{"key":"e_1_3_1_174_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et\u00a0al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. Retrieved from https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_3_1_175_2","unstructured":"Kristi Tsukida and Maya R. Gupta. 2011. How to analyze paired comparison data. Technical Report. Department of Electrical Engineering University of Washington UWEETR-2011-0004 1."},{"key":"e_1_3_1_176_2","unstructured":"Thomas Unterthiner Sjoerd Van Steenkiste Karol Kurach Raphael Marinier Marcin Michalski and Sylvain Gelly. 2018. Towards accurate generative models of video: A new metric and challenges. arXiv:1812.01717. Retrieved from https:\/\/arxiv.org\/abs\/1812.01717"},{"key":"e_1_3_1_177_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_1_178_2","doi-asserted-by":"publisher","DOI":"10.1117\/1.JEI.23.1.013016"},{"key":"e_1_3_1_179_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2011.6116171"},{"key":"e_1_3_1_180_2","first-page":"46","volume-title":"CAAI International Conference on Artificial Intelligence","author":"Wang Jiarui","year":"2023","unstructured":"Jiarui Wang, Huiyu Duan, Jing Liu, Shi Chen, Xiongkuo Min, and Guangtao Zhai. 2023. Aigciqa2023: A large-scale image quality assessment database for AI generated images: From the perspectives of quality, authenticity and correspondence. In CAAI International Conference on Artificial Intelligence. Springer, 46\u201357."},{"key":"e_1_3_1_181_2","unstructured":"Jiuniu Wang Hangjie Yuan Dayou Chen Yingya Zhang Xiang Wang and Shiwei Zhang. 2023. Modelscope text-to-video technical report. arXiv:2308.06571. Retrieved from https:\/\/arxiv.org\/abs\/2308.06571"},{"key":"e_1_3_1_182_2","unstructured":"Ke Wang Junting Pan Weikang Shi Zimu Lu Mingjie Zhan and Hongsheng Li. 2024. Measuring multimodal mathematical reasoning with MATH-vision dataset. arXiv:2402.14804. Retrieved from https:\/\/arxiv.org\/abs\/2402.14804"},{"key":"e_1_3_1_183_2","doi-asserted-by":"crossref","unstructured":"Puyi Wang Wei Sun Zicheng Zhang Jun Jia Yanwei Jiang Zhichao Zhang Xiongkuo Min and Guangtao Zhai. 2024. Large multi-modality model assisted ai-generated image quality assessment. arXiv:2404.17762. Retrieved from https:\/\/arxiv.org\/abs\/2404.17762","DOI":"10.1145\/3664647.3681471"},{"key":"e_1_3_1_184_2","unstructured":"Xuezhi Wang Jason Wei Dale Schuurmans Quoc Le Ed Chi Sharan Narang Aakanksha Chowdhery and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv:2203.11171. Retrieved from https:\/\/arxiv.org\/abs\/2203.11171"},{"key":"e_1_3_1_185_2","unstructured":"Yaohui Wang Xinyuan Chen Xin Ma Shangchen Zhou Ziqi Huang Yi Wang Ceyuan Yang Yinan He Jiashuo Yu Peiqing Yang et\u00a0al. 2023. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv:2309.15103. Retrieved from https:\/\/arxiv.org\/abs\/2309.15103"},{"key":"e_1_3_1_186_2","doi-asserted-by":"publisher","DOI":"10.1109\/MMSP.2019.8901772"},{"key":"e_1_3_1_187_2","doi-asserted-by":"publisher","DOI":"10.5555\/3019345"},{"key":"e_1_3_1_188_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2002.1004620"},{"key":"e_1_3_1_189_2","doi-asserted-by":"publisher","DOI":"10.1364\/JOSAA.24.000B61"},{"key":"e_1_3_1_190_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0923-5965(03)00076-6"},{"key":"e_1_3_1_191_2","unstructured":"Zijie J. Wang Evan Montoya David Munechika Haoyang Yang Benjamin Hoover and Duen Horng Chau. 2022. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. arXiv:2210.14896. Retrieved from https:\/\/arxiv.org\/abs\/2210.14896"},{"key":"e_1_3_1_192_2","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, Denny Zhou, et\u00a0al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824\u201324837.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_193_2","unstructured":"Chenfei Wu Lun Huang Qianxi Zhang Binyang Li Lei Ji Fan Yang Guillermo Sapiro and Nan Duan. 2021. Godiva: Generating open-domain videos from natural descriptions. arXiv:2104.14806. Retrieved from https:\/\/arxiv.org\/abs\/2104.14806"},{"key":"e_1_3_1_194_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20068-7_31"},{"key":"e_1_3_1_195_2","unstructured":"Haoning Wu Chaofeng Chen Liang Liao Jingwen Hou Wenxiu Sun Qiong Yan Jinwei Gu and Weisi Lin. 2022. Neighbourhood representative sampling for efficient end-to-end video quality assessment. arXiv:2210.05357. Retrieved from https:\/\/arxiv.org\/abs\/2210.05357"},{"key":"e_1_3_1_196_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME55011.2023.00070"},{"key":"e_1_3_1_197_2","unstructured":"Haoning Wu Liang Liao Annan Wang Chaofeng Chen Jingwen Hou Hou Erli Zhang Wenxiu Sun Sun Qiong Yan and Weisi Lin. 2023. Towards robust text-prompted semantic criterion for in-the-wild video quality assessment. arXiv:2304.14672. Retrieved from https:\/\/arxiv.org\/abs\/2304.14672"},{"key":"e_1_3_1_198_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9747287"},{"key":"e_1_3_1_199_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01843"},{"key":"e_1_3_1_200_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611737"},{"key":"e_1_3_1_201_2","unstructured":"Haoning Wu Zicheng Zhang Erli Zhang Chaofeng Chen Liang Liao Annan Wang Chunyi Li Wenxiu Sun Qiong Yan Guangtao Zhai and Weisi Lin. 2023. Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision."},{"key":"e_1_3_1_202_2","unstructured":"Haoning Wu Zicheng Zhang Erli Zhang Chaofeng Chen Liang Liao Annan Wang Kaixin Xu Chunyi Li Jingwen Hou Guangtao Zhai et\u00a0al. 2023. Q-instruct: Improving low-level visual abilities for multi-modality foundation models. arXiv:2311.06783. Retrieved from https:\/\/arxiv.org\/abs\/2311.06783"},{"key":"e_1_3_1_203_2","unstructured":"Haoning Wu Zicheng Zhang Weixia Zhang Chaofeng Chen Liang Liao Chunyi Li Yixuan Gao Annan Wang Erli Zhang Wenxiu Sun et\u00a0al. 2023. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv:2312.17090. Retrieved from https:\/\/arxiv.org\/abs\/2312.17090"},{"key":"e_1_3_1_204_2","unstructured":"Haoning Wu Hanwei Zhu Zicheng Zhang Erli Zhang Chaofeng Chen Liang Liao Chunyi Li Annan Wang Wenxiu Sun Qiong Yan et\u00a0al. 2024. Towards open-ended visual quality comparison. arXiv:2402.16641. Retrieved from https:\/\/arxiv.org\/abs\/2402.16641"},{"key":"e_1_3_1_205_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00701"},{"key":"e_1_3_1_206_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00200"},{"key":"e_1_3_1_207_2","article-title":"Imagereward: Learning and evaluating human preferences for text-to-image generation","volume":"36","author":"Xu Jiazheng","year":"2024","unstructured":"Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. 2024. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_208_2","doi-asserted-by":"publisher","DOI":"10.1109\/QoMEX.2014.6982328"},{"key":"e_1_3_1_209_2","doi-asserted-by":"crossref","unstructured":"Liu Yang Huiyu Duan Long Teng Yucheng Zhu Xiaohong Liu Menghan Hu Xiongkuo Min Guangtao Zhai and Patrick Le Callet. 2024. Aigcoiqa2024: Perceptual quality assessment of ai generated omnidirectional images. arXiv:2404.01024. Retrieved from https:\/\/arxiv.org\/abs\/2404.01024","DOI":"10.1109\/ICIP51287.2024.10647885"},{"key":"e_1_3_1_210_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.3033117"},{"key":"e_1_3_1_211_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.02050"},{"key":"e_1_3_1_212_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01924"},{"key":"e_1_3_1_213_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.541"},{"key":"e_1_3_1_214_2","unstructured":"Qinghao Ye Haiyang Xu Guohai Xu Jiabo Ye Ming Yan Yiyang Zhou Junyang Wang Anwen Hu Pengcheng Shi Yaya Shi et\u00a0al. 2023. mPLUG-Owl: Modularization empowers large language models with multimodality. arXiv:2304.14178. Retrieved from https:\/\/arxiv.org\/abs\/2304.14178"},{"key":"e_1_3_1_215_2","unstructured":"Qinghao Ye Haiyang Xu Jiabo Ye Ming Yan Haowei Liu Qi Qian Ji Zhang Fei Huang and Jingren Zhou. 2023. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv:2311.04257. Retrieved from https:\/\/arxiv.org\/abs\/2311.04257"},{"key":"e_1_3_1_216_2","article-title":"Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark","volume":"36","author":"Yin Zhenfei","year":"2024","unstructured":"Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Xiaoshui Huang, Zhiyong Wang, Lu Sheng, Lei Bai, et\u00a0al. 2024. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_217_2","first-page":"14019","volume-title":"2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Ying Zhenqiang","year":"2021","unstructured":"Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram, and Alan Bovik. 2021. Patch-VQ: \u2018Patching Up\u2019 the video quality problem. In 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 14019\u201314029."},{"key":"e_1_3_1_218_2","unstructured":"Zhiyuan You Jinjin Gu Zheyuan Li Xin Cai Kaiwen Zhu Tianfan Xue and Chao Dong. 2024. Descriptive image quality assessment in the wild. arXiv:2405.18842. Retrieved from https:\/\/arxiv.org\/abs\/2405.18842"},{"key":"e_1_3_1_219_2","unstructured":"Zhiyuan You Zheyuan Li Jinjin Gu Zhenfei Yin Tianfan Xue and Chao Dong. 2023. Depicting beyond scores: Advancing image quality assessment through multi-modal language models. arXiv:2312.08962. Retrieved from https:\/\/arxiv.org\/abs\/2312.08962"},{"key":"e_1_3_1_220_2","unstructured":"Jiahui Yu Yuanzhong Xu Jing Yu Koh Thang Luong Gunjan Baid Zirui Wang Vijay Vasudevan Alexander Ku Yinfei Yang Burcu Karagol Ayan et\u00a0al. 2022. Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789. Retrieved from https:\/\/arxiv.org\/abs\/2206.10789"},{"key":"e_1_3_1_221_2","unstructured":"Weihao Yu Zhengyuan Yang Linjie Li Jianfeng Wang Kevin Lin Zicheng Liu Xinchao Wang and Lijuan Wang. 2023. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv:2308.02490. Retrieved from https:\/\/arxiv.org\/abs\/2308.02490"},{"key":"e_1_3_1_222_2","unstructured":"Xiang Yue Yuansheng Ni Kai Zhang Tianyu Zheng Ruoqi Liu Ge Zhang Samuel Stevens Dongfu Jiang Weiming Ren Yuxuan Sun et\u00a0al. 2023. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv:2311.16502. Retrieved from https:\/\/arxiv.org\/abs\/2311.16502"},{"key":"e_1_3_1_223_2","doi-asserted-by":"publisher","DOI":"10.2352\/ISSN.2470-1173.2019.10.IQSP-323"},{"key":"e_1_3_1_224_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2008.2004910"},{"key":"e_1_3_1_225_2","unstructured":"David Junhao Zhang Jay Zhangjie Wu Jia-Wei Liu Rui Zhao Lingmin Ran Yuchao Gu Difei Gao and Mike Zheng Shou. 2023. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv:2309.15818. Retrieved from https:\/\/arxiv.org\/abs\/2309.15818"},{"key":"e_1_3_1_226_2","unstructured":"Pan Zhang Xiaoyi Dong Bin Wang Yuhang Cao Chao Xu Linke Ouyang Zhiyuan Zhao Shuangrui Ding Songyang Zhang Haodong Duan et\u00a0al. 2023. InternLM-XComposer: A vision-language large model for advanced text-image comprehension and composition. arXiv:2309.15112. Retrieved from https:\/\/arxiv.org\/abs\/2309.15112"},{"key":"e_1_3_1_227_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01352"},{"key":"e_1_3_1_228_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475294"},{"key":"e_1_3_1_229_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME57554.2024.10688055"},{"key":"e_1_3_1_230_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICMEW59549.2023.00082"},{"key":"e_1_3_1_231_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP46576.2022.9897249"},{"key":"e_1_3_1_232_2","doi-asserted-by":"publisher","DOI":"10.1109\/VCIP53242.2021.9675389"},{"key":"e_1_3_1_233_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3186894"},{"key":"e_1_3_1_234_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICMEW53276.2021.9455963"},{"key":"e_1_3_1_235_2","doi-asserted-by":"crossref","unstructured":"Zicheng Zhang Wei Sun Xiongkuo Min Quan Zhou Jun He Qiyuan Wang and Guangtao Zhai. 2022. Mm-pcqa: Multi-modal learning for no-reference point cloud quality assessment. arXiv:2209.00244. Retrieved from https:\/\/arxiv.org\/abs\/2209.00244","DOI":"10.24963\/ijcai.2023\/195"},{"key":"e_1_3_1_236_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME51207.2021.9428312"},{"key":"e_1_3_1_237_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS48785.2022.9937738"},{"issue":"4","key":"e_1_3_1_238_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3631357","article-title":"Subjective and objective quality assessment for in-the-wild computer graphics images","volume":"20","author":"Zhang Zicheng","year":"2023","unstructured":"Zicheng Zhang, Wei Sun, Tao Wang, Wei Lu, Quan Zhou, Qiyuan Wang, Xiongkuo Min, Guangtao Zhai. 2023. Subjective and objective quality assessment for in-the-wild computer graphics images. ACM Transactions on Multimedia Computing, Communications and Applications20, 4 (2023), 1\u201322.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_1_239_2","doi-asserted-by":"publisher","DOI":"10.1145\/3643817"},{"key":"e_1_3_1_240_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME55011.2023.00423"},{"key":"e_1_3_1_241_2","unstructured":"Zicheng Zhang Wei Sun Yingjie Zhou Haoning Wu Chunyi Li Xiongkuo Min and Xiaohong Liu. 2023. Advancing zero-shot digital human quality assessment through text-prompted evaluation. arXiv:2307.02808. Retrieved from https:\/\/arxiv.org\/abs\/2307.02808"},{"key":"e_1_3_1_242_2","first-page":"8858","article-title":"Evaluating point cloud from moving camera videos: A no-reference metric","author":"Zhang Zicheng","year":"2023","unstructured":"Zicheng Zhang, Wei Sun, Yucheng Zhu, Xiongkuo Min, Wei Wu, Ying Chen, and Guangtao Zhai. 2023. Evaluating point cloud from moving camera videos: A no-reference metric. IEEE Transactions on Multimedia 25 (2023), 8858\u20138870.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_1_243_2","unstructured":"Zicheng Zhang Haoning Wu Zhongpeng Ji Chunyi Li Erli Zhang Wei Sun Xiaohong Liu Xiongkuo Min Fengyu Sun Shangling Jui et\u00a0al. 2023. Q-Boost: On visual quality assessment ability of low-level multi-modality foundation models. arXiv:2312.15300. Retrieved from https:\/\/arxiv.org\/abs\/2312.15300"},{"key":"e_1_3_1_244_2","unstructured":"Zicheng Zhang Haoning Wu Chunyi Li Yingjie Zhou Wei Sun Xiongkuo Min Zijian Chen Xiaohong Liu Weisi Lin and Guangtao Zhai. 2024. A-Bench: Are LMMs masters at evaluating AI-generated images? arXiv:2406.03070. Retrieved from https:\/\/arxiv.org\/abs\/2406.03070"},{"key":"e_1_3_1_245_2","unstructured":"Zicheng Zhang Haoning Wu Erli Zhang Guangtao Zhai and Weisi Lin. 2024. A benchmark for multi-modal foundation models on low-level vision: From single images to pairs. arXiv:2402.07116. Retrieved from https:\/\/arxiv.org\/abs\/2402.07116"},{"key":"e_1_3_1_246_2","doi-asserted-by":"crossref","unstructured":"Zicheng Zhang Haoning Wu Yingjie Zhou Chunyi Li Wei Sun Chaofeng Chen Xiongkuo Min Xiaohong Liu Weisi Lin and Guangtao Zhai. 2024. LMM-PCQA: Assisting point cloud quality assessment with LMM. arXiv:2404.18203. Retrieved from https:\/\/arxiv.org\/abs\/2404.18203","DOI":"10.1145\/3664647.3680946"},{"key":"e_1_3_1_247_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00174"},{"key":"e_1_3_1_248_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP48485.2024.10447636"},{"key":"e_1_3_1_249_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME55011.2023.00429"},{"key":"e_1_3_1_250_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49357.2023.10095347"},{"key":"e_1_3_1_251_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP49359.2023.10222061"},{"key":"e_1_3_1_252_2","unstructured":"Zicheng Zhang Yingjie Zhou Wei Sun Xiongkuo Min and Guangtao Zhai. 2023. Simple baselines for projection-based full-reference and no-reference point cloud quality assessment. arXiv:2310.17147. Retrieved from https:\/\/arxiv.org\/abs\/2310.17147"},{"key":"e_1_3_1_253_2","first-page":"1","article-title":"Quality-of-experience evaluation for digital twins in 6G network environments","author":"Zhang Zicheng","year":"2024","unstructured":"Zicheng Zhang, Yingjie Zhou, Long Teng, Wei Sun, Chunyi Li, Xiongkuo Min, Xiao-Ping Zhang, and Guangtao Zhai. 2024. Quality-of-experience evaluation for digital twins in 6G network environments. IEEE Transactions on Broadcasting 70 (2024), 1\u201311.","journal-title":"IEEE Transactions on Broadcasting"},{"key":"e_1_3_1_254_2","unstructured":"Daquan Zhou Weimin Wang Hanshu Yan Weiwei Lv Yizhe Zhu and Jiashi Feng. 2022. Magicvideo: Efficient video generation with latent diffusion models. arXiv:2211.11018. Retrieved from https:\/\/arxiv.org\/abs\/2211.11018"},{"key":"e_1_3_1_255_2","unstructured":"Junjie Zhou Yan Shu Bo Zhao Boya Wu Shitao Xiao Xi Yang Yongping Xiong Bo Zhang Tiejun Huang and Zheng Liu. 2024. MLVU: A comprehensive benchmark for multi-task long video understanding. arXiv:2406.04264. Retrieved from https:\/\/arxiv.org\/abs\/2406.04264"},{"key":"e_1_3_1_256_2","unstructured":"Xunchu Zhou Xiaohong Liu Yunlong Dong Tengchuan Kou Yixuan Gao Zicheng Zhang Chunyi Li Haoning Wu and Guangtao Zhai. 2024. Light-VQA+: A video quality assessment model for exposure correction with vision-language guidance. arXiv:2405.03333. Retrieved from https:\/\/arxiv.org\/abs\/2405.03333"},{"key":"e_1_3_1_257_2","doi-asserted-by":"crossref","unstructured":"Yingjie Zhou Zicheng Zhang Wei Sun Xiaohong Liu Xiongkuo Min Zhihua Wang Xiao-Ping Zhang and Guangtao Zhai. 2024. THQA: A perceptual quality assessment database for talking heads. arXiv:2404.09003. Retrieved from https:\/\/arxiv.org\/abs\/2404.09003","DOI":"10.1109\/ICIP51287.2024.10647507"},{"key":"e_1_3_1_258_2","first-page":"4","article-title":"Perceptual quality assessment for point clouds: A survey","volume":"21","author":"Zhou Yingjie","year":"2023","unstructured":"Yingjie Zhou, Zicheng Zhang, Wei Sun, Xiongkuo Min, and Guangtao Zhai. 2023. Perceptual quality assessment for point clouds: A survey. ZTE Communications 21 (2023), 4.","journal-title":"ZTE Communications"},{"key":"e_1_3_1_259_2","unstructured":"Zhaokun Zhou Qiulin Wang Bin Lin Yiwei Su Rui Chen Xin Tao Amin Zheng Li Yuan Pengfei Wan and Di Zhang. 2024. UNIAA: A unified multi-modal image aesthetic assessment baseline and benchmark. arXiv:2404.09619. Retrieved from https:\/\/arxiv.org\/abs\/2404.09619"},{"key":"e_1_3_1_260_2","unstructured":"Deyao Zhu Jun Chen Xiaoqian Shen Xiang Li and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592. Retrieved from https:\/\/arxiv.org\/abs\/2304.10592"},{"key":"e_1_3_1_261_2","doi-asserted-by":"crossref","unstructured":"Hanwei Zhu Xiangjie Sui Baoliang Chen Xuelin Liu Peilin Chen Yuming Fang and Shiqi Wang. 2024. 2AFC prompting of large multimodal models for image quality assessment. arXiv:2402.01162. Retrieved from https:\/\/arxiv.org\/abs\/2402.01162","DOI":"10.1109\/TCSVT.2024.3434999"},{"key":"e_1_3_1_262_2","unstructured":"Hanwei Zhu Haoning Wu Yixuan Li Zicheng Zhang Baoliang Chen Lingyu Zhu Yuming Fang Guangtao Zhai Weisi Lin and Shiqi Wang. 2024. Adaptive image quality assessment via teaching large multimodal model to compare. arXiv:2405.19298. Retrieved from https:\/\/arxiv.org\/abs\/2405.19298"},{"key":"e_1_3_1_263_2","first-page":"512","volume-title":"CAAI International Conference on Artificial Intelligence","author":"Zhu Xilei","year":"2023","unstructured":"Xilei Zhu, Huiyu Duan, Yuqin Cao, Yuxin Zhu, Yucheng Zhu, Jing Liu, Li Chen, Xiongkuo Min, and Guangtao Zhai. 2023. Perceptual quality assessment of omnidirectional audio-visual signals. In CAAI International Conference on Artificial Intelligence. Springer, 512\u2013525."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3722559","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T21:41:53Z","timestamp":1772228513000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3722559"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,19]]},"references-count":262,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2025,7,31]]}},"alternative-id":["10.1145\/3722559"],"URL":"https:\/\/doi.org\/10.1145\/3722559","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,19]]},"assertion":[{"value":"2024-08-17","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-02-23","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}