{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,4]],"date-time":"2026-05-04T05:44:31Z","timestamp":1777873471716,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":65,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,8,3]]},"DOI":"10.1145\/3711896.3737409","type":"proceedings-article","created":{"date-parts":[[2025,8,3]],"date-time":"2025-08-03T20:52:41Z","timestamp":1754254361000},"page":"5742-5753","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Judge Anything: MLLM as a Judge Across Any Modality"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-5174-3106","authenticated-orcid":false,"given":"Shu","family":"Pu","sequence":"first","affiliation":[{"name":"Huazhong University of Science and Technology, WuHan, Hubei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-6562-209X","authenticated-orcid":false,"given":"Yaochen","family":"Wang","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-9848-2557","authenticated-orcid":false,"given":"Dongping","family":"Chen","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-7266-4842","authenticated-orcid":false,"given":"Yuhang","family":"Chen","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-1046-1510","authenticated-orcid":false,"given":"Guohao","family":"Wang","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-0113-8637","authenticated-orcid":false,"given":"Qi","family":"Qin","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-9951-3335","authenticated-orcid":false,"given":"Zhongyi","family":"Zhang","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-1611-8165","authenticated-orcid":false,"given":"Zhiyuan","family":"Zhang","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-2199-8807","authenticated-orcid":false,"given":"Zetong","family":"Zhou","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-1894-7812","authenticated-orcid":false,"given":"Shuang","family":"Gong","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-2841-7942","authenticated-orcid":false,"given":"Yi","family":"Gui","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6937-4180","authenticated-orcid":false,"given":"Yao","family":"Wan","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3491-5968","authenticated-orcid":false,"given":"Philip S.","family":"Yu","sequence":"additional","affiliation":[{"name":"University of Illinois Chicago, Chicago, Illinois, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,8,3]]},"reference":[{"key":"e_1_3_2_2_1_1","unstructured":"Abdelrahman Abouelenin Atabak Ashfaq Adam Atkinson et al. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743(2025)."},{"key":"e_1_3_2_2_2_1","volume-title":"Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890(2023).","author":"Bai Shuai","year":"2023","unstructured":"Shuai Bai, Shusheng Yang, Jinze Bai, et al., 2023. Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890(2023)."},{"key":"e_1_3_2_2_3_1","unstructured":"Yonatan Bitton Hritik Bansal Jack Hessel et al. 2023. VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models. In NeurIPS."},{"key":"e_1_3_2_2_4_1","volume-title":"Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models. arXiv preprint arXiv:2410.10818(2024).","author":"Cai Mu","year":"2024","unstructured":"Mu Cai, Reuben Tan, Jianrui Zhang, et al., 2024. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models. arXiv preprint arXiv:2410.10818(2024)."},{"key":"e_1_3_2_2_5_1","unstructured":"Dongping Chen Ruoxi Chen Shu Pu Zhaoyi Liu Yanru Wu Caixi Chen Benlin Liu Yue Huang Yao Wan Pan Zhou and Ranjay Krishna. 2025. Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment. In ICLR."},{"key":"e_1_3_2_2_6_1","unstructured":"Dongping Chen Ruoxi Chen Shilin Zhang Yinuo Liu Yaochen Wang Huichi Zhou Qihui Zhang Pan Zhou Yao Wan and Lichao Sun. 2024a. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. In ICML."},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"crossref","unstructured":"Honglie Chen Weidi Xie Andrea Vedaldi and Andrew Zisserman. 2020. VGGSound: A Large-scale Audio-Visual Dataset. In ICASSP.","DOI":"10.1109\/ICASSP40776.2020.9053174"},{"key":"e_1_3_2_2_8_1","volume-title":"Omnixr: Evaluating omni-modality language models on reasoning across modalities. In ICLR.","author":"Chen Lichang","year":"2024","unstructured":"Lichang Chen, Hexiang Hu, Mingda Zhang, et al., 2024c. Omnixr: Evaluating omni-modality language models on reasoning across modalities. In ICLR."},{"key":"e_1_3_2_2_9_1","first-page":"179","article-title":"Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering","author":"Chen Xiuyuan","year":"2024","unstructured":"Xiuyuan Chen, Yuan Lin, Yuchen Zhang, et al., 2024d. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. In ECCV. 179-195.","journal-title":"ECCV."},{"key":"e_1_3_2_2_10_1","volume-title":"Voicebench: Benchmarking llm-based voice assistants. arXiv preprint arXiv:2410.17196(2024).","author":"Chen Yiming","year":"2024","unstructured":"Yiming Chen, Xianghu Yue, Chen Zhang, et al., 2024 f. Voicebench: Benchmarking llm-based voice assistants. arXiv preprint arXiv:2410.17196(2024)."},{"key":"e_1_3_2_2_11_1","volume-title":"ICML 2024 FM-Wild Workshop.","author":"Chen Zhaorun","year":"2024","unstructured":"Zhaorun Chen, Yichao Du, Zichen Wen, et al., 2024b. MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?. In ICML 2024 FM-Wild Workshop."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"crossref","unstructured":"Zhe Chen Weiyun Wang Hao Tian et al. 2024 e. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites. arXiv preprint arXiv:2404.16821(2024).","DOI":"10.1007\/s11432-024-4231-5"},{"key":"e_1_3_2_2_13_1","first-page":"736","article-title":"Clotho: An audio captioning dataset","author":"Drossos Konstantinos","year":"2020","unstructured":"Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. 2020. Clotho: An audio captioning dataset. In ICASSP. IEEE, 736-740.","journal-title":"ICASSP. IEEE"},{"key":"e_1_3_2_2_14_1","volume-title":"The USCF Rating System: Its Development, Theory, and Applications","author":"Elo A.E.","unstructured":"A.E. Elo. 1966. The USCF Rating System: Its Development, Theory, and Applications. United States Chess Federation."},{"key":"e_1_3_2_2_15_1","unstructured":"Patrick Esser Sumith Kulal Andreas Blattmann et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In ICML."},{"key":"e_1_3_2_2_16_1","volume-title":"Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075(2024).","author":"Fu Chaoyou","year":"2024","unstructured":"Chaoyou Fu, Yuhan Dai, Yongdong Luo, et al., 2024. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075(2024)."},{"key":"e_1_3_2_2_17_1","volume-title":"Geneval: An object-focused framework for evaluating text-to-image alignment. In NeurIPS.","author":"Ghosh Dhruba","year":"2023","unstructured":"Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. 2023. Geneval: An object-focused framework for evaluating text-to-image alignment. In NeurIPS."},{"key":"e_1_3_2_2_18_1","unstructured":"Yuan Gong Hongyin Luo Alexander H Liu et al. 2023. Listen think and understand. arXiv preprint arXiv:2305.10790(2023)."},{"key":"e_1_3_2_2_19_1","unstructured":"Jiawei Gu Xuhui Jiang Zhichao Shi et al. 2024. A Survey on LLM-as-a-Judge. arXiv preprint arXiv:2411.15594(2024)."},{"key":"e_1_3_2_2_20_1","first-page":"78723","article-title":"T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation","volume":"36","author":"Huang Kaiyi","year":"2023","unstructured":"Kaiyi Huang, Kaiyue Sun, Enze Xie, et al., 2023. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. In NeurIPS, Vol. 36. 78723-78747.","journal-title":"NeurIPS"},{"key":"e_1_3_2_2_21_1","first-page":"21807","article-title":"Vbench: Comprehensive benchmark suite for video generative models","author":"Huang Ziqi","year":"2024","unstructured":"Ziqi Huang, Yinan He, Jiashuo Yu, et al., 2024. Vbench: Comprehensive benchmark suite for video generative models. In CVPR. 21807-21818.","journal-title":"CVPR."},{"key":"e_1_3_2_2_22_1","unstructured":"Aaron Hurst Adam Lerer Adam P Goucher Adam Perelman Aditya Ramesh Aidan Clark AJ Ostrow Akila Welihinda Alan Hayes Alec Radford et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276(2024)."},{"key":"e_1_3_2_2_23_1","volume-title":"Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback. arXiv preprint arxiv:2412.15838(2024).","author":"Ji Jiaming","year":"2024","unstructured":"Jiaming Ji, Jiayi Zhou, Hantao Lou, et al., 2024. Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback. arXiv preprint arxiv:2412.15838(2024)."},{"key":"e_1_3_2_2_24_1","unstructured":"Yuhang Jia Yang Chen Jinghua Zhao et al. 2024. AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework. arXiv preprint arXiv:2409.12466(2024)."},{"key":"e_1_3_2_2_25_1","first-page":"79889","article-title":"Genai arena: An open evaluation platform for generative models","volume":"37","author":"Jiang Dongfu","year":"2025","unstructured":"Dongfu Jiang, Max Ku, Tianle Li, et al., 2025. Genai arena: An open evaluation platform for generative models. NeurIPS, Vol. 37, 79889-79908.","journal-title":"NeurIPS"},{"key":"e_1_3_2_2_26_1","volume-title":"Jameel Hassan, et al.","author":"Khattak Muhammad Uzair","year":"2024","unstructured":"Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, et al., 2024. How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs. arXiv preprint arXiv:2405.03690(2024)."},{"key":"e_1_3_2_2_27_1","first-page":"119","article-title":"AudioCaps","author":"Kim Chris Dongjoo","year":"2019","unstructured":"Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. AudioCaps: Generating Captions for Audios in The Wild. In NAACL. 119-132.","journal-title":"Generating Captions for Audios in The Wild. In NAACL."},{"key":"e_1_3_2_2_28_1","volume-title":"Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi.","author":"Lee Harrison","year":"2023","unstructured":"Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. 2023. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267(2023)."},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.2307\/2685263"},{"key":"e_1_3_2_2_30_1","unstructured":"Dawei Li Bohan Jiang Liangjie Huang et al. 2024a. From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv preprint arXiv:2411.16594(2024)."},{"key":"e_1_3_2_2_31_1","unstructured":"Juncheng Li Kaihang Pan Zhiqi Ge Minghe Gao Wei Ji Wenqiao Zhang Tat-Seng Chua Siliang Tang Hanwang Zhang and Yueting Zhuang. 2024c. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions. In ICLR."},{"key":"e_1_3_2_2_32_1","volume-title":"Lingpeng Kong, and Qi Liu.","author":"Li Lei","year":"2024","unstructured":"Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, and Qi Liu. 2024d. VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models. arXiv preprint arXiv:2411.17451(2024)."},{"key":"e_1_3_2_2_33_1","unstructured":"Shufan Li Konstantinos Kallidromitis Akash Gokul et al. 2024b. OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows. arXiv preprint arXiv:2412.01169(2024)."},{"key":"e_1_3_2_2_34_1","volume-title":"Omnibench: Towards the future of universal omni-language models. arXiv preprint arXiv:2409.15272(2024).","author":"Li Yizhi","year":"2024","unstructured":"Yizhi Li, Ge Zhang, Yinghao Ma, et al., 2024 e. Omnibench: Towards the future of universal omni-language models. arXiv preprint arXiv:2409.15272(2024)."},{"key":"e_1_3_2_2_35_1","volume-title":"WILDBENCH: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild. In ICLR.","author":"Lin Bill Yuchen","year":"2025","unstructured":"Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, et al., 2025. WILDBENCH: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild. In ICLR."},{"key":"e_1_3_2_2_36_1","unstructured":"Aixin Liu Bei Feng Bing Xue Bingxuan Wang Bochao Wu Chengda Lu Chenggang Zhao Chengqi Deng Chenyu Zhang Chong Ruan et al. 2024a. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437(2024)."},{"key":"e_1_3_2_2_37_1","volume-title":"Plumbley","author":"Liu Haohe","year":"2024","unstructured":"Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D. Plumbley. 2024b. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE\/ACM Transactions on Audio, Speech, and Language Processing(2024)."},{"key":"e_1_3_2_2_38_1","first-page":"708","article-title":"VALOR","volume":"47","author":"Liu Jing","year":"2025","unstructured":"Jing Liu, Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, and Jinhui Tang. 2025. VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset. IEEE, Vol. 47 (2025), 708-724.","journal-title":"Vision-Audio-Language Omni-Perception Pretraining Model and Dataset. IEEE"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"crossref","unstructured":"Yang Liu Dan Iter Yichong Xu Shuohang Wang Ruochen Xu and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In EMNLP Houda Bouamor Juan Pino and Kalika Bali(Eds.).","DOI":"10.18653\/v1\/2023.emnlp-main.153"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"crossref","unstructured":"Ziyang Luo Haoning Wu Dongxu Li Jing Ma Mohan Kankanhalli and Junnan Li. 2024. VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation. arXiv preprint arXiv:2411.13281(2024).","DOI":"10.1109\/CVPR52734.2025.00792"},{"key":"e_1_3_2_2_41_1","unstructured":"Yiwei Ma Jiayi Ji Ke Ye Weihuang Lin Yonghan Zheng Qiang Zhou Xiaoshuai Sun Rongrong Ji et al. 2024. I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing. In NeurIPS."},{"key":"e_1_3_2_2_42_1","unstructured":"Jinjie Ni Yifan Song Deepanway Ghosal et al. 2024. MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures. In ICLR."},{"key":"e_1_3_2_2_43_1","unstructured":"Jiajun Song Chuanhao Li Zhaopan Xu Yue Yang Ziyao Guo Hao Zhang Yuqi Lin Yefei He Lirui Zhao Shuo Liu Tianhua Li Yuxuan Xie Xiaojun Chang Yu Qiao Wenqi Shao Pengfei Zhou Xiaopeng Peng and Kaipeng Zhang. [n.d.]. GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation. In CVPR."},{"key":"e_1_3_2_2_44_1","unstructured":"Weiming Ren Huan Yang Ge Zhang Cong Wei Xinrun Du Wenhao Huang and Wenhu Chen. 2024. Consisti2v: Enhancing visual consistency for image-to-video generation. TMLR(2024)."},{"key":"e_1_3_2_2_45_1","first-page":"10684","article-title":"High-resolution image synthesis with latent diffusion models","author":"Rombach Robin","year":"2022","unstructured":"Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj\u00f6rn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In CVPR. 10684-10695.","journal-title":"CVPR."},{"key":"e_1_3_2_2_46_1","first-page":"1","article-title":"I hear your true colors: Image guided audio generation","author":"Sheffer Roy","year":"2023","unstructured":"Roy Sheffer and Yossi Adi. 2023. I hear your true colors: Image guided audio generation. In ICASSP. IEEE, 1-5.","journal-title":"ICASSP. IEEE"},{"key":"e_1_3_2_2_47_1","unstructured":"Kaiyue Sun Kaiyi Huang Xian Liu et al. 2024a. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. In CVPR."},{"key":"e_1_3_2_2_48_1","unstructured":"Wenhao Sun Rong-Cheng Tu Jingyi Liao and Dacheng Tao. 2024b. Diffusion Model-Based Video Editing: A Survey. CoRR(2024)."},{"key":"e_1_3_2_2_49_1","volume-title":"Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818(2024).","author":"Team Chameleon","year":"2024","unstructured":"Chameleon Team. 2024. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818(2024)."},{"key":"e_1_3_2_2_50_1","volume-title":"Ving Ian Lei, et al","author":"Team Gemini","year":"2024","unstructured":"Gemini Team, Petko Georgiev, Ving Ian Lei, et al., 2024a. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530(2024)."},{"key":"e_1_3_2_2_51_1","volume-title":"Aditya Srikanth Veerubhotla, et al","author":"Team LM","year":"2024","unstructured":"LearnLM Team, Abhinit Modi, Aditya Srikanth Veerubhotla, et al., 2024b. LearnLM: Improving Gemini for Learning. arXiv preprint arXiv:2412.16429(2024)."},{"key":"e_1_3_2_2_52_1","unstructured":"Jason Wei Xuezhi Wang Dale Schuurmans et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS(2022)."},{"key":"e_1_3_2_2_53_1","volume-title":"Next-gpt: Any-to-any multimodal llm. In ICML.","author":"Wu Shengqiong","year":"2024","unstructured":"Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2024. Next-gpt: Any-to-any multimodal llm. In ICML."},{"key":"e_1_3_2_2_54_1","unstructured":"Xiaoshi Wu Yiming Hao Keqiang Sun et al. 2023. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341(2023)."},{"key":"e_1_3_2_2_55_1","volume-title":"Mmie: Massive multimodal interleaved comprehension benchmark for large vision-language models. arXiv preprint arXiv:2410.10139(2024).","author":"Xia Peng","year":"2024","unstructured":"Peng Xia, Siwei Han, Shi Qiu, et al., 2024. Mmie: Massive multimodal interleaved comprehension benchmark for large vision-language models. arXiv preprint arXiv:2410.10139(2024)."},{"key":"e_1_3_2_2_56_1","volume-title":"Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou.","author":"Xie Jinheng","year":"2024","unstructured":"Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. 2024. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation. In ICLR."},{"key":"e_1_3_2_2_57_1","volume-title":"Llava-critic: Learning to evaluate multimodal models. arXiv preprint arXiv:2410.02712(2024).","author":"Xiong Tianyi","year":"2024","unstructured":"Tianyi Xiong, Xiyao Wang, Dong Guo, et al., 2024. Llava-critic: Learning to evaluate multimodal models. arXiv preprint arXiv:2410.02712(2024)."},{"key":"e_1_3_2_2_58_1","unstructured":"Jin Xu Zhifang Guo Jinzheng He et al. 2025. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215(2025)."},{"key":"e_1_3_2_2_59_1","volume-title":"Air-bench: Benchmarking large audio-language models via generative comprehension. In ACL.","author":"Yang Qian","year":"2024","unstructured":"Qian Yang, Jin Xu, Wenrui Liu, et al., 2024. Air-bench: Benchmarking large audio-language models via generative comprehension. In ACL."},{"key":"e_1_3_2_2_60_1","unstructured":"Zhuoyi Yang Jiayan Teng Wendi Zheng Ming Ding Shiyu Huang Jiazheng Xu Yuanming Yang Wenyi Hong Xiaohan Zhang Guanyu Feng et al. 2025. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. In ICLR."},{"key":"e_1_3_2_2_61_1","unstructured":"Michihiro Yasunaga Luke Zettlemoyer and Marjan Ghazvininejad. 2025. Multimodal rewardbench: Holistic evaluation of reward models for vision language models. arXiv preprint arXiv:2502.14191(2025)."},{"key":"e_1_3_2_2_62_1","volume-title":"Quantifying Biases in LLM-as-a-Judge. In The Thirteenth International Conference on Learning Representations.","author":"Ye Jiayi","year":"2025","unstructured":"Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, and Xiangliang Zhang. 2025. Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge. In The Thirteenth International Conference on Learning Representations."},{"key":"e_1_3_2_2_63_1","unstructured":"Weihao Yu Zhengyuan Yang Linjie Li Jianfeng Wang Kevin Lin Zicheng Liu Xinchao Wang and Lijuan Wang. [n.d.]. Mm-vet: Evaluating large multimodal models for integrated capabilities. ( [n. d.])."},{"key":"e_1_3_2_2_64_1","doi-asserted-by":"crossref","unstructured":"Lin Zhang Shentong Mo Yijing Zhang et al. 2024. Audio-Synchronized Visual Animation. In ECCV.","DOI":"10.1007\/978-3-031-72940-9_1"},{"key":"e_1_3_2_2_65_1","unstructured":"Jeffrey Zhou Tianjian Lu Swaroop Mishra et al. 2023. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911(2023)."}],"event":{"name":"KDD '25: The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining","location":"Toronto ON Canada","acronym":"KDD '25","sponsor":["SIGKDD ACM Special Interest Group on Knowledge Discovery in Data","SIGMOD ACM Special Interest Group on Management of Data"]},"container-title":["Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3711896.3737409","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T17:57:14Z","timestamp":1777571834000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3711896.3737409"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,3]]},"references-count":65,"alternative-id":["10.1145\/3711896.3737409","10.1145\/3711896"],"URL":"https:\/\/doi.org\/10.1145\/3711896.3737409","relation":{},"subject":[],"published":{"date-parts":[[2025,8,3]]},"assertion":[{"value":"2025-08-03","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}