{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,24]],"date-time":"2025-09-24T00:15:01Z","timestamp":1758672901357,"version":"3.44.0"},"publisher-location":"California","reference-count":0,"publisher":"International Joint Conferences on Artificial Intelligence Organization","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,9]]},"abstract":"<jats:p>Recently, multimodal large language models (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchmarks overlook the problem of prompt sensitivity - minor prompt variations may lead to significant performance fluctuations. Thus, inappropriate prompts may obscure the models' capabilities, underestimating the models' performance. Moreover, different models have different preferences for different prompts, and thus, using the same prompt for all models will cause evaluation bias. This paper analyzes this deficiency in existing benchmarks and further introduces a new evaluation framework named TP-Eval, which introduces a prompt customization method to reduce evaluation biases and tap models' potential. TP-Eval will rewrite the original prompts to different customized prompts for different models. In particular, we propose some well-designed modules for prompt customization tailored to the scenario of MLLM evaluation. Extensive experiments demonstrate the effectiveness of our approach to uncovering models' capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks.<\/jats:p>","DOI":"10.24963\/ijcai.2025\/232","type":"proceedings-article","created":{"date-parts":[[2025,9,19]],"date-time":"2025-09-19T08:10:40Z","timestamp":1758269440000},"page":"2081-2088","source":"Crossref","is-referenced-by-count":0,"title":["TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts"],"prefix":"10.24963","author":[{"given":"Yuxuan","family":"Xie","sequence":"first","affiliation":[{"name":"Shanghai Artificial Intelligence Laboratory"},{"name":"School of Computer Science, Shanghai Jiao Tong"},{"name":"Zhiyuan College, Shanghai Jiao Tong University"}]},{"given":"Tianhua","family":"Li","sequence":"additional","affiliation":[{"name":"Shanghai Artificial Intelligence Laboratory"},{"name":"School of Computer Science, Shanghai Jiao Tong"}]},{"given":"Wenqi","family":"Shao","sequence":"additional","affiliation":[{"name":"Shanghai Artificial Intelligence Laboratory"}]},{"given":"Kaipeng","family":"Zhang","sequence":"additional","affiliation":[{"name":"Shanghai Artificial Intelligence Laboratory"}]}],"member":"10584","event":{"number":"34","sponsor":["International Joint Conferences on Artificial Intelligence Organization (IJCAI)"],"acronym":"IJCAI-2025","name":"Thirty-Fourth International Joint Conference on Artificial Intelligence {IJCAI-25}","start":{"date-parts":[[2025,8,16]]},"theme":"Artificial Intelligence","location":"Montreal, Canada","end":{"date-parts":[[2025,8,22]]}},"container-title":["Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence"],"original-title":[],"deposited":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T11:33:22Z","timestamp":1758627202000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.ijcai.org\/proceedings\/2025\/232"}},"subtitle":[],"proceedings-subject":"Artificial Intelligence Research Articles","short-title":[],"issued":{"date-parts":[[2025,9]]},"references-count":0,"URL":"https:\/\/doi.org\/10.24963\/ijcai.2025\/232","relation":{},"subject":[],"published":{"date-parts":[[2025,9]]}}}