{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T00:50:55Z","timestamp":1774399855737,"version":"3.50.1"},"reference-count":26,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2024,12,1]],"date-time":"2024-12-01T00:00:00Z","timestamp":1733011200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGIR Forum"],"published-print":{"date-parts":[[2024,12]]},"abstract":"<jats:p>The first edition of the workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) took place in July 2024, co-located with the ACM SIGIR Conference 2024 in the USA (SIGIR 2024). The aim was to bring information retrieval researchers together around the topic of LLMs for evaluation in information retrieval that gathered attention with the advancement of large language models and generative AI. Given the novelty of the topic, the workshop was focused around multi-sided discussions, namely panels and poster sessions of the accepted proceedings papers.<\/jats:p>\n          <jats:p>\n            <jats:bold>Date<\/jats:bold>\n            : 18 July 2024.\n          <\/jats:p>\n          <jats:p>\n            <jats:bold>Website<\/jats:bold>\n            : https:\/\/llm4eval.github.io.\n          <\/jats:p>","DOI":"10.1145\/3722449.3722461","type":"journal-article","created":{"date-parts":[[2025,3,6]],"date-time":"2025-03-06T17:24:20Z","timestamp":1741281860000},"page":"1-12","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024"],"prefix":"10.1145","volume":"58","author":[{"given":"Hossein A.","family":"Rahmani","sequence":"first","affiliation":[{"name":"University College London, London, UK"}]},{"given":"Clemencia","family":"Siro","sequence":"additional","affiliation":[{"name":"University of Amsterdam, Amsterdam, The Netherlands"}]},{"given":"Mohammad","family":"Aliannejadi","sequence":"additional","affiliation":[{"name":"University of Amsterdam, Amsterdam, The Netherlands"}]},{"given":"Nick","family":"Craswell","sequence":"additional","affiliation":[{"name":"Microsoft, Seattle, US"}]},{"given":"Charles L. A.","family":"Clarke","sequence":"additional","affiliation":[{"name":"University of Waterloo, Ontario, Canada"}]},{"given":"Guglielmo","family":"Faggioli","sequence":"additional","affiliation":[{"name":"University of Padua, Padua, Italy"}]},{"given":"Bhaskar","family":"Mitra","sequence":"additional","affiliation":[{"name":"Microsoft, Montr\u00e9al, Canada"}]},{"given":"Paul","family":"Thomas","sequence":"additional","affiliation":[{"name":"Microsoft, Adelaide, Australia"}]},{"given":"Emine","family":"Yilmaz","sequence":"additional","affiliation":[{"name":"University College London, London, UK"}]}],"member":"320","published-online":{"date-parts":[[2025,3,6]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Can we use large language models to fill relevance judgment holes? arXiv preprint arXiv:2405.05600","author":"Abbasiantaeb Zahra","year":"2024","unstructured":"Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, and Mohammad Aliannejadi. Can we use large language models to fill relevance judgment holes? arXiv preprint arXiv:2405.05600, 2024. URL https:\/\/arxiv.org\/abs\/2405.05600."},{"key":"e_1_2_1_2_1","volume-title":"The challenges of evaluating llm applications: An analysis of automated, human, and llm-based approaches. arXiv preprint arXiv:2406.03339","author":"Abeysinghe Bhashithe","year":"2024","unstructured":"Bhashithe Abeysinghe and Ruhan Circi. The challenges of evaluating llm applications: An analysis of automated, human, and llm-based approaches. arXiv preprint arXiv:2406.03339, 2024. URL https:\/\/arxiv.org\/abs\/2406.03339."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3673791.3698431"},{"key":"e_1_2_1_4_1","volume-title":"Evaluating the retrieval component in llm-based question answering systems. arXiv preprint arXiv:2406.06458","author":"Alinejad Ashkan","year":"2024","unstructured":"Ashkan Alinejad, Krtin Kumar, and Ali Vahdat. Evaluating the retrieval component in llm-based question answering systems. arXiv preprint arXiv:2406.06458, 2024. URL https:\/\/arxiv.org\/abs\/2406.06458."},{"key":"e_1_2_1_5_1","volume-title":"A comparison of methods for evaluating generative ir. arXiv preprint arXiv:2404.04044","author":"Arabzadeh Negar","year":"2024","unstructured":"Negar Arabzadeh and Charles LA Clarke. A comparison of methods for evaluating generative ir. arXiv preprint arXiv:2404.04044, 2024. URL https:\/\/arxiv.org\/abs\/2404.04044."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3591979"},{"key":"e_1_2_1_7_1","volume-title":"Text REtrieval Conference (TREC). NIST, TREC","author":"Craswell Nick","year":"2024","unstructured":"Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Hossein A. Rahmani, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. Overview of the trec 2023 deep learning track. In Text REtrieval Conference (TREC). NIST, TREC, February 2024. URL https:\/\/www.microsoft.com\/en-us\/research\/publication\/overview-of-the-trec-2023-deep-learning-track\/."},{"key":"e_1_2_1_8_1","volume-title":"Exploring large language models for relevance judgments in tetun. arXiv preprint arXiv:2406.07299","author":"de Jesus Gabriel","year":"2024","unstructured":"Gabriel de Jesus and S\u00e9rgio Nunes. Exploring large language models for relevance judgments in tetun. arXiv preprint arXiv:2406.07299, 2024. URL https:\/\/arxiv.org\/abs\/2406.07299v1."},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of LLM4Eval: The First Workshop on Large Language Models for Evaluation in Information Retrieval","author":"Farzi Naghmeh","year":"2024","unstructured":"Naghmeh Farzi and Laura Dietz. Exam++: Llm-based answerability metrics for ir evaluation. In Proceedings of LLM4Eval: The First Workshop on Large Language Models for Evaluation in Information Retrieval, 2024. URL https:\/\/ceur-ws.org\/Vol-3752\/paper3.pdf."},{"key":"e_1_2_1_10_1","volume-title":"A novel evaluation framework for image2text generation","author":"Huang Jia-Hong","year":"2024","unstructured":"Jia-Hong Huang, Hongyi Zhu, Yixian Shen, Stevan Rudinac, Alessio M. Pacces, and Evangelos Kanoulas. A novel evaluation framework for image2text generation, 2024. URL https:\/\/arxiv.org\/abs\/2408.01723."},{"key":"e_1_2_1_11_1","volume-title":"Jin Young Kim, and Juho Kim. Using llms to investigate correlations of conversational follow-up queries with user satisfaction. arXiv preprint arXiv:2407.13166","author":"Kim Hyunwoo","year":"2024","unstructured":"Hyunwoo Kim, Yoonseo Choi, Taehyun Yang, Honggu Lee, Chaneon Park, Yongju Lee, Jin Young Kim, and Juho Kim. Using llms to investigate correlations of conversational follow-up queries with user satisfaction. arXiv preprint arXiv:2407.13166, 2024. URL https:\/\/ceur-ws.org\/Vol-3752\/paper5.pdf."},{"key":"e_1_2_1_12_1","volume-title":"Selective fine-tuning on llm-labeled data may reduce reliance on human annotation: A case study using schedule-of-event table detection. arXiv preprint arXiv:2405.06093","author":"Kumar Bhawesh","year":"2024","unstructured":"Bhawesh Kumar, Jonathan Amar, Eric Yang, Nan Li, and Yugang Jia. Selective fine-tuning on llm-labeled data may reduce reliance on human annotation: A case study using schedule-of-event table detection. arXiv preprint arXiv:2405.06093, 2024. URL https:\/\/www.arxiv.org\/abs\/2405.06093."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3592032"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657846"},{"key":"e_1_2_1_15_1","volume-title":"Large language models for relevance judgment in product search. arXiv preprint arXiv:2406.00247","author":"Mehrdad Navid","year":"2024","unstructured":"Navid Mehrdad, Hrushikesh Mohapatra, Mossaab Bagdouri, Prijith Chandran, Alessandro Magnani, Xunfan Cai, Ajit Puthenputhussery, Sachin Yadav, Tony Lee, ChengXiang Zhai, et al. Large language models for relevance judgment in product search. arXiv preprint arXiv:2406.00247, 2024. URL https:\/\/arxiv.org\/abs\/2406.00247."},{"key":"e_1_2_1_16_1","volume-title":"Query performance prediction using relevance judgments generated by large language models. arXiv preprint arXiv:2404.01012","author":"Meng Chuan","year":"2024","unstructured":"Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, and Maarten de Rijke. Query performance prediction using relevance judgments generated by large language models. arXiv preprint arXiv:2404.01012, 2024. URL https:\/\/arxiv.org\/abs\/2404.01012."},{"key":"e_1_2_1_17_1","volume-title":"Ms marco: A human generated machine reading comprehension dataset. choice, 2640: 660","author":"Nguyen Tri","year":"2016","unstructured":"Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Ms marco: A human generated machine reading comprehension dataset. choice, 2640: 660, 2016."},{"key":"e_1_2_1_18_1","volume-title":"Reliable confidence intervals for information retrieval evaluation using generative ai. arXiv preprint arXiv:2407.02464","author":"Oosterhuis Harrie","year":"2024","unstructured":"Harrie Oosterhuis, Rolf Jagerman, Zhen Qin, Xuanhui Wang, and Michael Bendersky. Reliable confidence intervals for information retrieval evaluation using generative ai. arXiv preprint arXiv:2407.02464, 2024. URL https:\/\/arxiv.org\/abs\/2407.02464."},{"key":"e_1_2_1_19_1","volume-title":"Evaluating rag-fusion with ragelo: an automated elo-based framework. arXiv preprint arXiv:2406.14783","author":"Rackauckas Zackary","year":"2024","unstructured":"Zackary Rackauckas, Arthur C\u00e2mara, and Jakub Zavrel. Evaluating rag-fusion with ragelo: an automated elo-based framework. arXiv preprint arXiv:2406.14783, 2024. URL https:\/\/arxiv.org\/abs\/2406.14783."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657942"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657992"},{"key":"e_1_2_1_22_1","volume-title":"Context does matter: Implications for crowdsourced evaluation labels in task-oriented dialogue systems. arXiv preprint arXiv:2404.09980","author":"Siro Clemencia","year":"2024","unstructured":"Clemencia Siro, Mohammad Aliannejadi, and Maarten de Rijke. Context does matter: Implications for crowdsourced evaluation labels in task-oriented dialogue systems. arXiv preprint arXiv:2404.09980, 2024. URL https:\/\/arxiv.org\/abs\/2404.09980."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657707"},{"key":"e_1_2_1_24_1","volume-title":"Dawn Lawrie, and Luca Soldaini. Followir: Evaluating and teaching information retrieval models to follow instructions. arXiv preprint arXiv:2403.15246","author":"Weller Orion","year":"2024","unstructured":"Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, and Luca Soldaini. Followir: Evaluating and teaching information retrieval models to follow instructions. arXiv preprint arXiv:2403.15246, 2024. URL https:\/\/arxiv.org\/abs\/2403.15246."},{"key":"e_1_2_1_25_1","volume-title":"Toward automatic relevance judgment using vision-language models for image-text retrieval evaluation","author":"Yang Jheng-Hong","year":"2024","unstructured":"Jheng-Hong Yang and Jimmy Lin. Toward automatic relevance judgment using vision-language models for image-text retrieval evaluation, 2024. URL https:\/\/arxiv.org\/abs\/2408.01363."},{"key":"e_1_2_1_26_1","volume-title":"Towards fine-grained citation evaluation in generated text: A comparative analysis of faithfulness metrics","author":"Zhang Weijia","year":"2024","unstructured":"Weijia Zhang, Mohammad Aliannejadi, Yifei Yuan, Jiahuan Pei, Jia-Hong Huang, and Evangelos Kanoulas. Towards fine-grained citation evaluation in generated text: A comparative analysis of faithfulness metrics, 2024. URL https:\/\/arxiv.org\/abs\/2406.15264."}],"container-title":["ACM SIGIR Forum"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3722449.3722461","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3722449.3722461","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:57:10Z","timestamp":1750298230000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3722449.3722461"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12]]},"references-count":26,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,12]]}},"alternative-id":["10.1145\/3722449.3722461"],"URL":"https:\/\/doi.org\/10.1145\/3722449.3722461","relation":{},"ISSN":["0163-5840"],"issn-type":[{"value":"0163-5840","type":"print"}],"subject":[],"published":{"date-parts":[[2024,12]]},"assertion":[{"value":"2025-03-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}