{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T05:04:08Z","timestamp":1750309448909,"version":"3.41.0"},"reference-count":81,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,11,4]],"date-time":"2024-11-04T00:00:00Z","timestamp":1730678400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100000923","name":"Australian Research Council","doi-asserted-by":"crossref","award":["DP180102687"],"award-info":[{"award-number":["DP180102687"]}],"id":[{"id":"10.13039\/501100000923","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Office of Naval Research contract","award":["N000142212688"],"award-info":[{"award-number":["N000142212688"]}]},{"name":"NSF grant number","award":["2143434"],"award-info":[{"award-number":["2143434"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Inf. Syst."],"published-print":{"date-parts":[[2025,1,31]]},"abstract":"<jats:p>The effectiveness of clarification question models in engaging users within search systems is currently constrained, casting doubt on their overall usefulness. To improve the performance of these models, it is crucial to employ assessment approaches that encompass both real-time feedback from users (online evaluation) and the characteristics of clarification questions evaluated through human assessment (offline evaluation). However, the relationship between online and offline evaluations has been debated in information retrieval. This study aims to investigate how this discordance holds in search clarification. We use user engagement as ground truth and employ several offline labels to investigate to what extent the offline ranked lists of clarification resemble the ideal ranked lists based on online user engagement. Contrary to the current understanding that offline evaluations fall short of supporting online evaluations, we indicate that when identifying the most engaging clarification questions from the user\u2019s perspective, online and offline evaluations correspond with each other. We show that the query length does not influence the relationship between online and offline evaluations, and reducing uncertainty in online evaluation strengthens this relationship. We illustrate that an engaging clarification needs to excel from multiple perspectives, and SERP quality and characteristics of the clarification are equally important. We also investigate if human labels can enhance the performance of Large Language Models (LLMs) and Learning-to-Rank (LTR) models in identifying the most engaging clarification questions from the user\u2019s perspective by incorporating offline evaluations as input features. Our results indicate that LTR models do not perform better than individual offline labels. However, GPT, an LLM, emerges as the standout performer, surpassing all LTR models and offline labels.<\/jats:p>","DOI":"10.1145\/3681786","type":"journal-article","created":{"date-parts":[[2024,7,25]],"date-time":"2024-07-25T16:11:36Z","timestamp":1721923896000},"page":"1-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Online and Offline Evaluation in Search Clarification"],"prefix":"10.1145","volume":"43","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5951-4052","authenticated-orcid":false,"given":"Leila","family":"Tavakoli","sequence":"first","affiliation":[{"name":"RMIT University, Melbourne, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7801-0239","authenticated-orcid":false,"given":"Johanne R.","family":"Trippas","sequence":"additional","affiliation":[{"name":"RMIT University, Melbourne, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0800-3340","authenticated-orcid":false,"given":"Hamed","family":"Zamani","sequence":"additional","affiliation":[{"name":"University of Massachusetts Amherst, Amherst, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9094-0810","authenticated-orcid":false,"given":"Falk","family":"Scholer","sequence":"additional","affiliation":[{"name":"RMIT University, Melbourne, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0487-9609","authenticated-orcid":false,"given":"Mark","family":"Sanderson","sequence":"additional","affiliation":[{"name":"RMIT University, Melbourne, Australia"}]}],"member":"320","published-online":{"date-parts":[[2024,11,4]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/1498759.1498824"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2011.03.002"},{"key":"e_1_3_2_4_2","unstructured":"Mohammad Aliannejadi Julia Kiseleva Aleksandr Chuklin Jeff Dalton and Mikhail Burtsev. 2020. ConvAI3: Generating clarifying questions for open-domain dialogue systems (ClariQ). arXiv:2009.11352."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.367"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331265"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-56608-5_70"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-24592-8_12"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/2532508.2532512"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/1507509.1507511"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3020165.3022149"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/2094072.2094078"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/1526709.1526711"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3077136.3080804"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/2983323.2983829"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/2484028.2484071"},{"key":"e_1_3_2_17_2","volume-title":"Factors Determining the Performance of Indexing Systems","author":"Cleverdon Cyril","year":"1966","unstructured":"Cyril Cleverdon, Jack Mills, and Michael Keen. 1966. Factors Determining the Performance of Indexing Systems. Vol. 1, Part 2. Cranfield Research Projects."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/2209310.2209314"},{"key":"e_1_3_2_19_2","unstructured":"V. Dang. 2013. The Lemur Project-Wiki-RankLib. Lemur Project. Retrieved from https:\/\/sourceforge.net\/p\/lemur\/wiki\/RankLib."},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/2645710.2645737"},{"issue":"1","key":"e_1_3_2_21_2","first-page":"57","article-title":"What ChatGPT means for universities: Perceptions of scholars and students","volume":"6","author":"Firat Mehmet","year":"2023","unstructured":"Mehmet Firat. 2023. What ChatGPT means for universities: Perceptions of scholars and students. Journal of Applied Learning and Teaching 6, 1 (2023), 57\u201363.","journal-title":"Journal of Applied Learning and Teaching"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/1059981.1059982"},{"key":"e_1_3_2_23_2","first-page":"28","volume-title":"Proceedings of the 2003 AAAI Spring Symposium. Workshop on Natural Language Generation in Spoken and Written Dialogue","author":"Gabsdil Malte","year":"2003","unstructured":"Malte Gabsdil. 2003. Clarification in spoken dialogue systems. In Proceedings of the 2003 AAAI Spring Symposium. Workshop on Natural Language Generation in Spoken and Written Dialogue, 28\u201335."},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/2645710.2645745"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/2072298.2072308"},{"key":"e_1_3_2_26_2","volume-title":"Universal Methods of Design Expanded and Revised: 125 Ways to Research Complex Problems, Develop Innovative Ideas, and Design Effective Solutions","author":"Hanington Bruce","year":"2019","unstructured":"Bruce Hanington and Bella Martin. 2019. Universal Methods of Design Expanded and Revised: 125 Ways to Research Complex Problems, Develop Innovative Ideas, and Design Effective Solutions. Rockport publishers."},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/1835449.1835499"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-022-01726-0"},{"key":"e_1_3_2_29_2","volume-title":"Proceedings of the 1st International Workshop on Generalization in Information Retrieval (GLARE \u201918)","author":"Ingber Amir","year":"2018","unstructured":"Amir Ingber, Liane Lewin-Eytan, Alexander Libov, Yoelle Maarek, and Eliyahu Osherovich. 2018. Offline vs. online evaluation in voice product search. In Proceedings of the 1st International Workshop on Generalization in Information Retrieval (GLARE \u201918). Retrieved from http:\/\/glare2018.dei.unipd.it\/paper\/glare2018-paper4.pdf."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/2684822.2685319"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/775047.775067"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/1150402.1150429"},{"key":"e_1_3_2_33_2","volume-title":"Text Mining","author":"Joachims Thorsten","year":"2003","unstructured":"Thorsten Joachims. 2003. Evaluating retrieval performance using clickthrough data. In Text Mining. J. Franke, G. Nakhaeizadeh, and I. Renz (Eds.), Physica."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3130332.3130334"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/1229179.1229181"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/959258.959260"},{"key":"e_1_3_2_37_2","volume-title":"Correlation Methods","author":"Kendall Maurice George","year":"1975","unstructured":"Maurice George Kendall. 1975. Correlation Methods (4th. ed.). Charles Griffin, London, United Kingdom.","edition":"4"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2006.11.008"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/2600428.2609468"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10618-008-0114-1"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3409256.3409817"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3340531.3412137"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3220028"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2019.02.011"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3482190"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/2766462.2767721"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242731"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-023-09426-1"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/1416950.1416952"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2020.102226"},{"key":"e_1_3_2_51_2","unstructured":"Wenjie Ou and Yue Lin. 2020. A clarifying question selection system from NTES \\(\\_\\) ALONG in Convai3 challenge. arXiv:2010.14202. Retrieved from https:\/\/doi.org\/10.48550\/arXiv.2010.14202"},{"key":"e_1_3_2_52_2","first-page":"1881","volume-title":"Proceedings of the 28th Annual Conference of the Cognitive Science Society","volume":"28","author":"O\u2019Brien Maeve","year":"2006","unstructured":"Maeve O\u2019Brien and Mark T Keane. 2006. Modeling result-list searching in the World Wide Web: The role of relevance topologies and trust bias. In Proceedings of the 28th Annual Conference of the Cognitive Science Society, Vol. 28. Citeseer, 1881\u20131886."},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1098\/rsta.1896.0007"},{"key":"e_1_3_2_54_2","unstructured":"Gustavo Penha Alexandru Balan and Claudia Hauff. 2019. Introducing MANtIS: A novel multi-domain information seeking dialogues dataset. arXiv:1912.04639. Retrieved from https:\/\/doi.org\/10.48550\/arXiv.1912.04639"},{"key":"e_1_3_2_55_2","volume-title":"Proceedings of the 11th Workshop on the Semantics and Pragmatics of Dialogue (Decalog \u201907)","volume":"83","author":"Quarteroni Silvia","year":"2007","unstructured":"Silvia Quarteroni and Suresh Manandhar. 2007. A chatbot-based interactive question answering system. In Proceedings of the 11th Workshop on the Semantics and Pragmatics of Dialogue (Decalog \u201907), 83."},{"key":"e_1_3_2_56_2","first-page":"1266","volume-title":"Proceedings of the Findings of the Association for Computational Linguistics (EACL \u201924","author":"Rahmani Hossein A.","year":"2024","unstructured":"Hossein A. Rahmani, Xi Wang, Mohammad Aliannejadi, Mohammadmehdi Naghiaei, and Emine Yilmaz. 2024. Clarifying the path to user satisfaction: An investigation into clarification usefulness. In Proceedings of the Findings of the Association for Computational Linguistics (EACL \u201924), 1266\u20131277."},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.152"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1255"},{"key":"e_1_3_2_59_2","first-page":"143","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","volume":"1","author":"Rao Sudha","year":"2019","unstructured":"Sudha Rao and Hal Daum\u00e9 III. 2019. Answer-based adversarial training for generating clarification questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers), 143\u2013155. DOI: https:\/\/aclanthology.org\/N19-1.pdf"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/2959100.2959176"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/2645710.2645746"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-72113-8_41"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.373"},{"key":"e_1_3_2_64_2","first-page":"1306","article-title":"Post hoc tests: Tukey honestly significant difference test","author":"Stoll A","year":"2017","unstructured":"A Stoll. 2017. Post hoc tests: Tukey honestly significant difference test. The SAGE encyclopedia of communication research methods (2017), 1306\u20131307.","journal-title":"The SAGE encyclopedia of communication research methods"},{"key":"e_1_3_2_65_2","first-page":"814","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Swaminathan Adith","year":"2015","unstructured":"Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. In Proceedings of the International Conference on Machine Learning. PMLR, 814\u2013823."},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477495.3531750"},{"key":"e_1_3_2_67_2","doi-asserted-by":"publisher","DOI":"10.1002\/asi.24562"},{"key":"e_1_3_2_68_2","doi-asserted-by":"publisher","DOI":"10.1145\/1277741.1277894"},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.2307\/3001913"},{"key":"e_1_3_2_70_2","volume-title":"TREC: Experiment and Evaluation in Information Retrieval","author":"Voorhees Ellen M.","year":"2005","unstructured":"Ellen M. Voorhees, Donna K. Harman (Eds.). 2005. TREC: Experiment and Evaluation in Information Retrieval. Vol. 63. MIT press."},{"key":"e_1_3_2_71_2","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2911537"},{"key":"e_1_3_2_72_2","doi-asserted-by":"publisher","DOI":"10.1145\/1852102.1852106"},{"key":"e_1_3_2_73_2","doi-asserted-by":"publisher","DOI":"10.1126\/science.22.558.309"},{"key":"e_1_3_2_74_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1172"},{"key":"e_1_3_2_75_2","doi-asserted-by":"publisher","DOI":"10.1145\/2487575.2488215"},{"key":"e_1_3_2_76_2","doi-asserted-by":"publisher","DOI":"10.1145\/2661829.2661953"},{"key":"e_1_3_2_77_2","doi-asserted-by":"publisher","DOI":"10.1145\/3366423.3380126"},{"key":"e_1_3_2_78_2","doi-asserted-by":"publisher","DOI":"10.1145\/3340531.3412772"},{"key":"e_1_3_2_79_2","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401160"},{"key":"e_1_3_2_80_2","doi-asserted-by":"publisher","DOI":"10.1145\/3209978.3210059"},{"key":"e_1_3_2_81_2","doi-asserted-by":"publisher","DOI":"10.1145\/1864708.1864759"},{"key":"e_1_3_2_82_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2022.103176"}],"container-title":["ACM Transactions on Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3681786","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3681786","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:10:02Z","timestamp":1750295402000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3681786"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,4]]},"references-count":81,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,1,31]]}},"alternative-id":["10.1145\/3681786"],"URL":"https:\/\/doi.org\/10.1145\/3681786","relation":{},"ISSN":["1046-8188","1558-2868"],"issn-type":[{"type":"print","value":"1046-8188"},{"type":"electronic","value":"1558-2868"}],"subject":[],"published":{"date-parts":[[2024,11,4]]},"assertion":[{"value":"2024-03-14","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-07-16","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}