{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,15]],"date-time":"2026-04-15T21:41:44Z","timestamp":1776289304308,"version":"3.50.1"},"reference-count":82,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2025,5,2]],"date-time":"2025-05-02T00:00:00Z","timestamp":1746144000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Hum.-Comput. Interact."],"published-print":{"date-parts":[[2025,5,2]]},"abstract":"<jats:p>Chatbots have shown promise as tools to scale qualitative data collection. Recent advances in Large Language Models (LLMs) could accelerate this process by allowing researchers to easily deploy sophisticated interviewing chatbots. We test this assumption by conducting a large-scale user study (n=399) evaluating 3 different chatbots, two of which are LLM-based and a baseline which employs hard-coded questions. We evaluate the results with respect to participant engagement and experience, established metrics of chatbot quality grounded in theories of effective communication, and a novel scale evaluating ''richness'' or the extent to which responses capture the complexity and specificity of the social context under study. We find that, while the chatbots were able to elicit high-quality responses based on established evaluation metrics, the responses rarely capture participants' specific motives or personalized examples, and thus perform poorly with respect to richness. We further find low inter-rater reliability between LLMs and humans in the assessment of both quality and richness metrics. Our study offers a cautionary tale for scaling and evaluating qualitative research with LLMs.<\/jats:p>","DOI":"10.1145\/3710947","type":"journal-article","created":{"date-parts":[[2025,5,20]],"date-time":"2025-05-20T11:36:19Z","timestamp":1747740979000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Collecting Qualitative Data at Scale with Large Language Models: A Case Study"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6507-1334","authenticated-orcid":false,"given":"Alejandro","family":"Cuevas","sequence":"first","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0404-9197","authenticated-orcid":false,"given":"Jennifer V.","family":"Scurrell","sequence":"additional","affiliation":[{"name":"ETH Zurich, Zurich, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2564-0373","authenticated-orcid":false,"given":"Eva M.","family":"Brown","sequence":"additional","affiliation":[{"name":"University of Washington, Seattle, WA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-8960-4692","authenticated-orcid":false,"given":"Jason","family":"Entenmann","sequence":"additional","affiliation":[{"name":"Microsoft Research, Redmond, WA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9689-2442","authenticated-orcid":false,"given":"Madeleine I. G.","family":"Daepp","sequence":"additional","affiliation":[{"name":"Microsoft Research, Redmond, WA, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,5,2]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300484"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2854946.2854960"},{"key":"e_1_2_1_3_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal et al. 2020. Language Models Are Few-Shot Learners. Advances in Neural Information Processing Systems (NeurIPS'20) (May 2020)."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2858036.2858498"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3025453.3025919"},{"key":"e_1_2_1_6_1","volume-title":"ACM Workshop on Human-Centered Machine Learning (HCML'16)","author":"Kocielnik Rafal","year":"2016","unstructured":"Nan-chen Chen, Rafal Kocielnik, Margaret Drouhard, Vanessa Pe\u00f1a Araya, Jina Suh, Keting Cen, et al. 2016. Challenges of Applying Machine Learning to Qualitative Coding. In ACM Workshop on Human-Centered Machine Learning (HCML'16) (California, USA)."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3581122"},{"key":"e_1_2_1_8_1","doi-asserted-by":"crossref","unstructured":"Felix Chopra and Ingar Haaland. 2023. Conducting qualitative interviews with AI. (2023).","DOI":"10.2139\/ssrn.4583756"},{"key":"e_1_2_1_9_1","unstructured":"Junjie Chu Yugeng Liu Ziqing Yang Xinyue Shen Michael Backes and Yang Zhang. 2024. Comprehensive Assessment of Jailbreak Attacks Against LLMs. arXiv:2402.05668 [cs.CR]"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1177\/00018392231194442"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.3758\/BRM.40.1.8"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1111\/tct.12953"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1162\/COLI_a_00064"},{"key":"e_1_2_1_14_1","volume-title":"The SAGE Handbook of Qualitative Research","author":"Denzin Norman","unstructured":"Norman Denzin and Yvonna Lincoln. 2011. The SAGE Handbook of Qualitative Research. SAGE Publications."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.3102\/0013189X016007016"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.3115\/1690219.1690270"},{"key":"e_1_2_1_17_1","volume-title":"Reduce Harms: Methods","author":"Ganguli Deep","year":"2022","unstructured":"Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, et al. 2022. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. (2022). arXiv:2209.07858 [cs.CL]"},{"key":"e_1_2_1_18_1","volume-title":"The Interpretation of Cultures","author":"Geertz Clifford","unstructured":"Clifford Geertz. 1973. The Interpretation of Cultures. Basic Books."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.2305016120"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.chb.2019.01.020"},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"HP Grice. 1975. Logic and Conversation.","DOI":"10.1163\/9789004368811_003"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300439"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3580688"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1126\/science.1243091"},{"key":"e_1_2_1_25_1","volume-title":"Stargazer: Well-Formatted Regression and Summary Statistics Tables. https:\/\/CRAN.R-project.org\/package=stargazer R Package Version 5.2.2.","author":"Hlavac Marek","year":"2018","unstructured":"Marek Hlavac. 2018. Stargazer: Well-Formatted Regression and Summary Statistics Tables. https:\/\/CRAN.R-project.org\/package=stargazer R Package Version 5.2.2."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.2478\/v10199-011-0040-1"},{"key":"e_1_2_1_27_1","volume-title":"Llama Guard: LLM-Based Input-Output Safeguard For Human-AI Conversations. arXiv:2312.06674 [cs.CL]","author":"Inan Hakan","year":"2023","unstructured":"Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, et al. 2023. Llama Guard: LLM-Based Input-Output Safeguard For Human-AI Conversations. arXiv:2312.06674 [cs.CL]"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3531146.3533097"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","unstructured":"Zhiqiu Jiang Mashrur Rashik Kunjal Panchal Mahmood Jasim Ali Sarvghad Pari Riahi et al. 2023. CommunityBots: Creating and Evaluating a Multi-Agent Chatbot Platform For Public Input Elicitation. doi:10.1145\/3579469","DOI":"10.1145\/3579469"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3383652.3423870"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300316"},{"key":"e_1_2_1_32_1","volume-title":"Designing Social Inquiry: Scientific Inference in Qualitative Research","author":"King Gary","unstructured":"Gary King, Robert Keohane, and Sidney Verba. 2021. Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton University Press."},{"key":"e_1_2_1_33_1","unstructured":"Takeshi Kojima Shixiang (Shane) Gu Machel Reid Yutaka Matsuo and Yusuke Iwasawa. 2022. Large Language Models Are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems (NeurIPS'22) (Virtual)."},{"key":"e_1_2_1_34_1","unstructured":"Michal Kosinski. 2024. Theory of Mind May Have Spontaneously Emerged in Large Language Models. (2024). arXiv:2302.02083 [cs.CL]"},{"key":"e_1_2_1_35_1","unstructured":"Sam Ladner. 2024. Assessing Quality in Qualitative Research. https:\/\/www.epicpeople.org\/assessing-quality-inqualitative-research\/."},{"key":"e_1_2_1_36_1","volume-title":"Jinjuan Heidi Feng, and Harry Hochheiser","author":"Lazar Jonathan","year":"2017","unstructured":"Jonathan Lazar, Jinjuan Heidi Feng, and Harry Hochheiser. 2017. Research Methods in Human-Computer Interaction. Morgan Kaufmann."},{"key":"e_1_2_1_37_1","unstructured":"Paul Leedy and Jeanne Ellis Ormrod. 2015. Practical Research. Pearson."},{"key":"e_1_2_1_38_1","first-page":"8","article-title":"Package 'Emmeans","volume":"1","author":"Lenth Russell","year":"2023","unstructured":"Russell Lenth, Henrik Singmann, Jonathon Love, Paul Buerkner, and Maxime Herve. 2023. Package 'Emmeans'. R Package Version 1, 8.8 (2023).","journal-title":"R Package Version"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1080\/01488376.2011.580697"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3025171.3025206"},{"key":"e_1_2_1_41_1","unstructured":"Percy Liang Rishi Bommasani Tony Lee Dimitris Tsipras Dilara Soylu Michihiro Yasunaga et al. 2023. Holistic Evaluation of Language Models. Transactions on Machine Learning Research (TMLR) (2023)."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1230"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.225"},{"key":"e_1_2_1_44_1","volume-title":"a Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys (CSUR'23) 55, 9","author":"Liu Pengfei","year":"2023","unstructured":"Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-Train, Prompt, and Predict: a Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys (CSUR'23) 55, 9 (2023), 1--35."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2858036.2858288"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1017\/S0043887109990220"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1093\/pan\/mpj017"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3173922"},{"key":"e_1_2_1_49_1","volume-title":"Qualitative Research: a Guide to Design and Implementation","author":"Merriam Sharan","unstructured":"Sharan Merriam and Elizabeth Tisdell. 2015. Qualitative Research: a Guide to Design and Implementation. John Wiley & Sons."},{"key":"e_1_2_1_50_1","volume-title":"Accessed","year":"2023","unstructured":"Microsoft. [n. d.]. Orchestrate Your AI with Semantic Kernel | Microsoft Learn. https:\/\/learn.microsoft.com\/enus\/semantic-kernel\/overview\/. Accessed: September 14, 2023."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1177\/160940690200100202"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445383"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.21586\/ross0000004"},{"key":"e_1_2_1_54_1","volume-title":"Qualitative Research & Evaluation Methods: Integrating Theory and Practice","author":"Patton Michael Quinn","unstructured":"Michael Quinn Patton. 2014. Qualitative Research & Evaluation Methods: Integrating Theory and Practice. SAGE Publications."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.5555\/3237383.3237883"},{"key":"e_1_2_1_56_1","unstructured":"Stuart Russell. 2010. Artificial Intelligence a Modern Approach. Pearson Education."},{"key":"e_1_2_1_57_1","unstructured":"Mark Russinovich Ahmed Salem and Ronen Eldan. 2024. Great Now Write An Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack. arXiv:2404.01833 [cs.CR]"},{"key":"e_1_2_1_58_1","unstructured":"Johnny Salda\u00f1a. 2021. The Coding Manual For Qualitative Researchers. (2021)."},{"key":"e_1_2_1_59_1","volume-title":"Xuhui Zhou, Yejin Choi, Yoav Goldberg, et al.","author":"Shapira Natalie","year":"2024","unstructured":"Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, et al. 2024. Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models. In Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models (St. Julian's, Malta)."},{"key":"e_1_2_1_60_1","volume-title":"Qualitative Literacy: a Guide to Evaluating Ethnographic and Interview Research","author":"Small Mario Luis","unstructured":"Mario Luis Small and Jessica McCrory Calarco. 2022. Qualitative Literacy: a Guide to Evaluating Ethnographic and Interview Research. Univ. of California Press."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","unstructured":"James W. A. Strachan Dalila Albergo Giulia Borghini Oriana Pansardi Eugenio Scaliti Saurabh Gupta et al. 2024. Testing Theory of Mind in Large Language Models and Humans. Nature Human Behaviour (May 2024). doi:10.1038\/s41562-024-01882-z","DOI":"10.1038\/s41562-024-01882-z"},{"key":"e_1_2_1_62_1","volume-title":"Accessed","year":"2023","unstructured":"Taivo. [n. d.]. GPT-3.5 and GPT-4 Response Times. https:\/\/www.taivo.ai\/__gpt-3--5-and-gpt-4-response-times\/. Accessed: September 14, 2023."},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3174178"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401127"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0747-5632(02)00032-8"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1177\/1077800410383121"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.5555\/2041666.2041690"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.3115\/976909.979652"},{"key":"e_1_2_1_69_1","unstructured":"EricWallace Kai Xiao Reimar Leike LilianWeng Johannes Heidecke and Alex Beutel. 2024. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv:2404.13208 [cs.CR]"},{"key":"e_1_2_1_70_1","volume-title":"Chi, et al","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, et al. 2022. Chain-Of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems (NeurIPS'22) (2022)."},{"key":"e_1_2_1_71_1","unstructured":"Jules White Quchen Fu Sam Hays Michael Sandborn Carlos Olea Henry Gilbert et al. 2023. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv:2302.11382 [cs.SE]"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3581252"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581754.3584136"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3313831.3376131"},{"key":"e_1_2_1_75_1","doi-asserted-by":"crossref","unstructured":"Ziang Xiao Michelle X Zhou Vera Liao Gloria Mark Changyan Chi Wenxi Chen et al. 2020. Tell Me About Yourself: Using an AI-Powered Chatbot to Conduct Conversational Surveys with Open-ended Questions. ACM Transactions on Computer-Human Interaction (TOCHI'20) 27 3 (2020).","DOI":"10.1145\/3381804"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.235"},{"key":"e_1_2_1_77_1","unstructured":"Shunyu Yao Dian Yu Jeffrey Zhao Izhak Shafran Thomas L. Griffiths Yuan Cao et al. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Advances in Neural Information Processing Systems (NeurIPS'23) (New Orleans LA USA)."},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3581388"},{"key":"e_1_2_1_79_1","volume-title":"Data Quality, and User Evaluation. Communication Methods and Measures","author":"Zarouali Brahim","year":"2023","unstructured":"Brahim Zarouali, Theo Araujo, Jakob Ohme, and Claes de Vreese. 2023. Comparing Chatbots and Online Surveys For (Longitudinal) Data Collection: An Investigation of Response Characteristics, Data Quality, and User Evaluation. Communication Methods and Measures (2023), 1--20."},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.2302491120"},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3501855"},{"key":"e_1_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1145\/3232077"}],"container-title":["Proceedings of the ACM on Human-Computer Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3710947","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3710947","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T09:29:16Z","timestamp":1755768556000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3710947"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,2]]},"references-count":82,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,5,2]]}},"alternative-id":["10.1145\/3710947"],"URL":"https:\/\/doi.org\/10.1145\/3710947","relation":{},"ISSN":["2573-0142"],"issn-type":[{"value":"2573-0142","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,2]]},"assertion":[{"value":"2025-05-02","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}