{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T18:02:18Z","timestamp":1772906538290,"version":"3.50.1"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,8]]},"abstract":"<jats:p>\n            Large language models (LLMs) are being increasingly deployed as part of pipelines that repeatedly process or generate data of some sort. However, a common barrier to deployment are the frequent and often unpredictable errors that plague LLMs. Acknowledging the inevitability of these errors, we propose\n            <jats:italic>data quality assertions<\/jats:italic>\n            to identify when LLMs may be making mistakes. We present spade, a method for automatically synthesizing data quality assertions that identify bad LLM outputs. We make the observation that developers often identify data quality issues during prototyping prior to deployment, and attempt to address them by adding instructions to the LLM prompt over time. spade therefore analyzes histories of prompt versions over time to create candidate assertion functions and then selects a minimal set that fulfills both coverage and accuracy requirements. In testing across nine different real-world LLM pipelines, spade efficiently reduces the number of assertions by 14% and decreases false failures by 21% when compared to simpler baselines. spade has been deployed as an offering within LangSmith, LangChain's LLM pipeline hub, and has been used to generate data quality assertions for over 2000 pipelines across a spectrum of industries.\n          <\/jats:p>","DOI":"10.14778\/3685800.3685835","type":"journal-article","created":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T17:25:21Z","timestamp":1731086721000},"page":"4173-4186","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["spade: Synthesizing Data Quality Assertions for Large Language Model Pipelines"],"prefix":"10.14778","volume":"17","author":[{"given":"Shreya","family":"Shankar","sequence":"first","affiliation":[{"name":"UC Berkeley"}]},{"given":"Haotian","family":"Li","sequence":"additional","affiliation":[{"name":"HKUST"}]},{"given":"Parth","family":"Asawa","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]},{"given":"Madelon","family":"Hulsebos","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]},{"given":"Yiming","family":"Lin","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]},{"given":"J. D.","family":"Zamfirescu-Pereira","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]},{"given":"Harrison","family":"Chase","sequence":"additional","affiliation":[{"name":"LangChain"}]},{"given":"Will","family":"Fu-Hinthorn","sequence":"additional","affiliation":[{"name":"LangChain"}]},{"given":"Aditya G.","family":"Parameswaran","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]},{"given":"Eugene","family":"Wu","sequence":"additional","affiliation":[{"name":"Columbia University"}]}],"member":"320","published-online":{"date-parts":[[2024,11,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3314050"},{"key":"e_1_2_1_2_1","volume-title":"ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing. arXiv preprint arXiv:2309.09128","author":"Arawjo Ian","year":"2023","unstructured":"Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, and Elena Glassman. 2023. ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing. arXiv preprint arXiv:2309.09128 (2023)."},{"key":"e_1_2_1_3_1","volume-title":"Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441","author":"Arora Simran","year":"2022","unstructured":"Simran Arora, Avanika Narayan, Mayee F Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher R\u00e9. 2022. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441 (2022)."},{"key":"e_1_2_1_4_1","unstructured":"Eric Breck Neoklis Polyzotis Sudip Roy Steven Whang and Martin Zinkevich. 2019. Data Validation for Machine Learning.. In MLSys."},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of SysML. https:\/\/mlsys.org\/Conferences\/2019\/doc\/2019\/167","author":"Breck Eric","year":"2019","unstructured":"Eric Breck, Marty Zinkevich, Neoklis Polyzotis, Steven Whang, and Sudip Roy. 2019. Data Validation for Machine Learning. In Proceedings of SysML. https:\/\/mlsys.org\/Conferences\/2019\/doc\/2019\/167.pdf"},{"key":"e_1_2_1_6_1","volume-title":"Foundations and Applications of Sensor Management","author":"Castro Rui","unstructured":"Rui Castro and Robert Nowak. 2008. Active learning and sampling. In Foundations and Applications of Sensor Management. Springer, 177--200."},{"key":"e_1_2_1_7_1","volume-title":"Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201","author":"Chan Chi-Min","year":"2023","unstructured":"Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 (2023)."},{"key":"e_1_2_1_8_1","unstructured":"Yupeng Chang Xu Wang Jindong Wang Yuan Wu Kaijie Zhu Hao Chen Linyi Yang Xiaoyuan Yi Cunxiang Wang Yidong Wang et al. 2023. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023)."},{"key":"e_1_2_1_9_1","volume-title":"How is ChatGPT's behavior changing over time? arXiv preprint arXiv:2307.09009","author":"Chen Lingjiao","year":"2023","unstructured":"Lingjiao Chen, Matei Zaharia, and James Zou. 2023. How is ChatGPT's behavior changing over time? arXiv preprint arXiv:2307.09009 (2023)."},{"key":"e_1_2_1_10_1","volume-title":"Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al.","author":"Chen Mark","year":"2021","unstructured":"Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)."},{"key":"e_1_2_1_11_1","volume-title":"Prompt Sapper: A LLM-Empowered Production Tool for Building AI Chains. arXiv preprint arXiv:2306.12028","author":"Cheng Yu","year":"2023","unstructured":"Yu Cheng, Jieshan Chen, Qing Huang, Zhenchang Xing, Xiwei Xu, and Qinghua Lu. 2023. Prompt Sapper: A LLM-Empowered Production Tool for Building AI Chains. arXiv preprint arXiv:2306.12028 (2023)."},{"key":"e_1_2_1_12_1","volume-title":"Is GPT-3 text indistinguishable from human text? SCARECROW: A framework for scrutinizing machine text. arXiv preprint arXiv:2107.01294","author":"Dou Yao","year":"2021","unstructured":"Yao Dou, Maxwell Forbes, Rik Koncel-Kedziorski, Noah A Smith, and Yejin Choi. 2021. Is GPT-3 text indistinguishable from human text? SCARECROW: A framework for scrutinizing machine text. arXiv preprint arXiv:2107.01294 (2021)."},{"key":"e_1_2_1_13_1","volume-title":"RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv preprint arXiv:2309.15217","author":"Es Shahul","year":"2023","unstructured":"Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2023. RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv preprint arXiv:2309.15217 (2023)."},{"key":"e_1_2_1_14_1","volume-title":"Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533","author":"Fan Angela","year":"2023","unstructured":"Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533 (2023)."},{"key":"e_1_2_1_15_1","volume-title":"Towards Autonomous Testing Agents via Conversational Large Language Models. arXiv preprint arXiv:2306.05152","author":"Feldt Robert","year":"2023","unstructured":"Robert Feldt, Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Towards Autonomous Testing Agents via Conversational Large Language Models. arXiv preprint arXiv:2306.05152 (2023)."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611527"},{"key":"e_1_2_1_17_1","doi-asserted-by":"crossref","unstructured":"John Forrest and Robin Lougee-Heimer. 2005. CBC user guide. In Emerging theory methods and applications. INFORMS 257--277.","DOI":"10.1287\/educ.1053.0020"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1613\/jair.1.13715"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-021-00726-w"},{"key":"e_1_2_1_20_1","volume-title":"Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows. arXiv preprint arXiv:2312.11681","author":"Grunde-McLaughlin Madeleine","year":"2023","unstructured":"Madeleine Grunde-McLaughlin, Michelle S Lam, Ranjay Krishna, Daniel S Weld, and Jeffrey Heer. 2023. Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows. arXiv preprint arXiv:2312.11681 (2023)."},{"key":"e_1_2_1_21_1","unstructured":"Guardrails 2023. Guardrails AI. https:\/\/github.com\/guardrails-ai\/guardrails."},{"key":"e_1_2_1_22_1","unstructured":"Nick Hynes D. Sculley and Michael Terry. 2017. The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets. http:\/\/learningsys.org\/nips17\/assets\/papers\/paper_19.pdf"},{"key":"e_1_2_1_23_1","volume-title":"Calibrated Language Models Must Hallucinate. arXiv preprint arXiv:2311.14648","author":"Kalai Adam Tauman","year":"2023","unstructured":"Adam Tauman Kalai and Santosh S Vempala. 2023. Calibrated Language Models Must Hallucinate. arXiv preprint arXiv:2311.14648 (2023)."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517907"},{"key":"e_1_2_1_25_1","first-page":"481","article-title":"Model assertions for monitoring and improving ML models","volume":"2","author":"Kang Daniel","year":"2020","unstructured":"Daniel Kang, Deepti Raghavan, Peter Bailis, and Matei Zaharia. 2020. Model assertions for monitoring and improving ML models. Proceedings of Machine Learning and Systems 2 (2020), 481--496.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_1_26_1","volume-title":"EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria. arXiv preprint arXiv:2309.13633","author":"Kim Tae Soo","year":"2023","unstructured":"Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, and Juho Kim. 2023. EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria. arXiv preprint arXiv:2309.13633 (2023)."},{"key":"e_1_2_1_27_1","unstructured":"Langchain 2023. Langchain AI. https:\/\/github.com\/langchain-ai\/langchain."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE48619.2023.00085"},{"key":"e_1_2_1_29_1","first-page":"9459","article-title":"Retrieval-augmented generation for knowledge-intensive nlp tasks","volume":"33","author":"Lewis Patrick","year":"2020","unstructured":"Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\u00fcttler, Mike Lewis, Wen-tau Yih, Tim Rockt\u00e4schel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459--9474.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_30_1","first-page":"1","article-title":"Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing","volume":"55","author":"Liu Pengfei","year":"2023","unstructured":"Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1--35.","journal-title":"Comput. Surveys"},{"key":"e_1_2_1_31_1","unstructured":"Llama Index 2023. Llama Index. https:\/\/github.com\/run-llama\/llama_index."},{"key":"e_1_2_1_32_1","volume-title":"Fantastically ordered prompts and where to find them: Overcoming fewshot prompt order sensitivity. arXiv preprint arXiv:2104.08786","author":"Lu Yao","year":"2021","unstructured":"Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2021. Fantastically ordered prompts and where to find them: Overcoming fewshot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403205"},{"key":"e_1_2_1_34_1","volume-title":"Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911","author":"Narayan Avanika","year":"2022","unstructured":"Avanika Narayan, Ines Chami, Laurel Orr, Simran Arora, and Christopher R\u00e9. 2022. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911 (2022)."},{"key":"e_1_2_1_35_1","volume-title":"LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828","author":"Ouyang Shuyin","year":"2023","unstructured":"Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2023. LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828 (2023)."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3533378"},{"key":"e_1_2_1_37_1","volume-title":"Revisiting Prompt Engineering via Declarative Crowdsourcing. arXiv preprint arXiv:2308.03854","author":"Parameswaran Aditya G","year":"2023","unstructured":"Aditya G Parameswaran, Shreya Shankar, Parth Asawa, Naman Jain, and Yujie Wang. 2023. Revisiting Prompt Engineering via Declarative Crowdsourcing. arXiv preprint arXiv:2308.03854 (2023)."},{"key":"e_1_2_1_38_1","volume-title":"Henley","author":"Parnin Chris","year":"2023","unstructured":"Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, and Austin Z. Henley. 2023. Building Your Own Product Copilot: Challenges, Opportunities, and Needs. arXiv:2312.14231 [cs.SE]"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299887.3299891"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-019-00552-1"},{"key":"e_1_2_1_41_1","volume-title":"Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. arXiv preprint arXiv:2310.10501","author":"Rebedea Traian","year":"2023","unstructured":"Traian Rebedea, Razvan Dinu, Makesh Sreedhar, Christopher Parisien, and Jonathan Cohen. 2023. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. arXiv preprint arXiv:2310.10501 (2023)."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352110"},{"key":"e_1_2_1_43_1","volume-title":"ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. arXiv preprint arXiv:2311.09476","author":"Saad-Falcon Jon","year":"2023","unstructured":"Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2023. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. arXiv preprint arXiv:2311.09476 (2023)."},{"key":"e_1_2_1_44_1","volume-title":"An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. arXiv preprint arXiv:2302.06527","author":"Sch\u00e4fer Max","year":"2023","unstructured":"Max Sch\u00e4fer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. arXiv preprint arXiv:2302.06527 (2023)."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3555041.3589682"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3229867"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3583780.3614786"},{"key":"e_1_2_1_48_1","volume-title":"Operationalizing machine learning: An interview study. arXiv preprint arXiv:2209.09125","author":"Shankar Shreya","year":"2022","unstructured":"Shreya Shankar, Rolando Garcia, Joseph M Hellerstein, and Aditya G Parameswaran. 2022. Operationalizing machine learning: An interview study. arXiv preprint arXiv:2209.09125 (2022)."},{"key":"e_1_2_1_49_1","volume-title":"Rethinking streaming machine learning evaluation. arXiv preprint arXiv:2205.11473","author":"Shankar Shreya","year":"2022","unstructured":"Shreya Shankar, Bernease Herman, and Aditya G Parameswaran. 2022. Rethinking streaming machine learning evaluation. arXiv preprint arXiv:2205.11473 (2022)."},{"key":"e_1_2_1_50_1","volume-title":"SPADE: Synthesizing Assertions for Large Language Model Pipelines. arXiv preprint arXiv:2401.03038","author":"Shankar Shreya","year":"2024","unstructured":"Shreya Shankar, Haotian Li, Parth Asawa, Madelon Hulsebos, Yiming Lin, JD Zamfirescu-Pereira, Harrison Chase, Will Fu-Hinthorn, Aditya G Parameswaran, and Eugene Wu. 2024. SPADE: Synthesizing Assertions for Large Language Model Pipelines. arXiv preprint arXiv:2401.03038 (2024)."},{"key":"e_1_2_1_51_1","volume-title":"SPADE: Automatically digging up evals based on prompt refinements. https:\/\/blog.langchain.dev\/spade-automatically-digging-up-evals-based-on-prompt-refinements\/","author":"Shankar Shreya","year":"2023","unstructured":"Shreya Shankar, Haotian Li, Will Fu-Hinthorn, Harrison Chase, J.D. Zamfirescu-Pereira, Yiming Lin, Sam Noyes, Eugene Wu, and Aditya Parameswaran. 2023. SPADE: Automatically digging up evals based on prompt refinements. https:\/\/blog.langchain.dev\/spade-automatically-digging-up-evals-based-on-prompt-refinements\/"},{"key":"e_1_2_1_52_1","volume-title":"Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150","author":"Si Chenglei","year":"2022","unstructured":"Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, and Lijuan Wang. 2022. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150 (2022)."},{"key":"e_1_2_1_53_1","volume-title":"Noshin Ulfat, Fahmid Al Rifat, and Vinicius Carvalho Lopes.","author":"Siddiq Mohammed Latif","year":"2023","unstructured":"Mohammed Latif Siddiq, Joanna Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vinicius Carvalho Lopes. 2023. Exploring the Effectiveness of Large Language Models in Generating Unit Tests. arXiv preprint arXiv:2305.00418 (2023)."},{"key":"e_1_2_1_54_1","volume-title":"DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines. arXiv preprint arXiv:2312.13382","author":"Singhvi Arnav","year":"2023","unstructured":"Arnav Singhvi, Manish Shetty, Shangyin Tan, Christopher Potts, Koushik Sen, Matei Zaharia, and Omar Khattab. 2023. DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines. arXiv preprint arXiv:2312.13382 (2023)."},{"key":"e_1_2_1_55_1","volume-title":"Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation. arXiv preprint arXiv:2310.02368","author":"Steenhoek Benjamin","year":"2023","unstructured":"Benjamin Steenhoek, Michele Tufano, Neel Sundaresan, and Alexey Svyatkovskiy. 2023. Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation. arXiv preprint arXiv:2310.02368 (2023)."},{"key":"e_1_2_1_56_1","volume-title":"Software testing with large language model: Survey, landscape, and vision. arXiv preprint arXiv:2307.07221","author":"Wang Junjie","year":"2023","unstructured":"Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2023. Software testing with large language model: Survey, landscape, and vision. arXiv preprint arXiv:2307.07221 (2023)."},{"key":"e_1_2_1_57_1","volume-title":"PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. In The Twelfth International Conference on Learning Representations.","author":"Wang Yidong","year":"2023","unstructured":"Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Wenjin Yao, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, et al. 2023. PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_58_1","volume-title":"Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966","author":"Wang Yufei","year":"2023","unstructured":"Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966 (2023)."},{"key":"e_1_2_1_59_1","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824--24837.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491101.3519729"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3517582"},{"key":"e_1_2_1_62_1","volume-title":"Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective. arXiv preprint arXiv:2305.14889","author":"Xiao Ziang","year":"2023","unstructured":"Ziang Xiao, Susu Zhang, Vivian Lai, and Q Vera Liao. 2023. Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective. arXiv preprint arXiv:2305.14889 (2023)."},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3581388"},{"key":"e_1_2_1_64_1","volume-title":"Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862","author":"Zhang Xinghua","year":"2023","unstructured":"Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and Yongbin Li. 2023. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862 (2023)."},{"key":"e_1_2_1_65_1","volume-title":"Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba.","author":"Zhou Yongchao","year":"2022","unstructured":"Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022)."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3685800.3685835","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,31]],"date-time":"2024-12-31T05:30:53Z","timestamp":1735623053000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3685800.3685835"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8]]},"references-count":65,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["10.14778\/3685800.3685835"],"URL":"https:\/\/doi.org\/10.14778\/3685800.3685835","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,8]]},"assertion":[{"value":"2024-11-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}