{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T03:02:13Z","timestamp":1772593333646,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":21,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,5,2]],"date-time":"2024-05-02T00:00:00Z","timestamp":1714608000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,5,2]]},"DOI":"10.1145\/3613905.3650755","type":"proceedings-article","created":{"date-parts":[[2024,5,11]],"date-time":"2024-05-11T08:15:21Z","timestamp":1715415321000},"page":"1-7","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":26,"title":["LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0291-6026","authenticated-orcid":false,"given":"Minsuk","family":"Kahng","sequence":"first","affiliation":[{"name":"People + AI Research (PAIR), Google, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6225-9283","authenticated-orcid":false,"given":"Ian","family":"Tenney","sequence":"additional","affiliation":[{"name":"People + AI Research (PAIR), Google, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5903-5510","authenticated-orcid":false,"given":"Mahima","family":"Pushkarna","sequence":"additional","affiliation":[{"name":"People + AI Research (PAIR), Google, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8246-8736","authenticated-orcid":false,"given":"Michael Xieyang","family":"Liu","sequence":"additional","affiliation":[{"name":"People + AI Research (PAIR), Google, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-8105-6998","authenticated-orcid":false,"given":"James","family":"Wexler","sequence":"additional","affiliation":[{"name":"People + AI Research (PAIR), Google, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3572-6234","authenticated-orcid":false,"given":"Emily","family":"Reif","sequence":"additional","affiliation":[{"name":"People + AI Research (PAIR), Google, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2337-0114","authenticated-orcid":false,"given":"Krystal","family":"Kallarackal","sequence":"additional","affiliation":[{"name":"Google, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9441-3337","authenticated-orcid":false,"given":"Minsuk","family":"Chang","sequence":"additional","affiliation":[{"name":"People + AI Research (PAIR), Google, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1941-939X","authenticated-orcid":false,"given":"Michael","family":"Terry","sequence":"additional","affiliation":[{"name":"People + AI Research (PAIR), Google, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1094-1675","authenticated-orcid":false,"given":"Lucas","family":"Dixon","sequence":"additional","affiliation":[{"name":"People + AI Research (PAIR), Google, France"}]}],"member":"320","published-online":{"date-parts":[[2024,5,11]]},"reference":[{"key":"e_1_3_3_3_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2702123.2702509"},{"key":"e_1_3_3_3_2_1","volume-title":"Concrete problems in AI safety. arXiv preprint arXiv:1606.06565","author":"Amodei Dario","year":"2016","unstructured":"Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man\u00e9. 2016. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016). https:\/\/arxiv.org\/abs\/1606.06565"},{"key":"e_1_3_3_3_3_1","volume-title":"PaLM 2 technical report. arXiv preprint arXiv:2305.10403","author":"Anil Rohan","year":"2023","unstructured":"Rohan Anil, Andrew\u00a0M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, 2023. PaLM 2 technical report. arXiv preprint arXiv:2305.10403 (2023). https:\/\/arxiv.org\/abs\/2305.10403"},{"key":"e_1_3_3_3_4_1","volume-title":"ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing. arXiv preprint arXiv:2309.09128","author":"Arawjo Ian","year":"2023","unstructured":"Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, and Elena Glassman. 2023. ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing. arXiv preprint arXiv:2309.09128 (2023). https:\/\/arxiv.org\/abs\/2309.09128"},{"key":"e_1_3_3_3_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3490099.3511122"},{"key":"e_1_3_3_3_6_1","volume-title":"The Role of Interactive Visualization in Explaining (Large) NLP Models: from Data to Inference. arXiv preprint arXiv:2301.04528","author":"Brath Richard","year":"2023","unstructured":"Richard Brath, Daniel Keim, Johannes Knittel, Shimei Pan, Pia Sommerauer, and Hendrik Strobelt. 2023. The Role of Interactive Visualization in Explaining (Large) NLP Models: from Data to Inference. arXiv preprint arXiv:2301.04528 (2023). https:\/\/arxiv.org\/abs\/2301.04528"},{"key":"e_1_3_3_3_7_1","unstructured":"Google Cloud. 2024. Perform automatic side-by-side evaluation. https:\/\/cloud.google.com\/vertex-ai\/docs\/generative-ai\/models\/side-by-side-eval"},{"key":"e_1_3_3_3_8_1","volume-title":"KnowledgeVIS: Interpreting Language Models by Comparing Fill-in-the-Blank Prompts","author":"Coscia Adam","year":"2023","unstructured":"Adam Coscia and Alex Endert. 2023. KnowledgeVIS: Interpreting Language Models by Comparing Fill-in-the-Blank Prompts. IEEE Transactions on Visualization and Computer Graphics (2023)."},{"key":"e_1_3_3_3_9_1","volume-title":"Boxer: Interactive comparison of classifier results. In Computer Graphics Forum, Vol.\u00a039","author":"Gleicher Michael","year":"2020","unstructured":"Michael Gleicher, Aditya Barve, Xinyi Yu, and Florian Heimerl. 2020. Boxer: Interactive comparison of classifier results. In Computer Graphics Forum, Vol.\u00a039. Wiley Online Library, 181\u2013193. https:\/\/arxiv.org\/abs\/2004.07964"},{"key":"e_1_3_3_3_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2017.2744718"},{"key":"e_1_3_3_3_11_1","volume-title":"EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria. arXiv preprint arXiv:2309.13633","author":"Kim Tae\u00a0Soo","year":"2023","unstructured":"Tae\u00a0Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, and Juho Kim. 2023. EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria. arXiv preprint arXiv:2309.13633 (2023). https:\/\/arxiv.org\/abs\/2309.13633"},{"key":"e_1_3_3_3_12_1","volume-title":"LiPO: Listwise Preference Optimization through Learning-to-Rank. arXiv preprint arXiv:2402.01878","author":"Liu Tianqi","year":"2024","unstructured":"Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, 2024. LiPO: Listwise Preference Optimization through Learning-to-Rank. arXiv preprint arXiv:2402.01878 (2024). https:\/\/arxiv.org\/abs\/2402.01878"},{"key":"e_1_3_3_3_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2018.2864812"},{"key":"e_1_3_3_3_14_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-demo.12"},{"key":"e_1_3_3_3_15_1","first-page":"1146","article-title":"Interactive and visual prompt engineering for ad-hoc task adaptation with large language models","volume":"29","author":"Strobelt Hendrik","year":"2022","unstructured":"Hendrik Strobelt, Albert Webson, Victor Sanh, Benjamin Hoover, Johanna Beyer, Hanspeter Pfister, and Alexander\u00a0M Rush. 2022. Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2022), 1146\u20131156. https:\/\/arxiv.org\/abs\/2208.07852","journal-title":"IEEE Transactions on Visualization and Computer Graphics"},{"key":"e_1_3_3_3_16_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.15"},{"key":"e_1_3_3_3_17_1","volume-title":"Learning-from-disagreement: A model comparison and visual analytics framework","author":"Wang Junpeng","year":"2022","unstructured":"Junpeng Wang, Liang Wang, Yan Zheng, Chin-Chia\u00a0Michael Yeh, Shubham Jain, and Wei Zhang. 2022. Learning-from-disagreement: A model comparison and visual analytics framework. IEEE Transactions on Visualization and Computer Graphics (2022). https:\/\/arxiv.org\/abs\/2201.07849"},{"key":"e_1_3_3_3_18_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.657"},{"key":"e_1_3_3_3_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2019.2962027"},{"key":"e_1_3_3_3_20_1","unstructured":"Lianmin Zheng Wei-Lin Chiang Ying Sheng Siyuan Zhuang Zhanghao Wu Yonghao Zhuang Zi Lin Zhuohan Li Dacheng Li Eric Xing Hao Zhang Joseph\u00a0E. Gonzalez and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In Neural Information Processing Systems (NeurIPS): Datasets and Benchmarks Track. https:\/\/arxiv.org\/abs\/2306.05685"},{"key":"e_1_3_3_3_21_1","volume-title":"International Conference on Machine Learning (ICML). PMLR, 27099\u201327116","author":"Zhong Ruiqi","year":"2022","unstructured":"Ruiqi Zhong, Charlie Snell, Dan Klein, and Jacob Steinhardt. 2022. Describing differences between text distributions with natural language. In International Conference on Machine Learning (ICML). PMLR, 27099\u201327116. https:\/\/proceedings.mlr.press\/v162\/zhong22a.html"}],"event":{"name":"CHI '24: CHI Conference on Human Factors in Computing Systems","location":"Honolulu HI USA","acronym":"CHI '24","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction","SIGACCESS ACM Special Interest Group on Accessible Computing"]},"container-title":["Extended Abstracts of the CHI Conference on Human Factors in Computing Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3613905.3650755","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3613905.3650755","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:44:16Z","timestamp":1750290256000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3613905.3650755"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,2]]},"references-count":21,"alternative-id":["10.1145\/3613905.3650755","10.1145\/3613905"],"URL":"https:\/\/doi.org\/10.1145\/3613905.3650755","relation":{},"subject":[],"published":{"date-parts":[[2024,5,2]]},"assertion":[{"value":"2024-05-11","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}