{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T08:38:32Z","timestamp":1775205512070,"version":"3.50.1"},"reference-count":40,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2024,4,12]],"date-time":"2024-04-12T00:00:00Z","timestamp":1712880000000},"content-version":"vor","delay-in-days":102,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,4,5]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user\u2019s intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model (LLM) can amplify an expert\u2019s guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find that incorporating LLMs in the first two stages routinely provides significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.1<\/jats:p>","DOI":"10.1162\/tacl_a_00648","type":"journal-article","created":{"date-parts":[[2024,4,12]],"date-time":"2024-04-12T19:02:55Z","timestamp":1712948575000},"page":"321-333","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":48,"title":["Large Language Models Enable Few-Shot Clustering"],"prefix":"10.1162","volume":"12","author":[{"given":"Vijay","family":"Viswanathan","sequence":"first","affiliation":[{"name":"Carnegie Mellon University, USA"}]},{"given":"Kiril","family":"Gashteovski","sequence":"additional","affiliation":[{"name":"NEC Laboratories Europe, Germany"}]},{"given":"Kiril","family":"Gashteovski","sequence":"additional","affiliation":[{"name":"Center for Advanced Interdisciplinary Research, Ss. Cyril and Methodius Uni. of Skopje, Germany"}]},{"given":"Carolin","family":"Lawrence","sequence":"additional","affiliation":[{"name":"NEC Laboratories Europe, Germany"}]},{"given":"Tongshuang","family":"Wu","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, USA"}]},{"given":"Graham","family":"Neubig","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, USA"}]}],"member":"281","published-online":{"date-parts":[[2024,4,5]]},"reference":[{"key":"2024041219024513300_bib1","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4614-3223-4_4","article-title":"A survey of text clustering algorithms","volume-title":"Mining Text Data","author":"Aggarwal","year":"2012"},{"key":"2024041219024513300_bib2","article-title":"k-means++: the advantages of careful seeding","volume-title":"ACM-SIAM Symposium on Discrete Algorithms","author":"Arthur","year":"2007"},{"key":"2024041219024513300_bib3","first-page":"3:1\u20133:35","article-title":"Local algorithms for interactive clustering","volume":"18","author":"Awasthi","year":"2013","journal-title":"Journal of Machine Learning Research"},{"issue":"1","key":"2024041219024513300_bib4","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3340960","article-title":"Interactive clustering: A comprehensive review","volume":"53","author":"Bae","year":"2020","journal-title":"ACM Computing Surveys"},{"key":"2024041219024513300_bib5","article-title":"Open information extraction from the web","volume-title":"CACM","author":"Banko","year":"2007"},{"key":"2024041219024513300_bib6","article-title":"Semi-supervised clustering by seeding","volume-title":"International Conference on Machine Learning","author":"Basu","year":"2002"},{"key":"2024041219024513300_bib7","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611972740.31","article-title":"Active semi-supervision for pairwise constrained clustering","volume-title":"SDM","author":"Basu","year":"2004"},{"key":"2024041219024513300_bib8","first-page":"2787","article-title":"Translating embeddings for modeling multi-relational data","volume-title":"Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2","author":"Bordes","year":"2013"},{"key":"2024041219024513300_bib9","article-title":"Using encyclopedic knowledge for named entity disambiguation","volume-title":"Conference of the European Chapter of the Association for Computational Linguistics","author":"Bunescu","year":"2006"},{"key":"2024041219024513300_bib10","doi-asserted-by":"publisher","first-page":"1259","DOI":"10.1145\/2505515.2514692","article-title":"Clustering: Probably approximately useless?","volume-title":"Proceedings of the 22nd ACM International Conference on Information & Knowledge Management","author":"Caruana","year":"2013"},{"key":"2024041219024513300_bib11","doi-asserted-by":"publisher","first-page":"38","DOI":"10.18653\/v1\/2020.nlp4convai-1.5","article-title":"Efficient intent detection with dual sentence encoders","volume-title":"Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI","author":"Casanueva","year":"2020"},{"key":"2024041219024513300_bib12","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611974973.27","article-title":"A method to accelerate human in the loop clustering","volume-title":"Proceedings of the 2017 SIAM International Conference on Data Mining","author":"Coden","year":"2017"},{"key":"2024041219024513300_bib13","doi-asserted-by":"publisher","first-page":"581","DOI":"10.1613\/jair.3003","article-title":"Which clustering do you want? Inducing your ideal clustering with minimal feedback","volume":"39","author":"Dasgupta","year":"2010","journal-title":"Journal of Artificial Intelligence Research"},{"key":"2024041219024513300_bib14","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.811","article-title":"Open knowledge graphs canonicalization using variational autoencoders","volume-title":"Conference on Empirical Methods in Natural Language Processing","author":"Dash","year":"2020"},{"key":"2024041219024513300_bib15","doi-asserted-by":"publisher","first-page":"7","DOI":"10.1007\/BF01890115","article-title":"Efficient algorithms for agglomerative hierarchical clustering methods","volume":"1","author":"Day","year":"1984","journal-title":"Journal of Classification"},{"key":"2024041219024513300_bib16","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2024041219024513300_bib17","article-title":"Identifying relations for open information extraction","volume-title":"Conference on Empirical Methods in Natural Language Processing","author":"Fader","year":"2011"},{"key":"2024041219024513300_bib18","article-title":"GPTscore: Evaluate as you desire","volume":"abs\/2302.04166","author":"Jinlan","year":"2023","journal-title":"ArXiv"},{"key":"2024041219024513300_bib19","doi-asserted-by":"publisher","DOI":"10.1145\/2661829.2662073","article-title":"Canonicalizing open knowledge bases","author":"Gal\u00e1rraga","year":"2014","journal-title":"Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management"},{"key":"2024041219024513300_bib20","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1278","article-title":"Minie: Minimizing facts in open information extraction","volume-title":"Conference on Empirical Methods in Natural Language Processing","author":"Gashteovski","year":"2017"},{"key":"2024041219024513300_bib21","article-title":"Opiec: An open information extraction corpus","volume-title":"Proceedings of the Conference on Automatic Knowledge Base Construction (AKBC)","author":"Gashteovski","year":"2019"},{"key":"2024041219024513300_bib22","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3174023","article-title":"A data-driven analysis of workers\u2019 earnings on Amazon Mechanical Turk","author":"Hara","year":"2017","journal-title":"Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems"},{"key":"2024041219024513300_bib23","doi-asserted-by":"crossref","unstructured":"John J.\n              Horton\n            \n          . 2023. Large language models as simulated economic agents: What can we learn from homo silicus?Working Paper 31122, National Bureau of Economic Research. 10.3386\/w31122","DOI":"10.3386\/w31122"},{"key":"2024041219024513300_bib24","article-title":"The Hungarian method for the assignment problem","volume":"52","author":"Kuhn","year":"1955","journal-title":"Naval Research Logistics (NRL)"},{"key":"2024041219024513300_bib25","doi-asserted-by":"publisher","first-page":"1311","DOI":"10.18653\/v1\/D19-1131","article-title":"An evaluation dataset for intent classification and out-of-scope prediction","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Larson","year":"2019"},{"issue":"2","key":"2024041219024513300_bib26","doi-asserted-by":"publisher","first-page":"129","DOI":"10.1109\/TIT.1982.1056489","article-title":"Least squares quantization in pcm","volume":"28","author":"Lloyd","year":"1982","journal-title":"IEEE Transactions on Information Theory"},{"key":"2024041219024513300_bib27","doi-asserted-by":"publisher","DOI":"10.1145\/1458082.1458150","article-title":"Learning to link with wikipedia","volume-title":"International Conference on Information and Knowledge Management","author":"Milne","year":"2008"},{"key":"2024041219024513300_bib28","article-title":"Generative agents: Interactive simulacra of human behavior","author":"Park","year":"2023","journal-title":"arXiv preprint arXiv:2304.03442"},{"key":"2024041219024513300_bib29","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.nlp4convai-1.7","article-title":"Idas: Intent discovery with abstractive summarization","author":"De Raedt","year":"2023","journal-title":"ArXiv"},{"key":"2024041219024513300_bib30","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1410","article-title":"Sentence-bert: Sentence embeddings using siamese bert-networks","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing","author":"Reimers","year":"2019"},{"key":"2024041219024513300_bib31","article-title":"Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter","author":"Sanh","year":"2019","journal-title":"ArXiv"},{"key":"2024041219024513300_bib32","doi-asserted-by":"publisher","first-page":"1578","DOI":"10.1145\/3534678.3539449","article-title":"Multi-view clustering for open knowledge base canonicalization","volume-title":"Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining","author":"Shen","year":"2022"},{"key":"2024041219024513300_bib33","article-title":"One embedder, any task: Instruction-finetuned text embeddings","volume-title":"arXiv","author":"Hongjin","year":"2022"},{"key":"2024041219024513300_bib34","doi-asserted-by":"publisher","first-page":"1317","DOI":"10.1145\/3178876.3186030","article-title":"Cesi: Canonicalizing open knowledge bases using embeddings and side information","volume-title":"Proceedings of the 2018 World Wide Web Conference","author":"Vashishth","year":"2018"},{"key":"2024041219024513300_bib35","article-title":"Clustering with instance-level constraints","volume-title":"Proceedings of the Seventeenth International Conference on Machine Learning","author":"Wagstaff","year":"2000"},{"key":"2024041219024513300_bib36","doi-asserted-by":"publisher","first-page":"625","DOI":"10.1109\/ICDE.2016.7498276","article-title":"A model-based approach for text clustering with outlier detection","author":"Yin","year":"2016","journal-title":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)"},{"key":"2024041219024513300_bib37","doi-asserted-by":"publisher","first-page":"5419","DOI":"10.18653\/v1\/2021.naacl-main.427","article-title":"Supporting clustering with contrastive learning","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Zhang","year":"2021"},{"key":"2024041219024513300_bib38","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-46150-8_4","article-title":"A framework for deep constrained clustering - algorithms and advances","volume-title":"ECML\/PKDD","author":"Zhang","year":"2019"},{"key":"2024041219024513300_bib39","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.858","article-title":"Clusterllm: Large language models as a guide for text clustering","author":"Zhang","year":"2023","journal-title":"ArXiv"},{"key":"2024041219024513300_bib40","doi-asserted-by":"publisher","first-page":"754","DOI":"10.18653\/v1\/2022.naacl-main.55","article-title":"Learning dialogue representations from consecutive utterances","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Zhou","year":"2022"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00648\/2362202\/tacl_a_00648.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00648\/2362202\/tacl_a_00648.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,4,12]],"date-time":"2024-04-12T19:03:05Z","timestamp":1712948585000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00648\/120476\/Large-Language-Models-Enable-Few-Shot-Clustering"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":40,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00648","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}