{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,5]],"date-time":"2026-06-05T04:11:34Z","timestamp":1780632694727,"version":"3.54.1"},"reference-count":54,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2025,7,11]],"date-time":"2025-07-11T00:00:00Z","timestamp":1752192000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,7,11]],"date-time":"2025-07-11T00:00:00Z","timestamp":1752192000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100007241","name":"Universit\u00e9 Paris-Saclay","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100007241","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Comput Soc Sc"],"published-print":{"date-parts":[[2025,8]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Large Language Models have recently been applied to text annotation tasks from social sciences, equating or surpassing the performance of human workers at a fraction of the cost. However, very few inquiries in the social sciences have been made of the impact of prompt selection on labelling accuracy. In this study, we show that performance greatly varies between prompts, and we apply the method of automatic prompt optimization to systematically craft high quality prompts. We also provide the community with a simple, browser-based implementation of the method at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/prompt-ultra.github.io\/\" ext-link-type=\"uri\">https:\/\/prompt-ultra.github.io\/<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s42001-025-00388-6","type":"journal-article","created":{"date-parts":[[2025,7,11]],"date-time":"2025-07-11T02:28:19Z","timestamp":1752200899000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Prompt selection matters: enhancing text annotations for social sciences with large language models"],"prefix":"10.1007","volume":"8","author":[{"given":"Louis","family":"Abraham","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Charles","family":"Arnal","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7958-0153","authenticated-orcid":false,"given":"Antoine","family":"Marie","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2025,7,11]]},"reference":[{"key":"388_CR1","doi-asserted-by":"publisher","first-page":"1821","DOI":"10.1016\/j.procs.2020.03.201","volume":"167","author":"MZ Ansari","year":"2020","unstructured":"Ansari, M. Z., Aziz, M. B., Siddiqui, M. O., Mehra, H., & Singh, K. P. (2020). Analysis of political sentiment orientations on twitter. Procedia Computer Science, 167, 1821\u20131828.","journal-title":"Procedia Computer Science"},{"key":"388_CR2","volume-title":"Introduction to machine learning","author":"E Alpaydin","year":"2010","unstructured":"Alpaydin, E. (2010). Introduction to machine learning (2nd ed.). The MIT Press.","edition":"2"},{"key":"388_CR3","doi-asserted-by":"publisher","DOI":"10.3102\/10769986241279927","author":"KL Anglin","year":"2024","unstructured":"Anglin, K. L., & Ventura, C. (2024). Automatic text classification with large language models: A review of openai for zero- and few-shot classification. Journal of Educational and Behavioral Statistics. https:\/\/doi.org\/10.3102\/10769986241279927","journal-title":"Journal of Educational and Behavioral Statistics"},{"key":"388_CR4","doi-asserted-by":"crossref","unstructured":"Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel, P., F, M., Rosso, P., & Sanguinetti, M. (2019). SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In  Proceedings of the 13th international workshop on semantic evaluation (pp. 54\u201363). Minneapolis, Minnesota, USA, Association for Computational Linguistics.","DOI":"10.18653\/v1\/S19-2007"},{"key":"388_CR5","doi-asserted-by":"crossref","unstructured":"Barbieri, F., Camacho-Collados, J., Espinosa-Anke, L., & Neves, L. (2020). TweetEval: Unified Benchmark and comparative evaluation for tweet classification. In  Proceedings of findings of EMNLP.","DOI":"10.18653\/v1\/2020.findings-emnlp.148"},{"key":"388_CR6","doi-asserted-by":"crossref","unstructured":"Baly, R., Martino, G. D. S., Glass, J., & Nakov, P. (2020) We can detect your bias: Predicting the political ideology of news articles. In  Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), EMNLP\u00a0\u201920. (pp. 4982\u20134991).","DOI":"10.18653\/v1\/2020.emnlp-main.404"},{"key":"388_CR7","unstructured":"Barbieri, F., Anke, L. E., & Camacho-Collados, J. (2022) XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond. In  Proceedings of the Thirteenth language resources and evaluation conference (pp. 258\u2013266). Marseille, France, June . European Language Resources Association."},{"key":"388_CR8","unstructured":"Battle, R., & Gollapudi, T. (2024). The unreasonable effectiveness of eccentric automatic prompts. ArXiv, arXiv:abs\/2402.10949"},{"key":"388_CR9","unstructured":"Barrett, L. F., Lewis, M., & Haviland-Jones, J. M. (2016).  Handbook of emotions, Fourth Edition. Psychology (The Guilford Press). Guilford Publications"},{"key":"388_CR10","first-page":"1877","volume-title":"Advances in neural information processing systems","author":"T Brown","year":"2020","unstructured":"Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., \u2026 Amodei, D. (2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 1877\u20131901). Curran Associates Inc."},{"key":"388_CR11","unstructured":"Chen, W., Koenig, S., & Dilkina, B. (2024). Reprompt: Planning by automatic prompt engineering for large language models agents. 06 ."},{"key":"388_CR12","doi-asserted-by":"publisher","first-page":"382","DOI":"10.1007\/978-3-031-08473-7_35","volume-title":"Natural language processing and information systems","author":"TM Doan","year":"2022","unstructured":"Doan, T. M., Kille, B., & Gulla, J. A. (2022). Using language models for\u00c2 classifying the\u00c2 party affiliation of\u00c2 political texts. In P. Rosso, V. Basile, R. Mart\u00ednez, E. M\u00e9tais, & F. Meziane (Eds.), Natural language processing and information systems (pp. 382\u2013393). Springer International Publishing."},{"key":"388_CR13","doi-asserted-by":"crossref","unstructured":"Devatine, N., Muller, P., & Braud, C. (2023). An integrated approach for political bias prediction and explanation based on discursive structure. In  Findings of the association for computational linguistics (EACL 2023) (pp. 11196\u201311211). Association for Computational Linguistics.","DOI":"10.18653\/v1\/2023.findings-acl.711"},{"key":"388_CR14","unstructured":"Achiam, OpenAI Josh, & et\u00a0al. (2023) Gpt-4 technical report. ."},{"issue":"30","key":"388_CR15","doi-asserted-by":"publisher","first-page":"e2305016120","DOI":"10.1073\/pnas.2305016120","volume":"120","author":"F Gilardi","year":"2023","unstructured":"Gilardi, F., Alizadeh, M., & Kubli, M. (2023). Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30), e2305016120.","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"388_CR16","unstructured":"Gildenblat, J. (2023) A python library for confidence intervals. https:\/\/github.com\/jacobgil\/confidenceinterval"},{"issue":"280","key":"388_CR17","doi-asserted-by":"publisher","first-page":"127","DOI":"10.1111\/j.1475-4932.2011.00782.x","volume":"88","author":"JS Gans","year":"2012","unstructured":"Gans, J. S., & Leigh, A. (2012). How partisan is the press? multiple measures of media slant*. Economic Record, 88(280), 127\u2013147.","journal-title":"Economic Record"},{"issue":"1","key":"388_CR18","doi-asserted-by":"publisher","first-page":"205316802412362","DOI":"10.1177\/20531680241236239","volume":"11","author":"M Heseltine","year":"2024","unstructured":"Heseltine, M., & Clemm von Hohenberg, B. (2024). Large language models as a substitute for human experts in annotating political text. Research & Politics, 11(1), 20531680241236240.","journal-title":"Research & Politics"},{"key":"388_CR19","doi-asserted-by":"crossref","unstructured":"He, X., Lin, Z., Gong, Y., Jin, A. L., Zhang, H., Lin, C., Jiao, J., Yiu, S. M., Duan, N., & Chen, W. (2024). AnnoLLM: Making large language models to be better crowdsourced annotators. In Y. Yang, A. Davani, A. Sil, & A. Kumar (Eds.),  Proceedings of the 2024 conference of the north American chapter of the association for computational linguistics: human language technologies) (Volume 6: Industry Track), pp. 165\u2013190. Mexico City, Mexico, June . Association for Computational Linguistics.","DOI":"10.18653\/v1\/2024.naacl-industry.15"},{"key":"388_CR20","unstructured":"Han, X., Zhao, W., Ding, N., Liu, Z., & Sun, M. (2021). Ptr: Prompt tuning with rules for text classification. ArXiv, arXiv: abs\/2105.11259"},{"key":"388_CR21","doi-asserted-by":"crossref","unstructured":"Jain, A. P, & Dandannavar, P. (2016). Application of machine learning techniques to sentiment analysis. In  2016 2nd international conference on applied and theoretical computing and communication technology (iCATccT) (pp. 628\u2013632).","DOI":"10.1109\/ICATCCT.2016.7912076"},{"key":"388_CR22","unstructured":"Jin, C., Peng, H., Zhao, S., Zhenting, W., Xu, W., Han, L., Zhao, J., Zhong, K., Rajasekaran, S., Metaxas, D. (2024). Apeer: Automatic prompt engineering enhances large language model reranking. 06."},{"key":"388_CR23","first-page":"22199","volume":"35","author":"T Kojima","year":"2022","unstructured":"Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. Advances in neural information processing systems, 35, 22199\u201322213.","journal-title":"Advances in neural information processing systems"},{"key":"388_CR24","doi-asserted-by":"crossref","unstructured":"Lampinen, A. K., Dasgupta, I., Chan, S. C. Y., Matthewson, K., Tessler, M. H., Creswell, A., McClelland, J. L., Wang, J. X. & Hill, F. (2022) Can language models learn from explanations in context? ArXiv arXiv:abs\/2204.02329.","DOI":"10.18653\/v1\/2022.findings-emnlp.38"},{"issue":"2","key":"388_CR25","doi-asserted-by":"publisher","first-page":"100017","DOI":"10.1016\/j.metrad.2023.100017","volume":"1","author":"Y Liu","year":"2023","unstructured":"Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H., Li, A., He, M., Liu, Z., Zihao, W., Zhao, L., Zhu, D., Li, X., Qiang, N., Shen, D., Liu, T., & Ge, B. (2023). Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, 1(2), 100017.","journal-title":"Meta-Radiology"},{"key":"388_CR26","doi-asserted-by":"crossref","unstructured":"Li, J.(2024). A comparative study on annotation quality of crowdsourcing and llm via label aggregation (pp. 6525\u20136529), 04.","DOI":"10.1109\/ICASSP48485.2024.10447803"},{"key":"388_CR27","unstructured":"Liberals vs Conservatives on Reddit. (2024). Retrieved 16 June 2024 from https:\/\/www.kaggle.com\/datasets\/neelgajare\/liberals-vs-conservatives-on-reddit-13000-posts."},{"key":"388_CR28","doi-asserted-by":"publisher","first-page":"103654","DOI":"10.1016\/j.artint.2021.103654","volume":"304","author":"R Liu","year":"2022","unstructured":"Liu, R., Jia, C., Wei, J., Guangxuan, X., & Vosoughi, S. (2022). Quantifying and alleviating political bias in language models. Artificial Intelligence, 304, 103654.","journal-title":"Artificial Intelligence"},{"key":"388_CR29","unstructured":"Li, L., Li, J., Chen, C., Gui, F., Yang, H., Yu, C., Wang, Z., Cai, J., Zhou, J., Shen, B., Qian, A., Chen, W., Xue, Z., Sun, L., He, L., Chen, H., Ding, K., Du, Z., Mu, F., & Dong, Y. (2024). Political-llm: Large language models in political science, 12"},{"key":"388_CR30","first-page":"1","volume":"13","author":"Q Li","year":"2022","unstructured":"Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P., & He, L. (2022). A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology, 13, 1\u201341.","journal-title":"ACM Transactions on Intelligent Systems and Technology"},{"key":"388_CR31","doi-asserted-by":"crossref","unstructured":"Mohammad, S., Bravo-Marquez, F., Salameh, M., & Kiritchenko, S. (2018). Semeval-2018 task 1: Affect in tweets. In  Proceedings of the 12th international workshop on semantic evaluation (pp. 1\u201317).","DOI":"10.18653\/v1\/S18-1001"},{"issue":"1","key":"388_CR32","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1007\/s11127-023-01097-2","volume":"198","author":"F Motoki","year":"2024","unstructured":"Motoki, F., Neto, V. P., & Rodrigues, V. (2024). More human than human: Measuring chatgpt political bias. Public Choice, 198(1), 3\u201323.","journal-title":"Public Choice"},{"key":"388_CR33","doi-asserted-by":"crossref","unstructured":"Neethu, M.\u00a0S., & Rajasree, R. (2013). Sentiment analysis in twitter using machine learning techniques. In  2013 fourth international conference on computing, communications and networking technologies (ICCCNT) (pp. 1\u20135).","DOI":"10.1109\/ICCCNT.2013.6726818"},{"key":"388_CR34","unstructured":"OpenAI. Gpt-4 technical report. (2023)."},{"key":"388_CR35","doi-asserted-by":"crossref","unstructured":"Ollion, E., Shen, R. , Macanovic, A., & Chatelain, A. (2023) Chatgpt for text annotation? mind the hype!, 10 .","DOI":"10.31235\/osf.io\/x58kn"},{"key":"388_CR36","doi-asserted-by":"crossref","unstructured":"Pennacchiotti, M., & Popescu, A.-M. (2011). A machine learning approach to twitter user classification. In Proceedings of the international AAAI conference on web and social media (vol.5, pp. 281\u2013288).","DOI":"10.1609\/icwsm.v5i1.14139"},{"key":"388_CR37","unstructured":"Prompt engineering guide. Retrieved 01 June 2024 from https:\/\/www.promptingguide.ai\/techniques\/."},{"issue":"1","key":"388_CR38","doi-asserted-by":"publisher","first-page":"179","DOI":"10.1017\/S0007123423000042","volume":"54","author":"SHR Rasmussen","year":"2024","unstructured":"Rasmussen, S. H. R., Bor, A., Osmundsen, M., & Petersen, M. B. (2024). \u2018Super-unsupervised\u2019 classification for labelling text: Online political hostility as an illustration. British Journal of Political Science, 54(1), 179\u2013200.","journal-title":"British Journal of Political Science"},{"key":"388_CR39","doi-asserted-by":"crossref","unstructured":"Rosenthal, S., Farra, N., & Nakov, P. (2017). Semeval-2017 task 4: Sentiment analysis in twitter. In  Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) (pp. 502\u2013518).","DOI":"10.18653\/v1\/S17-2088"},{"key":"388_CR40","doi-asserted-by":"crossref","unstructured":"Shum, K., Diao, S., & Zhang, T. (2023). Automatic prompt augmentation and selection with chain-of-thought from labeled data. In H. Bouamor, J. Pino, & K. Bali (Eds.),  Findings of the association for computational linguistics: EMNLP 2023 (pp. 12113\u201312139). Association for Computational Linguistics.","DOI":"10.18653\/v1\/2023.findings-emnlp.811"},{"key":"388_CR41","first-page":"21","volume-title":"Annotating and identifying emotions in text","author":"C Strapparava","year":"2010","unstructured":"Strapparava, C., & Mihalcea, R. (2010). Annotating and identifying emotions in text (pp. 21\u201338). Springer."},{"key":"388_CR42","doi-asserted-by":"crossref","unstructured":"Sahoo, P., Singh, A., Saha, S., Jain, V., Mondal, S., & Chadha, A. (2024). A systematic survey of prompt engineering in large language models: Techniques and applications. 02 .","DOI":"10.1007\/979-8-8688-0569-1_4"},{"key":"388_CR43","unstructured":"T\u00f6rnberg, P. (2023). Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588."},{"key":"388_CR44","doi-asserted-by":"publisher","first-page":"03","DOI":"10.1007\/s10489-021-02635-5","volume":"52","author":"K Takahashi","year":"2022","unstructured":"Takahashi, K., Yamamoto, K., Kuchiba, A., & Koyama, T. (2022). Confidence interval for micro-averaged f1 and macro-averaged f1 scores. Applied Intelligence, 52, 03.","journal-title":"Applied Intelligence"},{"key":"388_CR45","unstructured":"Vatsal, S., & Dubey, H. (2024). A survey of prompt engineering methods in large language models for different nlp tasks. 07 ."},{"key":"388_CR46","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., & Uszkoreit, J. (2017). Llion Jones, Aidan N. Gomez: Lukasz Kaiser, and Illia Polosukhin. Attention is all you need."},{"key":"388_CR47","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1146\/annurev-polisci-052615-025542","volume":"20","author":"J Wilkerson","year":"2017","unstructured":"Wilkerson, J., & Casas, A. (2017). Large-scale computerized text analysis in political science: Opportunities and challenges. Annual Review of Political Science, 20, 529\u2013544.","journal-title":"Annual Review of Political Science"},{"key":"388_CR48","unstructured":"Weber, M., & Reichardt, M. (2023). Evaluation is all you need. Prompting generative large language models for annotation tasks in the social sciences. A primer using open models. 12 ."},{"key":"388_CR49","unstructured":"Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E. H, Xia, F., Le, Q., & Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. ArXiv ArXiv: abs\/2201.11903"},{"key":"388_CR50","unstructured":"Ye, J., Chen, X., Xu, N., Zu, C., Shao, Z., Liu, S. , Cui, Y., Zhou, Z., Gong, C., Shen, Y., Zhou, J., Chen, S., Gui, T., Zhang, Q., & Huang, X. (2023). A comprehensive capability analysis of gpt-3 and gpt-3.5 series models. ArXiv ArXiv: abs\/2303.10420 ."},{"key":"388_CR51","unstructured":"Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., & Chen, X. Large language models as optimizers (2024)."},{"issue":"1","key":"388_CR52","doi-asserted-by":"publisher","first-page":"e12","DOI":"10.1016\/S2589-7500(23)00225-X","volume":"6","author":"T Zack","year":"2024","unstructured":"Zack, T., Lehman, E., Suzgun, M., Rodriguez, J. A., Celi, L. A., Gichoya, J., Jurafsky, D., Szolovits, P., Bates, D. W., Abdulnour, R.-E.E., et al. (2024). Assessing the potential of gpt-4 to perpetuate racial and gender biases in health care: A model evaluation study. The Lancet Digital Health, 6(1), e12\u2013e22.","journal-title":"The Lancet Digital Health"},{"key":"388_CR53","unstructured":"Zhou, Y., Muresanu, A., Han, Z., Paster, K., Pitis, S., Chan, H., & Ba, J. (2022). Large language models are human-level prompt engineers, 11"},{"key":"388_CR54","doi-asserted-by":"crossref","unstructured":"Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. (2019). Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). In  Proceedings of the 13th international workshop on semantic evaluation (pp. 75\u201386).","DOI":"10.18653\/v1\/S19-2010"}],"container-title":["Journal of Computational Social Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42001-025-00388-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s42001-025-00388-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42001-025-00388-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,6]],"date-time":"2025-09-06T07:44:27Z","timestamp":1757144667000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s42001-025-00388-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,11]]},"references-count":54,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,8]]}},"alternative-id":["388"],"URL":"https:\/\/doi.org\/10.1007\/s42001-025-00388-6","relation":{},"ISSN":["2432-2717","2432-2725"],"issn-type":[{"value":"2432-2717","type":"print"},{"value":"2432-2725","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,11]]},"assertion":[{"value":"6 September 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 April 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 July 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The author's declared that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"73"}}