{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T14:03:48Z","timestamp":1772114628549,"version":"3.50.1"},"reference-count":59,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2024,9,4]],"date-time":"2024-09-04T00:00:00Z","timestamp":1725408000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,9,4]],"date-time":"2024-09-04T00:00:00Z","timestamp":1725408000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100003246","name":"Nederlandse Organisatie voor Wetenschappelijk Onderzoek","doi-asserted-by":"publisher","award":["06.DI.19.059"],"award-info":[{"award-number":["06.DI.19.059"]}],"id":[{"id":"10.13039\/501100003246","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Minds &amp; Machines"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>To prevent ordinary people from being harmed by natural language processing (NLP) technology, finding ways to measure the extent to which a language model is biased (e.g., regarding gender) has become an active area of research. One popular class of NLP bias measures are bias benchmark datasets\u2014collections of test items that are meant to assess a language model\u2019s preference for stereotypical versus non-stereotypical language. In this paper, we argue that such bias benchmarks should be assessed with models from the psychometric framework of item response theory (IRT). Specifically, we tie an introduction to basic IRT concepts and models with a discussion of how they could be relevant to the evaluation, interpretation and improvement of bias benchmark datasets. Regarding evaluation, IRT provides us with methodological tools for assessing the quality of both individual test items (e.g., the extent to which an item can differentiate highly biased from less biased language models) as well as benchmarks as a whole (e.g., the extent to which the benchmark allows us to assess not only severe but also subtle levels of model bias). Through such diagnostic tools, the quality of benchmark datasets could be improved, for example by deleting or reworking poorly performing items. Finally, in regards to interpretation, we argue that IRT models\u2019 estimates for language model bias are conceptually superior to traditional accuracy-based evaluation metrics, as the former take into account more information than just whether or not a language model provided a biased response.<\/jats:p>","DOI":"10.1007\/s11023-024-09695-9","type":"journal-article","created":{"date-parts":[[2024,9,4]],"date-time":"2024-09-04T10:02:38Z","timestamp":1725444158000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["fl-IRT-ing with Psychometrics to Improve NLP Bias Measurement"],"prefix":"10.1007","volume":"34","author":[{"given":"Dominik","family":"Bachmann","sequence":"first","affiliation":[]},{"given":"Oskar","family":"van der Wal","sequence":"additional","affiliation":[]},{"given":"Edita","family":"Chvojka","sequence":"additional","affiliation":[]},{"given":"Willem H.","family":"Zuidema","sequence":"additional","affiliation":[]},{"given":"Leendert","family":"van Maanen","sequence":"additional","affiliation":[]},{"given":"Katrin","family":"Schulz","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,9,4]]},"reference":[{"issue":"2","key":"9695_CR1","first-page":"291","volume":"5","author":"M Akour","year":"2013","unstructured":"Akour, M., & Al-Omari, H. (2013). Empirical investigation of the stability of IRT item-parameters estimation. International Online Journal of Educational Sciences, 5(2), 291\u2013301.","journal-title":"International Online Journal of Educational Sciences"},{"key":"9695_CR2","doi-asserted-by":"crossref","unstructured":"Amidei, J., Piwek, P., & Willis, A. (2020). Identifying annotator bias: a new IRT-based method for bias identification. In Proceedings of the 28th international conference on computational linguistics (pp. 4787\u20134797). https:\/\/aclanthology.org\/2020.coling-main.421\/","DOI":"10.18653\/v1\/2020.coling-main.421"},{"issue":"1","key":"9695_CR3","doi-asserted-by":"publisher","first-page":"44","DOI":"10.26407\/2018JRTDD.1.6","volume":"1","author":"L Anunciacao","year":"2018","unstructured":"Anunciacao, L. (2018). An overview of the history and methodological aspects of psychometrics: History and methodological aspects of psychometrics. Journal for ReAttach Therapy and Developmental Diversities, 1(1), 44\u201358.","journal-title":"Journal for ReAttach Therapy and Developmental Diversities"},{"key":"9695_CR4","unstructured":"Beeching, E., Fourrier, C., Habib, N., Han, S., Lambert, N., Rajani, N., Sanseviero, O., Tunstall, L., & Wolf, T. (2023). Open LLM leaderboard. https:\/\/huggingface.co\/spaces\/HuggingFaceH4\/open_llm_leaderboard"},{"key":"9695_CR5","doi-asserted-by":"publisher","unstructured":"Blodgett, S.\u00a0L., Barocas, S., Daum\u00e9 III, H., & Wallach, H. (2020). Language (technology) is power: a critical survey of \u201cbias\u201d in NLP. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 5454\u20135476). https:\/\/doi.org\/10.18653\/v1\/2020.acl-main.485","DOI":"10.18653\/v1\/2020.acl-main.485"},{"key":"9695_CR6","doi-asserted-by":"crossref","unstructured":"Blodgett, S.\u00a0L., Lopez, G., Olteanu, A., Sim, R., & Wallach, H. (2021). Stereotyping Norwegian Salmon: an inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers, pp. 1004\u20131015). https:\/\/aclanthology.org\/2021.acl-long.81","DOI":"10.18653\/v1\/2021.acl-long.81"},{"key":"9695_CR7","unstructured":"Bommasani, R., Hudson, D.\u00a0A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.\u00a0S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J.\u00a0Q., Demszky, D., & Liang, P. (2022). On the opportunities and risks of foundation models. arxiv:abs\/2108.07258"},{"key":"9695_CR8","unstructured":"Bommasani, R., & Liang, P. (2022). Trustworthy social bias measurement. arxiv:abs\/2212.11672"},{"key":"9695_CR9","volume-title":"Multidimensional item response theory","author":"W Bonifay","year":"2019","unstructured":"Bonifay, W. (2019). Multidimensional item response theory. Sage Publications."},{"key":"9695_CR10","unstructured":"Brown, T.\u00a0B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.\u00a0M., Wu, J., Winter, C., & Amodei, D. (2020). Language models are few-shot learners. arxiv:abs\/2005.14165"},{"key":"9695_CR11","doi-asserted-by":"crossref","unstructured":"Byrd, M., & Srivastava, S. (2022). Predicting difficulty and discrimination of natural language questions. In Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 2: Short Papers, pp. 119\u2013130). https:\/\/aclanthology.org\/2022.acl-short.15\/","DOI":"10.18653\/v1\/2022.acl-short.15"},{"key":"9695_CR12","doi-asserted-by":"publisher","first-page":"297","DOI":"10.1146\/annurev-statistics-041715-033702","volume":"3","author":"L Cai","year":"2016","unstructured":"Cai, L., Choi, K., Hansen, M., & Harrell, L. (2016). Item response theory. Annual Review of Statistics and Its Application, 3, 297\u2013321. https:\/\/doi.org\/10.1146\/annurev-statistics-041715-033702","journal-title":"Annual Review of Statistics and Its Application"},{"key":"9695_CR13","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18637\/jss.v048.i06","volume":"48","author":"RP Chalmers","year":"2012","unstructured":"Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1\u201329. https:\/\/doi.org\/10.18637\/jss.v048.i06","journal-title":"Journal of Statistical Software"},{"key":"9695_CR14","volume-title":"Introduction to classical and modern test theory","author":"L Crocker","year":"1986","unstructured":"Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. ERIC."},{"key":"9695_CR15","unstructured":"D\u2019Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B., Beutel, A., Chen, C., Deaton, J., Eisenstein, J., Hoffman, M.\u00a0D., Hormozdiari, F., Houlsby, N., Hou, S., Jerfel, G., Karthikesalingam, A., Lucic, M., Ma, Y., McLean, C., Mincu, D., & Sculley, D. (2020). Underspecification presents challenges for credibility in modern machine learning. arxiv:abs\/2011.03395v2"},{"key":"9695_CR16","volume-title":"The theory and practice of item response theory","author":"RJ De Ayala","year":"2013","unstructured":"De Ayala, R. J. (2013). The theory and practice of item response theory. Guilford Publications."},{"key":"9695_CR17","doi-asserted-by":"publisher","DOI":"10.1201\/9781315200620","volume-title":"An introduction to the Rasch model with examples in R","author":"R Debelak","year":"2022","unstructured":"Debelak, R., Strobl, C., & Zeigenfuse, M. D. (2022). An introduction to the Rasch model with examples in R. CRC Press."},{"issue":"4","key":"9695_CR18","doi-asserted-by":"publisher","first-page":"354","DOI":"10.1080\/15305058.2013.799067","volume":"13","author":"CE DeMars","year":"2013","unstructured":"DeMars, C. E. (2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13(4), 354\u2013378. https:\/\/doi.org\/10.1080\/15305058.2013.799067","journal-title":"International Journal of Testing"},{"key":"9695_CR19","doi-asserted-by":"crossref","unstructured":"Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W., & Gupta, R. (2021). BOLD: dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 862\u2013872). https:\/\/dl.acm.org\/doi\/10.1145\/3442188.3445924","DOI":"10.1145\/3442188.3445924"},{"key":"9695_CR20","doi-asserted-by":"crossref","unstructured":"Du, Y., Fang, Q., & Nguyen, D. (2021). Assessing the reliability of word embedding gender bias measures. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 10012\u201310034). https:\/\/aclanthology.org\/2021.emnlp-main.785","DOI":"10.18653\/v1\/2021.emnlp-main.785"},{"key":"9695_CR21","doi-asserted-by":"crossref","unstructured":"Ethayarajh, K. (2020). Is your classifier actually biased? Measuring fairness under uncertainty with Bernstein bounds. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 2914\u20132919). https:\/\/www.aclweb.org\/anthology\/2020.acl-main.262","DOI":"10.18653\/v1\/2020.acl-main.262"},{"key":"9695_CR22","unstructured":"Fang, Q., Oberski, D.\u00a0L., & Nguyen, D. (2024). PATCH\u2014psychometrics-assisted benchmarking of large language models: A case study of mathematics proficiency. arxiv:abs\/2404.01799"},{"key":"9695_CR23","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4419-0742-4","volume-title":"Introduction to Bayesian response modeling","author":"J-P Fox","year":"2010","unstructured":"Fox, J.-P. (2010). Introduction to Bayesian response modeling. Springer."},{"key":"9695_CR24","volume-title":"Psychometrics: An introduction","author":"RM Furr","year":"2021","unstructured":"Furr, R. M. (2021). Psychometrics: An introduction (4th ed.). SAGE Publications.","edition":"4"},{"issue":"5","key":"9695_CR25","doi-asserted-by":"publisher","first-page":"936","DOI":"10.1177\/0013164420987582","volume":"81","author":"P Gilholm","year":"2021","unstructured":"Gilholm, P., Mengersen, K., & Thompson, H. (2021). Bayesian hierarchical multidimensional item response modeling of small sample, sparse data for personalized developmental surveillance. Educational and Psychological Measurement, 81(5), 936\u2013956. https:\/\/doi.org\/10.1177\/0013164420987582","journal-title":"Educational and Psychological Measurement"},{"key":"9695_CR26","doi-asserted-by":"crossref","unstructured":"Goldfarb-Tarrant, S., Marchant, R., Mu\u00f1oz S\u00e1nchez, R., Pandya, M., & Lopez, A. (2021). Intrinsic bias metrics do not correlate with application bias. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers, pp. 1926\u20131940). https:\/\/aclanthology.org\/2021.acl-long.150","DOI":"10.18653\/v1\/2021.acl-long.150"},{"issue":"1","key":"9695_CR27","doi-asserted-by":"publisher","first-page":"300","DOI":"10.32614\/RJ-2020-014","volume":"12","author":"A Hladk\u00e1","year":"2020","unstructured":"Hladk\u00e1, A., & Martinkov\u00e1, P. (2020). difNLR: Generalized logistic regression models for DIF and DDF detection. R Journal, 12(1), 300\u2013323.","journal-title":"R Journal"},{"key":"9695_CR28","doi-asserted-by":"publisher","unstructured":"Jacobs, A.\u00a0Z., & Wallach, H. (2021). Measurement and fairness. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 375\u2013385). https:\/\/doi.org\/10.1145\/3442188.3445901","DOI":"10.1145\/3442188.3445901"},{"key":"9695_CR29","unstructured":"Jurafsky, D., & Martin, J.\u00a0H. (2023). Speech and language processing (3rd\u00a0edn.). https:\/\/web.stanford.edu\/~jurafsky\/slp3\/"},{"issue":"2","key":"9695_CR30","doi-asserted-by":"publisher","first-page":"1813","DOI":"10.1016\/j.compedu.2011.02.003","volume":"57","author":"S Klinkenberg","year":"2011","unstructured":"Klinkenberg, S., Straatemeier, M., & van der Maas, H. L. (2011). Computer adaptive practice of maths ability using a new item response model for on the fly ability and difficulty estimation. Computers & Education, 57(2), 1813\u20131824. https:\/\/doi.org\/10.1016\/j.compedu.2011.02.003","journal-title":"Computers & Education"},{"issue":"4","key":"9695_CR31","doi-asserted-by":"publisher","first-page":"311","DOI":"10.1177\/0146621619893786","volume":"44","author":"C K\u00f6nig","year":"2020","unstructured":"K\u00f6nig, C., Spoden, C., & Frey, A. (2020). An optimized Bayesian hierarchical two-parameter logistic model for small-sample item calibration. Applied Psychological Measurement, 44(4), 311\u2013326. https:\/\/doi.org\/10.1177\/0146621619893786","journal-title":"Applied Psychological Measurement"},{"key":"9695_CR32","doi-asserted-by":"crossref","unstructured":"Lalor, J.\u00a0P., Wu, H., & Yu, H. (2016). Building an evaluation scale using item response theory. In Proceedings of the conference on empirical methods in natural language processing. Conference on empirical methods in natural language processing (p. 648). https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC5167538\/","DOI":"10.18653\/v1\/D16-1062"},{"key":"9695_CR33","doi-asserted-by":"crossref","unstructured":"Lalor, J.\u00a0P., & Yu, H. (2020). Dynamic data selection for curriculum learning via ability estimation. In Proceedings of the conference on empirical methods in natural language processing. conference on empirical methods in natural language processing (p. 545). https:\/\/aclanthology.org\/2020.findings-emnlp.48\/","DOI":"10.18653\/v1\/2020.findings-emnlp.48"},{"key":"9695_CR34","doi-asserted-by":"crossref","unstructured":"Levy, S., Lazar, K., & Stanovsky, G. (2021). Collecting a large-scale gender bias dataset for coreference resolution and machine translation. Findings of the association for computational linguistics: EMNLP 2021 (pp. 2470\u20132480). https:\/\/aclanthology.org\/2021.findings-emnlp.211","DOI":"10.18653\/v1\/2021.findings-emnlp.211"},{"key":"9695_CR35","volume-title":"Applications of item response theory to practical testing problems","author":"FM Lord","year":"1980","unstructured":"Lord, F. M. (1980). Applications of item response theory to practical testing problems. Routledge."},{"key":"9695_CR36","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-69218-0","volume-title":"Computerized adaptive and multistage testing with R: Using packages catR and mstR","author":"D Magis","year":"2017","unstructured":"Magis, D., Yan, D., & von Davier, A. A. (2017). Computerized adaptive and multistage testing with R: Using packages catR and mstR. Springer."},{"key":"9695_CR37","doi-asserted-by":"publisher","DOI":"10.1201\/9781003054313","volume-title":"Computational aspects of psychometric methods: With R","author":"P Martinkov\u00e1","year":"2023","unstructured":"Martinkov\u00e1, P., & Hladk\u00e1, A. (2023). Computational aspects of psychometric methods: With R. CRC Press."},{"key":"9695_CR38","doi-asserted-by":"crossref","unstructured":"Nadeem, M., Bethke, A., & Reddy, S. (2021). StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers, pp. 5356\u20135371). https:\/\/aclanthology.org\/2021.acl-long.416","DOI":"10.18653\/v1\/2021.acl-long.416"},{"key":"9695_CR39","doi-asserted-by":"crossref","unstructured":"Nangia, N., Vania, C., Bhalerao, R., & Bowman, S.\u00a0R. (2020). CrowS-Pairs: a challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 1953\u20131967). https:\/\/aclanthology.org\/2020.emnlp-main.154","DOI":"10.18653\/v1\/2020.emnlp-main.154"},{"key":"9695_CR40","doi-asserted-by":"publisher","DOI":"10.4324\/9781351008167","volume-title":"Using R for item response theory model applications","author":"I Paek","year":"2019","unstructured":"Paek, I., & Cole, K. (2019). Using R for item response theory model applications. Routledge."},{"key":"9695_CR41","doi-asserted-by":"crossref","unstructured":"Parmar, M., Mishra, S., Geva, M., & Baral, C. (2023). Don\u2019t blame the annotator: Bias already starts in the annotation instructions. arxiv:org\/abs\/2205.00415","DOI":"10.18653\/v1\/2023.eacl-main.130"},{"key":"9695_CR42","unstructured":"Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P.\u00a0M., & Bowman, S.\u00a0R. (2021). BBQ: A hand-built bias benchmark for question answering. arxiv:abs\/2110.08193v2"},{"key":"9695_CR43","unstructured":"Polo, F.\u00a0M., Weber, L., Choshen, L., Sun, Y., Xu, G., & Yurochkin, M. (2024). tinyBenchmarks: Evaluating LLMs with fewer examples. arxiv:abs\/2402.14992v1"},{"key":"9695_CR44","first-page":"59","volume":"4","author":"F Rijmen","year":"2011","unstructured":"Rijmen, F. (2011). Hierarchical factor item response theory models for PIRLS: Capturing clustering effects at multiple levels. IERI Monograph Series: Issues and Methodologies in Large-scale Assessments, 4, 59\u201374.","journal-title":"IERI Monograph Series: Issues and Methodologies in Large-scale Assessments"},{"key":"9695_CR45","doi-asserted-by":"crossref","unstructured":"Rodriguez, P., Barrow, J., Hoyle, A.\u00a0M., Lalor, J.\u00a0P., Jia, R., & Boyd-Graber, J. (2021). Evaluation examples are not equally informative: How should that change NLP leaderboards? In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers, pp. 4486\u20134503). https:\/\/aclanthology.org\/2021.acl-long.346","DOI":"10.18653\/v1\/2021.acl-long.346"},{"key":"9695_CR46","doi-asserted-by":"crossref","unstructured":"Rodriguez, P., Htut, P.\u00a0M., Lalor, J.\u00a0P., & Sedoc, J. (2022). Clustering examples in multi-dataset benchmarks with item response theory. Proceedings of the Third Workshop on Insights from Negative Results in NLP (pp. 100\u2013112). https:\/\/aclanthology.org\/2022.insights-1.14\/","DOI":"10.18653\/v1\/2022.insights-1.14"},{"key":"9695_CR47","doi-asserted-by":"crossref","unstructured":"Rudinger, R., Naradowsky, J., Leonard, B., & Van Durme, B. (2018). Gender bias in coreference resolution. Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies (Volume 2 (Short Papers), pp. 8\u201314). https:\/\/aclanthology.org\/N18-2002","DOI":"10.18653\/v1\/N18-2002"},{"issue":"1","key":"9695_CR48","first-page":"321","volume":"17","author":"A \u015eahin","year":"2017","unstructured":"\u015eahin, A., & An\u0131l, D. (2017). The effects of test length and sample size on item parameters in item response theory. Educational Sciences: Theory & Practice, 17(1), 321\u2013335.","journal-title":"Educational Sciences: Theory & Practice"},{"key":"9695_CR49","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/BF03372160","volume":"34","author":"F Samejima","year":"1969","unstructured":"Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 34, 1\u201397. https:\/\/doi.org\/10.1007\/BF03372160","journal-title":"Psychometrika"},{"key":"9695_CR50","doi-asserted-by":"crossref","unstructured":"Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N.\u00a0A. (2019). The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 1668\u20131678). https:\/\/aclanthology.org\/P19-1163","DOI":"10.18653\/v1\/P19-1163"},{"key":"9695_CR51","doi-asserted-by":"crossref","unstructured":"Stanovsky, G., Smith, N.\u00a0A., & Zettlemoyer, L. (2019). Evaluating gender bias in machine translation. Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 1679\u20131684). https:\/\/aclanthology.org\/P19-1164","DOI":"10.18653\/v1\/P19-1164"},{"key":"9695_CR52","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1613\/jair.1.15195","volume":"79","author":"O van der Wal","year":"2024","unstructured":"van der Wal, O., Bachmann, D., Leidinger, A., van Maanen, L., Zuidema, W., & Schulz, K. (2024). Undesirable biases in NLP: Addressing challenges of measurement. Journal of AI Research, 79, 1\u201340. https:\/\/doi.org\/10.1613\/jair.1.15195","journal-title":"Journal of AI Research"},{"key":"9695_CR53","doi-asserted-by":"crossref","unstructured":"Vania, C., Htut, P.\u00a0M., Huang, W., Mungra, D., Yuanzhe Pang, R., Phang, J., Liu, H., Cho, K., & Bowman, S.\u00a0R. (2021). Comparing test sets with item response theory. arxiv:abs\/2106.00840","DOI":"10.18653\/v1\/2021.acl-long.92"},{"key":"9695_CR54","unstructured":"Warm, T.\u00a0A. (1978). A primer of item response theory (tech. rep.). Coast Guard Washington DC. https:\/\/files.eric.ed.gov\/fulltext\/ED171730.pdf"},{"key":"9695_CR55","unstructured":"Webster, K., Wang, X., Tenney, I., Beutel, A., Pitler, E., Pavlick, E., Chen, J., Chi, E., & Petrov, S. (2021). Measuring and reducing gendered correlations in pre-trained models. arxiv:abs\/2010.06032"},{"key":"9695_CR56","unstructured":"White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., & Schmidt, D.\u00a0C. (2023). A prompt pattern catalog to enhance prompt engineering with ChatGPT. arxiv:abs\/2302.11382"},{"key":"9695_CR57","doi-asserted-by":"publisher","unstructured":"Wu, M., Tam, H.\u00a0P., & Jen, T.-H. (2016). Differential item function. In Educational measurement for applied researchers: Theory into practice (pp.\u00a0207\u2013225). Springer. https:\/\/doi.org\/10.1007\/978-981-10-3302-5_11","DOI":"10.1007\/978-981-10-3302-5_11"},{"key":"9695_CR58","unstructured":"Zhang, H., Sneyd, A., & Stevenson, M. (2020). Robustness and reliability of gender bias assessment in word embeddings: the role of base pairs. In Proceedings of the 1st conference of the Asia-Pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing (pp. 759\u2013769). https:\/\/aclanthology.org\/2020.aacl-main.76"},{"key":"9695_CR59","doi-asserted-by":"crossref","unstructured":"Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2018). Gender bias in coreference resolution: evaluation and debiasing methods. In Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies (Volume 2 (Short Papers), pp. 15\u201320). https:\/\/aclanthology.org\/N18-2003","DOI":"10.18653\/v1\/N18-2003"}],"container-title":["Minds and Machines"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11023-024-09695-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11023-024-09695-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11023-024-09695-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,15]],"date-time":"2024-11-15T07:09:36Z","timestamp":1731654576000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11023-024-09695-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,4]]},"references-count":59,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["9695"],"URL":"https:\/\/doi.org\/10.1007\/s11023-024-09695-9","relation":{},"ISSN":["1572-8641"],"issn-type":[{"value":"1572-8641","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,9,4]]},"assertion":[{"value":"1 June 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 August 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 September 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"37"}}