{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T02:10:06Z","timestamp":1750299006090,"version":"3.41.0"},"reference-count":129,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2024,12,1]],"date-time":"2024-12-01T00:00:00Z","timestamp":1733011200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGIR Forum"],"published-print":{"date-parts":[[2024,12]]},"abstract":"<jats:p>More and more benchmarks, datasets, and evaluation tasks are becoming available. This is extremely useful for the community because it enables researchers and practitioners to test and evaluate new techniques. However, the construction, evaluation, and maintenance of data sets and benchmarks is opaque which creates problems with respect to stability and true representations. Our position is that we need to revisit how we design and implement benchmarks. The SPEC benchmark offers interesting perspectives that our community should consider. We use a data set of influential papers and resources to discuss important benchmark aspects such as realistic workloads, reliability, validity, leakage, and labeling. We conclude by proposing a list of principles for constructing evaluation benchmarks.<\/jats:p>","DOI":"10.1145\/3722449.3722467","type":"journal-article","created":{"date-parts":[[2025,3,6]],"date-time":"2025-03-06T17:24:20Z","timestamp":1741281860000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Evaluating the Evaluations: A Perspective on Benchmarks"],"prefix":"10.1145","volume":"58","author":[{"given":"Omar","family":"Alonso","sequence":"first","affiliation":[{"name":"Amazon, Palo Alto, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kenneth","family":"Church","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, MA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,3,6]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1145\/290941.290954","volume-title":"Proc. of SIGIR","author":"Allan James","year":"1998","unstructured":"James Allan, Ron Papka, and Victor Lavrenko. On-line new event detection and tracking. In Proc. of SIGIR, pages 37--45, 1998."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3130348.3130350"},{"key":"e_1_2_1_3_1","volume-title":"The Practice of Crowdsourcing. Synthesis Lectures on Information Concepts, Retrieval, and Services","author":"Alonso Omar","year":"2019","unstructured":"Omar Alonso. The Practice of Crowdsourcing. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers, 2019."},{"key":"e_1_2_1_4_1","first-page":"1053","volume-title":"Information Processing & Management","volume":"48","author":"Alonso Omar","year":"2012","unstructured":"Omar Alonso and Stefano Mizzaro. Using crowdsourcing for TREC relevance assessment. In Information Processing & Management, volume 48, pages 1053--1066, 2012."},{"key":"e_1_2_1_5_1","first-page":"4211","volume-title":"LREC","author":"Ardila Rosana","year":"2020","unstructured":"Rosana Ardila, Megan Branson, Kelly Davis, et al. Common Voice: A massively-multilingual speech corpus. In LREC, pages 4211--4215, 2020."},{"key":"e_1_2_1_6_1","volume-title":"MS Marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268","author":"Bajaj Payal","year":"2016","unstructured":"Payal Bajaj, Daniel Campos, Nick Craswell, et al. MS Marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016."},{"key":"e_1_2_1_7_1","first-page":"11","volume-title":"Proc. of SIGIR","author":"Belew Richard","year":"1989","unstructured":"Richard Belew. Adaptive information retrieval: Using a connectionist representation to retrieve and learn about documents. In Proc. of SIGIR, pages 11--20, 1989."},{"key":"e_1_2_1_8_1","first-page":"219","volume-title":"Proc. of SIGIR","volume":"51","author":"Berger Adam","year":"1999","unstructured":"Adam Berger and John Lafferty. Information retrieval as statistical translation. In Proc. of SIGIR, volume 51, pages 219--226, 1999."},{"key":"e_1_2_1_9_1","doi-asserted-by":"crossref","first-page":"104","DOI":"10.1145\/290941.290972","volume-title":"Proc. of SIGIR","author":"Bharat Krishna","year":"1998","unstructured":"Krishna Bharat and Monika Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proc. of SIGIR, pages 104--111, 1998."},{"key":"e_1_2_1_10_1","first-page":"2206","volume-title":"International conference on machine learning","author":"Borgeaud Sebastian","year":"2022","unstructured":"Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206--2240. PMLR, 2022."},{"key":"e_1_2_1_11_1","first-page":"66","article-title":"Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references","author":"Bornmann Lutz","year":"2014","unstructured":"Lutz Bornmann and R\u00fcdiger Mutz. Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66, 2014.","journal-title":"Journal of the Association for Information Science and Technology"},{"issue":"1","key":"e_1_2_1_12_1","first-page":"1","article-title":"Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases","volume":"8","author":"Bornmann Lutz","year":"2021","unstructured":"Lutz Bornmann, Robin Haunschild, and R\u00fcdiger Mutz. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanities and Social Sciences Communications, 8(1):1--15, 2021.","journal-title":"Humanities and Social Sciences Communications"},{"key":"e_1_2_1_13_1","doi-asserted-by":"crossref","first-page":"901","DOI":"10.1613\/jair.1.14388","article-title":"A general model for aggregating annotations across simple, complex, and multi-object annotation tasks","volume":"78","author":"Braylan Alexander","year":"2023","unstructured":"Alexander Braylan, Madalyn Marabella, Omar Alonso, and Matthew Lease. A general model for aggregating annotations across simple, complex, and multi-object annotation tasks. J. Artif. Intell. Res., 78:901--973, 2023.","journal-title":"J. Artif. Intell. Res."},{"key":"e_1_2_1_14_1","first-page":"21","volume-title":"Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)","author":"Broder Andrei","year":"1997","unstructured":"Andrei Broder. On the resemblance and containment of documents. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pages 21--29, 1997."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/792550.792552"},{"key":"e_1_2_1_16_1","volume-title":"Delphic costs and benefits in web search: A utilitarian and historical analysis","author":"Broder Andrei","year":"2023","unstructured":"Andrei Broder and Preston McAfee. Delphic costs and benefits in web search: A utilitarian and historical analysis, 2023. URL https:\/\/arxiv.org\/abs\/2308.07525."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.1987.1663532"},{"key":"e_1_2_1_18_1","volume-title":"Pearson Education","author":"Brooks Frederick P.","year":"1995","unstructured":"Frederick P. Brooks. The mythical man-month: essays on software engineering. Pearson Education, 1995."},{"key":"e_1_2_1_19_1","volume-title":"NeurIPS","author":"Brown Tom B.","year":"2020","unstructured":"Tom B. Brown, Benjamin Mann, Nick Ryder, et al. Language models are few-shot learners. NeurIPS, 2020."},{"key":"e_1_2_1_20_1","first-page":"235","volume-title":"SIGIR Forum","volume":"51","author":"Buckley Chris","year":"2000","unstructured":"Chris Buckley and Ellen Voorhees. Evaluating evaluation measure stability. In SIGIR Forum, volume 51, pages 235--242, 2000."},{"key":"e_1_2_1_21_1","first-page":"69","volume-title":"Text Retrieval Conference","author":"Buckley Chris","year":"1994","unstructured":"Chris Buckley, Gerard Salton, James Allan, and Amit Singhal. Automatic query expansion using SMART: TREC 3. In Text Retrieval Conference, pages 69--80, 1994."},{"key":"e_1_2_1_22_1","first-page":"21","volume-title":"Proc. of SIGIR","author":"Callan Jamie","year":"1995","unstructured":"Jamie Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. In Proc. of SIGIR, pages 21--28, 1995."},{"issue":"3","key":"e_1_2_1_23_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2856127","article-title":"A survey revisited","volume":"48","author":"Calzarossa Maria Carla","year":"2016","unstructured":"Maria Carla Calzarossa, Luisa Massari, and Daniele Tessera. Workload characterization: A survey revisited. ACM Computing Surveys, 48(3):1--43, 2016.","journal-title":"ACM Computing Surveys"},{"key":"e_1_2_1_24_1","volume-title":"Callhome american english speech","author":"Canavan Alexandra","year":"1997","unstructured":"Alexandra Canavan, David Graff, and George Zipperlen. Callhome american english speech. Linguistic Data Consortium, 1997."},{"key":"e_1_2_1_25_1","first-page":"209","volume-title":"SIGIR Forum","volume":"51","author":"Carbonell Jaime","year":"1998","unstructured":"Jaime Carbonell and Jade Goldstein-Stewart. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In SIGIR Forum, volume 51, pages 209--210, 1998."},{"key":"e_1_2_1_26_1","volume-title":"Proc. of ICLR","author":"Carlini Nicholas","year":"2023","unstructured":"Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. Proc. of ICLR, 2023."},{"key":"e_1_2_1_27_1","first-page":"00","volume-title":"NAACL","author":"Charniak Eugene","year":"2000","unstructured":"Eugene Charniak. A maximum-entropy-inspired parser. In NAACL, 2000. URL https:\/\/aclanthology.org\/A00-2018."},{"key":"e_1_2_1_28_1","volume-title":"PaLM: Scaling language modeling with pathways. ArXiv, abs\/2204.02311","author":"Chowdhery Aakanksha","year":"2022","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. PaLM: Scaling language modeling with pathways. ArXiv, abs\/2204.02311, 2022."},{"key":"e_1_2_1_29_1","first-page":"1","volume-title":"Natural Language Engineering","author":"Church Kenneth","year":"2024","unstructured":"Kenneth Church. Emerging trends: When can users trust GPT, and when should they intervene? Natural Language Engineering, pages 1--11, 2024."},{"issue":"6","key":"e_1_2_1_30_1","doi-asserted-by":"crossref","first-page":"1323","DOI":"10.1017\/S1351324924000068","article-title":"evaluating general purpose foundation models","volume":"30","author":"Church Kenneth","year":"2024","unstructured":"Kenneth Church and Omar Alonso. Emerging trends: evaluating general purpose foundation models. Natural Language Engineering, 30(6):1323--1335, 2024.","journal-title":"Natural Language Engineering"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324922000043"},{"key":"e_1_2_1_32_1","volume-title":"Ian Soboroff. Overview of the TREC 2009 web track. In Text Retrieval Conference","author":"Clarke Charles L. A.","year":"2009","unstructured":"Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. Overview of the TREC 2009 web track. In Text Retrieval Conference, 2009."},{"key":"e_1_2_1_33_1","first-page":"16","volume-title":"EACL","author":"Collins Michael","year":"1997","unstructured":"Michael Collins. Three generative, lexicalised models for statistical parsing. In EACL, pages 16--23. ACL, 1997."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1162\/089120103322753356"},{"key":"e_1_2_1_35_1","volume-title":"Word translation without parallel data. arXiv preprint arXiv:1710.04087","author":"Conneau Alexis","year":"2017","unstructured":"Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, et al. Word translation without parallel data. arXiv preprint arXiv:1710.04087, 2017."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.747"},{"key":"e_1_2_1_37_1","volume-title":"Ian Soboroff. Overview of the TREC 2005 enterprise track. In Text Retrieval Conference","author":"Craswell Nick","year":"2005","unstructured":"Nick Craswell, Arjen de Vries, and Ian Soboroff. Overview of the TREC 2005 enterprise track. In Text Retrieval Conference, 2005."},{"key":"e_1_2_1_38_1","volume-title":"Ellen Voorhees. Overview of the TREC 2019 deep learning track. In arXiv.org","volume":"2003","author":"Craswell Nick","year":"2020","unstructured":"Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Fernando Campos, and Ellen Voorhees. Overview of the TREC 2019 deep learning track. In arXiv.org, volume abs\/2003.07820, 2020."},{"key":"e_1_2_1_39_1","volume-title":"Ellen Voorhees. Overview of the TREC 2020 deep learning track. In Text Retrieval Conference","volume":"2102","author":"Craswell Nick","year":"2021","unstructured":"Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Fernando Campos, and Ellen Voorhees. Overview of the TREC 2020 deep learning track. In Text Retrieval Conference, volume abs\/2102.07662, 2021."},{"key":"e_1_2_1_40_1","first-page":"148","volume-title":"Proc. of SIGIR","volume":"51","author":"Cutting Douglas R.","year":"1992","unstructured":"Douglas R. Cutting, Jan O. Pedersen, David R. Karger, and John W. Tukey. Scatter\/gather: a cluster-based approach to browsing large document collections. In Proc. of SIGIR, volume 51, pages 148--159, 1992."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2012.2211477"},{"key":"e_1_2_1_42_1","first-page":"4171","volume-title":"NAACL","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171--4186. ACL, 2019."},{"key":"e_1_2_1_43_1","first-page":"7","article-title":"Overview of the SPEC benchmarks","author":"Dixit Kaivalya M","year":"1993","unstructured":"Kaivalya M Dixit. Overview of the SPEC benchmarks. The Benchmark Handbook, 7, 1993.","journal-title":"The Benchmark Handbook"},{"key":"e_1_2_1_44_1","first-page":"1761","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Dong Xuanyi","year":"2019","unstructured":"Xuanyi Dong and Yezhou Yang. Searching for a robust neural architecture in four gpu hours. IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 1761--1770, 2019."},{"key":"e_1_2_1_45_1","volume-title":"Proc. of SIGIR","author":"Fagan Joel","year":"1987","unstructured":"Joel Fagan. Automatic phrase indexing for document retrieval: An examination of syntactic and non-syntactic methods. In Proc. of SIGIR, 1987."},{"key":"e_1_2_1_46_1","first-page":"39","volume-title":"Proc. of ICTIR","author":"Faggioli Guglielmo","year":"2023","unstructured":"Guglielmo Faggioli, Laura Dietz, Charles L. A. Clarke, et al. Perspectives on large language models for relevance judgment. In Proc. of ICTIR, pages 39--50, 2023."},{"key":"e_1_2_1_47_1","volume-title":"MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of 2nd Machine Reading for Reading Comprehension (MRQA) Workshop at EMNLP","author":"Fisch Adam","year":"2019","unstructured":"Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, et al. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of 2nd Machine Reading for Reading Comprehension (MRQA) Workshop at EMNLP, 2019."},{"key":"e_1_2_1_48_1","volume-title":"Houghton Mifflin","author":"Francis Winthrop Nelson","year":"1982","unstructured":"Winthrop Nelson Francis and Henry Ku\u010dera. Frequency analysis of English usage: Lexicon and grammar. Houghton Mifflin, 1982."},{"key":"e_1_2_1_49_1","first-page":"465","volume-title":"Proc. of SIGIR","author":"Furnas George","year":"1988","unstructured":"George Furnas, Scott Deerwester, Susan Dumais, Thomas Landauer, et al. Information retrieval using a singular value decomposition model of latent semantic structure. In Proc. of SIGIR, pages 465--480, 1988."},{"key":"e_1_2_1_50_1","volume-title":"et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027","author":"Gao Leo","year":"2020","unstructured":"Leo Gao, Stella Biderman, Sid Black, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020."},{"issue":"4","key":"e_1_2_1_51_1","doi-asserted-by":"crossref","first-page":"901","DOI":"10.1109\/TKDE.2016.2518669","article-title":"Challenges in data crowdsourcing","volume":"28","author":"Garcia-Molina Hector","year":"2016","unstructured":"Hector Garcia-Molina, Manas Joglekar, Adam Marcus, Aditya G. Parameswaran, and Vasilis Verroios. Challenges in data crowdsourcing. IEEE Trans. Knowl. Data Eng., 28(4):901--911, 2016.","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"e_1_2_1_52_1","first-page":"1","volume-title":"Text Retrieval Conference","author":"Garofolo John S.","year":"2000","unstructured":"John S. Garofolo, Cedric G. P. Auzanne, and Ellen Voorhees. The TREC spoken document retrieval track: A success story. In Text Retrieval Conference, pages 1--20, 2000."},{"key":"e_1_2_1_53_1","volume-title":"Garofolo et al. TIMIT acoustic-phonetic continuous speech corpus","author":"John","year":"1983","unstructured":"John S. Garofolo et al. TIMIT acoustic-phonetic continuous speech corpus. Linguistic Data Consortium, Philadelphia, 1983."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.1992.225858"},{"key":"e_1_2_1_55_1","first-page":"910","volume-title":"J. Assoc. Inf. Sci. Technol.","volume":"58","author":"Gonz\u00e1lez Jos\u00e9","year":"2007","unstructured":"Jos\u00e9 Gonz\u00e1lez and Jaime G\u00f3mez. TREC: Experiment and evaluation in information retrieval. In J. Assoc. Inf. Sci. Technol., volume 58, pages 910--911, 2007."},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00474"},{"key":"e_1_2_1_57_1","doi-asserted-by":"crossref","first-page":"80","DOI":"10.1109\/CAIA.1995.378787","volume-title":"Proc. of the 11th Conference on Artificial Intelligence for Applications","author":"Hammond Kristian J.","year":"1995","unstructured":"Kristian J. Hammond, R. Burke, C. Martin, and Steven L. Lytinen. FAQ finder: a case-based approach to knowledge navigation. Proc. of the 11th Conference on Artificial Intelligence for Applications, pages 80--86, 1995."},{"key":"e_1_2_1_58_1","first-page":"321","volume-title":"Proc. of SIGIR","author":"Harman Donna","year":"1988","unstructured":"Donna Harman. Towards interactive query expansion. In Proc. of SIGIR, pages 321--331, 1988."},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_60_1","doi-asserted-by":"crossref","first-page":"37","DOI":"10.18653\/v1\/W18-2605","volume-title":"Proc. of the Workshop on Machine Reading for Question Answering","author":"He Wei","year":"2018","unstructured":"Wei He, Kai Liu, Jing Liu, et al. DuReader: a Chinese machine reading comprehension dataset from real-world applications. In Eunsol Choi, Minjoon Seo, Danqi Chen, Robin Jia, and Jonathan Berant, editors, Proc. of the Workshop on Machine Reading for Question Answering, pages 37--46. ACL, 2018."},{"key":"e_1_2_1_61_1","volume-title":"Computer Architecture: A Quantitative Approach","author":"Hennessy John","year":"2012","unstructured":"John Hennessy and David Patterson. Computer Architecture: A Quantitative Approach. Elsevier, 2012."},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/312624.312682"},{"key":"e_1_2_1_63_1","volume-title":"Roberts. TREC 2007 genomics track overview. In Text Retrieval Conference","author":"Hersh William R.","year":"2007","unstructured":"William R. Hersh, Aaron M. Cohen, Lynn Ruslen, and Phoebe M. Roberts. TREC 2007 genomics track overview. In Text Retrieval Conference, 2007."},{"key":"e_1_2_1_64_1","first-page":"211","volume-title":"SIGIR Forum","volume":"51","author":"Hofmann Thomas","year":"1999","unstructured":"Thomas Hofmann. Probabilistic latent semantic indexing. In SIGIR Forum, volume 51, pages 211--218, 1999."},{"key":"e_1_2_1_65_1","volume-title":"Text Retrieval Conference","author":"Jaleel Nasreen Abdul","year":"2004","unstructured":"Nasreen Abdul Jaleel, James Allan, W. Bruce Croft, Fernando Diaz, Leah S. Larkey, Xiaoyan Li, Mark D. Smucker, and Courtney Wade. Umass at TREC 2004: Novelty and hard. In Text Retrieval Conference, 2004."},{"key":"e_1_2_1_66_1","first-page":"62","volume-title":"SIGIR Forum","volume":"51","author":"Jones Karen Sp\u00e4rck","year":"1988","unstructured":"Karen Sp\u00e4rck Jones. A look back and a look forward. In SIGIR Forum, volume 51, pages 62--78, 1988."},{"key":"e_1_2_1_67_1","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1145\/345508.345545","volume-title":"Proc. of SIGIR","author":"J\u00e4rvelin Kalervo","year":"2000","unstructured":"Kalervo J\u00e4rvelin and Jaana Kek\u00e4l\u00e4inen. IR evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR, pages 41--48, 2000."},{"key":"e_1_2_1_68_1","first-page":"106","volume-title":"Proc. of SIGIR","author":"Katzer Jeffrey","year":"1983","unstructured":"Jeffrey Katzer, Judith Tessier, William Frakes, and Padmini Das-Gupta. A study of the overlap among document representations. In Proc. of SIGIR, page 106--114. ACM, 1983."},{"key":"e_1_2_1_69_1","first-page":"284","volume-title":"ACL","author":"Khandelwal Urvashi","year":"2018","unstructured":"Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. Sharp nearby, fuzzy far away: How neural language models use context. In ACL, pages 284--294. ACL, 2018."},{"key":"e_1_2_1_70_1","volume-title":"Content analysis: An introduction to its methodology","author":"Krippendorff Klaus","year":"2018","unstructured":"Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage publications, 2018."},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_2_1_72_1","volume-title":"Learning multiple layers of features from tiny images. https:\/\/www.cs.utoronto.ca\/~kriz\/learning-features-2009-TR.pdf","author":"Krizhevsky Alex","year":"2009","unstructured":"Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. https:\/\/www.cs.utoronto.ca\/~kriz\/learning-features-2009-TR.pdf, 2009."},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00276"},{"key":"e_1_2_1_74_1","doi-asserted-by":"crossref","first-page":"111","DOI":"10.1145\/383952.383970","volume-title":"Proc. of SIGIR","author":"Lafferty John","year":"2001","unstructured":"John Lafferty and ChengXiang Zhai. Document language models, query models, and risk minimization for information retrieval. In Proc. of SIGIR, pages 111--119, 2001."},{"key":"e_1_2_1_75_1","first-page":"260","volume-title":"Proc. of SIGIR","volume":"51","author":"Lavrenko Victor","year":"2001","unstructured":"Victor Lavrenko and W. Bruce Croft. Relevance-based language models. In Proc. of SIGIR, volume 51, pages 260--267, 2001."},{"key":"e_1_2_1_76_1","volume-title":"MNIST handwritten digit database. ATT Labs [Online]. Available: http:\/\/yann.lecun.com\/exdb\/mnist, 2","author":"LeCun Yann","year":"2010","unstructured":"Yann LeCun, Corinna Cortes, and Christopher Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http:\/\/yann.lecun.com\/exdb\/mnist, 2, 2010."},{"key":"e_1_2_1_77_1","first-page":"8424","volume-title":"ACL","author":"Lee Katherine","year":"2022","unstructured":"Katherine Lee, Daphne Ippolito, Andrew Nystrom, et al. Deduplicating training data makes language models better. In ACL, pages 8424--8445. ACL, 2022."},{"key":"e_1_2_1_78_1","volume-title":"Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361--397","author":"Lewis David D","year":"2004","unstructured":"David D Lewis, Yiming Yang, Tony Russell-Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361--397, 2004."},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.3115\/1072228.1072378"},{"key":"e_1_2_1_80_1","first-page":"142","volume-title":"ACL","author":"Maas Andrew L.","year":"2011","unstructured":"Andrew L. Maas, Raymond E. Daly, Peter T. Pham, et al. Learning word vectors for sentiment analysis. In ACL, pages 142--150. ACL, 2011."},{"key":"e_1_2_1_81_1","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-031-02296-8","volume-title":"On the efficient determination of most near neighbors: horseshoes, hand grenades, web search and other situations when close is close enough","author":"Manasse Mark S.","year":"2015","unstructured":"Mark S. Manasse. On the efficient determination of most near neighbors: horseshoes, hand grenades, web search and other situations when close is close enough. Morgan & Claypool Publishers, 2015."},{"key":"e_1_2_1_82_1","volume-title":"Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935--948","author":"Manber Udi","year":"1993","unstructured":"Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935--948, 1993."},{"key":"e_1_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.5555\/972470.972475"},{"key":"e_1_2_1_84_1","first-page":"1","volume-title":"Proc. of the IEEE Workload Characterization Symposium","author":"Mashey John","year":"2005","unstructured":"John Mashey. Summarizing performance is no mean feat [computer performance analysis]. In Proc. of the IEEE Workload Characterization Symposium, page 1, 2005."},{"key":"e_1_2_1_85_1","volume-title":"Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843","author":"Merity Stephen","year":"2016","unstructured":"Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016."},{"key":"e_1_2_1_86_1","first-page":"20939","volume-title":"EMNLP","author":"Mohamed Youssef","year":"2024","unstructured":"Youssef Mohamed, Runjia Li, Ibrahim Said Ahmad, et al. No culture left behind: ArtELingo-28, a benchmark of WikiArt with captions in 28 languages. In EMNLP, pages 20939--20962. ACL, 2024."},{"key":"e_1_2_1_87_1","volume-title":"Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035","author":"Nasr Milad","year":"2023","unstructured":"Milad Nasr, Nicholas Carlini, Jonathan Hayase, et al. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023."},{"key":"e_1_2_1_88_1","volume-title":"Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages","author":"Nguyen Thuat","year":"2023","unstructured":"Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages, 2023."},{"key":"e_1_2_1_89_1","volume-title":"Ian Soboroff. Overview of the TREC 2011 microblog track. In Text Retrieval Conference","author":"Ounis Iadh","year":"2011","unstructured":"Iadh Ounis, Craig Macdonald, Jimmy J. Lin, and Ian Soboroff. Overview of the TREC 2011 microblog track. In Text Retrieval Conference, 2011."},{"key":"e_1_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"e_1_2_1_91_1","first-page":"40","volume-title":"Proc. of SIGIR","author":"Pejtersen Annelise M.","year":"1989","unstructured":"Annelise M. Pejtersen. A library system for information retrieval based on a cognitive task analysis and supported by an icon-based interface. In Proc. of SIGIR, pages 40--47, 1989."},{"key":"e_1_2_1_92_1","volume-title":"Vintage","author":"Platt Stephen R","year":"2019","unstructured":"Stephen R Platt. Imperial Twilight: The Opium War and the End of China's Last Golden Age. Vintage, 2019."},{"key":"e_1_2_1_93_1","doi-asserted-by":"crossref","first-page":"275","DOI":"10.1145\/290941.291008","volume-title":"Proc. of SIGIR","author":"Ponte Jay","year":"1998","unstructured":"Jay Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In Proc. of SIGIR, pages 275--281, 1998."},{"key":"e_1_2_1_94_1","doi-asserted-by":"publisher","DOI":"10.1126\/science.149.3683.510"},{"issue":"8","key":"e_1_2_1_95_1","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.","journal-title":"OpenAI blog"},{"key":"e_1_2_1_96_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_2_1_97_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-2124"},{"key":"e_1_2_1_98_1","first-page":"23","volume-title":"SIGIR Forum","volume":"21","author":"Van Rijsbergen C. J.","year":"1986","unstructured":"C. J. Van Rijsbergen. A new theoretical framework for information retrieval. In SIGIR Forum, volume 21, pages 23--29, 1986."},{"key":"e_1_2_1_99_1","volume-title":"Overview of the TREC 2014 clinical decision support track. In Text Retrieval Conference","author":"Roberts Kirk","year":"2014","unstructured":"Kirk Roberts, Dina Demner-Fushman, Ellen Voorhees, and W. Hersh. Overview of the TREC 2014 clinical decision support track. In Text Retrieval Conference, 2014."},{"key":"e_1_2_1_100_1","first-page":"78","volume-title":"Proc. of CIKM","author":"Robertson Stephen","year":"2006","unstructured":"Stephen Robertson. On GMAP: and other transformations. In Proc. of CIKM, page 78--83, 2006."},{"key":"e_1_2_1_101_1","first-page":"109","volume-title":"Text Retrieval Conference","author":"Robertson Stephen","year":"1994","unstructured":"Stephen Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Text Retrieval Conference, pages 109--126, 1994."},{"key":"e_1_2_1_102_1","volume-title":"Text Retrieval Conference","author":"Robertson Stephen","year":"1995","unstructured":"Stephen Robertson, Steve Walker, Micheline Hancock-Beaulieu, Mike Gatford, and A. Payne. Okapi at TREC-4. In Text Retrieval Conference, 1995."},{"key":"e_1_2_1_103_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2019.2946162"},{"key":"e_1_2_1_104_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_2_1_105_1","volume-title":"A mathematical theory of communication. Bell system technical journal, 27(3): 379--423","author":"Shannon Claude","year":"1948","unstructured":"Claude Shannon. A mathematical theory of communication. Bell system technical journal, 27(3): 379--423, 1948."},{"key":"e_1_2_1_106_1","volume-title":"Prediction and entropy of printed english. Bell system technical journal, 30(1): 50--64","author":"Shannon Claude","year":"1951","unstructured":"Claude Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1): 50--64, 1951."},{"key":"e_1_2_1_107_1","first-page":"176","volume-title":"SIGIR Forum","volume":"51","author":"Singhal Amit","year":"1996","unstructured":"Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In SIGIR Forum, volume 51, pages 176 -- 184, 1996."},{"key":"e_1_2_1_108_1","volume-title":"Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615","author":"Srivastava Aarohi","year":"2022","unstructured":"Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022."},{"key":"e_1_2_1_109_1","volume-title":"Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137","author":"Sun Yu","year":"2021","unstructured":"Yu Sun, Shuohuan Wang, Shikun Feng, et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137, 2021."},{"key":"e_1_2_1_110_1","volume-title":"Third Message Understanding Conference (MUC-3)","author":"Sundheim Beth M.","year":"1991","unstructured":"Beth M. Sundheim. Overview of the third Message Understanding Evaluation and Conference. In Third Message Understanding Conference (MUC-3), 1991."},{"key":"e_1_2_1_111_1","volume-title":"Fourth Message Understanding Conference (MUC-4)","author":"Sundheim Beth M.","year":"1992","unstructured":"Beth M. Sundheim. Overview of the fourth Message Understanding Evaluation and Conference. In Fourth Message Understanding Conference (MUC-4), 1992."},{"key":"e_1_2_1_112_1","volume-title":"Large language models can accurately predict searcher preferences. CoRR, abs\/2309.10621","author":"Thomas Paul","year":"2023","unstructured":"Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. Large language models can accurately predict searcher preferences. CoRR, abs\/2309.10621, 2023."},{"key":"e_1_2_1_113_1","first-page":"1","volume-title":"Proc. of SIGIR","author":"Turtle Howard R.","year":"1989","unstructured":"Howard R. Turtle and W. Bruce Croft. Inference networks for document retrieval. In Proc. of SIGIR, pages 1--24, 1989."},{"key":"e_1_2_1_114_1","first-page":"188","volume-title":"Proc. of SIGIR","author":"Voorhees Ellen","year":"1985","unstructured":"Ellen Voorhees. The cluster hypothesis revisited. In Proc. of SIGIR, pages 188--196, 1985."},{"key":"e_1_2_1_115_1","doi-asserted-by":"publisher","DOI":"10.6028\/NIST.SP.500-242"},{"key":"e_1_2_1_116_1","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324901002789"},{"key":"e_1_2_1_117_1","first-page":"54","volume-title":"Voorhees. Overview of the TREC 2003 question answering track. In Text Retrieval Conference","author":"Ellen","year":"2004","unstructured":"Ellen Voorhees. Overview of the TREC 2003 question answering track. In Text Retrieval Conference, pages 54--68, 2004."},{"key":"e_1_2_1_118_1","volume-title":"TREC: Experiment and evaluation in information retrieval","author":"Voorhees Ellen","year":"2005","unstructured":"Ellen Voorhees and Donna Harman. TREC: Experiment and evaluation in information retrieval, 2005."},{"key":"e_1_2_1_119_1","volume-title":"Voorhees and W. Hersh. Overview of the TREC 2012 medical records track. In Text Retrieval Conference","author":"Ellen","year":"2012","unstructured":"Ellen Voorhees and W. Hersh. Overview of the TREC 2012 medical records track. In Text Retrieval Conference, 2012."},{"key":"e_1_2_1_120_1","doi-asserted-by":"crossref","DOI":"10.6028\/NIST.SP.500-246","volume-title":"Text Retrieval Conference","volume":"3","author":"Voorhees Ellen","year":"2000","unstructured":"Ellen Voorhees and Dawn Tice. The TREC-8 question answering track evaluation. In Text Retrieval Conference, volume 3, 2000."},{"key":"e_1_2_1_121_1","doi-asserted-by":"publisher","DOI":"10.1145\/3487553.3527147"},{"key":"e_1_2_1_122_1","doi-asserted-by":"crossref","first-page":"353","DOI":"10.18653\/v1\/W18-5446","volume-title":"Proc. of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP","author":"Wang Alex","year":"2018","unstructured":"Alex Wang, Amanpreet Singh, Julian Michael, et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353--355. ACL, 2018."},{"key":"e_1_2_1_123_1","volume-title":"SuperGLUE: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32","author":"Wang Alex","year":"2019","unstructured":"Alex Wang, Yada Pruksachatkun, Nikita Nangia, et al. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019."},{"key":"e_1_2_1_124_1","unstructured":"Wikimedia. ACL fourth conference on machine translation (WMT19) shared task: Machine translation of news. http:\/\/www.statmt.org\/wmt19\/translation-task.html 2019."},{"key":"e_1_2_1_125_1","first-page":"4","volume-title":"Proc. of SIGIR","author":"Xu Jinxi","year":"1996","unstructured":"Jinxi Xu and W. Bruce Croft. Query expansion using local and global document analysis. In Proc. of SIGIR, pages 4--11, 1996."},{"key":"e_1_2_1_126_1","doi-asserted-by":"crossref","first-page":"334","DOI":"10.1145\/383952.384019","volume-title":"Proc. of SIGIR","author":"Zhai ChengXiang","year":"2001","unstructured":"ChengXiang Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proc. of SIGIR, pages 334--342, 2001."},{"key":"e_1_2_1_127_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2017.2767044"},{"key":"e_1_2_1_128_1","volume-title":"A survey of large language models. ArXiv, abs\/2303.18223","author":"Zhao Wayne Xin","year":"2023","unstructured":"Wayne Xin Zhao, Kun Zhou, Junyi Li, et al. A survey of large language models. ArXiv, abs\/2303.18223, 2023."},{"key":"e_1_2_1_129_1","first-page":"18795","volume-title":"NeuIPS","volume":"33","author":"Zhuang Juntang","year":"2020","unstructured":"Juntang Zhuang, Tommy Tang, Yifan Ding, et al. AdaBelief optimizer: Adapting stepsizes by the belief in observed gradients. In NeuIPS, volume 33, pages 18795--18806, 2020."}],"container-title":["ACM SIGIR Forum"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3722449.3722467","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3722449.3722467","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:57:10Z","timestamp":1750298230000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3722449.3722467"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12]]},"references-count":129,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,12]]}},"alternative-id":["10.1145\/3722449.3722467"],"URL":"https:\/\/doi.org\/10.1145\/3722449.3722467","relation":{},"ISSN":["0163-5840"],"issn-type":[{"type":"print","value":"0163-5840"}],"subject":[],"published":{"date-parts":[[2024,12]]},"assertion":[{"value":"2025-03-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}