{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,28]],"date-time":"2026-03-28T13:11:13Z","timestamp":1774703473748,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":14,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,7,6]],"date-time":"2022-07-06T00:00:00Z","timestamp":1657065600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["NS-1822975 and CNS-182298"],"award-info":[{"award-number":["NS-1822975 and CNS-182298"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,7,6]]},"DOI":"10.1145\/3477495.3536321","type":"proceedings-article","created":{"date-parts":[[2022,7,7]],"date-time":"2022-07-07T15:12:08Z","timestamp":1657206728000},"page":"3360-3362","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":40,"title":["ClueWeb22: 10 Billion Web Documents with Rich Information"],"prefix":"10.1145","author":[{"given":"Arnold","family":"Overwijk","sequence":"first","affiliation":[{"name":"Microsoft, Redmond, WA, USA"}]},{"given":"Chenyan","family":"Xiong","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, WA, USA"}]},{"given":"Jamie","family":"Callan","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}]}],"member":"320","published-online":{"date-parts":[[2022,7,7]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"et almbox","author":"Bajaj Payal","year":"2016","unstructured":"Payal Bajaj , Daniel Campos , Nick Craswell , Li Deng , Jianfeng Gao , Xiaodong Liu , Rangan Majumder , Andrew McNamara , Bhaskar Mitra , Tri Nguyen , et almbox . 2016 . Ms MARCO : A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016). Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et almbox. 2016. Ms MARCO: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)."},{"key":"e_1_3_2_1_2_1","volume-title":"Overview of the TREC 2009 Web Track. Technical Report. NIST.","author":"Clarke Charles L","year":"2009","unstructured":"Charles L Clarke , Nick Craswell , and Ian Soboroff . 2009 . Overview of the TREC 2009 Web Track. Technical Report. NIST. Charles L Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the TREC 2009 Web Track. Technical Report. NIST."},{"key":"e_1_3_2_1_3_1","volume-title":"Overview of the TREC 2012 Web Track. Technical Report. NIST.","author":"Clarke Charles L","year":"2012","unstructured":"Charles L Clarke , Nick Craswell , and Ellen M Voorhees . 2012 . Overview of the TREC 2012 Web Track. Technical Report. NIST. Charles L Clarke, Nick Craswell, and Ellen M Voorhees. 2012. Overview of the TREC 2012 Web Track. Technical Report. NIST."},{"key":"e_1_3_2_1_4_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2019. 4171--4186","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2019. 4171--4186 . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2019. 4171--4186."},{"key":"e_1_3_2_1_5_1","volume-title":"Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758","author":"Dodge Jesse","year":"2021","unstructured":"Jesse Dodge , Maarten Sap , Ana Marasovi\u0107 , William Agnew , Gabriel Ilharco , Dirk Groeneveld , Margaret Mitchell , and Matt Gardner . 2021. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758 ( 2021 ). Jesse Dodge, Maarten Sap, Ana Marasovi\u0107, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758 (2021)."},{"key":"e_1_3_2_1_6_1","volume-title":"Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv preprint arXiv:2101.03961","author":"Fedus William","year":"2021","unstructured":"William Fedus , Barret Zoph , and Noam Shazeer . 2021 . Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv preprint arXiv:2101.03961 (2021). William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv preprint arXiv:2101.03961 (2021)."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.220"},{"key":"e_1_3_2_1_8_1","volume-title":"Open Domain Question Answering over Virtual Documents: A Unified Approach for Data and Text. arXiv preprint arXiv:2110.08417","author":"Ma Kaixin","year":"2021","unstructured":"Kaixin Ma , Hao Cheng , Xiaodong Liu , Eric Nyberg , and Jianfeng Gao . 2021. Open Domain Question Answering over Virtual Documents: A Unified Approach for Data and Text. arXiv preprint arXiv:2110.08417 ( 2021 ). Kaixin Ma, Hao Cheng, Xiaodong Liu, Eric Nyberg, and Jianfeng Gao. 2021. Open Domain Question Answering over Virtual Documents: A Unified Approach for Data and Text. arXiv preprint arXiv:2110.08417 (2021)."},{"key":"e_1_3_2_1_9_1","unstructured":"Microsoft. 2019. BlingFire. https:\/\/github.com\/microsoft\/BlingFire  Microsoft. 2019. BlingFire. https:\/\/github.com\/microsoft\/BlingFire"},{"key":"e_1_3_2_1_10_1","volume-title":"Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683","author":"Raffel Colin","year":"2019","unstructured":"Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , and Peter J Liu . 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 ( 2019 ). Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)."},{"key":"e_1_3_2_1_11_1","unstructured":"Carnegie Mellon University. 2009. ClueWeb09. http:\/\/lemurproject.org\/clueweb09\/  Carnegie Mellon University. 2009. ClueWeb09. http:\/\/lemurproject.org\/clueweb09\/"},{"key":"e_1_3_2_1_12_1","unstructured":"Carnegie Mellon University. 2012. ClueWeb12. http:\/\/lemurproject.org\/clueweb12\/  Carnegie Mellon University. 2012. ClueWeb12. http:\/\/lemurproject.org\/clueweb12\/"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1521"},{"key":"e_1_3_2_1_14_1","volume-title":"Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations, ICLR","author":"Xiong Lee","year":"2021","unstructured":"Lee Xiong , Chenyan Xiong , Ye Li , Kwok-Fung Tang , Jialin Liu , Paul N. Bennett , Junaid Ahmed , and Arnold Overwijk . 2021 . Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations, ICLR 2021. Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations, ICLR 2021."}],"event":{"name":"SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval","location":"Madrid Spain","acronym":"SIGIR '22","sponsor":["SIGIR ACM Special Interest Group on Information Retrieval"]},"container-title":["Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3477495.3536321","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3477495.3536321","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3477495.3536321","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:10:36Z","timestamp":1750183836000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3477495.3536321"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,7,6]]},"references-count":14,"alternative-id":["10.1145\/3477495.3536321","10.1145\/3477495"],"URL":"https:\/\/doi.org\/10.1145\/3477495.3536321","relation":{},"subject":[],"published":{"date-parts":[[2022,7,6]]},"assertion":[{"value":"2022-07-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}