{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T14:51:31Z","timestamp":1769179891089,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":37,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,7,10]],"date-time":"2024-07-10T00:00:00Z","timestamp":1720569600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Natural Science Foundation of China","award":["61902209, 62377044, U2001212"],"award-info":[{"award-number":["61902209, 62377044, U2001212"]}]},{"name":"Beijing Outstanding Young Scientist Program","award":["BJJWZYJH012019100020098"],"award-info":[{"award-number":["BJJWZYJH012019100020098"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,7,10]]},"DOI":"10.1145\/3626772.3657671","type":"proceedings-article","created":{"date-parts":[[2024,7,11]],"date-time":"2024-07-11T12:40:05Z","timestamp":1720701605000},"page":"2713-2718","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["An Integrated Data Processing Framework for Pretraining Foundation Models"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-1671-5016","authenticated-orcid":false,"given":"Yiding","family":"Sun","sequence":"first","affiliation":[{"name":"Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-5146-1562","authenticated-orcid":false,"given":"Feng","family":"Wang","sequence":"additional","affiliation":[{"name":"Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9432-3251","authenticated-orcid":false,"given":"Yutao","family":"Zhu","sequence":"additional","affiliation":[{"name":"Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8333-6196","authenticated-orcid":false,"given":"Wayne Xin","family":"Zhao","sequence":"additional","affiliation":[{"name":"Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9257-5498","authenticated-orcid":false,"given":"Jiaxin","family":"Mao","sequence":"additional","affiliation":[{"name":"Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,7,11]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020","author":"Brown Tom B.","year":"2020","unstructured":"Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.)."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2014-564"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2309.02033"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2309.01431"},{"key":"e_1_3_2_1_5_1","article-title":"PaLM: Scaling Language Modeling with Pathways","volume":"24","author":"Chowdhery Aakanksha","year":"2023","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2023. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 24 (2023), 240:1--240:113. http:\/\/jmlr.org\/papers\/v24\/22--1144.html","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_2_1_6_1","unstructured":"Together Computer. 2023. RedPajama: an Open Dataset for Training Large Language Models. https:\/\/github.com\/togethercomputer\/RedPajama-Data"},{"key":"e_1_3_2_1_7_1","volume-title":"GLaM: Efficient Scaling of Language Models with Mixtureof- Experts. In International Conference on Machine Learning, ICML 2022","volume":"5569","author":"Du Nan","year":"2022","unstructured":"Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P. Bosma, Zongwei Zhou, TaoWang, Yu EmmaWang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen S. Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V. Le, Yonghui Wu, Zhifeng Chen, and Claire Cui. 2022. GLaM: Efficient Scaling of Language Models with Mixtureof- Experts. In International Conference on Machine Learning, ICML 2022, 17--23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesv\u00e1ri, Gang Niu, and Sivan Sabato (Eds.). PMLR, 5547--5569. https:\/\/proceedings.mlr.press\/ v162\/du22c.html"},{"key":"e_1_3_2_1_8_1","volume-title":"The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR abs\/2101.00027","author":"Gao Leo","year":"2021","unstructured":"Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2021. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR abs\/2101.00027 (2021). arXiv:2101.00027 https:\/\/arxiv.org\/abs\/ 2101.00027"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458723"},{"key":"e_1_3_2_1_10_1","unstructured":"Suriya Gunasekar Yi Zhang Jyoti Aneja Caio C\u00e9sar Teodoro Mendes Allie Del Giorno Sivakanth Gopi Mojan Javaheripi Piero Kauffmann Gustavo de Rosa Olli Saarikivi Adil Salim Shital Shah Harkirat Singh Behl Xin Wang S\u00e9bastien Bubeck Ronen Eldan Adam Tauman Kalai Yin Tat Lee and Yuanzhi Li. 2023. Textbooks Are All You Need. CoRR abs\/2306.11644 (2023). https:\/\/doi.org\/10. 48550\/ARXIV.2306.11644 arXiv:2306.11644"},{"key":"e_1_3_2_1_11_1","volume-title":"4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2--4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http:\/\/arxiv.org\/abs\/1511","author":"Hill Felix","year":"2016","unstructured":"Felix Hill, Antoine Bordes, Sumit Chopra, and JasonWeston. 2016. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2--4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http:\/\/arxiv.org\/abs\/1511.02301"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","unstructured":"Jordan Hoffmann Sebastian Borgeaud Arthur Mensch Elena Buchatskaya Trevor Cai Eliza Rutherford Diego de Las Casas Lisa Anne Hendricks Johannes Welbl Aidan Clark Tom Hennigan Eric Noland Katie Millican George van den Driessche Bogdan Damoc Aurelia Guy Simon Osindero Karen Simonyan Erich Elsen Jack W. Rae Oriol Vinyals and Laurent Sifre. 2022. Training Compute-Optimal Large Language Models. CoRR abs\/2203.15556 (2022). https:\/\/doi.org\/10.48550\/ARXIV.2203.15556 arXiv:2203.15556","DOI":"10.48550\/ARXIV.2203.15556"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/276698.276876"},{"key":"e_1_3_2_1_14_1","volume-title":"Scaling Laws for Neural Language Models. CoRR abs\/2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. CoRR abs\/2001.08361 (2020). arXiv:2001.08361 https:\/\/arxiv.org\/abs\/2001.08361"},{"key":"e_1_3_2_1_15_1","volume-title":"Jia LI, Chenghao Mou, Yacine Jernite, Margaret Mitchell, Carlos Mu\u00f1oz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro Von Werra, and Harm de Vries.","author":"Kocetkov Denis","year":"2023","unstructured":"Denis Kocetkov, Raymond Li, Loubna Ben allal, Jia LI, Chenghao Mou, Yacine Jernite, Margaret Mitchell, Carlos Mu\u00f1oz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro Von Werra, and Harm de Vries. 2023. The Stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research (2023)."},{"key":"e_1_3_2_1_16_1","unstructured":"Hugo Lauren\u00e7on Lucile Saulnier Thomas Wang Christopher Akiki Albert Villanova del Moral Teven Le Scao Leandro von Werra Chenghao Mou Eduardo Gonz\u00e1lez Ponferrada Huu Nguyen J\u00f6rg Frohberg Mario Sasko Quentin Lhoest Angelina McMillan-Major G\u00e9rard Dupont Stella Biderman Anna Rogers Loubna Ben Allal Francesco De Toni Giada Pistilli Olivier Nguyen Somaieh Nikpoor Maraim Masoud Pierre Colombo Javier de la Rosa Paulo Villegas Tristan Thrush Shayne Longpre Sebastian Nagel Leon Weber Manuel Mu\u00f1oz Jian Zhu Daniel van Strien Zaid Alyafeai Khalid Almubarak Minh Chien Vu Itziar Gonzalez-Dios Aitor Soroa Kyle Lo Manan Dey Pedro Ortiz Suarez Aaron Gokaslan Shamik Bose David Ifeoluwa Adelani Long Phan Hieu Tran Ian Yu Suhas Pai Jenny Chim Violette Lepercq Suzana Ilic Margaret Mitchell Alexandra Sasha Luccioni and Yacine Jernite. 2022. The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022 NeurIPS 2022 New Orleans LA USA November 28 - December 9 2022 Sanmi Koyejo S. Mohamed A. Agarwal Danielle Belgrave K. Cho and A. Oh (Eds.)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2022.ACL-LONG.577"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2309.05463"},{"key":"e_1_3_2_1_19_1","volume-title":"5th International Conference on Learning Representations, ICLR","author":"Merity Stephen","year":"2017","unstructured":"Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture Models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings. OpenReview.net. https:\/\/openreview.net\/forum?id=Byj72udxe"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","unstructured":"Chenghao Mou Chris Ha Kenneth Enevoldsen and Peiyuan Liu. 2023. ChenghaoMou\/text-dedup: Reference Snapshot. https:\/\/doi.org\/10.5281\/zenodo. 8364980","DOI":"10.5281\/zenodo"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1088\/1742-5468\/ac3a74"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1144"},{"key":"e_1_3_2_1_23_1","unstructured":"Alec Radford JeffreyWu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9."},{"key":"e_1_3_2_1_24_1","article-title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1--140:67.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","unstructured":"Teven Le Scao Angela Fan Christopher Akiki Ellie Pavlick Suzana Ilic Daniel Hesslow Roman Castagn\u00e9 Alexandra Sasha Luccioni Fran\u00e7ois Yvon Matthias Gall\u00e9 Jonathan Tow Alexander M. Rush Stella Biderman Albert Webson Pawan Sasanka Ammanamanchi Thomas Wang Beno\u00eet Sagot Niklas Muennighoff Albert Villanova del Moral Olatunji Ruwase Rachel Bawden Stas Bekman Angelina McMillan-Major Iz Beltagy Huu Nguyen Lucile Saulnier Samson Tan Pedro Ortiz Suarez Victor Sanh Hugo Lauren\u00e7on Yacine Jernite Julien Launay Margaret Mitchell Colin Raffel Aaron Gokaslan Adi Simhi Aitor Soroa Alham Fikri Aji Amit Alfassy Anna Rogers Ariel Kreisberg Nitzav Canwen Xu Chenghao Mou Chris Emezue Christopher Klamm Colin Leong Daniel van Strien David Ifeoluwa Adelani and et al. 2022. BLOOM: A 176BParameter Open-Access Multilingual Language Model. CoRR abs\/2211.05100 (2022). https:\/\/doi.org\/10.48550\/ARXIV.2211.05100 arXiv:2211.05100","DOI":"10.48550\/ARXIV.2211.05100"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2302.13971"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton-Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aur\u00e9lien Rodriguez Robert Stojnic Sergey Edunov and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs\/2307.09288 (2023). https:\/\/doi.org\/10.48550\/ARXIV.2307.09288 arXiv:2307.09288","DOI":"10.48550\/ARXIV.2307.09288"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2308.11432"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV"},{"key":"e_1_3_2_1_30_1","volume-title":"Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020","author":"Wenzek Guillaume","year":"2020","unstructured":"Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm\u00e1n, Armand Joulin, and Edouard Grave. 2020. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11--16, 2020, Nicoletta Calzolari, Fr\u00e9d\u00e9ric B\u00e9chet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H\u00e9l\u00e8ne Mazo, Asunci\u00f3n Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, 4003--4012. https:\/\/aclanthology.org\/2020.lrec-1.494\/"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2305.19860"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2305.10429"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1016\/J.AIOPEN.2021.06.001"},{"key":"e_1_3_2_1_34_1","volume-title":"The Eleventh International Conference on Learning Representations, ICLR 2023","author":"Zeng Aohan","year":"2023","unstructured":"Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai,Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-130B: An Open Bilingual Pre-trained Model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1--5, 2023. OpenReview.net."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2311.12537"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2308.07107"}],"event":{"name":"SIGIR 2024: The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval","location":"Washington DC USA","acronym":"SIGIR 2024","sponsor":["SIGIR ACM Special Interest Group on Information Retrieval"]},"container-title":["Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3626772.3657671","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3626772.3657671","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T05:42:46Z","timestamp":1755841366000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3626772.3657671"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,10]]},"references-count":37,"alternative-id":["10.1145\/3626772.3657671","10.1145\/3626772"],"URL":"https:\/\/doi.org\/10.1145\/3626772.3657671","relation":{},"subject":[],"published":{"date-parts":[[2024,7,10]]},"assertion":[{"value":"2024-07-11","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}