{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T05:06:06Z","timestamp":1750309566596,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":40,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,12,16]],"date-time":"2024-12-16T00:00:00Z","timestamp":1734307200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,12,16]]},"DOI":"10.1145\/3677389.3702591","type":"proceedings-article","created":{"date-parts":[[2025,3,13]],"date-time":"2025-03-13T16:53:52Z","timestamp":1741884832000},"page":"1-11","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Can LLMs categorize the specialized documents from web archives in a better way?"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7110-8551","authenticated-orcid":false,"given":"Saran Pandian","family":"Pandi","sequence":"first","affiliation":[{"name":"Computer Science, University of Illinois at Chicago, Chicago, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-3611-3018","authenticated-orcid":false,"given":"Seoyeon","family":"Park","sequence":"additional","affiliation":[{"name":"Hanyang University ERICA, Ansan, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-5079-4347","authenticated-orcid":false,"given":"Praneeth","family":"Rikka","sequence":"additional","affiliation":[{"name":"University of North Texas, Denton, TX, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9679-6730","authenticated-orcid":false,"given":"Mark Edward","family":"Phillips","sequence":"additional","affiliation":[{"name":"University Libraries, University of North Texas, Denton, TX, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5664-2163","authenticated-orcid":false,"given":"Cornelia","family":"Caragea","sequence":"additional","affiliation":[{"name":"University of Illinois, Chicago, Chicago, IL, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,3,13]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Marah Abdin Sam Ade Jacobs Ammar Ahmad Awan Jyoti Aneja Ahmed Awadallah Hany Awadalla Nguyen Bach Amit Bahree Arash Bakhtiari Jianmin Bao Harkirat Behl Alon Benhaim Misha Bilenko Johan Bjorck S\u00e9bastien Bubeck Qin Cai Martin Cai Caio C\u00e9sar Teodoro Mendes Weizhu Chen Vishrav Chaudhary Dong Chen Dongdong Chen Yen-Chun Chen Yi-Ling Chen Parul Chopra Xiyang Dai Allie Del Giorno Gustavo de Rosa Matthew Dixon Ronen Eldan Victor Fragoso Dan Iter Mei Gao Min Gao Jianfeng Gao Amit Garg Abhishek Goswami Suriya Gunasekar Emman Haider Junheng Hao Russell J. Hewett Jamie Huynh Mojan Javaheripi Xin Jin Piero Kauffmann Nikos Karampatziakis Dongwoo Kim Mahoud Khademi Lev Kurilenko James R. Lee Yin Tat Lee Yuanzhi Li Yunsheng Li Chen Liang Lars Liden Ce Liu Mengchen Liu Weishung Liu Eric Lin Zeqi Lin Chong Luo Piyush Madan Matt Mazzola Arindam Mitra Hardik Modi Anh Nguyen Brandon Norick Barun Patra Daniel Perez-Becker Thomas Portet Reid Pryzant Heyang Qin Marko Radmilac Corby Rosset Sambudha Roy Olatunji Ruwase Olli Saarikivi Amin Saied Adil Salim Michael Santacroce Shital Shah Ning Shang Hiteshi Sharma Swadheen Shukla Xia Song Masahiro Tanaka Andrea Tupini Xin Wang Lijuan Wang Chunyu Wang Yu Wang Rachel Ward Guanhua Wang Philipp Witte Haiping Wu Michael Wyatt Bin Xiao Can Xu Jiahang Xu Weijian Xu Sonali Yadav Fan Yang Jianwei Yang Ziyi Yang Yifan Yang Donghan Yu Lu Yuan Chengruidong Zhang Cyril Zhang Jianwen Zhang Li Lyna Zhang Yi Zhang Yue Zhang Yunan Zhang and Xiren Zhou. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL] https:\/\/arxiv.org\/abs\/2404.14219"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1002\/meet.2014.14505101150"},{"key":"e_1_3_2_1_3_1","unstructured":"Jefferson Bailey. 2017. Twitter Post. https:\/\/twitter.com\/jefferson_bail\/status\/867808876917178368."},{"key":"e_1_3_2_1_4_1","volume-title":"A neural probabilistic language model. Advances in neural information processing systems 13","author":"Bengio Yoshua","year":"2000","unstructured":"Yoshua Bengio, R\u00e9jean Ducharme, and Pascal Vincent. 2000. A neural probabilistic language model. Advances in neural information processing systems 13 (2000)."},{"key":"e_1_3_2_1_5_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v30i2.19075"},{"key":"e_1_3_2_1_7_1","volume-title":"A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. Text databases and document management: Theory and practice 5478, 4","author":"Caropreso Maria Fernanda","year":"2001","unstructured":"Maria Fernanda Caropreso, Stan Matwin, and Fabrizio Sebastiani. 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. Text databases and document management: Theory and practice 5478, 4 (2001), 78--102."},{"key":"e_1_3_2_1_8_1","volume-title":"Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307","author":"Chen Yukang","year":"2023","unstructured":"Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2023. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307 (2023)."},{"key":"e_1_3_2_1_9_1","first-page":"1","article-title":"Palm: Scaling language modeling with pathways","volume":"24","author":"Chowdhery Aakanksha","year":"2023","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1--113.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_1_10_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_1_11_1","unstructured":"C. Dooley and G. Thomas. 2019. The library of congress web archives: Dipping a toe in a lake of data. https:\/\/blogs.loc.gov\/thesignal\/2019\/01\/the-library-of-congress-web-archivesdipping-a-toe-in-a-lake-of-data\/ (2019)."},{"key":"e_1_3_2_1_12_1","unstructured":"Nathaniel T Fox Mark E. Phillips and Hannah Tarver. 2020. Programmatic Extraction of 'Documents' from Web Archives: Identifying Document Characteristics from Content Selector Interviews. https:\/\/digital.library.unt.edu\/ark:\/67531\/metadc1757659\/."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.5555\/3176748.3176757"},{"key":"e_1_3_2_1_14_1","volume-title":"TRAIL: From Government Information Locator Service to Electronic Depository Program for Texas State Publications. DttP: Documents to the People 32","author":"Hartman Cathy Nelson","year":"2004","unstructured":"Cathy Nelson Hartman and Coby Condrey. 2004. TRAIL: From Government Information Locator Service to Electronic Depository Program for Texas State Publications. DttP: Documents to the People 32 (2004), 22--27. Issue 2."},{"key":"e_1_3_2_1_15_1","unstructured":"Albert Q. Jiang Alexandre Sablayrolles Arthur Mensch Chris Bamford Devendra Singh Chaplot Diego de las Casas Florian Bressand Gianna Lengyel Guillaume Lample Lucile Saulnier L\u00e9lio Renard Lavaud Marie-Anne Lachaux Pierre Stock Teven Le Scao Thibaut Lavril Thomas Wang Timoth\u00e9e Lacroix and William El Sayed. 2023. Mistral 7B. arXiv:2310.06825 [cs.CL] https:\/\/arxiv.org\/abs\/2310.06825"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/BFb0026683"},{"key":"e_1_3_2_1_17_1","volume-title":"A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188","author":"Kalchbrenner Nal","year":"2014","unstructured":"Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014)."},{"key":"e_1_3_2_1_18_1","volume-title":"Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461","author":"Lewis Mike","year":"2019","unstructured":"Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)."},{"key":"e_1_3_2_1_19_1","volume-title":"Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)."},{"key":"e_1_3_2_1_20_1","volume-title":"Web archiving methods and approaches: A comparative study","author":"Masan\u00e8s Julien","year":"2005","unstructured":"Julien Masan\u00e8s. 2005. Web archiving methods and approaches: A comparative study. Library trends 54, 1 (2005), 72--90."},{"key":"e_1_3_2_1_21_1","volume-title":"AAAI-98 workshop on learning for text categorization","volume":"752","author":"McCallum Andrew","year":"1998","unstructured":"Andrew McCallum, Kamal Nigam, et al. 1998. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, Vol. 752. Madison, WI, 41--48."},{"key":"e_1_3_2_1_22_1","volume-title":"Large language models: A survey. arXiv preprint arXiv:2402.06196","author":"Minaee Shervin","year":"2024","unstructured":"Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A survey. arXiv preprint arXiv:2402.06196 (2024)."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.giq.2007.04.005"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/988672.988674"},{"key":"e_1_3_2_1_25_1","unstructured":"OpenAI. 2022. Introducing ChatGPT. https:\/\/openai.com\/blog\/chatgpt"},{"key":"e_1_3_2_1_26_1","unstructured":"Long Ouyang Jeffrey Wu Xu Jiang Diogo Almeida Carroll Wainwright Pamela Mishkin Chong Zhang Sandhini Agarwal Katarina Slama Alex Ray et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022) 27730--27744."},{"key":"e_1_3_2_1_27_1","volume-title":"The Library of Congress Web Archives: Dipping a Toe in a Lake of Data. https:\/\/blogs.loc.gov\/thesignal\/2019\/01\/the-library-of-congress-web-archives-dipping-a-toe-in-a-lake-of-data\/.","author":"Owens Tevor","year":"2019","unstructured":"Tevor Owens. 2019. The Library of Congress Web Archives: Dipping a Toe in a Lake of Data. https:\/\/blogs.loc.gov\/thesignal\/2019\/01\/the-library-of-congress-web-archives-dipping-a-toe-in-a-lake-of-data\/."},{"key":"e_1_3_2_1_28_1","volume-title":"Proceedings of the Twelfth Language Resources and Evaluation Conference. 1459--1468","author":"Patel Krutarth","year":"2020","unstructured":"Krutarth Patel, Cornelia Caragea, and Mark Phillips. 2020. Dynamic classification in web archiving collections. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 1459--1468."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3383583.3398540"},{"key":"e_1_3_2_1_30_1","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans Ilya Sutskever et al. 2018. Improving language understanding by generative pre-training. (2018)."},{"key":"e_1_3_2_1_31_1","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9."},{"key":"e_1_3_2_1_32_1","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21, 140 (2020), 1--67.","journal-title":"Journal of machine learning research"},{"key":"e_1_3_2_1_33_1","volume-title":"Machine learning in automated text categorization. ACM computing surveys (CSUR) 34, 1","author":"Sebastiani Fabrizio","year":"2002","unstructured":"Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR) 34, 1 (2002), 1--47."},{"key":"e_1_3_2_1_34_1","unstructured":"Gemini Team Rohan Anil Sebastian Borgeaud Yonghui Wu Jean-Baptiste Alayrac Jiahui Yu Radu Soricut Johan Schalkwyk Andrew M Dai Anja Hauth et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)."},{"key":"e_1_3_2_1_35_1","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edunov and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL] https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_3_2_1_36_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_37_1","volume-title":"Aakanksha Chowdhery, and Denny Zhou.","author":"Wang Xuezhi","year":"2022","unstructured":"Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)."},{"key":"e_1_3_2_1_38_1","volume-title":"Denny Zhou, et al.","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824--24837."},{"key":"e_1_3_2_1_39_1","volume-title":"Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems 36","author":"Yao Shunyu","year":"2024","unstructured":"Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_3_2_1_40_1","volume-title":"A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820","author":"Zhang Ye","year":"2015","unstructured":"Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015)."}],"event":{"name":"JCDL '24: 24th ACM\/IEEE Joint Conference on Digital Libraries","sponsor":["SIGIR ACM Special Interest Group on Information Retrieval","SIGWEB ACM Special Interest Group on Hypertext, Hypermedia, and Web","IEEE TCDL"],"location":"Hong Kong China","acronym":"JCDL '24"},"container-title":["Proceedings of the 24th ACM\/IEEE Joint Conference on Digital Libraries"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3677389.3702591","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3677389.3702591","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:19:07Z","timestamp":1750295947000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3677389.3702591"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,16]]},"references-count":40,"alternative-id":["10.1145\/3677389.3702591","10.1145\/3677389"],"URL":"https:\/\/doi.org\/10.1145\/3677389.3702591","relation":{},"subject":[],"published":{"date-parts":[[2024,12,16]]},"assertion":[{"value":"2025-03-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}