{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,28]],"date-time":"2026-05-28T18:03:27Z","timestamp":1779991407603,"version":"3.53.1"},"publisher-location":"New York, NY, USA","reference-count":55,"publisher":"ACM","funder":[{"name":"Italian Ministry of University and Research","award":["PRIN 2022Y7HHNW"],"award-info":[{"award-number":["PRIN 2022Y7HHNW"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,6,29]]},"DOI":"10.1145\/3774905.3795601","type":"proceedings-article","created":{"date-parts":[[2026,5,28]],"date-time":"2026-05-28T17:14:56Z","timestamp":1779988496000},"page":"351-360","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Towards Automating Articles Screening Processes Using Chain-of-Thought Large Language Models"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-5295-9969","authenticated-orcid":false,"given":"Carlo","family":"Arpini","sequence":"first","affiliation":[{"name":"Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9601-0403","authenticated-orcid":false,"given":"Mirko","family":"Cesarini","sequence":"additional","affiliation":[{"name":"Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-0994-2238","authenticated-orcid":false,"given":"Emmanuele","family":"Lotano","sequence":"additional","affiliation":[{"name":"Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,5,28]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"h t t","author":"GROBID.","year":"1968","unstructured":"2008-2025. GROBID. h t t p s : \/ \/ g i t h u b . c om\/ k e rmi t t 2 \/ g r o b id. swh:1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c"},{"key":"e_1_3_2_1_2_1","unstructured":"Sandhini Agarwal and et al. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv preprint arXiv:2508.10925 (2025). Open-weight reasoning models from OpenAI Apache 2.0 license.."},{"key":"e_1_3_2_1_3_1","volume-title":"International conference on learning representations.","author":"Arora Sanjeev","year":"2017","unstructured":"Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-tobeat baseline for sentence embeddings. In International conference on learning representations."},{"key":"e_1_3_2_1_4_1","volume-title":"Adam Sloan, Tomasz Tudrej, Ferhan Ture, et al.","author":"Atil Berk","year":"2024","unstructured":"Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, et al. 2024. Non-determinism of ''deterministic'' llm settings. arXiv preprint arXiv:2408.04667 (2024)."},{"key":"e_1_3_2_1_5_1","unstructured":"Lochan Basyal and Mihir Sanghvi. 2023. Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct Falcon-7b-instruct and OpenAI Chat-GPT Models. arXiv:2310.10449 [cs.CL] https:\/\/arxiv.org\/abs\/2310 .10449"},{"key":"e_1_3_2_1_6_1","volume-title":"Metrics also disagree in the low scoring range: Revisiting summarization evaluation metrics. arXiv preprint arXiv:2011.04096","author":"Bhandari Manik","year":"2020","unstructured":"Manik Bhandari, Pranav Gour, Atabak Ashfaq, and Pengfei Liu. 2020. Metrics also disagree in the low scoring range: Revisiting summarization evaluation metrics. arXiv preprint arXiv:2011.04096 (2020)."},{"key":"e_1_3_2_1_7_1","unstructured":"Christian Cao Jason Sang Rohit Arora Robbie Kloosterman Matt Cecere Jaswanth Gorla Richard Saleh David Chen Ian Drennan Bijan Teja et al. 2024. Prompting is all you need: LLMs for systematic review screening. medRxiv (2024) 2024-06."},{"key":"e_1_3_2_1_8_1","unstructured":"Nick Crews. 2025. llama-cpp-server-python: Bootstrap a server from llama-cpp in a few lines of Python. https:\/\/github.com\/NickCrews\/llama-cpp-server-python. Commit: 00cc5ece8783848139d41fb7f9c5e5c9b7a62686. MIT License."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1162\/coli.2007.33.1.63"},{"key":"e_1_3_2_1_10_1","first-page":"4171","volume-title":"Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171-4186."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3746252.3760834"},{"key":"e_1_3_2_1_12_1","volume-title":"How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv preprint arXiv:1909.00512","author":"Ethayarajh Kawin","year":"2019","unstructured":"Kawin Ethayarajh. 2019. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv preprint arXiv:1909.00512 (2019)."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00373"},{"key":"e_1_3_2_1_14_1","unstructured":"Georgi Gerganov. 2023. llama.cpp. https:\/\/github.com\/ggerganov\/llama.cpp."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Yu Gu Robert Tinn Hao Cheng Michael Lucas Naoto Usuyama Xiaodong Liu Tristan Naumann Jianfeng Gao and Hoifung Poon. 2020. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. https: \/\/huggingface.co\/microsoft\/BiomedNLP-BiomedBERT-base-uncased-abstractfulltext. arXiv:arXiv:2007.15779 Model: microsoft\/BiomedNLP-BiomedBERTbase- uncased-abstract-fulltext.","DOI":"10.1145\/3458754"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1258\/jtt.2008.007007"},{"key":"e_1_3_2_1_17_1","volume-title":"Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al.","author":"Hoffmann Jordan","year":"2022","unstructured":"Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022)."},{"key":"e_1_3_2_1_18_1","volume-title":"Andrea Madotto, and Pascale Fung.","author":"Ji Ziwei","year":"2023","unstructured":"Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM computing surveys 55, 12 (2023), 1-38."},{"key":"e_1_3_2_1_19_1","unstructured":"Renren Jin Jiangcun Du Wuwei Huang Wei Liu Jian Luan Bin Wang and Deyi Xiong. 2024. A Comprehensive Evaluation of Quantization Strategies for Large Language Models. arXiv:2402.16775 [cs.CL] https:\/\/arxiv.org\/abs\/2402.16775"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/800125.804034"},{"key":"e_1_3_2_1_21_1","volume-title":"Benchmarking large language models in evidence-based medicine","author":"Li Jin","year":"2024","unstructured":"Jin Li, Yiyan Deng, Qi Sun, Junjie Zhu, Yu Tian, Jingsong Li, and Tingting Zhu. 2024. Benchmarking large language models in evidence-based medicine. IEEE Journal of Biomedical and Health Informatics (2024)."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1093\/jamia\/ocaf030"},{"key":"e_1_3_2_1_23_1","volume-title":"A technique for the measurement of attitudes. Archives of psychology","author":"Likert Rensis","year":"1932","unstructured":"Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of psychology (1932)."},{"key":"e_1_3_2_1_24_1","volume-title":"ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. Association for Computational Linguistics, 74-81","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. Association for Computational Linguistics, 74-81."},{"key":"e_1_3_2_1_25_1","volume-title":"Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172","author":"Liu Nelson F","year":"2023","unstructured":"Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172 (2023)."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-04346-8_62"},{"key":"e_1_3_2_1_27_1","volume-title":"Selfcheckgpt: Zeroresource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896","author":"Manakul Potsawee","year":"2023","unstructured":"Potsawee Manakul, Adian Liusie, and Mark JF Gales. 2023. Selfcheckgpt: Zeroresource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896 (2023)."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"crossref","first-page":"e0313401","DOI":"10.1371\/journal.pone.0313401","article-title":"ChatGPT-4o can serve as the second rater for data extraction in systematic reviews","volume":"20","author":"Jensen Mette Motzfeldt","year":"2025","unstructured":"Mette Motzfeldt Jensen, Mathias Brix Danielsen, Johannes Riis, Karoline Assifuah Kristjansen, Stig Andersen, Yoshiro Okubo, and Martin Gr\u00f8nbech J\u00f8rgensen. 2025. ChatGPT-4o can serve as the second rater for data extraction in systematic reviews. PloS one 20, 1 (2025), e0313401.","journal-title":"PloS one"},{"key":"e_1_3_2_1_29_1","volume-title":"Scientific writing and communication in agriculture and natural resources","author":"Ramachandran Nair PK","unstructured":"PK Ramachandran Nair and Vimala D Nair. 2014. Organization of a research paper: The IMRAD format. In Scientific writing and communication in agriculture and natural resources. Springer, 13-25."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11192-022-04536-x"},{"key":"e_1_3_2_1_31_1","unstructured":"NeuML. 2024. NeuML\/pubmedbert-base-embeddings. https:\/\/huggingface.co\/N euML\/pubmedbert-base-embeddings. Sentence-transformers fine-tuned version of PubMedBERT for embeddings."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1019"},{"key":"e_1_3_2_1_33_1","volume-title":"Rayyan\u2014a web and mobile app for systematic reviews. Systematic reviews 5, 1","author":"Ouzzani Mourad","year":"2016","unstructured":"Mourad Ouzzani, Hossam Hammady, Zbys Fedorowicz, and Ahmed Elmagarmid. 2016. Rayyan\u2014a web and mobile app for systematic reviews. Systematic reviews 5, 1 (2016), 210."},{"key":"e_1_3_2_1_34_1","volume-title":"Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He.","author":"Rajbhandari Samyam","year":"2022","unstructured":"Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed- MoE: Advancing Mixture-of-Experts Inference and Training to Power Next- Generation AI Scale. arXiv:2201.05596 [cs.LG] https:\/\/arxiv.org\/abs\/2201.05596"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1410"},{"key":"e_1_3_2_1_36_1","volume-title":"The well-built clinical question: a key to evidence-based decisions. ACP journal club 123, 3","author":"Richardson Scott","year":"1995","unstructured":"WScott Richardson, Mark C Wilson, Jennifer Nishikawa, and Robert SA Hayward. 1995. The well-built clinical question: a key to evidence-based decisions. ACP journal club 123, 3 (1995), A12-A13."},{"key":"e_1_3_2_1_37_1","volume-title":"Grobid-information extraction from scientific publications. ERCIM News 100","author":"Romary Laurent","year":"2015","unstructured":"Laurent Romary and Patrice Lopez. 2015. Grobid-information extraction from scientific publications. ERCIM News 100 (2015)."},{"key":"e_1_3_2_1_38_1","first-page":"606","article-title":"The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference","author":"Roy Olivier","year":"2007","unstructured":"Olivier Roy and Martin Vetterli. 2007. The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference. IEEE, 606-610.","journal-title":"IEEE"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1016\/0306-4573(88)90021-0"},{"key":"e_1_3_2_1_40_1","volume-title":"Proceedings of ICLR.","author":"Shazeer Noam","year":"2017","unstructured":"Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V Le, and Geoffrey Hinton. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In Proceedings of ICLR."},{"key":"e_1_3_2_1_41_1","first-page":"364","article-title":"The introduction, methods, results, and discussion (IMRAD) structure: a fifty-year survey","volume":"92","author":"Sollaci Luciana B","year":"2004","unstructured":"Luciana B Sollaci and Mauricio G Pereira. 2004. The introduction, methods, results, and discussion (IMRAD) structure: a fifty-year survey. Journal of the medical library association 92, 3 (2004), 364.","journal-title":"Journal of the medical library association"},{"key":"e_1_3_2_1_42_1","unstructured":"TEI Consortium. 2022. TEI P5: Guidelines for Electronic Text Encoding and Interchange. https:\/\/tei-c.org\/release\/doc\/tei-p5-doc\/en\/html\/ Version 4.5.0."},{"key":"e_1_3_2_1_43_1","volume-title":"Adam Dunn, Filippo Galgani, and Enrico Coiera.","author":"Tsafnat Guy","year":"2014","unstructured":"Guy Tsafnat, Paul Glasziou, Miew Keen Choong, Adam Dunn, Filippo Galgani, and Enrico Coiera. 2014. Systematic review automation technologies. Systematic reviews 3, 1 (2014), 74."},{"key":"e_1_3_2_1_44_1","unstructured":"Rens van de Schoot Jonathan de Bruin Raoul Schram et al. 2020. ASReview: Active Learning for Systematic Reviews. https:\/\/github.com\/asreview. Accessed: 2025-09-10."},{"key":"e_1_3_2_1_45_1","volume-title":"Jonathan De Bruin","author":"De Schoot Rens Van","year":"2021","unstructured":"Rens Van De Schoot, Jonathan De Bruin, Raoul Schram, Parisa Zahedi, Jan De Boer, Felix Weijdema, Bianca Kramer, Martijn Huijts, Maarten Hoogerwerf, Gerbrich Ferdinands, et al. 2021. An open source machine learning framework for efficient and transparent systematic reviews. Nature machine intelligence 3, 2 (2021), 125-133."},{"key":"e_1_3_2_1_46_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_47_1","volume-title":"Aakanksha Chowdhery, and Denny Zhou.","author":"Wang Xuezhi","year":"2023","unstructured":"Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL] https:\/\/arxiv.org\/abs\/2203.11171"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1002\/jgm.3312"},{"key":"e_1_3_2_1_49_1","first-page":"24824","volume-title":"Oh (Eds.)","volume":"35","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 24824-24837. https:\/\/proceeding s.neurips.cc\/paper_files\/paper\/2022\/file\/9d5609613524ecf4f15af0f7b31abca4- Paper-Conference.pdf"},{"key":"e_1_3_2_1_50_1","volume-title":"The wisdom of the crowd in combinatorial problems. Cognitive science 36, 3","author":"Michael Yi Sheng Kung","year":"2012","unstructured":"Sheng Kung Michael Yi, Mark Steyvers, Michael D Lee, and Matthew J Dry. 2012. The wisdom of the crowd in combinatorial problems. Cognitive science 36, 3 (2012), 452-470."},{"key":"e_1_3_2_1_51_1","volume-title":"Do large language models know what they don't know? arXiv preprint arXiv:2305.18153","author":"Yin Zhangyue","year":"2023","unstructured":"Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. 2023. Do large language models know what they don't know? arXiv preprint arXiv:2305.18153 (2023)."},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3731445"},{"key":"e_1_3_2_1_53_1","volume-title":"Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675","author":"Zhang Tianyi","year":"2019","unstructured":"Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)."},{"key":"e_1_3_2_1_54_1","volume-title":"BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations (ICLR).","author":"Zhang Tianyi","year":"2020","unstructured":"Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_2_1_55_1","volume-title":"Hao Liu, Chunhua Weng, and Yifan Peng.","author":"Zhou Yiliang","year":"2025","unstructured":"Yiliang Zhou, Abigail M Newbury, Gongbo Zhang, Betina Ross Idnay, Hao Liu, Chunhua Weng, and Yifan Peng. 2025. EvidenceOutcomes: a Dataset of Clinical Trial Publications with Clinically Meaningful Outcomes. arXiv preprint arXiv:2506.05380 (2025)."}],"event":{"name":"WWW '26: The ACM Web Conference 2026","location":"Dubai United Arab Emirates","sponsor":["SIGWEB ACM Special Interest Group on Hypertext, Hypermedia, and Web"]},"container-title":["Companion Proceedings of the ACM Web Conference 2026"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3774905.3795601","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,28]],"date-time":"2026-05-28T17:16:36Z","timestamp":1779988596000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3774905.3795601"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,5,28]]},"references-count":55,"alternative-id":["10.1145\/3774905.3795601","10.1145\/3774905"],"URL":"https:\/\/doi.org\/10.1145\/3774905.3795601","relation":{},"subject":[],"published":{"date-parts":[[2026,5,28]]},"assertion":[{"value":"2026-05-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}