{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,25]],"date-time":"2025-07-25T10:35:41Z","timestamp":1753439741431,"version":"3.41.0"},"reference-count":77,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2018,10,29]],"date-time":"2018-10-29T00:00:00Z","timestamp":1540771200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Data and Information Quality"],"published-print":{"date-parts":[[2018,12,31]]},"abstract":"<jats:p>Web document collections such as WT10G, GOV2, and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, and so on. Published articles that use these corpora generally do not provide specific details about how this markup information is handled during indexing. However, this question turns out to be important: Through experiments, we find that including or excluding metadata in the index can produce significantly different results with standard IR models. More importantly, the effect varies across models and collections. For example, metadata filtering is found to be generally beneficial when using BM25, or language modeling with Dirichlet smoothing, but can significantly reduce retrieval effectiveness if language modeling is used with Jelinek-Mercer smoothing. We also observe that, in general, the performance differences become more noticeable as the amount of metadata in the test collections increase. Given this variability, we believe that the details of document preprocessing are significant from the point of view of reproducibility. In a second set of experiments, we also study the effect of preprocessing on query expansion using RM3. In this case, once again, we find that it is generally better to remove markup before using documents for query expansion.<\/jats:p>","DOI":"10.1145\/3242180","type":"journal-article","created":{"date-parts":[[2018,10,29]],"date-time":"2018-10-29T12:02:18Z","timestamp":1540814538000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["To Clean or Not to Clean"],"prefix":"10.1145","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5962-5983","authenticated-orcid":false,"given":"Dwaipayan","family":"Roy","sequence":"first","affiliation":[{"name":"Indian Statistical Institute, Kolkata, India"}]},{"given":"Mandar","family":"Mitra","sequence":"additional","affiliation":[{"name":"Indian Statistical Institute, Kolkata, India"}]},{"given":"Debasis","family":"Ganguly","sequence":"additional","affiliation":[{"name":"IBM Research, Dublin, Ireland"}]}],"member":"320","published-online":{"date-parts":[[2018,10,29]]},"reference":[{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/582415.582416"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2983323.2983739"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2888422.2888439"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1645953.1646031"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2766462.2767799"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3053408.3053421"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0306-4573(02)00084-5"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2983323.2983857"},{"volume-title":"The TREC 2006 terabyte track.","author":"B\u00fcttcher Stefan","key":"e_1_2_1_10_1"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.5555\/1766091.1766143"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1561\/1500000021"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1189702.1189703"},{"volume-title":"Overview of the TREC 2004 terabyte track.","author":"Clarke Charles L. A.","key":"e_1_2_1_14_1"},{"volume-title":"The TREC 2005 terabyte track.DOI:https:\/\/trec.nist.gov\/pubs\/trec14\/papers\/TERABYTE.OVERVIEW.pdf.","author":"Clarke Charles L. A","key":"e_1_2_1_15_1"},{"volume-title":"Overview of the TREC 2009 web track. In Proceedings of the 18th Text REtrieval Conference (TREC\u201909)","year":"2009","author":"Clarke Charles L. A.","key":"e_1_2_1_16_1"},{"volume-title":"Overview of the TREC 2011 web track. In Proceedings of the 20th Text REtrieval Conference (TREC\u201911)","author":"Clarke Charles L. A.","key":"e_1_2_1_17_1"},{"volume-title":"Overview of the TREC 2012 web track. In Proceedings of the 21st Text REtrieval Conference (TREC\u201912)","author":"Clarke Charles L. A.","key":"e_1_2_1_18_1"},{"volume-title":"Overview of the TREC 2010 web track. In Proceedings of the 18th Text REtrieval Conference (TREC\u201910)","author":"Clarke Charles L. A.","key":"e_1_2_1_19_1"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2499178.2499179"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-011-9162-z"},{"volume-title":"Overview of the TREC 2004 web track. In Proceedings of the 13th Text REtrieval Conference (TREC\u201904)","year":"2004","author":"Craswell Nick","key":"e_1_2_1_22_1"},{"volume-title":"Overview of the TREC 2003 web track. In Proceedings of the 12th Text REtrieval Conference (TREC\u201903)","year":"2003","author":"Craswell Nick","key":"e_1_2_1_23_1"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3121050.3121053"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2746231"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2983323.2983706"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2983323.2983910"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3077136.3080832"},{"volume-title":"Proceedings of the SIGIR Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR\u201915)","year":"2015","author":"Buccio Emanuele Di","key":"e_1_2_1_29_1"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018661.3018692"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3020206"},{"key":"e_1_2_1_32_1","volume-title":"In Proceedings of the 38th European Conference on IR Research: Advances in Information Retrieval (ECIR\u201916)","volume":"9626","author":"Ferro N.","year":"2016"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964797.2964808"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2911530"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3106426.3106442"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2766462.2767801"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2983323.2983769"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/775152.775182"},{"volume-title":"Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb\u201905)","year":"2005","author":"Gyongyi Zoltan","key":"e_1_2_1_39_1"},{"key":"e_1_2_1_40_1","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-319-16354-3","volume-title":"Proceedings of the 37th European Conference on IR Research: Advances in Information Retrieval (ECIR\u201915)","volume":"9022","author":"Hanbury A.","year":"2015"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-009-9101-4"},{"volume-title":"Proceedings of the 9th Text REtrieval Conference (TREC\u201900)","year":"2000","author":"Hawking David","key":"e_1_2_1_42_1"},{"volume-title":"Proceedings of the 10th Text REtrieval Conference (TREC\u201901)","year":"2001","author":"Hawking David","key":"e_1_2_1_43_1"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2808194.2809471"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.5555\/2816272.2816336"},{"volume-title":"Proceedings of the Text REtrieval Conference (TREC\u201904)","year":"2004","author":"Jaleel Nasreen Abdul","key":"e_1_2_1_46_1"},{"volume-title":"Proceedings of the 9th Text REtrieval Conference (TREC\u201900)","year":"2000","author":"Kraaij Wessel","key":"e_1_2_1_47_1"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/383952.383972"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-30671-1_30"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-007-9040-x"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/1645953.1646259"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/2034617.2034624"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/1135777.1135794"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/2766462.2767762"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2766462.2767748"},{"volume-title":"Proceedings of the 1st International Workshop on Web Document Analysis (WDA\u201901)","author":"Rahman A. F. R.","key":"e_1_2_1_56_1"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/2348283.2348417"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3121050.3121062"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/2396761.2398514"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1561\/1500000019"},{"volume-title":"Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR\u201994)","year":"1884","author":"Robertson S. E.","key":"e_1_2_1_61_1"},{"key":"e_1_2_1_62_1","unstructured":"Gerard Salton and Chris Buckley. {n.d.}. SMART Stopword list. Retrieved from http:\/\/www.lextek.com\/manuals\/onix\/stopwords2.html.  Gerard Salton and Chris Buckley. {n.d.}. SMART Stopword list. Retrieved from http:\/\/www.lextek.com\/manuals\/onix\/stopwords2.html."},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/243199.243206"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1016\/0306-4573(96)00008-8"},{"volume-title":"Smucker and James Allan","year":"2005","author":"Mark","key":"e_1_2_1_65_1"},{"key":"e_1_2_1_66_1","unstructured":"Ian Soboroff. 2013. Information retrieval evaluation demo. Retrieved from https:\/\/github.com\/isoboroff\/trec-demo.  Ian Soboroff. 2013. Information retrieval evaluation demo. Retrieved from https:\/\/github.com\/isoboroff\/trec-demo."},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/2207243.2207252"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/2009916.2009952"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/1772690.1772789"},{"key":"e_1_2_1_70_1","unstructured":"Craig Willis. 2017. Evaluation Framework National Data Service\u2014Confluence. Retrieve from https:\/\/opensource.ncsa.illinois.edu\/confluence\/display\/NDS\/Evaluation+Framework.  Craig Willis. 2017. Evaluation Framework National Data Service\u2014Confluence. Retrieve from https:\/\/opensource.ncsa.illinois.edu\/confluence\/display\/NDS\/Evaluation+Framework."},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1145\/2808194.2809446"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/2970398.2970415"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1145\/2970398.2970403"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3077136.3080831"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1561\/1500000008"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/383952.384019"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/984321.984322"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/2766462.2767700"}],"container-title":["Journal of Data and Information Quality"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3242180","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3242180","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:43:36Z","timestamp":1750207416000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3242180"}},"subtitle":["Document Preprocessing and Reproducibility"],"short-title":[],"issued":{"date-parts":[[2018,10,29]]},"references-count":77,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2018,12,31]]}},"alternative-id":["10.1145\/3242180"],"URL":"https:\/\/doi.org\/10.1145\/3242180","relation":{},"ISSN":["1936-1955","1936-1963"],"issn-type":[{"type":"print","value":"1936-1955"},{"type":"electronic","value":"1936-1963"}],"subject":[],"published":{"date-parts":[[2018,10,29]]},"assertion":[{"value":"2017-10-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-10-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}