{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,7]],"date-time":"2026-02-07T22:46:18Z","timestamp":1770504378579,"version":"3.49.0"},"reference-count":36,"publisher":"SAGE Publications","issue":"5","license":[{"start":{"date-parts":[[2018,5,17]],"date-time":"2018-05-17T00:00:00Z","timestamp":1526515200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Journal of Intelligent &amp; Fuzzy Systems"],"published-print":{"date-parts":[[2018,5,24]]},"abstract":"<jats:p>In this paper, the use of collection term frequencies (i.e. the total number of occurrences of a term in a document collection) in the BM25 retrieval model is investigated by modifying its term frequency (TF) and inverse document frequency (IDF) components. Using selected examples extracted from TREC collections, it was observed that the informative nature, for retrieval purposes, of terms, either with the same TF (in a document) or IDF (in a collection) may be better revealed with the use of collection term frequencies (CTF). From three new heuristics based on those observations and deviations from a random Poisson model, collection term frequencies were integrated to TF and IDF factors. The novel formulations were tested by employing the TREC-1 to TREC-8 collections in the ad hoc task, for which BM25 was first developed and tested. Consistent and significant improvements were observed in mean average precision (MAP) reaching up to 17.67% for the TREC-8 dataset, and 7.16% averaged over all tested collections. These results were considerably better in comparison to other approaches surveyed aiming to improve BM25, proving in this way the effectiveness of the proposed heuristics and formulae. The proposed approach requires only additional offline pre-computations and does not entail extra computational complexity for retrieval while keeping the original spirit and parameter robustness of BM25.<\/jats:p>","DOI":"10.3233\/jifs-169475","type":"journal-article","created":{"date-parts":[[2018,5,18]],"date-time":"2018-05-18T10:37:22Z","timestamp":1526639842000},"page":"2887-2899","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":10,"title":["BM25-CTF: Improving TF and IDF factors in BM25 by using collection term\u00a0frequencies"],"prefix":"10.1177","volume":"34","author":[{"given":"Sergio","family":"Jimenez","sequence":"first","affiliation":[{"name":"Instituto Caro y Cuervo, Calle 10 #4-69, Bogot\u00e1, D.C., Colombia"}]},{"given":"Silviu-Petru","family":"Cucerzan","sequence":"additional","affiliation":[{"name":"Microsoft Research, One Microsoft Way, Redmond, WA, US"}]},{"given":"Fabio A.","family":"Gonzalez","sequence":"additional","affiliation":[{"name":"Universidad Nacional de Colombia, Ciudad Universitaria, Bogot\u00e1, D.C., Colombia"}]},{"given":"Alexander","family":"Gelbukh","sequence":"additional","affiliation":[{"name":"CIC, Instituto Polit\u00e9cnico Nacional, Av. Juan de Dios B\u00e1tiz, 07738, Mexico City, Mexico"}]},{"given":"George","family":"Due\u00f1as","sequence":"additional","affiliation":[{"name":"Instituto Caro y Cuervo, Calle 10 #4-69, Bogot\u00e1, D.C., Colombia"}]}],"member":"179","published-online":{"date-parts":[[2018,5,17]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0306-4573(02)00021-3"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/582415.582416"},{"key":"e_1_3_2_4_2","volume-title":"In Proceedings of TREC 2005","author":"B\u00fcttcher S.","year":"2005","unstructured":"B\u00fcttcherS. and CharlesL.A., Clarke, Efficiency vs, effectiveness in terabyte-scale information retrieval, In Proceedings of TREC 20052005."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-94-017-2390-9_18"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-006-1682-6"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/1008992.1009004"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/1076034.1076116"},{"key":"e_1_3_2_9_2","first-page":"345","volume-title":"In Proceedings of the 23rd ACM SIGIR","author":"Franz M.","year":"2000","unstructured":"FranzM. and McCarleyJ.S., Word document density and relevance scoring, In Proceedings of the 23rd ACM SIGIR (2000), 345\u2013347."},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-06028-6_31"},{"key":"e_1_3_2_11_2","first-page":"21","volume-title":"TREC: Experiments and Evaluation in Information Retrieval, Chapter the TREC Test Collections","author":"Harman D.","year":"2005","unstructured":"HarmanD., TREC: Experiments and Evaluation in Information Retrieval, Chapter the TREC Test CollectionsMIT Press (2005), 21\u201352."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2011.03.007"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/102675.102677"},{"key":"e_1_3_2_14_2","first-page":"187","volume-title":"In Proceedings of the 19th ACM SIGIR","author":"Kwok K.L.","year":"1996","unstructured":"KwokK.L., A new method of weighting query terms for adhoc retrieval, In Proceedings of the 19th ACM SIGIR (1996), 187\u2013195."},{"key":"e_1_3_2_15_2","first-page":"751","volume-title":"In Proceedings of the 30th ACM SIGIR","author":"Lee L.","year":"2007","unstructured":"LeeL., IDF revisited: A simple new derivation within the Robertson-Sp\u00e4rck Jones probabilistic model, In Proceedings of the 30th ACM SIGIR (2007), 751\u2013752."},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/2063576.2063871"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/2063576.2063584"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/2009916.2010070"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1002\/asi.1106"},{"issue":"2","key":"e_1_3_2_20_2","article-title":"The RATF formula (Kwok\u2019s formula): Exploiting average term frequency in cross-language retrieval","volume":"7","author":"Pirkola A.","year":"2002","unstructured":"PirkolaA., Lepp\u00e4nenE. and J\u00e4rvelinK., The RATF formula (Kwok\u2019s formula): Exploiting average term frequency in cross-language retrieval, Information Research7(2) (2002).","journal-title":"Information Research"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/290941.291008"},{"key":"e_1_3_2_22_2","first-page":"207","volume-title":"Proceedings of the 25th ECIR","author":"Rasolofo Y.","year":"2003","unstructured":"RasolofoY. and SavoyJ., Term proximity scoring for keywordbased retrieval systems, In Proceedings of the 25th ECIR (2003), 207\u2013218."},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1108\/00220410410560582"},{"key":"e_1_3_2_24_2","first-page":"287","volume-title":"TREC: Experiments and Evaluation in Information Retrieval, Chapter How Okapi Came to TREC","author":"Robertson S.","year":"2005","unstructured":"RobertsonS., TREC: Experiments and Evaluation in Information Retrieval, Chapter How Okapi Came to TREC (2005), 287\u2013300MIT Press."},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4471-2099-5_24"},{"key":"e_1_3_2_26_2","first-page":"73","article-title":"Okapi at TREC-4","author":"Robertson S.","year":"1996","unstructured":"RobertsonS., WalkerS., BeaulieuM.M., GatfordM. and PayneA., Okapi at TREC-4, In Proceedings of the 4th TREC (1996), 73\u201396.","journal-title":"Proceedings of the 4th TREC"},{"key":"e_1_3_2_27_2","first-page":"109","article-title":"Okapi at TREC-3","author":"Robertson S.","year":"1994","unstructured":"RobertsonS., WalkerS., JonesS., Hancock-BeaulieuM.M. and GatfordM., Okapi at TREC-3, In Proceedings of the 3rd TREC (1994), 109\u2013126.","journal-title":"Proceedings of the 3rd TREC"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/361219.361220"},{"key":"e_1_3_2_29_2","author":"Singhal A.","year":"1997","unstructured":"SinghalA., Term Weighting Revisited, PhD thesis, Cornell University, 1997.","journal-title":"Term Weighting Revisited, PhD thesis, Cornell University"},{"issue":"4","key":"e_1_3_2_30_2","first-page":"35","article-title":"Modern information retrieval: A brief overview","volume":"24","author":"Singhal A.","year":"2001","unstructured":"SinghalA., Modern information retrieval: A brief overview, IEEE Data Engineering Bulletin24(4) (2001), 35\u201343.","journal-title":"IEEE Data Engineering Bulletin"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1108\/00220410410560591"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/1277741.1277794"},{"key":"e_1_3_2_33_2","volume-title":"Proceedings of the 15th ACM CIKM","author":"Taylor M.","year":"2006","unstructured":"TaylorM., ZaragozaH., CraswellN., RobertsonS. and BurgesC., Optimization methods for ranking functions with multiple parameters, In Proceedings of the 15th ACM CIKM, 2006."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/2682862.2682863"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/1277741.1277844"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/383952.384019"},{"key":"e_1_3_2_37_2","volume-title":"Human Behaviour and the Principle of Least-Effort","author":"Zipf G.K.","year":"1949","unstructured":"ZipfG.K., Human Behaviour and the Principle of Least-Effort, Addison-Wesley, 1949."}],"container-title":["Journal of Intelligent &amp; Fuzzy Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.3233\/JIFS-169475","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.3233\/JIFS-169475","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.3233\/JIFS-169475","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,6]],"date-time":"2026-02-06T22:20:04Z","timestamp":1770416404000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.3233\/JIFS-169475"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,5,17]]},"references-count":36,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2018,5,24]]}},"alternative-id":["10.3233\/JIFS-169475"],"URL":"https:\/\/doi.org\/10.3233\/jifs-169475","relation":{},"ISSN":["1064-1246","1875-8967"],"issn-type":[{"value":"1064-1246","type":"print"},{"value":"1875-8967","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,5,17]]}}}