{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T14:37:56Z","timestamp":1758638276148,"version":"3.41.2"},"reference-count":26,"publisher":"Emerald","issue":"1","license":[{"start":{"date-parts":[[2012,3,30]],"date-time":"2012-03-30T00:00:00Z","timestamp":1333065600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emerald.com\/insight\/site-policies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,3,30]]},"abstract":"<jats:sec><jats:title content-type=\"abstract-heading\">Purpose<\/jats:title><jats:p>Automatic text categorization has applications in several domains, for example e\u2010mail spam detection, sexual content filtering, directory maintenance, and focused crawling, among others. Most information retrieval systems contain several components which use text categorization methods. One of the first text categorization methods was designed using a na\u00efve Bayes representation of the text. Currently, a number of variations of na\u00efve Bayes have been discussed. The purpose of this paper is to evaluate na\u00efve Bayes approaches on text categorization introducing new competitive extensions to previous approaches.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Design\/methodology\/approach<\/jats:title><jats:p>The paper focuses on introducing a new Bayesian text categorization method based on an extension of the na\u00efve Bayes approach. Some modifications to document representations are introduced based on the well\u2010known BM25 text information retrieval method. The performance of the method is compared to several extensions of na\u00efve Bayes using benchmark datasets designed for this purpose. The method is compared also to training\u2010based methods such as support vector machines and logistic regression.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Findings<\/jats:title><jats:p>The proposed text categorizer outperforms state\u2010of\u2010the\u2010art methods without introducing new computational costs. It also achieves performance results very similar to more complex methods based on criterion function optimization as support vector machines or logistic regression.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Practical implications<\/jats:title><jats:p>The proposed method scales well regarding the size of the collection involved. The presented results demonstrate the efficiency and effectiveness of the approach.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Originality\/value<\/jats:title><jats:p>The paper introduces a novel na\u00efve Bayes text categorization approach based on the well\u2010known BM25 information retrieval model, which offers a set of good properties for this problem.<\/jats:p><\/jats:sec>","DOI":"10.1108\/17440081211222591","type":"journal-article","created":{"date-parts":[[2012,3,24]],"date-time":"2012-03-24T08:52:43Z","timestamp":1332579163000},"page":"55-72","source":"Crossref","is-referenced-by-count":9,"title":["A new term\u2010weighting scheme for na\u00efve Bayes text categorization"],"prefix":"10.1108","volume":"8","author":[{"given":"Marcelo","family":"Mendoza","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"140","reference":[{"key":"key2022031219535676400_b1","doi-asserted-by":"crossref","unstructured":"Altin\u00e7ay, H. and Erenel, Z. (2010), \u201cAnalytical evaluation of term weighting schemes for text categorization\u201d, Pattern Recognition Letters, Vol. 31 No. 11, pp. 1310\u201023.","DOI":"10.1016\/j.patrec.2010.03.012"},{"key":"key2022031219535676400_b2","doi-asserted-by":"crossref","unstructured":"Ault, T. and Yang, Y. (2002), \u201cInformation filtering in TREC\u20109 and TDT\u20103: a comparative analysis\u201d, Journal of Information Retrieval, Vol. 5 Nos 2\/3, pp. 159\u201087.","DOI":"10.1023\/A:1015745911767"},{"key":"key2022031219535676400_b3","unstructured":"Bennett, P. (2000), \u201cAssessing the calibration of naive Bayes posterior estimates\u201d, Technical Report CMU\u2010CS\u201000\u2010155, School of Computer Science, Carnegie\u2010Mellon University, Pittsburgh, PA."},{"key":"key2022031219535676400_b4","doi-asserted-by":"crossref","unstructured":"Chen, J., Huang, H., Tian, S. and Qu, Y. (2009), \u201cFeature selection for text classification with na\u00efve Bayes\u201d, Expert Systems with Applications, Vol. 36 No. 3, pp. 5432\u20105.","DOI":"10.1016\/j.eswa.2008.06.054"},{"key":"key2022031219535676400_b5","doi-asserted-by":"crossref","unstructured":"Church, K. and Gale, W. (1995), \u201cPoisson mixtures\u201d, Natural Language Engineering, Vol. 1, pp. 163\u201090.","DOI":"10.1017\/S1351324900000139"},{"key":"key2022031219535676400_b6","doi-asserted-by":"crossref","unstructured":"Datar, M. and Indyk, P. (2004), \u201cLocality\u2010sensitive hashing scheme base don p\u2010stable distributions\u201d, Proceedings of the 20th Annual Symposium on Computational Geometry, Brooklyn, NY, USA, pp. 253\u201062.","DOI":"10.1145\/997817.997857"},{"key":"key2022031219535676400_b7","doi-asserted-by":"crossref","unstructured":"Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, NY.","DOI":"10.1007\/978-0-387-21606-5"},{"key":"key2022031219535676400_b8","doi-asserted-by":"crossref","unstructured":"Indyk, P. (2004), \u201cNearest neighbors in high\u2010dimensional spaces\u201d, in Goodman, J. and O'Rourke, J. (Eds), Handbook of Discrete and Computational Geometry, Chapman and Hall\/CRC Press, New York, NY, pp. 877\u201092.","DOI":"10.1201\/9781420035315-39"},{"key":"key2022031219535676400_b9","doi-asserted-by":"crossref","unstructured":"Joachims, T. (2006), \u201cTraining linear SVMs in linear time\u201d, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), Philadelphia, PA, USA, pp. 217\u201026.","DOI":"10.1145\/1150402.1150429"},{"key":"key2022031219535676400_b10","doi-asserted-by":"crossref","unstructured":"Kim, S., Han, K., Rim, H. and Myaeng, S. (2006), \u201cSome effective techniques for na\u00efve Bayes text classification\u201d, IEEE Transactions on Knowledge and Data Engineering, Vol. 18 No. 11, pp. 1457\u201066.","DOI":"10.1109\/TKDE.2006.180"},{"key":"key2022031219535676400_b11","doi-asserted-by":"crossref","unstructured":"Kolcz, A. and Yih, W. (2007), \u201cRaising the baseline for high\u2010precision text classifiers\u201d, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'07\u2009), San Jos\u00e9, CA, USA, pp. 525\u201033.","DOI":"10.1145\/1281192.1281237"},{"key":"key2022031219535676400_b12","unstructured":"Lewis, D. and Ringuette, M. (1994), \u201cA comparison of two learning algorithms for text categorization\u201d, Proceedings of the Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, USA, pp. 81\u201093."},{"key":"key2022031219535676400_b13","unstructured":"Lewis, D., Yang, Y., Rose, T. and Li, F. (2004), \u201cRCV1: a new benchmark collection for text categorization research\u201d, Journal of Machine Learning Research, Vol. 5, pp. 361\u201097."},{"key":"key2022031219535676400_b14","doi-asserted-by":"crossref","unstructured":"Liu, Y., Han, T. and Sun, A. (2009), \u201cImbalanced text classification: a term weighting approach\u201d, Expert Systems with Applications, Vol. 36 No. 1, pp. 690\u2010701.","DOI":"10.1016\/j.eswa.2007.10.042"},{"key":"key2022031219535676400_b16","unstructured":"McCallum, A. and Nigam, K. (1998), \u201cA comparison of event models for na\u00efve Bayes text classification\u201d, Proceedings of the International Conference on Machine Learning, Workshop on Learning for Text Categorization, Madison, WI, USA, pp. 41\u20108."},{"key":"key2022031219535676400_b15","doi-asserted-by":"crossref","unstructured":"Maron, M. and Kuhns, J. (1960), \u201cOn relevance, probabilistic indexing, and information retrieval\u201d, Journal of the Association for Computing Machinery, Vol. 7 No. 3, pp. 216\u201044.","DOI":"10.1145\/321033.321035"},{"key":"key2022031219535676400_b17","unstructured":"Perkins, S., Lacker, K. and Theiler, J. (2003), \u201cGrafting: fast, incremental feature selection by gradient descent in function space\u201d, Journal of Machine Learning Research, Vol. 3, pp. 1333\u201056."},{"key":"key2022031219535676400_b18","doi-asserted-by":"crossref","unstructured":"Qiang, G. (2010), \u201cAn effective algorithm for improving the performance on naive Bayes for text classification\u201d, Proceedings of the 2nd International Conference on Computer Research and Development (ICCRD'10), Kuala Lumpur, Malaysia, pp. 699\u2010701.","DOI":"10.1109\/ICCRD.2010.160"},{"key":"key2022031219535676400_b19","unstructured":"Rennie, J., Shih, L., Teevan, J. and Karger, D. (2003), \u201cTackling the poor assumptions of naive Bayes text classifiers\u201d, Proceedings of the 20th International Conference on Machine Learning (ICML'03), Washington, DC, USA, pp. 616\u201023."},{"key":"key2022031219535676400_b20","doi-asserted-by":"crossref","unstructured":"Robertson, S. and Walker, S. (1994), \u201cSome simple effective approximations to the 2\u2010Poisson model for probabilistic weighted retrieval\u201d, Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), Dublin, Ireland, pp. 232\u201041.","DOI":"10.1007\/978-1-4471-2099-5_24"},{"key":"key2022031219535676400_b21","doi-asserted-by":"crossref","unstructured":"Salton, G. and Buckley, C. (1988), \u201cTerm\u2010weighting approaches in automatic retrieval\u201d, Information Processing & Management, Vol. 24 No. 5, pp. 513\u201023.","DOI":"10.1016\/0306-4573(88)90021-0"},{"key":"key2022031219535676400_b23","doi-asserted-by":"crossref","unstructured":"Schneider, K. (2005), \u201cTechniques for improving the performance of naive Bayes for text classification\u201d, Proceedings of the 6th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing'06\u2009), Mexico City, Mexico, pp. 682\u201093.","DOI":"10.1007\/978-3-540-30586-6_76"},{"key":"key2022031219535676400_b22","doi-asserted-by":"crossref","unstructured":"Sebastiani, F. (2002), \u201cMachine learning in automated text categorization\u201d, ACM Computing Surveys, Vol. 34 No. 1, pp. 1\u201047.","DOI":"10.1145\/505282.505283"},{"key":"key2022031219535676400_b24","unstructured":"Vapnik, V. (1998), Statistical Learning Theory, Wiley\u2010Interscience, Hoboken, NJ."},{"key":"key2022031219535676400_b25","unstructured":"Voorhees, E. and Harman, D. (2005), TREC: Experiments and Evaluation in Information Retrieval, MIT Press, New York, NY."},{"key":"key2022031219535676400_b26","doi-asserted-by":"crossref","unstructured":"Wilbur, W. and Kim, W. (2009), \u201cThe ineffectiveness of within\u2010document term frequency in text classification\u201d, Information Retrieval, Vol. 12 No. 5, pp. 509\u201025.","DOI":"10.1007\/s10791-008-9069-5"}],"container-title":["International Journal of Web Information Systems"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/www.emeraldinsight.com\/doi\/full-xml\/10.1108\/17440081211222591","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/17440081211222591\/full\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/17440081211222591\/full\/html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,25]],"date-time":"2025-07-25T00:25:05Z","timestamp":1753403105000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/ijwis\/article\/8\/1\/55-72\/164088"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,3,30]]},"references-count":26,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2012,3,30]]}},"alternative-id":["10.1108\/17440081211222591"],"URL":"https:\/\/doi.org\/10.1108\/17440081211222591","relation":{},"ISSN":["1744-0084"],"issn-type":[{"type":"print","value":"1744-0084"}],"subject":[],"published":{"date-parts":[[2012,3,30]]}}}