{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T04:27:03Z","timestamp":1777696023849,"version":"3.51.4"},"reference-count":57,"publisher":"SAGE Publications","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IDA"],"published-print":{"date-parts":[[2021,4,20]]},"abstract":"<jats:p>In this paper, we present a novel approach for n-gram generation in text classification. The a-priori algorithm is adapted to prune word sequences by combining three feature selection techniques. Unlike the traditional two-step approach for text classification in which feature selection is performed after the n-gram construction process, our proposal performs an embedded feature elimination during the application of the a-priori algorithm. The proposed strategy reduces the number of branches to be explored, speeding up the process and making the construction of all the word sequences tractable. Our proposal has the additional advantage of constructing a low-dimensional dataset with only the features that are relevant for classification, that can be used directly without the need for a feature selection step. Experiments on text classification datasets for sentiment analysis demonstrate that our approach yields the best predictive performance when compared with other feature selection approaches, while also facilitating a better understanding of the words and phrases that explain a given task; in our case online reviews and ratings in various domains.<\/jats:p>","DOI":"10.3233\/ida-205154","type":"journal-article","created":{"date-parts":[[2021,4,23]],"date-time":"2021-04-23T14:46:44Z","timestamp":1619189204000},"page":"509-525","source":"Crossref","is-referenced-by-count":20,"title":["Efficient n-gram construction for text categorization using feature selection techniques"],"prefix":"10.1177","volume":"25","author":[{"given":"Maximiliano","family":"Garc\u00eda","sequence":"first","affiliation":[{"name":"Universidad de los Andes, Santiago, Chile"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sebasti\u00e1n","family":"Maldonado","sequence":"additional","affiliation":[{"name":"Department of Management Control and Information Systems, School of Economics and Business, University of Chile, Santiago, Chile"},{"name":"Instituto Sistemas Complejos de Ingenier\u00eda (ISCI), Chile"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Carla","family":"Vairetti","sequence":"additional","affiliation":[{"name":"Universidad de los Andes, Santiago, Chile"},{"name":"Instituto Sistemas Complejos de Ingenier\u00eda (ISCI), Chile"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","reference":[{"key":"10.3233\/IDA-205154_ref1","unstructured":"R. Agrawal, R. Srikant et al., Fast algorithms for mining association rules, in: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, Vo. 1215, 1994, pp. 487\u2013499."},{"key":"10.3233\/IDA-205154_ref2","doi-asserted-by":"crossref","unstructured":"A. Ahmed, Y. Hifny, S. Toral and K. Shaalan, A call center agent productivity modeling using discriminative approaches, in: Intelligent Natural Language Processing: Trends and Applications, Springer, 2018, pp. 501\u2013520.","DOI":"10.1007\/978-3-319-67056-0_24"},{"key":"10.3233\/IDA-205154_ref3","doi-asserted-by":"crossref","unstructured":"P. Antony and K. Soman, Kernel based part of speech tagger for kannada, in: 2010 International Conference on Machine Learning and Cybernetics, IEEE, Vol. 4, 2010, pp. 2139\u20132144.","DOI":"10.1109\/ICMLC.2010.5580488"},{"key":"10.3233\/IDA-205154_ref4","unstructured":"A. Bakliwal, P. Arora, A. Patil and V. Varma, Towards enhanced opinion classification using nlp techniques, in: Proceedings of the Workshop on Sentiment Analysis where AI meets Psychology, 2011, pp. 101\u2013107."},{"issue":"1","key":"10.3233\/IDA-205154_ref5","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1007\/s10489-018-1299-7","article-title":"Hybrid attribute based sentiment classification of online reviews for consumer intelligence","volume":"49","author":"Bansal","year":"2019","journal-title":"Applied Intelligence"},{"issue":"23","key":"10.3233\/IDA-205154_ref7","doi-asserted-by":"crossref","first-page":"3365","DOI":"10.1093\/bioinformatics\/btu557","article-title":"A novel feature-based approach to extract drug-drug interactions from biomedical text","volume":"30","author":"Bui","year":"2014","journal-title":"Bioinformatics"},{"issue":"2","key":"10.3233\/IDA-205154_ref8","doi-asserted-by":"crossref","first-page":"151","DOI":"10.1142\/S0218488517400165","article-title":"Detecting deceptive opinions: intra and cross-domain classification using an efficient representation","volume":"25","author":"Cagnina","year":"2017","journal-title":"International Journal of Uncertainty Fuzziness and Knowledge-Based Systems"},{"key":"10.3233\/IDA-205154_ref9","doi-asserted-by":"crossref","first-page":"70","DOI":"10.1016\/j.neucom.2017.11.077","article-title":"Feature selection in machine learning: a new perspective","volume":"300","author":"Cai","year":"2018","journal-title":"Neurocomputing"},{"key":"10.3233\/IDA-205154_ref11","doi-asserted-by":"crossref","unstructured":"A. Deshwal and S.K. Sharma, Twitter sentiment analysis using various classification algorithms, in: 2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), IEEE, 2016, pp. 251\u2013257.","DOI":"10.1109\/ICRITO.2016.7784960"},{"key":"10.3233\/IDA-205154_ref12","doi-asserted-by":"crossref","unstructured":"A. Ekbal and S. Bandyopadhyay, Part of speech tagging in bengali using support vector machine, in: 2008 International Conference on Information Technology, IEEE, 2008, pp. 106\u2013111.","DOI":"10.1109\/ICIT.2008.12"},{"key":"10.3233\/IDA-205154_ref13","doi-asserted-by":"crossref","unstructured":"S.R. El-Beltagy, Kp-miner: a simple system for effective keyphrase extraction, in: 2006 Innovations in Information Technology, IEEE, 2006, pp. 1\u20135.","DOI":"10.1109\/INNOVATIONS.2006.301948"},{"issue":"3","key":"10.3233\/IDA-205154_ref14","first-page":"119","article-title":"Effects of stop words elimination for arabic information retrieval: a comparative study","volume":"4","author":"El-Khair","year":"2006","journal-title":"International Journal of Computing & Information Sciences"},{"key":"10.3233\/IDA-205154_ref15","first-page":"1871","article-title":"Liblinear: a library for large linear classification","volume":"9","author":"Fan","year":"2008","journal-title":"Journal of Machine Learning Research"},{"issue":"5","key":"10.3233\/IDA-205154_ref16","doi-asserted-by":"crossref","first-page":"14","DOI":"10.1255\/nirn.1194","article-title":"Double cross-validation","volume":"21","author":"Fearn","year":"2010","journal-title":"NIR News"},{"key":"10.3233\/IDA-205154_ref17","first-page":"1","article-title":"A study using n-gram features for text categorization","volume":"3","author":"F\u00fcrnkranz","year":"1998","journal-title":"Austrian Research Institute for Artifical Intelligence"},{"key":"10.3233\/IDA-205154_ref19","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1016\/j.knosys.2017.10.028","article-title":"Differential evolution for filter feature selection based on information theory and feature ranking","volume":"140","author":"Hancer","year":"2018","journal-title":"Knowledge-Based Systems"},{"key":"10.3233\/IDA-205154_ref20","doi-asserted-by":"crossref","unstructured":"F. Harrag, E. El-Qawasmeh and P. Pichappan, Improving arabic text categorization using decision trees, in: 2009 First International Conference on Networked Digital Technologies, IEEE, 2009, pp. 110\u2013115.","DOI":"10.1109\/NDT.2009.5272214"},{"key":"10.3233\/IDA-205154_ref21","doi-asserted-by":"crossref","unstructured":"F.M. Hasan, N. UzZaman and M. Khan, Comparison of different pos tagging techniques (n-gram, hmm and brill\u2019s tagger) for bangla, in: Advances and Innovations in Systems, Computing Sciences and Software Engineering, Springer, 2007, pp.\u00a0121\u2013126.","DOI":"10.1007\/978-1-4020-6264-3_23"},{"key":"10.3233\/IDA-205154_ref22","unstructured":"L. Jensen and T. Martinez, Improving text classification by using conceptual and contextual features, in: Proceedings of the Workshop on Text Mining at the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000, pp. 101\u2013102."},{"key":"10.3233\/IDA-205154_ref23","doi-asserted-by":"crossref","first-page":"26","DOI":"10.1016\/j.engappai.2016.02.002","article-title":"Deep feature weighting for naive bayes and its application to text classification","volume":"52","author":"Jiang","year":"2016","journal-title":"Engineering Applications of Artificial Intelligence"},{"key":"10.3233\/IDA-205154_ref24","doi-asserted-by":"crossref","first-page":"61065","DOI":"10.1109\/ACCESS.2018.2873634","article-title":"A new filter feature selection based on criteria fusion for gene microarray data","volume":"6","author":"Ke","year":"2018","journal-title":"IEEE Access"},{"issue":"8","key":"10.3233\/IDA-205154_ref25","doi-asserted-by":"crossref","first-page":"3123","DOI":"10.1007\/s10489-019-01425-4","article-title":"Enswf: effective features extraction and selection in conjunction with ensemble learning methods for document sentiment classification","volume":"49","author":"Khan","year":"2019","journal-title":"Applied Intelligence"},{"key":"10.3233\/IDA-205154_ref26","doi-asserted-by":"crossref","unstructured":"N. Kumar and K. Srinathan, Automatic keyphrase extraction from scientific documents using n-gram filtration technique, in: Proceedings of the Eighth ACM Symposium on Document Engineering, ACM, 2008, pp. 199\u2013208.","DOI":"10.1145\/1410140.1410180"},{"key":"10.3233\/IDA-205154_ref27","doi-asserted-by":"crossref","unstructured":"D.A. Kurniawan, S. Wibirama and N.A. Setiawan, Real-time traffic classification with twitter data mining, in: 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), IEEE, 2016, pp. 1\u20135.","DOI":"10.1109\/ICITEED.2016.7863251"},{"key":"10.3233\/IDA-205154_ref28","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1016\/j.engappai.2017.12.014","article-title":"A novel multivariate filter method for feature selection in text classification problems","volume":"70","author":"Labani","year":"2018","journal-title":"Engineering Applications of Artificial Intelligence"},{"key":"10.3233\/IDA-205154_ref29","unstructured":"S.L. Lam and D.L. Lee, Feature reduction for neural network based text categorization, in: Proceedings. 6th International Conference on Advanced Systems for Advanced Applications, IEEE, 1999, pp. 195\u2013202."},{"key":"10.3233\/IDA-205154_ref30","doi-asserted-by":"crossref","unstructured":"C. Li, B. Wang, V. Pavlu and J.A. Aslam, An empirical study of skip-gram features and regularization for learning on sentiment analysis, in: European Conference on Information Retrieval, Springer, 2016, pp. 72\u201387.","DOI":"10.1007\/978-3-319-30671-1_6"},{"key":"10.3233\/IDA-205154_ref32","doi-asserted-by":"crossref","unstructured":"Y. Li, L. Yao, C. Mao, A. Srivastava, X. Jiang and Y. Luo, Early prediction of acute kidney injury in critical care setting using clinical notes, in: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2018, pp.\u00a0683\u2013686.","DOI":"10.1109\/BIBM.2018.8621574"},{"key":"10.3233\/IDA-205154_ref33","doi-asserted-by":"crossref","first-page":"190","DOI":"10.1016\/j.ins.2019.05.093","article-title":"Profit-based credit scoring based on robust optimization and feature selection","volume":"500","author":"L\u00f3pez","year":"2019","journal-title":"Information Sciences"},{"key":"10.3233\/IDA-205154_ref34","doi-asserted-by":"crossref","first-page":"441","DOI":"10.1016\/j.asoc.2017.11.006","article-title":"Whale optimization approaches for wrapper feature selection","volume":"62","author":"Mafarja","year":"2018","journal-title":"Applied Soft Computing"},{"key":"10.3233\/IDA-205154_ref35","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1016\/j.neucom.2018.04.035","article-title":"Ellipsoidal support vector regression based on second-order cone programming","volume":"305","author":"Maldonado","year":"2018","journal-title":"Neurocomputing"},{"key":"10.3233\/IDA-205154_ref36","doi-asserted-by":"crossref","first-page":"380","DOI":"10.1016\/j.asoc.2018.12.024","article-title":"An alternative smote oversampling strategy for high-dimensional datasets","volume":"76","author":"Maldonado","year":"2019","journal-title":"Applied Soft Computing"},{"key":"10.3233\/IDA-205154_ref38","first-page":"163","article-title":"Opinion classification techniques applied to a spanish corpus","volume":"47","author":"Mart\u00ednez C\u00e1mara","year":"2011","journal-title":"Procesamiento del Lenguaje Natural"},{"key":"10.3233\/IDA-205154_ref39","unstructured":"D. Mladenic and M. Grobelnik, Word sequences as features in text-learning, in: In Proceedings of the 17th Electrotechnical and Computer Science Conference (ERK98), 1998."},{"key":"10.3233\/IDA-205154_ref40","doi-asserted-by":"crossref","unstructured":"A. Moschitti, A study on convolution kernels for shallow semantic parsing, in: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2004, p. 335.","DOI":"10.3115\/1218955.1218998"},{"key":"10.3233\/IDA-205154_ref41","doi-asserted-by":"crossref","unstructured":"Y. Nagano and R. Uda, Static analysis with paragraph vector for malware detection, in: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication, ACM, 2017, p. 80.","DOI":"10.1145\/3022227.3022306"},{"key":"10.3233\/IDA-205154_ref42","unstructured":"C. Nobata, S. Sekine, M. Murata, K. Uchimoto, M. Utiyama and H. Isahara, Sentence extraction system assembling multiple evidence, in: Proceedings of the Second NTCIR Workshop Meeting, 2001, pp. 213\u2013218."},{"issue":"1\u20132","key":"10.3233\/IDA-205154_ref43","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1561\/1500000011","article-title":"Opinion mining and sentiment analysis","volume":"2","author":"Pang","year":"2008","journal-title":"Foundations and Trends in Information Retrieval"},{"issue":"2","key":"10.3233\/IDA-205154_ref44","doi-asserted-by":"crossref","first-page":"344","DOI":"10.1016\/j.ipm.2006.07.006","article-title":"Contextual feature selection for text classification","volume":"43","author":"Paradis","year":"2007","journal-title":"Information Processing & Management"},{"key":"10.3233\/IDA-205154_ref45","unstructured":"J. Plisson, N. Lavrac and D. Mladenic, A rule based approach to word lemmatization, in: Proceedings of the 7th International Multi-Conference Information Society IS-2004, 2004, pp. 83\u201386."},{"issue":"1\u20132","key":"10.3233\/IDA-205154_ref46","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1023\/A:1023864622883","article-title":"Comparing simple recurrent networks and n-grams in a large corpus","volume":"19","author":"Rodriguez","year":"2003","journal-title":"Applied Intelligence"},{"key":"10.3233\/IDA-205154_ref47","doi-asserted-by":"crossref","unstructured":"G. Roffo, S. Melzi and M. Cristani, Infinite feature selection, in: 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4202\u20134210.","DOI":"10.1109\/ICCV.2015.478"},{"key":"10.3233\/IDA-205154_ref48","unstructured":"M.K. Saad and W.M. Ashour, Arabic text classification using decision trees, in: Proceedings of the 12th International Workshop on Computer Science and Information Technologies CSIT2010, 2010."},{"key":"10.3233\/IDA-205154_ref49","doi-asserted-by":"crossref","unstructured":"R.E. Schapire, Y. Singer and A. Singhal, Boosting and rocchio applied to text filtering, in: SIGIR, Vol. 98, 1998, pp.\u00a0215\u2013223.","DOI":"10.1145\/290941.290996"},{"key":"10.3233\/IDA-205154_ref50","doi-asserted-by":"crossref","unstructured":"H. Sch\u00fctze, D.A. Hull and J.O. Pedersen, A comparison of classifiers and document representations for the routing problem, in: Annual ACM Conference on Research and Development in Information Retrieval-ACM SIGIR, 1995.","DOI":"10.1145\/215206.215365"},{"key":"10.3233\/IDA-205154_ref51","unstructured":"R. Socher, C.C. Lin, C. Manning and A.Y. Ng, Parsing natural scenes and natural language with recursive neural networks, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 129\u2013136."},{"issue":"4","key":"10.3233\/IDA-205154_ref52","doi-asserted-by":"crossref","first-page":"529","DOI":"10.1016\/S0306-4573(01)00045-0","article-title":"The use of bigrams to enhance text categorization","volume":"38","author":"Tan","year":"2002","journal-title":"Information Processing & Management"},{"issue":"4","key":"10.3233\/IDA-205154_ref53","doi-asserted-by":"crossref","first-page":"2622","DOI":"10.1016\/j.eswa.2007.05.028","article-title":"An empirical study of sentiment analysis for chinese documents","volume":"34","author":"Tan","year":"2008","journal-title":"Expert Systems with Applications"},{"issue":"6","key":"10.3233\/IDA-205154_ref54","doi-asserted-by":"crossref","first-page":"1602","DOI":"10.1109\/TKDE.2016.2522427","article-title":"A bayesian classification approach using class-specific features for text categorization","volume":"28","author":"Tang","year":"2016","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"issue":"9","key":"10.3233\/IDA-205154_ref55","doi-asserted-by":"crossref","first-page":"2508","DOI":"10.1109\/TKDE.2016.2563436","article-title":"Toward optimal feature selection in naive bayes for text categorization","volume":"28","author":"Tang","year":"2016","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"issue":"8","key":"10.3233\/IDA-205154_ref57","doi-asserted-by":"crossref","first-page":"1311","DOI":"10.1109\/TIFS.2014.2332820","article-title":"Multiple account identity deception detection in social media using nonverbal behavior","volume":"9","author":"Tsikerdekis","year":"2014","journal-title":"IEEE Transactions on Information Forensics and Security"},{"issue":"5","key":"10.3233\/IDA-205154_ref58","doi-asserted-by":"crossref","first-page":"1688","DOI":"10.1007\/s10489-018-1334-8","article-title":"Improved whale optimization algorithm for feature selection in arabic sentiment analysis","volume":"49","author":"Tubishat","year":"2019","journal-title":"Applied Intelligence"},{"key":"10.3233\/IDA-205154_ref59","doi-asserted-by":"crossref","first-page":"175","DOI":"10.1007\/s00521-013-1368-0","article-title":"A review of feature selection methods based on mutual information","volume":"24","author":"Vergara","year":"2014","journal-title":"Neural Computing and Applications"},{"issue":"1","key":"10.3233\/IDA-205154_ref60","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1007\/s10115-014-0746-y","article-title":"Adapting naive bayes tree for text classification","volume":"44","author":"Wang","year":"2015","journal-title":"Knowledge and Information Systems"},{"key":"10.3233\/IDA-205154_ref61","unstructured":"J. Xu, Y. Wu, Y. Zhang, J. Wang, R. Liu, Q. Wei and H. Xu, Uth-ccb@ biocreative v cdr task: identifying chemical-induced disease relations in biomedical text, in: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, 2015, pp.\u00a0254\u2013259."},{"key":"10.3233\/IDA-205154_ref62","doi-asserted-by":"crossref","unstructured":"M. Zhang, J. Zhang, J. Su and G. Zhou, A composite kernel to extract relations between entities with both flat and structured features, in: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2006, pp.\u00a0825\u2013832.","DOI":"10.3115\/1220175.1220279"},{"issue":"8","key":"10.3233\/IDA-205154_ref63","doi-asserted-by":"crossref","first-page":"3093","DOI":"10.1007\/s10489-019-01441-4","article-title":"A quantum-inspired sentiment representation model for twitter sentiment analysis","volume":"49","author":"Zhang","year":"2019","journal-title":"Applied Intelligence"}],"container-title":["Intelligent Data Analysis"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/IDA-205154","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T09:19:04Z","timestamp":1777454344000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/IDA-205154"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,4,20]]},"references-count":57,"journal-issue":{"issue":"3"},"URL":"https:\/\/doi.org\/10.3233\/ida-205154","relation":{},"ISSN":["1088-467X","1571-4128"],"issn-type":[{"value":"1088-467X","type":"print"},{"value":"1571-4128","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,4,20]]}}}