{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,16]],"date-time":"2025-10-16T10:00:30Z","timestamp":1760608830937},"reference-count":27,"publisher":"Cambridge University Press (CUP)","issue":"4","license":[{"start":{"date-parts":[[2014,10,15]],"date-time":"2014-10-15T00:00:00Z","timestamp":1413331200000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2015,8]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We address the problem of unsupervised and semi-supervised SMS (Short Message Service) text message SPAM detection. We develop a content-based Bayesian classification approach which is a modest extension of the technique discussed by Resnik and Hardisty in 2010. The approach assumes that the bodies of the SMS messages arise from a probabilistic generative model and estimates the model parameters by Gibbs sampling using an unlabeled, or partially labeled, SMS training message corpus. The approach classifies new SMS messages as SPAM or HAM (non-SPAM) by zero-thresholding their logit estimates. We tested the approach on a publicly available SMS corpora collected from the UK. Used in semi-supervised fashion, the approach clearly outperformed a competing algorithm, Semi-Boost. Used in unsupervised fashion, the approach outperformed a fully supervised classifier, an SVM (Support Vector Machine), when the number of training messages used by the SVM was small and performed comparably otherwise. We believe the approach works well and is a useful tool for SMS SPAM detection.<\/jats:p>","DOI":"10.1017\/s1351324914000102","type":"journal-article","created":{"date-parts":[[2014,10,15]],"date-time":"2014-10-15T08:50:58Z","timestamp":1413363058000},"page":"553-567","source":"Crossref","is-referenced-by-count":6,"title":["(Un\/Semi-)supervised SMS text message SPAM detection"],"prefix":"10.1017","volume":"21","author":[{"given":"CHRIS R.","family":"GIANNELLA","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"RANSOM","family":"WINDER","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"BRANDON","family":"WILSON","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"56","published-online":{"date-parts":[[2014,10,15]]},"reference":[{"key":"S1351324914000102_ref026","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2012.3"},{"key":"S1351324914000102_ref024","first-page":"231","article-title":"Unsupervised SPAM detection by document probability estimation with maximal overlap method","volume":"6","author":"Uemura","year":"2011","journal-title":"Information and Media Technologies"},{"key":"S1351324914000102_ref023","unstructured":"The International Telecommunication Union (Online; accessed 1-April-2013) The World in 2010: ICT Facts and Figures www.itu.int\/ITUD\/ict\/facts\/2011\/material\/ICTFactsFigures2010.pdf"},{"key":"S1351324914000102_ref021","doi-asserted-by":"crossref","unstructured":"Sohn D. , Lee J. , and Rim H. 2009. The contribution of stylistic information to content-based mobile SPAM filtering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Suntec, Singapore: Association for Computational Linguistics.","DOI":"10.3115\/1667583.1667682"},{"key":"S1351324914000102_ref017","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2009.191"},{"key":"S1351324914000102_ref015","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2008.235"},{"key":"S1351324914000102_ref014","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324900000218"},{"key":"S1351324914000102_ref012","unstructured":"Huffington Post (Online, accessed 1-April-2013) SMS Fraud: 95m Spam Text Messages Sent Per Day, Up 300% In 12 Months. www.huffingtonpost.co.uk\/2012\/05\/28\/sms-fraud-95m-spam-text-m_n_1550193.html"},{"key":"S1351324914000102_ref005","unstructured":"Cloudmark (Online; accessed 1-April-2013) Mobile Messaging Security Solutions. www.cloudmark.com\/en\/industries\/mobile\/solutions"},{"key":"S1351324914000102_ref004","unstructured":"Blanzieri B. , and Bryl A. 2008. A survey of learning-based techniques of email SPAM filtering. Technical Report DIT-06-056, Information Engineering and Computer Science Department, University of Trento."},{"key":"S1351324914000102_ref002","doi-asserted-by":"crossref","unstructured":"Almeida T. , Gomez Hidalgo J. M. , and Yamakami A. 2011. Contribution to the study of SMS SPAM filtering: new collection and results. In Proceedings of the ACM Symposium on Document Engineering. Mountain View, CA USA: Association for Computing Machinery.","DOI":"10.1145\/2034691.2034742"},{"key":"S1351324914000102_ref001","first-page":"1","article-title":"Towards SMS spam filtering: results under a new dataset","volume":"2","author":"Almeida","year":"2013","journal-title":"International Journal of Information Security Science"},{"key":"S1351324914000102_ref025","doi-asserted-by":"crossref","unstructured":"Wang C. , Zhang Y. , Chen X. , Liu Z. , Shi L. , Chen G. , Qiu F. , Ying C. , and Lu W. 2010. A behavior-based SMS antispam system. IBM Journal of Research & Development 54 (6): 3:1\u20133:16.","DOI":"10.1147\/JRD.2010.2066050"},{"key":"S1351324914000102_ref016","doi-asserted-by":"publisher","DOI":"10.1002\/sec.577"},{"key":"S1351324914000102_ref009","doi-asserted-by":"crossref","unstructured":"Gomez Hidalgo J. M. , Cajigas Bringas G. , Puertas Sanz E. , and Carrero Garcia F. 2006. Content based SMS SPAM filtering. In Proceedings of the ACM Symposium on Document Engineering. Amsterdam, Netherlands: Association for Computing Machinery.","DOI":"10.1145\/1166160.1166191"},{"key":"S1351324914000102_ref011","unstructured":"Gunal S. , Ergin S. , and Gunal E. 2012. A novel framework for SMS SPAM filtering. In Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications. Trabzon, Turkey: Institute for Electrical and Electronic Engineers."},{"key":"S1351324914000102_ref007","doi-asserted-by":"crossref","unstructured":"Coskun B. , and Giura P. 2012. Mitigating SMS spam by online detection of repetitive near-duplicate messages. In Proceedings of the IEEE Communication and Information Systems Security Symposium. Ottawa, Ontario, Canada: Institute for Electrical and Electronic Engineers.","DOI":"10.1109\/ICC.2012.6363989"},{"key":"S1351324914000102_ref003","unstructured":"Balaguer E. , and Rosso P. 2011. Detection of near-duplicate user generated contents: the SMS spam collection. In Proceedings of the International Workshop on Search and Mining User-Generated Contents. Glasgow, UK: Association for Computing Machinery."},{"key":"S1351324914000102_ref018","doi-asserted-by":"crossref","unstructured":"Qian F. , Pathak A. , Hu Y. , Mao Z. , and Xie Y. 2010. A case for unsupervised-learning-based SPAM filtering. In Proceedings of ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETERICS). New York, New York USA: Association for Computing Machinery.","DOI":"10.1145\/1811039.1811090"},{"key":"S1351324914000102_ref010","doi-asserted-by":"publisher","DOI":"10.1145\/1216016.1216017"},{"key":"S1351324914000102_ref013","first-page":"273","article-title":"A tutorial on practical prediction theory for classification","volume":"6","author":"Langford","year":"2006","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324914000102_ref022","unstructured":"Tagg C. 2009. A corpus linguisitics study of SMS text messaging. PhD thesis, Department of English, University of Birmingham, UK."},{"key":"S1351324914000102_ref008","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2012.02.053"},{"key":"S1351324914000102_ref020","unstructured":"Settles B. 2009. Active learning literature survey. Technical Report 1648. Computer Sciences Department, University of Wisconsin. Madison, Wisconsin USA."},{"key":"S1351324914000102_ref027","doi-asserted-by":"crossref","unstructured":"Yadav K. , Kumaraguru P. , Goyal A. , Gupta A. , and Naik V. 2011. SMSAssassin: crowdsourcing driven mobile-based system for SMS SPAM filtering. In Proceedings of the International Workshop on Mobile Computing Systems and Applications. Phoenix, Arizona USA: Association for Computing Machinery.","DOI":"10.1145\/2184489.2184491"},{"key":"S1351324914000102_ref006","doi-asserted-by":"crossref","unstructured":"Cormack G. , Gomez Hidalgo J. M. , and Puertas Sanz E. 2007. Feature engineering for mobile (SMS) SPAM filtering. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. Amsterdam, Netherlands: Association for Computing Machinery.","DOI":"10.1145\/1277741.1277951"},{"key":"S1351324914000102_ref019","unstructured":"Resnik P. , and Hardisty E. 2010. Gibbs sampling for the uninitiated. Technical Report LAMP-TR-153. Language and Media Processing Laboratory, University of Maryland College Park. College Park, Maryland USA."}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324914000102","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,8,16]],"date-time":"2019-08-16T02:31:26Z","timestamp":1565922686000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324914000102\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,10,15]]},"references-count":27,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2015,8]]}},"alternative-id":["S1351324914000102"],"URL":"https:\/\/doi.org\/10.1017\/s1351324914000102","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2014,10,15]]}}}