{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T05:06:03Z","timestamp":1772600763729,"version":"3.50.1"},"reference-count":69,"publisher":"Emerald","issue":"ahead-of-print","license":[{"start":{"date-parts":[[2019,7,8]],"date-time":"2019-07-08T00:00:00Z","timestamp":1562544000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emeraldinsight.com\/page\/tdm"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["EL"],"published-print":{"date-parts":[[2019,7,8]]},"abstract":"<jats:sec>\n<jats:title content-type=\"abstract-subheading\">Purpose<\/jats:title>\n<jats:p>The purpose of this paper is to develop a journal recommender system, which compares the content similarities between a manuscript and the existing journal articles in two subject corpora (covering the social sciences and medicine). The study examines the appropriateness of three text similarity measures and the impact of numerous aspects of corpus documents on system performance.<\/jats:p>\n<\/jats:sec>\n<jats:sec>\n<jats:title content-type=\"abstract-subheading\">Design\/methodology\/approach<\/jats:title>\n<jats:p>Implemented three similarity measures one at a time on a journal recommender system with two separate journal corpora. Two distinct samples of test abstracts were classified and evaluated based on the normalized discounted cumulative gain.<\/jats:p>\n<\/jats:sec>\n<jats:sec>\n<jats:title content-type=\"abstract-subheading\">Findings<\/jats:title>\n<jats:p>The BM25 similarity measure outperforms both the cosine and unigram language similarity measures overall. The unigram language measure shows the lowest performance. The performance results are significantly different between each pair of similarity measures, while the BM25 and cosine similarity measures are moderately correlated. The cosine similarity achieves better performance for subjects with higher density of technical vocabulary and shorter corpus documents. Moreover, increasing the number of corpus journals in the domain of social sciences achieved better performance for cosine similarity and BM25.<\/jats:p>\n<\/jats:sec>\n<jats:sec>\n<jats:title content-type=\"abstract-subheading\">Originality\/value<\/jats:title>\n<jats:p>This is the first work related to comparing the suitability of a number of string-based similarity measures with distinct corpora for journal recommender systems.<\/jats:p>\n<\/jats:sec>","DOI":"10.1108\/el-08-2018-0165","type":"journal-article","created":{"date-parts":[[2019,7,11]],"date-time":"2019-07-11T04:29:18Z","timestamp":1562819358000},"source":"Crossref","is-referenced-by-count":5,"title":["Selecting a text similarity measure for a content-based recommender system"],"prefix":"10.1108","volume":"ahead-of-print","author":[{"given":"Manjula","family":"Wijewickrema","sequence":"first","affiliation":[]},{"given":"Vivien","family":"Petras","sequence":"additional","affiliation":[]},{"given":"Naomal","family":"Dias","sequence":"additional","affiliation":[]}],"member":"140","reference":[{"key":"key2019071108290962100_ref001","first-page":"163","article-title":"A survey of text classification algorithms","volume-title":"In Mining Text Data","year":"2012"},{"issue":"2","key":"key2019071108290962100_ref002","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1016\/j.jksuci.2016.04.001","article-title":"Toward an enhanced arabic text classification using cosine similarity and latent semantic indexing","volume":"29","year":"2017","journal-title":"Journal of King Saud University-Computer and Information Sciences"},{"key":"key2019071108290962100_ref003","first-page":"1211","article-title":"LILI: a simple language independent approach for language identification","year":"2016"},{"key":"key2019071108290962100_ref004","first-page":"18","article-title":"Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing","volume-title":"The First International Conference on Human Language Technology Research","year":"2001"},{"key":"key2019071108290962100_ref005","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1109\/ReTIS.2015.7232847","article-title":"Sentiment analysis using cosine similarity measure","volume-title":"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)","year":"2015"},{"key":"key2019071108290962100_ref006","volume-title":"Social Science Research: Principles, Methods, and Practices","year":"2012","edition":"2nd edition"},{"issue":"6","key":"key2019071108290962100_ref007","doi-asserted-by":"crossref","first-page":"e11273","DOI":"10.1371\/journal.pone.0011273","article-title":"Open access to the scientific journal literature: situation 2009","volume":"5","year":"2010","journal-title":"PLoS One"},{"key":"key2019071108290962100_ref008","first-page":"141","article-title":"Comparing and evaluating information retrieval algorithms for news recommendation","year":"2007"},{"key":"key2019071108290962100_ref009","first-page":"53","article-title":"Exploiting new sentiment-based Meta-level features for effective sentiment analysis","year":"2016"},{"key":"key2019071108290962100_ref010","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1016\/j.ins.2013.12.062","article-title":"Scaling up cosine interesting pattern discovery: a depth-first method","volume":"266","year":"2014","journal-title":"Information Sciences"},{"key":"key2019071108290962100_ref011","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1016\/j.ins.2014.07.033","article-title":"A novel similarity measure between atanassov\u2019s intuitionistic fuzzy sets based on transformation techniques with applications to pattern recognition","volume":"291","year":"2015","journal-title":"Information Sciences"},{"key":"key2019071108290962100_ref012","first-page":"127","article-title":"Search engines for the world wide web: a comparative study and evaluation methodology","year":"1996"},{"key":"key2019071108290962100_ref013","first-page":"369","article-title":"The impact of corpus size on question answering performance","year":"2002"},{"issue":"7","key":"key2019071108290962100_ref014","doi-asserted-by":"crossref","first-page":"184","DOI":"10.1108\/eb051463","article-title":"Estimating the recall performance of web search engines","volume":"49","year":"1997","journal-title":"Aslib Proceedings"},{"key":"key2019071108290962100_ref015","volume-title":"Statistical Power Analysis for the Behavioral Sciences","year":"1988","edition":"2nd edition"},{"issue":"13","key":"key2019071108290962100_ref016","doi-asserted-by":"crossref","first-page":"1448","DOI":"10.1002\/asi.20243","article-title":"Predicting reading difficulty with statistical language models","volume":"56","year":"2005","journal-title":"Journal of the American Society for Information Science and Technology"},{"key":"key2019071108290962100_ref017","unstructured":"Dai, A.M. Olah, C. and Le, Q.V. (2015), \u201cDocument embedding with paragraph vectors\u201d, Arxiv (preprint), available at: https:\/\/arxiv.org\/abs\/1507.07998 (accessed 10 January 2019)."},{"issue":"5","key":"key2019071108290962100_ref018","first-page":"295","article-title":"Research on automatic classification of documents in library environment: a literature review","volume":"40","year":"2014","journal-title":"Knowledge Organization"},{"key":"key2019071108290962100_ref019","unstructured":"Elsevier (2017), \u201cScopus content coverage guide\u201d, available at: www.elsevier.com\/__data\/assets\/pdf_file\/0007\/69451\/0597-Scopus-Content-Coverage-Guide-US-LETTER-v4-HI-singles-no-ticks.pdf (accessed 19 April 2017)."},{"key":"key2019071108290962100_ref020","doi-asserted-by":"crossref","first-page":"w12","DOI":"10.1093\/nar\/gkm221","article-title":"eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications","volume":"35","year":"2007","journal-title":"Nucleic Acids Research"},{"issue":"4","key":"key2019071108290962100_ref021","doi-asserted-by":"crossref","first-page":"281","DOI":"10.1002\/leap.1112","article-title":"New web services that help authors to choose journals","volume":"30","year":"2017","journal-title":"Learned Publishing"},{"key":"key2019071108290962100_ref022","unstructured":"Frej, J. Chevallet, J.P. and Schwab, D. (2018), \u201cEnhancing translation language models with word embedding for information retrieval\u201d, Arxiv (preprint), available at: https:\/\/arxiv.org\/abs\/1801.03844 (accessed10 January 2019)."},{"key":"key2019071108290962100_ref023","first-page":"1","article-title":"The impact of different training sets on medical documents classification","year":"2014"},{"issue":"13","key":"key2019071108290962100_ref024","doi-asserted-by":"crossref","first-page":"13","DOI":"10.5120\/11638-7118","article-title":"A survey of text similarity approaches","volume":"68","year":"2013","journal-title":"International Journal of Comuter Applications"},{"key":"key2019071108290962100_ref025","first-page":"110","article-title":"Revisiting embedding features for simple semi-supervised learning","year":"2014"},{"key":"key2019071108290962100_ref026","first-page":"10","article-title":"A study of parameter tuning for term frequency normalization","year":"2003"},{"issue":"7","key":"key2019071108290962100_ref027","doi-asserted-by":"crossref","first-page":"9","DOI":"10.5120\/ijca2016907841","article-title":"Clustering techniques and the similarity measures used in clustering: a survey","volume":"134","year":"2016","journal-title":"International Journal of Computer Applications"},{"issue":"2\/10","key":"key2019071108290962100_ref028","article-title":"Semantic text similarity using corpus-based word similarity and string similarity","volume":"2","year":"2008","journal-title":"ACM Transactions on Knowledge Discovery from Data"},{"issue":"4","key":"key2019071108290962100_ref029","doi-asserted-by":"crossref","first-page":"422","DOI":"10.1145\/582415.582418","article-title":"Cumulated gain-based evaluation of IR techniques","volume":"20","year":"2002","journal-title":"ACM Transactions on Information Systems"},{"key":"key2019071108290962100_ref030","first-page":"2824","article-title":"Bag-of-embeddings for text classification","volume-title":"International Joint Conference on Artificial Intelligence","year":"2016"},{"key":"key2019071108290962100_ref031","unstructured":"Johnson, A. (2008), \u201cHow more like this works in lucene\u201d, available at: http:\/\/cephas.net\/blog\/2008\/03\/30\/how-morelikethis-works-in-lucene\/ (accessed 12 June 2017)."},{"issue":"6","key":"key2019071108290962100_ref032","doi-asserted-by":"crossref","first-page":"809","DOI":"10.1016\/S0306-4573(00)00016-9","article-title":"A probabilistic model of information retrieval: development and comparative experiments: part 2","volume":"36","year":"2000","journal-title":"Information Processing and Management"},{"key":"key2019071108290962100_ref033","first-page":"261","article-title":"Elsevier journal finder: recommending journals for your paper","year":"2015"},{"key":"key2019071108290962100_ref034","article-title":"A grammar-based semantic similarity algorithm for natural language sentences","volume":"2014","year":"2014","journal-title":"The Scientific World Journal"},{"key":"key2019071108290962100_ref035","unstructured":"Lenhard, W. and Lenhard, A. (2016), \u201cCalculation of effect sizes\u201d, available at: www.psychometrica.de\/effect_size.html (accessed 27 January 2018)."},{"issue":"4","key":"key2019071108290962100_ref036","doi-asserted-by":"crossref","first-page":"612","DOI":"10.1111\/j.1468-2958.2002.tb00828.x","article-title":"Eta squared, partial eta squared, and misreporting of effect size in communication research","volume":"28","year":"2002","journal-title":"Human Communication Research"},{"issue":"18","key":"key2019071108290962100_ref037","doi-asserted-by":"crossref","first-page":"2298","DOI":"10.1093\/bioinformatics\/btl388","article-title":"Text similarity: an alternative way to search medline","volume":"22","year":"2006","journal-title":"Bioinformatics (Oxford, England)"},{"key":"key2019071108290962100_ref038","doi-asserted-by":"crossref","first-page":"51","DOI":"10.1016\/j.physa.2016.03.003","article-title":"An adaptive contextual quantum language model","volume":"456","year":"2016","journal-title":"Physica A: Statistical Mechanics and Its Applications"},{"key":"key2019071108290962100_ref039","first-page":"611","article-title":"Distance weighted cosine similarity measure for text classification","volume-title":"International Conference on Intelligent Data Engineering and Automated Learning","year":"2013"},{"key":"key2019071108290962100_ref040","first-page":"14","article-title":"A news automatic tagging method based on statistical language model","volume-title":"Tenth International Congress on Image and Signal Processing, BioMedical Engineering and Informatics","year":"2017"},{"issue":"3","key":"key2019071108290962100_ref041","first-page":"493","article-title":"Measuring semantic similarity and relatedness with distributional and knowledge-based approaches","volume":"10","year":"2015","journal-title":"Information and Media Technologies"},{"issue":"22","key":"key2019071108290962100_ref042","doi-asserted-by":"crossref","first-page":"3038","DOI":"10.1093\/bioinformatics\/btp529","article-title":"Identifying related journals through log analysis","volume":"25","year":"2009","journal-title":"Bioinformatics"},{"key":"key2019071108290962100_ref043","volume-title":"An Introduction to Information Retrieval","year":"2008"},{"issue":"2","key":"key2019071108290962100_ref044","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1087\/20150210","article-title":"JournalGuide: bringing authors and journals together","volume":"28","year":"2015","journal-title":"Learned Publishing"},{"issue":"3","key":"key2019071108290962100_ref045","first-page":"69","article-title":"Statistics corner: a guide to appropriate use of correlation coefficient in medical research","volume":"24","year":"2012","journal-title":"Malawi Medical Journal"},{"key":"key2019071108290962100_ref046","first-page":"383","article-title":"Capturing term dependencies using a language model based on sentence trees","year":"2002"},{"issue":"2","key":"key2019071108290962100_ref047","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1177\/0165551507082592","article-title":"A comparative study of two automatic document classification methods in a library setting","volume":"34","year":"2008","journal-title":"Journal of Information Science"},{"key":"key2019071108290962100_ref048","first-page":"275","article-title":"A language modeling approach to information retrieval","year":"1998"},{"key":"key2019071108290962100_ref049","first-page":"38","article-title":"Authorship attribution using probabilistic context-free grammars","year":"2010"},{"key":"key2019071108290962100_ref050","first-page":"27","article-title":"Temporal action detection using a statistical language model","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition","year":"2016"},{"key":"key2019071108290962100_ref051","first-page":"232","article-title":"Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval","year":"1994"},{"key":"key2019071108290962100_ref052","first-page":"42","article-title":"Simple BM25 extension to multiple weighted fields","year":"2004"},{"key":"key2019071108290962100_ref053","first-page":"18","article-title":"Manuscript matcher: a content and bibliometrics-based scholarly journal recommendation system","year":"2017"},{"key":"key2019071108290962100_ref054","volume-title":"Automatic Text Processing","year":"1988"},{"issue":"5","key":"key2019071108290962100_ref055","doi-asserted-by":"crossref","first-page":"727","DOI":"10.1093\/bioinformatics\/btn006","article-title":"Jane: Suggesting journals, finding experts","volume":"24","year":"2008","journal-title":"Bioinformatics (Oxford, England)"},{"issue":"1","key":"key2019071108290962100_ref056","doi-asserted-by":"crossref","first-page":"6","DOI":"10.1145\/331403.331405","article-title":"Analysis of a very large web search engine query log","volume":"33","year":"1999","journal-title":"ACM SIGIR Forum"},{"issue":"4","key":"key2019071108290962100_ref057","first-page":"35","article-title":"Modern information retrieval: a brief overview","volume":"24","year":"2001","journal-title":"IEEE Data Engineering Bulletin"},{"key":"key2019071108290962100_ref058","first-page":"285","article-title":"Improving the sentiment analysis process of spanish tweets with BM25","volume-title":"International Conference on Applications of Natural Language to Information Systems","year":"2016"},{"issue":"133","key":"key2019071108290962100_ref059","first-page":"6627","article-title":"A novel technique for feature subset selection based on cosine similarity","volume":"6","year":"2012","journal-title":"Applied Mathematical Sciences"},{"issue":"3","key":"key2019071108290962100_ref060","doi-asserted-by":"crossref","first-page":"219","DOI":"10.1087\/095315104323159649","article-title":"Authors and open access publishing","volume":"17","year":"2004","journal-title":"Learned Publishing"},{"key":"key2019071108290962100_ref061","first-page":"407","article-title":"Language model information retrieval with document expansion","year":"2006"},{"key":"key2019071108290962100_ref062","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1016\/j.jbi.2014.03.005","article-title":"An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages","volume":"49","year":"2014","journal-title":"Journal of Biomedical Informatics"},{"key":"key2019071108290962100_ref063","first-page":"560","article-title":"Classification of web documents using a na\u00efve bayes method","year":"2003"},{"issue":"4","key":"key2019071108290962100_ref064","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1002\/leap.1113","article-title":"Journal selection criteria in an open access environment: a comparison between the medicine and social sciences","volume":"30","year":"2017","journal-title":"Learned Publishing"},{"issue":"3","key":"key2019071108290962100_ref065","doi-asserted-by":"crossref","first-page":"219","DOI":"10.1108\/00330330610681295","article-title":"The porter stemming algorithm: then and now","volume":"40","year":"2006","journal-title":"Program: Electronic Library and Information Systems"},{"issue":"1","key":"key2019071108290962100_ref066","doi-asserted-by":"crossref","first-page":"7","DOI":"10.6017\/ital.v30i1.3040","article-title":"A simple scheme for book classification using wikipedia","volume":"30","year":"2011","journal-title":"Information Technology and Libraries"},{"key":"key2019071108290962100_ref067","volume-title":"Statistical Language Models for Information Retrieval","year":"2008"},{"issue":"2","key":"key2019071108290962100_ref068","doi-asserted-by":"crossref","first-page":"179","DOI":"10.1145\/984321.984322","article-title":"A study of smoothing methods for language models applied to information retrieval","volume":"22","year":"2004","journal-title":"ACM Transactions on Information Systems"},{"key":"key2019071108290962100_ref069","first-page":"1","article-title":"Improving bag-of-words model with spatial information","volume-title":"25th International Conference of IEEE on Image and Vision Computing New Zealand (IVCNZ)","year":"2010"}],"container-title":["The Electronic Library"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emeraldinsight.com\/doi\/full-xml\/10.1108\/EL-08-2018-0165","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emeraldinsight.com\/doi\/full\/10.1108\/EL-08-2018-0165","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,25]],"date-time":"2025-07-25T01:07:52Z","timestamp":1753405672000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/el\/article\/37\/3\/506-527\/101604"}},"subtitle":["A comparison in two corpora"],"short-title":[],"issued":{"date-parts":[[2019,7,8]]},"references-count":69,"journal-issue":{"issue":"ahead-of-print","published-print":{"date-parts":[[2019,7,8]]}},"alternative-id":["10.1108\/EL-08-2018-0165"],"URL":"https:\/\/doi.org\/10.1108\/el-08-2018-0165","relation":{},"ISSN":["0264-0473","0264-0473"],"issn-type":[{"value":"0264-0473","type":"print"},{"value":"0264-0473","type":"print"}],"subject":[],"published":{"date-parts":[[2019,7,8]]}}}