{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,31]],"date-time":"2026-01-31T01:38:09Z","timestamp":1769823489383,"version":"3.49.0"},"reference-count":20,"publisher":"SAGE Publications","issue":"5","license":[{"start":{"date-parts":[[2021,12,19]],"date-time":"2021-12-19T00:00:00Z","timestamp":1639872000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Journal of Intelligent &amp; Fuzzy Systems"],"published-print":{"date-parts":[[2022,3,31]]},"abstract":"<jats:p>\n                    This paper presents a computational model for the unsupervised authorship attribution task based on a traditional machine learning scheme. An improvement over the state of the art is achieved by comparing different feature selection methods on the PAN17 author clustering dataset. To achieve this improvement, specific pre-processing and features extraction methods were proposed, such as a method to separate tokens by type to assign them to only one category. Similarly, special characters are used as part of the punctuation marks to improve the result obtained when applying typed character\n                    <jats:italic>n<\/jats:italic>\n                    -grams. The\n                    <jats:italic>Weighted cosine similarity<\/jats:italic>\n                    measure is applied to improve the\n                    <jats:italic>B<\/jats:italic>\n                    <jats:sup>3<\/jats:sup>\n                    F-score by reducing the vector values where attributes are exclusive. This measure is used to define distances between documents, which later are occupied by the clustering algorithm to perform authorship attribution.\n                  <\/jats:p>","DOI":"10.3233\/jifs-219226","type":"journal-article","created":{"date-parts":[[2021,12,21]],"date-time":"2021-12-21T11:58:06Z","timestamp":1640087886000},"page":"4357-4367","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":2,"title":["Unsupervised authorship attribution using feature selection and weighted cosine similarity"],"prefix":"10.1177","volume":"42","author":[{"given":"Carolina","family":"Mart\u00edn-del-Campo-Rodr\u00edguez","sequence":"first","affiliation":[{"name":"Instituto Polit\u00e9cnico Nacional, Centro de Investigaci\u00f3n en Computaci\u00f3n, Av. Juan de Dios B\u00e1tiz, s\/n, Col. Nueva Industrial Vallejo, Mexico City, Mexico"}]},{"given":"Grigori","family":"Sidorov","sequence":"additional","affiliation":[{"name":"Instituto Polit\u00e9cnico Nacional, Centro de Investigaci\u00f3n en Computaci\u00f3n, Av. Juan de Dios B\u00e1tiz, s\/n, Col. Nueva Industrial Vallejo, Mexico City, Mexico"}]},{"given":"Ildar","family":"Batyrshin","sequence":"additional","affiliation":[{"name":"Instituto Polit\u00e9cnico Nacional, Centro de Investigaci\u00f3n en Computaci\u00f3n, Av. Juan de Dios B\u00e1tiz, s\/n, Col. Nueva Industrial Vallejo, Mexico City, Mexico"}]}],"member":"179","published-online":{"date-parts":[[2021,12,19]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jocs.2017.07.018"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-008-9066-8"},{"issue":"2016","key":"e_1_3_2_4_2","first-page":"345","article-title":"of similarity measures for binary data and2\u00d72 tables","volume":"20","author":"Batyrshin I.","unstructured":"BatyrshinI., KubyshevaN., SolovyevV. and Villa-VargasVisualizationL., of similarity measures for binary data and2\u00d72 tables, Computaci\u00f3n y Sistemas20(2016), 345\u2013353.","journal-title":"Computaci\u00f3n y Sistemas"},{"key":"e_1_3_2_5_2","first-page":"1152","article-title":"Authorship attribution using content basedfeatures and n-gram features","volume":"9","author":"Dara R.","year":"2019","unstructured":"DaraR. and ReddyT.R., Authorship attribution using content basedfeatures and n-gram features, International Journal ofEngineering and Advanced Technology9 (2019), 1152\u20131156.","journal-title":"International Journal ofEngineering and Advanced Technology"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2012.05.019"},{"key":"e_1_3_2_7_2","unstructured":"Garc\u00eda-MondejaY. Castro-CastroD. Lavielle-CastroV. and MunozR. Discovering Author Groups Using a \u03b2- compact Graph-based Clustering In CLEF 2017 Working Notes CEUR Workshop Proceedings 2017."},{"key":"e_1_3_2_8_2","unstructured":"G\u00f3mez-AdornoH. AlemanY. Vilari\u00f1oD. Sanchez-PerezM.A. PintoD. and SidorovG. Author Clustering using Hierarchical Clustering Analysis In CLEF 2017 Working Notes CEUR Workshop Proceedings 2017."},{"key":"e_1_3_2_9_2","unstructured":"HalvaniO. and GranerL. Author Clustering based on Compression-based Dissimilarity Scores In CLEF 2017 Working Notes CEUR Workshop Proceedings 2017."},{"key":"e_1_3_2_10_2","unstructured":"KocherM. and SavoyJ. UniNE at CLEF 2017: Author Clustering In CLEF 2017 Working Notes CEUR Workshop Proceedings 2017."},{"key":"e_1_3_2_11_2","unstructured":"LiuL. KangJ. YuJ. and WangZ. A comparative study on unsupervised feature selection methods for text clustering In 2005 International Conference on Natural Language Processing and Knowledge Engineering 2005 pp. 597\u2013601."},{"key":"e_1_3_2_12_2","first-page":"252","article-title":"Performance evaluation of various featureextraction and classification techniques for authorship attribution","volume":"16","author":"Mahor U.","year":"2015","unstructured":"MahorU. and DasS., Performance evaluation of various featureextraction and classification techniques for authorship attribution, International Journal of Innovation and Scientific Research16 (2015), 252\u2013259.","journal-title":"International Journal of Innovation and Scientific Research"},{"key":"e_1_3_2_13_2","doi-asserted-by":"crossref","unstructured":"MarkovI. StamatatosE. and SidorovG. Improving Cross-Topic Authorship Attribution: The Role of Pre-Processing In Proceedings of the 18th International Conference on Computational Linguistics and Intelligent TextProcessing CICLing 2017. Springer 2017.","DOI":"10.1007\/978-3-319-77116-8_21"},{"key":"e_1_3_2_14_2","doi-asserted-by":"crossref","unstructured":"Mart\u00edn-del-Campo-Rodr\u00edguezC. SidorovG. and BatyrshinI.Z. Enhancement of Performance of Document Clustering in the Authorship Identification Problem with a Weighted Cosine Similarity In I.Z. Batyrshin M. de Lourdes Mart\u00ednez-Villase\u00f1or and H.E.P. Espinosa (Eds.) Advances in Computational Intelligence \u2013 17th Mexican-InternationalConferenceonArtificial Intelligence MICAI 2018 Guadalajara Mexico October 22\u201327 2018 pp. 49\u201356.","DOI":"10.1007\/978-3-030-04497-8_4"},{"key":"e_1_3_2_15_2","first-page":"2825","article-title":"Scikit-learn: Machine Learning in Python","volume":"12","author":"Pedregosa F.","year":"2011","unstructured":"PedregosaF., VaroquauxG., GramfortA., MichelV., ThirionB., GriselO., BlondelM., PrettenhoferP., WeissR., DubourgV., VanderplasJ., PassosA., CournapeauD., BrucherM., PerrotM. and DuchesnayE., Scikit-learn: Machine Learning in Python, Journal ofMachine Learning Research12 (2011), 2825\u20132830.","journal-title":"Journal ofMachine Learning Research"},{"key":"e_1_3_2_16_2","doi-asserted-by":"crossref","unstructured":"PramokchonP. and Piamsa-ngaP. An unsupervised fast correlation-based filter for feature selection for data clustering In Proceedings of the First International Conference on Advanced Data and Information Engineering Springer Singapore 2013 pp. 87\u201394.","DOI":"10.1007\/978-981-4585-18-7_10"},{"key":"e_1_3_2_17_2","doi-asserted-by":"crossref","unstructured":"SapkotaU. BethardS. Montes-y-G\u00f3mezM. and SolorioT. Not All Character N-grams Are Created Equal: A Study in Authorship Attribution In Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies. NAACL-HLT-15. Association for Computational Linguistics 2015 pp. 93\u2013102.","DOI":"10.3115\/v1\/N15-1010"},{"key":"e_1_3_2_18_2","doi-asserted-by":"crossref","unstructured":"SidorovG. Syntactic n-grams in computational linguistics. (1st ed.). Springer International Publishing 2019.","DOI":"10.1007\/978-3-030-14771-6_8"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1002\/asi.21001"},{"key":"e_1_3_2_20_2","unstructured":"TschuggnallM. StamatatosE. VerhoevenB. DaelemansW. SpechtG. SteinB. and PotthastM. Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering In Working Notes of CLEF 2017 \u2013 Conference and Labs of the Evaluation Forum Dublin Ireland September 11\u201314 2017."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2005.845141"}],"container-title":["Journal of Intelligent &amp; Fuzzy Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.3233\/JIFS-219226","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.3233\/JIFS-219226","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.3233\/JIFS-219226","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,30]],"date-time":"2026-01-30T13:25:27Z","timestamp":1769779527000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.3233\/JIFS-219226"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,12,19]]},"references-count":20,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2022,3,31]]}},"alternative-id":["10.3233\/JIFS-219226"],"URL":"https:\/\/doi.org\/10.3233\/jifs-219226","relation":{},"ISSN":["1064-1246","1875-8967"],"issn-type":[{"value":"1064-1246","type":"print"},{"value":"1875-8967","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,12,19]]}}}