{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T15:13:00Z","timestamp":1773155580317,"version":"3.50.1"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,9,6]],"date-time":"2023-09-06T00:00:00Z","timestamp":1693958400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100000780","name":"European Commission","doi-asserted-by":"crossref","award":["871042"],"award-info":[{"award-number":["871042"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100000780","name":"European Commission","doi-asserted-by":"crossref","award":["951911"],"award-info":[{"award-number":["951911"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Italian Ministry of University and Research under the NextGenerationEU program"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Knowl. Discov. Data"],"published-print":{"date-parts":[[2024,1,31]]},"abstract":"<jats:p>In this article, we investigate the effects on authorship identification tasks (including authorship verification, closed-set authorship attribution, and closed-set and open-set same-author verification) of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In \u201cclassic\u201d authorship analysis, a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unordered<jats:italic>pair<\/jats:italic>of documents, the value of a feature represents the absolute difference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that, in some cases (e.g., authorship verification), it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the first time) show that feature vectors representing pairs of documents (that we here call<jats:italic>Diff-Vectors<\/jats:italic>) bring about systematic improvements in the effectiveness of authorship identification tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identification scenarios). Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block. The code to reproduce our experiments is open-source and available online.<jats:xref ref-type=\"fn\"><jats:sup>1<\/jats:sup><\/jats:xref><\/jats:p>","DOI":"10.1145\/3609226","type":"journal-article","created":{"date-parts":[[2023,7,15]],"date-time":"2023-07-15T10:24:18Z","timestamp":1689416658000},"page":"1-36","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Same or Different? Diff-Vectors for Authorship Analysis"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5284-1771","authenticated-orcid":false,"given":"Silvia","family":"Corbara","sequence":"first","affiliation":[{"name":"Scuola Normale Superiore, Pisa, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0377-1025","authenticated-orcid":false,"given":"Alejandro","family":"Moreo","sequence":"additional","affiliation":[{"name":"Istituto di Scienza e Tecnologie dell\u2019Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4221-6427","authenticated-orcid":false,"given":"Fabrizio","family":"Sebastiani","sequence":"additional","affiliation":[{"name":"Istituto di Scienza e Tecnologie dell\u2019Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,9,6]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"crossref","first-page":"157","DOI":"10.1201\/b17320","volume-title":"Data Classification: Algorithms and Applications","author":"Aggarwal Charu C.","year":"2014","unstructured":"Charu C. Aggarwal. 2014. Instance-based learning: A survey. In Data Classification: Algorithms and Applications. Charu C. Aggarwal (Ed.), CRC Press, London, UK, 157\u2013185."},{"key":"e_1_3_2_3_2","volume-title":"Proceedings of the Working Notes of the 2011 Conference and Labs of the Evaluation Forum (CLEF 2011)","author":"Argamon Shlomo","year":"2011","unstructured":"Shlomo Argamon and Patrick Juola. 2011. Overview of the international authorship identification competition at PAN 2011. In Proceedings of the Working Notes of the 2011 Conference and Labs of the Evaluation Forum (CLEF 2011). Amsterdam, NL."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/1461928.1461959"},{"key":"e_1_3_2_5_2","volume-title":"Proceedings of the Working Notes of the 2015 Conference and Labs of the Evaluation Forum (CLEF 2015)","author":"Bartoli Alberto","year":"2015","unstructured":"Alberto Bartoli, Alex Dagri, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. 2015. An author verification approach based on differential features. In Proceedings of the Working Notes of the 2015 Conference and Labs of the Evaluation Forum (CLEF 2015). Toulouse, FR."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1080\/09296174.2013.830549"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1634"},{"issue":"1","key":"e_1_3_2_8_2","article-title":"Who\u2019s at the keyboard? Authorship attribution in digital evidence investigations","volume":"4","author":"Chaski Carole E.","year":"2005","unstructured":"Carole E. Chaski. 2005. Who\u2019s at the keyboard? Authorship attribution in digital evidence investigations. International Journal of Digital Evidence 4, 1 (2005).","journal-title":"International Journal of Digital Evidence"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1002\/asi.24660"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-30754-7_15"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3485822"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1023824908771"},{"key":"e_1_3_2_13_2","first-page":"105","volume-title":"Proceedings of the 13th International Conference on Machine Learning (ICML 1996)","author":"Domingos Pedro M.","year":"1996","unstructured":"Pedro M. Domingos and Michael J. Pazzani. 1996. Beyond independence: Conditions for the optimality of the simple bayesian classifier. In Proceedings of the 13th International Conference on Machine Learning (ICML 1996). Bari, IT, 105\u2013112."},{"issue":"1","key":"e_1_3_2_14_2","first-page":"99","article-title":"Style-markers in authorship attribution: A cross-language study of the authorial fingerprint","volume":"6","author":"Eder Maciej","year":"2011","unstructured":"Maciej Eder. 2011. Style-markers in authorship attribution: A cross-language study of the authorial fingerprint. Studies in Polish Linguistics 6, 1 (2011), 99\u2013114.","journal-title":"Studies in Polish Linguistics"},{"key":"e_1_3_2_15_2","first-page":"212","volume-title":"Encyclopedia of Machine Learning (2nd ed.)","author":"Flach Peter A.","year":"2017","unstructured":"Peter A. Flach. 2017. Classifier calibration. In Encyclopedia of Machine Learning (2nd ed.), Claude Sammut and Geoffrey I. Webb (Eds.), Springer, 212\u2013219."},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1093\/llc\/fqr029"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-40802-1_28"},{"key":"e_1_3_2_18_2","volume-title":"Benchmarking Authorship Attribution Techniques using over a Thousand Books by Fifty Victorian Era Novelists","author":"Gungor Abdulmecit","year":"2018","unstructured":"Abdulmecit Gungor. 2018. Benchmarking Authorship Attribution Techniques using over a Thousand Books by Fifty Victorian Era Novelists. Master\u2019s thesis. Department of Computer and Information Science, Purdue University, Indianapolis, US."},{"key":"e_1_3_2_19_2","unstructured":"Marjan Hosseinia and Arjun Mukherjee. 2018. Experiments with neural networks for small and large scale authorship verification. arXiv:1803.06456. Retrieved from https:\/\/arxiv.org\/abs\/1803.06456"},{"key":"e_1_3_2_20_2","first-page":"1995","volume-title":"Proceedings of the Working Notes of the 2021 Conference and Labs of the Evaluation Forum (CLEF 2021)","author":"Ikae Catherine","year":"2021","unstructured":"Catherine Ikae. 2021. UniNE at PAN-CLEF 2021: Authorship verification. In Proceedings of the Working Notes of the 2021 Conference and Labs of the Evaluation Forum (CLEF 2021). Bucharest, RO, 1995\u20132003."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1561\/1500000005"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-018-9424-0"},{"key":"e_1_3_2_23_2","first-page":"1743","volume-title":"Proceedings of the Working Notes of the 2021 Conference and Labs of the Evaluation Forum (CLEF 2021)","author":"Kestemont Mike","year":"2021","unstructured":"Mike Kestemont, Enrique Manjavacas, Ilia Markov, Janek Bevendorff, Matti Wiegmann, Efstathios Stamatatos, Benno Stein, and Martin Potthast. 2021. Overview of the cross-domain authorship verification task at PAN 2021. In Proceedings of the Working Notes of the 2021 Conference and Labs of the Evaluation Forum (CLEF 2021). Bucharest, RO, 1743\u20131759."},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1093\/llc\/fqt063"},{"key":"e_1_3_2_25_2","first-page":"1","volume-title":"Proceedings of the Working Notes of the 2019 Conference and Labs of the Evaluation Forum (CLEF 2019)","author":"Kestemont Mike","year":"2019","unstructured":"Mike Kestemont, Efstathios Stamatatos, Enrique Manjavacas, Walter Daelemans, Martin Potthast, and Benno Stein. 2019. Overview of the cross-domain authorship attribution task at PAN-2019. In Proceedings of the Working Notes of the 2019 Conference and Labs of the Evaluation Forum (CLEF 2019). Lugano, CH, 1\u201315."},{"key":"e_1_3_2_26_2","first-page":"1","volume-title":"Proceedings of the Working Notes of the 2018 Conference and Labs of the Evaluation Forum (CLEF 2018)","author":"Kestemont Mike","year":"2018","unstructured":"Mike Kestemont, Michael Tschuggnall, Efstathios Stamatatos, Walter Daelemans, G\u00fcnther Specht, Benno Stein, and Martin Potthast. 2018. Overview of the author identification task at PAN-2018: Cross-domain authorship attribution and style change detection. In Proceedings of the Working Notes of the 2018 Conference and Labs of the Evaluation Forum (CLEF 2018). Avignon, FR, 1\u201325."},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-30115-8_22"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1093\/llc\/17.4.401"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1002\/asi.20961"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1002\/asi.22954"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.5555\/2755227"},{"key":"e_1_3_2_32_2","volume-title":"Proceedings of the Working Notes of the 2021 Conference and Labs of the Evaluation Forum (CLEF 2021)","author":"Menta Antonio","year":"2021","unstructured":"Antonio Menta and Ana Garcia-Serrano. 2021. Authorship verification with neural networks via stylometric feature concatenation. In Proceedings of the Working Notes of the 2021 Conference and Labs of the Evaluation Forum (CLEF 2021). Bucharest, RO."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1613\/jair.5194"},{"key":"e_1_3_2_34_2","unstructured":"Alejandro Moreo Andrea Esuli and Fabrizio Sebastiani. 2018. Revisiting distributional correspondence indexing: A Python reimplementation and new experiments. arXiv:1810.09311. Retrieved from https:\/\/arxiv.org\/abs\/1810.09311"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2018.2883446"},{"key":"e_1_3_2_36_2","volume-title":"Inference and Disputed Authorship: The Federalist","author":"Mosteller Frederick","year":"1964","unstructured":"Frederick Mosteller and David L. Wallace. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, MA."},{"key":"e_1_3_2_37_2","first-page":"413","volume-title":"Proceedings of the 21st Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI 2005)","author":"Niculescu-Mizil Alexandru","year":"2005","unstructured":"Alexandru Niculescu-Mizil and Rich Caruana. 2005. Obtaining calibrated probabilities from boosting. In Proceedings of the 21st Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI 2005). Arlington, US, 413\u2013420."},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/1102351.1102430"},{"key":"e_1_3_2_39_2","doi-asserted-by":"crossref","first-page":"61","DOI":"10.7551\/mitpress\/1113.003.0008","volume-title":"Advances in Large Margin Classifiers","author":"Platt John C.","year":"2000","unstructured":"John C. Platt. 2000. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classifiers. Alexander Smola, Peter Bartlett, Bernard Sch\u00f6lkopf, and Dale Schuurmans (Eds.), The MIT Press, Cambridge, MA, 61\u201374."},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2016.2603960"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1016\/0306-4573(88)90021-0"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1002\/asi.24176"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1162\/COLI_a_00173"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1002\/asi.21001"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.13053\/rcs-123-1-1"},{"key":"e_1_3_2_46_2","first-page":"2585","volume-title":"Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012)","author":"Tetreault Joel R.","year":"2012","unstructured":"Joel R. Tetreault, Daniel Blanchard, Aoife Cahill, and Martin Chodorow. 2012. Native tongues, lost and found: Resources and empirical evaluations in native language identification. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012). Mumbai, IN, 2585\u20132602."},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1093\/llc\/fqw001"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.13135\/2532-5353\/3518"},{"key":"e_1_3_2_49_2","volume-title":"Proceedings of the Working Notes of the 2021 Conference and Labs of the Evaluation Forum (CLEF 2021)","author":"Weerasinghe Janith","year":"2021","unstructured":"Janith Weerasinghe, Rhia Singh, and Rachel Greenstadt. 2021. Feature vector difference based authorship verification for open world settings. In Proceedings of the Working Notes of the 2021 Conference and Labs of the Evaluation Forum (CLEF 2021). Bucharest, RO."},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1016\/s0893-6080(05)80023-1"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.5555\/1005332.1016791"},{"key":"e_1_3_2_52_2","first-page":"412","volume-title":"Proceedings of the 14th International Conference on Machine Learning (ICML 1997)","author":"Yang Yiming","year":"1997","unstructured":"Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML 1997). Nashville, US, 412\u2013420."},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/775107.775151"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.5555\/1115657.1115669"}],"container-title":["ACM Transactions on Knowledge Discovery from Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3609226","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3609226","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:48:58Z","timestamp":1750182538000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3609226"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,6]]},"references-count":53,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,1,31]]}},"alternative-id":["10.1145\/3609226"],"URL":"https:\/\/doi.org\/10.1145\/3609226","relation":{},"ISSN":["1556-4681","1556-472X"],"issn-type":[{"value":"1556-4681","type":"print"},{"value":"1556-472X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,6]]},"assertion":[{"value":"2023-01-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-07-10","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-09-06","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}