{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T05:02:18Z","timestamp":1750309338204,"version":"3.41.0"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,8,16]],"date-time":"2024-08-16T00:00:00Z","timestamp":1723766400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Web"],"published-print":{"date-parts":[[2024,8,31]]},"abstract":"<jats:p>Dealing with many of the problems related to the quality of textual content online involves identifying similar content. Algorithmic solutions for duplicate content classification typically rely on text vector representation, which maps textual information into a set of features. Ideally, this representation would capture all aspects of the underlying text, including length, word frequencies, syntax, and semantics. While recent advancements in text representation have led to improved performance, a comprehensive approach that explicitly incorporates all text features has not yet been proposed. In this study, we present the INCEPT framework that utilizes multiple representation methods to detect duplicate text pairs, taking advantage of their individual strengths. The core of our approach involves using a stacking ensemble of pairwise vector distance measurements that are computed from multiple text representation methods. A stacking classifier then utilizes these distance scores as input and learns to identify duplicate posts. We assess the proposed framework\u2019s effectiveness in identifying duplicate posts in an online Question and Answer platform. By combining several text representation methods, INCEPT performs well in the duplicate posts classification task. Our experiments demonstrate that specific framework configurations outperform the accuracy scores obtained from individual text representation methods. Therefore, we also infer that no single text representation method can independently capture a text\u2019s features.<\/jats:p>","DOI":"10.1145\/3677322","type":"journal-article","created":{"date-parts":[[2024,7,15]],"date-time":"2024-07-15T11:05:33Z","timestamp":1721041533000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["INCEPT: A Framework for Duplicate Posts Classification with Combined Text Representations"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5728-3959","authenticated-orcid":false,"given":"Erjon","family":"Skenderi","sequence":"first","affiliation":[{"name":"Tampere University, Tampere, Finland and University of Helsinki, Helsinki, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2707-108X","authenticated-orcid":false,"given":"Jukka","family":"Huhtam\u00e4ki","sequence":"additional","affiliation":[{"name":"Tampere University, Tampere, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3532-2387","authenticated-orcid":false,"given":"Salla-Maaria","family":"Laaksonen","sequence":"additional","affiliation":[{"name":"University of Helsinki, Helsinki, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1317-8062","authenticated-orcid":false,"given":"Kostas","family":"Stefanidis","sequence":"additional","affiliation":[{"name":"Tampere University, Tampere, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,8,16]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/2901739"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3324997"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1162\/153244303322533223"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.5555\/944919.944937"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.48550\/arxiv.1803.11175"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3366715.3366739"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343484"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.2017.7966144"},{"key":"e_1_3_2_11_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technologies (NAACL HLT \u201919). 4171\u20134186. https:\/\/arxiv.org\/abs\/1810.04805v2"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.3758\/BF03203370"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1002\/ARIS.1440380105"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.552"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3004296"},{"key":"e_1_3_2_16_2","unstructured":"M. Hoffman D. Blei and F. Bach. 2010. Online learning for latent Dirichlet allocation. In Proceedings of the 23rd International Conference on Neural Information Processing Systems (NIPS \u201910) Vol. 1. 856\u2013864."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1007617005950"},{"volume-title":"Train a Sentence Embedding Model with 1B Training Pairs","year":"2021","key":"e_1_3_2_18_2","unstructured":"HuggingFace. 2021. Train a Sentence Embedding Model with 1B Training Pairs. Technical Report. HuggingFace. https:\/\/huggingface.co\/blog\/1b-sentence-embeddings"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICNN.1995.488968"},{"key":"e_1_3_2_20_2","first-page":"160","volume-title":"Information Access Evaluation. Multilinguality, Multimodality, and Interaction","author":"Kosmpoulos Aris","year":"2014","unstructured":"Aris Kosmpoulos, Georgios Paliouras, and Ion Androutsopoulos. 2014. The effect of dimensionality reduction on large scale hierarchical classification. In Information Access Evaluation. Multilinguality, Multimodality, and Interaction, Evangelos Kanoulas, Mihai Lupu, Paul Clough, Mark Sanderson, Mark Hall, Allan Hanbury, and Elaine Toms (Eds.). Springer International Publishing, Cham, 160\u2013171."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.3389\/FDATA.2020.00003\/BIBTEX"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.48550\/arxiv.1405.4053"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer Veselin Stoyanov and Paul G. Allen. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019). DOI:10.48550\/arxiv.1907.11692","DOI":"10.48550\/arxiv.1907.11692"},{"key":"e_1_3_2_24_2","volume-title":"Proceedings of the 1st International Conference on Learning Representations (ICLR \u201913): Workshop Track","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR \u201913): Workshop Track."},{"key":"e_1_3_2_25_2","article-title":"Advances in pre-training distributed word representations","author":"Mikolov Tomas","year":"2017","unstructured":"Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2017. Advances in pre-training distributed word representations. In Proceedings of the 11th International Conference on Language Resources and Evaluation. 52\u201355. http:\/\/arxiv.org\/abs\/1712.09405","journal-title":"Proceedings of the 11th International Conference on Language Resources and Evaluation."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/S11042-020-10082-6\/TABLES\/11"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.48550\/arxiv.2010.15036"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1155\/2022\/6584394"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1016\/J.ESWA.2016.06.005"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3538491"},{"issue":"85","key":"e_1_3_2_31_2","first-page":"2825","article-title":"Scikit-learn: Machine learning in Python","volume":"12","author":"Pedregosa Fabian","year":"2011","unstructured":"Fabian Pedregosa, Ga\u00ebl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and \u00c9douard Duchesnay.2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 85 (2011), 2825\u20132830. http:\/\/jmlr.org\/papers\/v12\/pedregosa11a.html","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","unstructured":"Jeffrey Pennington Richard Socher and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP \u201914). 1532\u20131543. DOI:10.3115\/V1\/D14-1162","DOI":"10.3115\/V1\/D14-1162"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","unstructured":"Nina Poerner Hinrich Schutze Sch Schutze and Hinrich Sch\u00fctze. 2019. Multi-view domain adapted sentence embeddings for low-resource unsupervised duplicate question detection. In Proceedingsof the 2019 Conference on Empirical Methods in Natural Language Processing and the 9thInternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP \u201919). 1630\u20131641. DOI:10.18653\/V1\/D19-1173","DOI":"10.18653\/V1\/D19-1173"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.26615\/978-954-452-056-4_116"},{"key":"e_1_3_2_35_2","unstructured":"Radim Rehurek and Petr Sojka. 2011. Gensim\u2013Python Framework for Vector Space Modelling. NLP Centre Faculty of Informatics Masaryk University Brno Czech Republic."},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.48550\/arxiv.1908.10084"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3178541"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.48550\/arxiv.1309.2388"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/SANER.2018.8330262"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","unstructured":"Prabhnoor Singh Rajkanwar Chopra Ojasvi Sharma and Rekha Singla. 2020. Stackoverflow tag prediction using tag associations and code analysis. Journal of Discrete Mathematical Sciences and Cryptography 23 1 (2020) 35\u201343. DOI:10.1080\/09720529.2020.1721857","DOI":"10.1080\/09720529.2020.1721857"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.3390\/info12120491"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3314183.3323460"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/2934687"},{"key":"e_1_3_2_44_2","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/P19-1355","article-title":"Energy and policy considerations for deep learning in NLP","author":"Strubell Emma","year":"2019","unstructured":"Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL \u201919).http:\/\/arxiv.org\/abs\/1906.02243","journal-title":"In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL \u201919)."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","unstructured":"Ashish Upadhyay Tien Thanh Nguyen Stewart Massie and John McCall. 2020. WEC: Weighted ensemble of text classifiers. In Proceedings of the 2020 IEEE Congress on Evolutionary Computation (CEC \u201920). DOI:10.1109\/CEC48606.2020.9185641","DOI":"10.1109\/CEC48606.2020.9185641"},{"key":"e_1_3_2_46_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser Illia Polosukhin \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS \u201917). 5999\u20136009. https:\/\/arxiv.org\/abs\/1706.03762v5"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1002\/widm.1507"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1016\/J.DSS.2013.08.002"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","unstructured":"Jiapeng Wang and Yihong Dong. 2020. Measurement of text similarity: A survey. Information 11 9 (2020) 421. DOI:10.3390\/INFO11090421","DOI":"10.3390\/INFO11090421"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.2968391"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0893-6080(05)80023-1"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1016\/J.INS.2010.11.023"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1007\/S10462-022-10283-5"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1145\/3038912.3052701"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1007\/S11390-015-1576-4"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3486622.3493928"}],"container-title":["ACM Transactions on the Web"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3677322","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3677322","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:04:21Z","timestamp":1750291461000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3677322"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,16]]},"references-count":55,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,8,31]]}},"alternative-id":["10.1145\/3677322"],"URL":"https:\/\/doi.org\/10.1145\/3677322","relation":{},"ISSN":["1559-1131","1559-114X"],"issn-type":[{"type":"print","value":"1559-1131"},{"type":"electronic","value":"1559-114X"}],"subject":[],"published":{"date-parts":[[2024,8,16]]},"assertion":[{"value":"2023-02-03","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-05-29","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-16","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}