{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,2]],"date-time":"2026-03-02T14:05:31Z","timestamp":1772460331889,"version":"3.50.1"},"reference-count":41,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2026,3,1]],"date-time":"2026-03-01T00:00:00Z","timestamp":1772323200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>This paper addresses the challenge of deduplicating job postings in large, heterogeneous datasets by introducing an efficient, multi-stage methodology that combines embedding-based filtering with Large Language Model (LLM) validation. The proposed system begins with data preprocessing and semantic vectorization of key textual fields using a text embedding model. To reduce the computational cost of exhaustive pairwise comparisons, a clustering-based grouping mechanism is employed to restrict comparisons to semantically coherent clusters. Candidate duplicates are then validated using LLMs, which assess semantic equivalence across highlighted differences in job titles, descriptions, companies, and locations. The proposed system is evaluated on an augmented dataset of 50,000 job postings, producing 6669 candidate pairs for validation. Among the evaluated models, GPT-4o achieved the highest F1-score (95.10%), while the lightweight Phi-4 model demonstrated competitive performance (92.58%) with significantly lower computational cost. These findings demonstrate that the proposed hybrid framework achieves high semantic precision while remaining scalable for continuous large-scale deployment.<\/jats:p>","DOI":"10.3390\/info17030233","type":"journal-article","created":{"date-parts":[[2026,3,2]],"date-time":"2026-03-02T12:39:56Z","timestamp":1772455196000},"page":"233","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-2969-8881","authenticated-orcid":false,"given":"Giannis","family":"Thivaios","sequence":"first","affiliation":[{"name":"Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 26334 Patras, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2710-1066","authenticated-orcid":false,"given":"Panagiotis","family":"Zervas","sequence":"additional","affiliation":[{"name":"Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 26334 Patras, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5989-6313","authenticated-orcid":false,"given":"Konstantinos","family":"Giotopoulos","sequence":"additional","affiliation":[{"name":"Department of Management Science and Technology, University of Patras, 26334 Patras, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4073-7256","authenticated-orcid":false,"given":"Giannis","family":"Tzimas","sequence":"additional","affiliation":[{"name":"Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 26334 Patras, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2026,3,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"224","DOI":"10.32996\/jbms.2024.6.3.18","article-title":"Application of Artificial Intelligence (AI) in Recruitment and Selection: The Case of Company A and Company B","volume":"6","author":"Zhang","year":"2024","journal-title":"J. Bus. Manag. Stud."},{"key":"ref_2","unstructured":"Draisbach, U. (2022). Efficient Duplicate Detection and the Impact of Transitivity. [Ph.D. Thesis, Universitat Potsdam]."},{"key":"ref_3","unstructured":"Zhao, Y., Chen, H., and Mason, C.M. (2022). A framework for duplicate detection from online job postings. Proceedings of the WI-IAT\u201921: 20th IEEE\/WIC\/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Melbourne, Australia, 14\u201317 December 2021, Association for Computing Machinery."},{"key":"ref_4","first-page":"1","article-title":"Feature extraction and duplicate detection for text mining: A survey","volume":"16","author":"Ramya","year":"2017","journal-title":"Glob. J. Comput. Sci. Technol."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Tzimas, G., Zotos, N., Mourelatos, E., Giotopoulos, K.C., and Zervas, P. (2024). From Data to Insight: Transforming Online Job Postings into Labor-Market Intelligence. Information, 15.","DOI":"10.3390\/info15080496"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Engelbach, M., Klau, D., Kintz, M., and Ulrich, A. (2024). Combining Embeddings and Domain Knowledge for Job Posting Duplicate Detection. arXiv.","DOI":"10.13053\/cys-28-4-5306"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"7","DOI":"10.31695\/IJERAT.2022.8.4.2","article-title":"Techniques of Data Deduplication for Cloud Storage: A Review","volume":"8","author":"Adhab","year":"2024","journal-title":"Int. J. Eng. Res. Adv. Technol."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Burk, H., Javed, F., and Balaji, J. (2017). Apollo: Near-duplicate detection for job ads in the online recruitment domain. Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 18\u201321 November 2017, IEEE.","DOI":"10.1109\/ICDMW.2017.29"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Gao, J., He, Y., Zhang, X., and Xia, Y. (2017). Duplicate short text detection based on Word2vec. Proceedings of the 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 24\u201326 November 2017, IEEE.","DOI":"10.1109\/ICSESS.2017.8342858"},{"key":"ref_10","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013, January 2\u20134). Efficient Estimation of Word Representations in Vector Space. Proceedings of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, AZ, USA."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Shi, H., Liu, X., Lv, F., Xue, H., Hu, J., Du, S., and Li, T. (2025). A Pre-trained Data Deduplication Model based on Active Learning. arXiv.","DOI":"10.1016\/j.eswa.2025.128628"},{"key":"ref_12","unstructured":"OpenAI (2024, May 01). API Reference\u2014OpenAI Platform. Available online: https:\/\/platform.openai.com\/docs\/api-reference."},{"key":"ref_13","first-page":"283","article-title":"Fake Job Posting Detection","volume":"4","author":"Ram","year":"2024","journal-title":"Int. J. Adv. Res. Sci. Commun. Technol."},{"key":"ref_14","unstructured":"OpenAI (2025, August 07). Introducing GPT-5. Available online: https:\/\/openai.com\/index\/introducing-gpt-5\/."},{"key":"ref_15","unstructured":"(2025, May 15). ESCO (European Skills, Competences, Qualifications and Occupations). Available online: https:\/\/esco.ec.europa.eu\/en\/classification\/occupation_main."},{"key":"ref_16","unstructured":"(2023, September 29). O*NET Web Services. Welcome to the O*Net Web Services Site!. Available online: https:\/\/services.onetcenter.org\/."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1145\/219717.219748","article-title":"WordNet: A Lexical Database for English","volume":"38","author":"Miller","year":"1995","journal-title":"Commun. ACM"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Colombo, S., D\u2019Amico, S., Malandri, L., Mercorio, F., and Seveso, A. (2025). JobSet: Synthetic Job Advertisements Dataset for Labour Market Intelligence. Proceedings of the SAC\u201925: 40th ACM\/SIGAPP Symposium on Applied Computing, Catania, Italy, 31 March\u20134 April 2025, Association for Computing Machinery.","DOI":"10.1145\/3672608.3707718"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Skondras, P., Zervas, P., and Tzimas, G. (2023). Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification. Information, 15.","DOI":"10.3390\/fi15110363"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Skondras, P., Zotos, N., Lagios, D., Zervas, P., and Tzimas, G. (2023). Deep Learning Approaches for Big Data-Driven Metadata Extraction in Online Job Postings. Future Internet, 14.","DOI":"10.3390\/info14110585"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1508","DOI":"10.22214\/ijraset.2025.74635","article-title":"Fake\/Real Job Posting Detection Using Machine Learning","volume":"13","author":"Itnal","year":"2025","journal-title":"Int. J. Res. Appl. Sci. Eng. Technol."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, Springer Science and Business Media.","DOI":"10.1007\/978-3-642-31164-2"},{"key":"ref_23","unstructured":"Lavi, D., Medentsiy, V., and Graus, D. (2021). conSultantBERT: Fine-Tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Ortiz Martes, D., Gunderson, E., Neuman, C., and Kachouie, N.N. (2025). Transformer Models for Paraphrase Detection: A Comprehensive Semantic Similarity Study. Computers, 14.","DOI":"10.3390\/computers14090385"},{"key":"ref_25","unstructured":"Miller, D.L. (2024, October 24). WordLlama: Recycled Token Embeddings from Large Language Models. Available online: https:\/\/github.com\/dleemiller\/wordllama."},{"key":"ref_26","unstructured":"Bos, A. (2018). Visualizing Differences Between HTML Documents. [Bachelor\u2019s Thesis, Radboud University]."},{"key":"ref_27","unstructured":"Rajiv, Y. (2005). Detecting Similar HTML Documents Using a Sentence-Based Copy Detection Approach. [Master\u2019s Thesis, Department of Computer Science, Brigham Young University]."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1575","DOI":"10.1109\/TKDE.2013.19","article-title":"A Similarity Measure for Text Classification and Clustering","volume":"26","author":"Lin","year":"2014","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"012120","DOI":"10.1088\/1742-6596\/978\/1\/012120","article-title":"The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents","volume":"978","author":"Gunawan","year":"2018","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_30","unstructured":"Touvron, H., and Lavril, T. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv."},{"key":"ref_31","unstructured":"Jiang, A., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D., de Las Casas, D., Bressand, F., Lengyel, G., Lample, G., and Saulnier, L. (2023). Mistral 7B. arXiv."},{"key":"ref_32","unstructured":"Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R.J., Javaheripo, M., and Kauffmann, P. (2024). Phi-4 Technical Report. arXiv."},{"key":"ref_33","unstructured":"OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., and Altman, S. (2024). GPT-4 Technical Report. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Dong, Y., Mu, R., Zhang, Y., Sun, S., Zhang, T., Wu, C., Jin, G., Qi, Y., Hu, J., and Meng, J. (2024). Safeguarding Large Language Models: A Survey. arXiv.","DOI":"10.1007\/s10462-025-11389-2"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"109698","DOI":"10.1016\/j.compeleceng.2024.109698","article-title":"Privacy issues in Large Language Models","volume":"120","author":"Kibriya","year":"2024","journal-title":"Comput. Electr. Eng."},{"key":"ref_36","unstructured":"Powers, D.M.W. (2020). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"427","DOI":"10.1016\/j.ipm.2009.03.002","article-title":"A systematic analysis of performance measures for classification tasks","volume":"45","author":"Sokolova","year":"2009","journal-title":"Inf. Process. Manag."},{"key":"ref_38","unstructured":"Gwet, K. (2014). Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement; Among Raters, Advanced Analytics LLC. [4th ed.]."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"e101139","DOI":"10.1136\/bmjhci-2024-101139","article-title":"Large language models for data extraction from unstructured and semi-structured electronic health records: A multiple model performance evaluation","volume":"32","author":"Ntinopoulos","year":"2025","journal-title":"BMJ Health Care Inform."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"A134","DOI":"10.1161\/str.56.suppl_1.134","article-title":"Abstract 134: Use of Large Language Model to Allow Reliable Data Acquisition for International Pediatric Stroke Study","volume":"56","author":"Bhayana","year":"2025","journal-title":"Stroke"},{"key":"ref_41","unstructured":"Du, W., Yang, Y., and Welleck, S. (2024). Optimizing Temperature for Language Models with Multi-Sample Inference. arXiv."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/3\/233\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,2]],"date-time":"2026-03-02T13:12:01Z","timestamp":1772457121000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/3\/233"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,1]]},"references-count":41,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2026,3]]}},"alternative-id":["info17030233"],"URL":"https:\/\/doi.org\/10.3390\/info17030233","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,1]]}}}