{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T00:49:30Z","timestamp":1772498970846,"version":"3.50.1"},"reference-count":45,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T00:00:00Z","timestamp":1698192000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>This article presents a study on the multi-class classification of job postings using machine learning algorithms. With the growth of online job platforms, there has been an influx of labor market data. Machine learning, particularly NLP, is increasingly used to analyze and classify job postings. However, the effectiveness of these algorithms largely hinges on the quality and volume of the training data. In our study, we propose a multi-class classification methodology for job postings, drawing on AI models such as text-davinci-003 and the quantized versions of Falcon 7b (Falcon), Wizardlm 7B (Wizardlm), and Vicuna 7B (Vicuna) to generate synthetic datasets. These synthetic data are employed in two use-case scenarios: (a) exclusively as training datasets composed of synthetic job postings (situations where no real data is available) and (b) as an augmentation method to bolster underrepresented job title categories. To evaluate our proposed method, we relied on two well-established approaches: the feedforward neural network (FFNN) and the BERT model. Both the use cases and training methods were assessed against a genuine job posting dataset to gauge classification accuracy. Our experiments substantiated the benefits of using synthetic data to enhance job posting classification. In the first scenario, the models\u2019 performance matched, and occasionally exceeded, that of the real data. In the second scenario, the augmented classes consistently outperformed in most instances. This research confirms that AI-generated datasets can enhance the efficacy of NLP algorithms, especially in the domain of multi-class classification job postings. While data augmentation can boost model generalization, its impact varies. It is especially beneficial for simpler models like FNN. BERT, due to its context-aware architecture, also benefits from augmentation but sees limited improvement. Selecting the right type and amount of augmentation is essential.<\/jats:p>","DOI":"10.3390\/info14110585","type":"journal-article","created":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T06:19:58Z","timestamp":1698214798000},"page":"585","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Deep Learning Approaches for Big Data-Driven Metadata Extraction in Online Job Postings"],"prefix":"10.3390","volume":"14","author":[{"given":"Panagiotis","family":"Skondras","sequence":"first","affiliation":[{"name":"Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 22131 Tripolis, Greece"}]},{"given":"Nikos","family":"Zotos","sequence":"additional","affiliation":[{"name":"Department of Management Science and Technology, University of Patras, 26334 Patras, Greece"}]},{"given":"Dimitris","family":"Lagios","sequence":"additional","affiliation":[{"name":"Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 22131 Tripolis, Greece"}]},{"given":"Panagiotis","family":"Zervas","sequence":"additional","affiliation":[{"name":"Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 22131 Tripolis, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5989-6313","authenticated-orcid":false,"given":"Konstantinos C.","family":"Giotopoulos","sequence":"additional","affiliation":[{"name":"Department of Management Science and Technology, University of Patras, 26334 Patras, Greece"}]},{"given":"Giannis","family":"Tzimas","sequence":"additional","affiliation":[{"name":"Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 22131 Tripolis, Greece"}]}],"member":"1968","published-online":{"date-parts":[[2023,10,25]]},"reference":[{"key":"ref_1","unstructured":"(2023, October 15). OpenAI API. Available online: https:\/\/bit.ly\/3UOELSX."},{"key":"ref_2","unstructured":"(2023, October 15). GPT4All API. Available online: https:\/\/docs.gpt4all.io\/index.html."},{"key":"ref_3","unstructured":"Ye, J., Chen, X., Xu, N., Zu, C., Shao, Z., Liu, S., Cui, Y., Zhou, Z., Gong, C., and Shen, Y. (2023). A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. arXiv."},{"key":"ref_4","unstructured":"Anand, Y., Nussbaum, Z., Duderstadt, B., Schmidt, B., and Mulyar, A. (2023, September 16). GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo. Available online: https:\/\/github.com\/nomic-ai\/gpt4all."},{"key":"ref_5","unstructured":"(2023, October 15). The Rise of Open-Source LLMs in 2023: A Game Changer in AI. Available online: https:\/\/www.ankursnewsletter.com\/p\/the-rise-of-open-source-llms-in-2023."},{"key":"ref_6","unstructured":"(2023, October 15). 12 Best Large Language Models (LLMs) in 2023. Available online: https:\/\/beebom.com\/best-large-language-models-llms\/."},{"key":"ref_7","unstructured":"Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. (2023). The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. arXiv."},{"key":"ref_8","unstructured":"Chiang, W., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., and Gonzalez, J.E. (2023, October 15). Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. Available online: https:\/\/lmsys.org\/blog\/2023-03-30-vicuna\/."},{"key":"ref_9","unstructured":"Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., and Jiang, D. (2023). WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv."},{"key":"ref_10","unstructured":"White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., and Schmidt, D. (2023). A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv."},{"key":"ref_11","first-page":"1146","article-title":"Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models","volume":"29","author":"Strobelt","year":"2023","journal-title":"IEEE Trans. Vis. Comput. Graph."},{"key":"ref_12","unstructured":"Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., and Liu, Y. (2023). Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Gao, A. (2023, October 24). Prompt Engineering for Large Language Models. 2023. Available online: https:\/\/papers.ssrn.com\/sol3\/papers.cfm?abstract_id=4504303.","DOI":"10.2139\/ssrn.4504303"},{"key":"ref_14","unstructured":"Liu, V., and Chilton, L.B. (May, January 29). Design Guidelines for Prompt Engineering Text-to-Image Generative Models. Proceedings of the CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA."},{"key":"ref_15","unstructured":"Sabit, E. (2023). Prompt Engineering for ChatGPT: A Quick Guide to Techniques, Tips, And Best Practices. TechRxiv, techrXiv:22683919.v2."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"146","DOI":"10.1145\/3544558","article-title":"2022. A Survey on Data Augmentation for Text Classification","volume":"55","author":"Bayer","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Shi, Z., and Lipani, A. (2023). Rethink the Effectiveness of Text Data Augmentation: An Empirical Analysis. arXiv.","DOI":"10.14428\/esann\/2023.ES2023-42"},{"key":"ref_18","unstructured":"Kumar, V., Choudhary, A., and Cho, E. (2021). Data Augmentation using Pre-trained Transformer Models. arXiv."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"3182","DOI":"10.14778\/3476311.3476403","article-title":"Data augmentation for ML-driven data preparation and integration","volume":"14","author":"Li","year":"2021","journal-title":"Proc. VLDB Endow."},{"key":"ref_20","unstructured":"Whitehouse, C., Choudhury, M., and Aji, A.F. (2023). LLM-powered Data Augmentation for Enhanced Crosslingual Performance. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., St. John, R., Constant, N., Guajardo-Cespedes, M., Yuan, S., and Tar, C. (November, January 31). Universal Sentence Encoder for English. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium.","DOI":"10.18653\/v1\/D18-2029"},{"key":"ref_22","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv."},{"key":"ref_23","first-page":"6","article-title":"Machine Learning and Job Posting Classification: A Comparative Study","volume":"4","author":"Nasser","year":"2020","journal-title":"Int. J. Eng. Inf. Syst."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Zaroor, A., Maree, M., and Sabha, M. (2017, January 6\u20138). JRC: A Job Post and Resume Classification System for Online Recruitment. Proceedings of the 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), Boston, MA, USA.","DOI":"10.1109\/ICTAI.2017.00123"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"807","DOI":"10.1016\/j.ipm.2017.05.004","article-title":"Human resources for Big Data professions: A systematic classification of job roles and required skill sets","volume":"54","author":"Greco","year":"2018","journal-title":"Inf. Process. Manag."},{"key":"ref_26","unstructured":"Zhang, M., Jensen, K.N., and Plank, B. (2022). Kompetencer: Fine-grained Skill Classification in Danish Job Postings via Distant Supervision and Transfer Learning. arXiv."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Goindani, M., Liu, Q., Chao, J., and Jijkoun, V. (2017, January 18\u201321). Employer Industry Classification Using Job Postings. Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA.","DOI":"10.1109\/ICDMW.2017.30"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Varelas, G., Lagios, D., Ntouroukis, S., Zervas, P., Parsons, K., and Tzimas, G. (2022). Employing Natural Language Processing Techniques for Online Job Vacancies Classification, Springer. IFIP Advances in Information and Communication Technology.","DOI":"10.1007\/978-3-031-08341-9_27"},{"key":"ref_29","unstructured":"(2023, October 15). Hugging Face Libraries. Available online: https:\/\/huggingface.co\/docs\/hub\/models-libraries."},{"key":"ref_30","unstructured":"(2023, October 13). Scrappy. Available online: https:\/\/scrapy.org\/."},{"key":"ref_31","unstructured":"(2023, October 15). Requests. Available online: https:\/\/python.langchain.com\/docs\/integrations\/tools\/requests."},{"key":"ref_32","unstructured":"(2023, October 15). Beautiful Soup. Available online: https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/."},{"key":"ref_33","unstructured":"(2023, October 13). MariaDB. Available online: https:\/\/mariadb.org."},{"key":"ref_34","unstructured":"(2023, October 15). ChatGPT\u2014Python Parameters Tuning. Available online: https:\/\/platform.openai.com\/docs\/api-reference\/completions\/create."},{"key":"ref_35","unstructured":"(2023, October 15). GPT4All\u2014Python Parameters Tuning. Available online: https:\/\/docs.gpt4all.io\/gpt4all_python.html#the-generate-method-api."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1108\/eb026526","article-title":"A Statistical Interpretation of Term Specificity and Its Application in Retrieval","volume":"28","author":"Sparck","year":"1972","journal-title":"J. Doc."},{"key":"ref_37","unstructured":"(2023, October 15). Cosine Similarity. Available online: https:\/\/www.sciencedirect.com\/topics\/computer-science\/cosine-similarity."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Josifoski, M., Sakota, M., Peyrard, M., and West, R. (2023). Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction. arXiv.","DOI":"10.18653\/v1\/2022.naacl-main.342"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Xu, B., Wang, Q., Lyu, Y., Dai, D., Zhang, Y., and Mao, Z. (2023, January 9\u201314). S2ynRE: Two-Stage Self-Training with Synthetic Data for Low-resource Relation Extraction. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada.","DOI":"10.18653\/v1\/2023.acl-long.455"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Jeronymo, V., Bonifacio, L., Abonizio, H., Fadaee, M., Lotufo, R., Zavrel, J., and Nogueira, R. (2023). InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. arXiv.","DOI":"10.1145\/3477495.3531863"},{"key":"ref_41","unstructured":"Veselovsky, V., Ribeiro, M.H., Arora, A., Josifoski, M., Anderson, A., and West, R. (2023). Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science. arXiv."},{"key":"ref_42","unstructured":"Abonizio, H., Bonifacio, L., Jeronymo, V., Lotufo, R., Zavrel, J., and Nogueira, R. (2023). InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval. arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Skondras, P., Psaroudakis, G., Zervas, P., and Tzimas, G. (2023, January 10\u201312). Efficient Resume Classification through Rapid Dataset Creation Using ChatGPT. Proceedings of the 14th International Conference on Information, Intelligence, Systems and Applications (IISA 2023), Volos, Greece.","DOI":"10.1109\/IISA59645.2023.10345870"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3530811","article-title":"Efficient Transformers: A Survey. 2022","volume":"55","author":"Tay","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"ref_45","unstructured":"(2023, October 15). Safeguarding LLMs with Guardrails. Available online: https:\/\/towardsdatascience.com\/safeguarding-llms-with-guardrails-4f5d9f57cff2."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/14\/11\/585\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T21:11:29Z","timestamp":1760130689000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/14\/11\/585"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,25]]},"references-count":45,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2023,11]]}},"alternative-id":["info14110585"],"URL":"https:\/\/doi.org\/10.3390\/info14110585","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,10,25]]}}}