{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T14:37:52Z","timestamp":1773758272546,"version":"3.50.1"},"reference-count":43,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T00:00:00Z","timestamp":1751414400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T00:00:00Z","timestamp":1751414400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100020618","name":"Universit\u00e4t Bayreuth","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100020618","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Softw Syst Model"],"published-print":{"date-parts":[[2026,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Automatic retrieval of formal business process models from their natural language descriptions is a well-established way to facilitate the time- and cost-intensive modeling procedure. Yet, a lack of data usable for developing and training new retrieval methods is impeding progress in this field of research. This issue can be overcome by either using methods less reliant on high-quality data, such as large language models, or creating bigger datasets. The latter is often preferable in the context of business process modeling, especially when internal workflows of organizations have to be treated confidentially. It is the more data-intensive solution, though, which is costly. Data augmentation techniques aim to improve both quality and quantity of existing datasets, by deliberate perturbations resulting in new, synthetic data. In this article, we present a collection of data augmentation techniques, which are specifically selected for the task of improving data quality in the context of process information extraction. We show why data augmentation techniques from the wider field of natural language processing are often not applicable to process information extraction, and how the resulting data differ in terms of linguistic variety, structure, and feature space coverage. In our experiments, data augmentation results in an absolute improvement in the\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$F_1$$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:msub>\n                            <mml:mi>F<\/mml:mi>\n                            <mml:mn>1<\/mml:mn>\n                          <\/mml:msub>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    measure of\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$5.7\\%$$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:mrow>\n                            <mml:mn>5.7<\/mml:mn>\n                            <mml:mo>%<\/mml:mo>\n                          <\/mml:mrow>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    for extracting process-relevant entities from text and\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$4.5\\%$$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:mrow>\n                            <mml:mn>4.5<\/mml:mn>\n                            <mml:mo>%<\/mml:mo>\n                          <\/mml:mrow>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    for extracting relations between those entities. We make all code available at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/JulianNeuberger\/pet-data-augmentation\" ext-link-type=\"uri\">https:\/\/github.com\/JulianNeuberger\/pet-data-augmentation<\/jats:ext-link>\n                    and results for our experiments at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/zenodo.org\/doi\/10.5281\/zenodo.10941423\" ext-link-type=\"uri\">https:\/\/zenodo.org\/doi\/10.5281\/zenodo.10941423<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1007\/s10270-025-01305-1","type":"journal-article","created":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T08:49:19Z","timestamp":1751446159000},"page":"329-350","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Repeat, reorder, rephrase: data augmentation for process information extraction"],"prefix":"10.1007","volume":"25","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-4244-7659","authenticated-orcid":false,"given":"Julian","family":"Neuberger","sequence":"first","affiliation":[]},{"given":"Lars","family":"Ackermann","sequence":"additional","affiliation":[]},{"given":"Stefan","family":"Jablonski","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,7,2]]},"reference":[{"key":"1305_CR1","unstructured":"Ackermann, L., K\u00e4ppel, M., Marcus, L., Moder, L., Dunzer, S., Hornsteiner, M., Liessmann, A., Zisgen, Y., Empl, P., Herm, L.-V., et\u00a0al.: Recent advances in data-driven business process management. arXiv preprint arXiv:2406.01786 (2024)"},{"key":"1305_CR2","doi-asserted-by":"crossref","unstructured":"Ackermann, L., Neuberger, J., and Jablonski, S.: Data-driven annotation of textual process descriptions based on formal meaning representations. In CAiSE (2021)","DOI":"10.1007\/978-3-030-79382-1_5"},{"key":"1305_CR3","doi-asserted-by":"crossref","unstructured":"Ackermann, L., Neuberger, J., K\u00e4ppel, M., and Jablonski, S.: Bridging research fields: An empirical study on joint, neural relation extraction techniques. In CAiSE (2023)","DOI":"10.1007\/978-3-031-34560-9_28"},{"key":"1305_CR4","doi-asserted-by":"crossref","unstructured":"Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M.: Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.\u00a02623\u20132631 (2019)","DOI":"10.1145\/3292500.3330701"},{"key":"1305_CR5","doi-asserted-by":"crossref","unstructured":"Bellan, P., Dragoni, M., and Ghidini, C.: Assisted process knowledge graph building using pre-trained language models. In Proceedings of AIxIA 2022 - Advances in Artificial Intelligence (2022)","DOI":"10.1007\/978-3-031-27181-6_5"},{"key":"1305_CR6","doi-asserted-by":"crossref","unstructured":"Bellan, P., Dragoni, M., and Ghidini, C.: Extracting business process entities and relations from text using pre-trained language models and in-context learning. In EDOC (2022)","DOI":"10.1007\/978-3-031-17604-3_11"},{"key":"1305_CR7","doi-asserted-by":"crossref","unstructured":"Bellan, P., Ghidini, C., Dragoni, M., Ponzetto, S.\u00a0P., and van\u00a0der Aa, H.: Process extraction from natural language text: the PET dataset and annotation guidelines. In NL4AI (2022)","DOI":"10.1007\/978-3-031-25383-6_23"},{"issue":"8","key":"1305_CR8","doi-asserted-by":"publisher","first-page":"1279","DOI":"10.1109\/5.880084","volume":"88","author":"JR Bellegarda","year":"2000","unstructured":"Bellegarda, J.R.: Exploiting latent semantic information in statistical language modeling. Proceed. IEEE 88(8), 1279\u20131296 (2000)","journal-title":"Proceed. IEEE"},{"key":"1305_CR9","unstructured":"Bergstra, J., Bardenet, R., Bengio, Y., and K\u00e9gl, B.: Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems 24 (2011)"},{"key":"1305_CR10","unstructured":"Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O.: Translating embeddings for modeling multi-relational data. Advances in neural information processing systems 26 (2013)"},{"key":"1305_CR11","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)"},{"key":"1305_CR12","unstructured":"Dhole, K.\u00a0D., Gangal, V., Gehrmann, S., Gupta, A., Li, Z., Mahamood, S., Mahendiran, A., Mille, S., Shrivastava, A., Tan, S., et\u00a0al.: Nl-augmenter: a framework for task-sensitive natural language augmentation. arXiv preprint arXiv:2112.02721 (2021)"},{"key":"1305_CR13","unstructured":"Eldin, A.\u00a0N., Assy, N., Anesini, O., Dalmas, B., and Gaaloul, W.: A decomposed hybrid approach to business process modeling with llms"},{"key":"1305_CR14","doi-asserted-by":"crossref","unstructured":"Erdengasileng, A., Han, Q., Zhao, T., Tian, S., Sui, X., Li, K., Wang, W., Wang, J., Hu, T., Pan, F., et\u00a0al.: Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification. Database 2022, baac066 (2022)","DOI":"10.1093\/database\/baac066"},{"key":"1305_CR15","unstructured":"Feng, S.\u00a0Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., and Hovy, E.: A survey of data augmentation approaches for NLP"},{"key":"1305_CR16","doi-asserted-by":"crossref","unstructured":"Ferreira., R. C.\u00a0B., Thom., L.\u00a0H., and Fantinato., M.: A semi-automatic approach to identify business process elements in natural language texts. In ICEIS (2017)","DOI":"10.5220\/0006305902500261"},{"key":"1305_CR17","doi-asserted-by":"crossref","unstructured":"Friedrich, F., Mendling, J., and Puhlmann, F.: Process model generation from natural language text. In CAiSE (2011)","DOI":"10.1007\/978-3-642-21640-4_36"},{"key":"1305_CR18","unstructured":"G\u00fcnther, M., Ong, J., Mohr, I., Abdessalem, A., Abel, T., Akram, M.\u00a0K., Guzman, S., Mastrapas, G., Sturua, S., Wang, B., Werk, M., Wang, N., and Xiao, H.: Jina embeddings 2: 8192-token general-purpose text embeddings for long documents, (2023)"},{"key":"1305_CR19","doi-asserted-by":"crossref","unstructured":"Grohs, M., Abb, L., Elsayed, N., and Rehse, J.-R.: Large language models can accomplish business process management tasks. In International Conference on Business Process Management, Springer, pp.\u00a0453\u2013465 (2023)","DOI":"10.1007\/978-3-031-50974-2_34"},{"key":"1305_CR20","doi-asserted-by":"crossref","unstructured":"Jiang, Z., Han, J., Sisman, B., and Dong, X.\u00a0L.: Cori: Collective relation integration with data augmentation for open information extraction. arXiv preprint arXiv:2106.00793 (2021)","DOI":"10.18653\/v1\/2021.acl-long.363"},{"key":"1305_CR21","doi-asserted-by":"crossref","unstructured":"Kampik, T., Warmuth, C., Rebmann, A., Agam, R., Egger, L.\u00a0N., Gerber, A., Hoffart, J., Kolk, J., Herzig, P., Decker, G., et\u00a0al.: Large process models: Business process management in the age of generative ai. arXiv preprint arXiv:2309.00900 (2023)","DOI":"10.1007\/s13218-024-00884-3"},{"key":"1305_CR22","doi-asserted-by":"crossref","unstructured":"K\u00e4ppel, M., and Jablonski, S.: Model-agnostic event log augmentation for predictive process monitoring. In International Conference on Advanced Information Systems Engineering, Springer, pp.\u00a0381\u2013397 (2023)","DOI":"10.1007\/978-3-031-34560-9_23"},{"key":"1305_CR23","doi-asserted-by":"crossref","unstructured":"K\u00e4ppel, M., Sch\u00f6nig, S., and Jablonski, S.: Leveraging small sample learning for business process management. Information and Software Technology (2021)","DOI":"10.1016\/j.infsof.2020.106472"},{"key":"1305_CR24","doi-asserted-by":"crossref","unstructured":"Klievtsova, N., Benzin, J.-V., Kampik, T., Mangler, J., and Rinderle-Ma, S.: Conversational process modelling: state of the art, applications, and implications in practice. In International Conference on Business Process Management, Springer, pp.\u00a0319\u2013336 (2023)","DOI":"10.1007\/978-3-031-41623-1_19"},{"key":"1305_CR25","unstructured":"K\u00f6pke, J., and Safan, A.: Introducing the bpmn-chatbot for efficient llm-based process modeling"},{"key":"1305_CR26","doi-asserted-by":"crossref","unstructured":"Kourani, H., Berti, A., Schuster, D., and van\u00a0der Aalst, W.\u00a0M.: Process modeling with large language models. arXiv preprint arXiv:2403.07541 (2024)","DOI":"10.1007\/978-3-031-61007-3_18"},{"key":"1305_CR27","unstructured":"Kourani, H., Berti, A., Schuster, D., and van\u00a0der Aalst, W.\u00a0M.: Promoai: Process modeling with generative AI. arXiv preprint arXiv:2403.04327 (2024)"},{"key":"1305_CR28","doi-asserted-by":"crossref","unstructured":"Liu, J., Chen, Y., and Xu, J.: Machine reading comprehension as data augmentation: a case study on implicit event argument extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\u00a02716\u20132725 (2021)","DOI":"10.18653\/v1\/2021.emnlp-main.214"},{"key":"1305_CR29","doi-asserted-by":"crossref","unstructured":"L\u00f3pez, H.\u00a0A., Str\u00f8msted, R., Niyodusenga, J.-M., and Marquard, M.: Declarative process discovery: Linking process and textual views. In International Conference on Advanced Information Systems Engineering (2021)","DOI":"10.1007\/978-3-030-79108-7_13"},{"key":"1305_CR30","unstructured":"L\u00f3pez-Acosta, H.-A., Hildebrandt, T., Debois, S., and Marquard, M.: The process highlighter: From texts to declarative processes and back. In CEUR Workshop Proceedings, CEUR Workshop Proceedings, pp.\u00a066\u201370 (2018)"},{"issue":"11","key":"1305_CR31","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1145\/219717.219748","volume":"38","author":"GA Miller","year":"1995","unstructured":"Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39\u201341 (1995)","journal-title":"Commun. ACM"},{"key":"1305_CR32","doi-asserted-by":"crossref","unstructured":"Mohammed, R., Rawashdeh, J., and Abdullah, M.: Machine learning with oversampling and undersampling techniques: overview study and experimental results. In 2020 11th international conference on information and communication systems (ICICS), IEEE, pp.\u00a0243\u2013248 (2020)","DOI":"10.1109\/ICICS49469.2020.239556"},{"key":"1305_CR33","doi-asserted-by":"crossref","unstructured":"Neuberger, J., Ackermann, L., and Jablonski, S.: Beyond rule-based named entity recognition and relation extraction for process model generation from natural language text. In CoopIS (2023)","DOI":"10.1007\/978-3-031-46846-9_10"},{"key":"1305_CR34","doi-asserted-by":"crossref","unstructured":"Neuberger, J., Ackermann, L., van\u00a0der Aa, H., and Jablonski, S.: A universal prompting strategy for extracting process model information from natural language text using large language models. In International Conference on Conceptual Modeling, Springer, pp.\u00a038\u201355 (2024)","DOI":"10.1007\/978-3-031-75872-0_3"},{"key":"1305_CR35","doi-asserted-by":"crossref","unstructured":"Neuberger, J., Doll, L., Engelmann, B., Ackermann, L., and Jablonski, S.: Leveraging data augmentation for process information extraction. In International Conference on Business Process (2024)Modeling, Development and Support, Springer, pp.\u00a057\u201370 (2024)","DOI":"10.1007\/978-3-031-61007-3_6"},{"issue":"3","key":"1305_CR36","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18637\/jss.v109.i03","volume":"109","author":"PG Poli\u010dar","year":"2024","unstructured":"Poli\u010dar, P.G., Stra\u017ear, M., Zupan, B.: Opentsne: a modular python library for t-sne dimensionality reduction and embedding. J. Stat. Softw. 109(3), 1\u201330 (2024)","journal-title":"J. Stat. Softw."},{"key":"1305_CR37","doi-asserted-by":"crossref","unstructured":"Quishpi, L., Carmona, J., and Padr\u00f3, L.: Extracting annotations from textual descriptions of processes. In BPM 2020 (2020)","DOI":"10.1007\/978-3-030-58666-9_11"},{"key":"1305_CR38","unstructured":"Radford, A.: Improving language understanding by generative pre-training"},{"issue":"1","key":"1305_CR39","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-019-0197-0","volume":"6","author":"C Shorten","year":"2019","unstructured":"Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1\u201348 (2019)","journal-title":"J. Big Data"},{"key":"1305_CR40","doi-asserted-by":"crossref","unstructured":"Shorten, C., Khoshgoftaar, T.\u00a0M., and Furht, B.: Text data augmentation for deep learning. Journal of big Data (2021)","DOI":"10.21203\/rs.3.rs-650804\/v1"},{"key":"1305_CR41","doi-asserted-by":"crossref","unstructured":"van\u00a0der Aa, H., Di\u00a0Ciccio, C., Leopold, H., and Reijers, H.\u00a0A.: Extracting declarative process models from natural language. In CAiSE (2019)","DOI":"10.1007\/978-3-030-21290-2_23"},{"key":"1305_CR42","doi-asserted-by":"crossref","unstructured":"Yao, Y., Ye, D., Li, P., Han, X., Lin, Y., Liu, Z., Liu, Z., Huang, L., Zhou, J., and Sun, M.: Docred: A large-scale document-level relation extraction dataset. arXiv preprint arXiv:1906.06127 (2019)","DOI":"10.18653\/v1\/P19-1074"},{"key":"1305_CR43","doi-asserted-by":"crossref","unstructured":"Zoran, D., and Weiss, Y.: Scale invariance and noise in natural images. In 2009 IEEE 12th International Conference on Computer Vision, IEEE, pp.\u00a02209\u20132216 (2009)","DOI":"10.1109\/ICCV.2009.5459476"}],"container-title":["Software and Systems Modeling"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10270-025-01305-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10270-025-01305-1","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10270-025-01305-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T12:02:25Z","timestamp":1773748945000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10270-025-01305-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,2]]},"references-count":43,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,4]]}},"alternative-id":["1305"],"URL":"https:\/\/doi.org\/10.1007\/s10270-025-01305-1","relation":{},"ISSN":["1619-1366","1619-1374"],"issn-type":[{"value":"1619-1366","type":"print"},{"value":"1619-1374","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,2]]},"assertion":[{"value":"30 November 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 May 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 June 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 July 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}