{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,22]],"date-time":"2026-03-22T05:48:56Z","timestamp":1774158536604,"version":"3.50.1"},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2024,7,25]],"date-time":"2024-07-25T00:00:00Z","timestamp":1721865600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,25]],"date-time":"2024-07-25T00:00:00Z","timestamp":1721865600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"EU Horizon Framework","award":["101069543"],"award-info":[{"award-number":["101069543"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Intell Inf Syst"],"published-print":{"date-parts":[[2024,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Data play a key role in AI systems that support decision-making processes. Data-centric AI highlights the importance of having high-quality input data to obtain reliable results. However, well-preparing data for machine learning is becoming difficult due to the variety of data quality issues and available data preparation tasks. For this reason, approaches that help users in performing this demanding phase are needed. This work proposes DIANA, a framework for data-centric AI to support data exploration and preparation, suggesting suitable cleaning tasks to obtain valuable analysis results. We design an adaptive self-service environment that can handle the analysis and preparation of different types of sources, i.e., tabular, and streaming data. The central component of our framework is a knowledge base that collects evidence related to the effectiveness of the data preparation actions along with the type of input data and the considered machine learning model. In this paper, we first describe the framework, the knowledge base model, and its enrichment process. Then, we show the experiments conducted to enrich the knowledge base in a particular case study: time series data streams.<\/jats:p>","DOI":"10.1007\/s10844-024-00867-8","type":"journal-article","created":{"date-parts":[[2024,7,25]],"date-time":"2024-07-25T08:01:54Z","timestamp":1721894514000},"page":"1503-1530","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Enhancing data preparation: insights from a time series case study"],"prefix":"10.1007","volume":"62","author":[{"given":"Camilla","family":"Sancricca","sequence":"first","affiliation":[]},{"given":"Giovanni","family":"Siracusa","sequence":"additional","affiliation":[]},{"given":"Cinzia","family":"Cappiello","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,7,25]]},"reference":[{"key":"867_CR1","unstructured":"Angles, R. (2018). The property graph database model. In Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management. CEUR Workshop Proceedings, vol. 2100. https:\/\/ceur-ws.org\/Vol-2100\/paper26.pdf"},{"key":"867_CR2","doi-asserted-by":"publisher","unstructured":"Arasu, A., & Manku, G. S. (2004). Approximate counts and quantiles over sliding windows. In C. Beeri, & A. Deutsch (eds.) Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, (pp. 286\u2013296). ACM. https:\/\/doi.org\/10.1145\/1055558.1055598","DOI":"10.1145\/1055558.1055598"},{"key":"867_CR3","doi-asserted-by":"publisher","unstructured":"Batini, C., & Scannapieco, M. (2016). Data and information quality - dimensions, principles and techniques. Data-Centric Systems and Applications. Springer. https:\/\/doi.org\/10.1007\/978-3-319-24106-7","DOI":"10.1007\/978-3-319-24106-7"},{"key":"867_CR4","doi-asserted-by":"publisher","unstructured":"Berti-\u00c9quille, L. (2019) Learn2clean: Optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference, WWW 2019, (pp. 2580\u20132586). ACM. https:\/\/doi.org\/10.1145\/3308558.3313602","DOI":"10.1145\/3308558.3313602"},{"key":"867_CR5","unstructured":"Berti-\u00c9quille, L. (2020). Active reinforcement learning for data preparation: Learn2clean with human-in-the-loop. In CIDR 2020 Proceedings. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2020\/gongshow2020\/gongshow\/abstracts\/cidr2020_abstract59.pdf"},{"key":"867_CR6","doi-asserted-by":"publisher","unstructured":"Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., & Ye, Y. (2015). KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proc. of the 2015 ACM SIGMOD, (pp. 1247\u20131261). ACM. https:\/\/doi.org\/10.1145\/2723372.2749431","DOI":"10.1145\/2723372.2749431"},{"key":"867_CR7","doi-asserted-by":"publisher","unstructured":"C\u00f4t\u00e9, N., Canu, A., Bouzid, M., & Mouaddib, A. (2012) Humans-robots sliding collaboration control in complex environments with adjustable autonomy. In 2012 IAT, (pp. 146\u2013153). IEEE Computer Society. https:\/\/doi.org\/10.1109\/WI-IAT.2012.215","DOI":"10.1109\/WI-IAT.2012.215"},{"key":"867_CR8","doi-asserted-by":"publisher","unstructured":"Cui, Q., Zheng, W., Hou, W., Sheng, M., Ren, P., Chang, W., & Li, X. (2022). Holocleanx: A multi-source heterogeneous data cleaning solution based on lakehouse. In HIS 2022, Proceedings. LNCS, vol. 13705, (pp. 165\u2013176). Springer. https:\/\/doi.org\/10.1007\/978-3-031-20627-6_16","DOI":"10.1007\/978-3-031-20627-6_16"},{"key":"867_CR9","doi-asserted-by":"publisher","first-page":"850611","DOI":"10.3389\/FDATA.2022.850611","volume":"5","author":"L Ehrlinger","year":"2022","unstructured":"Ehrlinger, L., & W\u00f6\u00df, W. (2022). A survey of data quality measurement and monitoring tools. Frontiers Big Data, 5, 850611. https:\/\/doi.org\/10.3389\/FDATA.2022.850611","journal-title":"Frontiers Big Data"},{"key":"867_CR10","first-page":"261","volume":"23","author":"M Feurer","year":"2022","unstructured":"Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., & Hutter, F. (2022). Auto-sklearn 2.0: Hands-free automl via meta-learning. Journal of Machine Learning Research, 23, 261\u2013126161.","journal-title":"Journal of Machine Learning Research"},{"key":"867_CR11","doi-asserted-by":"publisher","unstructured":"Foroni, D., Lissandrini, M., & Velegrakis, Y. (2021). Estimating the extent of the effects of data quality through observations. In ICDE 2021, (pp. 1913\u20131918). IEEE. https:\/\/doi.org\/10.1109\/ICDE51399.2021.00176","DOI":"10.1109\/ICDE51399.2021.00176"},{"issue":"3","key":"867_CR12","doi-asserted-by":"publisher","first-page":"391","DOI":"10.1080\/10447318.2022.2153320","volume":"39","author":"\u00d6\u00d6 Garibay","year":"2023","unstructured":"Garibay, \u00d6. \u00d6., Winslow, B., et al. (2023). Six human-centered artificial intelligence grand challenges. International Journal of Human\u2013Computer Interaction, 39(3), 391\u2013437. https:\/\/doi.org\/10.1080\/10447318.2022.2153320","journal-title":"International Journal of Human\u2013Computer Interaction"},{"issue":"3","key":"867_CR13","doi-asserted-by":"publisher","first-page":"18","DOI":"10.1145\/3444831.3444835","volume":"49","author":"M Hameed","year":"2020","unstructured":"Hameed, M., & Naumann, F. (2020). Data preparation: A survey of commercial tools. SIGMOD Record, 49(3), 18\u201329. https:\/\/doi.org\/10.1145\/3444831.3444835","journal-title":"SIGMOD Record"},{"key":"867_CR14","doi-asserted-by":"publisher","unstructured":"Issa, O., Bonifati, A., & Toumani, F. (2021). INCA: inconsistency-aware data profiling and querying. In SIGMOD \u201921, (pp. 2745\u20132749). ACM. https:\/\/doi.org\/10.1145\/3448016.3452760","DOI":"10.1145\/3448016.3452760"},{"issue":"8","key":"867_CR15","doi-asserted-by":"publisher","first-page":"84","DOI":"10.1145\/3571724","volume":"66","author":"MH Jarrahi","year":"2023","unstructured":"Jarrahi, M. H., Memariani, A., & Guha, S. (2023). The principles of data-centric AI. Communications of the ACM, 66(8), 84\u201392. https:\/\/doi.org\/10.1145\/3571724","journal-title":"Communications of the ACM"},{"issue":"12","key":"867_CR16","doi-asserted-by":"publisher","first-page":"948","DOI":"10.14778\/2994509.2994514","volume":"9","author":"S Krishnan","year":"2016","unstructured":"Krishnan, S., Wang, J., Wu, E., Franklin, M. J., & Goldberg, K. (2016). Activeclean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment, 9(12), 948\u2013959. https:\/\/doi.org\/10.14778\/2994509.2994514","journal-title":"Proceedings of the VLDB Endowment"},{"key":"867_CR17","doi-asserted-by":"publisher","unstructured":"Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X., & Zhang, C.: Cleanml: A study for evaluating the impact of data cleaning on ML classification tasks. In ICDE 2021, (pp. 13\u201324). IEEE. https:\/\/doi.org\/10.1109\/ICDE51399.2021.00009","DOI":"10.1109\/ICDE51399.2021.00009"},{"key":"867_CR18","doi-asserted-by":"publisher","unstructured":"Liu, F. T., Ting, K. M., & Zhou, Z. (2008). Isolation forest. In Proceedings of ICDM, pp. 413\u2013422. IEEE Computer Society. https:\/\/doi.org\/10.1109\/ICDM.2008.17","DOI":"10.1109\/ICDM.2008.17"},{"key":"867_CR19","doi-asserted-by":"publisher","unstructured":"Luo, Y., Chai, C., Qin, X., Tang, N., & Li, G. (2020). Interactive cleaning for progressive visualization through composite questions. In ICDE 2020 (pp. 733\u2013744). IEEE. https:\/\/doi.org\/10.1109\/ICDE48307.2020.00069","DOI":"10.1109\/ICDE48307.2020.00069"},{"key":"867_CR20","unstructured":"Mahdavi, M., & Abedjan, Z. (2021). Semi-supervised data cleaning with raha and baran. In 11th Conference on Innovative Data Systems Research, CIDR 2021. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2021\/papers\/cidr2021_paper14.pdf"},{"key":"867_CR21","unstructured":"Mahdavi, M., Neutatz, F., Visengeriyeva, L., & Abedjan, Z. (2019). Towards automated data cleaning workflows. In Proc. of the Conference on \"Lernen, Wissen, Daten, Analysen\". CEUR Workshop Proceedings, vol. 2454, (pp. 10\u201319). CEUR-WS.org. https:\/\/ceur-ws.org\/Vol-2454\/paper_8.pdf"},{"key":"867_CR22","doi-asserted-by":"publisher","unstructured":"Martin, N., Martinez-Millana, A., Valdivieso, B., & Fern\u00e1ndez-Llatas, C. (2019). Interactive data cleaning for process mining: A case study of an outpatient clinic\u2019s appointment system. In BPM 2019 International Workshops. LNBIP, vol. 362, pp. 532\u2013544. Springer. https:\/\/doi.org\/10.1007\/978-3-030-37453-2_43","DOI":"10.1007\/978-3-030-37453-2_43"},{"key":"867_CR23","unstructured":"Melgar, L. A., & Dao, D., et al. (2021). Ease.ml: A lifecycle management system for machine learning. In CIDR 2021. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2021\/papers\/cidr2021_paper26.pdf"},{"issue":"2","key":"867_CR24","doi-asserted-by":"publisher","first-page":"121","DOI":"10.1007\/S13222-022-00413-2","volume":"22","author":"F Neutatz","year":"2022","unstructured":"Neutatz, F., Chen, B., Alkhatib, Y., Ye, J., & Abedjan, Z. (2022). Data cleaning and automl: Would an optimizer choose to clean? Datenbank-Spektrum, 22(2), 121\u2013130. https:\/\/doi.org\/10.1007\/S13222-022-00413-2","journal-title":"Datenbank-Spektrum"},{"issue":"4","key":"867_CR25","doi-asserted-by":"publisher","first-page":"44","DOI":"10.1145\/3603709","volume":"15","author":"H Patel","year":"2023","unstructured":"Patel, H., Guttula, S. C., Gupta, N., Hans, S., Mittal, R. S., & Nagalapatti, L. (2023). A data-centric AI framework for automating exploratory data analysis and data quality tasks. ACM Journal of Data and Information Quality, 15(4), 44\u201314426. https:\/\/doi.org\/10.1145\/3603709","journal-title":"ACM Journal of Data and Information Quality"},{"key":"867_CR26","doi-asserted-by":"publisher","first-page":"2825","DOI":"10.5555\/1953048.2078195","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825\u20132830. https:\/\/doi.org\/10.5555\/1953048.2078195","journal-title":"Journal of Machine Learning Research"},{"issue":"9","key":"867_CR27","doi-asserted-by":"publisher","first-page":"3105","DOI":"10.3390\/S18093105","volume":"18","author":"R P\u00e9rez-Castillo","year":"2018","unstructured":"P\u00e9rez-Castillo, R., Carretero, A. G., Caballero, I., Rodr\u00edguez, M., Piattini, M., Mate, A., Kim, S., & Lee, D. (2018). DAQUA-MASS: an ISO 8000\u201361 based data quality management methodology for sensor data. Sensors, 18(9), 3105. https:\/\/doi.org\/10.3390\/S18093105","journal-title":"Sensors"},{"key":"867_CR28","doi-asserted-by":"publisher","unstructured":"Qi, Z., & Wang, H. (2021). Dirty-data impacts on regression models: An experimental evaluation. In DASFAA 2021. LNCS, vol. 12681, (pp. 88\u201395). Springer. https:\/\/doi.org\/10.1007\/978-3-030-73194-6_6","DOI":"10.1007\/978-3-030-73194-6_6"},{"issue":"4","key":"867_CR29","doi-asserted-by":"publisher","first-page":"806","DOI":"10.1007\/S11390-021-1344-6","volume":"36","author":"Z Qi","year":"2021","unstructured":"Qi, Z., Wang, H., & Wang, A. (2021). Impacts of dirty data on classification and clustering models: An experimental evaluation. Journal of Computer Science and Technology, 36(4), 806\u2013821. https:\/\/doi.org\/10.1007\/S11390-021-1344-6","journal-title":"Journal of Computer Science and Technology"},{"key":"867_CR30","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1016\/J.NEUCOM.2017.01.078","volume":"239","author":"S Ram\u00edrez-Gallego","year":"2017","unstructured":"Ram\u00edrez-Gallego, S., Krawczyk, B., Garc\u00eda, S., Wozniak, M., & Herrera, F. (2017). A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing, 239, 39\u201357. https:\/\/doi.org\/10.1016\/J.NEUCOM.2017.01.078","journal-title":"Neurocomputing"},{"issue":"11","key":"867_CR31","doi-asserted-by":"publisher","first-page":"1190","DOI":"10.14778\/3137628.3137631","volume":"10","author":"T Rekatsinas","year":"2017","unstructured":"Rekatsinas, T., Chu, X., Ilyas, I. F., & R\u00e9, C. (2017). Holoclean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment, 10(11), 1190\u20131201. https:\/\/doi.org\/10.14778\/3137628.3137631","journal-title":"Proceedings of the VLDB Endowment"},{"key":"867_CR32","unstructured":"Sancricca, C., & Cappiello, C.: Supporting the design of data preparation pipelines. In Proceedings Of Sebd2022. CEUR Workshop Proceedings, vol. 3194, (pp. 149\u2013158). CEUR-WS.org. https:\/\/ceur-ws.org\/Vol-3194\/paper18.pdf"},{"key":"867_CR33","unstructured":"Shchur, O., T\u00fcrkmen, A.C., Erickson, N., Shen, H., Shirkov, A., Hu, T., & Wang, B. (2023). Autogluon-timeseries: Automl for probabilistic time series forecasting. In Proc. of the International Conference on Automated Machine Learning, vol. 228, (pp. 9\u2013121). PMLR. https:\/\/proceedings.mlr.press\/v228\/shchur23a.html"},{"issue":"6","key":"867_CR34","doi-asserted-by":"publisher","first-page":"495","DOI":"10.1080\/10447318.2020.1741118","volume":"36","author":"B Shneiderman","year":"2020","unstructured":"Shneiderman, B. (2020). Human-centered artificial intelligence: Reliable, safe & trustworthy. International Journal of Human\u2013Computer Interaction, 36(6), 495\u2013504. https:\/\/doi.org\/10.1080\/10447318.2020.1741118","journal-title":"International Journal of Human\u2013Computer Interaction"},{"key":"867_CR35","doi-asserted-by":"publisher","unstructured":"Shrivastava, S., et al. (2019) DQA: scalable, automated and interactive data quality advisor. In Proc. of 2019 (IEEE BigData), (pp. 2913\u20132922). IEEE. https:\/\/doi.org\/10.1109\/BIGDATA47090.2019.9006187","DOI":"10.1109\/BIGDATA47090.2019.9006187"},{"key":"867_CR36","doi-asserted-by":"publisher","unstructured":"Sibai, R.E., Chabchoub, Y., Chiky, R., Demerjian, J., & Barbar, K.: Assessing and improving sensors data quality in streaming context. In ICCCI 2017, Nicosia. LNCS, vol. 10449, (pp. 590\u2013599). Springer. https:\/\/doi.org\/10.1007\/978-3-319-67077-5_57","DOI":"10.1007\/978-3-319-67077-5_57"},{"key":"867_CR37","doi-asserted-by":"publisher","unstructured":"Tan, S. C., Ting, K. M., & Liu, F. T. (2011). Fast anomaly detection for streaming data. In T. Walsh (ed.) IJCAI 2011, (pp. 1511\u20131516). IJCAI\/AAAI. https:\/\/doi.org\/10.5591\/978-1-57735-516-8\/IJCAI11-254","DOI":"10.5591\/978-1-57735-516-8\/IJCAI11-254"},{"issue":"4","key":"867_CR38","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1080\/07421222.1996.11518099","volume":"12","author":"RY Wang","year":"1996","unstructured":"Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5\u201333. https:\/\/doi.org\/10.1080\/07421222.1996.11518099","journal-title":"Journal of Management Information Systems"},{"issue":"9","key":"867_CR39","doi-asserted-by":"publisher","first-page":"985","DOI":"10.1080\/24725854.2018.1530487","volume":"51","author":"M Yu","year":"2019","unstructured":"Yu, M., Wu, C., & Tsung, F. (2019). Monitoring the data quality of data streams using a two-step control scheme. IISE Transactions, 51(9), 985\u2013998. https:\/\/doi.org\/10.1080\/24725854.2018.1530487","journal-title":"IISE Transactions"}],"container-title":["Journal of Intelligent Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10844-024-00867-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10844-024-00867-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10844-024-00867-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,24]],"date-time":"2025-01-24T11:53:17Z","timestamp":1737719597000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10844-024-00867-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,25]]},"references-count":39,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,12]]}},"alternative-id":["867"],"URL":"https:\/\/doi.org\/10.1007\/s10844-024-00867-8","relation":{},"ISSN":["0925-9902","1573-7675"],"issn-type":[{"value":"0925-9902","type":"print"},{"value":"1573-7675","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,25]]},"assertion":[{"value":"20 March 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 June 2024","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 July 2024","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 July 2024","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical Approval"}},{"value":"The authors declare no competing interests.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}