{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,2]],"date-time":"2026-07-02T13:17:21Z","timestamp":1782998241978,"version":"3.54.5"},"reference-count":33,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,11,1]],"date-time":"2023-11-01T00:00:00Z","timestamp":1698796800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"European Commission H2020 project Crowd4SDG","award":["872944"],"award-info":[{"award-number":["872944"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Data and Information Quality"],"published-print":{"date-parts":[[2023,12,31]]},"abstract":"<jats:p>In a data-driven culture, in which analytics applications are the main resources for supporting decision-making, the use of high-quality datasets is mandatory to minimize errors and risks. For this reason, data analysis tasks need to be preceded by a data preparation pipeline. The design of such a pipeline is not trivial: the data analyst must carefully choose the appropriate operations considering several aspects. This is often performed by adopting a trial-and-error approach that does not always lead to the most effective solution. In addition, extracting information from social media poses specific problems due to the need to consider only posts relevant for the analysis, for its dependence from the context being considered, for its multimedia contents, and for the risk of filtering out informative posts with automatic filters. In this article, we propose a systematic approach to support the design of pipelines that are able to effectively extract a relevant dataset for the goal of the analysis of data from social media. We provide a conceptual model for designing and annotating the data preparation pipeline with quality and performance information, thus providing the data analyst preliminary information on the expected quality of the resulting dataset in a context-aware manner. The generation of metadata related to the processing tasks has been recognized as essential for enabling data sharing and reusability. To this aim, the dataset resulting from the pipeline application is automatically annotated with provenance metadata to get a detailed description of all the activities performed by the pipeline on them. As a case study, we consider the design of a pipeline for creating datasets of images extracted from social media in order to analyze behavioural aspects during COVID-19.<\/jats:p>","DOI":"10.1145\/3597305","type":"journal-article","created":{"date-parts":[[2023,5,20]],"date-time":"2023-05-20T08:56:27Z","timestamp":1684572987000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Pipeline Design for Data Preparation for Social Media Analysis"],"prefix":"10.1145","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5734-1274","authenticated-orcid":false,"given":"Carlo A.","family":"Bono","sequence":"first","affiliation":[{"name":"Dept. of Electronics, Information, and Bioengineering, Politecnico di Milano, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6062-5174","authenticated-orcid":false,"given":"Cinzia","family":"Cappiello","sequence":"additional","affiliation":[{"name":"Dept. of Electronics, Information, and Bioengineering, Politecnico di Milano, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2034-9774","authenticated-orcid":false,"given":"Barbara","family":"Pernici","sequence":"additional","affiliation":[{"name":"Dept. of Electronics, Information, and Bioengineering, Politecnico di Milano, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5124-9047","authenticated-orcid":false,"given":"Edoardo","family":"Ramalli","sequence":"additional","affiliation":[{"name":"Dept. of Electronics, Information, and Bioengineering, Politecnico di Milano, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5258-1893","authenticated-orcid":false,"given":"Monica","family":"Vitali","sequence":"additional","affiliation":[{"name":"Dept. of Electronics, Information, and Bioengineering, Politecnico di Milano, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,11]]},"reference":[{"key":"e_1_3_2_2_2","series-title":"LNCS","first-page":"17","volume-title":"Proc. BPM Conf.","author":"Akkiraju Rama","year":"2020","unstructured":"Rama Akkiraju et\u00a0al. 2020. Characterizing machine learning processes: A maturity framework. In Proc. BPM Conf.(LNCS, Vol. 12168). Springer, 17\u201331."},{"issue":"2","key":"e_1_3_2_3_2","doi-asserted-by":"crossref","first-page":"76","DOI":"10.15346\/hc.v8i2.121","article-title":"Exploring the use of deep learning with crowdsourcing to annotate images","volume":"8","author":"Anjum Samreen","year":"2021","unstructured":"Samreen Anjum, Ambika Verma, Brandon Dang, and Danna Gurari. 2021. Exploring the use of deep learning with crowdsourcing to annotate images. Human Computation 8, 2 (2021), 76\u2013106.","journal-title":"Human Computation"},{"key":"e_1_3_2_4_2","first-page":"768","volume-title":"Proc. ISCRAM","author":"Barozzi Sara","year":"2019","unstructured":"Sara Barozzi, Jose Luis Fernandez-Marquez, Amudha Ravi Shankar, and Barbara Pernici. 2019. Filtering images extracted from social media in the response phase of emergency events. In Proc. ISCRAM. 768\u2013779."},{"key":"e_1_3_2_5_2","first-page":"15","article-title":"PROV-DM: The PROV data model","volume":"14","author":"Belhajjame Khalid","year":"2013","unstructured":"Khalid Belhajjame, Reza B\u2019Far, James Cheney, Sam Coppens, Stephen Cresswell, Yolanda Gil, Paul Groth, Graham Klyne, Timothy Lebo, Jim McCusker, et\u00a0al. 2013. PROV-DM: The PROV data model. W3C Recomm. 14 (2013), 15\u201316.","journal-title":"W3C Recomm."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3188721"},{"key":"e_1_3_2_7_2","first-page":"2580","volume-title":"Proc. WWW Conf.","author":"Berti-\u00c9quille Laure","year":"2019","unstructured":"Laure Berti-\u00c9quille. 2019. Learn2Clean: Optimizing the sequence of tasks for web data preparation. In Proc. WWW Conf.ACM, 2580\u20132586."},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2208.02689"},{"key":"e_1_3_2_9_2","first-page":"674","volume-title":"ISCRAM 2022 Conference Proceedings \u2013 19th International Conference on Information Systems for Crisis Response and Management","author":"Bono Carlo","year":"2022","unstructured":"Carlo Bono, Barbara Pernici, Jose Luis Fernandez-Marquez, Amudha Ravi Shankar, Mehmet O\u011fuz M\u00fcl\u00e2yim, and Edoardo Nemni. 2022. TriggerCit: Early flood alerting using Twitter and geolocation\u2013a comparison with alternative sources. In ISCRAM 2022 Conference Proceedings \u2013 19th International Conference on Information Systems for Crisis Response and Management, Rob Grace and Hossein Baharmand (Eds.). Tarbes, France, 674\u2013686."},{"key":"e_1_3_2_10_2","series-title":"Conceptual Modeling - 40th International Conference, ER 2021, Virtual Event, Proceedings","first-page":"25","volume":"13011","author":"Cappiello Cinzia","year":"2021","unstructured":"Cinzia Cappiello, Barbara Pernici, and Monica Vitali. 2021. Modeling adaptive data analysis pipelines for crowd-enhanced processes. In Conceptual Modeling - 40th International Conference, ER 2021, Virtual Event, Proceedings(Lecture Notes in Computer Science, Vol. 13011), Aditya K. Ghose, Jennifer Horkoff, V\u00edtor E. Silva Souza, Jeffrey Parsons, and Joerg Evermann (Eds.). Springer, 25\u201335."},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.14778\/3436905.3436911"},{"key":"e_1_3_2_12_2","first-page":"1","volume-title":"Journal of Physics: Conference Series","volume":"664","author":"Cranmer Kyle","year":"2015","unstructured":"Kyle Cranmer, Lukas Heinrich, Roger Jones, David M. South, ATLAS collaboration, et\u00a0al. 2015. Analysis preservation in ATLAS. In Journal of Physics: Conference Series, Vol. 664. IOP Publishing, 1\u20135."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1080\/13662716.2021.1976627"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458723"},{"issue":"1","key":"e_1_3_2_16_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3467022","article-title":"Knowledge-driven data ecosystems towards data transparency","volume":"14","author":"Geisler Sandra","year":"2022","unstructured":"Sandra Geisler, Maria-Esther Vidal, Cinzia Cappiello, Bernadette Farias L\u00f3scio, Avigdor Gal, Matthias Jarke, Maurizio Lenzerini, Paolo Missier, Boris Otto, Elda Paja, Barbara Pernici, and Jakob Rehof. 2022. Knowledge-driven data ecosystems towards data transparency. ACM Journal of Data and Information Quality 14, 1:3 (March2022), 1\u201312.","journal-title":"ACM Journal of Data and Information Quality"},{"key":"e_1_3_2_17_2","volume-title":"The Enteprise Big Data Lake","author":"Gorelik Alex","year":"2019","unstructured":"Alex Gorelik. 2019. The Enteprise Big Data Lake. O\u2019Reilly."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.3390\/s17122766"},{"key":"e_1_3_2_19_2","doi-asserted-by":"crossref","first-page":"480","DOI":"10.1109\/ICDMW51313.2020.00071","volume-title":"2020 Intl. Conf. on Data Mining Workshops (ICDMW\u201920)","author":"Heidari Maryam","year":"2020","unstructured":"Maryam Heidari, James H. Jones, and Ozlem Uzuner. 2020. Deep contextualized word embedding for text-based online user profiling to detect social bots on Twitter. In 2020 Intl. Conf. on Data Mining Workshops (ICDMW\u201920). 480\u2013487."},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.4135\/9781412985451"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-017-0486-1"},{"issue":"2","key":"e_1_3_2_22_2","first-page":"283","article-title":"Cost of quality in crowdsourcing","volume":"1","author":"Iren Deniz","year":"2014","unstructured":"Deniz Iren and Semih Bilgen. 2014. Cost of quality in crowdsourcing. Human Computation 1, 2 (2014), 283\u2013314.","journal-title":"Human Computation"},{"key":"e_1_3_2_23_2","first-page":"xxii\u2013xxiv","volume-title":"Proc. CAiSE","author":"Naumann Felix","year":"2021","unstructured":"Felix Naumann. 2021. Bad files, bad data, bad results: Data quality and data preparation. In Proc. CAiSE. xxii\u2013xxiv."},{"key":"e_1_3_2_24_2","first-page":"92","volume-title":"IEEE\/ACM 43rd Intl. Conf. on Software Engineering: Software Engineering in Society (ICSE-SEIS\u201921)","author":"Negri Virginia","year":"2021","unstructured":"Virginia Negri, Dario Scuratti, Stefano Agresti, Donya Rooein, Gabriele Scalia, Amudha Ravi Shankar, Jose Luis Fernandez Marquez, Mark James Carman, and Barbara Pernici. 2021. Image-based social sensing: Combining AI and the crowd to mine policy-adherence indicators from Twitter. In IEEE\/ACM 43rd Intl. Conf. on Software Engineering: Software Engineering in Society (ICSE-SEIS\u201921). IEEE, 92\u2013101."},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3377330.3377334"},{"key":"e_1_3_2_26_2","first-page":"20","volume-title":"Provenance and Annotation of Data and Processes","author":"Pina D\u00e9bora","year":"2020","unstructured":"D\u00e9bora Pina, Liliane Kunstmann, Daniel de Oliveira, Patrick Valduriez, and Marta Mattoso. 2020. Provenance supporting hyperparameter analysis in deep neural networks. In Provenance and Annotation of Data and Processes. Springer, 20\u201338."},{"key":"e_1_3_2_27_2","first-page":"206","volume-title":"Proc. Conf. on Web Intelligence (WI\u201918)","author":"Purohit Hemant","year":"2018","unstructured":"Hemant Purohit, Carlos Castillo, Muhammad Imran, and Rahul Pandey. 2018. Ranking of social media alerts with workload bounds in emergency operation centers. In Proc. Conf. on Web Intelligence (WI\u201918). IEEE, 206\u2013213."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2012.29"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10707-021-00446-x"},{"key":"e_1_3_2_30_2","first-page":"1620","volume-title":"IEEE 39th Intl. Conf. on Distr. Computing Systems","author":"Scherzinger Stefanie","year":"2019","unstructured":"Stefanie Scherzinger, Christin Seifert, and Lena Wiese. 2019. The best of both worlds: Challenges in linking provenance and explainability in distributed machine learning. In IEEE 39th Intl. Conf. on Distr. Computing Systems. 1620\u20131629."},{"key":"e_1_3_2_31_2","first-page":"1","volume-title":"Proc. of the AAAI Fall 2020 AI for Social Good Symposium","author":"Scheunemann Christoph","year":"2020","unstructured":"Christoph Scheunemann, Julian Naumann, Max Eichler, Kevin Stowe, and Iryna Gurevych. 2020. Data collection and annotation pipeline for social good projects. In Proc. of the AAAI Fall 2020 AI for Social Good Symposium. 1\u20137."},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3360646"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1038\/sdata.2016.18"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2723009"}],"container-title":["Journal of Data and Information Quality"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3597305","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3597305","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:06Z","timestamp":1750182546000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3597305"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11]]},"references-count":33,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,12,31]]}},"alternative-id":["10.1145\/3597305"],"URL":"https:\/\/doi.org\/10.1145\/3597305","relation":{},"ISSN":["1936-1955","1936-1963"],"issn-type":[{"value":"1936-1955","type":"print"},{"value":"1936-1963","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,11]]},"assertion":[{"value":"2022-05-26","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-04-11","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}