{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,8]],"date-time":"2025-09-08T06:24:02Z","timestamp":1757312642269},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2023,7]]},"abstract":"<jats:p>\n            Organizations are collecting increasingly large amounts of data for data-driven decision making. These data are often dumped into a centralized repository, e.g., a data lake, consisting of thousands of structured and unstructured datasets. Perversely, such mixture makes the problem of\n            <jats:italic>discovering<\/jats:italic>\n            tables or documents that are relevant to a user's query very challenging. Despite the recent efforts in\n            <jats:italic>data discovery<\/jats:italic>\n            , the problem remains widely open especially in the two fronts of (1) discovering relationships and relatedness across structured and unstructured datasets-where existing techniques suffer from either scalability, being customized for a specific problem type (e.g., entity matching or data integration), or demolishing the structural properties on its way, and (2) developing a holistic system for integrating various similarity measurements and sketches in an effective way to boost the discovery accuracy.\n          <\/jats:p>\n          <jats:p>\n            In this paper, we propose a new data discovery system, named CMDL, for addressing these two limitations. CMDL supports the data discovery process over both structured and unstructured data while retaining the structural properties of tables. As a result, CMDL is the only system to date that empowers end-users to seamlessly pipeline the discovery tasks across the two modalities. We propose a novel multi-modal embedding representation that captures the similarities between text documents and tabular columns. The model training relies on labeled datasets generated though\n            <jats:italic>weak supervision<\/jats:italic>\n            , and thus the system is domain agnostic and easily generalizable. We evaluate CMDL on three real-world data lakes with diverse applications and show that our system is significantly more effective for cross-modality discovery compared to the search-based baseline techniques. Moreover, CMDL is more accurate and robust to different data types and distributions compared to the state-of-the-art systems that are limited to only the structured datasets.\n          <\/jats:p>","DOI":"10.14778\/3611479.3611533","type":"journal-article","created":{"date-parts":[[2023,8,25]],"date-time":"2023-08-25T02:08:08Z","timestamp":1692929288000},"page":"3377-3390","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Cross Modal Data Discovery over Structured and Unstructured Data Lakes"],"prefix":"10.14778","volume":"16","author":[{"given":"Mohamed Y.","family":"Eltabakh","sequence":"first","affiliation":[{"name":"Qatar Computing Research Institute (QCRI), Doha, Qatar"}]},{"given":"Mayuresh","family":"Kunjir","sequence":"additional","affiliation":[{"name":"Amazon Web Services, Berlin, Germany"}]},{"given":"Ahmed K.","family":"Elmagarmid","sequence":"additional","affiliation":[{"name":"Qatar Computing Research Institute (QCRI), Doha, Qatar"}]},{"given":"Mohammad Shahmeer","family":"Ahmad","sequence":"additional","affiliation":[{"name":"Qatar Computing Research Institute (QCRI), Doha, Qatar"}]}],"member":"320","published-online":{"date-parts":[[2023,8,24]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n. d.]. A Dive into Metadata Hub Tools. https:\/\/towardsdatascience.com\/a-dive-into-metadata-hub-tools-67259804971f.  [n. d.]. A Dive into Metadata Hub Tools. https:\/\/towardsdatascience.com\/a-dive-into-metadata-hub-tools-67259804971f."},{"key":"e_1_2_1_2_1","unstructured":"[n. d.]. Amundsen --- Lyft's data discovery & metadata engine. https:\/\/eng.lyft.com\/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9. Accessed: 2021-07-21.  [n. d.]. Amundsen --- Lyft's data discovery & metadata engine. https:\/\/eng.lyft.com\/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9. Accessed: 2021-07-21."},{"key":"e_1_2_1_3_1","volume-title":"Scientist Report","year":"2017","unstructured":"[n. d.]. CrowdFlower: Data Scientist Report 2017 . https:\/\/visit.crowdflower.com\/WC-2017-Data-Science-Report_LP.html. Accessed : 2021-07-21. [n. d.]. CrowdFlower: Data Scientist Report 2017. https:\/\/visit.crowdflower.com\/WC-2017-Data-Science-Report_LP.html. Accessed: 2021-07-21."},{"key":"e_1_2_1_4_1","unstructured":"[n. d.]. DrugBnak. https:\/\/www.drugbank.com.  [n. d.]. DrugBnak. https:\/\/www.drugbank.com."},{"key":"e_1_2_1_5_1","unstructured":"[n. d.]. MedWatch Online Voluntary Reporting Form. https:\/\/www.accessdata.fda.gov\/scripts\/medwatch\/index.cfm.  [n. d.]. MedWatch Online Voluntary Reporting Form. https:\/\/www.accessdata.fda.gov\/scripts\/medwatch\/index.cfm."},{"key":"e_1_2_1_6_1","unstructured":"[n. d.]. National Library of Medicine MedlinePlus. https:\/\/medlineplus.gov\/personalhealthrecords.html.  [n. d.]. National Library of Medicine MedlinePlus. https:\/\/medlineplus.gov\/personalhealthrecords.html."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2016.7498319"},{"key":"e_1_2_1_8_1","volume-title":"Unsupervised matching of data and text. arXiv preprint arXiv:2112.08776","author":"Ahmadi Naser","year":"2021","unstructured":"Naser Ahmadi , Hansjorg Sand , and Paolo Papotti . 2021. Unsupervised matching of data and text. arXiv preprint arXiv:2112.08776 ( 2021 ). Naser Ahmadi, Hansjorg Sand, and Paolo Papotti. 2021. Unsupervised matching of data and text. arXiv preprint arXiv:2112.08776 (2021)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2021.101846"},{"key":"e_1_2_1_10_1","doi-asserted-by":"crossref","unstructured":"A. Bogatu A. A. A. Fernandes N. W. Paton and N. Konstantinou. 2020. Dataset Discovery in Data Lakes. In ICDE. 709--720.  A. Bogatu A. A. A. Fernandes N. W. Paton and N. Konstantinou. 2020. Dataset Discovery in Data Lakes. In ICDE. 709--720.","DOI":"10.1109\/ICDE48307.2020.00067"},{"key":"e_1_2_1_11_1","volume-title":"Enriching Word Vectors with Subword Information. CoRR abs\/1607.04606","author":"Bojanowski Piotr","year":"2016","unstructured":"Piotr Bojanowski , Edouard Grave , Armand Joulin , and Tom\u00e1s Mikolov . 2016. Enriching Word Vectors with Subword Information. CoRR abs\/1607.04606 ( 2016 ). Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tom\u00e1s Mikolov. 2016. Enriching Word Vectors with Subword Information. CoRR abs\/1607.04606 (2016)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_2_1_13_1","volume-title":"Cognitive database: A step towards endowing relational databases with artificial intelligence capabilities. arXiv preprint arXiv:1712.07199","author":"Bordawekar Rajesh","year":"2017","unstructured":"Rajesh Bordawekar , Bortik Bandyopadhyay , and Oded Shmueli . 2017. Cognitive database: A step towards endowing relational databases with artificial intelligence capabilities. arXiv preprint arXiv:1712.07199 ( 2017 ). Rajesh Bordawekar, Bortik Bandyopadhyay, and Oded Shmueli. 2017. Cognitive database: A step towards endowing relational databases with artificial intelligence capabilities. arXiv preprint arXiv:1712.07199 (2017)."},{"key":"e_1_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Ursin Brunner and Kurt Stockinger. 2019. Entity matching on unstructured data : an active learning approach.  Ursin Brunner and Kurt Stockinger. 2019. Entity matching on unstructured data : an active learning approach.","DOI":"10.1109\/SDS.2019.00006"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.5555\/1756006.1756042"},{"key":"e_1_2_1_16_1","doi-asserted-by":"crossref","unstructured":"Zhimin Chen Vivek Narasayya and Surajit Chaudhuri. 2014. Fast foreign-key detection in Microsoft SQL server PowerPivot for Excel. In PVLDB.  Zhimin Chen Vivek Narasayya and Surajit Chaudhuri. 2014. Fast foreign-key detection in Microsoft SQL server PowerPivot for Excel. In PVLDB.","DOI":"10.14778\/2733004.2733014"},{"key":"e_1_2_1_17_1","volume-title":"Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang.","author":"Deng Dong","year":"2017","unstructured":"Dong Deng , Raul Castro Fernandez , Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017 . The Data Civilizer System.. In Cidr . Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System.. In Cidr."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/3430915.3430921"},{"key":"e_1_2_1_19_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00046"},{"key":"e_1_2_1_21_1","unstructured":"Mohamed Y. Eltabakh Mayuresh Kunjir Ahmed Elmagarmid and Mohammad Shahmeer Ahmad. 2023. Extended Paper-Cross Modal Data Discovery over Structured and Unstructured Data Lakes. https:\/\/arxiv.org\/abs\/2306.00932.  Mohamed Y. Eltabakh Mayuresh Kunjir Ahmed Elmagarmid and Mohammad Shahmeer Ahmad. 2023. Extended Paper-Cross Modal Data Discovery over Structured and Unstructured Data Lakes. https:\/\/arxiv.org\/abs\/2306.00932."},{"key":"e_1_2_1_22_1","volume-title":"Aurum: A data discovery system","author":"Fernandez Raul Castro","year":"2018","unstructured":"Raul Castro Fernandez , Ziawasch Abedjan , Famien Koko , Gina Yuan , Samuel Madden , and Michael Stonebraker . 2018 . Aurum: A data discovery system . In ICDE. IEEE , 1001--1012. Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In ICDE. IEEE, 1001--1012."},{"key":"e_1_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Raul Castro Fernandez and Samuel Madden. 2019. Termite: a system for tunneling through heterogeneous data. In aiDM. 1--8.  Raul Castro Fernandez and Samuel Madden. 2019. Termite: a system for tunneling through heterogeneous data. In aiDM. 1--8.","DOI":"10.1145\/3329859.3329877"},{"key":"e_1_2_1_24_1","volume-title":"2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 989--1000","author":"Fernandez Raul Castro","year":"2018","unstructured":"Raul Castro Fernandez , Essam Mansour , Abdulhakim A Qahtan , Ahmed Elmagarmid , Ihab Ilyas , Samuel Madden , Mourad Ouzzani , Michael Stonebraker , and Nan Tang . 2018 . Seeping semantics: Linking datasets using word embeddings for data discovery . In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 989--1000 . Raul Castro Fernandez, Essam Mansour, Abdulhakim A Qahtan, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018. Seeping semantics: Linking datasets using word embeddings for data discovery. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 989--1000."},{"key":"e_1_2_1_25_1","volume-title":"Lazo: A cardinality-based method for coupled estimation of jaccard similarity and containment","author":"Fernandez Raul Castro","year":"2019","unstructured":"Raul Castro Fernandez , Jisoo Min , Demitri Nava , and Samuel Madden . 2019 . Lazo: A cardinality-based method for coupled estimation of jaccard similarity and containment . In ICDE. IEEE , 1190--1201. Raul Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden. 2019. Lazo: A cardinality-based method for coupled estimation of jaccard similarity and containment. In ICDE. IEEE, 1190--1201."},{"key":"e_1_2_1_26_1","volume-title":"Advances in Database Technology: EDBT 2021, 24th International Conference on Extending Database Technology: Nicosia, Cyprus, March 23--26, 2021: proceedings. OpenProceedings, 690--693","author":"Jes\u00fas Flores Herrera Javier De","year":"2021","unstructured":"Javier De Jes\u00fas Flores Herrera , Sergi Nadal Francesch , and \u00d3scar Romero Moral . 2021 . Effective and scalable data discovery with NextiaJD . In Advances in Database Technology: EDBT 2021, 24th International Conference on Extending Database Technology: Nicosia, Cyprus, March 23--26, 2021: proceedings. OpenProceedings, 690--693 . Javier De Jes\u00fas Flores Herrera, Sergi Nadal Francesch, and \u00d3scar Romero Moral. 2021. Effective and scalable data discovery with NextiaJD. In Advances in Database Technology: EDBT 2021, 24th International Conference on Extending Database Technology: Nicosia, Cyprus, March 23--26, 2021: proceedings. OpenProceedings, 690--693."},{"key":"e_1_2_1_27_1","unstructured":"Google. [n. d.]. Google Cloud-Cloud Data Fusion. https:\/\/cloud.google.com\/data-fusion.  Google. [n. d.]. Google Cloud-Cloud Data Fusion. https:\/\/cloud.google.com\/data-fusion."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2016.03.048"},{"key":"e_1_2_1_29_1","volume-title":"Goods: Organizing google's datasets. In SIGMOD. 795--806.","author":"Halevy Alon","year":"2016","unstructured":"Alon Halevy , Flip Korn , Natalya F Noy , Christopher Olston , Neoklis Polyzotis , Sudip Roy , and Steven Euijong Whang . 2016 . Goods: Organizing google's datasets. In SIGMOD. 795--806. Alon Halevy, Flip Korn, Natalya F Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing google's datasets. In SIGMOD. 795--806."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330993"},{"key":"e_1_2_1_31_1","unstructured":"IBM. [n. d.]. IBM Watson Discovery. https:\/\/www.ibm.com\/cloud\/watson-discovery.  IBM. [n. d.]. IBM Watson Discovery. https:\/\/www.ibm.com\/cloud\/watson-discovery."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2731084"},{"key":"e_1_2_1_33_1","volume-title":"EDBT\/ICDT Workshops.","author":"Keim Daniel A.","year":"2014","unstructured":"Daniel A. Keim . 2014 . Exploring Big Data using Visual Analytics . In EDBT\/ICDT Workshops. Daniel A. Keim. 2014. Exploring Big Data using Visual Analytics. In EDBT\/ICDT Workshops."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1089"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5441\/002\/edbt.2021.03"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2019.2909204"},{"key":"e_1_2_1_37_1","volume-title":"The general pair-based weighting loss for deep metric learning. arXiv preprint arXiv:1905.12837","author":"Liu Haijun","year":"2019","unstructured":"Haijun Liu , Jian Cheng , Wen Wang , and Yanzhou Su. 2019. The general pair-based weighting loss for deep metric learning. arXiv preprint arXiv:1905.12837 ( 2019 ). Haijun Liu, Jian Cheng, Wen Wang, and Yanzhou Su. 2019. The general pair-based weighting loss for deep metric learning. arXiv preprint arXiv:1905.12837 (2019)."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58595-2_41"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192973"},{"key":"e_1_2_1_41_1","doi-asserted-by":"crossref","unstructured":"Mark Neumann Daniel King Iz Beltagy and Waleed Ammar. 2019. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing.  Mark Neumann Daniel King Iz Beltagy and Waleed Ammar. 2019. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing.","DOI":"10.18653\/v1\/W19-5034"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3393815"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/3384345.3384346"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3416028.3416033"},{"key":"e_1_2_1_45_1","volume-title":"Sen Wu, and Christopher R\u00e9.","author":"Ratner Alexander","year":"2017","unstructured":"Alexander Ratner , Stephen H. Bach , Henry R. Ehrenberg , Jason Alan Fries , Sen Wu, and Christopher R\u00e9. 2017 . Snorkel : Rapid Training Data Creation with Weak Supervision . abs\/1711.10160 (2017). arXiv:1711.10160 http:\/\/arxiv.org\/abs\/1711.10160 Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher R\u00e9. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. abs\/1711.10160 (2017). arXiv:1711.10160 http:\/\/arxiv.org\/abs\/1711.10160"},{"key":"e_1_2_1_46_1","volume-title":"Sen Wu, Daniel Selsam, and Christopher R\u00e9.","author":"Ratner Alexander","year":"2017","unstructured":"Alexander Ratner , Christopher De Sa , Sen Wu, Daniel Selsam, and Christopher R\u00e9. 2017 . Data Programming : Creating Large Training Sets, Quickly . arXiv:1605.07723 [stat.ML] Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher R\u00e9. 2017. Data Programming: Creating Large Training Sets, Quickly. arXiv:1605.07723 [stat.ML]"},{"key":"e_1_2_1_47_1","volume-title":"Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50","author":"\u0158eh\u016f\u0159ek Radim","year":"2010","unstructured":"Radim \u0158eh\u016f\u0159ek and Petr Sojka . 2010 . Software Framework for Topic Modelling with Large Corpora . In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50 . http:\/\/is.muni.cz\/publication\/884893\/en. Radim \u0158eh\u016f\u0159ek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50. http:\/\/is.muni.cz\/publication\/884893\/en."},{"key":"e_1_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Dominique Ritze Oliver Lehmberg and Christian Bizer. 2015. Matching HTML Tables to DBpedia. In WIMS (Larnaca Cyprus). Article 10 6 pages.  Dominique Ritze Oliver Lehmberg and Christian Bizer. 2015. Matching HTML Tables to DBpedia. In WIMS (Larnaca Cyprus). Article 10 6 pages.","DOI":"10.1145\/2797115.2797118"},{"key":"e_1_2_1_49_1","volume-title":"The probabilistic relevance framework: BM25 and beyond","author":"Robertson Stephen","unstructured":"Stephen Robertson and Hugo Zaragoza . 2009. The probabilistic relevance framework: BM25 and beyond . Now Publishers Inc . Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/cvpr.2015.7298682"},{"key":"e_1_2_1_51_1","unstructured":"SciSpacy. [n. d.]. SpaCy models for biomedical text processing. https:\/\/allenai.github.io\/scispacy\/.  SciSpacy. [n. d.]. SpaCy models for biomedical text processing. https:\/\/allenai.github.io\/scispacy\/."},{"key":"e_1_2_1_52_1","unstructured":"Spacy. [n. d.]. SpaCy-Industrial-Strength Natural Language Processing. https:\/\/spacy.io.  Spacy. [n. d.]. SpaCy-Industrial-Strength Natural Language Processing. https:\/\/spacy.io."},{"key":"e_1_2_1_53_1","volume-title":"Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins. arXiv preprint arXiv:2106.01501","author":"Suri Sahaana","year":"2021","unstructured":"Sahaana Suri , Ihab F Ilyas , Christopher R\u00e9 , and Theodoros Rekatsinas . 2021 . Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins. arXiv preprint arXiv:2106.01501 (2021). Sahaana Suri, Ihab F Ilyas, Christopher R\u00e9, and Theodoros Rekatsinas. 2021. Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins. arXiv preprint arXiv:2106.01501 (2021)."},{"key":"e_1_2_1_54_1","unstructured":"Joshua Tauberer. 2014. Open Government Data (The Book). https:\/\/opengovdata.io\/. Accessed: 2021-07-21.  Joshua Tauberer. 2014. Open Government Data (The Book). https:\/\/opengovdata.io\/. Accessed: 2021-07-21."},{"key":"e_1_2_1_55_1","volume-title":"An IDEA: an ingestion framework for data enrichment in AsterixDB. arXiv preprint arXiv:1902.08271","author":"Wang Xikui","year":"2019","unstructured":"Xikui Wang and Michael J Carey . 2019. An IDEA: an ingestion framework for data enrichment in AsterixDB. arXiv preprint arXiv:1902.08271 ( 2019 ). Xikui Wang and Michael J Carey. 2019. An IDEA: an ingestion framework for data enrichment in AsterixDB. arXiv preprint arXiv:1902.08271 (2019)."},{"key":"e_1_2_1_56_1","volume-title":"process, and challenges of exploratory data analysis: an interview study. arXiv preprint arXiv:1911.00568","author":"Wongsuphasawat Kanit","year":"2019","unstructured":"Kanit Wongsuphasawat , Yang Liu , and Jeffrey Heer . 2019. Goals , process, and challenges of exploratory data analysis: an interview study. arXiv preprint arXiv:1911.00568 ( 2019 ). Kanit Wongsuphasawat, Yang Liu, and Jeffrey Heer. 2019. Goals, process, and challenges of exploratory data analysis: an interview study. arXiv preprint arXiv:1911.00568 (2019)."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407793"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1007\/s13042-010-0001-0"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3300065"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.14778\/3115404.3115409"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994534"},{"key":"e_1_2_1_62_1","volume-title":"Data integration---problems, approaches, and perspectives. Conceptual modelling in information systems engineering","author":"Ziegler Patrick","year":"2007","unstructured":"Patrick Ziegler and Klaus R Dittrich . 2007. Data integration---problems, approaches, and perspectives. Conceptual modelling in information systems engineering ( 2007 ), 39--58. Patrick Ziegler and Klaus R Dittrich. 2007. Data integration---problems, approaches, and perspectives. Conceptual modelling in information systems engineering (2007), 39--58."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3611479.3611533","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,23]],"date-time":"2023-09-23T22:15:41Z","timestamp":1695507341000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3611479.3611533"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7]]},"references-count":62,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2023,7]]}},"alternative-id":["10.14778\/3611479.3611533"],"URL":"https:\/\/doi.org\/10.14778\/3611479.3611533","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2023,7]]},"assertion":[{"value":"2023-08-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}