{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,23]],"date-time":"2025-12-23T16:29:06Z","timestamp":1766507346886,"version":"3.48.0"},"reference-count":72,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2022YFB2702100"],"award-info":[{"award-number":["2022YFB2702100"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"NSFC","doi-asserted-by":"crossref","award":["61932004, 62225203, U21A20516"],"award-info":[{"award-number":["61932004, 62225203, U21A20516"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Inf. Syst."],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>A data lake maintains large amounts of heterogeneous data with different data schemas and query interfaces. Efficiently querying and analyzing the heterogeneous data enables users to gain more complete insights. In this article, we study a novel problem of distributed keyword search across heterogeneous data sources. Traditional distributed search algorithms generally require the predefined crossing edges connecting relevant data instances for communication between different sources, which is unpractical for the data lake due to the schema heterogeneity. To effectively perform keyword search over the data lake, we first introduce canonical graphs and then develop a best-first search algorithm called UnifySea, which explores the answers across different sources based on the unified identification of related instances. To further improve the query efficiency, we propose a novel incremental keyword search algorithm called DistSea, which just need to identify the promising relevant data between different sources. DistSea incrementally calculates the optimal answers based on locally partial evaluation. Equipped with several efficient pruning rules, DistSea reduces unpromising tree calculation across different sources. Experimental evaluations on six real-world benchmarks demonstrate the effectiveness, efficiency, and scalability of the proposed algorithms.<\/jats:p>","DOI":"10.1145\/3772001","type":"journal-article","created":{"date-parts":[[2025,10,21]],"date-time":"2025-10-21T10:52:14Z","timestamp":1761043934000},"page":"1-29","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Exploring Heterogeneous Data Lake Based on Canonical\u00a0Graphs"],"prefix":"10.1145","volume":"44","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-7123-6155","authenticated-orcid":false,"given":"Qin","family":"Yuan","sequence":"first","affiliation":[{"name":"Beijing Institute of Technology, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0247-9866","authenticated-orcid":false,"given":"Ye","family":"Yuan","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2914-912X","authenticated-orcid":false,"given":"Zhenyu","family":"Wen","sequence":"additional","affiliation":[{"name":"Zhejiang University of Technology, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0181-8379","authenticated-orcid":false,"given":"Guoren","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,12,23]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Amazon S3\u2014Cloud object storage. 2025. Retrieved from https:\/\/aws.amazon.com\/s"},{"key":"e_1_3_1_3_2","unstructured":"Huawei Data Storage. 2025. Retrieved from https:\/\/e.huawei.com\/en\/products\/storage"},{"key":"e_1_3_1_4_2","unstructured":"ChEMBL: A Database of Bioactive Drug-like Small Molecules. 2016. Retrieved from https:\/\/www.ebi.ac.uk\/chembl\/downloads"},{"key":"e_1_3_1_5_2","unstructured":"DrugCentral. 2017. Retrieved from http:\/\/drugcentral.org\/download"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00084"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3485447.3512026"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3085504.3091116"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2002.994756"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-020-09379-9"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3236252"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2010.5447818"},{"key":"e_1_3_1_13_2","volume-title":"Introduction to Algorithms","author":"Cormen Thomas H.","year":"2022","unstructured":"Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2022. Introduction to Algorithms. MIT Press."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453982"},{"key":"e_1_3_1_15_2","first-page":"4171","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT \u201919)","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT \u201919), Vol. 1 (Long and Short Papers). Jill Burstein, Christy Doran, and Thamar Solorio (Eds.), Association for Computational Linguistics, 4171\u20134186."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2007.367929"},{"key":"e_1_3_1_17_2","unstructured":"Grace Fan Jin Wang Yuliang Li Dan Zhang and Ren\u00e9e Miller. 2022. Semantics-aware dataset discovery from data lakes with contextualized column-based representation learning. arXiv:2210.01922. Retrieved from https:\/\/arxiv.org\/abs\/2210.01922"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3639363"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3282488"},{"key":"e_1_3_1_20_2","first-page":"1001","volume-title":"Proceedings of the 34th IEEE International Conference on Data Engineering (ICDE \u201918)","author":"Castro Fernandez Raul","year":"2018","unstructured":"Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In Proceedings of the 34th IEEE International Conference on Data Engineering (ICDE \u201918), 1001\u20131012."},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3058740"},{"key":"e_1_3_1_22_2","unstructured":"M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness."},{"key":"e_1_3_1_23_2","doi-asserted-by":"crossref","unstructured":"L. Guo F. Shao C. Botev and J. Shanmugasundaram. 2003. XRANK: Ranked keyword search over XML documents. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data 16\u201327.","DOI":"10.1145\/872757.872762"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2899389"},{"key":"e_1_3_1_25_2","unstructured":"Rihan Hai Christoph Quix andMatthias Jarke. 2021. Data lake concept and systems: A survey. arXiv:2106.09592. Retrieved from https:\/\/arxiv.org\/abs\/2106.09592"},{"key":"e_1_3_1_26_2","first-page":"795","volume-title":"Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data","author":"Halevy Alon Y.","year":"2016","unstructured":"Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google\u2019s datasets. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM, 795\u2013806."},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/1247480.1247516"},{"key":"e_1_3_1_28_2","volume-title":"Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (CIDR \u201917)","author":"Hellerstein Joseph M.","year":"2017","unstructured":"Joseph M. Hellerstein, Vikram Sreekanti, Joseph E. Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhattacharyya, Shirshanka Das, et al. 2017. Ground: A data context service. In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (CIDR \u201917)."},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1016\/B978-155860869-6\/50065-2"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.14778\/3654621.3654640"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2023.3313726"},{"issue":"6","key":"e_1_3_1_32_2","doi-asserted-by":"crossref","first-page":"2322","DOI":"10.1109\/TKDE.2019.2956535","article-title":"A generic ontology framework for indexing keyword search on massive graphs","volume":"33","author":"Jiang Jiaxin","year":"2019","unstructured":"Jiaxin Jiang, Byron Choi, Jianliang Xu, and Sourav S. Bhowmick. 2019. A generic ontology framework for indexing keyword search on massive graphs. IEEE Transactions on Knowledge and Data Engineering 33, 6 (2019), 2322\u20132336.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00046"},{"key":"e_1_3_1_34_2","unstructured":"Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan Rushi Desai and Hrishikesh Karambelkar. 2005. Bidirectional expansion for keyword search on graph databases. In Proceedings of the 31st International Conference on Very Large Data Bases 505\u2013516."},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/BigData.2015.7363784"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.14778\/2021017.2021025"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2012.124"},{"key":"e_1_3_1_38_2","first-page":"411","article-title":"Meaningful keyword search in relational databases with large and complex schema","author":"Kargar M.","year":"2015","unstructured":"M. Kargar, A. An, N. Cercone, P. Godfrey, and X. Yu. 2015. Meaningful keyword search in relational databases with large and complex schema. In ICDE, 411\u2013422.","journal-title":"ICDE,"},{"issue":"2","key":"e_1_3_1_39_2","doi-asserted-by":"crossref","first-page":"601","DOI":"10.1109\/TKDE.2020.2985376","article-title":"Effective keyword search over weighted graphs","volume":"34","author":"Kargar Mehdi","year":"2020","unstructured":"Mehdi Kargar, Lukasz Golab, Divesh Srivastava, Jaroslaw Szlichta, and Morteza Zihayat. 2020. Effective keyword search over weighted graphs. IEEE Transactions on Knowledge and Data Engineering 34, 2 (2020), 601\u2013616.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1137\/S1064827595287997"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3588689"},{"key":"e_1_3_1_42_2","unstructured":"Klyne Graham Carroll and J. Jeremy Brian. 2004. Resource description framework (RDF): Concepts and abstract syntax. W3C Recommendation."},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376706"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.14778\/3421424.3421431"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431816"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331252"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-22655-7_1"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457258"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380605"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.14778\/2536206.2536217"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2009.67"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3366423.3380110"},{"issue":"8","key":"e_1_3_1_54_2","first-page":"103","article-title":"A bridging model for parallel computation","volume":"33","author":"Valiant Leslie G.","year":"1990","unstructured":"Leslie G. Valiant. 1990. A bridging model for parallel computation. Communications of the ACM 33, 8 (1990), 103\u2013111.","journal-title":"Communications of the"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-07443-6_15"},{"key":"e_1_3_1_56_2","first-page":"249","volume-title":"A survey of algorithms for keyword search on graph dataManaging and Mining Graph Data.","author":"Wang H.","year":"2010","unstructured":"H. Wang and C. C. Aggarwal. 2010. A survey of algorithms for keyword search on graph data. In Managing and Mining Graph Data. Springer, 249\u2013273."},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3482008"},{"key":"e_1_3_1_58_2","doi-asserted-by":"crossref","unstructured":"Pengfei Wang Xiaocan Zeng Lu Chen Fan Ye Yuren Mao Junhao Zhu and Yunjun Gao. 2022. Promptem: Prompt-tuning for low-resource generalized entity matching. arXiv:2207.04802. Retrieved from https:\/\/arxiv.org\/abs\/2207.04802","DOI":"10.14778\/3565816.3565836"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00391"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1023"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-021-00154-4"},{"key":"e_1_3_1_62_2","unstructured":"Jeffrey Xu Yu Lu Qin and Lijun Chang. 2010. Keyword search in relational databases: A survey. IEEE Computer Society Data Engineering Bulletin 33 1 (2010) 67\u201378."},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE65448.2025.00054"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE65448.2025.00053"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477495.3531759"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2017.2656079"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2012.222"},{"key":"e_1_3_1_68_2","volume-title":"Proceedings of the VLDB 2018 PhD Workshop co-Located with the 44th International Conference on Very Large Databases.","author":"Zhang Chao","year":"2018","unstructured":"Chao Zhang. 2018. Parameter curation and data generation for benchmarking multi-model queries. In Proceedings of the VLDB 2018 PhD Workshop co-Located with the 44th International Conference on Very Large Databases."},{"key":"e_1_3_1_69_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00091"},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389726"},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10707-017-0299-9"},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00205"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3300065"}],"container-title":["ACM Transactions on Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3772001","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,23]],"date-time":"2025-12-23T14:06:36Z","timestamp":1766498796000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3772001"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,23]]},"references-count":72,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3772001"],"URL":"https:\/\/doi.org\/10.1145\/3772001","relation":{},"ISSN":["1046-8188","1558-2868"],"issn-type":[{"type":"print","value":"1046-8188"},{"type":"electronic","value":"1558-2868"}],"subject":[],"published":{"date-parts":[[2025,12,23]]},"assertion":[{"value":"2025-01-05","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-10","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-23","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}