{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,14]],"date-time":"2026-01-14T02:05:35Z","timestamp":1768356335187,"version":"3.49.0"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2022,5,17]],"date-time":"2022-05-17T00:00:00Z","timestamp":1652745600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,5,17]],"date-time":"2022-05-17T00:00:00Z","timestamp":1652745600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["The VLDB Journal"],"published-print":{"date-parts":[[2023,3]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We focus on the key task of semantic type discovery over a set of heterogeneous sources, an important data preparation task. We consider the challenging setting of multiple Web data sources in a vertical domain, which present sparsity of data and a high degree of heterogeneity, even internally within each individual source. We assume each source provides a collection of entity specifications, i.e. entity descriptions, each expressed as a set of attribute name-value pairs. Semantic type discovery aims at clustering individual attribute name-value pairs that represent the same semantic concept. We take advantage of the opportunities arising from the redundancy of information across such sources and propose the iterative<jats:sc>RaF-STD<\/jats:sc>solution, which consists of three key steps: (i) a Bayesian model analysis of overlapping information across sources to match the most locally homogeneous attributes; (ii) a tagging approach, inspired by NLP techniques, to create (virtual) homogeneous attributes from portions of heterogeneous attribute values; and (iii) a novel use of classical techniques based on matching of attribute names and domains. Empirical evaluation on the DI2KG and WDC benchmarks demonstrates the superiority of<jats:sc>RaF-STD<\/jats:sc>over alternative approaches adapted from the literature.<\/jats:p>","DOI":"10.1007\/s00778-022-00743-3","type":"journal-article","created":{"date-parts":[[2022,5,17]],"date-time":"2022-05-17T06:02:39Z","timestamp":1652767359000},"page":"305-324","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Fine-grained semantic type discovery for heterogeneous sources using clustering"],"prefix":"10.1007","volume":"32","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4503-4947","authenticated-orcid":false,"given":"Federico","family":"Piai","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1513-4725","authenticated-orcid":false,"given":"Paolo","family":"Atzeni","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3852-8092","authenticated-orcid":false,"given":"Paolo","family":"Merialdo","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7609-9217","authenticated-orcid":false,"given":"Divesh","family":"Srivastava","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,5,17]]},"reference":[{"issue":"12","key":"743_CR1","first-page":"993","volume":"9","author":"Z Abedjan","year":"2016","unstructured":"Abedjan, Z., Chu, X., Deng, D., Fernandez, R.C., Ilyas, I.F., Ouzzani, M., Papotti, P., Stonebraker, M., Tang, N.: Detecting data errors: Where are we and what needs to be done? PVLDB 9(12), 993\u20131004 (2016)","journal-title":"PVLDB"},{"key":"743_CR2","doi-asserted-by":"crossref","unstructured":"Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with coma++. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 906\u2013908 (2005)","DOI":"10.1145\/1066157.1066283"},{"key":"743_CR3","doi-asserted-by":"crossref","unstructured":"Banerjee, A., Krumpelman, C., Ghosh, J., Basu, S., Mooney, R.J.: Model-based overlapping clustering. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 532\u2013537 (2005)","DOI":"10.1145\/1081870.1081932"},{"issue":"2","key":"743_CR4","first-page":"71","volume":"41","author":"L Barbosa","year":"2018","unstructured":"Barbosa, L., Crescenzi, V., Dong, X.L., Merialdo, P., Piai, F., Qiu, D., Shen, Y., Srivastava, D.: Big data integration for product specifications. IEEE Data Eng. Bull. 41(2), 71\u201381 (2018)","journal-title":"IEEE Data Eng. Bull."},{"key":"743_CR5","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-16518-4","volume-title":"Schema Matching and Mapping","author":"Z Bellahsene","year":"2011","unstructured":"Bellahsene, Z., Bonifati, A., Rahm, E.: Schema Matching and Mapping. Springer Science & Business Media, Berlin (2011)"},{"key":"743_CR6","doi-asserted-by":"crossref","unstructured":"Berlin, J., Motro, A.: Autoplex: Automated discovery of content for virtual databases. In: International Conference on Cooperative Information Systems, pp. 108\u2013122. Springer (2001)","DOI":"10.1007\/3-540-44751-2_10"},{"key":"743_CR7","doi-asserted-by":"crossref","unstructured":"Bhagavatula, C.S., Noraset, T., Downey, D.: Tabel: Entity linking in web tables. In: International Semantic Web Conference, pp. 425\u2013441. Springer (2015)","DOI":"10.1007\/978-3-319-25007-6_25"},{"key":"743_CR8","doi-asserted-by":"crossref","unstructured":"Bilke, A., Naumann, F.: Schema matching using duplicates. In: 21st International Conference on Data Engineering (ICDE\u201905), pp. 69\u201380. IEEE (2005)","DOI":"10.1109\/ICDE.2005.126"},{"issue":"2","key":"743_CR9","doi-asserted-by":"publisher","first-page":"125","DOI":"10.3354\/meps005125","volume":"5","author":"SA Bloom","year":"1981","unstructured":"Bloom, S.A.: Similarity indices in community studies: potential pitfalls. Mar. Ecol. Prog. Ser 5(2), 125\u2013128 (1981)","journal-title":"Mar. Ecol. Prog. Ser"},{"key":"743_CR10","unstructured":"Brunner, U., Stockinger, K.: Entity matching with transformer architectures-a step forward in data integration. In: International Conference on Extending Database Technology, Copenhagen, 30 March-2 April 2020 (2020)"},{"key":"743_CR11","doi-asserted-by":"crossref","unstructured":"Cannaviccio, M., Barbosa, D., Merialdo, P.: Towards annotating relational data on the web with language models. In: Proceedings of the 2018 World Wide Web Conference, pp. 1307\u20131316 (2018)","DOI":"10.1145\/3178876.3186029"},{"issue":"2","key":"743_CR12","first-page":"10","volume":"41","author":"C Chen","year":"2018","unstructured":"Chen, C., Golshan, B., Halevy, A.Y., Tan, W.C., Doan, A.: Biggorilla: An open-source ecosystem for data preparation and integration. IEEE Data Eng. Bull. 41(2), 10\u201322 (2018)","journal-title":"IEEE Data Eng. Bull."},{"key":"743_CR13","doi-asserted-by":"crossref","unstructured":"Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: Overview and emerging challenges. In: Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data, pp. 2201\u20132206 (2016)","DOI":"10.1145\/2882903.2912574"},{"issue":"7","key":"743_CR14","first-page":"680","volume":"5","author":"N Dalvi","year":"2012","unstructured":"Dalvi, N., Machanavajjhala, A., Pang, B.: An analysis of structured data on the web. PVLDB 5(7), 680\u2013691 (2012)","journal-title":"PVLDB"},{"issue":"3","key":"743_CR15","doi-asserted-by":"publisher","first-page":"279","DOI":"10.1023\/A:1021765902788","volume":"50","author":"A Doan","year":"2003","unstructured":"Doan, A., Domingos, P., Halevy, A.: Learning to match the schemas of data sources: A multistrategy approach. Machine Learn. 50(3), 279\u2013301 (2003)","journal-title":"Machine Learn."},{"key":"743_CR16","doi-asserted-by":"crossref","unstructured":"Dong, X.L.: Challenges and innovations in building a product knowledge graph. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2869\u20132869. ACM (2018)","DOI":"10.1145\/3219819.3219938"},{"key":"743_CR17","doi-asserted-by":"crossref","unstructured":"Dong, X.L.: Building a broad knowledge graph for products. In: Proceedings of the 35th International Conference on Data Engineering (ICDE), pp. 25\u201325. IEEE (2019)","DOI":"10.1109\/ICDE.2019.00010"},{"issue":"1","key":"743_CR18","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/978-3-031-01853-4","volume":"7","author":"XL Dong","year":"2015","unstructured":"Dong, X.L., Srivastava, D.: Big data integration. Synthesis Lect. Data Manag. 7(1), 1\u2013198 (2015)","journal-title":"Synthesis Lect. Data Manag."},{"key":"743_CR19","unstructured":"Engmann, D., Massmann, S.: Instance matching with coma++. In: BTW workshops, vol.\u00a07, pp. 28\u201337 (2007)"},{"issue":"14","key":"743_CR20","first-page":"1845","volume":"7","author":"T Furche","year":"2014","unstructured":"Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C., Wang, C.: Diadem: thousands of websites to a single database. PVLDB 7(14), 1845\u20131856 (2014)","journal-title":"PVLDB"},{"key":"743_CR21","doi-asserted-by":"crossref","unstructured":"Guo, C., Hedeler, C., Paton, N.W., Fernandes, A.A.: Matchbench: benchmarking schema matching algorithms for schematic correspondences. In: British National Conference on Databases, pp. 92\u2013106. Springer (2013)","DOI":"10.1007\/978-3-642-39467-6_11"},{"key":"743_CR22","doi-asserted-by":"crossref","unstructured":"Hadjieleftheriou, M., Srivastava, D.: Approximate string processing. Foundations and Trends\u00ae in Databases 2(4), 267\u2013402 (2011)","DOI":"10.1561\/1900000010"},{"key":"743_CR23","doi-asserted-by":"crossref","unstructured":"Hulsebos, M., Hu, K., Bakker, M., Zgraggen, E., Satyanarayan, A., Kraska, T., Demiralp, \u00c7., Hidalgo, C.: Sherlock: A deep learning approach to semantic data type detection. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1500\u20131508 (2019)","DOI":"10.1145\/3292500.3330993"},{"key":"743_CR24","doi-asserted-by":"crossref","unstructured":"Kang, J., Naughton, J.F.: On schema matching with opaque column names and data values. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 205\u2013216 (2003)","DOI":"10.1145\/872757.872783"},{"key":"743_CR25","doi-asserted-by":"crossref","unstructured":"Kannan, A., Givoni, I.E., Agrawal, R., Fuxman, A.: Matching unstructured product offers to structured product specifications. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 404\u2013412. ACM (2011)","DOI":"10.1145\/2020408.2020474"},{"issue":"12","key":"743_CR26","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1109\/2.116884","volume":"24","author":"W Kim","year":"1991","unstructured":"Kim, W., Seo, J.: Classifying schematic and data heterogeneity in multidatabase systems. Computer 24(12), 12\u201318 (1991)","journal-title":"Computer"},{"key":"743_CR27","doi-asserted-by":"crossref","unstructured":"Koutras, C., Siachamis, G., Ionescu, A., Psarakis, K., Brons, J., Fragkoulis, M., Lofi, C., Bonifati, A., Katsifodimos, A.: Valentine: Evaluating matching techniques for dataset discovery. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 468\u2013479. IEEE (2021)","DOI":"10.1109\/ICDE51399.2021.00047"},{"issue":"1","key":"743_CR28","first-page":"50","volume":"14","author":"Y Li","year":"2020","unstructured":"Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep entity matching with pre-trained language models. PVLDB 14(1), 50\u201360 (2020)","journal-title":"PVLDB"},{"issue":"1\u20132","key":"743_CR29","first-page":"1338","volume":"3","author":"G Limaye","year":"2010","unstructured":"Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. PVLDB 3(1\u20132), 1338\u20131347 (2010)","journal-title":"PVLDB"},{"key":"743_CR30","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511809071","volume-title":"Introduction to Information Retrieval","author":"CD Manning","year":"2008","unstructured":"Manning, C.D., Raghavan, P., Sch\u00fctze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)"},{"key":"743_CR31","unstructured":"Mausam, M.: Open information extraction systems and downstream applications. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, pp. 4074\u20134077 (2016)"},{"key":"743_CR32","doi-asserted-by":"crossref","unstructured":"Mork, P., Seligman, L., Rosenthal, A., Korb, J., Wolf, C.: The harmony integration workbench. In: Journal on Data Semantics XI, pp. 65\u201393. Springer (2008)","DOI":"10.1007\/978-3-540-92148-6_3"},{"key":"743_CR33","doi-asserted-by":"crossref","unstructured":"Nguyen, H., Fuxman, A., Paparizos, S., Freire, J., Agrawal, R.: Synthesizing products for online catalogs. PVLDB 4(7), 409\u2013418 (2011)","DOI":"10.14778\/1988776.1988777"},{"issue":"7","key":"743_CR34","first-page":"953","volume":"13","author":"M Ota","year":"2020","unstructured":"Ota, M., M\u00fcller, H., Freire, J., Srivastava, D.: Data-driven domain discovery for structured datasets. PVLDB 13(7), 953\u2013967 (2020)","journal-title":"PVLDB"},{"key":"743_CR35","doi-asserted-by":"crossref","unstructured":"Primpeli, A., Peeters, R., Bizer, C.: The wdc training dataset and gold standard for large-scale product matching. In: Companion Proceedings of The 2019 World Wide Web Conference, pp. 381\u2013386 (2019)","DOI":"10.1145\/3308560.3316609"},{"key":"743_CR36","doi-asserted-by":"crossref","unstructured":"Qiu, D., Barbosa, L., Crescenzi, V., Merialdo, P., Srivastava, D.: Big data linkage for product specification pages. In: Proceedings of the 2018 ACM SIGMOD International Conference on Management of Data, pp. 67\u201381. ACM (2018)","DOI":"10.1145\/3183713.3183757"},{"issue":"4","key":"743_CR37","doi-asserted-by":"publisher","first-page":"334","DOI":"10.1007\/s007780100057","volume":"10","author":"E Rahm","year":"2001","unstructured":"Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334\u2013350 (2001)","journal-title":"VLDB J."},{"issue":"41","key":"743_CR38","first-page":"19","volume":"42","author":"D Ritze","year":"2017","unstructured":"Ritze, D., Bizer, C.: Matching web tables to dbpedia-a feature utility study. Context 42(41), 19\u201331 (2017)","journal-title":"Context"},{"key":"743_CR39","doi-asserted-by":"crossref","unstructured":"Ritze, D., Lehmberg, O., Bizer, C.: Matching html tables to dbpedia. In: Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, pp. 1\u20136 (2015)","DOI":"10.1145\/2797115.2797118"},{"issue":"1","key":"743_CR40","doi-asserted-by":"publisher","first-page":"305","DOI":"10.1007\/s11192-012-0889-0","volume":"96","author":"A Schubert","year":"2013","unstructured":"Schubert, A.: Measuring the similarity between the reference and citation distributions of journals. Scientometrics 96(1), 305\u2013313 (2013)","journal-title":"Scientometrics"},{"key":"743_CR41","unstructured":"Sekhavat, Y.A., Di\u00a0Paolo, F., Barbosa, D., Merialdo, P.: Knowledge base augmentation using tabular data. In: LDOW (2014)"},{"issue":"8","key":"743_CR42","doi-asserted-by":"publisher","first-page":"1254","DOI":"10.14778\/3457390.3457391","volume":"14","author":"N Tang","year":"2021","unstructured":"Tang, N., Fan, J., Li, F., Tu, J., Du, X., Li, G., Madden, S., Ouzzani, M.: Rpt: relational pre-trained transformer is almost all you need towards democratizing data preparation. Proc. VLDB Endowment 14(8), 1254\u20131261 (2021)","journal-title":"Proc. VLDB Endowment"},{"key":"743_CR43","doi-asserted-by":"crossref","unstructured":"Yan, C., He, Y.: Synthesizing type-detection logic for rich semantic data types using open-source code. In: Proceedings of the 2018 ACM SIGMOD International Conference on Management of Data, pp. 35\u201350 (2018)","DOI":"10.1145\/3183713.3196888"},{"key":"743_CR44","unstructured":"Zhang, D., Li, D., Guo, L., Tan, K.L.: Unsupervised entity resolution with blocking and graph algorithms. IEEE Trans. Knowledge Data Eng. (2020)"},{"key":"743_CR45","doi-asserted-by":"crossref","unstructured":"Zhang, D., Suhara, Y., Li, J., Hulsebos, M., Demiralp, C., Tan, W.C.: Sato: Contextual semantic type detection in tables. PVLDB 13(11) (2019)","DOI":"10.14778\/3407790.3407793"},{"issue":"2","key":"743_CR46","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3372117","volume":"11","author":"S Zhang","year":"2020","unstructured":"Zhang, S., Balog, K.: Web table extraction, retrieval, and augmentation: a survey. ACM Trans. Intell. Syst. Technol. (TIST) 11(2), 1\u201335 (2020)","journal-title":"ACM Trans. Intell. Syst. Technol. (TIST)"}],"container-title":["The VLDB Journal"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00778-022-00743-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00778-022-00743-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00778-022-00743-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,25]],"date-time":"2024-09-25T04:42:57Z","timestamp":1727239377000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00778-022-00743-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,17]]},"references-count":46,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,3]]}},"alternative-id":["743"],"URL":"https:\/\/doi.org\/10.1007\/s00778-022-00743-3","relation":{},"ISSN":["1066-8888","0949-877X"],"issn-type":[{"value":"1066-8888","type":"print"},{"value":"0949-877X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,5,17]]},"assertion":[{"value":"4 May 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 December 2021","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 March 2022","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 May 2022","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}