{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T05:06:43Z","timestamp":1755839203580,"version":"3.40.3"},"publisher-location":"Cham","reference-count":27,"publisher":"Springer International Publishing","isbn-type":[{"type":"print","value":"9783030883607"},{"type":"electronic","value":"9783030883614"}],"license":[{"start":{"date-parts":[[2021,1,1]],"date-time":"2021-01-01T00:00:00Z","timestamp":1609459200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,9,30]],"date-time":"2021-09-30T00:00:00Z","timestamp":1632960000000},"content-version":"vor","delay-in-days":272,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Semantic markup, such as , allows providers on the Web to describe content using a shared controlled vocabulary. This markup is invaluable in enabling a broad range of applications, from vertical search engines, to rich snippets in search results, to actions on emails, to many others. In this paper, we focus on semantic markup for datasets, specifically in the context of developing a vertical search engine for datasets on the Web, Google\u2019s Dataset Search. Dataset Search relies on  to identify pages that describe datasets. While  was the core enabling technology for this vertical search, we also discovered that we need to address the following problem: pages from 61% of internet hosts that provide  markup do not actually describe datasets. We analyze the veracity of dataset markup for Dataset Search\u2019s Web-scale corpus and categorize pages where this markup is not reliable. We then propose a way to drastically increase the quality of the dataset metadata corpus by developing a deep neural-network classifier that identifies whether or not a page with  markup is a dataset page. Our classifier achieves 96.7% recall at the 95% precision point. This level of precision enables Dataset Search to circumvent the noise in semantic markup and to use the metadata to provide high quality results to users.<\/jats:p>","DOI":"10.1007\/978-3-030-88361-4_20","type":"book-chapter","created":{"date-parts":[[2021,9,29]],"date-time":"2021-09-29T07:07:22Z","timestamp":1632899242000},"page":"338-356","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages"],"prefix":"10.1007","author":[{"given":"Tarfah","family":"Alrashed","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dimitris","family":"Paparas","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Omar","family":"Benjelloun","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ying","family":"Sheng","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Natasha","family":"Noy","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2021,9,30]]},"reference":[{"key":"20_CR1","doi-asserted-by":"publisher","unstructured":"Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based topic classification. In: 18th International Conference on World Wide Web. WWW 2009, pp. 1109\u20131110 (2009). https:\/\/doi.org\/10.1145\/1526709.1526880","DOI":"10.1145\/1526709.1526880"},{"key":"20_CR2","doi-asserted-by":"crossref","unstructured":"Benjelloun, O., Chen, S., Noy, N.: Google dataset search by the numbers. In: International Semantic Web Conference (2020)","DOI":"10.1007\/978-3-030-62466-8_41"},{"key":"20_CR3","doi-asserted-by":"publisher","unstructured":"Bozzon, A., Brambilla, M., Ceri, S., Fraternali, P.: Liquid query: multi-domain exploratory search on the web. In: 19th International Conference on World Wide Web. WWW 2010, pp. 161\u2013170 (2010). https:\/\/doi.org\/10.1145\/1772690.1772708","DOI":"10.1145\/1772690.1772708"},{"issue":"1","key":"20_CR4","doi-asserted-by":"publisher","first-page":"251","DOI":"10.1007\/s00778-019-00564-x","volume":"29","author":"A Chapman","year":"2019","unstructured":"Chapman, A., et al.: Dataset search: a survey. VLDB J. 29(1), 251\u2013272 (2019). https:\/\/doi.org\/10.1007\/s00778-019-00564-x","journal-title":"VLDB J."},{"key":"20_CR5","unstructured":"Choudhury, S., Batra, T., Hughes, C.: Content-based and link-based methods for categorical webpage classification (2016)"},{"key":"20_CR6","unstructured":"Cortes, C., Gonzalvo, X., Kuznetsov, V., Mohri, M., Yang, S.: AdaNet: adaptive structural learning of artificial neural networks. In: International Conference on Machine Learning, pp. 874\u2013883 (2017)"},{"key":"20_CR7","unstructured":"Craven, M., McCallum, A., PiPasquo, D., Mitchell, T., Freitag, D.: Learning to extract symbolic knowledge from the world wide web, Tech. Rep. Carnegie-mellon univ pittsburgh pa school of computer Science (1998)"},{"issue":"1","key":"20_CR8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41597-019-0031-8","volume":"6","author":"M Fenner","year":"2019","unstructured":"Fenner, M., Crosas, M., et al.: A data citation roadmap for scholarly data repositories. Sci. Data 6(1), 1\u20139 (2019). https:\/\/doi.org\/10.1038\/s41597-019-0031-8","journal-title":"Sci. Data"},{"key":"20_CR9","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"368","DOI":"10.1007\/11551362_33","volume-title":"Research and Advanced Technology for Digital Libraries","author":"K Golub","year":"2005","unstructured":"Golub, K., Ard\u00f6, A.: Importance of HTML structural elements and metadata in automated subject classification. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds.) ECDL 2005. LNCS, vol. 3652, pp. 368\u2013378. Springer, Heidelberg (2005). https:\/\/doi.org\/10.1007\/11551362_33"},{"key":"20_CR10","doi-asserted-by":"crossref","unstructured":"Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of structured data on the web. Commun. ACM 59(2), 44\u201351 (2016)","DOI":"10.1145\/2844544"},{"key":"20_CR11","doi-asserted-by":"publisher","unstructured":"Hern\u00e1ndez, I., Rivero, C.R., Ruiz, D., Corchuelo, R.: A statistical approach to URL-based web page clustering. In: 21st International Conference on World Wide Web. WWW 2012 Companion, pp. 525\u2013526 (2012). https:\/\/doi.org\/10.1145\/2187980.2188109","DOI":"10.1145\/2187980.2188109"},{"key":"20_CR12","first-page":"26","volume":"628","author":"A Hogan","year":"2010","unstructured":"Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. LDOW 628, 26 (2010)","journal-title":"LDOW"},{"key":"20_CR13","unstructured":"Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)"},{"key":"20_CR14","doi-asserted-by":"publisher","unstructured":"Kocayusufoglu, F., et al.: Riser: learning better representations for richly structured emails. In: The Web Conference, WWW 2019, pp. 886\u2013895 (2019). https:\/\/doi.org\/10.1145\/3308558.3313720","DOI":"10.1145\/3308558.3313720"},{"key":"20_CR15","doi-asserted-by":"publisher","unstructured":"Koesten, L.M., Kacprzak, E., Tennison, J.F.A., Simperl, E.: The trials and tribulations of working with structured data: -a study on information seeking behaviour. In: CHI 2017 (2017). https:\/\/doi.org\/10.1145\/3025453.3025838","DOI":"10.1145\/3025453.3025838"},{"key":"20_CR16","doi-asserted-by":"crossref","unstructured":"Krutil, J., Kud\u011bka, M., Sn\u00e1\u0161el, V.: Web page classification based on schema.org collection. In: 2012 Fourth International Conference on Computational Aspects of Social Networks (CASoN), pp. 356\u2013360 (2012)","DOI":"10.1109\/CASoN.2012.6412428"},{"key":"20_CR17","doi-asserted-by":"publisher","unstructured":"Lin, B.Y., Sheng, Y., Vo, N., Tata, S.: FreeDOM: a transferable neural architecture for structured information extraction on web documents. In: ACM KDD, pp. 1092\u20131102 (2020). https:\/\/doi.org\/10.1145\/3394486.3403153","DOI":"10.1145\/3394486.3403153"},{"key":"20_CR18","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"152","DOI":"10.1007\/978-3-319-18818-8_10","volume-title":"The Semantic Web. Latest Advances and New Domains","author":"R Meusel","year":"2015","unstructured":"Meusel, R., Paulheim, H.: Heuristics for fixing common errors in deployed schema.org microdata. In: Gandon, F., Sabou, M., Sack, H., d\u2019Amato, C., Cudr\u00e9-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 152\u2013168. Springer, Cham (2015). https:\/\/doi.org\/10.1007\/978-3-319-18818-8_10"},{"key":"20_CR19","doi-asserted-by":"crossref","unstructured":"Najork, M.: Web spam detection encyclopedia of database systems (2009)","DOI":"10.1007\/978-0-387-39940-9_465"},{"key":"20_CR20","doi-asserted-by":"publisher","unstructured":"Noy, N., Brickley, D., Burgess, M.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: The Web Conference, WWW 2019 (2019). https:\/\/doi.org\/10.1145\/3308558.3313685","DOI":"10.1145\/3308558.3313685"},{"key":"20_CR21","doi-asserted-by":"publisher","unstructured":"Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: A Empirical Methods in Natural Language Processing, EMNLP, USA, pp. 79\u201386 (2002). https:\/\/doi.org\/10.3115\/1118693.1118704","DOI":"10.3115\/1118693.1118704"},{"key":"20_CR22","doi-asserted-by":"publisher","unstructured":"Qi, X., Davison, B.D.: Web page classification: Features and algorithms. ACM Comput. Surv. 41(2) (2009). https:\/\/doi.org\/10.1145\/1459352.1459357","DOI":"10.1145\/1459352.1459357"},{"issue":"1","key":"20_CR23","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1002\/meet.14504701240","volume":"47","author":"AH Renear","year":"2010","unstructured":"Renear, A.H., Sacchi, S., Wickett, K.M.: Definitions of dataset in the scientific and technical literature. Am. Soc. Inf. Sci. Technol. 47(1), 1\u20134 (2010). https:\/\/doi.org\/10.1002\/meet.14504701240","journal-title":"Am. Soc. Inf. Sci. Technol."},{"issue":"4","key":"20_CR24","first-page":"18","volume":"2","author":"R Shettar","year":"2007","unstructured":"Shettar, R., Bhuptani, R.: A vertical search engine-based on domain classifier. Int. J. Comp. Sci. Secur. 2(4), 18\u201327 (2007)","journal-title":"Int. J. Comp. Sci. Secur."},{"key":"20_CR25","doi-asserted-by":"publisher","unstructured":"Wang, Q., Kanagal, B., Garg, V., Sivakumar, D.: Constructing a comprehensive events database from the web. In: 28th ACM CIKM (2019). https:\/\/doi.org\/10.1145\/3357384.3357986","DOI":"10.1145\/3357384.3357986"},{"key":"20_CR26","doi-asserted-by":"crossref","unstructured":"Xiong, C., Liu, Z., Callan, J., Liu, T.Y.: Towards better text understanding and retrieval through kernel entity salience modeling. In: 41st ACM SIGIR (2018)","DOI":"10.1145\/3209978.3209982"},{"key":"20_CR27","doi-asserted-by":"crossref","unstructured":"Zhao, Q., Yang, W., Hua, R.: Design and research of composite web page classification network based on deep learning. In: 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pp. 1531\u20131535. IEEE (2019)","DOI":"10.1109\/ICTAI.2019.00219"}],"container-title":["Lecture Notes in Computer Science","The Semantic Web \u2013 ISWC 2021"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-030-88361-4_20","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,10,20]],"date-time":"2021-10-20T22:18:08Z","timestamp":1634768288000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-030-88361-4_20"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021]]},"ISBN":["9783030883607","9783030883614"],"references-count":27,"URL":"https:\/\/doi.org\/10.1007\/978-3-030-88361-4_20","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"type":"print","value":"0302-9743"},{"type":"electronic","value":"1611-3349"}],"subject":[],"published":{"date-parts":[[2021]]},"assertion":[{"value":"30 September 2021","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"ISWC","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"International Semantic Web Conference","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2021","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"24 October 2021","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"28 October 2021","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"20","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"semweb2021","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/iswc2021.semanticweb.org\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Double-blind","order":1,"name":"type","label":"Type","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"Easychair","order":2,"name":"conference_management_system","label":"Conference Management System","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"202","order":3,"name":"number_of_submissions_sent_for_review","label":"Number of Submissions Sent for Review","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"42","order":4,"name":"number_of_full_papers_accepted","label":"Number of Full Papers Accepted","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"0","order":5,"name":"number_of_short_papers_accepted","label":"Number of Short Papers Accepted","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"21% - The value is computed by the equation \"Number of Full Papers Accepted \/ Number of Submissions Sent for Review * 100\" and then rounded to a whole number.","order":6,"name":"acceptance_rate_of_full_papers","label":"Acceptance Rate of Full Papers","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"3.5","order":7,"name":"average_number_of_reviews_per_paper","label":"Average Number of Reviews per Paper","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"2.5","order":8,"name":"average_number_of_papers_per_reviewer","label":"Average Number of Papers per Reviewer","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"Yes","order":9,"name":"external_reviewers_involved","label":"External Reviewers Involved","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}}]}}