{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T10:05:48Z","timestamp":1775815548792,"version":"3.50.1"},"reference-count":61,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2021,7]]},"abstract":"<jats:p>\n            We introduce\n            <jats:italic>landmark grammars<\/jats:italic>\n            , a new family of context-free grammars aimed at describing the HTML source code of pages published by large and templated websites and therefore at effectively tackling Web data extraction problems. Indeed, they address the inherent\n            <jats:italic>ambiguity<\/jats:italic>\n            of HTML, one of the main challenges of Web data extraction, which, despite over twenty years of research, has been largely neglected by the approaches presented in literature.\n          <\/jats:p>\n          <jats:p>\n            We then formalize the\n            <jats:italic>Smallest Extraction Problem<\/jats:italic>\n            (SEP), an optimization problem for finding the grammar of a family that best describes a set of pages and contextually extract their data.\n          <\/jats:p>\n          <jats:p>Finally, we present an unsupervised learning algorithm to induce a landmark grammar from a set of pages sharing a common HTML template, and we present an automatic Web data extraction system. The experiments on consolidated benchmarks show that the approach can substantially contribute to improve the state-of-the-art.<\/jats:p>","DOI":"10.14778\/3476249.3476293","type":"journal-article","created":{"date-parts":[[2021,10,27]],"date-time":"2021-10-27T16:46:23Z","timestamp":1635353183000},"page":"2445-2458","source":"Crossref","is-referenced-by-count":4,"title":["The smallest extraction problem"],"prefix":"10.14778","volume":"14","author":[{"given":"Valerio","family":"Cetorelli","sequence":"first","affiliation":[{"name":"Universit\u00e0 Roma Tre, Rome, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Paolo","family":"Atzeni","sequence":"additional","affiliation":[{"name":"Universit\u00e0 Roma Tre, Rome, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Valter","family":"Crescenzi","sequence":"additional","affiliation":[{"name":"Universit\u00e0 Roma Tre, Rome, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Franco","family":"Milicchio","sequence":"additional","affiliation":[{"name":"Universit\u00e0 Roma Tre, Rome, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,10,27]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/276305.276330"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/872757.872799"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.disc.2007.08.100"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICEEI.2017.8312458"},{"key":"e_1_2_1_5_1","volume-title":"The smallest grammar problem revisited. CoRR abs\/1908.06428","author":"Bannai Hideo","year":"2019"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536206.2536209"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00224-020-10013-w"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.2005.850116"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1142473.1142555"},{"key":"e_1_2_1_10_1","volume-title":"Donatella Firmani, Maurizio Mazzei, Paolo Merialdo, Federico Piai, and Divesh Srivastava.","author":"Crescenzi Valter","year":"2021"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.5555\/306766.306777"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1017460.1017462"},{"key":"e_1_2_1_13_1","volume-title":"AAAI-04 ATEM Workshop.","author":"Crescenzi Valter","year":"2004"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/645927.672370"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10619-014-7163-9"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3344720"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2016.05.003"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1559845.1559882"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403323"},{"key":"e_1_2_1_20_1","unstructured":"Steve Faulkner Arron Eicholz Travis Leithead Alex Danilo and Sangwhan Moon. 2017. HTML 5.2. W3C. Retrieved January 17 (2017) 2018.  Steve Faulkner Arron Eicholz Travis Leithead Alex Danilo and Sangwhan Moon. 2017. HTML 5.2. W3C. Retrieved January 17 (2017) 2018."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/321172.321179"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.14778\/2733085.2733091"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2915214"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1062745.1062763"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1055558.1055560"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/1951778"},{"key":"e_1_2_1_27_1","volume-title":"Parsing Techniques. Monographs in Computer Science","author":"Grune Dick","year":"2007"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767842"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313529"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2009916.2010020"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1230819.1241670"},{"key":"e_1_2_1_32_1","unstructured":"Crunchbase Inc. 2013. Lixto acquired by McKinsey. https:\/\/www.crunchbase.com\/organization\/lixto-software.  Crunchbase Inc. 2013. Lixto acquired by McKinsey. https:\/\/www.crunchbase.com\/organization\/lixto-software."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2009.82"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/18.841160"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/565117.565137"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2009.109"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.14778\/3231751.3231758"},{"key":"e_1_2_1_38_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","volume":"1","author":"Lockard Colin","year":"2019"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3336191.3371878"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1515\/gcc-2012-0016"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824120"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2016.7498320"},{"key":"e_1_2_1_43_1","unstructured":"Home page. 2021. Diffbot. https:\/\/www.diffbot.com\/.  Home page. 2021. Diffbot. https:\/\/www.diffbot.com\/."},{"key":"e_1_2_1_44_1","unstructured":"Home page. 2021. import.io. https:\/\/www.import.io\/.  Home page. 2021. import.io. https:\/\/www.import.io\/."},{"key":"e_1_2_1_45_1","unstructured":"Home page. 2021. Lixto. http:\/\/www.lixto.com\/.  Home page. 2021. Lixto. http:\/\/www.lixto.com\/."},{"key":"e_1_2_1_46_1","volume-title":"Meltwater: Media Monitoring & Social Listening Platform. https:\/\/www.meltwater.com\/.","author":"Home","year":"2021"},{"key":"e_1_2_1_47_1","unstructured":"Home page. 2021. Wrapidity. https:\/\/www.wrapidity.com.  Home page. 2021. Wrapidity. https:\/\/www.wrapidity.com."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3402707.3402735"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3479525"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380608"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.5555\/2553104"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.5555\/3360120"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcss.2011.12.006"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.5555\/773294"},{"key":"e_1_2_1_55_1","volume-title":"Sicco Verwer, Menno van Zaanen","author":"Siyari Payam"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2013.161"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/322344.322346"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.5555\/832313.837482"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-012-0281-y"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/1060745.1060761"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/1060745.1060760"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3476249.3476293","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:07:17Z","timestamp":1672222037000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3476249.3476293"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7]]},"references-count":61,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2021,7]]}},"alternative-id":["10.14778\/3476249.3476293"],"URL":"https:\/\/doi.org\/10.14778\/3476249.3476293","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2021,7]]}}}