{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:29:29Z","timestamp":1750220969883,"version":"3.41.0"},"reference-count":59,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2019,9,24]],"date-time":"2019-09-24T00:00:00Z","timestamp":1569283200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Knowl. Discov. Data"],"published-print":{"date-parts":[[2019,10,31]]},"abstract":"<jats:p>Wrapper inference deals in generating programs to extract data from Web pages. Several supervised and unsupervised wrapper inference approaches have been proposed in the literature. On one hand, unsupervised approaches produce erratic wrappers: whenever the sources do not satisfy underlying assumptions of the inference algorithm, their accuracy is compromised. On the other hand, supervised approaches produce accurate wrappers, but since they need training data, their scalability is limited. The recent advent of crowdsourcing platforms has opened new opportunities for supervised approaches, as they make possible the production of large amounts of training data with the support of workers recruited online. Nevertheless, involving human workers has monetary costs. We present an original hybrid crowd-machine wrapper inference system that offers the benefits of both approaches exploiting the cooperation of crowd workers and unsupervised algorithms. Based on a principled probabilistic model that estimates the quality of wrappers, humans workers are recruited only when unsupervised wrapper induction algorithms are not able to produce sufficiently accurate solutions.<\/jats:p>","DOI":"10.1145\/3344720","type":"journal-article","created":{"date-parts":[[2019,9,25]],"date-time":"2019-09-25T12:57:52Z","timestamp":1569416272000},"page":"1-43","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Hybrid Crowd-Machine Wrapper Inference"],"prefix":"10.1145","volume":"13","author":[{"given":"Valter","family":"Crescenzi","sequence":"first","affiliation":[{"name":"Roma Tre University, Roma, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3852-8092","authenticated-orcid":false,"given":"Paolo","family":"Merialdo","sequence":"additional","affiliation":[{"name":"Roma Tre University, Roma, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Disheng","family":"Qiu","sequence":"additional","affiliation":[{"name":"Roma Tre University, Roma, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2019,9,24]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2723722"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2003.11.004"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF00116829"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/872757.872799"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-010-5174-y"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536206.2536209"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1992.10475194"},{"volume-title":"Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work 8 Social Computing (CSCW\u201915)","author":"Cheng Justin","key":"e_1_2_1_8_1","unstructured":"Justin Cheng and Michael S. Bernstein . 2015. Flock: Hybrid crowd-machine learning classifiers . In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work 8 Social Computing (CSCW\u201915) . 600--611. Justin Cheng and Michael S. Bernstein. 2015. Flock: Hybrid crowd-machine learning classifiers. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work 8 Social Computing (CSCW\u201915). 600--611."},{"key":"e_1_2_1_9_1","first-page":"1","article-title":"Word association norms, mutual information, and lexicography","volume":"16","author":"Church Kenneth Ward","year":"1990","unstructured":"Kenneth Ward Church and Patrick Hanks . 1990 a. Word association norms, mutual information, and lexicography . Comput. Linguist. 16 , 1 (March 1990), 22--29. http:\/\/dl.acm.org\/citation.cfm?id&equals;89086.89095 Kenneth Ward Church and Patrick Hanks. 1990a. Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 1 (March 1990), 22--29. http:\/\/dl.acm.org\/citation.cfm?id&equals;89086.89095","journal-title":"Comput. Linguist."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/89086.89095"},{"volume-title":"Proceedings of the 11th International World Wide Web Conference (WWW\u201902)","author":"Cohen William W.","key":"e_1_2_1_11_1","unstructured":"William W. Cohen , Matthew Hurst , and Lee S. Jensen . 2002. A flexible learning system for wrapping tables and lists in HTML documents . In Proceedings of the 11th International World Wide Web Conference (WWW\u201902) . 232--241. William W. Cohen, Matthew Hurst, and Lee S. Jensen. 2002. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the 11th International World Wide Web Conference (WWW\u201902). 232--241."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-017-1057-x"},{"volume-title":"Proceedings of the International Conference on Very Large Data Bases (VLDB\u201901)","author":"Crescenzi V.","key":"e_1_2_1_13_1","unstructured":"V. Crescenzi , G. Mecca , and P. Merialdo . 2001. RoadRunner: Towards automatic data extraction from large web sites . In Proceedings of the International Conference on Very Large Data Bases (VLDB\u201901) . 109--118. V. Crescenzi, G. Mecca, and P. Merialdo. 2001. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the International Conference on Very Large Data Bases (VLDB\u201901). 109--118."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1080\/08839510701853093"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2487788.2487927"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2488388.2488412"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10619-014-7163-9"},{"key":"e_1_2_1_18_1","volume-title":"Sevalho","author":"Da Silva Altigran S.","year":"2007","unstructured":"Altigran S. Da Silva , Denilson Barbosa , Joao M. B. Cavalcanti , and Marco A. S . Sevalho . 2007 . Labeling data extracted from the web. In Proceedings of the OTM Confederated International Conferences \u201cOn the Move to Meaningful Internet Systems\u201d. Springer , 1099--1116. Altigran S. Da Silva, Denilson Barbosa, Joao M. B. Cavalcanti, and Marco A. S. Sevalho. 2007. Labeling data extracted from the web. In Proceedings of the OTM Confederated International Conferences \u201cOn the Move to Meaningful Internet Systems\u201d. Springer, 1099--1116."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/1938545.1938547"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.2517-6161.1977.tb01600.x"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.14778\/3402755.3402809"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2750550"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2014.6816716"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2014.07.007"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989331"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.14778\/2733085.2733091"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2588576"},{"key":"e_1_2_1_28_1","doi-asserted-by":"crossref","unstructured":"Leo A Goodman. 1949. On the estimation of the number of classes in a population. Ann. Math. Stat. (1949) 572--579.  Leo A Goodman. 1949. On the estimation of the number of classes in a population. Ann. Math. Stat. (1949) 572--579.","DOI":"10.1214\/aoms\/1177729949"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/1055558.1055560"},{"key":"e_1_2_1_30_1","volume-title":"Grossman and Ophir Frieder","author":"David","year":"2012","unstructured":"David A. Grossman and Ophir Frieder . 2012 . Information Retrieval : Algorithms and Heuristics, Vol. 15 . Springer Science 8 Business Media. David A. Grossman and Ophir Frieder. 2012. Information Retrieval: Algorithms and Heuristics, Vol. 15. Springer Science 8 Business Media."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767842"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/1135777.1135859"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137646"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2016.2535242"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/3231751.3231758"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1561\/1900000044"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2008.4497419"},{"key":"e_1_2_1_38_1","volume-title":"A unified statistical framework for crowd labeling. Knowl. Inf. Syst. 45, 2 (01","author":"Muhammadi Jafar","year":"2015","unstructured":"Jafar Muhammadi , Hamid R. Rabiee , and Abbas Hosseini . 2015. A unified statistical framework for crowd labeling. Knowl. Inf. Syst. 45, 2 (01 November 2015 ), 271--294. DOI:https:\/\/doi.org\/10.1007\/s10115-014-0790-7 10.1007\/s10115-014-0790-7 Jafar Muhammadi, Hamid R. Rabiee, and Abbas Hosseini. 2015. A unified statistical framework for crowd labeling. Knowl. Inf. Syst. 45, 2 (01 November 2015), 271--294. DOI:https:\/\/doi.org\/10.1007\/s10115-014-0790-7"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.5555\/1622572.1622579"},{"key":"e_1_2_1_40_1","volume-title":"C","author":"Nicholson Bryce","year":"2016","unstructured":"Bryce Nicholson , Victor S. Sheng , and Jing Zhang . 2016. Label noise correction and application in crowdsourcing. Expert Syst. Appl. 66 , C ( 2016 ), 149--162. Bryce Nicholson, Victor S. Sheng, and Jing Zhang. 2016. Label noise correction and application in crowdsourcing. Expert Syst. Appl. 66, C (2016), 149--162."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824120"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2016.7498320"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/2831360.2831372"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-04114-8_3"},{"volume-title":"Proceedings of the International Working Conference on Advanced Visual Interfaces (AVI\u201914)","author":"Rahmanian Bahareh","key":"e_1_2_1_45_1","unstructured":"Bahareh Rahmanian and Joseph G. Davis . 2014. User interface design for crowdsourcing systems . In Proceedings of the International Working Conference on Advanced Visual Interfaces (AVI\u201914) . Paolo Paolini and Franca Garzotto (Eds.), ACM, 405--408. DOI:https:\/\/doi.org\/10.1145\/2598153.2602248 10.1145\/2598153.2602248 Bahareh Rahmanian and Joseph G. Davis. 2014. User interface design for crowdsourcing systems. In Proceedings of the International Working Conference on Advanced Visual Interfaces (AVI\u201914). Paolo Paolini and Franca Garzotto (Eds.), ACM, 405--408. DOI:https:\/\/doi.org\/10.1145\/2598153.2602248"},{"key":"e_1_2_1_46_1","first-page":"491","article-title":"Eliminating spammers and ranking annotators for crowdsourced labeling tasks","author":"Raykar Vikas C.","year":"2012","unstructured":"Vikas C. Raykar and Shipeng Yu . 2012 . Eliminating spammers and ranking annotators for crowdsourced labeling tasks . J. Mach. Learn. Res. 13 , Feb (2012), 491 -- 518 . Vikas C. Raykar and Shipeng Yu. 2012. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J. Mach. Learn. Res. 13, Feb (2012), 491--518.","journal-title":"J. Mach. Learn. Res. 13"},{"key":"e_1_2_1_47_1","first-page":"1297","article-title":"Learning from crowds","author":"Raykar Vikas C.","year":"2010","unstructured":"Vikas C. Raykar , Shipeng Yu , Linda H. Zhao , Gerardo Hermosillo Valadez , Charles Florin , Luca Bogoni , and Linda Moy . 2010 . Learning from crowds . J. Mach. Learn. Res. 11 , Apr (2010), 1297 -- 1322 . Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. J. Mach. Learn. Res. 11, Apr (2010), 1297--1322.","journal-title":"J. Mach. Learn. Res. 11"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/18.705570"},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Ying Li, Bing Liu, and Sunita Sarawagi (Eds.). ACM, 614--622","author":"Sheng Victor S.","year":"1890","unstructured":"Victor S. Sheng , Foster J. Provost , and Panagiotis G. Ipeirotis . 2008. Get another label? Improving data quality and data mining using multiple, noisy labelers . In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Ying Li, Bing Liu, and Sunita Sarawagi (Eds.). ACM, 614--622 . DOI:https:\/\/doi.org\/10.1145\/140 1890 .1401965 10.1145\/1401890.1401965 Victor S. Sheng, Foster J. Provost, and Panagiotis G. Ipeirotis. 2008. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Ying Li, Bing Liu, and Sunita Sarawagi (Eds.). ACM, 614--622. DOI:https:\/\/doi.org\/10.1145\/1401890.1401965"},{"key":"e_1_2_1_51_1","volume-title":"Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing.","author":"Sheshadri Aashish","year":"2013","unstructured":"Aashish Sheshadri and Matthew Lease . 2013 . Square: A benchmark for research on computing crowd consensus . In Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing. Aashish Sheshadri and Matthew Lease. 2013. Square: A benchmark for research on computing crowd consensus. In Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/72.788640"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.14778\/2350229.2350263"},{"key":"e_1_2_1_54_1","volume-title":"Proceedings of the Advances in Neural Information Processing Systems. 2035--2043","author":"Whitehill Jacob","year":"2009","unstructured":"Jacob Whitehill , Ting-fan Wu, Jacob Bergsma , Javier R Movellan , and Paul L Ruvolo . 2009 . Whose vote should count more: Optimal integration of labels from labelers of unknown expertise . In Proceedings of the Advances in Neural Information Processing Systems. 2035--2043 . Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proceedings of the Advances in Neural Information Processing Systems. 2035--2043."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2661829.2661880"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2594515"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2017.2677468"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219958"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10462-016-9491-9"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.14778\/3055540.3055547"}],"container-title":["ACM Transactions on Knowledge Discovery from Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3344720","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3344720","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:54:27Z","timestamp":1750204467000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3344720"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,9,24]]},"references-count":59,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2019,10,31]]}},"alternative-id":["10.1145\/3344720"],"URL":"https:\/\/doi.org\/10.1145\/3344720","relation":{},"ISSN":["1556-4681","1556-472X"],"issn-type":[{"type":"print","value":"1556-4681"},{"type":"electronic","value":"1556-472X"}],"subject":[],"published":{"date-parts":[[2019,9,24]]},"assertion":[{"value":"2018-10-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-06-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-09-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}