{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T14:27:36Z","timestamp":1775053656859,"version":"3.50.1"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2009,6,1]],"date-time":"2009-06-01T00:00:00Z","timestamp":1243814400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100002920","name":"Research Grants Council, University Grants Committee, Hong Kong","doi-asserted-by":"publisher","award":["HKUST6172\/04E"],"award-info":[{"award-number":["HKUST6172\/04E"]}],"id":[{"id":"10.13039\/501100002920","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Database Syst."],"published-print":{"date-parts":[[2009,6]]},"abstract":"<jats:p>Online databases respond to a user query with result records encoded in HTML files. Data extraction, which is important for many applications, extracts the records from the HTML files automatically. We present a novel data extraction method, ODE (Ontology-assisted Data Extraction), which automatically extracts the query result records from the HTML pages. ODE first constructs an ontology for a domain according to information matching between the query interfaces and query result pages from different Web sites within the same domain. Then, the constructed domain ontology is used during data extraction to identify the query result section in a query result page and to align and label the data values in the extracted records. The ontology-assisted data extraction method is fully automatic and overcomes many of the deficiencies of current automatic data extraction methods. Experimental results show that ODE is extremely accurate for identifying the query result section in an HTML page, segmenting the query result section into query result records, and aligning and labeling the data values in the query result records.<\/jats:p>","DOI":"10.1145\/1538909.1538914","type":"journal-article","created":{"date-parts":[[2009,6,30]],"date-time":"2009-06-30T13:10:17Z","timestamp":1246367417000},"page":"1-35","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":58,"title":["ODE"],"prefix":"10.1145","volume":"34","author":[{"given":"Weifeng","family":"Su","sequence":"first","affiliation":[{"name":"BNU-HKBU United International College and Shenzhen Key Laboratory of Intelligent Media and Speech, PKU-HKUST Shenzhen Hong Kong Institution, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jiying","family":"Wang","sequence":"additional","affiliation":[{"name":"City University of Hong Kong, Kowloon, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Frederick H.","family":"Lochovsky","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology, Kowloon, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2009,7,2]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/872757.872799"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/74697.74700"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.5555\/234285.234289"},{"key":"e_1_2_1_4_1","volume-title":"The deep Web: Surfacing hidden value. White paper","author":"Bergman M. K.","unstructured":"Bergman , M. K. 2001. The deep Web: Surfacing hidden value. White paper , BrightPlanet Corporation . http:\/\/www.brightplanet.com\/resources\/details\/deepweb.html. Bergman, M. K. 2001. The deep Web: Surfacing hidden value. White paper, BrightPlanet Corporation. http:\/\/www.brightplanet.com\/resources\/details\/deepweb.html."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2005.126"},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the 21st International Conference on Distributed Computing Systems. 361--370","author":"Buttler D.","unstructured":"Buttler , D. , Liu , L. , and Pu , C . 2001. A fully automated object extraction system for the World Wide Web . In Proceedings of the 21st International Conference on Distributed Computing Systems. 361--370 . Buttler, D., Liu, L., and Pu, C. 2001. A fully automated object extraction system for the World Wide Web. In Proceedings of the 21st International Conference on Distributed Computing Systems. 361--370."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/371920.372182"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1031570.1031584"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1024694.1024704"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/511446.511477"},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the 27th International Conference on Very Large Data Bases. 109--118","author":"Crescenzi V.","unstructured":"Crescenzi , V. , Mecca , G. , and Merialdo , P . 2001. Roadrunner: Towards automatic data extraction from large Web sites . In Proceedings of the 27th International Conference on Very Large Data Bases. 109--118 . Crescenzi, V., Mecca, G., and Merialdo, P. 2001. Roadrunner: Towards automatic data extraction from large Web sites. In Proceedings of the 27th International Conference on Very Large Data Bases. 109--118."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0169-023X(99)00027-0"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1024452529781"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/635484.635485"},{"key":"e_1_2_1_16_1","volume-title":"Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology","author":"Gusfield D.","unstructured":"Gusfield , D. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology . Cambridge University Press , Cambridge, UK . Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1132863.1132872"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.5555\/306766.306775"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0004-3702(99)00100-9"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007584"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/956750.956826"},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the 23rd IEEE International Conference on Data Engineering. 376--385","author":"Lu Y.","unstructured":"Lu , Y. , He , H. , Zhao , H. , Meng , W. , and Yu , C . 2007. Annotating structured data of the deep Web . In Proceedings of the 23rd IEEE International Conference on Data Engineering. 376--385 . Lu, Y., He, H., Zhao, H., Meng, W., and Yu, C. 2007. Annotating structured data of the deep Web. In Proceedings of the 23rd IEEE International Conference on Data Engineering. 376--385."},{"key":"e_1_2_1_23_1","volume-title":"Department of Statistics","author":"Minka T.","unstructured":"Minka , T. 2003. A comparison of numerical optimizers for logistic regression. Tech. rep ., Department of Statistics , Carnegie Mellon University . Minka, T. 2003. A comparison of numerical optimizers for logistic regression. Tech. rep., Department of Statistics, Carnegie Mellon University."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/301136.301191"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the 1st Empirical Methods in Natural Language Processing Conference. 133--141","author":"Ratnaparkhi A.","year":"1996","unstructured":"Ratnaparkhi , A. 1996 . A maximum entropy model for part-of-speech tagging . In Proceedings of the 1st Empirical Methods in Natural Language Processing Conference. 133--141 . Ratnaparkhi, A. 1996. A maximum entropy model for part-of-speech tagging. In Proceedings of the 1st Empirical Methods in Natural Language Processing Conference. 133--141."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/11896548_42"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1099554.1099672"},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the Conference on Agent-Oriented Information Systems. 99--110","author":"Snoussi H.","unstructured":"Snoussi , H. , Magnin , L. , and Nie , J . -Y. 2001. Heterogeneous Web data extraction using ontologies . In Proceedings of the Conference on Agent-Oriented Information Systems. 99--110 . Snoussi, H., Magnin, L., and Nie, J.-Y. 2001. Heterogeneous Web data extraction using ontologies. In Proceedings of the Conference on Agent-Oriented Information Systems. 99--110."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/11687238_8"},{"key":"e_1_2_1_30_1","volume-title":"PADE: Pair-wise alignment-based data extraction. Tech. rep. HKUST-CS09-01, Department of Computer Science and Engineering","author":"Su W.","year":"2009","unstructured":"Su , W. , Wang , J. , Lochovsky , F. H. , and Liu , Y . 2009 . PADE: Pair-wise alignment-based data extraction. Tech. rep. HKUST-CS09-01, Department of Computer Science and Engineering , The Hong Kong University of Science and Technology , Hong Kong . Su, W., Wang, J., Lochovsky, F. H., and Liu, Y. 2009. PADE: Pair-wise alignment-based data extraction. Tech. rep. HKUST-CS09-01, Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.datak.2009.02.010"},{"key":"e_1_2_1_32_1","volume-title":"Lecture Notes in Computer Science","volume":"4801","author":"Tao C.","unstructured":"Tao , C. and Embley , D. W . 2007. Automatic hidden-Web table interpretation by sibling page comparison. In Conceptual Modeling -- ER'07 . Lecture Notes in Computer Science , vol. 4801 Springer Berlin, 566--581. Tao, C. and Embley, D. W. 2007. Automatic hidden-Web table interpretation by sibling page comparison. In Conceptual Modeling -- ER'07. Lecture Notes in Computer Science, vol. 4801 Springer Berlin, 566--581."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11280-005-0360-8"},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the XVII Simp\u00f3sio Brasileiro de Banco de Dados. 252--262","author":"Vivan O. M.","unstructured":"Vivan , O. M. and Heuser , C. A . 2002. Semiautomatic generation of data-extraction ontologies from relational databases . In Proceedings of the XVII Simp\u00f3sio Brasileiro de Banco de Dados. 252--262 . Vivan, O. M. and Heuser, C. A. 2002. Semiautomatic generation of data-extraction ontologies from relational databases. In Proceedings of the XVII Simp\u00f3sio Brasileiro de Banco de Dados. 252--262."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/775152.775179"},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the 30th International Conference on Very Large Data Bases. 408--419","author":"Wang J.","unstructured":"Wang , J. , Wen , J. , Lochovsky , F. H. , and Ma , W. Y . 2004. Instance-Based schema matching for Web databases by domain-specific query probing . In Proceedings of the 30th International Conference on Very Large Data Bases. 408--419 . Wang, J., Wen, J., Lochovsky, F. H., and Ma, W. Y. 2004. Instance-Based schema matching for Web databases by domain-specific query probing. In Proceedings of the 30th International Conference on Very Large Data Bases. 408--419."},{"key":"e_1_2_1_37_1","unstructured":"World Wide Web Consortium. 1999. HTML 4.01 specification. http:\/\/www.w3.org\/TR\/REC-html40\/.  World Wide Web Consortium. 1999. HTML 4.01 specification. http:\/\/www.w3.org\/TR\/REC-html40\/."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/11607380_2"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2006.197"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1011441423217"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1060745.1060760"}],"container-title":["ACM Transactions on Database Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1538909.1538914","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1538909.1538914","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T20:26:54Z","timestamp":1750278414000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1538909.1538914"}},"subtitle":["Ontology-assisted data extraction"],"short-title":[],"issued":{"date-parts":[[2009,6]]},"references-count":40,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2009,6]]}},"alternative-id":["10.1145\/1538909.1538914"],"URL":"https:\/\/doi.org\/10.1145\/1538909.1538914","relation":{},"ISSN":["0362-5915","1557-4644"],"issn-type":[{"value":"0362-5915","type":"print"},{"value":"1557-4644","type":"electronic"}],"subject":[],"published":{"date-parts":[[2009,6]]},"assertion":[{"value":"2008-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2009-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2009-07-02","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}