{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T00:53:16Z","timestamp":1774399996349,"version":"3.50.1"},"reference-count":29,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2015,8]]},"abstract":"<jats:p>We propose a method for extracting logical hierarchical structure of HTML documents. Because mark-up structure in HTML documents does not necessarily coincide with logical hierarchical structure, it is not trivial how to extract logical structure of HTML documents. Human readers, however, easily understand their logical structure. The key information used by them is headings in the documents. Human readers exploit the following properties of headings: (1) headings appear at the beginning of the corresponding blocks, (2) headings are given prominent visual styles, (3) headings of the same level share the same visual style, and (4) headings of higher levels are given more prominent visual styles. Our method also exploits these properties for extracting hierarchical headings and their associated blocks. Our experiment shows that our method outperforms existing methods. In addition, our method extracts not only hierarchical blocks but also their associated headings.<\/jats:p>","DOI":"10.14778\/2824032.2824058","type":"journal-article","created":{"date-parts":[[2015,9,16]],"date-time":"2015-09-16T12:18:17Z","timestamp":1442405897000},"page":"1606-1617","source":"Crossref","is-referenced-by-count":23,"title":["Extracting logical hierarchical structure of HTML documents based on headings"],"prefix":"10.14778","volume":"8","author":[{"given":"Tomohiro","family":"Manabe","sequence":"first","affiliation":[{"name":"Kyoto University, Sakyo, Kyoto, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Keishi","family":"Tajima","sequence":"additional","affiliation":[{"name":"Kyoto University, Sakyo, Kyoto, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2015,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536336.2536343"},{"key":"e_1_2_1_2_1","first-page":"374","volume-title":"Proc. of ICDAR","author":"Anjewierden A.","year":"2001","unstructured":"A. Anjewierden . AIDAS : Incremental logical structure discovery in PDF documents . In Proc. of ICDAR , pages 374 -- 378 , 2001 . A. Anjewierden. AIDAS: Incremental logical structure discovery in PDF documents. In Proc. of ICDAR, pages 374--378, 2001."},{"key":"e_1_2_1_3_1","first-page":"337","volume-title":"Proc. of SIGMOD","author":"Arasu A.","year":"2003","unstructured":"A. Arasu and H. Garcia-Molina . Extracting structured data from web pages . In Proc. of SIGMOD , pages 337 -- 348 , 2003 . 10.1145\/872757.872799 A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proc. of SIGMOD, pages 337--348, 2003. 10.1145\/872757.872799"},{"key":"e_1_2_1_4_1","first-page":"213","volume-title":"Proc. of CHI","author":"Buyukkokten O.","year":"2001","unstructured":"O. Buyukkokten , H. Garcia-Molina , and A. Paepcke . Accordion summarization for end-game browsing on PDAs and cellular phones . In Proc. of CHI , pages 213 -- 220 , 2001 . 10.1145\/365024.365102 O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In Proc. of CHI, pages 213--220, 2001. 10.1145\/365024.365102"},{"key":"e_1_2_1_6_1","doi-asserted-by":"crossref","first-page":"456","DOI":"10.1145\/1008992.1009070","volume-title":"Proc. of SIGIR","author":"Cai D.","year":"2004","unstructured":"D. Cai , S. Yu , J.-R. Wen , and W.-Y. Ma . Block-based web search . In Proc. of SIGIR , pages 456 -- 463 , 2004 . 10.1145\/1008992.1009070 D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Block-based web search. In Proc. of SIGIR, pages 456--463, 2004. 10.1145\/1008992.1009070"},{"key":"e_1_2_1_7_1","first-page":"377","volume-title":"Proc. of WWW Conf.","author":"Chakrabarti D.","year":"2008","unstructured":"D. Chakrabarti , R. Kumar , and K. Punera . A graph-theoretic approach to webpage segmentation . In Proc. of WWW Conf. , pages 377 -- 386 , 2008 . 10.1145\/1367497.1367549 D. Chakrabarti, R. Kumar, and K. Punera. A graph-theoretic approach to webpage segmentation. In Proc. of WWW Conf., pages 377--386, 2008. 10.1145\/1367497.1367549"},{"key":"e_1_2_1_8_1","series-title":"LNCS","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1007\/978-3-540-28640-0_20","volume-title":"Document Analysis Systems VI","author":"Chao H.","year":"2004","unstructured":"H. Chao and J. Fan . Layout and content extraction for PDF documents . In Document Analysis Systems VI , volume 3163 of LNCS , pages 213 -- 224 . Springer-Verlag , 2004 . H. Chao and J. Fan. Layout and content extraction for PDF documents. In Document Analysis Systems VI, volume 3163 of LNCS, pages 213--224. Springer-Verlag, 2004."},{"key":"e_1_2_1_9_1","first-page":"541","volume-title":"Proc. of SIGMOD","author":"Cortez E.","year":"2011","unstructured":"E. Cortez , D. Oliveira , A. S. da Silva , E. S. de Moura , and A. H. Laender . Joint unsupervised structure discovery and information extraction . In Proc. of SIGMOD , pages 541 -- 552 , 2011 . 10.1145\/1989323.1989380 E. Cortez, D. Oliveira, A. S. da Silva, E. S. de Moura, and A. H. Laender. Joint unsupervised structure discovery and information extraction. In Proc. of SIGMOD, pages 541--552, 2011. 10.1145\/1989323.1989380"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2005.138"},{"key":"e_1_2_1_11_1","first-page":"305","volume-title":"Proc. of SITIS 2006","volume":"4879","author":"El-Shayeb M. A.","year":"2009","unstructured":"M. A. El-Shayeb , S. R. El-Beltagy , and A. A. Rafea . Extracting the latent hierarchical structure of web documents . In Proc. of SITIS 2006 , volume 4879 of LNCS, pages 305 -- 313 . Springer-Verlag , 2009 . 10.1007\/978-3-642-01350-8_28 M. A. El-Shayeb, S. R. El-Beltagy, and A. A. Rafea. Extracting the latent hierarchical structure of web documents. In Proc. of SITIS 2006, volume 4879 of LNCS, pages 305--313. Springer-Verlag, 2009. 10.1007\/978-3-642-01350-8_28"},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","DOI":"10.1002\/0471445428","volume-title":"Statistical Methods for Rates and Proportions","author":"Fleiss J.","year":"2003","unstructured":"J. Fleiss , B. Levin , and M. Paik . Statistical Methods for Rates and Proportions . Wiley , John and Sons, Inc., 3 rd ed., 2003 . J. Fleiss, B. Levin, and M. Paik. Statistical Methods for Rates and Proportions. Wiley, John and Sons, Inc., 3rd ed., 2003.","edition":"3"},{"key":"e_1_2_1_13_1","first-page":"11","volume-title":"Proc. of JCDL","author":"Gao L.","year":"2011","unstructured":"L. Gao , Z. Tang , X. Lin , Y. Liu , R. Qiu , and Y. Wang . Structure extraction from PDF-based book documents . In Proc. of JCDL , pages 11 -- 20 , 2011 . 10.1145\/1998076.1998079 L. Gao, Z. Tang, X. Lin, Y. Liu, R. Qiu, and Y. Wang. Structure extraction from PDF-based book documents. In Proc. of JCDL, pages 11--20, 2011. 10.1145\/1998076.1998079"},{"key":"e_1_2_1_14_1","first-page":"911","volume-title":"Proc. of ICDAR","author":"Gao L.","year":"2009","unstructured":"L. Gao , Z. Tang , X. Lin , X. Tao , and Y. Chu . Analysis of book documents' table of content based on clustering . In Proc. of ICDAR , pages 911 -- 915 , 2009 . 10.1109\/ICDAR.2009.143 L. Gao, Z. Tang, X. Lin, X. Tao, and Y. Chu. Analysis of book documents' table of content based on clustering. In Proc. of ICDAR, pages 911--915, 2009. 10.1109\/ICDAR.2009.143"},{"key":"e_1_2_1_15_1","first-page":"529","volume-title":"Proc. of VLDB","author":"Graupmann J.","year":"2005","unstructured":"J. Graupmann , R. Schenkel , and G. Weikum . The SphereSearch engine for unified ranked retrieval of heterogeneous XML and web documents . In Proc. of VLDB , pages 529 -- 540 , 2005 . J. Graupmann, R. Schenkel, and G. Weikum. The SphereSearch engine for unified ranked retrieval of heterogeneous XML and web documents. In Proc. of VLDB, pages 529--540, 2005."},{"key":"e_1_2_1_16_1","first-page":"207","volume-title":"Proc. of WWW Conf.","author":"Gupta S.","year":"2003","unstructured":"S. Gupta , G. Kaiser , D. Neistadt , and P. Grimm . DOM-based content extraction of HTML documents . In Proc. of WWW Conf. , pages 207 -- 214 , 2003 . 10.1145\/775152.775182 S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm. DOM-based content extraction of HTML documents. In Proc. of WWW Conf., pages 207--214, 2003. 10.1145\/775152.775182"},{"key":"e_1_2_1_17_1","first-page":"361","volume-title":"Proc. of WWW Conf.","author":"Hattori G.","year":"2007","unstructured":"G. Hattori , K. Hoashi , K. Matsumoto , and F. Sugaya . Robust web page segmentation for mobile terminal using content-distances and page layout information . In Proc. of WWW Conf. , pages 361 -- 370 , 2007 . 10.1145\/1242572.1242622 G. Hattori, K. Hoashi, K. Matsumoto, and F. Sugaya. Robust web page segmentation for mobile terminal using content-distances and page layout information. In Proc. of WWW Conf., pages 361--370, 2007. 10.1145\/1242572.1242622"},{"key":"e_1_2_1_18_1","first-page":"290","volume-title":"Proc. of WI","author":"Keller M.","year":"2013","unstructured":"M. Keller and H. Hartenstein . GRABEX: A graph-based method for web site block classification and its application on mining breadcrumb trails . In Proc. of WI , pages 290 -- 297 , 2013 . 10.1109\/WI-IAT.2013.42 M. Keller and H. Hartenstein. GRABEX: A graph-based method for web site block classification and its application on mining breadcrumb trails. In Proc. of WI, pages 290--297, 2013. 10.1109\/WI-IAT.2013.42"},{"key":"e_1_2_1_19_1","first-page":"1025","volume-title":"Proc. of WWW Conf.","author":"Keller M.","year":"2012","unstructured":"M. Keller and M. Nussbaumer . MenuMiner: Revealing the information architecture of large web sites by analyzing maximal cliques . In Proc. of WWW Conf. , pages 1025 -- 1034 , 2012 . 10.1145\/2187980.2188237 M. Keller and M. Nussbaumer. MenuMiner: Revealing the information architecture of large web sites by analyzing maximal cliques. In Proc. of WWW Conf., pages 1025--1034, 2012. 10.1145\/2187980.2188237"},{"key":"e_1_2_1_20_1","first-page":"1173","volume-title":"Proc. of CIKM","author":"Kohlsch\u00fctter C.","year":"2008","unstructured":"C. Kohlsch\u00fctter and W. Nejdl . A densitometric approach to web page segmentation . In Proc. of CIKM , pages 1173 -- 1182 , 2008 . 10.1145\/1458082.1458237 C. Kohlsch\u00fctter and W. Nejdl. A densitometric approach to web page segmentation. In Proc. of CIKM, pages 1173--1182, 2008. 10.1145\/1458082.1458237"},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","first-page":"159","DOI":"10.2307\/2529310","article-title":"The measurement of observer agreement for categorical data","volume":"33","author":"Landis J. R.","year":"1977","unstructured":"J. R. Landis and G. G. Koch . The measurement of observer agreement for categorical data . Biometrics , 33 : 159 -- 174 , 1977 . J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33: 159--174, 1977.","journal-title":"Biometrics"},{"key":"e_1_2_1_22_1","doi-asserted-by":"crossref","first-page":"588","DOI":"10.1145\/775047.775134","volume-title":"Proc. of KDD","author":"Lin S.-H.","year":"2002","unstructured":"S.-H. Lin and J.-M. Ho . Discovering informative content blocks from web documents . In Proc. of KDD , pages 588 -- 593 , 2002 . 10.1145\/775047.775134 S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. In Proc. of KDD, pages 588--593, 2002. 10.1145\/775047.775134"},{"key":"e_1_2_1_23_1","first-page":"981","volume-title":"Proc. of WWW Conf.","author":"Miao G.","year":"2009","unstructured":"G. Miao , J. Tatemura , W.-P. Hsiung , A. Sawires , and L. E. Moser . Extracting data records from the web using tag path clustering . In Proc. of WWW Conf. , pages 981 -- 990 , 2009 . 10.1145\/1526709.1526841 G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. In Proc. of WWW Conf., pages 981--990, 2009. 10.1145\/1526709.1526841"},{"key":"e_1_2_1_24_1","first-page":"2117","volume-title":"Proc. of SICE Annual Conf.","author":"Okada H.","year":"2011","unstructured":"H. Okada and H. Arakawa . Automated extraction of non &lang;h&rang;-tagged headers in webpages by decision trees . In Proc. of SICE Annual Conf. , pages 2117 -- 2120 , 2011 . H. Okada and H. Arakawa. Automated extraction of non &lang;h&rang;-tagged headers in webpages by decision trees. In Proc. of SICE Annual Conf., pages 2117--2120, 2011."},{"key":"e_1_2_1_25_1","first-page":"447","volume-title":"Proc. of ICAART","author":"Pembe F. C.","year":"2010","unstructured":"F. C. Pembe and T. G\u00fcng\u00f6r . A tree learning approach to web document sectional hierarchy extraction . In Proc. of ICAART , pages 447 -- 450 , 2010 . F. C. Pembe and T. G\u00fcng\u00f6r. A tree learning approach to web document sectional hierarchy extraction. In Proc. of ICAART, pages 447--450, 2010."},{"issue":"10","key":"e_1_2_1_26_1","first-page":"84","article-title":"A web page segmentation method based on page layouts and title blocks","volume":"11","author":"Sano H.","year":"2011","unstructured":"H. Sano , S. Shiramatsu , T. Ozono , and T. Shintani . A web page segmentation method based on page layouts and title blocks . International Journal of Computer Science and Network Security , 11 ( 10 ): 84 -- 90 , 2011 . H. Sano, S. Shiramatsu, T. Ozono, and T. Shintani. A web page segmentation method based on page layouts and title blocks. International Journal of Computer Science and Network Security, 11(10):84--90, 2011.","journal-title":"International Journal of Computer Science and Network Security"},{"key":"e_1_2_1_27_1","first-page":"381","volume-title":"Proc. of CIKM","author":"Simon K.","year":"2005","unstructured":"K. Simon and G. Lausen . ViPER: Augmenting automatic information extraction with visual perceptions . In Proc. of CIKM , pages 381 -- 388 , 2005 . 10.1145\/1099554.1099672 K. Simon and G. Lausen. ViPER: Augmenting automatic information extraction with visual perceptions. In Proc. of CIKM, pages 381--388, 2005. 10.1145\/1099554.1099672"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1046456.1046459"},{"key":"e_1_2_1_29_1","first-page":"956","volume-title":"Proc. of WWW Conf.","author":"Tatsumi Y.","year":"2005","unstructured":"Y. Tatsumi and T. Asahi . Analyzing web page headings considering various presentation . In Proc. of WWW Conf. , pages 956 -- 957 , 2005 . 10.1145\/1062745.1062816 Y. Tatsumi and T. Asahi. Analyzing web page headings considering various presentation. In Proc. of WWW Conf., pages 956--957, 2005. 10.1145\/1062745.1062816"},{"key":"e_1_2_1_30_1","first-page":"76","volume-title":"Proc. of WWW Conf.","author":"Zhai Y.","year":"2005","unstructured":"Y. Zhai and B. Liu . Web data extraction based on partial tree alignment . In Proc. of WWW Conf. , pages 76 -- 85 , 2005 . 10.1145\/1060745.1060761 Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proc. of WWW Conf., pages 76--85, 2005. 10.1145\/1060745.1060761"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/2824032.2824058","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,8,14]],"date-time":"2023-08-14T05:49:57Z","timestamp":1691992197000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/2824032.2824058"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,8]]},"references-count":29,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2015,8]]}},"alternative-id":["10.14778\/2824032.2824058"],"URL":"https:\/\/doi.org\/10.14778\/2824032.2824058","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2015,8]]}}}