{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T16:07:35Z","timestamp":1761581255748,"version":"build-2065373602"},"reference-count":41,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2014,12,1]],"date-time":"2014-12-01T00:00:00Z","timestamp":1417392000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>To surface the Deep Web, one crucial task is to predict whether a given web page has a search interface (searchable HyperText Markup Language (HTML) form) or not. Previous studies have focused on supervised classification with labeled examples. However, labeled data are scarce, hard to get and requires tediousmanual work, while unlabeled HTML forms are abundant and easy to obtain. In this research, we consider the plausibility of using both labeled and unlabeled data to train better models to identify search interfaces more effectively. We present a semi-supervised co-training ensemble learning approach using both neural networks and decision trees to deal with the search interface identification problem. We show that the proposed model outperforms previous methods using only labeled data. We also show that adding unlabeled data improves the effectiveness of the proposed model.<\/jats:p>","DOI":"10.3390\/info5040634","type":"journal-article","created":{"date-parts":[[2014,12,2]],"date-time":"2014-12-02T02:07:08Z","timestamp":1417486028000},"page":"634-651","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach"],"prefix":"10.3390","volume":"5","author":[{"given":"Hong","family":"Wang","sequence":"first","affiliation":[{"name":"School of Mathematics & Statistics, Central South University, Changsha 410075, China"}]},{"given":"Qingsong","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Mathematics & Statistics, Central South University, Changsha 410075, China"}]},{"given":"Lifeng","family":"Zhou","sequence":"additional","affiliation":[{"name":"School of Mathematics & Statistics, Central South University, Changsha 410075, China"}]}],"member":"1968","published-online":{"date-parts":[[2014,12,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Bergman, M.K. (2001). White Paper: The deep web: Surfacing hidden value. J. Electron. Publ., 7.","DOI":"10.3998\/3336451.0007.104"},{"key":"ref_2","unstructured":"Cope, J., Craswell, N., and Hawking, D. (2003, January 4\u20137). Automated Discovery of Search Interfaces on the Web. Proceedings of the 14th Australasian Database Conference (ADC2003), Adelaide, Australia."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1241","DOI":"10.14778\/1454159.1454163","article-title":"Google\u2019s Deep Web crawl","volume":"1","author":"Madhavan","year":"2008","journal-title":"Proc. VLDB Endow."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1145\/1860702.1860708","article-title":"Understanding deep web search interfaces: A survey","volume":"39","author":"Khare","year":"2010","journal-title":"ACM SIGMOD Rec."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1016\/j.datak.2006.01.009","article-title":"Sampling, information extraction and summarisation of hidden web databases","volume":"59","author":"Hedley","year":"2006","journal-title":"Data Knowl. Eng."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Noor, U., Rashid, Z., and Rauf, A. (2011, January 5\u20137). TODWEB: Training-Less ontology based deep web source classification. Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services, Bali, Indonesia.","DOI":"10.1145\/2095536.2095569"},{"key":"ref_7","unstructured":"Balakrishnan, R., and Kambhampati, S. (April, January 28). Factal: Integrating deep web based on trust and relevance. Proceedings of the 20th international conference companion on World wide web, Hyderabad, India."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"177","DOI":"10.1016\/j.datak.2003.10.003","article-title":"Automatic generation of agents for collecting hidden web pages for data extraction","volume":"49","author":"Golgher","year":"2004","journal-title":"Data Knowl. Eng."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"61","DOI":"10.1145\/1031570.1031584","article-title":"Structured databases on the web: Observations and implications","volume":"33","author":"Chang","year":"2004","journal-title":"ACM SIGMOD Rec."},{"key":"ref_10","first-page":"387","article-title":"Feature weighting random forest for detection of hidden web search interfaces","volume":"13","author":"Ye","year":"2009","journal-title":"Comput. Linguist. Chin. Lang. Process."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Barbosa, L., and Freire, J. (2007, January 8\u201312). Combining classifiers to identify online databases. Proceedings of the 16th international conference on World Wide Web, Banff, AB, Canada.","DOI":"10.1145\/1242572.1242631"},{"key":"ref_12","unstructured":"Barbosa, L., and Freire, J. (2005, January 16\u201317). Searching for hidden-web databases. Proceedings of the Eighth International Workshop on the Web and Databases (WebDB 2005), Baltimore, MD, USA."},{"key":"ref_13","unstructured":"Shestakov, D. (2009, January 28). On building a search interface discovery system. Proceedings of the 2nd International Conference on Resource Discovery, Lyon, France."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"183","DOI":"10.4028\/www.scientific.net\/KEM.439-440.183","article-title":"Semi-Supervised Classification with Co-Training for Deep Web","volume":"439","author":"Fang","year":"2010","journal-title":"Key Eng. Mater."},{"key":"ref_15","unstructured":"Chang, K.C.C., He, B., and Zhang, Z. (2004, January 29). Metaquerier over the deep web: Shallow integration across holistic sources. Proceedings of the 2004 VLDB Workshop on Information Integration on the Web, Toronto, Canada."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"779","DOI":"10.2298\/CSIS100322028W","article-title":"Research on discovering deep web entries","volume":"8","author":"Wang","year":"2011","journal-title":"Comput. Sci. Inf. Syst."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1007\/s10844-012-0217-4","article-title":"Automatic discovery of Web Query Interfaces using machine learning techniques","volume":"40","year":"2013","journal-title":"J. Intell. Inf. Syst."},{"key":"ref_18","unstructured":"Bergholz, A., and Childlovskii, B. (2003, January 10\u201312). Crawling for domain-specific hidden Web resources. Proceedings of the Fourth International Conference on Web Information Systems Engineering, WISE 2003, Roma, Italy."},{"key":"ref_19","unstructured":"Lin, L., and Zhou, L. (2009, January 28). Web database schema identification through simple query interface. Proceedings of the 2nd International Conference on Resource Discovery, Lyon, France."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Chapelle, O., Sch\u00f6lkopf, B., and Zien, A. (2006). Semi-Supervised Learning, MIT press.","DOI":"10.7551\/mitpress\/9780262033589.001.0001"},{"key":"ref_21","unstructured":"Zhu, X. Semi-Supervised Learning Literature Survey. Available online: http:\/\/pages.cs.wisc.edu\/jerryzhu\/research\/ssl\/semireview.html."},{"key":"ref_22","first-page":"674","article-title":"Semi-supervised multiple classifier systems: Background and research directions","volume":"3541","author":"Roli","year":"2005","journal-title":"Mult. Classif. Syst."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"415","DOI":"10.1007\/s10115-009-0209-z","article-title":"Semi-supervised learning by disagreement","volume":"24","author":"Zhou","year":"2010","journal-title":"Knowl. Inf. Syst."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1109\/MCAS.2006.1688199","article-title":"Ensemble based systems in decision making","volume":"6","author":"Polikar","year":"2006","journal-title":"IEEE Circuits Syst. Mag."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"529","DOI":"10.1007\/978-3-642-02326-2_53","article-title":"When semi-supervised learning meets ensemble learning","volume":"5519","author":"Zhou","year":"2009","journal-title":"Mult. Classif. Syst."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1007\/BF00058655","article-title":"Bagging predictors","volume":"24","author":"Breiman","year":"1996","journal-title":"Mach. Learn."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1006\/jcss.1997.1504","article-title":"A decision-theoretic generalization of on-line learning and an application to boosting","volume":"55","author":"Freund","year":"1997","journal-title":"J. Comput. Syst. Sci."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"832","DOI":"10.1109\/34.709601","article-title":"The random subspace method for constructing decision forests","volume":"20","author":"Ho","year":"1998","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_29","unstructured":"d\u2019Alch\u00e8, F., Grandvalet, Y., and Ambroise, C. (2002). Advances in Neural Information Processing Systems 14, MIT Press."},{"key":"ref_30","first-page":"148","article-title":"Experiments with a new boosting algorithm","volume":"96","author":"Freund","year":"1996","journal-title":"ICML"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Bennett, K., Demiriz, A., and Maclin, R. (2002, January 23\u201325). Exploiting unlabeled data in ensemble methods. Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, AB, Canada.","DOI":"10.1145\/775047.775090"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"2000","DOI":"10.1109\/TPAMI.2008.235","article-title":"SemiBoost: Boosting for Semi-Supervised Learning","volume":"31","author":"Mallapragada","year":"2009","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1088","DOI":"10.1109\/TSMCA.2007.904745","article-title":"Improve Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples","volume":"37","author":"Li","year":"2007","journal-title":"IEEE Trans. Syst. Man Cybern."},{"key":"ref_34","unstructured":"Melville, P., and Mooney, R.J. (2003, January 9\u201315). Constructing diverse classifier ensembles using artificial training examples. Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"367","DOI":"10.1016\/S0167-9473(01)00065-2","article-title":"Stochastic gradient boosting","volume":"38","author":"Friedman","year":"2002","journal-title":"Comput. Stat. Data Anal."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"1399","DOI":"10.1016\/S0893-6080(99)00073-8","article-title":"Ensemble learning via negative correlation","volume":"12","author":"Liu","year":"1999","journal-title":"Neural Netw."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Blum, A., and Mitchell, T. (1998, January 24\u201326). Combining labeled and unlabeled data with co-training. Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WI, USA.","DOI":"10.1145\/279943.279962"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"861","DOI":"10.1016\/j.patrec.2005.10.010","article-title":"An introduction to ROC analysis","volume":"27","author":"Fawcett","year":"2006","journal-title":"Pattern Recogn. Lett."},{"key":"ref_40","first-page":"1","article-title":"Statistical comparisons of classifiers over multiple data sets","volume":"7","year":"2006","journal-title":"J. Mach. Learn. Res."},{"key":"ref_41","unstructured":"R Core Team (2012). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/5\/4\/634\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T21:10:16Z","timestamp":1760217016000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/5\/4\/634"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,12,1]]},"references-count":41,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2014,12]]}},"alternative-id":["info5040634"],"URL":"https:\/\/doi.org\/10.3390\/info5040634","relation":{},"ISSN":["2078-2489"],"issn-type":[{"type":"electronic","value":"2078-2489"}],"subject":[],"published":{"date-parts":[[2014,12,1]]}}}