{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T01:43:18Z","timestamp":1760233398672,"version":"build-2065373602"},"reference-count":54,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2021,1,14]],"date-time":"2021-01-14T00:00:00Z","timestamp":1610582400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"publisher","award":["17H00762"],"award-info":[{"award-number":["17H00762"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>The subpath kernel is a class of positive definite kernels defined over trees, which has the following advantages for the purposes of classification, regression and clustering: it can be incorporated into a variety of powerful kernel machines including SVM; It is invariant whether input trees are ordered or unordered; It can be computed by significantly fast linear-time algorithms; And, finally, its excellent learning performance has been proven through intensive experiments in the literature. In this paper, we leverage recent advances in tree kernels to solve real problems. As an example, we apply our method to the problem of detecting fake e-commerce sites. Although the problem is similar to phishing site detection, the fact that mimicking existing authentic sites is harmful for fake e-commerce sites marks a clear difference between these two problems. We focus on fake e-commerce site detection for three reasons: e-commerce fraud is a real problem that companies and law enforcement have been cooperating to solve; Inefficiency hampers existing approaches because datasets tend to be large, while subpath kernel learning overcomes these performance challenges; And we offer increased resiliency against attempts to subvert existing detection methods through incorporating robust features that adversaries cannot change: the DOM-trees of web-sites. Our real-world results are remarkable: our method has exhibited accuracy as high as 0.998 when training SVM with 1000 instances and evaluating accuracy for almost 7000 independent instances. Its generalization efficiency is also excellent: with only 100 training instances, the accuracy score reached 0.996.<\/jats:p>","DOI":"10.3390\/make3010006","type":"journal-article","created":{"date-parts":[[2021,1,15]],"date-time":"2021-01-15T01:33:29Z","timestamp":1610674409000},"page":"95-122","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Learning DOM Trees of Web Pages by Subpath Kernel and Detecting Fake e-Commerce Sites"],"prefix":"10.3390","volume":"3","author":[{"given":"Kilho","family":"Shin","sequence":"first","affiliation":[{"name":"Computer Centre, Gakushuin University, Tokyo 1718588, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Taichi","family":"Ishikawa","sequence":"additional","affiliation":[{"name":"Information Networking Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0496-8962","authenticated-orcid":false,"given":"Yu-Lu","family":"Liu","sequence":"additional","affiliation":[{"name":"Cyber Security Defense Department, Rakuten, Inc., Tokyo 1580094, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"David Lawrence","family":"Shepard","sequence":"additional","affiliation":[{"name":"Data Engineering, Evidation Health, Inc., San Mateo, CA 94402, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,1,14]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"422","DOI":"10.1145\/322139.322143","article-title":"The tree-to-tree correction problem","volume":"26","year":"1979","journal-title":"J. ACM"},{"key":"ref_2","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions, and reversals","volume":"10","author":"Levenshtein","year":"1966","journal-title":"Sov. Phys. Dokl."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Collins, M., and Duffy, N. (2001). Convolution Kernels for Natural Language. Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001], MIT Press.","DOI":"10.7551\/mitpress\/1120.003.0085"},{"key":"ref_4","unstructured":"Kimura, D., and Kashima, H. (2012). Fast Computation of Subpath Kernel for Trees. arXiv."},{"key":"ref_5","unstructured":"Shin, K., and Ishikawa, T. (2018, January 2\u20134). Linear-time algorithms for the subpath kernel. Proceedings of the 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018), Qingdao, China."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Corona, I., Biggio, B., Contini, M., Piras, L., Corda, R., Mereu, M., Mureddu, G., Ariu, D., and Roli, F. (2017). DeltaPhish: Detecting Phishing Webpages in Compromised Websites. arXiv, Available online: https:\/\/arxiv.org\/abs\/1707.00317.","DOI":"10.1007\/978-3-319-66402-6_22"},{"key":"ref_7","unstructured":"Zhang, Y., Egelman, S., Cranor, L., and Hong, J. (March, January 28). Phinding Phish: Evaluating anti-phishing tools. Proceedings of the 14th Anual Network and Distributed System Security Symposium, San Diego, CA, USA."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"163","DOI":"10.1007\/s11416-007-0050-4","article-title":"Usability Evaluation of Anti-Phishing Toolbars","volume":"3","author":"Li","year":"2007","journal-title":"J. Comput. Virol."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"78","DOI":"10.1109\/MC.2009.306","article-title":"A comparison of tools for detecting fake websites","volume":"42","author":"Abbasi","year":"2009","journal-title":"Computer"},{"key":"ref_10","unstructured":"Marchal, S., and Asokan, N. (2018, January 11\u201313). On Designing and Evaluating Phishing Webpage Detection Techniques for the Real World. Proceedings of the 11th USENIX Workshop on Cyber Security Experimentation and Test (CSET 18), Baltimore, MD, USA."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"370","DOI":"10.1007\/978-3-319-66402-6_22","article-title":"DeltaPhish: Detecting Phishing Webpages in Compromised Websites","volume":"10492","author":"Corona","year":"2017","journal-title":"Lect. Notes Comput. Sci."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"58","DOI":"10.1109\/MIC.2006.23","article-title":"An antiphishing strategy based on visual similarity assessment","volume":"10","author":"Liu","year":"2006","journal-title":"IEEE Internet Comput."},{"key":"ref_13","first-page":"1","article-title":"Phishing Websites Detection Based on Web Source Code and URL in the Webpage","volume":"1","author":"Satish","year":"2013","journal-title":"Int. J. Comput. Sci. Eng. Commun."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Marchal, S., Saari, K., Singh, N., and Asokan, N. (2016, January 27\u201330). Know Your Phish: Novel Techniques for Detecting Phishing Sites and Their Targets. Proceedings of the 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS), Nara, Japan.","DOI":"10.1109\/ICDCS.2016.10"},{"key":"ref_15","unstructured":"Whittaker, C., Ryner, B., and Nazif, M. (March, January 28). Large-Scale Automatic Classification of Phishing Pages. Proceedings of the NDSS \u201910, San Diego, CA, USA."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Hong, J.I., and Cranor, L.F. (2007, January 8\u201312). Cantina: A Content-based Approach to Detecting Phishing Web Sites. Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada.","DOI":"10.1145\/1242572.1242659"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"458","DOI":"10.1109\/TNSM.2014.2377295","article-title":"PhishStorm: Detecting Phishing With Streaming Analytics","volume":"11","author":"Marchal","year":"2014","journal-title":"IEEE Trans. Netw. Serv. Manag."},{"key":"ref_18","unstructured":"Gerbet, T., Kumar, A., and Lauradoux, C. (2014). (Un)Safe Browsing, INRIA. Technical Report RR-8594."},{"key":"ref_19","first-page":"1145","article-title":"A Survey of Phishing Website Detection Systems","volume":"7","author":"Raut","year":"2020","journal-title":"Int. Res. J. Eng. Technol."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Vazhayil, A., Vinayakumar, R., and Soman, K.P. (2018, January 10\u201312). Comparative Study of the Detection of Malicious URLs Using Shallow and Deep Networks. Proceedings of the 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Bengaluru, India.","DOI":"10.1109\/ICCCNT.2018.8494159"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"15196","DOI":"10.1109\/ACCESS.2019.2892066","article-title":"Phishing Website Detection Based on Multidimensional Features Driven by Deep Learning","volume":"7","author":"Yang","year":"2019","journal-title":"IEEE Access"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Shima, K., Miyamoto, D., Abe, H., Ishihara, T., Okada, K., Sekiya, Y., Asai, H., and Doi\u00a7, Y. (2018, January 19\u201322). Classification of URL bitstreams using bag of bytes. Proceedings of the 21st Conference on Innovation in Clouds, Internet and Networks and Workshops (ICIN), Paris, France.","DOI":"10.1109\/ICIN.2018.8401597"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"S\u00f6nmez, Y., Tuncer, T., G\u00f6kal, H., and Avc\u0131, E. (2018, January 22\u201325). Phishing web sites features classification based on extreme learning machine. Proceedings of the 2018 6th International Symposium on Digital Forensic and Security (ISDFS), Antalya, Turkey.","DOI":"10.1109\/ISDFS.2018.8355342"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Machado, L., and Gadge, J. (2017, January 17\u201318). Phishing Sites Detection Based on C4.5 Decision Tree Algorithm. Proceedings of the 2017 International Conference on Computing, Communication, Control and Automation (ICCUBEA), Maharashtra, India.","DOI":"10.1109\/ICCUBEA.2017.8463818"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"435","DOI":"10.2307\/25750686","article-title":"Detecting Fake Websites: The Contribution of Statistical Learning Theory","volume":"34","author":"Abbasi","year":"2010","journal-title":"MIS Q."},{"key":"ref_26","first-page":"2","article-title":"Fake-Website Detection Tools: Identifying Elements that Promote Individuals\u2019 Use and Enhance Their Performance","volume":"16","author":"Zahedi","year":"2015","journal-title":"J. Assoc. Inf. Syst."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Shin, K., and Niiyama, T. (2018, January 16\u201318). The mapping distance\u2014A generalization of the edit distance\u2014And its application to trees. Proceedings of the 10th International Conference on Agent and Artificial Intelligence, ICAART 2018, Madeira, Portugal.","DOI":"10.5220\/0006721902660275"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Berg, C., Christensen, J.P.R., and Ressel, R. (1984). Harmonic Analysis on Semigroups. Theory of Positive Definite and Related Functions, Springer.","DOI":"10.1007\/978-1-4612-1128-0"},{"key":"ref_29","unstructured":"Haussler, D. (1999). Convolution Kernels on Discrete Structures, Dept. of Computer Science, University of California at Santa Cruz. UCSC-CRL 99-10."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Shin, K., and Kuboyama, T. (2008, January 5\u20139). A generalization of Haussler\u2019s convolution kernel\u2014Mapping kernel. Proceedings of the ICML 2008, Helsinki, Finland.","DOI":"10.1145\/1390156.1390275"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Shin, K., and Kuboyama, T. (2014). A Comprehensive Study of Tree Kernels. JSAI-isAI Post-Workshop Proceedings, Springer. Lecture Notes in Articial Intelligence 8417.","DOI":"10.1007\/978-3-319-10061-6_22"},{"key":"ref_32","unstructured":"Kashima, H., and Koyanagi, T. (2002, January 8\u201312). Kernels for Semi-Structured Data. Proceedings of the 9th International Conference on Machine Learning (ICML 2002), Sydney, Australia."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Shin, K. (2015). A Theory of Subtree Matching and Tree Kernels based on the Edit Distance Concept. Ann. Math. Artif. Intell.","DOI":"10.1007\/s10472-015-9467-5"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"383","DOI":"10.1093\/biomet\/75.2.383","article-title":"A stagewise rejective multiple test procedure based on a modified Bonferroni tests","volume":"75","author":"Hommel","year":"1988","journal-title":"Biometrika"},{"key":"ref_35","first-page":"1","article-title":"Statistical comparisons of classifiers over multiple data sets","volume":"7","year":"2006","journal-title":"J. Mach. Learn. Theory"},{"key":"ref_36","unstructured":"Chang, C.C., and Lin, C.J. (2021, January 12). LIBSVM: A Library for Support Vector Machines. Available online: https:\/\/www.csie.ntu.edu.tw\/~cjlin\/libsvm\/."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1016\/j.procs.2015.06.017","article-title":"PhishShield: A Desktop Application to Detect Phishing Webpages through Heuristic Approach","volume":"54","author":"Rao","year":"2015","journal-title":"Procedia Comput. Sci."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Tyagi, I., Shad, J., Sharma, S., Gaur, S., and Kaur, G. (2018, January 22\u201323). A Novel Machine Learning Approach to Detect Phishing Websites. Proceedings of the 2018 5th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India.","DOI":"10.1109\/SPIN.2018.8474040"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1016\/0304-3975(95)80029-9","article-title":"Alignment of trees\u2014An alternative to tree edit","volume":"143","author":"Jiang","year":"1995","journal-title":"Theor. Comput. Sci."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"1245","DOI":"10.1137\/0218082","article-title":"Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems","volume":"18","author":"Zhang","year":"1989","journal-title":"SICOMP"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1142\/S0129054196000051","article-title":"On the editing distance between undirected acyclic graphs","volume":"7","author":"Zhang","year":"1996","journal-title":"Int. J. Found. Comput. Sci."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"133","DOI":"10.1016\/0020-0190(92)90136-J","article-title":"On the editing distance between unordered labeled trees","volume":"42","author":"Zhang","year":"1996","journal-title":"Inf. Process. Lett."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"205","DOI":"10.1007\/BF01975866","article-title":"A Constrained Edit Distance Between Unordered Labeled Trees","volume":"15","author":"Zhang","year":"1996","journal-title":"Algorithmica"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"338","DOI":"10.1007\/3-540-44679-6_37","article-title":"A New Measure of Edit Distance between Labeled Trees","volume":"Volume 2108","author":"Lu","year":"2001","journal-title":"Lecture Notes in Computer Science"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"127","DOI":"10.1016\/S0031-3203(99)00199-5","article-title":"Finding similar consensus between trees: An algorithm and a distance hierarchy","volume":"34","author":"Wang","year":"2001","journal-title":"Pattern Recognit."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Kuboyama, T., Shin, K., Miyahara, T., and Yasuda, H. (2005, January 12\u201314). A theoretical analysis of alignment and edit problems for trees. Proceedings of the Theoretical Computer Science, The 9th Italian Conference, Siena, Italy.","DOI":"10.1007\/11560586_26"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Neuhaus, M., and Bunke, H. (2007). Bridging the Gap between Graph Edit Distance and Kernel Machines, World Scientific.","DOI":"10.1142\/9789812770202"},{"key":"ref_48","first-page":"91","article-title":"Computing the edit-distance between unrooted ordered trees","volume":"1461","author":"Klein","year":"1998","journal-title":"LNCS"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Dulucq, S., and Touzet, H. (2003, January 25\u201327). Analysis of tree edit distance algorithms. Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching (CPM), Michoacan, Mexico.","DOI":"10.1007\/3-540-44888-8_7"},{"key":"ref_50","first-page":"2","article-title":"An Optimal Decomposition Algorithm for Tree Edit Distance","volume":"6","author":"Demaine","year":"2006","journal-title":"ACM Trans. Algo."},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"334","DOI":"10.14778\/2095686.2095692","article-title":"RTED: A Robust Algorithm for the Tree Edit Distance","volume":"5","author":"Pawlik","year":"2011","journal-title":"VLDB Endow."},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"463","DOI":"10.1016\/0031-3203(94)00109-Y","article-title":"Algorithms for the constrained editing distance between ordered labeled trees and related problems","volume":"28","author":"Zhang","year":"1995","journal-title":"Pattern Recognit."},{"key":"ref_53","unstructured":"Richter, T. (1997). A New Measure of the Distance between Ordered Trees and Its Applications, Dept. of Computer Science, Univ. of Bonn. Technical Report 85166-CS."},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"217","DOI":"10.1016\/j.tcs.2004.12.030","article-title":"A survey on tree edit distance and related problems","volume":"337","author":"Bille","year":"2005","journal-title":"Theor. Comput. Sci."}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/3\/1\/6\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:11:07Z","timestamp":1760159467000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/3\/1\/6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,1,14]]},"references-count":54,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2021,3]]}},"alternative-id":["make3010006"],"URL":"https:\/\/doi.org\/10.3390\/make3010006","relation":{},"ISSN":["2504-4990"],"issn-type":[{"type":"electronic","value":"2504-4990"}],"subject":[],"published":{"date-parts":[[2021,1,14]]}}}