{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T22:16:28Z","timestamp":1740176188389,"version":"3.37.3"},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2020,5,18]],"date-time":"2020-05-18T00:00:00Z","timestamp":1589760000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,5,18]],"date-time":"2020-05-18T00:00:00Z","timestamp":1589760000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Data Sci. Eng."],"published-print":{"date-parts":[[2020,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>String kernels are attractive data analysis tools for analyzing string data.\nAmong them, alignment kernels are known for their high prediction accuracies in string classifications when tested in combination with SVM in various applications. However, alignment kernels have a crucial drawback in that they scale poorly due to their quadratic computation complexity in the number of input strings, which limits large-scale applications in practice. We address this need by presenting the first approximation for string alignment kernels, which we call <jats:italic>space-efficient feature maps for edit distance with moves (SFMEDM)<\/jats:italic>, by leveraging a metric embedding named <jats:italic>edit-sensitive parsing<\/jats:italic> and <jats:italic>feature maps (FMs)<\/jats:italic> of <jats:italic>random Fourier features (RFFs)<\/jats:italic> for large-scale string analyses.\nThe original FMs for RFFs consume a huge amount of memory proportional to the dimension <jats:italic>d<\/jats:italic> of input vectors and the dimension <jats:italic>D<\/jats:italic> of output vectors, which prohibits its large-scale applications.\nWe present novel <jats:italic>space-efficient feature maps (SFMs)<\/jats:italic> of RFFs for a space reduction from <jats:italic>O<\/jats:italic>(<jats:italic>dD<\/jats:italic>) of the original FMs to <jats:italic>O<\/jats:italic>(<jats:italic>d<\/jats:italic>) of SFMs with a theoretical guarantee with respect to concentration bounds. We experimentally test SFMEDM on its ability to learn SVM for large-scale string classifications with various massive string data, and we demonstrate the superior performance of SFMEDM with respect to prediction accuracy, scalability and computation efficiency.<\/jats:p>","DOI":"10.1007\/s41019-020-00120-6","type":"journal-article","created":{"date-parts":[[2020,5,18]],"date-time":"2020-05-18T16:51:46Z","timestamp":1589820706000},"page":"168-179","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Space-Efficient Feature Maps for String Alignment Kernels"],"prefix":"10.1007","volume":"5","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2368-5607","authenticated-orcid":false,"given":"Yasuo","family":"Tabei","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yoshihiro","family":"Yamanishi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rasmus","family":"Pagh","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2020,5,18]]},"reference":[{"key":"120_CR1","doi-asserted-by":"crossref","unstructured":"Bahlmann C, Haasdonk B, Burkhardt H (2002) Online handwriting recognition with support vector machines-a kernel approach. In: Proceedings of the 8th international workshop on frontiers in handwriting recognition, pp 49\u201354","DOI":"10.1109\/IWFHR.2002.1030883"},{"key":"120_CR2","doi-asserted-by":"crossref","unstructured":"Chakraborty D, Goldenberg E, Kouck\u00fd M (2016) Streaming algorithms for embedding and computing edit distance in the low distance regime. In: Proceedings of the 48th annual ACM symposium on theory of computing, pp 715\u2013725","DOI":"10.1145\/2897518.2897577"},{"key":"120_CR3","doi-asserted-by":"publisher","first-page":"27:1","DOI":"10.1145\/1961189.1961199","volume":"2","author":"CC Chang","year":"2011","unstructured":"Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1\u201327:27","journal-title":"ACM Trans Intell Syst Technol"},{"key":"120_CR4","doi-asserted-by":"publisher","first-page":"2:1","DOI":"10.1145\/1186810.1186812","volume":"3","author":"G Cormode","year":"2007","unstructured":"Cormode G, Muthukrishnan S (2007) The string edit distance matching problem with moves. ACM Trans Algorithms 3:2:1\u20132:19","journal-title":"ACM Trans Algorithms"},{"key":"120_CR5","unstructured":"Cuturi M (2011) Fast global alignment kernels. In: Proceedings of the 28th international conference on machine learning, pp 929\u2013936"},{"key":"120_CR6","doi-asserted-by":"crossref","unstructured":"Cuturi M, Vert JP, Birkenes O, Matsui T (2007) A kernel for time series based on global alignments. In: Proceedings of the 2007 IEEE international conference on acoustics, speech and signal processing, pp 413\u2013416","DOI":"10.1109\/ICASSP.2007.366260"},{"key":"120_CR7","first-page":"1871","volume":"9","author":"RE Fan","year":"2008","unstructured":"Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871\u20131874","journal-title":"J Mach Learn Res"},{"key":"120_CR8","unstructured":"Farhan M, Tariq J, Zaman A, Shabbir M, Khan I (2017) Efficient approximation algorithms for strings kernel based sequence classification. In: Proceedings of the 30th international conference on neural information processing systems, pp 6938\u20136948"},{"key":"120_CR9","doi-asserted-by":"publisher","first-page":"783","DOI":"10.1137\/S1052623400374379","volume":"13","author":"MC Ferris","year":"2003","unstructured":"Ferris MC, Munson TS (2003) Interior-point methods for massive support vector machines. SIAM J Optim 13:783\u2013804","journal-title":"SIAM J Optim"},{"key":"120_CR10","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1145\/959242.959248","volume":"5","author":"T G\u00e4rtner","year":"2003","unstructured":"G\u00e4rtner T (2003) A survey of kernels for structured data. ACM SIGKDD Explor Newslett 5:49\u201358","journal-title":"ACM SIGKDD Explor Newslett"},{"key":"120_CR11","doi-asserted-by":"crossref","unstructured":"He R, McAuley J (2016) Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th international world wide web conference, pp 507\u2013517","DOI":"10.1145\/2872427.2883037"},{"key":"120_CR12","doi-asserted-by":"publisher","first-page":"1171","DOI":"10.1214\/009053607000000677","volume":"36","author":"T Hofmann","year":"2008","unstructured":"Hofmann T, Sch\u00f6lkopf B, Smola AJ (2008) Kernel methods in machine learning. Anna Stat 36:1171\u20131220","journal-title":"Anna Stat"},{"key":"120_CR13","doi-asserted-by":"crossref","unstructured":"Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the 12th ACM conference on knowledge discovery and data mining, pp 217\u2013226","DOI":"10.1145\/1150402.1150429"},{"key":"120_CR14","doi-asserted-by":"publisher","first-page":"D353","DOI":"10.1093\/nar\/gkw1092","volume":"45","author":"M Kanehisa","year":"2017","unstructured":"Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K (2017) Kegg: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res 45:D353\u2013D361","journal-title":"Nucleic Acids Res"},{"key":"120_CR15","doi-asserted-by":"publisher","first-page":"D1202","DOI":"10.1093\/nar\/gkv951","volume":"44","author":"S Kim","year":"2016","unstructured":"Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA, Wang J, Yu B, Zhang J, Bryant SH (2016) Pubchem substance and compound databases. Nucleic Acids Res 44:D1202\u2013D1213","journal-title":"Nucleic Acids Res"},{"key":"120_CR16","unstructured":"Kuksa PP, Huang PH, Pavlovic V (2008) Kernel methods and algorithms for general sequence analysis. Rutgers computer science technical report, vol. RU-DCS-TR630"},{"key":"120_CR17","unstructured":"Le Q, Sarlos T, Smola A (2013) Fastfood\u2014approximating kernel expansions in loglinear time. In: Proceedings of the 30th international conference on machine learning, pp 244\u2013252"},{"key":"120_CR18","unstructured":"Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the 7th pacific symposium on biocomputing, pp 566\u2013575"},{"key":"120_CR19","doi-asserted-by":"crossref","unstructured":"Li P (2017) Linearized GMM kernels and normalized random Fourier features. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 315\u2013324","DOI":"10.1145\/3097983.3098081"},{"key":"120_CR20","unstructured":"Li P, Shrivastava A, Moore J, K\u00f6ning AC (2011) Hashing algorithms for large-scale learning. In: Advances in neural information processing systems, pp 2672\u20132680"},{"key":"120_CR21","first-page":"419","volume":"2","author":"H Lodhi","year":"2002","unstructured":"Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C (2002) Text classification using string kernels. J Mach Learn Res 2:419\u2013444","journal-title":"J Mach Learn Res"},{"key":"120_CR22","doi-asserted-by":"crossref","unstructured":"Maruyama S, Tabei Y, Sakamoto H, Sadakane K (2013) Fully-online grammar compression. In: Proceedings of international symposium on string processing and information retrieval, pp 218\u2013229","DOI":"10.1007\/978-3-319-02432-5_25"},{"key":"120_CR23","doi-asserted-by":"crossref","unstructured":"McAuley J, Targett C, Shi J, van\u00a0den Hagel A (2015) Image-based recommendations on styles and substitutes. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 43\u201352","DOI":"10.1145\/2766462.2767755"},{"key":"120_CR24","doi-asserted-by":"crossref","unstructured":"Pham N, Pagh R (2013) Fast and scalable polynomial kernels via explicit feature maps. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 239\u2013247","DOI":"10.1145\/2487575.2487591"},{"key":"120_CR25","unstructured":"Rahimi A, Recht B (2007) Random features for large-scale kernel machines. In: Advances in neural information processing systems, pp 1177\u20131184"},{"key":"120_CR26","doi-asserted-by":"publisher","first-page":"1682","DOI":"10.1093\/bioinformatics\/bth141","volume":"20","author":"H Saigo","year":"2004","unstructured":"Saigo H, Vert JP, Ueda N, Akutsu T (2004) Protein homology detection using string alignment kernels. Bioinformatics 20:1682\u20131689","journal-title":"Bioinformatics"},{"key":"120_CR27","doi-asserted-by":"crossref","unstructured":"Shapira D, Storer JA (2002) Edit distance with move operations. In: Proceedings of the 13th symposium on combinatorial pattern matching, vol 2373, pp 85\u201398","DOI":"10.1007\/3-540-45452-7_9"},{"key":"120_CR28","unstructured":"Shimodaira H, Noma K, Nakai M, Sagayama S (2001) Dynamic time-alignment kernel in support vector machine. In: Advances in neural information processing systems, pp 921\u2013928"},{"key":"120_CR29","doi-asserted-by":"crossref","unstructured":"Singh R, Sekhon A, Kowsari K, Lanchantin J, Wang B, Qi Y (2017) Gakco: a fast gapped k-mer string kernel using counting. In: Proceedings of the European conference on machine learning and principles and practice on knowledge discovery in databases","DOI":"10.1101\/329425"},{"key":"120_CR30","doi-asserted-by":"publisher","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","volume":"147","author":"TF Smith","year":"1981","unstructured":"Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195\u2013197","journal-title":"J Mol Biol"},{"key":"120_CR31","doi-asserted-by":"crossref","unstructured":"Takabatake Y, Tabei Y, Sakamoto H (2014) Improved ESP-index: a practical self-index for highly repetitive texts. In: Proceedings of the 13th international symposium on experimental algorithms, pp 338\u2013350","DOI":"10.1007\/978-3-319-07959-2_29"},{"key":"120_CR32","unstructured":"Wu L, Yen IE, Xu F, Ravikumar P, Witbrock M (2018) D2KE: from distance to kernel and embedding. CoRR arXiv:1802.04956"},{"key":"120_CR33","unstructured":"Wu L, Yen IE, Yi J, Xu F, Lei Q, Witbrock M (2018) Random warping series: a random features method for time-series embedding. In: Proceedings of the 21st international conference on artificial intelligence and statistics, pp 793\u2013802"},{"key":"120_CR34","unstructured":"Yu F X, Suresh AT, Choromanski KM, Holtmann-Rice DN, Kumar S (2016) Orthogonal random features. In: Proceedings of the 29th international conference on neural information processing systems, pp 1975\u20131983"},{"key":"120_CR35","doi-asserted-by":"crossref","unstructured":"Zhang H, Zhang Q (2017) EmbedJoin: efficient edit similarity joins via embeddings. In: Proceedings of the 23rd SIGKDD international conference on knowledge discovery and data mining, pp 585\u2013594","DOI":"10.1145\/3097983.3098003"},{"key":"120_CR36","doi-asserted-by":"crossref","unstructured":"Zhou F, De\u00a0la Torre F, Cohn JF (2010) Unsupervised discovery of facial events. In: Proceedings of the 2010 IEEE computer society conference on computer vision and pattern recognition, pp 2574\u20132581","DOI":"10.1109\/CVPR.2010.5539966"}],"container-title":["Data Science and Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41019-020-00120-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s41019-020-00120-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41019-020-00120-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,5,18]],"date-time":"2021-05-18T00:02:55Z","timestamp":1621296175000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s41019-020-00120-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,5,18]]},"references-count":36,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2020,6]]}},"alternative-id":["120"],"URL":"https:\/\/doi.org\/10.1007\/s41019-020-00120-6","relation":{},"ISSN":["2364-1185","2364-1541"],"issn-type":[{"type":"print","value":"2364-1185"},{"type":"electronic","value":"2364-1541"}],"subject":[],"published":{"date-parts":[[2020,5,18]]},"assertion":[{"value":"4 February 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 April 2020","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 April 2020","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 May 2020","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}