{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T10:38:54Z","timestamp":1721731134512},"reference-count":42,"publisher":"MIT Press","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computational Linguistics"],"published-print":{"date-parts":[[2016,9]]},"abstract":"<jats:p>The most common approach in text mining classification tasks is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. Recently, an approach that uses only character p-grams as features has been proposed for the task of native language identification (NLI). The approach obtained state-of-the-art results by combining several string kernels using multiple kernel learning. Despite the fact that the approach based on string kernels performs so well, several questions about this method remain unanswered. First, it is not clear why such a simple approach can compete with far more complex approaches that take words, lemmas, syntactic information, or even semantics into account. Second, although the approach is designed to be language independent, all experiments to date have been on English. This work is an extensive study that aims to systematically present the string kernel approach and to clarify the open questions mentioned above.<\/jats:p><jats:p>A broad set of native language identification experiments were conducted to compare the string kernels approach with other state-of-the-art methods. The empirical results obtained in all of the experiments conducted in this work indicate that the proposed approach achieves state-of-the-art performance in NLI, reaching an accuracy that is 1.7% above the top scoring system of the 2013 NLI Shared Task. Furthermore, the results obtained on both the Arabic and the Norwegian corpora demonstrate that the proposed approach is language independent. In the Arabic native language identification task, string kernels show an increase of more than 17% over the best accuracy reported so far. The results of string kernels on Norwegian native language identification are also significantly better than the state-of-the-art approach. In addition, in a cross-corpus experiment, the proposed approach shows that it can also be topic independent, improving the state-of-the-art system by 32.3%.<\/jats:p><jats:p>To gain additional insights about the string kernels approach, the features selected by the classifier as being more discriminating are analyzed in this work. The analysis also offers information about localized language transfer effects, since the features used by the proposed model are p-grams of various lengths. The features captured by the model typically include stems, function words, and word prefixes and suffixes, which have the potential to generalize over purely word-based features. By analyzing the discriminating features, this article offers insights into two kinds of language transfer effects, namely, word choice (lexical transfer) and morphological differences. The goal of the current study is to give a full view of the string kernels approach and shed some light on why this approach works so well.<\/jats:p>","DOI":"10.1162\/coli_a_00256","type":"journal-article","created":{"date-parts":[[2016,6,17]],"date-time":"2016-06-17T19:28:59Z","timestamp":1466191739000},"page":"491-525","source":"Crossref","is-referenced-by-count":14,"title":["String Kernels for Native Language Identification: Insights from Behind the Curtains"],"prefix":"10.1162","volume":"42","author":[{"given":"Radu Tudor","family":"Ionescu","sequence":"first","affiliation":[{"name":"University of Bucharest"}]},{"given":"Marius","family":"Popescu","sequence":"additional","affiliation":[{"name":"University of Bucharest"}]},{"given":"Aoife","family":"Cahill","sequence":"additional","affiliation":[{"name":"Educational Testing Service"}]}],"member":"281","reference":[{"key":"R1","unstructured":"Abu-Jbara, Amjad, Rahul Jha, Eric Morley, and Dragomir Radev. 2013. Experimental results on the native language identification shared task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 82\u201388, Atlanta, GA."},{"key":"R2","unstructured":"Alfaifi, Abdullah, Eric Atwell, and Ibraheem Hedaya. 2014. Arabic Learner Corpus (ALC) v2: A New Written and Spoken Corpus of Arabic Learners. In Proceedings of the Learner Corpus Studies in Asia and the World, Kobe."},{"key":"R4","unstructured":"Bykh, Serhiy and Detmar Meurers. 2012. Native language identification using recurringn-grams\u2014investigating abstraction and domain dependence. In Proceedings of COLING, pages 425\u2013440, Mumbai."},{"key":"R5","unstructured":"Bykh, Serhiy and Detmar Meurers. 2014. Exploring syntactic features for native language identification: A variationist perspective on feature encoding and ensemble optimization. In Proceedings of COLING, pages 1962\u20131973, Dublin."},{"key":"R6","doi-asserted-by":"publisher","DOI":"10.1007\/BF00994018"},{"key":"R7","doi-asserted-by":"crossref","unstructured":"Cristianini, Nello, John Shawe-Taylor, Andr\u00e9 Elisseeff, and Jaz S. Kandola. 2001. On kernel-target alignment. In Proceedings of NIPS, pages 367\u2013373, Vancouver.","DOI":"10.7551\/mitpress\/1120.003.0052"},{"key":"R8","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0104006"},{"key":"R9","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2006.05.024"},{"key":"R10","unstructured":"Escalante, Hugo Jair, Thamar Solorio, and Manuel Montes-y-G\u00f3mez. 2011. Local histograms of charactern-grams for authorship attribution. In Proceedings of ACL: HLT, 1:288\u2013298, Portland, OR."},{"key":"R11","unstructured":"Estival, Dominique, Tanja Gaustad, Son-Bao Pham, Will Radford, and Ben Hutchinson. 2007. Author profiling for English emails. In Proceedings of PACLING, pages 263\u2013272, Melbourne."},{"key":"R12","unstructured":"Gebre, Binyam Gebrekidan, Marcos Zampieri, Peter Wittenburg, and Tom Heskes. 2013. Improving native language identification with tf-idf weighting. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 216\u2013223, Atlanta, GA."},{"key":"R13","unstructured":"Gonen, Mehmet and Ethem Alpaydin. 2011. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12:2211\u20132268."},{"key":"R15","unstructured":"Grozea, Cristian, Christian Gehl, and Marius Popescu. 2009. ENCOPLOT: Pairwise sequence matching in linear time applied to plagiarism detection. In 3rd PAN Workshop. Uncovering Plagiarism, Authorship, and Social Software Misuse, page 10, San Sebastian."},{"key":"R18","unstructured":"Henderson, John, Guido Zarrella, Craig Pfeifer, and John D. Burger. 2013. Discriminating non-native English with 350 words. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 101\u2013110, Atlanta, GA."},{"key":"R19","doi-asserted-by":"crossref","unstructured":"Ionescu, Radu Tudor. 2013. Local Rank Distance. In Proceedings of SYNASC, pages 221\u2013228, Timi\u015foara.","DOI":"10.1109\/SYNASC.2013.36"},{"key":"R20","doi-asserted-by":"crossref","unstructured":"Ionescu, Radu Tudor. 2015. A fast algorithm for Local Rank Distance: Application to Arabic native language identification. In Proceedings of ICONIP, pages 390\u2013400, Istanbul.","DOI":"10.1007\/978-3-319-26535-3_45"},{"key":"R21","doi-asserted-by":"crossref","unstructured":"Ionescu, Radu Tudor, Marius Popescu, and Aoife Cahill. 2014. Can characters reveal your native language? A language independent approach to native language identification. In Proceedings of EMNLP, pages 1363\u20131373, Doha.","DOI":"10.3115\/v1\/D14-1142"},{"key":"R22","unstructured":"Jarvis, Scott, Yves Bestgen, and Steve Pepper. 2013. Maximizing classification accuracy in native language identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 111\u2013118, Atlanta, GA."},{"key":"R23","unstructured":"Jarvis, Scott, Gabriela Casta\u00f1eda Jim\u00e9nez, and Rasmus Nielsen. 2004. Investigating L1 lexical transfer through learners' wordprints. Second Language Research Forum (SLRF). State College, PA."},{"key":"R25","doi-asserted-by":"publisher","DOI":"10.3115\/1220175.1220290"},{"key":"R26","doi-asserted-by":"crossref","unstructured":"Koppel, Moshe, Jonathan Schler, and Kfir Zigdon. 2005. Automatically determining an anonymous author's native language. In Proceedings of ISI, pages 209\u2013217, Atlanta, GA.","DOI":"10.1007\/11427995_17"},{"key":"R27","doi-asserted-by":"publisher","DOI":"10.1162\/153244302760200687"},{"key":"R28","doi-asserted-by":"crossref","unstructured":"Maji, Subhransu, Alexander C. Berg, and Jitendra Malik. 2008. Classification using intersection kernel support vector machines is efficient. In Proceedings of CVPR, pages 1\u20138, Anchorage, AK.","DOI":"10.1109\/CVPR.2008.4587630"},{"key":"R29","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3625"},{"key":"R30","doi-asserted-by":"crossref","unstructured":"Malmasi, Shervin and Mark Dras. 2014b. Chinese native language identification. In Proceedings of EACL, 2:95\u201399, Gothenburg.","DOI":"10.3115\/v1\/E14-4019"},{"key":"R31","doi-asserted-by":"crossref","unstructured":"Malmasi, Shervin and Mark Dras. 2014c. Language transfer hypotheses with linear SVM weights. In Proceedings of EMNLP, pages 1385\u20131390, Doha.","DOI":"10.3115\/v1\/D14-1144"},{"key":"R32","unstructured":"Malmasi, Shervin, Sze-Meng Jojo Wong, and Mark Dras. 2013. NLI shared task 2013: MQ submission. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 124\u2013133, Atlanta, GA."},{"key":"R34","unstructured":"Mizumoto, Tomoya, Yuta Hayashibe, Keisuke Sakaguchi, Mamoru Komachi, and Yuji Matsumoto. 2013. NAIst at the NLI 2013 shared task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 134\u2013139, Atlanta, GA."},{"key":"R36","doi-asserted-by":"publisher","DOI":"10.3115\/1699510.1699525"},{"key":"R37","unstructured":"Pighin, Daniele and Alessandro Moschitti. 2010. On reverse feature engineering of syntactic tree kernels. In Proceedings of CoNLL, pages 223\u2013233, Uppsala."},{"key":"R40","unstructured":"Popescu, Marius and Cristian Grozea. 2012. Kernel methods and string kernels for authorship analysis. CLEF (Online Working Notes\/Labs\/ Workshop). Rome."},{"key":"R41","unstructured":"Popescu, Marius and Radu Tudor Ionescu. 2013. The story of the characters, the DNA and the native language. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 270\u2013278, Atlanta, GA."},{"key":"R42","unstructured":"Rozovskaya, Alla and Dan Roth. 2010. Generating confusion sets for context-sensitive error correction. In Proceedings of EMNLP, pages 961\u2013970, Cambridge, MA."},{"key":"R43","doi-asserted-by":"publisher","DOI":"10.3115\/1610075.1610142"},{"key":"R45","doi-asserted-by":"crossref","unstructured":"Swanson, Ben and Eugene Charniak. 2014. Data driven language transfer hypotheses. In Proceedings of EACL, pages 169\u2013173, Gothenburg.","DOI":"10.3115\/v1\/E14-4033"},{"key":"R46","unstructured":"Tenfjord, K., P. Meurer, and K. Hofland. 2006. The ASK Corpus\u2014A language learner corpus of Norwegian as a second language. In Proceedings of LREC, pages 1821\u20131824, Genoa."},{"key":"R47","unstructured":"Tetreault, Joel, Daniel Blanchard, and Aoife Cahill. 2013. A report on the first native language identification shared task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 48\u201357, Atlanta, GA."},{"key":"R48","unstructured":"Tetreault, Joel, Daniel Blanchard, Aoife Cahill, and Martin Chodorow. 2012. Native tongues, lost and found: Resources and empirical evaluations in native language identification. In Proceedings of COLING, pages 2585\u20132602, Mumbai."},{"key":"R49","doi-asserted-by":"publisher","DOI":"10.3115\/1073336.1073367"},{"key":"R50","doi-asserted-by":"crossref","unstructured":"Tsur, Oren and Ari Rappoport. 2007. Using classifier features for studying the effect of native language on the choice of written second language words. In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition, pages 9\u201316, Prague.","DOI":"10.3115\/1629795.1629797"},{"key":"R51","unstructured":"Tsvetkov, Yulia, Naama Twitto, Nathan Schneider, Noam Ordan, Manaal Faruqui, Victor Chahuneau, Shuly Wintner, and Chris Dyer. 2013. Identifying the L1 of non-native writers: The CMU-Haifa system. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 279\u2013287, Atlanta, GA."},{"key":"R52","doi-asserted-by":"crossref","unstructured":"Vedaldi, Andrea and Andrew Zisserman. 2010. Efficient additive kernels via explicit feature maps. In Proceedings of CVPR, pages 3539\u20133546, San Francisco, CA.","DOI":"10.1109\/CVPR.2010.5539949"}],"container-title":["Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mitpressjournals.org\/doi\/pdf\/10.1162\/COLI_a_00256","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,17]],"date-time":"2024-06-17T18:33:30Z","timestamp":1718649210000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/coli\/article\/42\/3\/491-525\/1541"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,9]]},"references-count":42,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2016,9]]}},"alternative-id":["10.1162\/COLI_a_00256"],"URL":"https:\/\/doi.org\/10.1162\/coli_a_00256","relation":{},"ISSN":["0891-2017","1530-9312"],"issn-type":[{"value":"0891-2017","type":"print"},{"value":"1530-9312","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,9]]}}}