{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T13:53:18Z","timestamp":1770731598753,"version":"3.49.0"},"reference-count":27,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2024,11,11]],"date-time":"2024-11-11T00:00:00Z","timestamp":1731283200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62137002"],"award-info":[{"award-number":["62137002"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["GJGJZD20210408092806017"],"award-info":[{"award-number":["GJGJZD20210408092806017"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["22GYB159"],"award-info":[{"award-number":["22GYB159"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Shenzhen Science and Technology Innovation Commission project","award":["62137002"],"award-info":[{"award-number":["62137002"]}]},{"name":"Shenzhen Science and Technology Innovation Commission project","award":["GJGJZD20210408092806017"],"award-info":[{"award-number":["GJGJZD20210408092806017"]}]},{"name":"Shenzhen Science and Technology Innovation Commission project","award":["22GYB159"],"award-info":[{"award-number":["22GYB159"]}]},{"name":"the 14th Five-Year Plan of Guangdong Association of Higher Education 2022 Higher Education Research Project","award":["62137002"],"award-info":[{"award-number":["62137002"]}]},{"name":"the 14th Five-Year Plan of Guangdong Association of Higher Education 2022 Higher Education Research Project","award":["GJGJZD20210408092806017"],"award-info":[{"award-number":["GJGJZD20210408092806017"]}]},{"name":"the 14th Five-Year Plan of Guangdong Association of Higher Education 2022 Higher Education Research Project","award":["22GYB159"],"award-info":[{"award-number":["22GYB159"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>Script identification is easier to implement than language identification, and its identification rate is very high. The fewer languages are identified when using a language identification algorithm, the higher the identification rate is. However, no systematic study on SI involving multiple languages and determining how to construct relevant language identification datasets has been conducted. Therefore, in this paper, we discuss and design a script identification algorithm and the construction of a language identification dataset based on script groups. The data sources in this paper comprise 261 different languages\u2019 text corpora from the Leipzig Corpora Collection, which are grouped into 23 different script groups. In the Unicode encoding scheme, different scripts are arranged into different code regions. Based on this feature, we propose a written script identification algorithm based on regular expression matching, the micro F-score of which reaches 0.9929 in sentence-level script identification experiments. To reduce noise when constructing the language identification dataset for each script, a script identification algorithm is used to filter out other-script content in each text.<\/jats:p>","DOI":"10.3390\/data9110134","type":"journal-article","created":{"date-parts":[[2024,11,11]],"date-time":"2024-11-11T03:52:07Z","timestamp":1731297127000},"page":"134","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset"],"prefix":"10.3390","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8860-951X","authenticated-orcid":false,"given":"Mamtimin","family":"Qasim","sequence":"first","affiliation":[{"name":"School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China"}]},{"given":"Wushour","family":"Silamu","sequence":"additional","affiliation":[{"name":"School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China"},{"name":"Key Multi-Lingual Laboratory of Xinjiang, Urumqi 830046, China"}]},{"given":"Minghui","family":"Qiu","sequence":"additional","affiliation":[{"name":"School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,11,11]]},"reference":[{"key":"ref_1","first-page":"21","article-title":"Optimizing n-gram order of an n-gram based language identification algorithm for 68 written languages","volume":"2","author":"Choong","year":"2009","journal-title":"Int. J. Adv. ICT Emerg. Reg."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"307","DOI":"10.1016\/j.csl.2012.01.004","article-title":"Factors that affect the accuracy of text-based language identification","volume":"26","author":"Botha","year":"2012","journal-title":"Comput. Speech Lang."},{"key":"ref_3","first-page":"491","article-title":"Effective language identification of forum texts based on statistical approaches","volume":"52","author":"Abainia","year":"2016","journal-title":"Inf. Process. Manag. Int. J."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"457","DOI":"10.1016\/j.jksuci.2014.12.004","article-title":"Word-length algorithm for language identification of under-resourced languages","volume":"28","author":"Selamat","year":"2015","journal-title":"J. King Saud Univ. Comput. Inf. Sci."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"675","DOI":"10.1613\/jair.1.11675","article-title":"Automatic language identification in texts: A survey","volume":"65","author":"Jauhiainen","year":"2019","journal-title":"J. Artif. Intell. Res."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Zampieri, M., Malmasi, S., Ljube\u0161i\u0107, N., Nakov, P., Ali, A., Tiedemann, J., Scherrer, Y., and Aepli, N. (2017, January 3). Findings of the VarDial Evaluation Campaign. Proceedings of the VarDial Workshop, Valencia, Spain.","DOI":"10.18653\/v1\/W17-1201"},{"key":"ref_7","unstructured":"Apple (2021, February 10). Language Identification from Very Short Strings. Available online: https:\/\/machinelearning.apple.com\/research\/language-identification-from-very-short-strings."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Toftrup, M., Srensen, S.A., Ciosici, M.R., and Assent, I. (2021, January 19\u201323). A reproduction of apple\u2019s bi-directional lstm models for language identification in short strings. Proceedings of the 16th Conference of the European Chapter of the Associationfor Computational Linguistics: Student Research Workshop, Virtual.","DOI":"10.18653\/v1\/2021.eacl-srw.6"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"534","DOI":"10.3923\/itj.2007.534.540","article-title":"Unicode Aided Language Identification across Multiple Scripts and Heterogeneous Data","volume":"6","author":"Hanif","year":"2007","journal-title":"Inf. Technol. J."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Maimaitiyiming, H., and Wushour, S. (2018). On hierarchical text language-identification algorithms. Algorithms, 11.","DOI":"10.3390\/a11040039"},{"key":"ref_11","first-page":"354","article-title":"Three-stage short text language identification algorithm","volume":"15","author":"Hasimu","year":"2017","journal-title":"J. Digit. Inf. Manag."},{"key":"ref_12","unstructured":"Majlis, M. (2012, January 26). Yet another language identifier. Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Brown, R.D. (2013). Selecting and Weighting N-Grams to Identify 1100 Languages, Springer.","DOI":"10.1007\/978-3-642-40585-3_60"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Lui, M., and Baldwin, T. (2014, January 26\u201330). Accurate Language Identification of Twitter Messages. Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), Gothenburg, Sweden.","DOI":"10.3115\/v1\/W14-1303"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Blodgett, S.L., Wei, J., and O\u2019Connor, B. (2017, January 7). A Dataset and Classifier for Recognizing Social Media English. Proceedings of the 3rd Workshop on Noisy User-Generated Text, Copenhagen, Denmark.","DOI":"10.18653\/v1\/W17-4408"},{"key":"ref_16","unstructured":"Baldwin, T., and Lui, M. (2010, January 1\u20136). Language Identification: The Long and the Short of the Matter. Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, CA, USA."},{"key":"ref_17","unstructured":"Tan, L., Zampieri, M., Ljube\u0161ic, N., and Tiedemann, J. (2014, January 27). Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection. Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC), Reykjavik, Iceland."},{"key":"ref_18","unstructured":"Vatanen, T., Vyrynen, J.J., and Virpioja, S. (2010, January 17\u201323). Language identification of short text segments with n-gram models. Proceedings of the International Conference on Language Resources & Evaluation DBLP, Valletta, Malta."},{"key":"ref_19","unstructured":"Scherrer, Y., Jauhiainen, T., Ljube\u0161i\u0107, N., Nakov, P., Tiedemann, J., and Zampieri, M. (2023). Findings of the VarDial Evaluation Campaign 2023. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), Association for Computational Linguistics."},{"key":"ref_20","unstructured":"Goldhahn, D.T., and Eckart, U. (2012, January 23\u201325). Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. Proceedings of the 8th International Language Resources and Evaluation (LREC\u201912), Istanbul, Turkey."},{"key":"ref_21","unstructured":"(2024, June 28). Leipzig Corpora Collection. Available online: https:\/\/cls.corpora.uni-leipzig.de\/en."},{"key":"ref_22","unstructured":"(2024, July 05). Leipzig Corpora Collection Download Page. Available online: https:\/\/wortschatz-leipzig.de\/en\/download."},{"key":"ref_23","unstructured":"(2024, July 05). ISO 639-2 Code, Available online: https:\/\/www.loc.gov\/standards\/iso639-2\/php\/code_list.php."},{"key":"ref_24","unstructured":"(2024, July 05). ISO 639-3 Code. Available online: https:\/\/iso639-3.sil.org\/code_tables\/639\/data."},{"key":"ref_25","unstructured":"(2024, July 08). Chinese and Japanese. Available online: https:\/\/unicode.org\/faq\/han_cjk.html."},{"key":"ref_26","unstructured":"(2024, July 05). List of Unicode Groups and Block Ranges. Available online: https:\/\/www.unicodepedia.com\/groups\/."},{"key":"ref_27","unstructured":"(2024, July 08). Updated Proposal to Encode the Tulu-Tigalari Script in Unicode. Available online: https:\/\/www.unicode.org\/L2\/L2022\/22031-tulu-tigalari-prop.pdf."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/9\/11\/134\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:29:54Z","timestamp":1760113794000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/9\/11\/134"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,11]]},"references-count":27,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2024,11]]}},"alternative-id":["data9110134"],"URL":"https:\/\/doi.org\/10.3390\/data9110134","relation":{},"ISSN":["2306-5729"],"issn-type":[{"value":"2306-5729","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,11]]}}}