{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:05:03Z","timestamp":1760058303837,"version":"build-2065373602"},"reference-count":17,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,3,25]],"date-time":"2025-03-25T00:00:00Z","timestamp":1742860800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62137002","GJGJZD20210408092806017","24YJCZH142"],"award-info":[{"award-number":["62137002","GJGJZD20210408092806017","24YJCZH142"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"the Shenzhen Science and Technology Innovation Commission project","award":["62137002","GJGJZD20210408092806017","24YJCZH142"],"award-info":[{"award-number":["62137002","GJGJZD20210408092806017","24YJCZH142"]}]},{"name":"the 2024 Humanities and Social Science Research Youth Fund of the Ministry of Education of China","award":["62137002","GJGJZD20210408092806017","24YJCZH142"],"award-info":[{"award-number":["62137002","GJGJZD20210408092806017","24YJCZH142"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open-source script identification algorithm for text. For this reason, we analyze the Unicode encoding of each type of script and construct regular expressions in this study, in order to design an improved script identification algorithm. Because some scripts share common characters, it\u2019s impossible to count and summarize them. As a result, some extracted scripts are incomplete, which affects subsequent text processing tasks; furthermore, if a new script identification feature is required, the regular expression for each script must be re-adjusted. To improve the performance and scalability of script identification, we analyze the encoding range of each script provided on the official Unicode website and identify the shared characters, allowing us to design an improved script identification algorithm. Using this approach, we can fully consider all 169 Unicode script types. The proposed method is scalable and does not require numbers, punctuation marks, or other symbols to be filtered during script identification; furthermore, these items in the text are also included in the script identification results, thus ensuring the integrity of the provided information. The experimental results show that the proposed algorithm performs almost as well as our previous script identification algorithm while providing improvements on its basis.<\/jats:p>","DOI":"10.3390\/data10040043","type":"journal-article","created":{"date-parts":[[2025,3,25]],"date-time":"2025-03-25T10:53:54Z","timestamp":1742900034000},"page":"43","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8860-951X","authenticated-orcid":false,"given":"Mamtimin","family":"Qasim","sequence":"first","affiliation":[{"name":"School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wushour","family":"Silamu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China"},{"name":"Key Multi-Lingual Laboratory of Xinjiang, Urumqi 830046, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,3,25]]},"reference":[{"key":"ref_1","first-page":"21","article-title":"Optimizing n-gram order of an n-gram based language identification algorithm for 68 written languages","volume":"2","author":"Choong","year":"2009","journal-title":"Int. J. Adv. ICT Emerg. Reg."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"307","DOI":"10.1016\/j.csl.2012.01.004","article-title":"Factors that affect the accuracy of text-based language identification","volume":"26","author":"Botha","year":"2012","journal-title":"Comput. Speech Lang."},{"key":"ref_3","first-page":"491","article-title":"Effective language identification of forum texts based on statistical approaches","volume":"52","author":"Abainia","year":"2016","journal-title":"Inf. Process. Manag. Int. J."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"457","DOI":"10.1016\/j.jksuci.2014.12.004","article-title":"Word-length algorithm for language identification of under-resourced languages","volume":"28","author":"Selamat","year":"2015","journal-title":"J. King Saud Univ. Comput. Inf. Sci."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"675","DOI":"10.1613\/jair.1.11675","article-title":"Automatic language identification in texts: A survey","volume":"65","author":"Jauhiainen","year":"2019","journal-title":"J. Artif. Intell. Res."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Zampieri, M., Malmasi, S., Ljube\u0161i\u0107, N., Nakov, P., Ali, A., Tiedemann, J., Scherrer, Y., and Aepli, N. (2017, January 3). Findings of the VarDial Evaluation Campaign. Proceedings of the VarDial Workshop, Valencia, Spain.","DOI":"10.18653\/v1\/W17-1201"},{"key":"ref_7","unstructured":"Apple (2021, February 10). Language Identification from Very Short Strings. Available online: https:\/\/machinelearning.apple.com\/research\/language-identification-from-very-short-strings."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Toftrup, M., Srensen, S.A., Ciosici, M.R., and Assent, I. (2021, January 19\u201323). A reproduction of apple\u2019s bi-directional lstm models for language identification in short strings. Proceedings of the 16th Conference of the European Chapter of the Associationfor Computational Linguistics: Student Research Workshop, Virtual.","DOI":"10.18653\/v1\/2021.eacl-srw.6"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Maimaitiyiming, H., and Wushour, S. (2018). On hierarchical text language-identification algorithms. Algorithms, 11.","DOI":"10.3390\/a11040039"},{"key":"ref_10","first-page":"354","article-title":"Three-stage short text language identification algorithm","volume":"15","author":"Hasimu","year":"2017","journal-title":"J. Digit. Inf. Manag."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"534","DOI":"10.3923\/itj.2007.534.540","article-title":"Unicode Aided Language Identification across Multiple Scripts and Heterogeneous Data","volume":"6","author":"Hanif","year":"2007","journal-title":"Inf. Technol. J."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Mamtimin, Q., Wushour, S., and Minghui, Q. (2024). The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset. Data, 9.","DOI":"10.3390\/data9110134"},{"key":"ref_13","unstructured":"(2025, January 24). Scripts-16.0.0.txt. Available online: https:\/\/www.unicode.org\/Public\/UNIDATA\/Scripts.txt."},{"key":"ref_14","unstructured":"(2024, June 28). Leipzig Corpora Collection. Available online: https:\/\/cls.corpora.uni-leipzig.de\/en."},{"key":"ref_15","unstructured":"(2024, July 05). Leipzig Corpora Collection Download Page. Available online: https:\/\/wortschatz-leipzig.de\/en\/download."},{"key":"ref_16","unstructured":"(2024, July 05). ISO 639-2 Code, Available online: https:\/\/www.loc.gov\/standards\/iso639-2\/php\/code_list.php."},{"key":"ref_17","unstructured":"(2024, July 05). ISO 639-3 Code. Available online: https:\/\/iso639-3.sil.org\/code_tables\/639\/data."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/4\/43\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:59:52Z","timestamp":1760029192000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/4\/43"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,25]]},"references-count":17,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,4]]}},"alternative-id":["data10040043"],"URL":"https:\/\/doi.org\/10.3390\/data10040043","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2025,3,25]]}}}