{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,30]],"date-time":"2025-10-30T07:15:39Z","timestamp":1761808539748,"version":"3.40.5"},"reference-count":57,"publisher":"Cambridge University Press (CUP)","issue":"3","license":[{"start":{"date-parts":[[2022,8,26]],"date-time":"2022-08-26T00:00:00Z","timestamp":1661472000000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2023,5]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In this article, we introduce an extended, freely available resource for the Romanian language, named <jats:monospace>RoLEX<\/jats:monospace>. The dataset was developed mainly for speech processing applications, yet its applicability extends beyond this domain. <jats:monospace>RoLEX<\/jats:monospace> includes over 330,000 curated entries with information regarding lemma, morphosyntactic description, syllabification, lexical stress and phonemic transcription. The process of selecting the list of word entries and semi-automatically annotating the complete lexical information associated with each of the entries is thoroughly described.<\/jats:p><jats:p>The dataset\u2019s inherent knowledge is then evaluated in a task of concurrent prediction of syllabification, lexical stress marking and phonemic transcription. The evaluation looked into several dataset design factors, such as the minimum viable number of entries for correct prediction, the optimisation of the minimum number of required entries through expert selection and the augmentation of the input with morphosyntactic information, as well as the influence of each task in the overall accuracy. The best results were obtained when the orthographic form of the entries was augmented with the complete morphosyntactic tags. A word error rate of 3.08% and a character error rate of 1.08% were obtained this way. We show that using a carefully selected subset of entries for training can result in a similar performance to the performance obtained by a larger set of randomly selected entries (twice as many). In terms of prediction complexity, the lexical stress marking posed most problems and accounts for around 60% of the errors in the predicted sequence.<\/jats:p>","DOI":"10.1017\/s1351324922000419","type":"journal-article","created":{"date-parts":[[2022,8,26]],"date-time":"2022-08-26T08:24:34Z","timestamp":1661502274000},"page":"720-745","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":4,"title":["RoLEX: The development of an extended Romanian lexical dataset and its evaluation at predicting concurrent lexical information"],"prefix":"10.1017","volume":"29","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7728-5863","authenticated-orcid":false,"given":"Be\u00e1ta","family":"L\u0151rincz","sequence":"first","affiliation":[]},{"given":"Elena","family":"Irimia","sequence":"additional","affiliation":[]},{"given":"Adriana","family":"Stan","sequence":"additional","affiliation":[]},{"given":"Verginica","family":"Barbu Mititelu","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2022,8,26]]},"reference":[{"key":"S1351324922000419_ref28","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions and reversals","volume":"10","author":"Levenshtein","year":"1966","journal-title":"Soviet Physics Doklady"},{"key":"S1351324922000419_ref39","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-1184"},{"key":"S1351324922000419_ref45","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-1547"},{"key":"S1351324922000419_ref30","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-1436"},{"key":"S1351324922000419_ref24","doi-asserted-by":"publisher","DOI":"10.3758\/s13428-013-0400-8"},{"key":"S1351324922000419_ref16","doi-asserted-by":"publisher","DOI":"10.1023\/A:1024089129146"},{"key":"S1351324922000419_ref25","unstructured":"Halpern, J. (2022). Comprehensive Full-Form Lexicon for Arabic NLP and Speech Technology. Online. Available at https:\/\/www.cjk.org\/wp-content\/uploads\/Halpern-LREC2022Paper.pdf 18 July 2022."},{"key":"S1351324922000419_ref35","unstructured":"Rehm, G. , Berger, M. , Elsholz, E. , Hegele, S. , Kintzel, F. , Marheinecke, K. , Piperidis, S. , Deligiannis, M. , Galanis, D. , Gkirtzou, K. , Labropoulou, P. , Bontcheva, K. , Jones, D. , Roberts, I. , Haji\u010d, J. , Hamrlov\u00e1, J. , Ka\u010dena, L. , Choukri, K. , Arranz, V. , Vasi\u013cjevs, A. , Anvari, O. , Lagzdi\u0146\u0161, A. , Me\u013c\u0146ika, J. , Backfried, G. , Dikici, E. , Janosik, M. , Prinz, K. , Prinz, C. , Stampler, S. , Thomas-Aniola, D. , G\u00f3mez-P\u00e9rez, J. M. , Garcia Silva, A. , Berr\u00edo, C. , Germann, U. , Renals, S. and Klejch, O. (2020). European language grid: An overview. In Proceedings of the 12th Language Resources and Evaluation Conference. Marseille: European Language Resources Association, pp. 3366\u20133380."},{"key":"S1351324922000419_ref38","doi-asserted-by":"publisher","DOI":"10.1109\/SPED.2019.8906639"},{"key":"S1351324922000419_ref56","unstructured":"Zeineldeen, M. , Zeyer, A. , Zhou, W. , Ng, T. , Schl\u00fcter, R. and Ney, H. (2020). A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models. arXiv preprint, arXiv: 2005.09336."},{"key":"S1351324922000419_ref27","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0172493"},{"key":"S1351324922000419_ref2","unstructured":"Barbu Mititelu, V. , Tufi\u015f, D. and Irimia, E. (2018). The reference corpus of the contemporary Romanian language (CoRoLa). In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: European Language Resources Association (ELRA), pp. 1178\u20131185."},{"key":"S1351324922000419_ref51","first-page":"5998","volume-title":"Advances in Neural Information Processing Systems 30","author":"Vaswani","year":"2017"},{"volume-title":"Despartirea automata in silabe a cuvintelor din limba rom\u00e2n\u0103. Aplicatii in construc\u0163ia bazei de date a silabelor limbii rom\u00e2ne","year":"2004","author":"Dinu","key":"S1351324922000419_ref13"},{"key":"S1351324922000419_ref20","doi-asserted-by":"publisher","DOI":"10.3115\/1687878.1687897"},{"volume-title":"Fonetica Limbii Romane: Vol. 2 Dictionarul morfologic si fonetic al limbii romane (A-L), Vol. 3 Dictionarul morfologic si fonetic al limbii romane (M-Z)","year":"2015","author":"Diaconescu","key":"S1351324922000419_ref11"},{"key":"S1351324922000419_ref32","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-1788"},{"volume-title":"Fonetica Limbii Romane: Vol. 2 Dictionarul morfologic si fonetic al limbii romane (A-L), Vol. 3 Dictionarul morfologic si fonetic al limbii romane (M-Z)","year":"2015","author":"Diaconescu","key":"S1351324922000419_ref12"},{"key":"S1351324922000419_ref33","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-010-9130-z"},{"key":"S1351324922000419_ref23","unstructured":"Georgescu, A.-L. , Cucu, H. , Buzo, A. and Burileanu, C. (2020). RSC: A Romanian read speech corpus for automatic speech recognition. In Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, pp. 6606\u20136612."},{"key":"S1351324922000419_ref5","article-title":"Tools and resources for Romanian text-to-speech and speech-to-text applications","author":"Boro\u015f","year":"2018","journal-title":"CoRR"},{"key":"S1351324922000419_ref43","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2010.12.002"},{"key":"S1351324922000419_ref29","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2020.08.012"},{"key":"S1351324922000419_ref48","doi-asserted-by":"publisher","DOI":"10.1109\/SLT.2016.7846248"},{"key":"S1351324922000419_ref4","first-page":"36","article-title":"O posibil\u0103 clasificare a omografelor rom\u00e2ne\u015fti","volume":"V","author":"B\u0103cil\u0103","year":"2011","journal-title":"Philologica Banatica"},{"key":"S1351324922000419_ref52","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2004-484"},{"key":"S1351324922000419_ref47","doi-asserted-by":"publisher","DOI":"10.1109\/SPED.2017.7990435"},{"key":"S1351324922000419_ref1","unstructured":"Barbu, A.-M. (2008). Romanian lexical data bases: Inflected and syllabic forms dictionaries. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC\u201908) . Marrakech: European Language Resources Association (ELRA), pp. 1937\u20131941."},{"key":"S1351324922000419_ref42","doi-asserted-by":"publisher","DOI":"10.1109\/SpeD53181.2021.9587438"},{"key":"S1351324922000419_ref8","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2013.05.003"},{"key":"S1351324922000419_ref22","doi-asserted-by":"publisher","DOI":"10.1109\/SPED.2017.7990443"},{"key":"S1351324922000419_ref44","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-1208"},{"key":"S1351324922000419_ref55","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-1954"},{"key":"S1351324922000419_ref26","unstructured":"Ion, R. (2018). TEPROLIN: An extensible, online text preprocessing platform for Romanian. In Proceedings of the 13th International Conference on Linguistic Resources and Tools for Processing the Romanian Language, Ia\u015fi."},{"key":"S1351324922000419_ref7","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/E14-4013"},{"volume-title":"The Orthographic, Orthoepic and Morphologic Dictionary of the Romanian Language (DOOM2)","year":"2005","key":"S1351324922000419_ref19"},{"key":"S1351324922000419_ref3","doi-asserted-by":"publisher","DOI":"10.3115\/1620754.1620799"},{"key":"S1351324922000419_ref6","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8462678"},{"key":"S1351324922000419_ref40","doi-asserted-by":"publisher","DOI":"10.1109\/SPED.2017.7990428"},{"key":"S1351324922000419_ref31","doi-asserted-by":"publisher","DOI":"10.21437\/ICSLP.2000-298"},{"key":"S1351324922000419_ref41","unstructured":"Stan, A. and Giurgiu, M. (2018). A comparison between traditional machine learning approaches and deep neural networks for text processing in Romanian. In Proceedings of the 13th International Conference on Linguistic Resources and Tools for Processing Romanian Language (ConsILR), Ia\u015fi."},{"key":"S1351324922000419_ref9","unstructured":"de Mare\u00fcil, P. B. , d\u2019Alessandro, C. , Yvon, F. , Auberg\u00e9, V. , Vaissi\u00e8re, J. and Amelot, A. (2000). A French phonetic lexicon with variants for speech and language processing. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC\u201900) . Athens: European Language Resources Association (ELRA)."},{"volume-title":"Dic\u0163ionarul ortografic, ortoepic \u015fi morfologic al limbii rom\u00e2ne","year":"1982","author":"Rom\u00e2n\u0103","key":"S1351324922000419_ref36"},{"key":"S1351324922000419_ref57","unstructured":"Zhang, A. , Lipton, Z. C. , Li, M. and Smola, A. J. (2020). Dive into Deep Learning. Available at https:\/\/d2l.ai"},{"key":"S1351324922000419_ref34","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178767"},{"key":"S1351324922000419_ref54","doi-asserted-by":"publisher","DOI":"10.3390\/app9061143"},{"key":"S1351324922000419_ref37","doi-asserted-by":"publisher","DOI":"10.3758\/s13428-018-1058-z"},{"key":"S1351324922000419_ref15","unstructured":"Dinu, L. and Dinu, A. (2006). On the data base of Romanian syllables and some of its quantitative and cryptographic aspects. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC\u201906) . Genoa: European Language Resources Association (ELRA), pp. 1795\u20131798."},{"key":"S1351324922000419_ref53","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2015-134"},{"key":"S1351324922000419_ref14","unstructured":"Dinu, L. , Ciobanu, A. M. , Chitoran, I. and Niculae, V. (2014). Using a machine learning model to assess the complexity of stress systems. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914) . Reykjavik: European Language Resources Association (ELRA), pp. 331\u2013336."},{"key":"S1351324922000419_ref17","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-40585-3_57"},{"key":"S1351324922000419_ref46","first-page":"682","volume-title":"2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns","author":"Toma","year":"2009"},{"volume-title":"The Romanian Language in the Digital Era","year":"2012","author":"Trandabat","key":"S1351324922000419_ref49"},{"key":"S1351324922000419_ref18","unstructured":"Domokos, J. , Buza, O. and Toderean, G. (2012). 100K+ words, machine-readable, pronunciation dictionary for the Romanian language. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO) . Bucharest: IEEE, pp. 320\u2013324."},{"key":"S1351324922000419_ref50","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-1419"},{"key":"S1351324922000419_ref10","unstructured":"Devlin, J. , Chang, M.-W. , Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Minneapolis, MN: Association for Computational Linguistics, pp. 4171\u20134186."},{"key":"S1351324922000419_ref21","unstructured":"Gehring, J. , Auli, M. , Grangier, D. , Yarats, D. and Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. In International Conference on Machine Learning. Sydney: PMLR, pp. 1243\u20131252."}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324922000419","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,19]],"date-time":"2023-05-19T07:31:44Z","timestamp":1684481504000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324922000419\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,8,26]]},"references-count":57,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,5]]}},"alternative-id":["S1351324922000419"],"URL":"https:\/\/doi.org\/10.1017\/s1351324922000419","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"type":"print","value":"1351-3249"},{"type":"electronic","value":"1469-8110"}],"subject":[],"published":{"date-parts":[[2022,8,26]]},"assertion":[{"value":"\u00a9 The Author(s), 2022. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}}]}}