{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:26:30Z","timestamp":1760059590026,"version":"build-2065373602"},"reference-count":0,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2025,6,21]],"date-time":"2025-06-21T00:00:00Z","timestamp":1750464000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>This article introduces a freely available Spanish\u2013Mixtec parallel corpus designed to foster natural language processing (NLP) development for an indigenous language that remains digitally low-resourced. The dataset, comprising 14,587 sentence pairs, covers Mixtec variants from Guerrero (Tlacoachistlahuaca, Northern Guerrero, and Xochapa) and Oaxaca (Western Coast, Southern Lowland, Santa Mar\u00eda Yosoy\u00faa, Central, Lower Ca\u00f1ada, Western Central, San Antonio Huitepec, Upper Western, and Southwestern Central). Texts are classified into four main domains as follows: education, law, health, and religion. To compile these data, we conducted a two-phase collection process as follows: first, an online search of government portals, religious organizations, and Mixtec language blogs; and second, an on-site retrieval of physical texts from the library of the Autonomous University of Quer\u00e9taro. Scanning and optical character recognition were then performed to digitize physical materials, followed by manual correction to fix character misreadings and remove duplicates or irrelevant segments. We conducted a preliminary evaluation of the collected data to validate its usability in automatic translation systems. From Spanish to Mixtec, a fine-tuned GPT-4o-mini model yielded a BLEU score of 0.22 and a TER score of 122.86, while two fine-tuned open source models mBART-50 and M2M-100 yielded BLEU scores of 4.2 and 2.63 and TER scores of 98.99 and 104.87, respectively. All code demonstrating data usage, along with the final corpus itself, is publicly accessible via GitHub and Figshare. We anticipate that this resource will enable further research into machine translation, speech recognition, and other NLP applications while contributing to the broader goal of preserving and revitalizing the Mixtec language.<\/jats:p>","DOI":"10.3390\/data10070094","type":"journal-article","created":{"date-parts":[[2025,6,24]],"date-time":"2025-06-24T10:44:41Z","timestamp":1750761881000},"page":"94","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Mixtec\u2013Spanish Parallel Text Dataset for Language Technology Development"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5650-1240","authenticated-orcid":false,"given":"Hermilo","family":"Santiago-Benito","sequence":"first","affiliation":[{"name":"Facultad de Inform\u00e1tica, Universidad Aut\u00f3noma de Quer\u00e9taro, Av. de las Ciencias S\/N, Campus Juriquilla, Quer\u00e9taro 76230, Mexico"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5657-7752","authenticated-orcid":false,"given":"Diana-Margarita","family":"C\u00f3rdova-Esparza","sequence":"additional","affiliation":[{"name":"Facultad de Inform\u00e1tica, Universidad Aut\u00f3noma de Quer\u00e9taro, Av. de las Ciencias S\/N, Campus Juriquilla, Quer\u00e9taro 76230, Mexico"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6662-0390","authenticated-orcid":false,"given":"Juan","family":"Terven","sequence":"additional","affiliation":[{"name":"Centro de Investigaci\u00f3n en Ciencia Aplicada y Tecnolog\u00eda Avanzada\u2014Unidad Quer\u00e9taro, Instituto Polit\u00e9cnico Nacional, Cerro Blanco No. 141, Col. Colinas del Cimatario, Quer\u00e9taro 76090, Mexico"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8083-3891","authenticated-orcid":false,"given":"No\u00e9-Alejandro","family":"Castro-S\u00e1nchez","sequence":"additional","affiliation":[{"name":"Centro Nacional de Investigaci\u00f3n y Desarrollo Tecnol\u00f3gico, Tecnol\u00f3gico Nacional de M\u00e9xico, Interior Internado Palmira S\/N, Palmira, Cuernavaca 62493, Mexico"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5524-2002","authenticated-orcid":false,"given":"Teresa","family":"Garc\u00eda-Ramirez","sequence":"additional","affiliation":[{"name":"Centro de Investigaci\u00f3n en Ciencia Aplicada y Tecnolog\u00eda Avanzada\u2014Unidad Quer\u00e9taro, Instituto Polit\u00e9cnico Nacional, Cerro Blanco No. 141, Col. Colinas del Cimatario, Quer\u00e9taro 76090, Mexico"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7257-7595","authenticated-orcid":false,"given":"Julio-Alejandro","family":"Romero-Gonz\u00e1lez","sequence":"additional","affiliation":[{"name":"Centro de Investigaci\u00f3n en Ciencia Aplicada y Tecnolog\u00eda Avanzada\u2014Unidad Quer\u00e9taro, Instituto Polit\u00e9cnico Nacional, Cerro Blanco No. 141, Col. Colinas del Cimatario, Quer\u00e9taro 76090, Mexico"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1304-6791","authenticated-orcid":false,"given":"Jos\u00e9 M.","family":"\u00c1lvarez-Alvarado","sequence":"additional","affiliation":[{"name":"Facultad de Ingenier\u00eda, Universidad Aut\u00f3noma de Quer\u00e9taro, Quer\u00e9taro 76010, Mexico"}]}],"member":"1968","published-online":{"date-parts":[[2025,6,21]]},"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/7\/94\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:56:20Z","timestamp":1760032580000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/7\/94"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,21]]},"references-count":0,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2025,7]]}},"alternative-id":["data10070094"],"URL":"https:\/\/doi.org\/10.3390\/data10070094","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2025,6,21]]}}}