{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T04:59:15Z","timestamp":1764997155897,"version":"build-2065373602"},"reference-count":30,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2018,3,29]],"date-time":"2018-03-29T00:00:00Z","timestamp":1522281600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"University of Petra"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>This paper intends to present a large-scale dataset for Arabic morphology from a cognitive point of view considering the uniqueness of the root\u2013pattern phenomenon. The center of attention is focused on studying this singularity in terms of estimating associative relationships between roots as a higher level of abstraction for words meaning, and all their potential occurrences with multiple morpho-phonetic patterns. A major advantage of this approach resides in providing a novel balanced large-scale language resource, which can be viewed as an instantiated global root\u2013pattern network consisting of roots, patterns, stems, and particles, estimated statistically for studying the morpho-phonetic level of cognition of Arabic. In this context, this paper asserts that balanced root-distribution is an additional significant key criterion for evaluating topic coverage in an Arabic corpus. Furthermore, some additional novel probabilistic morpho-phonetic measures and their distribution have been estimated in the form of root and pattern entropies besides bi-directional conditional probabilities of bi-grams of stems, roots, and particles. Around 29.2 million webpages of ClueWeb were extracted, filtered from non-Arabic texts, and converted into a large textual dataset containing around 11.5 billion word forms and 9.3 million associative relationships. As this dataset is predominantly considering the root\u2013pattern phenomenon in Semitic languages, the acquired data might be significant support for researchers interested in studying phenomena of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, and cognitively motivated query expansion, spell-checking, and information retrieval. Furthermore, based on data distribution and frequencies, constructing balanced corpora will be easier.<\/jats:p>","DOI":"10.3390\/data3020010","type":"journal-article","created":{"date-parts":[[2018,3,29]],"date-time":"2018-03-29T12:51:56Z","timestamp":1522327916000},"page":"10","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Associative Root\u2013Pattern Data and Distribution in Arabic Morphology"],"prefix":"10.3390","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7033-9964","authenticated-orcid":false,"given":"Bassam","family":"Haddad","sequence":"first","affiliation":[{"name":"Department of Computer Science, University of Petra, 11196 Amman, Jordan"}]},{"given":"Ahmad","family":"Awwad","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Petra, 11196 Amman, Jordan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3165-2477","authenticated-orcid":false,"given":"Mamoun","family":"Hattab","sequence":"additional","affiliation":[{"name":"Arabic Textware, 11181 Amman, Jordan"}]},{"given":"Ammar","family":"Hattab","sequence":"additional","affiliation":[{"name":"Brown University, Providence, RI 02912, USA"}]}],"member":"1968","published-online":{"date-parts":[[2018,3,29]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Baranyi, P., Csapo, A., and Sallai, G. (2015). Cognitive Infocommunications (CogInfoCom), Springer International Publishing.","DOI":"10.1007\/978-3-319-19608-4"},{"key":"ref_2","unstructured":"(2018, March 28). Arabic. Available online: http:\/\/en.wikipedia.org\/wiki\/Arabic_language."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Al-Thubaity, A., Khan, M., Al-Mazura, M., and Al-Mousa, M. (2013, January 17\u201319). New Language Resources for Arabic: Corpus Containing More Than Two Million Words and A Corpus Processing Tool. Proceedings of the International Conference on Asian Language Processing (IALP), Urumqi, China.","DOI":"10.1109\/IALP.2013.21"},{"key":"ref_4","unstructured":"Zaghouan, W. (2014, January 27). Critical Survey of the Freely Available Arabic Corpora. Proceedings of the Workshop on Free\/Open-Source Arabic Corpora and Corpora Processing Tools, Reykjavik, Iceland."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1075\/ijcl.11.2.02als","article-title":"The design of a corpus of contemporary Arabic","volume":"11","author":"Atwell","year":"2006","journal-title":"Int. J. Corpus Linguist."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"El-Haj, M., Kruschwitz, U., and Fox, C. (2014). Creating language resources for under-resourced languages: Methodologies, and experiments with Arabic. Language Resources and Evaluation, Springer.","DOI":"10.1007\/s10579-014-9274-3"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Haddad, B. (2007). Semantic Representation of Arabic: A logical Approach towards Compositionality and Generalized Arabic Quantifiers. Int. J. Comput. Process. Orient. Lang., 20.","DOI":"10.1142\/S0219427907001585"},{"key":"ref_8","unstructured":"Feldman, L. (1994). Morphological Factors in word Identification in Hebrew. Morphological Aspects of Language Processing, Erlbaum."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Boudelaa, S. (2014). Is the Arabic Mental Lexicon Morpheme-Based or Stem-Based? Implications for Spoken and Written Word Recognition. Handbook of Arabic Literacy, Literacy Studies 9, Springer.","DOI":"10.1007\/978-94-017-8545-7_2"},{"key":"ref_10","unstructured":"Haddad, B. (2018, March 28). Cognitive Aspects of a Statistical Language Model for Arabic based on Associative Probabilistic Root-PATtern Relations: A-APRoPAT. Available online: http:\/\/www.infocommunications.hu\/documents\/169298\/393366\/2013_4_2Haddad.pdf."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Haddad, B. (2012, January 2\u20135). Probabilistic Bi-Directional Root-Pattern Relationships as Cognitive Model for Semantic Processing of Arabic. Proceedings of the 3rd IEEE International Conference on Cognitive Infocommunication 2012, Kosice, Slovakia.","DOI":"10.1109\/CogInfoCom.2012.6421994"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Croft, W., and Cruse, D.A. (2004). Cognitive Linguistics, Cambridge University Press.","DOI":"10.1017\/CBO9780511803864"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1207\/s15516709cog1001_1","article-title":"An Introduction to Cognitive Grammar","volume":"10","author":"Langacker","year":"1986","journal-title":"Cogn. Sci."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Haddad, B. (2018). Cognitively-Motivated Query Abstraction Model based on Associative Root-Pattern Networks, To be published, draft is available upon request.","DOI":"10.1515\/jisys-2017-0549"},{"key":"ref_15","unstructured":"Haddad, B. (2009, January 6\u20138). Representation of Arabic Words: An Approach towards Probabilistic Root-Pattern Relationship. Proceedings of the International Conference on Knowledge Engineering and Ontology Development, Madeira, Portugal."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"237","DOI":"10.1142\/S0219427907001706","article-title":"Detection and Correction of Non-Words in Arabic: A Hybrid Approach","volume":"20","author":"Haddad","year":"2007","journal-title":"Int. J. Comput. Process. Orient. Lang. IJCPOL"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Haddad, B., El-Khalili, N., and Hattab, M. (2014, January 5\u20137). A Cognitive Query Model for Arabic based on Probabilistic Associative Morpho-Phonetic Sub-Networks. Proceedings of the 5th IEEE Conference on Cognitive Infocommunications-CogInfoCom, Vietri sul Mare, Italy.","DOI":"10.1109\/CogInfoCom.2014.7020463"},{"key":"ref_18","first-page":"107","article-title":"Language Engineering for Creating Relevance Corpus","volume":"9","author":"Haddad","year":"2015","journal-title":"Int. J. Softw. Eng. Appl."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Meyer, C.F. (2002). English Corpus Linguistics An Introduction, Cambridge University Press.","DOI":"10.1017\/CBO9780511606311"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"505","DOI":"10.1016\/j.pragma.2004.08.002","article-title":"Corpus-based approaches and discourse analysis in relation to reduplication and repetition","volume":"37","author":"Wang","year":"2005","journal-title":"J. Pragmat."},{"key":"ref_21","unstructured":"Alansary, S., Nagi, M., and Adly, N. (2018, March 28). Towards Analyzing the International Corpus of Arabic (ICA): Progress of Morphological Stage. Available online: https:\/\/www.researchgate.net\/profile\/Sameh_Alansary\/publication\/263541571_Towards_Analyzing_the_International_Corpus_of_Arabic_ICA_Progress_of_Morphological_Stage\/links\/0a85e53b2e2622211d000000\/Towards-Analyzing-the-International-Corpus-of-Arabic-ICA-Progress-of-Morphological-Stage.pdf."},{"key":"ref_22","unstructured":"Yu, C.-H., and Chen, H.-H. (2018, March 28). Chinese Web Scale Linguistic Datasets and Toolki. Available online: http:\/\/www.aclweb.org\/anthology\/C12-3063."},{"key":"ref_23","unstructured":"Pomikalek, J., Jakubicek, M., and Rychly, P. (2018, March 28). Building a 70 Billion Word Corpus of English from ClueWeb. Available online: http:\/\/www.lrec-conf.org\/proceedings\/lrec2012\/pdf\/1047_Paper.pdf."},{"key":"ref_24","unstructured":"Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., and Suchomel, V. (2018, March 28). arTenTen: A new, vast corpus for Arabic. Available online: https:\/\/www.sketchengine.co.uk\/wp-content\/uploads\/arTenTen_corpus_for_Arabic_2013.pdf."},{"key":"ref_25","unstructured":"Eckart, T., Alshargi, F., Quasthoff, U., and Goldhahn, D. (2018, March 28). Large Arabic Web Corpora of High Quality: The Dimensions Time and Origin. Available online: http:\/\/www.lrec-conf.org\/proceedings\/lrec2014\/workshops\/LREC2014Workshop-OSACT%20Proceedings.pdf#page=35."},{"key":"ref_26","unstructured":"Maamouri, M., Bies, A., Buckwalter, T., and Mekki, W. (2018, March 28). The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus. Available online: https:\/\/www.researchgate.net\/profile\/Mohamed_Maamouri\/publication\/228693973_The_penn_arabic_treebank_Building_a_large-scale_annotated_arabic_corpus\/links\/0046351802c78190c5000000.pdf."},{"key":"ref_27","unstructured":"Yaseen, M., Attia, M., Maegaard, B., Choukri, K., Paulsson, N., Haamid, S., Krauwer, S., Bendahman, C., Fersoe, H., and Rashwan, M. (2018, March 28). Building Annotated Written and Spoken Arabic LR\u2019s in NEMLAR Project. Available online: https:\/\/pdfs.semanticscholar.org\/95d7\/1fc0a2de2228d62372026ff0913cf2a83959.pdf."},{"key":"ref_28","unstructured":"Alrabiah, M., Al-Salman, A., and Atwell, E. (2013, January 22). The design and construction of the 50 million words KSUCCA. Proceedings of the Second Workshop on Arabic Corpus Linguistics (WACL-2), Lancashire, UK."},{"key":"ref_29","unstructured":"Hattab, M., Haddad, B., Yaseen, M., Duraidi, A., and Shmias, A.A. (2018, March 28). Addaall Arabic Search Engine: Improving Search based on Combination of Morphological Analysis and Generation Considering Semantic Patterns. Available online: http:\/\/fafs.uop.edu.jo\/download\/research\/members\/202_778_Mamo.pdf."},{"key":"ref_30","unstructured":"Fischer, W. (1972). Grammatik des Klassischen Arabisch, Harrassowitz Vrlag."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/3\/2\/10\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T14:59:01Z","timestamp":1760194741000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/3\/2\/10"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,3,29]]},"references-count":30,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2018,6]]}},"alternative-id":["data3020010"],"URL":"https:\/\/doi.org\/10.3390\/data3020010","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2018,3,29]]}}}