{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T19:03:34Z","timestamp":1772823814543,"version":"3.50.1"},"reference-count":22,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,7,8]],"date-time":"2024-07-08T00:00:00Z","timestamp":1720396800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,8]],"date-time":"2024-07-08T00:00:00Z","timestamp":1720396800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Discov Artif Intell"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Twitter is a rich resource for analyzing the contents of social media and extracting the age groups of users can be beneficial for recommender systems, marketing and advertising. Age detection task is an aspect of demographic information of users. In this study a large-scale corpus of Arabic Twitter users including 181k user profiles with diverse age groups consisting of \u221218, 18\u201324, 25\u201334, 35\u201349, 50\u201364, +65 is presented. The corpus is created by four methods: (1) collecting publicly available birthday announcement tweets using the Twitter Search application programming interface, (2) augmenting data, (3) fetching verified accounts, and (4) manual annotation. To have a best age detection model on the presented corpus, different evaluations are tested to find the model with highest accuracy and efficiency. Number of tweets, regression vs. classification, using metadata of users and tweets, using LSTM+CNN model vs. BERT are some parts of examinations done. Presented methodology is based on language and metadata features and final model is fine-tuned with BERT on 70k users and evaluated on 8200 manually annotated users. We show that our best model, compared with LSTM+CNN model and BERT-based similar model yields an improvement of up to 9% in F1-score and increment of 5% in accuracy, respectively. The model achieved macro-averaged F1-score of 44 on six age groups, and F1-score of 58 on three age groups of \u221225, 25\u201334, +35. The link of our proposed data is provided here: <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"http:\/\/www.github.com\/exaco\/ExaAUAC\">www.github.com\/exaco\/ExaAUAC<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s44163-024-00145-0","type":"journal-article","created":{"date-parts":[[2024,7,8]],"date-time":"2024-07-08T16:02:18Z","timestamp":1720454538000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["ExaAUAC: Arabic Twitter user age prediction corpus based on language and metadata features"],"prefix":"10.1007","volume":"4","author":[{"given":"Reyhaneh","family":"Sadeghi","sequence":"first","affiliation":[]},{"given":"Ahmad","family":"Akbari","sequence":"additional","affiliation":[]},{"given":"Mohammad Mehdi","family":"Jaziriyan","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,7,8]]},"reference":[{"key":"145_CR1","doi-asserted-by":"crossref","unstructured":"Abdul-Mageed M, Elmadany A, Nagoudi E. ARBERT & MARBERT: deep bidirectional transformers for Arabic. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021; 7088-7105","DOI":"10.18653\/v1\/2021.acl-long.551"},{"key":"145_CR2","unstructured":"Antoun W, Baly F, Hajj H. Arabert: transformer-based model for arabic language understanding. LREC 2020 Workshop Language Resources and Evaluation Conference, 2020; 9"},{"key":"145_CR3","first-page":"120","volume":"25","author":"G Bradski","year":"2000","unstructured":"Bradski G. The OpenCV Library. Dr Dobb\u2019s J Softw Tools. 2000;25:120. https:\/\/github.com\/opencv\/opencv\/wiki\/CiteOpenCV.","journal-title":"Dr. Dobb\u2019s J Softw Tools"},{"key":"145_CR4","doi-asserted-by":"crossref","unstructured":"Chamberlain B, Humby C, Deisenroth M. Probabilistic inference of twitter users\u2019 age based on what they follow. Lecture Notes In Computer Science (including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics). 10536 LNAI 2017; 191-203","DOI":"10.1007\/978-3-319-71273-4_16"},{"key":"145_CR5","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1613\/jair.4935","volume":"55","author":"A Culotta","year":"2016","unstructured":"Culotta A, Ravi N, Cutler J. Predicting twitter user demographics using distant supervision from website traffic data. J Artif Intell Res. 2016;55:389\u2013408.","journal-title":"J Artif Intell Res"},{"key":"145_CR6","unstructured":"Devlin J, Chang M, Lee K, Toutanova K. BERT: pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings Of The 2019 Conference Of The North American Chapter Of The Association For Computational Linguistics: Human Language Technologies, Volume 1 (Long And Short Papers). 2019; 4171-4186, https:\/\/aclanthology.org\/N19-1423."},{"key":"145_CR7","doi-asserted-by":"publisher","first-page":"378","DOI":"10.1037\/h0031619","volume":"76","author":"J Fleiss","year":"1971","unstructured":"Fleiss J. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76:378\u201382.","journal-title":"Psychol Bull"},{"key":"145_CR8","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1093\/llc\/fqi024","volume":"20","author":"P Juola","year":"2005","unstructured":"Juola P, Baayen R. A controlled-corpus experiment in authorship identification by cross-entropy. Lit Linguist Comput. 2005;20:59\u201367. https:\/\/doi.org\/10.1093\/llc\/fqi024.","journal-title":"Lit Linguist Comput"},{"key":"145_CR9","doi-asserted-by":"crossref","unstructured":"Klein A, Magge A, Gonzalez-Hernandez G. ReportAGE: automatically extracting the exact age of twitter users based on self-reports in tweets. PLoS ONE. 2022;17: e0262087. https:\/\/journals.plos.org\/plosone\/article\/citation?id=10.1371\/journal.pone.0262087.","DOI":"10.1371\/journal.pone.0262087"},{"key":"145_CR10","doi-asserted-by":"crossref","unstructured":"Morgan-Lopez A, Kim A, Chew R, Ruddle P. Predicting age groups of Twitter users based on language and metadata features. PLoS ONE. 2017;12: e0183537. https:\/\/journals.plos.org\/plosone\/article\/citation?id=10.1371\/journal.pone.0262087.","DOI":"10.1371\/journal.pone.0183537"},{"key":"145_CR11","doi-asserted-by":"crossref","unstructured":"Mubarak H, Hassan S, Abdelali, A. Adult content detection on arabic twitter: analysis and experiments. 2021.","DOI":"10.1007\/978-3-030-60975-7_18"},{"key":"145_CR12","doi-asserted-by":"crossref","unstructured":"Pandya A, Oussalah M, Monachesi P, Kostakos P, Loven L. On the Use of URLs and Hashtags in Age Prediction of Twitter Users. 2018 IEEE International Conference On Information Reuse And Integration (IRI). 2018; 62-69.","DOI":"10.1109\/IRI.2018.00017"},{"key":"145_CR13","doi-asserted-by":"crossref","unstructured":"Pandya A, Oussalah M, Monachesi P, Kostakos P. On the use of distributed semantics of tweet metadata for user age prediction. Future Gener Comput Syst. 2020;102:437\u201352.","DOI":"10.1016\/j.future.2019.08.018"},{"key":"145_CR14","doi-asserted-by":"publisher","first-page":"547","DOI":"10.1146\/annurev.psych.54.101601.145041","volume":"54","author":"J Pennebaker","year":"2003","unstructured":"Pennebaker J, Mehl M, Niederhoffer K. Psychological aspects of natural language. Use: our words, our selves. Ann Rev Psychol. 2003;54:547\u201377.","journal-title":"Ann Rev Psychol"},{"key":"145_CR15","doi-asserted-by":"crossref","unstructured":"Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep Contextualized Word Representations. Proceedings Of The 2018 Conference Of The North American Chapter Of The Association For Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018; 2227-2237, https:\/\/aclanthology.org\/N18-1202.","DOI":"10.18653\/v1\/N18-1202"},{"key":"145_CR16","unstructured":"Rangel F, Rosso P, Koppel M, Stamatatos E, Inches G. Overview of the Author Profiling Task at PAN 2013. Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS. Org. 2013; 1179."},{"key":"145_CR17","unstructured":"Rangel F, Rosso P, Chugur I, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W. Overview of the 2nd Author Profiling Task at PAN 2014. Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS. Org. 2014; 1180."},{"key":"145_CR18","first-page":"70","volume":"2517","author":"F Rangel","year":"2019","unstructured":"Rangel F, Rosso P, Charfi A, Zaghouani W, Ghanem B, Sanchez-Junquera J. Overview of the track on author profiling and deception detection in Arabic. Work Notes FIRE 2019 CEUR-WS Org. 2019;2517:70\u201383.","journal-title":"Work Notes FIRE 2019 CEUR-WS Org"},{"key":"145_CR19","doi-asserted-by":"crossref","unstructured":"Safaya A, Abdullatif M, Yuret D. Kuisail at semeval-2020 task 12: Bert-cnn for offensive speech identification in social media. Proceedings of the Fourteenth Workshop on Semantic Evaluation, 2020; 2054-2059.","DOI":"10.18653\/v1\/2020.semeval-1.271"},{"key":"145_CR20","unstructured":"Schler J, Koppel M, Argamon S, Pennebaker J. Effects of age and gender on blogging. AAAI Spring Symposium: Computational Approaches To Analyzing Weblogs. 2006."},{"key":"145_CR21","unstructured":"Talafha B, Ali M, Za\u2019ter M, Seelawi H, Tuffaha I, Samir M, Farhan W, Al-Natsheh H. Multi-dialect Arabic BERT for Country-level Dialect Identification. Proceedings of the Fifth Arabic Natural Language Processing Workshop. 2020; pp. 111-118, https:\/\/aclanthology.org\/2020.wanlp-1.10."},{"key":"145_CR22","first-page":"84","volume":"2517","author":"C Zhang","year":"2019","unstructured":"Zhang C, Abdul-Mageed M. BERT-based Arabic social media author profiling. CEUR Workshop Proc. 2019;2517:84\u201391.","journal-title":"CEUR Workshop Proc"}],"container-title":["Discover Artificial Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44163-024-00145-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44163-024-00145-0\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44163-024-00145-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,8]],"date-time":"2024-07-08T16:52:54Z","timestamp":1720457574000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44163-024-00145-0"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,8]]},"references-count":22,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["145"],"URL":"https:\/\/doi.org\/10.1007\/s44163-024-00145-0","relation":{},"ISSN":["2731-0809"],"issn-type":[{"value":"2731-0809","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,8]]},"assertion":[{"value":"23 January 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 June 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 July 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no Competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"48"}}