{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,3]],"date-time":"2026-06-03T16:46:10Z","timestamp":1780505170792,"version":"3.54.1"},"reference-count":28,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,10,26]],"date-time":"2021-10-26T00:00:00Z","timestamp":1635206400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,10,26]],"date-time":"2021-10-26T00:00:00Z","timestamp":1635206400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Twitter and social media as a whole have great potential as a source of disease surveillance data however the general messiness of tweets presents several challenges for standard information extraction methods. Most deployed systems employ approaches that rely on simple keyword matching and do not distinguish between relevant and irrelevant keyword mentions making them susceptible to false positives as a result of the fact that keyword volume can be influenced by several social phenomena that may be unrelated to disease occurrence. Furthermore, most solutions are intended for a single language and those meant for multilingual scenarios do not incorporate semantic context. In this paper we experimentally examine different approaches for classifying text for epidemiological surveillance on the social web in addition we offer a systematic comparison of the impact of different input representations on performance. Specifically we compare continuous representations against one-hot encoding for word-based, class-based (ontology-based) and subword units in the form of byte pair encodings. We also go on to establish the desirable performance characteristics for multi-lingual semantic filtering approaches and offer an in-depth discussion of the implications for end-to-end surveillance.<\/jats:p>","DOI":"10.1186\/s40537-021-00528-5","type":"journal-article","created":{"date-parts":[[2021,10,26]],"date-time":"2021-10-26T18:02:42Z","timestamp":1635271362000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Evaluation of different machine learning approaches and input text representations for multilingual classification of tweets for disease surveillance in the social web"],"prefix":"10.1186","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2060-8625","authenticated-orcid":false,"given":"Mark Abraham","family":"Magumba","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Peter","family":"Nabende","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2021,10,26]]},"reference":[{"key":"528_CR1","doi-asserted-by":"crossref","unstructured":"Lee K, Agrawal A, Choudhary A. Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013; p. 1474\u201377.","DOI":"10.1145\/2487575.2487709"},{"issue":"8","key":"528_CR2","doi-asserted-by":"publisher","first-page":"e103408","DOI":"10.1371\/journal.pone.0103408","volume":"9","author":"MJ Paul","year":"2014","unstructured":"Paul MJ, Dredze M. Discovering health topics in social media using topic models. PLoS ONE. 2014;9(8):e103408.","journal-title":"PLoS ONE"},{"key":"528_CR3","doi-asserted-by":"publisher","first-page":"739","DOI":"10.1007\/978-3-319-46227-1_46","volume-title":"Joint European Conference on Machine Learning and Knowledge Discovery in Databases","author":"RC Souza","year":"2016","unstructured":"Souza RC, Assun\u00e7\u00e3o RM, de Oliveira DM, de Brito DE, Meira W. Infection hot spot mining from social media trajectories. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Cham: Springer; 2016. p. 739\u201355."},{"key":"528_CR4","unstructured":"Aramaki E, Maskawa S, Morita M. Twitter catches the flu: detecting influenza epidemics using Twitter. In: Proceedings of the 2011 Conference on empirical methods in natural language processing. 2011. pp. 1568\u20131576."},{"key":"528_CR5","unstructured":"Beswick A. # Outbreak: An Exploration of Twitter metadata as a means to supplement influenza surveillance in Canada during the 2013\u20132014 influenza season (Doctoral dissertation); 2016."},{"key":"528_CR6","doi-asserted-by":"crossref","unstructured":"Doan S, Ohno-Machado L, Collier N. Enhancing Twitter data analysis with simple semantic filtering: Example in tracking influenza-like illnesses. In: 2012 IEEE second international conference on healthcare informatics, imaging and systems biology. IEEE; 2012. p. 62\u201371.","DOI":"10.1109\/HISB.2012.21"},{"key":"528_CR7","unstructured":"Lamb A, Paul M., & Dredze, M. Separating fact from fear: Tracking flu infections on twitter. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2013. p. 789\u2013795."},{"issue":"24","key":"528_CR8","doi-asserted-by":"publisher","first-page":"2940","DOI":"10.1093\/bioinformatics\/btn534","volume":"24","author":"N Collier","year":"2008","unstructured":"Collier N, Doan S, Kawazoe A, Goodwin RM, Conway M, Tateno, & Taniguchi, K. BioCaster: detecting public health rumors with a Web-based text mining system. Bioinformatics. 2008;24(24):2940\u20131.","journal-title":"Bioinformatics"},{"key":"528_CR9","doi-asserted-by":"publisher","first-page":"183","DOI":"10.1007\/978-1-4419-1278-7_14","volume-title":"Infectious Disease Informatics","author":"H Chen","year":"2010","unstructured":"Chen H, Zeng D, Yan P. HealthMap. In: Infectious Disease Informatics. New York: Springer; 2010. p. 183\u20136."},{"issue":"21","key":"528_CR10","doi-asserted-by":"publisher","first-page":"2153","DOI":"10.1056\/NEJMp0900702","volume":"360","author":"JS Brownstein","year":"2009","unstructured":"Brownstein JS, Freifeld CC, Madoff LC. Digital disease detection\u2014harnessing the Web for public health surveillance. N Engl J Med. 2009;360(21):2153.","journal-title":"N Engl J Med"},{"key":"528_CR11","doi-asserted-by":"crossref","unstructured":"Mutuvi S, Boros E, Doucet A, Lejeune G, Jatowt A, Odeo M. Multilingual Epidemiological Text Classification: A Comparative Study. In COLING, International Conference on Computational Linguistics; 2020.","DOI":"10.18653\/v1\/2020.coling-main.543"},{"key":"528_CR12","unstructured":"Mikolov T, Le QV, Sutskever I. Exploiting similarities among languages for machine translation; 2013. arXiv preprint arXiv:1309.4168."},{"key":"528_CR13","unstructured":"Klementiev A, Titov I, Bhattarai B. Inducing crosslingual distributed representations of words. In: Proceedings of COLING 2012; p. 1459\u201374."},{"key":"528_CR14","unstructured":"Conneau A, Lample G, Ranzato MA, Denoyer L, J\u00e9gou H. Word translation without parallel data; 2017. arXiv preprint arXiv:1710.04087."},{"key":"528_CR15","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in neural information processing systems; 2017, p. 5998\u20136008."},{"key":"528_CR16","unstructured":"Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding; 2018. arXiv preprint arXiv:1810.04805."},{"key":"528_CR17","doi-asserted-by":"crossref","unstructured":"Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT?. 2019; arXiv preprint arXiv:1906.01502.","DOI":"10.18653\/v1\/P19-1493"},{"key":"528_CR18","doi-asserted-by":"crossref","unstructured":"Cunningham H, Maynard D, Bontcheva K, Tablan V. GATE: an architecture for development of robust HLT applications. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; 2002, p. 168\u201375.","DOI":"10.3115\/1073083.1073112"},{"key":"528_CR19","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1007\/978-94-010-0201-1_1","volume":"1","author":"A Taylor","year":"2003","unstructured":"Taylor A, Marcus M, Santorini B. The Penn treebank: an overview. Treebanks. 2003;1:5\u201322.","journal-title":"Treebanks"},{"key":"528_CR20","first-page":"38","volume-title":"International Conference on Hybrid Artificial Intelligence Systems","author":"MA Magumba","year":"2017","unstructured":"Magumba MA, Nabende P. An ontology for generalized disease incidence detection on twitter. In: International Conference on Hybrid Artificial Intelligence Systems. Cham: Springer; 2017. p. 38\u201351."},{"key":"528_CR21","first-page":"279","volume":"121","author":"K Donnelly","year":"2006","unstructured":"Donnelly K. SNOMED-CT: The advanced terminology and coding system for eHealth. Stud Health Technol Inform. 2006;121:279.","journal-title":"Stud Health Technol Inform"},{"issue":"11","key":"528_CR22","doi-asserted-by":"publisher","first-page":"1251","DOI":"10.1038\/nbt1346","volume":"25","author":"B Smith","year":"2007","unstructured":"Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007;25(11):1251\u20135.","journal-title":"Nat Biotechnol."},{"key":"528_CR23","unstructured":"Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space; 2013. arXiv preprint arXiv:1301.3781."},{"key":"528_CR24","unstructured":"Le Q, Mikolov T. Distributed representations of sentences and documents. In: International conference on machine learning; 2014. p. 1188\u201396. PMLR."},{"key":"528_CR25","doi-asserted-by":"crossref","unstructured":"Kim Y. Convolutional neural networks for sentence classification; 2014. arXiv preprint arXiv:14085882.","DOI":"10.3115\/v1\/D14-1181"},{"key":"528_CR26","unstructured":"Rehurek R, Sojka P. Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks; 2010."},{"key":"528_CR27","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825\u201330.","journal-title":"J Mach Learn Res"},{"key":"528_CR28","doi-asserted-by":"crossref","unstructured":"Wolf T, Debut L, Sanh V, Chaumond, J, Delangue C, Moi A, et al. Huggingface's transformers: State-of-the-art natural language processing; 2019. arXiv preprint arXiv:1910.03771.","DOI":"10.18653\/v1\/2020.emnlp-demos.6"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-021-00528-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-021-00528-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-021-00528-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,10]],"date-time":"2024-09-10T21:05:00Z","timestamp":1726002300000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-021-00528-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,26]]},"references-count":28,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["528"],"URL":"https:\/\/doi.org\/10.1186\/s40537-021-00528-5","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,10,26]]},"assertion":[{"value":"28 June 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 October 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 October 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"We, the authors, consent to the publication of this manuscript in the Journal of Big Data.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"139"}}