{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,18]],"date-time":"2025-12-18T09:13:41Z","timestamp":1766049221377,"version":"3.37.3"},"reference-count":52,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2019,10,1]],"date-time":"2019-10-01T00:00:00Z","timestamp":1569888000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2019,10,1]],"date-time":"2019-10-01T00:00:00Z","timestamp":1569888000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000092","name":"U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine","doi-asserted-by":"publisher","award":["R01LM011176","R01LM011176","R01LM011176"],"award-info":[{"award-number":["R01LM011176","R01LM011176","R01LM011176"]}],"id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomes\u2014the leading cause of infant mortality\u2014could be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train and evaluate supervised machine learning algorithms\u2014feature-engineered and deep learning-based classifiers\u2014that automatically distinguish tweets referring to the user\u2019s pregnancy outcome from tweets that merely mention birth defects. Because 90% of the tweets merely mention birth defects, we experimented with under-sampling and over-sampling approaches to address this class imbalance. An SVM classifier achieved the best performance for the two positive classes: an F<jats:sub>1<\/jats:sub>-score of 0.65 for the \u201cdefect\u201d class and 0.51 for the \u201cpossible defect\u201d class. We deployed the classifier on 20,457 unlabeled tweets that mention birth defects, which helped identify 542 additional users for potential inclusion in our cohort. Contributions of this study include (1) NLP methods for automatically detecting tweets by users reporting their birth defect outcomes, (2) findings that an SVM classifier can outperform a deep neural network-based classifier for highly imbalanced social media data, (3) evidence that automatic classification can be used to identify additional users for potential inclusion in our cohort, and (4) a publicly available corpus for training and evaluating supervised machine learning algorithms.<\/jats:p>","DOI":"10.1038\/s41746-019-0170-5","type":"journal-article","created":{"date-parts":[[2019,10,1]],"date-time":"2019-10-01T10:02:32Z","timestamp":1569924152000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":17,"title":["Towards scaling Twitter for digital epidemiology of birth defects"],"prefix":"10.1038","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8281-3464","authenticated-orcid":false,"given":"Ari Z.","family":"Klein","sequence":"first","affiliation":[]},{"given":"Abeed","family":"Sarker","sequence":"additional","affiliation":[]},{"given":"Davy","family":"Weissenbacher","sequence":"additional","affiliation":[]},{"given":"Graciela","family":"Gonzalez-Hernandez","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2019,10,1]]},"reference":[{"key":"170_CR1","first-page":"2000","volume":"64","author":"TJ Mathews","year":"2015","unstructured":"Mathews, T. J., MacDorman, M. F. & Thoma, M. E. Infant mortality statistics from the 2013 period linked birth\/infant death data set. Natl Vital. Stat. Rep. 64, 2000\u20132013 (2015).","journal-title":"Natl Vital. Stat. Rep."},{"key":"170_CR2","doi-asserted-by":"publisher","first-page":"e39","DOI":"10.1016\/j.whi.2012.10.003","volume":"23","author":"MC Blehar","year":"2013","unstructured":"Blehar, M. C. et al. Enrolling pregnant women: issues in clinical research. Women's Health Issues 23, e39\u2013e345 (2013).","journal-title":"Women's Health Issues"},{"key":"170_CR3","doi-asserted-by":"publisher","first-page":"410","DOI":"10.1016\/j.clindermatol.2016.02.014","volume":"34","author":"RI Hartman","year":"2016","unstructured":"Hartman, R. I. & Kimball, A. B. Performing research in pregnancy: challenges and perspectives. Clin. Dermatol. 34, 410\u2013415 (2016).","journal-title":"Clin. Dermatol."},{"key":"170_CR4","doi-asserted-by":"publisher","first-page":"191","DOI":"10.1053\/sper.2001.24567","volume":"25","author":"RM Ward","year":"2001","unstructured":"Ward, R. M. Difficulties in the study of adverse fetal and neonatal effects of drug therapy during pregnancy. Semin. Perinatol. 25, 191\u2013195 (2001).","journal-title":"Semin. Perinatol."},{"key":"170_CR5","doi-asserted-by":"publisher","first-page":"215","DOI":"10.2165\/00002018-200427040-00001","volume":"27","author":"DL Kennedy","year":"2004","unstructured":"Kennedy, D. L., Uhl, K. & Kweder, S. L. Pregnancy exposure registries. Drug Saf. 27, 215\u2013228 (2004).","journal-title":"Drug Saf."},{"key":"170_CR6","unstructured":"US Department of Health and Human Services, Food and Drug Administration. Reviewer Guidance: Evaluating the Risks of Drug Exposure in Human Pregnancies. (US Department of Health and Human Services, Food and Drug Administration, 2005). https:\/\/www.fda.gov\/downloads\/Drugs\/%E2%80%A6\/Guidances\/ucm071645.pdf."},{"key":"170_CR7","doi-asserted-by":"crossref","first-page":"779","DOI":"10.1002\/pds.3659","volume":"23","author":"S Sinclair","year":"2014","unstructured":"Sinclair, S. et al. Advantages and problems with pregnancy registries: observations and surprises throughout the life of the International Lamotrigine Pregnancy Registry. Pharmacoepidemiol. Drug Saf. 23, 779\u2013786 (2014).","journal-title":"Pharmacoepidemiol. Drug Saf."},{"key":"170_CR8","unstructured":"Gliklich R. E., Dreyer, N. A. & Leavy, M. B. Registries for evaluating patient outcomes: a user\u2019s guide 3rd edn, (Agency for Healthcare Research and Quality, Rockville, MD, 2014)."},{"key":"170_CR9","unstructured":"Smith, A. & Anderson, M. Social media use in 2018. http:\/\/www.pewinternet.org\/2018\/03\/01\/social-media-use-in-2018\/ (2018)."},{"key":"170_CR10","doi-asserted-by":"publisher","first-page":"68","DOI":"10.1016\/j.jbi.2018.10.001","volume":"87","author":"AZ Klein","year":"2018","unstructured":"Klein, A. Z., Sarker, A., Cai, H., Weissenbacher, D. & Gonzalez-Hernandez, G. Social media mining for birth defects research: a rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter. J. Biomed. Inform. 87, 68\u201378 (2018).","journal-title":"J. Biomed. Inform."},{"key":"170_CR11","doi-asserted-by":"publisher","DOI":"10.2196\/jmir.8164","volume":"19","author":"A Sarker","year":"2017","unstructured":"Sarker, A. et al. Discovering cohorts of pregnant women from social media for safety surveillance and analysis. J. Med. Internet Res. 19, e361 (2017).","journal-title":"J. Med. Internet Res."},{"key":"170_CR12","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1007\/s40264-018-0731-6","volume":"42","author":"S Golder","year":"2019","unstructured":"Golder, S. et al. Pharmacoepidemiologic evaluation of birth defects from health-related postings in social media during pregnancy. Drug Saf. 42, 389\u2013400 (2019).","journal-title":"Drug Saf."},{"key":"170_CR13","doi-asserted-by":"publisher","first-page":"123","DOI":"10.1097\/OGX.0000000000000405","volume":"72","author":"BS Harris","year":"2017","unstructured":"Harris, B. S. et al. Risk factors for birth defects. Obstet. Gynecol. Surv. 72, 123\u2013135 (2017).","journal-title":"Obstet. Gynecol. Surv."},{"key":"170_CR14","doi-asserted-by":"crossref","unstructured":"Klein, A., Sarker, A., Rouhizadeh, M., O\u2019Connor, K. & Gonzalez, G. Detecting personal medication intake in twitter: an annotated corpus and baseline classification system. In Proc. BioNLP 2017 Workshop 136\u2013142 (Association for Computational Linguistics, 2017).","DOI":"10.18653\/v1\/W17-2316"},{"key":"170_CR15","doi-asserted-by":"publisher","first-page":"j2249","DOI":"10.1136\/bmj.j2249","volume":"357","author":"ML Feldkamp","year":"2017","unstructured":"Feldkamp, M. L., Carey, J. C., Byrne, J. L. B., Krikov, S. & Botto, L. D. Etiology and clinical presentation of birth defects: population based study. BMJ 357, j2249 (2017).","journal-title":"BMJ"},{"key":"170_CR16","doi-asserted-by":"publisher","first-page":"208","DOI":"10.1002\/pds.4150","volume":"26","author":"K Gelperin","year":"2017","unstructured":"Gelperin, K. et al. A systematic review of pregnancy exposure registries: examination of protocol-specified pregnancy outcomes, target sample size, and comparator selection. Pharmacoepidemiol. Drug Saf. 26, 208\u2013214 (2017).","journal-title":"Pharmacoepidemiol. Drug Saf."},{"key":"170_CR17","doi-asserted-by":"crossref","unstructured":"Hong, L. & Davison, B. D. Empirical study of topic modeling in Twitter. In Proc. 1st Workshop on Social Media Analytics (SOMA) 80\u201388 (2010).","DOI":"10.1145\/1964858.1964870"},{"key":"170_CR18","first-page":"1","volume":"57","author":"L Rynn","year":"2008","unstructured":"Rynn, L., Cragan, J. & Correa, A. Update on overall prevalence of major birth defects: Atlanta, Georgia, 1978\u20132005. MMWR Morb. Mortal. Wkly. Rep. 57, 1\u20135 (2008).","journal-title":"MMWR Morb. Mortal. Wkly. Rep."},{"key":"170_CR19","doi-asserted-by":"crossref","unstructured":"De Choudhury, M., Counts, S. & Horvitz, E. Predicting postpartum changes in emotion and behavior via social media. In Proc. SIGCHI Conference on Human Factors in Computing Systems 3267\u20133276 (ACM, 2013).","DOI":"10.1145\/2470654.2466447"},{"key":"170_CR20","doi-asserted-by":"publisher","first-page":"357","DOI":"10.1016\/j.ajog.2008.11.033","volume":"200","author":"T Pearlstein","year":"2009","unstructured":"Pearlstein, T., Howard, M., Salisbury, A. & Zlotnick, C. Postpartum depression. Am. J. Obstet. Gynecol. 200, 357\u2013364 (2009).","journal-title":"Am. J. Obstet. Gynecol."},{"key":"170_CR21","doi-asserted-by":"publisher","first-page":"220","DOI":"10.1016\/j.eswa.2016.12.035","volume":"73","author":"G Haixiang","year":"2017","unstructured":"Haixiang, G. et al. Learning from class-imbalanced data: reviews of methods and applications. Expert Syst. Appl. 73, 220\u2013239 (2017).","journal-title":"Expert Syst. Appl."},{"key":"170_CR22","doi-asserted-by":"publisher","first-page":"429","DOI":"10.3233\/IDA-2002-6504","volume":"6","author":"N Japkowicz","year":"2002","unstructured":"Japkowicz, N. & Stephen, S. The class imbalance problem: a systematic study. Intell. Data Anal. 6, 429\u2013449 (2002).","journal-title":"Intell. Data Anal."},{"key":"170_CR23","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/1007730.1007733","volume":"6","author":"NV Chawla","year":"2004","unstructured":"Chawla, N. V., Japkowicz, N. & Kotcz, A. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl. 6, 1\u20136 (2004).","journal-title":"ACM SIGKDD Explor Newsl."},{"key":"170_CR24","doi-asserted-by":"crossref","unstructured":"Wang S. et al. Training deep neural networks on imbalanced data sets. In Proc. International Joint Conference on Neural Networks 4368\u20134374 (IEEE, 2016).","DOI":"10.1109\/IJCNN.2016.7727770"},{"key":"170_CR25","doi-asserted-by":"publisher","first-page":"196","DOI":"10.1016\/j.jbi.2014.11.002","volume":"53","author":"A Sarker","year":"2015","unstructured":"Sarker, A. & Gonzalez, G. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J. Biomed. Inform. 53, 196\u2013207 (2015).","journal-title":"J. Biomed. Inform."},{"key":"170_CR26","doi-asserted-by":"publisher","first-page":"1274","DOI":"10.1093\/jamia\/ocy114","volume":"25","author":"A Sarker","year":"2018","unstructured":"Sarker, A. et al. Data and systems for medication-related text classification and concept normalization from Twitter: insights from Social Media Mining for Health (SMM4H) 2017 shared task. J. Am. Med. Inform. Assoc. 25, 1274\u20131283 (2018).","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"170_CR27","unstructured":"Devlin, J., Cheng, M. W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https:\/\/arxiv.org\/abs\/1810.04805 (2018)."},{"key":"170_CR28","doi-asserted-by":"publisher","first-page":"249","DOI":"10.1016\/j.neunet.2018.07.011","volume":"106","author":"M Buda","year":"2018","unstructured":"Buda, M., Maki, A. & Mazurowski, M. A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249\u2013259 (2018).","journal-title":"Neural Netw."},{"key":"170_CR29","doi-asserted-by":"publisher","first-page":"972","DOI":"10.1002\/bdra.23461","volume":"103","author":"CT Mai","year":"2015","unstructured":"Mai, C. T. et al. Population-based birth defects data in the United States, 2008-2012: presentation of state-specific data and descriptive brief on variability of prevalence. Birth Defects Res. A Clin. Mol. Teratol. 103, 972\u2013993 (2015).","journal-title":"Birth Defects Res. A Clin. Mol. Teratol."},{"key":"170_CR30","doi-asserted-by":"publisher","first-page":"706","DOI":"10.1002\/bdra.20308","volume":"76","author":"Q Yang","year":"2006","unstructured":"Yang, Q. et al. Racial differences in infant mortality attributable to birth defects in the United States, 1989\u20132002. Birth Defects Res. A Clin. Mol. Teratol. 76, 706\u2013713 (2006).","journal-title":"Birth Defects Res. A Clin. Mol. Teratol."},{"key":"170_CR31","doi-asserted-by":"publisher","first-page":"2995","DOI":"10.1161\/CIRCULATIONAHA.106.183216","volume":"115","author":"KJ Jenkins","year":"2007","unstructured":"Jenkins, K. J. et al. Noninherited risk factors and congenital cardiovascular defects: current knowledge. Circulation 115, 2995\u20133014 (2007).","journal-title":"Circulation"},{"key":"170_CR32","first-page":"975","volume":"5","author":"TF Wu","year":"2004","unstructured":"Wu, T. F., Lin, C. J. & Weng, R. C. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res. 5, 975\u20131005 (2004).","journal-title":"J. Mach. Learn. Res."},{"key":"170_CR33","doi-asserted-by":"crossref","unstructured":"Rouhizadeh, M., Magge, A., Klein, A., Sarker, A. & Gonzalez, G. A rule-based approach to determining pregnancy timeframe from contextual social media postings. In Proc. 8th International Conference on Digital Health 16\u201320 (ACM, 2018).","DOI":"10.1145\/3194658.3194679"},{"key":"170_CR34","unstructured":"National Birth Defects Prevention Network. Guidelines for conducting birth defects surveillance. (National Birth Defects Prevention Network, 2004). https:\/\/www.nbdpn.org\/docs\/NBDPN_Guidelines2012.pdf."},{"key":"170_CR35","doi-asserted-by":"publisher","first-page":"66","DOI":"10.1002\/bdra.20351","volume":"79","author":"Metropolitan Atlanta Congenital Defects Program.","year":"2007","unstructured":"Metropolitan Atlanta Congenital Defects Program. Executive summary. Birth Defects Res. A Clin. Mol. Teratol. 79, 66\u201393 (2007).","journal-title":"Birth Defects Res. A Clin. Mol. Teratol."},{"key":"170_CR36","unstructured":"Fornoff, J. E. & Shen, T. Birth defects and other adverse pregnancy outcomes in Illinois 2005\u20132009: a report on county-specific prevalence (2013). http:\/\/www.dph.illinois.gov\/sites\/default\/files\/publications\/ers14-03-birth-defects-inillinois-2005-2009-041516.pdf."},{"key":"170_CR37","unstructured":"EUROCAT. Guide 1.4: Instruction for the registration of congenital anomalies. (EUROCAT, 2013) http:\/\/www.eurocat-network.eu\/content\/Section%203.3-%2027_Oct2016.pdf."},{"key":"170_CR38","unstructured":"US National Library of Medicine. UMLS Reference Manual. (US National Library of Medicine, 2009) https:\/\/www.ncbi.nlm.nih.gov\/books\/NBK9676\/pdf\/Bookshelf_NBK9676.pdf."},{"key":"170_CR39","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1016\/j.jbi.2018.11.007","volume":"88","author":"A Sarker","year":"2018","unstructured":"Sarker, A. & Gonzalez-Hernandez, G. An unsupervised and customizable misspelling generator for mining noisy health-related text sources. J. Biomed. Inform. 88, 98\u2013107 (2018).","journal-title":"J. Biomed. Inform."},{"key":"170_CR40","first-page":"360","volume":"37","author":"AJ Viera","year":"2005","unstructured":"Viera, A. J. & Garrett, J. M. Understanding interobserver agreement: the kappa statistic. Fam. Med. 37, 360\u2013363 (2005).","journal-title":"Fam. Med."},{"key":"170_CR41","unstructured":"McCallum, A. & Nigam, K. A comparison of event models for na\u00efve bayes text classification. In Proc. AAAI-98 Learning for Text Categorization Workshop 41\u201348 (AAAI, 1998)."},{"key":"170_CR42","unstructured":"El-Manzalawy, Y. & Honavar, V. WLSVM: integrating LibSVM into Weka environment. http:\/\/www.cs.iastate.edu\/~yasser\/wlsvm (2005)."},{"key":"170_CR43","doi-asserted-by":"crossref","unstructured":"Chang, C. & Lin, C. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, (2011).","DOI":"10.1145\/1961189.1961199"},{"key":"170_CR44","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735\u20131780 (1997).","journal-title":"Neural Comput."},{"key":"170_CR45","doi-asserted-by":"publisher","first-page":"130","DOI":"10.1108\/eb046814","volume":"14","author":"MF Porter","year":"1980","unstructured":"Porter, M. F. An algorithm for suffix stripping. Program 14, 130\u2013137 (1980).","journal-title":"Program"},{"key":"170_CR46","unstructured":"Yang, Y. & Pederson, J. O. A comparative study on feature selection in text categorization. In Proc. 14th International Conference on Machine Learning 412\u2013420 (Morgan Kaufmann Publishers Inc., 1997)."},{"key":"170_CR47","unstructured":"Social Security Administration. Top Names of the Period 2010\u20132017. (Social security administration) https:\/\/www.ssa.gov\/oact\/babynames\/decades\/names2010s.html."},{"key":"170_CR48","doi-asserted-by":"publisher","first-page":"231","DOI":"10.1007\/s40264-015-0379-4","volume":"39","author":"A Sarker","year":"2016","unstructured":"Sarker, A. et al. Social media mining for toxicovigilance: automatic monitoring of prescription medication abuse from Twitter. Drug Saf. 39, 231\u2013240 (2016).","journal-title":"Drug Saf."},{"key":"170_CR49","unstructured":"Owoputi, O., O\u2019Connor, B., Dyer, C., Gimpel, K. & Schneider, N. 2012. Part-of-speech tagging for Twitter: word clusters and other advances (2012). http:\/\/www.cs.cmu.edu\/~ark\/TweetNLP\/owoputi+etal.tr12.pdf."},{"key":"170_CR50","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R. & Manning, C.D. GloVe: global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1532\u20131543 (2014).","DOI":"10.3115\/v1\/D14-1162"},{"key":"170_CR51","unstructured":"Hsu, C., Chang, C. & Lin, C. A practical guide to support vector classification. https:\/\/www.csie.ntu.edu.tw\/~cjlin\/papers\/guide\/guide.pdf."},{"key":"170_CR52","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1613\/jair.953","volume":"16","author":"NV Chawla","year":"2002","unstructured":"Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321\u2013357 (2002).","journal-title":"J. Artif. Intell. Res."}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-019-0170-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-019-0170-5","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-019-0170-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,17]],"date-time":"2022-12-17T18:33:27Z","timestamp":1671302007000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-019-0170-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,1]]},"references-count":52,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2019,12]]}},"alternative-id":["170"],"URL":"https:\/\/doi.org\/10.1038\/s41746-019-0170-5","relation":{},"ISSN":["2398-6352"],"issn-type":[{"type":"electronic","value":"2398-6352"}],"subject":[],"published":{"date-parts":[[2019,10,1]]},"assertion":[{"value":"22 May 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 August 2019","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 October 2019","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"96"}}