{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T05:05:22Z","timestamp":1755839122180},"reference-count":51,"publisher":"Association for Computing Machinery (ACM)","issue":"7","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2021,3]]},"abstract":"<jats:p>\n            Real-world data is dirty, which causes serious problems in (supervised) machine learning (ML). The widely used practice in such scenario is to first repair the labeled source (a.k.a. train) data using rule-, statistical- or ML-based methods and then use the \"repaired\" source to train an ML model. During production, unlabeled target (a.k.a. test) data will also be repaired, and is then fed in the trained ML model for prediction. However, this process often causes a performance degradation when the source and target datasets are dirty with different\n            <jats:italic>noise patterns<\/jats:italic>\n            , which is common in practice.\n          <\/jats:p>\n          <jats:p>\n            In this paper, we propose an\n            <jats:italic>adaptive data augmentation<\/jats:italic>\n            approach, for handling missing data in supervised ML. The approach extracts noise patterns from target data, and adapts the source data with the extracted target noise patterns while still preserving supervision signals in the source. Then, it\n            <jats:italic>patches<\/jats:italic>\n            the ML model by retraining it on the adapted data, in order to better serve the target. To effectively support adaptive data augmentation, we propose a novel generative adversarial network (GAN) based framework, called DAGAN, which works in an unsupervised fashion. DAGAN consists of two connected GAN networks. The first GAN learns the noise pattern from the target, for\n            <jats:italic>target mask generation.<\/jats:italic>\n            The second GAN uses the learned target mask to augment the source data, for\n            <jats:italic>source data adaptation.<\/jats:italic>\n            The augmented source data is used to retrain the ML model. Extensive experiments show that our method significantly improves the ML model performance and is more robust than the state-of-the-art missing data imputation solutions for handling datasets with different missing value patterns.\n          <\/jats:p>","DOI":"10.14778\/3450980.3450989","type":"journal-article","created":{"date-parts":[[2021,4,12]],"date-time":"2021-04-12T16:17:16Z","timestamp":1618244236000},"page":"1202-1214","source":"Crossref","is-referenced-by-count":20,"title":["Adaptive data augmentation for supervised learning over missing data"],"prefix":"10.14778","volume":"14","author":[{"given":"Tongyu","family":"Liu","sequence":"first","affiliation":[{"name":"Renmin University of China"}]},{"given":"Ju","family":"Fan","sequence":"additional","affiliation":[{"name":"Renmin University of China"}]},{"given":"Yinqing","family":"Luo","sequence":"additional","affiliation":[{"name":"Renmin University of China"}]},{"given":"Nan","family":"Tang","sequence":"additional","affiliation":[{"name":"QCRI, HBKU"}]},{"given":"Guoliang","family":"Li","sequence":"additional","affiliation":[{"name":"Tsinghua University"}]},{"given":"Xiaoyong","family":"Du","sequence":"additional","affiliation":[{"name":"Renmin University of China"}]}],"member":"320","published-online":{"date-parts":[[2021,4,12]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"1996. Adult Data Set. https:\/\/archive.ics.uci.edu\/ml\/datasets\/Adult.  1996. Adult Data Set. https:\/\/archive.ics.uci.edu\/ml\/datasets\/Adult."},{"key":"e_1_2_1_2_1","unstructured":"2013. EyeState Data Set. http:\/\/archive.ics.uci.edu\/ml\/datasets\/EEG+Eye+State.  2013. EyeState Data Set. http:\/\/archive.ics.uci.edu\/ml\/datasets\/EEG+Eye+State."},{"key":"e_1_2_1_3_1","unstructured":"2014. Ipums Data Set. https:\/\/www.openml.org\/d\/381.  2014. Ipums Data Set. https:\/\/www.openml.org\/d\/381."},{"key":"e_1_2_1_4_1","unstructured":"2019. Okcupid Data Set. https:\/\/www.openml.org\/d\/41440.  2019. Okcupid Data Set. https:\/\/www.openml.org\/d\/41440."},{"key":"e_1_2_1_5_1","unstructured":"2020. Welfare Data Set. https:\/\/www.cnrds.com\/.  2020. Welfare Data Set. https:\/\/www.cnrds.com\/."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994518"},{"key":"e_1_2_1_7_1","volume-title":"CoRR abs\/1701.07875","author":"Arjovsky Mart\u00edn","year":"2017"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-009-5152-4"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/1577069.1755858"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/3367243.3367327"},{"key":"e_1_2_1_11_1","volume-title":"Le","author":"Cubuk Ekin D.","year":"2019"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2015.2496141"},{"key":"e_1_2_1_13_1","doi-asserted-by":"crossref","unstructured":"Marzieh Fadaee Arianna Bisazza and Christof Monz. 2017. Data Augmentation for Low-Resource Neural Machine Translation. In ACL. 567--573.  Marzieh Fadaee Arianna Bisazza and Christof Monz. 2017. Data Augmentation for Low-Resource Neural Machine Translation. In ACL. 567--573.","DOI":"10.18653\/v1\/P17-2090"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407802"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407802"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2008.05.019"},{"key":"e_1_2_1_17_1","volume-title":"Lempitsky","author":"Ganin Yaroslav","year":"2017"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969033.2969125"},{"key":"e_1_2_1_19_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2015"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999134.2999257"},{"key":"e_1_2_1_21_1","volume-title":"Data Augmentation using Pre-trained Transformer Models. CoRR abs\/2003.02245","author":"Kumar Varun","year":"2020"},{"key":"e_1_2_1_22_1","volume-title":"Marlin","author":"Cheng-Xian Li Steven","year":"2019"},{"key":"e_1_2_1_23_1","first-page":"3128","article-title":"Detecting and Correcting for Label Shift with BlackBox Predictors","volume":"80","author":"Lipton Zachary C.","year":"2018","journal-title":"ICML"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1197\/jamia.M2051"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/3326943.3327008"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-019-00107-y"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389768"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407801"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324956"},{"key":"e_1_2_1_31_1","volume-title":"Conditional Generative Adversarial Nets. CoRR abs\/1411.1784","author":"Mirza Mehdi","year":"2014"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-014-0007-7"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/3231751.3231757"},{"key":"e_1_2_1_34_1","volume-title":"A Review of Methods for Missing Data. Educational Research and Evaluation: An International Journal on Theory and Practice 7 (08","author":"Pigott Therese","year":"2010"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299887.3299891"},{"key":"e_1_2_1_36_1","volume-title":"A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models. Survey Methodology 27 (11","author":"Raghunathan Trivellore","year":"2000"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.5555\/3294996.3295083"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.5555\/773294"},{"key":"e_1_2_1_40_1","volume-title":"Tatsunori B. Hashimoto, and Percy Liang.","author":"Sagawa Shiori","year":"2019"},{"key":"e_1_2_1_41_1","volume-title":"Missing Data: Our View of the State of the Art. Psychological Methods 7 (06","author":"Schafer Joseph","year":"2002"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380604"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btr597"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.5555\/2986916.2987018"},{"key":"e_1_2_1_45_1","volume-title":"Wei and Kai Zou","author":"Jason","year":"2019"},{"key":"e_1_2_1_46_1","unstructured":"Richard Wu Aoqian Zhang Ihab F. Ilyas and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. In MLSys. mlsys.org.  Richard Wu Aoqian Zhang Ihab F. Ilyas and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. In MLSys. mlsys.org."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2463706"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/1952376.1952378"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.5555\/2429619.2429648"},{"key":"e_1_2_1_50_1","volume-title":"ICML (Proceedings of Machine Learning Research)","volume":"80","author":"Yoon Jinsung"},{"key":"e_1_2_1_51_1","unstructured":"Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals. 2017. Understanding deep learning requires rethinking generalization. In ICLR.  Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht and Oriol Vinyals. 2017. Understanding deep learning requires rethinking generalization. In ICLR."},{"key":"e_1_2_1_52_1","volume-title":"Efros","author":"Zhu Jun-Yan","year":"2017"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3450980.3450989","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:13:51Z","timestamp":1672222431000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3450980.3450989"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,3]]},"references-count":51,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2021,3]]}},"alternative-id":["10.14778\/3450980.3450989"],"URL":"https:\/\/doi.org\/10.14778\/3450980.3450989","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2021,3]]}}}