{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T19:16:23Z","timestamp":1772910983215,"version":"3.50.1"},"reference-count":60,"publisher":"Association for Computing Machinery (ACM)","issue":"6","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,2]]},"abstract":"<jats:p>Supervised machine learning (ML) models trained on data with mislabeled instances often produce inaccurate results due to label errors. Traditional methods of detecting mislabeled instances rely on data proximity, where an instance is considered mislabeled if its label is inconsistent with its neighbors. However, it often performs poorly, because an instance does not always share the same label with its neighbors. ML-based methods instead utilize trained models to differentiate between mislabeled and clean instances. However, these methods struggle to achieve high accuracy, since the models may have already overfitted mislabeled instances.<\/jats:p>\n          <jats:p>\n            In this paper, we propose a novel framework, MisDetect, that detects mislabeled instances during model training. MisDetect leverages the early loss observation to\n            <jats:italic>iteratively<\/jats:italic>\n            identify and remove mislabeled instances. In this process, influence-based verification is applied to enhance the detection accuracy. Moreover, MisDetect automatically determines when the early loss is no longer effective in detecting mislabels such that the iterative detection process should terminate. Finally, for the training instances that MisDetect is still not certain about whether they are mislabeled or not, MisDetect automatically produces some pseudo labels to learn a binary classification model and leverages the generalization ability of the machine learning model to determine their status. Our experiments on 15 datasets show that MisDetect outperforms 10 baseline methods, demonstrating its effectiveness in detecting mislabeled instances.\n          <\/jats:p>","DOI":"10.14778\/3648160.3648161","type":"journal-article","created":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T21:52:53Z","timestamp":1714773173000},"page":"1159-1172","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["MisDetect: Iterative Mislabel Detection using Early Loss"],"prefix":"10.14778","volume":"17","author":[{"given":"Yuhao","family":"Deng","sequence":"first","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Chengliang","family":"Chai","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Lei","family":"Cao","sequence":"additional","affiliation":[{"name":"University of Arizona\/MIT"}]},{"given":"Nan","family":"Tang","sequence":"additional","affiliation":[{"name":"HKUST (GZ)"}]},{"given":"Jiayi","family":"Wang","sequence":"additional","affiliation":[{"name":"Tsinghua University"}]},{"given":"Ju","family":"Fan","sequence":"additional","affiliation":[{"name":"Renmin University of China"}]},{"given":"Ye","family":"Yuan","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Guoren","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}]}],"member":"320","published-online":{"date-parts":[[2024,5,3]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"1998. https:\/\/archive.ics.uci.edu\/ml\/datasets\/Covertype."},{"key":"e_1_2_1_2_1","unstructured":"1999. https:\/\/yann.lecun.com\/exdb\/mnist\/."},{"key":"e_1_2_1_3_1","unstructured":"2009. http:\/\/www.cs.toronto.edu\/~kriz\/cifar.html."},{"key":"e_1_2_1_4_1","unstructured":"2011. http:\/\/ufldl.stanford.edu\/housenumbers\/."},{"key":"e_1_2_1_5_1","unstructured":"2023. https:\/\/www.kaggle.com\/datasets\/ghassenkhaled\/wine-quality-data."},{"key":"e_1_2_1_6_1","unstructured":"2023. https:\/\/www.kaggle.com\/datasets\/iabhishekofficial\/mobile-price-classification."},{"key":"e_1_2_1_7_1","unstructured":"2023. https:\/\/www.kaggle.com\/datasets\/teejmahal20\/airline-passenger-satisfaction."},{"key":"e_1_2_1_8_1","unstructured":"2023. https:\/\/www.kaggle.com\/datasets\/fedesoriano\/heart-failure-prediction."},{"key":"e_1_2_1_9_1","unstructured":"2023. https:\/\/www.kaggle.com\/datasets\/ahsan81\/hotel-reservations-classification-dataset."},{"key":"e_1_2_1_10_1","unstructured":"2023. http:\/\/codh.rois.ac.jp\/kmnist\/index.html.en."},{"key":"e_1_2_1_11_1","unstructured":"2023. https:\/\/www.kaggle.com\/datasets\/zalando-research\/fashionmnist."},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","first-page":"621","DOI":"10.1007\/3-540-44522-6_64","article-title":"Decontamination of training data for supevised pattern recognition","volume":"1876","author":"Badenas AI","year":"2000","unstructured":"AI Marques R Alejo J Badenas, JS Sanchez, and R Barandela. 2000. Decontamination of training data for supevised pattern recognition. Advances in Pattern Recognition Lecture Notes in Computer Science 1876 (2000), 621--630.","journal-title":"Advances in Pattern Recognition Lecture Notes in Computer Science"},{"key":"e_1_2_1_13_1","volume-title":"The tradeoffs of large scale learning. Advances in neural information processing systems 20","author":"Bottou L\u00e9on","year":"2007","unstructured":"L\u00e9on Bottou and Olivier Bousquet. 2007. The tradeoffs of large scale learning. Advances in neural information processing systems 20 (2007)."},{"key":"e_1_2_1_14_1","volume-title":"Random forests. Machine learning 45","author":"Breiman Leo","year":"2001","unstructured":"Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5--32."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1613\/jair.606"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389772"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2915252"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589302"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/3523210.3523223"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3580305.3599326"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.14778\/3297753.3297758"},{"key":"e_1_2_1_22_1","volume-title":"IJCAI","author":"Elkan Charles","year":"2001","unstructured":"Charles Elkan. 2001. The Foundations of Cost-Sensitive Learning. In IJCAI 2001. Morgan Kaufmann, 973--978."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/11564096_55"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the 37th International Conference on Machine Learning, ICML 2020","volume":"119","author":"Guo Chuan","year":"2020","unstructured":"Chuan Guo, Tom Goldstein, Awni Y. Hannun, and Laurens van der Maaten. 2020. Certified Data Removal from Machine Learning Models. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research), Vol. 119. PMLR, 3832--3842. http:\/\/proceedings.mlr.press\/v119\/guo20c.html"},{"key":"e_1_2_1_25_1","volume-title":"Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS","author":"Han Bo","year":"2018","unstructured":"Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS 2018. 8536--8546. https:\/\/proceedings.neurips.cc\/paper\/2018\/hash\/a19744e268754fb0148b017647355b7b-Abstract.html"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00196"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319888"},{"key":"e_1_2_1_28_1","volume-title":"NeurIPS","author":"Hendrycks Dan","year":"2018","unstructured":"Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. 2018. Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise. In NeurIPS 2018. 10477--10486. https:\/\/proceedings.neurips.cc\/paper\/2018\/hash\/ad554d8c3b06d6b97ee76a2448bd7913-Abstract.html"},{"key":"e_1_2_1_29_1","volume-title":"MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In ICML 2018 (Proceedings of Machine Learning Research)","volume":"80","author":"Jiang Lu","year":"2018","unstructured":"Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In ICML 2018 (Proceedings of Machine Learning Research), Vol. 80. PMLR, 2309--2318. http:\/\/proceedings.mlr.press\/v80\/jiang18c.html"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-28647-9_60"},{"key":"e_1_2_1_31_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_2_1_32_1","volume-title":"International conference on machine learning. PMLR","author":"Koh Pang Wei","year":"2017","unstructured":"Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International conference on machine learning. PMLR, 1885--1894."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3236226"},{"key":"e_1_2_1_34_1","volume-title":"ICLR","author":"Li Junnan","year":"2020","unstructured":"Junnan Li, Richard Socher, and Steven C. H. Hoi. 2020. DivideMix: Learning with Noisy Labels as Semi-supervised Learning. In ICLR 2020. OpenReview.net. https:\/\/openreview.net\/forum?id=HJgExaVtwr"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00009"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415484"},{"key":"e_1_2_1_37_1","volume-title":"NeurIPS","author":"Malach Eran","year":"2017","unstructured":"Eran Malach and Shai Shalev-Shwartz. 2017. Decoupling \"when to update\" from \"how to update\". In NeurIPS 2017. 960--970. https:\/\/proceedings.neurips.cc\/paper\/2017\/hash\/58d4d1e7b1e97b258c9ed0b37e02d087-Abstract.html"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093342"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-25966-4_29"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1613\/jair.1.12125"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.240"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1994.6.1.147"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517886"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_45_1","volume-title":"ICML 2018 (Proceedings of Machine Learning Research)","volume":"80","author":"Ren Mengye","year":"2018","unstructured":"Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. 2018. Learning to Reweight Examples for Robust Deep Learning. In ICML 2018 (Proceedings of Machine Learning Research), Vol. 80. PMLR, 4331--4340. http:\/\/proceedings.mlr.press\/v80\/ren18a.html"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2006.211"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8655(02)00225-8"},{"key":"e_1_2_1_48_1","volume-title":"ICLR","author":"Toneva Mariya","year":"2019","unstructured":"Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. 2019. An Empirical Study of Example Forgetting during Deep Neural Network Learning. In ICLR 2019. OpenReview.net. https:\/\/openreview.net\/forum?id=BJlxm30cKm"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611972771.28"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.14778\/3561261.3561267"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00906"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00458"},{"key":"e_1_2_1_53_1","volume-title":"MLSys","author":"Wu Richard","year":"2020","unstructured":"Richard Wu, Aoqian Zhang, Ihab F. Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. In MLSys 2020. mlsys.org. https:\/\/proceedings.mlsys.org\/book\/307.pdf"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389696"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476290"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2567393"},{"key":"e_1_2_1_57_1","volume-title":"Wright","author":"Zhang Xuezhou","year":"2018","unstructured":"Xuezhou Zhang, Xiaojin Zhu, and Stephen J. Wright. 2018. Training Set Debugging Using Trusted Items. In AAAI 2018. AAAI Press, 4482--4489. https:\/\/www.aaai.org\/ocs\/index.php\/AAAI\/AAAI18\/paper\/view\/16155"},{"key":"e_1_2_1_58_1","volume-title":"Sabuncu","author":"Zhang Zhilu","year":"2018","unstructured":"Zhilu Zhang and Mert R. Sabuncu. 2018. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In NeurIPS 2018. 8792--8802. https:\/\/proceedings.neurips.cc\/paper\/2018\/hash\/f2925f97bc13ad2852a7a551802feea0-Abstract.html"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00931"},{"key":"e_1_2_1_60_1","volume-title":"Eliminating Class Noise in Large Datasets. In ICML 2003","author":"Zhu Xingquan","year":"2003","unstructured":"Xingquan Zhu, Xindong Wu, and Qijun Chen. 2003. Eliminating Class Noise in Large Datasets. In ICML 2003, Tom Fawcett and Nina Mishra (Eds.). AAAI Press, 920--927. http:\/\/www.aaai.org\/Library\/ICML\/2003\/icml03-119.php"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3648160.3648161","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T21:59:14Z","timestamp":1714773554000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3648160.3648161"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2]]},"references-count":60,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,2]]}},"alternative-id":["10.14778\/3648160.3648161"],"URL":"https:\/\/doi.org\/10.14778\/3648160.3648161","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,2]]},"assertion":[{"value":"2024-05-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}