{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T14:50:53Z","timestamp":1776351053847,"version":"3.51.2"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2020,5,30]],"date-time":"2020-05-30T00:00:00Z","timestamp":1590796800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM\/IMS Trans. Data Sci."],"published-print":{"date-parts":[[2020,5,31]]},"abstract":"<jats:p>\n                    Labeling datasets is one of the most expensive bottlenecks in data preprocessing tasks in machine learning. Therefore, organizations, in many domains, are applying weak supervision to produce noisy labels. However, since weak supervision relies on cheaper sources, the quality of the generated labels is problematic. Therefore, in this article, we present\n                    <jats:italic toggle=\"yes\">Asterisk<\/jats:italic>\n                    , an end-to-end framework to generate high-quality, large-scale labeled datasets. The system, first, automatically generates heuristics to assign initial labels. Then, the framework applies a novel data-driven active learning process to enhance the labeling quality. We present an algorithm that learns the selection policy by accommodating the modeled accuracies of the heuristics, along with the outcome of the generative model. Finally, the system employs the output of the active learning process to enhance the quality of the labels. To evaluate the proposed system, we report its performance against four state-of-the-art techniques. In collaboration with our industrial partner, IBM, we test the framework within a wide range of real-world applications. The experiments include 10 datasets of varying sizes with a maximum size of 11 million records. The results illustrate the effectiveness of the framework in producing high-quality labels and achieving high classification accuracy with minimal annotation efforts.\n                  <\/jats:p>","DOI":"10.1145\/3385188","type":"journal-article","created":{"date-parts":[[2020,5,31]],"date-time":"2020-05-31T00:09:30Z","timestamp":1590883770000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Asterisk"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7580-5757","authenticated-orcid":false,"given":"Mona","family":"Nashaat","sequence":"first","affiliation":[{"name":"Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Aindrila","family":"Ghosh","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"James","family":"Miller","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shaikh","family":"Quader","sequence":"additional","affiliation":[{"name":"IBM Canada Software Lab, IBM Canada, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2020,5,30]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2017.2756658"},{"key":"e_1_2_1_2_1","first-page":"3567","volume-title":"Advances in Neural Information Processing Systems","author":"Ratner A. J.","unstructured":"A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. R\u00e9. 2016. Data programming: Creating large training sets, quickly. Advances in Neural Information Processing Systems, pp. 3567--3575."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2017.2659740"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2018.2849394"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2949741.2949756"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.14778\/3157794.3157797"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-012-0507-8"},{"key":"e_1_2_1_8_1","first-page":"7","volume-title":"International Workshop on Document Analysis Systems","author":"Gurjar N.","unstructured":"N. Gurjar, S. Sudholt, and G. A. Fink. 2018. Learning deep representations for word spotting under weak supervision. International Workshop on Document Analysis Systems, pp. 7--12."},{"key":"e_1_2_1_9_1","first-page":"1109","volume-title":"ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Chaidaroon S.","unstructured":"S. Chaidaroon, T. Ebesu, and Y. Fang. 2018. Deep semantic text hashing with weak supervision. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1109--1112."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2018.2833850"},{"key":"e_1_2_1_11_1","first-page":"2234","volume-title":"Advances in Neural Information Processing Systems","author":"Salimans T.","unstructured":"T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. 2016. Improved techniques for training GANs. Advances in Neural Information Processing Systems, pp. 2234--2242."},{"key":"e_1_2_1_12_1","first-page":"273","volume-title":"Proc. 34th International Conference on Machine Learning","author":"Bach S. H.","unstructured":"S. H. Bach, B. He, A. Ratner, and C. R\u00e9. 2017. Learning the structure of generative models without labeled data. Proc. 34th International Conference on Machine Learning, pp. 273--282."},{"key":"e_1_2_1_13_1","first-page":"223","volume-title":"Proc. VLDB Endowment","author":"Varma P.","unstructured":"P. Varma and C. R\u00e9. 2018. Snuba: Automating weak supervision to label training data. Proc. VLDB Endowment pp. 223--236."},{"key":"e_1_2_1_14_1","first-page":"94","volume-title":"IEEE International Conference on Big Data","author":"Huang E.-C.","year":"2017","unstructured":"E.-C. Huang, H.-K. Pao, and Y.-J. Lee. 2017. Big active learning. IEEE International Conference on Big Data, pp. 94--101."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1093\/nsr\/nwx106"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-010-5174-y"},{"key":"e_1_2_1_17_1","unstructured":"B. Settles. 2009. Active Learning Literature Survey."},{"key":"e_1_2_1_18_1","first-page":"285","volume-title":"Proc. International Conference on World Wide Web","author":"Dalvi N.","unstructured":"N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. 2013. Aggregating crowdsourced binary ratings. Proc. International Conference on World Wide Web, pp. 285--294."},{"key":"e_1_2_1_19_1","first-page":"195","volume-title":"IEEE International Conference on Data Engineering","author":"Joglekar M.","unstructured":"M. Joglekar, H. Garcia-Molina, and A. Parameswaran. 2015. Comprehensive and reliable crowd assessment algorithms. IEEE International Conference on Data Engineering, pp. 195--206."},{"key":"e_1_2_1_20_1","unstructured":"P. Varma B. He D. Iter P. Xu R. Yu C. De Sa and C. R\u00e9. 2016. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. ArXiv161008123 Cs Stat 2016."},{"key":"e_1_2_1_21_1","unstructured":"P. Varma et al. 2016. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. ArXiv161008123 Cs Stat 2016."},{"key":"e_1_2_1_22_1","volume-title":"GOGGLES: Automatic training data generation with affinity coding. ArXiv190304552 Cs","author":"Das N.","year":"2019","unstructured":"N. Das, S. Chaba, S. Gandhi, D. H. Chau, and X. Chu. 2019. GOGGLES: Automatic training data generation with affinity coding. ArXiv190304552 Cs, 2019."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2009.2033421"},{"key":"e_1_2_1_24_1","first-page":"346","volume-title":"IEEE International Joint Conference on Neural Networks","author":"Prudencio R. B. C.","unstructured":"R. B. C. Prudencio and T. B. Ludermir. 2008. Active meta-learning with uncertainty sampling and outlier detection. IEEE International Joint Conference on Neural Networks, pp. 346--351."},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","unstructured":"K. Konyushkova R. Sznitman and P. Fua. 2015. Introducing geometry in active learning for image segmentation. ArXiv150804955 Cs 2015.","DOI":"10.1109\/ICCV.2015.340"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-014-0781-x"},{"key":"e_1_2_1_27_1","first-page":"1874","article-title":"Learning how to actively learn: A deep imitation learning approach","volume":"1","author":"Liu M.","year":"2018","unstructured":"M. Liu, W. Buntine, and G. Haffari. 2018. Learning how to actively learn: A deep imitation learning approach. Proc. Annual Meeting of the Association for Computational Linguistics. 1 (2018), 1874--1883.","journal-title":"Proc. Annual Meeting of the Association for Computational Linguistics."},{"key":"e_1_2_1_28_1","doi-asserted-by":"crossref","unstructured":"M. E. Ramirez-Loaiza M. Sharma G. Kumar and M. Bilgic. 2017. Active learning: An empirical study of common baselines. Data Min. Knowl. Discov. 31. 2 (2017) 287--313.","DOI":"10.1007\/s10618-016-0469-7"},{"key":"e_1_2_1_29_1","volume-title":"Learn: A deep reinforcement learning approach. ArXiv170802383 Cs","author":"Fang M.","year":"2017","unstructured":"M. Fang, Y. Li, and T. Cohn. 2017. Learning how to active Learn: A deep reinforcement learning approach. ArXiv170802383 Cs, Aug. 2017."},{"key":"e_1_2_1_30_1","unstructured":"K. Konyushkova R. Sznitman and P. Fua. 2017. Learning active learning from data. Advances in Neural Information Processing Systems."},{"key":"e_1_2_1_31_1","first-page":"841","volume-title":"IEEE International Conference on Data Mining","author":"Chu H.","unstructured":"H. Chu and H. Lin. 2016. Can active learning experience be transferred? IEEE International Conference on Data Mining, pp. 841--846."},{"key":"e_1_2_1_32_1","unstructured":"K. Pang M. Dong Y. Wu and T. Hospedales. 2018. Meta-learning transferable active learning policies by deep reinforcement learning. ArXiv Prepr. ArXiv180604798 2018."},{"key":"e_1_2_1_33_1","first-page":"625","volume-title":"Proc. International Conference on Machine Learning","author":"Niculescu-Mizil A.","unstructured":"A. Niculescu-Mizil and R. Caruana. 2005. Predicting good probabilities with supervised learning. Proc. International Conference on Machine Learning, pp. 625--632."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1002\/widm.1249"},{"key":"e_1_2_1_35_1","doi-asserted-by":"crossref","first-page":"113","DOI":"10.1093\/jat\/bku127","article-title":"Determination of confidence intervals in non-normal data: Application of the bootstrap to cocaine concentration in femoral blood","volume":"39","author":"Desharnais B.","year":"2015","unstructured":"B. Desharnais, F. Camirand-Lemyre, P. Mireault, and C. D. Skinner. 2015. Determination of confidence intervals in non-normal data: Application of the bootstrap to cocaine concentration in femoral blood. J. Anal. Toxicol. 39, 2 (2015), 113--117.","journal-title":"J. Anal. Toxicol."},{"key":"e_1_2_1_36_1","first-page":"2156","volume-title":"AAAI Conference on Artificial Intelligence","author":"Xia R.","unstructured":"R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. 2014. Supervised hashing for image retrieval via image representation learning. AAAI Conference on Artificial Intelligence, pp. 2156--2162."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1111\/cgf.13406"},{"key":"e_1_2_1_38_1","first-page":"46","volume-title":"IEEE International Conference on Big Data","author":"Nashaat M.","unstructured":"M. Nashaat, A. Ghosh, J. Miller, S. Quader, C. Marston, and J. F. Puget. 2018. Hybridization of active learning and data programming for labeling large industrial datasets. IEEE International Conference on Big Data, pp. 46--55."},{"key":"e_1_2_1_39_1","doi-asserted-by":"crossref","first-page":"131","DOI":"10.1016\/j.infsof.2019.05.009","article-title":"M-Lean: An end-to-end development framework for predictive models in B2B scenarios","volume":"113","author":"Nashaat M.","year":"2019","unstructured":"M. Nashaat, A. Ghosh, J. Miller, S. Quader, and C. Marston. 2019. M-Lean: An end-to-end development framework for predictive models in B2B scenarios. Inf. Softw. Technol. 113 (2019), 131--145.","journal-title":"Inf. Softw. Technol."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.dss.2014.03.001"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2007.12.020"},{"key":"e_1_2_1_42_1","volume-title":"Searching for exotic particles in high-energy physics with deep learning. Nat. Comm. 5, 4308","author":"Baldi P.","year":"2014","unstructured":"P. Baldi, P. Sadowski, D. Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nat. Comm. 5, 4308 (2014)."},{"issue":"2018","key":"e_1_2_1_43_1","doi-asserted-by":"crossref","first-page":"28","DOI":"10.1016\/j.enbuild.2015.11.071","article-title":"Accurate occupancy detection of an office room from light, temperature, humidity and CO 2 measurements using statistical learning models. 2016","volume":"112","author":"Candanedo L. M.","year":"2016","unstructured":"L. M. Candanedo and V. Feldheim. 2016. Accurate occupancy detection of an office room from light, temperature, humidity and CO 2 measurements using statistical learning models. 2016. Energy Build. 112 (2018), 28--39.","journal-title":"Energy Build."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.nima.2003.08.157"},{"key":"e_1_2_1_45_1","first-page":"535","volume-title":"Conference on Artificial Intelligence","author":"Fernandes K.","unstructured":"K. Fernandes, P. Vinagre, and P. Cortez. 2015. A proactive intelligent decision support system for predicting the popularity of online news. Conference on Artificial Intelligence, pp. 535--546."},{"key":"e_1_2_1_46_1","unstructured":"H. Xiao K. Rasul and R. Vollgraf. 2017. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. ArXiv170807747 Cs Stat 2017."},{"key":"e_1_2_1_47_1","unstructured":"P. Varma et al. 2017. Inferring generative model structure with static analysis. ArXiv170902477 Cs Stat 2017."},{"key":"e_1_2_1_48_1","first-page":"287","volume-title":"IEEE International Conference on Semantic Computing","author":"Beatty G.","unstructured":"G. Beatty, E. Kochis, and M. Bloodgood. 2019. The use of unlabeled data versus labeled data for stopping active learning for text classification. IEEE International Conference on Semantic Computing, pp. 287--294."},{"key":"e_1_2_1_49_1","first-page":"39","volume-title":"Proc. 13th Conference on Computational Natural Language Learning","author":"Bloodgood M.","unstructured":"M. Bloodgood and K. Vijay-Shanker. 2009. A method for stopping active learning based on stabilizing predictions and the need for user-adjustable stopping. Proc. 13th Conference on Computational Natural Language Learning, pp. 39--47."},{"key":"e_1_2_1_50_1","volume-title":"Hawaii International Conference on System Sciences, submitted for publication.","author":"Nashaat M.","unstructured":"M. Nashaat, A. Ghosh, J. Miller, and S. Quader. WeSAL: Applying active supervision to find high-quality labels at industrial scale. Hawaii International Conference on System Sciences, submitted for publication."},{"key":"e_1_2_1_51_1","volume-title":"Proc. Conference on Bioinformatics. 19","author":"Tan A. C.","year":"2003","unstructured":"A. C. Tan and D. Gilbert. 2003. An empirical comparison of supervised machine learning techniques in bioinformatics. Proc. Conference on Bioinformatics. 19 (2003), 219--222."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1007\/s13042-011-0012-5"},{"key":"e_1_2_1_53_1","first-page":"785","volume-title":"Proc. ACM SIGKDD","author":"Chen T.","unstructured":"T. Chen and C. Guestrin. 2016. XGBoost: A scalable tree boosting system. Proc. ACM SIGKDD, pp. 785--794."},{"key":"e_1_2_1_54_1","first-page":"401","volume-title":"International Conference on Business Process Management","author":"Teinemaa I.","unstructured":"I. Teinemaa, M. Dumas, F. M. Maggi, and C. Di Francescomarino. 2016. Predictive business process monitoring with structured and unstructured data. International Conference on Business Process Management, pp. 401--417."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1002\/widm.1132"},{"key":"e_1_2_1_56_1","volume-title":"Proc. 2nd Workshop on Data Management for End-to-End Machine Learning.","author":"Ratner A.","unstructured":"A. Ratner, B. Hancock, J. Dunnmon, R. Goldman, and C. R\u00e9. 2018. Snorkel MeTaL: Weak supervision for multi-task learning. Proc. 2nd Workshop on Data Management for End-to-End Machine Learning."},{"key":"e_1_2_1_57_1","doi-asserted-by":"crossref","first-page":"6099","DOI":"10.1109\/TNNLS.2018.2820055","article-title":"SyMIL: MinMax latent SVM for weakly labeled data","volume":"29","author":"Durand T.","year":"2018","unstructured":"T. Durand, N. Thome, and M. Cord. 2018. SyMIL: MinMax latent SVM for weakly labeled data. IEEE Trans. Neural Netw. Learn. Syst. 29, 12 (2018), 6099--6112.","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"e_1_2_1_58_1","first-page":"2576","volume-title":"AAAI Conference on Artificial Intelligence","author":"Stewart R.","unstructured":"R. Stewart and S. Ermon. 2017. Label-free supervision of neural networks with physics and domain knowledge. AAAI Conference on Artificial Intelligence, pp. 2576--2582."},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352138"},{"key":"e_1_2_1_60_1","first-page":"1301","volume-title":"Proc. International Conference on Management of Data","author":"Wu S.","year":"2018","unstructured":"S. Wu et al. 2018. Fonduer: Knowledge base construction from richly formatted data. Proc. International Conference on Management of Data, pp. 1301--1316."},{"key":"e_1_2_1_61_1","doi-asserted-by":"crossref","first-page":"868","DOI":"10.1109\/TKDE.2019.2897307","article-title":"ASCENT: Active supervision for semi-supervised learning","volume":"32","author":"Li Y.","year":"2020","unstructured":"Y. Li, Y. Wang, D. Yu, Y. Ning, P. Hu, and R. Zhao. 2020. ASCENT: Active supervision for semi-supervised learning. IEEE Trans. Knowl. Data Eng. 32, 5 (2020), 868--882.","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"e_1_2_1_62_1","volume-title":"Proc. International Conference on Machine Learning. 70","author":"Bachman P.","year":"2017","unstructured":"P. Bachman, A. Sordoni, and A. Trischler. 2017. Learning algorithms for active learning. Proc. International Conference on Machine Learning. 70 (2017), 301--310."},{"key":"e_1_2_1_63_1","volume-title":"NeurIPS MLSys Workshop.","author":"Kang D.","unstructured":"D. Kang, D. Raghavan, P. Bailis, and M. Zaharia. 2018. Model assertions for debugging machine learning. NeurIPS MLSys Workshop."},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2018.2869164"},{"key":"e_1_2_1_65_1","unstructured":"Z. Zhou J. Y. Shin S. R. Gurudu M. B. Gotway and J. Liang. 2018. AFT* Integrating active learning and transfer learning to reduce annotation efforts. ArXiv180200912 Cs Stat 2018."}],"container-title":["ACM\/IMS Transactions on Data Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3385188","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3385188","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T13:57:02Z","timestamp":1776347822000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3385188"}},"subtitle":["Generating Large Training Datasets with Automatic Active Supervision"],"short-title":[],"issued":{"date-parts":[[2020,5,30]]},"references-count":65,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2020,5,31]]}},"alternative-id":["10.1145\/3385188"],"URL":"https:\/\/doi.org\/10.1145\/3385188","relation":{},"ISSN":["2691-1922"],"issn-type":[{"value":"2691-1922","type":"print"}],"subject":[],"published":{"date-parts":[[2020,5,30]]},"assertion":[{"value":"2019-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-02-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-05-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}