{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T20:55:33Z","timestamp":1767992133168,"version":"3.49.0"},"reference-count":59,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2020,11,15]],"date-time":"2020-11-15T00:00:00Z","timestamp":1605398400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100004663","name":"Ministry of Science and Technology, Taiwan","doi-asserted-by":"crossref","award":["107-2221-E-008-085-MY2"],"award-info":[{"award-number":["107-2221-E-008-085-MY2"]}],"id":[{"id":"10.13039\/501100004663","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2020,11,30]]},"abstract":"<jats:p>Named entity recognition (NER) is an important task in natural language understanding, as it extracts the key entities (person, organization, location, date, number, etc.) and objects (product, song, movie, activity name, etc.) mentioned in texts. However, existing natural language processing (NLP) tools (such as Stanford NER) recognize only general named entities or require annotated training examples and feature engineering for supervised model construction. Since not all languages or entities have public NER support, constructing a tool for NER model training is essential for low-resource language or entity information extraction. In this article, we study the problem of developing a tool to prepare training corpus from the Web with known seed entities for custom NER model training via distant supervision. The major challenge of automatic labeling lies in the long labeling time due to large corpus and seed entities as well as the concern to avoid false positive and false negative examples due to short and long seeds. To solve this problem, we adopt locality-sensitive hashing (LSH) for various length of seed entities. We conduct experiments on five types of entity recognition tasks, including Chinese person names, food names, locations, points of interest (POIs), and activity names to demonstrate the improvements with the proposed Web NER model construction tool. Because the training corpus is obtained by automatic labeling of the seed entity\u2013related sentences, one could use either the entire corpus or the positive only sentences for model training. Based on the experimental results, we found the decision should depend on whether traditional linear chained conditional random fields (CRF) or deep neural network\u2013based CRF is used for model training as well as the completeness of the provided seed list.<\/jats:p>","DOI":"10.1145\/3422817","type":"journal-article","created":{"date-parts":[[2020,11,25]],"date-time":"2020-11-25T01:26:03Z","timestamp":1606267563000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["On the Construction of Web NER Model Training Tool based on Distant Supervision"],"prefix":"10.1145","volume":"19","author":[{"given":"Chien-Lung","family":"Chou","sequence":"first","affiliation":[{"name":"National Central University, Taiwan (R.O.C.)"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1101-6337","authenticated-orcid":false,"given":"Chia-Hui","family":"Chang","sequence":"additional","affiliation":[{"name":"National Central University, Taiwan (R.O.C.)"}]},{"given":"Yuan-Hao","family":"Lin","sequence":"additional","affiliation":[{"name":"National Central University, Taiwan (R.O.C.)"}]},{"given":"Kuo-Chun","family":"Chien","sequence":"additional","affiliation":[{"name":"National Central University, Taiwan (R.O.C.)"}]}],"member":"320","published-online":{"date-parts":[[2020,11,15]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.3115\/1075178.1075207"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376746"},{"key":"e_1_2_1_3_1","volume-title":"RAPID detection of gene-gene interactions in genome-wide association studies. Bioinformatics 26 22","author":"Brinza Dumitru","year":"2010"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2013.30"},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the Technologies and Applications of Artificial Intelligence (TAAI\u201917)","author":"Chiang Chia-Feng","year":"2017"},{"key":"e_1_2_1_7_1","first-page":"1","article-title":"Leveraging memory-enhanced conditional random fields with convolutional and automatic lexical features for Chinese named entity recognition","volume":"24","author":"Chien Kuo-Chun","year":"2019","journal-title":"Int. J. Comput. Ling. Chinese Lang. Proc."},{"key":"e_1_2_1_8_1","first-page":"2","article-title":"Boosted web named entity recognition via tri-training","volume":"16","author":"Chou Chien-Lung","year":"2016","journal-title":"ACM Trans. Asian Low-resour. Lang. Inf. Proc."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/3059406.3059413"},{"key":"e_1_2_1_10_1","volume-title":"E-Commerce and Web Technologies","author":"Chuang Hsiu-Min"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING\u201917)","author":"Chung Chih-Yu","year":"2017"},{"key":"e_1_2_1_12_1","volume-title":"Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12 (Nov","author":"Collobert Ronan","year":"2011"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL\u201902)","author":"Cunningham Hamish","year":"2002"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242610"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.3115\/1219840.1219885"},{"key":"e_1_2_1_16_1","volume-title":"Advances in Neural Information Processing Systems 21","author":"Graves Alex"},{"key":"e_1_2_1_17_1","volume-title":"ROCLING, Lun-Wei Ku and Yu Tsao (Eds.)","author":"Hsu Kuo-Hsin"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1014052.1014073"},{"key":"e_1_2_1_19_1","volume-title":"Bidirectional LSTM-CRF models for sequence tagging. CoRR abs\/1508.01991","author":"Huang Zhiheng","year":"2015"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/276698.276876"},{"key":"e_1_2_1_21_1","volume-title":"Irrelevance reduction with locality-sensitive hash learning for efficient cross-media retrieval. Multimedia Tools Applic. (Feb","author":"Jia Yuhua","year":"2018"},{"key":"e_1_2_1_22_1","unstructured":"Sun Junyi. 2013. Jieba. Retrieved from https:\/\/github.com\/fxsjy\/jieba.  Sun Junyi. 2013. Jieba. Retrieved from https:\/\/github.com\/fxsjy\/jieba."},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the 14th Conference on Computational Natural Language Learning (CoNLL\u201910)","author":"Kim Su Nam","year":"2010"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-006-0027-5"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1137\/S0097539798347177"},{"key":"e_1_2_1_26_1","unstructured":"John Lafferty. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Morgan Kaufmann 282--289.  John Lafferty. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Morgan Kaufmann 282--289."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v29i1.9513"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-1030"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1989.1.4.541"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2014.2356471"},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC\u201902)","author":"Lin Jimmy","year":"2002"},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the 28th Conference on Computational Linguistics and Speech Processing (ROCLING\u201916)","author":"Lin Yuan-Hao","year":"2016"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00115"},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP\u201917)","author":"Liu Fei","year":"2017"},{"key":"e_1_2_1_35_1","unstructured":"Apache Lucene. 1999. Apache Lucene Text Analyzer. Retrieved from https:\/\/lucene.apache.org.  Apache Lucene. 1999. Apache Lucene Text Analyzer. Retrieved from https:\/\/lucene.apache.org."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1040"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242592"},{"key":"e_1_2_1_38_1","first-page":"265","article-title":"Audio fingerprint hierarchy searching strategies on GPGPU massively parallel computer","volume":"2","author":"Mau Toan Nguyen","year":"2018","journal-title":"J. Inf. Telecommun."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.3115\/1119176.1119206"},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the 21st International Jont Conference on Artifical Intelligence (IJCAI\u201909)","author":"Michelson Matthew","year":"2076"},{"key":"e_1_2_1_41_1","unstructured":"Tomas Mikolov Kai Chen Greg Corrado and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. Retrieved from arxiv:cs.CL\/1301.3781.  Tomas Mikolov Kai Chen Greg Corrado and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. Retrieved from arxiv:cs.CL\/1301.3781."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.5555\/1690219.1690287"},{"key":"e_1_2_1_43_1","unstructured":"Apache OpenNLP. 2004. Apache Software Foundation. Retrieved from https:\/\/opennlp.apache.org.  Apache OpenNLP. 2004. Apache Software Foundation. Retrieved from https:\/\/opennlp.apache.org."},{"key":"e_1_2_1_44_1","volume-title":"Relation extraction: A survey. CoRR abs\/1712.05191","author":"Pawar Sachin","year":"2017"},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics.","author":"Qiu Xipeng","year":"2013"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/2348283.2348379"},{"key":"e_1_2_1_47_1","doi-asserted-by":"crossref","volume-title":"Mining of Massive Datasets","author":"Rajaraman Anand","DOI":"10.1017\/CBO9781139058452"},{"key":"e_1_2_1_48_1","volume-title":"Machine Learning and Knowledge Discovery in Databases","author":"Riedel Sebastian"},{"key":"e_1_2_1_49_1","first-page":"3","article-title":"Information extraction","volume":"1","author":"Sarawagi Sunita","year":"2008","journal-title":"Found. Trends Datab."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-2043"},{"key":"e_1_2_1_51_1","first-page":"L","article-title":"Learning syntactic patterns for automatic hypernym discovery","volume":"17","author":"Snow Rion","year":"2005","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_52_1","volume-title":"ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields.","author":"Sutton Charles","year":"2004"},{"key":"e_1_2_1_53_1","volume-title":"Proceedings of the Eighth International Conference on Knowledge and Systems Engineering (KSE\u201916)","author":"Toan Nguyen Mau","year":"2016"},{"key":"e_1_2_1_54_1","volume-title":"Convolutional neural network with word embeddings for Chinese word segmentation. CoRR abs\/1711.04411","author":"Wang Chunqi","year":"2017"},{"key":"e_1_2_1_55_1","volume-title":"Jingkuan Song, and Jianqiu Ji.","author":"Wang Jingdong","year":"2014"},{"key":"e_1_2_1_57_1","volume-title":"Memory networks. CoRR abs\/1410.3916","author":"Weston Jason","year":"2014"},{"key":"e_1_2_1_58_1","volume-title":"Combine CRF and MMSEG to boost Chinese word segmentation in social media. CoRR abs\/1510.07099","author":"Yushi Yao","year":"2015"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1203"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/2983323.2983809"},{"key":"e_1_2_1_61_1","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1109\/TKDE.2005.186","article-title":"Tri-training: Exploiting unlabeled data using three classifiers","volume":"17","author":"Zhou Zhi-Hua","year":"2005","journal-title":"IEEE Trans. Knowl. Data Eng."}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3422817","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3422817","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:24:56Z","timestamp":1750195496000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3422817"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,11,15]]},"references-count":59,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2020,11,30]]}},"alternative-id":["10.1145\/3422817"],"URL":"https:\/\/doi.org\/10.1145\/3422817","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"value":"2375-4699","type":"print"},{"value":"2375-4702","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,11,15]]},"assertion":[{"value":"2019-01-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-11-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}