{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2022,4,1]],"date-time":"2022-04-01T18:56:08Z","timestamp":1648839368295},"reference-count":66,"publisher":"Cambridge University Press (CUP)","issue":"1","license":[{"start":{"date-parts":[[2017,6,15]],"date-time":"2017-06-15T00:00:00Z","timestamp":1497484800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2018,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Named Entity Recognition is a basic task in Information Extraction that aims at identifying entities of interest within full text documents. The patterns used to recognize entities can be rule based, as in the popular JAPE system. However, hand-crafting effective patterns is often difficult, and yet there is little research devoted to methods capable of learning human-readable patterns, possibly with arbitrary sets of features. In this paper, we present a semi-automatic method to generate both regular expressions and a subset of the JAPE language. It does not need a corpus annotated beforehand. Instead, it employs active learning and combines clustering with an algorithm that finds alignments between symbols present in the entities discovered during the learning process. The method currently supports a fixed set of character features and an arbitrary set of token features, but it can incorporate other kinds of features as well. Through several experiments with an English corpus, we show the ability of the method to generate effective patterns at a low annotation cost, and how it can successfully help in the annotation of brand new corpora.<\/jats:p>","DOI":"10.1017\/s135132491700016x","type":"journal-article","created":{"date-parts":[[2017,6,15]],"date-time":"2017-06-15T09:19:29Z","timestamp":1497518369000},"page":"39-75","source":"Crossref","is-referenced-by-count":5,"title":["A Semi-automatic and low-cost method to learn patterns for named entity recognition"],"prefix":"10.1017","volume":"24","author":[{"given":"M.","family":"MARRERO","sequence":"first","affiliation":[]},{"given":"J.","family":"URBANO","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2017,6,15]]},"reference":[{"key":"S135132491700016X_ref042","unstructured":"Nouvel D. , Antoine J. Y. , Friburger N. , and Soulet A. 2012. Coupling knowledge-based and data-driven systems for named entity recognition. In Proceedings of the ACL Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, Avignon, France, pp. 69\u201377."},{"key":"S135132491700016X_ref066","doi-asserted-by":"publisher","DOI":"10.1002\/asi.20119"},{"key":"S135132491700016X_ref033","doi-asserted-by":"publisher","DOI":"10.1109\/5254.920602"},{"key":"S135132491700016X_ref003","doi-asserted-by":"crossref","unstructured":"Asahara M. , and Matsumoto Y. 2003. Japanese named entity extraction with redundant morphological analysis. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Canada: Edmonton, vol. 1, pp. 8\u201315.","DOI":"10.3115\/1073445.1073447"},{"key":"S135132491700016X_ref020","unstructured":"Gantz J. , and Reinsel D. 2012. The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. Technical Report, IDC."},{"key":"S135132491700016X_ref014","doi-asserted-by":"publisher","DOI":"10.1007\/BF01890115"},{"key":"S135132491700016X_ref011","first-page":"112","volume-title":"Annotation for the Semantic Web, Frontiers in Artificial Intelligence and Applications series","author":"Ciravegna","year":"2003"},{"key":"S135132491700016X_ref006","unstructured":"Borthwick A. , Sterling J. , Agichtein E. , and Grishman R. 1998. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the 6th Workshop on Very Large Corpora, Montreal, Canada, pp. 152\u2013160."},{"key":"S135132491700016X_ref058","doi-asserted-by":"publisher","DOI":"10.1023\/A:1007562322031"},{"key":"S135132491700016X_ref019","unstructured":"Freitag D. 1998. Toward general-purpose learning for information extraction retargetability. In Proceedings of the 17th International Conference on Computational Linguistics, Montreal, Canada, pp. 404\u20138."},{"key":"S135132491700016X_ref022","doi-asserted-by":"crossref","unstructured":"Hachey B. , Alex B. , and Becker M. 2005. Investigating the effects of selective sampling on the annotation task. In Proceedings of the 9th Conference on Computational Natural Language Learning, Ann Arbor, Michigan, pp. 144\u201351.","DOI":"10.3115\/1706543.1706569"},{"key":"S135132491700016X_ref041","first-page":"1","volume-title":"ACL Workshop on BioNLP","author":"N\u00e9dellec","year":"2013"},{"key":"S135132491700016X_ref050","unstructured":"Ritter A. , Clark S. , and Etzioni O. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Edinburgh, United Kingdom, pp. 1524\u201334."},{"key":"S135132491700016X_ref024","doi-asserted-by":"crossref","unstructured":"Irmak U. , and Kraft R. 2010. A scalable machine-learning approach for semi-structured named entity recognition. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, USA, pp. 461\u201370.","DOI":"10.1145\/1772690.1772738"},{"key":"S135132491700016X_ref049","unstructured":"Ringger E. , et al. 2008. Assessing the costs of machine-assisted corpus annotation through a user study. In Proceedings of the International Conference on Language Resources and Evaluation, Marrakech, Morocco, pp. 3318\u201324."},{"key":"S135132491700016X_ref012","doi-asserted-by":"crossref","unstructured":"Culotta A. , and Mccallum A. 2005. Reducing labeling effort for structured prediction tasks. In Proceedings of the 20th National Conference on Artificial Intelligence, Pittsburgh, Pennsylvania, pp. 746\u201351.","DOI":"10.21236\/ADA440382"},{"key":"S135132491700016X_ref001","unstructured":"Alfonseca E. , and Manandhar S. 2002. An unsupervised method for general named entity recognition and automated concept discovery. In Proceedings of the 1st International Conference on General WordNet, Mysore, India, pp. 34\u201343."},{"key":"S135132491700016X_ref013","unstructured":"Cunningham H. , et al. 2013. Developing language processing components with GATE (a user gGuide). Technical Report, University of Sheffield Department of Computer Science."},{"key":"S135132491700016X_ref016","doi-asserted-by":"publisher","DOI":"10.1016\/j.artint.2005.03.001"},{"key":"S135132491700016X_ref063","doi-asserted-by":"publisher","DOI":"10.1016\/j.websem.2005.10.002"},{"key":"S135132491700016X_ref060","doi-asserted-by":"crossref","unstructured":"Srikant R. , and Agrawal R. 1996. Mining sequential patterns: generalizations and performance improvements. In Proceedings of the International Conference on Extending Database, Avignon, France, pp. 1\u201317.","DOI":"10.1007\/BFb0014140"},{"key":"S135132491700016X_ref007","doi-asserted-by":"crossref","unstructured":"Brauer F. , Rieger R. , Mocan A. , and Barczynski W. M. 2011. Enabling information extraction by inference of regular expressions from sample entities. In Proceedings of the 20th Conference on Information and Knowledge Management, Glasgow, United Kindgdom, pp. 1285\u201394.","DOI":"10.1145\/2063576.2063763"},{"key":"S135132491700016X_ref038","doi-asserted-by":"crossref","unstructured":"McCallum A. , and Li W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning, Edmonton, Canada, pp. 188\u201391.","DOI":"10.3115\/1119176.1119206"},{"key":"S135132491700016X_ref015","first-page":"17","article-title":"Shallow processing with unification and typed feature structures: foundations and applications","volume":"1","author":"Drozdzynski","year":"2004","journal-title":"K\u00fcnstliche Intelligenz"},{"key":"S135132491700016X_ref021","doi-asserted-by":"crossref","unstructured":"Gupta S. , and Manning C. D. 2014. Improved pattern learning for bootstrapped entity extraction. In Proceedings of the 18th Conference on Computational Natural Language Learning, Baltimore, USA, pp. 98\u2013108.","DOI":"10.3115\/v1\/W14-1611"},{"key":"S135132491700016X_ref008","unstructured":"Califf M. E. 1998. Relational Learning Techniques for Natural Language Information Extraction. PhD Thesis, The University of Texas at Austin."},{"key":"S135132491700016X_ref056","doi-asserted-by":"crossref","unstructured":"Silberztein M. 2005. NooJ: a linguistic annotation system for corpus processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 10\u201311.","DOI":"10.3115\/1225733.1225739"},{"key":"S135132491700016X_ref048","unstructured":"Rinaldi F. , et al. 2005. CAFETIERE: conceptual annotations for facts, events, terms, individual entities, and RElations. Technical Report TR-U4.3.1, Parmenides Project IST-2001-39023."},{"key":"S135132491700016X_ref055","doi-asserted-by":"crossref","unstructured":"Shinyama Y. , and Sekine S. 2004. Named entity discovery using comparable news articles. In Proceedings of the International Conference on Computational Linguistics, Geneva, Switzerland, p. 848.","DOI":"10.3115\/1220355.1220477"},{"key":"S135132491700016X_ref023","unstructured":"Haertel R. A. , Seppi K. D. , Ringger E. K. , and Carroll J. L. 2008. Return on investment for active learning. NIPS Workshop on Cost-Sensitive Learning."},{"key":"S135132491700016X_ref017","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2014.04.005"},{"key":"S135132491700016X_ref061","unstructured":"Thompson C. A. , Califf M. E. , and Mooney R. J. 1999. Active learning for natural language parsing and information extraction. In Proceedings of the International Conference on Machine Learning, Bled, Slovenia, pp. 406\u201314."},{"key":"S135132491700016X_ref064","unstructured":"Vijayanarasimhan S. , and Grauman K. 2009. What\u2019s it going to cost you? Predicting effort versus informativeness for multi-label image annotations. In Proceedings of the Confernce on Computer Vision and Pattern Recognition, Miami, Florida, pp. 2262\u20139."},{"key":"S135132491700016X_ref030","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324908004968"},{"key":"S135132491700016X_ref010","unstructured":"Chiticariu L. , and Reiss F. R. 2013. Rule-based information extraction is dead! Long live rule-based information extraction systems! In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle, USA, pp. 827\u201332."},{"key":"S135132491700016X_ref032","doi-asserted-by":"publisher","DOI":"10.1145\/2414425.2414428"},{"key":"S135132491700016X_ref035","unstructured":"Marrero M. , S\u00e1nchez-Cuadrado S. , Urbano J. , Morato J. , and Moreiro J. A. 2012. Information retrieval systems adapted to the biomedical domain. arXiv:1203.6845 [cs.CL]."},{"key":"S135132491700016X_ref043","doi-asserted-by":"publisher","DOI":"10.1561\/1500000011"},{"key":"S135132491700016X_ref046","doi-asserted-by":"crossref","unstructured":"Ratinov L. , and Roth D. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Conference on Natural Language Learning, Boulder, Colorado, pp. 147\u201355.","DOI":"10.3115\/1596374.1596399"},{"key":"S135132491700016X_ref034","first-page":"47","article-title":"Evaluation of named entity extraction systems","volume":"41","author":"Marrero","year":"2009","journal-title":"Research in Computing Science"},{"key":"S135132491700016X_ref057","unstructured":"Siniakov P. 2008. GROPUS-an Adaptive Rule Based Algorithm for Information Extraction. PhD Thesis, Free University of Berlin."},{"key":"S135132491700016X_ref062","unstructured":"Tomanek K. , Wermter J. , and Hahn U. 2007. An approach to text corpus construction which cuts annotation costs and maintains reusability of annotated data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Prague, Czech Republic, pp. 486\u20135."},{"key":"S135132491700016X_ref004","doi-asserted-by":"crossref","unstructured":"Bikel D. M. , Miller S. , Schwartz R. , and Weischedel R. 1997. Nymble: a high-performance learning name-finder. In Proceedings of the 5th Conference on Applied Natural Language Processing, Washington, DC, pp. 194\u2013201.","DOI":"10.3115\/974557.974586"},{"key":"S135132491700016X_ref025","unstructured":"Jones R. 2005. Learning to Extract Entities from Labelled and Unlabelled Text. PhD Thesis, Carnegie Mellon University."},{"key":"S135132491700016X_ref054","doi-asserted-by":"crossref","unstructured":"Shen D. , Zhang J. , Su J. , Zhou G. , and Tan C.-L. 2004. Multi-criteria-based active learning for named entity recognition. In Proceedings of the Annual Meeting of the ACL, Barcelona, Spain, pp. 589\u201396.","DOI":"10.3115\/1218955.1219030"},{"key":"S135132491700016X_ref036","volume-title":"Advances in Information Retrieval. ECIR 2015","author":"Marrero","year":"2015"},{"key":"S135132491700016X_ref051","doi-asserted-by":"publisher","DOI":"10.1561\/1900000003"},{"key":"S135132491700016X_ref040","unstructured":"Nagesh A. , and Chiticariu L. 2012. Towards efficient named-entity rule induction for customizability. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Jeju Island, Korea, pp. 128\u201338."},{"key":"S135132491700016X_ref029","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions and reversals","volume":"10","author":"Levenshtein","year":"1966","journal-title":"Soviet Physics Doklady"},{"key":"S135132491700016X_ref031","doi-asserted-by":"crossref","unstructured":"Li Y. , Krishnamurthy R. , Raghavan S. , Vaithyanathan S. , and Jagadish H. 2008. Regular expression learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Waikiki, Hawaii, pp. 21\u201330.","DOI":"10.3115\/1613715.1613719"},{"key":"S135132491700016X_ref005","doi-asserted-by":"publisher","DOI":"10.1075\/cilt.260.07bog"},{"key":"S135132491700016X_ref045","doi-asserted-by":"crossref","unstructured":"Popescu A.-M. , and Etzioni O. 2005. Extracting product features and opinions from reviews. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 339\u201346.","DOI":"10.3115\/1220575.1220618"},{"key":"S135132491700016X_ref059","doi-asserted-by":"crossref","unstructured":"Srihari R. K. , and Li W. 1999. Information extraction supported question answering. Technical Report, Cymfony Inc.","DOI":"10.21236\/ADA460042"},{"key":"S135132491700016X_ref052","unstructured":"Sekine S. , Grishman R. , and Shinnou H. 1998. A decision tree method for finding and classifying names in japanese texts. In Proceedings of the 6th Workshop on Very Large Corpora, Montreal, Canada, pp. 171\u20138."},{"key":"S135132491700016X_ref044","unstructured":"Pasca M. , Lin D. , Bigham J. , Lifchits A. , and Jain A. 2006. Organizing and searching the world wide web of facts-step one: the one million fact extraction challenge. In Proceedings of the 21st National Conference on Artificial Intelligence, Boston, Massachusetts, pp. 1400\u20135."},{"key":"S135132491700016X_ref002","doi-asserted-by":"crossref","unstructured":"Appelt D. E. , and Onyshkevych B. 1998. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III, Baltimore, Maryland, pp. 23\u201330.","DOI":"10.21236\/ADA631525"},{"key":"S135132491700016X_ref065","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2007.12.001"},{"key":"S135132491700016X_ref026","unstructured":"Kazama J. , and Torisawa K. 2007. A new perceptron algorithm for sequence labeling with non-local features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Prague, Czech Republic, pp. 315\u201324."},{"key":"S135132491700016X_ref039","unstructured":"Nadeau D. 2007. Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision. PhD Thesis, School of Information Technology and Engineering, University of Ottawa."},{"key":"S135132491700016X_ref037","doi-asserted-by":"publisher","DOI":"10.1016\/j.csi.2012.09.004"},{"key":"S135132491700016X_ref009","unstructured":"Chiticariu L. , Krishnamurthy R. , Li Y. , Reiss F. , and Vaithyanathan S. 2010. Domain adaptation of rule-based annotators for named-entity recognition tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Massachusetts, USA, pp. 1002\u201312."},{"key":"S135132491700016X_ref028","volume-title":"AAAI Workshop on Adaptive Text Extraction and Mining","author":"Lavelli","year":"2004"},{"key":"S135132491700016X_ref053","doi-asserted-by":"publisher","DOI":"10.2200\/S00429ED1V01Y201207AIM018"},{"key":"S135132491700016X_ref027","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324914000114"},{"key":"S135132491700016X_ref018","doi-asserted-by":"crossref","unstructured":"Finkel J. R. , Grenager T. , and Manning C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Ann Arbor, Michigan, pp. 363\u201370.","DOI":"10.3115\/1219840.1219885"},{"key":"S135132491700016X_ref047","first-page":"1634","volume-title":"ACM Symposium on Applied Computing","author":"Reeve","year":"2005"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S135132491700016X","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,4,15]],"date-time":"2019-04-15T21:27:45Z","timestamp":1555363665000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S135132491700016X\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,6,15]]},"references-count":66,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2018,1]]}},"alternative-id":["S135132491700016X"],"URL":"https:\/\/doi.org\/10.1017\/s135132491700016x","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,6,15]]}}}