{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:17:47Z","timestamp":1750306667407,"version":"3.41.0"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2015,6,30]],"date-time":"2015-06-30T00:00:00Z","timestamp":1435622400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NFS","award":["CCF-0938071, CCF-0938064, CNS-0716532, IIS-1423238 and IIS-1421247"],"award-info":[{"award-number":["CCF-0938071, CCF-0938064, CNS-0716532, IIS-1423238 and IIS-1421247"]}]},{"name":"a Yahoo! Key Scientific Challenges award"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Database Syst."],"published-print":{"date-parts":[[2015,6,30]]},"abstract":"<jats:p>\n            It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improves the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and\/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a\n            <jats:italic>conceptual design<\/jats:italic>\n            for the annotated collection. In this article, we introduce and formally define the problem of\n            <jats:italic>cost-effective conceptual design<\/jats:italic>\n            where, given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that most improves the effectiveness of answering queries over the collection. We provide efficient algorithms for special cases of the problem and prove it is generally NP-hard in the number of relevant concepts. We propose three efficient approximations to solve the problem: a greedy algorithm, an\n            <jats:italic>approximate popularity maximization<\/jats:italic>\n            (APM for short), and\n            <jats:italic>approximate annotation-benefit maximization<\/jats:italic>\n            (AAM for short). We show that, if there are no constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, the greedy algorithm delivers a constant approximation ratio if the concepts are equally costly, APM has a constant approximation ratio, and AAM is a fully polynomial-time approximation scheme. Our empirical results using a Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that, in general, APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.\n          <\/jats:p>","DOI":"10.1145\/2716321","type":"journal-article","created":{"date-parts":[[2015,6,30]],"date-time":"2015-06-30T15:45:37Z","timestamp":1435679137000},"page":"1-39","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Cost-Effective Conceptual Design for Information Extraction"],"prefix":"10.1145","volume":"40","author":[{"given":"Arash","family":"Termehchy","sequence":"first","affiliation":[{"name":"Oregon State University, OR"}]},{"given":"Ali","family":"Vakilian","sequence":"additional","affiliation":[{"name":"MIT, Cambridge, MA"}]},{"given":"Yodsawalai","family":"Chodpathumwan","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, IL"}]},{"given":"Marianne","family":"Winslett","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, IL"}]}],"member":"320","published-online":{"date-parts":[[2015,6,30]]},"reference":[{"volume-title":"Web Data Management","author":"Abiteboul Serge","key":"e_1_2_1_1_1","unstructured":"Serge Abiteboul , Ioana Manolescu , Philippe Rigaux , Marie-Christine Rousset , and Pierre Senellart . 2011. Web Data Management . Cambridge University Press . Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, and Pierre Senellart. 2011. Web Data Management. Cambridge University Press."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2003.1260786"},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the Conference on Innovative Data Systems Research (CIDR'13)","author":"Anderson Michael","year":"2013","unstructured":"Michael Anderson , Dolan Antenucci , Victor Bittorf , Matthew Burgess , Michael Cafarella , Arun Kumar , Feng Niu , Yongjoo Park , Christopher Re , and Ce Zhang . 2013 . Brainwash: A data system for feature engineering . In Proceedings of the Conference on Innovative Data Systems Research (CIDR'13) . Michael Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess, Michael Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Re, and Ce Zhang. 2013. Brainwash: A data system for feature engineering. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'13)."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1772690.1772703"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1018991717352"},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the International Conference on World Wide Web (WWW'07)","author":"Cafarella Michael","year":"2007","unstructured":"Michael Cafarella , Dan Suciu , and Oren Etzioni . 2007 . Navigating extracted data with schema discovery . In Proceedings of the International Conference on World Wide Web (WWW'07) . Michael Cafarella, Dan Suciu, and Oren Etzioni. 2007. Navigating extracted data with schema discovery. In Proceedings of the International Conference on World Wide Web (WWW'07)."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1135777.1135882"},{"key":"e_1_2_1_8_1","first-page":"1","article-title":"Enhancing search with structure","volume":"33","author":"Chakrabarti Soumen","year":"2010","unstructured":"Soumen Chakrabarti , Sunita Sarawagi , and S. Sudarshan . 2010 . Enhancing search with structure . IEEE Data Engin. Bull. 33 , 1 . Soumen Chakrabarti, Sunita Sarawagi, and S. Sudarshan. 2010. Enhancing search with structure. IEEE Data Engin. Bull. 33, 1.","journal-title":"IEEE Data Engin. Bull."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2012.60"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807339"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1321440.1321512"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1148170.1148247"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1559795.1559797"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/1887568.1887594"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/775152.775178"},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the Conference on Innovative Data Systems Research (CIDR'09)","author":"Doan Anhai","year":"2009","unstructured":"Anhai Doan , Jeff Naughton , Akanksha Baid , Xiaoyang Chai , Fei Chen , 2009 . The case for a structured approach to managing unstructured data . In Proceedings of the Conference on Innovative Data Systems Research (CIDR'09) . Anhai Doan, Jeff Naughton, Akanksha Baid, Xiaoyang Chai, Fei Chen, et al. 2009. The case for a structured approach to managing unstructured data. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'09)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/2535568.2448938"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.5555\/1642293.1642459"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1961209.1961211"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213913"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807085.1807121"},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the CSLDAMT Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (CSLDAMT-NAACL HLT'10)","author":"Finin Tim","year":"2010","unstructured":"Tim Finin , Will Murnane , Anand Karandikar , Nicholas Keller , Justin Martineau , and Mark Dredze . 2010 . Annotating named entities in Twitter data with crowdsourcing . In Proceedings of the CSLDAMT Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (CSLDAMT-NAACL HLT'10) . Tim Finin, Will Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. 2010. Annotating named entities in Twitter data with crowdsourcing. In Proceedings of the CSLDAMT Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (CSLDAMT-NAACL HLT'10)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0377-2217(03)00274-1"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1287\/opre.43.3.500"},{"key":"e_1_2_1_25_1","volume-title":"Database Systems: The Complete Book","author":"Garciamolina Hector","year":"2008","unstructured":"Hector Garciamolina , Jeff Ullman , and Jennifer Widom . 2008 . Database Systems: The Complete Book . Prentice Hall . Hector Garciamolina, Jeff Ullman, and Jennifer Widom. 2008. Database Systems: The Complete Book. Prentice Hall."},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the Conference on Very Large Databases (VLDB'04)","author":"Graupmann Jens","year":"2004","unstructured":"Jens Graupmann , Michael Biwer , Christian Zimmer , Patrick Zimmer , Matthias Bender , Martin Theobald , and Gerhard Weikum . 2004 . COMPASS: A concept-based Web search engine for HTML, XML, and Deep Web data . In Proceedings of the Conference on Very Large Databases (VLDB'04) . Jens Graupmann, Michael Biwer, Christian Zimmer, Patrick Zimmer, Matthias Bender, Martin Theobald, and Gerhard Weikum. 2004. COMPASS: A concept-based Web search engine for HTML, XML, and Deep Web data. In Proceedings of the Conference on Very Large Databases (VLDB'04)."},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the Conference on Very Large Databases (VLDB'05)","author":"Graupmann Jens","year":"2005","unstructured":"Jens Graupmann , Ralf Schenkel , and Gerhard Weikum . 2005 . The SphereSearch engine for unified ranked retrieval of heterogeneous XML and Web documents . In Proceedings of the Conference on Very Large Databases (VLDB'05) . Jens Graupmann, Ralf Schenkel, and Gerhard Weikum. 2005. The SphereSearch engine for unified ranked retrieval of heterogeneous XML and Web documents. In Proceedings of the Conference on Very Large Databases (VLDB'05)."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767842"},{"volume-title":"The Elements of Statistical Learning","author":"Hastie Trevor","key":"e_1_2_1_29_1","unstructured":"Trevor Hastie , Robert Tibshirani , and Jerome Friedman . 2009. The Elements of Statistical Learning . Springer . Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. Springer."},{"key":"e_1_2_1_30_1","first-page":"500","article-title":"Prioritization of domain-specific Web information extraction","volume":"43","author":"Huang Jian","year":"2010","unstructured":"Jian Huang and Cong Yu . 2010 . Prioritization of domain-specific Web information extraction . Oper. Res. 43 , 500 -- 508 . Jian Huang and Cong Yu. 2010. Prioritization of domain-specific Web information extraction. Oper. Res. 43, 500--508.","journal-title":"Oper. Res."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/321906.321909"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/1142473.1142504"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2008.4497472"},{"key":"e_1_2_1_34_1","unstructured":"Alpa Jain Panagiotis Ipeirotis and Luis Gravano. 2008b. Building query optimizers for information extraction: The SQoUT project. http:\/\/www.cs.columbia.edu\/&sim;gravano\/Papers\/2008\/sigmod-record08.pdf.  Alpa Jain Panagiotis Ipeirotis and Luis Gravano. 2008b. Building query optimizers for information extraction: The SQoUT project. http:\/\/www.cs.columbia.edu\/&sim;gravano\/Papers\/2008\/sigmod-record08.pdf."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1287\/moor.8.1.1"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2124295.2124328"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1142473.1142591"},{"volume-title":"On the existence of fast approximation schemes","author":"Korte Bernhard","key":"e_1_2_1_38_1","unstructured":"Bernhard Korte and Rainer Schrader . 1981. On the existence of fast approximation schemes . In Nonlinear Programming, O. L. Mangasarian, R. R. Meyer, and S. M. Robinson, Eds. Academic Press , New York , 41--437. Bernhard Korte and Rainer Schrader. 1981. On the existence of fast approximation schemes. In Nonlinear Programming, O. L. Mangasarian, R. R. Meyer, and S. M. Robinson, Eds. Academic Press, New York, 41--437."},{"key":"e_1_2_1_39_1","volume-title":"Proceedings of the Conference on Very Large Databases (VLDB'06)","author":"Kowalkiewicz Marek","year":"2006","unstructured":"Marek Kowalkiewicz , Tomasz Kaczmarek , and Witold Abramowicz . 2006 . Myportal: Robust extraction and aggregation of Web content . In Proceedings of the Conference on Very Large Databases (VLDB'06) . Marek Kowalkiewicz, Tomasz Kaczmarek, and Witold Abramowicz. 2006. Myportal: Robust extraction and aggregation of Web content. In Proceedings of the Conference on Very Large Databases (VLDB'06)."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/146802.146810"},{"volume-title":"An Introduction to Information Retrieval","author":"Manning Christopher","key":"e_1_2_1_41_1","unstructured":"Christopher Manning , Prabhakar Raghavan , and Hinrich Schutze . 2008. An Introduction to Information Retrieval . Cambridge University Press . Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. An Introduction to Information Retrieval. Cambridge University Press."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/1105664.1105679"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807214"},{"key":"e_1_2_1_44_1","volume-title":"Proceedings of the National Conference on Artificial Intelligence (AAAI'99)","author":"Riloff Ellen","year":"1999","unstructured":"Ellen Riloff and Rosie Jones . 1999 . Learning dictionaries for information extraction by multi-level bootstrapping . In Proceedings of the National Conference on Artificial Intelligence (AAAI'99) . Ellen Riloff and Rosie Jones. 1999. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the National Conference on Artificial Intelligence (AAAI'99)."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390334.1390420"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1561\/1900000003"},{"key":"e_1_2_1_47_1","volume-title":"Proceedings of the Symposium on Database Systems for Business, Technology and Web (BTW'07)","author":"Schenkel Ralf","year":"2007","unstructured":"Ralf Schenkel , Fabian Suchanek , and Gjergji Kasneci . 2007 . YAWN: A semantically annotated Wikipedia XML corpus . In Proceedings of the Symposium on Database Systems for Business, Technology and Web (BTW'07) . Ralf Schenkel, Fabian Suchanek, and Gjergji Kasneci. 2007. YAWN: A semantically annotated Wikipedia XML corpus. In Proceedings of the Symposium on Database Systems for Business, Technology and Web (BTW'07)."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376718"},{"key":"e_1_2_1_49_1","volume-title":"Proceedings of the Conference on Very Large Databases (VLDB'07)","author":"Shen Warren","year":"2007","unstructured":"Warren Shen , Anhai Doan , Jeff Naughton , and Raghu Ramakrishnan . 2007 . Declarative information extraction using Datalog with embedded extraction predicates . In Proceedings of the Conference on Very Large Databases (VLDB'07) . Warren Shen, Anhai Doan, Jeff Naughton, and Raghu Ramakrishnan. 2007. Declarative information extraction using Datalog with embedded extraction predicates. In Proceedings of the Conference on Very Large Databases (VLDB'07)."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/860435.860466"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2610496"},{"key":"e_1_2_1_52_1","volume-title":"Proceedings of the European Conference on IR Research (ECIR'07)","author":"Zwol Roelof Van","year":"2007","unstructured":"Roelof Van Zwol and Tim Van Loosbroek . 2007 . Effective use of semantic structure in XML retrieval . In Proceedings of the European Conference on IR Research (ECIR'07) . Roelof Van Zwol and Tim Van Loosbroek. 2007. Effective use of semantic structure in XML retrieval. In Proceedings of the European Conference on IR Research (ECIR'07)."}],"container-title":["ACM Transactions on Database Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2716321","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2716321","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T07:00:43Z","timestamp":1750230043000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2716321"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,6,30]]},"references-count":52,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2015,6,30]]}},"alternative-id":["10.1145\/2716321"],"URL":"https:\/\/doi.org\/10.1145\/2716321","relation":{},"ISSN":["0362-5915","1557-4644"],"issn-type":[{"type":"print","value":"0362-5915"},{"type":"electronic","value":"1557-4644"}],"subject":[],"published":{"date-parts":[[2015,6,30]]},"assertion":[{"value":"2014-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-12-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-06-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}