{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,5]],"date-time":"2026-02-05T08:48:54Z","timestamp":1770281334334,"version":"3.49.0"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2017,4,24]],"date-time":"2017-04-24T00:00:00Z","timestamp":1492992000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100006785","name":"Google","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100006785","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Toshiba"},{"DOI":"10.13039\/100000936","name":"Gordon and Betty Moore Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000936","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["IIS-1353606"],"award-info":[{"award-number":["IIS-1353606"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000185","name":"Defense Advanced Research Projects Agency","doi-asserted-by":"publisher","award":["XDATA (FA8750-12-2-0335), DEFT (FA8750-13-2-0039), MEMEX, SIMPLEX"],"award-info":[{"award-number":["XDATA (FA8750-12-2-0335), DEFT (FA8750-13-2-0039), MEMEX, SIMPLEX"]}],"id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000006","name":"Office of Naval Research","doi-asserted-by":"publisher","award":["N000141210041, N000141310129"],"award-info":[{"award-number":["N000141210041, N000141310129"]}],"id":[{"id":"10.13039\/100000006","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000879","name":"Alfred P. Sloan Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000879","id-type":"DOI","asserted-by":"publisher"}]},{"name":"American Family Insurance"},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["U54EB020405"],"award-info":[{"award-number":["U54EB020405"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Commun. ACM"],"published-print":{"date-parts":[[2017,4,24]]},"abstract":"<jats:p>The dark data extraction or knowledge base construction (KBC) problem is to populate a relational database with information from unstructured data sources, such as emails, webpages, and PDFs. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help to develop KBC systems. The key idea in DeepDive is to frame traditional extract-transform-load (ETL) style data management problems as a single large statistical inference task that is declaratively defined by the user. DeepDive leverages the effectiveness and efficiency of statistical inference and machine learning for difficult extraction tasks, whereas not requiring users to directly write any probabilistic inference algorithms. Instead, domain experts interact with DeepDive by defining features or rules about the domain. DeepDive has been successfully applied to domains such as pharmacogenomics, paleobiology, and antihuman trafficking enforcement, achieving human-caliber quality at machine-caliber scale. We present the applications, abstractions, and techniques used in DeepDive to accelerate the construction of such dark data extraction systems.<\/jats:p>","DOI":"10.1145\/3060586","type":"journal-article","created":{"date-parts":[[2017,4,26]],"date-time":"2017-04-26T12:58:52Z","timestamp":1493211532000},"page":"93-102","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":44,"title":["DeepDive"],"prefix":"10.1145","volume":"60","author":[{"given":"Ce","family":"Zhang","sequence":"first","affiliation":[{"name":"ETH Zurich, Zurich, Switzerland"}]},{"given":"Christopher","family":"R\u00e9","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA"}]},{"given":"Michael","family":"Cafarella","sequence":"additional","affiliation":[{"name":"Lattice Data, Inc., Palo Alto, CA"}]},{"given":"Christopher","family":"De Sa","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA"}]},{"given":"Alex","family":"Ratner","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA"}]},{"given":"Jaeho","family":"Shin","sequence":"additional","affiliation":[{"name":"Lattice Data, Inc., Palo Alto, CA"}]},{"given":"Feiran","family":"Wang","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA"}]},{"given":"Sen","family":"Wu","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA"}]}],"member":"320","published-online":{"date-parts":[[2017,4,24]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"et al. Stanford's 2014 slot filling systems. TAC KBP","author":"Angeli G.","year":"2014","unstructured":"Angeli , G. et al. Stanford's 2014 slot filling systems. TAC KBP ( 2014 ). Angeli, G. et al. Stanford's 2014 slot filling systems. TAC KBP (2014)."},{"key":"e_1_2_1_2_1","volume-title":"IJCAI","author":"Banko M.","year":"2007","unstructured":"Banko , M. et al. Open information extraction from the Web . In IJCAI ( 2007 ). Banko, M. et al. Open information extraction from the Web. In IJCAI (2007)."},{"key":"e_1_2_1_3_1","volume-title":"AAAI Spring Symposium(2009)","author":"Betteridge J.","unstructured":"Betteridge , J. , Carlson , A. , Hong , S.A. , Hruschka , E.R. , Jr , Law, E.L., Mitchell , T.M. , Wang , S.H. Toward never ending language learning . In AAAI Spring Symposium(2009) . Betteridge, J., Carlson, A., Hong, S.A., Hruschka, E.R., Jr, Law, E.L., Mitchell, T.M., Wang, S.H. Toward never ending language learning. In AAAI Spring Symposium(2009)."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/10704656_11"},{"key":"e_1_2_1_5_1","volume-title":"et al. Tools and methods for building Watson. IBM Research Report","author":"Brown E.","year":"2013","unstructured":"Brown , E. et al. Tools and methods for building Watson. IBM Research Report ( 2013 ). Brown, E. et al. Tools and methods for building Watson. IBM Research Report (2013)."},{"key":"e_1_2_1_6_1","volume-title":"AAAI","author":"Carlson A.","year":"2010","unstructured":"Carlson , A. et al. Toward an architecture for never-ending language learning . In AAAI ( 2010 ). Carlson, A. et al. Toward an architecture for never-ending language learning. In AAAI (2010)."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2008.4497503"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2012.60"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2610516"},{"key":"e_1_2_1_10_1","volume-title":"Ensuring rapid mixing and low bias for asynchronous gibbs sampling. arXiv preprint arXiv:1602.07415","author":"De Sa C.","year":"2016","unstructured":"De Sa , C. , Olukotun , K. , R\u00e9 , C. Ensuring rapid mixing and low bias for asynchronous gibbs sampling. arXiv preprint arXiv:1602.07415 ( 2016 ). De Sa, C., Olukotun, K., R\u00e9, C. Ensuring rapid mixing and low bias for asynchronous gibbs sampling. arXiv preprint arXiv:1602.07415 (2016)."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.5555\/1855041"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732951.2732962"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939502.2939515"},{"key":"e_1_2_1_14_1","volume-title":"WWW","author":"Etzioni O.","year":"2004","unstructured":"Etzioni , O. et al. Web-scale information extraction in KnowItAll: Preliminary results . In WWW ( 2004 ). Etzioni, O. et al. Web-scale information extraction in KnowItAll: Preliminary results. In WWW (2004)."},{"key":"e_1_2_1_15_1","volume-title":"et al. Building Watson: An overview of the DeepQA project. AI Magazine","author":"Ferrucci D.","year":"2010","unstructured":"Ferrucci , D. et al. Building Watson: An overview of the DeepQA project. AI Magazine ( 2010 ). Ferrucci, D. et al. Building Watson: An overview of the DeepQA project. AI Magazine (2010)."},{"key":"e_1_2_1_16_1","volume-title":"ACL","author":"Govindaraju V.","year":"2013","unstructured":"Govindaraju , V. et al. Understanding tables in context using standard NLP toolkits . In ACL ( 2013 ). Govindaraju, V. et al. Understanding tables in context using standard NLP toolkits. In ACL (2013)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/170036.170066"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.3115\/992133.992154"},{"key":"e_1_2_1_19_1","volume-title":"ACL","author":"Hoffmann R.","year":"2011","unstructured":"Hoffmann , R. et al. Knowledge-based weak supervision for information extraction of overlapping relations . In ACL ( 2011 ). Hoffmann, R. et al. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL (2011)."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376686"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511790423"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2012.156"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1519103.1519110"},{"key":"e_1_2_1_24_1","volume-title":"Incrementally maintaining classification using an RDBMS. PVLDB","author":"Koc M.L.","year":"2011","unstructured":"Koc , M.L. , R\u00e9 , C. Incrementally maintaining classification using an RDBMS. PVLDB ( 2011 ). Koc, M.L., R\u00e9, C. Incrementally maintaining classification using an RDBMS. PVLDB (2011)."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1519103.1519105"},{"key":"e_1_2_1_26_1","volume-title":"HLT","author":"Li Y.","year":"2011","unstructured":"Li , Y. , Reiss , F.R. , Chiticariu , L. System T: A declarative information extraction system . In HLT ( 2011 ). Li, Y., Reiss, F.R., Chiticariu, L. System T: A declarative information extraction system. In HLT (2011)."},{"key":"e_1_2_1_27_1","volume-title":"An asynchronous parallel stochastic coordinate descent algorithm. ICML","author":"Liu J.","year":"2014","unstructured":"Liu , J. and An asynchronous parallel stochastic coordinate descent algorithm. ICML ( 2014 ). Liu, J. and et al. An asynchronous parallel stochastic coordinate descent algorithm. ICML (2014)."},{"key":"e_1_2_1_28_1","volume-title":"CIDR","author":"Madhavan J.","year":"2007","unstructured":"Madhavan , J. et al. Web-scale data integration: You can only afford to pay as you go . In CIDR ( 2007 ). Madhavan, J. et al. Web-scale data integration: You can only afford to pay as you go. In CIDR (2007)."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btv476"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.3115\/1690219.1690287"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1935826.1935869"},{"key":"e_1_2_1_32_1","volume-title":"NIPS","author":"Niu F.","year":"2011","unstructured":"Niu , F. et al. Hogwild! A lock-free approach to parallelizing stochastic gradient descent . In NIPS ( 2011 ). Niu, F. et al. Hogwild! A lock-free approach to parallelizing stochastic gradient descent. In NIPS (2011)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"crossref","unstructured":"Niu F. et al. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB(2011).  Niu F. et al. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB (2011).","DOI":"10.14778\/1978665.1978669"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.4018\/jswis.2012070103"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2012.96"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0113523"},{"key":"e_1_2_1_37_1","volume-title":"AAAI","author":"Poon H.","year":"2007","unstructured":"Poon , H. , Domingos , P.. Joint inference in information extraction . In AAAI ( 2007 ). Poon, H., Domingos, P.. Joint inference in information extraction. In AAAI (2007)."},{"key":"e_1_2_1_38_1","volume-title":"Data programming: Creating large training sets, quickly. arXiv preprint arXiv:1605.07723","author":"Ratner A.","year":"2016","unstructured":"Ratner , A. , De Sa , C. , Wu , S. , Selsam , D. , R\u00e9 , C. Data programming: Creating large training sets, quickly. arXiv preprint arXiv:1605.07723 ( 2016 ). Ratner, A., De Sa, C., Wu, S., Selsam, D., R\u00e9, C. Data programming: Creating large training sets, quickly. arXiv preprint arXiv:1605.07723 (2016)."},{"key":"e_1_2_1_39_1","volume-title":"et al. Feature engineering for knowledge base construction","author":"R\u00e9 C.","year":"2014","unstructured":"R\u00e9 , C. et al. Feature engineering for knowledge base construction . IEEE Data Eng. Bull . ( 2014 ). R\u00e9, C. et al. Feature engineering for knowledge base construction. IEEE Data Eng. Bull. (2014)."},{"key":"e_1_2_1_40_1","volume-title":"Monte Carlo Statistical Methods","author":"Robert C.P","year":"2005","unstructured":"Robert , C.P , Casella , G. Monte Carlo Statistical Methods . Springer-Verlag New York, Inc. , Secaucus, NJ, USA , 2005 . Robert, C.P, Casella, G. Monte Carlo Statistical Methods. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005."},{"key":"e_1_2_1_41_1","volume-title":"VLDB","author":"Shen W.","year":"2007","unstructured":"Shen , W. et al. Declarative information extraction using datalog with embedded extraction predicates . In VLDB ( 2007 ). Shen, W. et al. Declarative information extraction using datalog with embedded extraction predicates. In VLDB (2007)."},{"key":"e_1_2_1_42_1","volume-title":"et al. Incremental knowledge base construction using deepdive. PVLDB","author":"Shin J.","year":"2015","unstructured":"Shin , J. et al. Incremental knowledge base construction using deepdive. PVLDB ( 2015 ). Shin, J. et al. Incremental knowledge base construction using deepdive. PVLDB (2015)."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/1526709.1526794"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSP.2006.874409"},{"key":"e_1_2_1_45_1","volume-title":"Graphical models, exponential families, and variational inference. FTML","author":"Wainwright M.J.","year":"2008","unstructured":"Wainwright , M.J. , Jordan , M.I. Graphical models, exponential families, and variational inference. FTML ( 2008 ). Wainwright, M.J., Jordan, M.I. Graphical models, exponential families, and variational inference. FTML (2008)."},{"key":"e_1_2_1_46_1","volume-title":"PODS","author":"Weikum G.","year":"2010","unstructured":"Weikum , G. , Theobald , M. From information to knowledge: Harvesting entities and relationships from web sources . In PODS ( 2010 ). Weikum, G., Theobald, M. From information to knowledge: Harvesting entities and relationships from web sources. In PODS (2010)."},{"key":"e_1_2_1_47_1","volume-title":"et al. Scalable probabilistic databases with factor graphs and MCMC. PVLDB","author":"Wick M.","year":"2010","unstructured":"Wick , M. et al. Scalable probabilistic databases with factor graphs and MCMC. PVLDB ( 2010 ). Wick, M. et al. Scalable probabilistic databases with factor graphs and MCMC. PVLDB (2010)."},{"key":"e_1_2_1_48_1","volume-title":"NAACL","author":"Yates A.","year":"2007","unstructured":"Yates , A. et al. TextRunner: Open information extraction on the Web . In NAACL ( 2007 ). Yates, A. et al. TextRunner: Open information extraction on the Web. In NAACL (2007)."},{"key":"e_1_2_1_49_1","volume-title":"SIGMOD","author":"Zhang C.","year":"2013","unstructured":"Zhang , C. et al. GeoDeepDive: Statistical inference using familiar data-processing languages . In SIGMOD ( 2013 ). Zhang, C. et al. GeoDeepDive: Statistical inference using familiar data-processing languages. In SIGMOD (2013)."},{"key":"e_1_2_1_50_1","volume-title":"SIGMOD","author":"Zhang C.","year":"2013","unstructured":"Zhang , C. , R\u00e9 , C. Towards high- throughput Gibbs sampling at scale: A study across storage managers . In SIGMOD ( 2013 ). Zhang, C., R\u00e9, C. Towards high- throughput Gibbs sampling at scale: A study across storage managers. In SIGMOD (2013)."},{"key":"e_1_2_1_51_1","volume-title":"DimmWitted: A study of main-memory statistical analytics. PVLDB","author":"Zhang C.","year":"2014","unstructured":"Zhang , C. , R\u00e9 , C.. DimmWitted: A study of main-memory statistical analytics. PVLDB ( 2014 ). Zhang, C., R\u00e9, C.. DimmWitted: A study of main-memory statistical analytics. PVLDB (2014)."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/1526709.1526724"},{"key":"e_1_2_1_53_1","unstructured":"Zinkevich M. et al. Parallelized stochastic gradient descent. In NIPS(2010) 2595--2603.  Zinkevich M. et al. Parallelized stochastic gradient descent. In NIPS (2010) 2595--2603."}],"container-title":["Communications of the ACM"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3060586","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3060586","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3060586","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:03:21Z","timestamp":1750215801000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3060586"}},"subtitle":["declarative knowledge base construction"],"short-title":[],"issued":{"date-parts":[[2017,4,24]]},"references-count":53,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2017,4,24]]}},"alternative-id":["10.1145\/3060586"],"URL":"https:\/\/doi.org\/10.1145\/3060586","relation":{},"ISSN":["0001-0782","1557-7317"],"issn-type":[{"value":"0001-0782","type":"print"},{"value":"1557-7317","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,4,24]]},"assertion":[{"value":"2017-04-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}