{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T11:48:42Z","timestamp":1763466522198},"reference-count":27,"publisher":"Association for Computing Machinery (ACM)","issue":"5","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2015,1]]},"abstract":"<jats:p>Enterprise applications need sophisticated in-database analytics in addition to traditional online analytical processing from a database. To meet customers' pressing demands, database vendors have been pushing advanced analytical techniques into databases. Most major DBMSes offer User-Defined Aggregate (UDA), a data-driven operator, to implement many of the analytical techniques in parallel. However, UDAs can not be used to implement statistical algorithms such as Markov chain Monte Carlo (MCMC), where most of the work is performed by iterative transitions over a large state that can not be naively partitioned due to data dependency. Typically, this type of statistical algorithm requires pre-processing to setup the large state in the first place and demands post-processing after the statistical inference. This paper presents General Iterative State Transition (GIST), a new database operator for parallel iterative state transitions over large states. GIST receives a state constructed by a UDA, and then performs rounds of transitions on the state until it converges. A final UDA performs post-processing and result extraction. We argue that the combination of UDA and GIST (UDA-GIST) unifies data-parallel and state-parallel processing in a single system, thus significantly extending the analytical capabilities of DBMSes. We exemplify the framework through two high-profile applications: cross-document coreference and image denoising. We show that the in-database framework allows us to tackle a 27 times larger problem than solved by the state-of-the-art for the first application and achieves 43 times speedup over the state-of-the-art for the second application.<\/jats:p>","DOI":"10.14778\/2735479.2735488","type":"journal-article","created":{"date-parts":[[2015,5,12]],"date-time":"2015-05-12T15:37:52Z","timestamp":1431445072000},"page":"557-568","source":"Crossref","is-referenced-by-count":10,"title":["UDA-GIST"],"prefix":"10.14778","volume":"8","author":[{"given":"Kun","family":"Li","sequence":"first","affiliation":[{"name":"University of Florida"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Daisy Zhe","family":"Wang","sequence":"additional","affiliation":[{"name":"University of Florida"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alin","family":"Dobra","sequence":"additional","affiliation":[{"name":"University of Florida"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Christopher","family":"Dudley","sequence":"additional","affiliation":[{"name":"University of Florida"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2015,1]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807224"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.3115\/980451.980859"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4302-1125-9_7"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2038037.1941561"},{"issue":"4","key":"e_1_2_1_5_1","doi-asserted-by":"crossref","first-page":"327","DOI":"10.1080\/00031305.1995.10476177","article-title":"Understanding the metropolis-hastings algorithm","volume":"49","author":"Chib S.","year":"1995","unstructured":"S. Chib and E. Greenberg . Understanding the metropolis-hastings algorithm . The American Statistician , 49 ( 4 ): 327 -- 335 , 1995 . S. Chib and E. Greenberg. Understanding the metropolis-hastings algorithm. The American Statistician, 49(4): 327--335, 1995.","journal-title":"The American Statistician"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687553.1687576"},{"key":"e_1_2_1_7_1","first-page":"137","volume-title":"OSDI","author":"Dean J.","year":"2004","unstructured":"J. Dean and S. Ghemawat . Mapreduce: Simplified data processing on large clusters . In OSDI , pages 137 -- 150 . USENIX Association , 2004 . J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150. USENIX Association, 2004."},{"key":"e_1_2_1_8_1","volume-title":"June","author":"Dobra A.","year":"2011","unstructured":"A. Dobra . Datapath: High-performance database engine , June 2011 . A. Dobra. Datapath: High-performance database engine, June 2011."},{"key":"e_1_2_1_9_1","first-page":"261","volume-title":"CVPR (1)","author":"Felzenszwalb P. F.","year":"2004","unstructured":"P. F. Felzenszwalb and D. P. Huttenlocher . Efficient belief propagation for early vision . In CVPR (1) , pages 261 -- 268 , 2004 . P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief propagation for early vision. In CVPR (1), pages 261--268, 2004."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1007465528199"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.14778\/2367502.2367510"},{"key":"e_1_2_1_12_1","first-page":"905","volume-title":"Journal of Machine Learning Research","author":"Ihler A. T.","year":"2005","unstructured":"A. T. Ihler , J. Iii , and A. S. Willsky . Loopy belief propagation: Convergence and effects of message errors . In Journal of Machine Learning Research , pages 905 -- 936 , 2005 . A. T. Ihler, J. Iii, and A. S. Willsky. Loopy belief propagation: Convergence and effects of message errors. In Journal of Machine Learning Research, pages 905--936, 2005."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2486767.2486774"},{"key":"e_1_2_1_14_1","volume-title":"Graphlab: A new framework for parallel machine learning. CoRR, abs\/1006.4990","author":"Low Y.","year":"2010","unstructured":"Y. Low , J. Gonzalez , A. Kyrola , D. Bickson , C. Guestrin , and J. M. Hellerstein . Graphlab: A new framework for parallel machine learning. CoRR, abs\/1006.4990 , 2010 . Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new framework for parallel machine learning. CoRR, abs\/1006.4990, 2010."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/2212351.2212354"},{"key":"e_1_2_1_16_1","unstructured":"A. Mahout. Scalable machine-learning and data-mining library. available at mahout.apache.org. A. Mahout. Scalable machine-learning and data-mining library. available at mahout.apache.org."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2009.5160991"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.963420"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.5555\/2073796.2073849"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/1978665.1978669"},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","DOI":"10.1007\/978-1-4613-8190-7","volume-title":"Markov random fields","author":"Rozanov Y. A.","year":"1982","unstructured":"Y. A. Rozanov . Markov random fields . Springer , 1982 . Y. A. Rozanov. Markov random fields. Springer, 1982."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2146382.2146386"},{"key":"e_1_2_1_24_1","first-page":"793","volume-title":"ACL","author":"Singh S.","year":"2011","unstructured":"S. Singh , A. Subramanya , F. C. N. Pereira , and A. McCallum . Large-scale cross-document coreference using distributed inference and hierarchical models. In D. Lin, Y. Matsumoto, and R. Mihalcea, editors , ACL , pages 793 -- 803 . The Association for Computer Linguistics , 2011 . S. Singh, A. Subramanya, F. C. N. Pereira, and A. McCallum. Large-scale cross-document coreference using distributed inference and hierarchical models. In D. Lin, Y. Matsumoto, and R. Mihalcea, editors, ACL, pages 793--803. The Association for Computer Linguistics, 2011."},{"key":"e_1_2_1_25_1","volume-title":"Efficient in-database analytics with graphical models","author":"Wang D. Z.","year":"2014","unstructured":"D. Z. Wang , Y. Chen , C. Grant , and K. Li . Efficient in-database analytics with graphical models . IEEE Data Engineering Bulletin , 2014 . D. Z. Wang, Y. Chen, C. Grant, and K. Li. Efficient in-database analytics with graphical models. IEEE Data Engineering Bulletin, 2014."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/846219.847359"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2484425.2484427"},{"key":"e_1_2_1_28_1","first-page":"10","volume-title":"Proceedings of the 2nd USENIX conference on Hot topics in cloud computing","author":"Zaharia M.","year":"2010","unstructured":"M. Zaharia , M. Chowdhury , M. J. Franklin , S. Shenker , and I. Stoica . Spark: cluster computing with working sets . In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing , pages 10 -- 10 , 2010 . M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10--10, 2010."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/2735479.2735488","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,8]],"date-time":"2024-06-08T23:05:23Z","timestamp":1717887923000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/2735479.2735488"}},"subtitle":["an in-database framework to unify data-parallel and state-parallel analytics"],"short-title":[],"issued":{"date-parts":[[2015,1]]},"references-count":27,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2015,1]]}},"alternative-id":["10.14778\/2735479.2735488"],"URL":"https:\/\/doi.org\/10.14778\/2735479.2735488","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2015,1]]}}}