{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,17]],"date-time":"2024-08-17T13:08:51Z","timestamp":1723900131580},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2012,8]]},"abstract":"MADlib is a free, open-source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import\/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind.<\/jats:p>In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open-source nature. We provide an overview of the library's architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the project's goals.<\/jats:p>MADlib is freely available at http:\/\/madlib.net, and the project is open for contributions of both new methods, and ports to additional database platforms.<\/jats:p>","DOI":"10.14778\/2367502.2367510","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"1700-1711","source":"Crossref","is-referenced-by-count":246,"title":["The MADlib analytics library"],"prefix":"10.14778","volume":"5","author":[{"given":"Joseph M.","family":"Hellerstein","sequence":"first","affiliation":[{"name":"U.C. Berkeley"}]},{"given":"Christoper","family":"R\u00e9","sequence":"additional","affiliation":[{"name":"U. Wisconsin"}]},{"given":"Florian","family":"Schoppmann","sequence":"additional","affiliation":[{"name":"Greenplum"}]},{"given":"Daisy Zhe","family":"Wang","sequence":"additional","affiliation":[{"name":"U. Florida"}]},{"given":"Eugene","family":"Fratkin","sequence":"additional","affiliation":[{"name":"Greenplum"}]},{"given":"Aleksander","family":"Gorajek","sequence":"additional","affiliation":[{"name":"Greenplum"}]},{"given":"Kee Siong","family":"Ng","sequence":"additional","affiliation":[{"name":"Greenplum"}]},{"given":"Caleb","family":"Welton","sequence":"additional","affiliation":[{"name":"Greenplum"}]},{"given":"Xixuan","family":"Feng","sequence":"additional","affiliation":[{"name":"U. Wisconsin"}]},{"given":"Kun","family":"Li","sequence":"additional","affiliation":[{"name":"U. Florida"}]},{"given":"Arun","family":"Kumar","sequence":"additional","affiliation":[{"name":"U. Wisconsin"}]}],"member":"320","published-online":{"date-parts":[[2012,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-009-5103-0"},{"key":"e_1_2_1_2_1","doi-asserted-by":"crossref","DOI":"10.1137\/1.9780898719604","volume-title":"LAPACK Users' Guide","author":"Anderson E.","year":"1999","unstructured":"E. Anderson , Z. Bai , C. Bischof , LAPACK Users' Guide . Society for Industrial and Applied Mathematics , third edition, 1999 . E. Anderson, Z. Bai, C. Bischof, et al. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, third edition, 1999."},{"key":"e_1_2_1_3_1","unstructured":"Apache Mahout. http:\/\/mahout.apache.org\/. Apache Mahout. http:\/\/mahout.apache.org\/."},{"key":"e_1_2_1_4_1","first-page":"405","volume-title":"FOCS","author":"Arthur D.","year":"2009","unstructured":"D. Arthur , B. Manthey , and H. Roglin . k-means has polynomial smoothed complexity . In FOCS , pages 405 -- 414 , 2009 . 10.1109\/FOCS.2009.14 D. Arthur, B. Manthey, and H. Roglin. k-means has polynomial smoothed complexity. In FOCS, pages 405--414, 2009. 10.1109\/FOCS.2009.14"},{"key":"e_1_2_1_5_1","first-page":"1027","volume-title":"SODA","author":"Arthur D.","year":"2007","unstructured":"D. Arthur and S. Vassilvitskii . k-means++: The advantages of careful seeding . In SODA , pages 1027 -- 1035 , 2007 . D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007."},{"key":"e_1_2_1_6_1","volume-title":"Athena Scientific","author":"Bertsekas D. P.","year":"1999","unstructured":"D. P. Bertsekas . Nonlinear Programming . Athena Scientific , 2 nd edition, 1999 . D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999.","edition":"2"},{"key":"e_1_2_1_7_1","first-page":"1151","volume-title":"ICDE","author":"Borkar V.","year":"2011","unstructured":"V. Borkar , M. Carey , R. Grover , : A flexible and extensible foundation for data-intensive computing . In ICDE , pages 1151 -- 1162 , 2011 . 10.1109\/ICDE.2011.5767921 V. Borkar, M. Carey, R. Grover, et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011. 10.1109\/ICDE.2011.5767921"},{"key":"e_1_2_1_8_1","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511804441","volume-title":"Convex Optimization","author":"Boyd S.","year":"2004","unstructured":"S. Boyd and L. Vandenberghe . Convex Optimization . Cambridge University Press , 2004 . S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004."},{"issue":"1","key":"e_1_2_1_9_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/0010-4655(96)00017-3","article-title":"ScaLAPACK: A portable linear algebra library for distributed memory computers -- design issues and performance","volume":"97","author":"Choi J.","year":"1996","unstructured":"J. Choi , J. Demmel , I. Dhillon , ScaLAPACK: A portable linear algebra library for distributed memory computers -- design issues and performance . Computer Physics Communications , 97 ( 1 ): 1 -- 15 , 1996 . J. Choi, J. Demmel, I. Dhillon, et al. ScaLAPACK: A portable linear algebra library for distributed memory computers -- design issues and performance. Computer Physics Communications, 97(1):1--15, 1996.","journal-title":"Computer Physics Communications"},{"key":"e_1_2_1_10_1","first-page":"281","volume-title":"NIPS","author":"Chu C.-T.","year":"2006","unstructured":"C.-T. Chu , S. K. Kim , Y.-A. Lin , Map-reduce for machine learning on multicore . In NIPS , pages 281 -- 288 , 2006 . C.-T. Chu, S. K. Kim, Y.-A. Lin, et al. Map-reduce for machine learning on multicore. In NIPS, pages 281--288, 2006."},{"issue":"2","key":"e_1_2_1_11_1","first-page":"1481","article-title":"MAD Skills: New analysis practices for big data","volume":"2","author":"Cohen J.","year":"2009","unstructured":"J. Cohen , B. Dolan , M. Dunlap , MAD Skills: New analysis practices for big data . PVLDB , 2 ( 2 ): 1481 -- 1492 , 2009 . J. Cohen, B. Dolan, M. Dunlap, et al. MAD Skills: New analysis practices for big data. PVLDB, 2(2):1481--1492, 2009.","journal-title":"PVLDB"},{"key":"e_1_2_1_12_1","volume-title":"The text mining handbook: advanced approaches in analyzing unstructured data","author":"Feldman R.","year":"2007","unstructured":"R. Feldman and J. Sanger . The text mining handbook: advanced approaches in analyzing unstructured data . Cambridge University Press , 2007 . R. Feldman and J. Sanger. The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, 2007."},{"key":"e_1_2_1_13_1","first-page":"325","volume-title":"SIGMOD","author":"Feng X.","year":"2012","unstructured":"X. Feng , A. Kumar , B. Recht , Towards a unified architecture for in-RDBMS analytics . In SIGMOD , pages 325 -- 336 , 2012 . 10.1145\/2213836.2213874 X. Feng, A. Kumar, B. Recht, et al. Towards a unified architecture for in-RDBMS analytics. In SIGMOD, pages 325--336, 2012. 10.1145\/2213836.2213874"},{"issue":"3","key":"e_1_2_1_14_1","doi-asserted-by":"crossref","first-page":"268","DOI":"10.1109\/PROC.1973.9030","article-title":"The Viterbi algorithm","volume":"61","author":"Jr G. Forney","year":"1973","unstructured":"G. Forney Jr . The Viterbi algorithm . Proceedings of the IEEE , 61 ( 3 ): 268 -- 278 , 1973 . G. Forney Jr. The Viterbi algorithm. Proceedings of the IEEE, 61(3):268--278, 1973.","journal-title":"Proceedings of the IEEE"},{"key":"e_1_2_1_15_1","first-page":"231","volume-title":"ICDE","author":"Ghoting A.","year":"2011","unstructured":"A. Ghoting , R. Krishnamurthy , E. Pednault , : Declarative machine learning on MapReduce . In ICDE , pages 231 -- 242 , 2011 . 10.1109\/ICDE.2011.5767930 A. Ghoting, R. Krishnamurthy, E. Pednault, et al. SystemML: Declarative machine learning on MapReduce. In ICDE, pages 231--242, 2011. 10.1109\/ICDE.2011.5767930"},{"issue":"4","key":"e_1_2_1_16_1","first-page":"28","article-title":"Using q-grams in a DBMS for approximate string processing","volume":"24","author":"Gravano L.","year":"2001","unstructured":"L. Gravano , P. Ipeirotis , H. Jagadish , Using q-grams in a DBMS for approximate string processing . IEEE Data Engineering Bulletin , 24 ( 4 ): 28 -- 34 , 2001 . L. Gravano, P. Ipeirotis, H. Jagadish, et al. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.","journal-title":"IEEE Data Engineering Bulletin"},{"key":"e_1_2_1_17_1","unstructured":"G. Guennebaud B. Jacob etal Eigen v3. http:\/\/eigen.tuxfamily.org 2010. G. Guennebaud B. Jacob et al. Eigen v3. http:\/\/eigen.tuxfamily.org 2010."},{"key":"e_1_2_1_18_1","volume-title":"Speech and Language Processing","author":"Jurafsky D.","year":"2008","unstructured":"D. Jurafsky and M. J. H. Speech and Language Processing . Pearson Prentice Hall , 2008 . D. Jurafsky and M. J. H. Speech and Language Processing. Pearson Prentice Hall, 2008."},{"key":"e_1_2_1_19_1","first-page":"282","volume-title":"ICML","author":"Lafferty J. D.","year":"2001","unstructured":"J. D. Lafferty , A. McCallum , and F. C. N. Pereira . Conditional random fields: Probabilistic models for segmenting and labeling sequence data . In ICML , pages 282 -- 289 , 2001 . J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001."},{"key":"e_1_2_1_20_1","unstructured":"J. Langford. http:\/\/hunch.net\/~vw\/. J. Langford. http:\/\/hunch.net\/~vw\/."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1982.1056489"},{"key":"e_1_2_1_22_1","first-page":"340","volume-title":"UAI","author":"Low Y.","year":"2010","unstructured":"Y. Low , J. Gonzalez , A. Kyrola , : A new framework for parallel machine learning . In UAI , pages 340 -- 349 , 2010 . Y. Low, J. Gonzalez, A. Kyrola, et al. GraphLab: A new framework for parallel machine learning. In UAI, pages 340--349, 2010."},{"key":"e_1_2_1_23_1","doi-asserted-by":"crossref","first-page":"274","DOI":"10.1007\/978-3-642-00202-1_24","volume-title":"WALCOM: Algorithms and Computation","author":"Mahajan M.","year":"2009","unstructured":"M. Mahajan , P. Nimbhorkar , and K. Varadarajan . The planar k-means problem is NP-hard . WALCOM: Algorithms and Computation , pages 274 -- 285 , 2009 . 10.1007\/978-3-642-00202-1_24 M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is NP-hard. WALCOM: Algorithms and Computation, pages 274--285, 2009. 10.1007\/978-3-642-00202-1_24"},{"key":"e_1_2_1_24_1","first-page":"135","volume-title":"SIGMOD","author":"Malewicz G.","year":"2010","unstructured":"G. Malewicz , M. H. Austern , A. J. Bik , : a system for large-scale graph processing . In SIGMOD , pages 135 -- 146 , 2010 . 10.1145\/1807167.1807184 G. Malewicz, M. H. Austern, A. J. Bik, et al. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. 10.1145\/1807167.1807184"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/375360.375365"},{"key":"e_1_2_1_26_1","first-page":"263","volume-title":"Stochastic Optimization: Algorithms and Applications","author":"Nedic A.","year":"2000","unstructured":"A. Nedic and D. P. Bertsekas . Convergence rate of incremental subgradient algorithms . In S. Uryasev and P. M. Pardalos, editors, Stochastic Optimization: Algorithms and Applications , pages 263 -- 304 . Kluwer Academic Publishers , 2000 . A. Nedic and D. P. Bertsekas. Convergence rate of incremental subgradient algorithms. In S. Uryasev and P. M. Pardalos, editors, Stochastic Optimization: Algorithms and Applications, pages 263--304. Kluwer Academic Publishers, 2000."},{"key":"e_1_2_1_27_1","first-page":"89","volume-title":"PLDI","author":"Nethercote N.","year":"2007","unstructured":"N. Nethercote and J. Seward . Valgrind: A framework for heavyweight dynamic binary instrumentation . In PLDI , pages 89 -- 100 , 2007 . 10.1145\/1250734.1250746 N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In PLDI, pages 89--100, 2007. 10.1145\/1250734.1250746"},{"key":"e_1_2_1_28_1","unstructured":"Oracle R Enterprise. http:\/\/www.oracle.com\/technetwork\/database\/options\/advanced-analytics\/r-enterprise\/index.html. Oracle R Enterprise. http:\/\/www.oracle.com\/technetwork\/database\/options\/advanced-analytics\/r-enterprise\/index.html."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2006.31"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2010.44"},{"key":"e_1_2_1_31_1","doi-asserted-by":"crossref","first-page":"559","DOI":"10.1145\/342009.335468","volume-title":"SIGMOD","author":"Ordonez C.","year":"2000","unstructured":"C. Ordonez and P. Cereghini . SQLEM: Fast clustering in SQL using the EM algorithm . In SIGMOD , pages 559 -- 570 , 2000 . 10.1145\/335191.335468 C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm. In SIGMOD, pages 559--570, 2000. 10.1145\/335191.335468"},{"key":"e_1_2_1_32_1","first-page":"165","volume-title":"SIGMOD","author":"Pavlo A.","year":"2009","unstructured":"A. Pavlo , E. Paulson , A. Rasin , A comparison of approaches to large-scale data analysis . In SIGMOD , pages 165 -- 178 . ACM, 2009 . 10.1145\/1559845.1559865 A. Pavlo, E. Paulson, A. Rasin, et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178. ACM, 2009. 10.1145\/1559845.1559865"},{"key":"e_1_2_1_33_1","unstructured":"Revloution Analytics. http:\/\/www.revolutionanalytics.com\/. Revloution Analytics. http:\/\/www.revolutionanalytics.com\/."},{"issue":"1","key":"e_1_2_1_34_1","doi-asserted-by":"crossref","first-page":"23","DOI":"10.11120\/msor.2001.01010023","article-title":"project in statistical computing","volume":"1","author":"Ripley B.","year":"2001","unstructured":"B. Ripley . The R project in statistical computing . MSOR Connections , 1 ( 1 ): 23 -- 25 , 2001 . B. Ripley. The R project in statistical computing. MSOR Connections, 1(1):23--25, 2001.","journal-title":"MSOR Connections"},{"key":"e_1_2_1_35_1","doi-asserted-by":"crossref","first-page":"400","DOI":"10.1214\/aoms\/1177729586","article-title":"A stochastic approximation method","volume":"22","author":"Robbins H.","year":"1951","unstructured":"H. Robbins and S. Monro . A stochastic approximation method . Annals of Mathematical Statistics , 22 : 400 -- 407 , 1951 . H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400--407, 1951.","journal-title":"Annals of Mathematical Statistics"},{"key":"e_1_2_1_36_1","volume-title":"Convex Analysis (Princeton Landmarks in Mathematics and Physics)","author":"Rockafellar R. T.","year":"1996","unstructured":"R. T. Rockafellar . Convex Analysis (Princeton Landmarks in Mathematics and Physics) . Princeton University Press , 1996 . R. T. Rockafellar. Convex Analysis (Princeton Landmarks in Mathematics and Physics). Princeton University Press, 1996."},{"key":"e_1_2_1_37_1","volume-title":"NICTA","author":"Sanderson C.","year":"2010","unstructured":"C. Sanderson . Armadillo : An open source C++ linear algebra library for fast prototyping and computationally intensive experiments. Technical report , NICTA , 2010 . C. Sanderson. Armadillo: An open source C++ linear algebra library for fast prototyping and computationally intensive experiments. Technical report, NICTA, 2010."},{"key":"e_1_2_1_38_1","first-page":"1","volume-title":"SSDBM","author":"Stonebraker M.","year":"2011","unstructured":"M. Stonebraker , P. Brown , A. Poliakov , The architecture of SciDB . In SSDBM , pages 1 -- 16 , 2011 . M. Stonebraker, P. Brown, A. Poliakov, et al. The architecture of SciDB. In SSDBM, pages 1--16, 2011."},{"key":"e_1_2_1_39_1","unstructured":"The PostgreSQL Global Development Group. PostgreSQL 9.1.4 Documentation 2011. The PostgreSQL Global Development Group. PostgreSQL 9.1.4 Documentation 2011."},{"key":"e_1_2_1_40_1","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1111\/j.2517-6161.1996.tb02080.x","article-title":"Regression shrinkage and selection via the lasso","volume":"58","author":"Tibshirani R.","year":"1994","unstructured":"R. Tibshirani . Regression shrinkage and selection via the lasso . Journal of the Royal Statistical Society, Series B , 58 : 267 -- 288 , 1994 . R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267--288, 1994.","journal-title":"Journal of the Royal Statistical Society, Series B"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-008-0077-2"},{"key":"e_1_2_1_42_1","volume-title":"Dept. of CIS","author":"Wallach H. M.","year":"2004","unstructured":"H. M. Wallach . Conditional random fields : An introduction. Technical report , Dept. of CIS , Univ. of Pennsylvania , 2004 . H. M. Wallach. Conditional random fields: An introduction. Technical report, Dept. of CIS, Univ. of Pennsylvania, 2004."},{"key":"e_1_2_1_43_1","first-page":"517","volume-title":"SIGMOD","author":"Wang D.","year":"2011","unstructured":"D. Wang , M. Franklin , M. Garofalakis , Hybrid in-database inference for declarative information extraction . In SIGMOD , pages 517 -- 528 , 2011 . 10.1145\/1989323.1989378 D. Wang, M. Franklin, M. Garofalakis, et al. Hybrid in-database inference for declarative information extraction. In SIGMOD, pages 517--528, 2011. 10.1145\/1989323.1989378"},{"issue":"1","key":"e_1_2_1_44_1","first-page":"1057","article-title":"Querying probabilistic information extraction","volume":"3","author":"Wang D. Z.","year":"2010","unstructured":"D. Z. Wang , M. J. Franklin , M. N. Garofalakis , Querying probabilistic information extraction . PVLDB , 3 ( 1 ): 1057 -- 1067 , 2010 . D. Z. Wang, M. J. Franklin, M. N. Garofalakis, et al. Querying probabilistic information extraction. PVLDB, 3(1):1057--1067, 2010.","journal-title":"PVLDB"},{"key":"e_1_2_1_45_1","first-page":"389","volume-title":"NIPS Workshop on Parallel and Large-Scale Machine Learning (BigLearn)","author":"Weimer M.","year":"2011","unstructured":"M. Weimer , T. Condie , R. Ramakrishnan , Machine learning in ScalOps, a higher order cloud computing language . In NIPS Workshop on Parallel and Large-Scale Machine Learning (BigLearn) , pages 389 -- 396 , 2011 . M. Weimer, T. Condie, R. Ramakrishnan, et al. Machine learning in ScalOps, a higher order cloud computing language. In NIPS Workshop on Parallel and Large-Scale Machine Learning (BigLearn), pages 389--396, 2011."},{"issue":"23","key":"e_1_2_1_47_1","first-page":"1","article-title":"Parallelized stochastic gradient descent","volume":"23","author":"Zinkevich M.","year":"2010","unstructured":"M. Zinkevich , M. Weimer , A. Smola , Parallelized stochastic gradient descent . NIPS , 23 ( 23 ): 1 -- 9 , 2010 . M. Zinkevich, M. Weimer, A. Smola, et al. Parallelized stochastic gradient descent. NIPS, 23(23):1--9, 2010.","journal-title":"NIPS"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/2367502.2367510","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,28]],"date-time":"2024-05-28T00:01:00Z","timestamp":1716854460000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/2367502.2367510"}},"subtitle":["*or MAD skills, the SQL<\/i>"],"short-title":[],"issued":{"date-parts":[[2012,8]]},"references-count":46,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2012,8]]}},"alternative-id":["10.14778\/2367502.2367510"],"URL":"http:\/\/dx.doi.org\/10.14778\/2367502.2367510","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2012,8]]}}}*