{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:28:05Z","timestamp":1750220885962,"version":"3.41.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2019,11,19]],"date-time":"2019-11-19T00:00:00Z","timestamp":1574121600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2019,12,31]]},"abstract":"<jats:p>\n            We describe a\n            <jats:sc>simd<\/jats:sc>\n            technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model, that avoids computing a complete table of partial sums of the relative probabilities. A table of alternate (\u201cbutterfly-patterned\u201d) form is faster to compute, making better use of coalesced memory accesses; from this table, complete partial sums are computed on the fly during a binary search. Measurements using\n            <jats:sc>cuda<\/jats:sc>\n            7.5 on an\n            <jats:sc>nvidia<\/jats:sc>\n            Titan Black\n            <jats:sc>gpu<\/jats:sc>\n            show that this technique makes an entire machine-learning application that uses a Latent Dirichlet Allocation topic model with 1,024 topics about 13% faster (when using single-precision floating-point data) or about 35% faster (when using double-precision floating-point data) than doing a straightforward matrix transposition after using coalesced accesses.\n          <\/jats:p>","DOI":"10.1145\/3365662","type":"journal-article","created":{"date-parts":[[2019,11,19]],"date-time":"2019-11-19T13:46:47Z","timestamp":1574171207000},"page":"1-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Using Butterfly-patterned Partial Sums to Draw from Discrete Distributions"],"prefix":"10.1145","volume":"6","author":[{"given":"Guy L. Steele","family":"Jr.","sequence":"first","affiliation":[{"name":"Oracle Labs, Burlington, MA, USA"}]},{"given":"Jean-Baptiste","family":"Tristan","sequence":"additional","affiliation":[{"name":"Oracle Labs, Burlington, MA, USA"}]}],"member":"320","published-online":{"date-parts":[[2019,11,19]]},"reference":[{"volume-title":"Proceedings of the 30th International Conference on Machine Learning (ICML\u201913)","author":"Ahmed Amr","key":"e_1_2_1_1_1"},{"volume-title":"Pattern Recognition and Machine Learning (Information Science and Statistics)","author":"Bishop Christopher M.","key":"e_1_2_1_2_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_3_1","DOI":"10.1145\/2133806.2133826"},{"volume-title":"Jordan","year":"2003","author":"Blei David M.","key":"e_1_2_1_4_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_5_1","DOI":"10.1145\/1375527.1375559"},{"doi-asserted-by":"publisher","key":"e_1_2_1_6_1","DOI":"10.1002\/spe.4380240306"},{"doi-asserted-by":"publisher","key":"e_1_2_1_7_1","DOI":"10.1073\/pnas.0307752101"},{"doi-asserted-by":"publisher","key":"e_1_2_1_8_1","DOI":"10.1145\/2623330.2623338"},{"doi-asserted-by":"publisher","key":"e_1_2_1_9_1","DOI":"10.1109\/JRPROC.1952.273898"},{"volume-title":"Proceedings of the 1989 ACM\/IEEE Conference on Supercomputing. ACM","author":"Johnsson S. Lennart","key":"e_1_2_1_10_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_11_1","DOI":"10.1145\/2939672.2939873"},{"volume-title":"Seminumerical Algorithms","author":"Knuth Donald E.","edition":"3","key":"e_1_2_1_12_1"},{"volume-title":"Sorting and Searching","author":"Knuth Donald E.","edition":"2","key":"e_1_2_1_13_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_14_1","DOI":"10.1198\/jcgs.2010.10039"},{"volume-title":"Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD\u201914)","author":"Li Aaron Q.","key":"e_1_2_1_15_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_16_1","DOI":"10.1007\/978-3-642-37401-2_20"},{"doi-asserted-by":"publisher","key":"e_1_2_1_17_1","DOI":"10.1145\/2908080.2908089"},{"doi-asserted-by":"publisher","key":"e_1_2_1_18_1","DOI":"10.1145\/366193.366228"},{"volume-title":"Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA\u201993)","year":"1993","author":"Matias Yossi","key":"e_1_2_1_19_1"},{"unstructured":"NVIDIA. 2015. Developer Zone website: CUDA Toolkit documentation: CUDA Toolkit v6.5 Programming Guide section B.14. Warp shuffle functions. Retrieved from http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#warp-shuffle-functions.  NVIDIA. 2015. Developer Zone website: CUDA Toolkit documentation: CUDA Toolkit v6.5 Programming Guide section B.14. Warp shuffle functions. Retrieved from http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#warp-shuffle-functions.","key":"e_1_2_1_20_1"},{"volume-title":"Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. Association for the Advancement of Artificial Intelligence","year":"2010","author":"Ramage Daniel","key":"e_1_2_1_21_1"},{"unstructured":"Guy L. Steele  Jr. 2016. Using Butterfly-Patterned Partial Sums to Draw from Discrete Distributions. GTC website. Retrieved from http:\/\/on-demand.gputechconf.com\/gtc\/2016\/video\/s6665-guy-steele-fast-splittable.mp4.  Guy L. Steele Jr. 2016. Using Butterfly-Patterned Partial Sums to Draw from Discrete Distributions. GTC website. Retrieved from http:\/\/on-demand.gputechconf.com\/gtc\/2016\/video\/s6665-guy-steele-fast-splittable.mp4.","key":"e_1_2_1_22_1"},{"volume-title":"NVIDIA GPU Technology Conference.","year":"2016","author":"Steele Guy L.","key":"e_1_2_1_23_1"},{"volume-title":"Using butterfly-patterned partial sums to optimize GPU memory accesses for drawing from discrete distributions. CoRR (Computing Research Repository at arXiv.org) (May","year":"2015","author":"Steele Guy L.","key":"e_1_2_1_24_1"},{"volume-title":"Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP\u201917)","year":"1874","author":"Steele Guy L.","key":"e_1_2_1_25_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_26_1","DOI":"10.1198\/jcgs.2010.10016"},{"volume-title":"Steele Jr","year":"2014","author":"Tristan Jean-Baptiste","key":"e_1_2_1_27_1"},{"volume-title":"Proceedings of the 32nd International Conference on Machine Learning (ICML\u201915)","author":"Tristan Jean-Baptiste","key":"e_1_2_1_28_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_29_1","DOI":"10.1109\/32.92917"},{"doi-asserted-by":"publisher","key":"e_1_2_1_30_1","DOI":"10.1049\/el:19740097"},{"doi-asserted-by":"publisher","key":"e_1_2_1_31_1","DOI":"10.1145\/355744.355749"},{"volume-title":"The CUDA Handbook: A Comprehensive Guide to GPU Programming","author":"Wilt Nicholas","key":"e_1_2_1_32_1"},{"unstructured":"Feng Yan Ningyi Xu and Yuan Qi. 2009. Parallel inference for latent Dirichlet allocation on graphics processing units. In Advances in Neural Information Processing Systems 22. Curran Associates 2134--2142. Retrieved from http:\/\/papers.nips.cc\/book\/year-2009.  Feng Yan Ningyi Xu and Yuan Qi. 2009. Parallel inference for latent Dirichlet allocation on graphics processing units. In Advances in Neural Information Processing Systems 22. Curran Associates 2134--2142. Retrieved from http:\/\/papers.nips.cc\/book\/year-2009.","key":"e_1_2_1_33_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_34_1","DOI":"10.1145\/2442516.2442539"},{"volume-title":"SAME but different: Fast and high-quality Gibbs parameter estimation. CoRR (Computing Research Repository at arXiv.org) (Sept","year":"2014","author":"Zhao Huasha","key":"e_1_2_1_35_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_36_1","DOI":"10.1080\/00029890.1959.11989389"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3365662","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3365662","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:44:21Z","timestamp":1750203861000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3365662"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,11,19]]},"references-count":36,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2019,12,31]]}},"alternative-id":["10.1145\/3365662"],"URL":"https:\/\/doi.org\/10.1145\/3365662","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"type":"print","value":"2329-4949"},{"type":"electronic","value":"2329-4957"}],"subject":[],"published":{"date-parts":[[2019,11,19]]},"assertion":[{"value":"2018-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-05-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-11-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}