{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,10]],"date-time":"2026-06-10T05:52:02Z","timestamp":1781070722329,"version":"3.54.1"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2019,7]]},"abstract":"<jats:p>\n            Given a data set\n            <jats:italic>D<\/jats:italic>\n            containing millions of data points and a data consumer who is willing to pay for $\n            <jats:italic>X<\/jats:italic>\n            to train a machine learning (ML) model over\n            <jats:italic>D<\/jats:italic>\n            ,\n            <jats:italic>how should we distribute this $X to each data point to reflect its \"value\"?<\/jats:italic>\n            In this paper, we define the \"relative value of data\" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all\n            <jats:italic>N<\/jats:italic>\n            data points, it requires\n            <jats:italic>O<\/jats:italic>\n            (2\n            <jats:sup>\n              <jats:italic>N<\/jats:italic>\n            <\/jats:sup>\n            ) model evaluations for exact computation and\n            <jats:italic>O<\/jats:italic>\n            (\n            <jats:italic>N<\/jats:italic>\n            log\n            <jats:italic>N<\/jats:italic>\n            ) for (\n            <jats:italic>\u03f5<\/jats:italic>\n            , \u03b4)-approximation.\n          <\/jats:p>\n          <jats:p>\n            In this paper, we focus on one popular family of ML models relying on\n            <jats:italic>K<\/jats:italic>\n            -nearest neighbors (\n            <jats:italic>K<\/jats:italic>\n            NN). The most surprising result is that for unweighted\n            <jats:italic>K<\/jats:italic>\n            NN classifiers and regressors, the Shapley value of all\n            <jats:italic>N<\/jats:italic>\n            data points can be computed,\n            <jats:italic>exactly<\/jats:italic>\n            , in\n            <jats:italic>O<\/jats:italic>\n            (\n            <jats:italic>N<\/jats:italic>\n            log\n            <jats:italic>N<\/jats:italic>\n            ) time - an exponential improvement on computational complexity! Moreover, for (\n            <jats:italic>\u03f5<\/jats:italic>\n            , \u03b4)-approximation, we are able to develop an algorithm based on Locality Sensitive Hashing (LSH) with only sublinear complexity\n            <jats:italic>O<\/jats:italic>\n            (\n            <jats:italic>N<\/jats:italic>\n            <jats:sup>\n              <jats:italic>h<\/jats:italic>\n              (\n              <jats:italic>\u03f5<\/jats:italic>\n              ,\n              <jats:italic>K<\/jats:italic>\n              )\n            <\/jats:sup>\n            log\n            <jats:italic>N<\/jats:italic>\n            ) when\n            <jats:italic>\u03f5<\/jats:italic>\n            is not too small and\n            <jats:italic>K<\/jats:italic>\n            is not too large. We empirically evaluate our algorithms on up to 10 million data points and even our\n            <jats:italic>exact<\/jats:italic>\n            algorithm is up to three orders of magnitude faster than the baseline approximation algorithm. The LSH-based approximation algorithm can accelerate the value calculation process even further.\n          <\/jats:p>\n          <jats:p>\n            We then extend our algorithm to other scenarios such as (1) weighed\n            <jats:italic>K<\/jats:italic>\n            NN classifiers, (2) different data points are clustered by different\n            <jats:italic>data curators<\/jats:italic>\n            , and (3) there are\n            <jats:italic>data analysts<\/jats:italic>\n            providing computation who also requires proper valuation.\n            <jats:italic>Some<\/jats:italic>\n            of these extensions, although also being improved exponentially, are less practical for exact computation (e.g.,\n            <jats:italic>O<\/jats:italic>\n            (\n            <jats:italic>\n              N\n              <jats:sup>K<\/jats:sup>\n            <\/jats:italic>\n            ) complexity for weigthed\n            <jats:italic>K<\/jats:italic>\n            NN). We thus propose an Monte Carlo approximation algorithm, which is\n            <jats:italic>O<\/jats:italic>\n            (\n            <jats:italic>N<\/jats:italic>\n            (log\n            <jats:italic>N<\/jats:italic>\n            )\n            <jats:sup>2<\/jats:sup>\n            \/(log\n            <jats:italic>K<\/jats:italic>\n            )\n            <jats:sup>2<\/jats:sup>\n            ) times more efficient than the baseline approximation algorithm.\n          <\/jats:p>","DOI":"10.14778\/3342263.3342637","type":"journal-article","created":{"date-parts":[[2019,9,18]],"date-time":"2019-09-18T18:36:11Z","timestamp":1568831771000},"page":"1610-1623","source":"Crossref","is-referenced-by-count":130,"title":["Efficient task-specific data valuation for nearest neighbor algorithms"],"prefix":"10.14778","volume":"12","author":[{"given":"Ruoxi","family":"Jia","sequence":"first","affiliation":[{"name":"UC Berkeley"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"David","family":"Dao","sequence":"additional","affiliation":[{"name":"ETH Zurich"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Boxin","family":"Wang","sequence":"additional","affiliation":[{"name":"Zhejiang University"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Frances Ann","family":"Hubis","sequence":"additional","affiliation":[{"name":"ETH Zurich"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Nezihe Merve","family":"Gurel","sequence":"additional","affiliation":[{"name":"ETH Zurich"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Bo","family":"Li","sequence":"additional","affiliation":[{"name":"UIUC"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ce","family":"Zhang","sequence":"additional","affiliation":[{"name":"ETH Zurich"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Costas","family":"Spanos","sequence":"additional","affiliation":[{"name":"UC Berkeley"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dawn","family":"Song","sequence":"additional","affiliation":[{"name":"UC Berkeley"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2019,7]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"A Decentralized Kaggle: Inside Algorithmias Approach to Blockchain-Based AI Competitions. https:\/\/towardsdatascience.com\/a-decentralized-kaggle-inside-algorithmias-approach-to-blockchain-based-ai-competitions-8c6aec99e89b.  A Decentralized Kaggle: Inside Algorithmias Approach to Blockchain-Based AI Competitions. https:\/\/towardsdatascience.com\/a-decentralized-kaggle-inside-algorithmias-approach-to-blockchain-based-ai-competitions-8c6aec99e89b."},{"key":"e_1_2_1_2_1","unstructured":"Dawex. https:\/\/www.dawex.com\/en\/.  Dawex. https:\/\/www.dawex.com\/en\/."},{"key":"e_1_2_1_3_1","unstructured":"Google bigquery. https:\/\/cloud.google.com\/bigquery\/.  Google bigquery. https:\/\/cloud.google.com\/bigquery\/."},{"key":"e_1_2_1_4_1","unstructured":"Iota. https:\/\/data.iota.org\/#\/.  Iota. https:\/\/data.iota.org\/#\/."},{"key":"e_1_2_1_5_1","unstructured":"Xignite. https:\/\/apollomapping.com\/.  Xignite. https:\/\/apollomapping.com\/."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.aci.2014.10.001"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46759-7_15"},{"key":"e_1_2_1_8_1","doi-asserted-by":"crossref","first-page":"169","DOI":"10.1007\/0-387-26281-4_6","volume-title":"Supply chain optimization","author":"Bartholdi J. J.","year":"2005","unstructured":"J. J. Bartholdi and E. Kemahlio\u011flu-Ziya . Using shapley value to allocate savings in a supply chain . In Supply chain optimization , pages 169 -- 208 . Springer , 2005 . J. J. Bartholdi and E. Kemahlio\u011flu-Ziya. Using shapley value to allocate savings in a supply chain. In Supply chain optimization, pages 169--208. Springer, 2005."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-40776-5_19"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/509907.509965"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3076246.3076250"},{"key":"e_1_2_1_12_1","volume-title":"Model-based pricing for machine learning in a data marketplace. arXiv preprint arXiv.1805.11450","author":"Chen L.","year":"2018","unstructured":"L. Chen , P. Koutris , and A. Kumar . Model-based pricing for machine learning in a data marketplace. arXiv preprint arXiv.1805.11450 , 2018 . L. Chen, P. Koutris, and A. Kumar. Model-based pricing for machine learning in a data marketplace. arXiv preprint arXiv.1805.11450, 2018."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3106723.3106732"},{"key":"e_1_2_1_14_1","volume-title":"Databright: Towards a global exchange for decentralized data ownership and trusted computation. arXiv preprint arXiv.1802.04780","author":"Dao D.","year":"2018","unstructured":"D. Dao , D. Alistarh , C. Musat , and C. Zhang . Databright: Towards a global exchange for decentralized data ownership and trusted computation. arXiv preprint arXiv.1802.04780 , 2018 . D. Dao, D. Alistarh, C. Musat, and C. Zhang. Databright: Towards a global exchange for decentralized data ownership and trusted computation. arXiv preprint arXiv.1802.04780, 2018."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/997817.997857"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137816"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1287\/moor.19.2.257"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSMC.1976.5408784"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.artint.2008.05.003"},{"key":"e_1_2_1_21_1","volume-title":"Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868","author":"Ghorbani A.","year":"2019","unstructured":"A. Ghorbani and J. Zou . Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868 , 2019 . A. Ghorbani and J. Zou. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868, 2019."},{"key":"e_1_2_1_22_1","first-page":"518","volume-title":"Vldb","volume":"99","author":"Gionis A.","year":"1999","unstructured":"A. Gionis , P. Indyk , R. Motwani , Similarity search in high dimensions via hashing . In Vldb , volume 99 , pages 518 -- 529 , 1999 . A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In Vldb, volume 99, pages 518--529, 1999."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.4086\/toc.2012.v008a014"},{"key":"e_1_2_1_24_1","first-page":"3315","volume-title":"Advances in neural information processing systems","author":"Hardt M.","year":"2016","unstructured":"M. Hardt , E. Price , N. Srebro , Equality of opportunity in supervised learning . In Advances in neural information processing systems , pages 3315 -- 3323 , 2016 . M. Hardt, E. Price, N. Srebro, et al. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315--3323, 2016."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-09861-6_3"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/3042573.3042582"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3236266"},{"key":"e_1_2_1_29_1","volume-title":"AISTATS","author":"Jia R.","year":"2019","unstructured":"R. Jia , D. Dao , B. Wang , F. A. Hubis , N. Hynes , B. Li , C. Zhang , D. Song , and C. Spanos . Towards efficient data valuation based on the shapley value . AISTATS , 2019 . R. Jia, D. Dao, B. Wang, F. A. Hubis, N. Hynes, B. Li, C. Zhang, D. Song, and C. Spanos. Towards efficient data valuation based on the shapley value. AISTATS, 2019."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/1028128.1028156"},{"key":"e_1_2_1_31_1","first-page":"1885","volume-title":"International Conference on Machine Learning","author":"Koh P. W.","year":"2017","unstructured":"P. W. Koh and P. Liang . Understanding black-box predictions via influence functions . In International Conference on Machine Learning , pages 1885 -- 1894 , 2017 . P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, pages 1885--1894, 2017."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14778\/2367502.2367548"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465335"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2770870"},{"key":"e_1_2_1_35_1","volume-title":"MNIST handwritten digit database","author":"LeCun Y.","year":"2010","unstructured":"Y. LeCun and C. Cortes . MNIST handwritten digit database . 2010 . Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3139457"},{"key":"e_1_2_1_37_1","first-page":"19","volume-title":"WebDB","author":"Li C.","year":"2012","unstructured":"C. Li and G. Miklau . Pricing aggregate queries in a data marketplace . In WebDB , pages 19 -- 24 , 2012 . C. Li and G. Miklau. Pricing aggregate queries in a data marketplace. In WebDB, pages 19--24, 2012."},{"key":"e_1_2_1_38_1","volume-title":"Using the k-nearest neighbor algorithm for the classification of lymph node metastasis in gastric cancer. Computational and mathematical methods in medicine","author":"Li C.","year":"2012","unstructured":"C. Li , S. Zhang , H. Zhang , L. Pang , K. Lam , C. Hui , and S. Zhang . Using the k-nearest neighbor algorithm for the classification of lymph node metastasis in gastric cancer. Computational and mathematical methods in medicine , 2012 , 2012. C. Li, S. Zhang, H. Zhang, L. Pang, K. Lam, C. Hui, and S. Zhang. Using the k-nearest neighbor algorithm for the classification of lymph node metastasis in gastric cancer. Computational and mathematical methods in medicine, 2012, 2012."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732939.2732948"},{"key":"e_1_2_1_41_1","volume-title":"Bounding the estimation error of sampling-based shapley value approximation. arXiv preprint arXiv.1306.4265","author":"Maleki S.","year":"2013","unstructured":"S. Maleki , L. Tran-Thanh , G. Hines , T. Rahwan , and A. Rogers . Bounding the estimation error of sampling-based shapley value approximation. arXiv preprint arXiv.1306.4265 , 2013 . S. Maleki, L. Tran-Thanh, G. Hines, T. Rahwan, and A. Rogers. Bounding the estimation error of sampling-based shapley value approximation. arXiv preprint arXiv.1306.4265, 2013."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.5555\/2512538.2512553"},{"key":"e_1_2_1_43_1","volume-title":"Ann: library for approximate nearest neighbour searching","author":"Mount D. M.","year":"1998","unstructured":"D. M. Mount and S. Arya . Ann: library for approximate nearest neighbour searching . 1998 . D. M. Mount and S. Arya. Ann: library for approximate nearest neighbour searching. 1998."},{"key":"e_1_2_1_44_1","first-page":"1382","volume-title":"International Conference on Machine Learning","author":"Ogawa K.","year":"2013","unstructured":"K. Ogawa , Y. Suzuki , and I. Takeuchi . Safe screening of non-support vectors in pathwise svm computation . In International Conference on Machine Learning , pages 1382 -- 1390 , 2013 . K. Ogawa, Y. Suzuki, and I. Takeuchi. Safe screening of non-support vectors in pathwise svm computation. In International Conference on Machine Learning, pages 1382--1390, 2013."},{"key":"e_1_2_1_45_1","volume-title":"Data markets to support ai for all: Pricing, valuation and governance. arXiv preprint arXiv:1905.06462","author":"Raskar R.","year":"2019","unstructured":"R. Raskar , P. Vepakomma , T. Swedish , and A. Sharan . Data markets to support ai for all: Pricing, valuation and governance. arXiv preprint arXiv:1905.06462 , 2019 . R. Raskar, P. Vepakomma, T. Swedish, and A. Sharan. Data markets to support ai for all: Pricing, valuation and governance. arXiv preprint arXiv:1905.06462, 2019."},{"key":"e_1_2_1_46_1","volume-title":"A value for n-person games. Contributions to the Theory of Games, 2(28):307--317","author":"Shapley L. S.","year":"1953","unstructured":"L. S. Shapley . A value for n-person games. Contributions to the Theory of Games, 2(28):307--317 , 1953 . L. S. Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307--317, 1953."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2007.40"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.14778\/2168651.2168657"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.14778\/3007328.3007335"},{"key":"e_1_2_1_51_1","volume-title":"Learning non-discriminatory predictors. arXiv preprint arXiv:1702.06081","author":"Woodworth B.","year":"2017","unstructured":"B. Woodworth , S. Gunasekar , M. I. Ohannessian , and N. Srebro . Learning non-discriminatory predictors. arXiv preprint arXiv:1702.06081 , 2017 . B. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro. Learning non-discriminatory predictors. arXiv preprint arXiv:1702.06081, 2017."},{"key":"e_1_2_1_52_1","first-page":"325","volume-title":"International Conference on Machine Learning","author":"Zemel R.","year":"2013","unstructured":"R. Zemel , Y. Wu , K. Swersky , T. Pitassi , and C. Dwork . Learning fair representations . In International Conference on Machine Learning , pages 325 -- 333 , 2013 . R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In International Conference on Machine Learning, pages 325--333, 2013."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMC.2019.2900243"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3342263.3342637","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T09:56:28Z","timestamp":1672221388000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3342263.3342637"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,7]]},"references-count":52,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2019,7]]}},"alternative-id":["10.14778\/3342263.3342637"],"URL":"https:\/\/doi.org\/10.14778\/3342263.3342637","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2019,7]]}}}