{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"institution":[{"name":"Research Square"}],"indexed":{"date-parts":[[2024,9,23]],"date-time":"2024-09-23T19:47:02Z","timestamp":1727120822482},"posted":{"date-parts":[[2024,9,16]]},"group-title":"In Review","reference-count":37,"publisher":"Springer Science and Business Media LLC","license":[{"start":{"date-parts":[[2024,9,16]],"date-time":"2024-09-16T00:00:00Z","timestamp":1726444800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"accepted":{"date-parts":[[2024,8,1]]},"abstract":"<title>Abstract<\/title>\n        <p>Data valuation is the act of assigning a monetary value to data based on its estimated usefulness, potential impact, and scarcity. In the current landscape of digital technology, data has emerged as a valuable resource for various entities such as organizations, governments, and individuals. Consequently, data valuation plays an increasingly important role in managing data assets, ultimately facilitating informed decision-making in relation to data acquisition, sharing, analysis, and even monetization. In the context of the Machine Learning Data Market, a platform to exchange data considering its value, data valuation has an important role in putting economic value before trading data. The Shapley Value has assumed a central role in data valuation, due to its equitable value distribution among contributors. This paper focuses on data valuation within the context of the Machine Learning Data Market (MLDM). Our primary objective is to investigate whether data valuation methods based on the Shapley Value can result in improved performance in MLDM. We introduced the Gain Data Shapley Value (GDSV) method. This paper presents an extensive empirical study of its behavior and compares GDSV with performance-based data valuation in MLDM under different configurations and learning algorithms. Our findings confirm that considering the contribution of the data set to performance scores can lead to systematic improvements in learning performance.<\/p>","DOI":"10.21203\/rs.3.rs-4843564\/v1","type":"posted-content","created":{"date-parts":[[2024,9,16]],"date-time":"2024-09-16T08:05:17Z","timestamp":1726473917000},"source":"Crossref","is-referenced-by-count":0,"title":["Shapley Value-based Data Valuation for Machine Learning Data Markets"],"prefix":"10.21203","author":[{"given":"Hajar","family":"Baghcheband","sequence":"first","affiliation":[{"name":"University of Porto"}]},{"given":"Carlos","family":"Soares","sequence":"additional","affiliation":[{"name":"University of Porto"}]},{"given":"Luis Paulo","family":"Reis","sequence":"additional","affiliation":[{"name":"University of Porto"}]}],"member":"297","reference":[{"doi-asserted-by":"crossref","unstructured":"Campbell, S. L. and Gear, C. W. (1995) The index of general nonlinear {D}{A}{E}{S}. Numer. {M}ath. 72(2): 173--196","key":"ref1","DOI":"10.1007\/s002110050165"},{"doi-asserted-by":"crossref","unstructured":"Slifka, M. K. and Whitton, J. L. (2000) Clinical implications of dysregulated cytokine production. J. {M}ol. {M}ed. 78: 74--80 https:\/\/doi.org\/10.1007\/s001090000086","key":"ref2","DOI":"10.1007\/s001090000086"},{"doi-asserted-by":"crossref","unstructured":"Hamburger, C. (1995) Quasimonotonicity, regularity and duality for nonlinear systems of  partial differential equations. Ann. Mat. Pura. Appl. 169(2): 321--354","key":"ref3","DOI":"10.1007\/BF01759359"},{"doi-asserted-by":"crossref","unstructured":"Geddes, K. O. and Czapor, S. R. and Labahn, G. (1992) Algorithms for {C}omputer {A}lgebra. Kluwer, Boston","key":"ref4","DOI":"10.1007\/b102438"},{"doi-asserted-by":"crossref","unstructured":"Broy, M. Software engineering---from auxiliary to key technologies. In: Broy, M. and Denert, E. (Eds.) Software Pioneers, 1992, Springer, New {Y}ork, 10--13","key":"ref5","DOI":"10.1007\/978-3-642-59412-0_1"},{"unstructured":"(1981) Conductive {P}olymers. Plenum, New {Y}ork, Seymour, R. S.","key":"ref6"},{"doi-asserted-by":"crossref","unstructured":"Smith, S. E. (1976) Neuromuscular blocking drugs in man. Springer, Heidelberg, 593--660, Neuromuscular junction. {H}andbook of experimental pharmacology, 42, Zaimis, E.","key":"ref7","DOI":"10.1007\/978-3-642-45476-9_9"},{"doi-asserted-by":"crossref","unstructured":"Hao, Z. and AghaKouchak, A. and Nakhjiri, N. and Farahmand, A.. Global integrated drought monitoring and prediction system (GIDMaPS) data sets. figshare https:\/\/doi.org\/10.6084\/m9.figshare.853801. 2014","key":"ref8","DOI":"10.1038\/sdata.2014.1"},{"doi-asserted-by":"crossref","unstructured":"Babichev, S. A. and Ries, J. and Lvovsky, A. I.. Quantum scissors: teleportation of single-mode optical states by means  of a nonlocal single photon. Preprint at https:\/\/arxiv.org\/abs\/quant-ph\/0208066v1. 2002","key":"ref9","DOI":"10.1209\/epl\/i2003-00504-y"},{"doi-asserted-by":"crossref","unstructured":"Beneke, M. and Buchalla, G. and Dunietz, I. (1997) Mixing induced {CP} asymmetries in inclusive {B} decays. Phys. {L}ett. B393: 132-142 gr-gc, 0707.3168, arXiv","key":"ref10","DOI":"10.1016\/S0370-2693(96)01648-6"},{"unstructured":"Abbott, T. M. C. and others (2019) {Dark Energy Survey Year 1 Results: Constraints on Extended Cosmological Models from Galaxy Clustering and Weak Lensing}. Phys. Rev. D 99(12): 123505 https:\/\/doi.org\/10.1103\/PhysRevD.99.123505, FERMILAB-PUB-18-507-PPD, astro-ph.CO, arXiv, 1810.02499, DES","key":"ref11"},{"doi-asserted-by":"crossref","unstructured":"Baghcheband, H. and Soares C. and Reis L. P.. Machine Learning Data Markets: Evaluating the Impact of Data Exchange on the Agent Learning Performance. Paper presented at the 22nd Portuguese Conference on Artificial Intelligence,  5--8 September 2023. 2023","key":"ref12","DOI":"10.1007\/978-3-031-49008-8_27"},{"doi-asserted-by":"crossref","unstructured":"H. Baghcheband and C. Soares and L. Reis (2022) Machine Learning Data Markets: Trading Data using a Multi-Agent System. IEEE Computer Society, Los Alamitos, CA, USA, nov, 450-457, 2022 IEEE\/WIC\/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)","key":"ref13","DOI":"10.1109\/WI-IAT55865.2022.00073"},{"unstructured":"Ghorbani, Amirata and Kim, Michael P. and Zou, James (2020) {A distributional framework for data valuation}. 37th International Conference on Machine Learning, ICML 2020 119: 3493--3502 PMLR, III, Hal Daum \u00e9 and Singh, Aarti, 9781713821120, 2002.12334, 2002.12334, arXiv","key":"ref14"},{"unstructured":"Ghorbani, Amirata and Kim, Michael and Zou, James (2020) A Distributional Framework For Data Valuation. 13--18 Jul, Proceedings of Machine Learning Research, 119, 3535--3544, Proceedings of the 37th International Conference on Machine Learning","key":"ref15"},{"unstructured":"Ghorbani, Amirata and Zou, James (2019) Data Shapley: Equitable Valuation of Data for Machine Learning. 09--15 Jun, Proceedings of Machine Learning Research, 97, 2242--2251, Proceedings of the 36th International Conference on Machine Learning","key":"ref16"},{"doi-asserted-by":"crossref","unstructured":"Ghorbani, Amirata and Zou, James and Esteva, Andre (2022) Data Shapley Valuation for Efficient Batch Active Learning. 1456-1462, , , 2022 56th Asilomar Conference on Signals, Systems, and Computers","key":"ref17","DOI":"10.1109\/IEEECONF56349.2022.10064696"},{"unstructured":"Ruoxi Jia and David Dao and Boxin Wang and Frances Ann Hubis and Nick Hynes and Nezihe Merve Gurel and Bo Li and Ce Zhang and Dawn Song and Costas Spanos. Towards Efficient Data Valuation Based on the Shapley Value. cs.LG, arXiv, 1902.10275, 2023","key":"ref18"},{"doi-asserted-by":"crossref","unstructured":"Shapley, L. S A Value for n-Person Games. In: Kuhn, Harold W. and Tucker, Albert W. (Eds.) Contributions to the Theory of Games II, Princeton, Princeton University Press, 1953, 307--317","key":"ref19","DOI":"10.1515\/9781400881970-018"},{"unstructured":"Shapley, L. S. (1988) The Shapley value : essays in honor of Lloyd S. Shapley edited by Alvin E. Roth. Cambridge University Press","key":"ref20"},{"doi-asserted-by":"crossref","unstructured":"Tang, S. and Ghorbani, A. and Yamashita, R. and Rehman, S. and Dunnmon, J. A. and Zou, J. and Rubin, D. L. (2021) {Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset}. Scientific Reports 11: 1--9 Nature Publishing Group UK, 33863957, 20452322, 0123456789","key":"ref21","DOI":"10.1038\/s41598-021-87762-2"},{"unstructured":"Garrido-Lucero, F. and Heymann, B. and Vono, M. and Loiseau, P. and Perchet, V. . DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation. cs.AI, arXiv, 2306.02071, 2023","key":"ref22"},{"doi-asserted-by":"crossref","unstructured":"M. Wu and R. Jia and C. Lin and W. Huang and X. Chang (2023) Variance reduced Shapley value estimation for trustworthy data valuation. Computers and Operations Research 159: 106305 0305-0548","key":"ref23","DOI":"10.1016\/j.cor.2023.106305"},{"doi-asserted-by":"crossref","unstructured":"Courtnage, C. and Smirnov, E. (2021) Shapley-Value Data Valuation for Semi-supervised Learning. Springer International Publishing, Cham, 978-3-030-88942-5, 94--108, Discovery Science, Soares, Carlos and Torgo, Luis","key":"ref24","DOI":"10.1007\/978-3-030-88942-5_8"},{"unstructured":"Cohen, {S. B.} and G. Dror and E. Ruppin (2005) Feature Selection Based on the Shapley Value. Proceedings of IJCAI, 1--6","key":"ref25"},{"doi-asserted-by":"crossref","unstructured":"J. Vanschoren and J. van Rijn and B. Bischl and L. Torgo (2013) OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15(2): 49--60 New York, NY, USA, ACM, http:\/\/doi.acm.org\/10.1145\/2641190.2641198","key":"ref26","DOI":"10.1145\/2641190.2641198"},{"doi-asserted-by":"crossref","unstructured":"Agarwal, Anish and Dahleh, Munther and Sarkar, Tuhin (2019) {A marketplace for data: An algorithmic solution}. ACM EC 2019 - Proceedings of the 2019 ACM Conference on Economics and Computation : 701--726 9781450367929","key":"ref27","DOI":"10.1145\/3328526.3329589"},{"unstructured":"Kharman, Aida Manzano and Jursitzky, Christian and Zhou, Quan and Ferraro, Pietro and Marecek, Jakub and Pinson, Pierre and Shorten, Robert (2022) {On the Design of Decentralised Data Markets}. arXiv preprint arXiv:2206.06299 : 1--26 2206.06299","key":"ref28"},{"unstructured":"Sim, Rachael Hwee Ling and Zhang, Yehong and Chan, Mun Choon and Low, Bryan Kian Hsiang (2020) {Collaborative machine learning with incentive-aware model rewards}. 37th International Conference on Machine Learning, ICML 2020 16814(Ml): 8886--8895 9781713821120, 2010.12797","key":"ref29"},{"unstructured":"Stahl, Florian and Schomm, Fabian and Vossen, Gottfried (2014) {Data marketplaces: An emerging species}. Frontiers in Artificial Intelligence and Applications Databases (August 2013): 145--158 09226389, 9781614994572","key":"ref30"},{"doi-asserted-by":"crossref","unstructured":"Faroukhi, Abou Zakaria and {El Alaoui}, Imane and Gahi, Youssef and Amine, Aouatif (2020) {Big data monetization throughout Big Data Value Chain: a comprehensive review}. Journal of Big Data 7(1)Springer International Publishing, 21961115, 4053701902","key":"ref31","DOI":"10.1186\/s40537-019-0281-5"},{"unstructured":"Travizano, Matias and Sarraute, Carlos and Ajzenman, Gustavo and Minnoni, Martin (2018) {Wibson: A Decentralized Data Marketplace}. : 1--6 1812.09966, arXiv","key":"ref32"},{"unstructured":"H. Baghcheband and C. Soares and L. Reis (2024) MLDM: Machine Learning Data Market Based on Multi-Agent Systems. IEEE Internet Computing (01): 1-7 https:\/\/doi.org\/10.1109\/MIC.2024.3399049, may, Los Alamitos, CA, USA, IEEE Computer Society, 1089-7801, 1941-0131","key":"ref33"},{"unstructured":"Zhihua Tian and Jian Liu and Jingyu Li and Xinle Cao and Ruoxi Jia and Jun Kong and Mengdi Liu and Kui Ren. Private Data Valuation and Fair Payment in Data Marketplaces. cs.CR, arXiv, 2210.08723, 2023","key":"ref34"},{"doi-asserted-by":"crossref","unstructured":"Liu, Jinfei and Lou, Jian and Liu, Junxu and Xiong, Li and Pei, Jian and Sun, Jimeng (2021) Dealer: an end-to-end model marketplace with differential privacy. Proc. VLDB Endow. 14(6): 957 \u2013969 feb, 2150-8097, VLDB Endowment, February 2021","key":"ref35","DOI":"10.14778\/3447689.3447700"},{"doi-asserted-by":"crossref","unstructured":"Baghcheband, Hajar and Soares, Carlos and Reis, Luis Paulo (2024) Shapley-Based Data Valuation Method for the Machine Learning Data Markets (MLDM). Springer Nature Switzerland, Cham, 978-3-031-62700-2, 170--177, Foundations of Intelligent Systems","key":"ref36","DOI":"10.1007\/978-3-031-62700-2_16"},{"doi-asserted-by":"crossref","unstructured":"Lawrenz, Sebastian and Sharma, Priyanka and Rausch, Andreas (2019) {Blockchain technology as an approach for data marketplaces}. ACM International Conference Proceeding Series 1481: 55--59 https:\/\/doi.org\/10.1145\/3320154.3320165, 9781450362689","key":"ref37","DOI":"10.1145\/3320154.3320165"}],"container-title":[],"original-title":[],"link":[{"URL":"https:\/\/www.researchsquare.com\/article\/rs-4843564\/v1","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.researchsquare.com\/article\/rs-4843564\/v1.html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,16]],"date-time":"2024-09-16T08:05:51Z","timestamp":1726473951000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.researchsquare.com\/article\/rs-4843564\/v1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,16]]},"references-count":37,"URL":"https:\/\/doi.org\/10.21203\/rs.3.rs-4843564\/v1","relation":{},"subject":[],"published":{"date-parts":[[2024,9,16]]},"subtype":"preprint"}}