{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,1]],"date-time":"2025-11-01T18:26:37Z","timestamp":1762021597772,"version":"build-2065373602"},"reference-count":22,"publisher":"Springer Science and Business Media LLC","issue":"10","license":[{"start":{"date-parts":[[2020,5,14]],"date-time":"2020-05-14T00:00:00Z","timestamp":1589414400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,5,14]],"date-time":"2020-05-14T00:00:00Z","timestamp":1589414400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100010665","name":"H2020 Marie Sk&lstrok;odowska-Curie Actions","doi-asserted-by":"publisher","award":["745829"],"award-info":[{"award-number":["745829"]}],"id":[{"id":"10.13039\/100010665","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Appl Intell"],"published-print":{"date-parts":[[2020,10]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Lack of knowledge in the underlying data distribution in distributed large-scale data can be an obstacle when issuing analytics &amp; predictive modelling queries. Analysts find themselves having a hard time finding analytics\/exploration queries that satisfy their needs. In this paper, we study how exploration query results can be predicted in order to avoid the execution of \u2018bad\u2019\/non-informative queries that waste network, storage, financial resources, and time in a distributed computing environment. The proposed methodology involves clustering of a training set of exploration queries along with the cardinality of the results (score) they retrieved and then using query-centroid representatives to proceed with predictions. After the training phase, we propose a novel refinement process to increase the <jats:italic>reliability<\/jats:italic> of predicting the score of new unseen queries based on the refined query representatives. Comprehensive experimentation with real datasets shows that more <jats:italic>reliable predictions<\/jats:italic> are acquired after the proposed refinement method, which increases the reliability of the closest centroid and improves predictability under the right circumstances.<\/jats:p>","DOI":"10.1007\/s10489-020-01712-5","type":"journal-article","created":{"date-parts":[[2020,5,14]],"date-time":"2020-05-14T04:09:50Z","timestamp":1589429390000},"page":"3219-3238","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Predictive intelligence of reliable analytics in distributed computing environments"],"prefix":"10.1007","volume":"50","author":[{"given":"Yiannis","family":"Kathidjiotis","sequence":"first","affiliation":[]},{"given":"Kostas","family":"Kolomvatsos","sequence":"additional","affiliation":[]},{"given":"Christos","family":"Anagnostopoulos","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,5,14]]},"reference":[{"key":"1712_CR1","doi-asserted-by":"crossref","unstructured":"Aboulnaga A, Chaudhuri S (1999) Self-tuning histograms: building histograms without looking at data, pp 181\u2013192","DOI":"10.1145\/304181.304198"},{"issue":"9","key":"1712_CR2","doi-asserted-by":"publisher","first-page":"2546","DOI":"10.1007\/s10489-017-1093-y","volume":"48","author":"C Anagnostopoulos","year":"2018","unstructured":"Anagnostopoulos C, Savva F, Triantafillou P (2018) Scalable aggregation predictive analytics. Appl Intell 48(9):2546\u20132567","journal-title":"Appl Intell"},{"key":"1712_CR3","doi-asserted-by":"crossref","unstructured":"Anagnostopoulos C, Triantafillou P (2015) Learning to accurately count with query-driven predictive analytics. In: 2015 IEEE international conference on big data (big data), pp 14\u201323","DOI":"10.1109\/BigData.2015.7363736"},{"key":"1712_CR4","doi-asserted-by":"crossref","unstructured":"Bottou L (1998) On-line learning in neural networks. Chapter on-line learning and stochastic approximations, pp 9\u201342. New York, NY, USA","DOI":"10.1017\/CBO9780511569920.003"},{"key":"1712_CR5","doi-asserted-by":"crossref","unstructured":"Chaudhuri S (1990) Generalization and a framework for query modification. In: 1990 Proceedings. Sixth international conference on data engineering, pp 138\u2013145","DOI":"10.1109\/ICDE.1990.113463"},{"key":"1712_CR6","doi-asserted-by":"crossref","unstructured":"Chaudhuri S, Das G, Hristidis V, Weikum G (2004) Probabilistic ranking of database query results. In: Proceedings of the thirtieth international conference on very large data bases - vol 30, VLDB \u201904, pp 888\u2013899","DOI":"10.1016\/B978-012088469-8.50078-4"},{"issue":"5","key":"1712_CR7","doi-asserted-by":"publisher","first-page":"738","DOI":"10.1109\/69.317704","volume":"6","author":"WW Chu","year":"1994","unstructured":"Chu WW, Qiming C (1994) A structured approach for cooperative query answering. IEEE Trans Knowl Data Eng 6(5):738\u2013749","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"1712_CR8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1561\/1900000004","volume":"4","author":"G Cormode","year":"2012","unstructured":"Cormode G, Garofalakis M, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4:1\u2013294","journal-title":"Found Trends Databases"},{"key":"1712_CR9","doi-asserted-by":"publisher","first-page":"72","DOI":"10.1145\/1629175.1629198","volume":"53","author":"J Dean","year":"2010","unstructured":"Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53:72\u201377","journal-title":"Commun ACM"},{"key":"1712_CR10","doi-asserted-by":"crossref","unstructured":"Fagin R, Lotem A, Naor M (2001) Optimal aggregation algorithms for middleware. In: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 102\u2013113","DOI":"10.1145\/375551.375567"},{"key":"1712_CR11","unstructured":"Google (2019) BigQuery documentation pricing"},{"issue":"2","key":"1712_CR12","doi-asserted-by":"publisher","first-page":"137","DOI":"10.1007\/s00778-003-0090-4","volume":"14","author":"D Gunopulos","year":"2005","unstructured":"Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2005) Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal 14(2):137\u2013154","journal-title":"The VLDB Journal"},{"issue":"2","key":"1712_CR13","doi-asserted-by":"publisher","first-page":"341","DOI":"10.1145\/141484.130335","volume":"21","author":"PJ Haas","year":"1992","unstructured":"Haas PJ, Swami AN (1992) Sequential sampling procedures for query size estimation. SIGMOD Rec 21 (2):341\u2013350","journal-title":"SIGMOD Rec"},{"key":"1712_CR14","doi-asserted-by":"crossref","unstructured":"Hu Q, Wu J, Bai L, Zhang Y, Cheng J (2017) Fast k-means for large scale clustering. In: Proceedings of the 2017 ACM on conference on information and knowledge management, CIKM\u201917. ACM, New York, pp 2099\u20132102","DOI":"10.1145\/3132847.3133091"},{"issue":"4","key":"1712_CR15","doi-asserted-by":"publisher","first-page":"11:1","DOI":"10.1145\/1391729.1391730","volume":"40","author":"IF Ilyas","year":"2008","unstructured":"Ilyas IF, Beskales G, Soliman MA (2008) A survey of top-k query processing techniques in relational database systems. ACM Comput Surv 40(4):11:1\u201311:58","journal-title":"ACM Comput Surv"},{"key":"1712_CR16","unstructured":"Islam MS, Liu C, Zhou R (2012) On modeling query refinement by capturing user intent through feedback. In: Proceedings of the twenty-third australasian database conference, vol 124, pp 11\u201320"},{"issue":"6","key":"1712_CR17","doi-asserted-by":"publisher","first-page":"1580","DOI":"10.1016\/j.jss.2013.01.069","volume":"86","author":"MS Islam","year":"2013","unstructured":"Islam MS, Liu C, Zhou R (2013) A framework for query refinement with user feedback. Journal of Systems and Software 86(6):1580\u20131595","journal-title":"Journal of Systems and Software"},{"key":"1712_CR18","doi-asserted-by":"crossref","unstructured":"Mishra BK, Rath A, Nayak NR, Swain S (2012) Far efficient k-means clustering algorithm. In: Proceedings of the international conference on advances in computing, communications and informatics, ICACCI\u201912. ACM, New York, pp 106\u2013110","DOI":"10.1145\/2345396.2345414"},{"key":"1712_CR19","doi-asserted-by":"publisher","first-page":"123","DOI":"10.1016\/j.chemolab.2013.10.012","volume":"130","author":"F Rodriguez-Lujan","year":"2014","unstructured":"Rodriguez-Lujan F, Vergara H (2014) Huerta on the calibration of sensor arrays for pattern recognition using the minimal number of experiments. Chemometrics and Intelligent Laboratory Systems 130:123\u2013134","journal-title":"Chemometrics and Intelligent Laboratory Systems"},{"key":"1712_CR20","doi-asserted-by":"crossref","unstructured":"To H, Chiang K, Shahabi C (2013) Entropy-based histograms for selectivity estimation. In: Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, CIKM \u201913. ACM, New York, pp 1939\u20131948","DOI":"10.1145\/2505515.2505756"},{"key":"1712_CR21","doi-asserted-by":"publisher","first-page":"320","DOI":"10.1016\/j.snb.2012.01.074","volume":"166","author":"A Vergara","year":"2012","unstructured":"Vergara A, Vembu S, Ayhan T, Ryan MA, Homer ML, Huerta R (2012) Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical 166:320\u2013329","journal-title":"Sensors and Actuators B: Chemical"},{"key":"1712_CR22","doi-asserted-by":"crossref","unstructured":"Vitter JS, Wang M, Iyer B (1998) Data cube approximation and histograms via wavelets. In: Proceedings of the seventh international conference on information and knowledge management, CIKM\u201998. ACM, New York, pp 96\u2013104","DOI":"10.1145\/288627.288645"}],"container-title":["Applied Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10489-020-01712-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10489-020-01712-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10489-020-01712-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,5,14]],"date-time":"2021-05-14T00:09:45Z","timestamp":1620950985000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10489-020-01712-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,5,14]]},"references-count":22,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2020,10]]}},"alternative-id":["1712"],"URL":"https:\/\/doi.org\/10.1007\/s10489-020-01712-5","relation":{},"ISSN":["0924-669X","1573-7497"],"issn-type":[{"type":"print","value":"0924-669X"},{"type":"electronic","value":"1573-7497"}],"subject":[],"published":{"date-parts":[[2020,5,14]]},"assertion":[{"value":"14 May 2020","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}