{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T04:30:54Z","timestamp":1772166654391,"version":"3.50.1"},"reference-count":23,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T00:00:00Z","timestamp":1755734400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T00:00:00Z","timestamp":1755734400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100009367","name":"Mansoura University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100009367","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>With the rapid growth of the Internet of Things (IoT) and the emergence of big data, handling massive amounts of data has become a major challenge. Traditional approaches involve sending raw data to cloud data centers for cleaning, processing, and interpretation using data warehouse tools. However, this study introduces BlueEdge, a fog edge mobile application that aims to shift the cleaning and preprocessing tasks from the cloud to the edge. We compare BlueEdge with four popular data cleaning tools (WinPure, DoubleTake, WizSame, and DQGlobal) that operate within data warehouse architectures, such as Hadoop servers. The comparison considers criteria such as time consumption, resource utilization (memory and CPU), and tool performance. BlueEdge utilizes Natural Language Processing (NLP) techniques, including those from the Natural Language Toolkit (NLTK) and Python packages, to connect with a real-time database. As shown in our results, the accuracy values that BlueEdge showed ranged between 72 and 95% across 6 categories of name-based duplicate detection tasks, proving its competitive performance in mobile edge environments. The validation of the framework was done using a larger dataset of 146 error cases with statistically significant values having confidence interval of between 3.4% to 5.8. Statistical comparison indicates consistently significant changes ( p\u2009&lt;\u20090.05) compared to baseline settings of four commercial tools with large effect sizes ( Cohen d: 0.89- 1.34). BlueEdge takes care of data duplication elimination services such as using different spelling and pronunciation (78.4%, CI: 73.1\u201383.7%), misspellings (72.0%, CI: 66.2\u201377.8%), name abbreviations (90.5%, CI: 86.1\u201394.9%), honorific prefixes (95.2%, CI: 91.8\u201398.6%), common nicknames (76.2%, C The reliable performance of edge-based data cleaning is verified through cross-validation analysis (81.7%\u2009\u00b1\u20092.3%), the results of which prove the consistency of its activity. Additionally, BlueEdge utilizes a minimal bandwidth of only 5000 bytes per edge on mobile phones, unlike data warehouses that require 10,000\u201360,000 bytes on Hadoop machines. Additionally, BlueEdge is designed to reduce the time taken for data cleaning to 1\u00a0s at the data edge, unlike the standard 4\u201330\u00a0s it normally takes for data warehouses. The blue edge is easy to use without authorization of the mobile devices, where the application is conducted free of charge. The framework was validated through controlled experimental testing and real-world deployment at an IT services company, achieving an overall ITSQM quality score of 8.9\/10 and demonstrating practical effectiveness in organizational settings. This foundation has been further enhanced with neural network-based classification approaches, which are currently under peer review.<\/jats:p>","DOI":"10.1186\/s40537-025-01262-y","type":"journal-article","created":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T09:47:30Z","timestamp":1755769650000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["BlueEdge: application design for big data cleaning processing using mobile edge computing environments"],"prefix":"10.1186","volume":"12","author":[{"given":"Nagwa","family":"Elmobark","sequence":"first","affiliation":[]},{"given":"Haitham","family":"El-ghareeb","sequence":"additional","affiliation":[]},{"given":"Sara","family":"Elhishi","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,8,21]]},"reference":[{"issue":"1","key":"1262_CR1","doi-asserted-by":"publisher","first-page":"450","DOI":"10.1109\/JIOT.2017.2750180","volume":"5","author":"N Abbas","year":"2017","unstructured":"Abbas N, Zhang Y, Taherkordi A, Skeie T. Mobile edge computing: A survey. IEEE Internet Things J. 2017;5(1):450\u201365.","journal-title":"IEEE Internet Things J"},{"key":"1262_CR2","unstructured":"Akhbardeh F. NLP and ML methods for pre-processing, clustering and classification of technical logbook datasets. July 2022. https:\/\/scholarworks.rit.edu\/theses\/11227"},{"key":"1262_CR3","doi-asserted-by":"crossref","unstructured":"Bilenko M, Mooney RJ. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003. p. 39\u201348.","DOI":"10.1145\/956750.956759"},{"key":"1262_CR4","doi-asserted-by":"crossref","unstructured":"Bird S. NLTK: The natural language toolkit. COLING\/ACL 2006\u201421st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, proceedings of the interactive presentation sessions. 2006, pp 69\u201372.","DOI":"10.3115\/1225403.1225421"},{"key":"1262_CR5","doi-asserted-by":"publisher","unstructured":"Bonomi F, Milito R, Zhu J, Addepalli S. Fog computing and its role in the internet of things. In: MCC'12\u2014proceedings of the 1st ACM mobile cloud computing workshop. 2012. p. 13\u201315. https:\/\/doi.org\/10.1145\/2342509.2342513","DOI":"10.1145\/2342509.2342513"},{"issue":"2","key":"1262_CR6","doi-asserted-by":"publisher","first-page":"834","DOI":"10.12928\/TELKOMNIKA.V16I2.7669","volume":"16","author":"A Bramantoro","year":"2018","unstructured":"Bramantoro A. Data cleaning service for data warehouse: an experimental comparative study on local data. Telkomnika (Telecommunication, Computing, Electronics, and Control). 2018;16(2):834\u201342. https:\/\/doi.org\/10.12928\/TELKOMNIKA.V16I2.7669.","journal-title":"Telkomnika (Telecommunication, Computing, Electronics, and Control)"},{"issue":"9","key":"1262_CR7","doi-asserted-by":"publisher","first-page":"1537","DOI":"10.1109\/TKDE.2011.127","volume":"24","author":"P Christen","year":"2012","unstructured":"Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng. 2012;24(9):1537\u201355.","journal-title":"IEEE Trans Knowl Data Eng"},{"issue":"3","key":"1262_CR8","doi-asserted-by":"publisher","first-page":"288","DOI":"10.1145\/352595.352598","volume":"18","author":"WW Cohen","year":"2003","unstructured":"Cohen WW. Data integration using similarity joins and a word-based information representation language. ACM Trans Inf Syst. 2003;18(3):288\u2013321.","journal-title":"ACM Trans Inf Syst"},{"key":"1262_CR9","unstructured":"Dong W, Douglis F, Reddy S, Li K, Shilane P, Patterson H. Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of FAST 2011: 9th USENIX conference on file and storage technologies, November 2017. p. 15\u201329."},{"key":"1262_CR10","doi-asserted-by":"publisher","unstructured":"Drolia U, Martins R, Tan J, Chheda A, Sanghavi M, Gandhi R, Narasimhan P. The case for mobile edge-clouds. In: Proceedings\u2014IEEE 10th international conference on ubiquitous intelligence and computing, UIC 2013 and IEEE 10th international conference on autonomic and trusted computing. ATC; 2013. p. 209\u2013215. https:\/\/doi.org\/10.1109\/UIC-ATC.2013.94","DOI":"10.1109\/UIC-ATC.2013.94"},{"key":"1262_CR11","unstructured":"Kobzdej P (Adam MU, Walig\u00f3ra D, Wielebi\u0144ska K, Paprzycki M (n.d.). Parallel application of levenshtein distance to establish similarity between strings."},{"issue":"12","key":"1262_CR12","doi-asserted-by":"publisher","first-page":"1878","DOI":"10.14778\/2367502.2367527","volume":"5","author":"L Kolb","year":"2012","unstructured":"Kolb L, Thor A, Rahm E. Dedoop: efficient deduplication with Hadoop. Proc VLDB Endow. 2012;5(12):1878\u201381.","journal-title":"Proc VLDB Endow"},{"issue":"3","key":"1262_CR13","doi-asserted-by":"publisher","first-page":"50","DOI":"10.1109\/MSP.2020.2975749","volume":"37","author":"T Li","year":"2020","unstructured":"Li T, Sahu AK, Talwalkar A, Smith V. Federated learning: challenges, methods, and future directions. IEEE Signal Process Mag. 2020;37(3):50\u201360.","journal-title":"IEEE Signal Process Mag"},{"key":"1262_CR14","unstructured":"M\u00fcller H, Freytag J. Problems, methods, and challenges in comprehensive data cleansing. informatics reports. Institute for Computer Science, Humboldt University of Berlin, HUB-IB-164, Humboldt University Berlin; 2003. p. 1\u201323. http:\/\/www.dbis.informatik.hu-berlin.de\/fileadmin\/research\/papers\/techreports\/2003-hub_ib_164-mueller.pdf"},{"key":"1262_CR15","unstructured":"Elmobark N, El-ghareeb H, Elhishi S. Perspectives on the integration of the internet of things and fog computing for geospatial big data analytics. Mach Intell Res. 2023;17(1):9515\u201328."},{"key":"1262_CR16","unstructured":"NLTK: nltk.metrics. Distance module. (n.d.). Retrieved 30 May 2023, from https:\/\/www.nltk.org\/api\/nltk.metrics.distance.html"},{"issue":"4","key":"1262_CR17","first-page":"3","volume":"23","author":"E Rahm","year":"2000","unstructured":"Rahm E, Do H. Data cleaning: problems and current approaches. IEEE Data Eng Bull. 2000;23(4):3\u201313. http:\/\/dc-pubs.dbs.uni-leipzig.de\/files\/Rahm2000DataCleaningProblemsand.pdf","journal-title":"IEEE Data Eng Bull"},{"issue":"9","key":"1262_CR18","doi-asserted-by":"publisher","first-page":"329","DOI":"10.3390\/fi16090329","volume":"16","author":"A Rancea","year":"2024","unstructured":"Rancea A, Anghel I, Cioara T. Edge computing in healthcare: innovations, opportunities, and challenges. Future Internet. 2024;16(9):329. https:\/\/doi.org\/10.3390\/fi16090329.","journal-title":"Future Internet"},{"issue":"2","key":"1262_CR19","doi-asserted-by":"publisher","first-page":"24","DOI":"10.1109\/MPRV.2015.32","volume":"14","author":"M Satyanarayanan","year":"2015","unstructured":"Satyanarayanan M, Simoens P, Xiao Y, Pillai P, Chen Z, Ha K, Hu W, Amos B. Edge analytics in the internet of things. IEEE Pervas Comput. 2015;14(2):24\u201331.","journal-title":"IEEE Pervas Comput"},{"issue":"5","key":"1262_CR20","doi-asserted-by":"publisher","first-page":"637","DOI":"10.1109\/JIOT.2016.2579198","volume":"3","author":"W Shi","year":"2016","unstructured":"Shi W, Cao J, Zhang Q, Li Y, Xu L. Edge computing: vision and challenges. IEEE Internet Things J. 2016;3(5):637\u201346.","journal-title":"IEEE Internet Things J"},{"key":"1262_CR21","doi-asserted-by":"publisher","DOI":"10.3390\/jsan6030017","author":"MH Ur Rehman","year":"2017","unstructured":"Ur Rehman MH, Jayaraman PP, Malik UR, S., Ur Rehman Khan, A., & Gaber, M. M. RedEdge: a novel architecture for big data processing in mobile edge computing environments. J Sens Actuator Netw. 2017. https:\/\/doi.org\/10.3390\/jsan6030017.","journal-title":"J Sens Actuator Netw"},{"key":"1262_CR22","doi-asserted-by":"publisher","first-page":"680","DOI":"10.1016\/j.future.2016.11.009","volume":"78","author":"R Roman","year":"2018","unstructured":"Roman R, Lopez J, Mambo M. Mobile edge computing, fog et al.: a survey and analysis of security threats and challenges. Future Gener Comput Syst. 2018;78:680\u201398.","journal-title":"Future Gener Comput Syst"},{"key":"1262_CR23","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1016\/j.bdr.2014.07.001","volume":"1","author":"H Zou","year":"2014","unstructured":"Zou H, Yu Y, Tang W, Chen HWM. Flexanalytics: a flexible data analytics framework for big data applications with I\/O performance improvement. Big Data Res. 2014;1:4\u201313. https:\/\/doi.org\/10.1016\/j.bdr.2014.07.001.","journal-title":"Big Data Res"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01262-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-025-01262-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01262-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,10]],"date-time":"2025-09-10T04:31:22Z","timestamp":1757478682000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-025-01262-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,21]]},"references-count":23,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1262"],"URL":"https:\/\/doi.org\/10.1186\/s40537-025-01262-y","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-3049779\/v1","asserted-by":"object"}]},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,21]]},"assertion":[{"value":"11 June 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 August 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 August 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"We declare that all the information provided in this manuscript is accurate and complete. Any errors or omissions are our responsibility.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"This research has been conducted per ethical principles and guidelines.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"All individuals mentioned as authors have reviewed and approved the final version of the manuscript and have agreed to its submission to the Journal of Big Data for publication. Furthermore, we confirm that this manuscript has not been previously published and is not currently under consideration for publication.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"All authors in this study declare that they have no competing interests that could affect the submitted manuscript.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"204"}}