{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T17:21:07Z","timestamp":1740158467505,"version":"3.37.3"},"reference-count":44,"publisher":"Sociedade Brasileira de Computacao - SB","issue":"1","license":[{"start":{"date-parts":[[2019,10,16]],"date-time":"2019-10-16T00:00:00Z","timestamp":1571184000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2019,10,16]],"date-time":"2019-10-16T00:00:00Z","timestamp":1571184000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100010663","name":"H2020 European Research Council","doi-asserted-by":"publisher","award":["690116"],"award-info":[{"award-number":["690116"]}],"id":[{"id":"10.13039\/100010663","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002425","name":"Governo Brasil","doi-asserted-by":"publisher","award":["MCT\/RNP EUBRA-BIGSEA"],"award-info":[{"award-number":["MCT\/RNP EUBRA-BIGSEA"]}],"id":[{"id":"10.13039\/501100002425","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Internet Serv Appl"],"published-print":{"date-parts":[[2019,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n              <jats:p>High-performance computing (HPC) and massive data processing (Big Data) are two trends that are beginning to converge. In that process, aspects of hardware architectures, systems support and programming paradigms are being revisited from both perspectives. This paper presents our experience on this path of convergence with the proposal of a framework that addresses some of the programming issues derived from such integration. Our contribution is the development of an integrated environment that integretes (<jats:italic>i<\/jats:italic>) COMPSs, a programming framework for the development and execution of parallel applications for distributed infrastructures; (<jats:italic>ii<\/jats:italic>) Lemonade, a data mining and analysis tool; and (<jats:italic>iii<\/jats:italic>) HDFS, the most widely used distributed file system for Big Data systems. To validate our framework, we used Lemonade to create COMPSs applications that access data through HDFS, and compared them with equivalent applications built with Spark, a popular Big Data framework. The results show that the HDFS integration benefits COMPSs by simplifying data access and by rearranging data transfer, reducing execution time. The integration with Lemonade facilitates COMPSs\u2019s use and may help its popularization in the Data Science community, by providing efficient algorithm implementations for experts from the data domain that want to develop applications with a higher level abstraction.<\/jats:p>","DOI":"10.1186\/s13174-019-0118-7","type":"journal-article","created":{"date-parts":[[2019,10,16]],"date-time":"2019-10-16T13:09:37Z","timestamp":1571231377000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Upgrading a high performance computing environment for massive data processing"],"prefix":"10.5753","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1480-0039","authenticated-orcid":false,"given":"Lucas M.","family":"Ponce","sequence":"first","affiliation":[]},{"given":"Walter dos","family":"Santos","sequence":"additional","affiliation":[]},{"suffix":"Jr.","given":"Wagner","family":"Meira","sequence":"additional","affiliation":[]},{"given":"Dorgival","family":"Guedes","sequence":"additional","affiliation":[]},{"given":"Daniele","family":"Lezzi","sequence":"additional","affiliation":[]},{"given":"Rosa M.","family":"Badia","sequence":"additional","affiliation":[]}],"member":"3742","published-online":{"date-parts":[[2019,10,16]]},"reference":[{"key":"118_CR1","doi-asserted-by":"publisher","unstructured":"Kamburugamuve S, et al.Twister2: Design of a big data toolkit. Concurr Comput: Pract Experience. 2019;31(14). \n                    https:\/\/doi.org\/10.1002\/cpe.5189\n                    \n                  .","DOI":"10.1002\/cpe.5189"},{"key":"118_CR2","doi-asserted-by":"publisher","unstructured":"Fox G, et al.Big data, simulations and HPC convergence. In: Big Data Benchmarking: 6th International Workshop, WBDB 2015, Toronto, ON, Canada, June 16-17, 2015 and 7th International Workshop, WBDB 2015, New Delhi, India, December 14-15, 2015, Revised Selected Papers. Cham, Switzerland: Springer: 2016. p. 3\u201317. \n                    https:\/\/doi.org\/10.1007\/978-3-319-49748-8_1\n                    \n                  .","DOI":"10.1007\/978-3-319-49748-8_1"},{"key":"118_CR3","doi-asserted-by":"publisher","unstructured":"Tejedor E, et al.PyCOMPSs: Parallel computational workflows in Python. Int High Perform Comput Appl. 2017; 31(1):66\u201382. \n                    https:\/\/doi.org\/10.1177\/1094342015594678\n                    \n                  .","DOI":"10.1177\/1094342015594678"},{"key":"118_CR4","doi-asserted-by":"publisher","unstructured":"Asch M, et al.Big data and extreme-scale computing: Pathways to convergence-toward a shaping strategy for a future software and data ecosystem for scientific inquiry. Int J High Perform Comput Appl. 2018; 32(4):435\u201379. \n                    https:\/\/doi.org\/10.1177\/1094342018778123\n                    \n                  .","DOI":"10.1177\/1094342018778123"},{"key":"118_CR5","doi-asserted-by":"publisher","unstructured":"Lezzi D, et al.Enabling e-Science applications on the cloud with COMPSs. In: Parallel Processing Workshops at European Conference on Parallel Processing (Euro-Par 2011). Berlin: Springer: 2011. p. 25\u201334. \n                    https:\/\/doi.org\/10.1007\/978-3-642-29737-3_4\n                    \n                  .","DOI":"10.1007\/978-3-642-29737-3_4"},{"key":"118_CR6","doi-asserted-by":"publisher","DOI":"10.1109\/pdp.2016.39","volume-title":"24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2016)","author":"F Lordan","year":"2016","unstructured":"Lordan F, Ejarque J, Sirvent R, Badia RM. Energy-aware programming model for distributed infrastructures. In: 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2016). Washington: IEEE Computer Society: 2016. p. 413\u20137. \n                    https:\/\/doi.org\/10.1109\/pdp.2016.39\n                    \n                  ."},{"key":"118_CR7","unstructured":"Zaharia M, et al.Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI \u201912). Berkeley: USENIX Association: 2012. p. 15\u201328. \n                    https:\/\/dl.acm.org\/citation.cfm?id=2228301\n                    \n                  ."},{"key":"118_CR8","doi-asserted-by":"publisher","unstructured":"Santos W, et al.Lemonade: A scalable and efficient Spark-based platform for Data Analytics. In: 17th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). Piscataway: IEEE Press: 2017. p. 745\u20138. \n                    https:\/\/doi.org\/10.1109\/CCGRID.2017.142\n                    \n                  .","DOI":"10.1109\/CCGRID.2017.142"},{"key":"118_CR9","doi-asserted-by":"publisher","unstructured":"Marozzo F, et al.Enabling cloud interoperability with COMPSs. In: Parallel Processing Workshops at European Conference on Parallel Processing (Euro-Par 2012). Berlin: Springer: 2012. p. 16\u201327. \n                    https:\/\/doi.org\/10.1007\/978-3-642-32820-6_4\n                    \n                  .","DOI":"10.1007\/978-3-642-32820-6_4"},{"key":"118_CR10","doi-asserted-by":"publisher","unstructured":"Ramon-Cortes C, et al.Transparent orchestration of task-based parallel applications in containers platforms. 2018; 16(1):137\u201360. \n                    https:\/\/doi.org\/10.1007\/s10723-017-9425-z\n                    \n                  .","DOI":"10.1007\/s10723-017-9425-z"},{"key":"118_CR11","unstructured":"Apache Cassandra. \n                    http:\/\/cassandra.apache.org\/\n                    \n                  . Accessed 4 July 2019."},{"key":"118_CR12","doi-asserted-by":"publisher","first-page":"1","DOI":"10.17487\/RFC5661","volume":"5661","author":"S Shepler","year":"2010","unstructured":"Shepler S, Eisler M, Noveck D. Network file system (NFS) version 4 minor version 1 protocol. RFC. 2010; 5661:1\u2013617. \n                    https:\/\/doi.org\/10.17487\/RFC5661\n                    \n                  .","journal-title":"RFC"},{"key":"118_CR13","unstructured":"Li H. Alluxio: A virtual distributed file system. EECS Department, University of California, Berkeley, USA. 2018. \n                    http:\/\/www2.eecs.berkeley.edu\/Pubs\/TechRpts\/2018\/EECS-2018-29.html\n                    \n                  ."},{"key":"118_CR14","unstructured":"Amazon Simple Storage Service (S3). \n                    https:\/\/aws.amazon.com\/s3\/\n                    \n                  . Accessed 4 July 2019."},{"key":"118_CR15","unstructured":"Microsoft Azure Storage. \n                    https:\/\/azure.microsoft.com\/services\/storage\/\n                    \n                  . Accessed 4 July 2019."},{"key":"118_CR16","volume-title":"Proceedings of the Linux Symposium","author":"P Schwan","year":"2003","unstructured":"Schwan P. Lustre: Building a file system for 1000-node clusters. In: Proceedings of the Linux Symposium. Ottawa: Linux symposium: 2003. p. 380\u20136. \n                    https:\/\/www.kernel.org\/doc\/ols\/2003\/ols2003-pages-380-386.pdf\n                    \n                  ."},{"key":"118_CR17","unstructured":"OpenStack Storage (Swift). \n                    https:\/\/docs.openstack.org\/swift\/\n                    \n                  . Accessed 4 July 2019."},{"key":"118_CR18","unstructured":"Weil SA, et al.Ceph: A scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI \u201906). Berkeley: USENIX Association: 2006. p. 307\u201320. \n                    http:\/\/dl.acm.org\/citation.cfm?id=1298455.1298485\n                    \n                  ."},{"key":"118_CR19","doi-asserted-by":"publisher","unstructured":"Andersen DG, et al.FAWN: A fast array of wimpy nodes. In: Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP \u201909). New York: ACM: 2009. p. 1\u201314. \n                    https:\/\/doi.org\/10.1145\/1629575.1629577\n                    \n                  .","DOI":"10.1145\/1629575.1629577"},{"key":"118_CR20","doi-asserted-by":"publisher","unstructured":"DeCandia G, et al.Dynamo: Amazon\u2019s highly available key-value store. In: Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles (SOSP \u201907). New York: ACM: 2007. p. 205\u201320. \n                    https:\/\/doi.org\/10.1145\/1294261.1294281\n                    \n                  .","DOI":"10.1145\/1294261.1294281"},{"key":"118_CR21","unstructured":"Memcached: A distributed memory object caching system. \n                    http:\/\/memcached.org\/.\n                    \n                  . Accessed 4 July 2019."},{"key":"118_CR22","unstructured":"Apache HBase. \n                    http:\/\/hbase.apache.org\/.\n                    \n                  . Accessed 4 July 2019."},{"key":"118_CR23","doi-asserted-by":"publisher","unstructured":"Palankar MR, et al.Amazon S3 for science grids: A viable solution? In: Proceedings of the 2008 International Workshop on Data-aware Distributed Computing (DADC \u201908). New York: ACM: 2008. p. 55\u201364. \n                    https:\/\/doi.org\/10.1145\/1383519.1383526\n                    \n                  .","DOI":"10.1145\/1383519.1383526"},{"key":"118_CR24","doi-asserted-by":"publisher","unstructured":"Wickramasinghe P, et al.Twister2:TSet high-performance iterative dataflow. In: International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS 2019). Piscataway: IEEE Press: 2019. p. 55\u201360. \n                    https:\/\/doi.org\/10.1109\/HPBDIS.2019.8735495\n                    \n                  .","DOI":"10.1109\/HPBDIS.2019.8735495"},{"issue":"21","key":"118_CR25","doi-asserted-by":"publisher","first-page":"2778","DOI":"10.1093\/bioinformatics\/btq524","volume":"26","author":"L Goodstadt","year":"2010","unstructured":"Goodstadt L. Ruffus: a lightweight Python library for computational pipelines. Bioinformatics. 2010; 26(21):2778\u20139. \n                    https:\/\/doi.org\/10.1093\/bioinformatics\/btq524\n                    \n                  .","journal-title":"Bioinformatics"},{"key":"118_CR26","doi-asserted-by":"publisher","unstructured":"Gafni E, et al.COSMOS: Python library for massively parallel workflows. Bioinformatics. 2014; 30(20):2956\u20138. \n                    https:\/\/doi.org\/10.1093\/bioinformatics\/btu385\n                    \n                  .","DOI":"10.1093\/bioinformatics\/btu385"},{"key":"118_CR27","doi-asserted-by":"publisher","unstructured":"Mierswa I, et al.YALE: Rapid prototyping for complex data mining tasks. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2006. p. 935\u201340. \n                    https:\/\/doi.org\/10.1145\/1150402.1150531\n                    \n                  .","DOI":"10.1145\/1150402.1150531"},{"key":"118_CR28","unstructured":"Dem\u0161ar J, et al.Orange: Data mining toolbox in Python. J Mach Learn Res. 2013; 14(1):2349\u201353."},{"key":"118_CR29","doi-asserted-by":"publisher","unstructured":"Berthold MR, et al.KNIME - the konstanz information miner: version 2.0 and beyond. ACM SIGKDD Explor Newsl. 2009; 11(1):26\u201331. \n                    https:\/\/doi.org\/10.1145\/1656274.1656280\n                    \n                  .","DOI":"10.1145\/1656274.1656280"},{"key":"118_CR30","volume-title":"OSDI\u201904: Sixth Symposium on Operating System Design and Implementation","author":"J Dean","year":"2004","unstructured":"Dean J, Ghemawat S. Mapreduce: Simplified data processing on large clusters. In: OSDI\u201904: Sixth Symposium on Operating System Design and Implementation. San Francisco: USENIX Association: 2004. p. 137\u201350."},{"key":"118_CR31","doi-asserted-by":"publisher","unstructured":"Kranjc J, et al.ClowdFlows: A cloud based scientific workflow platform. In: Machine Learning and Knowledge Discovery in Databases: European Conference (ECML PKDD 2012). Berlin: Springer: 2012. p. 816\u20139. \n                    https:\/\/doi.org\/10.1007\/978-3-642-33486-3_5\n                    \n                  .","DOI":"10.1007\/978-3-642-33486-3_5"},{"issue":"1","key":"118_CR32","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1093\/comjnl\/bxr077","volume":"55","author":"V Podpe\u010dan","year":"2012","unstructured":"Podpe\u010dan V, Zemenova M, Lavra\u010d N. Orange4WS environment for service-oriented data mining. Comput J. 2012; 55(1):82\u201398. \n                    https:\/\/doi.org\/10.1093\/comjnl\/bxr077\n                    \n                  .","journal-title":"Comput J"},{"key":"118_CR33","unstructured":"Microsoft Azure Machine Learning. \n                    https:\/\/azure.microsoft.com\/services\/machine-learning-studio\/.\n                    \n                  . Accessed 4 July 2019."},{"key":"118_CR34","doi-asserted-by":"publisher","unstructured":"Conejero J, et al.Task-based programming in COMPSs to converge from HPC to big data. Int J Perform Comput Appl. 2018; 32(1):45\u201360. \n                    https:\/\/doi.org\/10.1177\/1094342017701278\n                    \n                  .","DOI":"10.1177\/1094342017701278"},{"key":"118_CR35","volume-title":"Hadoop: The Definitive Guide","author":"T White","year":"2015","unstructured":"White T. Hadoop: The Definitive Guide, 4th. Sebastopol: O\u2019Reilly Media, Inc.; 2015."},{"key":"118_CR36","unstructured":"Gonzales SD. PyWebHDFS: a Python wrapper for the Hadoop WebHDFS REST API. 2016. \n                    https:\/\/pypi.python.org\/pypi\/pywebhdfs\/.\n                    \n                  . Accessed 4 July 2019."},{"key":"118_CR37","unstructured":"Luckow A. WebHDFS: HDFS Python client based on WebHDFS REST API. 2014. \n                    https:\/\/pypi.org\/project\/WebHDFS\/.\n                    \n                  . Accessed 4 July 2019."},{"key":"118_CR38","unstructured":"Kalika M. Python WebHDFS. 2019. \n                    https:\/\/github.com\/mk23\/webhdfs.\n                    \n                  . Accessed 4 July 2019."},{"key":"118_CR39","unstructured":"Rosen J. PySpark Internals. 2016. \n                    https:\/\/cwiki.apache.org\/confluence\/display\/SPARK\/PySpark+Internals\/.\n                    \n                  . Accessed 4 July 2019."},{"key":"118_CR40","doi-asserted-by":"publisher","DOI":"10.1145\/1851476.1851594","volume-title":"Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing","author":"S Leo","year":"2010","unstructured":"Leo S, Zanetti G. Pydoop: a Python MapReduce and HDFS API for Hadoop. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. New York: ACM: 2010. p. 819\u201325. \n                    https:\/\/doi.org\/10.1145\/1851476.1851594\n                    \n                  ."},{"key":"118_CR41","unstructured":"Apache Arrow Developers. Pyarrow: Python library for Apache Arrow. 2016. \n                    https:\/\/pypi.org\/project\/pyarrow\/.\n                    \n                  . Accessed 4 July 2019."},{"key":"118_CR42","doi-asserted-by":"publisher","unstructured":"Chang L, et al.HAWQ: A massively parallel processing SQL engine in Hadoop. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD \u201914). New York: ACM: 2014. p. 1223\u201334. \n                    https:\/\/doi.org\/10.1145\/2588555.2595636\n                    \n                  .","DOI":"10.1145\/2588555.2595636"},{"key":"118_CR43","volume-title":"Workshop on Python for High Performance and Scientific Computing Collocated with the 24rd International Conference for High Performance Computing, Networking, Storage and Analysis (SC \u201911)","author":"W McKinney","year":"2011","unstructured":"McKinney W. Pandas: a foundational Python library for data analysis and statistics. In: Workshop on Python for High Performance and Scientific Computing Collocated with the 24rd International Conference for High Performance Computing, Networking, Storage and Analysis (SC \u201911). New York: ACM: 2011."},{"key":"118_CR44","volume-title":"The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley Computer Publishing","author":"R Jain","year":"1991","unstructured":"Jain R. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley Computer Publishing. New York: Wiley; 1991."}],"container-title":["Journal of Internet Services and Applications"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s13174-019-0118-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/s13174-019-0118-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s13174-019-0118-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,2,9]],"date-time":"2022-02-09T22:15:29Z","timestamp":1644444929000},"score":1,"resource":{"primary":{"URL":"https:\/\/jisajournal.springeropen.com\/articles\/10.1186\/s13174-019-0118-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,16]]},"references-count":44,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,12]]}},"alternative-id":["118"],"URL":"https:\/\/doi.org\/10.1186\/s13174-019-0118-7","relation":{},"ISSN":["1867-4828","1869-0238"],"issn-type":[{"type":"print","value":"1867-4828"},{"type":"electronic","value":"1869-0238"}],"subject":[],"published":{"date-parts":[[2019,10,16]]},"assertion":[{"value":"15 February 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 September 2019","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 October 2019","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare that they have no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"19"}}