{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,21]],"date-time":"2026-05-21T02:53:51Z","timestamp":1779332031787,"version":"3.51.4"},"reference-count":38,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2018,8]]},"abstract":"<jats:p>\n            This paper describes the challenges and experiences in the development of IBM Streams runner for Apache Beam. Apache Beam is emerging as a common stream programming interface for multiple computing engines. Each participating engine implements a\n            <jats:italic>runner<\/jats:italic>\n            to translate Beam applications into engine-specific programs. Hence, applications written with the Beam SDK can be executed on different underlying stream computing engines, with negligible migration penalty. IBM Streams is a widely-used enterprise streaming platform. It has a rich set of connectors and toolkits for easy integration of streaming applications with other enterprise applications. It also supports a broad range of programming language interfaces, including Java, C++, Python, Stream Processing Language (SPL) and Apache Beam. This paper focuses on our solutions to efficiently support the Beam programming abstractions in IBM Streams runner. Beam organizes data into discrete event time windows. This design, on the one hand, supports out-of-order data arrivals, but on the other hand, forces runners to maintain more states, which leads to higher space and computation overhead. IBM Streams runner mitigates this problem by efficiently indexing inter-dependent states, garbage-collecting stale keys, and enforcing bundle sizes. We also share performance concerns in Beam that could potentially impact applications. Evaluations show that IBM Streams runner outperforms Flink runner and Spark runner in most scenarios when running the Beam NEXMark benchmarks. IBM Streams runner is available for download from IBM Cloud Streaming Analytics service console.\n          <\/jats:p>","DOI":"10.14778\/3229863.3229864","type":"journal-article","created":{"date-parts":[[2018,9,10]],"date-time":"2018-09-10T12:12:28Z","timestamp":1536581548000},"page":"1742-1754","source":"Crossref","is-referenced-by-count":16,"title":["Challenges and experiences in building an efficient apache beam runner for IBM streams"],"prefix":"10.14778","volume":"11","author":[{"given":"Shen","family":"Li","sequence":"first","affiliation":[{"name":"IBM Research AI"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Paul","family":"Gerver","sequence":"additional","affiliation":[{"name":"IBM Watson Cloud Platform"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"John","family":"MacMillan","sequence":"additional","affiliation":[{"name":"IBM Watson Cloud Platform"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Daniel","family":"Debrunner","sequence":"additional","affiliation":[{"name":"IBM Watson Cloud Platform"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"William","family":"Marshall","sequence":"additional","affiliation":[{"name":"IBM Watson Cloud Platform"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kun-Lung","family":"Wu","sequence":"additional","affiliation":[{"name":"IBM Research AI"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2018,8]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Retrieved","author":"Storm Alibaba","year":"2018","unstructured":"Alibaba J Storm : an enterprise fast and stable streaming process engine. http:\/\/jstorm.io\/ . Retrieved Feb , 2018 . Alibaba JStorm: an enterprise fast and stable streaming process engine. http:\/\/jstorm.io\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_2_1","volume-title":"Retrieved","author":"Apex Apache","year":"2018","unstructured":"Apache Apex : Enterprise-grade unified stream and batch processing engine. https:\/\/apex.apache.org\/ . Retrieved Feb , 2018 . Apache Apex: Enterprise-grade unified stream and batch processing engine. https:\/\/apex.apache.org\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_3_1","volume-title":"Retrieved","author":"Beam Apache","year":"2018","unstructured":"Apache Beam : An advanced unified programming model. https:\/\/beam.apache.org . Retrieved Feb , 2018 . Apache Beam: An advanced unified programming model. https:\/\/beam.apache.org. Retrieved Feb, 2018."},{"key":"e_1_2_1_4_1","volume-title":"https:\/\/beam.apache.org\/documentation\/programming-guide\/. Retrieved","author":"Programming Guide Apache Beam","year":"2018","unstructured":"Apache Beam Programming Guide . https:\/\/beam.apache.org\/documentation\/programming-guide\/. Retrieved Feb , 2018 . Apache Beam Programming Guide. https:\/\/beam.apache.org\/documentation\/programming-guide\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_5_1","volume-title":"Retrieved","author":"Flink Apache","year":"2018","unstructured":"Apache Flink : an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. https:\/\/flink.apache.org\/ . Retrieved Feb , 2018 . Apache Flink: an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. https:\/\/flink.apache.org\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_6_1","volume-title":"Retrieved","author":"Gearpump Apache","year":"2018","unstructured":"Apache Gearpump : a real-time big data streaming engine. https:\/\/gearpump.apache.org . Retrieved Feb , 2018 . Apache Gearpump: a real-time big data streaming engine. https:\/\/gearpump.apache.org. Retrieved Feb, 2018."},{"key":"e_1_2_1_7_1","volume-title":"http:\/\/hadoop.apache.org\/. Retrieved","author":"Hadoop Apache","year":"2018","unstructured":"Apache Hadoop . http:\/\/hadoop.apache.org\/. Retrieved Feb , 2018 . Apache Hadoop. http:\/\/hadoop.apache.org\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_8_1","volume-title":"Retrieved","author":"Samza Apache","year":"2018","unstructured":"Apache Samza : a distributed stream processing framework. https:\/\/samza.apache.org\/ . Retrieved Feb , 2018 . Apache Samza: a distributed stream processing framework. https:\/\/samza.apache.org\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_9_1","volume-title":"Retrieved","author":"Spark Apache","year":"2018","unstructured":"Apache Spark : a fast and general engine for large-scale data processing. https:\/\/spark.apache.org\/ . Retrieved Feb , 2018 . Apache Spark: a fast and general engine for large-scale data processing. https:\/\/spark.apache.org\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_10_1","volume-title":"Retrieved","year":"2018","unstructured":"AthenaX : SQL-based streaming analytics platform at scale. http:\/\/athenax.readthedocs.io\/ . Retrieved Feb , 2018 . AthenaX: SQL-based streaming analytics platform at scale. http:\/\/athenax.readthedocs.io\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_11_1","volume-title":"Retrieved","author":"Dataflow Google","year":"2018","unstructured":"Google Dataflow : Simplified stream and batch data processing, with equal reliability and expressiveness. https:\/\/cloud.google.com\/dataflow\/ . Retrieved Feb , 2018 . Google Dataflow: Simplified stream and batch data processing, with equal reliability and expressiveness. https:\/\/cloud.google.com\/dataflow\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_12_1","volume-title":"Retrieved","author":"Stream Analytics IBM","year":"2018","unstructured":"IBM Stream Analytics : Leverage continuously available data from all sources to discover opportunities faster. https:\/\/www.ibm.com\/cloud\/streaming-analytics . Retrieved Feb , 2018 . IBM Stream Analytics: Leverage continuously available data from all sources to discover opportunities faster. https:\/\/www.ibm.com\/cloud\/streaming-analytics. Retrieved Feb, 2018."},{"key":"e_1_2_1_13_1","volume-title":"http:\/\/ibmstreams.github.io\/streamsx.documentation\/docs\/beamrunner\/. Retrieved","author":"Streams IBM","year":"2018","unstructured":"IBM Streams Runner for Apache Beam . http:\/\/ibmstreams.github.io\/streamsx.documentation\/docs\/beamrunner\/. Retrieved Feb , 2018 . IBM Streams Runner for Apache Beam. http:\/\/ibmstreams.github.io\/streamsx.documentation\/docs\/beamrunner\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_14_1","volume-title":"http:\/\/ibmstreams.github.io\/streamsx.topology\/. Retrieved","author":"Streams Topology Toolkit IBM","year":"2018","unstructured":"IBM Streams Topology Toolkit . http:\/\/ibmstreams.github.io\/streamsx.topology\/. Retrieved Feb , 2018 . IBM Streams Topology Toolkit. http:\/\/ibmstreams.github.io\/streamsx.topology\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_15_1","volume-title":"https:\/\/beam.apache.org\/documentation\/sdks\/java\/nexmark\/. Retrieved","author":"Nexmark","year":"2018","unstructured":"Nexmark benchmark suite. https:\/\/beam.apache.org\/documentation\/sdks\/java\/nexmark\/. Retrieved May , 2018 . Nexmark benchmark suite. https:\/\/beam.apache.org\/documentation\/sdks\/java\/nexmark\/. Retrieved May, 2018."},{"key":"e_1_2_1_16_1","volume-title":"https:\/\/beam.apache.org\/blog\/2017\/02\/13\/stateful-processing.html. Retrieved","author":"Apache Beam Stateful","year":"2018","unstructured":"Stateful processing with Apache Beam . https:\/\/beam.apache.org\/blog\/2017\/02\/13\/stateful-processing.html. Retrieved Feb , 2018 . Stateful processing with Apache Beam. https:\/\/beam.apache.org\/blog\/2017\/02\/13\/stateful-processing.html. Retrieved Feb, 2018."},{"key":"e_1_2_1_17_1","volume-title":"Retrieved","year":"2018","unstructured":"StreamsDev : IBM Streams Developer Community. https:\/\/developer.ibm.com\/streamsdev\/ . Retrieved Feb , 2018 . StreamsDev: IBM Streams Developer Community. https:\/\/developer.ibm.com\/streamsdev\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_18_1","volume-title":"Retrieved","author":"Heron Twitter","year":"2018","unstructured":"Twitter Heron : A realtime, distributed, fault-tolerant stream processing engine from Twitter. https:\/\/twitter.github.io\/heron\/ . Retrieved Feb , 2018 . Twitter Heron: A realtime, distributed, fault-tolerant stream processing engine from Twitter. https:\/\/twitter.github.io\/heron\/. Retrieved Feb, 2018."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536222.2536229"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824076"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-004-0147-z"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137777"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/214451.214456"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137786"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/IEMBS.2004.1403627"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2013.295"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1147\/JRD.2013.2243535"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/3007263.3007272"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2742788"},{"key":"e_1_2_1_30_1","first-page":"97","volume-title":"USENIX Annual Technical Conference (USENIX ATC)","author":"Li S.","year":"2015","unstructured":"S. Li , S. Hu , R. Ganti , M. Srivatsa , and T. Abdelzaher . Pyro: A spatial-temporal big-data storage system . In USENIX Annual Technical Conference (USENIX ATC) , pages 97 -- 109 , 2015 . S. Li, S. Hu, R. Ganti, M. Srivatsa, and T. Abdelzaher. Pyro: A spatial-temporal big-data storage system. In USENIX Annual Technical Conference (USENIX ATC), pages 97--109, 2015."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137770"},{"key":"e_1_2_1_32_1","volume-title":"Safe data parallelism for general streaming","author":"Schneider S.","year":"2015","unstructured":"S. Schneider , M. Hirzel , B. Gedik , and K.-L. Wu . Safe data parallelism for general streaming . IEEE transactions on computers, 64(2):504--517, 2015 . S. Schneider, M. Hirzel, B. Gedik, and K.-L. Wu. Safe data parallelism for general streaming. IEEE transactions on computers, 64(2):504--517, 2015."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2988336.2990475"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3062341.3062366"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/2752939.2752940"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2595641"},{"key":"e_1_2_1_37_1","first-page":"2","volume-title":"Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation","author":"Zaharia M.","year":"2012","unstructured":"M. Zaharia , M. Chowdhury , T. Das , A. Dave , J. Ma , M. McCauley , M. J. Franklin , S. Shenker , and I. Stoica . Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing . In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation , pages 2 -- 2 . USENIX Association , 2012 . M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2517349.2522737"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3229863.3229864","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:07:32Z","timestamp":1672222052000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3229863.3229864"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,8]]},"references-count":38,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2018,8]]}},"alternative-id":["10.14778\/3229863.3229864"],"URL":"https:\/\/doi.org\/10.14778\/3229863.3229864","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2018,8]]}}}