{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,5]],"date-time":"2026-02-05T09:45:10Z","timestamp":1770284710469,"version":"3.49.0"},"reference-count":67,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,2,27]],"date-time":"2023-02-27T00:00:00Z","timestamp":1677456000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["1750667"],"award-info":[{"award-number":["1750667"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Meas. Anal. Comput. Syst."],"published-print":{"date-parts":[[2023,2,27]]},"abstract":"<jats:p>Finite-state automata serve as compute kernels for many application domains such as pattern matching and data analytics. Existing approaches on GPUs exploit three levels of parallelism in automata processing tasks: 1)~input stream level, 2)~automaton-level and 3)~state-level. Among these, only state-level parallelism is intrinsic to automata while the other two levels of parallelism depend on the number of automata and input streams to be processed. As GPU resources increase, a parallelism-limited automata processing task can underutilize GPU compute resources.<\/jats:p>\n          <jats:p>To this end, we propose AsyncAP, a low-overhead approach that optimizes for both scalability and throughput. Our insight is that most automata processing tasks have an additional source of parallelism originating from the input symbols which has not been leveraged before. Making the matching process associated with the automata tasks asynchronous, i.e., parallel GPU threads start processing an input stream from different input locations instead of processing it serially, improves throughput significantly and scales with input length.<\/jats:p>\n          <jats:p>When the task does not have enough parallelism to utilize all the GPU cores, detailed evaluation across 12 evaluated applications shows that AsyncAP achieves up to 58\u00d7 speedup on average over the state-of-the-art GPU automata processing engine. When the tasks have enough parallelism to utilize GPU cores, AsyncAP still achieves 2.4\u00d7 speedup.<\/jats:p>","DOI":"10.1145\/3579453","type":"journal-article","created":{"date-parts":[[2023,3,2]],"date-time":"2023-03-02T23:50:57Z","timestamp":1677801057000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Asynchronous Automata Processing on GPUs"],"prefix":"10.1145","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6961-6394","authenticated-orcid":false,"given":"Hongyuan","family":"Liu","sequence":"first","affiliation":[{"name":"William &amp; Mary \/ The Hong Kong University of Science and Technology (Guangzhou), Williamsburg, VA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3691-7238","authenticated-orcid":false,"given":"Sreepathi","family":"Pai","sequence":"additional","affiliation":[{"name":"University of Rochester, Rochester, NY, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5525-7204","authenticated-orcid":false,"given":"Adwait","family":"Jog","sequence":"additional","affiliation":[{"name":"William &amp; Mary \/ University of Virginia, Williamsburg, VA, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,3,2]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Scheduling and Execution of Compute Tasks. US 2013\/0185728A1.","author":"Karim M.","unstructured":"M. Karim Abdalla et al. 2013. Scheduling and Execution of Compute Tasks. US 2013\/0185728A1."},{"key":"e_1_2_1_2_1","doi-asserted-by":"crossref","unstructured":"018)]% mnrl K. Angstadt J. Wadden V. Dang T. Xie D. Kramp W. Weimer M. Stan and K. Skadron. 2018. MNCaRT: An Open-Source Multi-Architecture Automata-Processing Research and Execution Ecosystem. IEEE Computer Architecture Letters (CAL) (2018).","DOI":"10.1109\/LCA.2017.2780105"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNET.2015.2429918"},{"key":"e_1_2_1_4_1","unstructured":"009)]% gpgpu-sim A. Bakhoda G.L. Yuan W.W.L. Fung H. Wong and T.M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1323548.1323573"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1477942.1477950"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2008.4636093"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00068"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA).","author":"Brodie Benjamin C.","unstructured":"Benjamin C. Brodie, David E. Taylor, and Ron K. Cytron. 2006. A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching. In Proceedings of the International Symposium on Computer Architecture (ISCA)."},{"key":"e_1_2_1_10_1","volume-title":"iNFAnt: NFA Pattern Matching on GPGPU Devices. SIGCOMM Computer Communication Review (CCR)","author":"Cascarano Niccolo'","year":"2010","unstructured":"Niccolo' Cascarano, Pierluigi Rolando, Fulvio Risso, and Riccardo Sisto. 2010. iNFAnt: NFA Pattern Matching on GPGPU Devices. SIGCOMM Computer Communication Review (CCR) (2010)."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2014.8"},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).","author":"Amr","unstructured":"Amr S. Elhelw and Sreepathi Pai. 2020. Horus: A Modular GPU Emulator Framework. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)."},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the International Symposium on Microarchitecture (MICRO).","author":"Fang Yuanwei","unstructured":"Yuanwei Fang, Tung T. Hoang, Michela Becchi, and Andrew A. Chien. 2015. Fast Support for Unstructured Data Processing: The Unified Automata Processor. In Proceedings of the International Symposium on Microarchitecture (MICRO)."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2019.00023"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1070\/RM1961v016n05ABEH004112"},{"key":"e_1_2_1_16_1","volume-title":"Processing XML Streams with Deterministic Automata and Stream Indexes. ACM Transactions on Database Systems","author":"Green Todd J.","year":"2004","unstructured":"Todd J. Green, Ashish Gupta, Gerome Miklau, Makoto Onizuka, and Dan Suciu. 2004. Processing XML Streams with Deterministic Automata and Stream Indexes. ACM Transactions on Database Systems (2004)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2006.184"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018760"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).","author":"Khairy Mahmoud","unstructured":"Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In Proceedings of the ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3519939.3523456"},{"key":"e_1_2_1_21_1","volume-title":"Fischer","author":"Ladner Richard E.","year":"1980","unstructured":"Richard E. Ladner and Michael J. Fischer. 1980. Parallel Prefix Computation. Journal of the ACM (JACM) (1980)."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2018.00078"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378471"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/InPar.2012.6339597"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541988"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079079.3079100"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/PADSW.2018.8644852"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS51040.2020.00073"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378461"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446705"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2967938.2967965"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079079.3079082"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM48280.2020.00027"},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the International Symposium on Code Generation and Optimization (CGO).","author":"Ren Bin","unstructured":"Bin Ren, Tomi Poutanen, Todd Mytkowicz, Wolfram Schulte, Gagan Agrawal, and James R. Larus. 2013. SIMD parallelization of applications that traverse irregular data structures. In Proceedings of the International Symposium on Code Generation and Optimization (CGO)."},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the USENIX Conference on System Administration (LISA).","author":"Roesch Martin","year":"1999","unstructured":"Martin Roesch. 1999. Snort - Lightweight Intrusion Detection for Networks. In Proceedings of the USENIX Conference on System Administration (LISA)."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219889"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378459"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00017"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480934"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2019.2909870"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358324"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079079.3079084"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919649"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080207"},{"key":"e_1_2_1_45_1","volume-title":"Cache Automaton. In Proceedings of the International Symposium on Microarchitecture (MICRO).","author":"Subramaniyan Arun","year":"2017","unstructured":"Arun Subramaniyan, Jingcheng Wang, Ezhil R. M. Balasubramanian, David Blaauw, Dennis Sylvester, and Reetuparna Das. 2017. Cache Automaton. In Proceedings of the International Symposium on Microarchitecture (MICRO)."},{"key":"e_1_2_1_46_1","volume-title":"Shi Dong, and David Kaeli.","author":"Sun Yifan","year":"2020","unstructured":"Yifan Sun, Nicolas Bohm Agostini, Shi Dong, and David Kaeli. 2020. Summarizing CPU and GPU Design Trends with Product Data. arxiv: 1911.11313 [cs.DC]"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-41321-1_11"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2015.11.001"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2011.6114181"},{"key":"e_1_2_1_50_1","unstructured":"Kien Chi Vu. 2020. Accelerating bit-based finite automaton on a GPGPU device. (2020)."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2016.7581271"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2018.8573482"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/2968456.2976763"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2903150.2903172"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/IMIS.2011.107"},{"key":"e_1_2_1_57_1","volume-title":"Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI).","author":"Wang Xiang","year":"2019","unstructured":"Xiang Wang, Yang Hong, Harry Chang, KyoungSoo Park, Geoff Langdale, Jiayu Hu, and Heqing Zhu. 2019. Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI)."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS53621.2022.00053"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3332466.3374524"},{"key":"e_1_2_1_60_1","volume-title":"Siu Ming Yiu, and Lucas Chi Kwong Hui","author":"Xu Chengcheng","year":"2016","unstructured":"Chengcheng Xu, Shuhui Chen, Jinshu Su, Siu Ming Yiu, and Lucas Chi Kwong Hui. 2016. A Survey on Regular Expression Matching for Deep Packet Inspection: Applications, Algorithms, and Hardware Platforms. IEEE Communications Surveys Tutorials (2016)."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2011.129"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/2442516.2442548"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/2482767.2482791"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/2881025.2881034"},{"key":"e_1_2_1_65_1","volume-title":"Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI).","author":"Zhao Zhipeng","year":"2020","unstructured":"Zhipeng Zhao, Hugo Sadok, Nirav Atre, James C. Hoe, Vyas Sekar, and Justine Sherry. 2020. Achieving 100Gbps Intrusion Prevention on a Single Server. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI)."},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/2694344.2694369"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541989"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145833"}],"container-title":["Proceedings of the ACM on Measurement and Analysis of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3579453","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3579453","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3579453","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:23Z","timestamp":1750182683000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3579453"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,27]]},"references-count":67,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,2,27]]}},"alternative-id":["10.1145\/3579453"],"URL":"https:\/\/doi.org\/10.1145\/3579453","relation":{},"ISSN":["2476-1249"],"issn-type":[{"value":"2476-1249","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,27]]},"assertion":[{"value":"2023-03-02","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}