{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T16:02:42Z","timestamp":1774454562992,"version":"3.50.1"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2021,12,14]],"date-time":"2021-12-14T00:00:00Z","timestamp":1639440000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["CNS-2124393, CNS-1704077, and CNS-2126327"],"award-info":[{"award-number":["CNS-2124393, CNS-1704077, and CNS-2126327"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001665","name":"Agence Nationale de la Recherche","doi-asserted-by":"crossref","award":["ANR-21-CE94-0001-01"],"award-info":[{"award-number":["ANR-21-CE94-0001-01"]}],"id":[{"id":"10.13039\/501100001665","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100006785","name":"Google","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100006785","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100016745","name":"Comcast","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100016745","id-type":"DOI","asserted-by":"publisher"}]},{"name":"France and Chicago Collaborating in the Sciences"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Meas. Anal. Comput. Syst."],"published-print":{"date-parts":[[2021,12,14]]},"abstract":"<jats:p>Network management often relies on machine learning to make predictions about performance and security from network traffic. Often, the representation of the traffic is as important as the choice of the model. The features that the model relies on, and the representation of those features, ultimately determine model accuracy, as well as where and whether the model can be deployed in practice. Thus, the design and evaluation of these models ultimately requires understanding not only model accuracy but also the systems costs associated with deploying the model in an operational network. Towards this goal, this paper develops a new framework and system that enables a joint evaluation of both the conventional notions of machine learning performance (e.g., model accuracy) and the systems-level costs of different representations of network traffic. We highlight these two dimensions for two practical network management tasks, video streaming quality inference and malware detection, to demonstrate the importance of exploring different representations to find the appropriate operating point. We demonstrate the benefit of exploring a range of representations of network traffic and present Traffic Refinery, a proof-of-concept implementation that both monitors network traffic at 10~Gbps and transforms traffic in real time to produce a variety of feature representations for machine learning. Traffic Refinery both highlights this design space and makes it possible to explore different representations for learning, balancing systems costs related to feature extraction and model training against model accuracy.<\/jats:p>","DOI":"10.1145\/3491052","type":"journal-article","created":{"date-parts":[[2021,12,15]],"date-time":"2021-12-15T18:32:19Z","timestamp":1639593139000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":15,"title":["Traffic Refinery"],"prefix":"10.1145","volume":"5","author":[{"given":"Francesco","family":"Bronzino","sequence":"first","affiliation":[{"name":"Universit\u00e9 Savoie Mont Blanc, Annecy-le-Vieux, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Paul","family":"Schmitt","sequence":"additional","affiliation":[{"name":"USC Information Sciences Institute, Los Angeles, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sara","family":"Ayoubi","sequence":"additional","affiliation":[{"name":"Nokia Bell Labs, Paris-Saclay, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hyojoon","family":"Kim","sequence":"additional","affiliation":[{"name":"Princeton University, Princeton, NJ, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Renata","family":"Teixeira","sequence":"additional","affiliation":[{"name":"Inria, Paris, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nick","family":"Feamster","sequence":"additional","affiliation":[{"name":"University of Chicago, Chicago, IL, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,12,15]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2018. Deep Learning models for network traffic classification. https:\/\/github.com\/echowei\/DeepTraffic\/."},{"key":"e_1_2_1_2_1","unstructured":"2018. DPDK Data Plane Development Kit. https:\/\/www.dpdk.org\/."},{"key":"e_1_2_1_3_1","unstructured":"2019. Corelight. https:\/\/corelight.com\/."},{"key":"e_1_2_1_4_1","volume-title":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"5","unstructured":"2019. Deepfield. https:\/\/www.nokia.com\/networks\/solutions\/deepfield\/. Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 3, Article 40. Publication date: December 2021. Traffic Refinery: Cost-Aware Data Representation for Machine Learning on Network Traffic 40:21"},{"key":"e_1_2_1_5_1","unstructured":"2019. Kentik. https:\/\/kentik.com\/."},{"key":"e_1_2_1_6_1","unstructured":"2020. Go language. https:\/\/golang.org\/."},{"key":"e_1_2_1_7_1","unstructured":"2020. Go Packet Library. https:\/\/godoc.org\/github.com\/google\/gopacket."},{"key":"e_1_2_1_8_1","unstructured":"2020. NIKSUN NetVCR. https:\/\/www.niksun.com\/product.php?id=110."},{"key":"e_1_2_1_9_1","unstructured":"2020. Nokia Traffica. https:\/\/www.nokia.com\/networks\/products\/traffica\/."},{"key":"e_1_2_1_10_1","unstructured":"2020. tcpdump and libpcap. https:\/\/www.tcpdump.org\/."},{"key":"e_1_2_1_11_1","unstructured":"2020. Tshark: terminal-based Wireshark. https:\/\/www.wireshark.org\/docs\/wsug_html_chunked\/AppToolstshark.html."},{"key":"e_1_2_1_12_1","unstructured":"2021. Traffic Refinery. https:\/\/github.com\/traffic-refinery\/traffic-refinery."},{"key":"e_1_2_1_13_1","volume-title":"Chimera: A Declarative Language for Streaming Network Traffic Analysis. In Presented as part of the 21st USENIX Security Symposium (USENIX Security 12)","author":"Borders Kevin","year":"2012","unstructured":"Kevin Borders, Jonathan Springer, and Matthew Burnside. 2012. Chimera: A Declarative Language for Streaming Network Traffic Analysis. In Presented as part of the 21st USENIX Security Symposium (USENIX Security 12). USENIX, Bellevue, WA, 365--379. https:\/\/www.usenix.org\/conference\/usenixsecurity12\/technical-sessions\/presentation\/borders"},{"key":"e_1_2_1_14_1","volume-title":"Austin Hounsel, and Paul Schmitt","author":"Borgolte Kevin","year":"2019","unstructured":"Kevin Borgolte, Tithi Chattopadhyay, Nick Feamster, Mihir Kshirsagar, Jordan Holland, Austin Hounsel, and Paul Schmitt. 2019. How DNS over HTTPS is Reshaping Privacy, Performance, and Policy in the Internet Ecosystem. Performance, and Policy in the Internet Ecosystem (July 27, 2019) (2019)."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1186\/s13174-018-0087-2"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/BigData.2017.8258038"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3366704"},{"key":"e_1_2_1_19_1","doi-asserted-by":"crossref","unstructured":"Benoit Claise Brian Trammell and Paul Aitken. 2013. Specification of the IP flow information export (IPFIX) protocol for the exchange of flow information. RFC 7011..","DOI":"10.17487\/rfc7012"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/872757.872838"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of SANE","volume":"2004","author":"Luca","unstructured":"Luca Deri et al. 2004. Improving passive packet capture: Beyond device polling. In Proceedings of SANE, Vol. 2004. Amsterdam, Netherlands, 85--93."},{"key":"e_1_2_1_22_1","volume-title":"Portable Network Graphics (PNG) Specification","author":"Duce David","unstructured":"David Duce. 2003. Portable Network Graphics (PNG) Specification (Second Edition). W3C Recommendation."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/633025.633056"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-13315-2_24"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3230543.3230555"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3304109.3306226"},{"key":"e_1_2_1_27_1","unstructured":"Jordan Holland Paul Schmitt Nick Feamster and Prateek Mittal. 2020. nPrint: A Standard Data Representation for Network Traffic Analysis. (2020). arXiv:2008.02695 https:\/\/arxiv.org\/abs\/2008.02695"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3083187.3083193"},{"key":"e_1_2_1_29_1","volume-title":"Jun Jim Xu, and Jia Wang","author":"Kumar Abhishek","year":"2004","unstructured":"Abhishek Kumar, Minho Sung, Jun Jim Xu, and Jia Wang. 2004. Data streaming algorithms for efficient and accurate estimation of flow size distribution. In ACM SIGMETRICS Performance Evaluation Review, Vol. 32. ACM, 177--188."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2934872.2934906"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNSM.2019.2924942"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3234200.3234238"},{"key":"e_1_2_1_33_1","volume-title":"DeepMAL--Deep Learning Models for Malware Traffic Detection and Classification. arXiv preprint arXiv:2003.04079","author":"Mar\u00edn Gonzalo","year":"2020","unstructured":"Gonzalo Mar\u00edn, Pedro Casas, and Germ\u00e1n Capdehourat. 2020. DeepMAL--Deep Learning Models for Malware Traffic Detection and Classification. arXiv preprint arXiv:2003.04079 (2020)."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM.2018.8486321"},{"key":"e_1_2_1_35_1","volume-title":"INFOCOM, 2018 Proceedings IEEE. IEEE.","author":"Hammad Mazhar M.","year":"2018","unstructured":"M. Hammad Mazhar and Zubair Shafiq. 2018. Real-time Video Quality of Experience Monitoring for HTTPS and QUIC. In INFOCOM, 2018 Proceedings IEEE. IEEE."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-36480-3_11"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3098822.3098829"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/SURV.2008.080406"},{"key":"e_1_2_1_39_1","unstructured":"Angela Orebaugh Gilbert Ramirez and Jay Beale. 2006. Wireshark & Ethereal network protocol analyzer toolkit. Elsevier."},{"key":"e_1_2_1_40_1","volume-title":"Bro: a system for detecting network intruders in real-time. Computer networks 31, 23--24","author":"Paxson Vern","year":"1999","unstructured":"Vern Paxson. 1999. Bro: a system for detecting network intruders in real-time. Computer networks 31, 23--24 (1999), 2435--2463."},{"key":"e_1_2_1_41_1","volume-title":"Workshop Satin.","author":"Plonka David","year":"2011","unstructured":"David Plonka and Paul Barford. 2011. Flexible traffic and host profiling via DNS rendezvous. In Workshop Satin."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.17487\/RFC8094"},{"key":"e_1_2_1_43_1","first-page":"229","article-title":"Snort: Lightweight intrusion detection for networks","volume":"99","author":"Martin Roesch","year":"1999","unstructured":"Martin Roesch et al. 1999. Snort: Lightweight intrusion detection for networks.. In Lisa, Vol. 99. 229--238.","journal-title":"Lisa"},{"key":"e_1_2_1_44_1","unstructured":"David Sculley Gary Holt Daniel Golovin Eugene Davydov Todd Phillips Dietmar Ebner Vinay Chaudhary Michael Young Jean-Francois Crespo and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In Advances in neural information processing systems. 2503--2511."},{"key":"e_1_2_1_45_1","volume-title":"Arash Habibi Lashkari, and Ali A Ghorbani","author":"Sharafaldin Iman","year":"2018","unstructured":"Iman Sharafaldin, Arash Habibi Lashkari, and Ali A Ghorbani. 2018. Toward generating a new intrusion detection dataset and intrusion traffic characterization.. In ICISSP. 108--116."},{"key":"e_1_2_1_46_1","first-page":"4349","article-title":"A survey on machine learning techniques for intrusion detection systems","volume":"2","author":"Singh Jayveer","year":"2013","unstructured":"Jayveer Singh and Manisha J Nene. 2013. A survey on machine learning techniques for intrusion detection systems. International Journal of Advanced Research in Computer and Communication Engineering 2, 11 (2013), 4349--4355.","journal-title":"International Journal of Advanced Research in Computer and Communication Engineering"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2017.2780250"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICOIN.2017.7899588"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3230543.3230544"},{"key":"e_1_2_1_50_1","volume-title":"16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19)","author":"Yu Da","year":"2019","unstructured":"Da Yu, Yibo Zhu, Behnaz Arzani, Rodrigo Fonseca, Tianrong Zhang, Karl Deng, and Lihua Yuan. 2019. dShark: A general, easy to program and scalable framework for analyzing in-network packet traces. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 207--220."},{"key":"e_1_2_1_51_1","volume-title":"Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). 29--42.","author":"Yu Minlan","unstructured":"Minlan Yu, Lavanya Jose, and Rui Miao. 2013. Software Defined Traffic Measurement with OpenSketch. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). 29--42."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3098822.3098830"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/2785956.2787483"}],"container-title":["Proceedings of the ACM on Measurement and Analysis of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3491052","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3491052","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3491052","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:25:06Z","timestamp":1750195506000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3491052"}},"subtitle":["Cost-Aware Data Representation for Machine Learning on Network Traffic"],"short-title":[],"issued":{"date-parts":[[2021,12,14]]},"references-count":52,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,12,14]]}},"alternative-id":["10.1145\/3491052"],"URL":"https:\/\/doi.org\/10.1145\/3491052","relation":{},"ISSN":["2476-1249"],"issn-type":[{"value":"2476-1249","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,12,14]]},"assertion":[{"value":"2021-12-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}