{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T21:53:13Z","timestamp":1776376393535,"version":"3.51.2"},"reference-count":64,"publisher":"Association for Computing Machinery (ACM)","issue":"CoNEXT1","license":[{"start":{"date-parts":[[2023,6,30]],"date-time":"2023-06-30T00:00:00Z","timestamp":1688083200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Netw."],"published-print":{"date-parts":[[2023,6,30]]},"abstract":"<jats:p>Inferring the root cause of failures among thousands of components in a data center network is challenging, especially for \"gray\" failures that are not reported directly by switches. Faults can be localized through end-to-end measurements, but past localization schemes are either too slow for large-scale networks or sacrifice accuracy. We describe Flock, a network fault localization algorithm and system that achieves both high accuracy and speed at datacenter scale. Flock uses a probabilistic graphical model (PGM) to achieve high accuracy, coupled with new techniques to dramatically accelerate inference in discrete-valued Bayesian PGMs. Large-scale simulations and experiments in a hardware testbed show Flock speeds up inference by &gt;10000x compared to past PGM methods, and improves accuracy over the best previous datacenter fault localization approaches, reducing inference error by 1.19-11x on the same input telemetry, and by 1.2-55x after incorporating passive telemetry. We also prove Flock's inference is optimal in restricted settings.<\/jats:p>","DOI":"10.1145\/3595289","type":"journal-article","created":{"date-parts":[[2023,7,5]],"date-time":"2023-07-05T10:53:26Z","timestamp":1688554406000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Flock: Accurate Network Fault Localization at Scale"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-4767-945X","authenticated-orcid":false,"given":"Vipul","family":"Harsh","sequence":"first","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Urbana, IL, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-1039-0056","authenticated-orcid":false,"given":"Tong","family":"Meng","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Urbana, IL, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-6407-5434","authenticated-orcid":false,"given":"Kapil","family":"Agrawal","sequence":"additional","affiliation":[{"name":"Lawrence Berkeley National Lab, Champaign, IL, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-2930-1982","authenticated-orcid":false,"given":"Philip Brighten","family":"Godfrey","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign &amp; VMware, Urbana, IL, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,7,5]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2016. NetNORAD: Troubleshooting networks via end-to-end probing. https:\/\/code.fb.com\/networking-traffic\/netnorad-troubleshooting-networks-via-end-to-end-probing\/."},{"key":"e_1_2_1_2_1","unstructured":"2020. In-band Network Telemetry (INT) Dataplane Specification. https:\/\/github.com\/p4lang\/p4-applications\/blob\/master\/docs\/INT_latest.pdf."},{"key":"e_1_2_1_3_1","unstructured":"Github. Flock code. https:\/\/github.com\/netarch\/FaultLocalization."},{"key":"e_1_2_1_4_1","unstructured":"Github. PF_RING by ntop software."},{"key":"e_1_2_1_5_1","unstructured":"URL. Manage engine traffic analyzer. https:\/\/www.manageengine.com\/products\/netflow\/."},{"key":"e_1_2_1_6_1","unstructured":"URL. Solarwinds traffic analyzer. https:\/\/www.solarwinds.com\/netflow-traffic-analyzer."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1402958.1402967"},{"key":"e_1_2_1_8_1","volume-title":"Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)","author":"Alizadeh Mohammad","unstructured":"Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and Masato Yasuda. 2012. Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX, San Jose, CA, 253--266. https:\/\/www.usenix.org\/conference\/nsdi12\/technical-sessions\/presentation\/alizadeh"},{"key":"e_1_2_1_9_1","unstructured":"Andrew Lerner. 2015. Inclusion Criteria for the 2016 NPMD Magic Quadrant. https:\/\/blogs.gartner.com\/andrew-lerner\/2015\/06\/29\/gotnpmd\/."},{"key":"e_1_2_1_10_1","unstructured":"Arista. Accessed 2021-01--27. Arista Network Telemetry. https:\/\/www.arista.com\/en\/solutions\/software-defined-network-telemetry."},{"key":"e_1_2_1_11_1","volume-title":"15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18)","author":"Arzani Behnaz","year":"2018","unstructured":"Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, Hongqiang (Harry) Liu, Jitu Padhye, Boon Thau Loo, and Geoff Outhred. 2018. 007: Democratically Finding the Cause of Packet Drops. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA, 419--435. https:\/\/www.usenix.org\/conference\/nsdi18\/presentation\/arzani"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1282380.1282383"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3387514.3405894"},{"key":"e_1_2_1_14_1","unstructured":"British Telecommunications. 2018. Contract for BT Managed WAN Services. https:\/\/business.bt.com\/content\/dam\/terms\/it-solutions-support\/bt1190.pdf."},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","unstructured":"J. Case M. Fedor M. Schoffstall and J. Davin. 1990. A Simple Network Management Protocol (SNMP). In RFC 1157. Internet Engineering Task Force. https:\/\/datatracker.ietf.org\/doc\/rfc1157\/","DOI":"10.17487\/rfc1157"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1015467.1015475"},{"key":"e_1_2_1_17_1","unstructured":"Cisco. 2018. Monitoring and Troubleshooting With Cisco Prime LAN Management Solution 4.1. https:\/\/www.cisco.com\/c\/en\/us\/td\/docs\/net_mgmt\/ciscoworks_lan_management_solution\/4--1\/user\/guide\/monitoring_troubleshooting\/mnt_ug\/SNMPInfo.html."},{"key":"e_1_2_1_18_1","volume-title":"Cisco Bug: CSCvn56156 - Silent packet drops may occur on FXOS platforms due to classifier table entry corruption. https:\/\/quickview.cloudapps.cisco.com\/quickview\/bug\/CSCvn56156.","year":"2020","unstructured":"Cisco. 2020. Cisco Bug: CSCvn56156 - Silent packet drops may occur on FXOS platforms due to classifier table entry corruption. https:\/\/quickview.cloudapps.cisco.com\/quickview\/bug\/CSCvn56156."},{"key":"e_1_2_1_19_1","unstructured":"Cisco. 2020. Configure Link Flap Prevention on a Cisco Business Switch using CLI. https:\/\/www.cisco.com\/c\/en\/us\/support\/docs\/smb\/switches\/Cisco-Business-Switching\/kmgmt-2249-configure-the-link-flap-prevention-settings-on-a-switch-thro.html."},{"key":"e_1_2_1_20_1","volume-title":"Internet Engineering Task Force","author":"Claise Benoit","year":"2004","unstructured":"Benoit Claise. 2004. Cisco Systems NetFlow services export version 9. RFC 3954 (Internet Standard), Internet Engineering Task Force (2004)."},{"key":"e_1_2_1_21_1","volume-title":"Internet Engineering Task Force","author":"Claise Benoit","year":"2013","unstructured":"Benoit Claise, Brian Trammell, and Paul Aitken. 2013. Specification of the IP flow information export (IPFIX) protocol for the exchange of flow information. RFC 7011 (Internet Standard), Internet Engineering Task Force (2013)."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1364654.1364677"},{"key":"e_1_2_1_23_1","unstructured":"Divya Rao. 2020. Hot off the press: Introducing OpenConfig Telemetry on NX-OS with gNMI and Telegraf! https:\/\/www.cisco.com\/c\/en\/us\/td\/docs\/net_mgmt\/ciscoworks_lan_management_solution\/4--1\/user\/guide\/monitoring_troubleshooting\/mnt_ug\/SNMPInfo.html."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/90.251892"},{"key":"e_1_2_1_25_1","volume-title":"12th {USENIX} symposium on networked systems design and implementation ({NSDI} 15). 469--483.","author":"Fogel Ari","unstructured":"Ari Fogel, Stanley Fung, Luis Pedrosa, Meg Walraed-Sullivan, Ramesh Govindan, Ratul Mahajan, and Todd Millstein. 2015. A general approach to network configuration analysis. In 12th {USENIX} symposium on networked systems design and implementation ({NSDI} 15). 469--483."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446700"},{"key":"e_1_2_1_27_1","volume-title":"16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19). 549--564.","author":"Geng Yilong","unstructured":"Yilong Geng, Shiyu Liu, Zi Yin, Ashish Naik, Balaji Prabhakar, Mendel Rosenblum, and Amin Vahdat. 2019. {SIMON}: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19). 549--564."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFCOM.2010.5461918"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2785956.2787496"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3230543.3230555"},{"key":"e_1_2_1_31_1","unstructured":"Matt Hamblen. 2019. Programmable chips for data center switches catch fire with 20% annual growth. https:\/\/www.fierceelectronics.com\/electronics\/programmable-chips-for-data-center-switches-catch-fire-20-annual-growth."},{"key":"e_1_2_1_32_1","volume-title":"Flock: Accurate network fault localization at scale. https:\/\/arxiv.org\/pdf\/2305.03348.pdf.","author":"Harsh Vipul","year":"2023","unstructured":"Vipul Harsh, Tong Meng, Kapil Agrawal, and P. Brighten Godfrey. 2023. Flock: Accurate network fault localization at scale. https:\/\/arxiv.org\/pdf\/2305.03348.pdf."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2623330.2623365"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3102980.3103005"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3387514.3405877"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3387514.3405877"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1080173.1080178"},{"key":"e_1_2_1_38_1","volume-title":"9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12). 113--126.","author":"Kazemian Peyman","unstructured":"Peyman Kazemian, George Varghese, and Nick McKeown. 2012. Header space analysis: Static checking for networks. In 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12). 113--126."},{"key":"e_1_2_1_39_1","volume-title":"Veriflow: Verifying network-wide invariants in real time. In 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13). 15--27.","author":"Khurshid Ahmed","year":"2013","unstructured":"Ahmed Khurshid, Xuan Zou, Wenxuan Zhou, Matthew Caesar, and P Brighten Godfrey. 2013. Veriflow: Verifying network-wide invariants in real time. In 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13). 15--27."},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the 2Nd Conference on Symposium on Networked Systems Design & Implementation -","volume":"2","author":"Kompella Ramana Rao","unstructured":"Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, and Alex C. Snoeren. 2005. IP Fault Localization via Risk Modeling. In Proceedings of the 2Nd Conference on Symposium on Networked Systems Design & Implementation - Volume 2 (NSDI'05). USENIX Association, Berkeley, CA, USA, 57--70. http:\/\/dl.acm.org\/citation.cfm?id=1251203.1251208"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFCOM.2007.252"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2663716.2663723"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2043164.2018470"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544216.3544242"},{"key":"e_1_2_1_45_1","volume-title":"Unified Fault Localization for Networked Systems. In 2014 USENIX Annual Technical Conference (USENIX ATC 14)","author":"Mysore Radhika Niranjan","year":"2014","unstructured":"Radhika Niranjan Mysore, Ratul Mahajan, Amin Vahdat, and George Varghese. 2014. Gestalt: Fast, Unified Fault Localization for Networked Systems. In 2014 USENIX Annual Technical Conference (USENIX ATC 14). USENIX Association, Philadelphia, PA, 255--267. https:\/\/www.usenix.org\/conference\/atc14\/technical-sessions\/presentation\/mysore"},{"key":"e_1_2_1_46_1","unstructured":"P4.org Applications Working Group. 2020. In-band Network Telemetry (INT) Dataplane Specification Version 2.1. https:\/\/github.com\/p4lang\/p4-applications\/blob\/master\/docs\/INT_v2_1.pdf."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/637201.637214"},{"key":"e_1_2_1_48_1","unstructured":"Palo Alto Networks. 2020. Critical Issues Addressed in PAN-OS Releases. https:\/\/knowledgebase.paloaltonetworks.com\/KCSArticleDetail?id=kA10g000000Cm68CAC."},{"key":"e_1_2_1_49_1","volume-title":"2017 USENIX Annual Technical Conference (USENIX ATC 17)","author":"Peng Yanghua","year":"2017","unstructured":"Yanghua Peng, Ji Yang, Chuan Wu, Chuanxiong Guo, Chengchen Hu, and Zongpeng Li. 2017. deTector: a Topology-aware Monitoring System for Data Center Networks. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 55--68. https:\/\/www.usenix.org\/conference\/atc17\/technical-sessions\/presentation\/peng"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3278532.3278572"},{"key":"e_1_2_1_51_1","volume-title":"Passive Realtime Datacenter Fault Detection and Localization. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17)","author":"Roy Arjun","unstructured":"Arjun Roy, Hongyi Zeng, Jasmeet Bagga, and Alex C. Snoeren. 2017. Passive Realtime Datacenter Fault Detection and Localization. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 595--612. https:\/\/www.usenix.org\/conference\/nsdi17\/technical-sessions\/presentation\/roy"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/357401.357402"},{"key":"e_1_2_1_53_1","unstructured":"SolarWinds. Accessed 2021-01--24. Configure polling statistics intervals in the Orion Platform. https:\/\/documentation.solarwinds.com\/en\/Success_Center\/orionplatform\/content\/core-polling-statistics-intervals-sw1829.htm."},{"key":"e_1_2_1_54_1","volume-title":"Simplifying Datacenter Network Debugging with PathDump. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16)","author":"Tammana Praveen","year":"2016","unstructured":"Praveen Tammana, Rachit Agarwal, and Myungjin Lee. 2016. Simplifying Datacenter Network Debugging with PathDump. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA, 233--248. https:\/\/www.usenix.org\/conference\/osdi16\/technical-sessions\/presentation\/tammana"},{"key":"e_1_2_1_55_1","volume-title":"NetBouncer: Active Device and Link Failure Localization in Data Center Networks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19)","author":"Tan Cheng","year":"2019","unstructured":"Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. 2019. NetBouncer: Active Device and Link Failure Localization in Data Center Networks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 599--614. https:\/\/www.usenix.org\/conference\/nsdi19\/presentation\/tan"},{"key":"e_1_2_1_56_1","unstructured":"VMware. 2017. Possible data corruption after a Windows 2012 virtual machine network transfer. https:\/\/kb.vmware.com\/s\/article\/2058692."},{"key":"e_1_2_1_57_1","unstructured":"VMware. 2021. Network timeouts or packet drops with VMware Tools 11.x with Guest Introspection Driver on ESXi 6.5\/6.7. https:\/\/kb.vmware.com\/s\/article\/79185."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.5555\/1972457.1972464"},{"key":"e_1_2_1_59_1","unstructured":"Hongyi Zeng Ratul Mahajan Nick McKeown George Varghese Lihua Yuan and Ming Zhang. 2015. Measuring and Troubleshooting Large Operational Multipath Networks with Gray Box Testing. Technical Report MSR-TR-2015--55. https:\/\/www.microsoft.com\/en-us\/research\/publication\/measuring-and-troubleshooting-large-operational-multipath-networks-with-gray-box-testing\/"},{"key":"e_1_2_1_60_1","volume-title":"Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18)","author":"Zhang Qiao","year":"2018","unstructured":"Qiao Zhang, Guo Yu, Chuanxiong Guo, Yingnong Dang, Nick Swanson, Xinsheng Yang, Randolph Yao, Murali Chintalapati, Arvind Krishnamurthy, and Thomas Anderson. 2018. Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA, 519--532. https:\/\/www.usenix.org\/conference\/nsdi18\/presentation\/zhang-qiao"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/1159913.1159939"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/2592798.2592803"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3387514.3406214"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/2785956.2787483"}],"container-title":["Proceedings of the ACM on Networking"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3595289","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3595289","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T21:07:14Z","timestamp":1776373634000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3595289"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,30]]},"references-count":64,"journal-issue":{"issue":"CoNEXT1","published-print":{"date-parts":[[2023,6,30]]}},"alternative-id":["10.1145\/3595289"],"URL":"https:\/\/doi.org\/10.1145\/3595289","relation":{},"ISSN":["2834-5509"],"issn-type":[{"value":"2834-5509","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,30]]},"assertion":[{"value":"2023-07-05","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}