{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T17:18:29Z","timestamp":1775668709839,"version":"3.50.1"},"reference-count":18,"publisher":"Association for Computing Machinery (ACM)","issue":"13","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2014,8]]},"abstract":"<jats:p>Scalability and fault-tolerance are two fundamental challenges for all distributed computing at Internet scale. Despite many recent advances from both academia and industry, these two problems are still far from settled. In this paper, we present Fuxi, a resource management and job scheduling system that is capable of handling the kind of workload at Alibaba where hundreds of terabytes of data are generated and analyzed everyday to help optimize the company's business operations and user experiences. We employ several novel techniques to enable Fuxi to perform efficient scheduling of hundreds of thousands of concurrent tasks over large clusters with thousands of nodes: 1) an incremental resource management protocol that supports multi-dimensional resource allocation and data locality; 2) user-transparent failure recovery where failures of any Fuxi components will not impact the execution of user jobs; and 3) an effective detection mechanism and a multi-level blacklisting scheme that prevents them from affecting job execution. Our evaluation results demonstrate that 95% and 91% scheduled CPU\/memory utilization can be fulfilled under synthetic workloads, and Fuxi is capable of achieving 2.36T-B\/minute throughput in GraySort. Additionally, the same Fuxi job only experiences approximately 16% slowdown under a 5% fault-injection rate. The slowdown only grows to 20% when we double the fault-injection rate to 10%. Fuxi has been deployed in our production environment since 2009, and it now manages hundreds of thousands of server nodes.<\/jats:p>","DOI":"10.14778\/2733004.2733012","type":"journal-article","created":{"date-parts":[[2015,5,12]],"date-time":"2015-05-12T15:37:52Z","timestamp":1431445072000},"page":"1393-1404","source":"Crossref","is-referenced-by-count":133,"title":["Fuxi"],"prefix":"10.14778","volume":"7","author":[{"given":"Zhuo","family":"Zhang","sequence":"first","affiliation":[{"name":"Alibaba Cloud Computing Inc."}]},{"given":"Chao","family":"Li","sequence":"additional","affiliation":[{"name":"Alibaba Cloud Computing Inc."}]},{"given":"Yangyu","family":"Tao","sequence":"additional","affiliation":[{"name":"Alibaba Cloud Computing Inc."}]},{"given":"Renyu","family":"Yang","sequence":"additional","affiliation":[{"name":"Beihang University and Alibaba Cloud Computing Inc."}]},{"given":"Hong","family":"Tang","sequence":"additional","affiliation":[{"name":"Alibaba Cloud Computing Inc."}]},{"given":"Jie","family":"Xu","sequence":"additional","affiliation":[{"name":"University of Leeds"}]}],"member":"320","published-online":{"date-parts":[[2014,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Alibaba Cloud Computing. http:\/\/www.aliyun.com\/.  Alibaba Cloud Computing. http:\/\/www.aliyun.com\/."},{"key":"e_1_2_1_2_1","unstructured":"Apache Tez. http:\/\/hortonworks.com\/hadoop\/tez\/.  Apache Tez. http:\/\/hortonworks.com\/hadoop\/tez\/."},{"key":"e_1_2_1_3_1","unstructured":"Cgroup. http:\/\/en.wikipedia.org\/wiki\/Cgroups.  Cgroup. http:\/\/en.wikipedia.org\/wiki\/Cgroups."},{"key":"e_1_2_1_4_1","unstructured":"Fuxi. http:\/\/en.wikipedia.org\/wiki\/Fu_Xi.  Fuxi. http:\/\/en.wikipedia.org\/wiki\/Fu_Xi."},{"key":"e_1_2_1_5_1","unstructured":"Google Petabyte Result. http:\/\/www.datacenterknowledge.com\/archives\/2008\/11\/24\/google-sorts-1-petabyte-of-data-in-6-hours\/.  Google Petabyte Result. http:\/\/www.datacenterknowledge.com\/archives\/2008\/11\/24\/google-sorts-1-petabyte-of-data-in-6-hours\/."},{"key":"e_1_2_1_6_1","unstructured":"ODPS\n  : Open Data Processing Service. http:\/\/www.aliyun.com\/product\/odps\/.  ODPS: Open Data Processing Service. http:\/\/www.aliyun.com\/product\/odps\/."},{"key":"e_1_2_1_7_1","unstructured":"Sort Benchmark. http:\/\/sortbenchmark.org\/.  Sort Benchmark. http:\/\/sortbenchmark.org\/."},{"key":"e_1_2_1_8_1","volume-title":"https:\/\/developer.yahoo.com\/blogs\/hadoop\/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-180650399.html","year":"2013","unstructured":"Hadoop at Yahoo! Sets New Gray Sort Record. https:\/\/developer.yahoo.com\/blogs\/hadoop\/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-180650399.html , 2013 . Hadoop at Yahoo! Sets New Gray Sort Record. https:\/\/developer.yahoo.com\/blogs\/hadoop\/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-180650399.html, 2013."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2004.2"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2408776.2408794"},{"key":"e_1_2_1_11_1","volume-title":"Proc. NSDI. Usenix","author":"Hindman B.","year":"2011","unstructured":"B. Hindman , A. Konwinski , M. Zaharia , A. Ghodsi , A. D. Joseph , R. Katz , S. Shenker , and I. Stoica . Mesos: A platform for fine-grained resource sharing in the data center . In Proc. NSDI. Usenix , 2011 . B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proc. NSDI. Usenix, 2011."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1272996.1273005"},{"key":"e_1_2_1_13_1","first-page":"10","article-title":"Big data: The management revolution","author":"McAfee A.","year":"2012","unstructured":"A. McAfee and B. Erik . Big data: The management revolution . Harvard Business Review , 10 2012 . A. McAfee and B. Erik. Big data: The management revolution. Harvard Business Review, 10 2012.","journal-title":"Harvard Business Review"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2517349.2522716"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.5555\/1009382.1009793"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2465351.2465386"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.v17:2\/4"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2523616.2523633"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/2733004.2733012","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T09:39:32Z","timestamp":1672220372000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/2733004.2733012"}},"subtitle":["a fault-tolerant resource management and job scheduling system at internet scale"],"short-title":[],"issued":{"date-parts":[[2014,8]]},"references-count":18,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2014,8]]}},"alternative-id":["10.14778\/2733004.2733012"],"URL":"https:\/\/doi.org\/10.14778\/2733004.2733012","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2014,8]]}}}