{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,26]],"date-time":"2026-05-26T23:05:50Z","timestamp":1779836750375,"version":"3.53.1"},"reference-count":54,"publisher":"Cambridge University Press (CUP)","license":[{"start":{"date-parts":[[2016,3,17]],"date-time":"2016-03-17T00:00:00Z","timestamp":1458172800000},"content-version":"unspecified","delay-in-days":76,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Funct. Prog."],"published-print":{"date-parts":[[2016]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Reliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. The notable exception is the widely emulated Erlang distributed actor model that provides explicit supervision and recovery of actors with isolated state. We investigate scalable\n                    <jats:italic>transparent<\/jats:italic>\n                    fault tolerant functional computation with\n                    <jats:italic>automatic<\/jats:italic>\n                    supervision and recovery of tasks. We do so by developing\n                    <jats:italic>HdpH-RS<\/jats:italic>\n                    , a variant of the Haskell distributed parallel Haskell (HdpH) DSL with Reliable Scheduling. Extending the distributed work stealing protocol of HdpH for task supervision and recovery is challenging. To eliminate elusive concurrency bugs, we validate the HdpH-RS work stealing protocol using the SPIN model checker. HdpH-RS differs from the actor model in that its principal entities are tasks, i.e. independent stateless computations, rather than isolated stateful actors. Thanks to statelessness, fault recovery can be performed automatically and entirely hidden in the HdpH-RS runtime system. Statelessness is also key for proving a crucial property of the semantics of HdpH-RS: fault recovery does not change the result of the program, akin to deterministic parallelism. HdpH-RS provides a simple distributed fork\/join-style programming model, with minimal exposure of fault tolerance at the language level, and a library of higher level abstractions such as algorithmic skeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpH DSL, hence users can opt in or out of fault tolerant execution without any refactoring. Computations in HdpH-RS are always as reliable as the root node, no matter how many nodes and cores are actually used. We benchmark HdpH-RS on conventional clusters and an High Performance Computing platform: all benchmarks survive Chaos Monkey random fault injection; the system scales well e.g. up to 1,400 cores on the High Performance Computing; reliability and recovery overheads are consistently low even at scale.\n                  <\/jats:p>","DOI":"10.1017\/s095679681600006x","type":"journal-article","created":{"date-parts":[[2016,3,17]],"date-time":"2016-03-17T10:08:15Z","timestamp":1458209295000},"source":"Crossref","is-referenced-by-count":6,"title":["Transparent fault tolerance for scalable functional computation"],"prefix":"10.1017","volume":"26","author":[{"given":"ROBERT","family":"STEWART","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"PATRICK","family":"MAIER","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"PHIL","family":"TRINDER","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"56","published-online":{"date-parts":[[2016,3,17]]},"reference":[{"key":"S095679681600006X_ref28","doi-asserted-by":"publisher","DOI":"10.1017\/S0956796805005526"},{"key":"S095679681600006X_ref51","unstructured":"Stewart R. , Maier P. & Trinder P. (2015 June) Open access dataset for \u201cTransparent Fault Tolerance for Scalable Functional Computation\u201d. http:\/\/dx.doi.org\/10.5525\/gla.researchdata.189."},{"key":"S095679681600006X_ref35","doi-asserted-by":"crossref","unstructured":"Marlow S. , Newton R. & Jones S. L. P. (2011) A monad for deterministic parallelism. In Proceedings of the 4th ACM SIGPLAN Symposium on Haskell, Haskell 2011, Tokyo, Japan, 22 September 2011, pp. 71\u201382.","DOI":"10.1145\/2034675.2034685"},{"key":"S095679681600006X_ref9","volume-title":"Detection of Half-Open (Dropped) Connections","author":"Cleary","year":"2009"},{"key":"S095679681600006X_ref17","doi-asserted-by":"publisher","DOI":"10.1145\/1400097.1400105"},{"key":"S095679681600006X_ref10","unstructured":"Cole M. I. (1988) Algorithmic Skeletons: A Structured Approach to the Management of Parallel Computation. PhD Thesis, Computer Science Department, University of Edinburgh."},{"key":"S095679681600006X_ref27","doi-asserted-by":"crossref","unstructured":"Litvinova A. , Engelmann C. & Scott S. L. 2010 (February 16\u201318) A Proactive fault tolerance framework for high-performance computing. In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010, Innsbruck, Austria.","DOI":"10.2316\/P.2010.676-024"},{"key":"S095679681600006X_ref20","doi-asserted-by":"crossref","unstructured":"Harris T. , Marlow S. & Jones S. L. P. (2005) Haskell on a shared-memory multiprocessor. In Proceedings of the ACM SIGPLAN Workshop on Haskell, Haskell 2005, Tallinn, Estonia, September 30, 2005, pp. 49\u201361.","DOI":"10.1145\/1088348.1088354"},{"key":"S095679681600006X_ref46","first-page":"012022","article-title":"Understanding failures in Petascale computers","volume":"78","author":"Schroeder","year":"2007","journal-title":"J. Phys.: Conf. Ser."},{"key":"S095679681600006X_ref50","unstructured":"Stewart R. & Maier P. (2013) HdpH-RS source code. https:\/\/github.com\/robstewart57\/hdph-rs."},{"key":"S095679681600006X_ref49","unstructured":"Stewart R. (2013b November) Reliable Massively Parallel Symbolic Computing: Fault Tolerance for a Distributed Haskell. PhD Thesis, Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, Scotland."},{"key":"S095679681600006X_ref4","volume-title":"Distributed Mandelbrot Calculations","author":"Boije","year":"2009"},{"key":"S095679681600006X_ref2","doi-asserted-by":"publisher","DOI":"10.1145\/1810891.1810910"},{"key":"S095679681600006X_ref47","doi-asserted-by":"publisher","DOI":"10.21236\/ADA512459"},{"key":"S095679681600006X_ref24","doi-asserted-by":"crossref","unstructured":"John A. , Konnov I. , Schmid U. , Veith H. & Widder J. (2013) Towards modeling and model checking fault-tolerant distributed algorithms. In Model Checking Software \u2013 Proceedings of 20th International Symposium, SPIN 2013, Stony Brook, NY, July, 2013, pp. 209\u2013226.","DOI":"10.1007\/978-3-642-39176-7_14"},{"key":"S095679681600006X_ref32","unstructured":"Maier P. & Trinder P. (2012) Implementing a high-level distributed-memory parallel Haskell in Haskell. In Implementation and Application of Functional Languages, 23rd International Symposium 2011, Lawrence, KS, USA, October 3-5, 2011. Revised Selected Papers. Lecture Notes in Computer Science, vol. 7257. Springer, pp. 35\u201350."},{"key":"S095679681600006X_ref1","doi-asserted-by":"crossref","unstructured":"Aljabri M. , Loidl H.-W. & Trinder P W. (2014) The design and implementation of GUMSMP: A multilevel parallel Haskell implementation. In Proceedings of Implementation and Application of Functional Languages (IFL'13). New York, NY: ACM, pp. 37\u201348.","DOI":"10.1145\/2620678.2620682"},{"key":"S095679681600006X_ref36","unstructured":"Mattsson H. , Nilsson H. & Wikstrm C. (1999) Mnesia - a distributed robust DBMS for telecommunications applications. In Proceedings of PADL, San Antonio, Texas, USA, pp. 152\u2013163."},{"key":"S095679681600006X_ref48","unstructured":"Stewart R. (2013a December) Promela Abstraction of the HdpH-RS Scheduler. https:\/\/github.com\/robstewart57\/phd-thesis\/blob\/master\/spin_model\/hdph_scheduler.pml."},{"key":"S095679681600006X_ref43","doi-asserted-by":"crossref","unstructured":"Ramalingam G. & Vaswani K. (2013) Fault tolerance via idempotence. In Proceedings of POPL, Rome, Italy, pp. 249\u2013262.","DOI":"10.1145\/2429069.2429100"},{"key":"S095679681600006X_ref54","volume-title":"Load Balancing in Parallel Computers: Theory and Practice","author":"Xu","year":"1997"},{"key":"S095679681600006X_ref6","first-page":"212","article-title":"Fault tolerance in Petascale\/Exascale systems: Current knowledge, challenges and research opportunities","volume":"23","author":"Cappello","year":"2009","journal-title":"IJHPCA"},{"key":"S095679681600006X_ref42","volume-title":"Time and Modality","author":"Prior","year":"1957"},{"key":"S095679681600006X_ref30","first-page":"19","article-title":"Reliable scalable symbolic computation: The design of SymGridPar2","volume":"40","author":"Maier","year":"2014","journal-title":"Comput. Lang. Syst. Struct."},{"key":"S095679681600006X_ref3","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-031-01741-4","volume-title":"The Datacenter as a Computer","author":"Barroso","year":"2013"},{"key":"S095679681600006X_ref7","doi-asserted-by":"publisher","DOI":"10.1145\/214451.214456"},{"key":"S095679681600006X_ref14","doi-asserted-by":"publisher","DOI":"10.1145\/568522.568525"},{"key":"S095679681600006X_ref22","unstructured":"Hoff T. (2010 December) Netflix: Continually Test by Failing Servers with Chaos Monkey. http:\/\/highscalability.com."},{"key":"S095679681600006X_ref37","first-page":"268","article-title":"Exploring Beowulf clusters","volume":"18","author":"Meredith","year":"2003","journal-title":"J. Comput. Sci. Colleges"},{"key":"S095679681600006X_ref11","doi-asserted-by":"publisher","DOI":"10.1145\/1327452.1327492"},{"key":"S095679681600006X_ref16","volume-title":"Akka Essentials","author":"Gupta","year":"2012"},{"key":"S095679681600006X_ref13","unstructured":"Edinburgh Parallel Computing Center (EPCC). (2008) HECToR National UK Super Computing Resource, Edinburgh. https:\/\/www.hector.ac.uk."},{"key":"S095679681600006X_ref25","doi-asserted-by":"crossref","unstructured":"Kuper L. , Turon A. , Krishnaswami N. R. & Newton R. R. (2014) Freeze after writing: Quasi-deterministic parallel programming with LVars and handlers. In Proceedings of POPL 2014, San Diego, ACM, pp. 257\u2013270.","DOI":"10.1145\/2535838.2535842"},{"key":"S095679681600006X_ref26","doi-asserted-by":"publisher","DOI":"10.1145\/359545.359563"},{"key":"S095679681600006X_ref39","first-page":"47","volume-title":"Engineering Theories of Software Construction","author":"Peyton","year":"2002"},{"key":"S095679681600006X_ref53","volume-title":"Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale","author":"White","year":"2012"},{"key":"S095679681600006X_ref8","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2016.01.002"},{"key":"S095679681600006X_ref21","unstructured":"Herington D. (2006\u20132013) Haskell Library: hunit Package. A Unit Testing Framework for Haskell. http:\/\/hackage.haskell.org\/package\/HUnit."},{"key":"S095679681600006X_ref38","doi-asserted-by":"publisher","DOI":"10.1038\/218019a0"},{"key":"S095679681600006X_ref23","volume-title":"The SPIN Model Checker - Primer and Reference Manual","author":"Holzmann","year":"2004"},{"key":"S095679681600006X_ref5","doi-asserted-by":"publisher","DOI":"10.1090\/S0025-5718-08-02036-X"},{"key":"S095679681600006X_ref45","doi-asserted-by":"publisher","DOI":"10.1017\/S0956796802004458"},{"key":"S095679681600006X_ref15","doi-asserted-by":"crossref","unstructured":"Epstein J. , Black A. P. & Jones S. L. P. (2011) Towards Haskell in the cloud. In Proceedings of the 4th ACM SIGPLAN Symposium on Haskell, Haskell 2011, Tokyo, Japan, 22 September 2011, pp. 118\u2013129.","DOI":"10.1145\/2034675.2034690"},{"key":"S095679681600006X_ref12","unstructured":"Dinu F. & Ng T. S. E. (2011) Hadoop's overload tolerant design exacerbates failure detection and recovery. In Proceedings of 6th International Workshop on Networking Meets Databases, NETDB 2011. Athens, Greece. June."},{"key":"S095679681600006X_ref33","doi-asserted-by":"crossref","unstructured":"Marlow S. , Jones S. L. P. & Singh S. (2009) Runtime support for multicore Haskell. In Proceedings of ICFP, Edinburgh, Scotland, pp. 65\u201378.","DOI":"10.1145\/1596550.1596563"},{"key":"S095679681600006X_ref41","doi-asserted-by":"crossref","unstructured":"Postel J. 1980 (August) User Datagram Protocol. RFC 768 Standard. http:\/\/www.ietf.org\/rfc\/rfc768.txt.","DOI":"10.17487\/rfc0768"},{"key":"S095679681600006X_ref29","unstructured":"Maier P. , Livesey D. , Loidl H.-W. & Trinder P. (2014a) High-performance computer algebra: A Hecke algebra case study. In Proceedings of Euro-par 2014 Parallel Processing - 20th International Conference, Porto, Portugal, August 25\u201329, 2014, Silva F. M. A. , de Castro Dutra I. & Costa V. S. (eds), Lecture Notes in Computer Science, vol. 8632. Springer, pp. 19\u201335."},{"key":"S095679681600006X_ref19","doi-asserted-by":"crossref","unstructured":"Hammond K. , Zain A. Al Cooperman, G. , Petcu D. & Trinder P. W. (2007) SymGrid: A framework for symbolic computation on the grid. In Proceedings of 13th International Euro-Par Conference, Rennes, France, August 28\u201331, 2007, pp. 457\u2013466.","DOI":"10.1109\/SYNASC.2007.86"},{"key":"S095679681600006X_ref31","doi-asserted-by":"crossref","unstructured":"Maier P. , Stewart R. J. & Trinder P. (2014c) The HdpH DSLs for scalable reliable computation. In Proceedings of the 2014 ACM SIGPLAN Symposium on Haskell, Gothenburg, Sweden, September 4\u20135, 2014. ACM, pp. 65\u201376.","DOI":"10.1145\/2775050.2633363"},{"key":"S095679681600006X_ref52","doi-asserted-by":"crossref","unstructured":"Trinder P. W. , Hammond K. Jr. , J. S. Mattson , Partridge A. S. & Jones S. L. P. (1996) GUM: A portable parallel implementation of Haskell. In Proceedings of ACM Programming Language Design and Implementation (PLDI'96), Philadephia, Pennsylvania, May, pp. 79\u201388.","DOI":"10.1145\/231379.231392"},{"key":"S095679681600006X_ref18","doi-asserted-by":"publisher","DOI":"10.1145\/4472.4478"},{"key":"S095679681600006X_ref34","unstructured":"Marlow S. & Newton R. (2013) Source code for monad-par library. https:\/\/github.com\/simonmar\/monad-par."},{"key":"S095679681600006X_ref40","doi-asserted-by":"crossref","unstructured":"Pnueli A. (1977) The temporal logic of programs. In Proceedings of 18th Annual Symposium on Foundations of Computer Science, Providence, Rhode Island, 31 October\u20131 November 1977. IEEE Computer Society, pp. 46\u201357.","DOI":"10.1109\/SFCS.1977.32"},{"key":"S095679681600006X_ref44","doi-asserted-by":"publisher","DOI":"10.2307\/2974691"}],"container-title":["Journal of Functional Programming"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S095679681600006X","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,26]],"date-time":"2026-05-26T22:36:38Z","timestamp":1779834998000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S095679681600006X\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016]]},"references-count":54,"alternative-id":["S095679681600006X"],"URL":"https:\/\/doi.org\/10.1017\/s095679681600006x","relation":{},"ISSN":["0956-7968","1469-7653"],"issn-type":[{"value":"0956-7968","type":"print"},{"value":"1469-7653","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016]]},"article-number":"e5"}}