{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,1,12]],"date-time":"2025-01-12T00:10:24Z","timestamp":1736640624976,"version":"3.32.0"},"reference-count":22,"publisher":"Wiley","issue":"1","license":[{"start":{"date-parts":[[2006,10,24]],"date-time":"2006-10-24T00:00:00Z","timestamp":1161648000000},"content-version":"vor","delay-in-days":5379,"URL":"http:\/\/onlinelibrary.wiley.com\/termsAndConditions#vor"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Concurrency: Pract. Exper."],"published-print":{"date-parts":[[1992,2]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Fault tolerance is an issue ignored in most parallel languages. The overhead of making parallel, high\u2010performance programs resilient to processor crashes is often too high, given the low probability of such events. If parallel systems become more large\u2010scaled, however, processor failures will become likely, so they should be dealt with. Two approaches to this problem are feasible. First, the system can make programs fault\u2010tolerant transparently. It can log messages, make checkpoints, and so on. Second, the programmer can write explicit code for handling failures in an application\u2010specific way. The latter approach is potentially more efficient, but also requires more work from the programmer. In this paper, we intend to get some initial insight into how hard and efficient explicit fault\u2010tolerant parallel programming is. We do so by implementing four parallel applications in Argus, a language supporting parallelism as well as fault tolerance. Our experiences indicate that the extra effort needed for fault tolerance varies much between different applications. Also, trade\u2010offs can frequently be made between programming effort and efficiency. One lesson we learned is that fault tolerance should not be added as an afterthought, but is best taken into account from the start. As another result, the ability to integrate transparent and explicit mechanisms for fault tolerance would sometimes be highly useful.<\/jats:p>","DOI":"10.1002\/cpe.4330040104","type":"journal-article","created":{"date-parts":[[2006,11,17]],"date-time":"2006-11-17T16:22:00Z","timestamp":1163780520000},"page":"37-55","source":"Crossref","is-referenced-by-count":6,"title":["Fault\u2010tolerant parallel programming in Argus"],"prefix":"10.1002","volume":"4","author":[{"given":"Henri E.","family":"Bal","sequence":"first","affiliation":[]}],"member":"311","published-online":{"date-parts":[[2006,10,24]]},"reference":[{"key":"e_1_2_1_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/800217.806617"},{"key":"e_1_2_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3959.3962"},{"key":"e_1_2_1_4_2","doi-asserted-by":"crossref","unstructured":"A. P.SistlaandJ. L.Welch \u2018Efficient distributed recovery using message logging\u2019 Proceedings 8th ACM Symposium Principles of Distributed Computing Edmonton Alberta August1989 pp.223\u2013238.","DOI":"10.1145\/72981.72997"},{"key":"e_1_2_1_5_2","doi-asserted-by":"crossref","unstructured":"K.Li J. F.NaughtonandJ. S.Plank \u2018Real\u2010Time Concurrent Checkpoint for Parallel Programs\u2019 Proceedings 2nd Symposium on Principles and Practice of Parallel Programming Seattle Washington March1990 pp.79\u201388.","DOI":"10.1145\/99163.99173"},{"key":"e_1_2_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/2166.357215"},{"key":"e_1_2_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/42392.42399"},{"key":"e_1_2_1_8_2","doi-asserted-by":"crossref","unstructured":"B.Liskov D.Curtis P.JohnsonandR.Scheifler \u2018Implementation of Argus\u2019 Proceedings 11th Symposium Operating Systems Principles Austin TX ACM SIGOPS (Nov.1987) pp.111\u2013122.","DOI":"10.1145\/37499.37514"},{"volume-title":"Programming Distributed Systems","year":"1990","author":"Bal H. E.","key":"e_1_2_1_9_2"},{"key":"e_1_2_1_10_2","doi-asserted-by":"crossref","unstructured":"H. E.BalandA. S.Tanenbaum \u2018Distributed programming with shared data\u2019 Proceedings IEEE CS 1988 International Conference on Computer Languages Miami FL Oct.1988. pp.82\u201391.","DOI":"10.1109\/ICCL.1988.13046"},{"key":"e_1_2_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/72551.72552"},{"key":"e_1_2_1_12_2","unstructured":"H. E.Bal M. F.KaashoekandA. S.Tanenbaum \u2018A distributed implementation of the shared data\u2010object model\u2019 USENIX Workshop on Experiences with Building Distributed and Multiprocessor Systems Ft. Lauderdale FL. Oct.1989 pp.1\u201319."},{"key":"e_1_2_1_13_2","doi-asserted-by":"crossref","unstructured":"H. E.Bal M. F.KaashoekandA. S.Tanenbaum \u2018Experience with distributed programming in Orca\u2019 IEEE CS International Conference on Computer Languages New Orleans Louisiana March1990 pp.79\u201389.","DOI":"10.1109\/ICCL.1990.63763"},{"key":"e_1_2_1_14_2","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-10571-9_11"},{"key":"e_1_2_1_15_2","unstructured":"J. E. BMoss \u2018Nested transactions: an approach to reliable distributed computing\u2019 Report TR\u2010260 (Ph.D. dissertation) M.I.T. Cambridge MA 1981."},{"key":"e_1_2_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/359763.359789"},{"key":"e_1_2_1_17_2","doi-asserted-by":"crossref","unstructured":"I.Greif R.SeligerandW.Weihl \u2018Atomic data abstractions in a distributed collaborative editing system\u2019 Proceedings 13th ACM Symposium Principles Programming Languages St. Petersburg FL pp.160\u2013172(Jan.1986).","DOI":"10.1145\/512644.512659"},{"key":"e_1_2_1_18_2","doi-asserted-by":"crossref","unstructured":"M. S.Day \u2018Replication and reconfiguration in a distributed mail repository\u2019 Report TR\u2010376 M.I.T. Cambridge MA April1987.","DOI":"10.1145\/503956.503970"},{"key":"e_1_2_1_19_2","unstructured":"G. T.Leavens \u2018The Hailstone System\u2019 DSG Note 148 M.I.T. Laboratory for Computer Science Cambridge MA March1987."},{"key":"e_1_2_1_20_2","unstructured":"J.\u2010F.JenqandS.Sahni \u2018All pairs shortest paths on a hypercube multiprocessor\u2019 Proceedings 1987 International Conference Parallel Processing St. Charles IL Aug.1987 pp.713\u2013716."},{"key":"e_1_2_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3318.3319"},{"key":"e_1_2_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/22719.24067"},{"key":"e_1_2_1_23_2","unstructured":"S.HoriguchiandY.Shigei \u2018A Parallel Sorting Algorithm for a Linearly Connected Multiprocessor System\u2019 Proceedings 6th International Conference on Distributed Computing Systems Cambridge Massachusetts May1986 pp.111\u2013118."}],"container-title":["Concurrency: Practice and Experience"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fcpe.4330040104","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/cpe.4330040104","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,11]],"date-time":"2025-01-11T23:52:16Z","timestamp":1736639536000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/cpe.4330040104"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[1992,2]]},"references-count":22,"journal-issue":{"issue":"1","published-print":{"date-parts":[[1992,2]]}},"alternative-id":["10.1002\/cpe.4330040104"],"URL":"https:\/\/doi.org\/10.1002\/cpe.4330040104","archive":["Portico"],"relation":{},"ISSN":["1040-3108","1096-9128"],"issn-type":[{"type":"print","value":"1040-3108"},{"type":"electronic","value":"1096-9128"}],"subject":[],"published":{"date-parts":[[1992,2]]}}}