[PhD Thesis] Fault-Tolerant Parallel Applications Using a Network of Workstations

  1. Lookup NU author(s)
Author(s)Smith J
Publication type Report
Series Title
Year1997
Pages
Full text is not currently available for this publication.
It is becoming common to employ a Network of Workstations, often referred to as a NOW, for general purpose computing since the allocation of an individual workstation offers good interactive response. However, there may still be a need to perform very large scale computations which exceed the resources of a single workstation. It may be that the amount of processing implies an inconveniently long duration or that the data manipulated exceeds available storage. One possibility is to employ a more powerful single machine for such computation. However, there is growing interest in seeking a cheaper alternative by harnessing the significant idel time often observed in a NOW and also possibly employing a number of workstations in parallel on a single problem. Parallelisation permits use of the combined merories of all participating workstation, but also introduces a need for communication, and success in any hardware environment depends on the amount of communication relative to the amount of computation required. In the context of a NOW, much success is reported with applications which have low communication requirements relative to computation requirements. Here it is claimed that there is reason for investigation into the use of a NOW for parallel execution of computations which are demanding in storage, potentially even exceeding the sum of memory in all available workstation. Another consideration is that where a computation is of sufficient scale, some provision for tolerating partical failures may be desirable. However, generic support for storage management and fault-tolerance in computation of this cale for a NOW is not currently available and the suitability of a NOW for solving such computations has not been investigated to any large extent. The work described here is concerned with these issues. The approach employed is to make use of an existing distributed system which supports nested atomic actions (atomic transactions) to structure tault-tolerant computations with persistent objects. This system is used to develop a fault-tolerant "bag of tasks" computation model, where the bag and shared objects are located on secondary storage. In order to understand the factors that affect the performance of large parallel computation on a NOW, a number of specific applications are developed. The performance of these application is analysed using a semi-empirical model. The same measurements underlying these performance predictions may be employed in estimation of the performance of alternative application structures. Using services provided by the distributed system referred to above, each application is implemented. The implementation allows verification of predicted performance and also permits identification of issure sregarding construction of components required to support the chosen applicatin structuring technique. The work demonstrated that a NOW certainly offers some potential for gain through parallelisation and that for large grain computation, the cost of implementing fault tolerance is low.
InstitutionDepartment of Computing Science, University of Newcastle upon Tyne
Place PublishedNewcastle upon Tyne
NotesBritish Lending Library DSC stock location number: DXN015493
ActionsLink to this publication