Toggle Main Menu Toggle Search

ePrints

Enhancing Replica Management Services to Tolerate Group Failures

Lookup NU author(s): Dr Paul Ezhilchelvan, Emeritus Professor Santosh Shrivastava

Downloads


Abstract

In a distributed system, replication of components, such as objects, is a well known way of achieving availability. For increased availability, crashed and disconnected components must be replaced by new components on available spare nodes. This replacement results in the membership of the replicated group 'walking' over a number of machines during system operation. In this context, we address the problem of reconfiguring a group after the group as an entity has failed. Such a failure is termed a group failure which, for example, can be the crash of every component in the group or the group being partitioned into minority islands. The solution assumes crash-proof storage, and eventual recovery of crashed nodes and healing of partitions. It guarantees that (i) the number of groups reconfigured after a group failure is never more than one, and (ii) the reconfigured group contains a majority of the components which were members of the group just before the group failure occurred, so that the loss of state information due to a group failure is minimal. Though the protocol is subject to blocking, it remains efficient in terms of communication rounds and use of stable store, during both normal operations and reconfiguration after a group failure.


Publication metadata

Author(s): Ezhilchelvan PD, Shrivastava SK

Publication type: Conference Proceedings (inc. Abstract)

Publication status: Published

Conference Name: 2nd IEEE International Symposium on Object Oriented Real-Time Computing (ISORC '99)

Year of Conference: 1999

Pages: 263-270

Publisher: IEEE Computer Society Press

URL: http://dx.doi.org/10.1109/ISORC.1999.776388

DOI: 10.1109/ISORC.1999.776388

Library holdings: Search Newcastle University Library for this item

ISBN: 0769502075


Actions

Link to this publication


Share