Toggle Main Menu Toggle Search

Open Access padlockePrints

Energy-efficient Checkpointing in High-throughput Cycle-stealing Distributed Systems

Lookup NU author(s): Dr Matthew ForshawORCiD, Dr Stephen McGough, Dr Nigel Thomas

Downloads


Licence

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).


Abstract

Checkpointing is a fault-tolerance mechanism commonly used in High Throughput Computing (HTC) environments to allow the execution of long-running computational tasks on compute resources subject to hardware or software failures as well as interruptions from resource owners and more important tasks. Until recently many researchers have focused on the performance gains achieved through checkpointing, but now with growing scrutiny of the energy consumption of IT infrastructures it is increasingly important to understand the energy impact of checkpointing within an HTC environment. In this paper we demonstrate through trace-driven simulation of real-world datasets that existing checkpointing strategies are inadequate at maintaining an acceptable level of energy consumption whilst maintaing the performance gains expected with checkpointing. Furthermore, we identify factors important in deciding whether to exploit checkpointing within an HTC environment, and propose novel strategies to curtail the energy consumption of checkpointing approaches whist maintaining the performance benefits.


Publication metadata

Author(s): Forshaw M, McGough AS, Thomas N

Publication type: Article

Publication status: Published

Journal: Electronic Notes in Theoretical Computer Science

Year: 2015

Volume: 310

Pages: 65-90

Print publication date: 05/01/2015

Online publication date: 08/01/2015

Acceptance date: 01/01/1900

Date deposited: 26/05/2015

ISSN (electronic): 1571-0661

Publisher: Elsevier BV

URL: http://dx.doi.org/10.1016/j.entcs.2014.12.013

DOI: 10.1016/j.entcs.2014.12.013


Altmetrics

Altmetrics provided by Altmetric


Share