Lookup NU author(s): Dr Simon Woodman,
Dr Hugo Hiden,
Professor Paul Watson
This is the authors' accepted manuscript of a conference proceedings (inc. abstract) that has been published in its final definitive form by ACM, 2015.
For re-use rights please refer to the publisher's terms and conditions.
The storage and retrieval of provenance is a critical piece of functionality for many data processing systems. There are numerous cases where, in order to satisfy regulatory requirements (such as drug development and medical data processing), accurately reproduce results (scientific research) or to maintain financial transparency (for example to meet Sarbanes Oxley regulations in the US), a full and accurate provenance trace is vital.Whilst it is always possible to meet these requirements by storing every piece of intermediate data generated by a sequence of calculations, the costs associated with retaining data that may have a low probability of future retrieval is significant. There is, however, an opportunity for a reduction in the cost of storage by opting not to store certain intermediate results that can be regenerated given a knowledge of the processing code and input data that generated them.This paper presents a approach which is able, via a collection of past performance and provenance data, to make decisions based on the underlying storage and computation costs as to which intermediate data to retain and which to regenerate on demand.
Author(s): Woodman S, Hiden H, Watson P
Publication type: Conference Proceedings (inc. Abstract)
Publication status: Published
Conference Name: WORKS '15 Proceedings of the 10th Workshop on Workflows in Support of Large-Scale Science
Year of Conference: 2015
Print publication date: 15/11/2015
Date deposited: 16/12/2015
Library holdings: Search Newcastle University Library for this item