Workflow Provenance: An Analysis of Long Term Storage Costs

Woodman, S; Hiden, H; Watson, P

doi:10.1145/2822332.2822341

Workflow Provenance: An Analysis of Long Term Storage Costs

Lookup NU author(s): Dr Simon Woodman, Dr Hugo Hiden, Professor Paul Watson ORCiD

Downloads

Accepted version [.pdf]

Licence

This is the authors' accepted manuscript of a conference proceedings (inc. abstract) that has been published in its final definitive form by ACM, 2015.

For re-use rights please refer to the publisher's terms and conditions.

Abstract

The storage and retrieval of provenance is a critical piece of functionality for many data processing systems. There are numerous cases where, in order to satisfy regulatory requirements (such as drug development and medical data processing), accurately reproduce results (scientific research) or to maintain financial transparency (for example to meet Sarbanes Oxley regulations in the US), a full and accurate provenance trace is vital.Whilst it is always possible to meet these requirements by storing every piece of intermediate data generated by a sequence of calculations, the costs associated with retaining data that may have a low probability of future retrieval is significant. There is, however, an opportunity for a reduction in the cost of storage by opting not to store certain intermediate results that can be regenerated given a knowledge of the processing code and input data that generated them.This paper presents a approach which is able, via a collection of past performance and provenance data, to make decisions based on the underlying storage and computation costs as to which intermediate data to retain and which to regenerate on demand.