Selective and recurring re-computation of Big Data analytics tasks: insights from a Genomics case study

Cala, J; Missier, P

doi:10.1016/j.bdr.2018.06.001

Selective and recurring re-computation of Big Data analytics tasks: insights from a Genomics case study

Lookup NU author(s): Dr Jacek Cala ORCiD, Professor Paolo Missier ORCiD

Downloads

Accepted version [.pdf]

Licence

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND).

Abstract

The value of knowledge assets generated by analytics processes using Data Science techniques tends to decay over time, as a consequence of changes in the elements the process depends on: external data sources, libraries, and system dependencies. For large-scale problems, refreshing those outcomes through greedy re-computation is both expensive and inefficient, as some changes have limited impact. In this paper we address the problem of refreshing past process outcomes selectively, that is, by trying to identify the subset of outcomes that will have been affected by a change, and by only re-executing fragments of the original process. We propose a technical approach to address the selective re-computation problem by combining multiple techniques, and present an extensive experimental study in Genomics, namely variant calling and their clinical interpretation, to show its effectiveness. In this case study, we are able to decrease the number of required re-computations on a cohort of individuals from 495 (blind) down to 71, and that we can reduce runtime by at least 60% relative to the na¨ıve blind approach, and in some cases by 90%. Starting from this experience, we then propose a blueprint for a generic re-computation meta-process that makes use of process history metadata to make informed decisions about selective re-computations in reaction to a variety of changes in the data.

Publication metadata

Author(s): Cala J, Missier P

Publication type: Article

Publication status: Published

Journal: Big Data Research

Year: 2018

Volume: 13

Pages: 76-94

Print publication date: 01/09/2018

Online publication date: 14/08/2018

Acceptance date: 20/06/2018

Date deposited: 27/06/2018

ISSN (print): 2214-5796

ISSN (electronic): 2214-580X

Publisher: Elsevier BV

URL: https://doi.org/10.1016/j.bdr.2018.06.001

DOI: 10.1016/j.bdr.2018.06.001

Altmetrics

Altmetrics provided by Altmetric

Funding

Funder reference	Funder name
EP/N01426X/1	EPSRC

ePrints

Selective and recurring re-computation of Big Data analytics tasks: insights from a Genomics case study

Downloads

Licence

Abstract

Publication metadata

Altmetrics

Funding

Share