Toggle Main Menu Toggle Search

ePrints

Data mining and machine learning in e-Science Central using Weka

Lookup NU author(s): Dr Dominic Searson

Downloads


Abstract

Weka is a mature and widely used set of Java software tools for machine learning, data-driven modelling and data mining – and is regarded as a current gold standard for the practical application of these techniques. This paper describes the integration and use of elements of the Weka open source machine learning toolkit within the cloud based data analytics e-Science Central Platform. The purpose of this is to extend the data mining capabilities of the e-Science Central platform using trusted, widely used software components in such a way that the non-machine learning specialist can apply these techniques to their own data easily. To these ends, around 25 Weka blocks have been added to the e-Science Central workflow palette. These blocks encapsulate (1) a representative sample of supervised learning algorithms in Weka (2) utility blocks for the manipulation and pre-processing of data and (3) blocks that generate detailed model performance reports in PDF format. The blocks in the latter group were created to extend existing Weka functionality and allow the user to generate a single document that allows model details and performance to be referenced outside of e-Science Central and Weka. Two real world examples are used to demonstrate Weka functionality in e-Science Central workflows: a regression modelling problem where the objective is to develop a model to predict a quality variable from an industrial distillation tower, and a classification problem, where the objective to is predict cancer diagnostics (tumours classified as 'Malignant' or 'Benign') based on measurements taken from lab cell nuclei imaging. Step by step methods are used to show how these data sets may be modelled, and the models evaluated, using blocks in e-Science Central workflows.


Publication metadata

Author(s): Searson D

Publication type: Report

Publication status: Published

Series Title: School of Computing Science Technical Report Series

Year: 2015

Pages: 24

Print publication date: 01/02/2015

Report Number: 1454

Institution: School of Computing Science, University of Newcastle upon Tyne

Place Published: Newcastle upon Tyne

URL: http://www.cs.ncl.ac.uk/publications/trs/papers/1454.pdf


Share