ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem

Triguero, I; del, Rio, S; Lopez, V; Bacardit, J; Benitez, JM; Herrera, F

doi:10.1016/j.knosys.2015.05.027

ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem

Lookup NU author(s): Professor Jaume Bacardit

Downloads

Full text for this publication is not currently held within this repository. Alternative links are provided below where available.

Abstract

The application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc., that should be processed. Moreover, in many of these problems such as contact map prediction, the problem tackled in this paper, it is difficult to collect representative positive examples. Learning under these circumstances, known as imbalanced big data classification, may not be straightforward for most of the standard machine learning methods.In this work we describe the methodology that won the ECBDL'14 big data challenge for a bioinformatics big data problem. This algorithm, named as ROSEFW-RF, is based on several MapReduce approaches to (1) balance the classes distribution through random oversampling, (2) detect the most relevant features via an evolutionary feature weighting process and a threshold to choose them, (3) build an appropriate Random Forest model from the pre-processed data and finally (4) classify the test data. Across the paper, we detail and analyze the decisions made during the competition showing an extensive experimental study that characterize the way of working of our methodology. From this analysis we can conclude that this approach is very suitable to tackle large-scale bioinformatics classifications problems. (C) 2015 Elsevier B.V. All rights reserved.

Publication metadata

Author(s): Triguero I, del Rio S, Lopez V, Bacardit J, Benitez JM, Herrera F

Publication type: Article

Publication status: Published

Journal: Knowledge-Based Systems

Year: 2015

Volume: 87

Pages: 69-79

Print publication date: 01/10/2015

Online publication date: 01/06/2015

Acceptance date: 28/05/2015

ISSN (print): 0950-7051

ISSN (electronic): 1872-7409

Publisher: Elsevier

URL: http://dx.doi.org/10.1016/j.knosys.2015.05.027

DOI: 10.1016/j.knosys.2015.05.027

Altmetrics

Altmetrics provided by Altmetric

Funding

Funder reference	Funder name
	Ghent University
P10-TIC-6858
P11-TIC-7765
P12-TIC-2958
TIN2014-57251-P
TIN2013-47210-P

ePrints

ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem

Downloads

Abstract

Publication metadata

Altmetrics

Funding

Share