Meeting Bioinformatik Club 19.06.2017

Montag, 19. Juni 2017 - 16:30 to 18:00
  • Osama Mahmoud (University of Bristol): Improving Machine Learning from Functional Genomic Data by means of Overlapping Analysis

Machine learning approaches concern with understanding and modelling complex datasets. Based
on a given training data, their main aim is to learn from data to build a model that maps the
relationship between a set of input features and a considered response in a predictive way.
Classification and clustering are the foremost tasks of such a learning process. It has applications
encompassing many fields in modern biology, including array-based gene expression analysis as
well as other functional genomic experiments.

For instance, the microarray technology allows measuring tens of thousands of probe sets or genes
(features)  simultaneously. However, the expressions of these features are usually observed in a
much smaller number, tens to few hundreds, of tissue samples (observations). This common
characteristic of high dimensionality has a great impact on the learning process, since most of genes
are noisy, redundant or non-relevant to the considered learning task.

Both the prediction accuracy and interpretability of a machine learning model are believed to be
enhanced by performing the learning process based only on selected informative features.
Motivated by this notion, a novel statistical approach, named Proportional Overlapping Scores
(POS), was proposed for selecting features based on overlapping analysis of gene expression data
across different classes of a considered classification task. This method resulted in a novel measure
for a feature's relevance to the learning task.

The main idea of the proposal has been published in Mahmoud et al., BMC Bioinformatics, 2014
and the method was implemented in an R-package, named “propOverlap”, publicly available from
the Comprehensive R Archive Network (CRAN).

In this talk, the POS method would be presented with its applications on a number of publicly
available gene expression data sets for different cancer diseases. Moreover, I aim to introduce some
future plans and basic insights for extending the idea for clustering features to enhance a model's
predictive power.