Multi-class protein fold classification using a new ensemble machine learning approach.

Aik-Choon Tan, David Gilbert, Yves Deville

Research output: Contribution to journalArticle

64 Citations (Scopus)

Abstract

Protein structure classification represents an important process in understanding the associations between sequence and structure as well as possible functional and evolutionary relationships. Recent structural genomics initiatives and other high-throughput experiments have populated the biological databases at a rapid pace. The amount of structural data has made traditional methods such as manual inspection of the protein structure become impossible. Machine learning has been widely applied to bioinformatics and has gained a lot of success in this research area. This work proposes a novel ensemble machine learning method that improves the coverage of the classifiers under the multi-class imbalanced sample sets by integrating knowledge induced from different base classifiers, and we illustrate this idea in classifying multi-class SCOP protein fold data. We have compared our approach with PART and show that our method improves the sensitivity of the classifier in protein fold classification. Furthermore, we have extended this method to learning over multiple data types, preserving the independence of their corresponding data sources, and show that our new approach performs at least as well as the traditional technique over a single joined data source. These experimental results are encouraging, and can be applied to other bioinformatics problems similarly characterised by multi-class imbalanced data sets held in multiple data sources.

Original languageEnglish
Pages (from-to)206-217
Number of pages12
JournalGenome informatics series : proceedings of the . Workshop on Genome Informatics. Workshop on Genome Informatics
Volume14
Publication statusPublished - 2003 Dec 1
Externally publishedYes

Fingerprint

Information Storage and Retrieval
Computational Biology
Proteins
Genomics
Learning
Databases
Machine Learning
Research

Cite this

Multi-class protein fold classification using a new ensemble machine learning approach. / Tan, Aik-Choon; Gilbert, David; Deville, Yves.

In: Genome informatics series : proceedings of the . Workshop on Genome Informatics. Workshop on Genome Informatics, Vol. 14, 01.12.2003, p. 206-217.

Research output: Contribution to journalArticle

@article{9369c11ac3914e7bbb55089c68feaf2e,
title = "Multi-class protein fold classification using a new ensemble machine learning approach.",
abstract = "Protein structure classification represents an important process in understanding the associations between sequence and structure as well as possible functional and evolutionary relationships. Recent structural genomics initiatives and other high-throughput experiments have populated the biological databases at a rapid pace. The amount of structural data has made traditional methods such as manual inspection of the protein structure become impossible. Machine learning has been widely applied to bioinformatics and has gained a lot of success in this research area. This work proposes a novel ensemble machine learning method that improves the coverage of the classifiers under the multi-class imbalanced sample sets by integrating knowledge induced from different base classifiers, and we illustrate this idea in classifying multi-class SCOP protein fold data. We have compared our approach with PART and show that our method improves the sensitivity of the classifier in protein fold classification. Furthermore, we have extended this method to learning over multiple data types, preserving the independence of their corresponding data sources, and show that our new approach performs at least as well as the traditional technique over a single joined data source. These experimental results are encouraging, and can be applied to other bioinformatics problems similarly characterised by multi-class imbalanced data sets held in multiple data sources.",
author = "Aik-Choon Tan and David Gilbert and Yves Deville",
year = "2003",
month = "12",
day = "1",
language = "English",
volume = "14",
pages = "206--217",
journal = "Genome informatics. International Conference on Genome Informatics",
issn = "0919-9454",
publisher = "Universal Academy Press",

}

TY - JOUR

T1 - Multi-class protein fold classification using a new ensemble machine learning approach.

AU - Tan, Aik-Choon

AU - Gilbert, David

AU - Deville, Yves

PY - 2003/12/1

Y1 - 2003/12/1

N2 - Protein structure classification represents an important process in understanding the associations between sequence and structure as well as possible functional and evolutionary relationships. Recent structural genomics initiatives and other high-throughput experiments have populated the biological databases at a rapid pace. The amount of structural data has made traditional methods such as manual inspection of the protein structure become impossible. Machine learning has been widely applied to bioinformatics and has gained a lot of success in this research area. This work proposes a novel ensemble machine learning method that improves the coverage of the classifiers under the multi-class imbalanced sample sets by integrating knowledge induced from different base classifiers, and we illustrate this idea in classifying multi-class SCOP protein fold data. We have compared our approach with PART and show that our method improves the sensitivity of the classifier in protein fold classification. Furthermore, we have extended this method to learning over multiple data types, preserving the independence of their corresponding data sources, and show that our new approach performs at least as well as the traditional technique over a single joined data source. These experimental results are encouraging, and can be applied to other bioinformatics problems similarly characterised by multi-class imbalanced data sets held in multiple data sources.

AB - Protein structure classification represents an important process in understanding the associations between sequence and structure as well as possible functional and evolutionary relationships. Recent structural genomics initiatives and other high-throughput experiments have populated the biological databases at a rapid pace. The amount of structural data has made traditional methods such as manual inspection of the protein structure become impossible. Machine learning has been widely applied to bioinformatics and has gained a lot of success in this research area. This work proposes a novel ensemble machine learning method that improves the coverage of the classifiers under the multi-class imbalanced sample sets by integrating knowledge induced from different base classifiers, and we illustrate this idea in classifying multi-class SCOP protein fold data. We have compared our approach with PART and show that our method improves the sensitivity of the classifier in protein fold classification. Furthermore, we have extended this method to learning over multiple data types, preserving the independence of their corresponding data sources, and show that our new approach performs at least as well as the traditional technique over a single joined data source. These experimental results are encouraging, and can be applied to other bioinformatics problems similarly characterised by multi-class imbalanced data sets held in multiple data sources.

UR - http://www.scopus.com/inward/record.url?scp=14944354760&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=14944354760&partnerID=8YFLogxK

M3 - Article

VL - 14

SP - 206

EP - 217

JO - Genome informatics. International Conference on Genome Informatics

JF - Genome informatics. International Conference on Genome Informatics

SN - 0919-9454

ER -