SVM2Motif-Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor

Marina M C Vidovic, Nico Görnitz, Klaus Muller, Gunnar Rätsch, Marius Kloft

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Identifying discriminative motifs underlying the functionality and evolution of organisms is a major challenge in computational biology. Machine learning approaches such as support vector machines (SVMs) achieve state-of-the-art performances in genomic discrimination tasks, but-due to its black-box character-motifs underlying its decision function are largely unknown. As a remedy, positional oligomer importance matrices (POIMs) allow us to visualize the significance of position-specific subsequences. Although being a major step towards the explanation of trained SVM models, they suffer from the fact that their size grows exponentially in the length of the motif, which renders their manual inspection feasible only for comparably small motif sizes, typically k ≤ 5. In this work, we extend the work on positional oligomer importance matrices, by presenting a new machine-learning methodology, entitled motifPOIM, to extract the truly relevant motifs-regardless of their length and complexity-underlying the predictions of a trained SVM model. Our framework thereby considers the motifs as free parameters in a probabilistic model, a task which can be phrased as a non-convex optimization problem. The exponential dependence of the POIM size on the oligomer length poses a major numerical challenge, which we address by an efficient optimization framework that allows us to find possibly overlapping motifs consisting of up to hundreds of nucleotides. We demonstrate the efficacy of our approach on a synthetic data set as well as a real-world human splice site data set.

Original languageEnglish
Article numbere0144782
JournalPLoS One
Volume10
Issue number12
DOIs
Publication statusPublished - 2015 Dec 1

Fingerprint

Nucleotide Motifs
DNA sequences
Oligomers
Support vector machines
artificial intelligence
nucleotide sequences
Learning systems
probabilistic models
system optimization
Statistical Models
Computational Biology
bioinformatics
Nucleotides
nucleotides
genomics
prediction
Inspection
organisms
extracts
support vector machines

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Medicine(all)

Cite this

Vidovic, M. M. C., Görnitz, N., Muller, K., Rätsch, G., & Kloft, M. (2015). SVM2Motif-Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor. PLoS One, 10(12), [e0144782]. https://doi.org/10.1371/journal.pone.0144782

SVM2Motif-Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor. / Vidovic, Marina M C; Görnitz, Nico; Muller, Klaus; Rätsch, Gunnar; Kloft, Marius.

In: PLoS One, Vol. 10, No. 12, e0144782, 01.12.2015.

Research output: Contribution to journalArticle

Vidovic, MMC, Görnitz, N, Muller, K, Rätsch, G & Kloft, M 2015, 'SVM2Motif-Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor', PLoS One, vol. 10, no. 12, e0144782. https://doi.org/10.1371/journal.pone.0144782
Vidovic, Marina M C ; Görnitz, Nico ; Muller, Klaus ; Rätsch, Gunnar ; Kloft, Marius. / SVM2Motif-Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor. In: PLoS One. 2015 ; Vol. 10, No. 12.
@article{2ff4e604978f44f0a75b8ec7cb5df8a4,
title = "SVM2Motif-Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor",
abstract = "Identifying discriminative motifs underlying the functionality and evolution of organisms is a major challenge in computational biology. Machine learning approaches such as support vector machines (SVMs) achieve state-of-the-art performances in genomic discrimination tasks, but-due to its black-box character-motifs underlying its decision function are largely unknown. As a remedy, positional oligomer importance matrices (POIMs) allow us to visualize the significance of position-specific subsequences. Although being a major step towards the explanation of trained SVM models, they suffer from the fact that their size grows exponentially in the length of the motif, which renders their manual inspection feasible only for comparably small motif sizes, typically k ≤ 5. In this work, we extend the work on positional oligomer importance matrices, by presenting a new machine-learning methodology, entitled motifPOIM, to extract the truly relevant motifs-regardless of their length and complexity-underlying the predictions of a trained SVM model. Our framework thereby considers the motifs as free parameters in a probabilistic model, a task which can be phrased as a non-convex optimization problem. The exponential dependence of the POIM size on the oligomer length poses a major numerical challenge, which we address by an efficient optimization framework that allows us to find possibly overlapping motifs consisting of up to hundreds of nucleotides. We demonstrate the efficacy of our approach on a synthetic data set as well as a real-world human splice site data set.",
author = "Vidovic, {Marina M C} and Nico G{\"o}rnitz and Klaus Muller and Gunnar R{\"a}tsch and Marius Kloft",
year = "2015",
month = "12",
day = "1",
doi = "10.1371/journal.pone.0144782",
language = "English",
volume = "10",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "12",

}

TY - JOUR

T1 - SVM2Motif-Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor

AU - Vidovic, Marina M C

AU - Görnitz, Nico

AU - Muller, Klaus

AU - Rätsch, Gunnar

AU - Kloft, Marius

PY - 2015/12/1

Y1 - 2015/12/1

N2 - Identifying discriminative motifs underlying the functionality and evolution of organisms is a major challenge in computational biology. Machine learning approaches such as support vector machines (SVMs) achieve state-of-the-art performances in genomic discrimination tasks, but-due to its black-box character-motifs underlying its decision function are largely unknown. As a remedy, positional oligomer importance matrices (POIMs) allow us to visualize the significance of position-specific subsequences. Although being a major step towards the explanation of trained SVM models, they suffer from the fact that their size grows exponentially in the length of the motif, which renders their manual inspection feasible only for comparably small motif sizes, typically k ≤ 5. In this work, we extend the work on positional oligomer importance matrices, by presenting a new machine-learning methodology, entitled motifPOIM, to extract the truly relevant motifs-regardless of their length and complexity-underlying the predictions of a trained SVM model. Our framework thereby considers the motifs as free parameters in a probabilistic model, a task which can be phrased as a non-convex optimization problem. The exponential dependence of the POIM size on the oligomer length poses a major numerical challenge, which we address by an efficient optimization framework that allows us to find possibly overlapping motifs consisting of up to hundreds of nucleotides. We demonstrate the efficacy of our approach on a synthetic data set as well as a real-world human splice site data set.

AB - Identifying discriminative motifs underlying the functionality and evolution of organisms is a major challenge in computational biology. Machine learning approaches such as support vector machines (SVMs) achieve state-of-the-art performances in genomic discrimination tasks, but-due to its black-box character-motifs underlying its decision function are largely unknown. As a remedy, positional oligomer importance matrices (POIMs) allow us to visualize the significance of position-specific subsequences. Although being a major step towards the explanation of trained SVM models, they suffer from the fact that their size grows exponentially in the length of the motif, which renders their manual inspection feasible only for comparably small motif sizes, typically k ≤ 5. In this work, we extend the work on positional oligomer importance matrices, by presenting a new machine-learning methodology, entitled motifPOIM, to extract the truly relevant motifs-regardless of their length and complexity-underlying the predictions of a trained SVM model. Our framework thereby considers the motifs as free parameters in a probabilistic model, a task which can be phrased as a non-convex optimization problem. The exponential dependence of the POIM size on the oligomer length poses a major numerical challenge, which we address by an efficient optimization framework that allows us to find possibly overlapping motifs consisting of up to hundreds of nucleotides. We demonstrate the efficacy of our approach on a synthetic data set as well as a real-world human splice site data set.

UR - http://www.scopus.com/inward/record.url?scp=84956919555&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84956919555&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0144782

DO - 10.1371/journal.pone.0144782

M3 - Article

C2 - 26690911

AN - SCOPUS:84956919555

VL - 10

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 12

M1 - e0144782

ER -