Estimating the domain of applicability for machine learning QSAR models: A study on aqueous solubility of drug discovery molecules

Timon Sebastian Schroeter, Anton Schwaighofer, Sebastian Mika, Antonius Ter Laak, Detlev Suelzle, Ursula Ganzer, Nikolaus Heinrich, Klaus Muller

Research output: Contribution to journalArticle

30 Citations (Scopus)

Abstract

We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.

Original languageEnglish
Pages (from-to)485-498
Number of pages14
JournalJournal of Computer-Aided Molecular Design
Volume21
Issue number9
DOIs
Publication statusPublished - 2007 Sep 1
Externally publishedYes

Fingerprint

machine learning
Quantitative Structure-Activity Relationship
Drug Discovery
Solubility
Learning systems
drugs
estimating
solubility
Molecules
molecules
Training Support
predictions
Support vector machines
ridges
regression analysis
education
Machine Learning
estimates

Keywords

  • Aqueous
  • Bayesian modeling
  • Decision tree
  • Distance
  • Domain of applicability
  • Drug discovery
  • Ensemble
  • Error bar
  • Error estimation
  • Gaussian Process
  • Machine learning
  • Random forest
  • Ridge regression
  • Solubility
  • Support vector machine

ASJC Scopus subject areas

  • Molecular Medicine

Cite this

Estimating the domain of applicability for machine learning QSAR models : A study on aqueous solubility of drug discovery molecules. / Schroeter, Timon Sebastian; Schwaighofer, Anton; Mika, Sebastian; Ter Laak, Antonius; Suelzle, Detlev; Ganzer, Ursula; Heinrich, Nikolaus; Muller, Klaus.

In: Journal of Computer-Aided Molecular Design, Vol. 21, No. 9, 01.09.2007, p. 485-498.

Research output: Contribution to journalArticle

Schroeter, TS, Schwaighofer, A, Mika, S, Ter Laak, A, Suelzle, D, Ganzer, U, Heinrich, N & Muller, K 2007, 'Estimating the domain of applicability for machine learning QSAR models: A study on aqueous solubility of drug discovery molecules', Journal of Computer-Aided Molecular Design, vol. 21, no. 9, pp. 485-498. https://doi.org/10.1007/s10822-007-9125-z
Schroeter, Timon Sebastian ; Schwaighofer, Anton ; Mika, Sebastian ; Ter Laak, Antonius ; Suelzle, Detlev ; Ganzer, Ursula ; Heinrich, Nikolaus ; Muller, Klaus. / Estimating the domain of applicability for machine learning QSAR models : A study on aqueous solubility of drug discovery molecules. In: Journal of Computer-Aided Molecular Design. 2007 ; Vol. 21, No. 9. pp. 485-498.
@article{34924a84b94a403e8e05d5e52cc73202,
title = "Estimating the domain of applicability for machine learning QSAR models: A study on aqueous solubility of drug discovery molecules",
abstract = "We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.",
keywords = "Aqueous, Bayesian modeling, Decision tree, Distance, Domain of applicability, Drug discovery, Ensemble, Error bar, Error estimation, Gaussian Process, Machine learning, Random forest, Ridge regression, Solubility, Support vector machine",
author = "Schroeter, {Timon Sebastian} and Anton Schwaighofer and Sebastian Mika and {Ter Laak}, Antonius and Detlev Suelzle and Ursula Ganzer and Nikolaus Heinrich and Klaus Muller",
year = "2007",
month = "9",
day = "1",
doi = "10.1007/s10822-007-9125-z",
language = "English",
volume = "21",
pages = "485--498",
journal = "Journal of Computer-Aided Molecular Design",
issn = "0920-654X",
publisher = "Springer Netherlands",
number = "9",

}

TY - JOUR

T1 - Estimating the domain of applicability for machine learning QSAR models

T2 - A study on aqueous solubility of drug discovery molecules

AU - Schroeter, Timon Sebastian

AU - Schwaighofer, Anton

AU - Mika, Sebastian

AU - Ter Laak, Antonius

AU - Suelzle, Detlev

AU - Ganzer, Ursula

AU - Heinrich, Nikolaus

AU - Muller, Klaus

PY - 2007/9/1

Y1 - 2007/9/1

N2 - We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.

AB - We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.

KW - Aqueous

KW - Bayesian modeling

KW - Decision tree

KW - Distance

KW - Domain of applicability

KW - Drug discovery

KW - Ensemble

KW - Error bar

KW - Error estimation

KW - Gaussian Process

KW - Machine learning

KW - Random forest

KW - Ridge regression

KW - Solubility

KW - Support vector machine

UR - http://www.scopus.com/inward/record.url?scp=35748984950&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35748984950&partnerID=8YFLogxK

U2 - 10.1007/s10822-007-9125-z

DO - 10.1007/s10822-007-9125-z

M3 - Article

C2 - 17632688

AN - SCOPUS:35748984950

VL - 21

SP - 485

EP - 498

JO - Journal of Computer-Aided Molecular Design

JF - Journal of Computer-Aided Molecular Design

SN - 0920-654X

IS - 9

ER -