Machine learning models for lipophilicity and their domain of applicability

Timon Schroeter, Anton Schwaighofer, Sebastian Mika, Antonius Ter Laak, Detlev Suelzle, Ursula Ganzer, Nikolaus Heinrich, Klaus Muller

Research output: Contribution to journalArticle

24 Citations (Scopus)

Abstract

Unfavorable lipophilicity and water solubility cause many drug failures; therefore these properties have to be taken into account early on in lead discovery. Commercial tools for predicting lipophilicity usually have been trained on small and neutral molecules, and are thus often unable to accurately predict in-house data. Using a modern Bayesian machine learning algorithm - a Gaussian process model - this study constructs a log D7 model based on 14556 drug discovery compounds of Bayer Schering Pharma. Performance is compared with support vector machines, decision trees, ridge regression, and four commercial tools. In a blind test on 7013 new measurements from the last months (including compounds from new projects) 81% were predicted correctly within 1 log unit, compared to only 44% achieved by commercial software. Additional evaluations using public data are presented. We consider error bars for each method (model based error bars, ensemble based, and distance based approaches), and investigate how well they quantify the domain of applicability of each model.

Original languageEnglish
Pages (from-to)524-538
Number of pages15
JournalMolecular Pharmaceutics
Volume4
Issue number4
DOIs
Publication statusPublished - 2007 Jul 1
Externally publishedYes

Fingerprint

Decision Trees
Drug Discovery
Solubility
Software
Water
Pharmaceutical Preparations
Lead
Support Vector Machine
Machine Learning

Keywords

  • Bayesian
  • Decision tree
  • Distance
  • Domain of applicability
  • Drug discovery
  • Ensemble
  • Error bar
  • Error estimation
  • Gaussian process
  • Machine learning
  • Modeling
  • Random forest
  • Support vector machine
  • Support vector regression

ASJC Scopus subject areas

  • Molecular Medicine
  • Pharmaceutical Science

Cite this

Schroeter, T., Schwaighofer, A., Mika, S., Ter Laak, A., Suelzle, D., Ganzer, U., ... Muller, K. (2007). Machine learning models for lipophilicity and their domain of applicability. Molecular Pharmaceutics, 4(4), 524-538. https://doi.org/10.1021/mp0700413

Machine learning models for lipophilicity and their domain of applicability. / Schroeter, Timon; Schwaighofer, Anton; Mika, Sebastian; Ter Laak, Antonius; Suelzle, Detlev; Ganzer, Ursula; Heinrich, Nikolaus; Muller, Klaus.

In: Molecular Pharmaceutics, Vol. 4, No. 4, 01.07.2007, p. 524-538.

Research output: Contribution to journalArticle

Schroeter, T, Schwaighofer, A, Mika, S, Ter Laak, A, Suelzle, D, Ganzer, U, Heinrich, N & Muller, K 2007, 'Machine learning models for lipophilicity and their domain of applicability', Molecular Pharmaceutics, vol. 4, no. 4, pp. 524-538. https://doi.org/10.1021/mp0700413
Schroeter T, Schwaighofer A, Mika S, Ter Laak A, Suelzle D, Ganzer U et al. Machine learning models for lipophilicity and their domain of applicability. Molecular Pharmaceutics. 2007 Jul 1;4(4):524-538. https://doi.org/10.1021/mp0700413
Schroeter, Timon ; Schwaighofer, Anton ; Mika, Sebastian ; Ter Laak, Antonius ; Suelzle, Detlev ; Ganzer, Ursula ; Heinrich, Nikolaus ; Muller, Klaus. / Machine learning models for lipophilicity and their domain of applicability. In: Molecular Pharmaceutics. 2007 ; Vol. 4, No. 4. pp. 524-538.
@article{de3b68e962c247a29bcab5909197c48c,
title = "Machine learning models for lipophilicity and their domain of applicability",
abstract = "Unfavorable lipophilicity and water solubility cause many drug failures; therefore these properties have to be taken into account early on in lead discovery. Commercial tools for predicting lipophilicity usually have been trained on small and neutral molecules, and are thus often unable to accurately predict in-house data. Using a modern Bayesian machine learning algorithm - a Gaussian process model - this study constructs a log D7 model based on 14556 drug discovery compounds of Bayer Schering Pharma. Performance is compared with support vector machines, decision trees, ridge regression, and four commercial tools. In a blind test on 7013 new measurements from the last months (including compounds from new projects) 81{\%} were predicted correctly within 1 log unit, compared to only 44{\%} achieved by commercial software. Additional evaluations using public data are presented. We consider error bars for each method (model based error bars, ensemble based, and distance based approaches), and investigate how well they quantify the domain of applicability of each model.",
keywords = "Bayesian, Decision tree, Distance, Domain of applicability, Drug discovery, Ensemble, Error bar, Error estimation, Gaussian process, Machine learning, Modeling, Random forest, Support vector machine, Support vector regression",
author = "Timon Schroeter and Anton Schwaighofer and Sebastian Mika and {Ter Laak}, Antonius and Detlev Suelzle and Ursula Ganzer and Nikolaus Heinrich and Klaus Muller",
year = "2007",
month = "7",
day = "1",
doi = "10.1021/mp0700413",
language = "English",
volume = "4",
pages = "524--538",
journal = "Molecular Pharmaceutics",
issn = "1543-8384",
publisher = "American Chemical Society",
number = "4",

}

TY - JOUR

T1 - Machine learning models for lipophilicity and their domain of applicability

AU - Schroeter, Timon

AU - Schwaighofer, Anton

AU - Mika, Sebastian

AU - Ter Laak, Antonius

AU - Suelzle, Detlev

AU - Ganzer, Ursula

AU - Heinrich, Nikolaus

AU - Muller, Klaus

PY - 2007/7/1

Y1 - 2007/7/1

N2 - Unfavorable lipophilicity and water solubility cause many drug failures; therefore these properties have to be taken into account early on in lead discovery. Commercial tools for predicting lipophilicity usually have been trained on small and neutral molecules, and are thus often unable to accurately predict in-house data. Using a modern Bayesian machine learning algorithm - a Gaussian process model - this study constructs a log D7 model based on 14556 drug discovery compounds of Bayer Schering Pharma. Performance is compared with support vector machines, decision trees, ridge regression, and four commercial tools. In a blind test on 7013 new measurements from the last months (including compounds from new projects) 81% were predicted correctly within 1 log unit, compared to only 44% achieved by commercial software. Additional evaluations using public data are presented. We consider error bars for each method (model based error bars, ensemble based, and distance based approaches), and investigate how well they quantify the domain of applicability of each model.

AB - Unfavorable lipophilicity and water solubility cause many drug failures; therefore these properties have to be taken into account early on in lead discovery. Commercial tools for predicting lipophilicity usually have been trained on small and neutral molecules, and are thus often unable to accurately predict in-house data. Using a modern Bayesian machine learning algorithm - a Gaussian process model - this study constructs a log D7 model based on 14556 drug discovery compounds of Bayer Schering Pharma. Performance is compared with support vector machines, decision trees, ridge regression, and four commercial tools. In a blind test on 7013 new measurements from the last months (including compounds from new projects) 81% were predicted correctly within 1 log unit, compared to only 44% achieved by commercial software. Additional evaluations using public data are presented. We consider error bars for each method (model based error bars, ensemble based, and distance based approaches), and investigate how well they quantify the domain of applicability of each model.

KW - Bayesian

KW - Decision tree

KW - Distance

KW - Domain of applicability

KW - Drug discovery

KW - Ensemble

KW - Error bar

KW - Error estimation

KW - Gaussian process

KW - Machine learning

KW - Modeling

KW - Random forest

KW - Support vector machine

KW - Support vector regression

UR - http://www.scopus.com/inward/record.url?scp=34548159310&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34548159310&partnerID=8YFLogxK

U2 - 10.1021/mp0700413

DO - 10.1021/mp0700413

M3 - Article

C2 - 17637064

AN - SCOPUS:34548159310

VL - 4

SP - 524

EP - 538

JO - Molecular Pharmaceutics

JF - Molecular Pharmaceutics

SN - 1543-8384

IS - 4

ER -