### Abstract

Machine learning (ML) based prediction of molecular properties across chemical compound space is an important and alternative approach to efficiently estimate the solutions of highly complex many-electron problems in chemistry and physics. Statistical methods represent molecules as descriptors that should encode molecular symmetries and interactions between atoms. Many such descriptors have been proposed; all of them have advantages and limitations. Here, we propose a set of general two-body and three-body interaction descriptors which are invariant to translation, rotation and atomic indexing. By adapting the successfully used kernel ridge regression methods of machine learning, we evaluate our descriptors on predicting several properties of small organic molecules calculated using density-functional theory. We use two datasets. The GDB-7 set contains 6868 molecules with up to 7 heavy atoms of type CNO. The GDB-9 set is composed of 131722 molecules with up to 9 heavy atoms containing CNO. When trained on 5000 random molecules, our best model achieves an accuracy of 0.8 kcal/mol (on the remaining 1868 molecules of GDB-7) and 1.5 kcal/mol (on the remaining 126722 molecules of GDB-9) respectively. Applying a linear regression model on our novel many-body descriptors performs almost equal to a non-linear kernelized model. Linear models are readily interpretable: a feature importance ranking measure helps to obtain qualitative and quantitative insights on the importance of two and three-body molecular interactions for predicting molecular properties computed with quantum-mechanical methods. ©

Original language | English |
---|---|

Journal | Journal of Chemical Theory and Computation |

DOIs | |

Publication status | Accepted/In press - 2018 Feb 2 |

### Fingerprint

### ASJC Scopus subject areas

- Computer Science Applications
- Physical and Theoretical Chemistry

### Cite this

*Journal of Chemical Theory and Computation*. https://doi.org/10.1021/acs.jctc.8b00110

**Many-Body Descriptors for Predicting Molecular Properties with Machine Learning : Analysis of Pairwise and Three-Body Interactions in Molecules.** / Pronobis, Wiktor; Tkatchenko, Alexandre; Muller, Klaus.

Research output: Contribution to journal › Article

}

TY - JOUR

T1 - Many-Body Descriptors for Predicting Molecular Properties with Machine Learning

T2 - Analysis of Pairwise and Three-Body Interactions in Molecules

AU - Pronobis, Wiktor

AU - Tkatchenko, Alexandre

AU - Muller, Klaus

PY - 2018/2/2

Y1 - 2018/2/2

N2 - Machine learning (ML) based prediction of molecular properties across chemical compound space is an important and alternative approach to efficiently estimate the solutions of highly complex many-electron problems in chemistry and physics. Statistical methods represent molecules as descriptors that should encode molecular symmetries and interactions between atoms. Many such descriptors have been proposed; all of them have advantages and limitations. Here, we propose a set of general two-body and three-body interaction descriptors which are invariant to translation, rotation and atomic indexing. By adapting the successfully used kernel ridge regression methods of machine learning, we evaluate our descriptors on predicting several properties of small organic molecules calculated using density-functional theory. We use two datasets. The GDB-7 set contains 6868 molecules with up to 7 heavy atoms of type CNO. The GDB-9 set is composed of 131722 molecules with up to 9 heavy atoms containing CNO. When trained on 5000 random molecules, our best model achieves an accuracy of 0.8 kcal/mol (on the remaining 1868 molecules of GDB-7) and 1.5 kcal/mol (on the remaining 126722 molecules of GDB-9) respectively. Applying a linear regression model on our novel many-body descriptors performs almost equal to a non-linear kernelized model. Linear models are readily interpretable: a feature importance ranking measure helps to obtain qualitative and quantitative insights on the importance of two and three-body molecular interactions for predicting molecular properties computed with quantum-mechanical methods. ©

AB - Machine learning (ML) based prediction of molecular properties across chemical compound space is an important and alternative approach to efficiently estimate the solutions of highly complex many-electron problems in chemistry and physics. Statistical methods represent molecules as descriptors that should encode molecular symmetries and interactions between atoms. Many such descriptors have been proposed; all of them have advantages and limitations. Here, we propose a set of general two-body and three-body interaction descriptors which are invariant to translation, rotation and atomic indexing. By adapting the successfully used kernel ridge regression methods of machine learning, we evaluate our descriptors on predicting several properties of small organic molecules calculated using density-functional theory. We use two datasets. The GDB-7 set contains 6868 molecules with up to 7 heavy atoms of type CNO. The GDB-9 set is composed of 131722 molecules with up to 9 heavy atoms containing CNO. When trained on 5000 random molecules, our best model achieves an accuracy of 0.8 kcal/mol (on the remaining 1868 molecules of GDB-7) and 1.5 kcal/mol (on the remaining 126722 molecules of GDB-9) respectively. Applying a linear regression model on our novel many-body descriptors performs almost equal to a non-linear kernelized model. Linear models are readily interpretable: a feature importance ranking measure helps to obtain qualitative and quantitative insights on the importance of two and three-body molecular interactions for predicting molecular properties computed with quantum-mechanical methods. ©

UR - http://www.scopus.com/inward/record.url?scp=85047085761&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85047085761&partnerID=8YFLogxK

U2 - 10.1021/acs.jctc.8b00110

DO - 10.1021/acs.jctc.8b00110

M3 - Article

AN - SCOPUS:85047085761

JO - Journal of Chemical Theory and Computation

JF - Journal of Chemical Theory and Computation

SN - 1549-9618

ER -