Coping with unbalanced class data sets in oral absorption models

Newby, Danielle; Freitas, Alex A; Ghafourian, Taravat

Revised-Ghafourian-Danielle-Coping.pdf (626.94 kB)

Coping with unbalanced class data sets in oral absorption models

journal contribution

posted on 2023-06-09, 03:25 authored by Danielle Newby, Alex A Freitas, Taravat Ghafourian

Class imbalance occurs frequently in drug discovery data sets. In oral absorption data sets, in the literature, there are considerably more highly absorbed compounds compared to poorly absorbed compounds. This produces models that are biased toward highly absorbed compounds which lack generalization to industry settings where more early stage drug candidates are poorly absorbed. This paper presents two strategies to cope with unbalanced class data sets: undersampling the majority high absorption class and misclassification costs using classification decision trees. The published data set by Hou et al. J. Chem. Inf. Model.2007, 47, 208-218, which contained percentage human intestinal absorption of 645 drug and drug-like compounds, was used for the development and validation of classification trees using classification and regression tree (C&RT) analysis. The results indicate that undersampling the majority class, highly absorbed compounds, leads to a balanced distribution (50:50) training set which can achieve better accuracies for poorly absorbed compounds, whereas the biased training set achieved higher accuracies for highly absorbed compounds. The use of misclassification costs resulted in improved class predictions, when applied to reduce false positives or false negatives. Moreover, it was shown that the classical overall accuracy measure used in many publications is particularly misleading in the case of unbalanced data sets and more appropriate measures presented here may be used for a more realistic assessment of the classification models' performance. Thus, these strategies offer improvements to cope with unbalanced class data sets to obtain classification models applicable in industry. Â© 2013 American Chemical Society.

History

Publication status

Published

File Version

Accepted version

Journal

Journal of Chemical Information and Modeling

ISSN

1549-9596

Publisher

American Chemical Society

External DOI

https://doi.org/10.1021/ci300348u

Issue

2

Volume

53

Page range

461-474

Department affiliated with

Biochemistry Publications

Full text available

Yes

Peer reviewed?

Yes

Legacy Posted Date

2017-11-30

First Open Access (FOA) Date

2017-11-30

First Compliant Deposit (FCD) Date

2017-11-30

Usage metrics

Keywords

Class imbalance; Class prediction; Classification and regression tree; Classification decision; Classification models; Classification trees; Data set; Drug candidates; Drug discovery; False negatives; False positive; Human intestinal absorptions; Misclassification costs; Oral absorption; Overall accuracies; Training sets; Unbalanced data; Under-sampling Decision trees; Drug products Classification (of information)absorption; article; biological model; decision tree; drug database; drug development; human; methodology; oral drug administration; regression analysis Absorption; Administration Oral; Databases Pharmaceutical; Decision Trees; Drug Discovery; Humans; Models Biological; Regression Analysis

Licence

Copyright not evaluated

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Coping with unbalanced class data sets in oral absorption models

History

Publication status

File Version

Journal

ISSN

Publisher

External DOI

Issue

Volume

Page range

Department affiliated with

Full text available

Peer reviewed?

Legacy Posted Date

First Open Access (FOA) Date

First Compliant Deposit (FCD) Date

Usage metrics

Categories

Keywords

Licence

Exports