Mutual information between discrete variables with many categories using recursive adaptive partitioning

Junhee Seok, Yeong Seon Kang

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Mutual information, a general measure of the relatedness between two random variables, has been actively used in the analysis of biomedical data. The mutual information between two discrete variables is conventionally calculated by their joint probabilities estimated from the frequency of observed samples in each combination of variable categories. However, this conventional approach is no longer efficient for discrete variables with many categories, which can be easily found in large-scale biomedical data such as diagnosis codes, drug compounds, and genotypes. Here, we propose a method to provide stable estimations for the mutual information between discrete variables with many categories. Simulation studies showed that the proposed method reduced the estimation errors by 45 folds and improved the correlation coefficients with true values by 99 folds, compared with the conventional calculation of mutual information. The proposed method was also demonstrated through a case study for diagnostic data in electronic health records. This method is expected to be useful in the analysis of various biomedical data with discrete variables.

Original languageEnglish
Article number10981
JournalScientific Reports
Volume5
DOIs
Publication statusPublished - 2015 Jun 5

Fingerprint

Electronic Health Records
Genotype
Pharmaceutical Preparations

ASJC Scopus subject areas

  • General

Cite this

Mutual information between discrete variables with many categories using recursive adaptive partitioning. / Seok, Junhee; Kang, Yeong Seon.

In: Scientific Reports, Vol. 5, 10981, 05.06.2015.

Research output: Contribution to journalArticle

@article{20318978e1024cb9a8b085cb2d3e2378,
title = "Mutual information between discrete variables with many categories using recursive adaptive partitioning",
abstract = "Mutual information, a general measure of the relatedness between two random variables, has been actively used in the analysis of biomedical data. The mutual information between two discrete variables is conventionally calculated by their joint probabilities estimated from the frequency of observed samples in each combination of variable categories. However, this conventional approach is no longer efficient for discrete variables with many categories, which can be easily found in large-scale biomedical data such as diagnosis codes, drug compounds, and genotypes. Here, we propose a method to provide stable estimations for the mutual information between discrete variables with many categories. Simulation studies showed that the proposed method reduced the estimation errors by 45 folds and improved the correlation coefficients with true values by 99 folds, compared with the conventional calculation of mutual information. The proposed method was also demonstrated through a case study for diagnostic data in electronic health records. This method is expected to be useful in the analysis of various biomedical data with discrete variables.",
author = "Junhee Seok and Kang, {Yeong Seon}",
year = "2015",
month = "6",
day = "5",
doi = "10.1038/srep10981",
language = "English",
volume = "5",
journal = "Scientific Reports",
issn = "2045-2322",
publisher = "Nature Publishing Group",

}

TY - JOUR

T1 - Mutual information between discrete variables with many categories using recursive adaptive partitioning

AU - Seok, Junhee

AU - Kang, Yeong Seon

PY - 2015/6/5

Y1 - 2015/6/5

N2 - Mutual information, a general measure of the relatedness between two random variables, has been actively used in the analysis of biomedical data. The mutual information between two discrete variables is conventionally calculated by their joint probabilities estimated from the frequency of observed samples in each combination of variable categories. However, this conventional approach is no longer efficient for discrete variables with many categories, which can be easily found in large-scale biomedical data such as diagnosis codes, drug compounds, and genotypes. Here, we propose a method to provide stable estimations for the mutual information between discrete variables with many categories. Simulation studies showed that the proposed method reduced the estimation errors by 45 folds and improved the correlation coefficients with true values by 99 folds, compared with the conventional calculation of mutual information. The proposed method was also demonstrated through a case study for diagnostic data in electronic health records. This method is expected to be useful in the analysis of various biomedical data with discrete variables.

AB - Mutual information, a general measure of the relatedness between two random variables, has been actively used in the analysis of biomedical data. The mutual information between two discrete variables is conventionally calculated by their joint probabilities estimated from the frequency of observed samples in each combination of variable categories. However, this conventional approach is no longer efficient for discrete variables with many categories, which can be easily found in large-scale biomedical data such as diagnosis codes, drug compounds, and genotypes. Here, we propose a method to provide stable estimations for the mutual information between discrete variables with many categories. Simulation studies showed that the proposed method reduced the estimation errors by 45 folds and improved the correlation coefficients with true values by 99 folds, compared with the conventional calculation of mutual information. The proposed method was also demonstrated through a case study for diagnostic data in electronic health records. This method is expected to be useful in the analysis of various biomedical data with discrete variables.

UR - http://www.scopus.com/inward/record.url?scp=84930656104&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84930656104&partnerID=8YFLogxK

U2 - 10.1038/srep10981

DO - 10.1038/srep10981

M3 - Article

C2 - 26046461

AN - SCOPUS:84930656104

VL - 5

JO - Scientific Reports

JF - Scientific Reports

SN - 2045-2322

M1 - 10981

ER -