Composite large margin classifiers with latent subclasses for heterogeneous biomedical data

Guanhua Chen, Yufeng Liu, Dinggang Shen, Michael R. Kosorok

Research output: Contribution to journalArticle

Abstract

High-dimensional classification problems are prevalent in a wide range of modern scientific applications. Despite a large number of candidate classification techniques available to use, practitioners often face a dilemma of choosing between linear and general nonlinear classifiers. Specifically, simple linear classifiers have good interpretability, but may have limitations in handling data with complex structures. In contrast, general nonlinear classifiers are more flexible, but may lose interpretability and have higher tendency for overfitting. In this paper, we consider data with potential latent subgroups in the classes of interest. We propose a new method, namely the composite large margin (CLM) classifier, to address the issue of classification with latent subclasses. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear classifier, and it maintains the interpretability of traditional linear classifiers. We demonstrate the competitive performance of the CLM through comparisons with several existing linear and nonlinear classifiers by Monte Carlo experiments. Analysis of the Alzheimer's disease classification problem using CLM not only provides a lower classification error in discriminating cases and controls, but also identifies subclasses in controls that are more likely to develop the disease in the future.

Original languageEnglish
Pages (from-to)75-88
Number of pages14
JournalStatistical Analysis and Data Mining
Volume9
Issue number2
DOIs
Publication statusPublished - 2016 Apr 1

Fingerprint

Margin
Classifiers
Classifier
Composite
Composite materials
Interpretability
Classification Problems
Linear Function
Data Handling
Alzheimer's Disease
Data handling
Overfitting
Dilemma
Monte Carlo Experiment
Complex Structure
High-dimensional
Likely
Subgroup
Prediction
Range of data

Keywords

  • Classification
  • Large margin
  • Latent subclasses
  • Principal component analysis

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Analysis

Cite this

Composite large margin classifiers with latent subclasses for heterogeneous biomedical data. / Chen, Guanhua; Liu, Yufeng; Shen, Dinggang; Kosorok, Michael R.

In: Statistical Analysis and Data Mining, Vol. 9, No. 2, 01.04.2016, p. 75-88.

Research output: Contribution to journalArticle

@article{059b4a6f95d143388d4776ae55cc8adc,
title = "Composite large margin classifiers with latent subclasses for heterogeneous biomedical data",
abstract = "High-dimensional classification problems are prevalent in a wide range of modern scientific applications. Despite a large number of candidate classification techniques available to use, practitioners often face a dilemma of choosing between linear and general nonlinear classifiers. Specifically, simple linear classifiers have good interpretability, but may have limitations in handling data with complex structures. In contrast, general nonlinear classifiers are more flexible, but may lose interpretability and have higher tendency for overfitting. In this paper, we consider data with potential latent subgroups in the classes of interest. We propose a new method, namely the composite large margin (CLM) classifier, to address the issue of classification with latent subclasses. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear classifier, and it maintains the interpretability of traditional linear classifiers. We demonstrate the competitive performance of the CLM through comparisons with several existing linear and nonlinear classifiers by Monte Carlo experiments. Analysis of the Alzheimer's disease classification problem using CLM not only provides a lower classification error in discriminating cases and controls, but also identifies subclasses in controls that are more likely to develop the disease in the future.",
keywords = "Classification, Large margin, Latent subclasses, Principal component analysis",
author = "Guanhua Chen and Yufeng Liu and Dinggang Shen and Kosorok, {Michael R.}",
year = "2016",
month = "4",
day = "1",
doi = "10.1002/sam.11300",
language = "English",
volume = "9",
pages = "75--88",
journal = "Statistical Analysis and Data Mining",
issn = "1932-1872",
publisher = "John Wiley and Sons Inc.",
number = "2",

}

TY - JOUR

T1 - Composite large margin classifiers with latent subclasses for heterogeneous biomedical data

AU - Chen, Guanhua

AU - Liu, Yufeng

AU - Shen, Dinggang

AU - Kosorok, Michael R.

PY - 2016/4/1

Y1 - 2016/4/1

N2 - High-dimensional classification problems are prevalent in a wide range of modern scientific applications. Despite a large number of candidate classification techniques available to use, practitioners often face a dilemma of choosing between linear and general nonlinear classifiers. Specifically, simple linear classifiers have good interpretability, but may have limitations in handling data with complex structures. In contrast, general nonlinear classifiers are more flexible, but may lose interpretability and have higher tendency for overfitting. In this paper, we consider data with potential latent subgroups in the classes of interest. We propose a new method, namely the composite large margin (CLM) classifier, to address the issue of classification with latent subclasses. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear classifier, and it maintains the interpretability of traditional linear classifiers. We demonstrate the competitive performance of the CLM through comparisons with several existing linear and nonlinear classifiers by Monte Carlo experiments. Analysis of the Alzheimer's disease classification problem using CLM not only provides a lower classification error in discriminating cases and controls, but also identifies subclasses in controls that are more likely to develop the disease in the future.

AB - High-dimensional classification problems are prevalent in a wide range of modern scientific applications. Despite a large number of candidate classification techniques available to use, practitioners often face a dilemma of choosing between linear and general nonlinear classifiers. Specifically, simple linear classifiers have good interpretability, but may have limitations in handling data with complex structures. In contrast, general nonlinear classifiers are more flexible, but may lose interpretability and have higher tendency for overfitting. In this paper, we consider data with potential latent subgroups in the classes of interest. We propose a new method, namely the composite large margin (CLM) classifier, to address the issue of classification with latent subclasses. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear classifier, and it maintains the interpretability of traditional linear classifiers. We demonstrate the competitive performance of the CLM through comparisons with several existing linear and nonlinear classifiers by Monte Carlo experiments. Analysis of the Alzheimer's disease classification problem using CLM not only provides a lower classification error in discriminating cases and controls, but also identifies subclasses in controls that are more likely to develop the disease in the future.

KW - Classification

KW - Large margin

KW - Latent subclasses

KW - Principal component analysis

UR - http://www.scopus.com/inward/record.url?scp=84961595289&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84961595289&partnerID=8YFLogxK

U2 - 10.1002/sam.11300

DO - 10.1002/sam.11300

M3 - Article

AN - SCOPUS:84961595289

VL - 9

SP - 75

EP - 88

JO - Statistical Analysis and Data Mining

JF - Statistical Analysis and Data Mining

SN - 1932-1872

IS - 2

ER -